Re: [RESULT] [VOTE] Release Apache ManifoldCF 1.7.1, RC1

Karl Wright Mon, 22 Sep 2014 04:16:45 -0700

Oops, sorry, wrong thread. RC1 did NOT pass.  Will close the RC2 thread in
a minute.
Karl


On Mon, Sep 22, 2014 at 6:31 AM, Karl Wright <[email protected]> wrote:

> Three +1's, >72 hours.  Vote passes!
>
> Karl
>
> On Mon, Sep 22, 2014 at 5:05 AM, Erlend Garåsen <[email protected]>
> wrote:
>
>>
>> I'm able to fetch documents from www.duo.uio.no using file-based
>> synchronization, so there are no network problems.
>>
>> Anyway, I'll continue to test RC2. Even though I'm not able to use
>> Zookeeper-based synchronization on that host, I may find other
>> bugs/problems.
>>
>> Erlend
>>
>>
>> On 22.09.14 10:39, Erlend Garåsen wrote:
>>
>>>
>>> I can verify an eventually network problem by using file-based
>>> synchronization instead.
>>>
>>> I'll do that right away and test RC2 as well, even though you already
>>> have three +1's.
>>>
>>> The three other jobs I started before I left my office on Thursday did
>>> all complete successfully.
>>>
>>> Erlend
>>>
>>> On 19.09.14 12:27, Karl Wright wrote:
>>>
>>>> Well, it's crawled fine over night, with no issues whatsoever.  I'm
>>>> using a
>>>> Zookeeper setup, with MCF 1.7.1 RC1.
>>>>
>>>> I still maintain you've got something broken with the network in your
>>>> production machine.
>>>>
>>>> Karl
>>>>
>>>> On Thu, Sep 18, 2014 at 5:31 PM, Karl Wright <[email protected]>
>>>> wrote:
>>>>
>>>>  Well, FWIW it is still crawling perfectly.  I'll let it run until done.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Thu, Sep 18, 2014 at 5:29 PM, Erlend Fedt Garåsen <
>>>>> [email protected]> wrote:
>>>>>
>>>>>  I know. I used a lot of time to create the rules which seems to index
>>>>>> what we really want. Your observation is correct. Crawling Dspace
>>>>>> repositories are very difficult. A lot of nonsense pages we need to
>>>>>> filter
>>>>>> out.
>>>>>>
>>>>>> We have crawled this host the last two years using file based synch.
>>>>>>
>>>>>> I'm planning a new approach, i.e. using a connector etc.
>>>>>>
>>>>>> E
>>>>>>
>>>>>> Sent from my iPhone
>>>>>>
>>>>>>  On 18. sep. 2014, at 22:35, "Karl Wright" <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Ok, I started this crawl.  It fetched and processed robots.txt
>>>>>>>
>>>>>> perfectly.
>>>>>>
>>>>>>> And then I saw the following: lots of fetches of fairly good-sized
>>>>>>> documents, with very few ingestions.  The documents that did not
>>>>>>> ingest
>>>>>>> look like this:
>>>>>>>
>>>>>>>
>>>>>>>  https://www.duo.uio.no/handle/10852/163/discover?order=DESC&;
>>>>>> r...pp=100&sort_by=dc.date.issued_dt
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> I think your index inclusion rules may be excluding most of the
>>>>>>> content.
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>  On Thu, Sep 18, 2014 at 8:48 AM, Karl Wright <[email protected]>
>>>>>>>>
>>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>> Thanks -- I will probably not be able to get to this further until
>>>>>>>>
>>>>>>> tonight
>>>>>>
>>>>>>> anyhow.
>>>>>>>>
>>>>>>>> Karl
>>>>>>>>
>>>>>>>> On Thu, Sep 18, 2014 at 8:16 AM, Erlend Garåsen <
>>>>>>>>
>>>>>>> [email protected]>
>>>>>>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>> I tried to fetch documents by using curl from our prod server
>>>>>>>>> just in
>>>>>>>>> case a webmaster had blocked access. No problem. Maybe I should ask
>>>>>>>>>
>>>>>>>> the
>>>>>>
>>>>>>> webmaster of that host anyway, just to be sure.
>>>>>>>>>
>>>>>>>>> The interrupted message may have been caused by an abort of that
>>>>>>>>> job.
>>>>>>>>>
>>>>>>>>> I think I should just stop the problematic job and start all the
>>>>>>>>> other
>>>>>>>>> three remaining jobs instead. I bet they will all complete.
>>>>>>>>> Ideally we
>>>>>>>>> shouldn't crawl www.duo.uio.no at all since it's a Dspace
>>>>>>>>> resource. I
>>>>>>>>> have just contacted someone who is indexing Dspace resources. I
>>>>>>>>> guess
>>>>>>>>>
>>>>>>>> a
>>>>>>
>>>>>>> Dspace connector is a better approach.
>>>>>>>>>
>>>>>>>>> Below you'll find some parameters.
>>>>>>>>>
>>>>>>>>> REPOSITORY CONNECTION
>>>>>>>>> ---------------------
>>>>>>>>> Throttling -> max connections: 30
>>>>>>>>> Throttling -> Max fetches/min: 100
>>>>>>>>> Bandwith -> max connections: 25
>>>>>>>>> Bandwith -> max kbytes/sec: 8000
>>>>>>>>> Bandwith -> max fetches/min: 20
>>>>>>>>>
>>>>>>>>> JOB SETTINGS
>>>>>>>>> ------------
>>>>>>>>>
>>>>>>>>> Hop filters: Keep forever
>>>>>>>>>
>>>>>>>>> Seeds: https://www.duo.uio.no/
>>>>>>>>>
>>>>>>>>> Exclude from crawl:
>>>>>>>>> # Exclude some file types:
>>>>>>>>> \.gif$
>>>>>>>>> \.GIF$
>>>>>>>>> \.jpeg$
>>>>>>>>> \.JPEG$
>>>>>>>>> \.jpg$
>>>>>>>>> \.JPG$
>>>>>>>>> \.png$
>>>>>>>>> \.PNG$
>>>>>>>>> \.mpg$
>>>>>>>>> \.MPG$
>>>>>>>>> \.mpeg$
>>>>>>>>> \.MPEG$
>>>>>>>>> \.exe$
>>>>>>>>> \.bmp$
>>>>>>>>> \.BMP$
>>>>>>>>> \.mov$
>>>>>>>>> \.MOV$
>>>>>>>>> \.wmf$
>>>>>>>>> \.css$
>>>>>>>>> \.ico$
>>>>>>>>> \.ICO$
>>>>>>>>> \.mp2$
>>>>>>>>> \.mp3$
>>>>>>>>> \.mp4$
>>>>>>>>> \.wmv$
>>>>>>>>> \.tif$
>>>>>>>>> \.tiff$
>>>>>>>>> \.avi$
>>>>>>>>> \.ogg$
>>>>>>>>> \.ogv$
>>>>>>>>> \.zip$
>>>>>>>>> \.gz$
>>>>>>>>> \.psd$
>>>>>>>>>
>>>>>>>>> # TIKA-1011
>>>>>>>>> \.mhtml$
>>>>>>>>>
>>>>>>>>> # Exclude log files:
>>>>>>>>> \.log$
>>>>>>>>> \.logfile$
>>>>>>>>>
>>>>>>>>> # Generelt, ikke tillatt indeksering av DUO-søkeresultater:
>>>>>>>>> https?://www\.duo\.uio\.no/sok/search.*
>>>>>>>>>
>>>>>>>>> # Andre elementer i DUO som skal ekskluderes:
>>>>>>>>> https://www\.duo\.uio\.no.*open-search/description\.xml$
>>>>>>>>> https://www\.duo\.uio\.no/(inn|login|feed|search|
>>>>>>>>> advanced-search|community-list|browse|password-login|
>>>>>>>>> inn|discover).*
>>>>>>>>>
>>>>>>>>> # Skip locale settings - makes duplicates:
>>>>>>>>> https://www\.duo\.uio\.no/.*\?locale-attribute=\w{2}$
>>>>>>>>>
>>>>>>>>> # Temporarily skip PDFs since we are indexing abstracts:
>>>>>>>>> https://www\.duo\.uio\.no/bitstream/handle/.+
>>>>>>>>>
>>>>>>>>> # skip full item record:
>>>>>>>>> https://www\.duo\.uio\.no/handle/\d{9}/\d+\?show=full$
>>>>>>>>> # ny url-struktur:
>>>>>>>>> https://www\.duo\.uio\.no/handle/.*\?show=full$
>>>>>>>>>
>>>>>>>>> # Skip all navigations but "start with letter":
>>>>>>>>> https://www\.duo\.uio\.no/.*type=(author|dateissued)$
>>>>>>>>>
>>>>>>>>> # Skip search:
>>>>>>>>> #https://www\.duo\.uio\.no/handle/.*/discover\?.*
>>>>>>>>> https://www\.duo\.uio\.no/handle/.*search-filter\?.*
>>>>>>>>> # ny url-struktur:
>>>>>>>>> https://www\.duo\.uio\.no/discover\?.*
>>>>>>>>> https://www\.duo\.uio\.no/search-filter\?.*
>>>>>>>>>
>>>>>>>>> # Skip statistics:
>>>>>>>>> https://www\.duo\.uio\.no/handle/.*/statistics$
>>>>>>>>>
>>>>>>>>> Exclude from index:
>>>>>>>>> # Exclude front page - no valuable info and we have QL:
>>>>>>>>> https?://www\.duo\.uio\.no/$
>>>>>>>>>
>>>>>>>>> # Do not index navigation, but follow:
>>>>>>>>> https://www\.duo\.uio\.no/handle/\d{9}/\d+/.+
>>>>>>>>> #ny url-struktur:
>>>>>>>>> https://www\.duo\.uio\.no/handle/\d+/\d+/.+
>>>>>>>>>
>>>>>>>>> # Exclude id's lower than four, probably category listening:
>>>>>>>>> https://www\.duo\.uio\.no/handle/\d{9}/\d{1,4}$
>>>>>>>>> # ny url-strultur:
>>>>>>>>> https://www\.duo\.uio\.no/handle/\d+/\d{1,3}$
>>>>>>>>>
>>>>>>>>> Thanks for looking at this!
>>>>>>>>>
>>>>>>>>> BTW: Within an hour, I will be away from my computer and cannot
>>>>>>>>> test
>>>>>>>>> anymore until Monday. I'm leaving Oslo for some days, but I will
>>>>>>>>>
>>>>>>>> still be
>>>>>>
>>>>>>> able to read and answer emails.
>>>>>>>>>
>>>>>>>>> Erlend
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  On 18.09.14 13:43, Karl Wright wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Erlend,
>>>>>>>>>>
>>>>>>>>>> The "Interrupted: null" message with a -104 code means only that
>>>>>>>>>> the
>>>>>>>>>> fetch
>>>>>>>>>> was interrupted by something.  Unfortunately, the message is not
>>>>>>>>>>
>>>>>>>>> clear
>>>>>>
>>>>>>> about what the cause of the interruption is.  This is unrelated to
>>>>>>>>>> Zookeeper; but I agree that it is suspicious that many such
>>>>>>>>>>
>>>>>>>>> interruptions
>>>>>>
>>>>>>> appear right after robots is parsed.
>>>>>>>>>>
>>>>>>>>>> One cause of a -104 is when the target server forcibly drops the
>>>>>>>>>> connection, so an InterruptedIOException is thrown.  Having a look
>>>>>>>>>>
>>>>>>>>> at the
>>>>>>
>>>>>>> timestamps for the fetch messages, it looks believable that you
>>>>>>>>>> might
>>>>>>>>>> have
>>>>>>>>>> exceeded some predetermined limit on that machine.  They're all
>>>>>>>>>>
>>>>>>>>> within a
>>>>>>
>>>>>>> few milliseconds of each other.  When a robots file needs to be
>>>>>>>>>> read,
>>>>>>>>>> ManifoldCF creates an event for that, and the urls blocked by that
>>>>>>>>>>
>>>>>>>>> event
>>>>>>
>>>>>>> will all be 'fetchable' as soon as the event is released.  Perhaps
>>>>>>>>>>
>>>>>>>>> your
>>>>>>
>>>>>>> throttling needs to be adjusted now that the rate limit bug has
>>>>>>>>>> been
>>>>>>>>>> fixed?
>>>>>>>>>>
>>>>>>>>>> I won't be able to work with this without at least your crawling
>>>>>>>>>> parameters
>>>>>>>>>> for the server in question.  I can ping that server so if you
>>>>>>>>>> would
>>>>>>>>>>
>>>>>>>>> like
>>>>>>
>>>>>>> I
>>>>>>>>>> can try crawling that server from here.
>>>>>>>>>>
>>>>>>>>>> For zookeeper, I would still try to either increase your tick
>>>>>>>>>> count
>>>>>>>>>>
>>>>>>>>> to
>>>>>>
>>>>>>> maybe 10000, or better yet, find out why you periodically lose the
>>>>>>>>>> ability
>>>>>>>>>> to transmit pings from MCF to your zookeeper process.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Sep 18, 2014 at 7:15 AM, Erlend Garåsen <
>>>>>>>>>>
>>>>>>>>> [email protected]
>>>>>>
>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>  On 18.09.14 13:00, Karl Wright wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Erlend,
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> please can you also add the manifoldcf log as well?
>>>>>>>>>>>>
>>>>>>>>>>> Yes, I will, but it includes entries from RC0 as well.
>>>>>>>>>>>
>>>>>>>>>>> MCF works perfectly using the other jobs for the other hosts.
>>>>>>>>>>> Take a
>>>>>>>>>>> look
>>>>>>>>>>> at the following once again. MCF is being interrupted:
>>>>>>>>>>> INFO 2014-09-18 11:13:42,824 (Worker thread '19') - WEB: FETCH
>>>>>>>>>>> URL|
>>>>>>>>>>> https://www.duo.uio.no/|1411030940209+682605|-104|
>>>>>>>>>>> <https://www.duo.uio.no/%7C1411030940209+682605%7C-104%7C>
>>>>>>>>>>>
>>>>>>>>>> <https://www.duo.uio.no/%7C1411030940209+682605%7C-104%7C>
>>>>>>
>>>>>>> <https://www.duo.uio.no/%7C1411030940209+682605%7C-104%7C>
>>>>>>>>>>> 4096|org.apache.manifoldcf.core.interfaces.ManifoldCFException|
>>>>>>>>>>> <https://www.duo.uio.no/%7C1411030940209+682605%7C-104%
>>>>>>>>>>>
>>>>>>>>>>>  7C4096%7Corg.apache.manifoldcf.core.interfaces.
>>>>>> ManifoldCFException%7C>
>>>>>>
>>>>>>> Interrupted: Interrupted: null
>>>>>>>>>>>
>>>>>>>>>>> You can find this entry near the other regarding the robots.txt
>>>>>>>>>>>
>>>>>>>>>> file:
>>>>>>
>>>>>>> http://folk.uio.no/erlendfg/manifoldcf/manifoldcf.log
>>>>>>>>>>>
>>>>>>>>>>> Erlend
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: [RESULT] [VOTE] Release Apache ManifoldCF 1.7.1, RC1

Reply via email to