[RESULT] [VOTE] Release Apache ManifoldCF 1.7.1, RC1

Karl Wright Mon, 22 Sep 2014 03:33:01 -0700

Three +1's, >72 hours.  Vote passes!

Karl


On Mon, Sep 22, 2014 at 5:05 AM, Erlend Garåsen <[email protected]>
wrote:

>
> I'm able to fetch documents from www.duo.uio.no using file-based
> synchronization, so there are no network problems.
>
> Anyway, I'll continue to test RC2. Even though I'm not able to use
> Zookeeper-based synchronization on that host, I may find other
> bugs/problems.
>
> Erlend
>
>
> On 22.09.14 10:39, Erlend Garåsen wrote:
>
>>
>> I can verify an eventually network problem by using file-based
>> synchronization instead.
>>
>> I'll do that right away and test RC2 as well, even though you already
>> have three +1's.
>>
>> The three other jobs I started before I left my office on Thursday did
>> all complete successfully.
>>
>> Erlend
>>
>> On 19.09.14 12:27, Karl Wright wrote:
>>
>>> Well, it's crawled fine over night, with no issues whatsoever.  I'm
>>> using a
>>> Zookeeper setup, with MCF 1.7.1 RC1.
>>>
>>> I still maintain you've got something broken with the network in your
>>> production machine.
>>>
>>> Karl
>>>
>>> On Thu, Sep 18, 2014 at 5:31 PM, Karl Wright <[email protected]> wrote:
>>>
>>>  Well, FWIW it is still crawling perfectly.  I'll let it run until done.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Thu, Sep 18, 2014 at 5:29 PM, Erlend Fedt Garåsen <
>>>> [email protected]> wrote:
>>>>
>>>>  I know. I used a lot of time to create the rules which seems to index
>>>>> what we really want. Your observation is correct. Crawling Dspace
>>>>> repositories are very difficult. A lot of nonsense pages we need to
>>>>> filter
>>>>> out.
>>>>>
>>>>> We have crawled this host the last two years using file based synch.
>>>>>
>>>>> I'm planning a new approach, i.e. using a connector etc.
>>>>>
>>>>> E
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>>  On 18. sep. 2014, at 22:35, "Karl Wright" <[email protected]> wrote:
>>>>>>
>>>>>> Ok, I started this crawl.  It fetched and processed robots.txt
>>>>>>
>>>>> perfectly.
>>>>>
>>>>>> And then I saw the following: lots of fetches of fairly good-sized
>>>>>> documents, with very few ingestions.  The documents that did not
>>>>>> ingest
>>>>>> look like this:
>>>>>>
>>>>>>
>>>>>>  https://www.duo.uio.no/handle/10852/163/discover?order=DESC&;
>>>>> r...pp=100&sort_by=dc.date.issued_dt
>>>>>
>>>>>
>>>>>>
>>>>>> I think your index inclusion rules may be excluding most of the
>>>>>> content.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>>
>>>>>>  On Thu, Sep 18, 2014 at 8:48 AM, Karl Wright <[email protected]>
>>>>>>>
>>>>>> wrote:
>>>>>
>>>>>>
>>>>>>> Thanks -- I will probably not be able to get to this further until
>>>>>>>
>>>>>> tonight
>>>>>
>>>>>> anyhow.
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>> On Thu, Sep 18, 2014 at 8:16 AM, Erlend Garåsen <
>>>>>>>
>>>>>> [email protected]>
>>>>>
>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>> I tried to fetch documents by using curl from our prod server
>>>>>>>> just in
>>>>>>>> case a webmaster had blocked access. No problem. Maybe I should ask
>>>>>>>>
>>>>>>> the
>>>>>
>>>>>> webmaster of that host anyway, just to be sure.
>>>>>>>>
>>>>>>>> The interrupted message may have been caused by an abort of that
>>>>>>>> job.
>>>>>>>>
>>>>>>>> I think I should just stop the problematic job and start all the
>>>>>>>> other
>>>>>>>> three remaining jobs instead. I bet they will all complete.
>>>>>>>> Ideally we
>>>>>>>> shouldn't crawl www.duo.uio.no at all since it's a Dspace
>>>>>>>> resource. I
>>>>>>>> have just contacted someone who is indexing Dspace resources. I
>>>>>>>> guess
>>>>>>>>
>>>>>>> a
>>>>>
>>>>>> Dspace connector is a better approach.
>>>>>>>>
>>>>>>>> Below you'll find some parameters.
>>>>>>>>
>>>>>>>> REPOSITORY CONNECTION
>>>>>>>> ---------------------
>>>>>>>> Throttling -> max connections: 30
>>>>>>>> Throttling -> Max fetches/min: 100
>>>>>>>> Bandwith -> max connections: 25
>>>>>>>> Bandwith -> max kbytes/sec: 8000
>>>>>>>> Bandwith -> max fetches/min: 20
>>>>>>>>
>>>>>>>> JOB SETTINGS
>>>>>>>> ------------
>>>>>>>>
>>>>>>>> Hop filters: Keep forever
>>>>>>>>
>>>>>>>> Seeds: https://www.duo.uio.no/
>>>>>>>>
>>>>>>>> Exclude from crawl:
>>>>>>>> # Exclude some file types:
>>>>>>>> \.gif$
>>>>>>>> \.GIF$
>>>>>>>> \.jpeg$
>>>>>>>> \.JPEG$
>>>>>>>> \.jpg$
>>>>>>>> \.JPG$
>>>>>>>> \.png$
>>>>>>>> \.PNG$
>>>>>>>> \.mpg$
>>>>>>>> \.MPG$
>>>>>>>> \.mpeg$
>>>>>>>> \.MPEG$
>>>>>>>> \.exe$
>>>>>>>> \.bmp$
>>>>>>>> \.BMP$
>>>>>>>> \.mov$
>>>>>>>> \.MOV$
>>>>>>>> \.wmf$
>>>>>>>> \.css$
>>>>>>>> \.ico$
>>>>>>>> \.ICO$
>>>>>>>> \.mp2$
>>>>>>>> \.mp3$
>>>>>>>> \.mp4$
>>>>>>>> \.wmv$
>>>>>>>> \.tif$
>>>>>>>> \.tiff$
>>>>>>>> \.avi$
>>>>>>>> \.ogg$
>>>>>>>> \.ogv$
>>>>>>>> \.zip$
>>>>>>>> \.gz$
>>>>>>>> \.psd$
>>>>>>>>
>>>>>>>> # TIKA-1011
>>>>>>>> \.mhtml$
>>>>>>>>
>>>>>>>> # Exclude log files:
>>>>>>>> \.log$
>>>>>>>> \.logfile$
>>>>>>>>
>>>>>>>> # Generelt, ikke tillatt indeksering av DUO-søkeresultater:
>>>>>>>> https?://www\.duo\.uio\.no/sok/search.*
>>>>>>>>
>>>>>>>> # Andre elementer i DUO som skal ekskluderes:
>>>>>>>> https://www\.duo\.uio\.no.*open-search/description\.xml$
>>>>>>>> https://www\.duo\.uio\.no/(inn|login|feed|search|
>>>>>>>> advanced-search|community-list|browse|password-login|
>>>>>>>> inn|discover).*
>>>>>>>>
>>>>>>>> # Skip locale settings - makes duplicates:
>>>>>>>> https://www\.duo\.uio\.no/.*\?locale-attribute=\w{2}$
>>>>>>>>
>>>>>>>> # Temporarily skip PDFs since we are indexing abstracts:
>>>>>>>> https://www\.duo\.uio\.no/bitstream/handle/.+
>>>>>>>>
>>>>>>>> # skip full item record:
>>>>>>>> https://www\.duo\.uio\.no/handle/\d{9}/\d+\?show=full$
>>>>>>>> # ny url-struktur:
>>>>>>>> https://www\.duo\.uio\.no/handle/.*\?show=full$
>>>>>>>>
>>>>>>>> # Skip all navigations but "start with letter":
>>>>>>>> https://www\.duo\.uio\.no/.*type=(author|dateissued)$
>>>>>>>>
>>>>>>>> # Skip search:
>>>>>>>> #https://www\.duo\.uio\.no/handle/.*/discover\?.*
>>>>>>>> https://www\.duo\.uio\.no/handle/.*search-filter\?.*
>>>>>>>> # ny url-struktur:
>>>>>>>> https://www\.duo\.uio\.no/discover\?.*
>>>>>>>> https://www\.duo\.uio\.no/search-filter\?.*
>>>>>>>>
>>>>>>>> # Skip statistics:
>>>>>>>> https://www\.duo\.uio\.no/handle/.*/statistics$
>>>>>>>>
>>>>>>>> Exclude from index:
>>>>>>>> # Exclude front page - no valuable info and we have QL:
>>>>>>>> https?://www\.duo\.uio\.no/$
>>>>>>>>
>>>>>>>> # Do not index navigation, but follow:
>>>>>>>> https://www\.duo\.uio\.no/handle/\d{9}/\d+/.+
>>>>>>>> #ny url-struktur:
>>>>>>>> https://www\.duo\.uio\.no/handle/\d+/\d+/.+
>>>>>>>>
>>>>>>>> # Exclude id's lower than four, probably category listening:
>>>>>>>> https://www\.duo\.uio\.no/handle/\d{9}/\d{1,4}$
>>>>>>>> # ny url-strultur:
>>>>>>>> https://www\.duo\.uio\.no/handle/\d+/\d{1,3}$
>>>>>>>>
>>>>>>>> Thanks for looking at this!
>>>>>>>>
>>>>>>>> BTW: Within an hour, I will be away from my computer and cannot test
>>>>>>>> anymore until Monday. I'm leaving Oslo for some days, but I will
>>>>>>>>
>>>>>>> still be
>>>>>
>>>>>> able to read and answer emails.
>>>>>>>>
>>>>>>>> Erlend
>>>>>>>>
>>>>>>>>
>>>>>>>>  On 18.09.14 13:43, Karl Wright wrote:
>>>>>>>>>
>>>>>>>>> Hi Erlend,
>>>>>>>>>
>>>>>>>>> The "Interrupted: null" message with a -104 code means only that
>>>>>>>>> the
>>>>>>>>> fetch
>>>>>>>>> was interrupted by something.  Unfortunately, the message is not
>>>>>>>>>
>>>>>>>> clear
>>>>>
>>>>>> about what the cause of the interruption is.  This is unrelated to
>>>>>>>>> Zookeeper; but I agree that it is suspicious that many such
>>>>>>>>>
>>>>>>>> interruptions
>>>>>
>>>>>> appear right after robots is parsed.
>>>>>>>>>
>>>>>>>>> One cause of a -104 is when the target server forcibly drops the
>>>>>>>>> connection, so an InterruptedIOException is thrown.  Having a look
>>>>>>>>>
>>>>>>>> at the
>>>>>
>>>>>> timestamps for the fetch messages, it looks believable that you
>>>>>>>>> might
>>>>>>>>> have
>>>>>>>>> exceeded some predetermined limit on that machine.  They're all
>>>>>>>>>
>>>>>>>> within a
>>>>>
>>>>>> few milliseconds of each other.  When a robots file needs to be
>>>>>>>>> read,
>>>>>>>>> ManifoldCF creates an event for that, and the urls blocked by that
>>>>>>>>>
>>>>>>>> event
>>>>>
>>>>>> will all be 'fetchable' as soon as the event is released.  Perhaps
>>>>>>>>>
>>>>>>>> your
>>>>>
>>>>>> throttling needs to be adjusted now that the rate limit bug has
>>>>>>>>> been
>>>>>>>>> fixed?
>>>>>>>>>
>>>>>>>>> I won't be able to work with this without at least your crawling
>>>>>>>>> parameters
>>>>>>>>> for the server in question.  I can ping that server so if you would
>>>>>>>>>
>>>>>>>> like
>>>>>
>>>>>> I
>>>>>>>>> can try crawling that server from here.
>>>>>>>>>
>>>>>>>>> For zookeeper, I would still try to either increase your tick count
>>>>>>>>>
>>>>>>>> to
>>>>>
>>>>>> maybe 10000, or better yet, find out why you periodically lose the
>>>>>>>>> ability
>>>>>>>>> to transmit pings from MCF to your zookeeper process.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Sep 18, 2014 at 7:15 AM, Erlend Garåsen <
>>>>>>>>>
>>>>>>>> [email protected]
>>>>>
>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>  On 18.09.14 13:00, Karl Wright wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Erlend,
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> please can you also add the manifoldcf log as well?
>>>>>>>>>>>
>>>>>>>>>> Yes, I will, but it includes entries from RC0 as well.
>>>>>>>>>>
>>>>>>>>>> MCF works perfectly using the other jobs for the other hosts.
>>>>>>>>>> Take a
>>>>>>>>>> look
>>>>>>>>>> at the following once again. MCF is being interrupted:
>>>>>>>>>> INFO 2014-09-18 11:13:42,824 (Worker thread '19') - WEB: FETCH
>>>>>>>>>> URL|
>>>>>>>>>> https://www.duo.uio.no/|1411030940209+682605|-104|
>>>>>>>>>> <https://www.duo.uio.no/%7C1411030940209+682605%7C-104%7C>
>>>>>>>>>>
>>>>>>>>> <https://www.duo.uio.no/%7C1411030940209+682605%7C-104%7C>
>>>>>
>>>>>> <https://www.duo.uio.no/%7C1411030940209+682605%7C-104%7C>
>>>>>>>>>> 4096|org.apache.manifoldcf.core.interfaces.ManifoldCFException|
>>>>>>>>>> <https://www.duo.uio.no/%7C1411030940209+682605%7C-104%
>>>>>>>>>>
>>>>>>>>>>  7C4096%7Corg.apache.manifoldcf.core.interfaces.
>>>>> ManifoldCFException%7C>
>>>>>
>>>>>> Interrupted: Interrupted: null
>>>>>>>>>>
>>>>>>>>>> You can find this entry near the other regarding the robots.txt
>>>>>>>>>>
>>>>>>>>> file:
>>>>>
>>>>>> http://folk.uio.no/erlendfg/manifoldcf/manifoldcf.log
>>>>>>>>>>
>>>>>>>>>> Erlend
>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>

[RESULT] [VOTE] Release Apache ManifoldCF 1.7.1, RC1

Reply via email to