Three +1's, >72 hours. Vote passes! Karl
On Mon, Sep 22, 2014 at 5:05 AM, Erlend Garåsen <[email protected]> wrote: > > I'm able to fetch documents from www.duo.uio.no using file-based > synchronization, so there are no network problems. > > Anyway, I'll continue to test RC2. Even though I'm not able to use > Zookeeper-based synchronization on that host, I may find other > bugs/problems. > > Erlend > > > On 22.09.14 10:39, Erlend Garåsen wrote: > >> >> I can verify an eventually network problem by using file-based >> synchronization instead. >> >> I'll do that right away and test RC2 as well, even though you already >> have three +1's. >> >> The three other jobs I started before I left my office on Thursday did >> all complete successfully. >> >> Erlend >> >> On 19.09.14 12:27, Karl Wright wrote: >> >>> Well, it's crawled fine over night, with no issues whatsoever. I'm >>> using a >>> Zookeeper setup, with MCF 1.7.1 RC1. >>> >>> I still maintain you've got something broken with the network in your >>> production machine. >>> >>> Karl >>> >>> On Thu, Sep 18, 2014 at 5:31 PM, Karl Wright <[email protected]> wrote: >>> >>> Well, FWIW it is still crawling perfectly. I'll let it run until done. >>>> >>>> Karl >>>> >>>> >>>> On Thu, Sep 18, 2014 at 5:29 PM, Erlend Fedt Garåsen < >>>> [email protected]> wrote: >>>> >>>> I know. I used a lot of time to create the rules which seems to index >>>>> what we really want. Your observation is correct. Crawling Dspace >>>>> repositories are very difficult. A lot of nonsense pages we need to >>>>> filter >>>>> out. >>>>> >>>>> We have crawled this host the last two years using file based synch. >>>>> >>>>> I'm planning a new approach, i.e. using a connector etc. >>>>> >>>>> E >>>>> >>>>> Sent from my iPhone >>>>> >>>>> On 18. sep. 2014, at 22:35, "Karl Wright" <[email protected]> wrote: >>>>>> >>>>>> Ok, I started this crawl. It fetched and processed robots.txt >>>>>> >>>>> perfectly. >>>>> >>>>>> And then I saw the following: lots of fetches of fairly good-sized >>>>>> documents, with very few ingestions. The documents that did not >>>>>> ingest >>>>>> look like this: >>>>>> >>>>>> >>>>>> https://www.duo.uio.no/handle/10852/163/discover?order=DESC& >>>>> r...pp=100&sort_by=dc.date.issued_dt >>>>> >>>>> >>>>>> >>>>>> I think your index inclusion rules may be excluding most of the >>>>>> content. >>>>>> >>>>>> Karl >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Sep 18, 2014 at 8:48 AM, Karl Wright <[email protected]> >>>>>>> >>>>>> wrote: >>>>> >>>>>> >>>>>>> Thanks -- I will probably not be able to get to this further until >>>>>>> >>>>>> tonight >>>>> >>>>>> anyhow. >>>>>>> >>>>>>> Karl >>>>>>> >>>>>>> On Thu, Sep 18, 2014 at 8:16 AM, Erlend Garåsen < >>>>>>> >>>>>> [email protected]> >>>>> >>>>>> wrote: >>>>>>> >>>>>>> >>>>>>>> I tried to fetch documents by using curl from our prod server >>>>>>>> just in >>>>>>>> case a webmaster had blocked access. No problem. Maybe I should ask >>>>>>>> >>>>>>> the >>>>> >>>>>> webmaster of that host anyway, just to be sure. >>>>>>>> >>>>>>>> The interrupted message may have been caused by an abort of that >>>>>>>> job. >>>>>>>> >>>>>>>> I think I should just stop the problematic job and start all the >>>>>>>> other >>>>>>>> three remaining jobs instead. I bet they will all complete. >>>>>>>> Ideally we >>>>>>>> shouldn't crawl www.duo.uio.no at all since it's a Dspace >>>>>>>> resource. I >>>>>>>> have just contacted someone who is indexing Dspace resources. I >>>>>>>> guess >>>>>>>> >>>>>>> a >>>>> >>>>>> Dspace connector is a better approach. >>>>>>>> >>>>>>>> Below you'll find some parameters. >>>>>>>> >>>>>>>> REPOSITORY CONNECTION >>>>>>>> --------------------- >>>>>>>> Throttling -> max connections: 30 >>>>>>>> Throttling -> Max fetches/min: 100 >>>>>>>> Bandwith -> max connections: 25 >>>>>>>> Bandwith -> max kbytes/sec: 8000 >>>>>>>> Bandwith -> max fetches/min: 20 >>>>>>>> >>>>>>>> JOB SETTINGS >>>>>>>> ------------ >>>>>>>> >>>>>>>> Hop filters: Keep forever >>>>>>>> >>>>>>>> Seeds: https://www.duo.uio.no/ >>>>>>>> >>>>>>>> Exclude from crawl: >>>>>>>> # Exclude some file types: >>>>>>>> \.gif$ >>>>>>>> \.GIF$ >>>>>>>> \.jpeg$ >>>>>>>> \.JPEG$ >>>>>>>> \.jpg$ >>>>>>>> \.JPG$ >>>>>>>> \.png$ >>>>>>>> \.PNG$ >>>>>>>> \.mpg$ >>>>>>>> \.MPG$ >>>>>>>> \.mpeg$ >>>>>>>> \.MPEG$ >>>>>>>> \.exe$ >>>>>>>> \.bmp$ >>>>>>>> \.BMP$ >>>>>>>> \.mov$ >>>>>>>> \.MOV$ >>>>>>>> \.wmf$ >>>>>>>> \.css$ >>>>>>>> \.ico$ >>>>>>>> \.ICO$ >>>>>>>> \.mp2$ >>>>>>>> \.mp3$ >>>>>>>> \.mp4$ >>>>>>>> \.wmv$ >>>>>>>> \.tif$ >>>>>>>> \.tiff$ >>>>>>>> \.avi$ >>>>>>>> \.ogg$ >>>>>>>> \.ogv$ >>>>>>>> \.zip$ >>>>>>>> \.gz$ >>>>>>>> \.psd$ >>>>>>>> >>>>>>>> # TIKA-1011 >>>>>>>> \.mhtml$ >>>>>>>> >>>>>>>> # Exclude log files: >>>>>>>> \.log$ >>>>>>>> \.logfile$ >>>>>>>> >>>>>>>> # Generelt, ikke tillatt indeksering av DUO-søkeresultater: >>>>>>>> https?://www\.duo\.uio\.no/sok/search.* >>>>>>>> >>>>>>>> # Andre elementer i DUO som skal ekskluderes: >>>>>>>> https://www\.duo\.uio\.no.*open-search/description\.xml$ >>>>>>>> https://www\.duo\.uio\.no/(inn|login|feed|search| >>>>>>>> advanced-search|community-list|browse|password-login| >>>>>>>> inn|discover).* >>>>>>>> >>>>>>>> # Skip locale settings - makes duplicates: >>>>>>>> https://www\.duo\.uio\.no/.*\?locale-attribute=\w{2}$ >>>>>>>> >>>>>>>> # Temporarily skip PDFs since we are indexing abstracts: >>>>>>>> https://www\.duo\.uio\.no/bitstream/handle/.+ >>>>>>>> >>>>>>>> # skip full item record: >>>>>>>> https://www\.duo\.uio\.no/handle/\d{9}/\d+\?show=full$ >>>>>>>> # ny url-struktur: >>>>>>>> https://www\.duo\.uio\.no/handle/.*\?show=full$ >>>>>>>> >>>>>>>> # Skip all navigations but "start with letter": >>>>>>>> https://www\.duo\.uio\.no/.*type=(author|dateissued)$ >>>>>>>> >>>>>>>> # Skip search: >>>>>>>> #https://www\.duo\.uio\.no/handle/.*/discover\?.* >>>>>>>> https://www\.duo\.uio\.no/handle/.*search-filter\?.* >>>>>>>> # ny url-struktur: >>>>>>>> https://www\.duo\.uio\.no/discover\?.* >>>>>>>> https://www\.duo\.uio\.no/search-filter\?.* >>>>>>>> >>>>>>>> # Skip statistics: >>>>>>>> https://www\.duo\.uio\.no/handle/.*/statistics$ >>>>>>>> >>>>>>>> Exclude from index: >>>>>>>> # Exclude front page - no valuable info and we have QL: >>>>>>>> https?://www\.duo\.uio\.no/$ >>>>>>>> >>>>>>>> # Do not index navigation, but follow: >>>>>>>> https://www\.duo\.uio\.no/handle/\d{9}/\d+/.+ >>>>>>>> #ny url-struktur: >>>>>>>> https://www\.duo\.uio\.no/handle/\d+/\d+/.+ >>>>>>>> >>>>>>>> # Exclude id's lower than four, probably category listening: >>>>>>>> https://www\.duo\.uio\.no/handle/\d{9}/\d{1,4}$ >>>>>>>> # ny url-strultur: >>>>>>>> https://www\.duo\.uio\.no/handle/\d+/\d{1,3}$ >>>>>>>> >>>>>>>> Thanks for looking at this! >>>>>>>> >>>>>>>> BTW: Within an hour, I will be away from my computer and cannot test >>>>>>>> anymore until Monday. I'm leaving Oslo for some days, but I will >>>>>>>> >>>>>>> still be >>>>> >>>>>> able to read and answer emails. >>>>>>>> >>>>>>>> Erlend >>>>>>>> >>>>>>>> >>>>>>>> On 18.09.14 13:43, Karl Wright wrote: >>>>>>>>> >>>>>>>>> Hi Erlend, >>>>>>>>> >>>>>>>>> The "Interrupted: null" message with a -104 code means only that >>>>>>>>> the >>>>>>>>> fetch >>>>>>>>> was interrupted by something. Unfortunately, the message is not >>>>>>>>> >>>>>>>> clear >>>>> >>>>>> about what the cause of the interruption is. This is unrelated to >>>>>>>>> Zookeeper; but I agree that it is suspicious that many such >>>>>>>>> >>>>>>>> interruptions >>>>> >>>>>> appear right after robots is parsed. >>>>>>>>> >>>>>>>>> One cause of a -104 is when the target server forcibly drops the >>>>>>>>> connection, so an InterruptedIOException is thrown. Having a look >>>>>>>>> >>>>>>>> at the >>>>> >>>>>> timestamps for the fetch messages, it looks believable that you >>>>>>>>> might >>>>>>>>> have >>>>>>>>> exceeded some predetermined limit on that machine. They're all >>>>>>>>> >>>>>>>> within a >>>>> >>>>>> few milliseconds of each other. When a robots file needs to be >>>>>>>>> read, >>>>>>>>> ManifoldCF creates an event for that, and the urls blocked by that >>>>>>>>> >>>>>>>> event >>>>> >>>>>> will all be 'fetchable' as soon as the event is released. Perhaps >>>>>>>>> >>>>>>>> your >>>>> >>>>>> throttling needs to be adjusted now that the rate limit bug has >>>>>>>>> been >>>>>>>>> fixed? >>>>>>>>> >>>>>>>>> I won't be able to work with this without at least your crawling >>>>>>>>> parameters >>>>>>>>> for the server in question. I can ping that server so if you would >>>>>>>>> >>>>>>>> like >>>>> >>>>>> I >>>>>>>>> can try crawling that server from here. >>>>>>>>> >>>>>>>>> For zookeeper, I would still try to either increase your tick count >>>>>>>>> >>>>>>>> to >>>>> >>>>>> maybe 10000, or better yet, find out why you periodically lose the >>>>>>>>> ability >>>>>>>>> to transmit pings from MCF to your zookeeper process. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Karl >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Sep 18, 2014 at 7:15 AM, Erlend Garåsen < >>>>>>>>> >>>>>>>> [email protected] >>>>> >>>>>> wrote: >>>>>>>>> >>>>>>>>> On 18.09.14 13:00, Karl Wright wrote: >>>>>>>>>> >>>>>>>>>> Hi Erlend, >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> please can you also add the manifoldcf log as well? >>>>>>>>>>> >>>>>>>>>> Yes, I will, but it includes entries from RC0 as well. >>>>>>>>>> >>>>>>>>>> MCF works perfectly using the other jobs for the other hosts. >>>>>>>>>> Take a >>>>>>>>>> look >>>>>>>>>> at the following once again. MCF is being interrupted: >>>>>>>>>> INFO 2014-09-18 11:13:42,824 (Worker thread '19') - WEB: FETCH >>>>>>>>>> URL| >>>>>>>>>> https://www.duo.uio.no/|1411030940209+682605|-104| >>>>>>>>>> <https://www.duo.uio.no/%7C1411030940209+682605%7C-104%7C> >>>>>>>>>> >>>>>>>>> <https://www.duo.uio.no/%7C1411030940209+682605%7C-104%7C> >>>>> >>>>>> <https://www.duo.uio.no/%7C1411030940209+682605%7C-104%7C> >>>>>>>>>> 4096|org.apache.manifoldcf.core.interfaces.ManifoldCFException| >>>>>>>>>> <https://www.duo.uio.no/%7C1411030940209+682605%7C-104% >>>>>>>>>> >>>>>>>>>> 7C4096%7Corg.apache.manifoldcf.core.interfaces. >>>>> ManifoldCFException%7C> >>>>> >>>>>> Interrupted: Interrupted: null >>>>>>>>>> >>>>>>>>>> You can find this entry near the other regarding the robots.txt >>>>>>>>>> >>>>>>>>> file: >>>>> >>>>>> http://folk.uio.no/erlendfg/manifoldcf/manifoldcf.log >>>>>>>>>> >>>>>>>>>> Erlend >>>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >>>> >>>> >>> >> >
