Oops, sorry, wrong thread. RC1 did NOT pass. Will close the RC2 thread in a minute. Karl
On Mon, Sep 22, 2014 at 6:31 AM, Karl Wright <[email protected]> wrote: > Three +1's, >72 hours. Vote passes! > > Karl > > On Mon, Sep 22, 2014 at 5:05 AM, Erlend Garåsen <[email protected]> > wrote: > >> >> I'm able to fetch documents from www.duo.uio.no using file-based >> synchronization, so there are no network problems. >> >> Anyway, I'll continue to test RC2. Even though I'm not able to use >> Zookeeper-based synchronization on that host, I may find other >> bugs/problems. >> >> Erlend >> >> >> On 22.09.14 10:39, Erlend Garåsen wrote: >> >>> >>> I can verify an eventually network problem by using file-based >>> synchronization instead. >>> >>> I'll do that right away and test RC2 as well, even though you already >>> have three +1's. >>> >>> The three other jobs I started before I left my office on Thursday did >>> all complete successfully. >>> >>> Erlend >>> >>> On 19.09.14 12:27, Karl Wright wrote: >>> >>>> Well, it's crawled fine over night, with no issues whatsoever. I'm >>>> using a >>>> Zookeeper setup, with MCF 1.7.1 RC1. >>>> >>>> I still maintain you've got something broken with the network in your >>>> production machine. >>>> >>>> Karl >>>> >>>> On Thu, Sep 18, 2014 at 5:31 PM, Karl Wright <[email protected]> >>>> wrote: >>>> >>>> Well, FWIW it is still crawling perfectly. I'll let it run until done. >>>>> >>>>> Karl >>>>> >>>>> >>>>> On Thu, Sep 18, 2014 at 5:29 PM, Erlend Fedt Garåsen < >>>>> [email protected]> wrote: >>>>> >>>>> I know. I used a lot of time to create the rules which seems to index >>>>>> what we really want. Your observation is correct. Crawling Dspace >>>>>> repositories are very difficult. A lot of nonsense pages we need to >>>>>> filter >>>>>> out. >>>>>> >>>>>> We have crawled this host the last two years using file based synch. >>>>>> >>>>>> I'm planning a new approach, i.e. using a connector etc. >>>>>> >>>>>> E >>>>>> >>>>>> Sent from my iPhone >>>>>> >>>>>> On 18. sep. 2014, at 22:35, "Karl Wright" <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>> Ok, I started this crawl. It fetched and processed robots.txt >>>>>>> >>>>>> perfectly. >>>>>> >>>>>>> And then I saw the following: lots of fetches of fairly good-sized >>>>>>> documents, with very few ingestions. The documents that did not >>>>>>> ingest >>>>>>> look like this: >>>>>>> >>>>>>> >>>>>>> https://www.duo.uio.no/handle/10852/163/discover?order=DESC& >>>>>> r...pp=100&sort_by=dc.date.issued_dt >>>>>> >>>>>> >>>>>>> >>>>>>> I think your index inclusion rules may be excluding most of the >>>>>>> content. >>>>>>> >>>>>>> Karl >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Sep 18, 2014 at 8:48 AM, Karl Wright <[email protected]> >>>>>>>> >>>>>>> wrote: >>>>>> >>>>>>> >>>>>>>> Thanks -- I will probably not be able to get to this further until >>>>>>>> >>>>>>> tonight >>>>>> >>>>>>> anyhow. >>>>>>>> >>>>>>>> Karl >>>>>>>> >>>>>>>> On Thu, Sep 18, 2014 at 8:16 AM, Erlend Garåsen < >>>>>>>> >>>>>>> [email protected]> >>>>>> >>>>>>> wrote: >>>>>>>> >>>>>>>> >>>>>>>>> I tried to fetch documents by using curl from our prod server >>>>>>>>> just in >>>>>>>>> case a webmaster had blocked access. No problem. Maybe I should ask >>>>>>>>> >>>>>>>> the >>>>>> >>>>>>> webmaster of that host anyway, just to be sure. >>>>>>>>> >>>>>>>>> The interrupted message may have been caused by an abort of that >>>>>>>>> job. >>>>>>>>> >>>>>>>>> I think I should just stop the problematic job and start all the >>>>>>>>> other >>>>>>>>> three remaining jobs instead. I bet they will all complete. >>>>>>>>> Ideally we >>>>>>>>> shouldn't crawl www.duo.uio.no at all since it's a Dspace >>>>>>>>> resource. I >>>>>>>>> have just contacted someone who is indexing Dspace resources. I >>>>>>>>> guess >>>>>>>>> >>>>>>>> a >>>>>> >>>>>>> Dspace connector is a better approach. >>>>>>>>> >>>>>>>>> Below you'll find some parameters. >>>>>>>>> >>>>>>>>> REPOSITORY CONNECTION >>>>>>>>> --------------------- >>>>>>>>> Throttling -> max connections: 30 >>>>>>>>> Throttling -> Max fetches/min: 100 >>>>>>>>> Bandwith -> max connections: 25 >>>>>>>>> Bandwith -> max kbytes/sec: 8000 >>>>>>>>> Bandwith -> max fetches/min: 20 >>>>>>>>> >>>>>>>>> JOB SETTINGS >>>>>>>>> ------------ >>>>>>>>> >>>>>>>>> Hop filters: Keep forever >>>>>>>>> >>>>>>>>> Seeds: https://www.duo.uio.no/ >>>>>>>>> >>>>>>>>> Exclude from crawl: >>>>>>>>> # Exclude some file types: >>>>>>>>> \.gif$ >>>>>>>>> \.GIF$ >>>>>>>>> \.jpeg$ >>>>>>>>> \.JPEG$ >>>>>>>>> \.jpg$ >>>>>>>>> \.JPG$ >>>>>>>>> \.png$ >>>>>>>>> \.PNG$ >>>>>>>>> \.mpg$ >>>>>>>>> \.MPG$ >>>>>>>>> \.mpeg$ >>>>>>>>> \.MPEG$ >>>>>>>>> \.exe$ >>>>>>>>> \.bmp$ >>>>>>>>> \.BMP$ >>>>>>>>> \.mov$ >>>>>>>>> \.MOV$ >>>>>>>>> \.wmf$ >>>>>>>>> \.css$ >>>>>>>>> \.ico$ >>>>>>>>> \.ICO$ >>>>>>>>> \.mp2$ >>>>>>>>> \.mp3$ >>>>>>>>> \.mp4$ >>>>>>>>> \.wmv$ >>>>>>>>> \.tif$ >>>>>>>>> \.tiff$ >>>>>>>>> \.avi$ >>>>>>>>> \.ogg$ >>>>>>>>> \.ogv$ >>>>>>>>> \.zip$ >>>>>>>>> \.gz$ >>>>>>>>> \.psd$ >>>>>>>>> >>>>>>>>> # TIKA-1011 >>>>>>>>> \.mhtml$ >>>>>>>>> >>>>>>>>> # Exclude log files: >>>>>>>>> \.log$ >>>>>>>>> \.logfile$ >>>>>>>>> >>>>>>>>> # Generelt, ikke tillatt indeksering av DUO-søkeresultater: >>>>>>>>> https?://www\.duo\.uio\.no/sok/search.* >>>>>>>>> >>>>>>>>> # Andre elementer i DUO som skal ekskluderes: >>>>>>>>> https://www\.duo\.uio\.no.*open-search/description\.xml$ >>>>>>>>> https://www\.duo\.uio\.no/(inn|login|feed|search| >>>>>>>>> advanced-search|community-list|browse|password-login| >>>>>>>>> inn|discover).* >>>>>>>>> >>>>>>>>> # Skip locale settings - makes duplicates: >>>>>>>>> https://www\.duo\.uio\.no/.*\?locale-attribute=\w{2}$ >>>>>>>>> >>>>>>>>> # Temporarily skip PDFs since we are indexing abstracts: >>>>>>>>> https://www\.duo\.uio\.no/bitstream/handle/.+ >>>>>>>>> >>>>>>>>> # skip full item record: >>>>>>>>> https://www\.duo\.uio\.no/handle/\d{9}/\d+\?show=full$ >>>>>>>>> # ny url-struktur: >>>>>>>>> https://www\.duo\.uio\.no/handle/.*\?show=full$ >>>>>>>>> >>>>>>>>> # Skip all navigations but "start with letter": >>>>>>>>> https://www\.duo\.uio\.no/.*type=(author|dateissued)$ >>>>>>>>> >>>>>>>>> # Skip search: >>>>>>>>> #https://www\.duo\.uio\.no/handle/.*/discover\?.* >>>>>>>>> https://www\.duo\.uio\.no/handle/.*search-filter\?.* >>>>>>>>> # ny url-struktur: >>>>>>>>> https://www\.duo\.uio\.no/discover\?.* >>>>>>>>> https://www\.duo\.uio\.no/search-filter\?.* >>>>>>>>> >>>>>>>>> # Skip statistics: >>>>>>>>> https://www\.duo\.uio\.no/handle/.*/statistics$ >>>>>>>>> >>>>>>>>> Exclude from index: >>>>>>>>> # Exclude front page - no valuable info and we have QL: >>>>>>>>> https?://www\.duo\.uio\.no/$ >>>>>>>>> >>>>>>>>> # Do not index navigation, but follow: >>>>>>>>> https://www\.duo\.uio\.no/handle/\d{9}/\d+/.+ >>>>>>>>> #ny url-struktur: >>>>>>>>> https://www\.duo\.uio\.no/handle/\d+/\d+/.+ >>>>>>>>> >>>>>>>>> # Exclude id's lower than four, probably category listening: >>>>>>>>> https://www\.duo\.uio\.no/handle/\d{9}/\d{1,4}$ >>>>>>>>> # ny url-strultur: >>>>>>>>> https://www\.duo\.uio\.no/handle/\d+/\d{1,3}$ >>>>>>>>> >>>>>>>>> Thanks for looking at this! >>>>>>>>> >>>>>>>>> BTW: Within an hour, I will be away from my computer and cannot >>>>>>>>> test >>>>>>>>> anymore until Monday. I'm leaving Oslo for some days, but I will >>>>>>>>> >>>>>>>> still be >>>>>> >>>>>>> able to read and answer emails. >>>>>>>>> >>>>>>>>> Erlend >>>>>>>>> >>>>>>>>> >>>>>>>>> On 18.09.14 13:43, Karl Wright wrote: >>>>>>>>>> >>>>>>>>>> Hi Erlend, >>>>>>>>>> >>>>>>>>>> The "Interrupted: null" message with a -104 code means only that >>>>>>>>>> the >>>>>>>>>> fetch >>>>>>>>>> was interrupted by something. Unfortunately, the message is not >>>>>>>>>> >>>>>>>>> clear >>>>>> >>>>>>> about what the cause of the interruption is. This is unrelated to >>>>>>>>>> Zookeeper; but I agree that it is suspicious that many such >>>>>>>>>> >>>>>>>>> interruptions >>>>>> >>>>>>> appear right after robots is parsed. >>>>>>>>>> >>>>>>>>>> One cause of a -104 is when the target server forcibly drops the >>>>>>>>>> connection, so an InterruptedIOException is thrown. Having a look >>>>>>>>>> >>>>>>>>> at the >>>>>> >>>>>>> timestamps for the fetch messages, it looks believable that you >>>>>>>>>> might >>>>>>>>>> have >>>>>>>>>> exceeded some predetermined limit on that machine. They're all >>>>>>>>>> >>>>>>>>> within a >>>>>> >>>>>>> few milliseconds of each other. When a robots file needs to be >>>>>>>>>> read, >>>>>>>>>> ManifoldCF creates an event for that, and the urls blocked by that >>>>>>>>>> >>>>>>>>> event >>>>>> >>>>>>> will all be 'fetchable' as soon as the event is released. Perhaps >>>>>>>>>> >>>>>>>>> your >>>>>> >>>>>>> throttling needs to be adjusted now that the rate limit bug has >>>>>>>>>> been >>>>>>>>>> fixed? >>>>>>>>>> >>>>>>>>>> I won't be able to work with this without at least your crawling >>>>>>>>>> parameters >>>>>>>>>> for the server in question. I can ping that server so if you >>>>>>>>>> would >>>>>>>>>> >>>>>>>>> like >>>>>> >>>>>>> I >>>>>>>>>> can try crawling that server from here. >>>>>>>>>> >>>>>>>>>> For zookeeper, I would still try to either increase your tick >>>>>>>>>> count >>>>>>>>>> >>>>>>>>> to >>>>>> >>>>>>> maybe 10000, or better yet, find out why you periodically lose the >>>>>>>>>> ability >>>>>>>>>> to transmit pings from MCF to your zookeeper process. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Karl >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, Sep 18, 2014 at 7:15 AM, Erlend Garåsen < >>>>>>>>>> >>>>>>>>> [email protected] >>>>>> >>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> On 18.09.14 13:00, Karl Wright wrote: >>>>>>>>>>> >>>>>>>>>>> Hi Erlend, >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> please can you also add the manifoldcf log as well? >>>>>>>>>>>> >>>>>>>>>>> Yes, I will, but it includes entries from RC0 as well. >>>>>>>>>>> >>>>>>>>>>> MCF works perfectly using the other jobs for the other hosts. >>>>>>>>>>> Take a >>>>>>>>>>> look >>>>>>>>>>> at the following once again. MCF is being interrupted: >>>>>>>>>>> INFO 2014-09-18 11:13:42,824 (Worker thread '19') - WEB: FETCH >>>>>>>>>>> URL| >>>>>>>>>>> https://www.duo.uio.no/|1411030940209+682605|-104| >>>>>>>>>>> <https://www.duo.uio.no/%7C1411030940209+682605%7C-104%7C> >>>>>>>>>>> >>>>>>>>>> <https://www.duo.uio.no/%7C1411030940209+682605%7C-104%7C> >>>>>> >>>>>>> <https://www.duo.uio.no/%7C1411030940209+682605%7C-104%7C> >>>>>>>>>>> 4096|org.apache.manifoldcf.core.interfaces.ManifoldCFException| >>>>>>>>>>> <https://www.duo.uio.no/%7C1411030940209+682605%7C-104% >>>>>>>>>>> >>>>>>>>>>> 7C4096%7Corg.apache.manifoldcf.core.interfaces. >>>>>> ManifoldCFException%7C> >>>>>> >>>>>>> Interrupted: Interrupted: null >>>>>>>>>>> >>>>>>>>>>> You can find this entry near the other regarding the robots.txt >>>>>>>>>>> >>>>>>>>>> file: >>>>>> >>>>>>> http://folk.uio.no/erlendfg/manifoldcf/manifoldcf.log >>>>>>>>>>> >>>>>>>>>>> Erlend >>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>> >>>>> >>>>> >>>> >>> >> >
