subject:"Issue when fetching with multiple threads"

Re: Issue when fetching with multiple threads

2015-09-16 Thread Alex Wang

Hi Sebastian, Sorry for the delay. Unfortunately, we can't afford to set fetcher.threads.per.queue = 1, since it is taking many hours to crawl a site with about 1000 pages, even if I set fetcher.server.delay = 0. I have to somehow make the multi-threaded fetching work. I made Http.getResponse(..)

Re: Issue when fetching with multiple threads

2015-09-10 Thread Sebastian Nagel

Hi Alex, > I will make all methods "synchronized" except for setConf() in Http.java. This may help but it will effectively disable any parallelism in the fetcher. After a quick look at the form authentication of protocol-httpclient: looks like the login is done every connection / every time getRe

Re: Issue when fetching with multiple threads

2015-09-08 Thread Alex Wang

Hi Sebastian, Thanks for shedding some light on this issue. At least I now know where to focus in order to fix it. In the segment, the login form is contained with different URLs. Sometimes, even pages that do not need authentication can have the login form indexed in Solr instead of the actual p

Re: Issue when fetching with multiple threads

2015-09-08 Thread Sebastian Nagel

Hi Alex, > Some of the pages on the site requires login. I have enabled > HttpFormAuthentication in the protocal-httpclient plugin. However, looks > like the login page title gets indexed into Solr instead of the actual > page's title. Does this mean that one segment contains multiple records und

Re: Issue when fetching with multiple threads

2015-09-03 Thread Alex Wang

I might have identified the issue, but have no idea how to solve it. Some of the pages on the site requires login. I have enabled HttpFormAuthentication in the protocal-httpclient plugin. However, looks like the login page title gets indexed into Solr instead of the actual page's title. Anybody h

Re: Issue when fetching with multiple threads

2015-09-03 Thread Alex Wang

Thanks Julien for your suggestion! I ran the readseg command and examined the dump. The title for the particular html page was indeed fetched and parsed correctly even in multithread fetching mode. So it looks like the problem occurred somewhere after the parsing and/ or during indexing. Do you ha

Re: Issue when fetching with multiple threads

2015-09-03 Thread Julien Nioche

Hi Alex You can use the segment reader to check the binary content and data extracted from the parse (`./nutch readseg ...`). This should at least give you some insights into where things might have gone wrong. HTH Julien On 3 September 2015 at 16:13, Alex Wang wrote: > Hi, > > We are using N

Issue when fetching with multiple threads

2015-09-03 Thread Alex Wang

Hi, We are using Nutch 1.9 to crawl an internal website, and index the content to Solr 3.5. What we found is that the page title indexed for certain html pages are wrong. For example the "Contact us" page has "Login" as page title in the Solr index. This only happens when we use multiple threads t

Re: Issue when fetching with multiple threads

Re: Issue when fetching with multiple threads

Re: Issue when fetching with multiple threads

Re: Issue when fetching with multiple threads

Re: Issue when fetching with multiple threads

Re: Issue when fetching with multiple threads

Re: Issue when fetching with multiple threads

Issue when fetching with multiple threads

8 matches

Site Navigation

Mail list logo

Footer information