Re: Issue when fetching with multiple threads

2015-09-16 Thread Alex Wang
Hi Sebastian, Sorry for the delay. Unfortunately, we can't afford to set fetcher.threads.per.queue = 1, since it is taking many hours to crawl a site with about 1000 pages, even if I set fetcher.server.delay = 0. I have to somehow make the multi-threaded fetching work. I made Http.getResponse(..)

Re: Issue when fetching with multiple threads

2015-09-10 Thread Sebastian Nagel
Hi Alex, > I will make all methods "synchronized" except for setConf() in Http.java. This may help but it will effectively disable any parallelism in the fetcher. After a quick look at the form authentication of protocol-httpclient: looks like the login is done every connection / every time getRe

Re: Issue when fetching with multiple threads

2015-09-08 Thread Alex Wang
Hi Sebastian, Thanks for shedding some light on this issue. At least I now know where to focus in order to fix it. In the segment, the login form is contained with different URLs. Sometimes, even pages that do not need authentication can have the login form indexed in Solr instead of the actual p

Re: Issue when fetching with multiple threads

2015-09-08 Thread Sebastian Nagel
Hi Alex, > Some of the pages on the site requires login. I have enabled > HttpFormAuthentication in the protocal-httpclient plugin. However, looks > like the login page title gets indexed into Solr instead of the actual > page's title. Does this mean that one segment contains multiple records und

Re: Issue when fetching with multiple threads

2015-09-03 Thread Alex Wang
I might have identified the issue, but have no idea how to solve it. Some of the pages on the site requires login. I have enabled HttpFormAuthentication in the protocal-httpclient plugin. However, looks like the login page title gets indexed into Solr instead of the actual page's title. Anybody h

Re: Issue when fetching with multiple threads

2015-09-03 Thread Alex Wang
Thanks Julien for your suggestion! I ran the readseg command and examined the dump. The title for the particular html page was indeed fetched and parsed correctly even in multithread fetching mode. So it looks like the problem occurred somewhere after the parsing and/ or during indexing. Do you ha

Re: Issue when fetching with multiple threads

2015-09-03 Thread Julien Nioche
Hi Alex You can use the segment reader to check the binary content and data extracted from the parse (`./nutch readseg ...`). This should at least give you some insights into where things might have gone wrong. HTH Julien On 3 September 2015 at 16:13, Alex Wang wrote: > Hi, > > We are using N

Issue when fetching with multiple threads

2015-09-03 Thread Alex Wang
Hi, We are using Nutch 1.9 to crawl an internal website, and index the content to Solr 3.5. What we found is that the page title indexed for certain html pages are wrong. For example the "Contact us" page has "Login" as page title in the Solr index. This only happens when we use multiple threads t