Hi Sebastian,
Sorry for the delay. Unfortunately, we can't afford to set
fetcher.threads.per.queue
= 1, since it is taking many hours to crawl a site with about 1000 pages,
even if I set fetcher.server.delay = 0. I have to somehow make the
multi-threaded fetching work.
I made Http.getResponse(..)
Hi Alex,
> I will make all methods "synchronized" except for setConf() in Http.java.
This may help but it will effectively disable any parallelism in the
fetcher.
After a quick look at the form authentication of protocol-httpclient:
looks like the login is done every connection / every time getRe
Hi Sebastian,
Thanks for shedding some light on this issue. At least I now know where to
focus in order to fix it.
In the segment, the login form is contained with different URLs. Sometimes,
even pages that do not need authentication can have the login form indexed
in Solr instead of the actual p
Hi Alex,
> Some of the pages on the site requires login. I have enabled
> HttpFormAuthentication in the protocal-httpclient plugin. However, looks
> like the login page title gets indexed into Solr instead of the actual
> page's title.
Does this mean that one segment contains multiple records und
I might have identified the issue, but have no idea how to solve it.
Some of the pages on the site requires login. I have enabled
HttpFormAuthentication in the protocal-httpclient plugin. However, looks
like the login page title gets indexed into Solr instead of the actual
page's title.
Anybody h
Thanks Julien for your suggestion! I ran the readseg command and examined
the dump. The title for the particular html page was indeed fetched and
parsed correctly even in multithread fetching mode. So it looks like the
problem occurred somewhere after the parsing and/ or during indexing. Do
you ha
Hi Alex
You can use the segment reader to check the binary content and data
extracted from the parse (`./nutch readseg ...`). This should at least give
you some insights into where things might have gone wrong.
HTH
Julien
On 3 September 2015 at 16:13, Alex Wang wrote:
> Hi,
>
> We are using N
Hi,
We are using Nutch 1.9 to crawl an internal website, and index the content
to Solr 3.5. What we found is that the page title indexed for certain html
pages are wrong. For example the "Contact us" page has "Login" as page
title in the Solr index. This only happens when we use multiple threads t
8 matches
Mail list logo