Cannot crawl problem

2011-07-16 Thread Kelvin
Dear all, I was able to get nutch 1.2 working previously. I have done a clean install of nutch 1.2 now, and I strictly follow the instructions below: http://wiki.apache.org/nutch/NutchTutorialPre1.3 But now I have encounter this problem below. Why is it so? Do we need to setup tomcat in order

Fetcher thread time out

2011-07-16 Thread Markus Jelsma
Hi, With large map output the task tracker can time out (no progress update during merge). Using io.sort.factor i can tune the merge phase to proceed a bit faster. Yet it can still time out when the cluster is very busy etc. I've increased the task time out but now it also takes longer to get

Re: Cannot crawl problem

2011-07-16 Thread Kelvin
Dear all, Just to update, I have solved my problem. Apparently, we also need to edit this file conf/crawl-urlfilter.txt, besides conf/regex-urlfilter.txt Can we amend this pagehttp://wiki.apache.org/nutch/NutchTutorialPre1.3 I am sure many others encounter the same problem as me.

Re: Isn't there redudant/wasteful duplication between nutch crawldb and solr index?

2011-07-16 Thread lewis john mcgibbney
Hi Gabriele, At first this seems like a plausable arguement, however my question concerns what Nutch would do if we wished to change the Solr core which to index to? If we removed this functionality from the crawldb there would be no way to determine what Nutch was to fetch and what it wasn't.

Re: Isn't there redudant/wasteful duplication between nutch crawldb and solr index?

2011-07-16 Thread Gabriele Kahlout
On Sat, Jul 16, 2011 at 1:29 PM, lewis john mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Gabriele, At first this seems like a plausable arguement, Indeed, I think it could be a FAQ. Shall I add it to nutch wiki? however my question concerns what Nutch would do if we wished to change

Re: Isn't there redudant/wasteful duplication between nutch crawldb and solr index?

2011-07-16 Thread lewis john mcgibbney
Please feel free to add this to the wiki as it is a question that will undoubtably arise in the future. Lewis On Sat, Jul 16, 2011 at 12:37 PM, Gabriele Kahlout gabri...@mysimpatico.com wrote: On Sat, Jul 16, 2011 at 1:29 PM, lewis john mcgibbney lewis.mcgibb...@gmail.com wrote: Hi

Re: Isn't there redudant/wasteful duplication between nutch crawldb and solr index?

2011-07-16 Thread Julien Nioche
Gabriele What you are describing could be done with Nutch 2.0 by adding a SOLR backend to GORA. SOLR would be used to store the webtable and provided that you setup the schema accordingly you could index the appropriate fields for searching. I think there were plans to add SOLR as a GORA backend.

Re: Is it possible to crawl yahoo answer?

2011-07-16 Thread Kelvin
Hi Tamanjit, Thank you for your help. I tried your suggestion, but it crawl every normal url except url of this type answers.yahoo.com/question/index;_ylt=AtKz1xss1AS6RGeAQTFz1kyf5HNG;_ylv=3?qid=20110715030336AAzXnNs I also try this suggestion by

Re: running tests from the command line

2011-07-16 Thread lewis john mcgibbney
Further to this, I have been working on a JIRA ticket for this [1] If you could, can you please test. I will also shortly and hopefully we can get this committed soon. Thank you [1] https://issues.apache.org/jira/browse/NUTCH-672 On Tue, Jul 12, 2011 at 9:36 PM, lewis john mcgibbney

Re: modifying parse implementation

2011-07-16 Thread Cam Bazz
Hello, I did not understand ParseData.parseData - In ParseData there are getContentMeta and getParseMeta There is also a getMeta(String string) - it appears that there is no setter for this. There is also setParseMeta, but it appears content meta is not settable. Best Regards, C.B. On

Re: skipping invalid segments nutch 1.3

2011-07-16 Thread Leo Subscriptions
I've used crawl to ensure config is correct and I don't get any errors, so I must be doing something wrong with the individual steps, but can;t see what.

Re: modifying parse implementation

2011-07-16 Thread Joye
Hello, You could put the features into ParseData by calling /parseData.getParseMeta().set(features, valueOfFeatures); /When you wanna use it, call parseData.getParseMeta().get(features) to get it out/, /the same as the use of Java Map. No need call the setter method. :-)/ /Regards, Joey/ /