Re: HTTP error 400

2012-05-11 Thread Markus Jelsma
Ah, that means don't use the crawl command and do a little shell scripting to execute the separte crawl cycle commands, see the nutch wiki for examples. And don't do solrdedup. Search the Solr wiki for deduplication. cheers On Fri, 11 May 2012 07:39:36 +0300, Tolga to...@ozses.net wrote:

Re: Running nutch in eclipse

2012-05-11 Thread Vijith
Ok I got it. Its there in userlog folder in each of the slaves.. On Fri, May 11, 2012 at 12:28 AM, Vijith vijithkv...@gmail.com wrote: Regarding the nutch log, i think i missed out something while running the job. In eclipse I have given the following as VM arguments - -Dhadoop.log.dir=logs

Indexing HTML metatags from Nutch into Solr

2012-05-11 Thread ML mail
Hello, I am using Nutch 1.4 with Solr 3.6.0 and would like to get the HTML keywords and description metatags indexed into Solr. On the Nutch side I have followed thehttp://wiki.apache.org/nutch/IndexMetatags to get nutch parsing the extracting the metatags (using index-metatags and

Re: Separate logger for nutch

2012-05-11 Thread Vijith
I have tried with a seperate logger and a printWriter objects to do this. It works in local mode but not in deploy mode. I am running the nutch job file. Its running and generating the hadoop log without any errors. But the files are not created in any of the nodes. On Fri, May 11, 2012 at 3:07

Re: Separate logger for nutch

2012-05-11 Thread Ferdy Galema
When running hadoop in deploy mode the actual tasks are ran by the mapreduce framework so you have to check the mapreduce user logs. Either use the jobtracker interface or check them directly on the nodes in HADOOP_HOME/logs/userlogs or something like that. On Fri, May 11, 2012 at 1:11 PM, Vijith

Re: Separate logger for nutch

2012-05-11 Thread Ferdy Galema
There is, every task gets run a temporary working directory. But in general the output is cleaned after the task completes. If you want to save side data you have to figure a workaround. This page should give you a few pointers:

Re: Crawl-tool for iterative crawling?

2012-05-11 Thread Matthias Paul
In was confused by this tutorial: http://wiki.apache.org/nutch/NutchTutorial Reading this page one might get to the conclusion that the crawl tool can't do iterative crawling, because under 3.2 Using Individual Commands for Whole-Web Crawling there's the sentence This also permits ... incremental

Re: Crawl-tool for iterative crawling?

2012-05-11 Thread Lewis John Mcgibbney
If you would like I could add you to the moderators group and you can word it how you wish. Please sign up to Jira, give me your Jira username on this page, and I will happily add you the the group. On the other-hand, if you don't wish to do this, then please reply here with your suggestion and

Re: Separate logger for nutch

2012-05-11 Thread Markus Jelsma
Hi Nutch uses Log4j and with it you can write log output from different classes or different log levels to different output files. I'm sure this will work with Nutch in local mode so i believe you can make it happen with Hadoop but may be tricky, or not possible. Cheers On Fri, 11 May

Re: Indexing HTML metatags from Nutch into Solr

2012-05-11 Thread Ing. Eyeris Rodriguez Rueda
Hello, I am using index-metatags plugins(I supose that you have index-metatags plugins on nutch's plugins folder). Fist you need to include on nutch-site some like this |index-(basic|anchor|metatags|more)| also you need to include the metadata names that you want to index(in this file also):

Re: HTML documents with TXT extension

2012-05-11 Thread Bai Shen
I keep forgetting about the parsechecker. I'll have to take a look and see what it kicks out. And I've already changed solr, I was just looking at what I could do with Nutch as well. Thanks. On Tue, May 8, 2012 at 8:44 AM, Markus Jelsma markus.jel...@openindex.iowrote: Hi Nutch should

Re: Indexing HTML metatags from Nutch into Solr

2012-05-11 Thread ML mail
Hi, Actually I have already done all that, as I followed the Nutch Wiki for this purpose: http://wiki.apache.org/nutch/IndexMetatags Now your suggestion about cleaning my segments as well as solr index then re-index is a good idea. Could you just help me on the commands to achieve these 3

Re: Indexing HTML metatags from Nutch into Solr

2012-05-11 Thread Ing. Eyeris Rodriguez Rueda
Hi. I only have index-metatags plugins in my nutch-site.xml and is function succesfully I also was trying with parse-metatags without positive result and finaly dont use it. also make sure that your schema in nutch is the same in solr. if your index is not big you can erase the folder of your

Nutchgora - SQLTransientConnectionException?

2012-05-11 Thread Ramsel Ruiz
I just checked out nutchgora on Wednesday, and I'm getting exceptions trying to run an initial crawl. I found some threads regarding this issue but not sure if it was ever solved:

Re: Nutchgora - SQLTransientConnectionException?

2012-05-11 Thread Lewis John Mcgibbney
Hi Ramsel, It would be great if you could provide what configuration you have included, also whether or not you are keeping up to date with HEAD? This is most likely something to do with your HSQLDB configuration not matching between server and gora.properties configuration Lewis On Sat, May