How to stop hadoop fetch job?

2011-07-19 Thread Александр Кожевников
Hi! I've added a huge fetch job without time limit and topN. Now I want to stop it, but don't want to loose my fetched data. Job is running on hadoop cluster. Nutch 1.3, Hadoop 0.20 Can anyone help me? Thanks

Re: Nutch War file

2011-07-19 Thread El-Glabro
Hi to all, also I am trying to configure Nutch and Solr for the first time in these days, and also I am curious to know if it's possible to grep some content from the pages fetched by Nutch. Thanks in advance El-Glabro On 18/07/11 23:10, Sethi, Parampreet wrote: Hey Lewis, Thanks for the

Re: How to stop hadoop fetch job?

2011-07-19 Thread Markus Jelsma
Use hadoop job -kill id. There is no way to safely interrupt a Nutch 1.x fetcher. You are going to loose the data. Hi! I've added a huge fetch job without time limit and topN. Now I want to stop it, but don't want to loose my fetched data. Job is running on hadoop cluster. Nutch 1.3, Hadoop

Re: OutlinkExtractor, configure schema in regex

2011-07-19 Thread Markus Jelsma
Hi Julien, On Tuesday 19 July 2011 11:20:30 Julien Nioche wrote: Hi Markus On 18 July 2011 23:46, Markus Jelsma markus.jel...@openindex.io wrote: I've modified the regular expression in OutlinkExtractor not to allow URI schema's other than http:// and i can confirm a significant increase

How to use lucene to index Nutch 1.3 data

2011-07-19 Thread Kelvin
Dear all, After crawling using Nutch 1.3, I realise that my /crawl folder does not contain folder /index. Is there any way to create a lucene index from the /crawl folder? Thank you for your help. Best regards, Kelvin

Re: How to use lucene to index Nutch 1.3 data

2011-07-19 Thread Александр Кожевников
Kelvin, You should make index using solr $ bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/* 19.07.2011, 15:07, Kelvin k...@yahoo.com.sg: Dear all, After crawling using Nutch 1.3, I realise that my /crawl folder does not contain folder /index. Is

Score is rising in every recrawl

2011-07-19 Thread Marek Bachmann
Hi List, while I was crawling a set of 2000 pages a couple of times, I noticed that the page scores are getting higher and higher every time a crawl cycle finished. (There are no new pages discovered, only known pages are recrawled) Is that behaviour correct? Thanks, Marek

Re: Score is rising in every recrawl

2011-07-19 Thread Markus Jelsma
It's a familiar problem of OPIC-scoring. Maybe you can migrate to using WebGraph? It's really powerful and is recalculated each time. On Tuesday 19 July 2011 15:04:01 Marek Bachmann wrote: Hi List, while I was crawling a set of 2000 pages a couple of times, I noticed that the page scores

Re: Score is rising in every recrawl

2011-07-19 Thread Marek Bachmann
On 19.07.2011 15:16, Markus Jelsma wrote: On Tuesday 19 July 2011 15:14:31 Marek Bachmann wrote: On 19.07.2011 15:03, Markus Jelsma wrote: It's a familiar problem of OPIC-scoring. Maybe you can migrate to using WebGraph? It's really powerful and is recalculated each time. Thanks for this

Re: reparsing and already parsed segment.

2011-07-19 Thread Markus Jelsma
You cannot reparse a segment IIRC. On Monday 18 July 2011 23:26:20 Cam Bazz wrote: Hello, Any ideas on how I can reparse a segment? I am running experiments - and it is taking way long time for each inject/generate/fetch/parse cycle. Best Regards, -C.B. -- Markus Jelsma - CTO -

Re: reparsing and already parsed segment.

2011-07-19 Thread Julien Nioche
You can. Simply delete parse_text, parse_data and crawl_parse from the segment before calling the parse command on it On 19 July 2011 14:57, Markus Jelsma markus.jel...@openindex.io wrote: You cannot reparse a segment IIRC. On Monday 18 July 2011 23:26:20 Cam Bazz wrote: Hello, Any

Re: Question about solrclean

2011-07-19 Thread Marek Bachmann
Hi Markus, just for notice: today I run the test for deleting 404 pages in solr again. This time the URL wasn't disappeared from the crawldb. It works fine. Solrclean has removed it from the index as expected. But it also always want to remove the pages which never could be fetched and

Re: Nutch War file

2011-07-19 Thread El-Glabro
I have solved my problem with this: $ bin/nutch readseg -get crawl_urls/segments/XX http://thepagethatyouwanttosee/uri/ On 18/07/11 23:10, Sethi, Parampreet wrote: Hey Lewis, Thanks for the quick reply. I have setup Nutch with Solr and I am able to index the documents in solr server. How

Re: How to use lucene to index Nutch 1.3 data

2011-07-19 Thread Kelvin
Hi Александр, Thank you for your reply, but I am not using solr. How do I use Lucene to create an index of folder /crawl? I went to Lucene website, but it only explains how to index local files and html? From: Александр Кожевников b37hr3...@yandex.ru To:

get summary without use stored content

2011-07-19 Thread tamara_nus
Hello, i want to get the summary but without stored the content field . like using segment to do that, by reading the content using segment and url (i don't know if can i do it) i use untch 1.3 ... Serenity -- View this message in context:

SolrDeleteDuplicates error

2011-07-19 Thread Kelvin
Sorry for the multiple postings. I am trying out nutch 1.3, which requires solr for indexing I try to crawl and index with solr with this simple command bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 10 But why does it gives me the following error? Thank you for your kind

Re: SolrDeleteDuplicates error

2011-07-19 Thread Markus Jelsma
The solrdedup job completes without failure, it is the solrindex job that's actually failing. See your hadoop.log and check Solr's output. On Tuesday 19 July 2011 17:23:51 Kelvin wrote: Sorry for the multiple postings. I am trying out nutch 1.3, which requires solr for indexing I try to

Re: get summary without use stored content

2011-07-19 Thread Markus Jelsma
Nutch 1.3 doesn't do this anymore. Search has been delegated to Solr and it has its own good highlighting capabilities. On Tuesday 19 July 2011 10:55:16 tamara_nus wrote: Hello, i want to get the summary but without stored the content field . like using segment to do that, by reading the

Re: SolrDeleteDuplicates error

2011-07-19 Thread Kelvin
Hi Markus and Sethi, Thank you for your reply, I was stuck and tried the following and it works for me: bin/nutch solrindex http://127.0.0.1:8983/solr/crawl/crawldb crawl/linkdb crawl/segments/* May I know where is the lucene index directory for the /crawl folder? I would like to use

Re: reparsing and already parsed segment.

2011-07-19 Thread Cam Bazz
Hello, And how about solrindex? does nutch mark what is indexed and what is not, or is it a read-only from segment to solr operation? best. On Tue, Jul 19, 2011 at 5:05 PM, Julien Nioche lists.digitalpeb...@gmail.com wrote: You can. Simply delete parse_text, parse_data and crawl_parse from

selective crawl

2011-07-19 Thread Cam Bazz
Hello, If I were to identify certain pages as pages of interest, in the parse-html plugin, how can I index only pages I mark as interesting, and exclude the rest? However I have to be able to extract outlinks from pages of non-interest. What would be the correct approach to do that? Best.

Re: reparsing and already parsed segment.

2011-07-19 Thread Markus Jelsma
It's per segment. Hello, And how about solrindex? does nutch mark what is indexed and what is not, or is it a read-only from segment to solr operation? best. On Tue, Jul 19, 2011 at 5:05 PM, Julien Nioche lists.digitalpeb...@gmail.com wrote: You can. Simply delete parse_text,

Re: selective crawl

2011-07-19 Thread Markus Jelsma
So you still want to crawl and parse (for outlinks) but not index. Maybe using a parse filter to mark a page as interesting (perhaps by adding it to the meta data) and making an indexing filter that conditionally indexes pages based on that mark. Hello, If I were to identify certain pages

Re: Custom HTMLParseFilter when using Tika

2011-07-19 Thread Alexander Aristov
Have you associated content type to your parser? Best Regards Alexander Aristov On 19 July 2011 21:55, dietric diet...@gmail.com wrote: When using Tika as your HTML Parser in Nutch 1.2, how is it possible to using a custom HTMLParseFilter to process parsed content? Log files confirm that

Re: Custom HTMLParseFilter when using Tika

2011-07-19 Thread dietric
Alexander Aristov wrote: Have you associated content type to your parser? I don't want to replace the Tika parser, just implement a filter. Thanks Dietrich -- View this message in context: http://lucene.472066.n3.nabble.com/Custom-HTMLParseFilter-when-using-Tika-tp3183207p3183264.html

Re: Custom HTMLParseFilter when using Tika

2011-07-19 Thread dietric
Alexander Aristov wrote: In fact every filter is invoked. But you must turn on your plugin in nutch_site.xml If you have done all this then you need to debug why it's not invoked. Maybe previously invoked content returns null. I did register the plugin in nutch-site.xml. I was trying

Re: How to use lucene to index Nutch 1.3 data

2011-07-19 Thread lewis john mcgibbney
Hi Kelvin, I see you are posting on a couple of threads with regards to the Lucene index generated by Nutch which you correctly point out is not there. It is not possible to create a Lucene index from Nutch 1.3 anymore as all searching has been shifted to Solr therefore Nutch 1.3 has no use for a

Nutch bugs up when starting

2011-07-19 Thread Chance Callahan
Whenever I start Nutch, I get the following error: 2011-07-20 01:40:49,744 INFO server Copying /user/hdfs/bin/nutch-1.2.jar-/tmp/jobsub-eAdLAn/work/tmp.jar 2011-07-20 01:40:50,179 INFO server all_clusters: [hadoop.job_tracker.LiveJobTracker object at 0x94237ec,

FATAL fetcher.Fetcher: Fetcher: java.lang.NullPointerException

2011-07-19 Thread Chance Callahan
Whenever I start Nutch, I get the following error: 2011-07-20 01:40:49,744 INFO server Copying /user/hdfs/bin/nutch-1.2.jar-/tmp/jobsub-eAdLAn/work/tmp.jar 2011-07-20 01:40:50,179 INFO server all_clusters: [hadoop.job_tracker.LiveJobTracker object at 0x94237ec,

FATAL fetcher.Fetcher: Fetcher: java.lang.NullPointerException

2011-07-19 Thread Chance Callahan
Whenever I start Nutch, I get the following error: 2011-07-20 01:40:49,744 INFO   server Copying /user/hdfs/bin/nutch-1.2.jar-/tmp/jobsub-eAdLAn/work/tmp.jar 2011-07-20 01:40:50,179 INFO   server all_clusters: [hadoop.job_tracker.LiveJobTracker object at 0x94237ec,

Re: Updating Tika in Nutch

2011-07-19 Thread Mattmann, Chris A (388J)
Hey Fernando, Would be great to get a JIRA issue and patch to bring Nutch 1.4-branch up to date with the latest Tika based on your experience. Thanks for your help! Cheers, Chris On Jul 19, 2011, at 4:48 PM, Fernando Arreola wrote: Hi, You were right, it is enough to provide the right