Hi!
I've added a huge fetch job without time limit and topN. Now I want to stop it,
but don't want to loose my fetched data.
Job is running on hadoop cluster.
Nutch 1.3, Hadoop 0.20
Can anyone help me?
Thanks
Hi to all, also I am trying to configure Nutch and Solr for the first
time in these days, and also I am curious to know if it's possible to
grep some content from the pages fetched by Nutch.
Thanks in advance
El-Glabro
On 18/07/11 23:10, Sethi, Parampreet wrote:
Hey Lewis,
Thanks for the
Use hadoop job -kill id. There is no way to safely interrupt a Nutch 1.x
fetcher. You are going to loose the data.
Hi!
I've added a huge fetch job without time limit and topN. Now I want to stop
it, but don't want to loose my fetched data. Job is running on hadoop
cluster.
Nutch 1.3, Hadoop
Hi Julien,
On Tuesday 19 July 2011 11:20:30 Julien Nioche wrote:
Hi Markus
On 18 July 2011 23:46, Markus Jelsma markus.jel...@openindex.io wrote:
I've modified the regular expression in OutlinkExtractor not to allow URI
schema's other than http:// and i can confirm a significant increase
Dear all,
After crawling using Nutch 1.3, I realise that my /crawl folder does not
contain folder /index.
Is there any way to create a lucene index from the /crawl folder?
Thank you for your help.
Best regards,
Kelvin
Kelvin,
You should make index using solr
$ bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb
crawl/segments/*
19.07.2011, 15:07, Kelvin k...@yahoo.com.sg:
Dear all,
After crawling using Nutch 1.3, I realise that my /crawl folder does not
contain folder /index.
Is
Hi List,
while I was crawling a set of 2000 pages a couple of times, I noticed
that the page scores are getting higher and higher every time a crawl
cycle finished. (There are no new pages discovered, only known pages are
recrawled)
Is that behaviour correct?
Thanks,
Marek
It's a familiar problem of OPIC-scoring. Maybe you can migrate to using
WebGraph? It's really powerful and is recalculated each time.
On Tuesday 19 July 2011 15:04:01 Marek Bachmann wrote:
Hi List,
while I was crawling a set of 2000 pages a couple of times, I noticed
that the page scores
On 19.07.2011 15:16, Markus Jelsma wrote:
On Tuesday 19 July 2011 15:14:31 Marek Bachmann wrote:
On 19.07.2011 15:03, Markus Jelsma wrote:
It's a familiar problem of OPIC-scoring. Maybe you can migrate to using
WebGraph? It's really powerful and is recalculated each time.
Thanks for this
You cannot reparse a segment IIRC.
On Monday 18 July 2011 23:26:20 Cam Bazz wrote:
Hello,
Any ideas on how I can reparse a segment? I am running experiments -
and it is taking way long time for each inject/generate/fetch/parse
cycle.
Best Regards,
-C.B.
--
Markus Jelsma - CTO -
You can. Simply delete parse_text, parse_data and crawl_parse from the
segment before calling the parse command on it
On 19 July 2011 14:57, Markus Jelsma markus.jel...@openindex.io wrote:
You cannot reparse a segment IIRC.
On Monday 18 July 2011 23:26:20 Cam Bazz wrote:
Hello,
Any
Hi Markus,
just for notice:
today I run the test for deleting 404 pages in solr again. This time the
URL wasn't disappeared from the crawldb.
It works fine. Solrclean has removed it from the index as expected.
But it also always want to remove the pages which never could be fetched
and
I have solved my problem with this:
$ bin/nutch readseg -get crawl_urls/segments/XX
http://thepagethatyouwanttosee/uri/
On 18/07/11 23:10, Sethi, Parampreet wrote:
Hey Lewis,
Thanks for the quick reply. I have setup Nutch with Solr and I am able to
index the documents in solr server. How
Hi Александр,
Thank you for your reply, but I am not using solr. How do I use Lucene to
create an index of folder /crawl?
I went to Lucene website, but it only explains how to index local files and
html?
From: Александр Кожевников b37hr3...@yandex.ru
To:
Hello,
i want to get the summary but without stored the content field .
like using segment to do that, by reading the content using segment and url
(i don't know if can i do it)
i use untch 1.3 ...
Serenity
--
View this message in context:
Sorry for the multiple postings. I am trying out nutch 1.3, which requires solr
for indexing
I try to crawl and index with solr with this simple command
bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 10
But why does it gives me the following error? Thank you for your kind
The solrdedup job completes without failure, it is the solrindex job that's
actually failing. See your hadoop.log and check Solr's output.
On Tuesday 19 July 2011 17:23:51 Kelvin wrote:
Sorry for the multiple postings. I am trying out nutch 1.3, which requires
solr for indexing
I try to
Nutch 1.3 doesn't do this anymore. Search has been delegated to Solr and it
has its own good highlighting capabilities.
On Tuesday 19 July 2011 10:55:16 tamara_nus wrote:
Hello,
i want to get the summary but without stored the content field .
like using segment to do that, by reading the
Hi Markus and Sethi,
Thank you for your reply, I was stuck and tried the following and it works for
me:
bin/nutch solrindex http://127.0.0.1:8983/solr/crawl/crawldb crawl/linkdb
crawl/segments/*
May I know where is the lucene index directory for the /crawl folder? I would
like to use
Hello,
And how about solrindex? does nutch mark what is indexed and what is not, or
is it a read-only from segment to solr operation?
best.
On Tue, Jul 19, 2011 at 5:05 PM, Julien Nioche
lists.digitalpeb...@gmail.com wrote:
You can. Simply delete parse_text, parse_data and crawl_parse from
Hello,
If I were to identify certain pages as pages of interest, in the
parse-html plugin, how can I index only pages I mark as interesting,
and exclude the rest? However I have to be able to extract outlinks
from pages of non-interest.
What would be the correct approach to do that?
Best.
It's per segment.
Hello,
And how about solrindex? does nutch mark what is indexed and what is not,
or is it a read-only from segment to solr operation?
best.
On Tue, Jul 19, 2011 at 5:05 PM, Julien Nioche
lists.digitalpeb...@gmail.com wrote:
You can. Simply delete parse_text,
So you still want to crawl and parse (for outlinks) but not index. Maybe using
a parse filter to mark a page as interesting (perhaps by adding it to the meta
data) and making an indexing filter that conditionally indexes pages based on
that mark.
Hello,
If I were to identify certain pages
Have you associated content type to your parser?
Best Regards
Alexander Aristov
On 19 July 2011 21:55, dietric diet...@gmail.com wrote:
When using Tika as your HTML Parser in Nutch 1.2, how is it possible to
using
a custom HTMLParseFilter to process parsed content? Log files confirm that
Alexander Aristov wrote:
Have you associated content type to your parser?
I don't want to replace the Tika parser, just implement a filter.
Thanks
Dietrich
--
View this message in context:
http://lucene.472066.n3.nabble.com/Custom-HTMLParseFilter-when-using-Tika-tp3183207p3183264.html
Alexander Aristov wrote:
In fact every filter is invoked. But you must turn on your plugin in
nutch_site.xml
If you have done all this then you need to debug why it's not invoked.
Maybe
previously invoked content returns null.
I did register the plugin in nutch-site.xml. I was trying
Hi Kelvin,
I see you are posting on a couple of threads with regards to the Lucene
index generated by Nutch which you correctly point out is not there. It is
not possible to create a Lucene index from Nutch 1.3 anymore as all
searching has been shifted to Solr therefore Nutch 1.3 has no use for a
Whenever I start Nutch, I get the following error:
2011-07-20 01:40:49,744 INFO server Copying
/user/hdfs/bin/nutch-1.2.jar-/tmp/jobsub-eAdLAn/work/tmp.jar
2011-07-20 01:40:50,179 INFO server all_clusters:
[hadoop.job_tracker.LiveJobTracker object at 0x94237ec,
Whenever I start Nutch, I get the following error:
2011-07-20 01:40:49,744 INFO server Copying
/user/hdfs/bin/nutch-1.2.jar-/tmp/jobsub-eAdLAn/work/tmp.jar
2011-07-20 01:40:50,179 INFO server all_clusters:
[hadoop.job_tracker.LiveJobTracker object at 0x94237ec,
Whenever I start Nutch, I get the following error:
2011-07-20 01:40:49,744 INFO server Copying
/user/hdfs/bin/nutch-1.2.jar-/tmp/jobsub-eAdLAn/work/tmp.jar
2011-07-20 01:40:50,179 INFO server all_clusters:
[hadoop.job_tracker.LiveJobTracker object at 0x94237ec,
Hey Fernando,
Would be great to get a JIRA issue and patch to bring
Nutch 1.4-branch up to date with the latest Tika
based on your experience.
Thanks for your help!
Cheers,
Chris
On Jul 19, 2011, at 4:48 PM, Fernando Arreola wrote:
Hi,
You were right, it is enough to provide the right
31 matches
Mail list logo