from:"Markus Jelsma"

RE: Suffix URLFilter not working

2013-06-12 Thread Markus Jelsma

We happily use that filter just as it is shipped with Nutch. Just enabling it 
in plugin.includes works for us. To ease testing you can use the bin/nutch 
org.apache.nutch.net.URLFilterChecker to test filters.

-Original message-
 From:Bai Shen baishen.li...@gmail.com
 Sent: Wed 12-Jun-2013 14:32
 To: user@nutch.apache.org
 Subject: Suffix URLFilter not working

 I'm dealing with a lot of file types that I don't want to index.  I was
 originally using the regex filter to exclude them but it was getting out of
 hand.

 I changed my plugin includes from

 urlfilter-regex

 to

 urlfilter-(regex|suffix)

 I've tried using both the default urlfilter-suffix.txt file via adding the
 extensions I don't want and making my own file that starts with + and
 includes the extensions I do want.

 Neither of these approaches seem to work.  I continue to get urls added to
 the database which continue extensions I don't want.  Even adding a
 urlfilter.order section to my nutch-site.xml doesn't work.

 I don't see any obvious bugs in the code, so I'm a bit stumped.  Any
 suggestions for what else to look at?

 Thanks.

RE: HTMLParseFilter equivalent in Nutch 2.2 ???

2013-06-12 Thread Markus Jelsma

I think for Nutch 2x it was HTMLParseFilter was renamed to ParseFilter. This is 
not true for 1.x, see NUTCH-1482.

 https://issues.apache.org/jira/browse/NUTCH-1482

 
 
-Original message-
 From:Tony Mullins tonymullins...@gmail.com
 Sent: Wed 12-Jun-2013 14:37
 To: user@nutch.apache.org
 Subject: HTMLParseFilter equivalent in Nutch 2.2 ???
 
 Hi ,
 
 If I go to http://wiki.apache.org/nutch/AboutPlugins  ,here  it shows me
 HTMLParseFilter is extension point for adding custom metadata to HTML and
 its 'Filter' method's signature is 'public ParseResult filter(Content
 content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment
 doc)'  but its in api 1.4 doc.
 
 I am on Nutch 2.2 and there is no class by name of HTMLParseFilter in  v2.2
 api doc
 http://nutch.apache.org/apidocs-2.2/allclasses-noframe.html.
 
 So please tell me which class to use in v2.2 api for adding my custom rule
 to extract some data from HTML page (is it ParseFilter ?) and add it to
 HMTL metadata so later then I could add it to my Solr using indexfilter
 plugin.
 
 
 Thanks,
 Tony.

RE: using Tika within Nutch to remove boiler plates?

2013-06-11 Thread Markus Jelsma

we don't use Boilerpipe anymore so no point in sharing. Just set those two 
configuration options in nutch-site.xml as 

  property
  nametika.use_boilerpipe/name
  valuetrue/value
 /property
  property
  nametika.boilerpipe.extractor/name
  valueArticleExtractor/value
 /property

and it should work
 
-Original message-
 From:Joe Zhang smartag...@gmail.com
 Sent: Tue 11-Jun-2013 01:42
 To: user user@nutch.apache.org
 Subject: Re: using Tika within Nutch to remove boiler plates?
 
 Marcus, do you mind sharing a sample nutch-site.xml?
 
 
 On Mon, Jun 10, 2013 at 1:42 AM, Markus Jelsma
 markus.jel...@openindex.iowrote:
 
  Those settings belong to nutch-site. Enable BP and set the correct
  extractor and it should work just fine.
 
 
  -Original message-
   From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com
   Sent: Sun 09-Jun-2013 20:47
   To: user@nutch.apache.org
   Subject: Re: using Tika within Nutch to remove boiler plates?
  
   Hi Joe,
   I've not used this feature, it would be great if one of the others could
   chime in here.
   From what I can infer from the correspondence on the issue, and the
   available patches, you should be applying the most recent one uploaded by
   Markus [0] as your starting point. This is dated as 22/11/2011.
  
   On Sun, Jun 9, 2013 at 11:00 AM, Joe Zhang smartag...@gmail.com wrote:
  
   
One of the comments mentioned the following:
   
tika.use_boilerpipe=true
tika.boilerpipe.extractor=ArticleExtractor|CanolaExtractor
   
which part the code is it referring to?
   
   
   You will see this included in one of the earlier patches uploaded by
  Markus
   on 11/05/2011 [1]
  
  
   
Also, within the current Nutch config, should I focus on
  parse-plugin.xml?
   
   
   Look at the other patches and also Gabriele's comments. You may most
  likely
   need to alter something but AFAICT the work hasbeen done.. it's just a
  case
   of pulling together several contributions.
  
   Maybe you should look at the patch for 2.x (uploaded most recently by
   Roland) and see what is going on there.
  
   hth
  
   [0]
  
  https://issues.apache.org/jira/secure/attachment/12504736/NUTCH-961-1.5-1.patch
   [1]
  
  https://issues.apache.org/jira/secure/attachment/12478927/NUTCH-961-1.3-tikaparser1.patch

RE: Data Extraction from 100+ different sites...

2013-06-11 Thread Markus Jelsma

Hi,

Yes, you should write a plugin that has a parse filter and indexing filter. To 
ease maintenance you would want to have a file per host/domain containing XPath 
expressions, far easier that switch statements that need to be recompiled. The 
indexing filter would then index the field values extracted by your parse 
filter.

Cheers,
Markus 
 
-Original message-
 From:Tony Mullins tonymullins...@gmail.com
 Sent: Tue 11-Jun-2013 16:07
 To: user@nutch.apache.org
 Subject: Data Extraction from 100+ different sites...
 
 Hi,
 
 I have 100+ different sites ( and may be more will be added in near
 future), I have to crawl them and extract my required information from each
 site. So each site would have its own extraction rule ( XPaths).
 
 So far I have seen there is no built-in mechanism in Nutch to fulfill my
 requirement and I may  have to write custom HTMLParserFilter extension and
 IndexFilter plugin.
 
 And I may have to write 100+ switch cases in my plugin to handle the
 extraction rules of each site
 
 Is this the best way to handle my requirement or there is any better way to
 handle it ?
 
 Thanks for your support  help.
 
 Tony.

RE: using Tika within Nutch to remove boiler plates?

2013-06-11 Thread Markus Jelsma

Yes, Boilerpipe is complex and difficult to adapt. It also requires you to 
preset an extraction algorithm which is impossible for us. I've created an 
extractor instead that works for most pages and ignores stuff like news 
overviews and major parts of homepages. It's also tightly coupled with our date 
extractor (based on [1]) and language detector (based on LangDetect) and image 
extraction.

In many cases boilerpipe's articleextractor will work very well but date 
extraction such as NUTCH-141 won't do the trick as it only works on extracted 
text as a whole and does not consider page semantics.

[1]: https://issues.apache.org/jira/browse/NUTCH-1414

-Original message-
 From:Joe Zhang smartag...@gmail.com
 Sent: Tue 11-Jun-2013 18:06
 To: user user@nutch.apache.org
 Subject: Re: using Tika within Nutch to remove boiler plates?
 
 Any particular reason why you don't use boilerpipe any more? So what do you
 suggest as an alternative?
 
 
 On Tue, Jun 11, 2013 at 5:41 AM, Markus Jelsma
 markus.jel...@openindex.iowrote:
 
  we don't use Boilerpipe anymore so no point in sharing. Just set those two
  configuration options in nutch-site.xml as
 
property
nametika.use_boilerpipe/name
valuetrue/value
   /property
property
nametika.boilerpipe.extractor/name
valueArticleExtractor/value
   /property
 
  and it should work
 
  -Original message-
   From:Joe Zhang smartag...@gmail.com
   Sent: Tue 11-Jun-2013 01:42
   To: user user@nutch.apache.org
   Subject: Re: using Tika within Nutch to remove boiler plates?
  
   Marcus, do you mind sharing a sample nutch-site.xml?
  
  
   On Mon, Jun 10, 2013 at 1:42 AM, Markus Jelsma
   markus.jel...@openindex.iowrote:
  
Those settings belong to nutch-site. Enable BP and set the correct
extractor and it should work just fine.
   
   
-Original message-
 From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com
 Sent: Sun 09-Jun-2013 20:47
 To: user@nutch.apache.org
 Subject: Re: using Tika within Nutch to remove boiler plates?

 Hi Joe,
 I've not used this feature, it would be great if one of the others
  could
 chime in here.
 From what I can infer from the correspondence on the issue, and the
 available patches, you should be applying the most recent one
  uploaded by
 Markus [0] as your starting point. This is dated as 22/11/2011.

 On Sun, Jun 9, 2013 at 11:00 AM, Joe Zhang smartag...@gmail.com
  wrote:

 
  One of the comments mentioned the following:
 
  tika.use_boilerpipe=true
  tika.boilerpipe.extractor=ArticleExtractor|CanolaExtractor
 
  which part the code is it referring to?
 
 
 You will see this included in one of the earlier patches uploaded by
Markus
 on 11/05/2011 [1]


 
  Also, within the current Nutch config, should I focus on
parse-plugin.xml?
 
 
 Look at the other patches and also Gabriele's comments. You may most
likely
 need to alter something but AFAICT the work hasbeen done.. it's just
  a
case
 of pulling together several contributions.

 Maybe you should look at the patch for 2.x (uploaded most recently by
 Roland) and see what is going on there.

 hth

 [0]

   
  https://issues.apache.org/jira/secure/attachment/12504736/NUTCH-961-1.5-1.patch
 [1]

   
  https://issues.apache.org/jira/secure/attachment/12478927/NUTCH-961-1.3-tikaparser1.patch

RE: Data Extraction from 100+ different sites...

2013-06-11 Thread Markus Jelsma

You can use URLUtil in that parse filter to determine on which host/domain you 
are and lazy load the file with expressions for that host. Just keep a 
Maphostname, Listexpressions in your object and load lists of expressions 
on demand.

-Original message-
 From:Tony Mullins tonymullins...@gmail.com
 Sent: Tue 11-Jun-2013 18:59
 To: user@nutch.apache.org
 Subject: Re: Data Extraction from 100+ different sites...

 Hi Markus,

 I couldn't understand how can I avoid switch cases in your suggested
 idea

 I would have one plugin which will implement HtmlParseFilter and I would
 have to check the current URL by getting content.getUrl() and this all will
 be happening in same class so I would have to add swicth cases... I may
 could add xpath expression for each site in separate files but to get XPath
 expression I would have to decide which file I have to read and for that I
 would have to add my this code logic in swith case

 Please correct me if I am getting this all wrong !!!

 And I think this is common requirement for web crawling solutions to get
 custom data from page... then are not there any such Nutch plugins
 available on web ?

 Thanks,
 Tony.

 On Tue, Jun 11, 2013 at 7:35 PM, Markus Jelsma
 markus.jel...@openindex.iowrote:

  Hi,

  Yes, you should write a plugin that has a parse filter and indexing
  filter. To ease maintenance you would want to have a file per host/domain
  containing XPath expressions, far easier that switch statements that need
  to be recompiled. The indexing filter would then index the field values
  extracted by your parse filter.

  Cheers,
  Markus

  -Original message-
   From:Tony Mullins tonymullins...@gmail.com
   Sent: Tue 11-Jun-2013 16:07
   To: user@nutch.apache.org
   Subject: Data Extraction from 100+ different sites...

   Hi,

   I have 100+ different sites ( and may be more will be added in near
   future), I have to crawl them and extract my required information from
  each
   site. So each site would have its own extraction rule ( XPaths).

   So far I have seen there is no built-in mechanism in Nutch to fulfill my
   requirement and I may  have to write custom HTMLParserFilter extension
  and
   IndexFilter plugin.

   And I may have to write 100+ switch cases in my plugin to handle the
   extraction rules of each site

   Is this the best way to handle my requirement or there is any better way
  to
   handle it ?

   Thanks for your support  help.

   Tony.

RE: using Tika within Nutch to remove boiler plates?

2013-06-11 Thread Markus Jelsma

In my opinion Boilerpipe is the most effective free and open source tool for 
the job :)

It does require some patching (see linked issues) and manual upgrade to 
Boilerpipe 1.2.0.

-Original message-
 From:Joe Zhang smartag...@gmail.com
 Sent: Tue 11-Jun-2013 21:19
 To: user user@nutch.apache.org
 Subject: Re: using Tika within Nutch to remove boiler plates?

 So what in your opinion is the most effective way of removing boilerplates
 in Nutch crawls?

 On Tue, Jun 11, 2013 at 12:12 PM, Markus Jelsma
 markus.jel...@openindex.iowrote:

  Yes, Boilerpipe is complex and difficult to adapt. It also requires you to
  preset an extraction algorithm which is impossible for us. I've created an
  extractor instead that works for most pages and ignores stuff like news
  overviews and major parts of homepages. It's also tightly coupled with our
  date extractor (based on [1]) and language detector (based on LangDetect)
  and image extraction.

  In many cases boilerpipe's articleextractor will work very well but date
  extraction such as NUTCH-141 won't do the trick as it only works on
  extracted text as a whole and does not consider page semantics.

  [1]: https://issues.apache.org/jira/browse/NUTCH-1414

  -Original message-
   From:Joe Zhang smartag...@gmail.com
   Sent: Tue 11-Jun-2013 18:06
   To: user user@nutch.apache.org
   Subject: Re: using Tika within Nutch to remove boiler plates?

   Any particular reason why you don't use boilerpipe any more? So what do
  you
   suggest as an alternative?

   On Tue, Jun 11, 2013 at 5:41 AM, Markus Jelsma
   markus.jel...@openindex.iowrote:

we don't use Boilerpipe anymore so no point in sharing. Just set those
  two
configuration options in nutch-site.xml as

  property
  nametika.use_boilerpipe/name
  valuetrue/value
 /property
  property
  nametika.boilerpipe.extractor/name
  valueArticleExtractor/value
 /property

and it should work

-Original message-
 From:Joe Zhang smartag...@gmail.com
 Sent: Tue 11-Jun-2013 01:42
 To: user user@nutch.apache.org
 Subject: Re: using Tika within Nutch to remove boiler plates?

 Marcus, do you mind sharing a sample nutch-site.xml?

 On Mon, Jun 10, 2013 at 1:42 AM, Markus Jelsma
 markus.jel...@openindex.iowrote:

  Those settings belong to nutch-site. Enable BP and set the correct
  extractor and it should work just fine.

  -Original message-
   From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com
   Sent: Sun 09-Jun-2013 20:47
   To: user@nutch.apache.org
   Subject: Re: using Tika within Nutch to remove boiler plates?

   Hi Joe,
   I've not used this feature, it would be great if one of the
  others
could
   chime in here.
   From what I can infer from the correspondence on the issue, and
  the
   available patches, you should be applying the most recent one
uploaded by
   Markus [0] as your starting point. This is dated as 22/11/2011.

   On Sun, Jun 9, 2013 at 11:00 AM, Joe Zhang smartag...@gmail.com

wrote:

One of the comments mentioned the following:

tika.use_boilerpipe=true
tika.boilerpipe.extractor=ArticleExtractor|CanolaExtractor

which part the code is it referring to?

   You will see this included in one of the earlier patches
  uploaded by
  Markus
   on 11/05/2011 [1]

Also, within the current Nutch config, should I focus on
  parse-plugin.xml?

   Look at the other patches and also Gabriele's comments. You may
  most
  likely
   need to alter something but AFAICT the work hasbeen done.. it's
  just
a
  case
   of pulling together several contributions.

   Maybe you should look at the patch for 2.x (uploaded most
  recently by
   Roland) and see what is going on there.

   hth

   [0]

  https://issues.apache.org/jira/secure/attachment/12504736/NUTCH-961-1.5-1.patch
   [1]

  https://issues.apache.org/jira/secure/attachment/12478927/NUTCH-961-1.3-tikaparser1.patch

RE: using Tika within Nutch to remove boiler plates?

2013-06-10 Thread Markus Jelsma

Those settings belong to nutch-site. Enable BP and set the correct extractor
and it should work just fine.

-Original message-
From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com
Sent: Sun 09-Jun-2013 20:47
To: user@nutch.apache.org
Subject: Re: using Tika within Nutch to remove boiler plates?

Hi Joe,
I've not used this feature, it would be great if one of the others could
chime in here.
From what I can infer from the correspondence on the issue, and the
available patches, you should be applying the most recent one uploaded by
Markus [0] as your starting point. This is dated as 22/11/2011.

On Sun, Jun 9, 2013 at 11:00 AM, Joe Zhang smartag...@gmail.com wrote:

One of the comments mentioned the following:

tika.use_boilerpipe=true
tika.boilerpipe.extractor=ArticleExtractor|CanolaExtractor

which part the code is it referring to?

You will see this included in one of the earlier patches uploaded by Markus
on 11/05/2011 [1]

Also, within the current Nutch config, should I focus on parse-plugin.xml?

Look at the other patches and also Gabriele's comments. You may most likely
need to alter something but AFAICT the work hasbeen done.. it's just a case
of pulling together several contributions.

Maybe you should look at the patch for 2.x (uploaded most recently by
Roland) and see what is going on there.

hth

[0]
https://issues.apache.org/jira/secure/attachment/12504736/NUTCH-961-1.5-1.patch
[1]
https://issues.apache.org/jira/secure/attachment/12478927/NUTCH-961-1.3-tikaparser1.patch

RE: Generator -adddays

2013-05-31 Thread Markus Jelsma

Please don't break existing scripts and support lower and uppercase.

Markus

-Original message-
 From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com
 Sent: Fri 31-May-2013 19:11
 To: user@nutch.apache.org
 Subject: Re: Generator -adddays

 Seems like a small cli syntax bug.
 Please submit a patch and we can commit.
 Thanks
 Lewis

 On Friday, May 31, 2013, Bai Shen baishen.li...@gmail.com wrote:
  Two quick questions.

  1. Why is the parameter -adddays and not -addDays?
  2. Should it be changed to match the other parameters or is it another
  referer?

  Thanks.

 -- 
 *Lewis*

RE: How to achieve different fetcher.server.delay configuration for different hosts/sub domains?

2013-05-28 Thread Markus Jelsma

You can either use robots.txt or modify the Fetcher. Fetcher has a 
FetchItemQueue for each queue, this also records the CrawlDelay for that queue. 
A FetchItemQueue is created by FetchItemQueues.getFetchItemQueue(), here it 
sets the CrawlDelay for the queue. You can have a lookup table here that looks 
for CrawlDelay for a given queue id (host or domain or IP).

 
 
-Original message-
 From:vivekvl vive...@yahoo.com
 Sent: Tue 28-May-2013 16:01
 To: user@nutch.apache.org
 Subject: How to achieve different fetcher.server.delay configuration for 
 different hosts/sub domains?
 
 I have a problem in configuring fetcher.server.delay for my crawl. Some of
 the sub domains needs fetcher.server.delay to be high and some needs this to
 be less. Whether there is a straight forward way to achieve this? If yes
 what are the configurations I need to make.
 
 If this is not going to be simple, is there any workaround to achieve this?
 
 Thanks,
 Vivek
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/How-to-achieve-different-fetcher-server-delay-configuration-for-different-hosts-sub-domains-tp4066505.html
 Sent from the Nutch - User mailing list archive at Nabble.com.

Fetcher corrupting some segments

2013-05-27 Thread Markus Jelsma

Hi,

For some reason the fetcher sometimes produces corrupts unreadable segments. It 
then exists with exception like problem advancing post, or negative array 
size exception etc. 

java.lang.RuntimeException: problem advancing post rec#702
at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:1225)
at 
org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:250)
at 
org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:246)
at 
org.apache.nutch.fetcher.Fetcher$FetcherReducer.reduce(Fetcher.java:1431)
at 
org.apache.nutch.fetcher.Fetcher$FetcherReducer.reduce(Fetcher.java:1392)
at 
org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:520)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:260)
Caused by: java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at org.apache.hadoop.io.Text.readString(Text.java:402)
at org.apache.nutch.metadata.Metadata.readFields(Metadata.java:243)
at org.apache.nutch.parse.ParseData.readFields(ParseData.java:144)
at org.apache.nutch.parse.ParseImpl.readFields(ParseImpl.java:70)
at 
org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWritableConfigurable.java:54)
at 
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
at 
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
at 
org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:1282)
at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:1222)
... 7 more
2013-05-26 22:41:41,344 ERROR fetcher.Fetcher - Fetcher: java.io.IOException: 
Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1327)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1520)
at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1556)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1529)

These errors produce the following exception when trying to index.

java.io.IOException: IO error in map input file 
file:/opt/nutch/crawl/segments/20130526223014/crawl_parse/part-0
at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:242)
at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:216)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Caused by: org.apache.hadoop.fs.ChecksumException: Checksum error: 
file:/opt/nutch/crawl/segments/20130526223014/crawl_parse/part-0 at 2620416
at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:219)
at 
org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:237)
at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
at java.io.DataInputStream.readFully(DataInputStream.java:195)
at 
org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
at 
org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1992)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2124)
at 
org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:76)
at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:236)
... 5 more

Is there any way we can debug this? The errors is usually related to Nutch 
reading metadata, but since we cannot read the metadata, i cannot know what 
data is causing the issue :) Any hints to share on how to tackle these issues?

Markus

RE: rewriting urls that are index

2013-04-22 Thread Markus Jelsma

Hi,

The 1.x indexer takes a -normalize parameter and there you can rewrite your 
URL's. Judging from your patterns the RegexURLNormalizer should be sufficient. 
Make sure you use the config file containing that pattern only when indexing, 
otherwise they'll end up in the CrawlDB and segments. Use 
urlnormalizer.regex.file to specifiy the file or pass patterns directly using 
urlnormalizer.regex.rules.

Cheers,
Markus
 
 
-Original message-
 From:Niels Boldt nielsbo...@gmail.com
 Sent: Mon 22-Apr-2013 15:56
 To: user@nutch.apache.org
 Subject: rewriting urls that are index
 
 Hi,
 
 We are crawling a site using nutch 1.6 and indexing into solr.
 
 However, we need to rewrite the urls that are indexed in the following way
 
 For instance, nutch crawls a page http://www.example.com/article=xxx but
 when moving data to the index we would like to use the url
 
 http://www.example.com/kb#article=xxx http://www.example.com/article=xxx
 
 Instead. So when we get data from solr it will show links to
 http://www.example.com/kb#article=xxx
 http://www.example.com/article=xxx instead
 of http://www.example.com/article=xxx
 
 Is that possible to do by creating a plugin that extends the UrlNormalizer,
 eg
 
 http://nutch.apache.org/apidocs-1.4/org/apache/nutch/net/URLNormalizer.html
 
 Or is it better to add a new indexed property that we use.
 
 Best Regards
 Niels

RE: Period-terminated hostnames

2013-04-18 Thread Markus Jelsma

Rodney,

Those are valid URL's but you clearly don't need them. You can either use 
filters to get rid of them or normalize them away. Use the 
org.apache.nutch.net.URLNormalizerChecker or URLFilterChecker tools to test 
your config.

Markus

 
 
-Original message-
 From:Rodney Barnett barn...@ploughman-analytics.com
 Sent: Thu 18-Apr-2013 22:31
 To: user@nutch.apache.org
 Subject: Period-terminated hostnames
 
 I'm using nutch 1.6 to crawl a variety of web pages/sites and I'm finding
 that my solr database contains pairs of near-duplicate entries where the
 main difference is that one contains a period after the hostname in the id.
 For example:
 
 entry 1: id: http://example.com/
 
 entry 2: id: http://example.com./
 
  
 
 I can't find any references to this issue.  Has anyone else noticed this?
 Is there a good way to correct this?
 
  
 
 I've added an entry to regex-normalize.xml to remove the period, but I'm not
 sure yet whether it works.  Is there a good way to test the url normalizer
 configuration?
 
  
 
 I tracked the source of some of these urls back to hyperlinks extracted from
 PDF files where the hyperlink doesn't seem to have the period but the linked
 text is followed by a period.  For example:
 {link}http://example.com{/link}.; where the curly braces indicate the
 hyperlink boundaries.  The command nutch parsechecker reports that the
 outlink is http://example.com. for this case.
 
  
 
 Thanks for any assistance.
 
  
 
 Rodney

RE: How to Continue to Crawl with Nutch Even An Error Occurs?

2013-03-20 Thread Markus Jelsma

If Nutch exits with an error then the segment is bad, a failing thread is not 
an error that leads to a failed segments. This means the segment is properly 
fetched but just that some records failed. Those records will be eligible for 
refetch.

Assuming you use the crawl command, the updatedb command will be successful so 
there should be no issue here. What's the problem?
 
 
-Original message-
 From:kamaci furkankam...@gmail.com
 Sent: Wed 20-Mar-2013 23:48
 To: user@nutch.apache.org
 Subject: How to Continue to Crawl with Nutch Even An Error Occurs?
 
  When I crawl with Nutch and error occurs (i.e. when one of threads doesn't
 come within a time) it stops crawling and exits.
 
 Is there any configuration to continue crawling even a such kind of error
 occurs at Nutch? 
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/How-to-Continue-to-Crawl-with-Nutch-Even-An-Error-Occurs-tp4049567.html
 Sent from the Nutch - User mailing list archive at Nabble.com.

RE: Does Nutch Checks Whether A Page crawled before or not

2013-03-20 Thread Markus Jelsma

The CrawlDB contains information on all URL's and their status e.g. what HTTP 
code did they get, the interval, some metadata and their fetch time. Use the 
readdb command to inspect a specific URL.

-Original message-
 From:kamaci furkankam...@gmail.com
 Sent: Wed 20-Mar-2013 23:52
 To: user@nutch.apache.org
 Subject: Re: Does Nutch Checks Whether A Page crawled before or not

 Where does Nutch stores that information?

 2013/3/21 Markus Jelsma-2 [via Lucene] 
 ml-node+s472066n4049568...@n3.nabble.com

  Nutch selects records that are eligible for fetch. It's either due to a
  transient failure or if the fetch interval has been expired. This means
  that failed fetches due to network issues are refetched within 24 hours.
  Successfully fetched pages are only refetched if the current time exceeds
  the previously fetchTime + interval.

  -Original message-

   From:kamaci [hidden 
   email]http://user/SendEmail.jtp?type=nodenode=4049568i=0

   Sent: Wed 20-Mar-2013 23:46
   To: [hidden email]http://user/SendEmail.jtp?type=nodenode=4049568i=1
   Subject: Does Nutch Checks Whether A Page crawled before or not

   Lets assume that I am crawling wikipedia.org with depth 1 and topN 1.
  After
   it finishes crawling if I rerun that command and after finishes again
  and
   again. What happens? Does Nutch skips previous fetched pages or try to
  crawl
   same pages again?

   --
   View this message in context:
  http://lucene.472066.n3.nabble.com/Does-Nutch-Checks-Whether-A-Page-crawled-before-or-not-tp4049564.html
   Sent from the Nutch - User mailing list archive at Nabble.com.

  --
   If you reply to this email, your message will be added to the discussion
  below:

  http://lucene.472066.n3.nabble.com/Does-Nutch-Checks-Whether-A-Page-crawled-before-or-not-tp4049564p4049568.html
   To unsubscribe from Does Nutch Checks Whether A Page crawled before or
  not, click 
  herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=4049564code=ZnVya2Fua2FtYWNpQGdtYWlsLmNvbXw0MDQ5NTY0fDEyODM4MDc0Mg==
  .
  NAMLhttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Does-Nutch-Checks-Whether-A-Page-crawled-before-or-not-tp4049564p4049569.html
 Sent from the Nutch - User mailing list archive at Nabble.com.

RE: [WELCOME] Feng Lu as Apache Nutch PMC and Committer

2013-03-18 Thread Markus Jelsma

Feng Lu, welcome! :)

-Original message-
 From:Julien Nioche lists.digitalpeb...@gmail.com
 Sent: Mon 18-Mar-2013 13:23
 To: user@nutch.apache.org
 Cc: d...@nutch.apache.org
 Subject: Re: [WELCOME] Feng Lu as Apache Nutch PMC and Committer

 Hi Feng, 

 Congratulations on becoming a committer and welcome! 

 [...]

 A problem has been troubling me a long time is that what is the target of
 nutch 1.x, Does nutch 1.x is just a transitional version of Nutch 2.x, or
 they can coexist because Nutch 1.x has a different data processing method
 to Nutch 2.x,

 the latter, it's not so much the processing method that differs as they are 
 very similar but the way data are stored.

  like Julien said, Nutch 1.x is great for batch processing and
 2.x large scale processing. 

 Hmm, I don't think I said that. Both are batch orientated and 1.x is probably 
 better at large scale processing than 2.x (at least currently) 

 Perhaps with more and more people use NoSql as
 their back-end DB, the developers should focus more on the development of
 Nutch 2.x, ensure its stability and improve its function.

 IMHO it's not that the developers should focus on this or that. I see it more 
 as an evolutionary process where things get improved because they are used in 
 the first place or get derelict and abandoned if there is no interest from 
 users.  If as you say  people prefer to have a SQL backend instead of the 
 sequential HDFS data structures then there will be more contributions and as 
 a result 2.x will be improved. 

 Julien

 -- 
  http://digitalpebble.com/img/logo.gif 
 Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/ http://digitalpebble.blogspot.com/ 
 http://www.digitalpebble.com http://www.digitalpebble.com 
 http://twitter.com/digitalpebble http://twitter.com/digitalpebble

RE: keep all pages from a domain in one slice

2013-03-05 Thread Markus Jelsma

Hi

You can't do this with -slice but you can merge segments and filter them. This 
would mean you'd have to merge the segments for each domain. But that's far too 
much work. Why do you want to do this? There may be better ways in achieving 
you goal.

 
 
-Original message-
 From:Jason S jason.stu...@gmail.com
 Sent: Tue 05-Mar-2013 22:18
 To: user@nutch.apache.org
 Subject: keep all pages from a domain in one slice
 
 Hello,
 
 I seem to remember seeing a discussion about this in the past but I can't 
 seem to find it in the archives.
 
 When using mergesegs -slice, is it possible to keep all the pages from a 
 domain in the same slice?  I have just been messing around with this 
 functionality (Nutch 1.6), and it seems like the records are simply split 
 after the counter has reached the slice size specified, sometimes splitting 
 the records from a single domain over multiple slices. 
 
 How can I segregate a domain to a single slice?
 
 Thanks in advance,
 
 ~Jason

RE: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

2013-03-03 Thread Markus Jelsma

The default heap size of 1G is just enough for a parsing fetcher with 10
threads. The only problem that may rise is too large and complicated PDF files
or very large HTML files. If you generate fetch lists of a reasonable size
there won't be a problem most of the time. And if you want to crawl a lot, then
just generate more small segments.

If there is a bug it's most likely to be the parser eating memory and not
releasing it.

-Original message-
From:Tejas Patil tejas.patil...@gmail.com
Sent: Sun 03-Mar-2013 22:19
To: user@nutch.apache.org
Subject: Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new
native thread

I agree with Sebastian. It was a crawl in local mode and not over a
cluster. The intended crawl volume is huge and if we dont override the
default heap size to some decent value, there is high possibility of facing
an OOM.

On Sun, Mar 3, 2013 at 1:04 PM, kiran chitturi
chitturikira...@gmail.comwrote:

If you find the time you should trace the process.
Seems to be either a misconfiguration or even a bug.

I will try to track this down soon with the previous configuration. Right
now, i am just trying to get data crawled by Monday.

Kiran.

Luckily, you should be able to retry via bin/nutch parse ...
Then trace the system and the Java process to catch the reason.

Sebastian

On 03/02/2013 08:13 PM, kiran chitturi wrote:
Sorry, i am looking to crawl 400k documents with the crawl. I said
400
in
my last message.

On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi
chitturikira...@gmail.comwrote:

Hi!

I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5 2.8GHz.

Last night i started a crawl on local mode for 5 seeds with the
config
given below. If the crawl goes well, it should fetch a total of 400
documents. The crawling is done on a single host that we own.

Config
-

fetcher.threads.per.queue - 2
fetcher.server.delay - 1
fetcher.throughput.threshold.pages - -1

crawl script settings

timeLimitFetch- 30
numThreads - 5
topN - 1
mapred.child.java.opts=-Xmx1000m

I have noticed today that the crawl has stopped due to an error and
i
have
found the below error in logs.

2013-03-01 21:45:03,767 INFO parse.ParseSegment - Parsed (0ms):
http://scholar.lib.vt.edu/ejournals/JARS/v33n3/v33n3-letcher.htm
2013-03-01 21:45:03,790 WARN mapred.LocalJobRunner -
job_local_0001
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:658)
at

java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681)
at

java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
at

java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655)
at

java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92)
at
org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
at
org.apache.nutch.parse.ParseUtil.parse(PaifrseUtil.java:93)
at
org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
at
org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
at
org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at
org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at

org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
(END)

Did anyone run in to the same issue ? I am not sure why the new
native
thread is not being created. The link here says [0] that it might
due
to
the limitation of number of processes in my OS. Will increase them
solve
the issue ?

[0] - http://ww2.cs.fsu.edu/~czhang/errors.html

Thanks!

--
Kiran Chitturi

RE: a lot of threads spinwaiting

2013-03-01 Thread Markus Jelsma

Hi,

Regarding politeness, 3 threads per queue is not really polite :)

Cheers

 
 
-Original message-
 From:jc jvizu...@gmail.com
 Sent: Fri 01-Mar-2013 15:08
 To: user@nutch.apache.org
 Subject: Re: a lot of threads spinwaiting
 
 Hi Roland and lufeng,
 
 Thank you very much for your replies, I already tested lufeng advice, with
 results pretty much as expected.
 
 By the way, my nutch installation is based on 2.1 version with hbase as
 crawldb storage
 
 Roland, maybe fetcher.server.delay param has something to do with that as
 well, I set it to 3 secs, setting it to 0 would be unpolite?
 
 All info you provided has helped me a lot, only one issue remains unfixed
 yet, there are more than 60 URLs from different hosts in my seed file, and
 only 20 queues, things may seem that all other 40 hosts have no more URLs to
 generate, but I really haven't seen any URL coming from those hosts since
 the creation of the crawldb.
 
 Based on my poor experience following params would allow a number of 60
 queues for my vertical crawl, am I missing something?
 
 topN = 1 million
 fetcher.threads.per.queue = 3
 fetcher.threads.per.host = 3 (just in case, I remember you told me to use
 per.queue instead)
 fetcher.threads.fetch = 200
 seed urls of different hosts = 60 or more (regex-urlfilter.txt allows only
 urls from these hosts, they're all there, I checked)
 crawldb record count  1 million
 
 Thanks again for all your help
 
 Regards,
 JC
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/a-lot-of-threads-spinwaiting-tp4043801p4043988.html
 Sent from the Nutch - User mailing list archive at Nabble.com.

RE: Nutch Incremental Crawl

2013-02-27 Thread Markus Jelsma

The default or the injected interval? The default interval can be set  in the 
config (see nutch-default for example). Per URL's can be set using the 
injector: URL\tnutch.fixedFetchInterval=86400 

-Original message-
 From:David Philip davidphilipshe...@gmail.com
 Sent: Wed 27-Feb-2013 06:21
 To: user@nutch.apache.org
 Subject: Re: Nutch Incremental Crawl

 Hi all,

   Thank you very much for the replies. Very useful information to
 understand how incremental crawling can be achieved.

 Dear Markus:
 Can you please tell me how do I over ride this fetch interval , incase if I
 require to fetch the page before the time interval is passed?

 Thanks very much
 - David

 On Thu, Feb 14, 2013 at 2:57 PM, Markus Jelsma
 markus.jel...@openindex.iowrote:

  If you want records to be fetched at a fixed interval its easier to inject
  them with a fixed fetch interval.

  nutch.fixedFetchInterval=86400

  -Original message-
   From:kemical mickael.lume...@gmail.com
   Sent: Thu 14-Feb-2013 10:15
   To: user@nutch.apache.org
   Subject: Re: Nutch Incremental Crawl

   Hi David,

   You can also consider setting shorter fetch interval time with nutch
  inject.
   This way you'll set higher score (so the url is always taken in priority
   when you generate a segment) and a fetch.interval of 1 day.

   If you have a case similar to me, you'll often want some homepage fetch
  each
   day but not their inlinks. What you can do is inject all your seed urls
   again (assuming those url are only homepages).

   #change nutch option so existing urls can be injected again in
   conf/nutch-default.xml or conf/nutch-site.xml
   db.injector.update=true

   #Add metadata to update score/fetch interval
   #the following line will concat to each line of your seed urls files with
   the new score / new interval
   perl -pi -e 's/^(.*)\n$/\1\tnutch.score=100\tnutch.fetchInterval=8'
   [your_seed_url_dir]/*

   #run command
   bin/nutch inject crawl/crawldb [your_seed_url_dir]

   Now, the following crawl will take your urls in top priority and crawl
  them
   once a day. I've used my situation to illustrate the concept but i guess
  you
   can tweek params to fit your needs.

   This way is useful when you want a regular fetch on some urls, if it's
   occured rarely i guess freegen is the right choice.

   Best,
   Mike

   --
   View this message in context:
  http://lucene.472066.n3.nabble.com/Nutch-Incremental-Crawl-tp4037903p4040400.html
   Sent from the Nutch - User mailing list archive at Nabble.com.

RE: Nutch Incremental Crawl

2013-02-27 Thread Markus Jelsma

You can simply reinject the records.  You can overwrite and/or update the 
current record. See the db.injector.update and overwrite settings. 

-Original message-
 From:David Philip davidphilipshe...@gmail.com
 Sent: Wed 27-Feb-2013 11:23
 To: user@nutch.apache.org
 Subject: Re: Nutch Incremental Crawl

 HI Markus, I meant over riding  the injected interval.. How to override the
 injected fetch interval?
 While crawling fetch interval was set 30days (default). Now I want to
 re-fetch same site (that is to force re-fetch) and not wait for fetch
 interval (30 days).. how can we do that?

 Feng Lu : Thank you for the reference link.

 Thanks - David

 On Wed, Feb 27, 2013 at 3:22 PM, Markus Jelsma
 markus.jel...@openindex.iowrote:

  The default or the injected interval? The default interval can be set  in
  the config (see nutch-default for example). Per URL's can be set using the
  injector: URL\tnutch.fixedFetchInterval=86400

  -Original message-
   From:David Philip davidphilipshe...@gmail.com
   Sent: Wed 27-Feb-2013 06:21
   To: user@nutch.apache.org
   Subject: Re: Nutch Incremental Crawl

   Hi all,

 Thank you very much for the replies. Very useful information to
   understand how incremental crawling can be achieved.

   Dear Markus:
   Can you please tell me how do I over ride this fetch interval , incase
  if I
   require to fetch the page before the time interval is passed?

   Thanks very much
   - David

   On Thu, Feb 14, 2013 at 2:57 PM, Markus Jelsma
   markus.jel...@openindex.iowrote:

If you want records to be fetched at a fixed interval its easier to
  inject
them with a fixed fetch interval.

nutch.fixedFetchInterval=86400

-Original message-
 From:kemical mickael.lume...@gmail.com
 Sent: Thu 14-Feb-2013 10:15
 To: user@nutch.apache.org
 Subject: Re: Nutch Incremental Crawl

 Hi David,

 You can also consider setting shorter fetch interval time with nutch
inject.
 This way you'll set higher score (so the url is always taken in
  priority
 when you generate a segment) and a fetch.interval of 1 day.

 If you have a case similar to me, you'll often want some homepage
  fetch
each
 day but not their inlinks. What you can do is inject all your seed
  urls
 again (assuming those url are only homepages).

 #change nutch option so existing urls can be injected again in
 conf/nutch-default.xml or conf/nutch-site.xml
 db.injector.update=true

 #Add metadata to update score/fetch interval
 #the following line will concat to each line of your seed urls files
  with
 the new score / new interval
 perl -pi -e
  's/^(.*)\n$/\1\tnutch.score=100\tnutch.fetchInterval=8'
 [your_seed_url_dir]/*

 #run command
 bin/nutch inject crawl/crawldb [your_seed_url_dir]

 Now, the following crawl will take your urls in top priority and
  crawl
them
 once a day. I've used my situation to illustrate the concept but i
  guess
you
 can tweek params to fit your needs.

 This way is useful when you want a regular fetch on some urls, if
  it's
 occured rarely i guess freegen is the right choice.

 Best,
 Mike

 --
 View this message in context:

  http://lucene.472066.n3.nabble.com/Nutch-Incremental-Crawl-tp4037903p4040400.html
 Sent from the Nutch - User mailing list archive at Nabble.com.

RE: regex-urlfilter file for multiple domains

2013-02-26 Thread Markus Jelsma

Yes, it will support that until you run out of memory. But having a million 
expressions is not going to work nicely. If you have a lot of expressions but 
can divide them into domains i would patch the filter so it will only execute 
filters that or for a specific domain. 

-Original message-
 From:Danilo Fernandes dan...@kelsorfernandes.com.br
 Sent: Tue 26-Feb-2013 11:31
 To: user@nutch.apache.org
 Subject: RE: regex-urlfilter file for multiple domains

 Tejas, do you have any idea about how many rules can I use in the file?

 Probably I will work with 1M regex for differentes URLs.

 Nutch will support that?

RE: regex-urlfilter file for multiple domains

2013-02-26 Thread Markus Jelsma

No, there is no feature for that. You would have to patch it up yourself. It 
shouldn't be very hard. 

-Original message-
 From:Danilo Fernandes dan...@kelsorfernandes.com.br
 Sent: Tue 26-Feb-2013 11:37
 To: user@nutch.apache.org
 Subject: RE: regex-urlfilter file for multiple domains

 Yes, my first options is differents files to differents domains.
 The point is how can I link the files with each domain? Do I need do
 some changes in Nutch code or the project have a feature for do
 that?

 On Tue, 26 Feb 2013 10:33:37 +, Markus Jelsma wrote: 

  Yes,
 it will support that until you run out of memory. But having a million
 expressions is not going to work nicely. If you have a lot of
 expressions but can divide them into domains i would patch the filter so
 it will only execute filters that or for a specific domain. 

 -Original message-

  From:Danilo Fernandes Sent: Tue
 26-Feb-2013 11:31 To: user@nutch.apache.org [2] Subject: RE:
 regex-urlfilter file for multiple domains Tejas, do you have any idea
 about how many rules can I use in the file? Probably I will work with 1M
 regex for differentes URLs. Nutch will support that?

 Links:
 --
 [1] mailto:dan...@kelsorfernandes.com.br
 [2]
 mailto:user@nutch.apache.org

RE: Nutch status info on each domain individually

2013-02-25 Thread Markus Jelsma

Well, you can always the DomainStatistics utilities to get the raw numbers on
hosts, domains and TLD's but this won't tell you whether a domain has been
fully crawled because the crawling frontier can always change.

You can be sure that everything (disregarding url filters) has been crawled if
no more records are selected before fetched records are eligible again for
refetch (default interval).

NUTCH-1325 does a better job in providing stats for hosts than the current
DomainStatistics but it's uncommitted. It'll work though.

https://issues.apache.org/jira/browse/NUTCH-1325

-Original message-
From:Tejas Patil tejas.patil...@gmail.com
Sent: Mon 25-Feb-2013 20:46
To: user@nutch.apache.org
Subject: Re: Nutch status info on each domain individually

I can't of any existing nutch utility which can be used here. Maybe dumping
the crawldb and then grepping over it would sound reasonable if the number
of hosts is large and the crawldb is small. This will be a bad idea if this
has to be done after every nutch cycle on a large crawldb.

If you are ready to write some small code, then it can become easy:
1. Write some code to query the index so that you need not have to do that
manually. OR
2. Write a map reduce code to read crawdb wherein the mapper emits the
hosts of the url.

#1 is better deal in terms of execution time.

Thanks,
Tejas Patil

On Mon, Feb 25, 2013 at 11:28 AM, imehesz imeh...@gmail.com wrote:

hello,

I can finally run Nutch (+Solr) with JAVA, my only question left is, how
can
I make sure if a particular domain has been crawled?

Let's say I have 300 sites to crawl and index.
So far my work-around was to execute a simple Solr query for each domain
URL, and see if the indexing timestamp in the Solr DB is greater then the
Nutch crawling start date-time. It works, but I'm curious if there is a
better way to do this.

thanks,
--iM

--
View this message in context:
http://lucene.472066.n3.nabble.com/Nutch-status-info-on-each-domain-individually-tp4042815.html
Sent from the Nutch - User mailing list archive at Nabble.com.

RE: Differences between 2.1 and 1.6

2013-02-25 Thread Markus Jelsma

Something seems to be missing here. It's clear that 1.x has more features and 
is a lot more stable than 2.x. Nutch 2.x can theoretically perform a lot better 
if you are going to crawl on a very large scale but i still haven't seen any 
numbers to support this assumption. Nutch 1.x can easily deal with many 
millions of records and deal with billions if you throw some hardware at it. 

Most users are not going to crawl millions or records. In that case i 
personally choose 1.x. I prefer the stability and predictabilty above some 
performance you are not likely going to need anyway. 

Besides our large 1.x research cluster we still use 1.x in production for all 
our customers, running locally on a 2 core 512MB RAM VPS with a crawldb of over 
5 million records and it runs fine, fast and keeps up with newly discovered 
URL's. The only significant improvements were a better scoring filter and 
integrating indexing in the fetcher.
 
-Original message-
 From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com
 Sent: Mon 25-Feb-2013 23:37
 To: user@nutch.apache.org
 Subject: Re: Differences between 2.1 and 1.6
 
 Hi Danilo,
 
 You can check out the architecture changes here
 http://wiki.apache.org/nutch/#Nutch_2.x
 
 Nutch trunk (1.7-SNAPSHOT) is here
 http://svn.apache.org/repos/asf/nutch/trunk/
 
 2.x is here
 http://svn.apache.org/repos/asf/nutch/branches/2.x/
 
 On Mon, Feb 25, 2013 at 1:56 PM, Danilo Fernandes 
 dan...@kelsorfernandes.com.br wrote:
 
  Hi everyone,
 
  Somebody can tell me about differences between 2.1 and 1.6?
 
  The SVN trunk is 1.* or 2.*?
 
  Thanks,
  Danilo Fernandes
 
 
 
 
 -- 
 *Lewis*

RE: Crawl script numberOfRounds

2013-02-19 Thread Markus Jelsma

Yes. 
 
-Original message-
 From:Amit Sela am...@infolinks.com
 Sent: Tue 19-Feb-2013 13:40
 To: user@nutch.apache.org
 Subject: Crawl script quot;numberOfRoundsquot;
 
 Is the crawl script numberOfRounds argument is the equivalent of depth
 argument in the crawl command ?
 
 Thanks.

RE: fields in solrindex-mapping.xml

2013-02-16 Thread Markus Jelsma

Those are added by IndexerMapReduce (or 2.x equivalent) and index-basic. They 
contain the crawl datum's signature, the time stamp (see index-basic) and crawl 
datum score. If you think you don't need them, you can safely omit them. 

-Original message-
 From:alx...@aim.com alx...@aim.com
 Sent: Sat 16-Feb-2013 19:21
 To: user@nutch.apache.org
 Subject: Re: fields in solrindex-mapping.xml

 Hi Lewis,

 Why do we need to include digest, tstamp, boost and batchid fields in 
 solrindex?

 Thanks.
 Alex.

 -Original Message-
 From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com
 To: user user@nutch.apache.org
 Sent: Fri, Feb 15, 2013 4:21 pm
 Subject: Re: fields in solrindex-mapping.xml

 Hi Alex,
 OK so we can certainly remove segment from 2.x solr-index-mapping.xml. It
 would however be nice to replace this with the appropriate batchId.
 Can someone advise where the 'segment' field currently comes from in trunk?
 That way we can at least map the field to the batchId equivalent in 2.x

 Thank you
 Lewis

 On Fri, Feb 15, 2013 at 2:23 PM, alx...@aim.com wrote:

  Hi Lewis,

  If I exclude one of the fileds tstamp, digest, and boost from
  solindex-mapping and schema.xml, solrindex gives error

  SEVERE: org.apache.solr.common.SolrException: ERROR: [doc=com.yahoo:http/]
  unknown field 'tstamp'

  for each of above fields, except segment.

  Alex.

  -Original Message-
  From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com
  To: user user@nutch.apache.org
  Sent: Thu, Feb 14, 2013 8:34 pm
  Subject: Re: fields in solrindex-mapping.xml

  Hi Alex,
  Tstamp represents fetch tiem, used for deduplication.
  Boost is for scoring-opic and link. This is required in 2.x as well.
  I don't have the code right now, but you can try removing digest and
  segment. To me they both look legacy.
  There is a wiki page on index structure which you can consult and/or add to
  should you wish.
  Thank you
  Lewis

  On Thursday, February 14, 2013,  alx...@aim.com wrote:
   Hello,

   I see that there are

   field dest=segment source=segment/
   field dest=boost source=boost/
   field dest=digest source=digest/
   field dest=tstamp source=tstamp/

   fields in addition to title, host and content ones in nutch-2.x'
  solr-mapping.xml. I thought tstamp may be needed for sorting documents.
  What about the other fields,
   segment, boost and digest? Can someone explain, why these fields are
  included in solr-mapping.xml?

   Thanks.
   Alex.

  --
  *Lewis*

 -- 
 *Lewis*

RE: Nutch identifier while indexing.

2013-02-13 Thread Markus Jelsma

You can use the subcollection indexing filter to set a value for URL's that 
match a string. With it you can distinquish even if they are on the same host 
and domain.
 
-Original message-
 From:mbehlok m_beh...@hotmail.com
 Sent: Wed 13-Feb-2013 21:20
 To: user@nutch.apache.org
 Subject: Re: Nutch identifier while indexing.
 
 wish it was that simple:
 
 SitaA = www.myDomain.com/index.aspx?site=1
 
 SitaB = www.myDomain.com/index.aspx?site=2
 
 SitaC = www.myDomain.com/index.aspx?site=3
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Nutch-identifier-while-indexing-tp4040285p4040323.html
 Sent from the Nutch - User mailing list archive at Nabble.com.

RE: DiskChecker$DiskErrorException

2013-02-11 Thread Markus Jelsma

Hi- Also enough space in your /tmp directory?

Cheers

-Original message-
 From:Alexei Korolev alexei.koro...@gmail.com
 Sent: Mon 11-Feb-2013 09:27
 To: user@nutch.apache.org
 Subject: DiskChecker$DiskErrorException

 Hello,

 Already twice I got this error:

 2013-02-08 15:26:11,674 WARN  mapred.LocalJobRunner - job_local_0001
 org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
 taskTracker/jobcache/job_local_0001/attempt_local_0001_m_00_0/output/spill0.out
 in any of the configur
 ed local directories
 at
 org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:389)
 at
 org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:138)
 at
 org.apache.hadoop.mapred.MapOutputFile.getSpillFile(MapOutputFile.java:94)
 at
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1443)
 at
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1154)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:359)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
 at
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
 2013-02-08 15:26:12,515 ERROR fetcher.Fetcher - Fetcher:
 java.io.IOException: Job failed!
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
 at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1204)
 at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1240)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1213)

 I've checked in google, but no luck. I run nutch 1.4 locally and have a
 plenty of free space on disk.
 I would much appreciate for some help.

 Thanks.

 -- 
 Alexei A. Korolev

RE: performance question: fetcher and parser in separate map/reduce jobs?

2013-02-09 Thread Markus Jelsma

A parsing fetcher does everything in the mapper. Please check the output() 
method around line 1012 onwards:

http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java?view=markup

Parsing, signature, outlink processing (using code in ParseOutputFormat) all 
happens there.

Cheers,
Markus
 
 
-Original message-
 From:Weilei Zhang zhan...@gmail.com
 Sent: Sat 09-Feb-2013 23:40
 To: user@nutch.apache.org
 Subject: Re: performance question: fetcher and parser in separate map/reduce 
 jobs?
 
 This is indeed helpful. Thanks Lewis.
 
 On Wed, Feb 6, 2013 at 6:50 PM, Lewis John Mcgibbney
 lewis.mcgibb...@gmail.com wrote:
  I've eventually added this to our FAQ's
 
  http://wiki.apache.org/nutch/FAQ#Can_I_parse_during_the_fetching_process.3F
 
  This should explain for you.
  Lewis
 
  On Wed, Feb 6, 2013 at 6:31 PM, Weilei Zhang zhan...@gmail.com wrote:
 
  Hi
  I have a performance question:
  why fetcher and parser is staged in two separate jobs instead of one?
  Intuitively, parser can be included as a part of fetcher reducer,  is
  it? This seems to be more efficient.
  Thanks
  --
  Best Regards
  -Weilei
 
 
 
 
  --
  *Lewis*
 
 
 
 -- 
 Best Regards
 -Weilei

RE: performance question: fetcher and parser in separate map/reduce jobs?

2013-02-09 Thread Markus Jelsma

Oh, i'd like to add that the biggest problem is memory and the possibility for 
a parser to hang, consume resources and time out everything else and destroying 
the segment.
 
 
-Original message-
 From:Weilei Zhang zhan...@gmail.com
 Sent: Sat 09-Feb-2013 23:40
 To: user@nutch.apache.org
 Subject: Re: performance question: fetcher and parser in separate map/reduce 
 jobs?
 
 This is indeed helpful. Thanks Lewis.
 
 On Wed, Feb 6, 2013 at 6:50 PM, Lewis John Mcgibbney
 lewis.mcgibb...@gmail.com wrote:
  I've eventually added this to our FAQ's
 
  http://wiki.apache.org/nutch/FAQ#Can_I_parse_during_the_fetching_process.3F
 
  This should explain for you.
  Lewis
 
  On Wed, Feb 6, 2013 at 6:31 PM, Weilei Zhang zhan...@gmail.com wrote:
 
  Hi
  I have a performance question:
  why fetcher and parser is staged in two separate jobs instead of one?
  Intuitively, parser can be included as a part of fetcher reducer,  is
  it? This seems to be more efficient.
  Thanks
  --
  Best Regards
  -Weilei
 
 
 
 
  --
  *Lewis*
 
 
 
 -- 
 Best Regards
 -Weilei

RE: Best Practice to optimize Parse reduce step / ParseoutputFormat

2013-02-08 Thread Markus Jelsma



 
 
-Original message-
 From:kemical mickael.lume...@gmail.com
 Sent: Fri 08-Feb-2013 10:53
 To: user@nutch.apache.org
 Subject: Best Practice to optimize Parse reduce step / ParseoutputFormat
 
 Hi,
 
 I've been looking for some time now the reasons of Parse reduce taking a lot
 of time. And i've found lots of different suggestions but no many feedbacks
 on which are working or not.
 
 
 First here is a list of the thread i've found, and also the Patch 1314 :
 
 http://lucene.472066.n3.nabble.com/Parse-reduce-slow-as-a-snail-td3296865.html
 http://lucene.472066.n3.nabble.com/ParseSegment-taking-a-long-time-to-finish-td3758053.html
 http://lucene.472066.n3.nabble.com/ParseSegment-slow-reduce-phase-td612119.html
 https://issues.apache.org/jira/browse/NUTCH-1314
 
 Here are some questions about what i've found on them:
 
 - It's seems that parse reduce time is mainly due to long urls
 = Is there anyone who can confirm since he has excluded long urls (with
 patch or regex or whatever, he now have better perfs?)

Most certainly!

 
 - Normalizing step is occuring before filtering:
 = If so, is there a real interest to filter urls with regex (like the
 -^.{350,}$ expression) ?

The sooner you can reject long URL's, the better.

 
 -The patch 1314 seems to be done when you parse with parse-html
 = i'm using boilerpipe with patch NUTCH-961, should the patch 1314 work
 with it? (i guess not) and what change should i make (i'm quite afraid to do
 a patch/plugin myself) . 

It will help a little but i don't think you'll win much vs. filtering by regex 
filter.

 
 This is not an exhaustive list of questions, so if you have questions and/or
 recommandations, please add them.
 
 
 
 Sorry to start a new thread since it could have been added as an answer to
 my last one:
 http://lucene.472066.n3.nabble.com/Very-long-time-just-before-fetching-and-just-after-parsing-td4037673.html
 but i think the title of this one could be useful for more people (mine was
 too specific)
 
 
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Best-Practice-to-optimize-Parse-reduce-step-ParseoutputFormat-tp4039200.html
 Sent from the Nutch - User mailing list archive at Nabble.com.

RE: Could not find any valid local directory for output/file.out

2013-02-08 Thread Markus Jelsma

The /tmp directory is not cleaned up IIRC. You're safe to empty it as long a 
you don't have a job running ;)

-Original message-
 From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com
 Sent: Fri 08-Feb-2013 20:48
 To: user@nutch.apache.org
 Subject: Re: Could not find any valid local directory for output/file.out

 +1
 This is a ridiculous size of tmp for a crawldb of minimal size.
 There is clearly something wrong

 On Friday, February 8, 2013, Tejas Patil tejas.patil...@gmail.com wrote:
  I dont think there is any such property. Maybe its time for you to cleanup
  /tmp :)

  Thanks,
  Tejas Patil

  On Fri, Feb 8, 2013 at 11:16 AM, Eyeris Rodriguez Rueda eru...@uci.cu
 wrote:

  Hi lewis an tejas again.
  I have point the hadoop.tmp.dir property but nutch still consuming to
 much
  space for me.
  Is posible to reduce the space of nutch in my tmp folder with some
  property of a fetcher process? I always get an exception because the hard
  disk is full. my crawldb only have 150 MB not more. but my tmp folder
  continue increasing without control until 60 GB, and fail at this point.
  please any help

  - Mensaje original -
  De: Eyeris Rodriguez Rueda eru...@uci.cu
  Para: user@nutch.apache.org
  Enviados: Viernes, 8 de Febrero 2013 10:45:52
  Asunto: Re: Could not find any valid local directory for output/file.out

  Thanks a lot. lewis and tejas, you are very helpfull for me.
  It function ok, I have pointed to another partition and ok.
  Problem solved.

  - Mensaje original -
  De: Tejas Patil tejas.patil...@gmail.com
  Para: user@nutch.apache.org
  Enviados: Jueves, 7 de Febrero 2013 16:32:33
  Asunto: Re: Could not find any valid local directory for output/file.out

  On Thu, Feb 7, 2013 at 12:47 PM, Eyeris Rodriguez Rueda eru...@uci.cu
  wrote:

   Thank to all for your replies.
   If i want to change the default location for hadoop job(/tmp), where i
  can
   do that ?, because my nutch-site.xml not include nothing pointing to
  /tmp.

  Add this property to nutch-site.xml with appropriate value:

  property
  namehadoop.tmp.dir/name
  valueXX/value
  /property

   So I have readed about nutch and hadoop but im not sure to understand
 at
   all. Is posible to use nutch 1.5.1 in distributed mode ?

  yes

   In this case what i need to do for that, I really appreciated your
 answer
   because I can´t find a good documentation for this topic.

  For distributed mode, Nutch is called from runtime/deploy. The conf files
  should be modified in runtime/local/conf, not in $NUTCH_HOME/conf.
  So modify the runtime/local/conf/nutch-site.xml to set
  http.agent.nameproperly.  I am assuming that the hadoop setup is in
  place and hadoop
  variables are exported. Now, run the nutch commands from runtime/deploy.

  Thanks,
  Tejas Patil

   - Mensaje original -
   De: Tejas Patil tejas.patil...@gmail.com
   Para: user@nutch.apache.org
   Enviados: Jueves, 7 de Febrero 2013 14:04:26
   Asunto: Re: Could not find any valid local directory for
 output/file.out

   Nutch jobs are executed by Hadoop. /tmp is the default location used
 by
   hadoop to store temporary data required for a job. If you dont
 over-ride
   hadoop.tmp.dir in any config file, it will use /tmp by default. In your
   case, /tmp doesnt have ample space left so better over-ride that
 property
   and point it to some other location which has ample space.

   Thanks,
   Tejas Patil

   On Thu, Feb 7, 2013 at 10:38 AM, Eyeris Rodriguez Rueda eru...@uci.cu
   wrote:

Thanks lewis by your answer.
My doubt is why /tmp is increasing while crawl process is doing, and
  why
nutch use that folder. Im using nutch 1.5.1 in single mode and my
 nutch
site not have properties hadoop.tmp.dir. I need reduce the space used
  for
that folder because I only have 40 GB for nutch machine and 50 GB for
   solr
machine. Please some advice or expla

 -- 
 *Lewis*

RE: Could not find any valid local directory for output/file.out

2013-02-08 Thread Markus Jelsma

Hadoop stores temporary files there such as shuffling map output data, you need 
it! But you can rf -r it after a complete crawl cycle. Do not clear it while a 
job is running, it's going to miss it's temp files.

-Original message-
 From:Eyeris Rodriguez Rueda eru...@uci.cu
 Sent: Fri 08-Feb-2013 20:53
 To: user@nutch.apache.org
 Subject: Re: Could not find any valid local directory for output/file.out

 Im using ubuntu server 12.04 only for nutch, I have asigned 40 GB for this. 
 Is /tmp needed for nutch crawl process ? or i can make a crontab for delete 
 /tmp content without problem for nutch crawl.

 - Mensaje original -
 De: Tejas Patil tejas.patil...@gmail.com
 Para: user@nutch.apache.org
 Enviados: Viernes, 8 de Febrero 2013 14:33:25
 Asunto: Re: Could not find any valid local directory for output/file.out

 I dont think there is any such property. Maybe its time for you to cleanup
 /tmp :)

 Thanks,
 Tejas Patil

 On Fri, Feb 8, 2013 at 11:16 AM, Eyeris Rodriguez Rueda eru...@uci.cuwrote:

  Hi lewis an tejas again.
  I have point the hadoop.tmp.dir property but nutch still consuming to much
  space for me.
  Is posible to reduce the space of nutch in my tmp folder with some
  property of a fetcher process? I always get an exception because the hard
  disk is full. my crawldb only have 150 MB not more. but my tmp folder
  continue increasing without control until 60 GB, and fail at this point.
  please any help

  - Mensaje original -
  De: Eyeris Rodriguez Rueda eru...@uci.cu
  Para: user@nutch.apache.org
  Enviados: Viernes, 8 de Febrero 2013 10:45:52
  Asunto: Re: Could not find any valid local directory for output/file.out

  Thanks a lot. lewis and tejas, you are very helpfull for me.
  It function ok, I have pointed to another partition and ok.
  Problem solved.

  - Mensaje original -
  De: Tejas Patil tejas.patil...@gmail.com
  Para: user@nutch.apache.org
  Enviados: Jueves, 7 de Febrero 2013 16:32:33
  Asunto: Re: Could not find any valid local directory for output/file.out

  On Thu, Feb 7, 2013 at 12:47 PM, Eyeris Rodriguez Rueda eru...@uci.cu
  wrote:

   Thank to all for your replies.
   If i want to change the default location for hadoop job(/tmp), where i
  can
   do that ?, because my nutch-site.xml not include nothing pointing to
  /tmp.

  Add this property to nutch-site.xml with appropriate value:

  property
  namehadoop.tmp.dir/name
  valueXX/value
  /property

   So I have readed about nutch and hadoop but im not sure to understand at
   all. Is posible to use nutch 1.5.1 in distributed mode ?

  yes

   In this case what i need to do for that, I really appreciated your answer
   because I can´t find a good documentation for this topic.

  For distributed mode, Nutch is called from runtime/deploy. The conf files
  should be modified in runtime/local/conf, not in $NUTCH_HOME/conf.
  So modify the runtime/local/conf/nutch-site.xml to set
  http.agent.nameproperly.  I am assuming that the hadoop setup is in
  place and hadoop
  variables are exported. Now, run the nutch commands from runtime/deploy.

  Thanks,
  Tejas Patil

   - Mensaje original -
   De: Tejas Patil tejas.patil...@gmail.com
   Para: user@nutch.apache.org
   Enviados: Jueves, 7 de Febrero 2013 14:04:26
   Asunto: Re: Could not find any valid local directory for output/file.out

   Nutch jobs are executed by Hadoop. /tmp is the default location used by
   hadoop to store temporary data required for a job. If you dont over-ride
   hadoop.tmp.dir in any config file, it will use /tmp by default. In your
   case, /tmp doesnt have ample space left so better over-ride that property
   and point it to some other location which has ample space.

   Thanks,
   Tejas Patil

   On Thu, Feb 7, 2013 at 10:38 AM, Eyeris Rodriguez Rueda eru...@uci.cu
   wrote:

Thanks lewis by your answer.
My doubt is why /tmp is increasing while crawl process is doing, and
  why
nutch use that folder. Im using nutch 1.5.1 in single mode and my nutch
site not have properties hadoop.tmp.dir. I need reduce the space used
  for
that folder because I only have 40 GB for nutch machine and 50 GB for
   solr
machine. Please some advice or explanation will be accepted.
Thanks for your time.

- Mensaje original -
De: Lewis John Mcgibbney lewis.mcgibb...@gmail.com
Para: user@nutch.apache.org
Enviados: Jueves, 7 de Febrero 2013 13:06:11
Asunto: Re: Could not find any valid local directory for
  output/file.out

Hi,

  https://wiki.apache.org/nutch/NutchGotchas#DiskErrorException_while_fetching

On Thursday, February 7, 2013, Eyeris Rodriguez Rueda eru...@uci.cu
wrote:
 Hi all.
 I have a problem when i do a crawl for few hour or days, im using
  nutch
1.5.1 and solr 3.6, but the crawl process fails and i dont know how to
   fix

RE: increase the number of fetches at agiven time on nutch 1.6 or 2.1

2013-01-28 Thread Markus Jelsma

Try setting -numFetchers N on the generator.

-Original message-
From:Sourajit Basak sourajit.ba...@gmail.com
Sent: Mon 28-Jan-2013 11:57
To: user@nutch.apache.org
Subject: Re: increase the number of fetches at agiven time on nutch 1.6 or 2.1

A higher number of per host threads, etc might not be useful if the
bandwidth doesn't scale out. I have a different observation though.

We run nutch on a hadoop cluster. Even as we added new machines to the
cluster, the fetch phase only creates two tasks. (the original number of
nodes when we started) Why is it so ? I have checked that the tasks do get
spawned in the newly added nodes.
We have this setting in hadoop mapred-site.xml
property
namemapred.tasktracker.map.tasks.maximum/name
value20/value
/property

We have planned to double the number of websites and see if it still
doesn't spawn tasks on each node. I will keep this forum updated with out
results. In the meantime, can anyone point out if we have missed any
particular configuration ?

Thanks,
Sourajit

On Mon, Jan 28, 2013 at 10:35 AM, Tejas Patil tejas.patil...@gmail.comwrote:

Hey Peter,

I am guessing that you have just increased the global thread count. Have
you even increased fetcher.threads.per.host ? This will improve the crawl
rate as multiple threads can attack the same site. Dont make it too high or
else the system will get overloaded. The nutch wiki has an article [0]
about the potential reasons for slow crawls and some good suggestions.

[0] : https://wiki.apache.org/nutch/OptimizingCrawls

Thanks,
Tejas Patil

On Sun, Jan 27, 2013 at 8:08 PM, peterbarretto peterbarrett...@gmail.com
wrote:

I tried increasing the numbers of threads to 50 but the speed is not
affected

I tried changing the partition.url.mode value to byDomain and
fetcher.queue.mode to byDomain but still it does not help the speed.
It seems to get urls from 2 domains now and the other domains are not
getting crawled. Is this due to the url score? if so how do i crawl urls
from all the domains?

lewis john mcgibbney wrote
Increase number of threads when fetching
Also please see nutch-deault.xml for paritioning of urls, if you know
your
target domains you may wish to adapt the policy.
Lewis

On Sunday, January 27, 2013, peterbarretto lt;

peterbarretto08@

gt;
wrote:
I want to increase the number of urls fetched at a time in nutch. I
have
around 10 websites to crawl. so how can i crawl all the sites at a
time
?
right now i am fetching 1 site with a fetch delay of 2 second but it
is
too
slow. How to concurrently fetch from different domain?

--
View this message in context:

http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499.html
Sent from the Nutch - User mailing list archive at Nabble.com.

--
*Lewis*

--
View this message in context:

http://lucene.472066.n3.nabble.com/increase-the-number-of-fetches-at-agiven-time-on-nutch-1-6-or-2-1-tp4036499p4036630.html
Sent from the Nutch - User mailing list archive at Nabble.com.

RE: Solr dinamic fields

2013-01-28 Thread Markus Jelsma

Hi

-Original message-
 From:Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu
 Sent: Mon 28-Jan-2013 17:01
 To: user@nutch.apache.org
 Subject: Solr dinamic fields

 Hi:

 I'm currently working on a plattform for crawl a large amount of PDFs files. 
 Using nutch (and tika) I'm able of extract and store the textual content of 
 the files in solr, but right now we want to be able to extract the content of 
 the PDFs by page, this means that, we want to store several solr fields (one 
 per each page in the document). Is there any recommended way of accomplish 
 this in nutch/solr?. With a parse plugin I could store the text from each 
 page to the metadata's document, anything else would be needed?

Yes, make a custom indexing filter that reads your parsed metadata and adds 
page specific fields to NutchDocument. That should work fine.

 slds
 --
 It is only in the mysterious equation of love that any 
 logical reasons can be found.
 Good programmers often confuse halloween (31 OCT) with 
 christmas (25 DEC)

RE: conditional indexing

2013-01-23 Thread Markus Jelsma

Hi - i've not yet committed a fix for:
https://issues.apache.org/jira/browse/NUTCH-1449

This will allow you to stop documents from being indexed from within your 
indexing filter. Order can be configured using the indexing.filter.order or 
something configuration directive.
 
-Original message-
 From:Sourajit Basak sourajit.ba...@gmail.com
 Sent: Wed 23-Jan-2013 09:24
 To: user@nutch.apache.org
 Subject: conditional indexing
 
 We have an implementation of Indexing filter that runs side-by-side the
 indexer-basic plugin. How is the order determined ?
 Also, how do I do conditional indexing i.e. stop certain urls from being
 indexed ? I think I can apply a filter but that approach will not work
 since we index based on the page contents.

RE: Nutch support with regards to Deduplication and Document versioning

2013-01-23 Thread Markus Jelsma

If you use 1.x and don't merge segments you still have older versions of 
documents. There is no active versioning in Nutch 1x except segment naming and 
merging, if you use it.

-Original message-
 From:Tejas Patil tejas.patil...@gmail.com
 Sent: Wed 23-Jan-2013 09:25
 To: user@nutch.apache.org
 Subject: Re: Nutch support with regards to Deduplication and Document 
 versioning

 Hi Anand,
 Nutch will keep the latest content of a given url (based on the time when
 it was fetched). It wont store the old versions.

 Thanks,
 Tejas

 On Wed, Jan 23, 2013 at 12:12 AM, Anand Bhagwat abbhagwa...@gmail.comwrote:

  Hi,
  I want to know what kind of support does Nutch provides with regards to
  de-duplication and document versioning?

  Thanks,
  Anand.

RE: solrindex deleteGone vs solrclean

2013-01-23 Thread Markus Jelsma

Hi,

-deleteGone relies on segment information to delete records, which is faster 
and indeed somewhat on-the-fly. solclean command relies on CrawlDB information 
and will always work, even if you lost your segment or just periodically delete 
old segments.

Cheers 
 
-Original message-
 From:Jason S jason.stu...@gmail.com
 Sent: Thu 24-Jan-2013 03:01
 To: user@nutch.apache.org
 Subject: solrindex deleteGone vs solrclean
 
 Hello,
 
 I'm curious about the difference between using -deleteGone with solrindex
 and the solrclean command.  From what I understand, they basically do the
 same thing except -deleteGone is more on the fly.  Is this correct?
 
 Is there any scenario where one would be more appropriate than the other?
 
 Thanks in advance!
 
 ~Jason

RE: Synthetic Tokens

2013-01-21 Thread Markus Jelsma

Hi,

In Nutch a `synthetic token` maps to a field/value pair.  You need an indexing 
filter to read the key/value pair from the parsed metadata and add it as a 
field/value pair to the NutchDocument. You may also need a custom parser filter 
to extract the data from somewhere and store it to the parsed metadata as 
key/value, which you then further process in your indexing filter.

Check out the index-basic and index-more plugins for examples.

Cheers, 
 
-Original message-
 From:Jakub Moskal jakub.mos...@gmail.com
 Sent: Mon 21-Jan-2013 04:58
 To: user@nutch.apache.org
 Subject: Synthetic Tokens
 
 Hi,
 
 I would like to develop a plugin that creates synthetic tokens for
 some documents that are crawled by Nutch (as described here:
 http://www.ideaeng.com/synthetic-tokens-need-p2-0604). How can this be
 done in Nutch? Should I create a new field for every new synthetic
 token, or should I add them to metadata? I'm not quite sure how
 fields/metadata relate to the tokens described in the article.
 
 Thanks!
 Jakub

RE: Wrong ParseData in segment

2013-01-16 Thread Markus Jelsma

Sebastian!

I thought about that too since i do sometimes use class variables in some parse 
plugins such as storing the Parse object. However, i assumed the plugins were 
already in a thread-safe environment because each FetcherThread instance has 
it's own instance of ParseUtil. 

I'll modify the plugins and see if it helps ;)

Thanks,
Markus 
 
-Original message-
 From:Sebastian Nagel wastl.na...@googlemail.com
 Sent: Wed 16-Jan-2013 18:38
 To: user@nutch.apache.org
 Subject: Re: Wrong ParseData in segment
 
 Hi Markus,
 
 right now I have seen this problem in a small test set of 20 documents:
 - various document types (HTML, PDF, XLS, zip, doc, ods)
 - small and quite large docs (up to 12 MB)
 - local docs via protocol-file
 - fetcher.parse = true
 - Nutch 1.4, local mode
 
 Somehow metadata from a one doc slipped into another doc:
 - extracted by a custom HtmlParseFilter plugin (author, keywords, description)
 - reproducible, though not easily (3-5 trials to get one, rarely two
 wrong meta fields)
 - wrong parsemeta is definitely in the segment
 
 After adding more and more debug logs the stupid answer is:
 the custom plugin was not 100% thread-safe. Yes, it wasn't clear to me ;-):
 the same instance of a plugin may process two documents in parallel.
 I found also this thread (and NUTCH-496):
   
 http://www.mail-archive.com/nutch-developers@lists.sourceforge.net/msg12333.html
 I didn't find any hint in the wiki (eg. in
 http://wiki.apache.org/nutch/WritingPluginExample),
 but I'll add one.
 
 Cheers,
 Sebastian
 
 
 2012/11/30 Markus Jelsma markus.jel...@openindex.io:
  Hi
 
  In our case it is really in the segment, and ends up in the index. Are 
  there any known issues with parse filters? In that filter we do set the 
  Parse object as class attribute but we reset it with the new Parse object 
  right after filter() is called.
 
  I also cannot think of the custom Tika ContentHandler to be the issue, a 
  new ContentHandler is created for each parse and passed to the 
  TeeContentHandler, just all other ContentHandlers.
 
  I assume an individual parse is completely isolated from another because 
  all those objects are created new for each record.
 
  Does anyone have a clue, however slight? Or any general tips on this, or 
  how to attempt to reproduce it?
 
 
  Thanks
 
  -Original message-
  From:Sebastian Nagel wastl.na...@googlemail.com
  Sent: Fri 30-Nov-2012 21:04
  To: user@nutch.apache.org
  Subject: Re: Wrong ParseData in segment
 
  Hi Markus,
 
  sounds somewhat similar to NUTCH-1252 but that was rather trivial
  and easy to reproduce.
 
  Sebastian
 
  2012/11/30 Markus Jelsma markus.jel...@openindex.io:
   Hi,
  
   We've got an issue where one in a few thousand records partially 
   contains another record's ParseMeta data. To be specific, record A ends 
   up with the ParseMeta data of record B that is added by one of our 
   custom parse plugins. I'm unsure as to where the problem really is 
   because the parse plugin receives data from a modified parser plugin 
   that in turn adds a custom Tika ContentHandler.
  
   Because i'm unable to reproduce this i had to inspect the code for 
   places where an object is reused but an attribute is not reset. To me, 
   that would be the most obvious problem, but until now i've been 
   unsuccessful in finding the issue!
  
   Regardless of how remote the chance is of someone having had some 
   similar issue: does anyone have some ideas to share?
  
   Thanks,
   Markus

RE: Wrong ParseData in segment

2013-01-16 Thread Markus Jelsma

Hi Sebastian,

Makes sense, i'll be sure to modify the parser plugins. Perhaps it would be 
worth trying to make sure a single thread uses a single instance. I don't know 
why it works the way it does. Judging from the pointed thread it's intended 
behaviour.

On the other side, reusing parser plugins the way it's now doesn't make too 
much sense. There's usually not a huge amount of data involved per single 
instance so conserving heap space doesn't seem a reasonable justification.

Thanks,
Markus

 
 
-Original message-
 From:Sebastian Nagel wastl.na...@googlemail.com
 Sent: Wed 16-Jan-2013 22:04
 To: user@nutch.apache.org
 Subject: Re: Wrong ParseData in segment
 
 Hi Markus,
 
  However, i assumed the plugins were already in a thread-safe environment 
  because each
  FetcherThread instance has it's own instance of ParseUtil.
 I had similar assumptions but the debug output to investigate my problem is 
 straightforward
 (the number are object hash codes):
 
 2013-01-16 17:04:29,386 DEBUG parse.CustomParseFilter (instance=1639291161): 
 parsing file:.../1.xls
 2013-01-16 17:04:29,452 DEBUG parse.CustomParseFilter (instance=1639291161): 
 parsing file:.../2.doc
 2013-01-16 17:04:29,452 DEBUG parse.FieldExtractor - docfragm=1634712296: 
 node meta elem = 598132191
 2013-01-16 17:04:29,452 DEBUG parse.FieldExtractor - docfragm=1634712296: 
 author=Christina Maier
 2013-01-16 17:04:29,507 DEBUG parse.FieldExtractor - docfragm=1758166206: 
 node meta elem = 598132191
 2013-01-16 17:04:29,507 DEBUG parse.FieldExtractor - docfragm=1758166206: 
 author=Christina Maier
 
 The same parse filter instance processes two documents in parallel. The 
 plugin does a lot
 (extracting metadata, pruning content) and the documents are large and take 
 some time to process.
 Via a shared instance variable references to DOM nodes slipped from one call 
 of filter() to the other.
 
 Is there a possibility to ensure that every instance of ParseUtil has it's 
 own plugin instances?
 Would be worth to check.
 
 Cheers,
 Sebastian
 
 
 On 01/16/2013 06:55 PM, Markus Jelsma wrote:
  Sebastian!
  
  I thought about that too since i do sometimes use class variables in some 
  parse plugins such as storing the Parse object. However, i assumed the 
  plugins were already in a thread-safe environment because each 
  FetcherThread instance has it's own instance of ParseUtil. 
  
  I'll modify the plugins and see if it helps ;)
  
  Thanks,
  Markus 
   
  -Original message-
  From:Sebastian Nagel wastl.na...@googlemail.com
  Sent: Wed 16-Jan-2013 18:38
  To: user@nutch.apache.org
  Subject: Re: Wrong ParseData in segment
 
  Hi Markus,
 
  right now I have seen this problem in a small test set of 20 documents:
  - various document types (HTML, PDF, XLS, zip, doc, ods)
  - small and quite large docs (up to 12 MB)
  - local docs via protocol-file
  - fetcher.parse = true
  - Nutch 1.4, local mode
 
  Somehow metadata from a one doc slipped into another doc:
  - extracted by a custom HtmlParseFilter plugin (author, keywords, 
  description)
  - reproducible, though not easily (3-5 trials to get one, rarely two
  wrong meta fields)
  - wrong parsemeta is definitely in the segment
 
  After adding more and more debug logs the stupid answer is:
  the custom plugin was not 100% thread-safe. Yes, it wasn't clear to me ;-):
  the same instance of a plugin may process two documents in parallel.
  I found also this thread (and NUTCH-496):

  http://www.mail-archive.com/nutch-developers@lists.sourceforge.net/msg12333.html
  I didn't find any hint in the wiki (eg. in
  http://wiki.apache.org/nutch/WritingPluginExample),
  but I'll add one.
 
  Cheers,
  Sebastian
 
 
  2012/11/30 Markus Jelsma markus.jel...@openindex.io:
  Hi
 
  In our case it is really in the segment, and ends up in the index. Are 
  there any known issues with parse filters? In that filter we do set the 
  Parse object as class attribute but we reset it with the new Parse object 
  right after filter() is called.
 
  I also cannot think of the custom Tika ContentHandler to be the issue, a 
  new ContentHandler is created for each parse and passed to the 
  TeeContentHandler, just all other ContentHandlers.
 
  I assume an individual parse is completely isolated from another because 
  all those objects are created new for each record.
 
  Does anyone have a clue, however slight? Or any general tips on this, or 
  how to attempt to reproduce it?
 
 
  Thanks
 
  -Original message-
  From:Sebastian Nagel wastl.na...@googlemail.com
  Sent: Fri 30-Nov-2012 21:04
  To: user@nutch.apache.org
  Subject: Re: Wrong ParseData in segment
 
  Hi Markus,
 
  sounds somewhat similar to NUTCH-1252 but that was rather trivial
  and easy to reproduce.
 
  Sebastian
 
  2012/11/30 Markus Jelsma markus.jel...@openindex.io:
  Hi,
 
  We've got an issue where one in a few thousand records partially 
  contains another record's ParseMeta data. To be specific, record A ends

RE: [ANNOUNCE] New Nutch committer and PMC : Tejas Patil

2013-01-14 Thread Markus Jelsma

Nice!

Thanks

-Original message-
 From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com
 Sent: Mon 14-Jan-2013 20:28
 To: d...@nutch.apache.org
 Cc: user@nutch.apache.org
 Subject: Re: [ANNOUNCE] New Nutch committer and PMC : Tejas Patil

 Welcome aboard Tejas
 Best
 Lewis

 On Monday, January 14, 2013, Julien Nioche lists.digitalpeb...@gmail.com
 wrote:
  Dear all,

  It is my pleasure to announce that Tejas Patil has joined the Nutch PMC
 and is a new committer. Tejas, would you mind telling us about yourself,
 what you've done so far with Nutch, which areas you think you'd like to get
 involved, etc...
  Congratulations Tejas and welcome on board!

  BTW If you haven't done so please have a look at
 http://www.apache.org/dev/new-committers-guide.html. I expect that your
 account will be created within a few days after reception of the ICLA

  Best,

  Julien

  --
  http://digitalpebble.com/img/logo.gif
  Open Source Solutions for Text Engineering

  http://digitalpebble.blogspot.com/
  http://www.digitalpebble.com
  http://twitter.com/digitalpebble

 -- 
 *Lewis*

RE: How segments is created?

2013-01-13 Thread Markus Jelsma

-Original message-
 From:Bayu Widyasanyata bwidyasany...@gmail.com
 Sent: Sun 13-Jan-2013 07:34
 To: user@nutch.apache.org
 Subject: Re: How segments is created?

 On Sun, Jan 13, 2013 at 12:47 PM, Tejas Patil tejas.patil...@gmail.comwrote:

  Well, if you know that the front page is updated frequently, set
  db.fetch.interval.default to lower value so that urls will be eligible
  for re-fetch sooner. By default, if a url is fetched successfully, it
  becomes eligible for re-fetching after 30 days

 Very clear!
 In summary,
 Nutch can not identify if a page is being updated hence (if page is updated
 frequently) we should set to lower value db.fetch.interval.default to
 re-fetch the page.

No, you can plugin another FetchSchedule that supports adjusting the interval 
based on whether a record is modified. See the AdaptiveFetchSchedule for an 
example.

 Thanks so much!
 -- 
 wassalam,
 [bayu]

RE: code changes not reflecting when deployed on hadoop

2012-12-27 Thread Markus Jelsma

Seems the job file is not deployed to all task trackers and i'm not sure why. 
Can you try using the nutch script to run your fetcher? 

-Original message-
 From:Sourajit Basak sourajit.ba...@gmail.com
 Sent: Thu 27-Dec-2012 13:29
 To: user@nutch.apache.org
 Subject: code changes not reflecting when deployed on hadoop

 We have made some changes to Fetcher (v1.5). However, when we build a .job
 (jar) and deploy it on hadoop it doesn't seem to pick up any changes. This
 is how we are running it.

  ./hadoop jar ../nutch/apache-nutch-1.5.1.job
 org.apache.nutch.fetcher.Fetcher segment on hdfs -threads 4

 However, if we modify any of the plugins, it picks up the changes properly.

 Initially, I doubted that our logic wasn't getting hit. To cross check, we
 removed Fetcher.class from the .job file and re-executed. Still it seems to
 run an old version of the code.

 I strongly suspect, I am missing out something which needs to be done after
 a code change.

RE: code changes not reflecting when deployed on hadoop

2012-12-27 Thread Markus Jelsma

It works the same as in local mode, just have the job file in the CWD. 

-Original message-
 From:Sourajit Basak sourajit.ba...@gmail.com
 Sent: Thu 27-Dec-2012 14:51
 To: user@nutch.apache.org
 Subject: Re: code changes not reflecting when deployed on hadoop

 We are using hadoop 1.1

 On Thu, Dec 27, 2012 at 7:13 PM, Sourajit Basak 
 sourajit.ba...@gmail.comwrote:

  How do you use the nutch script on a cluster ?

  On Thu, Dec 27, 2012 at 6:25 PM, Markus Jelsma markus.jel...@openindex.io
   wrote:

  Can you try using the nutch script to run your fetcher?

RE: Nutch approach for DeadLinks

2012-12-26 Thread Markus Jelsma

Hi - Nutch 1.5 has a -deleteGone switch for the SolrIndexer job.This will 
delete permanent redirects and 404's that have been discovered during the 
crawl. 1.6 also has a  -deleteRobotsNoIndex that will delete pages that have a 
robots meta tag with a noindex value.

-Original message-
 From:David Philip davidphilipshe...@gmail.com
 Sent: Wed 26-Dec-2012 06:28
 To: user@nutch.apache.org
 Subject: Nutch approach for DeadLinks

 Hi  All,

  How does nutch  work with deadlinks? say for example, there is a blog
 site being crawled today and all the blogs (documents) are indexed to solr.
 Tomorrow, if one of the blog is deleted which mean that  the  URL indexed
 yesterday is no more working today! In such cases,  How to update the solr
 indexes such that this particular blog doesn’t come in search results?
 Recrawling the same site didn’t delete this record in solr. How to handle
 such cases? I am using nutch 1.5.1 bin.  Thanks David

RE: About the version of the nutch

2012-12-24 Thread Markus Jelsma

Hi - it depends on the estimated size of your data and the available hardware. 
You can simply get the current 1.0.x stable or 1.1.x beta Hadoop version, both 
will run fine. The choice is which Nutch to use, 1.x is very stable and has 
more features and can be used for very large scale crawls although you might 
have to use a bit more hardware. 2.x is more efficient in writing and reading 
data but also less stable, you will run into more problems that divert you from 
your core tasks.

If you have a few powerful machines and your data is in the TB range 1.x is 
fine. If you like a challenge 2.x is the way to go. We process many TBs each 
month on just a few powerful machines and run a modified 1.x.  
 
-Original message-
 From:許懷文 k120861032...@gmail.com
 Sent: Mon 24-Dec-2012 18:17
 To: user@nutch.apache.org
 Subject: About the version of the nutch
 
 Dear Nutch Project Team:
 
 I am interested in Nutch and Hadoop and want to use them to apply to  big
 data analysis; but I have some problems with the version of them.
 I want to set up a search engine by myself, and I also choose the
 Hadoop+Nutch+Solr+Hbase to implement it.
 Would you mind give me the suitable version of them to set them up? I will
 appreciate your kind reply and helpful suggestions.
 Thanks!
 Best regards,
 Kevin Hsu.

RE: shouldFetch rejected

2012-12-17 Thread Markus Jelsma

Hi - curTime does not exceed fetchTime, thus the record is not eligible for 
fetch.
 
 
-Original message-
 From:Jan Philippe Wimmer i...@jepse.net
 Sent: Mon 17-Dec-2012 13:31
 To: user@nutch.apache.org
 Subject: Re: shouldFetch rejected
 
 Hi again.
 
 i still have that issue. I start with a complete new crawl directory 
 structure and get the following error:
 
 -shouldFetch rejected 'http://www.lequipe.fr/Football/', 
 fetchTime=1359626286623, curTime=1355738313780
 
 Full-Log:
 crawl started in: /opt/project/current/crawl_project/nutch/crawl/1300
 rootUrlDir = /opt/project/current/crawl_project/nutch/urls/url_1300
 threads = 20
 depth = 3
 solrUrl=http://192.168.1.144:8983/solr/
 topN = 400
 Injector: starting at 2012-12-17 10:57:36
 Injector: crawlDb: 
 /opt/project/current/crawl_project/nutch/crawl/1300/crawldb
 Injector: urlDir: /opt/project/current/crawl_project/nutch/urls/url_1300
 Injector: Converting injected urls to crawl db entries.
 Injector: Merging injected urls into crawl db.
 Injector: finished at 2012-12-17 10:57:51, elapsed: 00:00:14
 Generator: starting at 2012-12-17 10:57:51
 Generator: Selecting best-scoring urls due for fetch.
 Generator: filtering: true
 Generator: normalizing: true
 Generator: topN: 400
 Generator: jobtracker is 'local', generating exactly one partition.
 Generator: Partitioning selected urls for politeness.
 Generator: segment: 
 /opt/project/current/crawl_project/nutch/crawl/1300/segments/20121217105759
 Generator: finished at 2012-12-17 10:58:06, elapsed: 00:00:15
 Fetcher: Your 'http.agent.name' value should be listed first in 
 'http.robots.agents' property.
 Fetcher: starting at 2012-12-17 10:58:06
 Fetcher: segment: 
 /opt/project/current/crawl_project/nutch/crawl/1300/segments/20121217105759
 Using queue mode : byHost
 Fetcher: threads: 20
 Fetcher: time-out divisor: 2
 QueueFeeder finished: total 1 records + hit by time limit :0
 Using queue mode : byHost
 Using queue mode : byHost
 fetching http://www.lequipe.fr/Football/
 -finishing thread FetcherThread, activeThreads=1
 Using queue mode : byHost
 -finishing thread FetcherThread, activeThreads=1
 Using queue mode : byHost
 -finishing thread FetcherThread, activeThreads=1
 Using queue mode : byHost
 -finishing thread FetcherThread, activeThreads=1
 Using queue mode : byHost
 -finishing thread FetcherThread, activeThreads=1
 Using queue mode : byHost
 -finishing thread FetcherThread, activeThreads=1
 Using queue mode : byHost
 -finishing thread FetcherThread, activeThreads=1
 Using queue mode : byHost
 -finishing thread FetcherThread, activeThreads=1
 Using queue mode : byHost
 -finishing thread FetcherThread, activeThreads=1
 Using queue mode : byHost
 -finishing thread FetcherThread, activeThreads=1
 Using queue mode : byHost
 -finishing thread FetcherThread, activeThreads=1
 Using queue mode : byHost
 -finishing thread FetcherThread, activeThreads=1
 Using queue mode : byHost
 -finishing thread FetcherThread, activeThreads=1
 Using queue mode : byHost
 -finishing thread FetcherThread, activeThreads=1
 Using queue mode : byHost
 -finishing thread FetcherThread, activeThreads=1
 Using queue mode : byHost
 -finishing thread FetcherThread, activeThreads=1
 Using queue mode : byHost
 -finishing thread FetcherThread, activeThreads=1
 Using queue mode : byHost
 -finishing thread FetcherThread, activeThreads=1
 Using queue mode : byHost
 -finishing thread FetcherThread, activeThreads=1
 Fetcher: throughput threshold: -1
 Fetcher: throughput threshold retries: 5
 -finishing thread FetcherThread, activeThreads=0
 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
 -activeThreads=0
 Fetcher: finished at 2012-12-17 10:58:13, elapsed: 00:00:07
 ParseSegment: starting at 2012-12-17 10:58:13
 ParseSegment: segment: 
 /opt/project/current/crawl_project/nutch/crawl/1300/segments/20121217105759
 ParseSegment: finished at 2012-12-17 10:58:20, elapsed: 00:00:07
 CrawlDb update: starting at 2012-12-17 10:58:20
 CrawlDb update: db: 
 /opt/project/current/crawl_project/nutch/crawl/1300/crawldb
 CrawlDb update: segments: 
 [/opt/project/current/crawl_project/nutch/crawl/1300/segments/20121217105759]
 CrawlDb update: additions allowed: true
 CrawlDb update: URL normalizing: true
 CrawlDb update: URL filtering: true
 CrawlDb update: 404 purging: false
 CrawlDb update: Merging segment data into db.
 CrawlDb update: finished at 2012-12-17 10:58:33, elapsed: 00:00:13
 Generator: starting at 2012-12-17 10:58:33
 Generator: Selecting best-scoring urls due for fetch.
 Generator: filtering: true
 Generator: normalizing: true
 Generator: topN: 400
 Generator: jobtracker is 'local', generating exactly one partition.
 -shouldFetch rejected 'http://www.lequipe.fr/Football/', 
 fetchTime=1359626286623, curTime=1355738313780
 Generator: 0 records selected for fetching, exiting ...
 Stopping at depth=1 - no more URLs to fetch.
 LinkDb: starting at 2012-12-17 10:58:40
 LinkDb: linkdb:

RE: How to extend Nutch for article crawling

2012-12-17 Thread Markus Jelsma

The 1.x indexer can filter and normalize. 

-Original message-
 From:Julien Nioche lists.digitalpeb...@gmail.com
 Sent: Mon 17-Dec-2012 15:11
 To: user@nutch.apache.org
 Subject: Re: How to extend Nutch for article crawling

 Hi

 See comments below

  1. Add article list pages into url/seed.txt
  Here's one problem. What I actually want to be indexed is the article
  pages, not the article list pages. But, if I don't allow the list page to
  be indexed, Nutch will do nothing because the list page is the entrance.
  So, how can I index only the article page without list pages?

 I think that the indexer can now filter URLs but can't remember whether it
 is for 1.x only or is in 2.x as well. Anyone?
 This would work if you can find a regular expression that captures the list
 pages. Another approach would be to tweak the indexer so that it skips
 documents containing an arbitrary metadatum (e.g. skip.indexing), this
 metadata would be set in a custom parser when processing the list pages.

 I think this would be a useful feature to have anyway. URL filters use the
 URL string only and having the option to skip based on metadata would be
 good IMHO

  2. Write a plugin to parse out the 'author', 'date', 'article body',
  'headline' and maybe other information from html.
  The 'Parser' plugin interface in Nutch 2.1 is:
  Parse getParse(String url, WebPage page)
  And the 'WebPage' class has some predefined attributs:
  public class WebPage extends PersistentBase {
//...
private Utf8 baseUrl;
// ...
private Utf8 title;
private Utf8 text;
// ...
private MapUtf8,ByteBuffer metadata;
// ...
  }

  So, the only field I can put my specified attributes in is the
  'metadata'. Is it designed for this purpose?
  BTW, the Parser in trunk looks like: 'public ParseResult
  getParse(Content content)', and seems more reasonable for me.

 The extension point Parser is for low level parsing i.e extract text and
 metadata from binary formats, which is done typically by parse-tika. What
 you want to implement is an extension of ParseFilter and add your own
 entries to the parse metadata. The creative commons plugin should be a good
 example to get started

  3. After the articles are indexed into Solr, another application can query
  it by 'date' then store the article information into Mysql.
  My question here is: can Nutch store the article directly into Mysql?
  Or can I write a plugin to specify the index behavior?

 you could use the mysql backend in GORA (but it is broken AFAIK) and get
 the other application to use it, alternatively you could write a custom
 indexer that sends directly into MySQL but that would be a bit redundant.
 Do you need to use SOLR at all or is the aim to simply to store in MySQL?

  Is Nutch a good choice for my purpose? If not, do you guys suggest another
  good quality framework/library for me?

 You can definitely do that with Nutch. There are certainly other resources
 that could be used but they might also need a bit of customisation anyway

 HTH

 Julien

 -- 
 *
 *Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble

RE: identify domains from fetch lists taking lot of time.

2012-12-14 Thread Markus Jelsma

Hi - you have to get rid of those URL's via URL filters. If you cannot filter 
them out you can set the fetcher time limit (see nutch-default) to limit the 
time the fetcher runs or set the fetcher minumum throughput (see 
nutch-default). The latter will abort the fetcher if less than N pages/second 
are fetched. The unfetched records will be fetched later on together with other 
queues. 
 
-Original message-
 From:manubharghav manubharg...@gmail.com
 Sent: Fri 14-Dec-2012 07:39
 To: user@nutch.apache.org
 Subject: identify domains from fetch lists taking lot of time.
 
 Hi,
 
 I initiated a crawl on 200 domains till a depth of 5 with a topN of 1
 million.  A single domain extended my fetch time by a day as it kept
 generating outlinks to the same page with different urls( the parameters
 change, but the content remains same.)
 .http://www.awex.com.au/about-awex.html?s=___.So is there anyway
 to run the content dedup while fetching itself or are there any other steps
 to avoid such cases. The problem is that as the size of the fetch list is
 increasing the fetcher has a delay of say 3 seconds hitting the same server.
 This is causing the delay in the node and hence delaying the effective time
 of the crawl.
 
 
 Thanks in advance.
 Manu Reddy.
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/identify-domains-from-fetch-lists-taking-lot-of-time-tp4026942.html
 Sent from the Nutch - User mailing list archive at Nabble.com.

RE: fetcher partitioning

2012-12-10 Thread Markus Jelsma

Sourajit,

Looks fine at a first glance. A partitioner does not partition between threads, 
only mappers. It also makes little sense because in the fetcher number of 
threads can be set plus the queue mode.

Can you open an issue and attach your patch? 

Thanks,

 
 
-Original message-
 From:Sourajit Basak sourajit.ba...@gmail.com
 Sent: Mon 10-Dec-2012 10:55
 To: user@nutch.apache.org
 Cc: Markus Jelsma markus.jel...@openindex.io
 Subject: Re: fetcher partitioning
 
 Could anyone review this patch for using a pluggable custom partitioner ?
 For the time, I have just copied over HashPartitioner impl. Need to 
 understand a bit more about Hadoop's partitioning.
 
 Can the group also comment if this RandomPartioner will distribute urls from 
 the same host across different fetcher threads ? Running in local mode, 
 doesn't seem to have any affect. 
 
 (My cluster is undergoing routine maintenance; need to wait for testing in 
 distributed mode)
 
 Best,
 Sourajit
 
 On Thu, Dec 6, 2012 at 11:21 AM, Sourajit Basak sourajit.ba...@gmail.com 
 mailto:sourajit.ba...@gmail.com  wrote:
 Ok. Give me some time. 
 
 On Thu, Dec 6, 2012 at 12:07 AM, Markus Jelsma markus.jel...@openindex.io 
 mailto:markus.jel...@openindex.io  wrote:
 
 
 
 
 -Original message-
  From:Sourajit Basak sourajit.ba...@gmail.com 
  mailto:sourajit.ba...@gmail.com 
  Sent: Wed 05-Dec-2012 18:16
  To: user@nutch.apache.org mailto:user@nutch.apache.org 
  Subject: fetcher partitioning
 
  Per my understanding, Nutch partitions urls based on either host, ip or
  domain. Is it possible to partition based on url patterns ?
 
  For e.g my company, a publishing house, is planning to expose its content
  like http://host/publicationA http://host/publicationA , 
  http://host/publicationB http://host/publicationB . etc. We wish to
  partition the fetching based on url patterns like /publicationA/* to a
  thread, /publicationB/* to another, etc.
 
  This will not only help us expedite indexing the content but also test the
  throughput of the site, though the second is an additional benefit we get
  by doing no extra work.
 
  We can attempt to modify the URLPartitioner, but that does not seem to be
  plug and play like the FetchSchedule. And would mean changes to the core.
 
 Indeed, you have to modify the partitioner to make this happen. You are free 
 to do so but you can also make it pluggable as fetch schedule via config and 
 provide a patch so it can be added to the Nutch sources.
 
 
  Any suggestions ?
 
  Best,
  Sourajit

RE: fetcher partitioning

2012-12-10 Thread Markus Jelsma

-Original message-
 From:Sourajit Basak sourajit.ba...@gmail.com
 Sent: Mon 10-Dec-2012 12:17
 To: user@nutch.apache.org
 Subject: Re: fetcher partitioning

 Markus,
 I will open an issue.

 But I am confused now. Does the partitioner have no effect on the fetchers
 ?

The partitioner decides which record ends up in which fetch list. When running 
locally, there is always one fetch list and one mapper to ingest that fetch 
list.

 Even if we allot 10 threads to the fetcher (all urls belonging to the same
 host), will each thread fetch its items simultaneously ?

That depends on the queue mode used. The fetcher organizes URL's in queues, and 
threads will just pick the next URL to fetch. URL's are either queued by host, 
ip or domain. See nutch-default for descriptions on which queue to use and how 
many threads per queue to set up.

 What is queue mode?

 Best,
 Sourajit

 On Mon, Dec 10, 2012 at 4:23 PM, Markus Jelsma
 markus.jel...@openindex.iowrote:

  Sourajit,

  Looks fine at a first glance. A partitioner does not partition between
  threads, only mappers. It also makes little sense because in the fetcher
  number of threads can be set plus the queue mode.

  Can you open an issue and attach your patch?

  Thanks,

  -Original message-
   From:Sourajit Basak sourajit.ba...@gmail.com
   Sent: Mon 10-Dec-2012 10:55
   To: user@nutch.apache.org
   Cc: Markus Jelsma markus.jel...@openindex.io
   Subject: Re: fetcher partitioning

   Could anyone review this patch for using a pluggable custom partitioner ?
   For the time, I have just copied over HashPartitioner impl. Need to
  understand a bit more about Hadoop's partitioning.

   Can the group also comment if this RandomPartioner will distribute urls
  from the same host across different fetcher threads ? Running in local
  mode, doesn't seem to have any affect.

   (My cluster is undergoing routine maintenance; need to wait for testing
  in distributed mode)

   Best,
   Sourajit

   On Thu, Dec 6, 2012 at 11:21 AM, Sourajit Basak 
  sourajit.ba...@gmail.com mailto:sourajit.ba...@gmail.com  wrote:
   Ok. Give me some time.

   On Thu, Dec 6, 2012 at 12:07 AM, Markus Jelsma 
  markus.jel...@openindex.io mailto:markus.jel...@openindex.io  wrote:

   -Original message-
From:Sourajit Basak sourajit.ba...@gmail.com mailto:
  sourajit.ba...@gmail.com 
Sent: Wed 05-Dec-2012 18:16
To: user@nutch.apache.org mailto:user@nutch.apache.org
Subject: fetcher partitioning

Per my understanding, Nutch partitions urls based on either host, ip or
domain. Is it possible to partition based on url patterns ?

For e.g my company, a publishing house, is planning to expose its
  content
like http://host/publicationA http://host/publicationA ,
  http://host/publicationB http://host/publicationB . etc. We wish to
partition the fetching based on url patterns like /publicationA/* to a
thread, /publicationB/* to another, etc.

This will not only help us expedite indexing the content but also test
  the
throughput of the site, though the second is an additional benefit we
  get
by doing no extra work.

We can attempt to modify the URLPartitioner, but that does not seem to
  be
plug and play like the FetchSchedule. And would mean changes to the
  core.

   Indeed, you have to modify the partitioner to make this happen. You are
  free to do so but you can also make it pluggable as fetch schedule via
  config and provide a patch so it can be added to the Nutch sources.

Any suggestions ?

Best,
Sourajit

RE: [ANNOUNCE] Apache Nutch 1.6 Released

2012-12-10 Thread Markus Jelsma

Thanks Lewis! :)

-Original message-
 From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com
 Sent: Sat 08-Dec-2012 22:56
 To: annou...@apache.org; user@nutch.apache.org
 Cc: d...@nutch.apache.org
 Subject: [ANNOUNCE] Apache Nutch 1.6 Released

 Hi All,

 The Apache Nutch PMC are extremely pleased to announce the release of
 Apache Nutch v1.6. This release includes over 20 bug fixes, the same
 in improvements, as well as new functionalities including a new
 HostNormalizer, the ability to dynamically set fetchInterval by
 MIME-type and functional enhancements to the Indexer API inluding the
 normalization of URL's and the deletion of robots noIndex documents.
 Other notable improvements include the upgrade of key dependencies to
 Tika 1.2 and Automaton 1.11-8.

 A full PMC statement can be found here [0]

 The release can be found on official Apache mirrors [1] as well as
 sources in Maven Central [2]

 Thank you

 Lewis
 On Behalf of the Nutch PMC

 [0] http://s.apache.org/NFp
 [1] http://www.apache.org/dyn/closer.cgi/nutch/
 [2] http://search.maven.org/#artifactdetails|org.apache.nutch|nutch|1.6|jar

 -- 
 Lewis

RE: New Scoring

2012-12-05 Thread Markus Jelsma

-Original message-
 From:Pratik Garg saytopra...@gmail.com
 Sent: Wed 05-Dec-2012 19:17
 To: user@nutch.apache.org
 Cc: Chirag Goel goel.chi...@gmail.com
 Subject: New Scoring

 Hi,

 Nutch provides a default and new Scoring method for giving score to the
 pages. I have couple of questions

 * What is the difference between these two methods?

LinkRank is a power iterative algorithm such as PageRank. It can be used 
incrementally and it very stable. Opic has trouble with increments.

 * If I want to pass this data to solr during indexing , do I have to do
 anything extra.

The CrawlDB has a score field which is used to populate the boost field. With 
Opic this is added via the scoring filter. If you use the linkrank algorithm 
make sure you call it's scoreupdater tool, that writes the calculated scores 
back to the crawldb.

 * If I want to sort the results from solr based on this data , which field
 I should use?

the boost field.

 Thanks,
 Pratik

RE: hung threads in big nutch crawl process

2012-12-03 Thread Markus Jelsma

This page explains the individual steps:
http://wiki.apache.org/nutch/NutchTutorial#A3.2_Using_Individual_Commands_for_Whole-Web_Crawling

-Original message-
From:Eyeris Rodriguez Rueda eru...@uci.cu
Sent: Mon 03-Dec-2012 21:08
To: user@nutch.apache.org
Subject: RE: hung threads in big nutch crawl process

Thank markus for your anwer.
I always have used nutch with console making a complete cycle
bin/nutch crawl urls -dir crawl -depth 10 -topN 10 -solr
http://localhost:8080/solr
Could you explain me how to use a separately process. I was reading the wiki
but not function for me because I don’t understand the commands. I want to
use nutch in distribuited mode, could you give me a good documentation of it.

_
Ing. Eyeris Rodriguez Rueda
Teléfono:837-3370
Universidad de las Ciencias Informáticas
_

-Mensaje original-
De: Markus Jelsma [mailto:markus.jel...@openindex.io]
Enviado el: lunes, 03 de diciembre de 2012 1:42 PM
Para: user@nutch.apache.org
Asunto: RE: hung threads in big nutch crawl process

Hi - Hadoop organizes some threads but in Nutch the only job that uses
threads is the fetcher. Parses are done using the executor service.

It is very well possible that you have some regexes that are very complex and
Nutch can take a long time processing those, especially if you parse in the
fetcher job.

You should run the Nutch jobs separate to find out which job is giving you
trouble.

-Original message-
From:Eyeris Rodriguez Rueda eru...@uci.cu
Sent: Mon 03-Dec-2012 20:31
To: user@nutch.apache.org
Subject: hung threads in big nutch crawl process

Hi all.
I have detected that in big nutch crawl process(depth:10 topN:100 000) some
threads are hunged in some part of crawl cicle for example normalizing by
regex and fetching urls to.
Im using nutch 1.5.1 and solr 3.6.
Ram:2GB
CPU:CoreI3.
OS:Ubuntu 12.04(server)

I have a doubt, How nutch manipulate the threads in a cicle of crawl
process ?.
Is multithread the generation,fetching,parsing process ?

PD:Sorry for my english. Is not my native language.

10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

RE: Fetch content inside nutch parse

2012-11-30 Thread Markus Jelsma

See how the indexchecker fetches URL's:
http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java?view=markup
 
 
-Original message-
 From:Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu
 Sent: Fri 30-Nov-2012 16:46
 To: user@nutch.apache.org
 Subject: Fetch content inside nutch parse
 
 It's possible to use nutch fetcher inside a parse plugin? Or should some 
 third party library?
 
 slds
 --
 It is only in the mysterious equation of love that any 
 logical reasons can be found.
 Good programmers often confuse halloween (31 OCT) with 
 christmas (25 DEC)
 
 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
 INFORMATICAS...
 CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
 
 http://www.uci.cu
 http://www.facebook.com/universidad.uci
 http://www.flickr.com/photos/universidad_uci

RE: Indexing-time URL filtering again

2012-11-29 Thread Markus Jelsma

Please send us the regex file.

-Original message-
 From:Joe Zhang smartag...@gmail.com
 Sent: Thu 29-Nov-2012 04:48
 To: user@nutch.apache.org
 Subject: Re: Indexing-time URL filtering again

 I made sure I got the most recent trunk, Markus. I don't understand why the
 problem persists.

 On Mon, Nov 26, 2012 at 3:21 AM, Markus Jelsma
 markus.jel...@openindex.iowrote:

  I checked the code. You're probably not pointing it to a valid path or
  perhaps the build is wrong and you haven't used ant clean before building
  Nutch. If you keep having trouble you may want to check out trunk.

  -Original message-
   From:Joe Zhang smartag...@gmail.com
   Sent: Mon 26-Nov-2012 00:40
   To: user@nutch.apache.org
   Subject: Re: Indexing-time URL filtering again

   OK. I'm testing it. But like I said, even when I reduce the patterns to
  the
   simpliest form -., the problem still persists.

   On Sun, Nov 25, 2012 at 3:59 PM, Markus Jelsma
   markus.jel...@openindex.iowrote:

It's taking input from stdin, enter some URL's to test it. You can add
  an
issue with reproducable steps.

-Original message-
 From:Joe Zhang smartag...@gmail.com
 Sent: Sun 25-Nov-2012 23:49
 To: user@nutch.apache.org
 Subject: Re: Indexing-time URL filtering again

 I ran the regex tester command you provided. It seems to be taking
forever
 (15 min + by now).

 On Sun, Nov 25, 2012 at 3:28 PM, Joe Zhang smartag...@gmail.com
  wrote:

  you mean the content my pattern file?

  well, even wehn I reduce it to simply -., the same problem still
pops up.

  On Sun, Nov 25, 2012 at 3:30 PM, Markus Jelsma 
markus.jel...@openindex.io
   wrote:

  You seems to have an NPE caused by your regex rules, for some
  weird
  reason. If you can provide a way to reproduce you can file an
  issue in
  Jira. This NPE should also occur if your run the regex tester.

  nutch -Durlfilter.regex.file=path
org.apache.nutch.net.URLFilterChecker
  -allCombined

  In the mean time you can check if a rule causes the NPE.

  -Original message-
   From:Joe Zhang smartag...@gmail.com
   Sent: Sun 25-Nov-2012 23:26
   To: user@nutch.apache.org
   Subject: Re: Indexing-time URL filtering again

   the last few lines of hadoop.log:

   2012-11-25 16:30:30,021 INFO  indexer.IndexingFilters - Adding
   org.apache.nutch.indexer.anchor.AnchorIndexingFilter
   2012-11-25 16:30:30,026 INFO  indexer.IndexingFilters - Adding
   org.apache.nutch.indexer.metadata.MetadataIndexer
   2012-11-25 16:30:30,218 WARN  mapred.LocalJobRunner -
  job_local_0001
   java.lang.RuntimeException: Error in configuring object
   at

  org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
   at

org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
   at

  org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
   at
  org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432)
   at
  org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
   at

  org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
   Caused by: java.lang.reflect.InvocationTargetException
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
Method)
   at

  sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at

  sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:601)
   at

  org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
   ... 5 more
   Caused by: java.lang.RuntimeException: Error in configuring
  object
   at

  org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
   at

org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
   at

  org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
   at
  org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
   ... 10 more
   Caused by: java.lang.reflect.InvocationTargetException
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
Method)
   at

  sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at

  sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:601

RE: size of crawl

2012-11-29 Thread Markus Jelsma

Impossible to say but perhaps there are more non-200 fetched records. Carefully 
look at the fetcher logs and inspect the crawldb with the readdb -stats 
command. 

-Original message-
 From:Joe Zhang smartag...@gmail.com
 Sent: Thu 29-Nov-2012 07:04
 To: user user@nutch.apache.org
 Subject: size of crawl

  With the same set of parameters (-depth 5 -topN 200), I run two different
 crawls:

 Crawl 1: 2 sites
 Crawl 2: 4 sites (superset of the 2 in Crawl1)

 However, I end up having much fewer docs in Crawl 2. Can anybody suggest
 the reason(s)?

 Thanks.

 Joe.

RE: Nutch efficiency and multiple single URL crawls

2012-11-29 Thread Markus Jelsma

As i said, you don't rebuild, you just overwrite the config file in the hadoop 
config directory on the data nodes. Config files are looked up there as well. 
Just copy the file to the data nodes. 

-Original message-
 From:AC Nutch acnu...@gmail.com
 Sent: Thu 29-Nov-2012 05:38
 To: user@nutch.apache.org
 Subject: Re: Nutch efficiency and multiple single URL crawls

 Thanks for the help. Perhaps I am misunderstanding, what would be the
 proper way to leverage this? I am a bit new to Nutch 1.5.1, I've been using
 1.4 and have generally been using runtime/deploy/bin/nutch with a .job
 file. I notice things are done a bit differently in 1.5.1 with the lack of
 a nutch runtime and nutch deploy directories. How can I run a crawl while
 leveraging this functionality and not having to rebuild the job file each
 new crawl? More specifically, I'm picturing the following workflow...

 (1) update config file to restrict domain crawls - (2) run command that
 crawls a domain with changes from config file while not having to rebuild
 job file  - (3) index to Solr

 What would the (general) command be for step (2) is my question.

 On Mon, Nov 26, 2012 at 5:16 AM, Markus Jelsma
 markus.jel...@openindex.iowrote:

  Hi,

  Rebuilding the job file for each domain is not a good idea indeed, plus it
  adds the Hadoop overhead. But you don't have to, we write dynamic config
  files to each node's Hadoop configuration directory and it is picked up
  instead of the embedded configuration file.

  Cheers,

  -Original message-
   From:AC Nutch acnu...@gmail.com
   Sent: Mon 26-Nov-2012 06:50
   To: user@nutch.apache.org
   Subject: Nutch efficiency and multiple single URL crawls

   Hello,

   I am using Nutch 1.5.1 and I am looking to do something specific with
  it. I
   have a few million base domains in a Solr index, so for example:
   http://www.nutch.org, http://www.apache.org, http://www.whatever.cometc. I
   am trying to crawl each of these base domains in deploy mode and retrieve
   all of their sub-urls associated with that domain in the most efficient
  way
   possible. To give you an example of the workflow I am trying to achieve:
   (1) Grab a base domain, let's say http://www.nutch.org (2) Crawl the
  base
   domain for all URLs in that domain, let's say http://www.nutch.org/page1
  ,
   http://www.nutch.org/page2, http://www.nutch.org/page3, etc. etc. (3)
  store
   these results somewhere (perhaps another Solr instance) and (4) move on
  to
   the next base domain in my Solr index and repeat the process. Essentially
   just trying to grab all links associated with a page and then move on to
   the next page.

   The part I am having trouble with is ensuring that this workflow is
   efficient. The only way I can think to do this would be: (1) Grab a base
   domain from Solr from my shell script (simple enough) (2) Add an entry to
   regex-urlfilter with the domain I am looking to restrict the crawl to, in
   the example above that would be an entry that says to only keep sub-pages
   of http://www.nutch.org/ (3) Recreate the Nutch job file (~25 sec.) (4)
   Start the crawl for pages associated with a domain and do the indexing

   My issue is with step #3, AFAIK if I want to restrict a crawl to a
  specific
   domain I have to change regex-urlfilter and reload the job file. This is
  a
   pretty significant problem, since adding 25 seconds every single time I
   start a new base domain is going to add way too many seconds to my
  workflow
   (25 sec x a few million = way too much time). Finally the question...is
   there a way to add url filters on the fly when I start a crawl and/or
   restrict a crawl to a particular domain on the fly. OR can you think of a
   decent solution to the problem/am I missing something?

RE: Access crawled content or parsed data of previous crawled url

2012-11-29 Thread Markus Jelsma

Hi,

This is a difficult problem in MapReduce and because of the fact that one image 
URL may be embedded in many documents. There are various methods you could use 
to aggregate the records but none i can think of will work very well or are 
straightforward to implement.

I think the most straightforward and easy to implement method is that you 
should create a new key/value pair to store the surrounding text in for each 
image, do this during the parse. This would mean you have to emit a Text,Text 
pair for each image in every HTML page with the image's URLas key and the 
surrounding text as value. You will have to modify the indexer to ingest that 
structure as well during indexing. This way existing CrawlDatum's for existing 
images will end up in the reducer together with zero or more of your new 
key/value pair. In IndexerMapReduce you can deal with them appropriately.

This method works well with MapReduce and requires not too much programming. 
The downside is that you cannot build a parse plugin and indexing plugin 
because they cannot handle your new key/value pair.

Good luck and let us know what you came up with :)
 
-Original message-
 From:Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu
 Sent: Thu 29-Nov-2012 19:53
 To: user@nutch.apache.org
 Subject: Re: Access crawled content or parsed data of previous crawled url
 
 For now I don't see any form of accessing metadata for a previously parsed 
 document, I'm mistaken?
 
 - Mensaje original -
 De: alx...@aim.com
 Para: user@nutch.apache.org
 Enviados: Jueves, 29 de Noviembre 2012 13:38:43
 Asunto: Re: Access crawled content or parsed data of previous crawled url
 
 Hi,
 
 Unfortunately, my employer does not want me to disclose details of the plugin 
 at this time.
 
 Alex.
 
  
 
  
 
  
 
 -Original Message-
 From: Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu
 To: user user@nutch.apache.org
 Sent: Wed, Nov 28, 2012 6:20 pm
 Subject: Re: Access crawled content or parsed data of previous crawled url
 
 
 Hi Alex:
 
 What you've done is basically what I'm try to accomplish: I'm trying to get 
 the 
 text surrounding the img tags to improve the image search engine we're 
 building 
 (this is done when the html page containing the img tag is parsed), and when 
 the 
 image url itself is parsed we generate thumbnails and extract some metadata. 
 But 
 how do you keep the this 2 pieces of data linked together inside your index 
 (solr in my case). Because the thing is that I'm getting two documents inside 
 solr (1. containing the text surrounding the img tag, and other document with 
 the thumbnail). So what brings me troubles is how when the thumbnail is being 
 generated can I get the surrounding text detecte when the html was parsed?
 
 Thanks a lot for all the replies!
 
 P.S: Alex, can you share some piece of code (if it's possible) of your 
 working 
 plugins? Or walk me through what you've came up with?
 
 - Mensaje original -
 De: alx...@aim.com
 Para: user@nutch.apache.org
 Enviados: Miércoles, 28 de Noviembre 2012 19:54:07
 Asunto: Re: Access crawled content or parsed data of previous crawled url
 
 It is not clear what you try to achieve. We have done something similar in 
 regard of indexing img tags. We retrieve img tag data while parsing the html 
 page  and keep it in a metadata and when parsing img url itself we create 
 thumbnail.
 
 hth.
 Alex.
 
 
 
 
 
 
 
 -Original Message-
 From: Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu
 To: user user@nutch.apache.org
 Sent: Wed, Nov 28, 2012 2:58 pm
 Subject: Re: Access crawled content or parsed data of previous crawled url
 
 
 Any documentation about crawldb api? I'm guessing the it shouldn't be so hard 
 to
 retrieve a documento by it's url (which is basically what I need. I'm also 
 open
 to any suggestion on this matter, so If any one has done something similar or
 has any thoughts on this and can share it, I'll be very grateful.
 
 Greetings!
 
 - Mensaje original -
 De: Stefan Scheffler sscheff...@avantgarde-labs.de
 Para: user@nutch.apache.org
 Enviados: Miércoles, 28 de Noviembre 2012 15:04:44
 Asunto: Re: Access crawled content or parsed data of previous crawled url
 
 Hi,
 I think, this is possible, because you can write a ParserPlugin which
 access the allready stored documents via the segments- /crawldb api.
 But i´m not sure how it will work exactly.
 
 Regards
 Stefan
 
 Re
 Am 28.11.2012 20:59, schrieb Jorge Luis Betancourt Gonzalez:
  Hi:
 
  For what I've seen in nutch plugins exist the philosophy of one 
  NutchDocument
 per url, but I was wondering if there is any way of accessing parsed/crawled
 content of a previous fetched/parsed url, let's say for instance that I've a
 HTML page with an image embedded: So the start point will be
 http://host.com/test.html which is the first document that get's 
 fetched/parsed
 then the OutLink extractor will detect the embedded image inside test.html and
 then add

RE: The topN parameter in nutch crawl

2012-11-29 Thread Markus Jelsma

Nutch does neither. If scoring is used the records to fetch are ordered by 
score and if there is no score it's simply sorted alphabetically. With some 
tuning to a scoring filter you can do whatever you want but in the end 
everything is going to be crawled (if there are enough resources).

What are you trying to do? If you're not going to process many millions of 
records it doesn't really matter because all records will be fetched within a 
reasonable amount of time. 
 
-Original message-
 From:Joe Zhang smartag...@gmail.com
 Sent: Thu 29-Nov-2012 22:45
 To: user@nutch.apache.org
 Subject: Re: The quot;topNquot; parameter in nutch crawl
 
 How would you characterize the crawling algorithm? Depth-first,
 breath-first, or some heuristic-based?
 
 On Thu, Nov 29, 2012 at 2:10 PM, Markus Jelsma
 markus.jel...@openindex.iowrote:
 
  Hi,
 
  None of all three. the topN-parameter simply means that the generator will
  select up to N records to fetch for each time it is invoked. It's best to
  forget the notion of depth in crawling, it has little meaning in most
  cases. Usually one will just continously crawl until there are no more
  records to fetch.
 
  We continously invoke the crawler and tell it to do something. If there's
  nothing to do (but that never happens) we just invoke it again the next
  time.
 
  Cheers,
 
 
  -Original message-
   From:Joe Zhang smartag...@gmail.com
   Sent: Thu 29-Nov-2012 21:58
   To: user user@nutch.apache.org
   Subject: The quot;topNquot; parameter in nutch crawl
  
   Dear list,
  
   This parameter is causing me some confusion. To me, there are at 3
  possible
   meanings for topN:
  
   1. The branching factor at a given node
   2. **the maximum number of pages that will be retrieved at each level up
   to the depth (from the wiki), which seems to refer to the total of
   branching factors at any given level
   3. The size of the entire frontier/queue
  
   To me, (1) makes the most sense, and (3) is the easiest to implement
   programming-wise.
  
   If (2) is the actual implementation in nutch, it means the effective
   branching factor would be lower at deeper levels, correct?
  
   In this sense, in order to conduct a comprehensive crawl, if we have to
   trade off between depth and topN, we should probably favor larger
   topN? In other words, -depth 5 -topN 1000 would make more sense than
   -depth 10 -topN 100 for a comprehensive crawl, correct?
  
   Thanks!

RE: The topN parameter in nutch crawl

2012-11-29 Thread Markus Jelsma

-Original message-
 From:Joe Zhang smartag...@gmail.com
 Sent: Thu 29-Nov-2012 23:33
 To: user@nutch.apache.org
 Subject: Re: The quot;topNquot; parameter in nutch crawl

 I'm not sure I completely understand.

 Typically when we think about the crawling problem as one of graph
 traversal, the # of nodes visited would some exponential function.

 Are you saying that this is not true, and if we specify somehting like
 -depth 5 -topN 100, we'll at most visit 500 nodes?

Yes. Nutch generates fetch lists from the CrawlDB which is nothing more than a 
sorted list of URL (score, than alphabetically). It just picks the first 
eligible URL in the sorted list. You really should take a good look at the 
Generator code, it'll answer most of your questions.

 On Thu, Nov 29, 2012 at 3:03 PM, Markus Jelsma
 markus.jel...@openindex.iowrote:

  Nutch does neither. If scoring is used the records to fetch are ordered by
  score and if there is no score it's simply sorted alphabetically. With some
  tuning to a scoring filter you can do whatever you want but in the end
  everything is going to be crawled (if there are enough resources).

  What are you trying to do? If you're not going to process many millions of
  records it doesn't really matter because all records will be fetched within
  a reasonable amount of time.

  -Original message-
   From:Joe Zhang smartag...@gmail.com
   Sent: Thu 29-Nov-2012 22:45
   To: user@nutch.apache.org
   Subject: Re: The quot;topNquot; parameter in nutch crawl

   How would you characterize the crawling algorithm? Depth-first,
   breath-first, or some heuristic-based?

   On Thu, Nov 29, 2012 at 2:10 PM, Markus Jelsma
   markus.jel...@openindex.iowrote:

Hi,

None of all three. the topN-parameter simply means that the generator
  will
select up to N records to fetch for each time it is invoked. It's best
  to
forget the notion of depth in crawling, it has little meaning in most
cases. Usually one will just continously crawl until there are no more
records to fetch.

We continously invoke the crawler and tell it to do something. If
  there's
nothing to do (but that never happens) we just invoke it again the next
time.

Cheers,

-Original message-
 From:Joe Zhang smartag...@gmail.com
 Sent: Thu 29-Nov-2012 21:58
 To: user user@nutch.apache.org
 Subject: The quot;topNquot; parameter in nutch crawl

 Dear list,

 This parameter is causing me some confusion. To me, there are at 3
possible
 meanings for topN:

 1. The branching factor at a given node
 2. **the maximum number of pages that will be retrieved at each
  level up
 to the depth (from the wiki), which seems to refer to the total of
 branching factors at any given level
 3. The size of the entire frontier/queue

 To me, (1) makes the most sense, and (3) is the easiest to implement
 programming-wise.

 If (2) is the actual implementation in nutch, it means the effective
 branching factor would be lower at deeper levels, correct?

 In this sense, in order to conduct a comprehensive crawl, if we
  have to
 trade off between depth and topN, we should probably favor larger
 topN? In other words, -depth 5 -topN 1000 would make more sense
  than
 -depth 10 -topN 100 for a comprehensive crawl, correct?

 Thanks!

RE: trunk

2012-11-27 Thread Markus Jelsma

Trunk is a directory in svn in which actual development is happening:
http://svn.apache.org/viewvc/nutch/trunk/

-Original message-
 From:Joe Zhang smartag...@gmail.com
 Sent: Tue 27-Nov-2012 01:46
 To: user user@nutch.apache.org
 Subject: trunk

 In a different thread, Markus suggested checking out trunk.

 The relationship between trunk and svn has been confusing to me. Can
 somebody provide a link to a tutorial, and offer advice on how to access
 nutch trunk?

 Thanks.

RE: problem with text/html content type of documents appears application/xhtml+xml in solr index

2012-11-27 Thread Markus Jelsma

Hi - are you sure you have tabs separating the target and the mapped mimes? Use 
the nutch indexchecker tool to quickly test if it works. 

-Original message-
 From:Eyeris Rodriguez Rueda eru...@uci.cu
 Sent: Tue 27-Nov-2012 21:18
 To: user@nutch.apache.org
 Subject: RE: problem with text/html content type of documents appears 
 application/xhtml+xml in solr index

 Hi. Markus.
 I was doing your recommendations but, my problem persist, some documents 
 still with application/xhtml+xml instead of text/html.
 I add the property to nutch-site.xml and make the 
 conf/contenttype-mapping.txt file
 property
 namemoreIndexingFilter.mapMimeTypes/name
 valuetrue/value
   /property

 I'm using nutch 1.5.1. Tell me if I need to replace index-more.jar in plugin 
 directory with any fixed version ?

 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
 INFORMATICAS...
 CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

 http://www.uci.cu
 http://www.facebook.com/universidad.uci
 http://www.flickr.com/photos/universidad_uci

RE: problem with text/html content type of documents appears application/xhtml+xml in solr index

2012-11-27 Thread Markus Jelsma

 :http://blogs.prod.uci.cu/humanOS
 outlinks :http://blogs.prod.uci.cu/micro
 outlinks :http://blogs.prod.uci.cu/nova/
 outlinks :http://coj.uci.cu/general/about.xhtml
 outlinks :http://pgs.soporte.uci.cu
 outlinks :http://portal.albet.prod.uci.cu
 outlinks :http://portal.calisoft.prod.uci.cu
 outlinks :http://portal.cdae.prod.uci.cu
 outlinks :http://portal.cedin.prod.uci.cu
 outlinks :http://portal.cegel.prod.uci.cu
 outlinks :http://portal.ceige.prod.uci.cu
 outlinks :http://portal.cenia.prod.uci.cu
 outlinks :http://portal.cesim.prod.uci.cu
 outlinks :http://portal.cice.prod.uci.cu
 outlinks :http://portal.cidi.prod.uci.cu
 outlinks :http://portal.cised.prod.uci.cu
 outlinks :http://portal.datec.prod.uci.cu
 outlinks :http://portal.dgp.prod.uci.cu
 outlinks :http://portal.dt.prod.uci.cu
 outlinks :http://portal.fortes.prod.uci.cu
 outlinks :http://portal.frcav.cav.uci.cu
 outlinks :http://portal.frgrm.grm.uci.cu
 outlinks :http://portal.frhab.hab.uci.cu
 outlinks :http://portal.geitel.prod.uci.cu
 outlinks :http://portal.geysed.prod.uci.cu
 outlinks :http://portal.hlg.uci.cu
 outlinks :http://portal.isec.prod.uci.cu
 outlinks :http://portal.tlm.prod.uci.cu
 outlinks :http://portal.vcl.uci.cu/
 outlinks :http://postgresql.uci.cu
 outlinks :http://www.redmine.org/
 outlinks :http://www.redmine.org/guide
 contentLength :   5280
 
 and this is the page code that i check with firefox.
 
 !DOCTYPE html PUBLIC -//W3C//DTD XHTML 1.0 Transitional//EN 
 http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd;
 html xmlns=http://www.w3.org/1999/xhtml; xml:lang=en
 head
 meta http-equiv=content-type content=text/html; charset=utf-8 /
 titleComunidades UCI/title
 continue
 
 
 
 I need to replace index-more.jar plugin ?
 
 
 
 
 - Mensaje original -
 De: Markus Jelsma markus.jel...@openindex.io
 Para: user@nutch.apache.org
 Enviados: Martes, 27 de Noviembre 2012 15:33:20
 Asunto: RE: problem with text/html content type of documents appears 
 application/xhtml+xml in solr index
 
 Hi - are you sure you have tabs separating the target and the mapped mimes? 
 Use the nutch indexchecker tool to quickly test if it works. 
  
 -Original message-
  From:Eyeris Rodriguez Rueda eru...@uci.cu
  Sent: Tue 27-Nov-2012 21:18
  To: user@nutch.apache.org
  Subject: RE: problem with text/html content type of documents appears 
  application/xhtml+xml in solr index
  
  Hi. Markus.
  I was doing your recommendations but, my problem persist, some documents 
  still with application/xhtml+xml instead of text/html.
  I add the property to nutch-site.xml and make the 
  conf/contenttype-mapping.txt file
  property
  namemoreIndexingFilter.mapMimeTypes/name
  valuetrue/value
/property
  
  I'm using nutch 1.5.1. Tell me if I need to replace index-more.jar in 
  plugin directory with any fixed version ?
 
 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
 INFORMATICAS...
 CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
 
 http://www.uci.cu
 http://www.facebook.com/universidad.uci
 http://www.flickr.com/photos/universidad_uci

RE: Indexing-time URL filtering again

2012-11-26 Thread Markus Jelsma

Building from source with ant produces a local runtime in runtime/local, that's 
the same as when you extract an official release. 

-Original message-
 From:Joe Zhang smartag...@gmail.com
 Sent: Mon 26-Nov-2012 22:23
 To: user@nutch.apache.org
 Subject: Re: Indexing-time URL filtering again

 yes that's wht i've been doing. but ant itself won't produce the official
 binary release.

 On Mon, Nov 26, 2012 at 2:16 PM, Markus Jelsma
 markus.jel...@openindex.iowrote:

  just ant will do the trick.

  -Original message-
   From:Joe Zhang smartag...@gmail.com
   Sent: Mon 26-Nov-2012 22:03
   To: user@nutch.apache.org
   Subject: Re: Indexing-time URL filtering again

   talking about ant, after ant clean, which ant target should i use?

   On Mon, Nov 26, 2012 at 3:21 AM, Markus Jelsma
   markus.jel...@openindex.iowrote:

I checked the code. You're probably not pointing it to a valid path or
perhaps the build is wrong and you haven't used ant clean before
  building
Nutch. If you keep having trouble you may want to check out trunk.

-Original message-
 From:Joe Zhang smartag...@gmail.com
 Sent: Mon 26-Nov-2012 00:40
 To: user@nutch.apache.org
 Subject: Re: Indexing-time URL filtering again

 OK. I'm testing it. But like I said, even when I reduce the patterns
  to
the
 simpliest form -., the problem still persists.

 On Sun, Nov 25, 2012 at 3:59 PM, Markus Jelsma
 markus.jel...@openindex.iowrote:

  It's taking input from stdin, enter some URL's to test it. You can
  add
an
  issue with reproducable steps.

  -Original message-
   From:Joe Zhang smartag...@gmail.com
   Sent: Sun 25-Nov-2012 23:49
   To: user@nutch.apache.org
   Subject: Re: Indexing-time URL filtering again

   I ran the regex tester command you provided. It seems to be
  taking
  forever
   (15 min + by now).

   On Sun, Nov 25, 2012 at 3:28 PM, Joe Zhang smartag...@gmail.com

wrote:

you mean the content my pattern file?

well, even wehn I reduce it to simply -., the same problem
  still
  pops up.

On Sun, Nov 25, 2012 at 3:30 PM, Markus Jelsma 
  markus.jel...@openindex.io
 wrote:

You seems to have an NPE caused by your regex rules, for some
weird
reason. If you can provide a way to reproduce you can file an
issue in
Jira. This NPE should also occur if your run the regex tester.

nutch -Durlfilter.regex.file=path
  org.apache.nutch.net.URLFilterChecker
-allCombined

In the mean time you can check if a rule causes the NPE.

-Original message-
 From:Joe Zhang smartag...@gmail.com
 Sent: Sun 25-Nov-2012 23:26
 To: user@nutch.apache.org
 Subject: Re: Indexing-time URL filtering again

 the last few lines of hadoop.log:

 2012-11-25 16:30:30,021 INFO  indexer.IndexingFilters -
  Adding
 org.apache.nutch.indexer.anchor.AnchorIndexingFilter
 2012-11-25 16:30:30,026 INFO  indexer.IndexingFilters -
  Adding
 org.apache.nutch.indexer.metadata.MetadataIndexer
 2012-11-25 16:30:30,218 WARN  mapred.LocalJobRunner -
job_local_0001
 java.lang.RuntimeException: Error in configuring object
 at

  org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
 at

  org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
 at

  org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
 at

  org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432)
 at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
 at

  org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
 Caused by: java.lang.reflect.InvocationTargetException
 at
  sun.reflect.NativeMethodAccessorImpl.invoke0(Native
  Method)
 at

  sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at

  sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:601)
 at

  org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
 ... 5 more
 Caused by: java.lang.RuntimeException: Error in configuring
object
 at

  org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93

RE: Indexing-time URL filtering again

2012-11-25 Thread Markus Jelsma

No, this is no bug. As i said, you need either to patch your Nutch or get the 
sources from trunk. The -filter parameter is not in your version. Check the 
patch manual if you don't know how it works.

$ cd trunk ; patch -p0  file.patch

-Original message-
 From:Joe Zhang smartag...@gmail.com
 Sent: Sun 25-Nov-2012 08:42
 To: Markus Jelsma markus.jel...@openindex.io; user user@nutch.apache.org
 Subject: Re: Indexing-time URL filtering again

 This does seem a bug. Can anybody help?

 On Sat, Nov 24, 2012 at 6:40 PM, Joe Zhang smartag...@gmail.com wrote:

  Markus, could you advise? Thanks a lot!

  On Sat, Nov 24, 2012 at 12:49 AM, Joe Zhang smartag...@gmail.com wrote:

  I followed your instruction and applied the patch, Markus, but the
  problem still persists --- -filter is interpreted as a path by solrindex.

  On Fri, Nov 23, 2012 at 12:39 AM, Markus Jelsma 
  markus.jel...@openindex.io wrote:

  Ah, i get it now. Please use trunk or patch your version with:
  https://issues.apache.org/jira/browse/NUTCH-1300 to enable filtering.

  -Original message-
   From:Joe Zhang smartag...@gmail.com
   Sent: Fri 23-Nov-2012 03:08
   To: user@nutch.apache.org
   Subject: Re: Indexing-time URL filtering again

   But Markus said it worked for him. I was really he could send his
  command
   line.

   On Thu, Nov 22, 2012 at 6:28 PM, Lewis John Mcgibbney 
   lewis.mcgibb...@gmail.com wrote:

Is this a bug?

On Thu, Nov 22, 2012 at 10:13 PM, Joe Zhang smartag...@gmail.com
  wrote:
 Putting -filter between crawldb and segments, I sitll got the same
  thing:

 org.apache.hadoop.mapred.InvalidInputException: Input path does not
exist:
 file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/crawl_fetch
 Input path does not exist:
 file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/crawl_parse
 Input path does not exist:
 file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/parse_data
 Input path does not exist:
 file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/parse_text

 On Thu, Nov 22, 2012 at 3:11 PM, Markus Jelsma
 markus.jel...@openindex.iowrote:

 These are roughly the available parameters:

 Usage: SolrIndexer solr url crawldb [-linkdb linkdb]
  [-hostdb
 hostdb] [-params k1=v1k2=v2...] (segment ... | -dir
  segments)
 [-noCommit] [-deleteGone] [-deleteRobotsNoIndex]
 [-deleteSkippedByIndexingFilter] [-filter] [-normalize]

 Having -filter at the end should work fine, if it, for some
  reason,
 doesn't work put it before the segment and after the crawldb and
  file an
 issue in jira, it works here if i have -filter at the end.

 Cheers

 -Original message-
  From:Joe Zhang smartag...@gmail.com
  Sent: Thu 22-Nov-2012 23:05
  To: Markus Jelsma markus.jel...@openindex.io; user 
 user@nutch.apache.org
  Subject: Re: Indexing-time URL filtering again

  Yes, I forgot to do that. But still, what exactly should the
  command
 look like?

  bin/nutch solrindex  -Durlfilter.regex.file=UrlFiltering.txt
 http://localhost:8983/solr/ http://localhost:8983/solr/
  .../crawldb/
 /segments/*  -filter
  this command would cause nutch to interpret -filter as a path.

  On Thu, Nov 22, 2012 at 6:14 AM, Markus Jelsma 
 markus.jel...@openindex.io mailto:markus.jel...@openindex.io 
  wrote:
  Hi,

  I just tested a small index job that usually writes 1200
  records to
 Solr. It works fine if i specify -. in a filter (index nothing)
  and
point
 to it with -Durlfilter.regex.file=path like you do.  I assume you
  mean
by
 `it doesn't work` that it filters nothing and indexes all records
  from
the
 segment. Did you forget the -filter parameter?

  Cheers

  -Original message-
   From:Joe Zhang smartag...@gmail.com mailto:
  smartag...@gmail.com

   Sent: Thu 22-Nov-2012 07:29
   To: user user@nutch.apache.org mailto:user@nutch.apache.org

   Subject: Indexing-time URL filtering again

   Dear List:

   I asked a similar question before, but I haven't solved the
  problem.
   Therefore I try to re-ask the question more clearly and seek
  advice.

   I'm using nutch 1.5.1 and solr 3.6.1 together. Things work
  fine at
the
   rudimentary level.

   The basic problem I face in crawling/indexing is that I need
  to
control
   which pages the crawlers should VISIT (so far through
   nutch/conf/regex-urlfilter.txt)
   and which pages are INDEXED by Solr. The latter are only a
  SUBSET of
 the
   former, and they are giving me headache.

   A real-life example would be: when we crawl CNN.com, we only
  want to
 index
   real content pages such as

  http://www.cnn.com/2012/11/21/us/bin-laden-burial/index.html?hpt=hp_t1

RE: problem with text/html content type of documents appears application/xhtml+xml in solr index

2012-11-25 Thread Markus Jelsma

Hi - trunk's more indexing filter can map mime types to any target. With it you 
can map both (x)html mimes to text/html or to `web page`.

https://issues.apache.org/jira/browse/NUTCH-1262

 
 
-Original message-
 From:Eyeris Rodriguez Rueda eru...@uci.cu
 Sent: Sun 25-Nov-2012 00:48
 To: user@nutch.apache.org
 Subject: problem with text/html content type of documents appears 
 application/xhtml+xml in solr index
 
 Hi.
 
 I have changed my nutch version from 1.4 to 1.5.1 and I have detected a 
 problem with content type of some document, some pages with text/html appears 
 in solr index with application/xhtml+xml , when I check the links the 
 navegator tell me that efectively is text/html.
 Any body can help me to fix this problem, I think change this content type 
 manually in solr index to text/html but is not a good way for me.
 Please any suggestion or advice will be accepted.
 
 
 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
 INFORMATICAS...
 CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
 
 http://www.uci.cu
 http://www.facebook.com/universidad.uci
 http://www.flickr.com/photos/universidad_uci

RE: Indexing-time URL filtering again

2012-11-25 Thread Markus Jelsma

You should provide the log output. 

-Original message-
 From:Joe Zhang smartag...@gmail.com
 Sent: Sun 25-Nov-2012 17:27
 To: user@nutch.apache.org
 Subject: Re: Indexing-time URL filtering again

 I actually checked out the most recent build from SVN, Release 1.6 -
 23/11/2012.

 The following command

 bin/nutch solrindex  -Durlfilter.regex.file=.UrlFiltering.txt
 http://localhost:8983/solr/ crawl/crawldb/ -linkdb crawl/linkdb/
 crawl/segments/*  -filter

 produced the following output:

 SolrIndexer: starting at 2012-11-25 16:19:29
 SolrIndexer: deleting gone documents: false
 SolrIndexer: URL filtering: true
 SolrIndexer: URL normalizing: false
 java.io.IOException: Job failed!

 Can anybody help?
 On Sun, Nov 25, 2012 at 6:43 AM, Joe Zhang smartag...@gmail.com wrote:

  How exactly do I get to trunk?

  I did download download NUTCH-1300-1.5-1.patch, and run the patch command
  correctly, and re-build nutch. But the problem still persists...

  On Sun, Nov 25, 2012 at 3:29 AM, Markus Jelsma markus.jel...@openindex.io
   wrote:

  No, this is no bug. As i said, you need either to patch your Nutch or get
  the sources from trunk. The -filter parameter is not in your version. Check
  the patch manual if you don't know how it works.

  $ cd trunk ; patch -p0  file.patch

  -Original message-
   From:Joe Zhang smartag...@gmail.com
   Sent: Sun 25-Nov-2012 08:42
   To: Markus Jelsma markus.jel...@openindex.io; user 
  user@nutch.apache.org
   Subject: Re: Indexing-time URL filtering again

   This does seem a bug. Can anybody help?

   On Sat, Nov 24, 2012 at 6:40 PM, Joe Zhang smartag...@gmail.com
  wrote:

Markus, could you advise? Thanks a lot!

On Sat, Nov 24, 2012 at 12:49 AM, Joe Zhang smartag...@gmail.com
  wrote:

I followed your instruction and applied the patch, Markus, but the
problem still persists --- -filter is interpreted as a path by
  solrindex.

On Fri, Nov 23, 2012 at 12:39 AM, Markus Jelsma 
markus.jel...@openindex.io wrote:

Ah, i get it now. Please use trunk or patch your version with:
https://issues.apache.org/jira/browse/NUTCH-1300 to enable
  filtering.

-Original message-
 From:Joe Zhang smartag...@gmail.com
 Sent: Fri 23-Nov-2012 03:08
 To: user@nutch.apache.org
 Subject: Re: Indexing-time URL filtering again

 But Markus said it worked for him. I was really he could send his
command
 line.

 On Thu, Nov 22, 2012 at 6:28 PM, Lewis John Mcgibbney 
 lewis.mcgibb...@gmail.com wrote:

  Is this a bug?

  On Thu, Nov 22, 2012 at 10:13 PM, Joe Zhang 
  smartag...@gmail.com
wrote:
   Putting -filter between crawldb and segments, I sitll got the
  same
thing:

   org.apache.hadoop.mapred.InvalidInputException: Input path
  does not
  exist:
   file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/crawl_fetch
   Input path does not exist:
   file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/crawl_parse
   Input path does not exist:
   file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/parse_data
   Input path does not exist:
   file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/parse_text

   On Thu, Nov 22, 2012 at 3:11 PM, Markus Jelsma
   markus.jel...@openindex.iowrote:

   These are roughly the available parameters:

   Usage: SolrIndexer solr url crawldb [-linkdb linkdb]
[-hostdb
   hostdb] [-params k1=v1k2=v2...] (segment ... | -dir
segments)
   [-noCommit] [-deleteGone] [-deleteRobotsNoIndex]
   [-deleteSkippedByIndexingFilter] [-filter] [-normalize]

   Having -filter at the end should work fine, if it, for some
reason,
   doesn't work put it before the segment and after the crawldb
  and
file an
   issue in jira, it works here if i have -filter at the end.

   Cheers

   -Original message-
From:Joe Zhang smartag...@gmail.com
Sent: Thu 22-Nov-2012 23:05
To: Markus Jelsma markus.jel...@openindex.io; user 
   user@nutch.apache.org
Subject: Re: Indexing-time URL filtering again

Yes, I forgot to do that. But still, what exactly should
  the
command
   look like?

bin/nutch solrindex
   -Durlfilter.regex.file=UrlFiltering.txt
   http://localhost:8983/solr/ http://localhost:8983/solr/
.../crawldb/
   /segments/*  -filter
this command would cause nutch to interpret -filter as a
  path.

On Thu, Nov 22, 2012 at 6:14 AM, Markus Jelsma 
   markus.jel...@openindex.io mailto:
  markus.jel...@openindex.io 
wrote:
Hi,

I just tested a small index job that usually writes 1200
records to
   Solr. It works fine if i specify -. in a filter (index
  nothing)
and
  point
   to it with -Durlfilter.regex.file=path like

RE: problem with text/html content type of documents appears application/xhtml+xml in solr index

2012-11-25 Thread Markus Jelsma

Hi - you need to enable mime-type mapping in Nutch config and define your 
mappings. Enable it with:

  property
namemoreIndexingFilter.mapMimeTypes/name
valuetrue/value
  /property

and add the following to your mapping config:

cat conf/contenttype-mapping.txt 
# Target content type TAB type1 [TAB type2 ...]
text/html   application/xhtml+xml

This will map application/xhtml+xml to text/html when indexing documents to 
Solr. You can configure any arbitrary target such as `web page` or `document` 
for various similar content types.

Trunk has this feature. You can either patch your version or check out from 
trunk and compile Nutch yourself. Patching is very simple:

$ cd trunk ; patch -p0  file.patch


-Original message-
 From:Eyeris Rodriguez Rueda eru...@uci.cu
 Sent: Sun 25-Nov-2012 20:42
 To: user@nutch.apache.org
 Subject: RE: problem with text/html content type of documents appears 
 application/xhtml+xml in solr index
 
 Thanks a lot Markus for your answer. My English is not so good.
 I was reading but i don’t know how to fix the problems yet. Could you explain 
 me in details the solution please. I was looking in conf directory but I 
 can't find how to map one mime types to another. I need to replace index-more 
 plugin ? 
 I was looking in the link that you suggest me and a saw a 
 NUTCH-1262-1.5-1.patch but I don’t know how to use that patch.
 Please tell me if I need to delete the index completely or there is a way to 
 replace an application/xhtml+xml to text/html in solr index.
 
 
 
 
 -Mensaje original-
 De: Markus Jelsma [mailto:markus.jel...@openindex.io] 
 Enviado el: domingo, 25 de noviembre de 2012 4:33 AM
 Para: user@nutch.apache.org
 Asunto: RE: problem with text/html content type of documents appears 
 application/xhtml+xml in solr index
 
 Hi - trunk's more indexing filter can map mime types to any target. With it 
 you can map both (x)html mimes to text/html or to `web page`.
 
 https://issues.apache.org/jira/browse/NUTCH-1262
 
  
 -Original message-
  From:Eyeris Rodriguez Rueda eru...@uci.cu
  Sent: Sun 25-Nov-2012 00:48
  To: user@nutch.apache.org
  Subject: problem with text/html content type of documents appears 
  application/xhtml+xml in solr index
  
  Hi.
  
  I have changed my nutch version from 1.4 to 1.5.1 and I have detected a 
  problem with content type of some document, some pages with text/html 
  appears in solr index with application/xhtml+xml , when I check the links 
  the navegator tell me that efectively is text/html.
  Any body can help me to fix this problem, I think change this content type 
  manually in solr index to text/html but is not a good way for me.
  Please any suggestion or advice will be accepted.
 
 
 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
 INFORMATICAS...
 CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
 
 http://www.uci.cu
 http://www.facebook.com/universidad.uci
 http://www.flickr.com/photos/universidad_uci

RE: Indexing-time URL filtering again

2012-11-25 Thread Markus Jelsma

You seems to have an NPE caused by your regex rules, for some weird reason. If 
you can provide a way to reproduce you can file an issue in Jira. This NPE 
should also occur if your run the regex tester.

nutch -Durlfilter.regex.file=path org.apache.nutch.net.URLFilterChecker 
-allCombined

In the mean time you can check if a rule causes the NPE.
 
-Original message-
 From:Joe Zhang smartag...@gmail.com
 Sent: Sun 25-Nov-2012 23:26
 To: user@nutch.apache.org
 Subject: Re: Indexing-time URL filtering again
 
 the last few lines of hadoop.log:
 
 2012-11-25 16:30:30,021 INFO  indexer.IndexingFilters - Adding
 org.apache.nutch.indexer.anchor.AnchorIndexingFilter
 2012-11-25 16:30:30,026 INFO  indexer.IndexingFilters - Adding
 org.apache.nutch.indexer.metadata.MetadataIndexer
 2012-11-25 16:30:30,218 WARN  mapred.LocalJobRunner - job_local_0001
 java.lang.RuntimeException: Error in configuring object
 at
 org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
 at
 org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
 at
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
 at
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
 Caused by: java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:601)
 at
 org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
 ... 5 more
 Caused by: java.lang.RuntimeException: Error in configuring object
 at
 org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
 at
 org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
 at
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
 at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
 ... 10 more
 Caused by: java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:601)
 at
 org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
 ... 13 more
 Caused by: java.lang.NullPointerException
 at java.io.Reader.init(Reader.java:78)
 at java.io.BufferedReader.init(BufferedReader.java:94)
 at java.io.BufferedReader.init(BufferedReader.java:109)
 at
 org.apache.nutch.urlfilter.api.RegexURLFilterBase.readRules(RegexURLFilterBase.java:180)
 at
 org.apache.nutch.urlfilter.api.RegexURLFilterBase.setConf(RegexURLFilterBase.java:156)
 at
 org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:162)
 at org.apache.nutch.net.URLFilters.init(URLFilters.java:57)
 at
 org.apache.nutch.indexer.IndexerMapReduce.configure(IndexerMapReduce.java:95)
 ... 18 more
 2012-11-25 16:30:30,568 ERROR solr.SolrIndexer - java.io.IOException: Job
 failed!
 
 
 On Sun, Nov 25, 2012 at 3:08 PM, Markus Jelsma
 markus.jel...@openindex.iowrote:
 
  You should provide the log output.
 
  -Original message-
   From:Joe Zhang smartag...@gmail.com
   Sent: Sun 25-Nov-2012 17:27
   To: user@nutch.apache.org
   Subject: Re: Indexing-time URL filtering again
  
   I actually checked out the most recent build from SVN, Release 1.6 -
   23/11/2012.
  
   The following command
  
   bin/nutch solrindex  -Durlfilter.regex.file=.UrlFiltering.txt
   http://localhost:8983/solr/ crawl/crawldb/ -linkdb crawl/linkdb/
   crawl/segments/*  -filter
  
   produced the following output:
  
   SolrIndexer: starting at 2012-11-25 16:19:29
   SolrIndexer: deleting gone documents: false
   SolrIndexer: URL filtering: true
   SolrIndexer: URL normalizing: false
   java.io.IOException: Job failed!
  
   Can anybody help?
   On Sun, Nov 25, 2012 at 6:43 AM, Joe Zhang smartag...@gmail.com wrote:
  
How exactly do I get to trunk?
   
I did download download NUTCH-1300-1.5-1.patch, and run the patch
  command
correctly, and re-build nutch. But the problem still persists...
   
On Sun, Nov 25, 2012 at 3:29 AM, Markus Jelsma 
  markus.jel...@openindex.io
 wrote:
   
No, this is no bug. As i said, you need either to patch your Nutch or
  get
the sources from trunk. The -filter parameter is not in your version.
  Check
the patch manual

RE: Indexing-time URL filtering again

2012-11-25 Thread Markus Jelsma

It's taking input from stdin, enter some URL's to test it. You can add an issue 
with reproducable steps. 

-Original message-
 From:Joe Zhang smartag...@gmail.com
 Sent: Sun 25-Nov-2012 23:49
 To: user@nutch.apache.org
 Subject: Re: Indexing-time URL filtering again

 I ran the regex tester command you provided. It seems to be taking forever
 (15 min + by now).

 On Sun, Nov 25, 2012 at 3:28 PM, Joe Zhang smartag...@gmail.com wrote:

  you mean the content my pattern file?

  well, even wehn I reduce it to simply -., the same problem still pops up.

  On Sun, Nov 25, 2012 at 3:30 PM, Markus Jelsma markus.jel...@openindex.io
   wrote:

  You seems to have an NPE caused by your regex rules, for some weird
  reason. If you can provide a way to reproduce you can file an issue in
  Jira. This NPE should also occur if your run the regex tester.

  nutch -Durlfilter.regex.file=path org.apache.nutch.net.URLFilterChecker
  -allCombined

  In the mean time you can check if a rule causes the NPE.

  -Original message-
   From:Joe Zhang smartag...@gmail.com
   Sent: Sun 25-Nov-2012 23:26
   To: user@nutch.apache.org
   Subject: Re: Indexing-time URL filtering again

   the last few lines of hadoop.log:

   2012-11-25 16:30:30,021 INFO  indexer.IndexingFilters - Adding
   org.apache.nutch.indexer.anchor.AnchorIndexingFilter
   2012-11-25 16:30:30,026 INFO  indexer.IndexingFilters - Adding
   org.apache.nutch.indexer.metadata.MetadataIndexer
   2012-11-25 16:30:30,218 WARN  mapred.LocalJobRunner - job_local_0001
   java.lang.RuntimeException: Error in configuring object
   at

  org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
   at
   org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
   at

  org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
   at
  org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
   at
   org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
   Caused by: java.lang.reflect.InvocationTargetException
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at

  sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at

  sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:601)
   at

  org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
   ... 5 more
   Caused by: java.lang.RuntimeException: Error in configuring object
   at

  org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
   at
   org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
   at

  org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
   at
  org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
   ... 10 more
   Caused by: java.lang.reflect.InvocationTargetException
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at

  sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at

  sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:601)
   at

  org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
   ... 13 more
   Caused by: java.lang.NullPointerException
   at java.io.Reader.init(Reader.java:78)
   at java.io.BufferedReader.init(BufferedReader.java:94)
   at java.io.BufferedReader.init(BufferedReader.java:109)
   at

  org.apache.nutch.urlfilter.api.RegexURLFilterBase.readRules(RegexURLFilterBase.java:180)
   at

  org.apache.nutch.urlfilter.api.RegexURLFilterBase.setConf(RegexURLFilterBase.java:156)
   at

  org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:162)
   at org.apache.nutch.net.URLFilters.init(URLFilters.java:57)
   at

  org.apache.nutch.indexer.IndexerMapReduce.configure(IndexerMapReduce.java:95)
   ... 18 more
   2012-11-25 16:30:30,568 ERROR solr.SolrIndexer - java.io.IOException:
  Job
   failed!

   On Sun, Nov 25, 2012 at 3:08 PM, Markus Jelsma
   markus.jel...@openindex.iowrote:

You should provide the log output.

-Original message-
 From:Joe Zhang smartag...@gmail.com
 Sent: Sun 25-Nov-2012 17:27
 To: user@nutch.apache.org
 Subject: Re: Indexing-time URL filtering again

 I actually checked out the most recent build from SVN, Release 1.6 -
 23/11/2012.

 The following command

 bin/nutch solrindex  -Durlfilter.regex.file=.UrlFiltering.txt
 http://localhost:8983/solr/ crawl

RE: Indexing-time URL filtering again

2012-11-22 Thread Markus Jelsma

Hi,

I just tested a small index job that usually writes 1200 records to Solr. It 
works fine if i specify -. in a filter (index nothing) and point to it with 
-Durlfilter.regex.file=path like you do.  I assume you mean by `it doesn't 
work` that it filters nothing and indexes all records from the segment. Did you 
forget the -filter parameter?

Cheers 
 
-Original message-
 From:Joe Zhang smartag...@gmail.com
 Sent: Thu 22-Nov-2012 07:29
 To: user user@nutch.apache.org
 Subject: Indexing-time URL filtering again
 
 Dear List:
 
 I asked a similar question before, but I haven't solved the problem.
 Therefore I try to re-ask the question more clearly and seek advice.
 
 I'm using nutch 1.5.1 and solr 3.6.1 together. Things work fine at the
 rudimentary level.
 
 The basic problem I face in crawling/indexing is that I need to control
 which pages the crawlers should VISIT (so far through
 nutch/conf/regex-urlfilter.txt)
 and which pages are INDEXED by Solr. The latter are only a SUBSET of the
 former, and they are giving me headache.
 
 A real-life example would be: when we crawl CNN.com, we only want to index
 real content pages such as
 http://www.cnn.com/2012/11/21/us/bin-laden-burial/index.html?hpt=hp_t1.
 When we start the crawling from the root, we can't specify tight
 patterns (e.g., +^http://([a-z0-9]*\.)*
 cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*) in nutch/conf/regex-urlfilter.txt,
 because the pages on the path between root and content pages do not satisfy
 such patterns. Putting such patterns in nutch/conf/regex-urlfilter.txt
 would severely jeopardize the coverage of the crawl.
 
 The closest solution I've got so far (courtesy of Markus) was this:
 
 nutch solrindex -Durlfilter.regex.file=/path http://solrurl/ ...
 
  but unfortunately I haven't been able to make it work for me. The content
 of the urlfilter.regex.file is what I thought correct --- something like
 the following:
 
 +^http://([a-z0-9]*\.)*cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*
 -.
 
 Everything seems quite straightforward. Am I doing anything wrong here? Can
 anyone advise? I'd greatly appreciate.
 
 Joe

RE: doubts about some propierties on nutch-site.xml file

2012-11-22 Thread Markus Jelsma

See: 
http://nutch.apache.org/apidocs-2.1/org/apache/nutch/crawl/AdaptiveFetchSchedule.html

-Original message-
 From:Eyeris Rodriguez Rueda eru...@uci.cu
 Sent: Fri 23-Nov-2012 03:29
 To: user@nutch.apache.org
 Subject: doubts about some propierties on nutch-site.xml file

 Hi all.

 I have some doubts about some properties on nutch-site.xml file, i will 
 appreciated if anybody can explain me about it function in detail.

 Im using nutch 1.5.1 and solr 3.6.

 First
 property
   namedb.fetch.schedule.adaptive.inc_rate/name
   value0.4/value
   descriptionIf a page is unmodified, its fetchInterval will be
   increased by this rate. This value should not
   exceed 0.5, otherwise the algorithm becomes unstable./description
 /property

 In this case how the fetchInterval is modified by this value what means that 
 ??
 *
 Second
 property
   namedb.fetch.schedule.adaptive.sync_delta/name
   valuetrue/value
   descriptionIf true, try to synchronize with the time of page change.
   by shifting the next fetchTime by a fraction (sync_rate) of the difference
   between the last modification time, and the last fetch time./description
 /property
 I can´t understand this propierty.
 **
 regards.

 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
 INFORMATICAS...
 CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

 http://www.uci.cu
 http://www.facebook.com/universidad.uci
 http://www.flickr.com/photos/universidad_uci

RE: Best practices for running Nutch

2012-11-19 Thread Markus Jelsma

Hi

-Original message-
 From:kiran chitturi chitturikira...@gmail.com
 Sent: Sun 18-Nov-2012 18:38
 To: user@nutch.apache.org
 Subject: Best practices for running Nutch

 Hi!

 I have been running crawls using Nutch for 13000 documents (protocol http)
 on a single machine and it goes on to take 2-3 days to get finished. I am
 using 2.x version of Nutch.

 I use a depth of 20 and topN of 1000 (2000) when i initiate the 'sh
 bin/nutch crawl -depth 20 -topN 1000'.

 I keep running in to Exceptions after one day. Sometimes its

- Memory Exception : Heap Space (after the parsing of the documents)

After parsing the documents? That should be during updatedb but are you sure? 
That job hardly ever runs out of memory. 

- Mysql Connection Error (because the crawler went on to fetch 10,000
documents after the command 'sh bin/nutch crawl -continue -depth 10 -topN
700' as the crawl failed because

 I increased the heap space and increased the timeout.

 I am wondering what are the best practices to run Nutch crawls. Is a full
 crawl a good thing to do or should i do it in steps (generate, fetch,
 parse, updatedb) ?

Separate steps are good for debugging and give you more control.  

 Also how do i choose the value of the parameters, even if i give topN as
 700 the fetcher goes to fetch 3000 documents. What parameters have high
 impact on the running time of the crawl ?

Are you sure? The generator (at least in trunk) honors the topN parameter and 
will not generate more than specified. Keep in mind that using the crawl script 
and the depth parameter you're multiplying topN by depth.

 All these options might be system based and need not have general values
 which work for everyone.

 I am wondering what are things that Nutch Users and Developers follow here
 when running big crawls ?

What is a big crawl? 13.000 documents are very easy to manage on a very small 
machine running locally. If you're downloading from one or a few hosts it's 
expected to take a very long time due to crawler politeness, don't download 
faster than one page every  5 seconds unless you're allowed to or own the 
host. If you own a host or are allowed to you can increase or increase the 
number of threads per queue (host, domain or IP).

 Some of the exceptions come after 1 or 2 days of running the crawler, so
 its getting hard to know how to fix them before hand.

I'm not sure this applies to you because i don't know what you mean by `running 
crawler`; never run the fetcher for longer than an hour orso.

 Are there any common exceptions that Nutch can run in to frequently ?

The usual exceptions are network errors.

 Is there any documentation for Nutch practices ? I have seen people crawls
 go for a long time because of the filtering sometimes.

I'm not sure but the best thing to do on this list is not talk about crawl 
(e.g. my crawl fails or takes too long) but to talk about the separate jobs. We 
don't know what's wrong if one tells us a crawl is taking long because it 
consists of the separate steps.

 Sorry for the long email.

 Thank you,
 -- 
 Kiran Chitturi

RE: custom plugin's constructor unable to access hadoop conf

2012-11-16 Thread Markus Jelsma

That's because the object is not set in the constructor. You can access 
Configuration after setConf() is called. So defer your work in the constructor 
to this method.

  public void setConf(Configuration conf) {
this.conf = conf;
  }

 
 
-Original message-
 From:Sourajit Basak sourajit.ba...@gmail.com
 Sent: Fri 16-Nov-2012 11:28
 To: user@nutch.apache.org
 Subject: custom plugin's constructor unable to access hadoop conf
 
 In my custom HtmlParseFilter plugin, I am getting a NPE on trying to access
 the hadoop Configuration object in the plugin constructor. Is this a known
 behavior ?

RE: site-specific crawling policies

2012-11-16 Thread Markus Jelsma

you can override some URL Filter paths in nutch site or with command line 
options (tools) such as bin/nutch fetch -Durlfilter.regex.file=bla. You can 
also set NUTCH_HOME and keep everything separate if you're running it locally. 
On Hadoop you'll need separate job files.

-Original message-
 From:Joe Zhang smartag...@gmail.com
 Sent: Fri 16-Nov-2012 18:35
 To: user@nutch.apache.org
 Subject: Re: site-specific crawling policies

 That's easy to do. But what about the configuration files? The same
 nutchs-site.xml, urlfiter files will be read.

 On Fri, Nov 16, 2012 at 3:28 AM, Sourajit Basak 
 sourajit.ba...@gmail.comwrote:

  Group related sites together and use separate crawldb, segment
  directories.

  On Fri, Nov 16, 2012 at 9:40 AM, Joe Zhang smartag...@gmail.com wrote:

   So how exactly do I set up different nutch instances then?

   On Thu, Nov 15, 2012 at 7:52 PM, Lewis John Mcgibbney 
   lewis.mcgibb...@gmail.com wrote:

Hi Joe,

In all honesty, it might sound slightly optimistic, it may also depend
upon the size and calibre of the different sites/domains but if you
are attempting a depth first, domain specific crawl, then maybe
separate Nutch instances will be your friend...

Lewis

On Thu, Nov 15, 2012 at 11:53 PM, Joe Zhang smartag...@gmail.com
   wrote:
 well, these are all details. The bigger question is, how to seperate
   the
 crawling policy of site A from that of site B?

 On Thu, Nov 15, 2012 at 7:41 AM, Sourajit Basak 
sourajit.ba...@gmail.comwrote:

 You probably need to customize parse-metatags plugin.

 I think you go ahead and include all possible metatags. And take
  care
   of
 missing metatags in solr.

 On Thu, Nov 15, 2012 at 12:22 AM, Joe Zhang smartag...@gmail.com
wrote:

  I understand conf/regex-urlfilter.txt; I can put domain names into
   the
 URL
  patterns.

  But what about meta tags? What if I want to parse out different
  meta
tags
  for different sites?

  On Wed, Nov 14, 2012 at 1:33 AM, Sourajit Basak 
 sourajit.ba...@gmail.com
  wrote:

   1) For parsing  indexing customized meta tags enable 
  configure
 plugin
   parse-metatags

   2) There are several filters of url, like regex based. For
  regex,
the
   patterns are specified via conf/regex-urlfilter.txt

   On Wed, Nov 14, 2012 at 1:33 PM, Tejas Patil 
tejas.patil...@gmail.com
   wrote:

While defining url patterns, have the domain name in it so
  that
you
 get
site/domain specific rules. I don't know about configuring
  meta
tags.

Thanks,
Tejas

On Tue, Nov 13, 2012 at 11:34 PM, Joe Zhang 
   smartag...@gmail.com

   wrote:

 How to enforce site-specific crawling policies, i.e,
  different
URL
 patterns, meta tags, etc. for different websites to be
   crawled?
I
 got
   the
 sense that multiple instances of nutch are needed? Is it
correct?
 If
   yes,
 how?

--
Lewis

RE: re-Crawl re-fetch all pages each time

2012-11-15 Thread Markus Jelsma

Hi - this should not happen. The only thing i can imagine is that the update 
step doesn't succeed but that would mean nothing is going to be indexed either. 
You can inspect an URL using the readdb tool, check before and after.

-Original message-
 From:vetus ve...@isac.cat
 Sent: Thu 15-Nov-2012 15:41
 To: user@nutch.apache.org
 Subject: re-Crawl re-fetch all pages each time

 Hello,

 I have a problem...

 I'm trying to index a small domain, and I'm using
 org.apache.nutch.crawl.Crawler to do it. The problem, is that after the
 crawler has indexed all the pages of the domain, I execute the crawler
 again... and It fetch all the pages again althoug the fetch interval has not
 expired...
 This is wrong because it generates a lot of connections...

 I'm using the default config and this is the command that I execute:

 org.apache.nutch.crawl.Crawler  -depth 1 -threads 1 -topN 5

 Can you help me? please

 Thanks

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/re-Crawl-re-fetch-all-pages-each-time-tp4020464.html
 Sent from the Nutch - User mailing list archive at Nabble.com.

RE: adding custom metadata to CrawlDatum during parse

2012-11-14 Thread Markus Jelsma

Hi - Sure, check the db.parsemeta.to.crawldb configuration directive. 

-Original message-
 From:Sourajit Basak sourajit.ba...@gmail.com
 Sent: Wed 14-Nov-2012 08:10
 To: user@nutch.apache.org
 Subject: adding custom metadata to CrawlDatum during parse

 Is it possible to add custom metadata (preferably via plugins) to the
 CrawlDatum of the url during parse or its associated filter phases ?

 It seems you can do so if you parse along with fetch. That too will require
 modifications to Fetcher.java;
 Have I missed out any better way to accomplish ?

 Sourajit

RE: How to find ids of pages that have been newly crawled or modified after a given date with Nutch 2.1

2012-11-13 Thread Markus Jelsma

In trunk the modified time is based on whether or not the signature has 
changed. It makes little sense relying on HTTP headers because almost no CMS 
implements it correctly and it messes (or allows to be messed with on purpose) 
with an adaptive schedule.

https://issues.apache.org/jira/browse/NUTCH-1341
 
 
-Original message-
 From:j.sulli...@thomsonreuters.com j.sulli...@thomsonreuters.com
 Sent: Tue 13-Nov-2012 11:13
 To: user@nutch.apache.org
 Subject: RE: How to find ids of pages that have been newly crawled or 
 modified after a given date with Nutch 2.1
 
 I think the modifiedTime comes from the http headers if available, if not it 
 is left empty.  In other words it is the time the content was last modified 
 according to the source if available and if not available it is left blank.  
 Depending on what Jacob is trying to achieve the one line patch at 
 https://issues.apache.org/jira/browse/NUTCH-1475 might be what he needs (or 
 might not be).
 
 James
 
 -Original Message-
 From: Ferdy Galema [mailto:ferdy.gal...@kalooga.com] 
 Sent: Tuesday, November 13, 2012 6:31 PM
 To: user@nutch.apache.org
 Subject: Re: How to find ids of pages that have been newly crawled or 
 modified after a given date with Nutch 2.1
 
 Hi,
 
 There might be something wrong with the field modifiedTime. I'm not sure how 
 well you can rely on this field (with the default or the adaptive scheduler).
 
 If you want to get to the bottom of this, I suggest debugging or running 
 small crawls to test the behaviour. In case something doesn't work as 
 expected, please repost here or open a Jira.
 
 Ferdy.
 
 On Mon, Nov 12, 2012 at 8:18 PM, Jacob Sisk jacob.s...@gmail.com wrote:
 
  Hi,
 
  If this question has already been answered please forgive me and point 
  me to the appropriate thread.
 
  I'd like to be able to find the ids of all new pages crawled by nutch 
  or pages modified since a fixed point in the past.
 
  I'm using Nutch 2.1 with MySQL as the back-end and it seems like the 
  appropriate back-end query should be something like:
 
   select id from webpage where (prevFetchTime=null  fetchTime=X) 
  or (modifiedTime = X )
 
  where X is some point in the past.
 
  What I've found is that modifiedTime is always null.  I am using the
  adaptive scheduler and the default md5 signature class.   I've tried both
  re-injecting seed URLs as well as not, it seems to make no difference.
   modifiedTime remains null.
 
  I am most grateful for any help or advise.  If my nutc-hsite.xml fiel 
  would help I can forward it along.
 
  Thanks,
  jacob

RE: Simulating 2.x's page.putToInlinks() in trunk

2012-11-13 Thread Markus Jelsma

In trunk you can use the Inlink and Inlinks classes. The first for each inline 
and the latter to add the Inlink objects to.  

Inlinks inlinks = new Inlinks()
inlinks.add(new Inlink(http://nutch.apache.org/;, Apache Nutch));

The inlink URL is the key in the key/value pair so you won't see that one.
 
-Original message-
 From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com
 Sent: Mon 12-Nov-2012 16:29
 To: user@nutch.apache.org
 Subject: Simulating 2.x's page.putToInlinks() in trunk
 
 Hi,
 
 I'm attempting to test the AnchorIndexingFilter by adding numerous
 inlinks and their anchor text then check whether the deduplication is
 working sufficiently.
 
 Can someone show me how I simulate the following using the trunk API
 
 // This is 2.x API
 WebPage page = new WebPage();
 page.putToInlinks(new Utf8($inlink1), new Utf8($anchor_text1));
 page.putToInlinks(new Utf8($inlink2), new Utf8($anchor_text1));
 page.putToInlinks(new Utf8($inlink3), new Utf8($anchor_text2));
 
 If anchor deduplication is set to boolean true value then we could
 only allow two anchor entries for the page inlinks. I wish therefore
 to simulate this in trunk API using Inlinks, Inlink or
 NutchDocument.add function however I am stuck...
 
 Thank you very much in advance for any help.
 
 Best
 
 Lewis
 
 -- 
 Lewis

RE: very slow generator step

2012-11-12 Thread Markus Jelsma

Hi - Please use the -noFilter option. It is usually useless to filter in the 
generator because they've already been filtered in the parse step and or update 
step.

 
 
-Original message-
 From:Mohammad wrk mhd...@yahoo.com
 Sent: Mon 12-Nov-2012 18:43
 To: user@nutch.apache.org
 Subject: very slow generator step
 
 Hi,
 
 The generator time has gone from 8 minutes to 106 minutes few days ago and 
 stayed there since then. AFAIK, I haven't made any configuration changes 
 recently (attached you can find some of the configurations that I thought 
 might be related). 
 
 A quick CPU sampling shows that most of the time is spent on 
 java.util.regex.Matcher.find(). Since I'm using default regex configurations 
 and my crawldb has only 3,052,412 urls, I was wondering if this is a known 
 issue with nutch-1.5.1 ?
 
 Here are some more information that might help:
 
 = Generator logs
 2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: starting at 
 2012-11-09 03:14:50
 2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: Selecting 
 best-scoring urls due for fetch.
 2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: filtering: true
 2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: normalizing: true
 2012-11-09 03:14:50,921 INFO  crawl.Generator - Generator: topN: 3000
 2012-11-09 03:14:50,923 INFO  crawl.Generator - Generator: jobtracker is 
 'local', generating exactly one partition.
 2012-11-09 03:23:39,741 INFO  crawl.Generator - Generator: Partitioning 
 selected urls for politeness.
 2012-11-09 03:23:40,743 INFO  crawl.Generator - Generator: segment: 
 segments/20121109032340
 2012-11-09 03:23:47,860 INFO  crawl.Generator - Generator: finished at 
 2012-11-09 03:23:47, elapsed: 00:08:56
 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: starting at 
 2012-11-09 05:35:14
 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: Selecting 
 best-scoring urls due for fetch.
 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: filtering: true
 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: normalizing: true
 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: topN: 3000
 2012-11-09 05:35:14,037 INFO  crawl.Generator - Generator: jobtracker is 
 'local', generating exactly one partition.
 2012-11-09 07:21:42,840 INFO  crawl.Generator - Generator: Partitioning 
 selected urls for politeness.
 2012-11-09 07:21:43,841 INFO  crawl.Generator - Generator: segment: 
 segments/20121109072143
 2012-11-09 07:21:51,004 INFO  crawl.Generator - Generator: finished at 
 2012-11-09 07:21:51, elapsed: 01:46:36
 
 = CrawlDb statistics
 CrawlDb statistics start: ./crawldb
 Statistics for CrawlDb: ./crawldb
 TOTAL urls:3052412
 retry 0:3047404
 retry 1:338
 retry 2:1192
 retry 3:822
 retry 4:336
 retry 5:2320
 min score:0.0
 avg score:0.015368268
 max score:48.608
 status 1 (db_unfetched):2813249
 status 2 (db_fetched):196717
 status 3 (db_gone):14204
 status 4 (db_redir_temp):10679
 status 5 (db_redir_perm):17563
 CrawlDb statistics: done
 
 = System info
 Memory: 4 GB
 CPUs: Intel® Core™ i3-2310M CPU @ 2.10GHz × 4 
 Available diskspace: 171.7 GB
 OS: Release 12.10 (quantal) 64-bit
 
 
 Thanks,
 Mohammad

RE: very slow generator step

2012-11-12 Thread Markus Jelsma

You may need to change your expressions but it is performant. Not all features 
of traditional regex are supported.
http://wiki.apache.org/nutch/RegexURLFiltersBenchs

 
 
-Original message-
 From:Mohammad wrk mhd...@yahoo.com
 Sent: Mon 12-Nov-2012 22:17
 To: user@nutch.apache.org
 Subject: Re: very slow generator step
 
 
 
 That's a good thinking. I have never used url-filter automation. Where can I 
 find more info?
 
 Thanks,
 Mohammad
 
 
  From: Julien Nioche lists.digitalpeb...@gmail.com
 To: user@nutch.apache.org; Mohammad wrk mhd...@yahoo.com 
 Sent: Monday, November 12, 2012 12:38:44 PM
 Subject: Re: very slow generator step
  
 Could be that a particularly long and tricky URL got into your crawldb and
 put the regex into a spin. I'd use the url-filter automaton instead as it
 is much faster. Would be interesting to know what caused the regex to take
 so much time, in case you fancy a bit of debugging ;-)
 
 Julien
 
 On 12 November 2012 20:29, Mohammad wrk mhd...@yahoo.com wrote:
 
  Thanks for the tip. It went down to 2 minutes :-)
 
  What I don't understand is that how come everything was working fine with
  the default configuration for about 4 days and all of a sudden one crawl
  causes a jump of 100 minutes?
 
  Cheers,
  Mohammad
 
 
  
   From: Markus Jelsma markus.jel...@openindex.io
  To: user@nutch.apache.org user@nutch.apache.org
  Sent: Monday, November 12, 2012 11:19:11 AM
  Subject: RE: very slow generator step
 
  Hi - Please use the -noFilter option. It is usually useless to filter in
  the generator because they've already been filtered in the parse step and
  or update step.
 
 
 
  -Original message-
   From:Mohammad wrk mhd...@yahoo.com
   Sent: Mon 12-Nov-2012 18:43
   To: user@nutch.apache.org
   Subject: very slow generator step
  
   Hi,
  
   The generator time has gone from 8 minutes to 106 minutes few days ago
  and stayed there since then. AFAIK, I haven't made any configuration
  changes recently (attached you can find some of the configurations that I
  thought might be related).
  
   A quick CPU sampling shows that most of the time is spent on
  java.util.regex.Matcher.find(). Since I'm using default regex
  configurations and my crawldb has only 3,052,412 urls, I was wondering if
  this is a known issue with nutch-1.5.1 ?
  
   Here are some more information that might help:
  
   = Generator logs
   2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: starting at
  2012-11-09 03:14:50
   2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: Selecting
  best-scoring urls due for fetch.
   2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: filtering:
  true
   2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: normalizing:
  true
   2012-11-09 03:14:50,921 INFO  crawl.Generator - Generator: topN: 3000
   2012-11-09 03:14:50,923 INFO  crawl.Generator - Generator: jobtracker is
  'local', generating exactly one partition.
   2012-11-09 03:23:39,741 INFO  crawl.Generator - Generator: Partitioning
  selected urls for politeness.
   2012-11-09 03:23:40,743 INFO  crawl.Generator - Generator: segment:
  segments/20121109032340
   2012-11-09 03:23:47,860 INFO  crawl.Generator - Generator: finished at
  2012-11-09 03:23:47, elapsed: 00:08:56
   2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: starting at
  2012-11-09 05:35:14
   2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: Selecting
  best-scoring urls due for fetch.
   2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: filtering:
  true
   2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: normalizing:
  true
   2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: topN: 3000
   2012-11-09 05:35:14,037 INFO  crawl.Generator - Generator: jobtracker is
  'local', generating exactly one partition.
   2012-11-09 07:21:42,840 INFO  crawl.Generator - Generator: Partitioning
  selected urls for politeness.
   2012-11-09 07:21:43,841 INFO  crawl.Generator - Generator: segment:
  segments/20121109072143
   2012-11-09 07:21:51,004 INFO  crawl.Generator - Generator: finished at
  2012-11-09 07:21:51, elapsed: 01:46:36
  
   = CrawlDb statistics
   CrawlDb statistics start: ./crawldb
   Statistics for CrawlDb: ./crawldb
   TOTAL urls:3052412
   retry 0:3047404
   retry 1:338
   retry 2:1192
   retry 3:822
   retry 4:336
   retry 5:2320
   min score:0.0
   avg score:0.015368268
   max score:48.608
   status 1 (db_unfetched):2813249
   status 2 (db_fetched):196717
   status 3 (db_gone):14204
   status 4 (db_redir_temp):10679
   status 5 (db_redir_perm):17563
   CrawlDb statistics: done
  
   = System info
   Memory: 4 GB
   CPUs: Intel® Core™ i3-2310M CPU @ 2.10GHz × 4
   Available diskspace: 171.7 GB
   OS: Release 12.10 (quantal) 64-bit
  
  
   Thanks,
   Mohammad
  
 
 
 
 
 -- 
 *
 *Open Source Solutions for Text

RE: Tika Parsing not working in the latest version of 2.X?

2012-11-08 Thread Markus Jelsma

Try cleaning your build. 
 
-Original message-
 From:j.sulli...@thomsonreuters.com j.sulli...@thomsonreuters.com
 Sent: Thu 08-Nov-2012 07:23
 To: user@nutch.apache.org
 Subject: Tika Parsing not working in the latest version of 2.X?
 
 Just tried the latest 2.X after being away for a while. Tika parsing doesn't 
 seem to be working.
 
 Exception in thread main java.lang.NoSuchMethodError: 
 org.apache.tika.mime.MediaType.set([Lorg/apache/tika/mime/MediaType;)Ljava/util/Set;
 at 
 org.apache.tika.parser.crypto.Pkcs7Parser.getSupportedTypes(Pkcs7Parser.java:52)
 at org.apache.nutch.parse.tika.TikaConfig.init(TikaConfig.java:149)
 at 
 org.apache.nutch.parse.tika.TikaConfig.getDefaultConfig(TikaConfig.java:210)
 at org.apache.nutch.parse.tika.TikaParser.setConf(TikaParser.java:203)
 at 
 org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:162)
 at org.apache.nutch.parse.ParserFactory.getFields(ParserFactory.java:209)
 at org.apache.nutch.parse.ParserJob.getFields(ParserJob.java:193)
 at org.apache.nutch.fetcher.FetcherJob.getFields(FetcherJob.java:142)
 at org.apache.nutch.fetcher.FetcherJob.run(FetcherJob.java:184)
 at org.apache.nutch.fetcher.FetcherJob.fetch(FetcherJob.java:219)
 at org.apache.nutch.fetcher.FetcherJob.run(FetcherJob.java:301)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.nutch.fetcher.FetcherJob.main(FetcherJob.java:307)
 Exception in thread main java.lang.NoSuchMethodError: 
 org.apache.tika.mime.MediaType.set([Lorg/apache/tika/mime/MediaType;)Ljava/util/Set;
 at 
 org.apache.tika.parser.crypto.Pkcs7Parser.getSupportedTypes(Pkcs7Parser.java:52)
 at org.apache.nutch.parse.tika.TikaConfig.init(TikaConfig.java:149)
 at 
 org.apache.nutch.parse.tika.TikaConfig.getDefaultConfig(TikaConfig.java:210)
 at org.apache.nutch.parse.tika.TikaParser.setConf(TikaParser.java:203)
 at 
 org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:162)
 at org.apache.nutch.parse.ParserFactory.getFields(ParserFactory.java:209)
 at org.apache.nutch.parse.ParserJob.getFields(ParserJob.java:193)
 at org.apache.nutch.parse.ParserJob.run(ParserJob.java:245)
 at org.apache.nutch.parse.ParserJob.parse(ParserJob.java:259)
 at org.apache.nutch.parse.ParserJob.run(ParserJob.java:302)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.nutch.parse.ParserJob.main(ParserJob.java:306)

RE: URL filtering: crawling time vs. indexing time

2012-11-04 Thread Markus Jelsma

Just try it. With -D you can override Nutch and Hadoop configuration properties.

-Original message-
 From:Joe Zhang smartag...@gmail.com
 Sent: Sun 04-Nov-2012 06:07
 To: user user@nutch.apache.org
 Subject: Re: URL filtering: crawling time vs. indexing time

 Markus, I don't see -D as a valid command parameter for solrindex.

 On Fri, Nov 2, 2012 at 11:37 AM, Markus Jelsma
 markus.jel...@openindex.iowrote:

  Ah, i understand now.

  The indexer tool can filter as well in 1.5.1 and if you enable the regex
  filter and set a different regex configuration file when indexing vs.
  crawling you should be good to go.

  You can override the default configuration file by setting
  urlfilter.regex.file and point it to the regex file you want to use for
  indexing. You can set it via nutch solrindex -Durlfilter.regex.file=/path
  http://solrurl/ ...

  Cheers

  -Original message-
   From:Joe Zhang smartag...@gmail.com
   Sent: Fri 02-Nov-2012 17:55
   To: user@nutch.apache.org
   Subject: Re: URL filtering: crawling time vs. indexing time

   I'm not sure I get it. Again, my problem is a very generic one:

   - The patterns in regex-urlfitler.txt, howevery exotic they are, they
   control ***which URLs to visit***.
   - Generally speaking, the set of ULRs to be indexed into solr is only a
   ***subset*** of the above.

   We need a way to specify crawling filter (which is regex-urlfitler.txt)
  vs.
   indexing filter, I think.

   On Fri, Nov 2, 2012 at 9:29 AM, Rémy Amouroux r...@teorem.fr wrote:

You have still several possibilities here :
1) find a way to seed the crawl with the URLs containing the links to
  the
leaf pages (sometimes it is possible with a simple loop)
2) create regex for each step of the scenario going to the leaf page,
  in
order to limit the crawl to necessary pages only. Use the $ sign at
  the end
of your regexp to limit the match of regexp like http://([a-z0-9]*\.)*
mysite.com.

Le 2 nov. 2012 à 17:22, Joe Zhang smartag...@gmail.com a écrit :

 The problem is that,

 - if you write regex such as: +^http://([a-z0-9]*\.)*mysite.com,
  you'll
end
 up indexing all the pages on the way, not just the leaf pages.
 - if you write specific regex for
 http://www.mysite.com/level1pattern/level2pattern/pagepattern.html,
  and
you
 start crawling at mysite.com, you'll get zero results, as there is
  no
match.

 On Fri, Nov 2, 2012 at 6:21 AM, Markus Jelsma 
markus.jel...@openindex.iowrote:

 -Original message-
 From:Joe Zhang smartag...@gmail.com
 Sent: Fri 02-Nov-2012 10:04
 To: user@nutch.apache.org
 Subject: URL filtering: crawling time vs. indexing time

 I feel like this is a trivial question, but I just can't get my
  ahead
 around it.

 I'm using nutch 1.5.1 and solr 3.6.1 together. Things work fine at
  the
 rudimentary level.

 If my understanding is correct, the regex-es in
 nutch/conf/regex-urlfilter.txt control  the crawling behavior, ie.,
which
 URLs to visit or not in the crawling process.

 Yes.

 On the other hand, it doesn't seem artificial for us to only want
certain
 pages to be indexed. I was hoping to write some regular
  expressions as
 well
 in some config file, but I just can't find the right place. My
  hunch
 tells
 me that such things should not require into-the-box coding. Can
  anybody
 help?

 What exactly do you want? Add your custom regular expressions? The
 regex-urlfilter.txt is the place to write them to.

 Again, the scenario is really rather generic. Let's say we want to
crawl
 http://www.mysite.com. We can use the regex-urlfilter.txt to skip
loops
 and
 unncessary file types etc., but only expect to index pages with
  URLs
 like:
 http://www.mysite.com/level1pattern/level2pattern/pagepattern.html
  .

 To do this you must simply make sure your regular expressions can do
this.

 Am I too naive to expect zero Java coding in this case?

 No, you can achieve almost all kinds of exotic filtering with just
  the
URL
 filters and the regular expressions.

 Cheers

RE: timestamp in nutch schema

2012-11-04 Thread Markus Jelsma

Hi - the timestamp is just the time when a page is being indexed. Not very 
useful except for deduplication. If you want to index some publishing date you 
must first identify the source of that date and get it out of webpages. It's 
possible to use og:date or other meta meta tags or perhaps other sources but to 
do so you must create a custom parse filter.

Meta tags can be indexed without creating a custom parse filter. If you don't 
trust websites or need special (re)formatting or checking logic you need to 
make a parse filter for it.

I've also built a date parsing filter to retrieve dates in various formats from 
free text, check Jira for a patch for the dateparsefilter. It's an older 
version but still works well.

-Original message-
 From:Joe Zhang smartag...@gmail.com
 Sent: Sun 04-Nov-2012 05:44
 To: user user@nutch.apache.org
 Subject: timestamp in nutch schema
 
 My understanding is that the timestamp stores crawling time. Is there any
 way to get nutch to parse out the publishing time of webpages and store
 such info in timestamp or some other field?

RE: URL filtering: crawling time vs. indexing time

2012-11-02 Thread Markus Jelsma

-Original message-
 From:Joe Zhang smartag...@gmail.com
 Sent: Fri 02-Nov-2012 10:04
 To: user@nutch.apache.org
 Subject: URL filtering: crawling time vs. indexing time

 I feel like this is a trivial question, but I just can't get my ahead
 around it.

 I'm using nutch 1.5.1 and solr 3.6.1 together. Things work fine at the
 rudimentary level.

 If my understanding is correct, the regex-es in
 nutch/conf/regex-urlfilter.txt control  the crawling behavior, ie., which
 URLs to visit or not in the crawling process.

Yes.

 On the other hand, it doesn't seem artificial for us to only want certain
 pages to be indexed. I was hoping to write some regular expressions as well
 in some config file, but I just can't find the right place. My hunch tells
 me that such things should not require into-the-box coding. Can anybody
 help?

What exactly do you want? Add your custom regular expressions? The 
regex-urlfilter.txt is the place to write them to.

 Again, the scenario is really rather generic. Let's say we want to crawl
 http://www.mysite.com. We can use the regex-urlfilter.txt to skip loops and
 unncessary file types etc., but only expect to index pages with URLs like:
 http://www.mysite.com/level1pattern/level2pattern/pagepattern.html.

To do this you must simply make sure your regular expressions can do this.

 Am I too naive to expect zero Java coding in this case?

No, you can achieve almost all kinds of exotic filtering with just the URL 
filters and the regular expressions.

Cheers

RE: URL filtering: crawling time vs. indexing time

2012-11-02 Thread Markus Jelsma

Ah, i understand now.

The indexer tool can filter as well in 1.5.1 and if you enable the regex filter 
and set a different regex configuration file when indexing vs. crawling you 
should be good to go.

You can override the default configuration file by setting urlfilter.regex.file 
and point it to the regex file you want to use for indexing. You can set it via 
nutch solrindex -Durlfilter.regex.file=/path http://solrurl/ ...

Cheers
 
-Original message-
 From:Joe Zhang smartag...@gmail.com
 Sent: Fri 02-Nov-2012 17:55
 To: user@nutch.apache.org
 Subject: Re: URL filtering: crawling time vs. indexing time
 
 I'm not sure I get it. Again, my problem is a very generic one:
 
 - The patterns in regex-urlfitler.txt, howevery exotic they are, they
 control ***which URLs to visit***.
 - Generally speaking, the set of ULRs to be indexed into solr is only a
 ***subset*** of the above.
 
 We need a way to specify crawling filter (which is regex-urlfitler.txt) vs.
 indexing filter, I think.
 
 On Fri, Nov 2, 2012 at 9:29 AM, Rémy Amouroux r...@teorem.fr wrote:
 
  You have still several possibilities here :
  1) find a way to seed the crawl with the URLs containing the links to the
  leaf pages (sometimes it is possible with a simple loop)
  2) create regex for each step of the scenario going to the leaf page, in
  order to limit the crawl to necessary pages only. Use the $ sign at the end
  of your regexp to limit the match of regexp like http://([a-z0-9]*\.)*
  mysite.com.
 
 
  Le 2 nov. 2012 à 17:22, Joe Zhang smartag...@gmail.com a écrit :
 
   The problem is that,
  
   - if you write regex such as: +^http://([a-z0-9]*\.)*mysite.com, you'll
  end
   up indexing all the pages on the way, not just the leaf pages.
   - if you write specific regex for
   http://www.mysite.com/level1pattern/level2pattern/pagepattern.html, and
  you
   start crawling at mysite.com, you'll get zero results, as there is no
  match.
  
   On Fri, Nov 2, 2012 at 6:21 AM, Markus Jelsma 
  markus.jel...@openindex.iowrote:
  
   -Original message-
   From:Joe Zhang smartag...@gmail.com
   Sent: Fri 02-Nov-2012 10:04
   To: user@nutch.apache.org
   Subject: URL filtering: crawling time vs. indexing time
  
   I feel like this is a trivial question, but I just can't get my ahead
   around it.
  
   I'm using nutch 1.5.1 and solr 3.6.1 together. Things work fine at the
   rudimentary level.
  
   If my understanding is correct, the regex-es in
   nutch/conf/regex-urlfilter.txt control  the crawling behavior, ie.,
  which
   URLs to visit or not in the crawling process.
  
   Yes.
  
  
   On the other hand, it doesn't seem artificial for us to only want
  certain
   pages to be indexed. I was hoping to write some regular expressions as
   well
   in some config file, but I just can't find the right place. My hunch
   tells
   me that such things should not require into-the-box coding. Can anybody
   help?
  
   What exactly do you want? Add your custom regular expressions? The
   regex-urlfilter.txt is the place to write them to.
  
  
   Again, the scenario is really rather generic. Let's say we want to
  crawl
   http://www.mysite.com. We can use the regex-urlfilter.txt to skip
  loops
   and
   unncessary file types etc., but only expect to index pages with URLs
   like:
   http://www.mysite.com/level1pattern/level2pattern/pagepattern.html.
  
   To do this you must simply make sure your regular expressions can do
  this.
  
  
   Am I too naive to expect zero Java coding in this case?
  
   No, you can achieve almost all kinds of exotic filtering with just the
  URL
   filters and the regular expressions.
  
   Cheers

RE: Information about compiling?

2012-11-01 Thread Markus Jelsma

Hi,

There are binary versions of 1.5.1 but not 2.x.
http://apache.xl-mirror.nl/nutch/1.5.1/

About the scripts, you have to build nutch and then go to runtime/local 
directory to run bin/nutch. 

Cheers
 
 
-Original message-
 From:Dr. Thomas Zastrow p...@thomas-zastrow.de
 Sent: Thu 01-Nov-2012 10:45
 To: user@nutch.apache.org
 Subject: Information about compiling?
 
 Dear all,
 
 I found the following tutorial on the web:
 
 http://wiki.apache.org/nutch/NutchTutorial
 
 It starts with a binary version of Nutch. Unfortunateley, I didn't  
 found any binary version, just the source code on the web page? So, I  
 downloaded the latest version and compiled it with ant. Everything  
 seems to work, but I'm a little bit confused about the paths and how I  
 should go on?
 
 Following the tutorial, I have to change some files, but they exist in  
 several versions:
 
   find . -iname regex-urlfilter.txt
 ./runtime/local/conf/regex-urlfilter.txt
 ./conf/regex-urlfilter.txt
 
 The same goes for the nutch command, I'm not sure which one is the  
 right one. When I execute /src/bin/nutch with the following parameters:
 
 ./nutch crawl /opt/crawls/ -dir /opt/crawls/ -depth 3 -topN 5
 
 I got an error which I understand that the script can not find the jar files:
 
 Exception in thread main java.lang.NoClassDefFoundError:  
 org/apache/nutch/crawl/Crawler
 Caused by: java.lang.ClassNotFoundException: org.apache.nutch.crawl.Crawler
  at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
  at java.security.AccessController.doPrivileged(Native Method)
  at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
 Could not find the main class: org.apache.nutch.crawl.Crawler.   
 Program will exit.
 
 
 Any help would be nice ;-)
 
 Best regards and thank you for the software!
 
 Tom
 
 
 -- 
 Dr. Thomas Zastrow
 Süsser Str. 5
 72074 Tübingen
 
 www.thomas-zastrow.de

RE: [crawler-common] infoQ article Apache Nutch 2 Features and Product Roadmap

2012-11-01 Thread Markus Jelsma

Cheers! 

-Original message-
 From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com
 Sent: Thu 01-Nov-2012 18:30
 To: user@nutch.apache.org
 Subject: Re: [crawler-common] infoQ article Apache Nutch 2 Features and 
 Product Roadmap

 Nice one Julien. Its nothing short of a privilege to be part of the various
 communities and working alongside you guys.

 Have a great night.

 Lewis

 On Thu, Nov 1, 2012 at 11:39 AM, Julien Nioche 
 lists.digitalpeb...@gmail.com wrote:

  Hi all,

  Apologies for cross posting. Srini Penchikala has just published an
  interview with me about Nutch 2 on InfoQ at
  http://www.infoq.com/articles/nioche-apache-nutch2. Several projects are
  mentioned in relation to Nutch, hence the CC.

  The views and opinions expressed are entirely mine and do not reflect any
  official position of the Nutch PMC ;-)

  Thanks

  Julien

  --
  *
  *Open Source Solutions for Text Engineering

  http://digitalpebble.blogspot.com/
  http://www.digitalpebble.com
  http://twitter.com/digitalpebble

   --
  You received this message because you are subscribed to the Google Groups
  crawler-commons group.
  Visit this group at
  http://groups.google.com/group/crawler-commons?hl=en-US.

 -- 
 *Lewis*

RE: fetch time

2012-10-27 Thread Markus Jelsma

Hi - Yes, the fetch time is the time when the record is eligible for fetch 
again.

Cheers,

 
 
-Original message-
 From:Stefan Scheffler sscheff...@avantgarde-labs.de
 Sent: Sat 27-Oct-2012 14:49
 To: user@nutch.apache.org
 Subject: fetch time
 
 Hi,
 When i dump out the crawl db, there is a fetch entry for each url, which 
 is over one month in the future...
 
 Fetch time: Mon Nov 26 06:09:43 CET 2012
 
 Does this mean, this is the next time of fetching?
 
 Regards Stefan

RE: Format of content file in segments?

2012-10-27 Thread Markus Jelsma

Hi Морозов,

It's a directory containing Hadoop map file(s) that stores key/value pairs. 
Hadoop Text class is the key and Nutch' Content class is the value. You would 
need Hadoop to easily process the files

http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/protocol/Content.java?view=markup

Cheers,
Markus
 
 
-Original message-
 From:Морозов Евгений ant...@yandex.ru
 Sent: Sat 27-Oct-2012 18:32
 To: user@nutch.apache.org
 Subject: Format of quot;contentquot; file in segments?
 
 Where can I find the format of the content file in a segment directory?
 Either source code or documentation. I'm looking at reading it with a
 program external to nutch.
 
 regards, keanta

RE: How to recover data from /tmp/hadoop-myuser

2012-10-26 Thread Markus Jelsma

Hi,

You cannot recover the mapper output as far as i know. But anyway, one should 
never have a fetcher running for three days. It's far better to generate a 
large amount of smaller segments and fetch them sequentially. If an error 
occurs, only a small portion is affected. We never run fetchers for more than 
one hour, instead we run many in a row and sometimes concurrently.

Cheers,

 
-Original message-
 From:Mohammad wrk mhd...@yahoo.com
 Sent: Fri 26-Oct-2012 00:47
 To: user@nutch.apache.org
 Subject: How to recover data from /tmp/hadoop-myuser
 
 Hi,
 
 
 
 My fetch cycle (nutch fetch ./segments/20121021205343/ -threads 25) failed, 
 after 3 days, with the error below. Under the segment folder 
 (./segments/20121021205343/) there is only generated fetch list 
 (crawl_generate) and no content. However /tmp/hadoop-myuser/ has 96G of data. 
 I was wondering if there is a way to recover this data and parse the segment?
 
 org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any 
 valid local directory for output/file.out
 
         at 
 org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:381)
         at 
 org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146)
         at 
 org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:127)
         at 
 org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:69)
         at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1640)
         at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1323)
         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:437)
         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
         at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
 2012-10-24 14:43:29,671 ERROR fetcher.Fetcher - Fetcher: java.io.IOException: 
 Job failed!
         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
         at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1318)
         at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1354)
         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
         at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1327)
 
 
 Thanks,
 Mohammad

RE: Nutch 2.x Eclipse: Can't retrieve Tika parser for mime-type application/pdf

2012-10-26 Thread Markus Jelsma

Hi,

-Original message-
 From:kiran chitturi chitturikira...@gmail.com
 Sent: Thu 25-Oct-2012 20:49
 To: user@nutch.apache.org
 Subject: Nutch 2.x Eclipse: Can't retrieve Tika parser for mime-type 
 application/pdf

 Hi,

 i have built Nutch 2.x in eclipse using this tutorial (
 http://wiki.apache.org/nutch/RunNutchInEclipse) and with some modifications.

 Its able to parse html files successfully but when it comes to pdf files it
 says 2012-10-25 14:37:05,071 ERROR tika.TikaParser - Can't retrieve Tika
 parser for mime-type application/pdf

 Is there anything wrong with my eclipse configuration? I am looking to
 debug some  things in nutch, so i am working with eclipse and nutch.

 Do i need to point any libraries for eclipseto recognize tika parsers for
 application/pdf type ?

 What exactly is the reason for this type of error to appear for only pdf
 files and not html files ? I am using recent nutch 2.x which has tika
 upgraded to 1.2

This is possible if the PDFBox dependancy is not found anywhere or is wrongly 
mapped in Tika's plugin.xml. The above error can also happen if you happen to 
have a tika-parsers-VERSION.jar in your runtime/local/lib directory, for some 
strange reason.

 I would like some help here and would like to know if anyone has
 encountered similar problem with eclipse, nutch 2.x and parsing
 application/pdf files ?

 Many Thanks,
 -- 
 Kiran Chitturi

RE: How to recover data from /tmp/hadoop-myuser

2012-10-26 Thread Markus Jelsma

Hi - there's a similar entry already, however, the fetcher.done part doesn't 
seem to be correct. I can see no reason why that would ever work as Hadoop temp 
files are simply no copied to the segment if it fails. There's also no notion 
of an fetcher.done file in trunk.

http://wiki.apache.org/nutch/FAQ#How_can_I_recover_an_aborted_fetch_process.3F

 
 
-Original message-
 From:Lewis John Mcgibbney lewis.mcgibb...@gmail.com
 Sent: Fri 26-Oct-2012 15:15
 To: user@nutch.apache.org
 Subject: Re: How to recover data from /tmp/hadoop-myuser
 
 I really think this should be in the FAQ's?
 
 http://wiki.apache.org/nutch/FAQ
 
 On Fri, Oct 26, 2012 at 2:10 PM, Markus Jelsma
 markus.jel...@openindex.io wrote:
  Hi,
 
  You cannot recover the mapper output as far as i know. But anyway, one 
  should never have a fetcher running for three days. It's far better to 
  generate a large amount of smaller segments and fetch them sequentially. If 
  an error occurs, only a small portion is affected. We never run fetchers 
  for more than one hour, instead we run many in a row and sometimes 
  concurrently.
 
  Cheers,
 
 
  -Original message-
  From:Mohammad wrk mhd...@yahoo.com
  Sent: Fri 26-Oct-2012 00:47
  To: user@nutch.apache.org
  Subject: How to recover data from /tmp/hadoop-myuser
 
  Hi,
 
 
 
  My fetch cycle (nutch fetch ./segments/20121021205343/ -threads 25) 
  failed, after 3 days, with the error below. Under the segment folder 
  (./segments/20121021205343/) there is only generated fetch list 
  (crawl_generate) and no content. However /tmp/hadoop-myuser/ has 96G of 
  data. I was wondering if there is a way to recover this data and parse the 
  segment?
 
  org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any 
  valid local directory for output/file.out
 
  at 
  org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:381)
  at 
  org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146)
  at 
  org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:127)
  at 
  org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:69)
  at 
  org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1640)
  at 
  org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1323)
  at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:437)
  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
  at 
  org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
  2012-10-24 14:43:29,671 ERROR fetcher.Fetcher - Fetcher: 
  java.io.IOException: Job failed!
  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
  at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1318)
  at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1354)
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
  at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1327)
 
 
  Thanks,
  Mohammad
 
 
 
 -- 
 Lewis

RE: RegEx URL Normalizer

2012-10-22 Thread Markus Jelsma

Hi,

Check the bottom normalizer, it uses the lookbehind operator to remove double 
slashes except the first two.

Cheers,

http://svn.apache.org/viewvc/nutch/trunk/conf/regex-normalize.xml.template?view=markup
 
 
-Original message-
 From:Magnús Skúlason magg...@gmail.com
 Sent: Mon 22-Oct-2012 00:34
 To: user@nutch.apache.org
 Cc: dkavr...@gmail.com; Markus Jelsma markus.jel...@openindex.io
 Subject: Re: RegEx URL Normalizer
 
 Hi,
 
 I am interested in doing this i.e. only strip out parameters from url
 if some other string is found as well, in my case it will be a domain
 name. I am using 1.5.1 but I am unfamiliar with the look-behind
 operator.
 
 Does anyone have a sample of how this is done?
 
 best regards,
 Magnus
 
 On Thu, Sep 8, 2011 at 12:14 PM, Alexander Fahlke
 alexander.fahlke.mailingli...@googlemail.com wrote:
  Thanks guys!
 
  @Dinçer: This does not check if the URL contains document.py. :(
 
  @Markus: Unfortunately I have to use nutch-1.2 so I decided to customize
  RegexURLNormalizer. ;)
 
--  regexNormalize(String urlString, String scope) { ...
 
It now simple stupid checks if urlString contains document.py and then
  cuts out the unwanted stuff.
I made this is even configurable via nutch-site.xml.
 
 
  Nutch 1.4 would be better for this. Maybe in the next project.
 
 
  BR
 
  On Wed, Sep 7, 2011 at 2:34 PM, Dinçer Kavraal dkavr...@gmail.com wrote:
 
  Hi Alexander,
 
  Would this one work? (I am far away from a Nutch installation to test)
 
  (?:[?](?:Date|Sort|Page|pos|anz)=[^?]+|([?](?:Name|Art|Blank|nr)=[^?]*))
 
  Don't forget to use amp; instead of  in the regex.
 
  Best,
  Dinçer
 
 
  2011/9/5 Alexander Fahlke alexander.fahlke.mailingli...@googlemail.com
 
  Hi!
 
  I have problems with the right setup of the RegExURLNormalizer. It should
  strip out some parameters for a specific script.
  Only pages where document.py is present should be normalized.
 
  Here is an example:
 
   Input:
 
  http://www.example.com/cgi-bin/category/document.py?Name=AlexArt=enDate=2000Sort=1Page=109nr=16519pos=1644anz=1952Blank=1.pdf
   Output:
 
  http://www.example.com/cgi-bin/category/document.py?Name=AlexArt=ennr=16519Blank=1.pdf
 
  Date, Sort, Page, pos, anz are the parameters to be stripped out.
 
  I tried it with the following setup:
 
   ([;_]?((?i)l|j|bv_)?((?i)date|
  sort|page|pos|anz)=.*?)(\?||#|$)
 
 
  How to tell nutch to use this regex only for pages with document.py?
 
 
  BR
 
  --
  Alexander Fahlke
  Software Development
  www.informera.de
 
 
 
 
 
  --
  Alexander Fahlke
  Software Development
  www.informera.de

RE: Best practice to index a large crawl through Solr?

2012-10-22 Thread Markus Jelsma

Hi - Hadoop can write more records per second than Solr can analyze and store,  
especially with multiple reducers (threads in Solr). SolrCloud is notoriously 
slow when it comes to indexing compared to a stand-alone setup. However, this 
should not be a problem at all as your not dealing with millions of records. 
Trying to tie HBase as a backend to Solr is not a good idea at all. The best 
and fastest storage for Solr is a disk and MMappedDirectory enabled (default in 
recent version) and plenty of RAM. Keep in mind that Solr keeps several parts 
of the index in memory and others if it can and it is very efficient in doing 
that.

With only a few million records it's easy and fast enough to run Hadoop locally 
(or pseudo if you can) and have a single Solr node running.
 
-Original message-
 From:Thilina Gunarathne cset...@gmail.com
 Sent: Mon 22-Oct-2012 22:35
 To: user@nutch.apache.org
 Subject: Re: Best practice to index a large crawl through Solr?
 
 Hi Alex,
 Thanks again for the information.
 
 My current requirement is to implement a  simple searching application for
 a publication. Our current data sizes probably would not exceed the amount
 of records you mentioned and for now, we should be fine with a single Solr
 instance. I'm going to check out the SolrCloud for our future needs.
 
 Hm, so you are thinking Nutch - HBase - Solr - HBase, that does
 sound pretty crazy.
 I agree :).. Unfortunately (or may be luckily) I do not have much time to
 invest on this and I'll probably have to rely on the existing tools, rather
 than trying to reinvent the wheels :)..
 
 thanks,
 Thilina
 
 
 On Mon, Oct 22, 2012 at 4:00 PM, Alejandro Caceres 
 acace...@hyperiongray.com wrote:
 
  No problem. Wrt to your first question, Solr would actually be storing
  this data locally. Solr sharding actually uses its own mechanism
  called SolrCloud. I'd recommend checking it out here:
  http://wiki.apache.org/solr/SolrCloud, it seems cool though I have not
  used it myself.
 
  Hm, so you are thinking Nutch - HBase - Solr - HBase, that does
  sound pretty crazy. You can most definitely find a more efficient way
  to do this, either by going to HBase directly from the start (I
  wouldn't do so personally) or just using Solr. It might be good to
  know what kind of application you are looking to build and asking more
  specifically.
 
  Alex
 
  On Mon, Oct 22, 2012 at 3:48 PM, Thilina Gunarathne cset...@gmail.com
  wrote:
   Hi Alex,
   Thanks for the very fast response :)..
  
   It sort of depends on your purpose and the amount of data. I currently
   have a single Solr instance (~1GB of memory, 2 processors on the
   server) serving almost ~3,700,000 records from Nutch and it's still
   working great for me. If you have around that I'd say a single Solr
   instance is OK, depending on if you are planning on making your data
   publicly available or not.
  
   This is very useful information. In this case, would the Solr instance be
   retrieving and storing all the data locally or is it still using the
  Nutch
   data store to retrieve the actual content while serving the queries?
  
  
   If you're creating something larger of some sort, Solr 4.0, which
   supports sharding natively would be a great option (I think it's still
   in Beta, but if you're feeling brave...). This is especially true if
   you are creating a search engine of some sort, or would like easily
   searchable data.
  
   That's interesting. I'll check that out. By any chance, do you know
  whether
   the Solr sharding is using the HDFS to store the data or is it using it's
   own infrastructure?
  
  
   I would imagine doing this directly from HBase would not be a great
   option, as Nutch is storing the data in the format that is convenient
   for Nutch itself to use, and not so much in a format that it is
   friendly for you to reuse for your own purposes.
  
   I was actually thinking  of a scenario where we would use Solr to index
  the
   data and storing the resultant index in HBase.  Then using the HBase
   directly to perform simple index lookups..  Please pardon my lack of
   knowledge on Nutch and Solr, if the above sounds ludicrous :)..
  
   thanks,
   Thilina
  
  
   IMO your best bet is going to try out Solr 4.0.
  
   Alex
  
   On Mon, Oct 22, 2012 at 3:03 PM, Thilina Gunarathne cset...@gmail.com
   wrote:
Dear All,
What would be the best practice to index a large crawl using Solr? The
crawl is performed on a multi node Hadoop cluster using HBase as the
  back
end.. Would Solr become a bottleneck if we use just a single Solr
   instance?
 Is it possible to store the indexed data on HBase and to serve them
  from
the HBase it self?
   
thanks a lot,
Thilina
   
--
https://www.cs.indiana.edu/~tgunarat/
http://www.linkedin.com/in/thilina
http://thilina.gunarathne.org
  
  
  
   --
   ___
  
   Alejandro Caceres
   Hyperion Gray, LLC
   Owner/CTO
  
  
  
  
   --

RE: Best practice to index a large crawl through Solr?

2012-10-22 Thread Markus Jelsma

Hi 

-Original message-
 From:Thilina Gunarathne cset...@gmail.com
 Sent: Tue 23-Oct-2012 00:38
 To: user@nutch.apache.org
 Subject: Re: Best practice to index a large crawl through Solr?

 Hi Markus,
 Thanks a lot for the info.

 Hi - Hadoop can write more records per second than Solr can analyze and
  store,  especially with multiple reducers (threads in Solr). SolrCloud is
  notoriously slow when it comes to indexing compared to a stand-alone setup.

 Can this be overcome by using the Nutch Solrindex job for indexing? In
 other words, does the Solr becomes a bottleneck for the SolrIndex job?

Nutch trunk can only write to a single Solr URL and if you have more than a few 
reducers Solr is the bottleneck. But that should not be a problem when dealing 
with a few milliion records. It is a matter of minutes.

 Out of curiosity, does SolrCloud supports any data locality when loading
 data from Nutch? For an example, if I'm co-locating SolrCloud on the same
 nodes that are running Hadoop/HBase, can SolrCloud work with the local
 region servers to load the data?  Eventually, we would have to process
 millions of records and I'm just wondering whether the communication
 between Nutch and Solr would be a huge bottleneck.

Data locallity is more a thing for distributed processing, moving the program 
to the data in the assumption that it's cheaper in terms of bandwidth. That 
does not apply to SolrCloud, it works with hash ranges based on your ID and 
then points documents to a specific shard (see SolrCloud wiki page referred to 
in this thread). If you want a stable and performing Nutch and Solr cluster you 
must separate them. Both have specific resource requirements and should not run 
on the same node. If you mix them, it is hard to provide a reliable service.

We operate one Nutch cluster and several Solr clusters with a lot of documents 
and don't worry about the bottleneck. Based on my experiences i think you 
should not worry too much at this point about Solr being an indexing bottle 
neck, you can scale out if it becomes a problem.

A significant improvement in very large scale indexing from a Nutch cluster to 
a SolrCloud cluster is NUTCH-1377 but it's tedious to implement. Right now we 
don't yet need it because the bottleneck is insignificant for now, even with 
many millions of documents. Unless you are going to work with A LOT of records 
this should not be a big problem for the next few months.

https://issues.apache.org/jira/browse/NUTCH-1377

 thanks,
 Thilina

  However, this should not be a problem at all as your not dealing with
  millions of records. Trying to tie HBase as a backend to Solr is not a good
  idea at all. The best and fastest storage for Solr is a disk and
  MMappedDirectory enabled (default in recent version) and plenty of RAM.
  Keep in mind that Solr keeps several parts of the index in memory and
  others if it can and it is very efficient in doing that.

  With only a few million records it's easy and fast enough to run Hadoop
  locally (or pseudo if you can) and have a single Solr node running.

  -Original message-
   From:Thilina Gunarathne cset...@gmail.com
   Sent: Mon 22-Oct-2012 22:35
   To: user@nutch.apache.org
   Subject: Re: Best practice to index a large crawl through Solr?

   Hi Alex,
   Thanks again for the information.

   My current requirement is to implement a  simple searching application
  for
   a publication. Our current data sizes probably would not exceed the
  amount
   of records you mentioned and for now, we should be fine with a single
  Solr
   instance. I'm going to check out the SolrCloud for our future needs.

   Hm, so you are thinking Nutch - HBase - Solr - HBase, that does
   sound pretty crazy.
   I agree :).. Unfortunately (or may be luckily) I do not have much time to
   invest on this and I'll probably have to rely on the existing tools,
  rather
   than trying to reinvent the wheels :)..

   thanks,
   Thilina

   On Mon, Oct 22, 2012 at 4:00 PM, Alejandro Caceres 
   acace...@hyperiongray.com wrote:

No problem. Wrt to your first question, Solr would actually be storing
this data locally. Solr sharding actually uses its own mechanism
called SolrCloud. I'd recommend checking it out here:
http://wiki.apache.org/solr/SolrCloud, it seems cool though I have not
used it myself.

Hm, so you are thinking Nutch - HBase - Solr - HBase, that does
sound pretty crazy. You can most definitely find a more efficient way
to do this, either by going to HBase directly from the start (I
wouldn't do so personally) or just using Solr. It might be good to
know what kind of application you are looking to build and asking more
specifically.

Alex

On Mon, Oct 22, 2012 at 3:48 PM, Thilina Gunarathne cset...@gmail.com

wrote:
 Hi Alex,
 Thanks for the very fast response :)..

 It sort of depends on your purpose and the amount of

RE: Fetcher Thread

2012-10-18 Thread Markus Jelsma

Hi Ye,

-Original message-
 From:Ye T Thet yethura.t...@gmail.com
 Sent: Thu 18-Oct-2012 15:46
 To: user@nutch.apache.org
 Subject: Fetcher Thread

 Hi Folks,

 I have two questions about the Fetcher Thread in Nutch. The value
 fetcher.threads.fetch in configuration file determines the number of
 threads the Nutch would use to fetch. Of course threads.per.host is also
 used for politeness.

 I set 100 for fetcher.threads.fetch and 2 for threads.per.host value. So
 far on my development I have been using only one linux box to fetch thus it
 is clear that Nutch would fetch 100 urls at time provided that the
 threads.per.host criteria is met.

 The questions are:

 1. What if I crawl on a hadoop cluster with with 5 linux box and set the
 fetcher.threads.fetch to 100? Would Nutch fetch 100 url at time or 500 (5 x
 100) at time?

All nodes are isolated and don't know what the other is doing. So if you set 
the threads to 100 for each machine, each machine will run with 100 threads.

 2. Any advise on formulating optimum fetcher.threads.fetch and
 threads.per.host for a hadoop cluster with 5 linux box (Amazon EC2 medium
 instance, 3.7 GB memory). I would be crawling around 10,000 (10k) web
 sites.

I think threads per host must not exceed 1 for most websites out of politeness. 
You can set the number of threads as high as you can, it only takes more 
memory. If you parse in the fetcher as well, you can run much fewer threads.

 Thanks,

 Ye

< 2 3 4 5 6 7 8 9 10 11 >

601 - 700 of 1614 matches

Mail list logo