RE: adding custom metadata to CrawlDatum during parse

2012-11-14 Thread Markus Jelsma
Hi - Sure, check the db.parsemeta.to.crawldb configuration directive. 
 
-Original message-
> From:Sourajit Basak 
> Sent: Wed 14-Nov-2012 08:10
> To: user@nutch.apache.org
> Subject: adding custom metadata to CrawlDatum during parse
> 
> Is it possible to add custom metadata (preferably via plugins) to the
> CrawlDatum of the url during parse or its associated filter phases ?
> 
> It seems you can do so if you parse along with fetch. That too will require
> modifications to Fetcher.java;
> Have I missed out any better way to accomplish ?
> 
> Sourajit
> 


RE: Simulating 2.x's page.putToInlinks() in trunk

2012-11-13 Thread Markus Jelsma
In trunk you can use the Inlink and Inlinks classes. The first for each inline 
and the latter to add the Inlink objects to.  

Inlinks inlinks = new Inlinks()
inlinks.add(new Inlink("http://nutch.apache.org/";, "Apache Nutch"));

The inlink URL is the key in the key/value pair so you won't see that one.
 
-Original message-
> From:Lewis John Mcgibbney 
> Sent: Mon 12-Nov-2012 16:29
> To: user@nutch.apache.org
> Subject: Simulating 2.x's page.putToInlinks() in trunk
> 
> Hi,
> 
> I'm attempting to test the AnchorIndexingFilter by adding numerous
> inlinks and their anchor text then check whether the deduplication is
> working sufficiently.
> 
> Can someone show me how I simulate the following using the trunk API
> 
> // This is 2.x API
> WebPage page = new WebPage();
> page.putToInlinks(new Utf8("$inlink1"), new Utf8("$anchor_text1"));
> page.putToInlinks(new Utf8("$inlink2"), new Utf8("$anchor_text1"));
> page.putToInlinks(new Utf8("$inlink3"), new Utf8("$anchor_text2"));
> 
> If anchor deduplication is set to boolean true value then we could
> only allow two anchor entries for the page inlinks. I wish therefore
> to simulate this in trunk API using Inlinks, Inlink or
> NutchDocument.add function however I am stuck...
> 
> Thank you very much in advance for any help.
> 
> Best
> 
> Lewis
> 
> -- 
> Lewis
> 


RE: How to find ids of pages that have been newly crawled or modified after a given date with Nutch 2.1

2012-11-13 Thread Markus Jelsma
In trunk the modified time is based on whether or not the signature has 
changed. It makes little sense relying on HTTP headers because almost no CMS 
implements it correctly and it messes (or allows to be messed with on purpose) 
with an adaptive schedule.

https://issues.apache.org/jira/browse/NUTCH-1341
 
 
-Original message-
> From:j.sulli...@thomsonreuters.com 
> Sent: Tue 13-Nov-2012 11:13
> To: user@nutch.apache.org
> Subject: RE: How to find ids of pages that have been newly crawled or 
> modified after a given date with Nutch 2.1
> 
> I think the modifiedTime comes from the http headers if available, if not it 
> is left empty.  In other words it is the time the content was last modified 
> according to the source if available and if not available it is left blank.  
> Depending on what Jacob is trying to achieve the one line patch at 
> https://issues.apache.org/jira/browse/NUTCH-1475 might be what he needs (or 
> might not be).
> 
> James
> 
> -Original Message-
> From: Ferdy Galema [mailto:ferdy.gal...@kalooga.com] 
> Sent: Tuesday, November 13, 2012 6:31 PM
> To: user@nutch.apache.org
> Subject: Re: How to find ids of pages that have been newly crawled or 
> modified after a given date with Nutch 2.1
> 
> Hi,
> 
> There might be something wrong with the field modifiedTime. I'm not sure how 
> well you can rely on this field (with the default or the adaptive scheduler).
> 
> If you want to get to the bottom of this, I suggest debugging or running 
> small crawls to test the behaviour. In case something doesn't work as 
> expected, please repost here or open a Jira.
> 
> Ferdy.
> 
> On Mon, Nov 12, 2012 at 8:18 PM, Jacob Sisk  wrote:
> 
> > Hi,
> >
> > If this question has already been answered please forgive me and point 
> > me to the appropriate thread.
> >
> > I'd like to be able to find the ids of all new pages crawled by nutch 
> > or pages modified since a fixed point in the past.
> >
> > I'm using Nutch 2.1 with MySQL as the back-end and it seems like the 
> > appropriate back-end query should be something like:
> >
> >  "select id from webpage where (prevFetchTime=null & fetchTime>="X") 
> > or (modifiedTime >= "X" )
> >
> > where "X" is some point in the past.
> >
> > What I've found is that modifiedTime is always null.  I am using the
> > adaptive scheduler and the default md5 signature class.   I've tried both
> > re-injecting seed URLs as well as not, it seems to make no difference.
> >  modifiedTime remains null.
> >
> > I am most grateful for any help or advise.  If my nutc-hsite.xml fiel 
> > would help I can forward it along.
> >
> > Thanks,
> > jacob
> >
> 


RE: very slow generator step

2012-11-12 Thread Markus Jelsma
You may need to change your expressions but it is performant. Not all features 
of traditional regex are supported.
http://wiki.apache.org/nutch/RegexURLFiltersBenchs

 
 
-Original message-
> From:Mohammad wrk 
> Sent: Mon 12-Nov-2012 22:17
> To: user@nutch.apache.org
> Subject: Re: very slow generator step
> 
> 
> 
> That's a good thinking. I have never used url-filter automation. Where can I 
> find more info?
> 
> Thanks,
> Mohammad
> 
> 
>  From: Julien Nioche 
> To: user@nutch.apache.org; Mohammad wrk  
> Sent: Monday, November 12, 2012 12:38:44 PM
> Subject: Re: very slow generator step
>  
> Could be that a particularly long and tricky URL got into your crawldb and
> put the regex into a spin. I'd use the url-filter automaton instead as it
> is much faster. Would be interesting to know what caused the regex to take
> so much time, in case you fancy a bit of debugging ;-)
> 
> Julien
> 
> On 12 November 2012 20:29, Mohammad wrk  wrote:
> 
> > Thanks for the tip. It went down to 2 minutes :-)
> >
> > What I don't understand is that how come everything was working fine with
> > the default configuration for about 4 days and all of a sudden one crawl
> > causes a jump of 100 minutes?
> >
> > Cheers,
> > Mohammad
> >
> >
> > 
> >  From: Markus Jelsma 
> > To: "user@nutch.apache.org" 
> > Sent: Monday, November 12, 2012 11:19:11 AM
> > Subject: RE: very slow generator step
> >
> > Hi - Please use the -noFilter option. It is usually useless to filter in
> > the generator because they've already been filtered in the parse step and
> > or update step.
> >
> >
> >
> > -Original message-
> > > From:Mohammad wrk 
> > > Sent: Mon 12-Nov-2012 18:43
> > > To: user@nutch.apache.org
> > > Subject: very slow generator step
> > >
> > > Hi,
> > >
> > > The generator time has gone from 8 minutes to 106 minutes few days ago
> > and stayed there since then. AFAIK, I haven't made any configuration
> > changes recently (attached you can find some of the configurations that I
> > thought might be related).
> > >
> > > A quick CPU sampling shows that most of the time is spent on
> > java.util.regex.Matcher.find(). Since I'm using default regex
> > configurations and my crawldb has only 3,052,412 urls, I was wondering if
> > this is a known issue with nutch-1.5.1 ?
> > >
> > > Here are some more information that might help:
> > >
> > > = Generator logs
> > > 2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: starting at
> > 2012-11-09 03:14:50
> > > 2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: Selecting
> > best-scoring urls due for fetch.
> > > 2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: filtering:
> > true
> > > 2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: normalizing:
> > true
> > > 2012-11-09 03:14:50,921 INFO  crawl.Generator - Generator: topN: 3000
> > > 2012-11-09 03:14:50,923 INFO  crawl.Generator - Generator: jobtracker is
> > 'local', generating exactly one partition.
> > > 2012-11-09 03:23:39,741 INFO  crawl.Generator - Generator: Partitioning
> > selected urls for politeness.
> > > 2012-11-09 03:23:40,743 INFO  crawl.Generator - Generator: segment:
> > segments/20121109032340
> > > 2012-11-09 03:23:47,860 INFO  crawl.Generator - Generator: finished at
> > 2012-11-09 03:23:47, elapsed: 00:08:56
> > > 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: starting at
> > 2012-11-09 05:35:14
> > > 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: Selecting
> > best-scoring urls due for fetch.
> > > 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: filtering:
> > true
> > > 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: normalizing:
> > true
> > > 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: topN: 3000
> > > 2012-11-09 05:35:14,037 INFO  crawl.Generator - Generator: jobtracker is
> > 'local', generating exactly one partition.
> > > 2012-11-09 07:21:42,840 INFO  crawl.Generator - Generator: Partitioning
> > selected urls for politeness.
> > > 2012-11-09 07:21:43,841 INFO  crawl.Generator - Generator: segment:
> > segments/20121109072143
> > > 2012-11-09 07:21:51,004 INFO  crawl.Generator - Generator: finished at
> > 2012-11

RE: very slow generator step

2012-11-12 Thread Markus Jelsma
Hi - Please use the -noFilter option. It is usually useless to filter in the 
generator because they've already been filtered in the parse step and or update 
step.

 
 
-Original message-
> From:Mohammad wrk 
> Sent: Mon 12-Nov-2012 18:43
> To: user@nutch.apache.org
> Subject: very slow generator step
> 
> Hi,
> 
> The generator time has gone from 8 minutes to 106 minutes few days ago and 
> stayed there since then. AFAIK, I haven't made any configuration changes 
> recently (attached you can find some of the configurations that I thought 
> might be related). 
> 
> A quick CPU sampling shows that most of the time is spent on 
> java.util.regex.Matcher.find(). Since I'm using default regex configurations 
> and my crawldb has only 3,052,412 urls, I was wondering if this is a known 
> issue with nutch-1.5.1 ?
> 
> Here are some more information that might help:
> 
> = Generator logs
> 2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: starting at 
> 2012-11-09 03:14:50
> 2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: Selecting 
> best-scoring urls due for fetch.
> 2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: filtering: true
> 2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: normalizing: true
> 2012-11-09 03:14:50,921 INFO  crawl.Generator - Generator: topN: 3000
> 2012-11-09 03:14:50,923 INFO  crawl.Generator - Generator: jobtracker is 
> 'local', generating exactly one partition.
> 2012-11-09 03:23:39,741 INFO  crawl.Generator - Generator: Partitioning 
> selected urls for politeness.
> 2012-11-09 03:23:40,743 INFO  crawl.Generator - Generator: segment: 
> segments/20121109032340
> 2012-11-09 03:23:47,860 INFO  crawl.Generator - Generator: finished at 
> 2012-11-09 03:23:47, elapsed: 00:08:56
> 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: starting at 
> 2012-11-09 05:35:14
> 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: Selecting 
> best-scoring urls due for fetch.
> 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: filtering: true
> 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: normalizing: true
> 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: topN: 3000
> 2012-11-09 05:35:14,037 INFO  crawl.Generator - Generator: jobtracker is 
> 'local', generating exactly one partition.
> 2012-11-09 07:21:42,840 INFO  crawl.Generator - Generator: Partitioning 
> selected urls for politeness.
> 2012-11-09 07:21:43,841 INFO  crawl.Generator - Generator: segment: 
> segments/20121109072143
> 2012-11-09 07:21:51,004 INFO  crawl.Generator - Generator: finished at 
> 2012-11-09 07:21:51, elapsed: 01:46:36
> 
> = CrawlDb statistics
> CrawlDb statistics start: ./crawldb
> Statistics for CrawlDb: ./crawldb
> TOTAL urls:3052412
> retry 0:3047404
> retry 1:338
> retry 2:1192
> retry 3:822
> retry 4:336
> retry 5:2320
> min score:0.0
> avg score:0.015368268
> max score:48.608
> status 1 (db_unfetched):2813249
> status 2 (db_fetched):196717
> status 3 (db_gone):14204
> status 4 (db_redir_temp):10679
> status 5 (db_redir_perm):17563
> CrawlDb statistics: done
> 
> = System info
> Memory: 4 GB
> CPUs: Intel® Core™ i3-2310M CPU @ 2.10GHz × 4 
> Available diskspace: 171.7 GB
> OS: Release 12.10 (quantal) 64-bit
> 
> 
> Thanks,
> Mohammad
> 


RE: Slides of Nutch talk at ApacheCon EU 2012

2012-11-09 Thread Markus Jelsma
Thanks!
 
-Original message-
> From:Julien Nioche 
> Sent: Fri 09-Nov-2012 10:50
> To: user@nutch.apache.org
> Subject: Slides of Nutch talk at ApacheCon EU 2012
> 
> Hi guys,
> 
> For those of you who could not make it to the ApacheCon in Sinsheim, here
> are the slides of my talk on Nutch
> http://www.slideshare.net/digitalpebble/large-scale-crawling-with-apache-nutch
> 
> The presentations have been filmed, will share the link where it is
> available
> 
> Thanks
> 
> Julien
> 
> -- 
> *
> *Open Source Solutions for Text Engineering
> 
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
> 


RE: Tika Parsing not working in the latest version of 2.X?

2012-11-08 Thread Markus Jelsma
Try cleaning your build. 
 
-Original message-
> From:j.sulli...@thomsonreuters.com 
> Sent: Thu 08-Nov-2012 07:23
> To: user@nutch.apache.org
> Subject: Tika Parsing not working in the latest version of 2.X?
> 
> Just tried the latest 2.X after being away for a while. Tika parsing doesn't 
> seem to be working.
> 
> Exception in thread "main" java.lang.NoSuchMethodError: 
> org.apache.tika.mime.MediaType.set([Lorg/apache/tika/mime/MediaType;)Ljava/util/Set;
> at 
> org.apache.tika.parser.crypto.Pkcs7Parser.getSupportedTypes(Pkcs7Parser.java:52)
> at org.apache.nutch.parse.tika.TikaConfig.(TikaConfig.java:149)
> at 
> org.apache.nutch.parse.tika.TikaConfig.getDefaultConfig(TikaConfig.java:210)
> at org.apache.nutch.parse.tika.TikaParser.setConf(TikaParser.java:203)
> at 
> org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:162)
> at org.apache.nutch.parse.ParserFactory.getFields(ParserFactory.java:209)
> at org.apache.nutch.parse.ParserJob.getFields(ParserJob.java:193)
> at org.apache.nutch.fetcher.FetcherJob.getFields(FetcherJob.java:142)
> at org.apache.nutch.fetcher.FetcherJob.run(FetcherJob.java:184)
> at org.apache.nutch.fetcher.FetcherJob.fetch(FetcherJob.java:219)
> at org.apache.nutch.fetcher.FetcherJob.run(FetcherJob.java:301)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.nutch.fetcher.FetcherJob.main(FetcherJob.java:307)
> Exception in thread "main" java.lang.NoSuchMethodError: 
> org.apache.tika.mime.MediaType.set([Lorg/apache/tika/mime/MediaType;)Ljava/util/Set;
> at 
> org.apache.tika.parser.crypto.Pkcs7Parser.getSupportedTypes(Pkcs7Parser.java:52)
> at org.apache.nutch.parse.tika.TikaConfig.(TikaConfig.java:149)
> at 
> org.apache.nutch.parse.tika.TikaConfig.getDefaultConfig(TikaConfig.java:210)
> at org.apache.nutch.parse.tika.TikaParser.setConf(TikaParser.java:203)
> at 
> org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:162)
> at org.apache.nutch.parse.ParserFactory.getFields(ParserFactory.java:209)
> at org.apache.nutch.parse.ParserJob.getFields(ParserJob.java:193)
> at org.apache.nutch.parse.ParserJob.run(ParserJob.java:245)
> at org.apache.nutch.parse.ParserJob.parse(ParserJob.java:259)
> at org.apache.nutch.parse.ParserJob.run(ParserJob.java:302)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.nutch.parse.ParserJob.main(ParserJob.java:306)
> 
> 


RE: timestamp in nutch schema

2012-11-04 Thread Markus Jelsma
Hi - the timestamp is just the time when a page is being indexed. Not very 
useful except for deduplication. If you want to index some publishing date you 
must first identify the source of that date and get it out of webpages. It's 
possible to use og:date or other meta meta tags or perhaps other sources but to 
do so you must create a custom parse filter.

Meta tags can be indexed without creating a custom parse filter. If you don't 
trust websites or need special (re)formatting or checking logic you need to 
make a parse filter for it.

I've also built a date parsing filter to retrieve dates in various formats from 
free text, check Jira for a patch for the dateparsefilter. It's an older 
version but still works well.

-Original message-
> From:Joe Zhang 
> Sent: Sun 04-Nov-2012 05:44
> To: user 
> Subject: timestamp in nutch schema
> 
> My understanding is that the timestamp stores crawling time. Is there any
> way to get nutch to parse out the publishing time of webpages and store
> such info in timestamp or some other field?
> 


RE: URL filtering: crawling time vs. indexing time

2012-11-04 Thread Markus Jelsma
Just try it. With -D you can override Nutch and Hadoop configuration properties.



 
 
-Original message-
> From:Joe Zhang 
> Sent: Sun 04-Nov-2012 06:07
> To: user 
> Subject: Re: URL filtering: crawling time vs. indexing time
> 
> Markus, I don't see "-D" as a valid command parameter for solrindex.
> 
> On Fri, Nov 2, 2012 at 11:37 AM, Markus Jelsma
> wrote:
> 
> > Ah, i understand now.
> >
> > The indexer tool can filter as well in 1.5.1 and if you enable the regex
> > filter and set a different regex configuration file when indexing vs.
> > crawling you should be good to go.
> >
> > You can override the default configuration file by setting
> > urlfilter.regex.file and point it to the regex file you want to use for
> > indexing. You can set it via nutch solrindex -Durlfilter.regex.file=/path
> > http://solrurl/ ...
> >
> > Cheers
> >
> > -Original message-
> > > From:Joe Zhang 
> > > Sent: Fri 02-Nov-2012 17:55
> > > To: user@nutch.apache.org
> > > Subject: Re: URL filtering: crawling time vs. indexing time
> > >
> > > I'm not sure I get it. Again, my problem is a very generic one:
> > >
> > > - The patterns in regex-urlfitler.txt, howevery exotic they are, they
> > > control ***which URLs to visit***.
> > > - Generally speaking, the set of ULRs to be indexed into solr is only a
> > > ***subset*** of the above.
> > >
> > > We need a way to specify crawling filter (which is regex-urlfitler.txt)
> > vs.
> > > indexing filter, I think.
> > >
> > > On Fri, Nov 2, 2012 at 9:29 AM, Rémy Amouroux  wrote:
> > >
> > > > You have still several possibilities here :
> > > > 1) find a way to seed the crawl with the URLs containing the links to
> > the
> > > > leaf pages (sometimes it is possible with a simple loop)
> > > > 2) create regex for each step of the scenario going to the leaf page,
> > in
> > > > order to limit the crawl to necessary pages only. Use the $ sign at
> > the end
> > > > of your regexp to limit the match of regexp like http://([a-z0-9]*\.)*
> > > > mysite.com.
> > > >
> > > >
> > > > Le 2 nov. 2012 à 17:22, Joe Zhang  a écrit :
> > > >
> > > > > The problem is that,
> > > > >
> > > > > - if you write regex such as: +^http://([a-z0-9]*\.)*mysite.com,
> > you'll
> > > > end
> > > > > up indexing all the pages on the way, not just the leaf pages.
> > > > > - if you write specific regex for
> > > > > http://www.mysite.com/level1pattern/level2pattern/pagepattern.html,
> > and
> > > > you
> > > > > start crawling at mysite.com, you'll get zero results, as there is
> > no
> > > > match.
> > > > >
> > > > > On Fri, Nov 2, 2012 at 6:21 AM, Markus Jelsma <
> > > > markus.jel...@openindex.io>wrote:
> > > > >
> > > > >> -Original message-
> > > > >>> From:Joe Zhang 
> > > > >>> Sent: Fri 02-Nov-2012 10:04
> > > > >>> To: user@nutch.apache.org
> > > > >>> Subject: URL filtering: crawling time vs. indexing time
> > > > >>>
> > > > >>> I feel like this is a trivial question, but I just can't get my
> > ahead
> > > > >>> around it.
> > > > >>>
> > > > >>> I'm using nutch 1.5.1 and solr 3.6.1 together. Things work fine at
> > the
> > > > >>> rudimentary level.
> > > > >>>
> > > > >>> If my understanding is correct, the regex-es in
> > > > >>> nutch/conf/regex-urlfilter.txt control  the crawling behavior, ie.,
> > > > which
> > > > >>> URLs to visit or not in the crawling process.
> > > > >>
> > > > >> Yes.
> > > > >>
> > > > >>>
> > > > >>> On the other hand, it doesn't seem artificial for us to only want
> > > > certain
> > > > >>> pages to be indexed. I was hoping to write some regular
> > expressions as
> > > > >> well
> > > > >>> in some config file, but I just can't find the right place. My
> > hunch
> > > > >> tells
> > > > >>> me that such things should not require into-the-box coding. Can
> > anybody
> > > > >>> help?
> > > > >>
> > > > >> What exactly do you want? Add your custom regular expressions? The
> > > > >> regex-urlfilter.txt is the place to write them to.
> > > > >>
> > > > >>>
> > > > >>> Again, the scenario is really rather generic. Let's say we want to
> > > > crawl
> > > > >>> http://www.mysite.com. We can use the regex-urlfilter.txt to skip
> > > > loops
> > > > >> and
> > > > >>> unncessary file types etc., but only expect to index pages with
> > URLs
> > > > >> like:
> > > > >>> http://www.mysite.com/level1pattern/level2pattern/pagepattern.html
> > .
> > > > >>
> > > > >> To do this you must simply make sure your regular expressions can do
> > > > this.
> > > > >>
> > > > >>>
> > > > >>> Am I too naive to expect zero Java coding in this case?
> > > > >>
> > > > >> No, you can achieve almost all kinds of exotic filtering with just
> > the
> > > > URL
> > > > >> filters and the regular expressions.
> > > > >>
> > > > >> Cheers
> > > > >>>
> > > > >>
> > > >
> > > >
> > >
> >
> 


RE: URL filtering: crawling time vs. indexing time

2012-11-02 Thread Markus Jelsma
Ah, i understand now.

The indexer tool can filter as well in 1.5.1 and if you enable the regex filter 
and set a different regex configuration file when indexing vs. crawling you 
should be good to go.

You can override the default configuration file by setting urlfilter.regex.file 
and point it to the regex file you want to use for indexing. You can set it via 
nutch solrindex -Durlfilter.regex.file=/path http://solrurl/ ...

Cheers
 
-Original message-
> From:Joe Zhang 
> Sent: Fri 02-Nov-2012 17:55
> To: user@nutch.apache.org
> Subject: Re: URL filtering: crawling time vs. indexing time
> 
> I'm not sure I get it. Again, my problem is a very generic one:
> 
> - The patterns in regex-urlfitler.txt, howevery exotic they are, they
> control ***which URLs to visit***.
> - Generally speaking, the set of ULRs to be indexed into solr is only a
> ***subset*** of the above.
> 
> We need a way to specify crawling filter (which is regex-urlfitler.txt) vs.
> indexing filter, I think.
> 
> On Fri, Nov 2, 2012 at 9:29 AM, Rémy Amouroux  wrote:
> 
> > You have still several possibilities here :
> > 1) find a way to seed the crawl with the URLs containing the links to the
> > leaf pages (sometimes it is possible with a simple loop)
> > 2) create regex for each step of the scenario going to the leaf page, in
> > order to limit the crawl to necessary pages only. Use the $ sign at the end
> > of your regexp to limit the match of regexp like http://([a-z0-9]*\.)*
> > mysite.com.
> >
> >
> > Le 2 nov. 2012 à 17:22, Joe Zhang  a écrit :
> >
> > > The problem is that,
> > >
> > > - if you write regex such as: +^http://([a-z0-9]*\.)*mysite.com, you'll
> > end
> > > up indexing all the pages on the way, not just the leaf pages.
> > > - if you write specific regex for
> > > http://www.mysite.com/level1pattern/level2pattern/pagepattern.html, and
> > you
> > > start crawling at mysite.com, you'll get zero results, as there is no
> > match.
> > >
> > > On Fri, Nov 2, 2012 at 6:21 AM, Markus Jelsma <
> > markus.jel...@openindex.io>wrote:
> > >
> > >> -Original message-
> > >>> From:Joe Zhang 
> > >>> Sent: Fri 02-Nov-2012 10:04
> > >>> To: user@nutch.apache.org
> > >>> Subject: URL filtering: crawling time vs. indexing time
> > >>>
> > >>> I feel like this is a trivial question, but I just can't get my ahead
> > >>> around it.
> > >>>
> > >>> I'm using nutch 1.5.1 and solr 3.6.1 together. Things work fine at the
> > >>> rudimentary level.
> > >>>
> > >>> If my understanding is correct, the regex-es in
> > >>> nutch/conf/regex-urlfilter.txt control  the crawling behavior, ie.,
> > which
> > >>> URLs to visit or not in the crawling process.
> > >>
> > >> Yes.
> > >>
> > >>>
> > >>> On the other hand, it doesn't seem artificial for us to only want
> > certain
> > >>> pages to be indexed. I was hoping to write some regular expressions as
> > >> well
> > >>> in some config file, but I just can't find the right place. My hunch
> > >> tells
> > >>> me that such things should not require into-the-box coding. Can anybody
> > >>> help?
> > >>
> > >> What exactly do you want? Add your custom regular expressions? The
> > >> regex-urlfilter.txt is the place to write them to.
> > >>
> > >>>
> > >>> Again, the scenario is really rather generic. Let's say we want to
> > crawl
> > >>> http://www.mysite.com. We can use the regex-urlfilter.txt to skip
> > loops
> > >> and
> > >>> unncessary file types etc., but only expect to index pages with URLs
> > >> like:
> > >>> http://www.mysite.com/level1pattern/level2pattern/pagepattern.html.
> > >>
> > >> To do this you must simply make sure your regular expressions can do
> > this.
> > >>
> > >>>
> > >>> Am I too naive to expect zero Java coding in this case?
> > >>
> > >> No, you can achieve almost all kinds of exotic filtering with just the
> > URL
> > >> filters and the regular expressions.
> > >>
> > >> Cheers
> > >>>
> > >>
> >
> >
> 


RE: URL filtering: crawling time vs. indexing time

2012-11-02 Thread Markus Jelsma
 
-Original message-
> From:Joe Zhang 
> Sent: Fri 02-Nov-2012 17:55
> To: user@nutch.apache.org
> Subject: Re: URL filtering: crawling time vs. indexing time
> 
> I'm not sure I get it. Again, my problem is a very generic one:
> 
I think i get it.

> - The patterns in regex-urlfitler.txt, howevery exotic they are, they
> control ***which URLs to visit***.

They also control which URL's NOT to visit. If you have a regex prepended with 
a + it will pass but prepend it with a - and it will be filtered out. You have 
to filter out URL's you don't want and whatever remains you can let through by 
another regex.

> - Generally speaking, the set of ULRs to be indexed into solr is only a
> ***subset*** of the above.
> 
> We need a way to specify crawling filter (which is regex-urlfitler.txt) vs.
> indexing filter, I think.
> 
> On Fri, Nov 2, 2012 at 9:29 AM, Rémy Amouroux  wrote:
> 
> > You have still several possibilities here :
> > 1) find a way to seed the crawl with the URLs containing the links to the
> > leaf pages (sometimes it is possible with a simple loop)
> > 2) create regex for each step of the scenario going to the leaf page, in
> > order to limit the crawl to necessary pages only. Use the $ sign at the end
> > of your regexp to limit the match of regexp like http://([a-z0-9]*\.)*
> > mysite.com.
> >
> >
> > Le 2 nov. 2012 à 17:22, Joe Zhang  a écrit :
> >
> > > The problem is that,
> > >
> > > - if you write regex such as: +^http://([a-z0-9]*\.)*mysite.com, you'll
> > end
> > > up indexing all the pages on the way, not just the leaf pages.
> > > - if you write specific regex for
> > > http://www.mysite.com/level1pattern/level2pattern/pagepattern.html, and
> > you
> > > start crawling at mysite.com, you'll get zero results, as there is no
> > match.
> > >
> > > On Fri, Nov 2, 2012 at 6:21 AM, Markus Jelsma <
> > markus.jel...@openindex.io>wrote:
> > >
> > >> -Original message-
> > >>> From:Joe Zhang 
> > >>> Sent: Fri 02-Nov-2012 10:04
> > >>> To: user@nutch.apache.org
> > >>> Subject: URL filtering: crawling time vs. indexing time
> > >>>
> > >>> I feel like this is a trivial question, but I just can't get my ahead
> > >>> around it.
> > >>>
> > >>> I'm using nutch 1.5.1 and solr 3.6.1 together. Things work fine at the
> > >>> rudimentary level.
> > >>>
> > >>> If my understanding is correct, the regex-es in
> > >>> nutch/conf/regex-urlfilter.txt control  the crawling behavior, ie.,
> > which
> > >>> URLs to visit or not in the crawling process.
> > >>
> > >> Yes.
> > >>
> > >>>
> > >>> On the other hand, it doesn't seem artificial for us to only want
> > certain
> > >>> pages to be indexed. I was hoping to write some regular expressions as
> > >> well
> > >>> in some config file, but I just can't find the right place. My hunch
> > >> tells
> > >>> me that such things should not require into-the-box coding. Can anybody
> > >>> help?
> > >>
> > >> What exactly do you want? Add your custom regular expressions? The
> > >> regex-urlfilter.txt is the place to write them to.
> > >>
> > >>>
> > >>> Again, the scenario is really rather generic. Let's say we want to
> > crawl
> > >>> http://www.mysite.com. We can use the regex-urlfilter.txt to skip
> > loops
> > >> and
> > >>> unncessary file types etc., but only expect to index pages with URLs
> > >> like:
> > >>> http://www.mysite.com/level1pattern/level2pattern/pagepattern.html.
> > >>
> > >> To do this you must simply make sure your regular expressions can do
> > this.
> > >>
> > >>>
> > >>> Am I too naive to expect zero Java coding in this case?
> > >>
> > >> No, you can achieve almost all kinds of exotic filtering with just the
> > URL
> > >> filters and the regular expressions.
> > >>
> > >> Cheers
> > >>>
> > >>
> >
> >
> 


RE: URL filtering: crawling time vs. indexing time

2012-11-02 Thread Markus Jelsma
-Original message-
> From:Joe Zhang 
> Sent: Fri 02-Nov-2012 10:04
> To: user@nutch.apache.org
> Subject: URL filtering: crawling time vs. indexing time
> 
> I feel like this is a trivial question, but I just can't get my ahead
> around it.
> 
> I'm using nutch 1.5.1 and solr 3.6.1 together. Things work fine at the
> rudimentary level.
> 
> If my understanding is correct, the regex-es in
> nutch/conf/regex-urlfilter.txt control  the crawling behavior, ie., which
> URLs to visit or not in the crawling process.

Yes.

> 
> On the other hand, it doesn't seem artificial for us to only want certain
> pages to be indexed. I was hoping to write some regular expressions as well
> in some config file, but I just can't find the right place. My hunch tells
> me that such things should not require into-the-box coding. Can anybody
> help?

What exactly do you want? Add your custom regular expressions? The 
regex-urlfilter.txt is the place to write them to.

> 
> Again, the scenario is really rather generic. Let's say we want to crawl
> http://www.mysite.com. We can use the regex-urlfilter.txt to skip loops and
> unncessary file types etc., but only expect to index pages with URLs like:
> http://www.mysite.com/level1pattern/level2pattern/pagepattern.html.

To do this you must simply make sure your regular expressions can do this.

> 
> Am I too naive to expect zero Java coding in this case?

No, you can achieve almost all kinds of exotic filtering with just the URL 
filters and the regular expressions.

Cheers
> 


RE: [crawler-common] infoQ article Apache Nutch 2 Features and Product Roadmap

2012-11-01 Thread Markus Jelsma
Cheers! 
 
-Original message-
> From:Lewis John Mcgibbney 
> Sent: Thu 01-Nov-2012 18:30
> To: user@nutch.apache.org
> Subject: Re: [crawler-common] infoQ article Apache Nutch 2 Features and 
> Product Roadmap
> 
> Nice one Julien. Its nothing short of a privilege to be part of the various
> communities and working alongside you guys.
> 
> Have a great night.
> 
> Lewis
> 
> On Thu, Nov 1, 2012 at 11:39 AM, Julien Nioche <
> lists.digitalpeb...@gmail.com> wrote:
> 
> > Hi all,
> >
> > Apologies for cross posting. Srini Penchikala has just published an
> > interview with me about Nutch 2 on InfoQ at
> > http://www.infoq.com/articles/nioche-apache-nutch2. Several projects are
> > mentioned in relation to Nutch, hence the CC.
> >
> > The views and opinions expressed are entirely mine and do not reflect any
> > official position of the Nutch PMC ;-)
> >
> > Thanks
> >
> > Julien
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> > http://twitter.com/digitalpebble
> >
> >  --
> > You received this message because you are subscribed to the Google Groups
> > "crawler-commons" group.
> > Visit this group at
> > http://groups.google.com/group/crawler-commons?hl=en-US.
> >
> >
> >
> 
> 
> 
> -- 
> *Lewis*
> 


RE: Information about compiling?

2012-11-01 Thread Markus Jelsma
Hi,

There are binary versions of 1.5.1 but not 2.x.
http://apache.xl-mirror.nl/nutch/1.5.1/

About the scripts, you have to build nutch and then go to runtime/local 
directory to run bin/nutch. 

Cheers
 
 
-Original message-
> From:Dr. Thomas Zastrow 
> Sent: Thu 01-Nov-2012 10:45
> To: user@nutch.apache.org
> Subject: Information about compiling?
> 
> Dear all,
> 
> I found the following tutorial on the web:
> 
> http://wiki.apache.org/nutch/NutchTutorial
> 
> It starts with a binary version of Nutch. Unfortunateley, I didn't  
> found any binary version, just the source code on the web page? So, I  
> downloaded the latest version and compiled it with "ant". Everything  
> seems to work, but I'm a little bit confused about the paths and how I  
> should go on?
> 
> Following the tutorial, I have to change some files, but they exist in  
> several versions:
> 
>   find . -iname regex-urlfilter.txt
> ./runtime/local/conf/regex-urlfilter.txt
> ./conf/regex-urlfilter.txt
> 
> The same goes for the "nutch" command, I'm not sure which one is the  
> right one. When I execute /src/bin/nutch with the following parameters:
> 
> ./nutch crawl /opt/crawls/ -dir /opt/crawls/ -depth 3 -topN 5
> 
> I got an error which I understand that the script can not find the jar files:
> 
> Exception in thread "main" java.lang.NoClassDefFoundError:  
> org/apache/nutch/crawl/Crawler
> Caused by: java.lang.ClassNotFoundException: org.apache.nutch.crawl.Crawler
>  at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
>  at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
>  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
>  at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
> Could not find the main class: org.apache.nutch.crawl.Crawler.   
> Program will exit.
> 
> 
> Any help would be nice ;-)
> 
> Best regards and thank you for the software!
> 
> Tom
> 
> 
> -- 
> Dr. Thomas Zastrow
> Süsser Str. 5
> 72074 Tübingen
> 
> www.thomas-zastrow.de
> 


RE: Format of "content" file in segments?

2012-10-27 Thread Markus Jelsma
Hi Морозов,

It's a directory containing Hadoop map file(s) that stores key/value pairs. 
Hadoop Text class is the key and Nutch' Content class is the value. You would 
need Hadoop to easily process the files

http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/protocol/Content.java?view=markup

Cheers,
Markus
 
 
-Original message-
> From:Морозов Евгений 
> Sent: Sat 27-Oct-2012 18:32
> To: user@nutch.apache.org
> Subject: Format of "content" file in segments?
> 
> Where can I find the format of the content file in a segment directory?
> Either source code or documentation. I'm looking at reading it with a
> program external to nutch.
> 
> regards, keanta
> 


RE: fetch time

2012-10-27 Thread Markus Jelsma
Hi - Yes, the fetch time is the time when the record is eligible for fetch 
again.

Cheers,

 
 
-Original message-
> From:Stefan Scheffler 
> Sent: Sat 27-Oct-2012 14:49
> To: user@nutch.apache.org
> Subject: fetch time
> 
> Hi,
> When i dump out the crawl db, there is a fetch entry for each url, which 
> is over one month in the future...
> 
> Fetch time: Mon Nov 26 06:09:43 CET 2012
> 
> Does this mean, this is the next time of fetching?
> 
> Regards Stefan
> 


RE: How to recover data from /tmp/hadoop-myuser

2012-10-26 Thread Markus Jelsma
Hi - there's a similar entry already, however, the fetcher.done part doesn't 
seem to be correct. I can see no reason why that would ever work as Hadoop temp 
files are simply no copied to the segment if it fails. There's also no notion 
of an fetcher.done file in trunk.

http://wiki.apache.org/nutch/FAQ#How_can_I_recover_an_aborted_fetch_process.3F

 
 
-Original message-
> From:Lewis John Mcgibbney 
> Sent: Fri 26-Oct-2012 15:15
> To: user@nutch.apache.org
> Subject: Re: How to recover data from /tmp/hadoop-myuser
> 
> I really think this should be in the FAQ's?
> 
> http://wiki.apache.org/nutch/FAQ
> 
> On Fri, Oct 26, 2012 at 2:10 PM, Markus Jelsma
>  wrote:
> > Hi,
> >
> > You cannot recover the mapper output as far as i know. But anyway, one 
> > should never have a fetcher running for three days. It's far better to 
> > generate a large amount of smaller segments and fetch them sequentially. If 
> > an error occurs, only a small portion is affected. We never run fetchers 
> > for more than one hour, instead we run many in a row and sometimes 
> > concurrently.
> >
> > Cheers,
> >
> >
> > -Original message-
> >> From:Mohammad wrk 
> >> Sent: Fri 26-Oct-2012 00:47
> >> To: user@nutch.apache.org
> >> Subject: How to recover data from /tmp/hadoop-myuser
> >>
> >> Hi,
> >>
> >>
> >>
> >> My fetch cycle (nutch fetch ./segments/20121021205343/ -threads 25) 
> >> failed, after 3 days, with the error below. Under the segment folder 
> >> (./segments/20121021205343/) there is only generated fetch list 
> >> (crawl_generate) and no content. However /tmp/hadoop-myuser/ has 96G of 
> >> data. I was wondering if there is a way to recover this data and parse the 
> >> segment?
> >>
> >> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any 
> >> valid local directory for output/file.out
> >>
> >> at 
> >> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:381)
> >> at 
> >> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146)
> >> at 
> >> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:127)
> >> at 
> >> org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:69)
> >> at 
> >> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1640)
> >> at 
> >> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1323)
> >> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:437)
> >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
> >> at 
> >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> >> 2012-10-24 14:43:29,671 ERROR fetcher.Fetcher - Fetcher: 
> >> java.io.IOException: Job failed!
> >> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
> >> at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1318)
> >> at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1354)
> >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >> at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1327)
> >>
> >>
> >> Thanks,
> >> Mohammad
> 
> 
> 
> -- 
> Lewis
> 


RE: Nutch 2.x Eclipse: Can't retrieve Tika parser for mime-type application/pdf

2012-10-26 Thread Markus Jelsma
Hi,
 
-Original message-
> From:kiran chitturi 
> Sent: Thu 25-Oct-2012 20:49
> To: user@nutch.apache.org
> Subject: Nutch 2.x Eclipse: Can't retrieve Tika parser for mime-type 
> application/pdf
> 
> Hi,
> 
> i have built Nutch 2.x in eclipse using this tutorial (
> http://wiki.apache.org/nutch/RunNutchInEclipse) and with some modifications.
> 
> Its able to parse html files successfully but when it comes to pdf files it
> says 2012-10-25 14:37:05,071 ERROR tika.TikaParser - Can't retrieve Tika
> parser for mime-type application/pdf
> 
> Is there anything wrong with my eclipse configuration? I am looking to
> debug some  things in nutch, so i am working with eclipse and nutch.
> 
> Do i need to point any libraries for eclipseto recognize tika parsers for
> application/pdf type ?
> 
> What exactly is the reason for this type of error to appear for only pdf
> files and not html files ? I am using recent nutch 2.x which has tika
> upgraded to 1.2

This is possible if the PDFBox dependancy is not found anywhere or is wrongly 
mapped in Tika's plugin.xml. The above error can also happen if you happen to 
have a tika-parsers-VERSION.jar in your runtime/local/lib directory, for some 
strange reason.

> 
> I would like some help here and would like to know if anyone has
> encountered similar problem with eclipse, nutch 2.x and parsing
> application/pdf files ?
> 
> Many Thanks,
> -- 
> Kiran Chitturi
> 


RE: How to recover data from /tmp/hadoop-myuser

2012-10-26 Thread Markus Jelsma
Hi,

You cannot recover the mapper output as far as i know. But anyway, one should 
never have a fetcher running for three days. It's far better to generate a 
large amount of smaller segments and fetch them sequentially. If an error 
occurs, only a small portion is affected. We never run fetchers for more than 
one hour, instead we run many in a row and sometimes concurrently.

Cheers,

 
-Original message-
> From:Mohammad wrk 
> Sent: Fri 26-Oct-2012 00:47
> To: user@nutch.apache.org
> Subject: How to recover data from /tmp/hadoop-myuser
> 
> Hi,
> 
> 
> 
> My fetch cycle (nutch fetch ./segments/20121021205343/ -threads 25) failed, 
> after 3 days, with the error below. Under the segment folder 
> (./segments/20121021205343/) there is only generated fetch list 
> (crawl_generate) and no content. However /tmp/hadoop-myuser/ has 96G of data. 
> I was wondering if there is a way to recover this data and parse the segment?
> 
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any 
> valid local directory for output/file.out
> 
>         at 
> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:381)
>         at 
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146)
>         at 
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:127)
>         at 
> org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:69)
>         at 
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1640)
>         at 
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1323)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:437)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>         at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> 2012-10-24 14:43:29,671 ERROR fetcher.Fetcher - Fetcher: java.io.IOException: 
> Job failed!
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
>         at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1318)
>         at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1354)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1327)
> 
> 
> Thanks,
> Mohammad


RE: problems with image dynamic fields in nutch 1.4

2012-10-24 Thread Markus Jelsma
Hi - I don't know about the specific field names but you can check it out using 
the parsechecker tool, it prints all detected meta data. 
 
-Original message-
> From:Jorge Luis Betancourt Gonzalez 
> Sent: Wed 24-Oct-2012 16:05
> To: user@nutch.apache.org
> Subject: Re: problems with image dynamic fields in nutch 1.4
> 
> And how can I know the name of the fields generated by the tika parser? there 
> are any prefix used?
> 
> Greetings,
> 
> On Oct 24, 2012, at 10:02 AM, Markus Jelsma  
> wrote:
> 
> > Hi - you need a custom indexing filter that adds the fields from parsemeta 
> > to the document.
> > 
> > Cheers,
> > 
> > 
> > 
> > -Original message-
> >> From:Eyeris Rodriguez Rueda 
> >> Sent: Wed 24-Oct-2012 14:59
> >> To: user@nutch.apache.org
> >> Subject: problems with image dynamic fields in nutch 1.4 
> >> 
> >> Hi all.
> >> I have a problem when I try to crawl images, specifically with dynamic
> >> fields of that images.
> >> When I do a crawl, nutch is ignoring this dynamic fields.
> >> When I upload manually some images directly  to solr index, solr's  tika is
> >> capable to extract some metadata in dynamic fields like width, height,
> >> content-type, but with nutch crawl those fields are ignored.
> >> I have tried to put in static in solr and nutch schema but continue without
> >> results, here is my schema and solrindex-mapping, Im using nutch 1.4 and
> >> solr 3.6 . Some help or advice will be appreciated.
> >> 
> >> Schema.xml
> >> 
> >>   
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >>
> >>
> >>
> >>
> >>
> >>
> >> >> default="NOW"/>
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >>
> >> 
> >>  
> >>   >> multiValued="false"/>
> >>   >> multiValued="true" /> 
> >>   >> multiValued="false" /> 
> >>  
> >>   >> multiValued="false" />
> >> 
> >> 
> >>  >> multiValued="false"/>
> >>  >> multiValued="false"/>
> >>  >> multiValued="false"/>
> >>  >> multiValued="false"/>
> >>  >> multiValued="false"/>
> >> 
> >> 
> >>  
> >> 
> >> 
> >> 
> >> id
> >> 
> >> 
> >> name
> >> 
> >> 
> >> 
> >> 
> >> 
> >> ***
> >> Solrindex-mapping
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >>
> >>id
> >> 
> >> 
> >> 
> >> 
> >> 
> >> _
> >> Ing. Eyeris Rodriguez Rueda
> >> Teléfono:837-3370
> >> Universidad de las Ciencias Informáticas
> >> _
> >> 
> >> 
> >> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
> >> INFORMATICAS...
> >> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
> >> 
> >> http://www.uci.cu
> >> http://www.facebook.com/universidad.uci
> >> http://www.flickr.com/photos/universidad_uci
> >> 
> > 
> > 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
> > INFORMATICAS...
> > CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
> > 
> > http://www.uci.cu
> > http://www.facebook.com/universidad.uci
> > http://www.flickr.com/photos/universidad_uci
> 
> 
> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
> INFORMATICAS...
> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
> 
> http://www.uci.cu
> http://www.facebook.com/universidad.uci
> http://www.flickr.com/photos/universidad_uci
> 


RE: problems with image dynamic fields in nutch 1.4

2012-10-24 Thread Markus Jelsma
Hi - you need a custom indexing filter that adds the fields from parsemeta to 
the document.

Cheers,

 
 
-Original message-
> From:Eyeris Rodriguez Rueda 
> Sent: Wed 24-Oct-2012 14:59
> To: user@nutch.apache.org
> Subject: problems with image dynamic fields in nutch 1.4 
> 
> Hi all.
> I have a problem when I try to crawl images, specifically with dynamic
> fields of that images.
> When I do a crawl, nutch is ignoring this dynamic fields.
> When I upload manually some images directly  to solr index, solr's  tika is
> capable to extract some metadata in dynamic fields like width, height,
> content-type, but with nutch crawl those fields are ignored.
> I have tried to put in static in solr and nutch schema but continue without
> results, here is my schema and solrindex-mapping, Im using nutch 1.4 and
> solr 3.6 . Some help or advice will be appreciated.
> 
> Schema.xml
> 
>   
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  default="NOW"/>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>   
>multiValued="false"/>
>multiValued="true" /> 
>multiValued="false" /> 
>   
>multiValued="false" />
> 
> 
>  multiValued="false"/>
>  multiValued="false"/>
>  multiValued="false"/>
>  multiValued="false"/>
>  multiValued="false"/>
> 
> 
>   
>  
> 
>  
>  id
> 
>  
>  name
> 
>  
>  
> 
> 
> ***
> Solrindex-mapping
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>   
>   id
> 
> 
> 
> 
> 
> _
> Ing. Eyeris Rodriguez Rueda
> Teléfono:837-3370
> Universidad de las Ciencias Informáticas
> _
> 
> 
> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
> INFORMATICAS...
> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
> 
> http://www.uci.cu
> http://www.facebook.com/universidad.uci
> http://www.flickr.com/photos/universidad_uci
> 


RE: Crawling Time

2012-10-23 Thread Markus Jelsma
Hi - this is printed to the command line and log for each individual job.
 
-Original message-
> From:Stefan Scheffler 
> Sent: Tue 23-Oct-2012 14:39
> To: user@nutch.apache.org
> Subject: Crawling Time
> 
> Hello,
> Is there a possibility to check how long a whole crawl took, after it is 
> finished?
> 
> Regards
> Stefan
> 
> -- 
> Stefan Scheffler
> Avantgarde Labs GmbH
> Löbauer Straße 19, 01099 Dresden
> Telefon: + 49 (0) 351 21590834
> Email: sscheff...@avantgarde-labs.de
> 
> 


RE: Best practice to index a large crawl through Solr?

2012-10-22 Thread Markus Jelsma
Hi 
 
-Original message-
> From:Thilina Gunarathne 
> Sent: Tue 23-Oct-2012 00:38
> To: user@nutch.apache.org
> Subject: Re: Best practice to index a large crawl through Solr?
> 
> Hi Markus,
> Thanks a lot for the info.
> 
> Hi - Hadoop can write more records per second than Solr can analyze and
> > store,  especially with multiple reducers (threads in Solr). SolrCloud is
> > notoriously slow when it comes to indexing compared to a stand-alone setup.
> 
> Can this be overcome by using the Nutch Solrindex job for indexing? In
> other words, does the Solr becomes a bottleneck for the SolrIndex job?

Nutch trunk can only write to a single Solr URL and if you have more than a few 
reducers Solr is the bottleneck. But that should not be a problem when dealing 
with a few milliion records. It is a matter of minutes.

> 
> Out of curiosity, does SolrCloud supports any data locality when loading
> data from Nutch? For an example, if I'm co-locating SolrCloud on the same
> nodes that are running Hadoop/HBase, can SolrCloud work with the local
> region servers to load the data?  Eventually, we would have to process
> millions of records and I'm just wondering whether the communication
> between Nutch and Solr would be a huge bottleneck.

Data locallity is more a thing for distributed processing, moving the program 
to the data in the assumption that it's cheaper in terms of bandwidth. That 
does not apply to SolrCloud, it works with hash ranges based on your ID and 
then points documents to a specific shard (see SolrCloud wiki page referred to 
in this thread). If you want a stable and performing Nutch and Solr cluster you 
must separate them. Both have specific resource requirements and should not run 
on the same node. If you mix them, it is hard to provide a reliable service.

We operate one Nutch cluster and several Solr clusters with a lot of documents 
and don't worry about the bottleneck. Based on my experiences i think you 
should not worry too much at this point about Solr being an indexing bottle 
neck, you can scale out if it becomes a problem.

A significant improvement in very large scale indexing from a Nutch cluster to 
a SolrCloud cluster is NUTCH-1377 but it's tedious to implement. Right now we 
don't yet need it because the bottleneck is insignificant for now, even with 
many millions of documents. Unless you are going to work with A LOT of records 
this should not be a big problem for the next few months.

https://issues.apache.org/jira/browse/NUTCH-1377

> 
> thanks,
> Thilina
> 
> 
> > However, this should not be a problem at all as your not dealing with
> > millions of records. Trying to tie HBase as a backend to Solr is not a good
> > idea at all. The best and fastest storage for Solr is a disk and
> > MMappedDirectory enabled (default in recent version) and plenty of RAM.
> > Keep in mind that Solr keeps several parts of the index in memory and
> > others if it can and it is very efficient in doing that.
> >
> > With only a few million records it's easy and fast enough to run Hadoop
> > locally (or pseudo if you can) and have a single Solr node running.
> >
> > -Original message-
> > > From:Thilina Gunarathne 
> > > Sent: Mon 22-Oct-2012 22:35
> > > To: user@nutch.apache.org
> > > Subject: Re: Best practice to index a large crawl through Solr?
> > >
> > > Hi Alex,
> > > Thanks again for the information.
> > >
> > > My current requirement is to implement a  simple searching application
> > for
> > > a publication. Our current data sizes probably would not exceed the
> > amount
> > > of records you mentioned and for now, we should be fine with a single
> > Solr
> > > instance. I'm going to check out the SolrCloud for our future needs.
> > >
> > > >Hm, so you are thinking Nutch -> HBase -> Solr -> HBase, that does
> > > >sound pretty crazy.
> > > I agree :).. Unfortunately (or may be luckily) I do not have much time to
> > > invest on this and I'll probably have to rely on the existing tools,
> > rather
> > > than trying to reinvent the wheels :)..
> > >
> > > thanks,
> > > Thilina
> > >
> > >
> > > On Mon, Oct 22, 2012 at 4:00 PM, Alejandro Caceres <
> > > acace...@hyperiongray.com> wrote:
> > >
> > > > No problem. Wrt to your first question, Solr would actually be storing
> > > > this data locally. Solr sharding actually uses its own mechanism
> > > > called SolrCloud. I'd recommend checking it out here:
> > > > http://wiki.apache.org/solr/SolrCloud, it seems cool though I have not
> > > > used it myself.
> > > >
> > > > Hm, so you are thinking Nutch -> HBase -> Solr -> HBase, that does
> > > > sound pretty crazy. You can most definitely find a more efficient way
> > > > to do this, either by going to HBase directly from the start (I
> > > > wouldn't do so personally) or just using Solr. It might be good to
> > > > know what kind of application you are looking to build and asking more
> > > > specifically.
> > > >
> > > > Alex
> > > >
> > > > On Mon, Oct 22, 2012 at 3:48 PM, T

RE: Best practice to index a large crawl through Solr?

2012-10-22 Thread Markus Jelsma
Hi - Hadoop can write more records per second than Solr can analyze and store,  
especially with multiple reducers (threads in Solr). SolrCloud is notoriously 
slow when it comes to indexing compared to a stand-alone setup. However, this 
should not be a problem at all as your not dealing with millions of records. 
Trying to tie HBase as a backend to Solr is not a good idea at all. The best 
and fastest storage for Solr is a disk and MMappedDirectory enabled (default in 
recent version) and plenty of RAM. Keep in mind that Solr keeps several parts 
of the index in memory and others if it can and it is very efficient in doing 
that.

With only a few million records it's easy and fast enough to run Hadoop locally 
(or pseudo if you can) and have a single Solr node running.
 
-Original message-
> From:Thilina Gunarathne 
> Sent: Mon 22-Oct-2012 22:35
> To: user@nutch.apache.org
> Subject: Re: Best practice to index a large crawl through Solr?
> 
> Hi Alex,
> Thanks again for the information.
> 
> My current requirement is to implement a  simple searching application for
> a publication. Our current data sizes probably would not exceed the amount
> of records you mentioned and for now, we should be fine with a single Solr
> instance. I'm going to check out the SolrCloud for our future needs.
> 
> >Hm, so you are thinking Nutch -> HBase -> Solr -> HBase, that does
> >sound pretty crazy.
> I agree :).. Unfortunately (or may be luckily) I do not have much time to
> invest on this and I'll probably have to rely on the existing tools, rather
> than trying to reinvent the wheels :)..
> 
> thanks,
> Thilina
> 
> 
> On Mon, Oct 22, 2012 at 4:00 PM, Alejandro Caceres <
> acace...@hyperiongray.com> wrote:
> 
> > No problem. Wrt to your first question, Solr would actually be storing
> > this data locally. Solr sharding actually uses its own mechanism
> > called SolrCloud. I'd recommend checking it out here:
> > http://wiki.apache.org/solr/SolrCloud, it seems cool though I have not
> > used it myself.
> >
> > Hm, so you are thinking Nutch -> HBase -> Solr -> HBase, that does
> > sound pretty crazy. You can most definitely find a more efficient way
> > to do this, either by going to HBase directly from the start (I
> > wouldn't do so personally) or just using Solr. It might be good to
> > know what kind of application you are looking to build and asking more
> > specifically.
> >
> > Alex
> >
> > On Mon, Oct 22, 2012 at 3:48 PM, Thilina Gunarathne 
> > wrote:
> > > Hi Alex,
> > > Thanks for the very fast response :)..
> > >
> > > It sort of depends on your purpose and the amount of data. I currently
> > >> have a single Solr instance (~1GB of memory, 2 processors on the
> > >> server) serving almost ~3,700,000 records from Nutch and it's still
> > >> working great for me. If you have around that I'd say a single Solr
> > >> instance is OK, depending on if you are planning on making your data
> > >> publicly available or not.
> > >>
> > > This is very useful information. In this case, would the Solr instance be
> > > retrieving and storing all the data locally or is it still using the
> > Nutch
> > > data store to retrieve the actual content while serving the queries?
> > >
> > >
> > >> If you're creating something larger of some sort, Solr 4.0, which
> > >> supports sharding natively would be a great option (I think it's still
> > >> in Beta, but if you're feeling brave...). This is especially true if
> > >> you are creating a search engine of some sort, or would like easily
> > >> searchable data.
> > >>
> > > That's interesting. I'll check that out. By any chance, do you know
> > whether
> > > the Solr sharding is using the HDFS to store the data or is it using it's
> > > own infrastructure?
> > >
> > >
> > >> I would imagine doing this directly from HBase would not be a great
> > >> option, as Nutch is storing the data in the format that is convenient
> > >> for Nutch itself to use, and not so much in a format that it is
> > >> friendly for you to reuse for your own purposes.
> > >>
> > > I was actually thinking  of a scenario where we would use Solr to index
> > the
> > > data and storing the resultant index in HBase.  Then using the HBase
> > > directly to perform simple index lookups..  Please pardon my lack of
> > > knowledge on Nutch and Solr, if the above sounds ludicrous :)..
> > >
> > > thanks,
> > > Thilina
> > >
> > >
> > >> IMO your best bet is going to try out Solr 4.0.
> > >>
> > >> Alex
> > >>
> > >> On Mon, Oct 22, 2012 at 3:03 PM, Thilina Gunarathne 
> > >> wrote:
> > >> > Dear All,
> > >> > What would be the best practice to index a large crawl using Solr? The
> > >> > crawl is performed on a multi node Hadoop cluster using HBase as the
> > back
> > >> > end.. Would Solr become a bottleneck if we use just a single Solr
> > >> instance?
> > >> >  Is it possible to store the indexed data on HBase and to serve them
> > from
> > >> > the HBase it self?
> > >> >
> > >> > thanks a lot,
> > >> > Thilin

RE: RegEx URL Normalizer

2012-10-22 Thread Markus Jelsma
Hi,

Check the bottom normalizer, it uses the lookbehind operator to remove double 
slashes except the first two.

Cheers,

http://svn.apache.org/viewvc/nutch/trunk/conf/regex-normalize.xml.template?view=markup
 
 
-Original message-
> From:Magnús Skúlason 
> Sent: Mon 22-Oct-2012 00:34
> To: user@nutch.apache.org
> Cc: dkavr...@gmail.com; Markus Jelsma 
> Subject: Re: RegEx URL Normalizer
> 
> Hi,
> 
> I am interested in doing this i.e. only strip out parameters from url
> if some other string is found as well, in my case it will be a domain
> name. I am using 1.5.1 but I am unfamiliar with the look-behind
> operator.
> 
> Does anyone have a sample of how this is done?
> 
> best regards,
> Magnus
> 
> On Thu, Sep 8, 2011 at 12:14 PM, Alexander Fahlke
>  wrote:
> > Thanks guys!
> >
> > @Dinçer: This does not check if the URL contains "document.py". :(
> >
> > @Markus: Unfortunately I have to use nutch-1.2 so I decided to customize
> > RegexURLNormalizer. ;)
> >
> >   -->  regexNormalize(String urlString, String scope) { ...
> >
> >   It now simple stupid checks if urlString contains "document.py" and then
> > cuts out the unwanted stuff.
> >   I made this is even configurable via nutch-site.xml.
> >
> >
> > Nutch 1.4 would be better for this. Maybe in the next project.
> >
> >
> > BR
> >
> > On Wed, Sep 7, 2011 at 2:34 PM, Dinçer Kavraal  wrote:
> >
> >> Hi Alexander,
> >>
> >> Would this one work? (I am far away from a Nutch installation to test)
> >>
> >> (?:[&?](?:Date|Sort|Page|pos|anz)=[^&?]+|([?&](?:Name|Art|Blank|nr)=[^&?]*))
> >>
> >> Don't forget to use & instead of & in the regex.
> >>
> >> Best,
> >> Dinçer
> >>
> >>
> >> 2011/9/5 Alexander Fahlke 
> >>
> >>> Hi!
> >>>
> >>> I have problems with the right setup of the RegExURLNormalizer. It should
> >>> strip out some parameters for a specific script.
> >>> Only pages where "document.py" is present should be normalized.
> >>>
> >>> Here is an example:
> >>>
> >>>  Input:
> >>>
> >>> http://www.example.com/cgi-bin/category/document.py?Name=Alex&Art=en&Date=2000&Sort=1&Page=109&nr=16519&pos=1644&anz=1952&Blank=1.pdf
> >>>  Output:
> >>>
> >>> http://www.example.com/cgi-bin/category/document.py?Name=Alex&Art=en&nr=16519&Blank=1.pdf
> >>>
> >>> Date, Sort, Page, pos, anz are the parameters to be stripped out.
> >>>
> >>> I tried it with the following setup:
> >>>
> >>>  ([;_]?((?i)l|j|bv_)?((?i)date|
> >>> sort|page|pos|anz)=.*?)(\?|&|#|$)
> >>>
> >>>
> >>> How to tell nutch to use this regex only for pages with "document.py"?
> >>>
> >>>
> >>> BR
> >>>
> >>> --
> >>> Alexander Fahlke
> >>> Software Development
> >>> www.informera.de
> >>>
> >>
> >>
> >
> >
> > --
> > Alexander Fahlke
> > Software Development
> > www.informera.de
> 


RE: Nutch generate fetch lists for a single domain (but with multiple urls) crawl

2012-10-18 Thread Markus Jelsma
You would have to check the generator code to make sure. But why would you want 
to distribute the queue for a single domain to multiple mappers? A single local 
running mapper without parsing on a low-end machine can easily fetch 20-40 
records per second from the same domain (if it allows you to do it). At that 
speed you can easily fetch a few million records in a day orso.

-Original message-
> From:shri_s_ram 
> Sent: Thu 18-Oct-2012 23:11
> To: user@nutch.apache.org
> Subject: RE: Nutch generate fetch lists for a single domain (but with 
> multiple urls) crawl
> 
> Thanks.. But I thought there would be a way around it..
> Is it possible even to have multiple fetch lists generated (for this
> problem) at all by tweaking some parameters?
> 
> [I am thinking of something like partition.url.mode - byRandom]  
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Nutch-generate-fetch-lists-for-a-single-domain-but-with-multiple-urls-crawl-tp4014573p4014582.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 


RE: Nutch generate fetch lists for a single domain (but with multiple urls) crawl

2012-10-18 Thread Markus Jelsma
Hi - the generator tool partitions URL's by host, domain or IP address, they'll 
all end up in the same fetch list. Since you're doing only one domain there is 
no need to run additional mappers. If you want to crawl them as fast as you can 
(and you are allowed to do that) then use only one mapper and increase the 
number of threads and the number of threads per queue.

Keep in mind that it is considered impolite to crawl some host with too many 
threads and too little delay between successive fetches. You can do it if you 
own the host or have an agreement to do it. Reuters.com won't appreciate many 
URL's fetched with 10 threads without delay.
 
-Original message-
> From:shri_s_ram 
> Sent: Thu 18-Oct-2012 22:40
> To: user@nutch.apache.org
> Subject: Nutch generate fetch lists for a single domain (but with multiple 
> urls) crawl
> 
> Hi I am using Apache Nutch to crawl a website (say reuters.com). My seed urls
> are like the following
> 1.http://www.reuters.com/news/archive?view=page&page=1&pageSize=10,
> 2.http://www.reuters.com/news/archive?view=page&page=2&pageSize=10.. Now
> when I use the crawl command with 100 mapred.map.tasks parameter and
> partition.url.mode - byHost, Nutch generates 100 fetch lists but only one of
> them has all the urls. This in turn meant that out of 100 fetch jobs one of
> them takes a long time (actually all the time) I need to fetch urls from the
> same domain (but different urls) in multiple fetch jobs. Can someone help me
> out with the parameter setting for the same? Is this possible? Cheers
> Shriram Sridharan 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Nutch-generate-fetch-lists-for-a-single-domain-but-with-multiple-urls-crawl-tp4014573.html
> Sent from the Nutch - User mailing list archive at Nabble.com.


RE: Fetcher Thread

2012-10-18 Thread Markus Jelsma
Hi Ye,
 
-Original message-
> From:Ye T Thet 
> Sent: Thu 18-Oct-2012 15:46
> To: user@nutch.apache.org
> Subject: Fetcher Thread
> 
> Hi Folks,
> 
> I have two questions about the Fetcher Thread in Nutch. The value
> fetcher.threads.fetch in configuration file determines the number of
> threads the Nutch would use to fetch. Of course threads.per.host is also
> used for politeness.
> 
> I set 100 for fetcher.threads.fetch and 2 for threads.per.host value. So
> far on my development I have been using only one linux box to fetch thus it
> is clear that Nutch would fetch 100 urls at time provided that the
> threads.per.host criteria is met.
> 
> The questions are:
> 
> 1. What if I crawl on a hadoop cluster with with 5 linux box and set the
> fetcher.threads.fetch to 100? Would Nutch fetch 100 url at time or 500 (5 x
> 100) at time?

All nodes are isolated and don't know what the other is doing. So if you set 
the threads to 100 for each machine, each machine will run with 100 threads.

> 
> 2. Any advise on formulating optimum fetcher.threads.fetch and
> threads.per.host for a hadoop cluster with 5 linux box (Amazon EC2 medium
> instance, 3.7 GB memory). I would be crawling around 10,000 (10k) web
> sites.

I think threads per host must not exceed 1 for most websites out of politeness. 
You can set the number of threads as high as you can, it only takes more 
memory. If you parse in the fetcher as well, you can run much fewer threads.

> 
> Thanks,
> 
> Ye
> 


RE: Search in specific website

2012-10-16 Thread Markus Jelsma
There should be a hostfield if the more-indexing filter plugin is enabled, 
which it is by default. Therefore ?q=content:"find me"&fq=host:www.example.org 
must work correctly. 
 
-Original message-
> From:Lewis John Mcgibbney 
> Sent: Tue 16-Oct-2012 15:27
> To: user@nutch.apache.org
> Subject: Re: Search in specific website
> 
> Hi Tolga,
> 
> As Alejandro noted, we are fundamentally talking about correct Solr
> query syntax to ensure you get the granularity you require in order to
> retrieve the data/content you require... so far you've mentioned
> little regarding the fields you're creating on the Nutch side so this
> justifies my comment to head over to Solr lists...
> 
> hth
> 
> Lewis
> 
> On Tue, Oct 16, 2012 at 2:01 PM, Tolga  wrote:
> > Solr sent me to Nutch list, but okay. Thanks,
> >
> > On 10/16/2012 02:27 PM, Lewis John Mcgibbney wrote:
> >>
> >> Hi Tolga,
> >>
> >> Please take this to the Solr user@ list.
> >>
> >> Thank you
> >>
> >> Lewis
> >>
> >> On Tue, Oct 16, 2012 at 12:13 PM, Tolga  wrote:
> >>>
> >>> Hi,
> >>>
> >>> I've tried url:fass\.sabanciuniv\.edu AND content:this, and I got results
> >>> from both my URLs. What to do?
> >>>
> >>> Regards,
> >>>
> >>>
> >>> On 10/13/2012 12:48 AM, Alejandro Caceres wrote:
> 
>  Once you've indexed it with Solr this can be done using Solr Query
>  Syntax. Essentially what you're asking boils down to a Solr question.
>  In your example situation you could do something like this in Solr:
> 
>  url:example\.net AND content:
> 
>  ...or something of the sort. This will search a url with example.net
>  in it for whatever content you're looking for. Is this what you are
>  looking for? If not we need more details of what you have tried and
>  what issues you are having.
> 
>  On Fri, Oct 12, 2012 at 5:03 PM, Tolga  wrote:
> >
> > Not really. Let me elaborate. If I pass it multiple URLs such as
> > http://example.com, example.net and example.org, how can I search only
> > in
> > net?
> >
> > Regards,
> >
> > On 12 October 2012 23:55, Tejas Patil  wrote:
> >
> >> Hi Tolga,
> >>
> >> For searching a specific content from a specific website, crawl it
> >> first,
> >> then index it and search for the term after loading indexes over Solr.
> >> Does that really answer your question ?
> >>
> >> Thanks,
> >> Tejas
> >>
> >> On Fri, Oct 12, 2012 at 12:55 PM, Tolga  wrote:
> >>
> >>> Hi,
> >>>
> >>> I use nutch to crawl my website and index to solr. However, how can I
> >>> search for piece of content in a specific website? I use multiple
> >>> URL's
> >>>
> >>> Regards,
> >>>
> 
> >>
> >>
> >
> 
> 
> 
> -- 
> Lewis
> 


RE: nutch - Status: failed(2,200): org.apache.nutch.parse.ParseException: Unable to successfully parse content

2012-10-16 Thread Markus Jelsma
No, it doesn't work because of the old PDFBox version you are using. You need 
Tika 1.2 or higher.

 
 
-Original message-
> From:kiran chitturi 
> Sent: Tue 16-Oct-2012 01:32
> To: user@nutch.apache.org
> Subject: Re: nutch - Status: failed(2,200): 
> org.apache.nutch.parse.ParseException: Unable to successfully parse content
> 
> When i tried the command 'sh bin/nutch parsechecker
> http://scholar.lib.vt.edu/ejournals/JTE/v23n2/pdf/katsioloudis.pdf' the
> logs (hadoop.log) says
> 
> parse.ParserFactory - The parsing plugins:
> > [org.apache.nutch.parse.tika.TikaParser] are enabled via the
> > plugin.includes system property, and all claim to support the content type
> > application/pdf, but they are not mapped to it  in the parse-plugins.xml
> > file
> > 2012-10-15 19:04:23,733 WARN  pdfparser.PDFParser - Parsing Error,
> > Skipping Object
> > java.io.IOException: expected='endstream' actual=''
> > org.apache.pdfbox.io.PushBackInputStream@215983b7
> > at
> > org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:530)
> > at
> > org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:566)
> > at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:187)
> > at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1090)
> > at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1055)
> > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:123)
> > at
> > org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:96)
> > at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
> > at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
> > at
> > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> > at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> > at
> > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> > at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> > at java.lang.Thread.run(Thread.java:680)
> > 2012-10-15 19:04:23,734 WARN  pdfparser.XrefTrailerResolver - Did not
> > found XRef object at specified startxref position 0
> > 2012-10-15 19:04:23,933 INFO  crawl.SignatureFactory - Using Signature
> > impl: org.apache.nutch.crawl.MD5Signature
> > 2012-10-15 19:04:23,944 INFO  parse.ParserChecker - parsing:
> > http://scholar.lib.vt.edu/ejournals/JTE/v23n2/pdf/katsioloudis.pdf
> 
> 
> Does this has anything to do with content limit, or is this other kind of
> error ?
> 
> Thanks for the help.
> 
> Regards,
> Kiran.
> 
> On Mon, Oct 15, 2012 at 5:20 PM, Markus Jelsma
> wrote:
> 
> > Hi,
> >
> > It complains about not finding a Tika parser for the content type, did you
> > modify parse-plugins.xml? I can run it with a vanilla 1.4 but it fails
> > because of PDFbox. I can parse it successfully with trunk, 1.5 is not going
> > to work, not because it cannot find the TikaParser for PDFs but becasue
> > PDFBox cannot handle it.
> >
> > Cheers,
> >
> >
> > -Original message-
> > > From:kiran chitturi 
> > > Sent: Mon 15-Oct-2012 21:58
> > > To: user@nutch.apache.org
> > > Subject: nutch - Status: failed(2,200):
> > org.apache.nutch.parse.ParseException: Unable to successfully parse content
> > >
> > > Hi,
> > >
> > > I am trying to parse pdf files using nutch and its failing everytime with
> > > the status 'Status: failed(2,200): org.apache.nutch.parse.ParseException:
> > > Unable to successfully parse content' in both nutch 1.5 and 2.x series
> > when
> > > i do the command 'sh bin/nutch parsechecker
> > > http://scholar.lib.vt.edu/ejournals/JTE/v23n2/pdf/katsioloudis.pdf'.
> > >
> > > The hadoop.log looks like this
> > >
> > > >
> > > > 2012-10-15 15:43:32,323 INFO  http.Http - http.proxy.host = null
> > > > 2012-10-15 15:43:32,323 INFO  http.Http - http.proxy.port = 8080
> > > > 2012-10-15 15:43:32,323 INFO  http.Http - http.timeout = 1
> > > > 2012-10-15 15:43:32,323 INFO  http.Http - http.content.limit = -1
> > > > 2012-10-15 15:43:32,323 INFO  http.Http - http.agent = My Nutch
> > > > Spider/Nutch-2.2-SNAPSHOT
> > > > 2012-10-15 15:43:32,323 INFO  http.Http - http.accept.language =
> > > > en-us,en-gb,en;q=0.7,*;q=0.3
> > > > 2012-10-15 1

RE: nutch - Status: failed(2,200): org.apache.nutch.parse.ParseException: Unable to successfully parse content

2012-10-15 Thread Markus Jelsma
We use a modified version of trunk with updated dependencies but it works with 
1.5 as well. Having Tika 1.2 or higher fixes the problem. We did change the 
parse-plugins and the Tika plugin's plugin.xml but it comes down to pointing 
the PDF content type to the Tika parser, which is enabled by default.
 
-Original message-
> From:kiran chitturi 
> Sent: Mon 15-Oct-2012 23:59
> To: user@nutch.apache.org
> Subject: Re: nutch - Status: failed(2,200): 
> org.apache.nutch.parse.ParseException: Unable to successfully parse content
> 
> What configuration did you use in nutch-site.xml ? Does it work for you
> with the 2.x version ?
> 
> Thanks,
> Kiran.
> 
> On Mon, Oct 15, 2012 at 5:20 PM, Markus Jelsma
> wrote:
> 
> > Hi,
> >
> > It complains about not finding a Tika parser for the content type, did you
> > modify parse-plugins.xml? I can run it with a vanilla 1.4 but it fails
> > because of PDFbox. I can parse it successfully with trunk, 1.5 is not going
> > to work, not because it cannot find the TikaParser for PDFs but becasue
> > PDFBox cannot handle it.
> >
> > Cheers,
> >
> >
> > -Original message-
> > > From:kiran chitturi 
> > > Sent: Mon 15-Oct-2012 21:58
> > > To: user@nutch.apache.org
> > > Subject: nutch - Status: failed(2,200):
> > org.apache.nutch.parse.ParseException: Unable to successfully parse content
> > >
> > > Hi,
> > >
> > > I am trying to parse pdf files using nutch and its failing everytime with
> > > the status 'Status: failed(2,200): org.apache.nutch.parse.ParseException:
> > > Unable to successfully parse content' in both nutch 1.5 and 2.x series
> > when
> > > i do the command 'sh bin/nutch parsechecker
> > > http://scholar.lib.vt.edu/ejournals/JTE/v23n2/pdf/katsioloudis.pdf'.
> > >
> > > The hadoop.log looks like this
> > >
> > > >
> > > > 2012-10-15 15:43:32,323 INFO  http.Http - http.proxy.host = null
> > > > 2012-10-15 15:43:32,323 INFO  http.Http - http.proxy.port = 8080
> > > > 2012-10-15 15:43:32,323 INFO  http.Http - http.timeout = 1
> > > > 2012-10-15 15:43:32,323 INFO  http.Http - http.content.limit = -1
> > > > 2012-10-15 15:43:32,323 INFO  http.Http - http.agent = My Nutch
> > > > Spider/Nutch-2.2-SNAPSHOT
> > > > 2012-10-15 15:43:32,323 INFO  http.Http - http.accept.language =
> > > > en-us,en-gb,en;q=0.7,*;q=0.3
> > > > 2012-10-15 15:43:32,323 INFO  http.Http - http.accept =
> > > > text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> > > > 2012-10-15 15:43:36,851 INFO  parse.ParserChecker - parsing:
> > > > http://scholar.lib.vt.edu/ejournals/JTE/v23n2/pdf/katsioloudis.pdf
> > > > 2012-10-15 15:43:36,851 INFO  parse.ParserChecker - contentType:
> > > > application/pdf
> > > > 2012-10-15 15:43:36,858 INFO  crawl.SignatureFactory - Using Signature
> > > > impl: org.apache.nutch.crawl.MD5Signature
> > > > 2012-10-15 15:43:36,904 INFO  parse.ParserFactory - The parsing
> > plugins:
> > > > [org.apache.nutch.parse.tika.TikaParser] are enabled via the
> > > > plugin.includes system property, and all claim to support the content
> > type
> > > > application/pdf, but they are not mapped to it  in the
> > parse-plugins.xml
> > > > file
> > > > 2012-10-15 15:43:36,967 ERROR tika.TikaParser - Can't retrieve Tika
> > parser
> > > > for mime-type application/pdf
> > > > 2012-10-15 15:43:36,969 WARN  parse.ParseUtil - Unable to successfully
> > > > parse content
> > > > http://scholar.lib.vt.edu/ejournals/JTE/v23n2/pdf/katsioloudis.pdf of
> > > > type application/pdf
> > >
> > >
> > > The config file nutch-site.xml is as below:
> > >
> > >  
> > > >
> > > > 
> > > > 
> > > > 
> > > > 
> > > >  http.agent.name
> > > >  My Nutch Spider
> > > > 
> > > >
> > > > 
> > > > plugin.folders
> > > >
> > /Users/kiranch/Documents/workspace/nutch-2.x/runtime/local/plugins
> > > > 
> > > > 
> > > >
> > > > 
> > > > plugin.includes
> > > > 
> > > >
> > protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|scoring-opic|urlnormalizer-(pass|regex|basic)
> > > > 
> > > > 
> > >

RE: nutch - Status: failed(2,200): org.apache.nutch.parse.ParseException: Unable to successfully parse content

2012-10-15 Thread Markus Jelsma
Tika 1.2 has not yet been committed to the 2.x branch so it won't work in any 
case for this specific file. You can help in confirming the ticket so it can be 
committed.

https://issues.apache.org/jira/browse/NUTCH-1433
 
 
-Original message-
> From:kiran chitturi 
> Sent: Mon 15-Oct-2012 23:54
> To: user@nutch.apache.org
> Subject: Re: nutch - Status: failed(2,200): 
> org.apache.nutch.parse.ParseException: Unable to successfully parse content
> 
> I did not change parse-plugins.xml at all. I am using the 2.x branch.
> 
> Many Thanks,
> Kiran.
> 
> On Mon, Oct 15, 2012 at 5:20 PM, Markus Jelsma
> wrote:
> 
> > Hi,
> >
> > It complains about not finding a Tika parser for the content type, did you
> > modify parse-plugins.xml? I can run it with a vanilla 1.4 but it fails
> > because of PDFbox. I can parse it successfully with trunk, 1.5 is not going
> > to work, not because it cannot find the TikaParser for PDFs but becasue
> > PDFBox cannot handle it.
> >
> > Cheers,
> >
> >
> > -Original message-
> > > From:kiran chitturi 
> > > Sent: Mon 15-Oct-2012 21:58
> > > To: user@nutch.apache.org
> > > Subject: nutch - Status: failed(2,200):
> > org.apache.nutch.parse.ParseException: Unable to successfully parse content
> > >
> > > Hi,
> > >
> > > I am trying to parse pdf files using nutch and its failing everytime with
> > > the status 'Status: failed(2,200): org.apache.nutch.parse.ParseException:
> > > Unable to successfully parse content' in both nutch 1.5 and 2.x series
> > when
> > > i do the command 'sh bin/nutch parsechecker
> > > http://scholar.lib.vt.edu/ejournals/JTE/v23n2/pdf/katsioloudis.pdf'.
> > >
> > > The hadoop.log looks like this
> > >
> > > >
> > > > 2012-10-15 15:43:32,323 INFO  http.Http - http.proxy.host = null
> > > > 2012-10-15 15:43:32,323 INFO  http.Http - http.proxy.port = 8080
> > > > 2012-10-15 15:43:32,323 INFO  http.Http - http.timeout = 1
> > > > 2012-10-15 15:43:32,323 INFO  http.Http - http.content.limit = -1
> > > > 2012-10-15 15:43:32,323 INFO  http.Http - http.agent = My Nutch
> > > > Spider/Nutch-2.2-SNAPSHOT
> > > > 2012-10-15 15:43:32,323 INFO  http.Http - http.accept.language =
> > > > en-us,en-gb,en;q=0.7,*;q=0.3
> > > > 2012-10-15 15:43:32,323 INFO  http.Http - http.accept =
> > > > text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> > > > 2012-10-15 15:43:36,851 INFO  parse.ParserChecker - parsing:
> > > > http://scholar.lib.vt.edu/ejournals/JTE/v23n2/pdf/katsioloudis.pdf
> > > > 2012-10-15 15:43:36,851 INFO  parse.ParserChecker - contentType:
> > > > application/pdf
> > > > 2012-10-15 15:43:36,858 INFO  crawl.SignatureFactory - Using Signature
> > > > impl: org.apache.nutch.crawl.MD5Signature
> > > > 2012-10-15 15:43:36,904 INFO  parse.ParserFactory - The parsing
> > plugins:
> > > > [org.apache.nutch.parse.tika.TikaParser] are enabled via the
> > > > plugin.includes system property, and all claim to support the content
> > type
> > > > application/pdf, but they are not mapped to it  in the
> > parse-plugins.xml
> > > > file
> > > > 2012-10-15 15:43:36,967 ERROR tika.TikaParser - Can't retrieve Tika
> > parser
> > > > for mime-type application/pdf
> > > > 2012-10-15 15:43:36,969 WARN  parse.ParseUtil - Unable to successfully
> > > > parse content
> > > > http://scholar.lib.vt.edu/ejournals/JTE/v23n2/pdf/katsioloudis.pdf of
> > > > type application/pdf
> > >
> > >
> > > The config file nutch-site.xml is as below:
> > >
> > >  
> > > >
> > > > 
> > > > 
> > > > 
> > > > 
> > > >  http.agent.name
> > > >  My Nutch Spider
> > > > 
> > > >
> > > > 
> > > > plugin.folders
> > > >
> > /Users/kiranch/Documents/workspace/nutch-2.x/runtime/local/plugins
> > > > 
> > > > 
> > > >
> > > > 
> > > > plugin.includes
> > > > 
> > > >
> > protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|scoring-opic|urlnormalizer-(pass|regex|basic)
> > > > 
> > > > 
> > > > 
> > > > 
> > > > metatags.names
> > > > *
> > >

RE: nutch - Status: failed(2,200): org.apache.nutch.parse.ParseException: Unable to successfully parse content

2012-10-15 Thread Markus Jelsma
Hi,

It complains about not finding a Tika parser for the content type, did you 
modify parse-plugins.xml? I can run it with a vanilla 1.4 but it fails because 
of PDFbox. I can parse it successfully with trunk, 1.5 is not going to work, 
not because it cannot find the TikaParser for PDFs but becasue PDFBox cannot 
handle it.

Cheers,
 
 
-Original message-
> From:kiran chitturi 
> Sent: Mon 15-Oct-2012 21:58
> To: user@nutch.apache.org
> Subject: nutch - Status: failed(2,200): 
> org.apache.nutch.parse.ParseException: Unable to successfully parse content
> 
> Hi,
> 
> I am trying to parse pdf files using nutch and its failing everytime with
> the status 'Status: failed(2,200): org.apache.nutch.parse.ParseException:
> Unable to successfully parse content' in both nutch 1.5 and 2.x series when
> i do the command 'sh bin/nutch parsechecker
> http://scholar.lib.vt.edu/ejournals/JTE/v23n2/pdf/katsioloudis.pdf'.
> 
> The hadoop.log looks like this
> 
> >
> > 2012-10-15 15:43:32,323 INFO  http.Http - http.proxy.host = null
> > 2012-10-15 15:43:32,323 INFO  http.Http - http.proxy.port = 8080
> > 2012-10-15 15:43:32,323 INFO  http.Http - http.timeout = 1
> > 2012-10-15 15:43:32,323 INFO  http.Http - http.content.limit = -1
> > 2012-10-15 15:43:32,323 INFO  http.Http - http.agent = My Nutch
> > Spider/Nutch-2.2-SNAPSHOT
> > 2012-10-15 15:43:32,323 INFO  http.Http - http.accept.language =
> > en-us,en-gb,en;q=0.7,*;q=0.3
> > 2012-10-15 15:43:32,323 INFO  http.Http - http.accept =
> > text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> > 2012-10-15 15:43:36,851 INFO  parse.ParserChecker - parsing:
> > http://scholar.lib.vt.edu/ejournals/JTE/v23n2/pdf/katsioloudis.pdf
> > 2012-10-15 15:43:36,851 INFO  parse.ParserChecker - contentType:
> > application/pdf
> > 2012-10-15 15:43:36,858 INFO  crawl.SignatureFactory - Using Signature
> > impl: org.apache.nutch.crawl.MD5Signature
> > 2012-10-15 15:43:36,904 INFO  parse.ParserFactory - The parsing plugins:
> > [org.apache.nutch.parse.tika.TikaParser] are enabled via the
> > plugin.includes system property, and all claim to support the content type
> > application/pdf, but they are not mapped to it  in the parse-plugins.xml
> > file
> > 2012-10-15 15:43:36,967 ERROR tika.TikaParser - Can't retrieve Tika parser
> > for mime-type application/pdf
> > 2012-10-15 15:43:36,969 WARN  parse.ParseUtil - Unable to successfully
> > parse content
> > http://scholar.lib.vt.edu/ejournals/JTE/v23n2/pdf/katsioloudis.pdf of
> > type application/pdf
> 
> 
> The config file nutch-site.xml is as below:
> 
>  
> >
> > 
> > 
> > 
> > 
> >  http.agent.name
> >  My Nutch Spider
> > 
> >
> > 
> > plugin.folders
> > /Users/kiranch/Documents/workspace/nutch-2.x/runtime/local/plugins
> > 
> > 
> >
> > 
> > plugin.includes
> > 
> > protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|scoring-opic|urlnormalizer-(pass|regex|basic)
> > 
> > 
> > 
> > 
> > metatags.names
> > *
> >  Names of the metatags to extract, separated by;.
> >   Use '*' to extract all metatags. Prefixes the names with 'metatag.'
> >   in the parse-metadata. For instance to index description and keywords,
> >   you need to activate the plugin index-metadata and set the value of the
> >   parameter 'index.parse.md' to 'metatag.description;metatag.keywords'.
> > 
> > 
> > 
> >   index.parse.md
> >   
> > dc.creator,dc.bibliographiccitation,dcterms.issued,content-type,dcterms.bibliographiccitation,dc.format,dc.type,dc.language,dc.contributor,originalcharencoding,dc.publisher,dc.title,charencodingforconversion
> > 
> >   
> >   Comma-separated list of keys to be taken from the parse metadata to
> > generate fields.
> >   Can be used e.g. for 'description' or 'keywords' provided that these
> > values are generated
> >   by a parser (see parse-metatags plugin)
> >   
> > 
> > 
> > http.content.limit
> > -1
> > 
> > 
> >
> > Are there any configuration settings that i need to do to work with pdf
> files ? I have parsed them before and crawled but i am not sure which is
> causing the error now.
> 
> Can someone please point the cause of the errors above ?
> 
> Many Thanks,
> -- 
> Kiran Chitturi
> 


RE: issue about tika parse

2012-10-15 Thread Markus Jelsma
See the Tika formats page for more info:

http://tika.apache.org/1.2/formats.html#Microsoft_Office_document_formats
 
 
-Original message-
> From:宾军志 
> Sent: Mon 15-Oct-2012 12:14
> To: user@nutch.apache.org
> Subject: Re: issue about tika parse
> 
> Hi Tejas,
> 
> Thanks for your information. This issue has been resolved as your
> instruction.
> 
> BTW: currently which versions of MS office is supported by nutch?
> 
> BR,
> 
> Rock Bin
> 
> 2012/10/15 Tejas Patil 
> 
> > This can happen due to either of these:
> >
> > 1. This is probably due to the content having been trimmed during the
> > fetching. Try setting  http.content.limit to a larger
> > value
> > .
> > 2. If the file is huge, try increasing your parser.timeout
> > setting
> > .
> >
> > thanks,
> > Tejas Patil
> >
> > On Sun, Oct 14, 2012 at 7:50 PM, 宾军志  wrote:
> >
> > > Hi All,
> > >
> > > Currently I already have done the installation of nutch2.1 with hbase and
> > > it work well with html parsing.
> > > But when I try to parse a word document I got the below exception:
> > >
> > > 2012-10-14 17:56:04,686 INFO  crawl.SignatureFactory - Using Signature
> > > impl: org.apache.nutch.crawl.MD5Signature
> > > 2012-10-14 17:56:05,026 INFO  mapreduce.GoraRecordReader -
> > > gora.buffer.read.limit = 1
> > > 2012-10-14 17:56:05,048 INFO  mapreduce.GoraRecordWriter -
> > > gora.buffer.write.limit = 1
> > > 2012-10-14 17:56:05,054 INFO  crawl.SignatureFactory - Using Signature
> > > impl: org.apache.nutch.crawl.MD5Signature
> > > 2012-10-14 17:56:05,077 INFO  parse.ParserJob - Parsing
> > >
> > >
> > http://www.g12e.com/upload/html/2012/6/25/zhangw8867520120625134724265971.doc
> > > 2012-10-14 17:56:05,077 INFO  parse.ParserFactory - The parsing plugins:
> > > [org.apache.nutch.parse.tika.TikaParser] are enabled via the
> > > plugin.includes system property, and all claim to support the content
> > type
> > > application/x-tika-msoffice, but they are not mapped to it  in the
> > > parse-plugins.xml file
> > > 2012-10-14 17:56:05,164 ERROR tika.TikaParser - Error parsing
> > >
> > >
> > http://www.g12e.com/upload/html/2012/6/25/zhangw8867520120625134724265971.doc
> > > java.io.IOException: Invalid header signature; read 0x,
> > > expected 0xE11AB1A1E011CFD0
> > > at org.apache.poi.poifs.storage.HeaderBlock.(HeaderBlock.java:140)
> > > at org.apache.poi.poifs.storage.HeaderBlock.(HeaderBlock.java:115)
> > > at
> > >
> > >
> > org.apache.poi.poifs.filesystem.NPOIFSFileSystem.(NPOIFSFileSystem.java:265)
> > > at
> > >
> > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:170)
> > > at org.apache.nutch.parse.tika.TikaParser.getParse(Unknown Source)
> > > at org.apache.nutch.parse.ParseCallable.call(Unknown Source)
> > > at org.apache.nutch.parse.ParseCallable.call(Unknown Source)
> > > at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
> > > at java.util.concurrent.FutureTask.run(Unknown Source)
> > > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> > > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> > > at java.lang.Thread.run(Unknown Source)
> > >
> > > Then I download this document to my local and try tika parse by command:
> > > ./bin/nutch plugin parse-tika
> > > org.apache.nutch.parse.tika.TikaParser
> > zhangw8867520120625134724265971.doc
> > > This command worked well.
> > >
> > > Anyone has idea about it?
> > >
> > > BR,
> > >
> > > Rock Bin
> > >
> >
> 


RE: solrj 4 integeration in nutch-1.* versions

2012-10-15 Thread Markus Jelsma
https://issues.apache.org/jira/browse/NUTCH-1377
 
 
-Original message-
> From:Julien Nioche 
> Sent: Mon 15-Oct-2012 10:01
> To: user@nutch.apache.org
> Subject: Re: solrj 4 integeration in nutch-1.* versions
> 
> Why don't you open a JIRA issue and contribute a patch for it? I think it
> would make sense to move to SOLR 4 and that would be a valuable contribution
> 
> Thanks
> 
> Julien
> 
> On 14 October 2012 08:06, nutch.bu...@gmail.com wrote:
> 
> > Hi
> > Are there any plans to integrate  solrj 4 (instead of 3.4.1), in nutch-1.*
> > series?
> > I'm using nutch-1.4 and solr 4.
> > It seems that solrj 3.4 can handle sending basic indexing requests to solr
> > 4
> > server - without any code changes.
> >
> > Though, if I want to use solrj 4 classes as CloudSolrServer, I need to
> > change  SolrWriter.java.
> >
> > Should be noted that solrj comes with httpclient 4, so seems that while
> > integrating this, changes should be made to other parts of the code that
> > use
> > httpclient 3.1
> >
> >
> >
> >
> >
> >
> > --
> > View this message in context:
> > http://lucene.472066.n3.nabble.com/solrj-4-integeration-in-nutch-1-versions-tp4013590.html
> > Sent from the Nutch - User mailing list archive at Nabble.com.
> >
> 
> 
> 
> -- 
> *
> *Open Source Solutions for Text Engineering
> 
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
> 


RE: Anchor text of current URL

2012-10-08 Thread Markus Jelsma
Hi - did you run the invertlinks program over your segments before indexing? 
 
-Original message-
> From:chethan 
> Sent: Mon 08-Oct-2012 04:28
> To: user@nutch.apache.org
> Subject: Anchor text of current URL
> 
> Hi,
> 
> In an indexing filter, is there a way to figure out the Anchor text from
> which the current URL/document originated from? I tried the inlinks but
> that seems to be null.
> 
> public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
> CrawlDatum datum, Inlinks inlinks) IndexingException {
> 
> *//Need to know the anchor text from which the current document
> originated from at this point*
> 
> }
> 
> Thanks
> Chethan
> 


RE: [ANNOUNCE] Apache Nutch 2.1 Released

2012-10-05 Thread Markus Jelsma
Cheers!

 
 
-Original message-
> From:Mattmann, Chris A (388J) 
> Sent: Fri 05-Oct-2012 18:39
> To:  
> Cc: user@nutch.apache.org
> Subject: Re: [ANNOUNCE] Apache Nutch 2.1 Released
> 
> Great job everyone!
> 
> Cheers,
> Chris
> 
> On Oct 5, 2012, at 9:29 AM, Julien Nioche wrote:
> 
> > Thanks Lewis and well done everyone!
> > Enjoy your week end
> > 
> > Julien
> > 
> > On 5 October 2012 16:12, lewis john mcgibbney  wrote:
> > Good Afternoon Everyone,
> > 
> > The Apache Nutch PMC are very pleased to announce the release of
> > Apache Nutch v2.1. This release continues to provide Nutch users with
> > a simplified Nutch distribution building on the 2.x development drive
> > which is growing in popularity amongst the community. As well as
> > addressing ~20 bugs this release also offers improved properties for
> > better Solr configuration, upgrades to various Gora dependencies and
> > the introduction of the option to build indexes in elastic search,
> > amongst various others.
> > 
> > A full PMC Announcement can be seen here [0]
> > 
> > Thanks you, have a great weekend on behalf of the Nutch community.
> > 
> > Lewis
> > 
> > [0] http://nutch.apache.org/#05+October+2012+-+Apache+Nutch+v2.1+Released
> > 
> > 
> > 
> > -- 
> > 
> > Open Source Solutions for Text Engineering
> > 
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> > http://twitter.com/digitalpebble
> > 
> 
> 
> ++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattm...@nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
> 
> 


RE: nutch-2.0 generate in deploy mode

2012-10-02 Thread Markus Jelsma
Hi - i don't know 2.0 but Hadoop's Mapred is likely just taking advantage of 
multiple CPU cores. 
 
-Original message-
> From:alx...@aim.com 
> Sent: Tue 02-Oct-2012 04:15
> To: user@nutch.apache.org
> Subject: nutch-2.0  generate in  deploy mode
> 
> Hello,
> 
> I use nutch-2.0 with hadoop-0.20.2. bin/nutch generate  command takes 87% of 
> cpu  in deploy mode versus 18% in local mode.
> Any ideas how to fix this issue?
> 
> Thanks.
> Alex.
> 


RE: priorised/scored fetching

2012-10-02 Thread Markus Jelsma
Hi - There's nothing like that yet. What you can do is run a custom URL filter 
for the generate step, allowing only HTML files and use your standard URL 
filter for the other steps.

 
 
-Original message-
> From:Stefan Scheffler 
> Sent: Tue 02-Oct-2012 09:24
> To: user@nutch.apache.org
> Subject: priorised/scored fetching
> 
> Hi.
> I crawl a webdatabase for *.html, *.pdf and *.doc documents, with a 
> given topN. I want nutch to fetch first all of the html documents, then 
> pdf and at last doc, because html is more important than pdf and so on.
> Is there a way to make nutch follow such rules (maybe with a scoring 
> algorithm)?
> 
> Regards
> Stefan
> 
> -- 
> Stefan Scheffler
> Avantgarde Labs GbR
> Löbauer Straße 19, 01099 Dresden
> Telefon: + 49 (0) 351 21590834
> Email: sscheff...@avantgarde-labs.de
> 
> 


RE: Parsing/Indexing alt tag

2012-10-01 Thread Markus Jelsma
You can write a simple parse filter plugin. With the NodeWalker you can walk 
all nodes of the DOM and get the alt attribute for img tags.

NodeWalker walker = new NodeWalker(doc);
Node currentNode = walker.nextNode();
if (currentNode.getNodeType() == Node.ELEMENT_NODE) {
  if ("img".equalsIgnoreCase(currentNode.getNodeName())) {
HashMap atts = getAttributes(currentNode);

  }
}
  }
 
   protected HashMap getAttributes(Node node) {
HashMap attribMap = new HashMap();

NamedNodeMap attributes = node.getAttributes();

for(int i = 0 ; i < attributes.getLength(); i++) {
  Attr attribute = (Attr)attributes.item(i);
  attribMap.put(attribute.getName().toLowerCase(), attribute.getValue());
}

return attribMap;
  }

-Original message-
> From:Alexandre 
> Sent: Mon 01-Oct-2012 15:05
> To: user@nutch.apache.org
> Subject: Re: Parsing/Indexing alt tag
> 
> Hi Patrick,
> 
> I have the same Problem.
> Did you find a way to parse the alt attributes without rewrite a complet
> parse plugin?
> 
> Alex.
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Parsing-Indexing-alt-tag-tp3999540p4011181.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 


RE: Indexing Exception

2012-09-24 Thread Markus Jelsma
It is hardcoded to process the `content` field only but it could be changed to 
process any string field. 
 
-Original message-
> From:Stefan Scheffler 
> Sent: Mon 24-Sep-2012 12:27
> To: user@nutch.apache.org
> Subject: Re: Indexing Exception
> 
> Hey,
> Thank you i used this method in the meantime for me and it worked fine.
> Is there a general way to do the encoding to utf8 to this field in 
> Nutchg as well?
> 
> On 24.09.2012 12:04, Markus Jelsma wrote:
> > Hi Stefan,
> >
> > You can take the stripNonCharCodepoints() method and pass your content 
> > through it. It should fix the problem.
> >
> > Cheers,
> >   
> > -Original message-
> >> From:Stefan Scheffler 
> >> Sent: Mon 24-Sep-2012 11:23
> >> To: user@nutch.apache.org
> >> Subject: Re: Indexing Exception
> >>
> >> Hey Markus. you gave me the right hint.
> >> Additionally to the normally content field i added a field fullcontent,
> >> which simply holds the html document of the relevant content field,
> >> because we need this in a later proccessing step. This field is not
> >> encoded like the content field. I realised this with an own
> >> ParsingFilter, which stores it in to  the ParseResult and then an
> >> Indexingfilter merges it into the NutchDocument.
> >>
> >> Is there a way to do this better or just do the encoding to the
> >> fullcontent like to the content?
> >>
> >> Regards
> >> Stefan
> >> On 24.09.2012 10:41, Markus Jelsma wrote:
> >>> It was fixed for the content field with 1016. Can you pinpoint the 
> >>> problematic field?
> >>> https://issues.apache.org/jira/browse/NUTCH-1016
> >>>
> >>>
> >>>
> >>> -Original message-
> >>>> From:Stefan Scheffler 
> >>>> Sent: Mon 24-Sep-2012 10:37
> >>>> To: user@nutch.apache.org
> >>>> Subject: Re: Indexing Exception
> >>>>
> >>>> nutch 1.5, solr 3.6
> >>>> On 24.09.2012 10:34, Markus Jelsma wrote:
> >>>>> Hi - What version?
> >>>>>
> >>>>> 
> >>>>> 
> >>>>> -Original message-
> >>>>>> From:Stefan Scheffler 
> >>>>>> Sent: Mon 24-Sep-2012 10:29
> >>>>>> To: user@nutch.apache.org
> >>>>>> Subject: Indexing Exception
> >>>>>>
> >>>>>> Hello,
> >>>>>> I have a strange Problem. While indexing a crawl to solr i got the
> >>>>>> following exception
> >>>>>>
> >>>>>> java.lang.RuntimeException: [was class java.io.CharConversionException]
> >>>>>> Invalid UTF-8 character 0xfffe at char #6886708, byte #11578429)
> >>>>>> at
> >>>>>> com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
> >>>>>> at
> >>>>>> com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> >>>>>> at
> >>>>>> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
> >>>>>> at
> >>>>>> com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
> >>>>>> at 
> >>>>>> org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:315)
> >>>>>> at 
> >>>>>> org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:156)
> >>>>>> at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)
> >>>>>> at
> >>>>>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)
> >>>>>> at
> >>>>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
> >>>>>> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376)
> >>>>>> at
> >>>>>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:365)
> >>>>>> at
> >>>>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:260)
> >>>>>> at
> >>>>>> org.mortbay.jetty.servlet.ServletHandler$Ca

RE: Indexing Exception

2012-09-24 Thread Markus Jelsma
Hi Stefan,

You can take the stripNonCharCodepoints() method and pass your content through 
it. It should fix the problem.

Cheers,
 
-Original message-
> From:Stefan Scheffler 
> Sent: Mon 24-Sep-2012 11:23
> To: user@nutch.apache.org
> Subject: Re: Indexing Exception
> 
> Hey Markus. you gave me the right hint.
> Additionally to the normally content field i added a field fullcontent, 
> which simply holds the html document of the relevant content field, 
> because we need this in a later proccessing step. This field is not 
> encoded like the content field. I realised this with an own 
> ParsingFilter, which stores it in to  the ParseResult and then an 
> Indexingfilter merges it into the NutchDocument.
> 
> Is there a way to do this better or just do the encoding to the 
> fullcontent like to the content?
> 
> Regards
> Stefan
> On 24.09.2012 10:41, Markus Jelsma wrote:
> > It was fixed for the content field with 1016. Can you pinpoint the 
> > problematic field?
> > https://issues.apache.org/jira/browse/NUTCH-1016
> >
> >   
> >   
> > -Original message-
> >> From:Stefan Scheffler 
> >> Sent: Mon 24-Sep-2012 10:37
> >> To: user@nutch.apache.org
> >> Subject: Re: Indexing Exception
> >>
> >> nutch 1.5, solr 3.6
> >> On 24.09.2012 10:34, Markus Jelsma wrote:
> >>> Hi - What version?
> >>>
> >>>
> >>>
> >>> -Original message-
> >>>> From:Stefan Scheffler 
> >>>> Sent: Mon 24-Sep-2012 10:29
> >>>> To: user@nutch.apache.org
> >>>> Subject: Indexing Exception
> >>>>
> >>>> Hello,
> >>>> I have a strange Problem. While indexing a crawl to solr i got the
> >>>> following exception
> >>>>
> >>>> java.lang.RuntimeException: [was class java.io.CharConversionException]
> >>>> Invalid UTF-8 character 0xfffe at char #6886708, byte #11578429)
> >>>>at
> >>>> com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
> >>>>at
> >>>> com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> >>>>at
> >>>> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
> >>>>at
> >>>> com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
> >>>>at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:315)
> >>>>at 
> >>>> org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:156)
> >>>>at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)
> >>>>at
> >>>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)
> >>>>at
> >>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
> >>>>at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376)
> >>>>at
> >>>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:365)
> >>>>at
> >>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:260)
> >>>>at
> >>>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> >>>>at
> >>>> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
> >>>>at
> >>>> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> >>>>at
> >>>> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
> >>>>at
> >>>> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
> >>>>at
> >>>> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
> >>>>at
> >>>> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
> >>>>at
> >>>> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
> >>>>at
> >>>> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
> >>>>at org.mortbay.jetty.Server.handle(Server.java:326)
> >>>>at
> >>>> org.mortbay.jetty.Htt

RE: External domain redirection with db.ignore.external.links=true

2012-09-24 Thread Markus Jelsma
Hi - You can use the domain url filter to manually whitelist domains.  
 
-Original message-
> From:Alexandre 
> Sent: Mon 24-Sep-2012 09:19
> To: user@nutch.apache.org
> Subject: External domain redirection with db.ignore.external.links=true
> 
> Hi,
> 
> I've a question concerning redirection to external domain.
> I crawl different websites, but I don't want to crawl external links. For
> that I used the option 
> db.ignore.external.links=true
> It's working fine. But my problem is, that the websites using redirection to
> an external domain are not crawled.
> For exemple:
> http://www.ikea.at  is redirected to http://www.ikea.com/at/de/ and my
> crawler ignore this website because of the option
> db.ignore.external.links=true.
> 
> A solution could be to use directly the url  http://www.ikea.com/at/de/ in
> the seed list, but this is not an option for me, because I can not change
> this list.
> 
> Is there any possibility in Nutch to authorize to crawl websites that are
> redirected to external domains, and ignore external links?
> 
> Thank for your help,
> 
> Alex.
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/External-domain-redirection-with-db-ignore-external-links-true-tp4009783.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 


RE: Indexing Exception

2012-09-24 Thread Markus Jelsma
It was fixed for the content field with 1016. Can you pinpoint the problematic 
field?
https://issues.apache.org/jira/browse/NUTCH-1016

 
 
-Original message-
> From:Stefan Scheffler 
> Sent: Mon 24-Sep-2012 10:37
> To: user@nutch.apache.org
> Subject: Re: Indexing Exception
> 
> nutch 1.5, solr 3.6
> On 24.09.2012 10:34, Markus Jelsma wrote:
> > Hi - What version?
> >
> >   
> >   
> > -Original message-
> >> From:Stefan Scheffler 
> >> Sent: Mon 24-Sep-2012 10:29
> >> To: user@nutch.apache.org
> >> Subject: Indexing Exception
> >>
> >> Hello,
> >> I have a strange Problem. While indexing a crawl to solr i got the
> >> following exception
> >>
> >> java.lang.RuntimeException: [was class java.io.CharConversionException]
> >> Invalid UTF-8 character 0xfffe at char #6886708, byte #11578429)
> >>   at
> >> com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
> >>   at
> >> com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> >>   at
> >> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
> >>   at
> >> com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
> >>   at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:315)
> >>   at 
> >> org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:156)
> >>   at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)
> >>   at
> >> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)
> >>   at
> >> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
> >>   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376)
> >>   at
> >> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:365)
> >>   at
> >> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:260)
> >>   at
> >> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> >>   at
> >> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
> >>   at
> >> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> >>   at
> >> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
> >>   at
> >> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
> >>   at
> >> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
> >>   at
> >> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
> >>   at
> >> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
> >>   at
> >> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
> >>   at org.mortbay.jetty.Server.handle(Server.java:326)
> >>   at
> >> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
> >>   at
> >> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945)
> >>   at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:843)
> >>   at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
> >>   at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
> >>   at
> >> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
> >>   at
> >> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
> >> Caused by: java.io.CharConversionException: Invalid UTF-8 character
> >> 0xfffe at char #6886708, byte #11578429)
> >> ...
> >>
> >> It seems to be an encoding exception. Is there a way to avoid this?
> >>
> >> Regards
> >> Stefan
> >>
> >> -- 
> >> Stefan Scheffler
> >> Avantgarde Labs GmbH
> >> Löbauer Straße 19, 01099 Dresden
> >> Telefon: + 49 (0) 351 21590834
> >> Email: sscheff...@avantgarde-labs.de
> >>
> >>
> 
> 
> -- 
> Stefan Scheffler
> Avantgarde Labs GmbH
> Löbauer Straße 19, 01099 Dresden
> Telefon: + 49 (0) 351 21590834
> Email: sscheff...@avantgarde-labs.de
> 
> 


RE: Indexing Exception

2012-09-24 Thread Markus Jelsma
Hi - What version?

 
 
-Original message-
> From:Stefan Scheffler 
> Sent: Mon 24-Sep-2012 10:29
> To: user@nutch.apache.org
> Subject: Indexing Exception
> 
> Hello,
> I have a strange Problem. While indexing a crawl to solr i got the 
> following exception
> 
> java.lang.RuntimeException: [was class java.io.CharConversionException] 
> Invalid UTF-8 character 0xfffe at char #6886708, byte #11578429)
>  at 
> com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
>  at 
> com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
>  at 
> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
>  at 
> com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
>  at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:315)
>  at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:156)
>  at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)
>  at 
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)
>  at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
>  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376)
>  at 
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:365)
>  at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:260)
>  at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>  at 
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>  at 
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>  at 
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
>  at 
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
>  at 
> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
>  at 
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
>  at 
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>  at 
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
>  at org.mortbay.jetty.Server.handle(Server.java:326)
>  at 
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
>  at 
> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945)
>  at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:843)
>  at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
>  at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
>  at 
> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
>  at 
> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
> Caused by: java.io.CharConversionException: Invalid UTF-8 character 
> 0xfffe at char #6886708, byte #11578429)
> ...
> 
> It seems to be an encoding exception. Is there a way to avoid this?
> 
> Regards
> Stefan
> 
> -- 
> Stefan Scheffler
> Avantgarde Labs GmbH
> Löbauer Straße 19, 01099 Dresden
> Telefon: + 49 (0) 351 21590834
> Email: sscheff...@avantgarde-labs.de
> 
> 


RE: problem with big crawl process

2012-09-20 Thread Markus Jelsma
Hi

Petición incorrecta is a HTTP 400 BAD REQUEST? Check your Solr log, there must 
be something there.

Cheers

 
 
-Original message-
> From:Eyeris Rodriguez Rueda 
> Sent: Thu 20-Sep-2012 16:23
> To: user@nutch.apache.org
> Subject: problem with big crawl process
> 
> Hi all.
> I have a problem when i try to do a big crawl process, specifically when the 
> topN paremeter is bigger than 1000.
> Im using nutch 1.4 and solr 3.4 in a pc with this features
> Intel CoreI3,Ram 2GB, HD 160 GB.
> 
> the problem is an exception(java.io.IOException: Job failed!
> ) in the moment to add documents in solr index, but I dont know how to fix 
> this, I have reduced the solr.commit.size from 1000 to 250, but the problems 
> still happening, please any idea or recomendation or way to solve will be 
> appreciated
> 
> 
> this is a part of my hadoop.log file
> 
> 2012-09-20 09:46:57,148 INFO  solr.SolrMappingReader - source: language dest: 
> language
> 2012-09-20 09:46:57,148 INFO  solr.SolrMappingReader - source: url dest: url
> 2012-09-20 09:46:57,704 INFO  solr.SolrWriter - Adding 250 documents
> 2012-09-20 09:46:59,974 INFO  solr.SolrWriter - Adding 250 documents
> 2012-09-20 09:47:01,578 INFO  solr.SolrWriter - Adding 250 documents
> 2012-09-20 09:47:02,137 INFO  solr.SolrWriter - Adding 250 documents
> 2012-09-20 09:47:02,816 INFO  solr.SolrWriter - Adding 250 documents
> 2012-09-20 09:47:03,272 WARN  mapred.LocalJobRunner - job_local_0030
> org.apache.solr.common.SolrException: Petición incorrecta
> 
> Petición incorrecta
> 
> request: http://localhost:8080/solr/update?wt=javabin&version=2
>   at 
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430)
>   at 
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
>   at 
> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
>   at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49)
>   at org.apache.nutch.indexer.solr.SolrWriter.write(SolrWriter.java:81)
>   at 
> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:54)
>   at 
> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:44)
>   at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:440)
>   at 
> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:166)
>   at 
> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:51)
>   at 
> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
>   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
> 2012-09-20 09:47:03,447 ERROR solr.SolrIndexer - java.io.IOException: Job 
> failed!
> 2012-09-20 09:47:03,448 INFO  solr.SolrDeleteDuplicates - 
> SolrDeleteDuplicates: starting at 2012-09-20 09:47:03
> 2012-09-20 09:47:03,448 INFO  solr.SolrDeleteDuplicates - 
> SolrDeleteDuplicates: Solr url: http://localhost:8080/solr
> 2012-09-20 09:47:05,039 INFO  solr.SolrDeleteDuplicates - 
> SolrDeleteDuplicates: finished at 2012-09-20 09:47:05, elapsed: 00:00:01
> 2012-09-20 09:47:05,040 INFO  crawl.Crawl - crawl finished: crawl
> 
> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
> INFORMATICAS...
> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
> 
> http://www.uci.cu
> http://www.facebook.com/universidad.uci
> http://www.flickr.com/photos/universidad_uci
> 


RE: Recrawling and segment cleanup

2012-09-19 Thread Markus Jelsma


 
 
-Original message-
> From:Alexandre 
> Sent: Wed 19-Sep-2012 13:18
> To: user@nutch.apache.org
> Subject: Recrawling and segment cleanup
> 
> Hi,
> 
> we currently encounter a little problem with the segment folders created
> during crawling.
> 
> Our situation is like follows:
> We try to set up a Nutch crawler who is crawling / recrwaling on a regular
> basis with a fixed depth. How to establish this is already clear for us and
> working as intended.
> (http://lucene.472066.n3.nabble.com/Absolute-depth-for-recrawling-td4008320.html)
> 
> Our general solution looks (from the process point of view) like this:
> 
>   1. Inject
>   Loop Recrawl {
>   Loop (depth) {
> 2. Generate
> 3. Fetch
> 4. Parse
> 5. UpdateDB
>   }
> 6. InvertLinks
> 7. SOLRIndex
> 8. SOLRDeup
>   }
> 
> The problem we now got, is that there is a new segment (folder) created for
> each crawl / recrawl and each depth loop (which is in fact nothing else then
> a normal crawl).
> 
> Our main question now is, 
>1) when can we delete / eventually merge these segment folders and

Wou can merge them whenever you want. We merge all segments daily and monthly 
because we may have to reindex occasionally.

>2) what are they used for in the future.

They are only used for reindexing or rebuilding data structures such as the 
crawldb, webgraph of linkdb.

> 
> For now we automatically delete all segement folders after each complete
> crawl (after each step 8.SOLRDeup) and it seems to work fine for us. Does
> this even make sense?

Sure. If you don't need them.

> 
> I think we have to admit that we are not entirely aware of what kind of
> information is contained within the crawl DB and the segment folder.

The all databases contain a   key/value pair. The CrawlDB contains 
the state of every URL and the segments contain structures such as the 
generated fetch list, info on the fetched records, parse data (outlinks and 
such) and parsed text. All this information is key/value based.

> 
> Thanks a lot for your help in advance and kind regards,
> Alex
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Recrawling-and-segment-cleanup-tp4008865.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 


RE: Relative urls - outlinks

2012-09-18 Thread Markus Jelsma
No, relative URL's are resolved in both parsers plugins. You can try to disable 
it manually. There's no way to remove them from the CrawlDB except some clever 
filtering. They're absolute now.

 
 
-Original message-
> From:webdev1977 
> Sent: Tue 18-Sep-2012 15:24
> To: user@nutch.apache.org
> Subject: Relative urls - outlinks
> 
> Is there anyway to keep nutch from generating outlinks for any RELATIVE urls? 
> I basically don't want to use ANY relative urls that I find.. 
> 
> Then the next question is how do I get them out of my crawldb :-)
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Relative-urls-outlinks-tp4008601.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 


RE: nutch dedup on content of the html

2012-09-13 Thread Markus Jelsma
It depends on the implementation you use, configured in your nutch-site.xml:

http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/crawl/MD5Signature.java?view=markup

http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/crawl/TextProfileSignature.java?view=markup
 
-Original message-
> From:kiran chitturi 
> Sent: Thu 13-Sep-2012 23:48
> To: user@nutch.apache.org
> Subject: nutch dedup on content of the html
> 
> Hi,
> 
> When crawling and indexing the document i have seen that Nutch is creating
> singature and running dedup on solr which it shows as digest.
> 
> Can anyone point how the signature is computed, is it based on the entire
> text in the file ?
> 
> Can i create signature based on only one field like 'content' so that solr
> can dedup files with same content but different urls ?
> 
> Many Thanks for your help,
> -- 
> Kiran Chitturi
> 


RE: Help needed on Large scale single domain crawling ( Multiple country / Multilanguage / user type ) CGI urls

2012-09-12 Thread Markus Jelsma
Hi Martin,
 
-Original message-
> From:Martin Louis 
> Sent: Wed 12-Sep-2012 11:46
> To: Markus Jelsma 
> Cc: user@nutch.apache.org
> Subject: Re: Help needed on Large scale single domain crawling ( Multiple 
> country / Multilanguage / user type ) CGI urls
> 
> Thanks Markus for you answers, I will try them and post back, but one 
> question remains in my mind ; 
> 
> I can hack http conection for POST authentication, but I have multiple login 
> credentials ( user types ) for the website, what will be the approach to 
> re-run nutch crawling based on different login credentials, as i also i want 
> to seach based on user types; so the info has to be captured to a nutch field 
> some how. Any suggestions ?

This is tricky. Perhaps running separate crawls will do the trick but make sure 
the URL's are not identical, otherwise your index will contain overwritten 
items. If the URL's are unique you can have one crawl and use marker in the URL 
to decide how to login.

> 
> > Is there a way i can capture cookie information into nutch as a field  ?

Cookies are saved in the Content Metadata in the segment. You can use the 
parsechecker tool of Nutch and see what is exactly saved. The content metadata 
must contain the cookie.

> 
> > Any recommendations for the CGI issue ? Any part of code that can be hacked 
> > to append http params to the URL that nutch stores ; so that stored URLS 
> > will be different.

I think it's best for your application to generate distinct URL's. Otherwise it 
may be too difficult and you may run into unexpected problems.

 OR 
> Can i set up multiple nutch instances for each country i support ?

Yes, but again, if the URL's are not unique, the indexed URL's will be 
overwritten.

> OR 
> Does nutch allows some kind of grouping ? ( like "Collections" and "Front 
> ends" in GSA ) 

Are you talking about queries? Solr can do some kind of grouping.

> 
> 
> Thanks 
> Martin 
> 
> On Tue, Sep 11, 2012 at 6:30 PM, Markus Jelsma  <mailto:markus.jel...@openindex.io> > wrote:
> Hello Martin,
> 
> -Original message-
> > From:Martin Louis mailto:mail.lo...@gmail.com> >
> > Sent: Mon 10-Sep-2012 16:41
> > To: user@nutch.apache.org <mailto:user@nutch.apache.org> 
> > Subject: Help needed on Large scale single domain crawling ( Multiple 
> > country / Multilanguage / user type ) CGI urls
> >
> > Hi Guys
> >
> > I am a JAVA engineer, trying to set up an environment with all the features
> > of GSA and more to address large website needs;
> >
> > My website
> > > Works mostly on CGI commands to redirect to pages
> >  (like, ?cmd=_services-page )
> > > Multiple counties ( means; they are different for different counties as
> > products we sell to different countries are different ); reachable by sub
> > URL: *mydomain.com/ <http://mydomain.com/> / *.
> > > Supports multiple local languages for each country.
> > > Each country can have users having multiple types of accounts ( we
> > support 2-3 types of users in each country based on the service level like
> > "free user" / "premium user" ) and the content for them will vary.
> >
> > *What will be the best approach to crawl this website for a good "site wide
> > search" experience both for logged -in and out users with relevant content.*
> >
> > Below are the questions with me
> >
> > 1. If i keep my *seed to be "mydomain.com <http://mydomain.com> "*  and 
> > initiate a crawl on entire
> > site
> >   >Q. How can i capture "//"   as a field in NUTCH ) during
> > crawl ?
> 
> Depends on where the country-code is located, is it a HTTP element? If so, 
> you must create a custom HTML parse filter and look for it in the DOM. Is is 
> part of the URL? Then you can still do it with an HTML parse filter or 
> indexing filter as they both have access to the URL and you can look it up.
> 
> >   >Q. How can i crawl language specific pages and index it
> >             -  Same CGI command ( like : ?cmd=_login-run ) is used for all
> > languages in a country
> >             -  Language flip done by setting a cookie in the website
> 
> This is not going to work. The URL must be unique, see below.
> 
> >
> > 3. My website support different types of accounts and the content can be
> > different for each type of account for same CGI ?cmd
> >      > Q. How to group based on account types used to crawl.
> 
> Very tricky. You must make sure the URL's are not identical. Different 
> content for the same URL wil

RE: Parallelize Fetching Phase

2012-09-12 Thread Markus Jelsma
please share the relevant part of the log. 
 
-Original message-
> From:Matteo Simoncini 
> Sent: Wed 12-Sep-2012 11:23
> To: user@nutch.apache.org
> Subject: Re: Parallelize Fetching Phase
> 
> I've got another issue.
> 
> I'm crawling a single domain and I have fetcher.queue.mode set to "byHost".
> I'm using 20 thread so I set fetcher.threads.per.queue to 20. But I get a
> NullPointerException.
> 
> But setting it to 10 thread, it works fine.
> 
> Can someone explain me why nutch has this behavior?
> 
> Sorry tfor bothering you and thanks very much for your help.
> 
> Matteo
> 
> 2012/9/11 Markus Jelsma 
> 
> > Hi
> >
> > -Original message-
> > > From:Matteo Simoncini 
> > > Sent: Tue 11-Sep-2012 14:41
> > > To: user@nutch.apache.org
> > > Subject: Parallelize Fetching Phase
> > >
> > > Hi everyone,
> > >
> > > I'm running nutch 1.5.1 using a script I created, but there is
> > > a significant slowdown in the fetching phase.
> > > My script uses 20 thread to fetch. Here is the fetch istruction:
> > >
> > > bin/nutch fetch $segment -threads 20
> > >
> > > It works, but it seems they are all fetching the same URL. Here is the
> > log:
> >
> > Not same URL but same host or domain. The fetcher uses either host, domain
> > or IP queues. If you have only one domain or host them setting
> > fetcher.queue.mode is useless. Instead, you would have to increase
> > fetcher.threads.per.queue.
> >
> >
> > >
> > > fetching
> > >
> > http://www.eclap.eu/drupal/?q=en-US/node/65229/og/forum&sort=asc&order=Topic
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> > > fetching
> > >
> > http://www.eclap.eu/drupal/?q=en-US/node/2867/og/forum&sort=asc&order=Created
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> > > fetching http://www.eclap.eu/drupal/?q=zh-hans/node/103996
> > > -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=412
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=412
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=412
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=412
> > > -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=412
> > > ...
> > >
> > > Is there a way to make each thread crawl a different URL?
> > >
> >
> 


RE: Help needed on Large scale single domain crawling ( Multiple country / Multilanguage / user type ) CGI urls

2012-09-11 Thread Markus Jelsma
Hello Martin, 
 
-Original message-
> From:Martin Louis 
> Sent: Mon 10-Sep-2012 16:41
> To: user@nutch.apache.org
> Subject: Help needed on Large scale single domain crawling ( Multiple country 
> / Multilanguage / user type ) CGI urls
> 
> Hi Guys
> 
> I am a JAVA engineer, trying to set up an environment with all the features
> of GSA and more to address large website needs;
> 
> My website
> > Works mostly on CGI commands to redirect to pages
>  (like, ?cmd=_services-page )
> > Multiple counties ( means; they are different for different counties as
> products we sell to different countries are different ); reachable by sub
> URL: *mydomain.com// *.
> > Supports multiple local languages for each country.
> > Each country can have users having multiple types of accounts ( we
> support 2-3 types of users in each country based on the service level like
> "free user" / "premium user" ) and the content for them will vary.
> 
> *What will be the best approach to crawl this website for a good "site wide
> search" experience both for logged -in and out users with relevant content.*
> 
> Below are the questions with me
> 
> 1. If i keep my *seed to be "mydomain.com"*  and initiate a crawl on entire
> site
>   >Q. How can i capture "//"   as a field in NUTCH ) during
> crawl ?

Depends on where the country-code is located, is it a HTTP element? If so, you 
must create a custom HTML parse filter and look for it in the DOM. Is is part 
of the URL? Then you can still do it with an HTML parse filter or indexing 
filter as they both have access to the URL and you can look it up.

>   >Q. How can i crawl language specific pages and index it
> -  Same CGI command ( like : ?cmd=_login-run ) is used for all
> languages in a country
> -  Language flip done by setting a cookie in the website

This is not going to work. The URL must be unique, see below.

> 
> 3. My website support different types of accounts and the content can be
> different for each type of account for same CGI ?cmd
>  > Q. How to group based on account types used to crawl.

Very tricky. You must make sure the URL's are not identical. Different content 
for the same URL will not work in Nutch because the URL is the key in all of 
Nutch' databases. You can get different content for the same URL by sending 
different HTTP headers but in Nutch' database you will just overwrite the 
`other content` for the URL.

> 
> 4. How can i do a post ( form authentication ), i know i can hack HTTP
> connection, but above grouping of crawl based on authentication is blocking
> me.

Indeed, hack into the HTTP protocol plugin you're using. Nutch cannot do this 
by default.

> 
> 
> 
> Thanks in advance, for any of your valuable suggestion to my problem
> 
> -- 
> - Martin
> 


RE: Parallelize Fetching Phase

2012-09-11 Thread Markus Jelsma
Hi
 
-Original message-
> From:Matteo Simoncini 
> Sent: Tue 11-Sep-2012 14:41
> To: user@nutch.apache.org
> Subject: Parallelize Fetching Phase
> 
> Hi everyone,
> 
> I'm running nutch 1.5.1 using a script I created, but there is
> a significant slowdown in the fetching phase.
> My script uses 20 thread to fetch. Here is the fetch istruction:
> 
> bin/nutch fetch $segment -threads 20
> 
> It works, but it seems they are all fetching the same URL. Here is the log:

Not same URL but same host or domain. The fetcher uses either host, domain or 
IP queues. If you have only one domain or host them setting fetcher.queue.mode 
is useless. Instead, you would have to increase fetcher.threads.per.queue.


> 
> fetching
> http://www.eclap.eu/drupal/?q=en-US/node/65229/og/forum&sort=asc&order=Topic
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=414
> fetching
> http://www.eclap.eu/drupal/?q=en-US/node/2867/og/forum&sort=asc&order=Created
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=413
> fetching http://www.eclap.eu/drupal/?q=zh-hans/node/103996
> -activeThreads=20, spinWaiting=19, fetchQueues.totalSize=412
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=412
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=412
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=412
> -activeThreads=20, spinWaiting=20, fetchQueues.totalSize=412
> ...
> 
> Is there a way to make each thread crawl a different URL?
> 


RE: Escaping URL during redirection

2012-09-08 Thread Markus Jelsma
You mean the redirects followed by the fetcher (if enabled) are not passed 
through the filters and normalizers? You can open an issue for that and if 
possible provide a patch for trunk. An example of the fetcher following 
filtered and normalized outlinks can be found in the fetcher around line 1036.
 
 
-Original message-
> From:remi tassing 
> Sent: Sat 08-Sep-2012 19:34
> To: user@nutch.apache.org
> Subject: Escaping URL during redirection
> 
> Hi guys,
> 
> I'm not quite sure how to make Nutch follow the normalizer regular
> expressions during redirection. I see some URLs are not properly escaped.
> 
> Any help?
> 
> Remi
> 


RE: Query SolrIndex for Id

2012-09-08 Thread Markus Jelsma
Special characters must be escaped:
http://lucene.apache.org/core/3_6_1/queryparsersyntax.html#Escaping%20Special%20Characters

 
 
-Original message-
> From:Alaak 
> Sent: Sat 08-Sep-2012 15:26
> To: user@nutch.apache.org
> Subject: Query SolrIndex for Id
> 
> Hi,
> 
> I have a problem with the way Nutch (1.6 trunk) stores the pages id to 
> the Solr (3.6.1) index. It uses the pages URL. However due to special 
> characters in the URL like ":" or "/" it is impossible to use the URL as 
> query parameter to access documents on Solr. Of course it is possible to 
> apply URL encoding, but it seems Solr does not invert the encoding on 
> Server side and thus is unable to match the queried URL to the URL in 
> the index. So does anyone know of a way to retrieve documents from a 
> Nutch created Solr Index by id?
> 
> Thanks and Regards
> 


RE: Keeping an externally created field in solr.

2012-09-08 Thread Markus Jelsma

Don't delete the crawl db, that's pointless. You can either delete the whole 
segment or remove all but crawl_generate and try again. You should delete the 
segment if you've successfully crawled another segment after that segment 
because it'll contain the same URL's.

 
 
-Original message-
> From:Alaak 
> Sent: Sat 08-Sep-2012 10:43
> To: user@nutch.apache.org
> Cc: Markus Jelsma 
> Subject: Re: Keeping an externally created field in solr.
> 
> Hi,
> 
> Ok. Thanks. Then I guess I will follow your last proposal and read the 
> value from the Solr Index if the URL is already there.
> 
> Am Sa 08 Sep 2012 00:11:41 CEST schrieb Markus Jelsma:
> >
> > No, but you could modify the indexer to do so. Or make use of Solr's 
> > new capability of updating specific fields. You could also modifiy 
> > that indexer plugin to fetch the value for that field from some source 
> > you have prior to indexing. I think the latter is the easiest to make 
> > but it only works for fields specifically set by Nutch.
> >
> > -Original message-
> >>
> >> From:Alaak 
> >> Sent: Sat 08-Sep-2012 00:08
> >> To: user@nutch.apache.org
> >> Subject: Keeping an externally created field in solr.
> >>
> >> Hi,
> >>
> >> I have an external program which changes a field for some websites
> >> within my Solr index. Nutch sets this field to a default value using a
> >> plugin on indexing a page. My problem now is that nutch resets the field
> >> for already indexed pages as well, when it updates those pages. Do I
> >> have any possibility to tell Nutch it should not touch that field if it
> >> already exists within the Solr Index?
> >>
> >> Thanks and Regards
> 


RE: Keeping an externally created field in solr.

2012-09-07 Thread Markus Jelsma
No, but you could modify the indexer to do so. Or make use of Solr's new 
capability of updating specific fields. You could also modifiy that indexer 
plugin to fetch the value for that field from some source you have prior to 
indexing. I think the latter is the easiest to make but it only works for 
fields specifically set by Nutch.
 
-Original message-
> From:Alaak 
> Sent: Sat 08-Sep-2012 00:08
> To: user@nutch.apache.org
> Subject: Keeping an externally created field in solr.
> 
> Hi,
> 
> I have an external program which changes a field for some websites 
> within my Solr index. Nutch sets this field to a default value using a 
> plugin on indexing a page. My problem now is that nutch resets the field 
> for already indexed pages as well, when it updates those pages. Do I 
> have any possibility to tell Nutch it should not touch that field if it 
> already exists within the Solr Index?
> 
> Thanks and Regards
> 


RE: Errors when indexing to Solr

2012-09-07 Thread Markus Jelsma
-Original message-
> From:Fournier, Danny G 
> Sent: Fri 07-Sep-2012 14:46
> To: user@nutch.apache.org
> Subject: RE: Errors when indexing to Solr
> 
> I've tried crawling with nutch-1.6-SNAPSHOT.jar and got the following
> error:
> 
> [root@w7sp1-x64 nutch]# bin/nutch crawl urls -dir crawl -depth 3 -topN 5
> solrUrl is not set, indexing will be skipped...
> crawl started in: crawl
> rootUrlDir = urls
> threads = 10
> depth = 3
> solrUrl=null
> topN = 5
> Injector: starting at 2012-09-07 08:41:06
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: finished at 2012-09-07 08:41:21, elapsed: 00:00:14
> Generator: starting at 2012-09-07 08:41:21
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: topN: 5
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls for politeness.
> Generator: segment: crawl/segments/20120907084129
> Generator: finished at 2012-09-07 08:41:36, elapsed: 00:00:15
> Fetcher: Your 'http.agent.name' value should be listed first in
> 'http.robots.agents' property.
> Fetcher: starting at 2012-09-07 08:41:36
> Fetcher: segment: crawl/segments/20120907084129
> Using queue mode : byHost
> Fetcher: threads: 10
> Fetcher: time-out divisor: 2
> QueueFeeder finished: total 1 records + hit by time limit :0
> Exception in thread "main" java.io.IOException: Job failed!
>   at
> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
>   at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1332)
>   at org.apache.nutch.crawl.Crawl.run(Crawl.java:136)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>   at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)

Please post the relevant log

> 
> I then tried to crawl with 1.5.1 (which was successful) and INDEX with
> 1.6-SNAPSHOT. I got this error:
> 
> [root@w7sp1-x64 nutch]# bin/nutch solrindex
> http://127.0.0.1:8080/solr/core2 crawl/crawldb -linkdb crawl/linkdb
> SolrIndexer: starting at 2012-09-07 09:05:21
> SolrIndexer: deleting gone documents: false
> SolrIndexer: URL filtering: false
> SolrIndexer: URL normalizing: false
> org.apache.solr.common.SolrException: Not Found
> 
> Not Found
> 
> request: http://127.0.0.1:8080/solr/core2/update

This is no Nutch error, there simply is no Solr running there (404), or a badly 
configured one.

> 
> > -Original Message-
> > From: Fournier, Danny G [mailto:danny.fourn...@dfo-mpo.gc.ca]
> > Sent: September 6, 2012 4:15 PM
> > To: user@nutch.apache.org
> > Subject: Errors when indexing to Solr
> > 
> > I'm getting two different errors while trying to index Nutch crawls to
> > Solr. I'm running with:
> > 
> > - CentOS 6.3 VM (Virtualbox) (in host Windows XP)
> > - Solr 3.6.1
> > - Nutch 1.5.1
> > 
> > It would seem that NUTCH-1251 comes rather close to solving my issue?
> > Which would mean that I would have to compile Nutch 1.6 to fix this?
> > 
> > Error #1 - When indexing directly to Solr
> > 
> > Command: bin/nutch crawl urls -solr http://localhost:8080/solr/core2
> > -depth 3 -topN 5
> > 
> > Error:  Exception in thread "main" java.io.IOException:
> > org.apache.solr.client.solrj.SolrServerException: Error executing
> query
> > 
> > SolrIndexer: starting at 2012-09-06 14:30:11
> > Indexing 8 documents
> > java.io.IOException: Job failed!
> > SolrDeleteDuplicates: starting at 2012-09-06 14:30:55
> > SolrDeleteDuplicates: Solr url: http://localhost:8080/solr/core2
> > Exception in thread "main" java.io.IOException:
> > org.apache.solr.client.solrj.SolrServerException: Error executing
> > query
> > at
> >
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSp
> > lits(SolrDeleteDuplicates.java:200)
> > at
> > org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:989)
> > at
> > org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:981)
> > at
> > org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174)
> > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897)
> > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
> > at java.security.AccessController.doPrivileged(Native Method)
> > at javax.security.auth.Subject.doAs(Subject.java:416)
> > at
> > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInform
> > atio
> > n.java:1121)
> > at
> >
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
> > at
> > org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
> > at
> > org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261)
> > at
> >
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDupli
> > cates.java:373)
> > at
> >
> org.apache.nutch.indexer.solr.SolrDelete

RE: Need some directions

2012-09-03 Thread Markus Jelsma
We ignore false positives for for now. A common solution is to maintain a set 
of known false positives and check that set for membership first before looking 
at the bloom filter.
 
-Original message-
> From:Vijith 
> Sent: Mon 03-Sep-2012 13:01
> To: Markus Jelsma 
> Subject: Re: Need some directions
> 
> I tried with bloom filters. Its working fine for my sample site. So how did 
> you handle false positives then ?
> I am working on it as part of a training assignment. I thought this would be 
> a good starting point to learn Nutch code base.
> 
> On Fri, Aug 31, 2012 at 7:20 PM, Markus Jelsma  <mailto:markus.jel...@openindex.io> > wrote:
> 
> -Original message-
> > From:Vijith mailto:vijithkv...@gmail.com> >
> > Sent: Fri 31-Aug-2012 15:44
> > To: d...@nutch.apache.org <mailto:d...@nutch.apache.org> 
> > Subject: Re: Need some directions
> >
> > I have tried running nutch with a sample site with two different urls 
> > redirecting to a common resource.
> > I could not find any clues, from hadoop.log, where the common resource is 
> > parsed multiple times.
> > Could some one please explain the exact scenario that creates this bug.
> 
> In the Jira comment you said it fetched page4 twice now.
> 
> >
> > And how does this bug relates to NUTCH-1184 ? 
> 
> It relates to 1184 because if URL's in the same fetch list link to a common 
> page, it can be followed.as <http://followed.as> well.
> 
> We solved this issue by keeping a list of crawled URL's in a external bloom 
> filter.
> 
> >
> > On Thu, Aug 30, 2012 at 11:44 AM, Vijith  > <mailto:vijithkv...@gmail.com> <mailto:vijithkv...@gmail.com 
> > <mailto:vijithkv...@gmail.com> > > wrote:
> > Hi all, 
> >
> > I am new to dev... I am working on NUTCH-1150...
> > I would like to get some directions before I can start... Right now I am 
> > going through the Fetcher.java code...
> >
> > --
> > . . . . . thanks & regards
> >
> > Vijith V.
> >
> >
> >
> >
> >
> > --
> > . . . . . thanks & regards
> >
> > Vijith V.
> >
> >
> >
> 
> 
> 
> -- 
> . . . . . thanks & regards
> 
> Vijith V.
> 
> 
> 


RE: Crawl a whole domain with indicization

2012-08-29 Thread Markus Jelsma

There is nothing wrong with your script but it depends on your data how much 
URL's are generated. The difference in your script and the crawl command (both 
are almost identical) could also be explained by the state of your CrawlDb.


 
 
-Original message-
> From:Matteo Simoncini 
> Sent: Thu 30-Aug-2012 00:16
> To: user@nutch.apache.org
> Subject: Crawl a whole domain with indicization
> 
> Hi,
> 
> I'm using Nutch version 1.5. My problem is to crawl every URL in a domain.
> I also want to indicize everything using Solr but, instead of doing that in
> the endo of the process, since is a very big domain, I would like to call
> the indiciziong command of Solr every X URL (for example let's say every
> 1 URL).
> 
> Since now all I was capable to do is this script:
> 
> #!/bin/bash
> # inject the initial seed into crawlDB
> bin/nutch inject test/crawldb urls
> 
> # initialization of the variables
> counter=1
> error=0
> 
> #while there is no error
> while [ $error -ne 1 ]
> do
> 
> # crawl 500 URL
> 
> echo [ Script ] Starting generating phase
> 
> bin/nutch generate test/crawldb test/segments -topN 1
> 
> 
> if [ $? -ne 0 ]
> 
> then
> 
> echo [ Script ] Stopping: No more URLs to fetch.
> 
> error=1
> 
> break
> 
> fi
> 
> segment=`ls -d test/segments/2* | tail -1`
> 
> 
> #fetching phase
> 
> echo [ Script ] Starting fetching phase
> 
> bin/nutch fetch $segment -threads 20
> 
> if [ $? -ne 0 ]
> 
> then
> 
> echo [ Script ] Fetch $segment failed. Deleting it.
> 
> rm -rf $segment
> 
> continue
> 
> fi
> 
> #parsing phase
> 
> echo [ Script ] Starting parsing phase
> 
> bin/nutch parse $segment
> 
> 
> #updateDB phase
> 
> echo [ Script ] Starting updateDB phase
> 
> bin/nutch updatedb test/crawldb $segment
> 
> 
> #indicizing with solr
> 
> bin/nutch invertlinks test/linkdb -dir test/segments
> 
> bin/nutch solrindex http://127.0.0.1:8983/solr/ test/crawldb -linkdb
> test/linkdb test/segments/*
> 
> done
> 
> 
> but it seems to not work. In fact crawling using the command:
> 
> bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 20
> 
> 
> and testing on the apache.org domain I get more URL than using the script
> (command: 1676, script: 1658)
> Can anyone tell me what's wrong with my script? Is there a better way to
> solve my problem?
> 
> Thanks,
> 
> Matteo
> 


RE: Need to transfer Parse metadata obtained in HtmlParseFilter.filter() to the CrawlDb

2012-08-29 Thread Markus Jelsma
Hi

Check the db.parsemeta.to.crawldb parameter. It'll send your parse meta keys to 
the CrawlDatum meta data.

Cheers

 
 
-Original message-
> From:Safdar Kureishy 
> Sent: Wed 29-Aug-2012 21:26
> To: user@nutch.apache.org
> Subject: Need to transfer Parse metadata obtained in HtmlParseFilter.filter() 
> to the CrawlDb
> 
> Hi,
> 
> I've built a custom HtmlParseFilter and am doing custom language
> identification in the filter() API. Here, I am able to set the relevant
> lang id properties on a ParseResult object via getParseMeta().put("LangId",
> id). I am also able to retrieve these properties in my custom
> ScoringFilter, for use during distributeScoreToOutlinks(). However, what I
> also need is to persist this data as as metadata in the relevant CrawlDb
> record (i.e., in the CrawlDatum.getMetadata() data structure). My intent,
> from all this, is finally to be able to write custom Hadoop jobs to gather
> language distribution statistics directy from the Crawldb (without having
> to do any joins on the ParseText, Content, ParseData types). The only way I
> see this being possible, is if each URL's CrawlDatum also has the lang-id
> in its metadata.
> 
> This is turning out to be a challenge. I first tried transfering the parse
> properties in my custom ScoringFilter.distributeScoreToOutlinks() API,
> because that API offers access to the ParseResut as well as an "adjust"
> CrawlDatum parameter for updating the CrawlDb (according to the Javadocs).
> However, doing that is not updating the crawldb. Then, in the newsgroup
> archives, I stumbled upon a thread about the
> "db.max.outlinks.per.page"property being used by the ParseOutputFormat
> class to do exactly the same
> property transfer at a different stage of the crawl cycle, but that doesn't
> work either.
> 
> So, I'm writing to the newsgroup hoping someone could give me specific
> advice on which API I should override, or which configuration setting I
> should change, so as to transfer custom parse-time metadata to the CrawlDb.
> 
> Thanks in advance.
> 
> Cheers,
> Safdar
> 


RE: Content of size X was truncated to Y

2012-08-27 Thread Markus Jelsma
please see the http.content.limit parameter.
 
 
-Original message-
> From:Alaak 
> Sent: Mon 27-Aug-2012 16:39
> To: user@nutch.apache.org
> Subject: Content of size X was truncated to Y
> 
> Hi
> 
> I recieve the messages similar to "Content of size 109690 was truncated 
> to 64937".
> 
> I fear this results in incomplete content within my index. So my 
> question is, what causes this truncation and is there a possibility to 
> disable it?
> 
> Regards
> 


RE: bin/nutch

2012-08-27 Thread Markus Jelsma
It should be in runtime/(local|deploy)/bin/ after the build.

Cheers

 
 
-Original message-
> From:Tolga 
> Sent: Mon 27-Aug-2012 08:53
> To: user@nutch.apache.org
> Subject: bin/nutch
> 
> Hi,
> 
> I can find only src/bin/nutch in 2.0-src.zip. There's no bin/nutch. Does 
> that mean I have to compile it?
> 
> Regards,
> 


RE: Extracting non anchored URLs from page

2012-08-26 Thread Markus Jelsma
Nutch' parser relies on Nutch' OutlinkExtractor is the underlying parser did 
not yield any outlinks.  
 
-Original message-
> From:Ye T Thet 
> Sent: Sun 26-Aug-2012 18:09
> To: user@nutch.apache.org
> Subject: Extracting non anchored URLs from page
> 
> Hi Folks,
> 
> I am using nutch (1.2 and 1.5) to crawl some website.
> 
> The short question is that is there anyway or plug-ins to extracts URLs
> which are not in anchor tags in a page.
> 
> The long question:
> 
> The crawler is not extraction some of the URLs from the page. After the
> investigation I noticed that the URLs are not links technically, i.e. not
> inside anchor elements. URLs are inside value of other HTML tags used by
> javascripts.
> 
> Following is the snippet of the contents.
> 
> 
> 
> 
>  name="bloglinkselect">
> text 1
>  value="http://craweledsite.blogspot.com/2007/11/blog-post_7360.html"/>text
> 2
>  value="http://craweledsite.blogspot.com/2007/09/blog-post_10.html"/>text
> 3
> 
> 
> 
> 
> 
> As mentioned above, the URLs are not in html anchor tags. but rather valid
> urls used by javascripts when the user clicks the items.  Thus resulting
> those address are not crawled. To make the matter worse, there is no site
> map or index page where such urls can be reached other than the above
> mentioned links.
> 
> Has anyone encounter such cases and have figure out the solution? Any tips
> or direction would be great.
> 
> Thanks,
> 
> Ye
> 


RE: running main() in plugins?

2012-08-26 Thread Markus Jelsma
See: https://issues.apache.org/jira/browse/NUTCH-961

 
 
-Original message-
> From:Shaya Potter 
> Sent: Sun 26-Aug-2012 17:59
> To: user@nutch.apache.org
> Subject: Re: running main() in plugins?
> 
> It could be the "magic" (i.e. analysis) that Nutch is doing in the 
> background gets rid of most of the cruft, I'm just playing around on my 
> own trying to see how I can get the best text to analyze, and in many 
> cases, there's a lot of cruft and I was wondering if Nutch did anything 
> to remove said cruft (headers, footers, sidebars)
> 
> what I'm doing now for my experiments is relatively heavyweight, but 
> I'm, applying the readability algorithm to web pages before I index them 
> into my a lucene database.  probably not the best idea for nutch though.
> 
> With that said, if Nutch is doing more processing than a jsoup 
> Document.text() operation, the question is why?  (some might be obvious, 
> metadata, getting outbound links)
> 
> On 08/26/2012 08:55 AM, Lewis John Mcgibbney wrote:
> > Hi Shaya,
> >
> > Can you elaborate? The plugin has been around for a good while. If you
> > have suggestions to improve they are very welcome.
> >
> > Thanks
> >
> > On Sun, Aug 26, 2012 at 1:41 PM, Shaya Potter  wrote:
> >> ok, so it seems that Nutch isn't doing much different (at least from a
> >> smattering of tests I've done) than Jsoup's Document.text() ability (from
> >> what I can tell at least, perhaps only some issues with spacing between
> >> elements).
> >>
> >> On 08/26/2012 06:28 AM, Lewis John Mcgibbney wrote:
> >>>
> >>> You can easily run any plugin from the terminal using
> >>>
> >>> ./bin/nutch plugin
> >>>
> >>> in the case of the HtmlParser main() method you would want to do
> >>>
> >>> ./bin/nutch plugin parse-html org.apache.nutch.parse.html.HtmlParser
> >>> $pathToLocalFile
> >>>
> >>> You have actually identified an improvement which we could do with
> >>> having in the main() method for this class e.g.
> >>>
> >>> 1) When the arguments are not correctly specified it should print a
> >>> usage message to std out explaining the correct plugin usage as with
> >>> more or less every other plugin. Currently we just get a nasty stack
> >>> like the following
> >>>
> >>> Exception in thread "main" java.lang.reflect.InvocationTargetException
> >>>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>>  at
> >>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >>>  at
> >>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >>>  at java.lang.reflect.Method.invoke(Method.java:597)
> >>>  at
> >>> org.apache.nutch.plugin.PluginRepository.main(PluginRepository.java:421)
> >>> Caused by: java.io.FileNotFoundException:
> >>> http:/www.trancearoundtheworld.com (No such file or directory)
> >>>  at java.io.FileInputStream.open(Native Method)
> >>>  at java.io.FileInputStream.(FileInputStream.java:120)
> >>>  at
> >>> org.apache.nutch.parse.html.HtmlParser.main(HtmlParser.java:274)
> >>>  ... 5 more
> >>>
> >>> 2) The plugin main method only enables you to parse local files an
> >>> improvement would be to add functionality similar to the parserchecker
> >>> as highlighted by Sourajit
> >>>
> >>> If you wish to add these functions then please open a Jira issue, the
> >>> contribution would be great.
> >>>
> >>> Thanks
> >>>
> >>> Lewis
> >>>
> >>> On Sun, Aug 26, 2012 at 4:18 AM, Shaya Potter  wrote:
> 
>  I'm trying to run the main function in HtmlParser (just to see test how
>  Nutch's parser works compared to others) and I can't see to figure out
>  how
>  to get it to run.
> 
> 
>  http://svn.apache.org/viewvc/nutch/branches/branch-1.5.1/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java?revision=1356339&view=markup
> 
>  when I run it naively, I get an error
> 
>  Exception in thread "main" java.lang.RuntimeException:
>  org.apache.nutch.parse.HtmlParseFilter not found.
>    at
>  org.apache.nutch.parse.HtmlParseFilters.(HtmlParseFilters.java:55)
> 
>  in looking at HtmlParseFilters, I see that it throws the runtime
>  exception
>  if it can't find any HtmlParseFilter classes, however, I can't seem to
>  figure out how to make it able to find them (I see the jar's in the
>  plugins
>  dir, but do they have to be registered?  could the main() in HtmlParser
>  ever
>  work as is?
> 
>  any pointers would be appreciated.
> 
>  thanks.
> 
> >>>
> >>>
> >>>
> >>
> >
> >
> >
> 


RE: recrawl a URL?

2012-08-24 Thread Markus Jelsma
No, the CrawlDatum's status field will be set to db_notmodified if the 
signatures match regardless of the HTTP headers. The header only sets a 
fetch_notmodified but it is not relevant for the db_* status.

 
 
-Original message-
> From:alx...@aim.com 
> Sent: Fri 24-Aug-2012 20:14
> To: user@nutch.apache.org; max.dzy...@comintelli.com
> Subject: Re: recrawl a URL?
> 
> This will work only for urls that has If-Modified-Since headers. But most 
> urls does not have this header.
> 
> Thanks.
> Alex. 
>  
> 
>  
> 
>  
> 
> -----Original Message-
> From: Max Dzyuba 
> To: Markus Jelsma ; user 
> Sent: Fri, Aug 24, 2012 9:02 am
> Subject: RE: recrawl a URL?
> 
> 
> Thanks again! I'll have to test it more then in my 1.5.1.
> 
> 
> 
> 
> Best regards,
> MaxMarkus Jelsma  wrote:Hmm, i had to look it up 
> but 
> it is supported in 1.5 and 1.5.1:
> 
> http://svn.apache.org/viewvc/nutch/tags/release-1.5.1/src/java/org/apache/nutch/indexer/IndexerMapReduce.java?view=markup
> 
> 
> -Original message-
> > From:Max Dzyuba 
> > Sent: Fri 24-Aug-2012 17:35
> > To: Markus Jelsma ; user@nutch.apache.org
> > Subject: RE: recrawl a URL?
> > 
> > Thank you for the reply. Does it mean that it is not supported in latest 
> stable release of Nutch?
> > 
> > 
> > -Original Message-
> > From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
> > Sent: den 24 augusti 2012 17:21
> > To: user@nutch.apache.org; Max Dzyuba
> > Subject: RE: recrawl a URL?
> > 
> > Hi,
> > 
> > Trunk has a feature for this: indexer.skip.notmodified
> > 
> > Cheers 
> >  
> > -Original message-
> > > From:Max Dzyuba 
> > > Sent: Fri 24-Aug-2012 17:19
> > > To: user@nutch.apache.org
> > > Subject: recrawl a URL?
> > > 
> > > Hello everyone,
> > > 
> > >  
> > > 
> > > I run a crawl command every day, but I don't want Nutch to submit an 
> > > update to Solr if a particular page hasn't changed. How do I achieve 
> > > that? Right now the value of db.fetch.interval.default doesn't seem to 
> > > help prevent the crawl since the updates are submitted to Solr as if 
> > > the page has been changed. I know for sure that the page has not been 
> > > changed. This happens for every new crawl command.
> > > 
> > >  
> > > 
> > >  
> > > 
> > > Thanks in advance,
> > > 
> > > Max
> > > 
> > > 
> > 
> > 
> 
> 
>  
> 


RE: LINK RANK & CRAWL DATUM SCORE

2012-08-24 Thread Markus Jelsma
Hi,

The CrawlDatum's score field is added to the document via the `boost` field, 
this is not a document boost. You'll have to boost on the field manually to see 
the LinkRank value in effect. You can do this with a function query or a boost 
query.

Cheers,
Markus
 
 
-Original message-
> From:parnab kumar 
> Sent: Fri 24-Aug-2012 20:50
> To: user@nutch.apache.org
> Subject: LINK RANK & CRAWL DATUM SCORE
> 
> Hi All,
> 
>  I need to clarify a concept . After we run link rank , each url gets a
> score based on its link rank/page rank value. This score is updated in the
> crawl db . Is this score used as a document boost while when we  run the
> indexer . This is because unless we index we do not see any effect of link
> ranking. If it is not used as a document boost then where is this score
> used . Can anyone throw some light on this ..
> 
> Thanks ,
> Parnab
> 


RE: recrawl a URL?

2012-08-24 Thread Markus Jelsma
Hmm, i had to look it up but it is supported in 1.5 and 1.5.1:

http://svn.apache.org/viewvc/nutch/tags/release-1.5.1/src/java/org/apache/nutch/indexer/IndexerMapReduce.java?view=markup
 
 
-Original message-
> From:Max Dzyuba 
> Sent: Fri 24-Aug-2012 17:35
> To: Markus Jelsma ; user@nutch.apache.org
> Subject: RE: recrawl a URL?
> 
> Thank you for the reply. Does it mean that it is not supported in latest 
> stable release of Nutch?
> 
> 
> -Original Message-
> From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
> Sent: den 24 augusti 2012 17:21
> To: user@nutch.apache.org; Max Dzyuba
> Subject: RE: recrawl a URL?
> 
> Hi,
> 
> Trunk has a feature for this: indexer.skip.notmodified
> 
> Cheers 
>  
> -Original message-
> > From:Max Dzyuba 
> > Sent: Fri 24-Aug-2012 17:19
> > To: user@nutch.apache.org
> > Subject: recrawl a URL?
> > 
> > Hello everyone,
> > 
> >  
> > 
> > I run a crawl command every day, but I don't want Nutch to submit an 
> > update to Solr if a particular page hasn't changed. How do I achieve 
> > that? Right now the value of db.fetch.interval.default doesn't seem to 
> > help prevent the crawl since the updates are submitted to Solr as if 
> > the page has been changed. I know for sure that the page has not been 
> > changed. This happens for every new crawl command.
> > 
> >  
> > 
> >  
> > 
> > Thanks in advance,
> > 
> > Max
> > 
> > 
> 
> 


RE: recrawl a URL?

2012-08-24 Thread Markus Jelsma
Hi,

Trunk has a feature for this: indexer.skip.notmodified

Cheers 
 
-Original message-
> From:Max Dzyuba 
> Sent: Fri 24-Aug-2012 17:19
> To: user@nutch.apache.org
> Subject: recrawl a URL?
> 
> Hello everyone,
> 
>  
> 
> I run a crawl command every day, but I don't want Nutch to submit an update
> to Solr if a particular page hasn't changed. How do I achieve that? Right
> now the value of db.fetch.interval.default doesn't seem to help prevent the
> crawl since the updates are submitted to Solr as if the page has been
> changed. I know for sure that the page has not been changed. This happens
> for every new crawl command.
> 
>  
> 
>  
> 
> Thanks in advance,
> 
> Max
> 
> 


RE: Two questions about Nutch

2012-08-22 Thread Markus Jelsma
Hi weishenyun,

See inline:

Markus
 
 
-Original message-
> From:weishenyun 
> Sent: Wed 22-Aug-2012 11:02
> To: d...@nutch.apache.org
> Subject: Two questions about Nutch
> 
> Hi everyone here:
>   I have two questions which confused me for weeks. If anyone here can
> help me, thanks so much!
>   The first one, I know that Nutch won't store the HTTP code at all.
> Instead, it encodes it as a single status byte. If Nutch fetches a bad link
> whose HTTP status is not 200(e.g. 203 307 404 ...) or fetches a link which
> is robots denied or throttled by website because of frequently fetch. How
> can we distinguish between these conditions from that status byte(e.g.
> db_status_gone, db_redir_temp)?

Only in the fetcher you can distinquish between status codes and non-HTTP 
status codes such as being denied by robots or a problem with the robots crawl 
delay.

>   Second, I know a little about Ranking & Scoring mechanism in Nutch. I
> know linkrank algorithm is the main algorithm. The linkrank algorithm is
> just a single score factor in the index system of Nutch, what is other
> factors about index and search in Nutch?

We also use the LinkRank to aggregate a score but a host and use that host 
score to select a master host when deduplicating hosts. The host among the 
duplicates with the highest score prevails and the others are removed.

> The webgraph has not yet been
> ported to the GORA-based API in Nutch 2.0. What is the result if we index
> and search in Nutch 2.0?

You would still have a decent or good search result if you configured your 
weights properly. Keep in mind that LinkRank is not meant for scoring of URL's 
within a domain or host but across domains so it's a more internet scale 
scoring algorithm.

We don't use LinkRank for our site search services.

> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Two-questions-about-Nutch-tp4002589.html
> Sent from the Nutch - Dev mailing list archive at Nabble.com.
> 


RE: Happy 10th Birthday Nutch!

2012-08-21 Thread Markus Jelsma
Hehehe, nice! 

Cheers
 
-Original message-
> From:Jérôme Charron 
> Sent: Tue 21-Aug-2012 23:58
> To: d...@nutch.apache.org
> Cc: user@nutch.apache.org
> Subject: Re: Happy 10th Birthday Nutch!
> 
> Oups! Sorry...
> These one should be ok : http://statigr.am/p/254365383887354210_4414285 
> <http://statigr.am/p/254365383887354210_4414285> 
> ;)
> 
> 
> On Tue, Aug 21, 2012 at 11:40 PM, Markus Jelsma  <mailto:markus.jel...@openindex.io> > wrote:
> Hi Jérôme,
> 
> It asks for a login.
> 
> Cheers
> 
> 
> 
> -Original message-
> > From:Jérôme Charron  > <mailto:jerome.char...@gmail.com> >
> > Sent: Tue 21-Aug-2012 22:22
> > To: user@nutch.apache.org <mailto:user@nutch.apache.org> 
> > Cc: mailto:d...@nutch.apache.org> > 
> > mailto:d...@nutch.apache.org> >
> > Subject: Re: Happy 10th Birthday Nutch!
> >
> > My small contribution to Nutch birthday...
> > http://statigr.am/viewer.php#/detail/254365383887354210_4414285 
> > <http://statigr.am/viewer.php#/detail/254365383887354210_4414285> 
> > <http://statigr.am/viewer.php#/detail/254365383887354210_4414285 
> > <http://statigr.am/viewer.php#/detail/254365383887354210_4414285> >
> >
> > Cheers,
> > Jérôme
> >
> > On Fri, Aug 10, 2012 at 1:44 AM, Mattmann, Chris A (388J) 
> > mailto:chris.a.mattm...@jpl.nasa.gov> 
> > <mailto:chris.a.mattm...@jpl.nasa.gov 
> > <mailto:chris.a.mattm...@jpl.nasa.gov> > > wrote:
> > Super cool. Proud to have been around since 2005 (7 of them!)
> >
> > :)
> >
> > Cheers,
> > Chris
> >
> > On Aug 9, 2012, at 1:31 PM, Lewis John Mcgibbney wrote:
> >
> > > Nice one Julien
> > >
> > > I'm going to update the site with this as its a pretty huge milestone
> > > @Apache and a lot of projects and current developers owe a lot to the
> > > great work done by all you guys over the years.
> > >
> > > Thank you for sharing.
> > >
> > > Lewis
> > >
> > > On Thu, Aug 9, 2012 at 8:56 AM, Julien Nioche
> > > mailto:lists.digitalpeb...@gmail.com> 
> > > <mailto:lists.digitalpeb...@gmail.com 
> > > <mailto:lists.digitalpeb...@gmail.com> > > wrote:
> > >> Doug Cutting on twitter :
> > >> https://twitter.com/cutting/status/233415059798372353
> > >>
> > >> *RT @StefanGroschupf: Happy 10th birthday#Nutch! Registered at 
> > >> sourceforce
> > >> august 2002. Turned out to be quite a game changer. #Hadoop
> > >> *
> > >> Happy birthday Nutch and thanks to all contributors past and present!
> > >>
> > >> Julien
> > >>
> > >> --
> > >>
> > >> Open Source Solutions for Text Engineering
> > >>
> > >> http://digitalpebble.blogspot.com/ <http://digitalpebble.blogspot.com/> 
> > >> <http://digitalpebble.blogspot.com/ <http://digitalpebble.blogspot.com/> 
> > >> >
> > >> http://www.digitalpebble.com <http://www.digitalpebble.com> 
> > >> <http://www.digitalpebble.com <http://www.digitalpebble.com> >
> > >> http://twitter.com/digitalpebble <http://twitter.com/digitalpebble> 
> > >> <http://twitter.com/digitalpebble <http://twitter.com/digitalpebble> >
> > >
> > >
> > >
> > > --
> > > Lewis
> >
> >
> > ++
> > Chris Mattmann, Ph.D.
> > Senior Computer Scientist
> > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > Office: 171-266B, Mailstop: 171-246
> > Email: chris.a.mattm...@nasa.gov <mailto:chris.a.mattm...@nasa.gov> 
> > <mailto:chris.a.mattm...@nasa.gov <mailto:chris.a.mattm...@nasa.gov> >
> > WWW:   http://sunset.usc.edu/~mattmann/ <http://sunset.usc.edu/~mattmann/> 
> > <http://sunset.usc.edu/~mattmann/ <http://sunset.usc.edu/~mattmann/> >
> > ++
> > Adjunct Assistant Professor, Computer Science Department
> > University of Southern California, Los Angeles, CA 90089 USA
> > ++
> >
> >
> >
> >
> > --
> > 
> > @jcharron <http://www.twitter.com/jcharron 
> > <http://www.twitter.com/jcharron> >
> > http://motre.ch/ <ht

RE: Happy 10th Birthday Nutch!

2012-08-21 Thread Markus Jelsma
Hi Jérôme,

It asks for a login.

Cheers

 
 
-Original message-
> From:Jérôme Charron 
> Sent: Tue 21-Aug-2012 22:22
> To: user@nutch.apache.org
> Cc:  
> Subject: Re: Happy 10th Birthday Nutch!
> 
> My small contribution to Nutch birthday...
> http://statigr.am/viewer.php#/detail/254365383887354210_4414285 
>  
> 
> Cheers,
> Jérôme
> 
> On Fri, Aug 10, 2012 at 1:44 AM, Mattmann, Chris A (388J) 
> mailto:chris.a.mattm...@jpl.nasa.gov> > wrote:
> Super cool. Proud to have been around since 2005 (7 of them!)
> 
> :)
> 
> Cheers,
> Chris
> 
> On Aug 9, 2012, at 1:31 PM, Lewis John Mcgibbney wrote:
> 
> > Nice one Julien
> >
> > I'm going to update the site with this as its a pretty huge milestone
> > @Apache and a lot of projects and current developers owe a lot to the
> > great work done by all you guys over the years.
> >
> > Thank you for sharing.
> >
> > Lewis
> >
> > On Thu, Aug 9, 2012 at 8:56 AM, Julien Nioche
> > mailto:lists.digitalpeb...@gmail.com> > 
> > wrote:
> >> Doug Cutting on twitter :
> >> https://twitter.com/cutting/status/233415059798372353
> >>
> >> *RT @StefanGroschupf: Happy 10th birthday#Nutch! Registered at sourceforce
> >> august 2002. Turned out to be quite a game changer. #Hadoop
> >> *
> >> Happy birthday Nutch and thanks to all contributors past and present!
> >>
> >> Julien
> >>
> >> --
> >>
> >> Open Source Solutions for Text Engineering
> >>
> >> http://digitalpebble.blogspot.com/  
> >> http://www.digitalpebble.com  
> >> http://twitter.com/digitalpebble  
> >
> >
> >
> > --
> > Lewis
> 
> 
> ++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattm...@nasa.gov  
> WWW:   http://sunset.usc.edu/~mattmann/  
> ++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
> 
> 
> 
> 
> -- 
> 
> @jcharron  
> http://motre.ch/  
> http://jcharron.posterous.com/  
> http://www.shopreflex.fr/  
> http://www.staragora.com/  
> 
>   
>  
> 
> Hi


RE: what's mean this values?

2012-08-20 Thread Markus Jelsma
Those counts are the sum of the fetched pages for that host. 210661 are fetched 
in total and 427773 are unfetched.
 
 
-Original message-
> From:Alexei Korolev 
> Sent: Mon 20-Aug-2012 13:38
> To: user@nutch.apache.org
> Subject: what's mean this values?
> 
> Hello,
> 
> I tried to google about it, but without luck. I run this command:
> 
> nutch domainstats crawl/crawldb/current temp host
> 
> and then have following output:
> 
> 469   ttt.in.ua
> 12 aaa.com.ua
> 210661  FETCHED
> 427773  NOT_FETCHED
> 4238 .ru
> 1  all4.com.ua
> 17844   amtist.ru
> 4092 aptrrr.ru
> 
> Anybody could explore for me what's mean this values? And why I have
> FETCHED and NOT FETCHED in the middle of this list?
> 
> Thanks.
> 
> -- 
> Alexei A. Korolev
> 


RE: Nutch 2.0 and Sitemap

2012-08-17 Thread Markus Jelsma
Hi,

There is no support for sitemaps in either 1.x or 2.x but you're more than 
welcome to provide any patches if you have them.

Cheers

 
 
-Original message-
> From:Prashant Dave 
> Sent: Fri 17-Aug-2012 17:23
> To: user@nutch.apache.org
> Subject: Nutch 2.0 and Sitemap
> 
> Is there sitemap support in nutch 2.0. For a given website, first get
> and parse the sitemap file and crawl only those urls that are in the
> sitemap? I read online that this functionality was in the roadmap but
> have not been able to find out if sitemap is supported.
> 
> Thanks
> 
> Prashant
> 


RE: recrawling

2012-08-17 Thread Markus Jelsma
hi,

Pages will be recrawled when their eligible (last fetch time + interval). To 
force it you can use the -adddays switch on the generator tool. 


 
 
-Original message-
> From:Stefan Scheffler 
> Sent: Fri 17-Aug-2012 11:54
> To: user@nutch.apache.org
> Subject: recrawling
> 
> Hello,
> How can i do a recrawling to an existing crawldb?
> 
> With friendly regards
> Stefan Scheffler
> 
> -- 
> Stefan Scheffler
> Avantgarde Labs GmbH
> Löbauer Straße 19, 01099 Dresden
> Telefon: + 49 (0) 351 21590834
> Email: sscheff...@avantgarde-labs.de
> 
> 


RE: Cached page (like google) with hits highlighted

2012-08-16 Thread Markus Jelsma
Tika has a PDF2XHTML.java in the PDF parser but i think the standard 
PDFParser.java is executed for the MIME-type. In ParseTika.java we ask 
TikaConfig for the parser of a given MIME-type. To quickly test if it works 
like that you can try to hack in TikaParser and load PDF2XHTML instead of 
getting the parser via TikaConfig.

You can also override tell the CompositeParser.setParsers(Map parsers) in Tika via TikeConfig.getParser() to map the PDF2XHTML parser 
to the PDF MIME-type. By reading the code I think that should work.
 
 
-Original message-
> From:webdev1977 
> Sent: Thu 16-Aug-2012 12:51
> To: user@nutch.apache.org
> Subject: Re: Cached page (like google) with hits highlighted
> 
> Thanks Julien and Markus for all your help.
> 
> I poked around the code some more yesterday and it seems like the markup is
> just not getting in the DocumentFragment.  All I get (for word and pdf) is
> just one html tag with the text of the document in between.  Maybe something
> is not using parse-tika properly (somewhere in the nutch implementation of
> the parser?)
> 
> The same two documents give me tons of markup using the tika-app gui.  The
> versions are the same.  I am out of ideas, anyone, anyone? 
> 
> Thanks!
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Cached-page-like-google-with-hits-highlighted-tp4001374p4001593.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 


RE: Cached page (like google) with hits highlighted

2012-08-15 Thread Markus Jelsma
No, it doesn't come with Nutch. You can download Tika 1.2 or build trunk from 
source.

Code looks fine. But you might want to check the headings plugin, it uses the 
NodeWalker to make things easier:
http://svn.apache.org/viewvc/nutch/trunk/src/plugin/headings/src/java/org/apache/nutch/parse/headings/HeadingsParseFilter.java?revision=1349233&view=markup

 
 
-Original message-
> From:webdev1977 
> Sent: Wed 15-Aug-2012 19:00
> To: user@nutch.apache.org
> Subject: RE: Cached page (like google) with hits highlighted
> 
> Does the 1.4 version of nutch have tika-app?  Also..maybe I am not using the
> DocumentFragment object properly?  Below is a summary version of my code:
> 
> public ParseResult filter(Content content, ParseResult parseResult,
>HTMLMetaTags metaTags, DocumentFragment doc) {
> 
>for (int x = 0; x < doc.getChildNodes().getLength(); x++) {
>
>  System.out.println("xml node name" +
> doc.getChildNodes().item(x).getNodeName());
>  System.out.println("xml node value" +
> doc.getChildNodes().item(x).getNodeValue());
>  System.out.println("xml text content" +
> doc.getChildNodes().item(x).getTextContent());
> 
>   }
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Cached-page-like-google-with-hits-highlighted-tp4001374p4001440.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 


RE: Cached page (like google) with hits highlighted

2012-08-15 Thread Markus Jelsma
Hmm, i would also expect PDF and office documents to have at least paragraph 
and heading tags in Tika's XHTML representation. You can test if it's true with 
java -jar tika-app -x . I think it was -x, use --help to see all options.
 
 
-Original message-
> From:webdev1977 
> Sent: Wed 15-Aug-2012 18:22
> To: user@nutch.apache.org
> Subject: RE: Cached page (like google) with hits highlighted
> 
> Thanks Markus!
> 
> So after some testing and walking the DocumentFragment, I see that all I get
> is one node:
> 
> some content here and here
> 
> 
> I guess I expected to see more from a PDF/word document (like H1 tags, etc)
> that would help make the xhtml format more readable.
> 
> Am I missing something? Do I have to do anything special to the
> DocumentFragment to format it?
> 
> Thanks!
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Cached-page-like-google-with-hits-highlighted-tp4001374p4001434.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 


RE: how to add raw HTML field to Solr

2012-08-15 Thread Markus Jelsma
The easiest non-java approach would be using Nutch' SegmentReader tool to 
extract the HTML from your segments and store them somewhere you can access 
them easiliy.

 
 
-Original message-
> From:Max Dzyuba 
> Sent: Wed 15-Aug-2012 17:00
> To: user@nutch.apache.org
> Subject: how to add raw HTML field to Solr
> 
> Hello everyone,
> 
>  
> 
> I have Nutch installed and running just fine. Nutch submits the crawl
> results to Solr for indexing. I need to have a separate field in Solr
> document that would hold raw HTML. At the moment, the "content" field holds
> the parsed text from the page only.
> 
>  
> 
> From what I read, it's impossible to do what I need without writing your own
> plugin. I don't know Java that well. What would be the easiest way to
> approach this task?
> 
>  
> 
>  
> 
> Thank you in advance,
> 
> Max
> 
> 


RE: Cached page (like google) with hits highlighted

2012-08-15 Thread Markus Jelsma
Hi,

You can catch the XML in a Parse Filter by walking over the DocumentFragment 
that is passed. It should contain the proper mark up. 

Cheers,

 
 
-Original message-
> From:webdev1977 
> Sent: Wed 15-Aug-2012 14:09
> To: user@nutch.apache.org
> Subject: Cached page (like google) with hits highlighted
> 
> Hello Everyone!
> 
> I am up and running with my nutch 1.4 /solr 3.3  architecture and am looking
> to add a few new features.  
> 
> My users want the ability to view their solr results as xhtml with the hits
> highlighted in the document.  So a word document/pdf would become an XHTML
> version first.
> 
> I see that Tika can produce XHTML but I don't see a way to integrate that
> with the parsing that nutch does in the parse-tika plugin.  Seems like the
> results sent to solr for the "content" field are just the text of the
> document.  
> 
> Is there a way to do this?
> 
> Thanks!
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Cached-page-like-google-with-hits-highlighted-tp4001374.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 


RE: Does Nutch2.0 implement webgraph?

2012-08-15 Thread Markus Jelsma
Here's the issue but there is no work yet:
https://issues.apache.org/jira/browse/NUTCH-875
 
-Original message-
> From:weishenyun 
> Sent: Wed 15-Aug-2012 12:55
> To: user@nutch.apache.org
> Subject: Does Nutch2.0 implement webgraph?
> 
> I have used nutch2.0 for one month, but I can't find webgraph class in the
> source code. And also I can't use the webgraph function.My question is that:
> does Nutch2.0 implement webgraph? If not, when will that part be added.
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Does-Nutch2-0-implement-webgraph-tp4001354.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 


RE: NUTCH-1443

2012-08-14 Thread Markus Jelsma
We don't have a currency field. Please use one of the schema's shipped by Nutch:
http://svn.apache.org/viewvc/nutch/trunk/conf/
 
 
-Original message-
> From:Sourajit Basak 
> Sent: Tue 14-Aug-2012 07:40
> To: user@nutch.apache.org
> Subject: Re: NUTCH-1443
> 
> was using ...
> http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/example/solr/collection1/conf/schema.xml
> 
> probably this was for sol 4.0 alpha ... replacing the one in trunk took
> care of stemmer NCDE
> 
> Solr starts up, but with this one ...
> 
> SEVERE: org.apache.solr.common.SolrException: Unknown fieldtype 'currency'
> specified on field *_c
> 
> 
> On Mon, Aug 13, 2012 at 7:09 PM, Markus Jelsma
> wrote:
> 
> > All that was changed in rev 1367786 was the schema version which was
> > incorrect. Which schema did you replace with the updated one?
> >
> > Anyway, you probably haven't included non-standard stemmers from contrib
> > in your Solr lib dir. All snowball variants work out of the box.
> >
> > -Original message-
> > > From:Sourajit Basak 
> > > Sent: Mon 13-Aug-2012 15:34
> > > To: user@nutch.apache.org
> > > Subject: NUTCH-1443
> > >
> > > Replacing the schema.xml from
> > > https://issues.apache.org/jira/browse/NUTCH-1443 on a Solr 3.4
> > installation
> > > reports NoClassDefs on many non-English stemmers.
> > > Any ideas ?
> > >
> >
> 


RE: WWW wide crawling using nutch

2012-08-13 Thread Markus Jelsma
Hi,

I won't try to estimate the size of the public internet but i may have some 
useful figures. A standard dual core machine with 2GB RAM can process about 15 
records per second in ideal conditions with a parsing fetcher without storing 
content. But this doesn't include indexing time, webgraph building or linkrank 
calculation. So we could achieve only about 10 records per second on average.

Another cluster with 16 cores and 16GB RAM each gives much better results so 
not only more hardware is better but more powerful hardware as well. With it we 
could, in ideal conditions, fetch and parse about 500 records per machine per 
second. When taking the other jobs into account it drops to an average of 300 
records per second per machine.

Under normal conditions it is between 150 and 250. With these figures you would 
have only a fraction of the internet after a year and not revisiting pages, 
even if you have a hunderd powerful machines.

It's also impossible to do with a standard Nutch as you will quickly run into a 
lot of trouble with useless pages and crawler traps. Another very significant 
problem is duplicate websites such as www and non-www pages but these 
duplicates come in many more exotic varieties. You also have to manage 
extremely large black lists (many millions) of dead hosts. You need to prevent 
those from polluting your CrawlDB, dead URL's can quickly grow very large.

Crawling the internet means managing a lot of crap.

Good luck
Markus
 
-Original message-
> From:Ryan L. Sun 
> Sent: Mon 13-Aug-2012 20:58
> To: user@nutch.apache.org
> Subject: WWW wide crawling using nutch
> 
> Hi all,
> 
> I'm looking for some estimate/stat regarding WWW wide crawling using
> nutch (or 10%/20% of WWW). What kind of hardware do u need and how
> long it takes to finish one round of search?
> 
> TIA.
> 


RE: NUTCH-1443

2012-08-13 Thread Markus Jelsma
All that was changed in rev 1367786 was the schema version which was incorrect. 
Which schema did you replace with the updated one?

Anyway, you probably haven't included non-standard stemmers from contrib in 
your Solr lib dir. All snowball variants work out of the box.
 
-Original message-
> From:Sourajit Basak 
> Sent: Mon 13-Aug-2012 15:34
> To: user@nutch.apache.org
> Subject: NUTCH-1443
> 
> Replacing the schema.xml from
> https://issues.apache.org/jira/browse/NUTCH-1443 on a Solr 3.4 installation
> reports NoClassDefs on many non-English stemmers.
> Any ideas ?
> 


RE: MoreIndexingFilter plugin failing with NPE

2012-08-13 Thread Markus Jelsma
Strange, no content type, that should not happen. Anyway, you can open an issue 
in Jira for this.
Please mention your Nutch version. 

I cannot replicate it with trunk.

Also, is it being parsed at all? 
 
-Original message-
> From:Bai Shen 
> Sent: Mon 13-Aug-2012 15:12
> To: user@nutch.apache.org
> Subject: MoreIndexingFilter plugin failing with NPE
> 
> MoreIndexingFilter is failing with an NPE when trying to index
> http://spiderbites.nytimes.com/
> 
> The contentType comes back as null.  There is a check for this in order to
> determine which MIME command to run.
> 
> However, when you check to see if the content type needs to be spilt into
> sub parts, there is no check and it throws and NPE.
> 


RE: CHM Files and Tika

2012-08-09 Thread Markus Jelsma
hmm, i'm not sure but maybe we don't include all Tika parser deps in our 
build.xml?

 
 
-Original message-
> From:Sebastian Nagel 
> Sent: Thu 09-Aug-2012 23:18
> To: user@nutch.apache.org
> Subject: Re: CHM Files and Tika
> 
> Hi Jan,
> 
> confirmed: Nutch cannot parse, while Tika (same version used by Nutch)
> can parse chm. The chm parsers are in tika-parser*.jar which is contained
> in the Nutch package.
> 
> Any ideas?
> 
> Sebastian
> 
> On 08/08/2012 12:03 PM, Jan Riewe wrote:
> > Hey there,
> > 
> > i try to parse CHM (Microsoft Help Files) with Nucht, but i get a:
> > 
> > Can't retrieve Tika parser for mime-type application/vnd.ms-htmlhelp
> > 
> > i've tried version 1.4 (tika 0.10) and 1.51 from nutch (tika 1.1) which
> > should be able to parse those files
> > https://issues.apache.org/jira/browse/TIKA-245
> > 
> > In the tika-mimetypes.xml i do find a entry related to
> > application/vnd.ms-htmlhelp
> > 
> > Does anyone ever ran into the same issues and knows how to fix that?
> > 
> > Bye
> > Jan
> > 
> 
> 


RE: Happy 10th Birthday Nutch!

2012-08-09 Thread Markus Jelsma

Nice!
 
 
-Original message-
> From:Ferdy Galema 
> Sent: Thu 09-Aug-2012 10:12
> To: user@nutch.apache.org
> Cc: d...@nutch.apache.org
> Subject: Re: Happy 10th Birthday Nutch!
> 
> Cheers!
> 
> On Thu, Aug 9, 2012 at 9:56 AM, Julien Nioche  > wrote:
> 
> > Doug Cutting on twitter :
> > https://twitter.com/cutting/status/233415059798372353
> >
> > *RT @StefanGroschupf: Happy 10th birthday#Nutch! Registered at sourceforce
> > august 2002. Turned out to be quite a game changer. #Hadoop
> > *
> > Happy birthday Nutch and thanks to all contributors past and present!
> >
> > Julien
> >
> > --
> >
> > Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> > http://twitter.com/digitalpebble
> >
> 


RE: crawling site without www

2012-08-08 Thread Markus Jelsma

If it starts to redirect and you are on the wrong side of the redirect, you're 
in trouble. But with the HostNormalizer you can then renormalize all URL's to 
the host that is being redirected to.
 
 
-Original message-
> From:Alexei Korolev 
> Sent: Wed 08-Aug-2012 15:55
> To: user@nutch.apache.org
> Subject: Re: crawling site without www
> 
> > You can use the HostURLNormalizer for this task or just crawl the www OR
> > the non-www, not both.
> >
> 
> I'm trying to crawl only version without www. As I see, I can remove www.
> using proper configured regex-normalize.xml.
> But will it work if mobile365.ru redirect on www.mobile365.ru (it's very
> common situation in web)
> 
> Thanks.
> 
> Alexei
> 


RE: crawling site without www

2012-08-08 Thread Markus Jelsma


 
 
-Original message-
> From:Alexei Korolev 
> Sent: Wed 08-Aug-2012 15:43
> To: user@nutch.apache.org
> Subject: Re: crawling site without www
> 
> Hi, Sebastian
> 
> Seems you are right. I have db.ignore.external.links is true.
> But how to configure nutch for processing mobile365.ru and www.mobile365 as
> single site?

You can use the HostURLNormalizer for this task or just crawl the www OR the 
non-www, not both.

> 
> Thanks.
> 
> On Tue, Aug 7, 2012 at 10:58 PM, Sebastian Nagel  > wrote:
> 
> > Hi Alexei,
> >
> > I tried a crawl with your scrip fragment and Nutch 1.5.1
> > and the URLs http://mobile365.ru as seed. It worked,
> > see annotated log below.
> >
> > Which version of Nutch do you use?
> >
> > Check the property db.ignore.external.links (default is false).
> > If true the link from mobile365.ru to www.mobile365.ru
> > is skipped.
> >
> > Look into your crawldb (bin/nutch readdb)
> >
> > Check your URL filters with
> >  bin/nutch org.apache.nutch.net.URLFilterChecker
> >
> > Finally, send the nutch-site.xml and every configuration
> > file you changed.
> >
> > Good luck,
> > Sebastian
> >
> > % nutch inject crawl/crawldb seed.txt
> > Injector: starting at 2012-08-07 20:31:00
> > Injector: crawlDb: crawl/crawldb
> > Injector: urlDir: seed.txt
> > Injector: Converting injected urls to crawl db entries.
> > Injector: Merging injected urls into crawl db.
> > Injector: finished at 2012-08-07 20:31:15, elapsed: 00:00:15
> >
> > % nutch generate crawl/crawldb crawl/crawldb/segments -adddays 0
> > Generator: starting at 2012-08-07 20:31:23
> > Generator: Selecting best-scoring urls due for fetch.
> > Generator: filtering: true
> > Generator: normalizing: true
> > Generator: jobtracker is 'local', generating exactly one partition.
> > Generator: Partitioning selected urls for politeness.
> > Generator: segment: crawl/crawldb/segments/20120807203131
> > Generator: finished at 2012-08-07 20:31:39, elapsed: 00:00:15
> >
> > # Note: personally, I would prefer not to place segments (also linkdb)
> > #   in the crawldb/ folder.
> >
> > % s1=`ls -d crawl/crawldb/segments/* | tail -1`
> >
> > % nutch fetch $s1
> > Fetcher: starting at 2012-08-07 20:32:00
> > Fetcher: segment: crawl/crawldb/segments/20120807203131
> > Using queue mode : byHost
> > Fetcher: threads: 10
> > Fetcher: time-out divisor: 2
> > QueueFeeder finished: total 1 records + hit by time limit :0
> > Using queue mode : byHost
> > fetching http://mobile365.ru/
> > Using queue mode : byHost
> > Using queue mode : byHost
> > -finishing thread FetcherThread, activeThreads=1
> > Using queue mode : byHost
> > -finishing thread FetcherThread, activeThreads=1
> > Using queue mode : byHost
> > -finishing thread FetcherThread, activeThreads=1
> > Using queue mode : byHost
> > -finishing thread FetcherThread, activeThreads=1
> > -finishing thread FetcherThread, activeThreads=1
> > Using queue mode : byHost
> > -finishing thread FetcherThread, activeThreads=1
> > Using queue mode : byHost
> > -finishing thread FetcherThread, activeThreads=1
> > Using queue mode : byHost
> > Using queue mode : byHost
> > Fetcher: throughput threshold: -1
> > -finishing thread FetcherThread, activeThreads=1
> > Fetcher: throughput threshold retries: 5
> > -finishing thread FetcherThread, activeThreads=1
> > -finishing thread FetcherThread, activeThreads=0
> > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> > -activeThreads=0
> > Fetcher: finished at 2012-08-07 20:32:08, elapsed: 00:00:07
> >
> > % nutch parse $s1
> > ParseSegment: starting at 2012-08-07 20:32:12
> > ParseSegment: segment: crawl/crawldb/segments/20120807203131
> > Parsed (10ms):http://mobile365.ru/
> > ParseSegment: finished at 2012-08-07 20:32:20, elapsed: 00:00:07
> >
> > % nutch updatedb crawl/crawldb/ $s1
> > CrawlDb update: starting at 2012-08-07 20:32:24
> > CrawlDb update: db: crawl/crawldb
> > CrawlDb update: segments: [crawl/crawldb/segments/20120807203131]
> > CrawlDb update: additions allowed: true
> > CrawlDb update: URL normalizing: false
> > CrawlDb update: URL filtering: false
> > CrawlDb update: 404 purging: false
> > CrawlDb update: Merging segment data into db.
> > CrawlDb update: finished at 2012-08-07 20:32:38, elapsed: 00:00:13
> >
> > # see whether the outlink is now in crawldb:
> > % nutch readdb crawl/crawldb/ -stats
> > CrawlDb statistics start: crawl/crawldb/
> > Statistics for CrawlDb: crawl/crawldb/
> > TOTAL urls: 2
> > retry 0:2
> > min score:  1.0
> > avg score:  1.0
> > max score:  1.0
> > status 1 (db_unfetched):1
> > status 2 (db_fetched):  1
> > CrawlDb statistics: done
> > # => yes: http://mobile365.ru/ is fetched, outlink found
> >
> > %nutch generate crawl/crawldb crawl/crawldb/segments -adddays 0
> > Generator: starting at 2012-08-07 20:32:58
> > Generator: Selecting best-scoring urls due for fetch.
> > Generator: filtering: true
> > Generator: normalizing: true
> > Generator: jobtracker is 'local', generating 

RE: Can I only add url in a specified div to the fetch list with nutch?

2012-08-03 Thread Markus Jelsma
Hi,

Outlinks are added to the ParseData object before being passed to a 
HTMLParseFilter. In a HTMLParseFilter plugin you can obtain the Outlinks and 
remove those you don't want.

   Outlinks[] outlinks = 
parseResult.get(content.getUrl()).getData().getOutlinks();

Use the setOutlinks() method to write your processed list to the ParseData.

Cheers,
 
 
-Original message-
> From:刘?? 
> Sent: Fri 03-Aug-2012 15:45
> To: user@nutch.apache.org
> Subject: Can I only add url in a specified div to the fetch list with nutch?
> 
> Such as the title, I want crawl a page with many urls, but only the ones in
> a specified div are meaningful to me. So I want to write a plugin to filter
> it, but I don't know which extension point should I choose.
> 
> The htmlparser filter can get the html content, but seems like process
> after the "add to fetch list" operation. And the urlfilter can control the
> fetch list, but I cant get the html content in it.
> 
> Look forward to any helpful replies, thx.
> 


RE: Nutch 1.5.1 Solr 3.6.1 Error

2012-07-31 Thread Markus Jelsma
The issue is now fixed in trunk and the 2.x branch. 
 
-Original message-
> From:Markus Jelsma 
> Sent: Tue 31-Jul-2012 23:22
> To: user@nutch.apache.org
> Subject: RE: Nutch 1.5.1 Solr 3.6.1 Error
> 
> I see, the schema version is very wrong and will not work properly with Solr:
> https://issues.apache.org/jira/browse/NUTCH-1443
> 
>  
>  
> -Original message-
> > From:Caklovic, Nenad 
> > Sent: Tue 31-Jul-2012 23:11
> > To: user@nutch.apache.org
> > Subject: RE: Nutch 1.5.1 Solr 3.6.1 Error
> > 
> > Hi,
> > 
> > I had the same problem. I resolved it by modifying nutch version line in 
> > schema.xml file from:
> > 
> > 
> > to:
> > 
> > 
> > It seems that the problem is that  schema parser looks at this field as 
> > floating point type, which doesn't like seeing two periods in a floating 
> > point number.
> > 
> > Best,
> > Nenad
> > 
> > 
> > -Original Message-
> > From: Kevin Reiss [mailto:kevin.re...@gmail.com] 
> > Sent: Tuesday, July 31, 2012 1:23 PM
> > To: user@nutch.apache.org
> > Subject: Re: Nutch 1.5.1 Solr 3.6.1 Error
> > 
> > Hi,
> > 
> > Thanks for the suggestion, but I am actually am using the "conf/schema.xml"
> > file that ships with nutch 1.5.1.
> > http://svn.apache.org/viewvc/nutch/branches/branch-1.5.1/conf/schema.xml?view=markup.
> > I've just copied it over into my solr instance's conf directory as 
> > suggested in step six off the Nutch Tutorial 
> > http://wiki.apache.org/nutch/NutchTutorial#A6._Integrate_Solr_with_Nutch.
> > 
> > On Tue, Jul 31, 2012 at 3:10 PM, Alejandro Caceres < 
> > acace...@hyperiongray.com> wrote:
> > 
> > > Hey Kevin,
> > >
> > > Check your "schema.xml" file. This file specifies what fields Sole 
> > > "knows about" when indexing. I suspect you have not edited it to 
> > > include what nutch is trying to index. Do a search online for a 
> > > nutch-specific Solr schema.xml file, or alternatively check the Solr 
> > > logs to see which fields are missing.
> > >
> > > On Mon, Jul 30, 2012 at 6:11 PM, Kevin Reiss 
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > I'm working through the Nutch Tutorial using version 1.5.1 and am 
> > > > getting an error when using the schema.xml file for solr provided 
> > > > with nutch to finish the final step at 
> > > > http://wiki.apache.org/nutch/NutchTutorial#A6._Integrate_Solr_with_N
> > > > utch
> > > .
> > > > I'm working with Solr 3.6.1 using the "1.6.0_33" version of the JDK. 
> > > > The error is
> > > >
> > > > SEVERE: org.apache.solr.common.SolrException: Schema Parsing Failed:
> > > > multiple points
> > > > at 
> > > > org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:688)
> > > > at org.apache.solr.schema.IndexSchema.(IndexSchema.java:123)
> > > > at org.apache.solr.core.CoreContainer.create(CoreContainer.java:478)
> > > > at org.apache.solr.core.CoreContainer.load(CoreContainer.java:332)
> > > > at org.apache.solr.core.CoreContainer.load(CoreContainer.java:216)
> > > > at
> > > >
> > > >
> > > org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContaine
> > > r.java:161)
> > > > at
> > > >
> > > org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.jav
> > > a:96)
> > > > at 
> > > > org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97)
> > > > at
> > > org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:5
> > > 0)
> > > > at
> > > >
> > > >
> > > org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.jav
> > > a:713)
> > > > at org.mortbay.jetty.servlet.Context.startContext(Context.java:140)
> > > > at
> > > >
> > > >
> > > org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java
> > > :1282)
> > > > at
> > > > org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java
> > > > :518) at 
> > > > org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:49
> > > > 9)
> > > > at
> > > org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:5
> > > 0)
> > > > at
> > > >
> > > >
> > > org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.
> > > java:152)
> > > > at
> > > >
> > > >
> > > org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHand
> > > lerCollection.java:156)
> > > > at
> > > org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:5
> > > 0)
> > > > at
> > > >
> > > >
> > > org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.
> > > java:152)
> > > > at
> > > org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:5
> > > 0)
> > > > at
> > > > org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java
> > > > :130) at org.mortbay.jetty.Server.doStart(Server.java:224)
> > > > at
> > > org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:5
> > > 0)
> > > > at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:985)
> > > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
> > > >
> > > >
> > > sun.reflect.NativeMethodAccessorImpl.in

RE: Nutch 1.5.1 Solr 3.6.1 Error

2012-07-31 Thread Markus Jelsma
I see, the schema version is very wrong and will not work properly with Solr:
https://issues.apache.org/jira/browse/NUTCH-1443

 
 
-Original message-
> From:Caklovic, Nenad 
> Sent: Tue 31-Jul-2012 23:11
> To: user@nutch.apache.org
> Subject: RE: Nutch 1.5.1 Solr 3.6.1 Error
> 
> Hi,
> 
> I had the same problem. I resolved it by modifying nutch version line in 
> schema.xml file from:
> 
> 
> to:
> 
> 
> It seems that the problem is that  schema parser looks at this field as 
> floating point type, which doesn't like seeing two periods in a floating 
> point number.
> 
> Best,
> Nenad
> 
> 
> -Original Message-
> From: Kevin Reiss [mailto:kevin.re...@gmail.com] 
> Sent: Tuesday, July 31, 2012 1:23 PM
> To: user@nutch.apache.org
> Subject: Re: Nutch 1.5.1 Solr 3.6.1 Error
> 
> Hi,
> 
> Thanks for the suggestion, but I am actually am using the "conf/schema.xml"
> file that ships with nutch 1.5.1.
> http://svn.apache.org/viewvc/nutch/branches/branch-1.5.1/conf/schema.xml?view=markup.
> I've just copied it over into my solr instance's conf directory as suggested 
> in step six off the Nutch Tutorial 
> http://wiki.apache.org/nutch/NutchTutorial#A6._Integrate_Solr_with_Nutch.
> 
> On Tue, Jul 31, 2012 at 3:10 PM, Alejandro Caceres < 
> acace...@hyperiongray.com> wrote:
> 
> > Hey Kevin,
> >
> > Check your "schema.xml" file. This file specifies what fields Sole 
> > "knows about" when indexing. I suspect you have not edited it to 
> > include what nutch is trying to index. Do a search online for a 
> > nutch-specific Solr schema.xml file, or alternatively check the Solr 
> > logs to see which fields are missing.
> >
> > On Mon, Jul 30, 2012 at 6:11 PM, Kevin Reiss 
> > wrote:
> >
> > > Hi,
> > >
> > > I'm working through the Nutch Tutorial using version 1.5.1 and am 
> > > getting an error when using the schema.xml file for solr provided 
> > > with nutch to finish the final step at 
> > > http://wiki.apache.org/nutch/NutchTutorial#A6._Integrate_Solr_with_N
> > > utch
> > .
> > > I'm working with Solr 3.6.1 using the "1.6.0_33" version of the JDK. 
> > > The error is
> > >
> > > SEVERE: org.apache.solr.common.SolrException: Schema Parsing Failed:
> > > multiple points
> > > at 
> > > org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:688)
> > > at org.apache.solr.schema.IndexSchema.(IndexSchema.java:123)
> > > at org.apache.solr.core.CoreContainer.create(CoreContainer.java:478)
> > > at org.apache.solr.core.CoreContainer.load(CoreContainer.java:332)
> > > at org.apache.solr.core.CoreContainer.load(CoreContainer.java:216)
> > > at
> > >
> > >
> > org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContaine
> > r.java:161)
> > > at
> > >
> > org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.jav
> > a:96)
> > > at 
> > > org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97)
> > > at
> > org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:5
> > 0)
> > > at
> > >
> > >
> > org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.jav
> > a:713)
> > > at org.mortbay.jetty.servlet.Context.startContext(Context.java:140)
> > > at
> > >
> > >
> > org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java
> > :1282)
> > > at
> > > org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java
> > > :518) at 
> > > org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:49
> > > 9)
> > > at
> > org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:5
> > 0)
> > > at
> > >
> > >
> > org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.
> > java:152)
> > > at
> > >
> > >
> > org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHand
> > lerCollection.java:156)
> > > at
> > org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:5
> > 0)
> > > at
> > >
> > >
> > org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.
> > java:152)
> > > at
> > org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:5
> > 0)
> > > at
> > > org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java
> > > :130) at org.mortbay.jetty.Server.doStart(Server.java:224)
> > > at
> > org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:5
> > 0)
> > > at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:985)
> > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
> > >
> > >
> > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > > at
> > >
> > >
> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > > at java.lang.reflect.Method.invoke(Method.java:597)
> > > at org.mortbay.start.Main.invokeMain(Main.java:194)
> > > at org.mortbay.start.Main.start(Main.java:534)
> > > at org.mortbay.start.Main.start(Main.java:441)
> > > at org.mortbay.start.Main.main(Main.java:119)
> > > Caused by: java.lang.NumberFormatException: multiple points
>

RE: Why won't my crawl ignore these urls?

2012-07-30 Thread Markus Jelsma
Hi,

Either your regex is wrong, you haven't updated the CrawlDB with the new 
filters and/or you disabled filtering in the Generator.

Cheers

 
 
-Original message-
> From:Ian Piper 
> Sent: Mon 30-Jul-2012 20:01
> To: user@nutch.apache.org
> Subject: Why won't my crawl ignore these urls?
> 
> Hi all,
> 
> I have been trying to get to the bottom of this problem for ages and cannot 
> resolve it - you're my last hope, Obi-Wan...
> 
> I have a job that crawls over a client's site. I want to exclude urls that 
> look like this:
> 
> http://[clientsite.net]/resources/type.aspx?type=[whatever] 
>  
> 
> and
> 
> http://[clientsite.net]/resources/topic.aspx?topic=[whatever] 
>  
> 
> 
> To achieve this I thought I could put this into conf/regex-urlfilter.txt:
> 
> [...]
> -^http://([a-z0-9\-A-Z]*\.)*www.elaweb.org.uk/resources/type.aspx.*
> -^http://([a-z0-9\-A-Z]*\.)*www.elaweb.org.uk/resources/topic.aspx.*
> [...]
> 
> Yet when I next run the crawl I see things like this:
> 
> fetching http://[clientsite.net]/resources/topic.aspx?topic=10 
>  
> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=37
> [...]
> fetching http://[clientsite.net]/resources/type.aspx?type=2 
>  
> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=36
> [...]
> 
> and the corresponding pages seem to appear in the final Solr index. So 
> clearly they are not being excluded.
> 
> Is anyone able to explain what I have missed? Any guidance much appreciated.
> 
> Thanks,
> 
> 
> Ian.
> -- 
> Dr Ian Piper
> Tellura Information Services - the web, document and information people
> Registered in England and Wales: 5076715, VAT Number: 874 2060 29
> http://www.tellura.co.uk/  
> Creator of monickr: http://monickr.com  
> 01926 813736 | 07973 156616
> -- 
> 
> 
> 

RE: Args list for generate jobs

2012-07-30 Thread Markus Jelsma
hi 
 
-Original message-
> From:Bai Shen 
> Sent: Mon 30-Jul-2012 16:15
> To: user@nutch.apache.org
> Subject: Args list for generate jobs
> 
> Generate is one of the jobs that does not produce an argument list when run
> without any.  Is there a command to get it to print that?

$ bin/nutch generate
Usage: Generator   [-force] [-topN N] [-numFetchers 
numFetchers] [-adddays numDays] [-noFilter] [-noNorm][-maxNumSegments num]

> 
> Also, where is the class that converts the command line arguments into the
> args object?

That's different in various classes. Most base jobs simply walk down the input 
string, the webgraph code uses the GNU command line parser.

> 
> Thanks.
> 


RE: Nutch and Solr

2012-07-20 Thread Markus Jelsma
You don't run Solr on Hadoop, just separate of Nutch. Integration is the same 
as when running stand alone, it's http.

 
 
-Original message-
> From:shekhar sharma 
> Sent: Fri 20-Jul-2012 13:22
> To: user@nutch.apache.org
> Subject: Re: Nutch and Solr
> 
> But would that integration of solr and nutch will work on hadoop?
> I can run nutch on hadoop but i am not sure about solr...Let me try
> 
> Regards,
> Som
> 
> On Fri, Jul 20, 2012 at 4:46 PM, Jim Chandler wrote:
> 
> > Hi Som,
> >
> > for (1) I found the following useful.
> > http://wiki.apache.org/nutch/NutchTutorial
> >
> > as for (2) I've never used Lily soI can't speak to that.
> >
> > Jim
> >
> > On Fri, Jul 20, 2012 at 1:57 AM, shekhar sharma  > >wrote:
> >
> > > Hi All,
> > > I am running Nutch(1.5.1) on Hadoop antd succesfully crawled some pages.
> >  I
> > > would like to use Solr for indexing the crawled pages, but while going
> > > through lot of posts, i found so many information that i am lost..
> > > i have few questions..
> > >
> > > (1) Can i use the solr jar which comes with Nutch distribution? or do i
> > > have to install Solr seperately and integrate all of the three (nutch
> > solr
> > > and hadoop)
> > >
> > > (2) Is it good to use Lily (Hadoop,HBase and Solr)
> > >
> > > SOmewhere i read that distributing indexing is not possible, you need to
> > > write your own indexing logic ..
> > >
> > > Can you please suggest something?
> > >
> > > Regards,
> > > Som
> > >
> >
> 


RE: Nutch Content Filtering

2012-07-17 Thread Markus Jelsma
HtmlParseFilter or IndexingFilter. If you do want to parse and extract outlinks 
use an indexing filter to deny pages from being indexed. If you just want to 
throw away the whole page and it's outlinks if it does not contain your terms 
then implement HtmlParseFilter. See plugins for examples.

 
 
-Original message-
> From:mausmust 
> Sent: Tue 17-Jul-2012 09:53
> To: user@nutch.apache.org
> Subject: Re: Nutch Content Filtering
> 
> Which interface i should use for implementing?
> 
> 
> On 07/17/2012 10:45 AM, Markus Jelsma wrote:
> > Hi,
> >
> > You can create a simple parse or index filter implementation, check for 
> > words in the content and act appropriately.
> >
> > Cheers
> >   
> 
> 
> 


RE: Nutch Content Filtering

2012-07-17 Thread Markus Jelsma
Hi,

You can create a simple parse or index filter implementation, check for words 
in the content and act appropriately.

Cheers 
 
-Original message-
> From:mausmust 
> Sent: Tue 17-Jul-2012 09:43
> To: user@nutch.apache.org
> Subject: Nutch Content Filtering
> 
> While Apache Nutch 1.3 crawling pages, i want to analyze the content of 
> the page and if the content contains some keywords then adding page for 
> next steps, say indexing. If the content do not contains at least one 
> key, then just getting links from that page and ignoring it. How can i 
> do that? Is there any filtering plugin available for this purpose? Thnx.
> 


RE: Problems on using nutch

2012-07-16 Thread Markus Jelsma
Hi,
 
 
-Original message-
> From:IT_ailen 
> Sent: Mon 16-Jul-2012 07:40
> To: user@nutch.apache.org
> Subject: Problems on using nutch
> 
> Hi there,
>  Recently I'm crawling some sites with Nutch, but there are several problems
> bothering me. I have searched some with Google and some forums like
> nutch-user, but still gotten little help. So I have to list them as
> following and hope you guys can do me a favor. Thanks~
>  1. Can Nutch be interrupted when it is crawling? If it can be interrupted,
> what's the exact handling logic after it resumes; if not, must I re-crawling
> the whole sites(oh, that will be a really huge re-work.), or there will be a
> better solution?

No you cannot resume a Nutch 1.x crawl. If it is interrupted for some reason 
you must refetch all pages. This is not a problem is you work with small 
segment sizes.

>  2. How does the Nutch handle with some bad HTTP status like 307, 203?

I'm not sure. You may want to check the ProtocolStatus and lib-http code.

>  3. How does the crawl option depth work? For example, if I have crawled
> with a depth valued 3, what will the Nutch do when I re-crawl with
> "depth=3". Will it regenerate the destine list of URLs from the most recent
> segment or all of them and the file of original seeds?

It will follow outlinks to the 3rd depth from the original seeds.

>  4. What kind of influences will be made when I manually remove some
> subdirectories under the segments directory?
> I've searched these questions but don't get clear answers, so I hope you
> guys maybe tell me what in your opinions, or we can discuss them here.
> I'm reading the source code but that is a really huge work~~ 

You can delete them if you don't need them anymore.

> 
> -
> I'm what I am.
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Problems-on-using-nutch-tp3995207.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 


RE: Error parsing html

2012-07-12 Thread Markus Jelsma
Please provide the whole log snippet. Is it an HTML file? Can the parser parse 
it, is it large?
 
 
-Original message-
> From:Sudip Datta 
> Sent: Thu 12-Jul-2012 23:47
> To: Markus Jelsma 
> Cc: user@nutch.apache.org
> Subject: Re: Error parsing html
> 
> In ParseUtil.java, it gets an Exception (Not TimoutException), while trying 
> to implement:
>  res = task.get(MAX_PARSE_TIME, TimeUnit.SECONDS);
> Wonder if that can help in getting closer to the solution.
> 
> Should I instead try Tika as well, which I believe also parses HTML? What 
> changes will be required for that?
> 
> Thanks again.
> 
> On Fri, Jul 13, 2012 at 1:40 AM, Markus Jelsma  <mailto:markus.jel...@openindex.io> > wrote:
> Seems correct indeed. Please check the logs, they may tell some more.
> 
> 
> 
> -Original message-
> > From:Sudip Datta mailto:pid...@gmail.com> >
> > Sent: Thu 12-Jul-2012 21:51
> > To: Markus Jelsma  > <mailto:markus.jel...@openindex.io> >
> > Cc: user@nutch.apache.org <mailto:user@nutch.apache.org> 
> > Subject: Re: Error parsing html
> >
> > Hi Markus,
> >
> > Yes, they seem to be rightly mapped:
> >
> > parse-plugins.xml reads:
> >
> > 
> >     
> > 
> >
> > and tika's plugin.xml reads:
> >
> >    > id="org.apache.nutch.parse.tika" name="TikaParser">
> >      > class="org.apache.nutch.parse.tika.TikaParser">
> >       
> >     
> >   
> >
> > This one
> > http://stackoverflow.com/questions/8784656/nutch-unable-to-successfully-parse-contentseems
> >  
> > <http://stackoverflow.com/questions/8784656/nutch-unable-to-successfully-parse-contentseems>
> >  
> > to have a similar problem but doesn't mention where in code he has an
> > error.
> >
> > Thanks,
> >
> > --Sudip.
> >
> > On Fri, Jul 13, 2012 at 12:19 AM, Markus Jelsma
> > mailto:markus.jel...@openindex.io> >wrote:
> >
> > > strange, check if text/html is mapped to parse-tika or parse-html in
> > > parse-plugins.xml. You may also want to check tika's plugin.xml, it must 
> > > be
> > > mapped to * or a regex of content types.
> > >
> > >
> > > -Original message-
> > > > From:Sudip Datta mailto:pid...@gmail.com> >
> > > > Sent: Thu 12-Jul-2012 20:36
> > > > To: user@nutch.apache.org <mailto:user@nutch.apache.org> 
> > > > Subject: Re: Error parsing html
> > > >
> > > > Nopes. That didn't help. In fact, I had added that entry minutes before
> > > > sending a mail to the group and after couple of hours of frustration in
> > > > trying to get the parser to work.
> > > >
> > > > On Thu, Jul 12, 2012 at 11:40 PM, Lewis John Mcgibbney <
> > > > lewis.mcgibb...@gmail.com <mailto:lewis.mcgibb...@gmail.com> > wrote:
> > > >
> > > > > For starters there is no parse-xhtml plugin unless of course this is a
> > > > > custom one you've written yourself.
> > > > >
> > > > > Unless this is the case then remove this from the plugin.includes
> > > > > property and re-spin it
> > > > >
> > > > > hth
> > > > >
> > > > > On Thu, Jul 12, 2012 at 7:00 PM, Sudip Datta  > > > > <mailto:pid...@gmail.com> > wrote:
> > > > > > Hi,
> > > > > >
> > > > > > I am using Nutch 1.4 and Solr. My crawls were working perfectly fine
> > > > > before
> > > > > > I made some changes to the SolrWriter (which I believe has nothing
> > > to do
> > > > > > with my problem). Since then, I am getting:
> > > > > >
> > > > > > WARN : org.apache.nutch.parse.ParseUtil - Unable to successfully
> > > parse
> > > > > > content  of type text/html
> > > > > > INFO : org.apache.nutch.parse.ParseSegment - Parsing: 
> > > > > > WARN : org.apache.nutch.parse.ParseSegment - Error parsing:
> > > :
> > > > > > failed(2,200): org.apache.nutch.parse.ParseException: Unable to
> > > > > > successfully parse content
> > > > > >
> > > > > > for any  that I try to crawl!
> > > > > >
> > > > > > My nutch-site.xml file reads:
> > > > > >
> > > > >
> > > protocol-httpclient|urlfilter-regex|parse-(html|xhtml|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)
> > > > > >
> > > > > > What could be going wrong?
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > --Sudip.
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Lewis
> > > > >
> > > >
> > >
> >
> 
> 


RE: Error parsing html

2012-07-12 Thread Markus Jelsma
Seems correct indeed. Please check the logs, they may tell some more. 

 
 
-Original message-
> From:Sudip Datta 
> Sent: Thu 12-Jul-2012 21:51
> To: Markus Jelsma 
> Cc: user@nutch.apache.org
> Subject: Re: Error parsing html
> 
> Hi Markus,
> 
> Yes, they seem to be rightly mapped:
> 
> parse-plugins.xml reads:
> 
> 
> 
> 
> 
> and tika's plugin.xml reads:
> 
>id="org.apache.nutch.parse.tika" name="TikaParser">
>  class="org.apache.nutch.parse.tika.TikaParser">
>   
> 
>   
> 
> This one
> http://stackoverflow.com/questions/8784656/nutch-unable-to-successfully-parse-contentseems
> to have a similar problem but doesn't mention where in code he has an
> error.
> 
> Thanks,
> 
> --Sudip.
> 
> On Fri, Jul 13, 2012 at 12:19 AM, Markus Jelsma
> wrote:
> 
> > strange, check if text/html is mapped to parse-tika or parse-html in
> > parse-plugins.xml. You may also want to check tika's plugin.xml, it must be
> > mapped to * or a regex of content types.
> >
> >
> > -Original message-
> > > From:Sudip Datta 
> > > Sent: Thu 12-Jul-2012 20:36
> > > To: user@nutch.apache.org
> > > Subject: Re: Error parsing html
> > >
> > > Nopes. That didn't help. In fact, I had added that entry minutes before
> > > sending a mail to the group and after couple of hours of frustration in
> > > trying to get the parser to work.
> > >
> > > On Thu, Jul 12, 2012 at 11:40 PM, Lewis John Mcgibbney <
> > > lewis.mcgibb...@gmail.com> wrote:
> > >
> > > > For starters there is no parse-xhtml plugin unless of course this is a
> > > > custom one you've written yourself.
> > > >
> > > > Unless this is the case then remove this from the plugin.includes
> > > > property and re-spin it
> > > >
> > > > hth
> > > >
> > > > On Thu, Jul 12, 2012 at 7:00 PM, Sudip Datta  wrote:
> > > > > Hi,
> > > > >
> > > > > I am using Nutch 1.4 and Solr. My crawls were working perfectly fine
> > > > before
> > > > > I made some changes to the SolrWriter (which I believe has nothing
> > to do
> > > > > with my problem). Since then, I am getting:
> > > > >
> > > > > WARN : org.apache.nutch.parse.ParseUtil - Unable to successfully
> > parse
> > > > > content  of type text/html
> > > > > INFO : org.apache.nutch.parse.ParseSegment - Parsing: 
> > > > > WARN : org.apache.nutch.parse.ParseSegment - Error parsing:
> > :
> > > > > failed(2,200): org.apache.nutch.parse.ParseException: Unable to
> > > > > successfully parse content
> > > > >
> > > > > for any  that I try to crawl!
> > > > >
> > > > > My nutch-site.xml file reads:
> > > > >
> > > >
> > protocol-httpclient|urlfilter-regex|parse-(html|xhtml|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)
> > > > >
> > > > > What could be going wrong?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > --Sudip.
> > > >
> > > >
> > > >
> > > > --
> > > > Lewis
> > > >
> > >
> >
> 


<    4   5   6   7   8   9   10   11   12   13   >