Re: Nutch efficiency and multiple single URL crawls

2012-11-25 Thread Joe Zhang
what do you mean by the "job file"?

On Sun, Nov 25, 2012 at 10:43 PM, AC Nutch  wrote:

> Hello,
>
> I am using Nutch 1.5.1 and I am looking to do something specific with it. I
> have a few million base domains in a Solr index, so for example:
> http://www.nutch.org, http://www.apache.org, http://www.whatever.com etc.
> I
> am trying to crawl each of these base domains in deploy mode and retrieve
> all of their sub-urls associated with that domain in the most efficient way
> possible. To give you an example of the workflow I am trying to achieve:
> (1) Grab a base domain, let's say http://www.nutch.org (2) Crawl the base
> domain for all URLs in that domain, let's say http://www.nutch.org/page1,
> http://www.nutch.org/page2, http://www.nutch.org/page3, etc. etc. (3)
> store
> these results somewhere (perhaps another Solr instance) and (4) move on to
> the next base domain in my Solr index and repeat the process. Essentially
> just trying to grab all links associated with a page and then move on to
> the next page.
>
> The part I am having trouble with is ensuring that this workflow is
> efficient. The only way I can think to do this would be: (1) Grab a base
> domain from Solr from my shell script (simple enough) (2) Add an entry to
> regex-urlfilter with the domain I am looking to restrict the crawl to, in
> the example above that would be an entry that says to only keep sub-pages
> of http://www.nutch.org/ (3) Recreate the Nutch job file (~25 sec.) (4)
> Start the crawl for pages associated with a domain and do the indexing
>
> My issue is with step #3, AFAIK if I want to restrict a crawl to a specific
> domain I have to change regex-urlfilter and reload the job file. This is a
> pretty significant problem, since adding 25 seconds every single time I
> start a new base domain is going to add way too many seconds to my workflow
> (25 sec x a few million = way too much time). Finally the question...is
> there a way to add url filters on the fly when I start a crawl and/or
> restrict a crawl to a particular domain on the fly. OR can you think of a
> decent solution to the problem/am I missing something?
>


Nutch efficiency and multiple single URL crawls

2012-11-25 Thread AC Nutch
Hello,

I am using Nutch 1.5.1 and I am looking to do something specific with it. I
have a few million base domains in a Solr index, so for example:
http://www.nutch.org, http://www.apache.org, http://www.whatever.com etc. I
am trying to crawl each of these base domains in deploy mode and retrieve
all of their sub-urls associated with that domain in the most efficient way
possible. To give you an example of the workflow I am trying to achieve:
(1) Grab a base domain, let's say http://www.nutch.org (2) Crawl the base
domain for all URLs in that domain, let's say http://www.nutch.org/page1,
http://www.nutch.org/page2, http://www.nutch.org/page3, etc. etc. (3) store
these results somewhere (perhaps another Solr instance) and (4) move on to
the next base domain in my Solr index and repeat the process. Essentially
just trying to grab all links associated with a page and then move on to
the next page.

The part I am having trouble with is ensuring that this workflow is
efficient. The only way I can think to do this would be: (1) Grab a base
domain from Solr from my shell script (simple enough) (2) Add an entry to
regex-urlfilter with the domain I am looking to restrict the crawl to, in
the example above that would be an entry that says to only keep sub-pages
of http://www.nutch.org/ (3) Recreate the Nutch job file (~25 sec.) (4)
Start the crawl for pages associated with a domain and do the indexing

My issue is with step #3, AFAIK if I want to restrict a crawl to a specific
domain I have to change regex-urlfilter and reload the job file. This is a
pretty significant problem, since adding 25 seconds every single time I
start a new base domain is going to add way too many seconds to my workflow
(25 sec x a few million = way too much time). Finally the question...is
there a way to add url filters on the fly when I start a crawl and/or
restrict a crawl to a particular domain on the fly. OR can you think of a
decent solution to the problem/am I missing something?


Re: Indexing-time URL filtering again

2012-11-25 Thread Joe Zhang
OK. I'm testing it. But like I said, even when I reduce the patterns to the
simpliest form "-.", the problem still persists.

On Sun, Nov 25, 2012 at 3:59 PM, Markus Jelsma
wrote:

> It's taking input from stdin, enter some URL's to test it. You can add an
> issue with reproducable steps.
>
> -Original message-
> > From:Joe Zhang 
> > Sent: Sun 25-Nov-2012 23:49
> > To: user@nutch.apache.org
> > Subject: Re: Indexing-time URL filtering again
> >
> > I ran the regex tester command you provided. It seems to be taking
> forever
> > (15 min + by now).
> >
> > On Sun, Nov 25, 2012 at 3:28 PM, Joe Zhang  wrote:
> >
> > > you mean the content my pattern file?
> > >
> > > well, even wehn I reduce it to simply "-.", the same problem still
> pops up.
> > >
> > > On Sun, Nov 25, 2012 at 3:30 PM, Markus Jelsma <
> markus.jel...@openindex.io
> > > > wrote:
> > >
> > >> You seems to have an NPE caused by your regex rules, for some weird
> > >> reason. If you can provide a way to reproduce you can file an issue in
> > >> Jira. This NPE should also occur if your run the regex tester.
> > >>
> > >> nutch -Durlfilter.regex.file=path
> org.apache.nutch.net.URLFilterChecker
> > >> -allCombined
> > >>
> > >> In the mean time you can check if a rule causes the NPE.
> > >>
> > >> -Original message-
> > >> > From:Joe Zhang 
> > >> > Sent: Sun 25-Nov-2012 23:26
> > >> > To: user@nutch.apache.org
> > >> > Subject: Re: Indexing-time URL filtering again
> > >> >
> > >> > the last few lines of hadoop.log:
> > >> >
> > >> > 2012-11-25 16:30:30,021 INFO  indexer.IndexingFilters - Adding
> > >> > org.apache.nutch.indexer.anchor.AnchorIndexingFilter
> > >> > 2012-11-25 16:30:30,026 INFO  indexer.IndexingFilters - Adding
> > >> > org.apache.nutch.indexer.metadata.MetadataIndexer
> > >> > 2012-11-25 16:30:30,218 WARN  mapred.LocalJobRunner - job_local_0001
> > >> > java.lang.RuntimeException: Error in configuring object
> > >> > at
> > >> >
> > >>
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
> > >> > at
> > >> >
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
> > >> > at
> > >> >
> > >>
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
> > >> > at
> > >> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432)
> > >> > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
> > >> > at
> > >> >
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> > >> > Caused by: java.lang.reflect.InvocationTargetException
> > >> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method)
> > >> > at
> > >> >
> > >>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> > >> > at
> > >> >
> > >>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > >> > at java.lang.reflect.Method.invoke(Method.java:601)
> > >> > at
> > >> >
> > >>
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
> > >> > ... 5 more
> > >> > Caused by: java.lang.RuntimeException: Error in configuring object
> > >> > at
> > >> >
> > >>
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
> > >> > at
> > >> >
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
> > >> > at
> > >> >
> > >>
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
> > >> > at
> > >> org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
> > >> > ... 10 more
> > >> > Caused by: java.lang.reflect.InvocationTargetException
> > >> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method)
> > >> > at
> > >> >
> > >>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> > >> > at
> > >> >
> > >>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > >> > at java.lang.reflect.Method.invoke(Method.java:601)
> > >> > at
> > >> >
> > >>
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
> > >> > ... 13 more
> > >> > Caused by: java.lang.NullPointerException
> > >> > at java.io.Reader.(Reader.java:78)
> > >> > at java.io.BufferedReader.(BufferedReader.java:94)
> > >> > at java.io.BufferedReader.(BufferedReader.java:109)
> > >> > at
> > >> >
> > >>
> org.apache.nutch.urlfilter.api.RegexURLFilterBase.readRules(RegexURLFilterBase.java:180)
> > >> > at
> > >> >
> > >>
> org.apache.nutch.urlfilter.api.RegexURLFilterBase.setConf(RegexURLFilterBase.java:156)
> > >> > at
> > >> >
> > >>
> org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:162)
> > >> > at
> org.apache.nutch.net.URLFilters.(URLFilters.java:57)
> > >> > at
> > >> >
> > >>
> org.apache.nutch.indexer.

RE: Indexing-time URL filtering again

2012-11-25 Thread Markus Jelsma
It's taking input from stdin, enter some URL's to test it. You can add an issue 
with reproducable steps. 
 
-Original message-
> From:Joe Zhang 
> Sent: Sun 25-Nov-2012 23:49
> To: user@nutch.apache.org
> Subject: Re: Indexing-time URL filtering again
> 
> I ran the regex tester command you provided. It seems to be taking forever
> (15 min + by now).
> 
> On Sun, Nov 25, 2012 at 3:28 PM, Joe Zhang  wrote:
> 
> > you mean the content my pattern file?
> >
> > well, even wehn I reduce it to simply "-.", the same problem still pops up.
> >
> > On Sun, Nov 25, 2012 at 3:30 PM, Markus Jelsma  > > wrote:
> >
> >> You seems to have an NPE caused by your regex rules, for some weird
> >> reason. If you can provide a way to reproduce you can file an issue in
> >> Jira. This NPE should also occur if your run the regex tester.
> >>
> >> nutch -Durlfilter.regex.file=path org.apache.nutch.net.URLFilterChecker
> >> -allCombined
> >>
> >> In the mean time you can check if a rule causes the NPE.
> >>
> >> -Original message-
> >> > From:Joe Zhang 
> >> > Sent: Sun 25-Nov-2012 23:26
> >> > To: user@nutch.apache.org
> >> > Subject: Re: Indexing-time URL filtering again
> >> >
> >> > the last few lines of hadoop.log:
> >> >
> >> > 2012-11-25 16:30:30,021 INFO  indexer.IndexingFilters - Adding
> >> > org.apache.nutch.indexer.anchor.AnchorIndexingFilter
> >> > 2012-11-25 16:30:30,026 INFO  indexer.IndexingFilters - Adding
> >> > org.apache.nutch.indexer.metadata.MetadataIndexer
> >> > 2012-11-25 16:30:30,218 WARN  mapred.LocalJobRunner - job_local_0001
> >> > java.lang.RuntimeException: Error in configuring object
> >> > at
> >> >
> >> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
> >> > at
> >> > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
> >> > at
> >> >
> >> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
> >> > at
> >> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432)
> >> > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
> >> > at
> >> > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> >> > Caused by: java.lang.reflect.InvocationTargetException
> >> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >> > at
> >> >
> >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> >> > at
> >> >
> >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >> > at java.lang.reflect.Method.invoke(Method.java:601)
> >> > at
> >> >
> >> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
> >> > ... 5 more
> >> > Caused by: java.lang.RuntimeException: Error in configuring object
> >> > at
> >> >
> >> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
> >> > at
> >> > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
> >> > at
> >> >
> >> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
> >> > at
> >> org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
> >> > ... 10 more
> >> > Caused by: java.lang.reflect.InvocationTargetException
> >> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >> > at
> >> >
> >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> >> > at
> >> >
> >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >> > at java.lang.reflect.Method.invoke(Method.java:601)
> >> > at
> >> >
> >> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
> >> > ... 13 more
> >> > Caused by: java.lang.NullPointerException
> >> > at java.io.Reader.(Reader.java:78)
> >> > at java.io.BufferedReader.(BufferedReader.java:94)
> >> > at java.io.BufferedReader.(BufferedReader.java:109)
> >> > at
> >> >
> >> org.apache.nutch.urlfilter.api.RegexURLFilterBase.readRules(RegexURLFilterBase.java:180)
> >> > at
> >> >
> >> org.apache.nutch.urlfilter.api.RegexURLFilterBase.setConf(RegexURLFilterBase.java:156)
> >> > at
> >> >
> >> org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:162)
> >> > at org.apache.nutch.net.URLFilters.(URLFilters.java:57)
> >> > at
> >> >
> >> org.apache.nutch.indexer.IndexerMapReduce.configure(IndexerMapReduce.java:95)
> >> > ... 18 more
> >> > 2012-11-25 16:30:30,568 ERROR solr.SolrIndexer - java.io.IOException:
> >> Job
> >> > failed!
> >> >
> >> >
> >> > On Sun, Nov 25, 2012 at 3:08 PM, Markus Jelsma
> >> > wrote:
> >> >
> >> > > You should provide the log output.
> >> > >
> >> > > -Original message-
> >> > > > From:Joe Zhang 
> >> > > > Sent: Sun 25-Nov-2012 17:27
> >> > > > To: user@nutch.apache.org
> >> > > > Sub

Re: Indexing-time URL filtering again

2012-11-25 Thread Joe Zhang
I ran the regex tester command you provided. It seems to be taking forever
(15 min + by now).

On Sun, Nov 25, 2012 at 3:28 PM, Joe Zhang  wrote:

> you mean the content my pattern file?
>
> well, even wehn I reduce it to simply "-.", the same problem still pops up.
>
> On Sun, Nov 25, 2012 at 3:30 PM, Markus Jelsma  > wrote:
>
>> You seems to have an NPE caused by your regex rules, for some weird
>> reason. If you can provide a way to reproduce you can file an issue in
>> Jira. This NPE should also occur if your run the regex tester.
>>
>> nutch -Durlfilter.regex.file=path org.apache.nutch.net.URLFilterChecker
>> -allCombined
>>
>> In the mean time you can check if a rule causes the NPE.
>>
>> -Original message-
>> > From:Joe Zhang 
>> > Sent: Sun 25-Nov-2012 23:26
>> > To: user@nutch.apache.org
>> > Subject: Re: Indexing-time URL filtering again
>> >
>> > the last few lines of hadoop.log:
>> >
>> > 2012-11-25 16:30:30,021 INFO  indexer.IndexingFilters - Adding
>> > org.apache.nutch.indexer.anchor.AnchorIndexingFilter
>> > 2012-11-25 16:30:30,026 INFO  indexer.IndexingFilters - Adding
>> > org.apache.nutch.indexer.metadata.MetadataIndexer
>> > 2012-11-25 16:30:30,218 WARN  mapred.LocalJobRunner - job_local_0001
>> > java.lang.RuntimeException: Error in configuring object
>> > at
>> >
>> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
>> > at
>> > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
>> > at
>> >
>> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
>> > at
>> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432)
>> > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>> > at
>> > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
>> > Caused by: java.lang.reflect.InvocationTargetException
>> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> > at
>> >
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> > at
>> >
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> > at java.lang.reflect.Method.invoke(Method.java:601)
>> > at
>> >
>> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
>> > ... 5 more
>> > Caused by: java.lang.RuntimeException: Error in configuring object
>> > at
>> >
>> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
>> > at
>> > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
>> > at
>> >
>> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
>> > at
>> org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
>> > ... 10 more
>> > Caused by: java.lang.reflect.InvocationTargetException
>> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> > at
>> >
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> > at
>> >
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> > at java.lang.reflect.Method.invoke(Method.java:601)
>> > at
>> >
>> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
>> > ... 13 more
>> > Caused by: java.lang.NullPointerException
>> > at java.io.Reader.(Reader.java:78)
>> > at java.io.BufferedReader.(BufferedReader.java:94)
>> > at java.io.BufferedReader.(BufferedReader.java:109)
>> > at
>> >
>> org.apache.nutch.urlfilter.api.RegexURLFilterBase.readRules(RegexURLFilterBase.java:180)
>> > at
>> >
>> org.apache.nutch.urlfilter.api.RegexURLFilterBase.setConf(RegexURLFilterBase.java:156)
>> > at
>> >
>> org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:162)
>> > at org.apache.nutch.net.URLFilters.(URLFilters.java:57)
>> > at
>> >
>> org.apache.nutch.indexer.IndexerMapReduce.configure(IndexerMapReduce.java:95)
>> > ... 18 more
>> > 2012-11-25 16:30:30,568 ERROR solr.SolrIndexer - java.io.IOException:
>> Job
>> > failed!
>> >
>> >
>> > On Sun, Nov 25, 2012 at 3:08 PM, Markus Jelsma
>> > wrote:
>> >
>> > > You should provide the log output.
>> > >
>> > > -Original message-
>> > > > From:Joe Zhang 
>> > > > Sent: Sun 25-Nov-2012 17:27
>> > > > To: user@nutch.apache.org
>> > > > Subject: Re: Indexing-time URL filtering again
>> > > >
>> > > > I actually checked out the most recent build from SVN, Release 1.6 -
>> > > > 23/11/2012.
>> > > >
>> > > > The following command
>> > > >
>> > > > bin/nutch solrindex  -Durlfilter.regex.file=.UrlFiltering.txt
>> > > > http://localhost:8983/solr/ crawl/crawldb/ -linkdb crawl/linkdb/
>> > > > crawl/segments/*  -filter
>> > > >
>> > > > produced the following output:
>> > > >
>> > > > SolrIndexer: starting at 2012-11-25 16:19:29
>> > 

Re: Indexing-time URL filtering again

2012-11-25 Thread Joe Zhang
you mean the content my pattern file?

well, even wehn I reduce it to simply "-.", the same problem still pops up.

On Sun, Nov 25, 2012 at 3:30 PM, Markus Jelsma
wrote:

> You seems to have an NPE caused by your regex rules, for some weird
> reason. If you can provide a way to reproduce you can file an issue in
> Jira. This NPE should also occur if your run the regex tester.
>
> nutch -Durlfilter.regex.file=path org.apache.nutch.net.URLFilterChecker
> -allCombined
>
> In the mean time you can check if a rule causes the NPE.
>
> -Original message-
> > From:Joe Zhang 
> > Sent: Sun 25-Nov-2012 23:26
> > To: user@nutch.apache.org
> > Subject: Re: Indexing-time URL filtering again
> >
> > the last few lines of hadoop.log:
> >
> > 2012-11-25 16:30:30,021 INFO  indexer.IndexingFilters - Adding
> > org.apache.nutch.indexer.anchor.AnchorIndexingFilter
> > 2012-11-25 16:30:30,026 INFO  indexer.IndexingFilters - Adding
> > org.apache.nutch.indexer.metadata.MetadataIndexer
> > 2012-11-25 16:30:30,218 WARN  mapred.LocalJobRunner - job_local_0001
> > java.lang.RuntimeException: Error in configuring object
> > at
> >
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
> > at
> > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
> > at
> >
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
> > at
> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432)
> > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
> > at
> > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> > Caused by: java.lang.reflect.InvocationTargetException
> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > at
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> > at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > at java.lang.reflect.Method.invoke(Method.java:601)
> > at
> >
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
> > ... 5 more
> > Caused by: java.lang.RuntimeException: Error in configuring object
> > at
> >
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
> > at
> > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
> > at
> >
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
> > at
> org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
> > ... 10 more
> > Caused by: java.lang.reflect.InvocationTargetException
> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > at
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> > at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > at java.lang.reflect.Method.invoke(Method.java:601)
> > at
> >
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
> > ... 13 more
> > Caused by: java.lang.NullPointerException
> > at java.io.Reader.(Reader.java:78)
> > at java.io.BufferedReader.(BufferedReader.java:94)
> > at java.io.BufferedReader.(BufferedReader.java:109)
> > at
> >
> org.apache.nutch.urlfilter.api.RegexURLFilterBase.readRules(RegexURLFilterBase.java:180)
> > at
> >
> org.apache.nutch.urlfilter.api.RegexURLFilterBase.setConf(RegexURLFilterBase.java:156)
> > at
> >
> org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:162)
> > at org.apache.nutch.net.URLFilters.(URLFilters.java:57)
> > at
> >
> org.apache.nutch.indexer.IndexerMapReduce.configure(IndexerMapReduce.java:95)
> > ... 18 more
> > 2012-11-25 16:30:30,568 ERROR solr.SolrIndexer - java.io.IOException: Job
> > failed!
> >
> >
> > On Sun, Nov 25, 2012 at 3:08 PM, Markus Jelsma
> > wrote:
> >
> > > You should provide the log output.
> > >
> > > -Original message-
> > > > From:Joe Zhang 
> > > > Sent: Sun 25-Nov-2012 17:27
> > > > To: user@nutch.apache.org
> > > > Subject: Re: Indexing-time URL filtering again
> > > >
> > > > I actually checked out the most recent build from SVN, Release 1.6 -
> > > > 23/11/2012.
> > > >
> > > > The following command
> > > >
> > > > bin/nutch solrindex  -Durlfilter.regex.file=.UrlFiltering.txt
> > > > http://localhost:8983/solr/ crawl/crawldb/ -linkdb crawl/linkdb/
> > > > crawl/segments/*  -filter
> > > >
> > > > produced the following output:
> > > >
> > > > SolrIndexer: starting at 2012-11-25 16:19:29
> > > > SolrIndexer: deleting gone documents: false
> > > > SolrIndexer: URL filtering: true
> > > > SolrIndexer: URL normalizing: false
> > > > java.io.IOException: Job failed!
> > > >
> > > > Can anybody help?
> > > > On Sun, Nov 25, 2012 at 6:43 AM, Joe Zhang 
> wrote:
> > > >
> > >

RE: Indexing-time URL filtering again

2012-11-25 Thread Markus Jelsma
You seems to have an NPE caused by your regex rules, for some weird reason. If 
you can provide a way to reproduce you can file an issue in Jira. This NPE 
should also occur if your run the regex tester.

nutch -Durlfilter.regex.file=path org.apache.nutch.net.URLFilterChecker 
-allCombined

In the mean time you can check if a rule causes the NPE.
 
-Original message-
> From:Joe Zhang 
> Sent: Sun 25-Nov-2012 23:26
> To: user@nutch.apache.org
> Subject: Re: Indexing-time URL filtering again
> 
> the last few lines of hadoop.log:
> 
> 2012-11-25 16:30:30,021 INFO  indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.anchor.AnchorIndexingFilter
> 2012-11-25 16:30:30,026 INFO  indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.metadata.MetadataIndexer
> 2012-11-25 16:30:30,218 WARN  mapred.LocalJobRunner - job_local_0001
> java.lang.RuntimeException: Error in configuring object
> at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
> at
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
> at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:601)
> at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
> ... 5 more
> Caused by: java.lang.RuntimeException: Error in configuring object
> at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
> at
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
> at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
> at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
> ... 10 more
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:601)
> at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
> ... 13 more
> Caused by: java.lang.NullPointerException
> at java.io.Reader.(Reader.java:78)
> at java.io.BufferedReader.(BufferedReader.java:94)
> at java.io.BufferedReader.(BufferedReader.java:109)
> at
> org.apache.nutch.urlfilter.api.RegexURLFilterBase.readRules(RegexURLFilterBase.java:180)
> at
> org.apache.nutch.urlfilter.api.RegexURLFilterBase.setConf(RegexURLFilterBase.java:156)
> at
> org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:162)
> at org.apache.nutch.net.URLFilters.(URLFilters.java:57)
> at
> org.apache.nutch.indexer.IndexerMapReduce.configure(IndexerMapReduce.java:95)
> ... 18 more
> 2012-11-25 16:30:30,568 ERROR solr.SolrIndexer - java.io.IOException: Job
> failed!
> 
> 
> On Sun, Nov 25, 2012 at 3:08 PM, Markus Jelsma
> wrote:
> 
> > You should provide the log output.
> >
> > -Original message-
> > > From:Joe Zhang 
> > > Sent: Sun 25-Nov-2012 17:27
> > > To: user@nutch.apache.org
> > > Subject: Re: Indexing-time URL filtering again
> > >
> > > I actually checked out the most recent build from SVN, Release 1.6 -
> > > 23/11/2012.
> > >
> > > The following command
> > >
> > > bin/nutch solrindex  -Durlfilter.regex.file=.UrlFiltering.txt
> > > http://localhost:8983/solr/ crawl/crawldb/ -linkdb crawl/linkdb/
> > > crawl/segments/*  -filter
> > >
> > > produced the following output:
> > >
> > > SolrIndexer: starting at 2012-11-25 16:19:29
> > > SolrIndexer: deleting gone documents: false
> > > SolrIndexer: URL filtering: true
> > > SolrIndexer: URL normalizing: false
> > > java.io.IOException: Job failed!
> > >
> > > Can anybody help?
> > > On Sun, Nov 25, 2012 at 6:43 AM, Joe Zhang  wrote:
> > >
> > > > How exactly do I get to trunk?
> > > >
> > > > I did download download NUTCH-1300-1.5-1.patch, and run the patch
> > command
> > > > correctly, and re-build nutch. But the problem still persists...
> > > >
> > > > On Sun, Nov 25, 2012 at 3:29 AM, Markus Jelsma <
> > markus.jel...@openindex.io
> > > > > wrote:
> > > >
> > > >> No, this is no bug. As i said, you need either to patch your Nutch or
> > get
> > > 

Re: Indexing-time URL filtering again

2012-11-25 Thread Joe Zhang
the last few lines of hadoop.log:

2012-11-25 16:30:30,021 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2012-11-25 16:30:30,026 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.metadata.MetadataIndexer
2012-11-25 16:30:30,218 WARN  mapred.LocalJobRunner - job_local_0001
java.lang.RuntimeException: Error in configuring object
at
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
... 5 more
Caused by: java.lang.RuntimeException: Error in configuring object
at
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
... 10 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
... 13 more
Caused by: java.lang.NullPointerException
at java.io.Reader.(Reader.java:78)
at java.io.BufferedReader.(BufferedReader.java:94)
at java.io.BufferedReader.(BufferedReader.java:109)
at
org.apache.nutch.urlfilter.api.RegexURLFilterBase.readRules(RegexURLFilterBase.java:180)
at
org.apache.nutch.urlfilter.api.RegexURLFilterBase.setConf(RegexURLFilterBase.java:156)
at
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:162)
at org.apache.nutch.net.URLFilters.(URLFilters.java:57)
at
org.apache.nutch.indexer.IndexerMapReduce.configure(IndexerMapReduce.java:95)
... 18 more
2012-11-25 16:30:30,568 ERROR solr.SolrIndexer - java.io.IOException: Job
failed!


On Sun, Nov 25, 2012 at 3:08 PM, Markus Jelsma
wrote:

> You should provide the log output.
>
> -Original message-
> > From:Joe Zhang 
> > Sent: Sun 25-Nov-2012 17:27
> > To: user@nutch.apache.org
> > Subject: Re: Indexing-time URL filtering again
> >
> > I actually checked out the most recent build from SVN, Release 1.6 -
> > 23/11/2012.
> >
> > The following command
> >
> > bin/nutch solrindex  -Durlfilter.regex.file=.UrlFiltering.txt
> > http://localhost:8983/solr/ crawl/crawldb/ -linkdb crawl/linkdb/
> > crawl/segments/*  -filter
> >
> > produced the following output:
> >
> > SolrIndexer: starting at 2012-11-25 16:19:29
> > SolrIndexer: deleting gone documents: false
> > SolrIndexer: URL filtering: true
> > SolrIndexer: URL normalizing: false
> > java.io.IOException: Job failed!
> >
> > Can anybody help?
> > On Sun, Nov 25, 2012 at 6:43 AM, Joe Zhang  wrote:
> >
> > > How exactly do I get to trunk?
> > >
> > > I did download download NUTCH-1300-1.5-1.patch, and run the patch
> command
> > > correctly, and re-build nutch. But the problem still persists...
> > >
> > > On Sun, Nov 25, 2012 at 3:29 AM, Markus Jelsma <
> markus.jel...@openindex.io
> > > > wrote:
> > >
> > >> No, this is no bug. As i said, you need either to patch your Nutch or
> get
> > >> the sources from trunk. The -filter parameter is not in your version.
> Check
> > >> the patch manual if you don't know how it works.
> > >>
> > >> $ cd trunk ; patch -p0 < file.patch
> > >>
> > >> -Original message-
> > >> > From:Joe Zhang 
> > >> > Sent: Sun 25-Nov-2012 08:42
> > >> > To: Markus Jelsma ; user <
> > >> user@nutch.apache.org>
> > >> > Subject: Re: Indexing-time URL filtering again
> > >> >
> > >> > This does seem a bug. Can anybody help?
> > >> >
> > >> > On Sat, Nov 24, 2012 at 6:40 PM, Joe Zhang 
> > >> wrote:
> > >> >
> > >> > > Markus, could you advise? Thanks a lot!
> > >> > >
> > >> > >
> > >> > > On Sat, Nov 24, 2012 at 12:49 AM, Joe Zhang  >
> > >> wrote:
> > >> > >
> > >> > >> I

RE: problem with text/html content type of documents appears application/xhtml+xml in solr index

2012-11-25 Thread Markus Jelsma
Hi - you need to enable mime-type mapping in Nutch config and define your 
mappings. Enable it with:

  
moreIndexingFilter.mapMimeTypes
true
  

and add the following to your mapping config:

cat conf/contenttype-mapping.txt 
# Target content type  type1 [ type2 ...]
text/html   application/xhtml+xml

This will map application/xhtml+xml to text/html when indexing documents to 
Solr. You can configure any arbitrary target such as `web page` or `document` 
for various similar content types.

Trunk has this feature. You can either patch your version or check out from 
trunk and compile Nutch yourself. Patching is very simple:

$ cd trunk ; patch -p0 < file.patch


-Original message-
> From:Eyeris Rodriguez Rueda 
> Sent: Sun 25-Nov-2012 20:42
> To: user@nutch.apache.org
> Subject: RE: problem with text/html content type of documents appears 
> application/xhtml+xml in solr index
> 
> Thanks a lot Markus for your answer. My English is not so good.
> I was reading but i don’t know how to fix the problems yet. Could you explain 
> me in details the solution please. I was looking in conf directory but I 
> can't find how to map one mime types to another. I need to replace index-more 
> plugin ? 
> I was looking in the link that you suggest me and a saw a 
> NUTCH-1262-1.5-1.patch but I don’t know how to use that patch.
> Please tell me if I need to delete the index completely or there is a way to 
> replace an application/xhtml+xml to text/html in solr index.
> 
> 
> 
> 
> -Mensaje original-
> De: Markus Jelsma [mailto:markus.jel...@openindex.io] 
> Enviado el: domingo, 25 de noviembre de 2012 4:33 AM
> Para: user@nutch.apache.org
> Asunto: RE: problem with text/html content type of documents appears 
> application/xhtml+xml in solr index
> 
> Hi - trunk's more indexing filter can map mime types to any target. With it 
> you can map both (x)html mimes to text/html or to `web page`.
> 
> https://issues.apache.org/jira/browse/NUTCH-1262
> 
>  
> -Original message-
> > From:Eyeris Rodriguez Rueda 
> > Sent: Sun 25-Nov-2012 00:48
> > To: user@nutch.apache.org
> > Subject: problem with text/html content type of documents appears 
> > application/xhtml+xml in solr index
> > 
> > Hi.
> > 
> > I have changed my nutch version from 1.4 to 1.5.1 and I have detected a 
> > problem with content type of some document, some pages with text/html 
> > appears in solr index with application/xhtml+xml , when I check the links 
> > the navegator tell me that efectively is text/html.
> > Any body can help me to fix this problem, I think change this content type 
> > manually in solr index to text/html but is not a good way for me.
> > Please any suggestion or advice will be accepted.
> 
> 
> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
> INFORMATICAS...
> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
> 
> http://www.uci.cu
> http://www.facebook.com/universidad.uci
> http://www.flickr.com/photos/universidad_uci
> 


RE: Indexing-time URL filtering again

2012-11-25 Thread Markus Jelsma
You should provide the log output. 
 
-Original message-
> From:Joe Zhang 
> Sent: Sun 25-Nov-2012 17:27
> To: user@nutch.apache.org
> Subject: Re: Indexing-time URL filtering again
> 
> I actually checked out the most recent build from SVN, Release 1.6 -
> 23/11/2012.
> 
> The following command
> 
> bin/nutch solrindex  -Durlfilter.regex.file=.UrlFiltering.txt
> http://localhost:8983/solr/ crawl/crawldb/ -linkdb crawl/linkdb/
> crawl/segments/*  -filter
> 
> produced the following output:
> 
> SolrIndexer: starting at 2012-11-25 16:19:29
> SolrIndexer: deleting gone documents: false
> SolrIndexer: URL filtering: true
> SolrIndexer: URL normalizing: false
> java.io.IOException: Job failed!
> 
> Can anybody help?
> On Sun, Nov 25, 2012 at 6:43 AM, Joe Zhang  wrote:
> 
> > How exactly do I get to trunk?
> >
> > I did download download NUTCH-1300-1.5-1.patch, and run the patch command
> > correctly, and re-build nutch. But the problem still persists...
> >
> > On Sun, Nov 25, 2012 at 3:29 AM, Markus Jelsma  > > wrote:
> >
> >> No, this is no bug. As i said, you need either to patch your Nutch or get
> >> the sources from trunk. The -filter parameter is not in your version. Check
> >> the patch manual if you don't know how it works.
> >>
> >> $ cd trunk ; patch -p0 < file.patch
> >>
> >> -Original message-
> >> > From:Joe Zhang 
> >> > Sent: Sun 25-Nov-2012 08:42
> >> > To: Markus Jelsma ; user <
> >> user@nutch.apache.org>
> >> > Subject: Re: Indexing-time URL filtering again
> >> >
> >> > This does seem a bug. Can anybody help?
> >> >
> >> > On Sat, Nov 24, 2012 at 6:40 PM, Joe Zhang 
> >> wrote:
> >> >
> >> > > Markus, could you advise? Thanks a lot!
> >> > >
> >> > >
> >> > > On Sat, Nov 24, 2012 at 12:49 AM, Joe Zhang 
> >> wrote:
> >> > >
> >> > >> I followed your instruction and applied the patch, Markus, but the
> >> > >> problem still persists --- "-filter" is interpreted as a path by
> >> solrindex.
> >> > >>
> >> > >> On Fri, Nov 23, 2012 at 12:39 AM, Markus Jelsma <
> >> > >> markus.jel...@openindex.io> wrote:
> >> > >>
> >> > >>> Ah, i get it now. Please use trunk or patch your version with:
> >> > >>> https://issues.apache.org/jira/browse/NUTCH-1300 to enable
> >> filtering.
> >> > >>>
> >> > >>> -Original message-
> >> > >>> > From:Joe Zhang 
> >> > >>> > Sent: Fri 23-Nov-2012 03:08
> >> > >>> > To: user@nutch.apache.org
> >> > >>> > Subject: Re: Indexing-time URL filtering again
> >> > >>> >
> >> > >>> > But Markus said it worked for him. I was really he could send his
> >> > >>> command
> >> > >>> > line.
> >> > >>> >
> >> > >>> > On Thu, Nov 22, 2012 at 6:28 PM, Lewis John Mcgibbney <
> >> > >>> > lewis.mcgibb...@gmail.com> wrote:
> >> > >>> >
> >> > >>> > > Is this a bug?
> >> > >>> > >
> >> > >>> > > On Thu, Nov 22, 2012 at 10:13 PM, Joe Zhang <
> >> smartag...@gmail.com>
> >> > >>> wrote:
> >> > >>> > > > Putting -filter between crawldb and segments, I sitll got the
> >> same
> >> > >>> thing:
> >> > >>> > > >
> >> > >>> > > > org.apache.hadoop.mapred.InvalidInputException: Input path
> >> does not
> >> > >>> > > exist:
> >> > >>> > > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/crawl_fetch
> >> > >>> > > > Input path does not exist:
> >> > >>> > > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/crawl_parse
> >> > >>> > > > Input path does not exist:
> >> > >>> > > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/parse_data
> >> > >>> > > > Input path does not exist:
> >> > >>> > > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/parse_text
> >> > >>> > > >
> >> > >>> > > > On Thu, Nov 22, 2012 at 3:11 PM, Markus Jelsma
> >> > >>> > > > wrote:
> >> > >>> > > >
> >> > >>> > > >> These are roughly the available parameters:
> >> > >>> > > >>
> >> > >>> > > >> Usage: SolrIndexer   [-linkdb ]
> >> > >>> [-hostdb
> >> > >>> > > >> ] [-params k1=v1&k2=v2...] ( ... | -dir
> >> > >>> )
> >> > >>> > > >> [-noCommit] [-deleteGone] [-deleteRobotsNoIndex]
> >> > >>> > > >> [-deleteSkippedByIndexingFilter] [-filter] [-normalize]
> >> > >>> > > >>
> >> > >>> > > >> Having -filter at the end should work fine, if it, for some
> >> > >>> reason,
> >> > >>> > > >> doesn't work put it before the segment and after the crawldb
> >> and
> >> > >>> file an
> >> > >>> > > >> issue in jira, it works here if i have -filter at the end.
> >> > >>> > > >>
> >> > >>> > > >> Cheers
> >> > >>> > > >>
> >> > >>> > > >> -Original message-
> >> > >>> > > >> > From:Joe Zhang 
> >> > >>> > > >> > Sent: Thu 22-Nov-2012 23:05
> >> > >>> > > >> > To: Markus Jelsma ; user <
> >> > >>> > > >> user@nutch.apache.org>
> >> > >>> > > >> > Subject: Re: Indexing-time URL filtering again
> >> > >>> > > >> >
> >> > >>> > > >> > Yes, I forgot to do that. But still, what exactly should
> >> the
> >> > >>> command
> >> > >>> > > >> look like?
> >> > >>> > > >> >
> >> > >>> > > >> > bin/nutch solrindex
> >>  -Durlfilter.regex.file=UrlFiltering.txt
> >> > >>> > > >> http://localhost:8983/

Re: shouldFetch rejected

2012-11-25 Thread Sebastian Nagel
> But, i create a complete new crawl dir for every crawl.
Then all should work as expected.

> why the the cralwer set a "page to fetch" to rejected. Because obviously
> the crawler never saw this page before (because i deleted all the old crawl 
> dirs).
> In the crawl log i see many page to fetch, but at the end all of them are 
> rejected
Are you sure they aren't fetched at all? This debug log output in Generator 
mapper
is shown also for URLs fetched in previous cycles. You should check the complete
log for the "rejected" URLs.


On 11/24/2012 04:46 PM, Jan Philippe Wimmer wrote:
> Hey Sebastian! Thanks for your answer.
> 
> But, i create a complete new crawl dir for every crawl. In other words i just 
> have the crawl data of
> the current, running crawl-process. When i recrawl a urlset, i delete the old 
> crawl dir and create a
> new one. At the end of any crawl i index it to solr. So i keep all crawled 
> content in the index. I
> don't need any nutch crawl dirs, because i want to crawl all relevant pages 
> in every crawl process.
> again and again.
> 
> I totaly don't understand, why the the cralwer set a "page to fetch" to 
> rejected. Because obviously
> the crawler never saw this page before (because i deleted all the old crawl 
> dirs). In the crawl log
> i see many page to fetch, but at the end all of them are rejected. Any ideas?
> 
> Am 24.11.2012 16:36, schrieb Sebastian Nagel:
>>> I want my crawler to crawl the complete page without setting up schedulers 
>>> at all. Every crawl
>>> process should crawl every page again without having setup wait intervals.
>> That's quite easy: remove all data and launch the crawl again.
>> - Nutch 1.x : remove crawldb, segments, and linkdb
>> - 2.x : drop 'webpage' (or similar, depends on the chosen data store)
>>
>> On 11/24/2012 12:17 PM, Jan Philippe Wimmer wrote:
>>> Hi there,
>>>
>>> how can i avoid the following error:
>>> -shouldFetch rejected 'http://www.page.com/shop', fetchTime=1356347311285, 
>>> curTime=1353755337755
>>>
>>> I want my crawler to crawl the complete page without setting up schedulers 
>>> at all. Every crawl
>>> process should crawl every page again without having setup wait intervals.
>>>
>>> Any soluti



RE: problem with text/html content type of documents appears application/xhtml+xml in solr index

2012-11-25 Thread Eyeris Rodriguez Rueda
Thanks a lot Markus for your answer. My English is not so good.
I was reading but i don’t know how to fix the problems yet. Could you explain 
me in details the solution please. I was looking in conf directory but I can't 
find how to map one mime types to another. I need to replace index-more plugin 
? 
I was looking in the link that you suggest me and a saw a 
NUTCH-1262-1.5-1.patch but I don’t know how to use that patch.
Please tell me if I need to delete the index completely or there is a way to 
replace an application/xhtml+xml to text/html in solr index.




-Mensaje original-
De: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Enviado el: domingo, 25 de noviembre de 2012 4:33 AM
Para: user@nutch.apache.org
Asunto: RE: problem with text/html content type of documents appears 
application/xhtml+xml in solr index

Hi - trunk's more indexing filter can map mime types to any target. With it you 
can map both (x)html mimes to text/html or to `web page`.

https://issues.apache.org/jira/browse/NUTCH-1262

 
-Original message-
> From:Eyeris Rodriguez Rueda 
> Sent: Sun 25-Nov-2012 00:48
> To: user@nutch.apache.org
> Subject: problem with text/html content type of documents appears 
> application/xhtml+xml in solr index
> 
> Hi.
> 
> I have changed my nutch version from 1.4 to 1.5.1 and I have detected a 
> problem with content type of some document, some pages with text/html appears 
> in solr index with application/xhtml+xml , when I check the links the 
> navegator tell me that efectively is text/html.
> Any body can help me to fix this problem, I think change this content type 
> manually in solr index to text/html but is not a good way for me.
> Please any suggestion or advice will be accepted.


10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci


Re: Indexing-time URL filtering again

2012-11-25 Thread Joe Zhang
I actually checked out the most recent build from SVN, Release 1.6 -
23/11/2012.

The following command

bin/nutch solrindex  -Durlfilter.regex.file=.UrlFiltering.txt
http://localhost:8983/solr/ crawl/crawldb/ -linkdb crawl/linkdb/
crawl/segments/*  -filter

produced the following output:

SolrIndexer: starting at 2012-11-25 16:19:29
SolrIndexer: deleting gone documents: false
SolrIndexer: URL filtering: true
SolrIndexer: URL normalizing: false
java.io.IOException: Job failed!

Can anybody help?
On Sun, Nov 25, 2012 at 6:43 AM, Joe Zhang  wrote:

> How exactly do I get to trunk?
>
> I did download download NUTCH-1300-1.5-1.patch, and run the patch command
> correctly, and re-build nutch. But the problem still persists...
>
> On Sun, Nov 25, 2012 at 3:29 AM, Markus Jelsma  > wrote:
>
>> No, this is no bug. As i said, you need either to patch your Nutch or get
>> the sources from trunk. The -filter parameter is not in your version. Check
>> the patch manual if you don't know how it works.
>>
>> $ cd trunk ; patch -p0 < file.patch
>>
>> -Original message-
>> > From:Joe Zhang 
>> > Sent: Sun 25-Nov-2012 08:42
>> > To: Markus Jelsma ; user <
>> user@nutch.apache.org>
>> > Subject: Re: Indexing-time URL filtering again
>> >
>> > This does seem a bug. Can anybody help?
>> >
>> > On Sat, Nov 24, 2012 at 6:40 PM, Joe Zhang 
>> wrote:
>> >
>> > > Markus, could you advise? Thanks a lot!
>> > >
>> > >
>> > > On Sat, Nov 24, 2012 at 12:49 AM, Joe Zhang 
>> wrote:
>> > >
>> > >> I followed your instruction and applied the patch, Markus, but the
>> > >> problem still persists --- "-filter" is interpreted as a path by
>> solrindex.
>> > >>
>> > >> On Fri, Nov 23, 2012 at 12:39 AM, Markus Jelsma <
>> > >> markus.jel...@openindex.io> wrote:
>> > >>
>> > >>> Ah, i get it now. Please use trunk or patch your version with:
>> > >>> https://issues.apache.org/jira/browse/NUTCH-1300 to enable
>> filtering.
>> > >>>
>> > >>> -Original message-
>> > >>> > From:Joe Zhang 
>> > >>> > Sent: Fri 23-Nov-2012 03:08
>> > >>> > To: user@nutch.apache.org
>> > >>> > Subject: Re: Indexing-time URL filtering again
>> > >>> >
>> > >>> > But Markus said it worked for him. I was really he could send his
>> > >>> command
>> > >>> > line.
>> > >>> >
>> > >>> > On Thu, Nov 22, 2012 at 6:28 PM, Lewis John Mcgibbney <
>> > >>> > lewis.mcgibb...@gmail.com> wrote:
>> > >>> >
>> > >>> > > Is this a bug?
>> > >>> > >
>> > >>> > > On Thu, Nov 22, 2012 at 10:13 PM, Joe Zhang <
>> smartag...@gmail.com>
>> > >>> wrote:
>> > >>> > > > Putting -filter between crawldb and segments, I sitll got the
>> same
>> > >>> thing:
>> > >>> > > >
>> > >>> > > > org.apache.hadoop.mapred.InvalidInputException: Input path
>> does not
>> > >>> > > exist:
>> > >>> > > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/crawl_fetch
>> > >>> > > > Input path does not exist:
>> > >>> > > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/crawl_parse
>> > >>> > > > Input path does not exist:
>> > >>> > > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/parse_data
>> > >>> > > > Input path does not exist:
>> > >>> > > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/parse_text
>> > >>> > > >
>> > >>> > > > On Thu, Nov 22, 2012 at 3:11 PM, Markus Jelsma
>> > >>> > > > wrote:
>> > >>> > > >
>> > >>> > > >> These are roughly the available parameters:
>> > >>> > > >>
>> > >>> > > >> Usage: SolrIndexer   [-linkdb ]
>> > >>> [-hostdb
>> > >>> > > >> ] [-params k1=v1&k2=v2...] ( ... | -dir
>> > >>> )
>> > >>> > > >> [-noCommit] [-deleteGone] [-deleteRobotsNoIndex]
>> > >>> > > >> [-deleteSkippedByIndexingFilter] [-filter] [-normalize]
>> > >>> > > >>
>> > >>> > > >> Having -filter at the end should work fine, if it, for some
>> > >>> reason,
>> > >>> > > >> doesn't work put it before the segment and after the crawldb
>> and
>> > >>> file an
>> > >>> > > >> issue in jira, it works here if i have -filter at the end.
>> > >>> > > >>
>> > >>> > > >> Cheers
>> > >>> > > >>
>> > >>> > > >> -Original message-
>> > >>> > > >> > From:Joe Zhang 
>> > >>> > > >> > Sent: Thu 22-Nov-2012 23:05
>> > >>> > > >> > To: Markus Jelsma ; user <
>> > >>> > > >> user@nutch.apache.org>
>> > >>> > > >> > Subject: Re: Indexing-time URL filtering again
>> > >>> > > >> >
>> > >>> > > >> > Yes, I forgot to do that. But still, what exactly should
>> the
>> > >>> command
>> > >>> > > >> look like?
>> > >>> > > >> >
>> > >>> > > >> > bin/nutch solrindex
>>  -Durlfilter.regex.file=UrlFiltering.txt
>> > >>> > > >> http://localhost:8983/solr/ 
>> > >>> .../crawldb/
>> > >>> > > >> /segments/*  -filter
>> > >>> > > >> > this command would cause nutch to interpret "-filter" as a
>> path.
>> > >>> > > >> >
>> > >>> > > >> > On Thu, Nov 22, 2012 at 6:14 AM, Markus Jelsma <
>> > >>> > > >> markus.jel...@openindex.io > markus.jel...@openindex.io> >
>> > >>> wrote:
>> > >>> > > >> > Hi,
>> > >>> > > >> >
>> > >>> > > >> > I just tested a small index job that u

RE: How to generate directed graph image file using Nutch linkdb, webgraphdb etc.

2012-11-25 Thread A Geek

Hi, Would appreciate if someone can give me any pointers with the following 
issue. Any pointers on how to use the Nutch Webgraphdb, outlink, inlinks etc 
for generating directed graph would be helpful. Thanks in advance.
Thanks, DW

> From: dw...@live.com
> To: user@nutch.apache.org
> Subject: How to generate directed graph image file using Nutch linkdb, 
> webgraphdb etc.
> Date: Sun, 25 Nov 2012 09:52:47 +
> 
> 
> Hi All, I've been learning up Nutch 1.5 from last couple of weeks and so far 
> using these links: http://wiki.apache.org/nutch/NutchTutorial and 
> http://wiki.apache.org/nutch/NewScoringIndexingExample I'm able to crawl a 
> list of sites, with seed list of 1000 urls. I created the webgraphdb using 
> one of the segments then dumped the score for link ranking etc. I'm able to 
> see the link scores for URLS. I browsed the webgraphdb folders/subfolders 
> which contains : inlinks,  loops,  nodes,  outlinks,  routes  etc. I can 
> browse the file sitting in these folders but not able to understand anything 
> as they contains some URLs and some other related data in some unusual 
> characters. Basically, I want to generate a directed graph image or a 
> connectivity graph image for the crawled URLs using all the data. I would 
> appreciate any pointers in this regard. Is there any third party tool which 
> takes these data as input and generates a directed/connectivity graph for the 
> URLs which can be shown to give a visual understanding of connectivity 
> between the URLS. Please provide inputs in this direction. Thanks in advance. 
> 
> Thanks, DW  
  

Re: mime type text/plain

2012-11-25 Thread Sourajit Basak
DEBUG tika.TikaParser - Using Tika parser
org.apache.tika.parser.txt.TXTParser for mime-type text/plain

The above indicates Tika is fired. But somehow I need to tell Tika to use
HtmlParser for mime-type text/plain. Have to dig into Tika docs.

Is it possible to do anything in Nutch ?

On Sun, Nov 25, 2012 at 7:27 PM, Sourajit Basak wrote:

> Some of my target webpages return a mime type of text/plain though they
> are htmls. I changed "http.accept" to include text/plain and configured
> both tika & parse-html to see if those can be parsed. However, both seem to
> produce no content.
>
> I changed parse-plugins.xml & the corresponding plugin.xml's to match this
> mime type.
>
> Has anyone encountered this problem ?
>
>
>


Re: scoring (v1.5)

2012-11-25 Thread Sourajit Basak
You're saying that linkrank doesn't have any affect on the subsequent
generate phase ?

On Sun, Nov 25, 2012 at 6:14 PM, parnab kumar  wrote:

> Hi Sourajit,
>  I donno about nutch 1.5 but in nutch 1.4 the following happens i
> guess (probably the same applies for nutch 1.5 as well)  :
>
>   To create the webgraph you run the webgraph command . Scoring is not
> affected here . Next you need to run linkRank(this will compute the link
> rank scores by exploring the webgraph) . This replaces the old crawldb with
> a newCrawldb.. This new crawl db contains the new link-scores against each
> url . Scoring is not yet affected in the index . Next when you reindex the
> documents this scores from the newCrawldb is injected into the index as a
> static document boost. Now you can see the effect the link ranking in your
> application !!
>
> Anyone experienced in these .. please correct me if i am wrong  !!
>
> Thanks,
> Parnab
> IIT Kharagpur,India
>
> On Sun, Nov 25, 2012 at 2:15 PM, Sourajit Basak  >wrote:
>
> > In Nutch 1.5, during which phase (updatedb, solrindex, invertlinks or )
> > does scoring happen ? Do I explicitly use 'linkrank' ?
> >
> > Best,
> > Sourajit
> >
>


Re: Indexing-time URL filtering again

2012-11-25 Thread Joe Zhang
How exactly do I get to trunk?

I did download download NUTCH-1300-1.5-1.patch, and run the patch command
correctly, and re-build nutch. But the problem still persists...

On Sun, Nov 25, 2012 at 3:29 AM, Markus Jelsma
wrote:

> No, this is no bug. As i said, you need either to patch your Nutch or get
> the sources from trunk. The -filter parameter is not in your version. Check
> the patch manual if you don't know how it works.
>
> $ cd trunk ; patch -p0 < file.patch
>
> -Original message-
> > From:Joe Zhang 
> > Sent: Sun 25-Nov-2012 08:42
> > To: Markus Jelsma ; user <
> user@nutch.apache.org>
> > Subject: Re: Indexing-time URL filtering again
> >
> > This does seem a bug. Can anybody help?
> >
> > On Sat, Nov 24, 2012 at 6:40 PM, Joe Zhang  wrote:
> >
> > > Markus, could you advise? Thanks a lot!
> > >
> > >
> > > On Sat, Nov 24, 2012 at 12:49 AM, Joe Zhang 
> wrote:
> > >
> > >> I followed your instruction and applied the patch, Markus, but the
> > >> problem still persists --- "-filter" is interpreted as a path by
> solrindex.
> > >>
> > >> On Fri, Nov 23, 2012 at 12:39 AM, Markus Jelsma <
> > >> markus.jel...@openindex.io> wrote:
> > >>
> > >>> Ah, i get it now. Please use trunk or patch your version with:
> > >>> https://issues.apache.org/jira/browse/NUTCH-1300 to enable
> filtering.
> > >>>
> > >>> -Original message-
> > >>> > From:Joe Zhang 
> > >>> > Sent: Fri 23-Nov-2012 03:08
> > >>> > To: user@nutch.apache.org
> > >>> > Subject: Re: Indexing-time URL filtering again
> > >>> >
> > >>> > But Markus said it worked for him. I was really he could send his
> > >>> command
> > >>> > line.
> > >>> >
> > >>> > On Thu, Nov 22, 2012 at 6:28 PM, Lewis John Mcgibbney <
> > >>> > lewis.mcgibb...@gmail.com> wrote:
> > >>> >
> > >>> > > Is this a bug?
> > >>> > >
> > >>> > > On Thu, Nov 22, 2012 at 10:13 PM, Joe Zhang <
> smartag...@gmail.com>
> > >>> wrote:
> > >>> > > > Putting -filter between crawldb and segments, I sitll got the
> same
> > >>> thing:
> > >>> > > >
> > >>> > > > org.apache.hadoop.mapred.InvalidInputException: Input path
> does not
> > >>> > > exist:
> > >>> > > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/crawl_fetch
> > >>> > > > Input path does not exist:
> > >>> > > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/crawl_parse
> > >>> > > > Input path does not exist:
> > >>> > > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/parse_data
> > >>> > > > Input path does not exist:
> > >>> > > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/parse_text
> > >>> > > >
> > >>> > > > On Thu, Nov 22, 2012 at 3:11 PM, Markus Jelsma
> > >>> > > > wrote:
> > >>> > > >
> > >>> > > >> These are roughly the available parameters:
> > >>> > > >>
> > >>> > > >> Usage: SolrIndexer   [-linkdb ]
> > >>> [-hostdb
> > >>> > > >> ] [-params k1=v1&k2=v2...] ( ... | -dir
> > >>> )
> > >>> > > >> [-noCommit] [-deleteGone] [-deleteRobotsNoIndex]
> > >>> > > >> [-deleteSkippedByIndexingFilter] [-filter] [-normalize]
> > >>> > > >>
> > >>> > > >> Having -filter at the end should work fine, if it, for some
> > >>> reason,
> > >>> > > >> doesn't work put it before the segment and after the crawldb
> and
> > >>> file an
> > >>> > > >> issue in jira, it works here if i have -filter at the end.
> > >>> > > >>
> > >>> > > >> Cheers
> > >>> > > >>
> > >>> > > >> -Original message-
> > >>> > > >> > From:Joe Zhang 
> > >>> > > >> > Sent: Thu 22-Nov-2012 23:05
> > >>> > > >> > To: Markus Jelsma ; user <
> > >>> > > >> user@nutch.apache.org>
> > >>> > > >> > Subject: Re: Indexing-time URL filtering again
> > >>> > > >> >
> > >>> > > >> > Yes, I forgot to do that. But still, what exactly should the
> > >>> command
> > >>> > > >> look like?
> > >>> > > >> >
> > >>> > > >> > bin/nutch solrindex
>  -Durlfilter.regex.file=UrlFiltering.txt
> > >>> > > >> http://localhost:8983/solr/ 
> > >>> .../crawldb/
> > >>> > > >> /segments/*  -filter
> > >>> > > >> > this command would cause nutch to interpret "-filter" as a
> path.
> > >>> > > >> >
> > >>> > > >> > On Thu, Nov 22, 2012 at 6:14 AM, Markus Jelsma <
> > >>> > > >> markus.jel...@openindex.io 
> >
> > >>> wrote:
> > >>> > > >> > Hi,
> > >>> > > >> >
> > >>> > > >> > I just tested a small index job that usually writes 1200
> > >>> records to
> > >>> > > >> Solr. It works fine if i specify -. in a filter (index
> nothing)
> > >>> and
> > >>> > > point
> > >>> > > >> to it with -Durlfilter.regex.file=path like you do.  I assume
> you
> > >>> mean
> > >>> > > by
> > >>> > > >> `it doesn't work` that it filters nothing and indexes all
> records
> > >>> from
> > >>> > > the
> > >>> > > >> segment. Did you forget the -filter parameter?
> > >>> > > >> >
> > >>> > > >> > Cheers
> > >>> > > >> >
> > >>> > > >> > -Original message-
> > >>> > > >> > > From:Joe Zhang  > >>> smartag...@gmail.com>
> > >>> > > >
> > >>> > > >> > > Sent: Thu 22-Nov-2012 07:29
> > >>> > > >> > > To:

Re: scoring (v1.5)

2012-11-25 Thread parnab kumar
Hi Sourajit,
 I donno about nutch 1.5 but in nutch 1.4 the following happens i
guess (probably the same applies for nutch 1.5 as well)  :

  To create the webgraph you run the webgraph command . Scoring is not
affected here . Next you need to run linkRank(this will compute the link
rank scores by exploring the webgraph) . This replaces the old crawldb with
a newCrawldb.. This new crawl db contains the new link-scores against each
url . Scoring is not yet affected in the index . Next when you reindex the
documents this scores from the newCrawldb is injected into the index as a
static document boost. Now you can see the effect the link ranking in your
application !!

Anyone experienced in these .. please correct me if i am wrong  !!

Thanks,
Parnab
IIT Kharagpur,India

On Sun, Nov 25, 2012 at 2:15 PM, Sourajit Basak wrote:

> In Nutch 1.5, during which phase (updatedb, solrindex, invertlinks or )
> does scoring happen ? Do I explicitly use 'linkrank' ?
>
> Best,
> Sourajit
>


RE: scoring (v1.5)

2012-11-25 Thread Markus Jelsma
Hi - Scoring filters can run in several stages but the webgraph and linkrank 
programs must be run separately. After the graph has been iterated over you can 
update your crawldb with the score from the graph using the scoreupdater 
program.
 
-Original message-
> From:Sourajit Basak 
> Sent: Sun 25-Nov-2012 09:51
> To: user@nutch.apache.org
> Subject: scoring (v1.5)
> 
> In Nutch 1.5, during which phase (updatedb, solrindex, invertlinks or )
> does scoring happen ? Do I explicitly use 'linkrank' ?
> 
> Best,
> Sourajit
> 


RE: problem with text/html content type of documents appears application/xhtml+xml in solr index

2012-11-25 Thread Markus Jelsma
Hi - trunk's more indexing filter can map mime types to any target. With it you 
can map both (x)html mimes to text/html or to `web page`.

https://issues.apache.org/jira/browse/NUTCH-1262

 
 
-Original message-
> From:Eyeris Rodriguez Rueda 
> Sent: Sun 25-Nov-2012 00:48
> To: user@nutch.apache.org
> Subject: problem with text/html content type of documents appears 
> application/xhtml+xml in solr index
> 
> Hi.
> 
> I have changed my nutch version from 1.4 to 1.5.1 and I have detected a 
> problem with content type of some document, some pages with text/html appears 
> in solr index with application/xhtml+xml , when I check the links the 
> navegator tell me that efectively is text/html.
> Any body can help me to fix this problem, I think change this content type 
> manually in solr index to text/html but is not a good way for me.
> Please any suggestion or advice will be accepted.
> 
> 
> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
> INFORMATICAS...
> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
> 
> http://www.uci.cu
> http://www.facebook.com/universidad.uci
> http://www.flickr.com/photos/universidad_uci
> 


RE: Indexing-time URL filtering again

2012-11-25 Thread Markus Jelsma
No, this is no bug. As i said, you need either to patch your Nutch or get the 
sources from trunk. The -filter parameter is not in your version. Check the 
patch manual if you don't know how it works.

$ cd trunk ; patch -p0 < file.patch
 
-Original message-
> From:Joe Zhang 
> Sent: Sun 25-Nov-2012 08:42
> To: Markus Jelsma ; user 
> Subject: Re: Indexing-time URL filtering again
> 
> This does seem a bug. Can anybody help?
> 
> On Sat, Nov 24, 2012 at 6:40 PM, Joe Zhang  wrote:
> 
> > Markus, could you advise? Thanks a lot!
> >
> >
> > On Sat, Nov 24, 2012 at 12:49 AM, Joe Zhang  wrote:
> >
> >> I followed your instruction and applied the patch, Markus, but the
> >> problem still persists --- "-filter" is interpreted as a path by solrindex.
> >>
> >> On Fri, Nov 23, 2012 at 12:39 AM, Markus Jelsma <
> >> markus.jel...@openindex.io> wrote:
> >>
> >>> Ah, i get it now. Please use trunk or patch your version with:
> >>> https://issues.apache.org/jira/browse/NUTCH-1300 to enable filtering.
> >>>
> >>> -Original message-
> >>> > From:Joe Zhang 
> >>> > Sent: Fri 23-Nov-2012 03:08
> >>> > To: user@nutch.apache.org
> >>> > Subject: Re: Indexing-time URL filtering again
> >>> >
> >>> > But Markus said it worked for him. I was really he could send his
> >>> command
> >>> > line.
> >>> >
> >>> > On Thu, Nov 22, 2012 at 6:28 PM, Lewis John Mcgibbney <
> >>> > lewis.mcgibb...@gmail.com> wrote:
> >>> >
> >>> > > Is this a bug?
> >>> > >
> >>> > > On Thu, Nov 22, 2012 at 10:13 PM, Joe Zhang 
> >>> wrote:
> >>> > > > Putting -filter between crawldb and segments, I sitll got the same
> >>> thing:
> >>> > > >
> >>> > > > org.apache.hadoop.mapred.InvalidInputException: Input path does not
> >>> > > exist:
> >>> > > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/crawl_fetch
> >>> > > > Input path does not exist:
> >>> > > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/crawl_parse
> >>> > > > Input path does not exist:
> >>> > > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/parse_data
> >>> > > > Input path does not exist:
> >>> > > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/parse_text
> >>> > > >
> >>> > > > On Thu, Nov 22, 2012 at 3:11 PM, Markus Jelsma
> >>> > > > wrote:
> >>> > > >
> >>> > > >> These are roughly the available parameters:
> >>> > > >>
> >>> > > >> Usage: SolrIndexer   [-linkdb ]
> >>> [-hostdb
> >>> > > >> ] [-params k1=v1&k2=v2...] ( ... | -dir
> >>> )
> >>> > > >> [-noCommit] [-deleteGone] [-deleteRobotsNoIndex]
> >>> > > >> [-deleteSkippedByIndexingFilter] [-filter] [-normalize]
> >>> > > >>
> >>> > > >> Having -filter at the end should work fine, if it, for some
> >>> reason,
> >>> > > >> doesn't work put it before the segment and after the crawldb and
> >>> file an
> >>> > > >> issue in jira, it works here if i have -filter at the end.
> >>> > > >>
> >>> > > >> Cheers
> >>> > > >>
> >>> > > >> -Original message-
> >>> > > >> > From:Joe Zhang 
> >>> > > >> > Sent: Thu 22-Nov-2012 23:05
> >>> > > >> > To: Markus Jelsma ; user <
> >>> > > >> user@nutch.apache.org>
> >>> > > >> > Subject: Re: Indexing-time URL filtering again
> >>> > > >> >
> >>> > > >> > Yes, I forgot to do that. But still, what exactly should the
> >>> command
> >>> > > >> look like?
> >>> > > >> >
> >>> > > >> > bin/nutch solrindex  -Durlfilter.regex.file=UrlFiltering.txt
> >>> > > >> http://localhost:8983/solr/ 
> >>> .../crawldb/
> >>> > > >> /segments/*  -filter
> >>> > > >> > this command would cause nutch to interpret "-filter" as a path.
> >>> > > >> >
> >>> > > >> > On Thu, Nov 22, 2012 at 6:14 AM, Markus Jelsma <
> >>> > > >> markus.jel...@openindex.io  >
> >>> wrote:
> >>> > > >> > Hi,
> >>> > > >> >
> >>> > > >> > I just tested a small index job that usually writes 1200
> >>> records to
> >>> > > >> Solr. It works fine if i specify -. in a filter (index nothing)
> >>> and
> >>> > > point
> >>> > > >> to it with -Durlfilter.regex.file=path like you do.  I assume you
> >>> mean
> >>> > > by
> >>> > > >> `it doesn't work` that it filters nothing and indexes all records
> >>> from
> >>> > > the
> >>> > > >> segment. Did you forget the -filter parameter?
> >>> > > >> >
> >>> > > >> > Cheers
> >>> > > >> >
> >>> > > >> > -Original message-
> >>> > > >> > > From:Joe Zhang  >>> smartag...@gmail.com>
> >>> > > >
> >>> > > >> > > Sent: Thu 22-Nov-2012 07:29
> >>> > > >> > > To: user mailto:user@nutch.apache.org>
> >>> >
> >>> > > >> > > Subject: Indexing-time URL filtering again
> >>> > > >> > >
> >>> > > >> > > Dear List:
> >>> > > >> > >
> >>> > > >> > > I asked a similar question before, but I haven't solved the
> >>> problem.
> >>> > > >> > > Therefore I try to re-ask the question more clearly and seek
> >>> advice.
> >>> > > >> > >
> >>> > > >> > > I'm using nutch 1.5.1 and solr 3.6.1 together. Things work
> >>> fine at
> >>> > > the
> >>> > > >> > > rudimentary level.
> >>> > > >> > >
> >>> > > >> > >

Re: How to extract fetched files(pdf)?

2012-11-25 Thread hudvin
I found better solution - Heritrix :). It just works except terrible spring
config.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-extract-fetched-files-pdf-tp4022202p4022244.html
Sent from the Nutch - User mailing list archive at Nabble.com.


How to generate directed graph image file using Nutch linkdb, webgraphdb etc.

2012-11-25 Thread A Geek

Hi All, I've been learning up Nutch 1.5 from last couple of weeks and so far 
using these links: http://wiki.apache.org/nutch/NutchTutorial and 
http://wiki.apache.org/nutch/NewScoringIndexingExample I'm able to crawl a list 
of sites, with seed list of 1000 urls. I created the webgraphdb using one of 
the segments then dumped the score for link ranking etc. I'm able to see the 
link scores for URLS. I browsed the webgraphdb folders/subfolders which 
contains : inlinks,  loops,  nodes,  outlinks,  routes  etc. I can browse the 
file sitting in these folders but not able to understand anything as they 
contains some URLs and some other related data in some unusual characters. 
Basically, I want to generate a directed graph image or a connectivity graph 
image for the crawled URLs using all the data. I would appreciate any pointers 
in this regard. Is there any third party tool which takes these data as input 
and generates a directed/connectivity graph for the URLs which can be shown to 
give a visual understanding of connectivity between the URLS. Please provide 
inputs in this direction. Thanks in advance. 

Thanks, DW