RE: Indexing-time URL filtering again

Markus Jelsma Mon, 26 Nov 2012 13:44:08 -0800
Building from source with ant produces a local runtime in runtime/local, that's 
the same as when you extract an official release. 
 
-----Original message-----
> From:Joe Zhang <smartag...@gmail.com>
> Sent: Mon 26-Nov-2012 22:23
> To: user@nutch.apache.org
> Subject: Re: Indexing-time URL filtering again
> 
> yes that's wht i've been doing. but "ant" itself won't produce the official
> binary release.
> 
> On Mon, Nov 26, 2012 at 2:16 PM, Markus Jelsma
> <markus.jel...@openindex.io>wrote:
> 
> > just ant will do the trick.
> >
> >
> >
> > -----Original message-----
> > > From:Joe Zhang <smartag...@gmail.com>
> > > Sent: Mon 26-Nov-2012 22:03
> > > To: user@nutch.apache.org
> > > Subject: Re: Indexing-time URL filtering again
> > >
> > > talking about ant, after ant clean, which ant target should i use?
> > >
> > > On Mon, Nov 26, 2012 at 3:21 AM, Markus Jelsma
> > > <markus.jel...@openindex.io>wrote:
> > >
> > > > I checked the code. You're probably not pointing it to a valid path or
> > > > perhaps the build is wrong and you haven't used ant clean before
> > building
> > > > Nutch. If you keep having trouble you may want to check out trunk.
> > > >
> > > > -----Original message-----
> > > > > From:Joe Zhang <smartag...@gmail.com>
> > > > > Sent: Mon 26-Nov-2012 00:40
> > > > > To: user@nutch.apache.org
> > > > > Subject: Re: Indexing-time URL filtering again
> > > > >
> > > > > OK. I'm testing it. But like I said, even when I reduce the patterns
> > to
> > > > the
> > > > > simpliest form "-.", the problem still persists.
> > > > >
> > > > > On Sun, Nov 25, 2012 at 3:59 PM, Markus Jelsma
> > > > > <markus.jel...@openindex.io>wrote:
> > > > >
> > > > > > It's taking input from stdin, enter some URL's to test it. You can
> > add
> > > > an
> > > > > > issue with reproducable steps.
> > > > > >
> > > > > > -----Original message-----
> > > > > > > From:Joe Zhang <smartag...@gmail.com>
> > > > > > > Sent: Sun 25-Nov-2012 23:49
> > > > > > > To: user@nutch.apache.org
> > > > > > > Subject: Re: Indexing-time URL filtering again
> > > > > > >
> > > > > > > I ran the regex tester command you provided. It seems to be
> > taking
> > > > > > forever
> > > > > > > (15 min + by now).
> > > > > > >
> > > > > > > On Sun, Nov 25, 2012 at 3:28 PM, Joe Zhang <smartag...@gmail.com
> > >
> > > > wrote:
> > > > > > >
> > > > > > > > you mean the content my pattern file?
> > > > > > > >
> > > > > > > > well, even wehn I reduce it to simply "-.", the same problem
> > still
> > > > > > pops up.
> > > > > > > >
> > > > > > > > On Sun, Nov 25, 2012 at 3:30 PM, Markus Jelsma <
> > > > > > markus.jel...@openindex.io
> > > > > > > > > wrote:
> > > > > > > >
> > > > > > > >> You seems to have an NPE caused by your regex rules, for some
> > > > weird
> > > > > > > >> reason. If you can provide a way to reproduce you can file an
> > > > issue in
> > > > > > > >> Jira. This NPE should also occur if your run the regex tester.
> > > > > > > >>
> > > > > > > >> nutch -Durlfilter.regex.file=path
> > > > > > org.apache.nutch.net.URLFilterChecker
> > > > > > > >> -allCombined
> > > > > > > >>
> > > > > > > >> In the mean time you can check if a rule causes the NPE.
> > > > > > > >>
> > > > > > > >> -----Original message-----
> > > > > > > >> > From:Joe Zhang <smartag...@gmail.com>
> > > > > > > >> > Sent: Sun 25-Nov-2012 23:26
> > > > > > > >> > To: user@nutch.apache.org
> > > > > > > >> > Subject: Re: Indexing-time URL filtering again
> > > > > > > >> >
> > > > > > > >> > the last few lines of hadoop.log:
> > > > > > > >> >
> > > > > > > >> > 2012-11-25 16:30:30,021 INFO  indexer.IndexingFilters -
> > Adding
> > > > > > > >> > org.apache.nutch.indexer.anchor.AnchorIndexingFilter
> > > > > > > >> > 2012-11-25 16:30:30,026 INFO  indexer.IndexingFilters -
> > Adding
> > > > > > > >> > org.apache.nutch.indexer.metadata.MetadataIndexer
> > > > > > > >> > 2012-11-25 16:30:30,218 WARN  mapred.LocalJobRunner -
> > > > job_local_0001
> > > > > > > >> > java.lang.RuntimeException: Error in configuring object
> > > > > > > >> >         at
> > > > > > > >> >
> > > > > > > >>
> > > > > >
> > > >
> > org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
> > > > > > > >> >         at
> > > > > > > >> >
> > > > > >
> > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
> > > > > > > >> >         at
> > > > > > > >> >
> > > > > > > >>
> > > > > >
> > > >
> > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
> > > > > > > >> >         at
> > > > > > > >>
> > org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432)
> > > > > > > >> >         at
> > > > org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
> > > > > > > >> >         at
> > > > > > > >> >
> > > > > >
> > > >
> > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> > > > > > > >> > Caused by: java.lang.reflect.InvocationTargetException
> > > > > > > >> >         at
> > sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> > > > > > Method)
> > > > > > > >> >         at
> > > > > > > >> >
> > > > > > > >>
> > > > > >
> > > >
> > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> > > > > > > >> >         at
> > > > > > > >> >
> > > > > > > >>
> > > > > >
> > > >
> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > > > > > >> >         at java.lang.reflect.Method.invoke(Method.java:601)
> > > > > > > >> >         at
> > > > > > > >> >
> > > > > > > >>
> > > > > >
> > > >
> > org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
> > > > > > > >> >         ... 5 more
> > > > > > > >> > Caused by: java.lang.RuntimeException: Error in configuring
> > > > object
> > > > > > > >> >         at
> > > > > > > >> >
> > > > > > > >>
> > > > > >
> > > >
> > org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
> > > > > > > >> >         at
> > > > > > > >> >
> > > > > >
> > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
> > > > > > > >> >         at
> > > > > > > >> >
> > > > > > > >>
> > > > > >
> > > >
> > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
> > > > > > > >> >         at
> > > > > > > >>
> > org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
> > > > > > > >> >         ... 10 more
> > > > > > > >> > Caused by: java.lang.reflect.InvocationTargetException
> > > > > > > >> >         at
> > sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> > > > > > Method)
> > > > > > > >> >         at
> > > > > > > >> >
> > > > > > > >>
> > > > > >
> > > >
> > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> > > > > > > >> >         at
> > > > > > > >> >
> > > > > > > >>
> > > > > >
> > > >
> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > > > > > >> >         at java.lang.reflect.Method.invoke(Method.java:601)
> > > > > > > >> >         at
> > > > > > > >> >
> > > > > > > >>
> > > > > >
> > > >
> > org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
> > > > > > > >> >         ... 13 more
> > > > > > > >> > Caused by: java.lang.NullPointerException
> > > > > > > >> >         at java.io.Reader.<init>(Reader.java:78)
> > > > > > > >> >         at
> > java.io.BufferedReader.<init>(BufferedReader.java:94)
> > > > > > > >> >         at
> > > > java.io.BufferedReader.<init>(BufferedReader.java:109)
> > > > > > > >> >         at
> > > > > > > >> >
> > > > > > > >>
> > > > > >
> > > >
> > org.apache.nutch.urlfilter.api.RegexURLFilterBase.readRules(RegexURLFilterBase.java:180)
> > > > > > > >> >         at
> > > > > > > >> >
> > > > > > > >>
> > > > > >
> > > >
> > org.apache.nutch.urlfilter.api.RegexURLFilterBase.setConf(RegexURLFilterBase.java:156)
> > > > > > > >> >         at
> > > > > > > >> >
> > > > > > > >>
> > > > > >
> > > >
> > org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:162)
> > > > > > > >> >         at
> > > > > > org.apache.nutch.net.URLFilters.<init>(URLFilters.java:57)
> > > > > > > >> >         at
> > > > > > > >> >
> > > > > > > >>
> > > > > >
> > > >
> > org.apache.nutch.indexer.IndexerMapReduce.configure(IndexerMapReduce.java:95)
> > > > > > > >> >         ... 18 more
> > > > > > > >> > 2012-11-25 16:30:30,568 ERROR solr.SolrIndexer -
> > > > > > java.io.IOException:
> > > > > > > >> Job
> > > > > > > >> > failed!
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> > On Sun, Nov 25, 2012 at 3:08 PM, Markus Jelsma
> > > > > > > >> > <markus.jel...@openindex.io>wrote:
> > > > > > > >> >
> > > > > > > >> > > You should provide the log output.
> > > > > > > >> > >
> > > > > > > >> > > -----Original message-----
> > > > > > > >> > > > From:Joe Zhang <smartag...@gmail.com>
> > > > > > > >> > > > Sent: Sun 25-Nov-2012 17:27
> > > > > > > >> > > > To: user@nutch.apache.org
> > > > > > > >> > > > Subject: Re: Indexing-time URL filtering again
> > > > > > > >> > > >
> > > > > > > >> > > > I actually checked out the most recent build from SVN,
> > > > Release
> > > > > > 1.6 -
> > > > > > > >> > > > 23/11/2012.
> > > > > > > >> > > >
> > > > > > > >> > > > The following command
> > > > > > > >> > > >
> > > > > > > >> > > > bin/nutch solrindex
> > > > > >  -Durlfilter.regex.file=.....UrlFiltering.txt
> > > > > > > >> > > > http://localhost:8983/solr/ crawl/crawldb/ -linkdb
> > > > > > crawl/linkdb/
> > > > > > > >> > > > crawl/segments/*  -filter
> > > > > > > >> > > >
> > > > > > > >> > > > produced the following output:
> > > > > > > >> > > >
> > > > > > > >> > > > SolrIndexer: starting at 2012-11-25 16:19:29
> > > > > > > >> > > > SolrIndexer: deleting gone documents: false
> > > > > > > >> > > > SolrIndexer: URL filtering: true
> > > > > > > >> > > > SolrIndexer: URL normalizing: false
> > > > > > > >> > > > java.io.IOException: Job failed!
> > > > > > > >> > > >
> > > > > > > >> > > > Can anybody help?
> > > > > > > >> > > > On Sun, Nov 25, 2012 at 6:43 AM, Joe Zhang <
> > > > > > smartag...@gmail.com>
> > > > > > > >> wrote:
> > > > > > > >> > > >
> > > > > > > >> > > > > How exactly do I get to trunk?
> > > > > > > >> > > > >
> > > > > > > >> > > > > I did download download NUTCH-1300-1.5-1.patch, and
> > run
> > > > the
> > > > > > patch
> > > > > > > >> > > command
> > > > > > > >> > > > > correctly, and re-build nutch. But the problem still
> > > > > > persists...
> > > > > > > >> > > > >
> > > > > > > >> > > > > On Sun, Nov 25, 2012 at 3:29 AM, Markus Jelsma <
> > > > > > > >> > > markus.jel...@openindex.io
> > > > > > > >> > > > > > wrote:
> > > > > > > >> > > > >
> > > > > > > >> > > > >> No, this is no bug. As i said, you need either to
> > patch
> > > > your
> > > > > > > >> Nutch or
> > > > > > > >> > > get
> > > > > > > >> > > > >> the sources from trunk. The -filter parameter is not
> > in
> > > > your
> > > > > > > >> version.
> > > > > > > >> > > Check
> > > > > > > >> > > > >> the patch manual if you don't know how it works.
> > > > > > > >> > > > >>
> > > > > > > >> > > > >> $ cd trunk ; patch -p0 < file.patch
> > > > > > > >> > > > >>
> > > > > > > >> > > > >> -----Original message-----
> > > > > > > >> > > > >> > From:Joe Zhang <smartag...@gmail.com>
> > > > > > > >> > > > >> > Sent: Sun 25-Nov-2012 08:42
> > > > > > > >> > > > >> > To: Markus Jelsma <markus.jel...@openindex.io>;
> > user <
> > > > > > > >> > > > >> user@nutch.apache.org>
> > > > > > > >> > > > >> > Subject: Re: Indexing-time URL filtering again
> > > > > > > >> > > > >> >
> > > > > > > >> > > > >> > This does seem a bug. Can anybody help?
> > > > > > > >> > > > >> >
> > > > > > > >> > > > >> > On Sat, Nov 24, 2012 at 6:40 PM, Joe Zhang <
> > > > > > > >> smartag...@gmail.com>
> > > > > > > >> > > > >> wrote:
> > > > > > > >> > > > >> >
> > > > > > > >> > > > >> > > Markus, could you advise? Thanks a lot!
> > > > > > > >> > > > >> > >
> > > > > > > >> > > > >> > >
> > > > > > > >> > > > >> > > On Sat, Nov 24, 2012 at 12:49 AM, Joe Zhang <
> > > > > > > >> smartag...@gmail.com
> > > > > > > >> > > >
> > > > > > > >> > > > >> wrote:
> > > > > > > >> > > > >> > >
> > > > > > > >> > > > >> > >> I followed your instruction and applied the
> > patch,
> > > > > > Markus,
> > > > > > > >> but
> > > > > > > >> > > the
> > > > > > > >> > > > >> > >> problem still persists --- "-filter" is
> > interpreted
> > > > as a
> > > > > > > >> path by
> > > > > > > >> > > > >> solrindex.
> > > > > > > >> > > > >> > >>
> > > > > > > >> > > > >> > >> On Fri, Nov 23, 2012 at 12:39 AM, Markus Jelsma
> > <
> > > > > > > >> > > > >> > >> markus.jel...@openindex.io> wrote:
> > > > > > > >> > > > >> > >>
> > > > > > > >> > > > >> > >>> Ah, i get it now. Please use trunk or patch
> > your
> > > > > > version
> > > > > > > >> with:
> > > > > > > >> > > > >> > >>>
> > https://issues.apache.org/jira/browse/NUTCH-1300to
> > > > > > enable
> > > > > > > >> > > > >> filtering.
> > > > > > > >> > > > >> > >>>
> > > > > > > >> > > > >> > >>> -----Original message-----
> > > > > > > >> > > > >> > >>> > From:Joe Zhang <smartag...@gmail.com>
> > > > > > > >> > > > >> > >>> > Sent: Fri 23-Nov-2012 03:08
> > > > > > > >> > > > >> > >>> > To: user@nutch.apache.org
> > > > > > > >> > > > >> > >>> > Subject: Re: Indexing-time URL filtering
> > again
> > > > > > > >> > > > >> > >>> >
> > > > > > > >> > > > >> > >>> > But Markus said it worked for him. I was
> > really
> > > > he
> > > > > > could
> > > > > > > >> send
> > > > > > > >> > > his
> > > > > > > >> > > > >> > >>> command
> > > > > > > >> > > > >> > >>> > line.
> > > > > > > >> > > > >> > >>> >
> > > > > > > >> > > > >> > >>> > On Thu, Nov 22, 2012 at 6:28 PM, Lewis John
> > > > > > Mcgibbney <
> > > > > > > >> > > > >> > >>> > lewis.mcgibb...@gmail.com> wrote:
> > > > > > > >> > > > >> > >>> >
> > > > > > > >> > > > >> > >>> > > Is this a bug?
> > > > > > > >> > > > >> > >>> > >
> > > > > > > >> > > > >> > >>> > > On Thu, Nov 22, 2012 at 10:13 PM, Joe
> > Zhang <
> > > > > > > >> > > > >> smartag...@gmail.com>
> > > > > > > >> > > > >> > >>> wrote:
> > > > > > > >> > > > >> > >>> > > > Putting -filter between crawldb and
> > > > segments, I
> > > > > > > >> sitll got
> > > > > > > >> > > the
> > > > > > > >> > > > >> same
> > > > > > > >> > > > >> > >>> thing:
> > > > > > > >> > > > >> > >>> > > >
> > > > > > > >> > > > >> > >>> > > >
> > > > org.apache.hadoop.mapred.InvalidInputException:
> > > > > > > >> Input path
> > > > > > > >> > > > >> does not
> > > > > > > >> > > > >> > >>> > > exist:
> > > > > > > >> > > > >> > >>> > > >
> > > > > > > >> > >
> > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/crawl_fetch
> > > > > > > >> > > > >> > >>> > > > Input path does not exist:
> > > > > > > >> > > > >> > >>> > > >
> > > > > > > >> > >
> > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/crawl_parse
> > > > > > > >> > > > >> > >>> > > > Input path does not exist:
> > > > > > > >> > > > >> > >>> > > >
> > > > > > > >> > >
> > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/parse_data
> > > > > > > >> > > > >> > >>> > > > Input path does not exist:
> > > > > > > >> > > > >> > >>> > > >
> > > > > > > >> > >
> > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/parse_text
> > > > > > > >> > > > >> > >>> > > >
> > > > > > > >> > > > >> > >>> > > > On Thu, Nov 22, 2012 at 3:11 PM, Markus
> > > > Jelsma
> > > > > > > >> > > > >> > >>> > > > <markus.jel...@openindex.io>wrote:
> > > > > > > >> > > > >> > >>> > > >
> > > > > > > >> > > > >> > >>> > > >> These are roughly the available
> > parameters:
> > > > > > > >> > > > >> > >>> > > >>
> > > > > > > >> > > > >> > >>> > > >> Usage: SolrIndexer <solr url> <crawldb>
> > > > [-linkdb
> > > > > > > >> > > <linkdb>]
> > > > > > > >> > > > >> > >>> [-hostdb
> > > > > > > >> > > > >> > >>> > > >> <hostdb>] [-params k1=v1&k2=v2...]
> > > > (<segment>
> > > > > > ... |
> > > > > > > >> -dir
> > > > > > > >> > > > >> > >>> <segments>)
> > > > > > > >> > > > >> > >>> > > >> [-noCommit] [-deleteGone]
> > > > [-deleteRobotsNoIndex]
> > > > > > > >> > > > >> > >>> > > >> [-deleteSkippedByIndexingFilter]
> > [-filter]
> > > > > > > >> [-normalize]
> > > > > > > >> > > > >> > >>> > > >>
> > > > > > > >> > > > >> > >>> > > >> Having -filter at the end should work
> > fine,
> > > > if
> > > > > > it,
> > > > > > > >> for
> > > > > > > >> > > some
> > > > > > > >> > > > >> > >>> reason,
> > > > > > > >> > > > >> > >>> > > >> doesn't work put it before the segment
> > and
> > > > > > after the
> > > > > > > >> > > crawldb
> > > > > > > >> > > > >> and
> > > > > > > >> > > > >> > >>> file an
> > > > > > > >> > > > >> > >>> > > >> issue in jira, it works here if i have
> > > > -filter
> > > > > > at
> > > > > > > >> the
> > > > > > > >> > > end.
> > > > > > > >> > > > >> > >>> > > >>
> > > > > > > >> > > > >> > >>> > > >> Cheers
> > > > > > > >> > > > >> > >>> > > >>
> > > > > > > >> > > > >> > >>> > > >> -----Original message-----
> > > > > > > >> > > > >> > >>> > > >> > From:Joe Zhang <smartag...@gmail.com>
> > > > > > > >> > > > >> > >>> > > >> > Sent: Thu 22-Nov-2012 23:05
> > > > > > > >> > > > >> > >>> > > >> > To: Markus Jelsma <
> > > > markus.jel...@openindex.io
> > > > > > >;
> > > > > > > >> user <
> > > > > > > >> > > > >> > >>> > > >> user@nutch.apache.org>
> > > > > > > >> > > > >> > >>> > > >> > Subject: Re: Indexing-time URL
> > filtering
> > > > again
> > > > > > > >> > > > >> > >>> > > >> >
> > > > > > > >> > > > >> > >>> > > >> > Yes, I forgot to do that. But still,
> > what
> > > > > > exactly
> > > > > > > >> > > should
> > > > > > > >> > > > >> the
> > > > > > > >> > > > >> > >>> command
> > > > > > > >> > > > >> > >>> > > >> look like?
> > > > > > > >> > > > >> > >>> > > >> >
> > > > > > > >> > > > >> > >>> > > >> > bin/nutch solrindex
> > > > > > > >> > > > >>  -Durlfilter.regex.file=....UrlFiltering.txt
> > > > > > > >> > > > >> > >>> > > >> http://localhost:8983/solr/ <
> > > > > > > >> http://localhost:8983/solr/
> > > > > > > >> > > >
> > > > > > > >> > > > >> > >>> .../crawldb/
> > > > > > > >> > > > >> > >>> > > >> ..../segments/*  -filter
> > > > > > > >> > > > >> > >>> > > >> > this command would cause nutch to
> > > > interpret
> > > > > > > >> "-filter"
> > > > > > > >> > > as a
> > > > > > > >> > > > >> path.
> > > > > > > >> > > > >> > >>> > > >> >
> > > > > > > >> > > > >> > >>> > > >> > On Thu, Nov 22, 2012 at 6:14 AM,
> > Markus
> > > > > > Jelsma <
> > > > > > > >> > > > >> > >>> > > >> markus.jel...@openindex.io <mailto:
> > > > > > > >> > > > >> markus.jel...@openindex.io> >
> > > > > > > >> > > > >> > >>> wrote:
> > > > > > > >> > > > >> > >>> > > >> > Hi,
> > > > > > > >> > > > >> > >>> > > >> >
> > > > > > > >> > > > >> > >>> > > >> > I just tested a small index job that
> > > > usually
> > > > > > > >> writes
> > > > > > > >> > > 1200
> > > > > > > >> > > > >> > >>> records to
> > > > > > > >> > > > >> > >>> > > >> Solr. It works fine if i specify -. in a
> > > > filter
> > > > > > > >> (index
> > > > > > > >> > > > >> nothing)
> > > > > > > >> > > > >> > >>> and
> > > > > > > >> > > > >> > >>> > > point
> > > > > > > >> > > > >> > >>> > > >> to it with -Durlfilter.regex.file=path
> > like
> > > > you
> > > > > > do.
> > > > > > > >>  I
> > > > > > > >> > > > >> assume you
> > > > > > > >> > > > >> > >>> mean
> > > > > > > >> > > > >> > >>> > > by
> > > > > > > >> > > > >> > >>> > > >> `it doesn't work` that it filters
> > nothing
> > > > and
> > > > > > > >> indexes all
> > > > > > > >> > > > >> records
> > > > > > > >> > > > >> > >>> from
> > > > > > > >> > > > >> > >>> > > the
> > > > > > > >> > > > >> > >>> > > >> segment. Did you forget the -filter
> > > > parameter?
> > > > > > > >> > > > >> > >>> > > >> >
> > > > > > > >> > > > >> > >>> > > >> > Cheers
> > > > > > > >> > > > >> > >>> > > >> >
> > > > > > > >> > > > >> > >>> > > >> > -----Original message-----
> > > > > > > >> > > > >> > >>> > > >> > > From:Joe Zhang <
> > smartag...@gmail.com
> > > > <mailto:
> > > > > > > >> > > > >> > >>> smartag...@gmail.com>
> > > > > > > >> > > > >> > >>> > > >
> > > > > > > >> > > > >> > >>> > > >> > > Sent: Thu 22-Nov-2012 07:29
> > > > > > > >> > > > >> > >>> > > >> > > To: user <user@nutch.apache.org
> > <mailto:
> > > > > > > >> > > > >> user@nutch.apache.org>
> > > > > > > >> > > > >> > >>> >
> > > > > > > >> > > > >> > >>> > > >> > > Subject: Indexing-time URL filtering
> > > > again
> > > > > > > >> > > > >> > >>> > > >> > >
> > > > > > > >> > > > >> > >>> > > >> > > Dear List:
> > > > > > > >> > > > >> > >>> > > >> > >
> > > > > > > >> > > > >> > >>> > > >> > > I asked a similar question before,
> > but I
> > > > > > haven't
> > > > > > > >> > > solved
> > > > > > > >> > > > >> the
> > > > > > > >> > > > >> > >>> problem.
> > > > > > > >> > > > >> > >>> > > >> > > Therefore I try to re-ask the
> > question
> > > > more
> > > > > > > >> clearly
> > > > > > > >> > > and
> > > > > > > >> > > > >> seek
> > > > > > > >> > > > >> > >>> advice.
> > > > > > > >> > > > >> > >>> > > >> > >
> > > > > > > >> > > > >> > >>> > > >> > > I'm using nutch 1.5.1 and solr 3.6.1
> > > > > > together.
> > > > > > > >> Things
> > > > > > > >> > > > >> work
> > > > > > > >> > > > >> > >>> fine at
> > > > > > > >> > > > >> > >>> > > the
> > > > > > > >> > > > >> > >>> > > >> > > rudimentary level.
> > > > > > > >> > > > >> > >>> > > >> > >
> > > > > > > >> > > > >> > >>> > > >> > > The basic problem I face in
> > > > > > crawling/indexing is
> > > > > > > >> > > that I
> > > > > > > >> > > > >> need
> > > > > > > >> > > > >> > >>> to
> > > > > > > >> > > > >> > >>> > > control
> > > > > > > >> > > > >> > >>> > > >> > > which pages the crawlers should
> > VISIT
> > > > (so
> > > > > > far
> > > > > > > >> through
> > > > > > > >> > > > >> > >>> > > >> > > nutch/conf/regex-urlfilter.txt)
> > > > > > > >> > > > >> > >>> > > >> > > and which pages are INDEXED by
> > Solr. The
> > > > > > latter
> > > > > > > >> are
> > > > > > > >> > > only
> > > > > > > >> > > > >> a
> > > > > > > >> > > > >> > >>> SUBSET of
> > > > > > > >> > > > >> > >>> > > >> the
> > > > > > > >> > > > >> > >>> > > >> > > former, and they are giving me
> > headache.
> > > > > > > >> > > > >> > >>> > > >> > >
> > > > > > > >> > > > >> > >>> > > >> > > A real-life example would be: when
> > we
> > > > crawl
> > > > > > > >> CNN.com,
> > > > > > > >> > > we
> > > > > > > >> > > > >> only
> > > > > > > >> > > > >> > >>> want to
> > > > > > > >> > > > >> > >>> > > >> index
> > > > > > > >> > > > >> > >>> > > >> > > "real content" pages such as
> > > > > > > >> > > > >> > >>> > > >> > >
> > > > > > > >> > > > >> > >>> > >
> > > > > > > >> > > > >> > >>>
> > > > > > > >> > > > >>
> > > > > > > >> > >
> > > > > > > >>
> > > > > >
> > http://www.cnn.com/2012/11/21/us/bin-laden-burial/index.html?hpt=hp_t1
> > > > <
> > > > > > > >> > > > >> > >>> > > >>
> > > > > > > >> > > > >> > >>>
> > > > > > > >> > > > >>
> > > > > > > >> > >
> > > > > > > >>
> > > > > >
> > http://www.cnn.com/2012/11/21/us/bin-laden-burial/index.html?hpt=hp_t1
> > > > >
> > > > > > > >> > > > >> > >>> > > .
> > > > > > > >> > > > >> > >>> > > >> > > When we start the crawling from the
> > > > root, we
> > > > > > > >> can't
> > > > > > > >> > > > >> specify
> > > > > > > >> > > > >> > >>> tight
> > > > > > > >> > > > >> > >>> > > >> > > patterns (e.g., +^http://
> > ([a-z0-9]*\.)*
> > > > > > > >> > > > >> > >>> > > >> > >
> > cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*<
> > > > > >
> > http://cnn.com/%5B0-9%5D%7B4%7D/%5B0-9%5D%7B2%7D/%5B0-9%5D%7B2%7D/..*>
> > > > > > > >> <
> > > > > > > >> > >
> > > > > >
> > http://cnn.com/%5B0-9%5D%7B4%7D/%5B0-9%5D%7B2%7D/%5B0-9%5D%7B2%7D/..*
> > > > > > > >> >
> > > > > > > >> > > > >> <
> > > > > > > >> > >
> > > > > >
> > http://cnn.com/%5B0-9%5D%7B4%7D/%5B0-9%5D%7B2%7D/%5B0-9%5D%7B2%7D/..*
> > > > > > > >> ><
> > > > > > > >> > > > >> > >>> > > >>
> > > > http://cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*>
> > > > > > ) in
> > > > > > > >> > > > >> > >>> > > >> nutch/conf/regex-urlfilter.txt,
> > > > > > > >> > > > >> > >>> > > >> > > because the pages on the path
> > between
> > > > root
> > > > > > and
> > > > > > > >> > > content
> > > > > > > >> > > > >> pages
> > > > > > > >> > > > >> > >>> do not
> > > > > > > >> > > > >> > >>> > > >> satisfy
> > > > > > > >> > > > >> > >>> > > >> > > such patterns. Putting such
> > patterns in
> > > > > > > >> > > > >> > >>> > > nutch/conf/regex-urlfilter.txt
> > > > > > > >> > > > >> > >>> > > >> > > would severely jeopardize the
> > coverage
> > > > of
> > > > > > the
> > > > > > > >> crawl.
> > > > > > > >> > > > >> > >>> > > >> > >
> > > > > > > >> > > > >> > >>> > > >> > > The closest solution I've got so far
> > > > > > (courtesy
> > > > > > > >> of
> > > > > > > >> > > > >> Markus) was
> > > > > > > >> > > > >> > >>> this:
> > > > > > > >> > > > >> > >>> > > >> > >
> > > > > > > >> > > > >> > >>> > > >> > > nutch solrindex
> > > > -Durlfilter.regex.file=/path
> > > > > > > >> > > > >> http://solrurl/<
> > > > > > > >> > > > >> > >>> > > >> http://solrurl/> ...
> > > > > > > >> > > > >> > >>> > > >> > >
> > > > > > > >> > > > >> > >>> > > >> > >  but unfortunately I haven't been
> > able
> > > > to
> > > > > > make
> > > > > > > >> it
> > > > > > > >> > > work
> > > > > > > >> > > > >> for
> > > > > > > >> > > > >> > >>> me. The
> > > > > > > >> > > > >> > >>> > > >> content
> > > > > > > >> > > > >> > >>> > > >> > > of the urlfilter.regex.file is what
> > I
> > > > > > thought
> > > > > > > >> > > "correct"
> > > > > > > >> > > > >> ---
> > > > > > > >> > > > >> > >>> > > something
> > > > > > > >> > > > >> > >>> > > >> like
> > > > > > > >> > > > >> > >>> > > >> > > the following:
> > > > > > > >> > > > >> > >>> > > >> > >
> > > > > > > >> > > > >> > >>> > > >> > > +^http://([a-z0-9]*\.)*
> > > > > > > >> > > > >> cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*<
> > > > > >
> > http://cnn.com/%5B0-9%5D%7B4%7D/%5B0-9%5D%7B2%7D/%5B0-9%5D%7B2%7D/..*>
> > > > > > > >> <
> > > > > > > >> > >
> > > > > >
> > http://cnn.com/%5B0-9%5D%7B4%7D/%5B0-9%5D%7B2%7D/%5B0-9%5D%7B2%7D/..*
> > > > > > > >> >
> > > > > > > >> > > > >> <
> > > > > > > >> > >
> > > > > >
> > http://cnn.com/%5B0-9%5D%7B4%7D/%5B0-9%5D%7B2%7D/%5B0-9%5D%7B2%7D/..*
> > > > > > > >> ><
> > > > > > > >> > > > >> > >>> > > >>
> > > > http://cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*>
> > > > > > > >> > > > >> > >>> > > >> > > -.
> > > > > > > >> > > > >> > >>> > > >> > >
> > > > > > > >> > > > >> > >>> > > >> > > Everything seems quite
> > straightforward.
> > > > Am I
> > > > > > > >> doing
> > > > > > > >> > > > >> anything
> > > > > > > >> > > > >> > >>> wrong
> > > > > > > >> > > > >> > >>> > > >> here? Can
> > > > > > > >> > > > >> > >>> > > >> > > anyone advise? I'd greatly
> > appreciate.
> > > > > > > >> > > > >> > >>> > > >> > >
> > > > > > > >> > > > >> > >>> > > >> > > Joe
> > > > > > > >> > > > >> > >>> > > >> > >
> > > > > > > >> > > > >> > >>> > > >> >
> > > > > > > >> > > > >> > >>> > > >> >
> > > > > > > >> > > > >> > >>> > > >>
> > > > > > > >> > > > >> > >>> > >
> > > > > > > >> > > > >> > >>> > >
> > > > > > > >> > > > >> > >>> > >
> > > > > > > >> > > > >> > >>> > > --
> > > > > > > >> > > > >> > >>> > > Lewis
> > > > > > > >> > > > >> > >>> > >
> > > > > > > >> > > > >> > >>> >
> > > > > > > >> > > > >> > >>>
> > > > > > > >> > > > >> > >>
> > > > > > > >> > > > >> > >>
> > > > > > > >> > > > >> > >
> > > > > > > >> > > > >> >
> > > > > > > >> > > > >>
> > > > > > > >> > > > >
> > > > > > > >> > > > >
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
RE: Indexing-time URL filtering again

Reply via email to