RE: Indexing-time URL filtering again

Markus Jelsma Sun, 25 Nov 2012 14:03:07 -0800
You should provide the log output. 
 
-----Original message-----
> From:Joe Zhang <smartag...@gmail.com>
> Sent: Sun 25-Nov-2012 17:27
> To: user@nutch.apache.org
> Subject: Re: Indexing-time URL filtering again
> 
> I actually checked out the most recent build from SVN, Release 1.6 -
> 23/11/2012.
> 
> The following command
> 
> bin/nutch solrindex  -Durlfilter.regex.file=.....UrlFiltering.txt
> http://localhost:8983/solr/ crawl/crawldb/ -linkdb crawl/linkdb/
> crawl/segments/*  -filter
> 
> produced the following output:
> 
> SolrIndexer: starting at 2012-11-25 16:19:29
> SolrIndexer: deleting gone documents: false
> SolrIndexer: URL filtering: true
> SolrIndexer: URL normalizing: false
> java.io.IOException: Job failed!
> 
> Can anybody help?
> On Sun, Nov 25, 2012 at 6:43 AM, Joe Zhang <smartag...@gmail.com> wrote:
> 
> > How exactly do I get to trunk?
> >
> > I did download download NUTCH-1300-1.5-1.patch, and run the patch command
> > correctly, and re-build nutch. But the problem still persists...
> >
> > On Sun, Nov 25, 2012 at 3:29 AM, Markus Jelsma <markus.jel...@openindex.io
> > > wrote:
> >
> >> No, this is no bug. As i said, you need either to patch your Nutch or get
> >> the sources from trunk. The -filter parameter is not in your version. Check
> >> the patch manual if you don't know how it works.
> >>
> >> $ cd trunk ; patch -p0 < file.patch
> >>
> >> -----Original message-----
> >> > From:Joe Zhang <smartag...@gmail.com>
> >> > Sent: Sun 25-Nov-2012 08:42
> >> > To: Markus Jelsma <markus.jel...@openindex.io>; user <
> >> user@nutch.apache.org>
> >> > Subject: Re: Indexing-time URL filtering again
> >> >
> >> > This does seem a bug. Can anybody help?
> >> >
> >> > On Sat, Nov 24, 2012 at 6:40 PM, Joe Zhang <smartag...@gmail.com>
> >> wrote:
> >> >
> >> > > Markus, could you advise? Thanks a lot!
> >> > >
> >> > >
> >> > > On Sat, Nov 24, 2012 at 12:49 AM, Joe Zhang <smartag...@gmail.com>
> >> wrote:
> >> > >
> >> > >> I followed your instruction and applied the patch, Markus, but the
> >> > >> problem still persists --- "-filter" is interpreted as a path by
> >> solrindex.
> >> > >>
> >> > >> On Fri, Nov 23, 2012 at 12:39 AM, Markus Jelsma <
> >> > >> markus.jel...@openindex.io> wrote:
> >> > >>
> >> > >>> Ah, i get it now. Please use trunk or patch your version with:
> >> > >>> https://issues.apache.org/jira/browse/NUTCH-1300 to enable
> >> filtering.
> >> > >>>
> >> > >>> -----Original message-----
> >> > >>> > From:Joe Zhang <smartag...@gmail.com>
> >> > >>> > Sent: Fri 23-Nov-2012 03:08
> >> > >>> > To: user@nutch.apache.org
> >> > >>> > Subject: Re: Indexing-time URL filtering again
> >> > >>> >
> >> > >>> > But Markus said it worked for him. I was really he could send his
> >> > >>> command
> >> > >>> > line.
> >> > >>> >
> >> > >>> > On Thu, Nov 22, 2012 at 6:28 PM, Lewis John Mcgibbney <
> >> > >>> > lewis.mcgibb...@gmail.com> wrote:
> >> > >>> >
> >> > >>> > > Is this a bug?
> >> > >>> > >
> >> > >>> > > On Thu, Nov 22, 2012 at 10:13 PM, Joe Zhang <
> >> smartag...@gmail.com>
> >> > >>> wrote:
> >> > >>> > > > Putting -filter between crawldb and segments, I sitll got the
> >> same
> >> > >>> thing:
> >> > >>> > > >
> >> > >>> > > > org.apache.hadoop.mapred.InvalidInputException: Input path
> >> does not
> >> > >>> > > exist:
> >> > >>> > > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/crawl_fetch
> >> > >>> > > > Input path does not exist:
> >> > >>> > > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/crawl_parse
> >> > >>> > > > Input path does not exist:
> >> > >>> > > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/parse_data
> >> > >>> > > > Input path does not exist:
> >> > >>> > > > file:/home/tools/Nutch/apache-nutch-1.5.1/-filter/parse_text
> >> > >>> > > >
> >> > >>> > > > On Thu, Nov 22, 2012 at 3:11 PM, Markus Jelsma
> >> > >>> > > > <markus.jel...@openindex.io>wrote:
> >> > >>> > > >
> >> > >>> > > >> These are roughly the available parameters:
> >> > >>> > > >>
> >> > >>> > > >> Usage: SolrIndexer <solr url> <crawldb> [-linkdb <linkdb>]
> >> > >>> [-hostdb
> >> > >>> > > >> <hostdb>] [-params k1=v1&k2=v2...] (<segment> ... | -dir
> >> > >>> <segments>)
> >> > >>> > > >> [-noCommit] [-deleteGone] [-deleteRobotsNoIndex]
> >> > >>> > > >> [-deleteSkippedByIndexingFilter] [-filter] [-normalize]
> >> > >>> > > >>
> >> > >>> > > >> Having -filter at the end should work fine, if it, for some
> >> > >>> reason,
> >> > >>> > > >> doesn't work put it before the segment and after the crawldb
> >> and
> >> > >>> file an
> >> > >>> > > >> issue in jira, it works here if i have -filter at the end.
> >> > >>> > > >>
> >> > >>> > > >> Cheers
> >> > >>> > > >>
> >> > >>> > > >> -----Original message-----
> >> > >>> > > >> > From:Joe Zhang <smartag...@gmail.com>
> >> > >>> > > >> > Sent: Thu 22-Nov-2012 23:05
> >> > >>> > > >> > To: Markus Jelsma <markus.jel...@openindex.io>; user <
> >> > >>> > > >> user@nutch.apache.org>
> >> > >>> > > >> > Subject: Re: Indexing-time URL filtering again
> >> > >>> > > >> >
> >> > >>> > > >> > Yes, I forgot to do that. But still, what exactly should
> >> the
> >> > >>> command
> >> > >>> > > >> look like?
> >> > >>> > > >> >
> >> > >>> > > >> > bin/nutch solrindex
> >>  -Durlfilter.regex.file=....UrlFiltering.txt
> >> > >>> > > >> http://localhost:8983/solr/ <http://localhost:8983/solr/>
> >> > >>> .../crawldb/
> >> > >>> > > >> ..../segments/*  -filter
> >> > >>> > > >> > this command would cause nutch to interpret "-filter" as a
> >> path.
> >> > >>> > > >> >
> >> > >>> > > >> > On Thu, Nov 22, 2012 at 6:14 AM, Markus Jelsma <
> >> > >>> > > >> markus.jel...@openindex.io <mailto:
> >> markus.jel...@openindex.io> >
> >> > >>> wrote:
> >> > >>> > > >> > Hi,
> >> > >>> > > >> >
> >> > >>> > > >> > I just tested a small index job that usually writes 1200
> >> > >>> records to
> >> > >>> > > >> Solr. It works fine if i specify -. in a filter (index
> >> nothing)
> >> > >>> and
> >> > >>> > > point
> >> > >>> > > >> to it with -Durlfilter.regex.file=path like you do.  I
> >> assume you
> >> > >>> mean
> >> > >>> > > by
> >> > >>> > > >> `it doesn't work` that it filters nothing and indexes all
> >> records
> >> > >>> from
> >> > >>> > > the
> >> > >>> > > >> segment. Did you forget the -filter parameter?
> >> > >>> > > >> >
> >> > >>> > > >> > Cheers
> >> > >>> > > >> >
> >> > >>> > > >> > -----Original message-----
> >> > >>> > > >> > > From:Joe Zhang <smartag...@gmail.com <mailto:
> >> > >>> smartag...@gmail.com>
> >> > >>> > > >
> >> > >>> > > >> > > Sent: Thu 22-Nov-2012 07:29
> >> > >>> > > >> > > To: user <user@nutch.apache.org <mailto:
> >> user@nutch.apache.org>
> >> > >>> >
> >> > >>> > > >> > > Subject: Indexing-time URL filtering again
> >> > >>> > > >> > >
> >> > >>> > > >> > > Dear List:
> >> > >>> > > >> > >
> >> > >>> > > >> > > I asked a similar question before, but I haven't solved
> >> the
> >> > >>> problem.
> >> > >>> > > >> > > Therefore I try to re-ask the question more clearly and
> >> seek
> >> > >>> advice.
> >> > >>> > > >> > >
> >> > >>> > > >> > > I'm using nutch 1.5.1 and solr 3.6.1 together. Things
> >> work
> >> > >>> fine at
> >> > >>> > > the
> >> > >>> > > >> > > rudimentary level.
> >> > >>> > > >> > >
> >> > >>> > > >> > > The basic problem I face in crawling/indexing is that I
> >> need
> >> > >>> to
> >> > >>> > > control
> >> > >>> > > >> > > which pages the crawlers should VISIT (so far through
> >> > >>> > > >> > > nutch/conf/regex-urlfilter.txt)
> >> > >>> > > >> > > and which pages are INDEXED by Solr. The latter are only
> >> a
> >> > >>> SUBSET of
> >> > >>> > > >> the
> >> > >>> > > >> > > former, and they are giving me headache.
> >> > >>> > > >> > >
> >> > >>> > > >> > > A real-life example would be: when we crawl CNN.com, we
> >> only
> >> > >>> want to
> >> > >>> > > >> index
> >> > >>> > > >> > > "real content" pages such as
> >> > >>> > > >> > >
> >> > >>> > >
> >> > >>>
> >> http://www.cnn.com/2012/11/21/us/bin-laden-burial/index.html?hpt=hp_t1<
> >> > >>> > > >>
> >> > >>>
> >> http://www.cnn.com/2012/11/21/us/bin-laden-burial/index.html?hpt=hp_t1>
> >> > >>> > > .
> >> > >>> > > >> > > When we start the crawling from the root, we can't
> >> specify
> >> > >>> tight
> >> > >>> > > >> > > patterns (e.g., +^http://([a-z0-9]*\.)*
> >> > >>> > > >> > > cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*<http://cnn.com/%5B0-9%5D%7B4%7D/%5B0-9%5D%7B2%7D/%5B0-9%5D%7B2%7D/..*>
> >> <http://cnn.com/%5B0-9%5D%7B4%7D/%5B0-9%5D%7B2%7D/%5B0-9%5D%7B2%7D/..*><
> >> > >>> > > >> http://cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*> ) in
> >> > >>> > > >> nutch/conf/regex-urlfilter.txt,
> >> > >>> > > >> > > because the pages on the path between root and content
> >> pages
> >> > >>> do not
> >> > >>> > > >> satisfy
> >> > >>> > > >> > > such patterns. Putting such patterns in
> >> > >>> > > nutch/conf/regex-urlfilter.txt
> >> > >>> > > >> > > would severely jeopardize the coverage of the crawl.
> >> > >>> > > >> > >
> >> > >>> > > >> > > The closest solution I've got so far (courtesy of
> >> Markus) was
> >> > >>> this:
> >> > >>> > > >> > >
> >> > >>> > > >> > > nutch solrindex -Durlfilter.regex.file=/path
> >> http://solrurl/<
> >> > >>> > > >> http://solrurl/> ...
> >> > >>> > > >> > >
> >> > >>> > > >> > >  but unfortunately I haven't been able to make it work
> >> for
> >> > >>> me. The
> >> > >>> > > >> content
> >> > >>> > > >> > > of the urlfilter.regex.file is what I thought "correct"
> >> ---
> >> > >>> > > something
> >> > >>> > > >> like
> >> > >>> > > >> > > the following:
> >> > >>> > > >> > >
> >> > >>> > > >> > > +^http://([a-z0-9]*\.)*
> >> cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*<http://cnn.com/%5B0-9%5D%7B4%7D/%5B0-9%5D%7B2%7D/%5B0-9%5D%7B2%7D/..*>
> >> <http://cnn.com/%5B0-9%5D%7B4%7D/%5B0-9%5D%7B2%7D/%5B0-9%5D%7B2%7D/..*><
> >> > >>> > > >> http://cnn.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/..*>
> >> > >>> > > >> > > -.
> >> > >>> > > >> > >
> >> > >>> > > >> > > Everything seems quite straightforward. Am I doing
> >> anything
> >> > >>> wrong
> >> > >>> > > >> here? Can
> >> > >>> > > >> > > anyone advise? I'd greatly appreciate.
> >> > >>> > > >> > >
> >> > >>> > > >> > > Joe
> >> > >>> > > >> > >
> >> > >>> > > >> >
> >> > >>> > > >> >
> >> > >>> > > >>
> >> > >>> > >
> >> > >>> > >
> >> > >>> > >
> >> > >>> > > --
> >> > >>> > > Lewis
> >> > >>> > >
> >> > >>> >
> >> > >>>
> >> > >>
> >> > >>
> >> > >
> >> >
> >>
> >
> >
>
RE: Indexing-time URL filtering again

Reply via email to