Re: [Nutch-general] indexing only special documents

Martin Kammerlander Thu, 14 Jun 2007 05:53:21 -0700

Hi!

I did those steps now that Marcin proposed:


1. I run the parser normally based on my seed list.

2. I used the segment merger in this way:

bin/nutch mergesegs temp -dir segments crawl_SAVE/segments/* -filter

where the nutch-site.xml contains this parts:

<property>
  <name>plugin.auto-activation</name>
  <value>false</value>
  <description>Defines if some plugins that are not activated regarding
  the plugin.includes and plugin.excludes properties must be automaticaly
  activated if they are needed by some actived plugins.
  </description>
</property>


<property>
<name>plugin.includes</name>
        <value>nutch-extensionpoints|lib-log4j|parse-pdf|scoring-opic</value>
        <description>blabla</description>
</property>

these two entries guarantee me that really no other plugins are used. Only the
ones I mentioned in plugin.includes. As I found out in the hadoop.log file that
the plugin 'parse-pdf' needs the plugin 'lib-log4j'. Furthermore without the
inclusion of 'scoring-opic' and 'nutch-extensionpoints' the whole merging
(filtering) process does not work.

hadoop.log says the following about used plugins:

INFO  segment.SegmentMerger - Merging 2 segments to temp/20070614143438
INFO  segment.SegmentMerger - SegmentMerger:   adding
crawl_SAVE/segments/20070611204026
INFO  segment.SegmentMerger - SegmentMerger:   adding
crawl_SAVE/segments/20070611211131
INFO  segment.SegmentMerger - SegmentMerger: using segment data from: content
crawl_generate crawl_fetch crawl_parse parse_data parse_text
INFO  plugin.PluginRepository - Plugins: looking in:
/home/kammerlander/crawler/nutch-0.8.1/plugins
INFO  plugin.PluginRepository - Plugin Auto-activation mode: [false]
INFO  plugin.PluginRepository - Registered Plugins:
INFO  plugin.PluginRepository -         Log4j (lib-log4j)
INFO  plugin.PluginRepository -         the nutch core extension points
(nutch-extensionpoints)
INFO  plugin.PluginRepository -         Pdf Parse Plug-in (parse-pdf)
INFO  plugin.PluginRepository -         OPIC Scoring Plug-in (scoring-opic)
INFO  plugin.PluginRepository - Registered Extension-Points:
INFO  plugin.PluginRepository -         Nutch Summarizer
(org.apache.nutch.searcher.Summarizer)
INFO  plugin.PluginRepository -         Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
INFO  plugin.PluginRepository -         Nutch Protocol
(org.apache.nutch.protocol.Protocol)
INFO  plugin.PluginRepository -         Nutch URL Filter
(org.apache.nutch.net.URLFilter)
INFO  plugin.PluginRepository -         HTML Parse Filter
(org.apache.nutch.parse.HtmlParseFilter)
INFO  plugin.PluginRepository -         Nutch Online Search Results Clustering
Plugin                                                                          
(org.apache.nutch.clustering.OnlineClusterer)
INFO  plugin.PluginRepository -         Nutch Indexing Filter
(org.apache.nutch.indexer.IndexingFilter)
INFO  plugin.PluginRepository -         Nutch Content Parser
(org.apache.nutch.parse.Parser)
INFO  plugin.PluginRepository -         Ontology Model Loader
(org.apache.nutch.ontology.Ontology)
INFO  plugin.PluginRepository -         Nutch Analysis
(org.apache.nutch.analysis.NutchAnalyzer)
INFO  plugin.PluginRepository -         Nutch Query Filter
(org.apache.nutch.searcher.QueryFilter)

With this entries in the nutch-site.xml file I get no errors while the merging
process, and everything seems working fine.

3. Indexing of the merged (filtered) documents:

After merging(filtering) the segments I do:

a) cp -a temp/* crawl/segments/
b) bin/nutch updatedb crawl/crawldb crawl/segments/*
c) bin/nutch invertlinks crawl/linkdb crawl/segments/*
d) bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/*

With the configuration above this now gives me a java Null point exception.
hadoop.log:

WARN  mapred.LocalJobRunner - job_ldja5n
java.lang.NullPointerException
        at org.apache.nutch.indexer.Indexer$2.write(Indexer.java:113)
        at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:235)
        at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:258)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:247)
        at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:112)

First I thougt ok... maybe there is really no pdf document contained in the
merged segments...but I tried mergesegs without filtering and there I found a
plenty of pdf documents in the search.

So anyone may has an idea what's going wrong, or what I do wrong?

kind regards
martin

Zitat von Marcin Okraszewski <[EMAIL PROTECTED]>:

> This is what I do:
> 1. Run parser normally, without any limitations.
> 2. Use segment merger with URL Filter witch will filter out documents which
> are not PDF files. So this is in fact filtering segment, not merging it.
> 3. Index the result (filtered) segment.
>
> I hope it works for you.
> Marcin Okraszewski
>
>
> > Hi Briggs, hi Ronny, hi all
> >
> > First off all, thx for your help!!
> >
> > I tried both methods from you Ronny and additionally the one from Briggs.
> They
> > both have the same effect that if I remove all extensions not to be indexed
> > like you both described (the only extension that rests is pdf), than it has
> the
> > effect that the crawler does not even parse me one site. The crawler just
> ends
> > with no page indexed.
> >
> > May I try to describe my problem again: The crawler should really parse all
> > sites starting from the seed url no matter which extension. But what the
> > crawler should do is only to index then the pdf documents. All other sites
> > which are not pdf documents should not be listed and also not fetched and
> > therefore not saved.
> >
> > So the only final sites that are listed when I do a search in the nutch
> search
> > engine should be urls with pdfs...nothing else.
> >
> > hope I made my problem bit clearer ;)
> >
> > greetz
> > martin
> >
> >
> >
> > Zitat von Briggs <[EMAIL PROTECTED]>:
> >
> > > Ronny, your way is probably better.  See, I was only dealing with the
> > > fetched properties.  But, in your case, you don't fetch it, which gets
> rid
> > > of all that wasted bandwidth.
> > >
> > > For dealing with types that can be dealt with via the file extension,
> this
> > > would probably work better.
> > >
> > >
> > > On 6/7/07, Naess, Ronny <[EMAIL PROTECTED]> wrote:
> > > >
> > > >
> > > > Hi.
> > > >
> > > > Configure crawl-urlfilter.txt
> > > > Thus you want to add something like +\.pdf$ I guess another way would
> be
> > > > to exclude all others
> > > >
> > > > Try expanding the line below with html, doc, xls, ppt, etc
> > > >
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|r
> > > > pm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js|JS|dojo|DOJO|jsp|JSP)$
> > > >
> > > > Or try including
> > > > +\.pdf$
> > > > #
> > > >
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|r
> > > > pm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js|JS|dojo|DOJO|jsp|JSP)$
> > > > Followd by
> > > > -.
> > > >
> > > > Have'nt tried it myself, but experiment some and I guess you figure it
> > > > out pretty soon.
> > > >
> > > > Regards,
> > > > Ronny
> > > >
> > > > -----Opprinnelig melding-----
> > > > Fra: Martin Kammerlander
> [mailto:[EMAIL PROTECTED]
> > > >
> > > > Sendt: 6. juni 2007 20:30
> > > > Til: [EMAIL PROTECTED]
> > > > Emne: indexing only special documents
> > > >
> > > >
> > > >
> > > > hi!
> > > >
> > > > I have a question. If I have for example the seed urls and do a crawl
> > > > based o that seeds. If I want to index then only pages that contain for
> > > > example pdf documents, how can I do that?
> > > >
> > > > cheers
> > > > martin
> > > >
> > > >
> > > >
> > > > !DSPAM:4666ff05259891293215062!
> > > >
> > > >
> > >
> > >
> > > --
> > > "Conscious decisions by conscious minds are what make reality real"
> > >
> >
> >
> >
>
>




-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] indexing only special documents

Reply via email to