Hi!
I did those steps now that Marcin proposed:
1. I run the parser normally based on my seed list.
2. I used the segment merger in this way:
bin/nutch mergesegs temp -dir segments crawl_SAVE/segments/* -filter
where the nutch-site.xml contains this parts:
<property>
<name>plugin.auto-activation</name>
<value>false</value>
<description>Defines if some plugins that are not activated regarding
the plugin.includes and plugin.excludes properties must be automaticaly
activated if they are needed by some actived plugins.
</description>
</property>
<property>
<name>plugin.includes</name>
<value>nutch-extensionpoints|lib-log4j|parse-pdf|scoring-opic</value>
<description>blabla</description>
</property>
these two entries guarantee me that really no other plugins are used. Only the
ones I mentioned in plugin.includes. As I found out in the hadoop.log file that
the plugin 'parse-pdf' needs the plugin 'lib-log4j'. Furthermore without the
inclusion of 'scoring-opic' and 'nutch-extensionpoints' the whole merging
(filtering) process does not work.
hadoop.log says the following about used plugins:
INFO segment.SegmentMerger - Merging 2 segments to temp/20070614143438
INFO segment.SegmentMerger - SegmentMerger: adding
crawl_SAVE/segments/20070611204026
INFO segment.SegmentMerger - SegmentMerger: adding
crawl_SAVE/segments/20070611211131
INFO segment.SegmentMerger - SegmentMerger: using segment data from: content
crawl_generate crawl_fetch crawl_parse parse_data parse_text
INFO plugin.PluginRepository - Plugins: looking in:
/home/kammerlander/crawler/nutch-0.8.1/plugins
INFO plugin.PluginRepository - Plugin Auto-activation mode: [false]
INFO plugin.PluginRepository - Registered Plugins:
INFO plugin.PluginRepository - Log4j (lib-log4j)
INFO plugin.PluginRepository - the nutch core extension points
(nutch-extensionpoints)
INFO plugin.PluginRepository - Pdf Parse Plug-in (parse-pdf)
INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic)
INFO plugin.PluginRepository - Registered Extension-Points:
INFO plugin.PluginRepository - Nutch Summarizer
(org.apache.nutch.searcher.Summarizer)
INFO plugin.PluginRepository - Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
INFO plugin.PluginRepository - Nutch Protocol
(org.apache.nutch.protocol.Protocol)
INFO plugin.PluginRepository - Nutch URL Filter
(org.apache.nutch.net.URLFilter)
INFO plugin.PluginRepository - HTML Parse Filter
(org.apache.nutch.parse.HtmlParseFilter)
INFO plugin.PluginRepository - Nutch Online Search Results Clustering
Plugin
(org.apache.nutch.clustering.OnlineClusterer)
INFO plugin.PluginRepository - Nutch Indexing Filter
(org.apache.nutch.indexer.IndexingFilter)
INFO plugin.PluginRepository - Nutch Content Parser
(org.apache.nutch.parse.Parser)
INFO plugin.PluginRepository - Ontology Model Loader
(org.apache.nutch.ontology.Ontology)
INFO plugin.PluginRepository - Nutch Analysis
(org.apache.nutch.analysis.NutchAnalyzer)
INFO plugin.PluginRepository - Nutch Query Filter
(org.apache.nutch.searcher.QueryFilter)
With this entries in the nutch-site.xml file I get no errors while the merging
process, and everything seems working fine.
3. Indexing of the merged (filtered) documents:
After merging(filtering) the segments I do:
a) cp -a temp/* crawl/segments/
b) bin/nutch updatedb crawl/crawldb crawl/segments/*
c) bin/nutch invertlinks crawl/linkdb crawl/segments/*
d) bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/*
With the configuration above this now gives me a java Null point exception.
hadoop.log:
WARN mapred.LocalJobRunner - job_ldja5n
java.lang.NullPointerException
at org.apache.nutch.indexer.Indexer$2.write(Indexer.java:113)
at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:235)
at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:258)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:247)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:112)
First I thougt ok... maybe there is really no pdf document contained in the
merged segments...but I tried mergesegs without filtering and there I found a
plenty of pdf documents in the search.
So anyone may has an idea what's going wrong, or what I do wrong?
kind regards
martin
Zitat von Marcin Okraszewski <[EMAIL PROTECTED]>:
> This is what I do:
> 1. Run parser normally, without any limitations.
> 2. Use segment merger with URL Filter witch will filter out documents which
> are not PDF files. So this is in fact filtering segment, not merging it.
> 3. Index the result (filtered) segment.
>
> I hope it works for you.
> Marcin Okraszewski
>
>
> > Hi Briggs, hi Ronny, hi all
> >
> > First off all, thx for your help!!
> >
> > I tried both methods from you Ronny and additionally the one from Briggs.
> They
> > both have the same effect that if I remove all extensions not to be indexed
> > like you both described (the only extension that rests is pdf), than it has
> the
> > effect that the crawler does not even parse me one site. The crawler just
> ends
> > with no page indexed.
> >
> > May I try to describe my problem again: The crawler should really parse all
> > sites starting from the seed url no matter which extension. But what the
> > crawler should do is only to index then the pdf documents. All other sites
> > which are not pdf documents should not be listed and also not fetched and
> > therefore not saved.
> >
> > So the only final sites that are listed when I do a search in the nutch
> search
> > engine should be urls with pdfs...nothing else.
> >
> > hope I made my problem bit clearer ;)
> >
> > greetz
> > martin
> >
> >
> >
> > Zitat von Briggs <[EMAIL PROTECTED]>:
> >
> > > Ronny, your way is probably better. See, I was only dealing with the
> > > fetched properties. But, in your case, you don't fetch it, which gets
> rid
> > > of all that wasted bandwidth.
> > >
> > > For dealing with types that can be dealt with via the file extension,
> this
> > > would probably work better.
> > >
> > >
> > > On 6/7/07, Naess, Ronny <[EMAIL PROTECTED]> wrote:
> > > >
> > > >
> > > > Hi.
> > > >
> > > > Configure crawl-urlfilter.txt
> > > > Thus you want to add something like +\.pdf$ I guess another way would
> be
> > > > to exclude all others
> > > >
> > > > Try expanding the line below with html, doc, xls, ppt, etc
> > > >
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|r
> > > > pm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js|JS|dojo|DOJO|jsp|JSP)$
> > > >
> > > > Or try including
> > > > +\.pdf$
> > > > #
> > > >
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|r
> > > > pm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js|JS|dojo|DOJO|jsp|JSP)$
> > > > Followd by
> > > > -.
> > > >
> > > > Have'nt tried it myself, but experiment some and I guess you figure it
> > > > out pretty soon.
> > > >
> > > > Regards,
> > > > Ronny
> > > >
> > > > -----Opprinnelig melding-----
> > > > Fra: Martin Kammerlander
> [mailto:[EMAIL PROTECTED]
> > > >
> > > > Sendt: 6. juni 2007 20:30
> > > > Til: [EMAIL PROTECTED]
> > > > Emne: indexing only special documents
> > > >
> > > >
> > > >
> > > > hi!
> > > >
> > > > I have a question. If I have for example the seed urls and do a crawl
> > > > based o that seeds. If I want to index then only pages that contain for
> > > > example pdf documents, how can I do that?
> > > >
> > > > cheers
> > > > martin
> > > >
> > > >
> > > >
> > > > !DSPAM:4666ff05259891293215062!
> > > >
> > > >
> > >
> > >
> > > --
> > > "Conscious decisions by conscious minds are what make reality real"
> > >
> >
> >
> >
>
>
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general