Re: nutch crawling with java (not shellscript)

Julien Nioche Wed, 14 Jan 2009 04:26:08 -0800

Matthias,

Have a look at Apache Tika. It provides a simple and unified API over PDFBOX
and POI etc... and Mimetype facilities.
That should greatly simplify your code.


Julien

2009/1/14 Matthias W. <[email protected]>

>
> Ok thanks!
>
> But I decided against using the nutch crawler.
>
> It will be the better way to build the index directly with Lucene, because
> I
> do not need to crawl.
> (I'm also searching with Lucene)
>
> Now I use the parsers PDFBox for PDF-Documents and the Apache POI for MS
> Office Documents.
>
> There's little problem remaining: Mimetype checking.
> I tried this:
> String mimetype = new MimetypesFileTypeMap().getContentType( file );
> but I always get the type application/octet-stream
>
> Does anybody know a good Mimetype class in java?
>
>
>
> Otis Gospodnetic-2 wrote:
> >
> > Hi Matthias,
> >
> > Several years ago when I did crawling/parsing/indexing of full-page
> > content for Simpy.com I used Nutch in exactly that manner.
> >
> > For example (this is outdated code, but you'll get the idea):
> >
> >        System.out.println("Urls to fetch: " + _urls.size());
> >
> >         if (_urls.size() == 0)
> >             return;
> >
> >         // clean up and prepare the FS
> >         prepareFS();
> >
> >         // create the URL file
> >         String urlFile = createURLFile();
> >
> >         // create the fetch list from the URL file
> >         createFetchList(urlFile);
> >
> >         // start the fetcher
> >         _segmentDir = getLastSegmentDirectory(_rootDir);
> >         String[] params = new String[] {                           //
> THIS
> > IS WHAT YOU ARE AFTER
> >             "-local",
> >             _segmentDir
> >         };
> >         org.apache.nutch.fetcher.Fetcher.main(params);   // THIS IS WHAT
> > YOU ARE AFTER
> >
> >
> > If you look at bin/nutch script, you will see it really just calls
> Nutch's
> > Java classes, so you just have to figure out what parameters those
> classes
> > take and then call them as above, or even more directly using ctor and
> > methods other than main.
> >
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> >
> >
> > ----- Original Message ----
> >> From: Matthias W. <[email protected]>
> >> To: [email protected]
> >> Sent: Tuesday, January 13, 2009 7:17:50 AM
> >> Subject: nutch crawling with java (not shellscript)
> >>
> >>
> >> Hi,
> >> is there a tutorial or can anyone explain if and how I can run the nutch
> >> crawler via java and not with the shellscript?
> >> Furthermore I don't need to crawl, because I've got a list of URLs (PDF,
> >> Word, Excel, ... Documents) which I have to index
> >> -> In my case nutch only has to create the index from the urls list.
> >>
> >> Till now I've got a shellscript which calls "bin/nutch crawl ..."
> >>
> >> But if it is possible, I want to use java code instead of the
> "bin/nutch"
> >> crawlscript.
> >>
> >> Are there Java classes and methods to do this?
> >>
> >> For better understanding, my association to start the crawl respectively
> >> the
> >> index process:
> >>     "java Crawl"
> >> That I'm able to set options for crawling in the java code and not in a
> >> shellscript.
> >>
> >> Is this possible?
> >>
> >> Thanks!
> >> Matthias
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/nutch-crawling-with-java-%28not-shellscript%29-tp21434602p21434602.html
> >> Sent from the Nutch - User mailing list archive at Nabble.com.
> >
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/nutch-crawling-with-java-%28not-shellscript%29-tp21434602p21454646.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>


-- 
DigitalPebble Ltd
http://www.digitalpebble.com

Re: nutch crawling with java (not shellscript)

Reply via email to