Thanks, I'll look at it.

But it doesn't support Office07, yet !?



Julien Nioche-4 wrote:
> 
> Matthias,
> 
> Have a look at Apache Tika. It provides a simple and unified API over
> PDFBOX
> and POI etc... and Mimetype facilities.
> That should greatly simplify your code.
> 
> Julien
> 
> 2009/1/14 Matthias W. <[email protected]>
> 
>>
>> Ok thanks!
>>
>> But I decided against using the nutch crawler.
>>
>> It will be the better way to build the index directly with Lucene,
>> because
>> I
>> do not need to crawl.
>> (I'm also searching with Lucene)
>>
>> Now I use the parsers PDFBox for PDF-Documents and the Apache POI for MS
>> Office Documents.
>>
>> There's little problem remaining: Mimetype checking.
>> I tried this:
>> String mimetype = new MimetypesFileTypeMap().getContentType( file );
>> but I always get the type application/octet-stream
>>
>> Does anybody know a good Mimetype class in java?
>>
>>
>>
>> Otis Gospodnetic-2 wrote:
>> >
>> > Hi Matthias,
>> >
>> > Several years ago when I did crawling/parsing/indexing of full-page
>> > content for Simpy.com I used Nutch in exactly that manner.
>> >
>> > For example (this is outdated code, but you'll get the idea):
>> >
>> >        System.out.println("Urls to fetch: " + _urls.size());
>> >
>> >         if (_urls.size() == 0)
>> >             return;
>> >
>> >         // clean up and prepare the FS
>> >         prepareFS();
>> >
>> >         // create the URL file
>> >         String urlFile = createURLFile();
>> >
>> >         // create the fetch list from the URL file
>> >         createFetchList(urlFile);
>> >
>> >         // start the fetcher
>> >         _segmentDir = getLastSegmentDirectory(_rootDir);
>> >         String[] params = new String[] {                           //
>> THIS
>> > IS WHAT YOU ARE AFTER
>> >             "-local",
>> >             _segmentDir
>> >         };
>> >         org.apache.nutch.fetcher.Fetcher.main(params);   // THIS IS
>> WHAT
>> > YOU ARE AFTER
>> >
>> >
>> > If you look at bin/nutch script, you will see it really just calls
>> Nutch's
>> > Java classes, so you just have to figure out what parameters those
>> classes
>> > take and then call them as above, or even more directly using ctor and
>> > methods other than main.
>> >
>> > Otis
>> > --
>> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>> >
>> >
>> >
>> > ----- Original Message ----
>> >> From: Matthias W. <[email protected]>
>> >> To: [email protected]
>> >> Sent: Tuesday, January 13, 2009 7:17:50 AM
>> >> Subject: nutch crawling with java (not shellscript)
>> >>
>> >>
>> >> Hi,
>> >> is there a tutorial or can anyone explain if and how I can run the
>> nutch
>> >> crawler via java and not with the shellscript?
>> >> Furthermore I don't need to crawl, because I've got a list of URLs
>> (PDF,
>> >> Word, Excel, ... Documents) which I have to index
>> >> -> In my case nutch only has to create the index from the urls list.
>> >>
>> >> Till now I've got a shellscript which calls "bin/nutch crawl ..."
>> >>
>> >> But if it is possible, I want to use java code instead of the
>> "bin/nutch"
>> >> crawlscript.
>> >>
>> >> Are there Java classes and methods to do this?
>> >>
>> >> For better understanding, my association to start the crawl
>> respectively
>> >> the
>> >> index process:
>> >>     "java Crawl"
>> >> That I'm able to set options for crawling in the java code and not in
>> a
>> >> shellscript.
>> >>
>> >> Is this possible?
>> >>
>> >> Thanks!
>> >> Matthias
>> >> --
>> >> View this message in context:
>> >>
>> http://www.nabble.com/nutch-crawling-with-java-%28not-shellscript%29-tp21434602p21434602.html
>> >> Sent from the Nutch - User mailing list archive at Nabble.com.
>> >
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/nutch-crawling-with-java-%28not-shellscript%29-tp21434602p21454646.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
> 
> 
> -- 
> DigitalPebble Ltd
> http://www.digitalpebble.com
> 
> 

-- 
View this message in context: 
http://www.nabble.com/nutch-crawling-with-java-%28not-shellscript%29-tp21434602p21455529.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to