Thanks, I'll look at it. But it doesn't support Office07, yet !?
Julien Nioche-4 wrote: > > Matthias, > > Have a look at Apache Tika. It provides a simple and unified API over > PDFBOX > and POI etc... and Mimetype facilities. > That should greatly simplify your code. > > Julien > > 2009/1/14 Matthias W. <[email protected]> > >> >> Ok thanks! >> >> But I decided against using the nutch crawler. >> >> It will be the better way to build the index directly with Lucene, >> because >> I >> do not need to crawl. >> (I'm also searching with Lucene) >> >> Now I use the parsers PDFBox for PDF-Documents and the Apache POI for MS >> Office Documents. >> >> There's little problem remaining: Mimetype checking. >> I tried this: >> String mimetype = new MimetypesFileTypeMap().getContentType( file ); >> but I always get the type application/octet-stream >> >> Does anybody know a good Mimetype class in java? >> >> >> >> Otis Gospodnetic-2 wrote: >> > >> > Hi Matthias, >> > >> > Several years ago when I did crawling/parsing/indexing of full-page >> > content for Simpy.com I used Nutch in exactly that manner. >> > >> > For example (this is outdated code, but you'll get the idea): >> > >> > System.out.println("Urls to fetch: " + _urls.size()); >> > >> > if (_urls.size() == 0) >> > return; >> > >> > // clean up and prepare the FS >> > prepareFS(); >> > >> > // create the URL file >> > String urlFile = createURLFile(); >> > >> > // create the fetch list from the URL file >> > createFetchList(urlFile); >> > >> > // start the fetcher >> > _segmentDir = getLastSegmentDirectory(_rootDir); >> > String[] params = new String[] { // >> THIS >> > IS WHAT YOU ARE AFTER >> > "-local", >> > _segmentDir >> > }; >> > org.apache.nutch.fetcher.Fetcher.main(params); // THIS IS >> WHAT >> > YOU ARE AFTER >> > >> > >> > If you look at bin/nutch script, you will see it really just calls >> Nutch's >> > Java classes, so you just have to figure out what parameters those >> classes >> > take and then call them as above, or even more directly using ctor and >> > methods other than main. >> > >> > Otis >> > -- >> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch >> > >> > >> > >> > ----- Original Message ---- >> >> From: Matthias W. <[email protected]> >> >> To: [email protected] >> >> Sent: Tuesday, January 13, 2009 7:17:50 AM >> >> Subject: nutch crawling with java (not shellscript) >> >> >> >> >> >> Hi, >> >> is there a tutorial or can anyone explain if and how I can run the >> nutch >> >> crawler via java and not with the shellscript? >> >> Furthermore I don't need to crawl, because I've got a list of URLs >> (PDF, >> >> Word, Excel, ... Documents) which I have to index >> >> -> In my case nutch only has to create the index from the urls list. >> >> >> >> Till now I've got a shellscript which calls "bin/nutch crawl ..." >> >> >> >> But if it is possible, I want to use java code instead of the >> "bin/nutch" >> >> crawlscript. >> >> >> >> Are there Java classes and methods to do this? >> >> >> >> For better understanding, my association to start the crawl >> respectively >> >> the >> >> index process: >> >> "java Crawl" >> >> That I'm able to set options for crawling in the java code and not in >> a >> >> shellscript. >> >> >> >> Is this possible? >> >> >> >> Thanks! >> >> Matthias >> >> -- >> >> View this message in context: >> >> >> http://www.nabble.com/nutch-crawling-with-java-%28not-shellscript%29-tp21434602p21434602.html >> >> Sent from the Nutch - User mailing list archive at Nabble.com. >> > >> > >> > >> >> -- >> View this message in context: >> http://www.nabble.com/nutch-crawling-with-java-%28not-shellscript%29-tp21434602p21454646.html >> Sent from the Nutch - User mailing list archive at Nabble.com. >> >> > > > -- > DigitalPebble Ltd > http://www.digitalpebble.com > > -- View this message in context: http://www.nabble.com/nutch-crawling-with-java-%28not-shellscript%29-tp21434602p21455529.html Sent from the Nutch - User mailing list archive at Nabble.com.
