Matthias, Have a look at Apache Tika. It provides a simple and unified API over PDFBOX and POI etc... and Mimetype facilities. That should greatly simplify your code.
Julien 2009/1/14 Matthias W. <[email protected]> > > Ok thanks! > > But I decided against using the nutch crawler. > > It will be the better way to build the index directly with Lucene, because > I > do not need to crawl. > (I'm also searching with Lucene) > > Now I use the parsers PDFBox for PDF-Documents and the Apache POI for MS > Office Documents. > > There's little problem remaining: Mimetype checking. > I tried this: > String mimetype = new MimetypesFileTypeMap().getContentType( file ); > but I always get the type application/octet-stream > > Does anybody know a good Mimetype class in java? > > > > Otis Gospodnetic-2 wrote: > > > > Hi Matthias, > > > > Several years ago when I did crawling/parsing/indexing of full-page > > content for Simpy.com I used Nutch in exactly that manner. > > > > For example (this is outdated code, but you'll get the idea): > > > > System.out.println("Urls to fetch: " + _urls.size()); > > > > if (_urls.size() == 0) > > return; > > > > // clean up and prepare the FS > > prepareFS(); > > > > // create the URL file > > String urlFile = createURLFile(); > > > > // create the fetch list from the URL file > > createFetchList(urlFile); > > > > // start the fetcher > > _segmentDir = getLastSegmentDirectory(_rootDir); > > String[] params = new String[] { // > THIS > > IS WHAT YOU ARE AFTER > > "-local", > > _segmentDir > > }; > > org.apache.nutch.fetcher.Fetcher.main(params); // THIS IS WHAT > > YOU ARE AFTER > > > > > > If you look at bin/nutch script, you will see it really just calls > Nutch's > > Java classes, so you just have to figure out what parameters those > classes > > take and then call them as above, or even more directly using ctor and > > methods other than main. > > > > Otis > > -- > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > > > > > ----- Original Message ---- > >> From: Matthias W. <[email protected]> > >> To: [email protected] > >> Sent: Tuesday, January 13, 2009 7:17:50 AM > >> Subject: nutch crawling with java (not shellscript) > >> > >> > >> Hi, > >> is there a tutorial or can anyone explain if and how I can run the nutch > >> crawler via java and not with the shellscript? > >> Furthermore I don't need to crawl, because I've got a list of URLs (PDF, > >> Word, Excel, ... Documents) which I have to index > >> -> In my case nutch only has to create the index from the urls list. > >> > >> Till now I've got a shellscript which calls "bin/nutch crawl ..." > >> > >> But if it is possible, I want to use java code instead of the > "bin/nutch" > >> crawlscript. > >> > >> Are there Java classes and methods to do this? > >> > >> For better understanding, my association to start the crawl respectively > >> the > >> index process: > >> "java Crawl" > >> That I'm able to set options for crawling in the java code and not in a > >> shellscript. > >> > >> Is this possible? > >> > >> Thanks! > >> Matthias > >> -- > >> View this message in context: > >> > http://www.nabble.com/nutch-crawling-with-java-%28not-shellscript%29-tp21434602p21434602.html > >> Sent from the Nutch - User mailing list archive at Nabble.com. > > > > > > > > -- > View this message in context: > http://www.nabble.com/nutch-crawling-with-java-%28not-shellscript%29-tp21434602p21454646.html > Sent from the Nutch - User mailing list archive at Nabble.com. > > -- DigitalPebble Ltd http://www.digitalpebble.com
