Ok thanks!
But I decided against using the nutch crawler.
It will be the better way to build the index directly with Lucene, because I
do not need to crawl.
(I'm also searching with Lucene)
Now I use the parsers PDFBox for PDF-Documents and the Apache POI for MS
Office Documents.
There's little problem remaining: Mimetype checking.
I tried this:
String mimetype = new MimetypesFileTypeMap().getContentType( file );
but I always get the type application/octet-stream
Does anybody know a good Mimetype class in java?
Otis Gospodnetic-2 wrote:
>
> Hi Matthias,
>
> Several years ago when I did crawling/parsing/indexing of full-page
> content for Simpy.com I used Nutch in exactly that manner.
>
> For example (this is outdated code, but you'll get the idea):
>
> System.out.println("Urls to fetch: " + _urls.size());
>
> if (_urls.size() == 0)
> return;
>
> // clean up and prepare the FS
> prepareFS();
>
> // create the URL file
> String urlFile = createURLFile();
>
> // create the fetch list from the URL file
> createFetchList(urlFile);
>
> // start the fetcher
> _segmentDir = getLastSegmentDirectory(_rootDir);
> String[] params = new String[] { // THIS
> IS WHAT YOU ARE AFTER
> "-local",
> _segmentDir
> };
> org.apache.nutch.fetcher.Fetcher.main(params); // THIS IS WHAT
> YOU ARE AFTER
>
>
> If you look at bin/nutch script, you will see it really just calls Nutch's
> Java classes, so you just have to figure out what parameters those classes
> take and then call them as above, or even more directly using ctor and
> methods other than main.
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
>> From: Matthias W. <[email protected]>
>> To: [email protected]
>> Sent: Tuesday, January 13, 2009 7:17:50 AM
>> Subject: nutch crawling with java (not shellscript)
>>
>>
>> Hi,
>> is there a tutorial or can anyone explain if and how I can run the nutch
>> crawler via java and not with the shellscript?
>> Furthermore I don't need to crawl, because I've got a list of URLs (PDF,
>> Word, Excel, ... Documents) which I have to index
>> -> In my case nutch only has to create the index from the urls list.
>>
>> Till now I've got a shellscript which calls "bin/nutch crawl ..."
>>
>> But if it is possible, I want to use java code instead of the "bin/nutch"
>> crawlscript.
>>
>> Are there Java classes and methods to do this?
>>
>> For better understanding, my association to start the crawl respectively
>> the
>> index process:
>> "java Crawl"
>> That I'm able to set options for crawling in the java code and not in a
>> shellscript.
>>
>> Is this possible?
>>
>> Thanks!
>> Matthias
>> --
>> View this message in context:
>> http://www.nabble.com/nutch-crawling-with-java-%28not-shellscript%29-tp21434602p21434602.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>
>
--
View this message in context:
http://www.nabble.com/nutch-crawling-with-java-%28not-shellscript%29-tp21434602p21454646.html
Sent from the Nutch - User mailing list archive at Nabble.com.