Re: nutch crawling with java (not shellscript)

Matthias W. Wed, 14 Jan 2009 03:59:13 -0800

Ok thanks!

But I decided against using the nutch crawler.


It will be the better way to build the index directly with Lucene, because I
do not need to crawl.
(I'm also searching with Lucene)

Now I use the parsers PDFBox for PDF-Documents and the Apache POI for MS
Office Documents.

There's little problem remaining: Mimetype checking.
I tried this:
String mimetype = new MimetypesFileTypeMap().getContentType( file );
but I always get the type application/octet-stream

Does anybody know a good Mimetype class in java?



Otis Gospodnetic-2 wrote:
> 
> Hi Matthias,
> 
> Several years ago when I did crawling/parsing/indexing of full-page
> content for Simpy.com I used Nutch in exactly that manner.
> 
> For example (this is outdated code, but you'll get the idea):
> 
>        System.out.println("Urls to fetch: " + _urls.size());
> 
>         if (_urls.size() == 0)
>             return;
> 
>         // clean up and prepare the FS
>         prepareFS();
> 
>         // create the URL file
>         String urlFile = createURLFile();
> 
>         // create the fetch list from the URL file
>         createFetchList(urlFile);
> 
>         // start the fetcher
>         _segmentDir = getLastSegmentDirectory(_rootDir);
>         String[] params = new String[] {                           // THIS
> IS WHAT YOU ARE AFTER
>             "-local",
>             _segmentDir
>         };
>         org.apache.nutch.fetcher.Fetcher.main(params);   // THIS IS WHAT
> YOU ARE AFTER
> 
> 
> If you look at bin/nutch script, you will see it really just calls Nutch's
> Java classes, so you just have to figure out what parameters those classes
> take and then call them as above, or even more directly using ctor and
> methods other than main.
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> 
> ----- Original Message ----
>> From: Matthias W. <[email protected]>
>> To: [email protected]
>> Sent: Tuesday, January 13, 2009 7:17:50 AM
>> Subject: nutch crawling with java (not shellscript)
>> 
>> 
>> Hi,
>> is there a tutorial or can anyone explain if and how I can run the nutch
>> crawler via java and not with the shellscript?
>> Furthermore I don't need to crawl, because I've got a list of URLs (PDF,
>> Word, Excel, ... Documents) which I have to index
>> -> In my case nutch only has to create the index from the urls list.
>> 
>> Till now I've got a shellscript which calls "bin/nutch crawl ..."
>> 
>> But if it is possible, I want to use java code instead of the "bin/nutch"
>> crawlscript.
>> 
>> Are there Java classes and methods to do this?
>> 
>> For better understanding, my association to start the crawl respectively
>> the
>> index process:
>>     "java Crawl"
>> That I'm able to set options for crawling in the java code and not in a
>> shellscript.
>> 
>> Is this possible?
>> 
>> Thanks!
>> Matthias
>> -- 
>> View this message in context: 
>> http://www.nabble.com/nutch-crawling-with-java-%28not-shellscript%29-tp21434602p21434602.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/nutch-crawling-with-java-%28not-shellscript%29-tp21434602p21454646.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: nutch crawling with java (not shellscript)

Reply via email to