Re: alternatives to PDFBox (was: IOException when parsing PDF files)

2010-01-07 Thread Godmar Back
Thanks Andrzej, I'll check out parse-ext. To reply to your views on PDFBox, etc. - I understand text extraction is a hard problem to solve, because it essentially involve reversing a type-setting procedure, although PDFBox's failure in this case is actually not related to the inherent difficulties

Re: alternatives to PDFBox (was: IOException when parsing PDF files)

2010-01-07 Thread Andrzej Bialecki
On 2010-01-07 07:16, Godmar Back wrote: ps: upon closer examination, it seems that PDFBox is not very mature software; I was able to fix its parser to go past this first error I encountered, then discovered that it's not implementing many essential PDF operators. ? that's surprising, I've been

alternatives to PDFBox (was: IOException when parsing PDF files)

2010-01-06 Thread Godmar Back
ps: upon closer examination, it seems that PDFBox is not very mature software; I was able to fix its parser to go past this first error I encountered, then discovered that it's not implementing many essential PDF operators. As a result, the extracted text is pretty bad and one cannot expect good re

Re: Alternatives

2006-07-07 Thread karl wettin
On Wed, 2006-07-05 at 20:32 -0700, Stefan Groschupf wrote: > Crawler & Co. are command line tools. > The servletcontainer is only used to deliver search results but you > can use the servlet that just provides XML. Ah, excellent. Thanks for letting me avoid reading the manual ;) > > It would be

Re: [Nutch-general] Alternatives

2006-07-06 Thread Jason Calabrese
You can startup a crawler by just creating a job. You can basicly just copy/tweek some code from the main method in org.apache.nutch.crawl.Crawl. In my application we are working at a lower level and first create the crawl_generate dir, then start a fetch, then we parse the fetched results, a

Re: Alternatives

2006-07-05 Thread Stefan Groschupf
Hi, It would be nice to use the features of Nutch instead of my own hacky stuff. How bound is Nutch to the J2EE-container? Would it be a big job to make it run on an alternative GUI? Or is is the container used for more than GUI? I.e. do all services (crawler, et.c.) run within the container? Do

Alternatives

2006-07-05 Thread karl wettin
I have never looked at how Nutch works, nor have I used it. My questions might just be RTFM-related. Lately people have asked me to help them out with simple domainspecific webindexing services. The requirements are, as usual when I'm involved, to run on very limited resources. What I did is to co

Good Alternatives to Nutch?

2005-11-07 Thread wmelo
There is (or there was, it depends on you )  aspseek (www.aspseek.org).  This internet search engine project is now paralyzed, but it function, just out of the box, with RedHat 9 and even (using legacy software that is part of the distribution), using the rpm, with Fedora Core 3.  Until, at

Re: Good Alternatives to Nutch?

2005-11-06 Thread Stefan Groschupf
riginal Message- From: Stefan Groschupf [mailto:[EMAIL PROTECTED] Sent: Sunday, November 06, 2005 8:03 AM To: nutch-user@lucene.apache.org Subject: Re: Good Alternatives to Nutch? NO! :-) There is no serious open source competitor to nutch, but there are some commercial products. e.g. googl

RE: Good Alternatives to Nutch?

2005-11-06 Thread Paul Harrison
Groschupf [mailto:[EMAIL PROTECTED] Sent: Sunday, November 06, 2005 8:03 AM To: nutch-user@lucene.apache.org Subject: Re: Good Alternatives to Nutch? NO! :-) There is no serious open source competitor to nutch, but there are some commercial products. e.g. google Appliance or gigablast. In any cas

Re: Good Alternatives to Nutch?

2005-11-06 Thread Stefan Groschupf
solution wouldn't improve speed. Stefan Am 06.11.2005 um 09:38 schrieb Victor Lee: Hi, I am looking for a non-java solution for search engine. Are there any good non-java alternatives to Nutch? Many thanks. __ Start your day with Yahoo! -

Good Alternatives to Nutch?

2005-11-06 Thread Victor Lee
Hi, I am looking for a non-java solution for search engine. Are there any good non-java alternatives to Nutch? Many thanks. __ Start your day with Yahoo! - Make it your home page! http://www.yahoo.com/r/hs