Thanks Andrzej, I'll check out parse-ext.
To reply to your views on PDFBox, etc. - I understand text extraction is a
hard problem to solve, because it essentially involve reversing a
type-setting procedure, although PDFBox's failure in this case is actually
not related to the inherent difficulties
On 2010-01-07 07:16, Godmar Back wrote:
ps: upon closer examination, it seems that PDFBox is not very mature
software; I was able to fix its parser to go past this first error I
encountered, then discovered that it's not implementing many essential PDF
operators.
? that's surprising, I've been
ps: upon closer examination, it seems that PDFBox is not very mature
software; I was able to fix its parser to go past this first error I
encountered, then discovered that it's not implementing many essential PDF
operators. As a result, the extracted text is pretty bad and one cannot
expect good re
On Wed, 2006-07-05 at 20:32 -0700, Stefan Groschupf wrote:
> Crawler & Co. are command line tools.
> The servletcontainer is only used to deliver search results but you
> can use the servlet that just provides XML.
Ah, excellent. Thanks for letting me avoid reading the manual ;)
> > It would be
You can startup a crawler by just creating a job. You can basicly just
copy/tweek some code from the main method in org.apache.nutch.crawl.Crawl.
In my application we are working at a lower level and first create the
crawl_generate dir, then start a fetch, then we parse the fetched results,
a
Hi,
It would be nice to use the features of Nutch instead of my own hacky
stuff. How bound is Nutch to the J2EE-container? Would it be a big job
to make it run on an alternative GUI? Or is is the container used for
more than GUI? I.e. do all services (crawler, et.c.) run within the
container? Do
I have never looked at how Nutch works, nor have I used it. My questions
might just be RTFM-related.
Lately people have asked me to help them out with simple domainspecific
webindexing services. The requirements are, as usual when I'm involved,
to run on very limited resources. What I did is to co
There is (or there was, it depends on you )
aspseek (www.aspseek.org). This
internet search engine project is now paralyzed, but it function, just out of
the box, with RedHat 9 and even (using legacy software that is part of the
distribution), using the rpm, with Fedora Core 3. Until, at
riginal Message-
From: Stefan Groschupf [mailto:[EMAIL PROTECTED]
Sent: Sunday, November 06, 2005 8:03 AM
To: nutch-user@lucene.apache.org
Subject: Re: Good Alternatives to Nutch?
NO! :-) There is no serious open source competitor to nutch, but
there are some commercial products.
e.g. googl
Groschupf [mailto:[EMAIL PROTECTED]
Sent: Sunday, November 06, 2005 8:03 AM
To: nutch-user@lucene.apache.org
Subject: Re: Good Alternatives to Nutch?
NO! :-) There is no serious open source competitor to nutch, but
there are some commercial products.
e.g. google Appliance or gigablast.
In any cas
solution wouldn't improve speed.
Stefan
Am 06.11.2005 um 09:38 schrieb Victor Lee:
Hi,
I am looking for a non-java solution for search
engine. Are there any good non-java alternatives to
Nutch?
Many thanks.
__
Start your day with Yahoo! -
Hi,
I am looking for a non-java solution for search
engine. Are there any good non-java alternatives to
Nutch?
Many thanks.
__
Start your day with Yahoo! - Make it your home page!
http://www.yahoo.com/r/hs
12 matches
Mail list logo