+1 on Nutch!
On Fri, Jan 21, 2011 at 4:11 PM, Markus Jelsma
markus.jel...@openindex.io wrote:
Hi,
Please take a look at Apache Nutch. I can crawl through a file system over
FTP.
After crawling, it can use Tika to extract the content from your PDF files and
other. Finally you can then send
On Mon, Jan 24, 2011 at 8:15 AM, Adam Estrada estrada.a...@gmail.com wrote:
+1 on Nutch!
[...]
Would it be possible for Markus, and you to clarify on
what the advantages of Nutch are in crawling a
well-defined filesystem hierarchy? A simple shell script
that POSTs to Solr works fine for this, so
I'd be happy to comment:
A simple shell script doesn't provide URL filtering and control of how you
crawl those documents on the local file system. Nutch has several levels of URL
filtering based on regex, MIME type, and others. Also, if there are any
outlinks in those local files that point
On Mon, Jan 24, 2011 at 11:07 AM, Mattmann, Chris A (388J)
chris.a.mattm...@jpl.nasa.gov wrote:
I'd be happy to comment:
A simple shell script doesn't provide URL filtering and control of how you
crawl those documents on the local file system. Nutch has several levels of
URL filtering based
Hi Gora,
Thanks for the answer. I want to index all the PDF,HTML documents
lying within a tree hierarchy at FTP Server.
In addition, can i add an attribute location whose value is the FTP
FILE LOCATION.
If you can give me, the sample configuration, it will be great.
/
On Fri, Jan 21, 2011 at 1:31 PM, pankaj bhatt panbh...@gmail.com wrote:
Hi Gora,
Thanks for the answer. I want to index all the PDF,HTML documents
lying within a tree hierarchy at FTP Server.
In addition, can i add an attribute location whose value is the FTP
FILE LOCATION.
Hi Gora,
Thanks, however i think it would be a cumbersome process, to do all
this manual.
Aren't there any plugin or extracter does this automatically.???
Anyone in the group, if had done this previously.?
/ Pankaj Bhatt.
On Fri, Jan 21, 2011 at 1:41 PM, Gora Mohanty
On Fri, Jan 21, 2011 at 1:47 PM, pankaj bhatt panbh...@gmail.com wrote:
Hi Gora,
Thanks, however i think it would be a cumbersome process, to do all
this manual.
Aren't there any plugin or extracter does this automatically.???
Anyone in the group, if had done this previously.?
Hi,
Please take a look at Apache Nutch. I can crawl through a file system over FTP.
After crawling, it can use Tika to extract the content from your PDF files and
other. Finally you can then send the data to your Solr server for indexing.
http://nutch.apache.org/
Hi All,
Is there is any
Hi All,
Is there is any way in SOLR or any plug-in through which the folders and
documents in FTP location can be indexed.
/ Pankaj Bhatt.
On Fri, Jan 21, 2011 at 12:21 PM, pankaj bhatt panbh...@gmail.com wrote:
Hi All,
Is there is any way in SOLR or any plug-in through which the folders and
documents in FTP location can be indexed.
[...]
What format are these documents in? Which parts of the documents
do you want to index?
In
11 matches
Mail list logo