Hello,
I am hoping crawl about 3000 domains using the nutch crawler +
PrefixURLFilter, however, I have no need to actually index the html.
Ideally, I would just like each domain's raw html pages saved into separate
directories. We already have a parser that converts the HTML into indexes
for our
I've been using nutch for a little while now, and the new release is great. I'm
hoping someone can help me with what I'm trying to do.
One of the sites I crawl is basically an archive for a mailing list. So there's
lots of data that never changes, and then there are new pages every day. I'm
not
To use these plugins you have to edit your conf/nutch-site.xml configuration
file incluing something like this:
plugin.includes
nutch-extensionpoints|protocol-http|language-identifier|urlfilter-regex|parse-(text|html|pdf|msword)|index-(basic|more)|query-(basic|site|url|more)
Regular expression
Perhaps someone else can chime in on this. I am not sure of exactly
what you are asking. The indexing is based on Lucene. So, if you need
to understand how the indexing works you will need to look into the
Lucene documentation. If you are only looking to add custom fields
and such to the index
I'm not sure to understand everything. I'm still a novice.
How can i use index-more and query-more ?
Do you mind to help me ?
Thanks
E
You can use the plugins index-more and query-more to create a field on
your
index indicating the file type of the document. So, in you search you can
use "type:
Thanks for your help but i think there is a misunderstanding. I was talking
about creating a new index class in java based on specific parameters that i
will defined.
Do you if there is any web page which can give me more information in order
to implement in Java this index ?
E
On the nutch wi
On the nutch wiki there is this tutorial:
http://wiki.apache.org/nutch/NutchHadoopTutorial
There is also (it is for version 0.8, but can still work with 0.9):
http://lucene.apache.org/nutch/tutorial8.html
On 4/24/07, ekoje ekoje <[EMAIL PROTECTED]> wrote:
Hi Guys,
I would like to create a n
You can use the plugins index-more and query-more to create a field on your
index indicating the file type of the document. So, in you search you can
use "type:pdf" or "type:msword" to filter these files. I used nutch 0.7.2 to
make it work...
Regards,
Lourival Júnior
On 4/24/07, ekoje ekoje <[E
Hi Guys,
I would like to create a new custom index.
Do you know if there is any tutorial, document or web page which can help me
?
Thanks,
E
Hi Guys,
I would like to add a new button on my webpage to make an adanced search
using the keywords.
Once the user will click on it it will search for keywords only in the
different PDF/WORD or Excel document indexed.
Do you know how i can filter/limit my search on PDF/WORD/EXCEL documents ?
T
I have a 3MB xls, with 26 sheets. Half have a matrix of approx 1100xP and the
others have approx 1000xE.
Using the v0.9 ExcelExtractor, I left it extracting text on a reasonably
powerful machine @ 100% CPU (Java 1.6). Just over 4 hours later it was still
going!!
I finally gave up waiting a
11 matches
Mail list logo