Using nutch just for the crawler/fetcher

2007-04-24 Thread John Kleven
Hello, I am hoping crawl about 3000 domains using the nutch crawler + PrefixURLFilter, however, I have no need to actually index the html. Ideally, I would just like each domain's raw html pages saved into separate directories. We already have a parser that converts the HTML into indexes for our

Nutch 0.9 recrawl

2007-04-24 Thread Annona Keene
I've been using nutch for a little while now, and the new release is great. I'm hoping someone can help me with what I'm trying to do. One of the sites I crawl is basically an archive for a mailing list. So there's lots of data that never changes, and then there are new pages every day. I'm not

Re: Query pdf, etc..

2007-04-24 Thread Lourival Júnior
To use these plugins you have to edit your conf/nutch-site.xml configuration file incluing something like this: plugin.includes nutch-extensionpoints|protocol-http|language-identifier|urlfilter-regex|parse-(text|html|pdf|msword)|index-(basic|more)|query-(basic|site|url|more) Regular expression

Re: Index

2007-04-24 Thread Briggs
Perhaps someone else can chime in on this. I am not sure of exactly what you are asking. The indexing is based on Lucene. So, if you need to understand how the indexing works you will need to look into the Lucene documentation. If you are only looking to add custom fields and such to the index

Re: Query pdf, etc..

2007-04-24 Thread ekoje ekoje
I'm not sure to understand everything. I'm still a novice. How can i use index-more and query-more ? Do you mind to help me ? Thanks E You can use the plugins index-more and query-more to create a field on your index indicating the file type of the document. So, in you search you can use "type:

Re: Index

2007-04-24 Thread ekoje ekoje
Thanks for your help but i think there is a misunderstanding. I was talking about creating a new index class in java based on specific parameters that i will defined. Do you if there is any web page which can give me more information in order to implement in Java this index ? E On the nutch wi

Re: Index

2007-04-24 Thread Briggs
On the nutch wiki there is this tutorial: http://wiki.apache.org/nutch/NutchHadoopTutorial There is also (it is for version 0.8, but can still work with 0.9): http://lucene.apache.org/nutch/tutorial8.html On 4/24/07, ekoje ekoje <[EMAIL PROTECTED]> wrote: Hi Guys, I would like to create a n

Re: Query pdf, etc..

2007-04-24 Thread Lourival Júnior
You can use the plugins index-more and query-more to create a field on your index indicating the file type of the document. So, in you search you can use "type:pdf" or "type:msword" to filter these files. I used nutch 0.7.2 to make it work... Regards, Lourival Júnior On 4/24/07, ekoje ekoje <[E

Index

2007-04-24 Thread ekoje ekoje
Hi Guys, I would like to create a new custom index. Do you know if there is any tutorial, document or web page which can help me ? Thanks, E

Query pdf, etc..

2007-04-24 Thread ekoje ekoje
Hi Guys, I would like to add a new button on my webpage to make an adanced search using the keywords. Once the user will click on it it will search for keywords only in the different PDF/WORD or Excel document indexed. Do you know how i can filter/limit my search on PDF/WORD/EXCEL documents ? T

ExcelExtractor performance

2007-04-24 Thread Antony Bowesman
I have a 3MB xls, with 26 sheets. Half have a matrix of approx 1100xP and the others have approx 1000xE. Using the v0.9 ExcelExtractor, I left it extracting text on a reasonably powerful machine @ 100% CPU (Java 1.6). Just over 4 hours later it was still going!! I finally gave up waiting a