Re: document markup to control indexing

2005-12-28 Thread Jeff Breidenbach
I'm so new to Nutch that I wasn't sure yet how to tie the feature into a configuration file, but here's the first pass hardcoded version that seems to do ok. At least on the perfectly clean data that I've been feeding it. Probably blows up if someone forgets their tag. I'd definitely like to see

Re: document markup to control indexing

2005-12-28 Thread Jack Tang
Hi I am sorry, it should be getTextHelper() method. Say i want to index the content in this block: This is not Ads The code may look like this: boolean contentStart; boolean contentEnd; if (node.getNodeType() == Node.COMMENT_NODE) { // you can move the value to your configuration file.

Re: Clustering Index job

2005-12-28 Thread Byron Miller
Check the list for my earlier discussions. There are tweaks you can do to enhance the performance if you have available memory resources. How large are your segments that you are indexing? what file system do you use? what OS /JVM are you building your index on? -byron --- "R.Mayoran" <[EMAIL PR

Re: Setting Search over NDFS

2005-12-28 Thread Byron Miller
I would recommend that you search the list for some great discussions on NDFS. Doug has a nice writeup of his vision of using a map reduce job to push the indexes to your query servers so they're updates as the webdb is and managed that way. NDFS just wasn't designed for the I/O of a query. You wa

Re: Why does Nutch use n-grams in analysis?

2005-12-28 Thread Andrzej Bialecki
Teruhiko Kurosaka wrote: Andrzej, Thank you for explanation. No, in this case, if "web" and "services" were added to common-grams.utf8, the result would look like: web|web-services, services|services-is, cool where | marks tokens indexed at the same position in the index. I guess

RE: Why does Nutch use n-grams in analysis?

2005-12-28 Thread Teruhiko Kurosaka
Andrzej, Thank you for explanation. > No, in this case, if "web" and "services" were added to > common-grams.utf8, the result would look like: > > web|web-services, services|services-is, cool > > where | marks tokens indexed at the same position in the index. I guess you meant common-terms.utf

Re: document markup to control indexing

2005-12-28 Thread cilquirm . 20552126
Hi, I asked this question a while back and didn't get a response, so I rolled my own parse solution using jericho-html and and applyling it to the HTMLParseFilter extension point. I just took a look at the getText() method of the DOMContentUtils class and I don't see any way to add your own custo

Re: Why does Nutch use n-grams in analysis?

2005-12-28 Thread Andrzej Bialecki
Teruhiko Kurosaka wrote: I thought n-grams are used for language identification only but I see they are used in another area. In the source code of CommonGramps and the API doc: http://lucene.apache.org/nutch/apidocs/org/apache/nutch/analysis/CommonG rams.html I see (tokens representing) n-gram

Why does Nutch use n-grams in analysis?

2005-12-28 Thread Teruhiko Kurosaka
I thought n-grams are used for language identification only but I see they are used in another area. In the source code of CommonGramps and the API doc: http://lucene.apache.org/nutch/apidocs/org/apache/nutch/analysis/CommonG rams.html I see (tokens representing) n-grams are "inserted" to the toke

Re: Setting Search over NDFS

2005-12-28 Thread Gal Nitzan
So just use the ndfs command to download the relevant files from NDFS and put them on the search server and from there to follow the sample on your documentation project? Thanks for all the help. P.S. Do you have a clear view for the solution to the "slowness in search over NDFS"? if so I wou

Re: Setting Search over NDFS

2005-12-28 Thread Stefan Groschupf
There will be a solution soon, if I found some more time, until this for smaller installation you need a shell script that download the index and segment to the box that runs the search server. You also can move the index from ndfs to local instead of copy it. check: "bin/nutch ndfs" for docum

Re: Setting Search over NDFS

2005-12-28 Thread Gal Nitzan
Whoa, that was fast... So all in all you would need two sets of the same data? Did I understand there is an effort to improve the "poor performance" issue? And if we are at it, would you care to explain how to download the index to local and what happens if the data is growing over the boundar

Re: Setting Search over NDFS

2005-12-28 Thread Stefan Groschupf
Download index to a local file system. Am 28.12.2005 um 14:25 schrieb Gal Nitzan: Hi, If using search over NDFS is too slow than what is the alternative when all your data is in NDFS? Thanks, Gal

Setting Search over NDFS

2005-12-28 Thread Gal Nitzan
Hi, If using search over NDFS is too slow than what is the alternative when all your data is in NDFS? Thanks, Gal

Re: Clustering Index job

2005-12-28 Thread Stefan Groschupf
Yes, you need to use map reduce on several boxes. Anyway 100 mio files will also work on powerful box. There are some configuration values in the nutch-default.xml that can improve indexing speed. Am 28.12.2005 um 09:56 schrieb R.Mayoran: Hi, I need to index about 100million files. Is it

Re: Is any one able to successfully run Distributed Crawl?

2005-12-28 Thread Nutch Newbie
Hi I have had no problem doing distributed crawl. On 12/28/05, Pushpesh Kr. Rajwanshi <[EMAIL PROTECTED]> wrote: > Hi NN, > > Thanks for replying me. Actually I wanted to know if distributed crawling in > nutch is working fine and to what success? Like i am successful in setting > up distributed

Re: Is any one able to successfully run Distributed Crawl?

2005-12-28 Thread Pushpesh Kr. Rajwanshi
Hi NN, Thanks for replying me. Actually I wanted to know if distributed crawling in nutch is working fine and to what success? Like i am successful in setting up distributed crawl for 2 machines (1 master and 1 slave) but when i try with more than two machines there seems problem specially while i

Clustering Index job

2005-12-28 Thread R.Mayoran
Hi, I need to index about 100million files. Is it possible to cluster this job? Are there any sugestions to increase the speed of indexing? Thank you in advance. Mayu.

Re: Trouble setting NDFS on multiple machines

2005-12-28 Thread Nutch Newbie
I had exactly similler problem with JDK 1.5. Also when I worked with only one data node problem doesn't occur. Thanks On 12/28/05, Stefan Groschupf <[EMAIL PROTECTED]> wrote: > Interesting! > That is not a feature that is a bug, may you can open a minor bug > report. > Thanks. > Stefan > Am 28.12