Re: implement thai language indexing and search

2006-11-28 Thread Jérôme Charron
i used an existing ThaiAnalyzer which was in lucene package. ok - i renamed the lucene.analysis.th.* to nutch.analysis.th.* - compiled and placed all class files in a jar - analysis-th.jar (do i need to bundle the ngp file in the jar as well ?) 1. You don't have to refactor the lucene analyzer.

Indexing and Re-crawling site

2006-11-28 Thread Armel T. Nene
Hi guys, I have a few questions regarding the way nutch indexes and the best way a recrawl can be implemented. 1. Why does nutch has to create a new index every time when indexing, while it can just merge it with the old existing index? I try to change the value in the IndexMerger cla

[jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements

2006-11-28 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12454045 ] Sami Siren commented on NUTCH-339: -- I am running with 300 thread, and in parsing mode thread dump shows: 191 threads waiting on condition at java.lang.Thread.slee

[jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements

2006-11-28 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12453989 ] Andrzej Bialecki commented on NUTCH-339: - Ah, we are getting somewhere ... fetchQueues.totalSize=0 means that all input entries from the queues have been p

[jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements

2006-11-28 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12453975 ] Sami Siren commented on NUTCH-339: -- perhaps thath exception is just a consequence of something other like this: 2006-11-27 07:35:09,434 INFO fetcher.Fetcher2 - -a

Re: updating index without refitting

2006-11-28 Thread DS jha
new field's data is also stored as a meta data - value is assigned during parse process and then during index, it reads meta-data field value and adds it to an index. Looks like, I will have to run parse and index again. Thanks much. On 11/28/06, Gal Nitzan <[EMAIL PROTECTED]> wrote: Hi, You

[jira] Commented: (NUTCH-407) Make Nutch crawling parent directories for file protocol configurable

2006-11-28 Thread Chris A. Mattmann (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-407?page=comments#action_12453934 ] Chris A. Mattmann commented on NUTCH-407: - I'm not entirey sure what the right answer to this is. One thing that I do know is that a colleague at my own wor

[jira] Commented: (NUTCH-407) Make Nutch crawling parent directories for file protocol configurable

2006-11-28 Thread Alan Tanaman (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-407?page=comments#action_12453932 ] Alan Tanaman commented on NUTCH-407: In our team we feel that this patch would have been beneficial in practical terms. In the context of the enterprise intell

RE: updating index without refitting

2006-11-28 Thread Gal Nitzan
Hi, You do not mention if the new field's data is stored as a metadata? Does the value data being created during parse or is it added only during the index phase? If your new field is created during the parse process than you could delete only the parse folders and run the parse process i.e. (del

updating index without refetching

2006-11-28 Thread DS jha
Hi All, Is it possible to update the index without refetching everything? I have changed logic of one of my plugins (which also sets a custom field in the index) - and I would like this field to get updated without refetching everything - is it doable? Thanks,

[jira] Commented: (NUTCH-233) wrong regular expression hang reduce process for ever

2006-11-28 Thread Sean Dean (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12453919 ] Sean Dean commented on NUTCH-233: - Could I suggest that this change, from ".*(/.+?)/.*?\1/.*?\1/" to ".*(/[^/]+)/[^/]+\1/[^/]+\1/" be committed to at least trunk fo

[jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements

2006-11-28 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12453820 ] Andrzej Bialecki commented on NUTCH-339: - This looks weird, if anything it rather seems caused by a bug in Hadoop - are you able to run "readseg -dump" on