Re: Multi language indexing

2007-05-08 Thread bhecht
Hi Doron, Thank you very much for your time and for the detailed explanations. This is exactly what I meant and I am happy to see I understood correctly. I am now using the Snowball which seems to work very good. Thanks again and good day, Barak Hecht. -- View this message in context: htt

Is it necessary to optimize?

2007-05-08 Thread Stadler Hans-Christian
If mergeFactor is set to 2 and no optimize() is ever done on the index, what is the impact on 1) the number opened files during indexing 2) the number of opened files during searching 2) the search speed 3) the indexing speed ?? HC ---

Re: Is it necessary to optimize?

2007-05-08 Thread Aleksander M. Stensby
I would say, that over time, the number of files will grow. and continue growing if you never perform an optimize(). After some very adviceful mails from Erick i settled on a mergeFactor of 30, and since I do the indexing in large batches, I perform an optimize() only in the end of the indexi

Re: Keyphrase Extraction

2007-05-08 Thread Bill Janssen
Dawid Weiss wrote: > You could also try splitting the document into paragraphs and use Carrot2's > Lingo algorithm (www.carrot2.org) on a paragraph-level to extract clusters. > Labelling routine in Lingo should extract 'key' phrases; this analysis is > heavily frequency-based, but... you know, y

Re: Questions regarding Lucene query syntax

2007-05-08 Thread Daniel Einspanjer
On 5/7/07, Doron Cohen <[EMAIL PROTECTED]> wrote: With a query parser set to allowLeadingWildcard, this should do: ( +item -price:* ) ( +item +price:[0100 TO 0150] ) or, to avoid too-many-cluases risk: ( +item -price:[MIN TO MAX]) ( +item +price:[0100 TO 0150] ) where MIN and MAX cover (at least)

RE: Keyphrase Extraction

2007-05-08 Thread Vishal Shah
Hi Arsen, I've seen another commercial one from a company called Connexor (www.connexor.com) . It has a decent part-of-speech tagger that could be used in keyphrase extraction with some heuristics on top of it. -vishal. -Original Message- From: Mark Miller [mailto:[EMAIL PROTECTED] S

Re: Is it necessary to optimize?

2007-05-08 Thread Grant Ingersoll
The contrib/benchmark addition can help you characterize many of these scenarios, especially if you write a DocMaker and QueryMaker for your collection. On May 8, 2007, at 5:30 AM, Stadler Hans-Christian wrote: If mergeFactor is set to 2 and no optimize() is ever done on the index, what i

Doubt in FuzzyQuery

2007-05-08 Thread sccarrera
Last week I send a message with a doubt in a FuzzyQuery. I am working with Lucene 2.1.0. I would like to recover files including the set of strings "société américaine" and "sociétés américaines" from a fuzzy query relating the string "société américain" I create a method "getDocuments" and I

Re: Keyphrase Extraction

2007-05-08 Thread José Ramón Pérez Agüera
here you have a very good tool for Keyphrase Extraction. It is GNU and easy to integrate in Lucene. http://www.paynter.info/academia/Kea.php best jose On 5/8/07, Bill Janssen <[EMAIL PROTECTED]> wrote: Dawid Weiss wrote: > You could also try splitting the document into paragraphs and use Carr

Automatic analyzer resolving based on Locale

2007-05-08 Thread Geoffrey De Smet
I have a use case, in which I need to select the Analyzer based on a Locale. For example: "nl" => DutchAnalyzer "nl_BE" => DutchAnalyzer "fr" => FrenchAnalyzer "foobar" => StandardAnalyzer (fallback) I was wondering if lucene has any sort of "AutomaticAnalyzerResolver" class that could do this f

Re: Automatic analyzer resolving based on Locale

2007-05-08 Thread Erick Erickson
There is nothing canned that I know of. I'm also not sure how this would be used. If you're using a single index, how are you going to index, then search using these analyzers? Or is there some other magic going on? Consider your document with a field "text". If you index into this field with dif

Re: Is it necessary to optimize?

2007-05-08 Thread Otis Gospodnetic
Hi, - Original Message From: Stadler Hans-Christian <[EMAIL PROTECTED]> If mergeFactor is set to 2 and no optimize() is ever done on the index, what is the impact on 1) the number opened files during indexing OG: it will grow a little, but frequently go down as Lucene merges segments

Re: Automatic analyzer resolving based on Locale

2007-05-08 Thread Chris Hostetter
: There is nothing canned that I know of. I'm also not sure how this : would be used. If you're using a single index, how are you going : to index, then search using these analyzers? Or is there some : other magic going on? i suspect the use case is "shipped" software product, where you want to h

search problem/odd results

2007-05-08 Thread John Powers
I don't understand why I'm getting the results I'm getting. If I search for "pandock*" I get 6 results Np-pandock Np-pandock-L Np-pandock-1 Np-pandock-2 Np-pandock Np-pandock-L1 If I search for np-pandock I get Np-pandock Np-pandock-L If I search for pandock I get Np-pandock

Re: search problem/odd results

2007-05-08 Thread Erick Erickson
First question: What analyzers are you using at index and search time? Second question: Have you tried using query.toString() to see how the query parses? If so, you should post the results. Third question: Have you used Luke to examine your index, to see what's actually in there (which may surp

Re: Keyphrase Extraction

2007-05-08 Thread Bob Carpenter
Mark Miller wrote: The only commercial options that I have seen do not have a web presence (that I know of or can find) and I don't recall the company names (only peripherally involved). Are we talking about Yahoo's buzz index and Amazon's SIPs or CAPs? I actually think the most interesting a

RE: search problem/odd results

2007-05-08 Thread John Powers
Here are the queries In my code I have: System.out.println("luceneQuery: " + luceneQuery); query = MultiFieldQueryParser.parse(luceneQuery.toString(), IndexerExternal.CARTABLE, IndexerExternal.getAnalyzer()); System.out.printl

Re: search problem/odd results

2007-05-08 Thread Daniel Naber
On Tuesday 08 May 2007 23:42, John Powers wrote: > I've had problems with luke in the past not being able to read the > files. Just make sure you specify the directory, not the files when opening an index with Luke. Also use the latest version (0.7). Regards Daniel -- http://www.danielnaber

Periodic Indexing DESIGN QUESTION

2007-05-08 Thread Ram Peters
I am indexing documents periodically every hour. I have a scenario. For example, when you are indexing every hour and large document set is present, it takes >1 hr to index the documents. Now you are already behind indexing for the next hour. How do you design something that is robust? thanks.

Re: Periodic Indexing DESIGN QUESTION

2007-05-08 Thread Erick Erickson
Don't do it that way ? Is this an actual or theoretical scenario? And do you reasonably expect it to become actual? Otherwise, why bother? And you've got other problems here. If you're indexing that much data, you'll soon outgrow your disk. Unless you're replacing most of the documents. But assu

Re: Periodic Indexing DESIGN QUESTION

2007-05-08 Thread Chris Hostetter
: For example, when you are indexing every hour and large document set : is present, it takes >1 hr to index the documents. Now you are : already behind indexing for the next hour. How do you design : something that is robust? fundementally, this question is really about issues in a producer/con

Lock obtain timed out while searching

2007-05-08 Thread Laxmilal Menaria
Hello everyone, I am getting the following exception while searching: Lock obtain timed out: java.io.IOException: Lock obtain timed out: [EMAIL PROTECTED]:\WINDOWS\TEMP\lucene-22e0ad3c019e26a6e2991b0e6ed97e1c-commit.lock I have implemented MultiSearcher only, No other methods are updating/addi

Scoring results?!

2007-05-08 Thread supereric
How I can get the tag word score in lucene. suppose that you have searched a tag word and 3 hit documents are now found. 1 -How someone could find number of occurrences in any document so it could sort the results. Also I wan to have some other policies for ranking the results. What should I do t