Re: Inquiring part-of-speech (POS) tagging indexing and searching

2011-05-03 Thread Grijesh
As you have seen the example code for PartOfSpeechTaggingFilter at http://lucene.apache.org/java/3_0_0/api/core/org/apache/lucene/analysis/package-summary.html You can use a custom analyzer to inject metadata tokens into the index at the same position as the source tokens. For example, given

Re: questions about the index

2011-05-03 Thread Bernd Fehling
Hi Mike, thanks for the infos. As far as I know a write.lock is created from an IndexWriter. So I have to dig into it why an IndexWriter is created just on starting solr with an optimized index. The problem, this is only with a huge index. And also old parts of the index are not cleaned up. May

Re: Lucene spending alot of time in BooleanScorer2

2011-05-03 Thread Paul Taylor
On 02/05/2011 23:36, Paul Taylor wrote: Hi Nearing completion on a new version of a lucene search component for the http://www.musicbrainz.org music database and having a problem with performance. There are a number of indexes each built from data in a database, there is one index for

Re: questions about the index

2011-05-03 Thread Bernd Fehling
Well, it is not only with a huge index. It is only if ReplicationHandler is in use on a master. If ReplicationHandler is configured to replicateAfter startup it first sends a commit via IndexWriter to have a stable index. The left over of this operation is the write.lock. So removing

AW: fuzzy prefix search

2011-05-03 Thread Clemens Wyss
Sorry for coming back to my issue. Can anybody explain why my simple unit test below fails? Any hint/help appreciated. Directory directory = new RAMDirectory(); IndexWriter indexWriter = new IndexWriter( directory, new StandardAnalyzer( Version.LUCENE_31 ), IndexWriter.MaxFieldLength.UNLIMITED

Re: fuzzy prefix search

2011-05-03 Thread Ian Lea
Mer != mer. The latter will be what is indexed because StandardAnalyzer calls LowerCaseFilter. -- Ian. On Tue, May 3, 2011 at 9:56 AM, Clemens Wyss clemens...@mysign.ch wrote: Sorry for coming back to my issue. Can anybody explain why my simple unit test below fails? Any hint/help

AW: fuzzy prefix search

2011-05-03 Thread Clemens Wyss
Unfortunately lowercasing doesn't help. Also, doesn't the FuzzyQuery ignore casing? -Ursprüngliche Nachricht- Von: Ian Lea [mailto:ian@gmail.com] Gesendet: Dienstag, 3. Mai 2011 11:06 An: java-user@lucene.apache.org Betreff: Re: fuzzy prefix search Mer != mer. The latter

Re: fuzzy prefix search

2011-05-03 Thread Ian Lea
I'd assumed that FuzzyQuery wouldn't ignore case but I could be wrong. What would be the edit distance between mer and merlot? Would it be less that 1.5 which I reckon would be the value of length(term)*0.5 as detailed in the javadocs? Seems unlikely, but I don't really know anything about the

AW: fuzzy prefix search

2011-05-03 Thread Clemens Wyss
PrefixQuery I'd like the combination of prefix and fuzzy ;-) because people could also type menlo or märl and in any of these cases I'd like to get a hit on Merlot (for suggesting Merlot) -Ursprüngliche Nachricht- Von: Ian Lea [mailto:ian@gmail.com] Gesendet: Dienstag, 3. Mai

Speed up payload loading?

2011-05-03 Thread Chris Bamford
Hi, I have been experimenting with using a int payload as a unique identifier, one per Document. I have successfully loaded them in using the TermPositions API with something like: public static void loadPayloadIntArray(IndexReader reader, Term term, int[] intArray, int from, int to)

AW: fuzzy prefix search

2011-05-03 Thread Biedermann,S.,Fa. Post Direkt
Have you tried Query q = new FuzzyQuery( new Term( test, Mer ), 0.499f); Sven -Ursprüngliche Nachricht- Von: Clemens Wyss [mailto:clemens...@mysign.ch] Gesendet: Dienstag, 3. Mai 2011 10:57 An: java-user@lucene.apache.org Betreff: AW: fuzzy prefix search Sorry for coming back to my

Re: fuzzy prefix search

2011-05-03 Thread Ian Lea
Then why not do that? Add a PrefixQuery and a FuzzyQuery to a BooleanQuery and use that. -- Ian. On Tue, May 3, 2011 at 10:25 AM, Clemens Wyss clemens...@mysign.ch wrote: PrefixQuery I'd like the combination of prefix and fuzzy ;-) because people could also type menlo or märl and in any of

Re: MultiPhraseQuery slowing down over time in Lucene 3.1

2011-05-03 Thread Michael McCandless
I feel like we are back to Basic ;) If you keep running line 40 over and over on the same memory index, do you see a slowdown? Mike http://blog.mikemccandless.com On Mon, May 2, 2011 at 1:19 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi, I think this describes what's going on:

AW: fuzzy prefix search

2011-05-03 Thread Biedermann,S.,Fa. Post Direkt
I had a look into the 3.0 implementation The calculation of the similarity is 1 - (edit distance / min (string 1 length, string 2 length) As opposed to the levenstein in spellchecker 1 - (edit distance / max (string 1 length, string 2 length) So, the similarity is 1 - ( 3 /

AW: fuzzy prefix search

2011-05-03 Thread Clemens Wyss
Is this calculation intended or a bug? -Ursprüngliche Nachricht- Von: Biedermann,S.,Fa. Post Direkt [mailto:s.biederm...@postdirekt.de] Gesendet: Dienstag, 3. Mai 2011 12:00 An: java-user@lucene.apache.org Betreff: AW: fuzzy prefix search I had a look into the 3.0 implementation

Re: The MoreLikeThisHandler could include highlighting ?

2011-05-03 Thread Koji Sekiguchi
(11/03/01 21:16), Amel Fraisse wrote: Hello, The MoreLikeThisHandler could include higlighting ? Is it true to define a MoreLikeThisHandler like this: ? requestHandler name=/mlt class=org.apache.solr.handler.MoreLikeThisHandler lst name=defaults bool

Re: Speed up payload loading?

2011-05-03 Thread Michael McCandless
On Tue, May 3, 2011 at 5:35 AM, Chris Bamford chris.bamf...@talktalk.net wrote: Hi, I have been experimenting with using a int payload as a unique identifier, one per Document.  I have successfully loaded them in using the TermPositions API with something like:    public static void

AW: fuzzy prefix search

2011-05-03 Thread Biedermann,S.,Fa. Post Direkt
I don't know. But changing it now would cause trouble in many applications... For our applications we reimplemented fuzzy query so that we can pass along a org.apache.lucene.search.spell.StringDistance instance that holds the similarity algorithm of choice. -- Sven -Ursprüngliche

Re: AW: fuzzy prefix search

2011-05-03 Thread Otis Gospodnetic
Hi, I didn't read this thread closely, but just in case: * Is this something you can handle with synonyms? * If this is for English and you are trying to handle typos, there is a list of common English misspellings out there that you could use for this perhaps. * Have you considered n-gramming

Re: MultiPhraseQuery slowing down over time in Lucene 3.1

2011-05-03 Thread Tomislav Poljak
Hi, 2011/5/3 Michael McCandless luc...@mikemccandless.com: I feel like we are back to Basic ;) If you keep running line 40 over and over on the same memory index, do you see a slowdown? Yes. I've tested running same query list (~3,5 k queries) on the same MemoryIndex instance and after a

RE: MultiPhraseQuery slowing down over time in Lucene 3.1

2011-05-03 Thread Uwe Schindler
Hi, 2011/5/3 Michael McCandless luc...@mikemccandless.com: I feel like we are back to Basic ;) If you keep running line 40 over and over on the same memory index, do you see a slowdown? Yes. I've tested running same query list (~3,5 k queries) on the same MemoryIndex instance and

Anyway to not bother scoring less good matches ?

2011-05-03 Thread Paul Taylor
Im receiving a number of searches with many ORs so that the total number of matches is huge ( 1 million) although only the first 20 results are required. Analysis shows most time is spent scoring the results. Now it seems to me if you sending a query with 10 OR components, documents that

How to fix the number of searched terms for a field

2011-05-03 Thread harsh srivastava
Hi All, I want to know any inbuilt method in lucene that can help me to fix the number of searched terms for a given field e.g. Suppose I have given content:(text1 text2 text3 text4 text5) to search and want to limit it to 3 words only i.e. content:(text1 text2 text3) Please help. Thanks,

Re: How to fix the number of searched terms for a field

2011-05-03 Thread Erick Erickson
Why do you want to do this? I'm wondering if this is an XY problem... See: http://people.apache.org/~hossman/#xyproblem Best Erick On Tue, May 3, 2011 at 7:55 AM, harsh srivastava harshc...@gmail.com wrote: Hi All, I want to know any inbuilt method in lucene that can help me to fix the

Re: ComplexPhraseQueryParser with multiple fields

2011-05-03 Thread Chris Salem
That seems to work. Thank you! Sincerely, Chris Salem Development Team Main Sequence Technologies, Inc. PCRecruiter.net - PCRecruiter Support ch...@mainsequence.net P: 440.946.5214 ext 5458 F: 440.856.0312 This email and any files transmitted with it may contain confidential information

Problem modifying Similarity class to work with lucene 3.1.0

2011-05-03 Thread Paul Taylor
How can I convert this Similariity method to use 3.1 (currently using 3.0.3), I understand I have to replace lengthNorm() wuth computerNorm() , but fieldlName is not a provided parameter in computerNorm() and FieldInvertState does not contain the fieldname either. I need the field because I

Re: Problem modifying Similarity class to work with lucene 3.1.0

2011-05-03 Thread Robert Muir
On Tue, May 3, 2011 at 9:57 AM, Paul Taylor paul_t...@fastmail.fm wrote: How can I convert this Similariity method to use 3.1 (currently using 3.0.3), I understand I have to replace lengthNorm() wuth computerNorm() , but fieldlName is not a provided parameter in computerNorm() and

Re: Problem modifying Similarity class to work with lucene 3.1.0

2011-05-03 Thread Paul Taylor
On 03/05/2011 15:06, Robert Muir wrote: On Tue, May 3, 2011 at 9:57 AM, Paul Taylorpaul_t...@fastmail.fm wrote: How can I convert this Similariity method to use 3.1 (currently using 3.0.3), I understand I have to replace lengthNorm() wuth computerNorm() , but fieldlName is not a provided

Re: Problem modifying Similarity class to work with lucene 3.1.0

2011-05-03 Thread Robert Muir
On Tue, May 3, 2011 at 10:29 AM, Paul Taylor paul_t...@fastmail.fm wrote: I assume this would be the correct way to fix the code for 3.1.0 Yes, thats correct. public float computeNorm(String field, FieldInvertState state) {        //This will match both artist and label aliases and is

AW: AW: fuzzy prefix search

2011-05-03 Thread Clemens Wyss
How does an simple Analyzer look that just n-grams the docs/fields. class SimpleNGramAnalyzer extends Analyzer { @Override public TokenStream tokenStream ( String fieldName, Reader reader ) { EdgeNGramTokenFilter... ??? } } -Ursprüngliche Nachricht- Von: Otis Gospodnetic

Re: AW: AW: fuzzy prefix search

2011-05-03 Thread Otis Gospodnetic
Clemens, Something a la: public TokenStream tokenStream (String fieldName, Reader r) { return nw EdgeNGramTokenFilter(new KeywordTokenizer(r), EdgeNGramTokenFilter.Side.FRONT, 1, 4); } Check out page 265 of Lucene in Action 2. Otis Sematext :: http://sematext.com/ :: Solr - Lucene -

AW: AW: AW: fuzzy prefix search

2011-05-03 Thread Clemens Wyss
But doesn't the KeyWordTokenizer extract single words out oft he stream? I would like to create n-grams on the stream (field content) as it is... -Ursprüngliche Nachricht- Von: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Gesendet: Dienstag, 3. Mai 2011 21:31 An:

Re: AW: AW: AW: fuzzy prefix search

2011-05-03 Thread Otis Gospodnetic
Clemens - that's just an example. Stick another tokenizer in there, like WhitespaceTokenizer in there, for example. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Clemens Wyss