Re: Removing terms in the Index

2010-04-14 Thread Shai Erera
I'm still not sure I understand ... If the first document includes "Lucene in Action. Lucene" (two sentences, the 2nd one with Lucene only) and the second "Lucene for Dummies", then what exactly do you want to get for the queries "\"Lucene in Action\"" and "\"Lucene\""? If I understand correctly,

Re: trying to resolve error: after flush: fdx size mismatch

2010-04-14 Thread Michael McCandless
Not good! Can you describe how your threads work with Lucene? Is this just a local filesystem (disk) under vista? Mike On Wed, Apr 14, 2010 at 7:41 AM, jm wrote: > Hi, > > I am trying to chase an issue in our code and it is being quite > difficult. We have seen two instances (see below) where

Re: Utility program to extract a segment

2010-04-14 Thread Michael McCandless
I don't think there's an existing tool, but it shouldn't be too hard to create. Create a new SegmentInfos(), then call its .read(oldDir) to read all segments. Look up the SegmentInfo(s) you want to copy and call their .files() methods to see which files to copy. Copy them. Remove all other segm

Utility program to extract a segment

2010-04-14 Thread Lance Norskog
Is there a program available that makes a new index with one or more segments from an existing index? (The immediate use case for this is doing forensics on corrupted indexes.) The user interface would be: extract -segments _ab,_g9 oldindex newindex This would copy the files for segments _ab and

Re: Indexing lists of IDs

2010-04-14 Thread Kristjan Siimson
Thanks, the problem was with tokenizer, which didn't index any numbers, so I tried writing my own, and it works perfectly! :) Sincerely, Kristjan Siimson On Wed, Apr 14, 2010 at 2:12 PM, Uwe Schindler wrote: > You can add the terms with Field.Index.NOT_ANALYZED multiple times to the > same fiel

RE: NumericField indexing performance

2010-04-14 Thread Uwe Schindler
One addition: If you are indexing millions of numeric fields, you should also try to reuse NumericField and Document instances (as described in JavaDocs). NumericField creates internally a NumericTokenStream and lots of small objects (attributes), so GC cost may be high. This is just another ide

RE: NumericField indexing performance

2010-04-14 Thread Uwe Schindler
Hi Tomislav, indexing with NumericField takes longer (at least for the default precision step of 4, which means out of 32 bit integers make 8 subterms with each 4 bits of the value). So you produce 8 times more terms during indexing that must be handled by the indexer. If you have lots of docum

RE: Indexing lists of IDs

2010-04-14 Thread Uwe Schindler
You can add the terms with Field.Index.NOT_ANALYZED multiple times to the same field. If you use an analyzer like WhitespaceAnalyzer and you analyze your tersm, you must also pass the analyzed term through analyzer when building a TermQuery. This may explain, why you don’t get those IDs. But fo

Re: Problem with search

2010-04-14 Thread Sirish Vadala
Hmmm... Seems like a lot of work to be done. I will try these options and update. Thanks a lot. Best. -- View this message in context: http://n3.nabble.com/Problem-with-search-tp717137p719604.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --

Re: IndexWriter and memory usage

2010-04-14 Thread Michael McCandless
Run this: svn co https://svn.apache.org/repos/asf/lucene/java/branches/lucene_2_9 lucene.29x Then apply the patch, then, run "ant jar-core", and in that should create the lucene-core-2.9.2-dev.jar. Mike On Wed, Apr 14, 2010 at 1:28 PM, Woolf, Ross wrote: > How do I get to the 2.9.x branch?

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2010-04-14 Thread Michael McCandless
>From your PyLucene thread it looks like this may be a known mem leak in PyLucene 2.4 (fixed in 2.9)? Mike On Wed, Apr 14, 2010 at 11:13 AM, Herbert Roitblat wrote: > Thanks, Michael. > > I have not had a chance to try your whittled example yet. Another problem > captured my attention. > > What

Re: Indexing lists of IDs

2010-04-14 Thread Rene Hackl-Sommer
Hi Kristjan, which Tokenizer and Filters are you using for the ID field? Rene Am 14.04.2010 21:15, schrieb Kristjan Siimson: Hello, I have document for which I'd like to index an array of indexes. For example, there is a product that belongs to categories with IDs 12, 15, 16, 145, 148. I'd li

Indexing lists of IDs

2010-04-14 Thread Kristjan Siimson
Hello, I have document for which I'd like to index an array of indexes. For example, there is a product that belongs to categories with IDs 12, 15, 16, 145, 148. I'd like to index these categories, and then be able to use them in queries, so that I can search for product which's name is "Bottle" a

NumericField indexing performance

2010-04-14 Thread Tomislav Poljak
Hi, is it normal for indexing time to increase up to 10 times after introducing NumericField instead of Field (for two fields)? I've changed two date fields from String representation (Field) to NumericField, now it is: doc.add(new NumericField("time").setIntValue(date.getTime()/24/3600)) and a

RE: IndexWriter and memory usage

2010-04-14 Thread Woolf, Ross
How do I get to the 2.9.x branch? Every link I take from the Lucene site takes me to the trunk which I assume is the 3.x version. I've tried to look around svn but can't find anything labeled 2.9.x. Is there a daily build of 2.9.x or do I need to build it myself. I would like to try out the

Re: Removing terms in the Index

2010-04-14 Thread Railan Xisto
Actually the doc1 with the terms to be searched, has two words "Lucene in Action" and "Lucene". I want when I pass "Lucene in Action", it shows the result and remove the word not to be found when I pass only the term "Lucene". In short, the term "Lucene" not find the phrase "Lucene in Action", sinc

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2010-04-14 Thread Herbert Roitblat
Thanks, Michael. I have not had a chance to try your whittled example yet. Another problem captured my attention. What I have done, is use a single reader over and over. It does not seem to make any difference. I don't close it at all, now. It sped up my process a bit (12 docs/second rathe

RE: PrefixQuery and special characters

2010-04-14 Thread Steven A Rowe
Hi Franz, The likely problem is that you're using an index-time analyzer that strips out the parentheses. StandardAnalyzer, for example, does this; WhitespaceAnalyzer does not. Remember that hits are the result of matches between index-analyzed terms and query-analyzed terms. Except in the c

PrefixQuery and special characters

2010-04-14 Thread Franz Roth
Hi all, say I have an Index with one field named "category". There are two documents one with value "(testvalue)" and one with value "test value". Now somone search with "test". My Searchenine uses the org.apache.lucene.search.PrefixQuery and finds 2 documents. Maybe he estimated only one hit;

trying to resolve error: after flush: fdx size mismatch

2010-04-14 Thread jm
Hi, I am trying to chase an issue in our code and it is being quite difficult. We have seen two instances (see below) where we get the same error. I have been trying to reproduce but it has been impossible so far.I have several threads, some might be creating indices and adding documents, others c

Re: IndexWriter and memory usage

2010-04-14 Thread Michael McCandless
It looks like the mailing list software stripped your image attachments... Alas these fixes are only committed on 3.1. But I just posted the patch on LUCENE-2387 for 2.9.x -- it's a tiny fix. I think the other issue was part of LUCENE-2074 (though this issue included many other changes) -- Uwe c

Re: Problem with search

2010-04-14 Thread Shai Erera
I don't know if that proposal is the most efficient one, but you can try it. In general, what you're looking for is a GROUP BY Bill-Id feature and then select the most recent one, right? Only you don't need all the Versions of the same Bill, and therefore you can hold the most recent Version-Id onl