Re: Document ID shuffling under 2.3.x (on merge?)

2008-03-13 Thread Michael McCandless
Daniel Noll wrote: On Wednesday 12 March 2008 19:36:57 Michael McCandless wrote: OK, I think very likely this is the issue: when IndexWriter hits an exception while processing a document, the portion of the document already indexed is left in the index, and then its docID is marked for deletio

Weird results with appendable fields

2008-03-13 Thread Gustavo Corral
Hi list, I'm new in Lucene and I'm trying to index a set of XML documents (document-centric) with the same structure. All this documents have a header, a front, and a body (where there's a lot of text). The problem is that in the header I have two fields author and title, but one document can hav

Minimum records to create IndexStore

2008-03-13 Thread Sebastin
Hi All, How many records needed minimum to create a index store.when i try to create a index store with 5 records ,it creates segments file only. -- View this message in context: http://www.nabble.com/Minimum-records-to-create-IndexStore-tp16024349p16024349.html Sent from the Lucene - Ja

sorting a doc field takes more time

2008-03-13 Thread sandyg
Hi, Thnxs for spending time for the problem. When sorting the results of lucene search it takes more time and not looks not that much usefull can any one help Below is my code.. sort = new Sort(new SortField(field)); hits = searcher.search(query,sort); Once

Solid State Drives vs. RAMDirectory

2008-03-13 Thread Toke Eskildsen
Time for another dose of inspiration for investigating Solid State Drives. And no, I don't get percentages from the chip manufacturers :-) This time I'll argue that there's little gain in using a RAMDirectory over SSDs, when performing searches. At least for our setting. We've taken our producti

Re: indexing api wrt Analyzer

2008-03-13 Thread Grant Ingersoll
On IndexWriter, you can pass in the Analyzer when you add a Document, thus your application can identify the language, choose the analyzer for the given doc, and then add the document See public void addDocument(Document doc, Analyzer analyzer) On Mar 12, 2008, at 8:40 PM, John Wang wrote:

Re: sorting a doc field takes more time

2008-03-13 Thread Grant Ingersoll
What's in "field"? What are your docs? More info is needed to help... -Grant On Mar 13, 2008, at 6:50 AM, sandyg wrote: Hi, Thnxs for spending time for the problem. When sorting the results of lucene search it takes more time and not looks not that much usefull can any one help Below i

Re: cannot delete cfs files on windows

2008-03-13 Thread Grant Ingersoll
Not sure why you can't close, but it's a bit suspicious that you are opening the IndexReader every time you do a search. Can you explain a little more about your process? When are you indexing, how often, etc.? -Grant On Mar 12, 2008, at 11:50 AM, Ioannis Cherouvim wrote: Hello I can in

Re: Solid State Drives vs. RAMDirectory

2008-03-13 Thread Srikant Jakilinki
Hi Toke, Thanks for the write-up. Speaking for the community, the graphs (as earlier) would be great. There is no benchmarks page on the Wiki. There is one on the main site to which you can add your stuff - http://lucene.apache.org/java/2_1_0/benchmarks.html Maybe one should create one on th

[aside] Re: Solid State Drives vs. RAMDirectory

2008-03-13 Thread Grant Ingersoll
Slight aside below? On Mar 13, 2008, at 7:58 AM, Srikant Jakilinki wrote: Remember, this is all searches with an optimized index. This is on the corpus from the Danish State and University Library and should be seen as nothing else than inspiration. Is this corpus publicly available? If

Re: [aside] Re: Solid State Drives vs. RAMDirectory

2008-03-13 Thread Toke Eskildsen
On Thu, 2008-03-13 at 08:37 -0400, Grant Ingersoll wrote: > Is this corpus publicly available? If so, please share. I'm always > on the hunt for free data! I'm sorry. It's the bibliographic records from the State and University Library of Denmark and we're not allowed to share them.

Re: Solid State Drives vs. RAMDirectory

2008-03-13 Thread eks dev
>>Upping the amount of RAM does not help us when the index is replaced before we pass the 50.000 queries. have you seen https://issues..apache.org/jira/browse/LUCENE-1035 , It would be interesting to see if this one changes HD numbers . You have plenty of free memory in this setup...

Index Merging Space Requirements

2008-03-13 Thread Mark Miller
If I use LogByteSizeMergePolicy#setMaxMergeMB, can I clamp down on the space needed for optimize/merge? My thought is, if a segment is maxed out, it will never need to be copied for a merge right? So you could significantly reduce merge/optimize space requirments (now at like 2x-4x if readers c

Re: Index Merging Space Requirements

2008-03-13 Thread Michael McCandless
Yes this should reduce transient (while merging) disk usage. However, optimize disregards this parameter, so it will still use the same disk space. However, if you call optimize(N) then that should use less space since it does not merge all the way down to 1 segment. Note that the limit

Re: Unique Fields

2008-03-13 Thread Ion Badita
My unique is more like synonym. For instance: Brain cancer, Cancer of the brain, Brain neoplasm, are the same, so i need to tokenize the title remove the stop words etc. I have a problem with the indexing... with a new title first i have to search in the index, if the title is not found write

Re: cannot delete cfs files on windows

2008-03-13 Thread Ioannis Cherouvim
Hello I index once every 24h. If a single search takes place between those 24hours, the next indexing will generate a new cfs file, because the old one cannot be deleted. Yes, I've read in the API that it's best not to open and close an IndexReader for every search, but right now I'm not con

Re: Index Merging Space Requirements

2008-03-13 Thread Michael McCandless
Well ... yes and no? Yes, the Log*MergePolicy will still at certain times merge the index all the way down to one segment. If mergeFactor is 10 then this will happen every "power of 10" flushed segments. Ie, after 10 flushes a merge will merge them down to 1 segment, then after 100 flush

Re: indexing api wrt Analyzer

2008-03-13 Thread John Wang
Yes, but usually it's a good idea to add documents in batch and not having to reinstantiate the writer for every document and then closing it. It would be nice if one can specify to the writer which analyzer to use. PerfieldAnalyzer wouldn't work because different analyzers may apply on the same

Re: Index Merging Space Requirements

2008-03-13 Thread Mark Miller
Thanks a lot Mike...one more question: I remember reading that a regular addDocument call could basically trigger an optimize on a given call. Is this true? Maybe not true anymore? It doesnt sound right to me, but I do remember reading about it. This was pre background merging when it was men

Re: indexing api wrt Analyzer

2008-03-13 Thread Grant Ingersoll
On Mar 13, 2008, at 11:03 AM, John Wang wrote: Yes, but usually it's a good idea to add documents in batch and not having to reinstantiate the writer for every document and then closing it. Why does what I suggested require instantiating a new writer for every document? It uses the anal

Re: indexing api wrt Analyzer

2008-03-13 Thread Grant Ingersoll
On Mar 13, 2008, at 11:03 AM, John Wang wrote: Yes, but usually it's a good idea to add documents in batch and not having to reinstantiate the writer for every document and then closing it. It would be nice if one can specify to the writer which analyzer to use. PerfieldAnalyzer wouldn't

Re: Document ID shuffling under 2.3.x (on merge?)

2008-03-13 Thread Doron Cohen
Hi Daniel, LUCENE-1228 fixes a problem in IndexWriter.commit(). I suspect this can be related to the problem you see though I am not sure. Could you try with the patch there? Thanks, Doron On Thu, Mar 13, 2008 at 10:46 AM, Michael McCandless < [EMAIL PROTECTED]> wrote: > > Daniel Noll wrote: > >

Re: indexing api wrt Analyzer

2008-03-13 Thread John Wang
Hi Grant: For our corpus, we don't rely on idf in scoring calculation that much, so I don't see that being a problem that much. About performance, instantiating 1 indexWriter for a batch of say 1000 docs, e.g. iterate over 1000 docs and do addDocument; comparing with instantiating and clo

Re: Document ID shuffling under 2.3.x (on merge?)

2008-03-13 Thread Doron Cohen
On Thu, Mar 13, 2008 at 9:30 PM, Doron Cohen <[EMAIL PROTECTED]> wrote: > Hi Daniel, LUCENE-1228 fixes a problem in IndexWriter.commit(). > I suspect this can be related to the problem you see though I am not sure. > Could you try with the patch there? > Thanks, > Doron Daniel, I was wrong about

Re: Document ID shuffling under 2.3.x (on merge?)

2008-03-13 Thread Michael Busch
Daniel Noll wrote: For interest's sake I also timed fetching the document with no FieldSelector, that takes around 410ms for the same documents. So there is still a big benefit in using the field selector, it just isn't anywhere near enough to get it close to the time it takes to retrieve th

Re: indexing api wrt Analyzer

2008-03-13 Thread Grant Ingersoll
There is an addDocument method that takes an Analyzer and overrides the one used at construction of the IndexWriter. See http://lucene.apache.org/java/2_3_1/api/core/org/apache/lucene/index/IndexWriter.html#addDocument(org.apache.lucene.document.Document,%20org.apache.lucene.analysis.Analyzer)

Build Lucene maven artifacts

2008-03-13 Thread Patrick Turcotte
Hi, I've looked around (mailing lists, jira) and I can't seem to find information about how to generate maven artifacts, especially for contrib. I mean, I can get lucene from the maven repo, and I know I have to build the contrib for myself. But I kind of hoped I would be able to deploy contrib l

how to list term score inside some document?

2008-03-13 Thread Rao WeiXiong
Dear: If possible to list all term scores inside some document by some simple method? now i just use each term as the query to search the whole index to get the score. seems very cumbersome. is there any simple approach? Cheers! weixiong

Language identification ??

2008-03-13 Thread Raghu Ram
Hi all, I guess this question is a bit off the track. Are there any language identification modules inside Lucene ??? If not can somebody please suggest me a good one. Thank You.

Re: Build Lucene maven artifacts

2008-03-13 Thread Michael Busch
Hi Patrick, I noticed that we do not package the *.pom.template files in the source release files. That's why it is not possible to build the maven artifacts using official releases. I'll open a JIRA issue and make sure that we will ship 2.3.2 with the template files. In the meantime, you ca

Re: indexing api wrt Analyzer

2008-03-13 Thread John Wang
Excellent! Exactly what I was looking for! Thanks Grant! -John On Thu, Mar 13, 2008 at 5:39 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > There is an addDocument method that takes an Analyzer and overrides > the one used at construction of the IndexWriter. See > > http://lucene.apache.org/j