Term vector Lucene 4.2

2013-04-02 Thread andi rexha
Hi, I have a problem while trying to extract term vector's attributes (i.e. position of the terms). What I have done was: Terms termVector = indexReader.getTermVector(docId, fieldName); TermsEnum reuse = null; TermsEnum iterator = termVector.iterator(reuse); PositionIncr

Re: Term vector Lucene 4.2

2013-04-02 Thread Adrien Grand
Hi Andi, Here is how you could retrieve positions from your document: Terms termVector = indexReader.getTermVector(docId, fieldName); TermsEnum reuse = null; TermsEnum iterator = termVector.iterator(reuse); BytesRef ref = null; DocsAndPositionsEnum docsAndPositions = null;

RE: Term vector Lucene 4.2

2013-04-02 Thread andi rexha
Hi Adrien, Thank you very much for the reply. I have two other small question about this: 1) Is "final int freq = docsAndPositions.freq();" the same with "iterator.totalTermFreq()" ? In my tests it returns the same result and from the documentation it seems that the result should be the same.

How to use concurrency efficiently

2013-04-02 Thread Igor Shalyminov
Hello! I have a ~20GB index and try to make a concurrent search over it. The index has 16 segments, I run SpanQuery.getSpans() on each segment concurrently. I see really small performance improvement of searching concurrently. I suppose, the reason is that the sizes of the segments are very non-

Re: Term vector Lucene 4.2

2013-04-02 Thread Adrien Grand
On Tue, Apr 2, 2013 at 12:45 PM, andi rexha wrote: > Hi Adrien, > Thank you very much for the reply. > > I have two other small question about this: > 1) Is "final int freq = docsAndPositions.freq();" the same with > "iterator.totalTermFreq()" ? In my tests it returns the same result and from >

Segment readers in Lucene 4.2

2013-04-02 Thread andi rexha
Hi, I have a question about the Index Readers in Lucene. As far as I understand from the documentation, with the Lucene 4, we can create an Index Reader from DirectoryReader.open(directory); >From the code of the DirectoryReader, I have seen that it uses the >SegmentReader to create the reader.

RE: Segment readers in Lucene 4.2

2013-04-02 Thread Uwe Schindler
Hi, this is all not public tot he code because it is also subject to change! With Lucene 4.x, you can assume: directoryReader.leaves().get(i) corresponds to segmentsinfos.info(i) WARNING: But this is only true if: - the reader is instanceof DirectoryReader - the segmentinfos were opened on the e

Re: How to use concurrency efficiently

2013-04-02 Thread Adrien Grand
On Tue, Apr 2, 2013 at 2:29 PM, Igor Shalyminov wrote: > Hello! Hi Igor, > I have a ~20GB index and try to make a concurrent search over it. > The index has 16 segments, I run SpanQuery.getSpans() on each segment > concurrently. > I see really small performance improvement of searching concurre

Re: Indexing Term Frequency Vectors

2013-04-02 Thread Sharon W Tam
Thanks for your help, Adrien. But unfortunately, my term frequencies will be partial counts so they won't be integers, And finding a common denominator and scaling the rest of the frequencies accordingly will affect the relative lengths of the documents which will affect the Lucene scoring becaus

RE: Segment readers in Lucene 4.2

2013-04-02 Thread andi rexha
Hi, Thanks for the reply ;) > > this is all not public tot he code because it is also subject to change! > > With Lucene 4.x, you can assume: > directoryReader.leaves().get(i) corresponds to segmentsinfos.info(i) > > WARNING: But this is only true if: > - the reader is instanceof DirectoryR

Re: How to use concurrency efficiently

2013-04-02 Thread Igor Shalyminov
Yes, the number of documents is not too large (about 90 000), but the queries are very hard. Although they're just boolean, a typical query can produce a result with tens of millions of hits. Single-threadedly such a query runs ~20 seconds, which is too slow. therefore, multithreading is vital f

Re: How to use concurrency efficiently

2013-04-02 Thread Adrien Grand
On Tue, Apr 2, 2013 at 4:39 PM, Igor Shalyminov wrote: > Yes, the number of documents is not too large (about 90 000), but the queries > are very hard. Although they're just boolean, a typical query can produce a > result with tens of millions of hits. How can there be tens of millions of hits

Re: Indexing Term Frequency Vectors

2013-04-02 Thread Adrien Grand
On Tue, Apr 2, 2013 at 4:10 PM, Sharon W Tam wrote: > Are there any other ideas? Since scoring seems to be what you are interested in, you could have a look to payloads: there can store arbitrary data and can be used to score matches. -- Adrien

Re: Scoring function in LMDirichletSimilarity Class

2013-04-02 Thread Zeynep P.
Hi, I have the same question related to LMJelinekMercerSimiliarity class. protected float score(BasicStats stats, float freq, float docLen) { return stats.getTotalBoost() * (float)Math.log(1 + ((1 - lambda) * freq / docLen) / (lambda * ((LMStats)stats).getCollectionProbability()));

Re: How to use concurrency efficiently

2013-04-02 Thread Igor Shalyminov
These are not document hits but text hits (to be more specific, spans). For the search result it is necessary to have the precise number of document and text hits and a relatively small number of matched text snippets. I've tried several approaches to optimize the search algorithm but they didn't

When should I commit IndexWriter and TaxonomyWriter if I use NRT readers?

2013-04-02 Thread crocket
Since I use NRT readers for Index and TaxonomyIndex, I don't have to commit to see the changes. Now, I don't know if indexes are ever committed. If they don't commit automatically, I'd have to do it on a regular basis. What should I do about committing?

Re: When should I commit IndexWriter and TaxonomyWriter if I use NRT readers?

2013-04-02 Thread Apostolis Xekoukoulotakis
Maybe consider the data saved only after you have committed them. Acknowledge new data in batches after a commit? 2013/4/3 crocket > Since I use NRT readers for Index and TaxonomyIndex, I don't have to commit > to see the changes. > > Now, I don't know if indexes are ever committed. > > If they

RE: How to use concurrency efficiently

2013-04-02 Thread Uwe Schindler
If you are using MMapDirectory (default on 64 bit platforms) then they are already in filesystem cache and directly accessible like RAM to indexreader. No need to cache separately. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Or

Re: How to use concurrency efficiently

2013-04-02 Thread Paul
Hi, I've experimented a bit with MultiFieldQueryParser (http://lucene.apache.org/core/4_2_0/queryparser/org/apache/lucene/queryparser/classic/MultiFieldQueryParser.html) But it seems to search for each of a query's terms in each field specified in the constructor. So, as the doc says, if you q