Re: Multiple merge-runs from same set of segments

2021-05-30 Thread Ravikumar Govindarajan
that! > > > > Patrick > > > > Ravikumar Govindarajan 于2021年5月24日周一 > > 上午11:49写道: > > > > > Thanks Patrick for the help! > > > > > > May I know what lucene version you're using? > > > > > > > > > > We are using

Re: Multiple merge-runs from same set of segments

2021-05-24 Thread Ravikumar Govindarajan
e current default directory implementation is > MMapDirectory, which delegate the caching to the system and should have > already optimized this situation. Here's a great blog explaining the > MMapDirectory in lucene: > https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64b

Re: Multiple merge-runs from same set of segments

2021-05-24 Thread Ravikumar Govindarajan
, including mocking LiveDocs to get > the right documents into the right segments. > > Mike McCandless > > http://blog.mikemccandless.com > > > On Sat, May 22, 2021 at 3:50 PM Ravikumar Govindarajan < > ravikumar.govindara...@gmail.com> wrote: > >> Hello

Multiple merge-runs from same set of segments

2021-05-22 Thread Ravikumar Govindarajan
Hello, We have a use-case for index-rewrite on a "frozen index" where no new documents are added. It goes like this.. 1. Get all segments for the index (base-segment-list) 2. Create a new segment from base-segment-list with unique set of docs (LiveDocs) 3. Repeat step 2, for a fixed c

Re: Multi-IDF for a single term possible?

2019-12-04 Thread Ravikumar Govindarajan
rite a similarity wrapper that will read the needed information from > a hash map. > > Regards > Ameer > > > > On Wed, 4 Dec 2019 at 00:55, Ravikumar Govindarajan < > ravikumar.govindara...@gmail.com> wrote: > > > > > > > it is enough to giv

Re: Multi-IDF for a single term possible?

2019-12-03 Thread Ravikumar Govindarajan
> > it is enough to give each its own field. > I kind of over-simplified the problem at hand. Apologies. DOC_TYPE is just one aspect of the problem. The other one is that, it is actually shared index where there are multiple-users (100-3000 users per index). There are many hundreds of such shared

Multi-IDF for a single term possible?

2019-12-03 Thread Ravikumar Govindarajan
Hello, We are using TF-IDF for scoring (Yet to migrate to BM25). Different entities (DOC_TYPES) are crunched & stored together in a single index. When it comes to IDF, I find that there is a single value computed across documents & stored as part of TermStats, whereas our documents are not homoge

Re: block min-max values for Sort Field with Top-N query..

2019-07-02 Thread Ravikumar Govindarajan
t than > FeatureField but would allow sorting in either ascending or descending > order. > > > > On Tue, Jul 2, 2019 at 3:01 PM Ravikumar Govindarajan > wrote: > > > > Our Sort Fields utilize DocValues.. > > > > Lets say I collect min-max ords of a Sort F

block min-max values for Sort Field with Top-N query..

2019-07-02 Thread Ravikumar Govindarajan
Our Sort Fields utilize DocValues.. Lets say I collect min-max ords of a Sort Field for a block of documents (128, 256 etc..) at index-time via Codec & store it as part of DocValues at a Segment level.. During query time, could we take advantage of this Stats when Top-N query with Sort Field is r

Re: Storing external transaction log-ids in lucene...

2017-08-10 Thread Ravikumar Govindarajan
You can know exactly which ops made it into your > commit and which didn't. > > TrackingIndexWriter is replaced by the sequence numbers. > > Mike McCandless > > http://blog.mikemccandless.com > > On Thu, Aug 10, 2017 at 9:37 AM, Ravikumar Govindarajan < > ravikum

Re: Storing external transaction log-ids in lucene...

2017-08-10 Thread Ravikumar Govindarajan
t; On Thu, Aug 10, 2017 at 6:57 AM, Ravikumar Govindarajan < > ravikumar.govindara...@gmail.com> wrote: > > > Every mutation (Add/Update/Delete) has a transaction-id (incremental > long) > > assigned by our Messaging Queue (Kafka) > > > > To index these mutati

Storing external transaction log-ids in lucene...

2017-08-10 Thread Ravikumar Govindarajan
Every mutation (Add/Update/Delete) has a transaction-id (incremental long) assigned by our Messaging Queue (Kafka) To index these mutations, an indexer thread pulls data from the queue, adds & commits to IndexWriter, then updates the latest transaction-id in an external system (ZooKeeper). During

Re: will lucene traverse all segments to search a 'primary key'term or will it stop as soon as it get one?

2017-04-24 Thread Ravikumar Govindarajan
> > Let’s say I have a user info index and user id is the ‘primary key’. So > when I do a userid term search, will lucene traverse all segments to search > a 'primary key'term or will it stop as soon as it get one? Lucene in general will search all segments for primary key. But in case you want a

Re: Can ByteBufferIndexInput use buffering?

2016-10-20 Thread Ravikumar Govindarajan
hing like this: > http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html > > Kind regards, > Uwe > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > -Original Message---

Can ByteBufferIndexInput use buffering?

2016-10-20 Thread Ravikumar Govindarajan
When we use NIOFSDirectory, lucene internally uses buffering via BufferedIndexInput (1KB etc...) while reading from the file.. However, for MmapDirectory (ByteBufferIndexInput) there is no such buffering & data is read from the mapped bytes directly... Will it be too much of a performance drag if

Re: Segment Corruption - ForUtil.readBlock AIOBE

2016-08-09 Thread Ravikumar Govindarajan
8, 2016 at 1:41 PM, Robert Muir wrote: > Can you run checkindex and include the output? > > On Mon, Aug 8, 2016 at 2:36 AM, Ravikumar Govindarajan > wrote: > > For some of the segments we received the following exception during merge > > as well as search. They look to be

Segment Corruption - ForUtil.readBlock AIOBE

2016-08-07 Thread Ravikumar Govindarajan
For some of the segments we received the following exception during merge as well as search. They look to be corrupt [Lucene 4.6.1 & Sun JDK 1.7.0_55] Is this a known bug? Any help is much appreciated The offending line of code is in ForUtil.readBlock() method... *final int encodedSize = encode

IndexWriterConfig.readerPooling option...

2016-06-16 Thread Ravikumar Govindarajan
Came across a JIRA filed for pooling IndexReaders https://issues.apache.org/jira/browse/LUCENE-2297 For every commit/delete/update cycle IndexWriter opens a bunch of SegmentReaders, does the job & closes it. Does the JIRA aim to re-use the SegmentReaders for all commit-cycles till they are fina

Re: Merge deletes files for ongoing search...

2016-05-09 Thread Ravikumar Govindarajan
385) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1374) at org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:89) at org.apache.blur.store.hdfs.HdfsIndexInput.readInternal(HdfsIndexInput.java:62) On Tue, May 10, 2016 at 11:32 AM, Ravikumar Govindarajan < ravikumar.govindara...@gmail.com> wrote:

Merge deletes files for ongoing search...

2016-05-09 Thread Ravikumar Govindarajan
Sometimes during an ongoing search we receive an IndexReaderClosedException & found that it happens when a merge is completed. We are on an older version of lucene (4.6.1). IndexFileDeleter (KeepOnlyLastCommitDeletionPolicy) deletes the file after the merge completes but we have an open IndexSear

DocsEnum.FLAG_NONE for search query...

2015-07-31 Thread Ravikumar Govindarajan
On lucene-4.6.1, is there a way specify during search that only docs need to be iterated/searched and frequency need to be skipped… I saw DocsEnum.FLAG_NONE meant for this, but could not find out how to pass this via a search query… My assumption is that skipping frequencies could speed up search

Re: Delete Parents without any children...

2015-07-10 Thread Ravikumar Govindarajan
; and it would be interesting to see how it would affect multi-term > queries compared to lz4 blocks. > > [1] https://en.wikipedia.org/wiki/Byte_pair_encoding > > On Fri, Jul 3, 2015 at 12:09 PM, Ravikumar Govindarajan > wrote: > > An unrelated question… > > &g

Re: Delete Parents without any children...

2015-07-03 Thread Ravikumar Govindarajan
ehave well? Currently we don't have plans of providing queries like Fuzzy/Re-spell etc.. and thought could benefit from it On Thu, Jul 2, 2015 at 6:02 PM, Ravikumar Govindarajan < ravikumar.govindara...@gmail.com> wrote: > Thanks Adrien… > > Works like a charm!!! > >

Re: Delete Parents without any children...

2015-07-02 Thread Ravikumar Govindarajan
Thanks Adrien… Works like a charm!!! On Wed, Jul 1, 2015 at 10:22 PM, Adrien Grand wrote: > Hi Ravikumar, > > You need to run a BooleanQuery with two clauses: > - a must clause that matches all parent documents > - a must_not clause that matches all parents that have children > > Building thi

Delete Parents without any children...

2015-07-01 Thread Ravikumar Govindarajan
We have organised our segments in parent-child blocks and wish to periodically delete parent-documents that don't have any children to reclaim space via IndexWriter.deleteDocuments(Query)… Is it possible to draft a Query that identifies such parents? Any help is much appreciated… -- Ravi

Re: SortingAtomicReader alternate to Tim-Sort...

2015-04-30 Thread Ravikumar Govindarajan
Apr 28, 2015 at 6:03 PM, Adrien Grand wrote: > On Tue, Apr 21, 2015 at 10:00 AM, Ravikumar Govindarajan > wrote: > > Thanks for the comments… > > > > My only > >> concern about using the FixedBitSet is that it would make sorting each > >> postin

Re: SortingAtomicReader alternate to Tim-Sort...

2015-04-24 Thread Ravikumar Govindarajan
Thanks. Glad that it has been pro-actively identified and fixed -- Ravi On Thu, Apr 23, 2015 at 10:34 AM, Robert Muir wrote: > On Tue, Apr 21, 2015 at 4:00 AM, Ravikumar Govindarajan > wrote: > > > b) CompressingStoredFieldsReader did not store the last decoded 32KB > chunk

Re: SortingAtomicReader alternate to Tim-Sort...

2015-04-21 Thread Ravikumar Govindarajan
> assume you are still on 4.x)? > > I'm curious if you already performed any kind of benchmarking of this > approach? > > > On Tue, Apr 14, 2015 at 2:07 PM, Ravikumar Govindarajan > wrote: > > We were experimenting with SortingMergePolicy and came across an >

SortingAtomicReader alternate to Tim-Sort...

2015-04-14 Thread Ravikumar Govindarajan
We were experimenting with SortingMergePolicy and came across an alternate solution to TimSort of postings-list using FBS & GrowableWriter. I have attached relevant code-snippet. It would be nice if someone can clarify whether it is a good idea to implement... public class SortingAtomicReader { …

Re: URL/Email tokenizer

2015-02-17 Thread Ravikumar Govindarajan
wrote: > Sounds like a job for > org.apache.lucene.analysis.miscellaneous.PerFieldAnalyzerWrapper. > > > -- > Ian. > > > On Tue, Feb 17, 2015 at 8:51 AM, Ravikumar Govindarajan > wrote: > > We have a requirement in that E-mail addresses need to be added in a &

URL/Email tokenizer

2015-02-17 Thread Ravikumar Govindarajan
We have a requirement in that E-mail addresses need to be added in a tokenized form to one field while untokenized form is added to another field Ex: "I have mailed a...@xyz.com" . It should tokenize as below body = {"I", "have", "mailed", "abc", "xyz", "com"}; I also have a body-addr field. To

Re: Slow doc/pos file merges...

2014-12-09 Thread Ravikumar Govindarajan
s... We switched it to write using ForUtil even if block-size<128 and perf was much better and predictable. Are there any particular reasons for taking the VInt approach? Any help on this issue is appreciated -- Ravi On Tue, Nov 18, 2014 at 12:49 PM, Ravikumar Govindarajan < ravikumar.govindar

Slow doc/pos file merges...

2014-11-17 Thread Ravikumar Govindarajan
Hi, I am finding that lucene is slowing down a lot when bigger and bigger doc/pos files are merged... While it's normally the case, the worrying part is all my data is in RAM. Version is 4.6.1 Some sample statistics took after instrumenting the SortingAtomicReader code, as we use a SortingMergePo

ToChildBlockJoinQuery possible issue?

2014-08-31 Thread Ravikumar Govindarajan
Sometimes TCBJQ returns parent-doc itself as a child-doc. I traced it down to the following code... public int advance(int childTarget) throws IOException { ... final int firstChild = parentBits.prevSetBit(parentDoc-1); //System.out.println(" firstChild=" + firstChild); childTarget = Math

Re: Block-Join and number of child-documents

2014-08-05 Thread Ravikumar Govindarajan
FYI, there is SirenDB on top of lucene that addresses such concerns... It supports multi-level parent-child relationships and provides nice querying capabilities... -- Ravi On Thu, Jul 31, 2014 at 12:59 PM, Ravikumar Govindarajan < ravikumar.govindara...@gmail.com> wrote: > We are pl

Block-Join and number of child-documents

2014-07-31 Thread Ravikumar Govindarajan
We are planning to use block-indexing and ToChildBlockJoin queries... Each parent-doc can contain anywhere between 1-2000 children-docs and is highly variable. A sample user-stats is as follows 1. No.of. parent-docs = 500K 2. Children -per parent = 50 3. Total-docs = 25 Million 4. Size occupied

Re: Incremental Field Updates

2014-07-08 Thread Ravikumar Govindarajan
; > > > On Thu, Jul 3, 2014 at 3:22 AM, Ravikumar Govindarajan < > ravikumar.govindara...@gmail.com > wrote: > > > In case of sorting, updatable DocValues may be what you are looking for. > > > > But updatable fields for searching is a different beast. > > >

Re: Incremental Field Updates

2014-07-03 Thread Ravikumar Govindarajan
In case of sorting, updatable DocValues may be what you are looking for. But updatable fields for searching is a different beast. A sample approach is documented at http://www.flax.co.uk/blog/2012/06/22/updating-individual-fields-in-lucene-with-a-redis-backed-codec/ The general problems with upd

Re: EarlyTerminatingSortingCollector help needed..

2014-06-23 Thread Ravikumar Govindarajan
d values for a few documents, > - doc values when loading a few field values for many documents. Thanks for this clarification. Shall surely move towards doc-values... -- Ravi On Mon, Jun 23, 2014 at 5:36 PM, Adrien Grand wrote: > On Sun, Jun 22, 2014 at 6:44 PM, Ravikumar Govindarajan &g

Re: EarlyTerminatingSortingCollector help needed..

2014-06-22 Thread Ravikumar Govindarajan
m seek. [ http://blog.jpountz.net/post/35667727458/stored-fields-compression-in-lucene-4-1 ] If so, then what could make DocValues still a winner? -- Ravi On Sat, Jun 21, 2014 at 6:41 PM, Adrien Grand wrote: > Hi Ravikumar, > > On Fri, Jun 20, 2014 at 12:14 PM, Ravikumar Govindarajan >

EarlyTerminatingSortingCollector help needed..

2014-06-20 Thread Ravikumar Govindarajan
I was planning to use ETSC in-conjunction with SortingMergePolicy and got stuck. In ESTC, we have @Override public void collect(int doc) throws IOException { in.collect(doc); if (++numCollected >= numDocsToCollect) { throw new CollectionTerminatedException(); } } I und

Re: SortingMergePolicy for already sorted segments

2014-06-17 Thread Ravikumar Govindarajan
t to map them to 4 GLOBALLY SORTED documents. > If > > you make a local decision based on these 4 documents, you will end up w/ > a > > completely messed up segment. > > > > I think the global DocMap is really required. Forget about that that > other > > co

Re: SortingMergePolicy for already sorted segments

2014-06-17 Thread Ravikumar Govindarajan
ap, like Lucene code does ... > > If I miss your point, I'd appreciate if you can point me to a code example, > preferably in Lucene source, which demonstrates the problem. > > Shai > > > On Tue, Jun 17, 2014 at 3:03 PM, Ravikumar Govindarajan < > ravikumar.govin

Re: SortingMergePolicy for already sorted segments

2014-06-17 Thread Ravikumar Govindarajan
.8, and now > you don't really need to implement a Sorter, but rather pass a SortField, > if that works for you. > > Shai > > > On Tue, Jun 17, 2014 at 9:41 AM, Ravikumar Govindarajan < > ravikumar.govindara...@gmail.com> wrote: > > > Shai, &

Re: SortingMergePolicy for already sorted segments

2014-06-16 Thread Ravikumar Govindarajan
. > > Shai > > > On Tue, Jun 17, 2014 at 4:04 AM, Ravikumar Govindarajan < > ravikumar.govindara...@gmail.com> wrote: > > > I am planning to use SortingMergePolicy where all the merge-participating > > segments are already sorted... I understand that I need to d

SortingMergePolicy for already sorted segments

2014-06-16 Thread Ravikumar Govindarajan
I am planning to use SortingMergePolicy where all the merge-participating segments are already sorted... I understand that I need to define a DocMap with old-new doc-id mappings. Is it possible to optimize the eager loading of DocMap and make it kind of lazy load on-demand? Ex: Pass List to the c

Re: fadvise/madvise during segment-merges....

2014-05-27 Thread Ravikumar Govindarajan
McCandless < luc...@mikemccandless.com> wrote: > On Wed, May 21, 2014 at 10:50 AM, Ravikumar Govindarajan > wrote: > >> > >> But does that mean SEQUENTIAL will evict the > >> page once we're done reading it? > > > > > > Yes, looks like it do

Re: fadvise/madvise during segment-merges....

2014-05-21 Thread Ravikumar Govindarajan
.. There are also sneaky ways to > invoke some of these OS-level APIs without using JNI This is cool stuff... Saves an amazing amount of effort for most of the things... -- Ravi On Wed, May 21, 2014 at 7:13 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > On Wed, May

Re: fadvise/madvise during segment-merges....

2014-05-21 Thread Ravikumar Govindarajan
ecause then the OS knows to read > ahead and aggressively free the page once we are done using it. > > There is also O_DIRECT (e.g., using NativeUnixDirectory) for direct IO > to bypass the buffer cache entirely. > > > Mike McCandless > > http://blog.mikemccandless.com &g

fadvise/madvise during segment-merges....

2014-05-21 Thread Ravikumar Govindarajan
Is it a good idea to use FADVISE_DONTNEED/MADVISE_DONTNEED flags during segment merge reads? Buffer-Cache contains critical data belonging to searches. A segment-merge has the potential to disturb the cache no? -- Ravi

Few questions on updatable DocValues

2014-03-14 Thread Ravikumar Govindarajan
Hi, I have few questions related to updatable DocValues API... It would be great if I can get help. 1. Is it possible to provide updateNumericDocValue(Term term, Map), incase I wish to update multiple-fields and it's doc-values? 2. Instead of a "Term" based update, is it possible to extend it to

StoredFieldsWriter finishDocument() method should be abstract?

2014-03-13 Thread Ravikumar Govindarajan
I was just trying to implement a StoredFieldsWriter[4.6.1] and found that finishDocument() method has an empty impl. Any reason for not declaring it abstract? We could easily miss over-riding it -- Ravi

Re: Single segment merge in lucene possible?

2014-02-24 Thread Ravikumar Govindarajan
eIndexMergePolicy.html You just have to (anonymously) subclass > > > UpgradeIndexMergePolicy and return true from "protected boolean > > shouldUpgradeSegment(SegmentCommitInfo si)" only for the segment to > > be merged. By default this returns true for segments t

Single segment merge in lucene possible?

2014-02-21 Thread Ravikumar Govindarajan
Hi, Is it possible to merge a single segment all by itself, may be just accounting for deletes alone? This is needed so as to solve certain data-locality issues we face in a custom implementation of Directory API. -- Ravi

Re: Actual min and max-value of NumericField during codec flush

2014-02-19 Thread Ravikumar Govindarajan
Thanks Mike for your time and help On Monday, February 17, 2014, Michael McCandless wrote: > On Mon, Feb 17, 2014 at 8:33 AM, Ravikumar Govindarajan > > wrote: > >> > >> Well, this will change your scores? MultiReader will sum up all term > >> statistics

Re: Actual min and max-value of NumericField during codec flush

2014-02-17 Thread Ravikumar Govindarajan
> > Well, this will change your scores? MultiReader will sum up all term > statistics across all SegmentReaders "up front", and then scoring per > segment will use those top-level weights. Our app needs to do only matching and sorting. In-fact, it would be fully OK to by-pass scoring. But I feel

Re: Actual min and max-value of NumericField during codec flush

2014-02-13 Thread Ravikumar Govindarajan
balanced and your indexing > performance will degrade because of unbalanced amount of IO that happens > during the merge. > > Shai > > > On Thu, Feb 13, 2014 at 7:25 AM, Ravikumar Govindarajan < > ravikumar.govindara...@gmail.com> wrote: > > > @Mike, > >

Re: Actual min and max-value of NumericField during codec flush

2014-02-12 Thread Ravikumar Govindarajan
jacent > segments and SortingMP ensures the merged segment is also sorted. > > Shai > > > On Wed, Feb 12, 2014 at 3:16 PM, Ravikumar Govindarajan < > ravikumar.govindara...@gmail.com> wrote: > > > Yes exactly as you have described. > > > > Ex: Consider S

Re: Actual min and max-value of NumericField during codec flush

2014-02-12 Thread Ravikumar Govindarajan
then searched > in "reverse segment order"? > > I think you should be able to do this w/ SortingMergePolicy? And then > use a custom collector that stops after you've gone back enough in > time for a given search. > > Mike McCandless > > http://blog.mikemc

Re: Actual min and max-value of NumericField during codec flush

2014-02-12 Thread Ravikumar Govindarajan
why you need to encouraging merging of the more > recent (by your "time" field) segments... > > Mike McCandless > > http://blog.mikemccandless.com > > > On Fri, Feb 7, 2014 at 8:18 AM, Ravikumar Govindarajan > wrote: > > Mike, > >

Re: Actual min and max-value of NumericField during codec flush

2014-02-07 Thread Ravikumar Govindarajan
er's infoStream > and do a long running test to convince yourself the merging is being > sane. > > Mike > > Mike McCandless > > http://blog.mikemccandless.com > > > On Thu, Feb 6, 2014 at 11:24 PM, Ravikumar Govindarajan > wrote: > > Thanks Mike, > &g

Re: Actual min and max-value of NumericField during codec flush

2014-02-06 Thread Ravikumar Govindarajan
t has improved, so that > you can e.g. pull your own TermsEnum and iterate the terms yourself. > > Mike McCandless > > http://blog.mikemccandless.com > > > On Thu, Feb 6, 2014 at 5:16 AM, Ravikumar Govindarajan > wrote: > > I use a Codec to flush data. All methods delegat

Actual min and max-value of NumericField during codec flush

2014-02-06 Thread Ravikumar Govindarajan
I use a Codec to flush data. All methods delegate to actual Lucene42Codec, except for intercepting one single-field. This field is indexed as an IntField [Numeric-Trie...], with precisionStep=4. The purpose of the Codec is as follows 1. Note the first BytesRef for this field 2. During finish() ca

Re: IndexWriter flush/commit exception

2013-12-18 Thread Ravikumar Govindarajan
..@mikemccandless.com> wrote: > On Wed, Dec 18, 2013 at 3:15 AM, Ravikumar Govindarajan > wrote: > > Thanks Mike for a great explanation on Flush IOException > > You're welcome! > > > I was thinking on the perspective of a HDFSDirectory. In addition to the >

Re: IndexWriter flush/commit exception

2013-12-18 Thread Ravikumar Govindarajan
d for handling momentary IOExceptions -- Ravi On Tue, Dec 17, 2013 at 9:14 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > On Mon, Dec 16, 2013 at 7:33 AM, Ravikumar Govindarajan > wrote: > > I am trying to model a transaction-log for lucene, which creates a > &g

IndexWriter flush/commit exception

2013-12-16 Thread Ravikumar Govindarajan
I am trying to model a transaction-log for lucene, which creates a transaction-log per-commit Things work fine during normal operations, but I cannot fathom the effect during a. IOException during Index-Commit Will the index be restored to previous commit-point? Can I blindly re-try operations f

SortingMergePolicy for index-block-joins

2013-12-03 Thread Ravikumar Govindarajan
I am trying to find an optimal way of merging two already sorted segments and need some help here... My use-case is this: Segment1 [Segment already sorted by Field "F1"] Fields: F1, F2, F3 List Fields: C1,

Re: FST Builder pruning

2013-11-18 Thread Ravikumar Govindarajan
signing terms to blocks, but to build the trie terms > index it builds a separate FST, by adding in each block's prefix (it > doesn't use the FST's builder pruning to create the trie). > > Mike McCandless > > http://blog.mikemccandless.com > > > On Fri,

Re: FST Builder pruning

2013-11-15 Thread Ravikumar Govindarajan
his to build a prefix trie instead of the full FST. > > Creating a custom tail freezer is very expert: it lets you implement > arbitrary logic on which nodes are pruned or not. > > Mike McCandless > > http://blog.mikemccandless.com > > > On Fri, Nov 15, 2013 at 12:16

FST Builder pruning

2013-11-15 Thread Ravikumar Govindarajan
I was trying to understand some logic in Builder class of FST. The method freezeTail() looks quite hairy. I gather that there is an some logic for pruning a node or compiling it. What exactly is pruning a node? An example of it will be really really helpful -- Ravi

Re: IndexReader close listeners and NRT

2013-11-10 Thread Ravikumar Govindarajan
Thanks Mike. Explicit type-cast to SegmentReader will do the trick for the moment. -- Ravi On Fri, Nov 8, 2013 at 6:17 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > On Fri, Nov 8, 2013 at 12:22 AM, Ravikumar Govindarajan > wrote: > >> So, in your code, &quo

Re: IndexReader close listeners and NRT

2013-11-07 Thread Ravikumar Govindarajan
> wrote: > > On Thu, Nov 7, 2013 at 12:18 PM, Ravikumar Govindarajan > wrote: > > Thanks Mike. > > > > If you look at my impl, I am using the getCoreCacheKey() only, but keyed > > on a ReaderClosedListener and purging it onClose(). When NRT does reopens, > >

Re: IndexReader close listeners and NRT

2013-11-07 Thread Ravikumar Govindarajan
k at liveDocs "live" and fold them in, instead of > regenerating the whole cache entry. > > > Mike McCandless > > http://blog.mikemccandless.com > > > On Thu, Nov 7, 2013 at 8:04 AM, Ravikumar Govindarajan > > wrote: > > Thanks Mike. Can you hel

Re: IndexReader close listeners and NRT

2013-11-07 Thread Ravikumar Govindarajan
(returned by IndexReader.leaves()), to play well with NRT. > > Typically you'd do so in a context that already sees each leaf, like a > custom Filter or a Collector. > > Mike McCandless > > http://blog.mikemccandless.com > > > On Thu, Nov 7, 2013 at 1:33 AM, Ravikumar

IndexReader close listeners and NRT

2013-11-06 Thread Ravikumar Govindarajan
I am trying to cache a BitSet by attaching to IndexReader.addCloseListener, using the getCoreCacheKey() But, I find that getCoreCacheKey() returns the IndexReader object itself as the key. Whenever the IndexReader re-opens via NRT because of deletes, will it mean that my cache will be purged, bec

addIndexes and MultiReader search consistency

2013-10-25 Thread Ravikumar Govindarajan
Hi, Currently we merge 2 indexes using iw.addIndexes(idxReaders), where the same call will be made in batches of 10 readers Our requirement is to make this addIndex call consistent. That is, during this merge-time, searches using a MultiReader should not return duplicate documents[docs currently

Grouping field using pruned terms?

2013-07-24 Thread Ravikumar Govindarajan
TermFirstPassGroupingCollector loads all terms for a given group-by field, through FieldCache. Is it possible to instruct the class to group only pruned terms of a field, based on a user-supplied query [RangeQuery, TermQuery etc...] This way, only pruned terms are grouped and all others are ignor

Common Index with personalized fields

2013-05-27 Thread Ravikumar Govindarajan
We have a system where N number of users are tied to a particular lucene index. Sort of "Shared Index". But each of the N users can have their own personalized fields. Ex: Every E-mail to lucene mailing-list is a document Every user part of this mailing-list has his own set of labels for tha

Re: TermsEnum.docFreq() returns 0

2013-05-14 Thread Ravikumar Govindarajan
tests -- Ravi On Tue, May 14, 2013 at 3:31 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > On Tue, May 14, 2013 at 3:03 AM, Ravikumar Govindarajan > wrote: > > We ran the checkIndex and a simple test case. It passes. Actually, I had > > assumed problem wit

Re: TermsEnum.docFreq() returns 0

2013-05-14 Thread Ravikumar Govindarajan
CheckIndex on the index produced by the code below, > how many terms/freqs/positions does it report? > > Mike McCandless > > http://blog.mikemccandless.com > > > On Mon, May 13, 2013 at 9:25 AM, Ravikumar Govindarajan > wrote: > >

Re: TermsEnum.docFreq() returns 0

2013-05-13 Thread Ravikumar Govindarajan
9 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > It should not be 0, as long as TermsEnum.next() does not return null > ... can you make a small test case? Thanks. > > Mike McCandless > > http://blog.mikemccandless.com > > > On Fri, May 10, 2013 at 8:26 AM, Ra

Re: TermsEnum.docFreq() returns 0

2013-05-10 Thread Ravikumar Govindarajan
Fri, May 10, 2013 at 5:54 PM, Ravikumar Govindarajan < ravikumar.govindara...@gmail.com> wrote: > We have the following code > > SegmentInfos segments = new SegmentInfos(); > segments.read(luceneDir); > for(SegmentInfoPerCommit sipc: segments) > { > String name = sipc

TermsEnum.docFreq() returns 0

2013-05-10 Thread Ravikumar Govindarajan
We have the following code SegmentInfos segments = new SegmentInfos(); segments.read(luceneDir); for(SegmentInfoPerCommit sipc: segments) { String name = sipc.info.name; SegmentReader reader = new SegmentReader(sipc, 1, new IOContext()); Terms terms = reader.terms("content"); TermsEnum tEnum = t

StackedUpdates [Lucene-4258] using stored fields possible?

2013-04-25 Thread Ravikumar Govindarajan
The stacked updates issue as in the link mentioned https://issues.apache.org/jira/browse/LUCENE-4258 handles FieldUpdates only for "new incoming values". In our case, all fields that are updated are, by default StoredFields. Currently StackedTermsEnum looks too costly on computing Term Stats. Is

Re: SloppyPhraseScorer behavior

2013-04-19 Thread Ravikumar Govindarajan
Thanks Robert for the quick response. Saved my day!!! -- Ravi On Fri, Apr 19, 2013 at 10:45 PM, Robert Muir wrote: > Its a bug: its already fixed for 4.3 (coming soon): > > https://issues.apache.org/jira/browse/LUCENE-4888 > > On Fri, Apr 19, 2013 at 1:09 PM, Ravikum

SloppyPhraseScorer behavior

2013-04-19 Thread Ravikumar Govindarajan
When writing a custom codec, I encountered an issue in SloppyPhraseScorer. I am using lucene-4.2 GA. public int nextDoc() { return advance(max.doc) } This in-turn calls my DocsAndPositionEnum.advance(int target). Intially this seems to call with advance(-1). It's kind of unsettling to see an i

Segment file clean-up and codecs

2013-03-22 Thread Ravikumar Govindarajan
Most of us, writing custom codec use segment-name as a handle and push data to a different storage Would it be possible to get a hook in the codec APIs, when obsolete segment files are cleaned up after merges? Currently, this is always implemented as a hack -- Ravi

Re: Overall doc-count in TermStats, during flush...

2013-03-20 Thread Ravikumar Govindarajan
x27;t use the segmetns doc count. > > hope that helps > > simon > > On Wed, Mar 20, 2013 at 1:12 PM, Ravikumar Govindarajan > wrote: > > This is an internal code I came across in lucene today and unable to > > decipher it. > > > > FreqProxTermsWriterPer

Re: Grouping on multiple shards possible in lucene?

2012-11-21 Thread Ravikumar Govindarajan
ion sorting, then it should be easy > to reverse the doc orders in each segment, using something like > IndexSorter. > > Shai > > On Wed, Nov 21, 2012 at 8:03 AM, Ravikumar Govindarajan < > ravikumar.govindara...@gmail.com> wrote: > > > Hi Shai, > > > > I wou

Re: Grouping on multiple shards possible in lucene?

2012-11-20 Thread Ravikumar Govindarajan
waste a > > lot of storage > > > > The default merge policy will merge adjacent segments no? Is it going to > > disturb the ordering? > > > > -- > > Ravi > > > > On Tue, Nov 20, 2012 at 5:19 PM, Michael McCandless < > > luc...@mikemccandless.com> w

Re: Grouping on multiple shards possible in lucene?

2012-11-20 Thread Ravikumar Govindarajan
could waste a lot of storage The default merge policy will merge adjacent segments no? Is it going to disturb the ordering? -- Ravi On Tue, Nov 20, 2012 at 5:19 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > On Tue, Nov 20, 2012 at 1:49 AM, Ravikumar Govindarajan > wr

Re: Grouping on multiple shards possible in lucene?

2012-11-19 Thread Ravikumar Govindarajan
> requests out to other shards, gather the results, call the merge, etc. > > Mike McCandless > > http://blog.mikemccandless.com > > On Fri, Nov 16, 2012 at 9:43 AM, Ravikumar Govindarajan > wrote: > > The formatter has wrecked the table... Reposting it > > > >

Re: App supplied docID in lucene possible?

2012-11-07 Thread Ravikumar Govindarajan
discussions that has happened previously on sort-docID-before-flush/sparse-doc-handling? -- Ravi On Tue, Nov 6, 2012 at 4:53 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > On Tue, Nov 6, 2012 at 1:04 AM, Ravikumar Govindarajan > wrote: > > Looks far more complex than

Re: App supplied docID in lucene possible?

2012-11-05 Thread Ravikumar Govindarajan
, Nov 5, 2012 at 4:37 AM, Ravikumar Govindarajan > wrote: > > Thanks Mike, > > > > Joins could be slower than docID based approach, no? > > Yes: slower at search time but faster at update time (generally not a > good tradeoff... but it seems like in your case slow updates are

Re: App supplied docID in lucene possible?

2012-11-05 Thread Ravikumar Govindarajan
t; > http://blog.mikemccandless.com > > On Thu, Oct 25, 2012 at 6:10 AM, Ravikumar Govindarajan > wrote: > > We have the need to re-index some fields in our application frequently. > > > > Our typical document consists of > > > > a) Many single-valued {l

Re: App supplied docID in lucene possible?

2012-11-02 Thread Ravikumar Govindarajan
cument rather than the low-level Lucene document id. > > -- Jack Krupansky > > -Original Message- From: Ravikumar Govindarajan > Sent: Thursday, October 25, 2012 6:10 AM > To: java-user@lucene.apache.org > Subject: App supplied docID in lucene possible? > > > We have th