Re: Anticipating a benchmark for direct posting format

Benson Margulies Mon, 07 Apr 2014 15:54:07 -0700

Interested / easily amused parties are welcomed to observe the
proceedings in https://github.com/apache/lucene-solr/pull/44. It's a
PR _only_ to offer visibility! So far, I've got a 'delegating codec'
that interposes the direct posting idea atop any other codec. Next
comes the filtering.


I'm not sure that I ever concisely reported the situation that got me
started on this: a profile in which _time in the codec_ dominated my
application. So the RAMDirectory was useless, since that removes no
codec CPU time, but the D-P-F did the job, since it does.


On Mon, Apr 7, 2014 at 5:34 PM, Benson Margulies <bimargul...@gmail.com> wrote:
> On Mon, Apr 7, 2014 at 5:32 PM, Alan Woodward <a...@flax.co.uk> wrote:
>> Does FilterDirectoryReader do what you want?
>> https://lucene.apache.org/core/4_7_1/core/org/apache/lucene/index/FilterDirectoryReader.html
>
> Yes, indeed, precisely what the doctor ordered.
>
>>
>> Alan Woodward
>> www.flax.co.uk
>>
>>
>> On 7 Apr 2014, at 22:19, Benson Margulies wrote:
>>
>> Typically, an app gets a directory reader, which is a composite
>> reader. To get a filter down there into the leaves of the composite
>> reader, does anyone have a suggestion about where to enter the
>> modularity?
>>
>> I sort of want to insert myself at
>> org.apache.lucene.index.StandardDirectoryReader#open(org.apache.lucene.store.Directory,
>> org.apache.lucene.index.IndexCommit) wrapping the segment readers, or
>> I could make a sort of filtering composite reader that wraps each of
>> the segment readers in a filter.
>>
>>
>> On Mon, Apr 7, 2014 at 1:02 PM, Shai Erera <ser...@gmail.com> wrote:
>>
>> Given that DPF delegates indexing to another PF anyway (currently Lucene41),
>>
>> I think this might be the case. We would need to test of course. The key
>>
>> point is that this FilterAtomicReader will be able to serve anything as
>>
>> direct, even DV, so it might eliminate DVF too. We need to experiment and
>>
>> benchmark!
>>
>>
>> Shai
>>
>>
>> On Apr 7, 2014 7:32 PM, "david.w.smi...@gmail.com"
>>
>> <david.w.smi...@gmail.com> wrote:
>>
>>
>> Aaaah, nice idea to simply use FilterAtomicReader — of course!  So this
>>
>> would ultimately be a new IndexReaderFactory that creates
>>
>> FilterAtomicReaders for a subset of the fields you want to do this on.
>>
>> Cool!  With that, I don’t think there would be a need for
>>
>> DirectPostingsFormat as a postings format, would there be?
>>
>>
>> ~ David
>>
>>
>>
>> On Mon, Apr 7, 2014 at 10:58 AM, Shai Erera <ser...@gmail.com> wrote:
>>
>>
>> The only problem is how the Codec makes a dynamic decision on whether to
>>
>> use the wrapped Codec for reading vs pre-load data into in-memory
>>
>> structures, because Codecs are loaded through reflection by the SPI loading
>>
>> mechanism.
>>
>>
>> There is also a TODO in DirectPF to allow wrapping arbitrary PFs, just
>>
>> mentioning in case you want to tackle DPF.
>>
>>
>> I think that if we allowed passing something like a CodecLookupService,
>>
>> with an SPILookupService default impl, you could easily pass that to
>>
>> DirectoryReader which will use your runtime logic to load the right PF (e.g.
>>
>> DPF) instead of the one the index was created with.
>>
>>
>> But it sounds like the core problem is that when we load a Codec/PF/DVF
>>
>> for reading, we cannot pass it any arguments, and so we must make an
>>
>> index-time decision about how we're going to read the data later on. If we
>>
>> could somehow support that, I think that will help you to achieve what you
>>
>> want too.
>>
>>
>> E.g. currently it's an all-or-nothing decision, but if we could pass a
>>
>> parameter like "50% available heap", the Codec/PF/DVF could cache the
>>
>> frequently accessed postings instead of loading all of them into memory.
>>
>> But, that can also be achieved at the IndexReader level, through a custom
>>
>> FilterAtomicReader. And if you could reuse DPF's structures (like
>>
>> DirectTermsEnum, DirectFields...), it should be easier to do this. So
>>
>> perhaps we can think about a DirectAtomicReader which does that? I believe
>>
>> it can share some code w/ DPF, as long as we don't make these APIs public,
>>
>> or make them @super.experimental and @super.expert.
>>
>>
>> Just throwing some ideas...
>>
>>
>> Shai
>>
>>
>>
>> On Mon, Apr 7, 2014 at 5:35 PM, david.w.smi...@gmail.com
>>
>> <david.w.smi...@gmail.com> wrote:
>>
>>
>> Benson, I like your idea.
>>
>>
>> I think your idea can be achieved as a codec, one that wraps another
>>
>> codec that establishes the on-disk format.  By default the wrapped codec can
>>
>> be Lucene’s default codec.  I think, if implemented, this would be a change
>>
>> to DPF instead of an additional DPF-variant codec.
>>
>>
>> ~ David
>>
>>
>>
>> On Mon, Apr 7, 2014 at 9:22 AM, Benson Margulies <bimargul...@gmail.com>
>>
>> wrote:
>>
>>
>> On Mon, Apr 7, 2014 at 9:14 AM, Robert Muir <rcm...@gmail.com> wrote:
>>
>> On Thu, Apr 3, 2014 at 12:27 PM, Benson Margulies
>>
>> <bimargul...@gmail.com> wrote:
>>
>>
>>
>> My takeaway from the prior conversation was that various people
>>
>> didn't
>>
>> entirely believe that I'd seen a dramatic improvement in query perfo
>>
>> using D-P-F, and so would not smile upon a patch intended to
>>
>> liberate
>>
>> D-P-F from codecs. It could be that the effect I saw has to do with
>>
>> the fact that our system depends on hitting and scoring 50% of the
>>
>> documents in an index with a lot of documents.
>>
>>
>>
>> I dont understand the word "liberate" here. why is it such a problem
>>
>> that this is a codec?
>>
>>
>> I don't want to have to declare my intentions at the time I create
>>
>> the index. I don't want to have to use D-P-F for all readers all the
>>
>> time. Because I want to be able to decide to open up an index with an
>>
>> arbitrary on-disk format and get the in-memory cache behavior of
>>
>> D-P-F. Thus 'liberate' -- split the question of 'keep a copy in
>>
>> memory' from the choice of the on-disk format.
>>
>>
>>
>>
>> i do not think we should give it any more status than that, it wastes
>>
>> too much ram.
>>
>>
>> It didn't seem like 'waste' when it solved a big practical for us. We
>>
>> had an application that was too slow, and had plenty of RAM available,
>>
>> and we were able to trade space for time by applying D-P-F.
>>
>>
>> Maybe I'm going about this backwards; if I can come up with a small,
>>
>> inconspicuous proposed change that does what I want, there won't be
>>
>> any disagreement.
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>>
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>
>>
>> ---------------------------------------------------------------------
>>
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Anticipating a benchmark for direct posting format

Reply via email to