Re: Anticipating a benchmark for direct posting format

Benson Margulies Mon, 07 Apr 2014 10:48:12 -0700

Eventually, I'll care about how to set this up in Solr. For now I
think I'll see if I can figure out the luceneutils benchmark.




On Mon, Apr 7, 2014 at 1:02 PM, Shai Erera <ser...@gmail.com> wrote:
> Given that DPF delegates indexing to another PF anyway (currently Lucene41),
> I think this might be the case. We would need to test of course. The key
> point is that this FilterAtomicReader will be able to serve anything as
> direct, even DV, so it might eliminate DVF too. We need to experiment and
> benchmark!
>
> Shai
>
> On Apr 7, 2014 7:32 PM, "david.w.smi...@gmail.com"
> <david.w.smi...@gmail.com> wrote:
>>
>> Aaaah, nice idea to simply use FilterAtomicReader — of course!  So this
>> would ultimately be a new IndexReaderFactory that creates
>> FilterAtomicReaders for a subset of the fields you want to do this on.
>> Cool!  With that, I don’t think there would be a need for
>> DirectPostingsFormat as a postings format, would there be?
>>
>> ~ David
>>
>>
>> On Mon, Apr 7, 2014 at 10:58 AM, Shai Erera <ser...@gmail.com> wrote:
>>>
>>> The only problem is how the Codec makes a dynamic decision on whether to
>>> use the wrapped Codec for reading vs pre-load data into in-memory
>>> structures, because Codecs are loaded through reflection by the SPI loading
>>> mechanism.
>>>
>>> There is also a TODO in DirectPF to allow wrapping arbitrary PFs, just
>>> mentioning in case you want to tackle DPF.
>>>
>>> I think that if we allowed passing something like a CodecLookupService,
>>> with an SPILookupService default impl, you could easily pass that to
>>> DirectoryReader which will use your runtime logic to load the right PF (e.g.
>>> DPF) instead of the one the index was created with.
>>>
>>> But it sounds like the core problem is that when we load a Codec/PF/DVF
>>> for reading, we cannot pass it any arguments, and so we must make an
>>> index-time decision about how we're going to read the data later on. If we
>>> could somehow support that, I think that will help you to achieve what you
>>> want too.
>>>
>>> E.g. currently it's an all-or-nothing decision, but if we could pass a
>>> parameter like "50% available heap", the Codec/PF/DVF could cache the
>>> frequently accessed postings instead of loading all of them into memory.
>>> But, that can also be achieved at the IndexReader level, through a custom
>>> FilterAtomicReader. And if you could reuse DPF's structures (like
>>> DirectTermsEnum, DirectFields...), it should be easier to do this. So
>>> perhaps we can think about a DirectAtomicReader which does that? I believe
>>> it can share some code w/ DPF, as long as we don't make these APIs public,
>>> or make them @super.experimental and @super.expert.
>>>
>>> Just throwing some ideas...
>>>
>>> Shai
>>>
>>>
>>> On Mon, Apr 7, 2014 at 5:35 PM, david.w.smi...@gmail.com
>>> <david.w.smi...@gmail.com> wrote:
>>>>
>>>> Benson, I like your idea.
>>>>
>>>> I think your idea can be achieved as a codec, one that wraps another
>>>> codec that establishes the on-disk format.  By default the wrapped codec 
>>>> can
>>>> be Lucene’s default codec.  I think, if implemented, this would be a change
>>>> to DPF instead of an additional DPF-variant codec.
>>>>
>>>> ~ David
>>>>
>>>>
>>>> On Mon, Apr 7, 2014 at 9:22 AM, Benson Margulies <bimargul...@gmail.com>
>>>> wrote:
>>>>>
>>>>> On Mon, Apr 7, 2014 at 9:14 AM, Robert Muir <rcm...@gmail.com> wrote:
>>>>> > On Thu, Apr 3, 2014 at 12:27 PM, Benson Margulies
>>>>> > <bimargul...@gmail.com> wrote:
>>>>> >
>>>>> >>
>>>>> >> My takeaway from the prior conversation was that various people
>>>>> >> didn't
>>>>> >> entirely believe that I'd seen a dramatic improvement in query perfo
>>>>> >> using D-P-F, and so would not smile upon a patch intended to
>>>>> >> liberate
>>>>> >> D-P-F from codecs. It could be that the effect I saw has to do with
>>>>> >> the fact that our system depends on hitting and scoring 50% of the
>>>>> >> documents in an index with a lot of documents.
>>>>> >>
>>>>> >
>>>>> > I dont understand the word "liberate" here. why is it such a problem
>>>>> > that this is a codec?
>>>>>
>>>>>  I don't want to have to declare my intentions at the time I create
>>>>> the index. I don't want to have to use D-P-F for all readers all the
>>>>> time. Because I want to be able to decide to open up an index with an
>>>>> arbitrary on-disk format and get the in-memory cache behavior of
>>>>> D-P-F. Thus 'liberate' -- split the question of 'keep a copy in
>>>>> memory' from the choice of the on-disk format.
>>>>>
>>>>>
>>>>> >
>>>>> > i do not think we should give it any more status than that, it wastes
>>>>> > too much ram.
>>>>>
>>>>> It didn't seem like 'waste' when it solved a big practical for us. We
>>>>> had an application that was too slow, and had plenty of RAM available,
>>>>> and we were able to trade space for time by applying D-P-F.
>>>>>
>>>>> Maybe I'm going about this backwards; if I can come up with a small,
>>>>> inconspicuous proposed change that does what I want, there won't be
>>>>> any disagreement.
>>>>>
>>>>>
>>>>> >
>>>>> > ---------------------------------------------------------------------
>>>>> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>>>> > For additional commands, e-mail: dev-h...@lucene.apache.org
>>>>> >
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>>>
>>>>
>>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Anticipating a benchmark for direct posting format

Reply via email to