Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

Michael McCandless Tue, 27 Nov 2012 16:12:08 -0800

Flexible indexing is the ability to make your own codec, which
controls the reading and writing of all index parts (postings, stored
fields, term vectors, deleted docs, etc.).


So for example if you want to store some postings as a bit set instead
of the block format that's the default coming up in 4.1, that's easy
to do.

But what is less easy (as I described below) is changing what is
actually stored in the postings, eg adding a new per-position
attribute.

The original goal was to allow arbitrary attributes beyond the known
docs/freqs/positions/offsets that Lucene supports today, so that you
could easily make new application-dependent per-term, per-doc,
per-position things, pull them from the analyzer, save them to the
index, and access them from an IndexReader / query, but while some
APIs do expose this, it's not very well explored yet (eg, you'd have
to make a custom indexing chain to get the attributes "through"
IndexWriter down to your codec).  It would be great to make progress
making this easier, so ideas are very welcome :)

Mike McCandless

http://blog.mikemccandless.com

On Tue, Nov 27, 2012 at 3:37 PM, Wu, Stephen T., Ph.D.
<wu.step...@mayo.edu> wrote:
> Following up on a previous question...
> What is "flexible indexing" in Lucene 4.0?  We assumed it was the ability to
> easily make new postings formats/codecs -- but a response below says that
> would be "tricky"?
>
> stephen
>
>
> On 11/27/12 11:48 AM, "David Causse" <dcau...@spotter.com> wrote:
>
>> Hi,
>>
>> We use payloads but we can't use the whole lucene API.
>> For example we use it to do some relation query for example :
>>
>> @quote(@speaker(obama) @discourse(health))
>>
>> Search for all documents that contains a quote by Obama talking about
>> health.
>> We encode linguistic informations (standoff annotations) inside payloads
>> and use custom search API to query the index.
>> I didn't found a convenable way to attach my code to lucene
>> Query/Scorer/Weight API. Like SpanQuery you have to rewrite the whole
>> Query stack.
>> In short if you want to go with Payloads that do more than boosting a
>> term there's chances that you'll need to rewrite a big part of the query
>> stack.
>>
>>
>> Le 27/11/2012 16:59, Wu, Stephen T., Ph.D. a écrit :
>>> I think we're looking at doing something related.  I haven't explored the
>>> Enums or know how to make a postings codec... But what is "flexible
>>> indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?
>>>
>>> We're trying to incorporate attributes onto terms/spans in indexes.  We'd
>>> also like to try out some interesting ways to score things that go beyond
>>> just tokens.
>>>
>>> We were considering using Attributes instead of Payloads, because it seems
>>> like using Payloads ties you to a particular kind of scoring -- just a
>>> weight on a token.  Can Payloads be used for more general scoring functions?
>>> E.g., considering a span of text alongside multiple Payloads?
>>>
>>> Does it make sense to move outside of Payloads here?
>>>
>>> Thanks!
>>>
>>> stephen
>>>
>>>
>>>
>>>
>>> On 11/19/12 8:14 AM, "Michael McCandless" <luc...@mikemccandless.com> wrote:
>>>
>>>> A new postings format would be tricky because you have new attributes
>>>> you want to index.
>>>>
>>>> The DocsAndPositionsEnum does have an attributes source, but this is
>>>> not well explored, and there are known problems (they can't be easily
>>>> merged in the composite reader case).
>>>>
>>>> So that's why I suggested packing your information into a payload ...
>>>>
>>>> Mike McCandless
>>>>
>>>> http://blog.mikemccandless.com
>>>>
>>>> On Sun, Nov 18, 2012 at 8:33 PM, wgggfiy <wuqiu....@qq.com> wrote:
>>>>> thx, mike.
>>>>> about the 3th question, "encode them all into the payload" is better than
>>>>> "a new postings format with the codec" ??
>>>>> I mean replace the orginal posting item (position, startOffset, endOffset,
>>>>> payload) with my own inverted item such as
>>>>> class TestPostingItem
>>>>> {
>>>>>          int termId;
>>>>>          long startOffset;
>>>>>          long endOffset;
>>>>>          float score;
>>>>>          int segId;
>>>>>          long timeStamp;
>>>>> }
>>>>> ?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://lucene.472066.n3.nabble.com/what-is-the-offsets-and-payload-in-DocsA
>>>>> nd
>>>>> PositionsEnum-for-tp4020933p4020968.html
>>>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>>
>>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

Reply via email to