What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-11-27 Thread Wu, Stephen T., Ph.D.
Following up on a previous question...
What is "flexible indexing" in Lucene 4.0?  We assumed it was the ability to
easily make new postings formats/codecs -- but a response below says that
would be "tricky"?

stephen


On 11/27/12 11:48 AM, "David Causse"  wrote:

> Hi,
> 
> We use payloads but we can't use the whole lucene API.
> For example we use it to do some relation query for example :
> 
> @quote(@speaker(obama) @discourse(health))
> 
> Search for all documents that contains a quote by Obama talking about
> health.
> We encode linguistic informations (standoff annotations) inside payloads
> and use custom search API to query the index.
> I didn't found a convenable way to attach my code to lucene
> Query/Scorer/Weight API. Like SpanQuery you have to rewrite the whole
> Query stack.
> In short if you want to go with Payloads that do more than boosting a
> term there's chances that you'll need to rewrite a big part of the query
> stack.
> 
> 
> Le 27/11/2012 16:59, Wu, Stephen T., Ph.D. a écrit :
>> I think we're looking at doing something related.  I haven't explored the
>> Enums or know how to make a postings codec... But what is "flexible
>> indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?
>> 
>> We're trying to incorporate attributes onto terms/spans in indexes.  We'd
>> also like to try out some interesting ways to score things that go beyond
>> just tokens.
>> 
>> We were considering using Attributes instead of Payloads, because it seems
>> like using Payloads ties you to a particular kind of scoring -- just a
>> weight on a token.  Can Payloads be used for more general scoring functions?
>> E.g., considering a span of text alongside multiple Payloads?
>> 
>> Does it make sense to move outside of Payloads here?
>> 
>> Thanks!
>> 
>> stephen
>> 
>> 
>> 
>> 
>> On 11/19/12 8:14 AM, "Michael McCandless"  wrote:
>> 
>>> A new postings format would be tricky because you have new attributes
>>> you want to index.
>>> 
>>> The DocsAndPositionsEnum does have an attributes source, but this is
>>> not well explored, and there are known problems (they can't be easily
>>> merged in the composite reader case).
>>> 
>>> So that's why I suggested packing your information into a payload ...
>>> 
>>> Mike McCandless
>>> 
>>> http://blog.mikemccandless.com
>>> 
>>> On Sun, Nov 18, 2012 at 8:33 PM, wgggfiy  wrote:
>>>> thx, mike.
>>>> about the 3th question, "encode them all into the payload" is better than
>>>> "a new postings format with the codec" ??
>>>> I mean replace the orginal posting item (position, startOffset, endOffset,
>>>> payload) with my own inverted item such as
>>>> class TestPostingItem
>>>> {
>>>>  int termId;
>>>>  long startOffset;
>>>>  long endOffset;
>>>>  float score;
>>>>  int segId;
>>>>  long timeStamp;
>>>> }
>>>> ?
>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> View this message in context:
>>>> http://lucene.472066.n3.nabble.com/what-is-the-offsets-and-payload-in-DocsA
>>>> nd
>>>> PositionsEnum-for-tp4020933p4020968.html
>>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>> 
>>>> -
>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>> 
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>> 
>> 
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> 
>> 
>> 
> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-11-27 Thread Michael McCandless
Flexible indexing is the ability to make your own codec, which
controls the reading and writing of all index parts (postings, stored
fields, term vectors, deleted docs, etc.).

So for example if you want to store some postings as a bit set instead
of the block format that's the default coming up in 4.1, that's easy
to do.

But what is less easy (as I described below) is changing what is
actually stored in the postings, eg adding a new per-position
attribute.

The original goal was to allow arbitrary attributes beyond the known
docs/freqs/positions/offsets that Lucene supports today, so that you
could easily make new application-dependent per-term, per-doc,
per-position things, pull them from the analyzer, save them to the
index, and access them from an IndexReader / query, but while some
APIs do expose this, it's not very well explored yet (eg, you'd have
to make a custom indexing chain to get the attributes "through"
IndexWriter down to your codec).  It would be great to make progress
making this easier, so ideas are very welcome :)

Mike McCandless

http://blog.mikemccandless.com

On Tue, Nov 27, 2012 at 3:37 PM, Wu, Stephen T., Ph.D.
 wrote:
> Following up on a previous question...
> What is "flexible indexing" in Lucene 4.0?  We assumed it was the ability to
> easily make new postings formats/codecs -- but a response below says that
> would be "tricky"?
>
> stephen
>
>
> On 11/27/12 11:48 AM, "David Causse"  wrote:
>
>> Hi,
>>
>> We use payloads but we can't use the whole lucene API.
>> For example we use it to do some relation query for example :
>>
>> @quote(@speaker(obama) @discourse(health))
>>
>> Search for all documents that contains a quote by Obama talking about
>> health.
>> We encode linguistic informations (standoff annotations) inside payloads
>> and use custom search API to query the index.
>> I didn't found a convenable way to attach my code to lucene
>> Query/Scorer/Weight API. Like SpanQuery you have to rewrite the whole
>> Query stack.
>> In short if you want to go with Payloads that do more than boosting a
>> term there's chances that you'll need to rewrite a big part of the query
>> stack.
>>
>>
>> Le 27/11/2012 16:59, Wu, Stephen T., Ph.D. a écrit :
>>> I think we're looking at doing something related.  I haven't explored the
>>> Enums or know how to make a postings codec... But what is "flexible
>>> indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?
>>>
>>> We're trying to incorporate attributes onto terms/spans in indexes.  We'd
>>> also like to try out some interesting ways to score things that go beyond
>>> just tokens.
>>>
>>> We were considering using Attributes instead of Payloads, because it seems
>>> like using Payloads ties you to a particular kind of scoring -- just a
>>> weight on a token.  Can Payloads be used for more general scoring functions?
>>> E.g., considering a span of text alongside multiple Payloads?
>>>
>>> Does it make sense to move outside of Payloads here?
>>>
>>> Thanks!
>>>
>>> stephen
>>>
>>>
>>>
>>>
>>> On 11/19/12 8:14 AM, "Michael McCandless"  wrote:
>>>
>>>> A new postings format would be tricky because you have new attributes
>>>> you want to index.
>>>>
>>>> The DocsAndPositionsEnum does have an attributes source, but this is
>>>> not well explored, and there are known problems (they can't be easily
>>>> merged in the composite reader case).
>>>>
>>>> So that's why I suggested packing your information into a payload ...
>>>>
>>>> Mike McCandless
>>>>
>>>> http://blog.mikemccandless.com
>>>>
>>>> On Sun, Nov 18, 2012 at 8:33 PM, wgggfiy  wrote:
>>>>> thx, mike.
>>>>> about the 3th question, "encode them all into the payload" is better than
>>>>> "a new postings format with the codec" ??
>>>>> I mean replace the orginal posting item (position, startOffset, endOffset,
>>>>> payload) with my own inverted item such as
>>>>> class TestPostingItem
>>>>> {
>>>>>  int termId;
>>>>>  long startOffset;
>>>>>  long endOffset;
>>>>>  float score;
>>>>>  int segId;
>>>>>  long timeStamp;
>>>>> }
>&g

Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-11-30 Thread Johannes.Lichtenberger

On 11/28/2012 01:11 AM, Michael McCandless wrote:

Flexible indexing is the ability to make your own codec, which
controls the reading and writing of all index parts (postings, stored
fields, term vectors, deleted docs, etc.).

So for example if you want to store some postings as a bit set instead
of the block format that's the default coming up in 4.1, that's easy
to do.

But what is less easy (as I described below) is changing what is
actually stored in the postings, eg adding a new per-position
attribute.

The original goal was to allow arbitrary attributes beyond the known
docs/freqs/positions/offsets that Lucene supports today, so that you
could easily make new application-dependent per-term, per-doc,
per-position things, pull them from the analyzer, save them to the
index, and access them from an IndexReader / query, but while some
APIs do expose this, it's not very well explored yet (eg, you'd have
to make a custom indexing chain to get the attributes "through"
IndexWriter down to your codec).  It would be great to make progress
making this easier, so ideas are very welcome :)


Regarding my questin/thread, is it also possible to change the backend 
system? I'd like to use Lucene for a versioned DBMS, thus I would need 
the ability to serialize/deserialize the bytes in my backend whereas 
keys/values are stored in pages (for instance in an upcoming B+-tree, or 
in  simple "unordered" pages via a record-ID/record mapping). But as no 
one suggested anything as of now and I've also asked a year ago or so, 
after implementing the B+-tree I will probably have to implement my own 
datastructure and parser/tokenizer/stemmer... :-(


kind regards,
Johannes


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-11-30 Thread Jack Krupansky
"I will probably have to implement my own datastructure and 
parser/tokenizer/stemmer"


Why? I mean, I think the point of the Lucene architecture is that the codec 
level is completely independent of the analysis level.


The end result of analysis is a value to be stored from the application 
perspective, a "logical value" so to speak, but NOT the bit sequence, the 
"physical value" so to speak, that the codec will actually store.


So, go ahead and have your own codec that does whatever it wants with 
values, but the input for storage and query should be the output of a 
standard Lucene analyzer.


-- Jack Krupansky

-Original Message- 
From: Johannes.Lichtenberger

Sent: Friday, November 30, 2012 10:15 AM
To: java-user@lucene.apache.org
Cc: Michael McCandless
Subject: Re: What is "flexible indexing" in Lucene 4.0 if it's not the 
ability to make new postings codecs?


On 11/28/2012 01:11 AM, Michael McCandless wrote:

Flexible indexing is the ability to make your own codec, which
controls the reading and writing of all index parts (postings, stored
fields, term vectors, deleted docs, etc.).

So for example if you want to store some postings as a bit set instead
of the block format that's the default coming up in 4.1, that's easy
to do.

But what is less easy (as I described below) is changing what is
actually stored in the postings, eg adding a new per-position
attribute.

The original goal was to allow arbitrary attributes beyond the known
docs/freqs/positions/offsets that Lucene supports today, so that you
could easily make new application-dependent per-term, per-doc,
per-position things, pull them from the analyzer, save them to the
index, and access them from an IndexReader / query, but while some
APIs do expose this, it's not very well explored yet (eg, you'd have
to make a custom indexing chain to get the attributes "through"
IndexWriter down to your codec).  It would be great to make progress
making this easier, so ideas are very welcome :)


Regarding my questin/thread, is it also possible to change the backend
system? I'd like to use Lucene for a versioned DBMS, thus I would need
the ability to serialize/deserialize the bytes in my backend whereas
keys/values are stored in pages (for instance in an upcoming B+-tree, or
in  simple "unordered" pages via a record-ID/record mapping). But as no
one suggested anything as of now and I've also asked a year ago or so,
after implementing the B+-tree I will probably have to implement my own
datastructure and parser/tokenizer/stemmer... :-(

kind regards,
Johannes


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-11-30 Thread Wu, Stephen T., Ph.D.
Is there any (preliminary) code checked in somewhere that I can look at,
that would help me understand the practical issues that would need to be
addressed?

If I understand you correctly, it's a little different from what's happening
in your blog posts:
http://blog.mikemccandless.com/2012/07/building-new-lucene-postings-format.h
tml
http://blog.mikemccandless.com/2012/08/lucenes-new-blockpostingsformat-thank
s.html
Those posts deal with making your own codec, but not about changing what's
stored in the postings?  I guess I misunderstood "postings format" before.

stephen

> Flexible indexing is the ability to make your own codec, which
> controls the reading and writing of all index parts (postings, stored
> fields, term vectors, deleted docs, etc.).
> 
> So for example if you want to store some postings as a bit set instead
> of the block format that's the default coming up in 4.1, that's easy
> to do.
> 
> But what is less easy (as I described below) is changing what is
> actually stored in the postings, eg adding a new per-position
> attribute.
> 
> The original goal was to allow arbitrary attributes beyond the known
> docs/freqs/positions/offsets that Lucene supports today, so that you
> could easily make new application-dependent per-term, per-doc,
> per-position things, pull them from the analyzer, save them to the
> index, and access them from an IndexReader / query, but while some
> APIs do expose this, it's not very well explored yet (eg, you'd have
> to make a custom indexing chain to get the attributes "through"
> IndexWriter down to your codec).  It would be great to make progress
> making this easier, so ideas are very welcome :)
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> On Tue, Nov 27, 2012 at 3:37 PM, Wu, Stephen T., Ph.D.
>  wrote:
>> Following up on a previous question...
>> What is "flexible indexing" in Lucene 4.0?  We assumed it was the ability to
>> easily make new postings formats/codecs -- but a response below says that
>> would be "tricky"?
>> 
>> stephen
>> 
>> 
>> On 11/27/12 11:48 AM, "David Causse"  wrote:
>> 
>>> Hi,
>>> 
>>> We use payloads but we can't use the whole lucene API.
>>> For example we use it to do some relation query for example :
>>> 
>>> @quote(@speaker(obama) @discourse(health))
>>> 
>>> Search for all documents that contains a quote by Obama talking about
>>> health.
>>> We encode linguistic informations (standoff annotations) inside payloads
>>> and use custom search API to query the index.
>>> I didn't found a convenable way to attach my code to lucene
>>> Query/Scorer/Weight API. Like SpanQuery you have to rewrite the whole
>>> Query stack.
>>> In short if you want to go with Payloads that do more than boosting a
>>> term there's chances that you'll need to rewrite a big part of the query
>>> stack.
>>> 
>>> 
>>> Le 27/11/2012 16:59, Wu, Stephen T., Ph.D. a écrit :
>>>> I think we're looking at doing something related.  I haven't explored the
>>>> Enums or know how to make a postings codec... But what is "flexible
>>>> indexing" in Lucene 4.0 if it's not the ability to make new postings
>>>> codecs?
>>>> 
>>>> We're trying to incorporate attributes onto terms/spans in indexes.  We'd
>>>> also like to try out some interesting ways to score things that go beyond
>>>> just tokens.
>>>> 
>>>> We were considering using Attributes instead of Payloads, because it seems
>>>> like using Payloads ties you to a particular kind of scoring -- just a
>>>> weight on a token.  Can Payloads be used for more general scoring
>>>> functions?
>>>> E.g., considering a span of text alongside multiple Payloads?
>>>> 
>>>> Does it make sense to move outside of Payloads here?
>>>> 
>>>> Thanks!
>>>> 
>>>> stephen
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On 11/19/12 8:14 AM, "Michael McCandless" 
>>>> wrote:
>>>> 
>>>>> A new postings format would be tricky because you have new attributes
>>>>> you want to index.
>>>>> 
>>>>> The DocsAndPositionsEnum does have an attributes source, but this is
>>>>> not well explored, and there are known problems (they can't be easily
>>>>> merged in the compos

Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-11-30 Thread Michael McCandless
On Fri, Nov 30, 2012 at 12:25 PM, Wu, Stephen T., Ph.D.
 wrote:
> Is there any (preliminary) code checked in somewhere that I can look at,
> that would help me understand the practical issues that would need to be
> addressed?
>
> If I understand you correctly, it's a little different from what's happening
> in your blog posts:
> http://blog.mikemccandless.com/2012/07/building-new-lucene-postings-format.h
> tml
> http://blog.mikemccandless.com/2012/08/lucenes-new-blockpostingsformat-thank
> s.html
> Those posts deal with making your own codec, but not about changing what's
> stored in the postings?  I guess I misunderstood "postings format" before.

I don't know of any examples of adding an entirely new attribute to
the postings, except via payloads.

All the examples we have are of Codecs/PostingsFormats/etc. storing
all the usual attributes (term & its stats (docFreq/totalTermFreq),
doc, freq, position, offsets, payload) in "interesting" ways.

Maybe we can make this more concrete: what new attribute are you
needing to record in the postings and access at search time?

Mike McCandless

http://blog.mikemccandless.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-12 Thread Wu, Stephen T., Ph.D.
>> Is there any (preliminary) code checked in somewhere that I can look at,
>> that would help me understand the practical issues that would need to be
>> addressed?
> 
> Maybe we can make this more concrete: what new attribute are you
> needing to record in the postings and access at search time?

For example: 
 - part of speech of a token.
 - syntactic parse subtree (over a span).
 - semantically normalized phrase (to canonical text or ontological code).
 - semantic group (of a span).
 - coreference link.

stephen


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-12 Thread Glen Newton
+10

These are the kind of things you can do in GATE[1] using annotations[2].
A VERY useful feature.

-Glen

[1]http://gate.ac.uk
[2]http://gate.ac.uk/wiki/jape-repository/annotations.html

On Wed, Dec 12, 2012 at 3:02 PM, Wu, Stephen T., Ph.D.
 wrote:
>>> Is there any (preliminary) code checked in somewhere that I can look at,
>>> that would help me understand the practical issues that would need to be
>>> addressed?
>>
>> Maybe we can make this more concrete: what new attribute are you
>> needing to record in the postings and access at search time?
>
> For example:
>  - part of speech of a token.
>  - syntactic parse subtree (over a span).
>  - semantically normalized phrase (to canonical text or ontological code).
>  - semantic group (of a span).
>  - coreference link.
>
> stephen
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>



-- 
-
http://zzzoot.blogspot.com/
-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-12 Thread wgggfiy
Thx very much!
Lingpipe and Gate are very useful, and new to me,
but is it too larger to realize the custom like
class TestPostingItem
{
int termId;
long startOffset;
long endOffset;
float score;
int segId;
long timeStamp;
} ?



-
--
Email: wuqiu.m...@qq.com
--
--
View this message in context: 
http://lucene.472066.n3.nabble.com/what-is-the-offsets-and-payload-in-DocsAndPositionsEnum-for-tp4020933p4026571.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-12 Thread lukai
Do we have any plan to decouple the index process?

Lucene was design for search, but according the question people ask in the
thread it beyonds search functionality sometimes. Like we might want to
customize our scoring function based on payload. Sometimes i dont need to
store TF/IDF information. We can pre-calculate features and store into the
system. But i still need to store the extra TF/IDF information. And
sometimes, i think we want to load the whole postings into memory to speed
up the performance. In that case, we really want to customize the
functionality/process of Inverted index. The main problem is, the
implementation is highly coupled with the index chain. It's not easy to
re-write a new one. Do we have plan to make the index chain change more
easier?

Flexible index chain logic, flexible codecs format.

Thanks,



On Fri, Nov 30, 2012 at 10:02 AM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> On Fri, Nov 30, 2012 at 12:25 PM, Wu, Stephen T., Ph.D.
>  wrote:
> > Is there any (preliminary) code checked in somewhere that I can look at,
> > that would help me understand the practical issues that would need to be
> > addressed?
> >
> > If I understand you correctly, it's a little different from what's
> happening
> > in your blog posts:
> >
> http://blog.mikemccandless.com/2012/07/building-new-lucene-postings-format.h
> > tml
> >
> http://blog.mikemccandless.com/2012/08/lucenes-new-blockpostingsformat-thank
> > s.html
> > Those posts deal with making your own codec, but not about changing
> what's
> > stored in the postings?  I guess I misunderstood "postings format"
> before.
>
> I don't know of any examples of adding an entirely new attribute to
> the postings, except via payloads.
>
> All the examples we have are of Codecs/PostingsFormats/etc. storing
> all the usual attributes (term & its stats (docFreq/totalTermFreq),
> doc, freq, position, offsets, payload) in "interesting" ways.
>
> Maybe we can make this more concrete: what new attribute are you
> needing to record in the postings and access at search time?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-13 Thread Michael McCandless
On Wed, Dec 12, 2012 at 3:02 PM, Wu, Stephen T., Ph.D.
 wrote:
>>> Is there any (preliminary) code checked in somewhere that I can look at,
>>> that would help me understand the practical issues that would need to be
>>> addressed?
>>
>> Maybe we can make this more concrete: what new attribute are you
>> needing to record in the postings and access at search time?
>
> For example:
>  - part of speech of a token.
>  - syntactic parse subtree (over a span).
>  - semantically normalized phrase (to canonical text or ontological code).
>  - semantic group (of a span).
>  - coreference link.

So for example part-of-speech is a per-Token-position attribute.

Today the easiest way to handle this is to encode these attributes
into a Payload, which is straightforward (make a custom TokenFilter
that creates the payload).

At search time you would then use e.g. PayloadTermQuery to decode the
Payload and do something with it to alter how the query is being
scored.

For the span-like attributes (eg a syntactic parse, semantically
normalized phrase) I think you'd need to do something like
SynonymFilter in your analysis, i.e. insert new tokens at the position
where the span started.  Unfortunately, Lucene doesn't properly index
spans (it records the start position but not the end position), so
that limits what kind of matching you can do at search time.

Mike McCandless

http://blog.mikemccandless.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-13 Thread Michael McCandless
On Wed, Dec 12, 2012 at 9:08 PM, lukai  wrote:
> Do we have any plan to decouple the index process?
>
> Lucene was design for search, but according the question people ask in the
> thread it beyonds search functionality sometimes. Like we might want to
> customize our scoring function based on payload. Sometimes i dont need to
> store TF/IDF information. We can pre-calculate features and store into the
> system. But i still need to store the extra TF/IDF information. And
> sometimes, i think we want to load the whole postings into memory to speed
> up the performance. In that case, we really want to customize the
> functionality/process of Inverted index.

Much of this can already be done with Lucene.  Eg, plug in your own
Similarity to get custom scoring (and we already have a bunch of
standard models ... TF/IDF (default), BM25, DFR, language models,
etc.).  Use MemoryPostingsFormat to pull everything into RAM.
Customize other parts of the index using your own Codec.

> The main problem is, the
> implementation is highly coupled with the index chain. It's not easy to
> re-write a new one. Do we have plan to make the index chain change more
> easier?
>
> Flexible index chain logic, flexible codecs format.

The indexing chain, which is inside IndexWriter and processes each
document into temporary RAM structures and then writes a new segment
via the Codec API, can in fact be changed, but it's extremely expert
and the APIs are not documented (you must read the source code to work
through it).

That said, customizing the chain is rarely really necessary ...
typically existing pluggability (payloads, Sims, custom codec) can
solve most problems.

Mike McCandless

http://blog.mikemccandless.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-13 Thread Carsten Schnober
Am 13.12.2012 12:27, schrieb Michael McCandless:

>> For example:
>>  - part of speech of a token.
>>  - syntactic parse subtree (over a span).
>>  - semantically normalized phrase (to canonical text or ontological code).
>>  - semantic group (of a span).
>>  - coreference link.
> 
> So for example part-of-speech is a per-Token-position attribute.
> 
> Today the easiest way to handle this is to encode these attributes
> into a Payload, which is straightforward (make a custom TokenFilter
> that creates the payload).
> 
> At search time you would then use e.g. PayloadTermQuery to decode the
> Payload and do something with it to alter how the query is being
> scored.

This is a relatively easy example, but how would deal with e.g.
annotations that include multiple tokens (as in spans), such as chunks,
or relations between tokens (and token spans), as in the coreference
links example given by Steven above?
Best,
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-13 Thread Glen Newton
>Unfortunately, Lucene doesn't properly index
spans (it records the start position but not the end position), so
that limits what kind of matching you can do at search time.

If this could be fixed (i.e. indexing the _end_ of a span) I think all
the things that I want to do, and the things that can now be done in
GATE very easily, would be possible using Mike's suggested method.


-Glen

On Thu, Dec 13, 2012 at 6:27 AM, Michael McCandless
 wrote:
> On Wed, Dec 12, 2012 at 3:02 PM, Wu, Stephen T., Ph.D.
>  wrote:
 Is there any (preliminary) code checked in somewhere that I can look at,
 that would help me understand the practical issues that would need to be
 addressed?
>>>
>>> Maybe we can make this more concrete: what new attribute are you
>>> needing to record in the postings and access at search time?
>>
>> For example:
>>  - part of speech of a token.
>>  - syntactic parse subtree (over a span).
>>  - semantically normalized phrase (to canonical text or ontological code).
>>  - semantic group (of a span).
>>  - coreference link.
>
> So for example part-of-speech is a per-Token-position attribute.
>
> Today the easiest way to handle this is to encode these attributes
> into a Payload, which is straightforward (make a custom TokenFilter
> that creates the payload).
>
> At search time you would then use e.g. PayloadTermQuery to decode the
> Payload and do something with it to alter how the query is being
> scored.
>
> For the span-like attributes (eg a syntactic parse, semantically
> normalized phrase) I think you'd need to do something like
> SynonymFilter in your analysis, i.e. insert new tokens at the position
> where the span started.  Unfortunately, Lucene doesn't properly index
> spans (it records the start position but not the end position), so
> that limits what kind of matching you can do at search time.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>



-- 
-
http://zzzoot.blogspot.com/
-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-13 Thread Wu, Stephen T., Ph.D.
That would be really nice. Full standoff annotations open a lot of doors.

If we had them, though, I'm not sure exactly which of Mike's methods you'd
use?  I thought payloads were completely token-based and could not be
attached to spans regardless.  And the SynonymFilter is really to mimic the
behavior of multiple tokens/span... (though maybe you could add the other
tokens in as "synonyms" and then skip the tokens you added...?).
Mike, is all this stuff possible if we can just index the ends of spans?

stephen


On 12/13/12 9:09 AM, "Glen Newton"  wrote:

>> Unfortunately, Lucene doesn't properly index
> spans (it records the start position but not the end position), so
> that limits what kind of matching you can do at search time.
> 
> If this could be fixed (i.e. indexing the _end_ of a span) I think all
> the things that I want to do, and the things that can now be done in
> GATE very easily, would be possible using Mike's suggested method.
> 
> 
> -Glen
> 
> On Thu, Dec 13, 2012 at 6:27 AM, Michael McCandless
>  wrote:
>> On Wed, Dec 12, 2012 at 3:02 PM, Wu, Stephen T., Ph.D.
>>  wrote:
> Is there any (preliminary) code checked in somewhere that I can look at,
> that would help me understand the practical issues that would need to be
> addressed?
 
 Maybe we can make this more concrete: what new attribute are you
 needing to record in the postings and access at search time?
>>> 
>>> For example:
>>>  - part of speech of a token.
>>>  - syntactic parse subtree (over a span).
>>>  - semantically normalized phrase (to canonical text or ontological code).
>>>  - semantic group (of a span).
>>>  - coreference link.
>> 
>> So for example part-of-speech is a per-Token-position attribute.
>> 
>> Today the easiest way to handle this is to encode these attributes
>> into a Payload, which is straightforward (make a custom TokenFilter
>> that creates the payload).
>> 
>> At search time you would then use e.g. PayloadTermQuery to decode the
>> Payload and do something with it to alter how the query is being
>> scored.
>> 
>> For the span-like attributes (eg a syntactic parse, semantically
>> normalized phrase) I think you'd need to do something like
>> SynonymFilter in your analysis, i.e. insert new tokens at the position
>> where the span started.  Unfortunately, Lucene doesn't properly index
>> spans (it records the start position but not the end position), so
>> that limits what kind of matching you can do at search time.
>> 
>> Mike McCandless
>> 
>> http://blog.mikemccandless.com
>> 
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> 
> 
> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-13 Thread Lance Norskog

Parts-of-speech is available now, in the indexer.

LUCENE-2899 adds OpenNLP to the Lucene&Solr codebase. It does 
parts-of-speech, chunking and Named Entity Recognition. OpenNLP is an 
Apache project for natural-language processing.


Some parts are in Solr that could be in Lucene.

https://issues.apache.org/jira/browse/lucene-2899

On 12/12/2012 12:02 PM, Wu, Stephen T., Ph.D. wrote:

Is there any (preliminary) code checked in somewhere that I can look at,
that would help me understand the practical issues that would need to be
addressed?

Maybe we can make this more concrete: what new attribute are you
needing to record in the postings and access at search time?

For example:
  - part of speech of a token.
  - syntactic parse subtree (over a span).
  - semantically normalized phrase (to canonical text or ontological code).
  - semantic group (of a span).
  - coreference link.

stephen


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-13 Thread Glen Newton
It is not clear this is exactly what is needed/being discussed.

>From the issue:
"We are also planning a Tokenizer/TokenFilter that can put parts of
speech as either payloads (PartOfSpeechAttribute?) on a token or at
the same position."

This adds it to a token, not a span. 'same position' does not suggest
it also records the end position.

-Glen

On Thu, Dec 13, 2012 at 4:45 PM, Lance Norskog  wrote:
> Parts-of-speech is available now, in the indexer.
>
> LUCENE-2899 adds OpenNLP to the Lucene&Solr codebase. It does
> parts-of-speech, chunking and Named Entity Recognition. OpenNLP is an Apache
> project for natural-language processing.
>
> Some parts are in Solr that could be in Lucene.
>
> https://issues.apache.org/jira/browse/lucene-2899
>
>
> On 12/12/2012 12:02 PM, Wu, Stephen T., Ph.D. wrote:

 Is there any (preliminary) code checked in somewhere that I can look at,
 that would help me understand the practical issues that would need to be
 addressed?
>>>
>>> Maybe we can make this more concrete: what new attribute are you
>>> needing to record in the postings and access at search time?
>>
>> For example:
>>   - part of speech of a token.
>>   - syntactic parse subtree (over a span).
>>   - semantically normalized phrase (to canonical text or ontological
>> code).
>>   - semantic group (of a span).
>>   - coreference link.
>>
>> stephen
>>
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>



-- 
-
http://zzzoot.blogspot.com/
-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-13 Thread Lance Norskog
I should not have added that note. The Opennlp patch gives a concrete 
example of adding an annotation to text.


On 12/13/2012 01:54 PM, Glen Newton wrote:

It is not clear this is exactly what is needed/being discussed.

 From the issue:
"We are also planning a Tokenizer/TokenFilter that can put parts of
speech as either payloads (PartOfSpeechAttribute?) on a token or at
the same position."

This adds it to a token, not a span. 'same position' does not suggest
it also records the end position.

-Glen

On Thu, Dec 13, 2012 at 4:45 PM, Lance Norskog  wrote:

Parts-of-speech is available now, in the indexer.

LUCENE-2899 adds OpenNLP to the Lucene&Solr codebase. It does
parts-of-speech, chunking and Named Entity Recognition. OpenNLP is an Apache
project for natural-language processing.

Some parts are in Solr that could be in Lucene.

https://issues.apache.org/jira/browse/lucene-2899


On 12/12/2012 12:02 PM, Wu, Stephen T., Ph.D. wrote:

Is there any (preliminary) code checked in somewhere that I can look at,
that would help me understand the practical issues that would need to be
addressed?

Maybe we can make this more concrete: what new attribute are you
needing to record in the postings and access at search time?

For example:
   - part of speech of a token.
   - syntactic parse subtree (over a span).
   - semantically normalized phrase (to canonical text or ontological
code).
   - semantic group (of a span).
   - coreference link.

stephen


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org







-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-13 Thread Glen Newton
Cool! Sounds great!  :-)

Any pointers to a (Lucene) example that attaches a payload to a
start..end span that is more than one token?

thanks,
-Glen

On Thu, Dec 13, 2012 at 5:03 PM, Lance Norskog  wrote:
> I should not have added that note. The Opennlp patch gives a concrete
> example of adding an annotation to text.
>
>
> On 12/13/2012 01:54 PM, Glen Newton wrote:
>>
>> It is not clear this is exactly what is needed/being discussed.
>>
>>  From the issue:
>> "We are also planning a Tokenizer/TokenFilter that can put parts of
>> speech as either payloads (PartOfSpeechAttribute?) on a token or at
>> the same position."
>>
>> This adds it to a token, not a span. 'same position' does not suggest
>> it also records the end position.
>>
>> -Glen
>>
>> On Thu, Dec 13, 2012 at 4:45 PM, Lance Norskog  wrote:
>>>
>>> Parts-of-speech is available now, in the indexer.
>>>
>>> LUCENE-2899 adds OpenNLP to the Lucene&Solr codebase. It does
>>> parts-of-speech, chunking and Named Entity Recognition. OpenNLP is an
>>> Apache
>>> project for natural-language processing.
>>>
>>> Some parts are in Solr that could be in Lucene.
>>>
>>> https://issues.apache.org/jira/browse/lucene-2899
>>>
>>>
>>> On 12/12/2012 12:02 PM, Wu, Stephen T., Ph.D. wrote:
>>
>> Is there any (preliminary) code checked in somewhere that I can look
>> at,
>> that would help me understand the practical issues that would need to
>> be
>> addressed?
>
> Maybe we can make this more concrete: what new attribute are you
> needing to record in the postings and access at search time?

 For example:
- part of speech of a token.
- syntactic parse subtree (over a span).
- semantically normalized phrase (to canonical text or ontological
 code).
- semantic group (of a span).
- coreference link.

 stephen


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org

>>
>>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>



-- 
-
http://zzzoot.blogspot.com/
-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-13 Thread SUJIT PAL
Hi Glen,

I don't believe you can attach a single payload to multiple tokens. What I did 
for a similar requirement was to combine the tokens into a single "_" delimited 
single token and attached the payload to it. For example:

The Big Bad Wolf huffed and puffed and blew the house of the Three Little Pigs 
down.

Now assume "Big Bad Wolf" and "Three Little Pigs" are spans to which I would 
like to attach payloads to. I run the tokens through a custom tokenizer that 
produces:

The Big_Bad_Wolf$payload1 huffed and puffed and blew the house of the 
Three_Little_Pigs$payload2 down.

In my case this makes sense, ie I can treat the span as a single unit. Not sure 
about your use case.

HTH
Sujit

On Dec 13, 2012, at 2:08 PM, Glen Newton wrote:

> Cool! Sounds great!  :-)
> 
> Any pointers to a (Lucene) example that attaches a payload to a
> start..end span that is more than one token?
> 
> thanks,
> -Glen
> 
> On Thu, Dec 13, 2012 at 5:03 PM, Lance Norskog  wrote:
>> I should not have added that note. The Opennlp patch gives a concrete
>> example of adding an annotation to text.
>> 
>> 
>> On 12/13/2012 01:54 PM, Glen Newton wrote:
>>> 
>>> It is not clear this is exactly what is needed/being discussed.
>>> 
>>> From the issue:
>>> "We are also planning a Tokenizer/TokenFilter that can put parts of
>>> speech as either payloads (PartOfSpeechAttribute?) on a token or at
>>> the same position."
>>> 
>>> This adds it to a token, not a span. 'same position' does not suggest
>>> it also records the end position.
>>> 
>>> -Glen
>>> 
>>> On Thu, Dec 13, 2012 at 4:45 PM, Lance Norskog  wrote:
 
 Parts-of-speech is available now, in the indexer.
 
 LUCENE-2899 adds OpenNLP to the Lucene&Solr codebase. It does
 parts-of-speech, chunking and Named Entity Recognition. OpenNLP is an
 Apache
 project for natural-language processing.
 
 Some parts are in Solr that could be in Lucene.
 
 https://issues.apache.org/jira/browse/lucene-2899
 
 
 On 12/12/2012 12:02 PM, Wu, Stephen T., Ph.D. wrote:
>>> 
>>> Is there any (preliminary) code checked in somewhere that I can look
>>> at,
>>> that would help me understand the practical issues that would need to
>>> be
>>> addressed?
>> 
>> Maybe we can make this more concrete: what new attribute are you
>> needing to record in the postings and access at search time?
> 
> For example:
>   - part of speech of a token.
>   - syntactic parse subtree (over a span).
>   - semantically normalized phrase (to canonical text or ontological
> code).
>   - semantic group (of a span).
>   - coreference link.
> 
> stephen
> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
>>> 
>>> 
>> 
>> 
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> 
> 
> 
> 
> -- 
> -
> http://zzzoot.blogspot.com/
> -
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-18 Thread Michael McCandless
On Thu, Dec 13, 2012 at 8:32 AM, Carsten Schnober
 wrote:
> Am 13.12.2012 12:27, schrieb Michael McCandless:
>
>>> For example:
>>>  - part of speech of a token.
>>>  - syntactic parse subtree (over a span).
>>>  - semantically normalized phrase (to canonical text or ontological code).
>>>  - semantic group (of a span).
>>>  - coreference link.
>>
>> So for example part-of-speech is a per-Token-position attribute.
>>
>> Today the easiest way to handle this is to encode these attributes
>> into a Payload, which is straightforward (make a custom TokenFilter
>> that creates the payload).
>>
>> At search time you would then use e.g. PayloadTermQuery to decode the
>> Payload and do something with it to alter how the query is being
>> scored.
>
> This is a relatively easy example, but how would deal with e.g.
> annotations that include multiple tokens (as in spans), such as chunks,
> or relations between tokens (and token spans), as in the coreference
> links example given by Steven above?

I think you'd do something like what SynonymFilter does for
multi-token synonyms.

Eg a synonym for "wireless network" - > wifi would insert a new token
("wifi"), overlapped on wireless.

Lucene doesn't store the end span, but if this is really important for
your use case, you could add a payload to that wifi token that would
encode the number of positions that the inserted token spans (2 in
this case), and then the information would be present in the index.

You'd still need to do something custom at read/search time to decode
this end position and do something interesting with it ...

Mike McCandless

http://blog.mikemccandless.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-18 Thread Michael McCandless
On Thu, Dec 13, 2012 at 10:09 AM, Glen Newton  wrote:
>>Unfortunately, Lucene doesn't properly index
> spans (it records the start position but not the end position), so
> that limits what kind of matching you can do at search time.
>
> If this could be fixed (i.e. indexing the _end_ of a span) I think all
> the things that I want to do, and the things that can now be done in
> GATE very easily, would be possible using Mike's suggested method.

What would you use the end of the span for?

For example, do you need to do the equivalent of and end-of-span-aware
PhraseQuery?

Ie, so that if the document is "wireless network is down", and I apply
the synonym "wireless network" -> "wifi" at indexing time, then the
end-span-aware-PhraseQuery would match "wifi is down" (unlike today).

If you stuff the end of the span into the payload you'd have to create
a custom variant of PhraseQuery to properly match based on the end
span.

Mike McCandless

http://blog.mikemccandless.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-18 Thread Carsten Schnober
Am 18.12.2012 12:36, schrieb Michael McCandless:
> On Thu, Dec 13, 2012 at 8:32 AM, Carsten Schnober
>  wrote:


>> This is a relatively easy example, but how would deal with e.g.
>> annotations that include multiple tokens (as in spans), such as chunks,
>> or relations between tokens (and token spans), as in the coreference
>> links example given by Steven above?
> 
> I think you'd do something like what SynonymFilter does for
> multi-token synonyms.
> 
> Eg a synonym for "wireless network" - > wifi would insert a new token
> ("wifi"), overlapped on wireless.
> 
> Lucene doesn't store the end span, but if this is really important for
> your use case, you could add a payload to that wifi token that would
> encode the number of positions that the inserted token spans (2 in
> this case), and then the information would be present in the index.
> 
> You'd still need to do something custom at read/search time to decode
> this end position and do something interesting with it ...

Thanks for the pointer!
I'm still puzzled whether something there is an optimal way to encode
(labelled) relations between tokens or even spans; the latter part would
probably lead back to the synonym-like solution.
Best,
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-20 Thread Wu, Stephen T., Ph.D.
> If you stuff the end of the span into the payload you'd have to create
> a custom variant of PhraseQuery to properly match based on the end
> span.

How different is this from the functionality already avaialable through
SpanQuery?

stephen


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-21 Thread Michael McCandless
On Thu, Dec 20, 2012 at 3:54 PM, Wu, Stephen T., Ph.D.
 wrote:
>> If you stuff the end of the span into the payload you'd have to create
>> a custom variant of PhraseQuery to properly match based on the end
>> span.
>
> How different is this from the functionality already avaialable through
> SpanQuery?

Good question!

I think the difference would be index-time (payload encoding span-end
+ new Query) vs search time (SpanQuery)?

Ie, with the former (index-time) you'd have a TokenFilter spotting the
spans and encoding them into the index, and with the latter all
spotting happens at search time?

So net/net I guess (?) the results would be the same, but performance
should be faster if you do it index-time?

Mike McCandless

http://blog.mikemccandless.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org