Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-21 Thread Michael McCandless
On Thu, Dec 20, 2012 at 3:54 PM, Wu, Stephen T., Ph.D.
wu.step...@mayo.edu wrote:
 If you stuff the end of the span into the payload you'd have to create
 a custom variant of PhraseQuery to properly match based on the end
 span.

 How different is this from the functionality already avaialable through
 SpanQuery?

Good question!

I think the difference would be index-time (payload encoding span-end
+ new Query) vs search time (SpanQuery)?

Ie, with the former (index-time) you'd have a TokenFilter spotting the
spans and encoding them into the index, and with the latter all
spotting happens at search time?

So net/net I guess (?) the results would be the same, but performance
should be faster if you do it index-time?

Mike McCandless

http://blog.mikemccandless.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-20 Thread Wu, Stephen T., Ph.D.
 If you stuff the end of the span into the payload you'd have to create
 a custom variant of PhraseQuery to properly match based on the end
 span.

How different is this from the functionality already avaialable through
SpanQuery?

stephen


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-18 Thread Michael McCandless
On Thu, Dec 13, 2012 at 8:32 AM, Carsten Schnober
schno...@ids-mannheim.de wrote:
 Am 13.12.2012 12:27, schrieb Michael McCandless:

 For example:
  - part of speech of a token.
  - syntactic parse subtree (over a span).
  - semantically normalized phrase (to canonical text or ontological code).
  - semantic group (of a span).
  - coreference link.

 So for example part-of-speech is a per-Token-position attribute.

 Today the easiest way to handle this is to encode these attributes
 into a Payload, which is straightforward (make a custom TokenFilter
 that creates the payload).

 At search time you would then use e.g. PayloadTermQuery to decode the
 Payload and do something with it to alter how the query is being
 scored.

 This is a relatively easy example, but how would deal with e.g.
 annotations that include multiple tokens (as in spans), such as chunks,
 or relations between tokens (and token spans), as in the coreference
 links example given by Steven above?

I think you'd do something like what SynonymFilter does for
multi-token synonyms.

Eg a synonym for wireless network -  wifi would insert a new token
(wifi), overlapped on wireless.

Lucene doesn't store the end span, but if this is really important for
your use case, you could add a payload to that wifi token that would
encode the number of positions that the inserted token spans (2 in
this case), and then the information would be present in the index.

You'd still need to do something custom at read/search time to decode
this end position and do something interesting with it ...

Mike McCandless

http://blog.mikemccandless.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-18 Thread Michael McCandless
On Thu, Dec 13, 2012 at 10:09 AM, Glen Newton glen.new...@gmail.com wrote:
Unfortunately, Lucene doesn't properly index
 spans (it records the start position but not the end position), so
 that limits what kind of matching you can do at search time.

 If this could be fixed (i.e. indexing the _end_ of a span) I think all
 the things that I want to do, and the things that can now be done in
 GATE very easily, would be possible using Mike's suggested method.

What would you use the end of the span for?

For example, do you need to do the equivalent of and end-of-span-aware
PhraseQuery?

Ie, so that if the document is wireless network is down, and I apply
the synonym wireless network - wifi at indexing time, then the
end-span-aware-PhraseQuery would match wifi is down (unlike today).

If you stuff the end of the span into the payload you'd have to create
a custom variant of PhraseQuery to properly match based on the end
span.

Mike McCandless

http://blog.mikemccandless.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-18 Thread Carsten Schnober
Am 18.12.2012 12:36, schrieb Michael McCandless:
 On Thu, Dec 13, 2012 at 8:32 AM, Carsten Schnober
 schno...@ids-mannheim.de wrote:


 This is a relatively easy example, but how would deal with e.g.
 annotations that include multiple tokens (as in spans), such as chunks,
 or relations between tokens (and token spans), as in the coreference
 links example given by Steven above?
 
 I think you'd do something like what SynonymFilter does for
 multi-token synonyms.
 
 Eg a synonym for wireless network -  wifi would insert a new token
 (wifi), overlapped on wireless.
 
 Lucene doesn't store the end span, but if this is really important for
 your use case, you could add a payload to that wifi token that would
 encode the number of positions that the inserted token spans (2 in
 this case), and then the information would be present in the index.
 
 You'd still need to do something custom at read/search time to decode
 this end position and do something interesting with it ...

Thanks for the pointer!
I'm still puzzled whether something there is an optimal way to encode
(labelled) relations between tokens or even spans; the latter part would
probably lead back to the synonym-like solution.
Best,
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-13 Thread Michael McCandless
On Wed, Dec 12, 2012 at 9:08 PM, lukai lukai1...@gmail.com wrote:
 Do we have any plan to decouple the index process?

 Lucene was design for search, but according the question people ask in the
 thread it beyonds search functionality sometimes. Like we might want to
 customize our scoring function based on payload. Sometimes i dont need to
 store TF/IDF information. We can pre-calculate features and store into the
 system. But i still need to store the extra TF/IDF information. And
 sometimes, i think we want to load the whole postings into memory to speed
 up the performance. In that case, we really want to customize the
 functionality/process of Inverted index.

Much of this can already be done with Lucene.  Eg, plug in your own
Similarity to get custom scoring (and we already have a bunch of
standard models ... TF/IDF (default), BM25, DFR, language models,
etc.).  Use MemoryPostingsFormat to pull everything into RAM.
Customize other parts of the index using your own Codec.

 The main problem is, the
 implementation is highly coupled with the index chain. It's not easy to
 re-write a new one. Do we have plan to make the index chain change more
 easier?

 Flexible index chain logic, flexible codecs format.

The indexing chain, which is inside IndexWriter and processes each
document into temporary RAM structures and then writes a new segment
via the Codec API, can in fact be changed, but it's extremely expert
and the APIs are not documented (you must read the source code to work
through it).

That said, customizing the chain is rarely really necessary ...
typically existing pluggability (payloads, Sims, custom codec) can
solve most problems.

Mike McCandless

http://blog.mikemccandless.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-13 Thread Carsten Schnober
Am 13.12.2012 12:27, schrieb Michael McCandless:

 For example:
  - part of speech of a token.
  - syntactic parse subtree (over a span).
  - semantically normalized phrase (to canonical text or ontological code).
  - semantic group (of a span).
  - coreference link.
 
 So for example part-of-speech is a per-Token-position attribute.
 
 Today the easiest way to handle this is to encode these attributes
 into a Payload, which is straightforward (make a custom TokenFilter
 that creates the payload).
 
 At search time you would then use e.g. PayloadTermQuery to decode the
 Payload and do something with it to alter how the query is being
 scored.

This is a relatively easy example, but how would deal with e.g.
annotations that include multiple tokens (as in spans), such as chunks,
or relations between tokens (and token spans), as in the coreference
links example given by Steven above?
Best,
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-13 Thread Glen Newton
Unfortunately, Lucene doesn't properly index
spans (it records the start position but not the end position), so
that limits what kind of matching you can do at search time.

If this could be fixed (i.e. indexing the _end_ of a span) I think all
the things that I want to do, and the things that can now be done in
GATE very easily, would be possible using Mike's suggested method.


-Glen

On Thu, Dec 13, 2012 at 6:27 AM, Michael McCandless
luc...@mikemccandless.com wrote:
 On Wed, Dec 12, 2012 at 3:02 PM, Wu, Stephen T., Ph.D.
 wu.step...@mayo.edu wrote:
 Is there any (preliminary) code checked in somewhere that I can look at,
 that would help me understand the practical issues that would need to be
 addressed?

 Maybe we can make this more concrete: what new attribute are you
 needing to record in the postings and access at search time?

 For example:
  - part of speech of a token.
  - syntactic parse subtree (over a span).
  - semantically normalized phrase (to canonical text or ontological code).
  - semantic group (of a span).
  - coreference link.

 So for example part-of-speech is a per-Token-position attribute.

 Today the easiest way to handle this is to encode these attributes
 into a Payload, which is straightforward (make a custom TokenFilter
 that creates the payload).

 At search time you would then use e.g. PayloadTermQuery to decode the
 Payload and do something with it to alter how the query is being
 scored.

 For the span-like attributes (eg a syntactic parse, semantically
 normalized phrase) I think you'd need to do something like
 SynonymFilter in your analysis, i.e. insert new tokens at the position
 where the span started.  Unfortunately, Lucene doesn't properly index
 spans (it records the start position but not the end position), so
 that limits what kind of matching you can do at search time.

 Mike McCandless

 http://blog.mikemccandless.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




-- 
-
http://zzzoot.blogspot.com/
-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-13 Thread Wu, Stephen T., Ph.D.
That would be really nice. Full standoff annotations open a lot of doors.

If we had them, though, I'm not sure exactly which of Mike's methods you'd
use?  I thought payloads were completely token-based and could not be
attached to spans regardless.  And the SynonymFilter is really to mimic the
behavior of multiple tokens/span... (though maybe you could add the other
tokens in as synonyms and then skip the tokens you added...?).
Mike, is all this stuff possible if we can just index the ends of spans?

stephen


On 12/13/12 9:09 AM, Glen Newton glen.new...@gmail.com wrote:

 Unfortunately, Lucene doesn't properly index
 spans (it records the start position but not the end position), so
 that limits what kind of matching you can do at search time.
 
 If this could be fixed (i.e. indexing the _end_ of a span) I think all
 the things that I want to do, and the things that can now be done in
 GATE very easily, would be possible using Mike's suggested method.
 
 
 -Glen
 
 On Thu, Dec 13, 2012 at 6:27 AM, Michael McCandless
 luc...@mikemccandless.com wrote:
 On Wed, Dec 12, 2012 at 3:02 PM, Wu, Stephen T., Ph.D.
 wu.step...@mayo.edu wrote:
 Is there any (preliminary) code checked in somewhere that I can look at,
 that would help me understand the practical issues that would need to be
 addressed?
 
 Maybe we can make this more concrete: what new attribute are you
 needing to record in the postings and access at search time?
 
 For example:
  - part of speech of a token.
  - syntactic parse subtree (over a span).
  - semantically normalized phrase (to canonical text or ontological code).
  - semantic group (of a span).
  - coreference link.
 
 So for example part-of-speech is a per-Token-position attribute.
 
 Today the easiest way to handle this is to encode these attributes
 into a Payload, which is straightforward (make a custom TokenFilter
 that creates the payload).
 
 At search time you would then use e.g. PayloadTermQuery to decode the
 Payload and do something with it to alter how the query is being
 scored.
 
 For the span-like attributes (eg a syntactic parse, semantically
 normalized phrase) I think you'd need to do something like
 SynonymFilter in your analysis, i.e. insert new tokens at the position
 where the span started.  Unfortunately, Lucene doesn't properly index
 spans (it records the start position but not the end position), so
 that limits what kind of matching you can do at search time.
 
 Mike McCandless
 
 http://blog.mikemccandless.com
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-13 Thread Lance Norskog

Parts-of-speech is available now, in the indexer.

LUCENE-2899 adds OpenNLP to the LuceneSolr codebase. It does 
parts-of-speech, chunking and Named Entity Recognition. OpenNLP is an 
Apache project for natural-language processing.


Some parts are in Solr that could be in Lucene.

https://issues.apache.org/jira/browse/lucene-2899

On 12/12/2012 12:02 PM, Wu, Stephen T., Ph.D. wrote:

Is there any (preliminary) code checked in somewhere that I can look at,
that would help me understand the practical issues that would need to be
addressed?

Maybe we can make this more concrete: what new attribute are you
needing to record in the postings and access at search time?

For example:
  - part of speech of a token.
  - syntactic parse subtree (over a span).
  - semantically normalized phrase (to canonical text or ontological code).
  - semantic group (of a span).
  - coreference link.

stephen


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-13 Thread Glen Newton
It is not clear this is exactly what is needed/being discussed.

From the issue:
We are also planning a Tokenizer/TokenFilter that can put parts of
speech as either payloads (PartOfSpeechAttribute?) on a token or at
the same position.

This adds it to a token, not a span. 'same position' does not suggest
it also records the end position.

-Glen

On Thu, Dec 13, 2012 at 4:45 PM, Lance Norskog goks...@gmail.com wrote:
 Parts-of-speech is available now, in the indexer.

 LUCENE-2899 adds OpenNLP to the LuceneSolr codebase. It does
 parts-of-speech, chunking and Named Entity Recognition. OpenNLP is an Apache
 project for natural-language processing.

 Some parts are in Solr that could be in Lucene.

 https://issues.apache.org/jira/browse/lucene-2899


 On 12/12/2012 12:02 PM, Wu, Stephen T., Ph.D. wrote:

 Is there any (preliminary) code checked in somewhere that I can look at,
 that would help me understand the practical issues that would need to be
 addressed?

 Maybe we can make this more concrete: what new attribute are you
 needing to record in the postings and access at search time?

 For example:
   - part of speech of a token.
   - syntactic parse subtree (over a span).
   - semantically normalized phrase (to canonical text or ontological
 code).
   - semantic group (of a span).
   - coreference link.

 stephen


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org





-- 
-
http://zzzoot.blogspot.com/
-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-13 Thread Lance Norskog
I should not have added that note. The Opennlp patch gives a concrete 
example of adding an annotation to text.


On 12/13/2012 01:54 PM, Glen Newton wrote:

It is not clear this is exactly what is needed/being discussed.

 From the issue:
We are also planning a Tokenizer/TokenFilter that can put parts of
speech as either payloads (PartOfSpeechAttribute?) on a token or at
the same position.

This adds it to a token, not a span. 'same position' does not suggest
it also records the end position.

-Glen

On Thu, Dec 13, 2012 at 4:45 PM, Lance Norskog goks...@gmail.com wrote:

Parts-of-speech is available now, in the indexer.

LUCENE-2899 adds OpenNLP to the LuceneSolr codebase. It does
parts-of-speech, chunking and Named Entity Recognition. OpenNLP is an Apache
project for natural-language processing.

Some parts are in Solr that could be in Lucene.

https://issues.apache.org/jira/browse/lucene-2899


On 12/12/2012 12:02 PM, Wu, Stephen T., Ph.D. wrote:

Is there any (preliminary) code checked in somewhere that I can look at,
that would help me understand the practical issues that would need to be
addressed?

Maybe we can make this more concrete: what new attribute are you
needing to record in the postings and access at search time?

For example:
   - part of speech of a token.
   - syntactic parse subtree (over a span).
   - semantically normalized phrase (to canonical text or ontological
code).
   - semantic group (of a span).
   - coreference link.

stephen


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org







-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-13 Thread Glen Newton
Cool! Sounds great!  :-)

Any pointers to a (Lucene) example that attaches a payload to a
start..end span that is more than one token?

thanks,
-Glen

On Thu, Dec 13, 2012 at 5:03 PM, Lance Norskog goks...@gmail.com wrote:
 I should not have added that note. The Opennlp patch gives a concrete
 example of adding an annotation to text.


 On 12/13/2012 01:54 PM, Glen Newton wrote:

 It is not clear this is exactly what is needed/being discussed.

  From the issue:
 We are also planning a Tokenizer/TokenFilter that can put parts of
 speech as either payloads (PartOfSpeechAttribute?) on a token or at
 the same position.

 This adds it to a token, not a span. 'same position' does not suggest
 it also records the end position.

 -Glen

 On Thu, Dec 13, 2012 at 4:45 PM, Lance Norskog goks...@gmail.com wrote:

 Parts-of-speech is available now, in the indexer.

 LUCENE-2899 adds OpenNLP to the LuceneSolr codebase. It does
 parts-of-speech, chunking and Named Entity Recognition. OpenNLP is an
 Apache
 project for natural-language processing.

 Some parts are in Solr that could be in Lucene.

 https://issues.apache.org/jira/browse/lucene-2899


 On 12/12/2012 12:02 PM, Wu, Stephen T., Ph.D. wrote:

 Is there any (preliminary) code checked in somewhere that I can look
 at,
 that would help me understand the practical issues that would need to
 be
 addressed?

 Maybe we can make this more concrete: what new attribute are you
 needing to record in the postings and access at search time?

 For example:
- part of speech of a token.
- syntactic parse subtree (over a span).
- semantically normalized phrase (to canonical text or ontological
 code).
- semantic group (of a span).
- coreference link.

 stephen


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org





 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




-- 
-
http://zzzoot.blogspot.com/
-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-13 Thread SUJIT PAL
Hi Glen,

I don't believe you can attach a single payload to multiple tokens. What I did 
for a similar requirement was to combine the tokens into a single _ delimited 
single token and attached the payload to it. For example:

The Big Bad Wolf huffed and puffed and blew the house of the Three Little Pigs 
down.

Now assume Big Bad Wolf and Three Little Pigs are spans to which I would 
like to attach payloads to. I run the tokens through a custom tokenizer that 
produces:

The Big_Bad_Wolf$payload1 huffed and puffed and blew the house of the 
Three_Little_Pigs$payload2 down.

In my case this makes sense, ie I can treat the span as a single unit. Not sure 
about your use case.

HTH
Sujit

On Dec 13, 2012, at 2:08 PM, Glen Newton wrote:

 Cool! Sounds great!  :-)
 
 Any pointers to a (Lucene) example that attaches a payload to a
 start..end span that is more than one token?
 
 thanks,
 -Glen
 
 On Thu, Dec 13, 2012 at 5:03 PM, Lance Norskog goks...@gmail.com wrote:
 I should not have added that note. The Opennlp patch gives a concrete
 example of adding an annotation to text.
 
 
 On 12/13/2012 01:54 PM, Glen Newton wrote:
 
 It is not clear this is exactly what is needed/being discussed.
 
 From the issue:
 We are also planning a Tokenizer/TokenFilter that can put parts of
 speech as either payloads (PartOfSpeechAttribute?) on a token or at
 the same position.
 
 This adds it to a token, not a span. 'same position' does not suggest
 it also records the end position.
 
 -Glen
 
 On Thu, Dec 13, 2012 at 4:45 PM, Lance Norskog goks...@gmail.com wrote:
 
 Parts-of-speech is available now, in the indexer.
 
 LUCENE-2899 adds OpenNLP to the LuceneSolr codebase. It does
 parts-of-speech, chunking and Named Entity Recognition. OpenNLP is an
 Apache
 project for natural-language processing.
 
 Some parts are in Solr that could be in Lucene.
 
 https://issues.apache.org/jira/browse/lucene-2899
 
 
 On 12/12/2012 12:02 PM, Wu, Stephen T., Ph.D. wrote:
 
 Is there any (preliminary) code checked in somewhere that I can look
 at,
 that would help me understand the practical issues that would need to
 be
 addressed?
 
 Maybe we can make this more concrete: what new attribute are you
 needing to record in the postings and access at search time?
 
 For example:
   - part of speech of a token.
   - syntactic parse subtree (over a span).
   - semantically normalized phrase (to canonical text or ontological
 code).
   - semantic group (of a span).
   - coreference link.
 
 stephen
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 
 
 -- 
 -
 http://zzzoot.blogspot.com/
 -
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-12 Thread Wu, Stephen T., Ph.D.
 Is there any (preliminary) code checked in somewhere that I can look at,
 that would help me understand the practical issues that would need to be
 addressed?
 
 Maybe we can make this more concrete: what new attribute are you
 needing to record in the postings and access at search time?

For example: 
 - part of speech of a token.
 - syntactic parse subtree (over a span).
 - semantically normalized phrase (to canonical text or ontological code).
 - semantic group (of a span).
 - coreference link.

stephen


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-12 Thread Glen Newton
+10

These are the kind of things you can do in GATE[1] using annotations[2].
A VERY useful feature.

-Glen

[1]http://gate.ac.uk
[2]http://gate.ac.uk/wiki/jape-repository/annotations.html

On Wed, Dec 12, 2012 at 3:02 PM, Wu, Stephen T., Ph.D.
wu.step...@mayo.edu wrote:
 Is there any (preliminary) code checked in somewhere that I can look at,
 that would help me understand the practical issues that would need to be
 addressed?

 Maybe we can make this more concrete: what new attribute are you
 needing to record in the postings and access at search time?

 For example:
  - part of speech of a token.
  - syntactic parse subtree (over a span).
  - semantically normalized phrase (to canonical text or ontological code).
  - semantic group (of a span).
  - coreference link.

 stephen


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




-- 
-
http://zzzoot.blogspot.com/
-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-12 Thread lukai
Do we have any plan to decouple the index process?

Lucene was design for search, but according the question people ask in the
thread it beyonds search functionality sometimes. Like we might want to
customize our scoring function based on payload. Sometimes i dont need to
store TF/IDF information. We can pre-calculate features and store into the
system. But i still need to store the extra TF/IDF information. And
sometimes, i think we want to load the whole postings into memory to speed
up the performance. In that case, we really want to customize the
functionality/process of Inverted index. The main problem is, the
implementation is highly coupled with the index chain. It's not easy to
re-write a new one. Do we have plan to make the index chain change more
easier?

Flexible index chain logic, flexible codecs format.

Thanks,



On Fri, Nov 30, 2012 at 10:02 AM, Michael McCandless 
luc...@mikemccandless.com wrote:

 On Fri, Nov 30, 2012 at 12:25 PM, Wu, Stephen T., Ph.D.
 wu.step...@mayo.edu wrote:
  Is there any (preliminary) code checked in somewhere that I can look at,
  that would help me understand the practical issues that would need to be
  addressed?
 
  If I understand you correctly, it's a little different from what's
 happening
  in your blog posts:
 
 http://blog.mikemccandless.com/2012/07/building-new-lucene-postings-format.h
  tml
 
 http://blog.mikemccandless.com/2012/08/lucenes-new-blockpostingsformat-thank
  s.html
  Those posts deal with making your own codec, but not about changing
 what's
  stored in the postings?  I guess I misunderstood postings format
 before.

 I don't know of any examples of adding an entirely new attribute to
 the postings, except via payloads.

 All the examples we have are of Codecs/PostingsFormats/etc. storing
 all the usual attributes (term  its stats (docFreq/totalTermFreq),
 doc, freq, position, offsets, payload) in interesting ways.

 Maybe we can make this more concrete: what new attribute are you
 needing to record in the postings and access at search time?

 Mike McCandless

 http://blog.mikemccandless.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-11-30 Thread Johannes.Lichtenberger

On 11/28/2012 01:11 AM, Michael McCandless wrote:

Flexible indexing is the ability to make your own codec, which
controls the reading and writing of all index parts (postings, stored
fields, term vectors, deleted docs, etc.).

So for example if you want to store some postings as a bit set instead
of the block format that's the default coming up in 4.1, that's easy
to do.

But what is less easy (as I described below) is changing what is
actually stored in the postings, eg adding a new per-position
attribute.

The original goal was to allow arbitrary attributes beyond the known
docs/freqs/positions/offsets that Lucene supports today, so that you
could easily make new application-dependent per-term, per-doc,
per-position things, pull them from the analyzer, save them to the
index, and access them from an IndexReader / query, but while some
APIs do expose this, it's not very well explored yet (eg, you'd have
to make a custom indexing chain to get the attributes through
IndexWriter down to your codec).  It would be great to make progress
making this easier, so ideas are very welcome :)


Regarding my questin/thread, is it also possible to change the backend 
system? I'd like to use Lucene for a versioned DBMS, thus I would need 
the ability to serialize/deserialize the bytes in my backend whereas 
keys/values are stored in pages (for instance in an upcoming B+-tree, or 
in  simple unordered pages via a record-ID/record mapping). But as no 
one suggested anything as of now and I've also asked a year ago or so, 
after implementing the B+-tree I will probably have to implement my own 
datastructure and parser/tokenizer/stemmer... :-(


kind regards,
Johannes


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-11-30 Thread Jack Krupansky
I will probably have to implement my own datastructure and 
parser/tokenizer/stemmer


Why? I mean, I think the point of the Lucene architecture is that the codec 
level is completely independent of the analysis level.


The end result of analysis is a value to be stored from the application 
perspective, a logical value so to speak, but NOT the bit sequence, the 
physical value so to speak, that the codec will actually store.


So, go ahead and have your own codec that does whatever it wants with 
values, but the input for storage and query should be the output of a 
standard Lucene analyzer.


-- Jack Krupansky

-Original Message- 
From: Johannes.Lichtenberger

Sent: Friday, November 30, 2012 10:15 AM
To: java-user@lucene.apache.org
Cc: Michael McCandless
Subject: Re: What is flexible indexing in Lucene 4.0 if it's not the 
ability to make new postings codecs?


On 11/28/2012 01:11 AM, Michael McCandless wrote:

Flexible indexing is the ability to make your own codec, which
controls the reading and writing of all index parts (postings, stored
fields, term vectors, deleted docs, etc.).

So for example if you want to store some postings as a bit set instead
of the block format that's the default coming up in 4.1, that's easy
to do.

But what is less easy (as I described below) is changing what is
actually stored in the postings, eg adding a new per-position
attribute.

The original goal was to allow arbitrary attributes beyond the known
docs/freqs/positions/offsets that Lucene supports today, so that you
could easily make new application-dependent per-term, per-doc,
per-position things, pull them from the analyzer, save them to the
index, and access them from an IndexReader / query, but while some
APIs do expose this, it's not very well explored yet (eg, you'd have
to make a custom indexing chain to get the attributes through
IndexWriter down to your codec).  It would be great to make progress
making this easier, so ideas are very welcome :)


Regarding my questin/thread, is it also possible to change the backend
system? I'd like to use Lucene for a versioned DBMS, thus I would need
the ability to serialize/deserialize the bytes in my backend whereas
keys/values are stored in pages (for instance in an upcoming B+-tree, or
in  simple unordered pages via a record-ID/record mapping). But as no
one suggested anything as of now and I've also asked a year ago or so,
after implementing the B+-tree I will probably have to implement my own
datastructure and parser/tokenizer/stemmer... :-(

kind regards,
Johannes


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-11-30 Thread Wu, Stephen T., Ph.D.
Is there any (preliminary) code checked in somewhere that I can look at,
that would help me understand the practical issues that would need to be
addressed?

If I understand you correctly, it's a little different from what's happening
in your blog posts:
http://blog.mikemccandless.com/2012/07/building-new-lucene-postings-format.h
tml
http://blog.mikemccandless.com/2012/08/lucenes-new-blockpostingsformat-thank
s.html
Those posts deal with making your own codec, but not about changing what's
stored in the postings?  I guess I misunderstood postings format before.

stephen

 Flexible indexing is the ability to make your own codec, which
 controls the reading and writing of all index parts (postings, stored
 fields, term vectors, deleted docs, etc.).
 
 So for example if you want to store some postings as a bit set instead
 of the block format that's the default coming up in 4.1, that's easy
 to do.
 
 But what is less easy (as I described below) is changing what is
 actually stored in the postings, eg adding a new per-position
 attribute.
 
 The original goal was to allow arbitrary attributes beyond the known
 docs/freqs/positions/offsets that Lucene supports today, so that you
 could easily make new application-dependent per-term, per-doc,
 per-position things, pull them from the analyzer, save them to the
 index, and access them from an IndexReader / query, but while some
 APIs do expose this, it's not very well explored yet (eg, you'd have
 to make a custom indexing chain to get the attributes through
 IndexWriter down to your codec).  It would be great to make progress
 making this easier, so ideas are very welcome :)
 
 Mike McCandless
 
 http://blog.mikemccandless.com
 
 On Tue, Nov 27, 2012 at 3:37 PM, Wu, Stephen T., Ph.D.
 wu.step...@mayo.edu wrote:
 Following up on a previous question...
 What is flexible indexing in Lucene 4.0?  We assumed it was the ability to
 easily make new postings formats/codecs -- but a response below says that
 would be tricky?
 
 stephen
 
 
 On 11/27/12 11:48 AM, David Causse dcau...@spotter.com wrote:
 
 Hi,
 
 We use payloads but we can't use the whole lucene API.
 For example we use it to do some relation query for example :
 
 @quote(@speaker(obama) @discourse(health))
 
 Search for all documents that contains a quote by Obama talking about
 health.
 We encode linguistic informations (standoff annotations) inside payloads
 and use custom search API to query the index.
 I didn't found a convenable way to attach my code to lucene
 Query/Scorer/Weight API. Like SpanQuery you have to rewrite the whole
 Query stack.
 In short if you want to go with Payloads that do more than boosting a
 term there's chances that you'll need to rewrite a big part of the query
 stack.
 
 
 Le 27/11/2012 16:59, Wu, Stephen T., Ph.D. a écrit :
 I think we're looking at doing something related.  I haven't explored the
 Enums or know how to make a postings codec... But what is flexible
 indexing in Lucene 4.0 if it's not the ability to make new postings
 codecs?
 
 We're trying to incorporate attributes onto terms/spans in indexes.  We'd
 also like to try out some interesting ways to score things that go beyond
 just tokens.
 
 We were considering using Attributes instead of Payloads, because it seems
 like using Payloads ties you to a particular kind of scoring -- just a
 weight on a token.  Can Payloads be used for more general scoring
 functions?
 E.g., considering a span of text alongside multiple Payloads?
 
 Does it make sense to move outside of Payloads here?
 
 Thanks!
 
 stephen
 
 
 
 
 On 11/19/12 8:14 AM, Michael McCandless luc...@mikemccandless.com
 wrote:
 
 A new postings format would be tricky because you have new attributes
 you want to index.
 
 The DocsAndPositionsEnum does have an attributes source, but this is
 not well explored, and there are known problems (they can't be easily
 merged in the composite reader case).
 
 So that's why I suggested packing your information into a payload ...
 
 Mike McCandless
 
 http://blog.mikemccandless.com
 
 On Sun, Nov 18, 2012 at 8:33 PM, wgggfiy wuqiu@qq.com wrote:
 thx, mike.
 about the 3th question, encode them all into the payload is better than
 a new postings format with the codec ??
 I mean replace the orginal posting item (position, startOffset,
 endOffset,
 payload) with my own inverted item such as
 class TestPostingItem
 {
  int termId;
  long startOffset;
  long endOffset;
  float score;
  int segId;
  long timeStamp;
 }
 ?
 
 
 
 
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/what-is-the-offsets-and-payload-in-Doc
 sA
 nd
 PositionsEnum-for-tp4020933p4020968.html
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 

Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-11-30 Thread Michael McCandless
On Fri, Nov 30, 2012 at 12:25 PM, Wu, Stephen T., Ph.D.
wu.step...@mayo.edu wrote:
 Is there any (preliminary) code checked in somewhere that I can look at,
 that would help me understand the practical issues that would need to be
 addressed?

 If I understand you correctly, it's a little different from what's happening
 in your blog posts:
 http://blog.mikemccandless.com/2012/07/building-new-lucene-postings-format.h
 tml
 http://blog.mikemccandless.com/2012/08/lucenes-new-blockpostingsformat-thank
 s.html
 Those posts deal with making your own codec, but not about changing what's
 stored in the postings?  I guess I misunderstood postings format before.

I don't know of any examples of adding an entirely new attribute to
the postings, except via payloads.

All the examples we have are of Codecs/PostingsFormats/etc. storing
all the usual attributes (term  its stats (docFreq/totalTermFreq),
doc, freq, position, offsets, payload) in interesting ways.

Maybe we can make this more concrete: what new attribute are you
needing to record in the postings and access at search time?

Mike McCandless

http://blog.mikemccandless.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-11-27 Thread Wu, Stephen T., Ph.D.
Following up on a previous question...
What is flexible indexing in Lucene 4.0?  We assumed it was the ability to
easily make new postings formats/codecs -- but a response below says that
would be tricky?

stephen


On 11/27/12 11:48 AM, David Causse dcau...@spotter.com wrote:

 Hi,
 
 We use payloads but we can't use the whole lucene API.
 For example we use it to do some relation query for example :
 
 @quote(@speaker(obama) @discourse(health))
 
 Search for all documents that contains a quote by Obama talking about
 health.
 We encode linguistic informations (standoff annotations) inside payloads
 and use custom search API to query the index.
 I didn't found a convenable way to attach my code to lucene
 Query/Scorer/Weight API. Like SpanQuery you have to rewrite the whole
 Query stack.
 In short if you want to go with Payloads that do more than boosting a
 term there's chances that you'll need to rewrite a big part of the query
 stack.
 
 
 Le 27/11/2012 16:59, Wu, Stephen T., Ph.D. a écrit :
 I think we're looking at doing something related.  I haven't explored the
 Enums or know how to make a postings codec... But what is flexible
 indexing in Lucene 4.0 if it's not the ability to make new postings codecs?
 
 We're trying to incorporate attributes onto terms/spans in indexes.  We'd
 also like to try out some interesting ways to score things that go beyond
 just tokens.
 
 We were considering using Attributes instead of Payloads, because it seems
 like using Payloads ties you to a particular kind of scoring -- just a
 weight on a token.  Can Payloads be used for more general scoring functions?
 E.g., considering a span of text alongside multiple Payloads?
 
 Does it make sense to move outside of Payloads here?
 
 Thanks!
 
 stephen
 
 
 
 
 On 11/19/12 8:14 AM, Michael McCandless luc...@mikemccandless.com wrote:
 
 A new postings format would be tricky because you have new attributes
 you want to index.
 
 The DocsAndPositionsEnum does have an attributes source, but this is
 not well explored, and there are known problems (they can't be easily
 merged in the composite reader case).
 
 So that's why I suggested packing your information into a payload ...
 
 Mike McCandless
 
 http://blog.mikemccandless.com
 
 On Sun, Nov 18, 2012 at 8:33 PM, wgggfiy wuqiu@qq.com wrote:
 thx, mike.
 about the 3th question, encode them all into the payload is better than
 a new postings format with the codec ??
 I mean replace the orginal posting item (position, startOffset, endOffset,
 payload) with my own inverted item such as
 class TestPostingItem
 {
  int termId;
  long startOffset;
  long endOffset;
  float score;
  int segId;
  long timeStamp;
 }
 ?
 
 
 
 
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/what-is-the-offsets-and-payload-in-DocsA
 nd
 PositionsEnum-for-tp4020933p4020968.html
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 
 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-11-27 Thread Michael McCandless
Flexible indexing is the ability to make your own codec, which
controls the reading and writing of all index parts (postings, stored
fields, term vectors, deleted docs, etc.).

So for example if you want to store some postings as a bit set instead
of the block format that's the default coming up in 4.1, that's easy
to do.

But what is less easy (as I described below) is changing what is
actually stored in the postings, eg adding a new per-position
attribute.

The original goal was to allow arbitrary attributes beyond the known
docs/freqs/positions/offsets that Lucene supports today, so that you
could easily make new application-dependent per-term, per-doc,
per-position things, pull them from the analyzer, save them to the
index, and access them from an IndexReader / query, but while some
APIs do expose this, it's not very well explored yet (eg, you'd have
to make a custom indexing chain to get the attributes through
IndexWriter down to your codec).  It would be great to make progress
making this easier, so ideas are very welcome :)

Mike McCandless

http://blog.mikemccandless.com

On Tue, Nov 27, 2012 at 3:37 PM, Wu, Stephen T., Ph.D.
wu.step...@mayo.edu wrote:
 Following up on a previous question...
 What is flexible indexing in Lucene 4.0?  We assumed it was the ability to
 easily make new postings formats/codecs -- but a response below says that
 would be tricky?

 stephen


 On 11/27/12 11:48 AM, David Causse dcau...@spotter.com wrote:

 Hi,

 We use payloads but we can't use the whole lucene API.
 For example we use it to do some relation query for example :

 @quote(@speaker(obama) @discourse(health))

 Search for all documents that contains a quote by Obama talking about
 health.
 We encode linguistic informations (standoff annotations) inside payloads
 and use custom search API to query the index.
 I didn't found a convenable way to attach my code to lucene
 Query/Scorer/Weight API. Like SpanQuery you have to rewrite the whole
 Query stack.
 In short if you want to go with Payloads that do more than boosting a
 term there's chances that you'll need to rewrite a big part of the query
 stack.


 Le 27/11/2012 16:59, Wu, Stephen T., Ph.D. a écrit :
 I think we're looking at doing something related.  I haven't explored the
 Enums or know how to make a postings codec... But what is flexible
 indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

 We're trying to incorporate attributes onto terms/spans in indexes.  We'd
 also like to try out some interesting ways to score things that go beyond
 just tokens.

 We were considering using Attributes instead of Payloads, because it seems
 like using Payloads ties you to a particular kind of scoring -- just a
 weight on a token.  Can Payloads be used for more general scoring functions?
 E.g., considering a span of text alongside multiple Payloads?

 Does it make sense to move outside of Payloads here?

 Thanks!

 stephen




 On 11/19/12 8:14 AM, Michael McCandless luc...@mikemccandless.com wrote:

 A new postings format would be tricky because you have new attributes
 you want to index.

 The DocsAndPositionsEnum does have an attributes source, but this is
 not well explored, and there are known problems (they can't be easily
 merged in the composite reader case).

 So that's why I suggested packing your information into a payload ...

 Mike McCandless

 http://blog.mikemccandless.com

 On Sun, Nov 18, 2012 at 8:33 PM, wgggfiy wuqiu@qq.com wrote:
 thx, mike.
 about the 3th question, encode them all into the payload is better than
 a new postings format with the codec ??
 I mean replace the orginal posting item (position, startOffset, endOffset,
 payload) with my own inverted item such as
 class TestPostingItem
 {
  int termId;
  long startOffset;
  long endOffset;
  float score;
  int segId;
  long timeStamp;
 }
 ?




 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/what-is-the-offsets-and-payload-in-DocsA
 nd
 PositionsEnum-for-tp4020933p4020968.html
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org






 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org