Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?
On Thu, Dec 20, 2012 at 3:54 PM, Wu, Stephen T., Ph.D. wu.step...@mayo.edu wrote: If you stuff the end of the span into the payload you'd have to create a custom variant of PhraseQuery to properly match based on the end span. How different is this from the functionality already avaialable through SpanQuery? Good question! I think the difference would be index-time (payload encoding span-end + new Query) vs search time (SpanQuery)? Ie, with the former (index-time) you'd have a TokenFilter spotting the spans and encoding them into the index, and with the latter all spotting happens at search time? So net/net I guess (?) the results would be the same, but performance should be faster if you do it index-time? Mike McCandless http://blog.mikemccandless.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?
If you stuff the end of the span into the payload you'd have to create a custom variant of PhraseQuery to properly match based on the end span. How different is this from the functionality already avaialable through SpanQuery? stephen - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?
On Thu, Dec 13, 2012 at 8:32 AM, Carsten Schnober schno...@ids-mannheim.de wrote: Am 13.12.2012 12:27, schrieb Michael McCandless: For example: - part of speech of a token. - syntactic parse subtree (over a span). - semantically normalized phrase (to canonical text or ontological code). - semantic group (of a span). - coreference link. So for example part-of-speech is a per-Token-position attribute. Today the easiest way to handle this is to encode these attributes into a Payload, which is straightforward (make a custom TokenFilter that creates the payload). At search time you would then use e.g. PayloadTermQuery to decode the Payload and do something with it to alter how the query is being scored. This is a relatively easy example, but how would deal with e.g. annotations that include multiple tokens (as in spans), such as chunks, or relations between tokens (and token spans), as in the coreference links example given by Steven above? I think you'd do something like what SynonymFilter does for multi-token synonyms. Eg a synonym for wireless network - wifi would insert a new token (wifi), overlapped on wireless. Lucene doesn't store the end span, but if this is really important for your use case, you could add a payload to that wifi token that would encode the number of positions that the inserted token spans (2 in this case), and then the information would be present in the index. You'd still need to do something custom at read/search time to decode this end position and do something interesting with it ... Mike McCandless http://blog.mikemccandless.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?
On Thu, Dec 13, 2012 at 10:09 AM, Glen Newton glen.new...@gmail.com wrote: Unfortunately, Lucene doesn't properly index spans (it records the start position but not the end position), so that limits what kind of matching you can do at search time. If this could be fixed (i.e. indexing the _end_ of a span) I think all the things that I want to do, and the things that can now be done in GATE very easily, would be possible using Mike's suggested method. What would you use the end of the span for? For example, do you need to do the equivalent of and end-of-span-aware PhraseQuery? Ie, so that if the document is wireless network is down, and I apply the synonym wireless network - wifi at indexing time, then the end-span-aware-PhraseQuery would match wifi is down (unlike today). If you stuff the end of the span into the payload you'd have to create a custom variant of PhraseQuery to properly match based on the end span. Mike McCandless http://blog.mikemccandless.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?
Am 18.12.2012 12:36, schrieb Michael McCandless: On Thu, Dec 13, 2012 at 8:32 AM, Carsten Schnober schno...@ids-mannheim.de wrote: This is a relatively easy example, but how would deal with e.g. annotations that include multiple tokens (as in spans), such as chunks, or relations between tokens (and token spans), as in the coreference links example given by Steven above? I think you'd do something like what SynonymFilter does for multi-token synonyms. Eg a synonym for wireless network - wifi would insert a new token (wifi), overlapped on wireless. Lucene doesn't store the end span, but if this is really important for your use case, you could add a payload to that wifi token that would encode the number of positions that the inserted token spans (2 in this case), and then the information would be present in the index. You'd still need to do something custom at read/search time to decode this end position and do something interesting with it ... Thanks for the pointer! I'm still puzzled whether something there is an optimal way to encode (labelled) relations between tokens or even spans; the latter part would probably lead back to the synonym-like solution. Best, Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?
On Wed, Dec 12, 2012 at 9:08 PM, lukai lukai1...@gmail.com wrote: Do we have any plan to decouple the index process? Lucene was design for search, but according the question people ask in the thread it beyonds search functionality sometimes. Like we might want to customize our scoring function based on payload. Sometimes i dont need to store TF/IDF information. We can pre-calculate features and store into the system. But i still need to store the extra TF/IDF information. And sometimes, i think we want to load the whole postings into memory to speed up the performance. In that case, we really want to customize the functionality/process of Inverted index. Much of this can already be done with Lucene. Eg, plug in your own Similarity to get custom scoring (and we already have a bunch of standard models ... TF/IDF (default), BM25, DFR, language models, etc.). Use MemoryPostingsFormat to pull everything into RAM. Customize other parts of the index using your own Codec. The main problem is, the implementation is highly coupled with the index chain. It's not easy to re-write a new one. Do we have plan to make the index chain change more easier? Flexible index chain logic, flexible codecs format. The indexing chain, which is inside IndexWriter and processes each document into temporary RAM structures and then writes a new segment via the Codec API, can in fact be changed, but it's extremely expert and the APIs are not documented (you must read the source code to work through it). That said, customizing the chain is rarely really necessary ... typically existing pluggability (payloads, Sims, custom codec) can solve most problems. Mike McCandless http://blog.mikemccandless.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?
Am 13.12.2012 12:27, schrieb Michael McCandless: For example: - part of speech of a token. - syntactic parse subtree (over a span). - semantically normalized phrase (to canonical text or ontological code). - semantic group (of a span). - coreference link. So for example part-of-speech is a per-Token-position attribute. Today the easiest way to handle this is to encode these attributes into a Payload, which is straightforward (make a custom TokenFilter that creates the payload). At search time you would then use e.g. PayloadTermQuery to decode the Payload and do something with it to alter how the query is being scored. This is a relatively easy example, but how would deal with e.g. annotations that include multiple tokens (as in spans), such as chunks, or relations between tokens (and token spans), as in the coreference links example given by Steven above? Best, Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?
Unfortunately, Lucene doesn't properly index spans (it records the start position but not the end position), so that limits what kind of matching you can do at search time. If this could be fixed (i.e. indexing the _end_ of a span) I think all the things that I want to do, and the things that can now be done in GATE very easily, would be possible using Mike's suggested method. -Glen On Thu, Dec 13, 2012 at 6:27 AM, Michael McCandless luc...@mikemccandless.com wrote: On Wed, Dec 12, 2012 at 3:02 PM, Wu, Stephen T., Ph.D. wu.step...@mayo.edu wrote: Is there any (preliminary) code checked in somewhere that I can look at, that would help me understand the practical issues that would need to be addressed? Maybe we can make this more concrete: what new attribute are you needing to record in the postings and access at search time? For example: - part of speech of a token. - syntactic parse subtree (over a span). - semantically normalized phrase (to canonical text or ontological code). - semantic group (of a span). - coreference link. So for example part-of-speech is a per-Token-position attribute. Today the easiest way to handle this is to encode these attributes into a Payload, which is straightforward (make a custom TokenFilter that creates the payload). At search time you would then use e.g. PayloadTermQuery to decode the Payload and do something with it to alter how the query is being scored. For the span-like attributes (eg a syntactic parse, semantically normalized phrase) I think you'd need to do something like SynonymFilter in your analysis, i.e. insert new tokens at the position where the span started. Unfortunately, Lucene doesn't properly index spans (it records the start position but not the end position), so that limits what kind of matching you can do at search time. Mike McCandless http://blog.mikemccandless.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- - http://zzzoot.blogspot.com/ - - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?
That would be really nice. Full standoff annotations open a lot of doors. If we had them, though, I'm not sure exactly which of Mike's methods you'd use? I thought payloads were completely token-based and could not be attached to spans regardless. And the SynonymFilter is really to mimic the behavior of multiple tokens/span... (though maybe you could add the other tokens in as synonyms and then skip the tokens you added...?). Mike, is all this stuff possible if we can just index the ends of spans? stephen On 12/13/12 9:09 AM, Glen Newton glen.new...@gmail.com wrote: Unfortunately, Lucene doesn't properly index spans (it records the start position but not the end position), so that limits what kind of matching you can do at search time. If this could be fixed (i.e. indexing the _end_ of a span) I think all the things that I want to do, and the things that can now be done in GATE very easily, would be possible using Mike's suggested method. -Glen On Thu, Dec 13, 2012 at 6:27 AM, Michael McCandless luc...@mikemccandless.com wrote: On Wed, Dec 12, 2012 at 3:02 PM, Wu, Stephen T., Ph.D. wu.step...@mayo.edu wrote: Is there any (preliminary) code checked in somewhere that I can look at, that would help me understand the practical issues that would need to be addressed? Maybe we can make this more concrete: what new attribute are you needing to record in the postings and access at search time? For example: - part of speech of a token. - syntactic parse subtree (over a span). - semantically normalized phrase (to canonical text or ontological code). - semantic group (of a span). - coreference link. So for example part-of-speech is a per-Token-position attribute. Today the easiest way to handle this is to encode these attributes into a Payload, which is straightforward (make a custom TokenFilter that creates the payload). At search time you would then use e.g. PayloadTermQuery to decode the Payload and do something with it to alter how the query is being scored. For the span-like attributes (eg a syntactic parse, semantically normalized phrase) I think you'd need to do something like SynonymFilter in your analysis, i.e. insert new tokens at the position where the span started. Unfortunately, Lucene doesn't properly index spans (it records the start position but not the end position), so that limits what kind of matching you can do at search time. Mike McCandless http://blog.mikemccandless.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?
Parts-of-speech is available now, in the indexer. LUCENE-2899 adds OpenNLP to the LuceneSolr codebase. It does parts-of-speech, chunking and Named Entity Recognition. OpenNLP is an Apache project for natural-language processing. Some parts are in Solr that could be in Lucene. https://issues.apache.org/jira/browse/lucene-2899 On 12/12/2012 12:02 PM, Wu, Stephen T., Ph.D. wrote: Is there any (preliminary) code checked in somewhere that I can look at, that would help me understand the practical issues that would need to be addressed? Maybe we can make this more concrete: what new attribute are you needing to record in the postings and access at search time? For example: - part of speech of a token. - syntactic parse subtree (over a span). - semantically normalized phrase (to canonical text or ontological code). - semantic group (of a span). - coreference link. stephen - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?
It is not clear this is exactly what is needed/being discussed. From the issue: We are also planning a Tokenizer/TokenFilter that can put parts of speech as either payloads (PartOfSpeechAttribute?) on a token or at the same position. This adds it to a token, not a span. 'same position' does not suggest it also records the end position. -Glen On Thu, Dec 13, 2012 at 4:45 PM, Lance Norskog goks...@gmail.com wrote: Parts-of-speech is available now, in the indexer. LUCENE-2899 adds OpenNLP to the LuceneSolr codebase. It does parts-of-speech, chunking and Named Entity Recognition. OpenNLP is an Apache project for natural-language processing. Some parts are in Solr that could be in Lucene. https://issues.apache.org/jira/browse/lucene-2899 On 12/12/2012 12:02 PM, Wu, Stephen T., Ph.D. wrote: Is there any (preliminary) code checked in somewhere that I can look at, that would help me understand the practical issues that would need to be addressed? Maybe we can make this more concrete: what new attribute are you needing to record in the postings and access at search time? For example: - part of speech of a token. - syntactic parse subtree (over a span). - semantically normalized phrase (to canonical text or ontological code). - semantic group (of a span). - coreference link. stephen - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- - http://zzzoot.blogspot.com/ - - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?
I should not have added that note. The Opennlp patch gives a concrete example of adding an annotation to text. On 12/13/2012 01:54 PM, Glen Newton wrote: It is not clear this is exactly what is needed/being discussed. From the issue: We are also planning a Tokenizer/TokenFilter that can put parts of speech as either payloads (PartOfSpeechAttribute?) on a token or at the same position. This adds it to a token, not a span. 'same position' does not suggest it also records the end position. -Glen On Thu, Dec 13, 2012 at 4:45 PM, Lance Norskog goks...@gmail.com wrote: Parts-of-speech is available now, in the indexer. LUCENE-2899 adds OpenNLP to the LuceneSolr codebase. It does parts-of-speech, chunking and Named Entity Recognition. OpenNLP is an Apache project for natural-language processing. Some parts are in Solr that could be in Lucene. https://issues.apache.org/jira/browse/lucene-2899 On 12/12/2012 12:02 PM, Wu, Stephen T., Ph.D. wrote: Is there any (preliminary) code checked in somewhere that I can look at, that would help me understand the practical issues that would need to be addressed? Maybe we can make this more concrete: what new attribute are you needing to record in the postings and access at search time? For example: - part of speech of a token. - syntactic parse subtree (over a span). - semantically normalized phrase (to canonical text or ontological code). - semantic group (of a span). - coreference link. stephen - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?
Cool! Sounds great! :-) Any pointers to a (Lucene) example that attaches a payload to a start..end span that is more than one token? thanks, -Glen On Thu, Dec 13, 2012 at 5:03 PM, Lance Norskog goks...@gmail.com wrote: I should not have added that note. The Opennlp patch gives a concrete example of adding an annotation to text. On 12/13/2012 01:54 PM, Glen Newton wrote: It is not clear this is exactly what is needed/being discussed. From the issue: We are also planning a Tokenizer/TokenFilter that can put parts of speech as either payloads (PartOfSpeechAttribute?) on a token or at the same position. This adds it to a token, not a span. 'same position' does not suggest it also records the end position. -Glen On Thu, Dec 13, 2012 at 4:45 PM, Lance Norskog goks...@gmail.com wrote: Parts-of-speech is available now, in the indexer. LUCENE-2899 adds OpenNLP to the LuceneSolr codebase. It does parts-of-speech, chunking and Named Entity Recognition. OpenNLP is an Apache project for natural-language processing. Some parts are in Solr that could be in Lucene. https://issues.apache.org/jira/browse/lucene-2899 On 12/12/2012 12:02 PM, Wu, Stephen T., Ph.D. wrote: Is there any (preliminary) code checked in somewhere that I can look at, that would help me understand the practical issues that would need to be addressed? Maybe we can make this more concrete: what new attribute are you needing to record in the postings and access at search time? For example: - part of speech of a token. - syntactic parse subtree (over a span). - semantically normalized phrase (to canonical text or ontological code). - semantic group (of a span). - coreference link. stephen - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- - http://zzzoot.blogspot.com/ - - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?
Hi Glen, I don't believe you can attach a single payload to multiple tokens. What I did for a similar requirement was to combine the tokens into a single _ delimited single token and attached the payload to it. For example: The Big Bad Wolf huffed and puffed and blew the house of the Three Little Pigs down. Now assume Big Bad Wolf and Three Little Pigs are spans to which I would like to attach payloads to. I run the tokens through a custom tokenizer that produces: The Big_Bad_Wolf$payload1 huffed and puffed and blew the house of the Three_Little_Pigs$payload2 down. In my case this makes sense, ie I can treat the span as a single unit. Not sure about your use case. HTH Sujit On Dec 13, 2012, at 2:08 PM, Glen Newton wrote: Cool! Sounds great! :-) Any pointers to a (Lucene) example that attaches a payload to a start..end span that is more than one token? thanks, -Glen On Thu, Dec 13, 2012 at 5:03 PM, Lance Norskog goks...@gmail.com wrote: I should not have added that note. The Opennlp patch gives a concrete example of adding an annotation to text. On 12/13/2012 01:54 PM, Glen Newton wrote: It is not clear this is exactly what is needed/being discussed. From the issue: We are also planning a Tokenizer/TokenFilter that can put parts of speech as either payloads (PartOfSpeechAttribute?) on a token or at the same position. This adds it to a token, not a span. 'same position' does not suggest it also records the end position. -Glen On Thu, Dec 13, 2012 at 4:45 PM, Lance Norskog goks...@gmail.com wrote: Parts-of-speech is available now, in the indexer. LUCENE-2899 adds OpenNLP to the LuceneSolr codebase. It does parts-of-speech, chunking and Named Entity Recognition. OpenNLP is an Apache project for natural-language processing. Some parts are in Solr that could be in Lucene. https://issues.apache.org/jira/browse/lucene-2899 On 12/12/2012 12:02 PM, Wu, Stephen T., Ph.D. wrote: Is there any (preliminary) code checked in somewhere that I can look at, that would help me understand the practical issues that would need to be addressed? Maybe we can make this more concrete: what new attribute are you needing to record in the postings and access at search time? For example: - part of speech of a token. - syntactic parse subtree (over a span). - semantically normalized phrase (to canonical text or ontological code). - semantic group (of a span). - coreference link. stephen - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- - http://zzzoot.blogspot.com/ - - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?
Is there any (preliminary) code checked in somewhere that I can look at, that would help me understand the practical issues that would need to be addressed? Maybe we can make this more concrete: what new attribute are you needing to record in the postings and access at search time? For example: - part of speech of a token. - syntactic parse subtree (over a span). - semantically normalized phrase (to canonical text or ontological code). - semantic group (of a span). - coreference link. stephen - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?
+10 These are the kind of things you can do in GATE[1] using annotations[2]. A VERY useful feature. -Glen [1]http://gate.ac.uk [2]http://gate.ac.uk/wiki/jape-repository/annotations.html On Wed, Dec 12, 2012 at 3:02 PM, Wu, Stephen T., Ph.D. wu.step...@mayo.edu wrote: Is there any (preliminary) code checked in somewhere that I can look at, that would help me understand the practical issues that would need to be addressed? Maybe we can make this more concrete: what new attribute are you needing to record in the postings and access at search time? For example: - part of speech of a token. - syntactic parse subtree (over a span). - semantically normalized phrase (to canonical text or ontological code). - semantic group (of a span). - coreference link. stephen - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- - http://zzzoot.blogspot.com/ - - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?
Do we have any plan to decouple the index process? Lucene was design for search, but according the question people ask in the thread it beyonds search functionality sometimes. Like we might want to customize our scoring function based on payload. Sometimes i dont need to store TF/IDF information. We can pre-calculate features and store into the system. But i still need to store the extra TF/IDF information. And sometimes, i think we want to load the whole postings into memory to speed up the performance. In that case, we really want to customize the functionality/process of Inverted index. The main problem is, the implementation is highly coupled with the index chain. It's not easy to re-write a new one. Do we have plan to make the index chain change more easier? Flexible index chain logic, flexible codecs format. Thanks, On Fri, Nov 30, 2012 at 10:02 AM, Michael McCandless luc...@mikemccandless.com wrote: On Fri, Nov 30, 2012 at 12:25 PM, Wu, Stephen T., Ph.D. wu.step...@mayo.edu wrote: Is there any (preliminary) code checked in somewhere that I can look at, that would help me understand the practical issues that would need to be addressed? If I understand you correctly, it's a little different from what's happening in your blog posts: http://blog.mikemccandless.com/2012/07/building-new-lucene-postings-format.h tml http://blog.mikemccandless.com/2012/08/lucenes-new-blockpostingsformat-thank s.html Those posts deal with making your own codec, but not about changing what's stored in the postings? I guess I misunderstood postings format before. I don't know of any examples of adding an entirely new attribute to the postings, except via payloads. All the examples we have are of Codecs/PostingsFormats/etc. storing all the usual attributes (term its stats (docFreq/totalTermFreq), doc, freq, position, offsets, payload) in interesting ways. Maybe we can make this more concrete: what new attribute are you needing to record in the postings and access at search time? Mike McCandless http://blog.mikemccandless.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?
On 11/28/2012 01:11 AM, Michael McCandless wrote: Flexible indexing is the ability to make your own codec, which controls the reading and writing of all index parts (postings, stored fields, term vectors, deleted docs, etc.). So for example if you want to store some postings as a bit set instead of the block format that's the default coming up in 4.1, that's easy to do. But what is less easy (as I described below) is changing what is actually stored in the postings, eg adding a new per-position attribute. The original goal was to allow arbitrary attributes beyond the known docs/freqs/positions/offsets that Lucene supports today, so that you could easily make new application-dependent per-term, per-doc, per-position things, pull them from the analyzer, save them to the index, and access them from an IndexReader / query, but while some APIs do expose this, it's not very well explored yet (eg, you'd have to make a custom indexing chain to get the attributes through IndexWriter down to your codec). It would be great to make progress making this easier, so ideas are very welcome :) Regarding my questin/thread, is it also possible to change the backend system? I'd like to use Lucene for a versioned DBMS, thus I would need the ability to serialize/deserialize the bytes in my backend whereas keys/values are stored in pages (for instance in an upcoming B+-tree, or in simple unordered pages via a record-ID/record mapping). But as no one suggested anything as of now and I've also asked a year ago or so, after implementing the B+-tree I will probably have to implement my own datastructure and parser/tokenizer/stemmer... :-( kind regards, Johannes - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?
I will probably have to implement my own datastructure and parser/tokenizer/stemmer Why? I mean, I think the point of the Lucene architecture is that the codec level is completely independent of the analysis level. The end result of analysis is a value to be stored from the application perspective, a logical value so to speak, but NOT the bit sequence, the physical value so to speak, that the codec will actually store. So, go ahead and have your own codec that does whatever it wants with values, but the input for storage and query should be the output of a standard Lucene analyzer. -- Jack Krupansky -Original Message- From: Johannes.Lichtenberger Sent: Friday, November 30, 2012 10:15 AM To: java-user@lucene.apache.org Cc: Michael McCandless Subject: Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs? On 11/28/2012 01:11 AM, Michael McCandless wrote: Flexible indexing is the ability to make your own codec, which controls the reading and writing of all index parts (postings, stored fields, term vectors, deleted docs, etc.). So for example if you want to store some postings as a bit set instead of the block format that's the default coming up in 4.1, that's easy to do. But what is less easy (as I described below) is changing what is actually stored in the postings, eg adding a new per-position attribute. The original goal was to allow arbitrary attributes beyond the known docs/freqs/positions/offsets that Lucene supports today, so that you could easily make new application-dependent per-term, per-doc, per-position things, pull them from the analyzer, save them to the index, and access them from an IndexReader / query, but while some APIs do expose this, it's not very well explored yet (eg, you'd have to make a custom indexing chain to get the attributes through IndexWriter down to your codec). It would be great to make progress making this easier, so ideas are very welcome :) Regarding my questin/thread, is it also possible to change the backend system? I'd like to use Lucene for a versioned DBMS, thus I would need the ability to serialize/deserialize the bytes in my backend whereas keys/values are stored in pages (for instance in an upcoming B+-tree, or in simple unordered pages via a record-ID/record mapping). But as no one suggested anything as of now and I've also asked a year ago or so, after implementing the B+-tree I will probably have to implement my own datastructure and parser/tokenizer/stemmer... :-( kind regards, Johannes - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?
Is there any (preliminary) code checked in somewhere that I can look at, that would help me understand the practical issues that would need to be addressed? If I understand you correctly, it's a little different from what's happening in your blog posts: http://blog.mikemccandless.com/2012/07/building-new-lucene-postings-format.h tml http://blog.mikemccandless.com/2012/08/lucenes-new-blockpostingsformat-thank s.html Those posts deal with making your own codec, but not about changing what's stored in the postings? I guess I misunderstood postings format before. stephen Flexible indexing is the ability to make your own codec, which controls the reading and writing of all index parts (postings, stored fields, term vectors, deleted docs, etc.). So for example if you want to store some postings as a bit set instead of the block format that's the default coming up in 4.1, that's easy to do. But what is less easy (as I described below) is changing what is actually stored in the postings, eg adding a new per-position attribute. The original goal was to allow arbitrary attributes beyond the known docs/freqs/positions/offsets that Lucene supports today, so that you could easily make new application-dependent per-term, per-doc, per-position things, pull them from the analyzer, save them to the index, and access them from an IndexReader / query, but while some APIs do expose this, it's not very well explored yet (eg, you'd have to make a custom indexing chain to get the attributes through IndexWriter down to your codec). It would be great to make progress making this easier, so ideas are very welcome :) Mike McCandless http://blog.mikemccandless.com On Tue, Nov 27, 2012 at 3:37 PM, Wu, Stephen T., Ph.D. wu.step...@mayo.edu wrote: Following up on a previous question... What is flexible indexing in Lucene 4.0? We assumed it was the ability to easily make new postings formats/codecs -- but a response below says that would be tricky? stephen On 11/27/12 11:48 AM, David Causse dcau...@spotter.com wrote: Hi, We use payloads but we can't use the whole lucene API. For example we use it to do some relation query for example : @quote(@speaker(obama) @discourse(health)) Search for all documents that contains a quote by Obama talking about health. We encode linguistic informations (standoff annotations) inside payloads and use custom search API to query the index. I didn't found a convenable way to attach my code to lucene Query/Scorer/Weight API. Like SpanQuery you have to rewrite the whole Query stack. In short if you want to go with Payloads that do more than boosting a term there's chances that you'll need to rewrite a big part of the query stack. Le 27/11/2012 16:59, Wu, Stephen T., Ph.D. a écrit : I think we're looking at doing something related. I haven't explored the Enums or know how to make a postings codec... But what is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs? We're trying to incorporate attributes onto terms/spans in indexes. We'd also like to try out some interesting ways to score things that go beyond just tokens. We were considering using Attributes instead of Payloads, because it seems like using Payloads ties you to a particular kind of scoring -- just a weight on a token. Can Payloads be used for more general scoring functions? E.g., considering a span of text alongside multiple Payloads? Does it make sense to move outside of Payloads here? Thanks! stephen On 11/19/12 8:14 AM, Michael McCandless luc...@mikemccandless.com wrote: A new postings format would be tricky because you have new attributes you want to index. The DocsAndPositionsEnum does have an attributes source, but this is not well explored, and there are known problems (they can't be easily merged in the composite reader case). So that's why I suggested packing your information into a payload ... Mike McCandless http://blog.mikemccandless.com On Sun, Nov 18, 2012 at 8:33 PM, wgggfiy wuqiu@qq.com wrote: thx, mike. about the 3th question, encode them all into the payload is better than a new postings format with the codec ?? I mean replace the orginal posting item (position, startOffset, endOffset, payload) with my own inverted item such as class TestPostingItem { int termId; long startOffset; long endOffset; float score; int segId; long timeStamp; } ? -- View this message in context: http://lucene.472066.n3.nabble.com/what-is-the-offsets-and-payload-in-Doc sA nd PositionsEnum-for-tp4020933p4020968.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?
On Fri, Nov 30, 2012 at 12:25 PM, Wu, Stephen T., Ph.D. wu.step...@mayo.edu wrote: Is there any (preliminary) code checked in somewhere that I can look at, that would help me understand the practical issues that would need to be addressed? If I understand you correctly, it's a little different from what's happening in your blog posts: http://blog.mikemccandless.com/2012/07/building-new-lucene-postings-format.h tml http://blog.mikemccandless.com/2012/08/lucenes-new-blockpostingsformat-thank s.html Those posts deal with making your own codec, but not about changing what's stored in the postings? I guess I misunderstood postings format before. I don't know of any examples of adding an entirely new attribute to the postings, except via payloads. All the examples we have are of Codecs/PostingsFormats/etc. storing all the usual attributes (term its stats (docFreq/totalTermFreq), doc, freq, position, offsets, payload) in interesting ways. Maybe we can make this more concrete: what new attribute are you needing to record in the postings and access at search time? Mike McCandless http://blog.mikemccandless.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?
Following up on a previous question... What is flexible indexing in Lucene 4.0? We assumed it was the ability to easily make new postings formats/codecs -- but a response below says that would be tricky? stephen On 11/27/12 11:48 AM, David Causse dcau...@spotter.com wrote: Hi, We use payloads but we can't use the whole lucene API. For example we use it to do some relation query for example : @quote(@speaker(obama) @discourse(health)) Search for all documents that contains a quote by Obama talking about health. We encode linguistic informations (standoff annotations) inside payloads and use custom search API to query the index. I didn't found a convenable way to attach my code to lucene Query/Scorer/Weight API. Like SpanQuery you have to rewrite the whole Query stack. In short if you want to go with Payloads that do more than boosting a term there's chances that you'll need to rewrite a big part of the query stack. Le 27/11/2012 16:59, Wu, Stephen T., Ph.D. a écrit : I think we're looking at doing something related. I haven't explored the Enums or know how to make a postings codec... But what is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs? We're trying to incorporate attributes onto terms/spans in indexes. We'd also like to try out some interesting ways to score things that go beyond just tokens. We were considering using Attributes instead of Payloads, because it seems like using Payloads ties you to a particular kind of scoring -- just a weight on a token. Can Payloads be used for more general scoring functions? E.g., considering a span of text alongside multiple Payloads? Does it make sense to move outside of Payloads here? Thanks! stephen On 11/19/12 8:14 AM, Michael McCandless luc...@mikemccandless.com wrote: A new postings format would be tricky because you have new attributes you want to index. The DocsAndPositionsEnum does have an attributes source, but this is not well explored, and there are known problems (they can't be easily merged in the composite reader case). So that's why I suggested packing your information into a payload ... Mike McCandless http://blog.mikemccandless.com On Sun, Nov 18, 2012 at 8:33 PM, wgggfiy wuqiu@qq.com wrote: thx, mike. about the 3th question, encode them all into the payload is better than a new postings format with the codec ?? I mean replace the orginal posting item (position, startOffset, endOffset, payload) with my own inverted item such as class TestPostingItem { int termId; long startOffset; long endOffset; float score; int segId; long timeStamp; } ? -- View this message in context: http://lucene.472066.n3.nabble.com/what-is-the-offsets-and-payload-in-DocsA nd PositionsEnum-for-tp4020933p4020968.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?
Flexible indexing is the ability to make your own codec, which controls the reading and writing of all index parts (postings, stored fields, term vectors, deleted docs, etc.). So for example if you want to store some postings as a bit set instead of the block format that's the default coming up in 4.1, that's easy to do. But what is less easy (as I described below) is changing what is actually stored in the postings, eg adding a new per-position attribute. The original goal was to allow arbitrary attributes beyond the known docs/freqs/positions/offsets that Lucene supports today, so that you could easily make new application-dependent per-term, per-doc, per-position things, pull them from the analyzer, save them to the index, and access them from an IndexReader / query, but while some APIs do expose this, it's not very well explored yet (eg, you'd have to make a custom indexing chain to get the attributes through IndexWriter down to your codec). It would be great to make progress making this easier, so ideas are very welcome :) Mike McCandless http://blog.mikemccandless.com On Tue, Nov 27, 2012 at 3:37 PM, Wu, Stephen T., Ph.D. wu.step...@mayo.edu wrote: Following up on a previous question... What is flexible indexing in Lucene 4.0? We assumed it was the ability to easily make new postings formats/codecs -- but a response below says that would be tricky? stephen On 11/27/12 11:48 AM, David Causse dcau...@spotter.com wrote: Hi, We use payloads but we can't use the whole lucene API. For example we use it to do some relation query for example : @quote(@speaker(obama) @discourse(health)) Search for all documents that contains a quote by Obama talking about health. We encode linguistic informations (standoff annotations) inside payloads and use custom search API to query the index. I didn't found a convenable way to attach my code to lucene Query/Scorer/Weight API. Like SpanQuery you have to rewrite the whole Query stack. In short if you want to go with Payloads that do more than boosting a term there's chances that you'll need to rewrite a big part of the query stack. Le 27/11/2012 16:59, Wu, Stephen T., Ph.D. a écrit : I think we're looking at doing something related. I haven't explored the Enums or know how to make a postings codec... But what is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs? We're trying to incorporate attributes onto terms/spans in indexes. We'd also like to try out some interesting ways to score things that go beyond just tokens. We were considering using Attributes instead of Payloads, because it seems like using Payloads ties you to a particular kind of scoring -- just a weight on a token. Can Payloads be used for more general scoring functions? E.g., considering a span of text alongside multiple Payloads? Does it make sense to move outside of Payloads here? Thanks! stephen On 11/19/12 8:14 AM, Michael McCandless luc...@mikemccandless.com wrote: A new postings format would be tricky because you have new attributes you want to index. The DocsAndPositionsEnum does have an attributes source, but this is not well explored, and there are known problems (they can't be easily merged in the composite reader case). So that's why I suggested packing your information into a payload ... Mike McCandless http://blog.mikemccandless.com On Sun, Nov 18, 2012 at 8:33 PM, wgggfiy wuqiu@qq.com wrote: thx, mike. about the 3th question, encode them all into the payload is better than a new postings format with the codec ?? I mean replace the orginal posting item (position, startOffset, endOffset, payload) with my own inverted item such as class TestPostingItem { int termId; long startOffset; long endOffset; float score; int segId; long timeStamp; } ? -- View this message in context: http://lucene.472066.n3.nabble.com/what-is-the-offsets-and-payload-in-DocsA nd PositionsEnum-for-tp4020933p4020968.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org