Re: Using POS payloads for chunking
Ah, good to know! I'm actually using lower level calls, as I'm building the TokenStream by hand from UIMA annotations and not using any analyzer, but I'll keep that in mind for uture projects. Thanks! On Thu, Jun 15, 2017 at 12:10 PM Erick Erickson wrote: > José: > > Do note that, while the bytearray isn't limited, prior to LUCENE-7705 > most of the tokenizers you would use limited the incoming token to 256 > at most. This is not at all a _Lucene_ limitation at a low level, > rather if you're indexing data with a delimited payload (say > abc|your_payload_here) the tokenizer would chop it off when the whole > thing reached 256 chars. > > Hmmm, still confusing. Say the input to the analysis chain was > abc|512_byes_of_payload_data > The tokenizer would give you > > abc|frst_252_bytes > > But if you're using lower-level Lucene calls directly that limit doesn't > apply. > > Best, > Erick > > On Thu, Jun 15, 2017 at 8:21 AM, José Tomás Atria > wrote: > > Hi Markus, thanks for your response! > > > > Now I feel stupid, that is clearly a much simpler approach and it has the > > added benefits that it would not require me to meddle into the scoring > > process, which I'm still a bit terrified of. Thanks for the tip. > > > > I guess the question is still valid though? i.e. how would one take into > > account payloads for scoring entire spans? Does this make sense at all? > Any > > links to a more-or-less straightforward example? > > > > On the length of payloads: I understood that you have other restrictions, > > but payloads take a bytesref as value, so you can encode arbitrary data > in > > them as long as you encode and decode properly. E.g. you could encode the > > long array that backs a fixed bitset as a bytesref and pass that, though > > I'm not sure it would be efficient unless you have at least 64 flags. > > > > thanks! > > jta > > > > > > > > On Wed, Jun 14, 2017 at 4:45 PM Markus Jelsma < > markus.jel...@openindex.io> > > wrote: > > > >> Hello, > >> > >> We use POS-tagging too, and encode them as payload bitsets for scoring, > >> which is, as far as is know, the only possibility with payloads. > >> > >> So, instead of encoding them as payloads, why not index your treebanks > >> POS-tags as tokens on the same position, like synonyms. If you do that, > you > >> can use spans and phrase queries to find chunks of multiple POS-tags. > >> > >> This would be the first approach i can think of. Treating them as > regular > >> tokens enables you to use regular search for them. > >> > >> Regards, > >> Markus > >> > >> > >> > >> -Original message- > >> > From:José Tomás Atria > >> > Sent: Wednesday 14th June 2017 22:29 > >> > To: java-user@lucene.apache.org > >> > Subject: Using POS payloads for chunking > >> > > >> > Hello! > >> > > >> > I'm not particularly familiar with lucene's search api (as I've been > >> using > >> > the library mostly as a dumb index rather than a search engine), but > I am > >> > almost certain that, using its payload capabilities, it would be > trivial > >> to > >> > implement a regular chunker to look for patterns in sequences of > >> payloads. > >> > > >> > (trying not to be too pedantic, a regular chunker looks for 'chunks' > >> based > >> > on part-of-speech tags, e.g. noun phrases can be searched for with > >> patterns > >> > like "(DT)?(JJ)*(NN|NP)+", that is, an optional determinant and zero > or > >> > more adjectives preceding a bunch of nouns, etc) > >> > > >> > Assuming my index has POS tags encoded as payloads for each position, > how > >> > would one search for such patterns, irrespective of terms? I started > >> > studying the spans search API, as this seemed like the natural place > to > >> > start, but I quickly got lost. > >> > > >> > Any tips would be extremely appreciated. (or references to this kind > of > >> > thing, I'm sure someone must have tried something similar before...) > >> > > >> > thanks! > >> > ~jta > >> > -- > >> > > >> > sent from a phone. please excuse terseness and tpyos. > >> > > >> > enviado desde un teléfono. por favor disculpe la parquedad y los > erroers. > >> > > >> > >> - > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> > >> -- > > > > sent from a phone. please excuse terseness and tpyos. > > > > enviado desde un teléfono. por favor disculpe la parquedad y los erroers. > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- sent from a phone. please excuse terseness and tpyos. enviado desde un teléfono. por favor disculpe la parquedad y los erroers.
Re: Using POS payloads for chunking
José: Do note that, while the bytearray isn't limited, prior to LUCENE-7705 most of the tokenizers you would use limited the incoming token to 256 at most. This is not at all a _Lucene_ limitation at a low level, rather if you're indexing data with a delimited payload (say abc|your_payload_here) the tokenizer would chop it off when the whole thing reached 256 chars. Hmmm, still confusing. Say the input to the analysis chain was abc|512_byes_of_payload_data The tokenizer would give you abc|frst_252_bytes But if you're using lower-level Lucene calls directly that limit doesn't apply. Best, Erick On Thu, Jun 15, 2017 at 8:21 AM, José Tomás Atria wrote: > Hi Markus, thanks for your response! > > Now I feel stupid, that is clearly a much simpler approach and it has the > added benefits that it would not require me to meddle into the scoring > process, which I'm still a bit terrified of. Thanks for the tip. > > I guess the question is still valid though? i.e. how would one take into > account payloads for scoring entire spans? Does this make sense at all? Any > links to a more-or-less straightforward example? > > On the length of payloads: I understood that you have other restrictions, > but payloads take a bytesref as value, so you can encode arbitrary data in > them as long as you encode and decode properly. E.g. you could encode the > long array that backs a fixed bitset as a bytesref and pass that, though > I'm not sure it would be efficient unless you have at least 64 flags. > > thanks! > jta > > > > On Wed, Jun 14, 2017 at 4:45 PM Markus Jelsma > wrote: > >> Hello, >> >> We use POS-tagging too, and encode them as payload bitsets for scoring, >> which is, as far as is know, the only possibility with payloads. >> >> So, instead of encoding them as payloads, why not index your treebanks >> POS-tags as tokens on the same position, like synonyms. If you do that, you >> can use spans and phrase queries to find chunks of multiple POS-tags. >> >> This would be the first approach i can think of. Treating them as regular >> tokens enables you to use regular search for them. >> >> Regards, >> Markus >> >> >> >> -Original message- >> > From:José Tomás Atria >> > Sent: Wednesday 14th June 2017 22:29 >> > To: java-user@lucene.apache.org >> > Subject: Using POS payloads for chunking >> > >> > Hello! >> > >> > I'm not particularly familiar with lucene's search api (as I've been >> using >> > the library mostly as a dumb index rather than a search engine), but I am >> > almost certain that, using its payload capabilities, it would be trivial >> to >> > implement a regular chunker to look for patterns in sequences of >> payloads. >> > >> > (trying not to be too pedantic, a regular chunker looks for 'chunks' >> based >> > on part-of-speech tags, e.g. noun phrases can be searched for with >> patterns >> > like "(DT)?(JJ)*(NN|NP)+", that is, an optional determinant and zero or >> > more adjectives preceding a bunch of nouns, etc) >> > >> > Assuming my index has POS tags encoded as payloads for each position, how >> > would one search for such patterns, irrespective of terms? I started >> > studying the spans search API, as this seemed like the natural place to >> > start, but I quickly got lost. >> > >> > Any tips would be extremely appreciated. (or references to this kind of >> > thing, I'm sure someone must have tried something similar before...) >> > >> > thanks! >> > ~jta >> > -- >> > >> > sent from a phone. please excuse terseness and tpyos. >> > >> > enviado desde un teléfono. por favor disculpe la parquedad y los erroers. >> > >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> -- > > sent from a phone. please excuse terseness and tpyos. > > enviado desde un teléfono. por favor disculpe la parquedad y los erroers. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Using POS payloads for chunking
Hi Markus, thanks for your response! Now I feel stupid, that is clearly a much simpler approach and it has the added benefits that it would not require me to meddle into the scoring process, which I'm still a bit terrified of. Thanks for the tip. I guess the question is still valid though? i.e. how would one take into account payloads for scoring entire spans? Does this make sense at all? Any links to a more-or-less straightforward example? On the length of payloads: I understood that you have other restrictions, but payloads take a bytesref as value, so you can encode arbitrary data in them as long as you encode and decode properly. E.g. you could encode the long array that backs a fixed bitset as a bytesref and pass that, though I'm not sure it would be efficient unless you have at least 64 flags. thanks! jta On Wed, Jun 14, 2017 at 4:45 PM Markus Jelsma wrote: > Hello, > > We use POS-tagging too, and encode them as payload bitsets for scoring, > which is, as far as is know, the only possibility with payloads. > > So, instead of encoding them as payloads, why not index your treebanks > POS-tags as tokens on the same position, like synonyms. If you do that, you > can use spans and phrase queries to find chunks of multiple POS-tags. > > This would be the first approach i can think of. Treating them as regular > tokens enables you to use regular search for them. > > Regards, > Markus > > > > -Original message- > > From:José Tomás Atria > > Sent: Wednesday 14th June 2017 22:29 > > To: java-user@lucene.apache.org > > Subject: Using POS payloads for chunking > > > > Hello! > > > > I'm not particularly familiar with lucene's search api (as I've been > using > > the library mostly as a dumb index rather than a search engine), but I am > > almost certain that, using its payload capabilities, it would be trivial > to > > implement a regular chunker to look for patterns in sequences of > payloads. > > > > (trying not to be too pedantic, a regular chunker looks for 'chunks' > based > > on part-of-speech tags, e.g. noun phrases can be searched for with > patterns > > like "(DT)?(JJ)*(NN|NP)+", that is, an optional determinant and zero or > > more adjectives preceding a bunch of nouns, etc) > > > > Assuming my index has POS tags encoded as payloads for each position, how > > would one search for such patterns, irrespective of terms? I started > > studying the spans search API, as this seemed like the natural place to > > start, but I quickly got lost. > > > > Any tips would be extremely appreciated. (or references to this kind of > > thing, I'm sure someone must have tried something similar before...) > > > > thanks! > > ~jta > > -- > > > > sent from a phone. please excuse terseness and tpyos. > > > > enviado desde un teléfono. por favor disculpe la parquedad y los erroers. > > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- sent from a phone. please excuse terseness and tpyos. enviado desde un teléfono. por favor disculpe la parquedad y los erroers.
RE: Using POS payloads for chunking
Hello Tommaso, These don't propagate to search right, but can be used in the analyzer chain! This would be a better solution than using delimiters on words. The only problem is that TypeFilter only works on Tokens, after the tokenizer. The bonus of a CharFilter is that is sees the whole text, so OpenNLP can digest it all at once. Downside is that a CharFilter cannot set TypeAttribute because there are no tokens yet. If we would try that option, we would have to build a TokenFilter that understands the whole text at once, because that is what OpenNLP needs, not single tokens. This is difficult, so we chose the option of a CharFilter plus a TokenFilter. This is not ideal but i find it very hard to digest whole text in a TokenFilter. See Shingle and CommonGrams, these are very complicated filters. How would you overcome this problem? For NLP you need all text at once, which CharFilter provides. But that won't allow you to set TypeAttribute. Perhaps i am missing something completely and am stupid, probably :) Thanks, Markus -Original message- > From:Tommaso Teofili > Sent: Wednesday 14th June 2017 23:49 > To: java-user@lucene.apache.org > Subject: Re: Using POS payloads for chunking > > I think it'd be interesting to also investigate using TypeAttribute [1] > together with TypeTokenFilter [2]. > > Regards, > Tommaso > > [1] : > https://lucene.apache.org/core/6_5_0/core/org/apache/lucene/analysis/tokenattributes/TypeAttribute.html > [2] : > https://lucene.apache.org/core/6_5_0/analyzers-common/org/apache/lucene/analysis/core/TypeTokenFilter.html > > Il giorno mer 14 giu 2017 alle ore 23:33 Markus Jelsma < > markus.jel...@openindex.io> ha scritto: > > > Hello Erick, no worries, i recognize you two. > > > > I will take a look at your references tomorrow. Although i am still fine > > with eight bits, i cannot spare any more but one. If Lucene allows us to > > pass longer bitsets to the BytesRef, it would be awesome and easy to encode. > > > > Thanks! > > Markus > > > > -Original message- > > > From:Erick Erickson > > > Sent: Wednesday 14th June 2017 23:29 > > > To: java-user > > > Subject: Re: Using POS payloads for chunking > > > > > > Markus: > > > > > > I don't believe that payloads are limited in size at all. LUCENE-7705 > > > was done in part because there _was_ a hard-coded 256 limit for some > > > of the tokenizers. The Payload (at least recent versions) just have > > > some bytes after them, and (with LUCENE-7705) can be arbitrarily long. > > > > > > Of course if you put anything other than a number in there you have to > > > provide your own decoders and the like to make sense of your > > > payload > > > > > > Best, > > > Erick (Erickson, not Hatcher) > > > > > > On Wed, Jun 14, 2017 at 2:22 PM, Markus Jelsma > > > wrote: > > > > Hello Erik, > > > > > > > > Using Solr, or actually more parts are Lucene, we have a CharFilter > > adding treebank tags to whitespace delimited word using a delimiter, > > further on we get these tokens with the delimiter and the POS-tag. It won't > > work with some Tokenizers and put it before WDF, it'll split as you know. > > That TokenFilter is configured with a tab delimited mapping config > > containing \t, and there the bitset is encoded as payload. > > > > > > > > Our edismax extension rewrites queries to payload supported > > equivalents, this is quite trivial, except for all those API changes in > > Lucene you have to put up with. Finally a BM25 extension that has, amongst > > others, a mapping of bitset to score. Nouns get a bonus, prepositions and > > other useless pieces get a punishment etc. > > > > > > > > Payloads are really great things to use! We also use it to distinguish > > between compounds and their subwords, o.a. we supply Dutch and German > > speaking countries. And stemmed words and non-stemmed words. Although the > > latter also benefit from IDF statistics, payloads just help to control > > boosting more precisely regardless of your corpus. > > > > > > > > I still need to take a look at your recent payload QParsers for Solr > > and see how different, probably better, they are compared to our older > > implementations. Although we don't use PayloadTermQParser equivalent for > > regular search, we do use it for scoring recommendations via delimited > > multi valued fields. Payloads are versatile! > > > > > > > > The downside of payloa
Re: Using POS payloads for chunking
I think it'd be interesting to also investigate using TypeAttribute [1] together with TypeTokenFilter [2]. Regards, Tommaso [1] : https://lucene.apache.org/core/6_5_0/core/org/apache/lucene/analysis/tokenattributes/TypeAttribute.html [2] : https://lucene.apache.org/core/6_5_0/analyzers-common/org/apache/lucene/analysis/core/TypeTokenFilter.html Il giorno mer 14 giu 2017 alle ore 23:33 Markus Jelsma < markus.jel...@openindex.io> ha scritto: > Hello Erick, no worries, i recognize you two. > > I will take a look at your references tomorrow. Although i am still fine > with eight bits, i cannot spare any more but one. If Lucene allows us to > pass longer bitsets to the BytesRef, it would be awesome and easy to encode. > > Thanks! > Markus > > -Original message- > > From:Erick Erickson > > Sent: Wednesday 14th June 2017 23:29 > > To: java-user > > Subject: Re: Using POS payloads for chunking > > > > Markus: > > > > I don't believe that payloads are limited in size at all. LUCENE-7705 > > was done in part because there _was_ a hard-coded 256 limit for some > > of the tokenizers. The Payload (at least recent versions) just have > > some bytes after them, and (with LUCENE-7705) can be arbitrarily long. > > > > Of course if you put anything other than a number in there you have to > > provide your own decoders and the like to make sense of your > > payload > > > > Best, > > Erick (Erickson, not Hatcher) > > > > On Wed, Jun 14, 2017 at 2:22 PM, Markus Jelsma > > wrote: > > > Hello Erik, > > > > > > Using Solr, or actually more parts are Lucene, we have a CharFilter > adding treebank tags to whitespace delimited word using a delimiter, > further on we get these tokens with the delimiter and the POS-tag. It won't > work with some Tokenizers and put it before WDF, it'll split as you know. > That TokenFilter is configured with a tab delimited mapping config > containing \t, and there the bitset is encoded as payload. > > > > > > Our edismax extension rewrites queries to payload supported > equivalents, this is quite trivial, except for all those API changes in > Lucene you have to put up with. Finally a BM25 extension that has, amongst > others, a mapping of bitset to score. Nouns get a bonus, prepositions and > other useless pieces get a punishment etc. > > > > > > Payloads are really great things to use! We also use it to distinguish > between compounds and their subwords, o.a. we supply Dutch and German > speaking countries. And stemmed words and non-stemmed words. Although the > latter also benefit from IDF statistics, payloads just help to control > boosting more precisely regardless of your corpus. > > > > > > I still need to take a look at your recent payload QParsers for Solr > and see how different, probably better, they are compared to our older > implementations. Although we don't use PayloadTermQParser equivalent for > regular search, we do use it for scoring recommendations via delimited > multi valued fields. Payloads are versatile! > > > > > > The downside of payloads is that they are limited to 8 bits. Although > we can easily fit our reduced treebank in there, we also use single bits to > signal for compound/subword, and stemmed/unstemmed and some others. > > > > > > Hope this helps. > > > > > > Regards, > > > Markus > > > > > > -Original message- > > >> From:Erik Hatcher > > >> Sent: Wednesday 14th June 2017 23:03 > > >> To: java-user@lucene.apache.org > > >> Subject: Re: Using POS payloads for chunking > > >> > > >> Markus - how are you encoding payloads as bitsets and use them for > scoring? Curious to see how folks are leveraging them. > > >> > > >> Erik > > >> > > >> > On Jun 14, 2017, at 4:45 PM, Markus Jelsma < > markus.jel...@openindex.io> wrote: > > >> > > > >> > Hello, > > >> > > > >> > We use POS-tagging too, and encode them as payload bitsets for > scoring, which is, as far as is know, the only possibility with payloads. > > >> > > > >> > So, instead of encoding them as payloads, why not index your > treebanks POS-tags as tokens on the same position, like synonyms. If you do > that, you can use spans and phrase queries to find chunks of multiple > POS-tags. > > >> > > > >> > This would be the first approach i can think of. Treating them as > regular tokens enables you to use regular sea
RE: Using POS payloads for chunking
Hello Erick, no worries, i recognize you two. I will take a look at your references tomorrow. Although i am still fine with eight bits, i cannot spare any more but one. If Lucene allows us to pass longer bitsets to the BytesRef, it would be awesome and easy to encode. Thanks! Markus -Original message- > From:Erick Erickson > Sent: Wednesday 14th June 2017 23:29 > To: java-user > Subject: Re: Using POS payloads for chunking > > Markus: > > I don't believe that payloads are limited in size at all. LUCENE-7705 > was done in part because there _was_ a hard-coded 256 limit for some > of the tokenizers. The Payload (at least recent versions) just have > some bytes after them, and (with LUCENE-7705) can be arbitrarily long. > > Of course if you put anything other than a number in there you have to > provide your own decoders and the like to make sense of your > payload > > Best, > Erick (Erickson, not Hatcher) > > On Wed, Jun 14, 2017 at 2:22 PM, Markus Jelsma > wrote: > > Hello Erik, > > > > Using Solr, or actually more parts are Lucene, we have a CharFilter adding > > treebank tags to whitespace delimited word using a delimiter, further on we > > get these tokens with the delimiter and the POS-tag. It won't work with > > some Tokenizers and put it before WDF, it'll split as you know. That > > TokenFilter is configured with a tab delimited mapping config containing > > \t, and there the bitset is encoded as payload. > > > > Our edismax extension rewrites queries to payload supported equivalents, > > this is quite trivial, except for all those API changes in Lucene you have > > to put up with. Finally a BM25 extension that has, amongst others, a > > mapping of bitset to score. Nouns get a bonus, prepositions and other > > useless pieces get a punishment etc. > > > > Payloads are really great things to use! We also use it to distinguish > > between compounds and their subwords, o.a. we supply Dutch and German > > speaking countries. And stemmed words and non-stemmed words. Although the > > latter also benefit from IDF statistics, payloads just help to control > > boosting more precisely regardless of your corpus. > > > > I still need to take a look at your recent payload QParsers for Solr and > > see how different, probably better, they are compared to our older > > implementations. Although we don't use PayloadTermQParser equivalent for > > regular search, we do use it for scoring recommendations via delimited > > multi valued fields. Payloads are versatile! > > > > The downside of payloads is that they are limited to 8 bits. Although we > > can easily fit our reduced treebank in there, we also use single bits to > > signal for compound/subword, and stemmed/unstemmed and some others. > > > > Hope this helps. > > > > Regards, > > Markus > > > > -Original message- > >> From:Erik Hatcher > >> Sent: Wednesday 14th June 2017 23:03 > >> To: java-user@lucene.apache.org > >> Subject: Re: Using POS payloads for chunking > >> > >> Markus - how are you encoding payloads as bitsets and use them for > >> scoring? Curious to see how folks are leveraging them. > >> > >> Erik > >> > >> > On Jun 14, 2017, at 4:45 PM, Markus Jelsma > >> > wrote: > >> > > >> > Hello, > >> > > >> > We use POS-tagging too, and encode them as payload bitsets for scoring, > >> > which is, as far as is know, the only possibility with payloads. > >> > > >> > So, instead of encoding them as payloads, why not index your treebanks > >> > POS-tags as tokens on the same position, like synonyms. If you do that, > >> > you can use spans and phrase queries to find chunks of multiple POS-tags. > >> > > >> > This would be the first approach i can think of. Treating them as > >> > regular tokens enables you to use regular search for them. > >> > > >> > Regards, > >> > Markus > >> > > >> > > >> > > >> > -Original message- > >> >> From:José Tomás Atria > >> >> Sent: Wednesday 14th June 2017 22:29 > >> >> To: java-user@lucene.apache.org > >> >> Subject: Using POS payloads for chunking > >> >> > >> >> Hello! > >> >> > >> >> I'm not particularly familiar with lucene's search api (as I've been > >> >> using &
Re: Using POS payloads for chunking
Markus: I don't believe that payloads are limited in size at all. LUCENE-7705 was done in part because there _was_ a hard-coded 256 limit for some of the tokenizers. The Payload (at least recent versions) just have some bytes after them, and (with LUCENE-7705) can be arbitrarily long. Of course if you put anything other than a number in there you have to provide your own decoders and the like to make sense of your payload Best, Erick (Erickson, not Hatcher) On Wed, Jun 14, 2017 at 2:22 PM, Markus Jelsma wrote: > Hello Erik, > > Using Solr, or actually more parts are Lucene, we have a CharFilter adding > treebank tags to whitespace delimited word using a delimiter, further on we > get these tokens with the delimiter and the POS-tag. It won't work with some > Tokenizers and put it before WDF, it'll split as you know. That TokenFilter > is configured with a tab delimited mapping config containing > \t, and there the bitset is encoded as payload. > > Our edismax extension rewrites queries to payload supported equivalents, this > is quite trivial, except for all those API changes in Lucene you have to put > up with. Finally a BM25 extension that has, amongst others, a mapping of > bitset to score. Nouns get a bonus, prepositions and other useless pieces get > a punishment etc. > > Payloads are really great things to use! We also use it to distinguish > between compounds and their subwords, o.a. we supply Dutch and German > speaking countries. And stemmed words and non-stemmed words. Although the > latter also benefit from IDF statistics, payloads just help to control > boosting more precisely regardless of your corpus. > > I still need to take a look at your recent payload QParsers for Solr and see > how different, probably better, they are compared to our older > implementations. Although we don't use PayloadTermQParser equivalent for > regular search, we do use it for scoring recommendations via delimited multi > valued fields. Payloads are versatile! > > The downside of payloads is that they are limited to 8 bits. Although we can > easily fit our reduced treebank in there, we also use single bits to signal > for compound/subword, and stemmed/unstemmed and some others. > > Hope this helps. > > Regards, > Markus > > -Original message- >> From:Erik Hatcher >> Sent: Wednesday 14th June 2017 23:03 >> To: java-user@lucene.apache.org >> Subject: Re: Using POS payloads for chunking >> >> Markus - how are you encoding payloads as bitsets and use them for scoring? >> Curious to see how folks are leveraging them. >> >> Erik >> >> > On Jun 14, 2017, at 4:45 PM, Markus Jelsma >> > wrote: >> > >> > Hello, >> > >> > We use POS-tagging too, and encode them as payload bitsets for scoring, >> > which is, as far as is know, the only possibility with payloads. >> > >> > So, instead of encoding them as payloads, why not index your treebanks >> > POS-tags as tokens on the same position, like synonyms. If you do that, >> > you can use spans and phrase queries to find chunks of multiple POS-tags. >> > >> > This would be the first approach i can think of. Treating them as regular >> > tokens enables you to use regular search for them. >> > >> > Regards, >> > Markus >> > >> > >> > >> > -Original message- >> >> From:José Tomás Atria >> >> Sent: Wednesday 14th June 2017 22:29 >> >> To: java-user@lucene.apache.org >> >> Subject: Using POS payloads for chunking >> >> >> >> Hello! >> >> >> >> I'm not particularly familiar with lucene's search api (as I've been using >> >> the library mostly as a dumb index rather than a search engine), but I am >> >> almost certain that, using its payload capabilities, it would be trivial >> >> to >> >> implement a regular chunker to look for patterns in sequences of payloads. >> >> >> >> (trying not to be too pedantic, a regular chunker looks for 'chunks' based >> >> on part-of-speech tags, e.g. noun phrases can be searched for with >> >> patterns >> >> like "(DT)?(JJ)*(NN|NP)+", that is, an optional determinant and zero or >> >> more adjectives preceding a bunch of nouns, etc) >> >> >> >> Assuming my index has POS tags encoded as payloads for each position, how >> >> would one search for such patterns, irrespective of terms? I started >> >> studyi
RE: Using POS payloads for chunking
Hello Erik, Using Solr, or actually more parts are Lucene, we have a CharFilter adding treebank tags to whitespace delimited word using a delimiter, further on we get these tokens with the delimiter and the POS-tag. It won't work with some Tokenizers and put it before WDF, it'll split as you know. That TokenFilter is configured with a tab delimited mapping config containing \t, and there the bitset is encoded as payload. Our edismax extension rewrites queries to payload supported equivalents, this is quite trivial, except for all those API changes in Lucene you have to put up with. Finally a BM25 extension that has, amongst others, a mapping of bitset to score. Nouns get a bonus, prepositions and other useless pieces get a punishment etc. Payloads are really great things to use! We also use it to distinguish between compounds and their subwords, o.a. we supply Dutch and German speaking countries. And stemmed words and non-stemmed words. Although the latter also benefit from IDF statistics, payloads just help to control boosting more precisely regardless of your corpus. I still need to take a look at your recent payload QParsers for Solr and see how different, probably better, they are compared to our older implementations. Although we don't use PayloadTermQParser equivalent for regular search, we do use it for scoring recommendations via delimited multi valued fields. Payloads are versatile! The downside of payloads is that they are limited to 8 bits. Although we can easily fit our reduced treebank in there, we also use single bits to signal for compound/subword, and stemmed/unstemmed and some others. Hope this helps. Regards, Markus -Original message- > From:Erik Hatcher > Sent: Wednesday 14th June 2017 23:03 > To: java-user@lucene.apache.org > Subject: Re: Using POS payloads for chunking > > Markus - how are you encoding payloads as bitsets and use them for scoring? > Curious to see how folks are leveraging them. > > Erik > > > On Jun 14, 2017, at 4:45 PM, Markus Jelsma > > wrote: > > > > Hello, > > > > We use POS-tagging too, and encode them as payload bitsets for scoring, > > which is, as far as is know, the only possibility with payloads. > > > > So, instead of encoding them as payloads, why not index your treebanks > > POS-tags as tokens on the same position, like synonyms. If you do that, you > > can use spans and phrase queries to find chunks of multiple POS-tags. > > > > This would be the first approach i can think of. Treating them as regular > > tokens enables you to use regular search for them. > > > > Regards, > > Markus > > > > > > > > -Original message- > >> From:José Tomás Atria > >> Sent: Wednesday 14th June 2017 22:29 > >> To: java-user@lucene.apache.org > >> Subject: Using POS payloads for chunking > >> > >> Hello! > >> > >> I'm not particularly familiar with lucene's search api (as I've been using > >> the library mostly as a dumb index rather than a search engine), but I am > >> almost certain that, using its payload capabilities, it would be trivial to > >> implement a regular chunker to look for patterns in sequences of payloads. > >> > >> (trying not to be too pedantic, a regular chunker looks for 'chunks' based > >> on part-of-speech tags, e.g. noun phrases can be searched for with patterns > >> like "(DT)?(JJ)*(NN|NP)+", that is, an optional determinant and zero or > >> more adjectives preceding a bunch of nouns, etc) > >> > >> Assuming my index has POS tags encoded as payloads for each position, how > >> would one search for such patterns, irrespective of terms? I started > >> studying the spans search API, as this seemed like the natural place to > >> start, but I quickly got lost. > >> > >> Any tips would be extremely appreciated. (or references to this kind of > >> thing, I'm sure someone must have tried something similar before...) > >> > >> thanks! > >> ~jta > >> -- > >> > >> sent from a phone. please excuse terseness and tpyos. > >> > >> enviado desde un teléfono. por favor disculpe la parquedad y los erroers. > >> > > > > - > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Using POS payloads for chunking
Markus - how are you encoding payloads as bitsets and use them for scoring? Curious to see how folks are leveraging them. Erik > On Jun 14, 2017, at 4:45 PM, Markus Jelsma wrote: > > Hello, > > We use POS-tagging too, and encode them as payload bitsets for scoring, which > is, as far as is know, the only possibility with payloads. > > So, instead of encoding them as payloads, why not index your treebanks > POS-tags as tokens on the same position, like synonyms. If you do that, you > can use spans and phrase queries to find chunks of multiple POS-tags. > > This would be the first approach i can think of. Treating them as regular > tokens enables you to use regular search for them. > > Regards, > Markus > > > > -Original message- >> From:José Tomás Atria >> Sent: Wednesday 14th June 2017 22:29 >> To: java-user@lucene.apache.org >> Subject: Using POS payloads for chunking >> >> Hello! >> >> I'm not particularly familiar with lucene's search api (as I've been using >> the library mostly as a dumb index rather than a search engine), but I am >> almost certain that, using its payload capabilities, it would be trivial to >> implement a regular chunker to look for patterns in sequences of payloads. >> >> (trying not to be too pedantic, a regular chunker looks for 'chunks' based >> on part-of-speech tags, e.g. noun phrases can be searched for with patterns >> like "(DT)?(JJ)*(NN|NP)+", that is, an optional determinant and zero or >> more adjectives preceding a bunch of nouns, etc) >> >> Assuming my index has POS tags encoded as payloads for each position, how >> would one search for such patterns, irrespective of terms? I started >> studying the spans search API, as this seemed like the natural place to >> start, but I quickly got lost. >> >> Any tips would be extremely appreciated. (or references to this kind of >> thing, I'm sure someone must have tried something similar before...) >> >> thanks! >> ~jta >> -- >> >> sent from a phone. please excuse terseness and tpyos. >> >> enviado desde un teléfono. por favor disculpe la parquedad y los erroers. >> > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: Using POS payloads for chunking
Hello, We use POS-tagging too, and encode them as payload bitsets for scoring, which is, as far as is know, the only possibility with payloads. So, instead of encoding them as payloads, why not index your treebanks POS-tags as tokens on the same position, like synonyms. If you do that, you can use spans and phrase queries to find chunks of multiple POS-tags. This would be the first approach i can think of. Treating them as regular tokens enables you to use regular search for them. Regards, Markus -Original message- > From:José Tomás Atria > Sent: Wednesday 14th June 2017 22:29 > To: java-user@lucene.apache.org > Subject: Using POS payloads for chunking > > Hello! > > I'm not particularly familiar with lucene's search api (as I've been using > the library mostly as a dumb index rather than a search engine), but I am > almost certain that, using its payload capabilities, it would be trivial to > implement a regular chunker to look for patterns in sequences of payloads. > > (trying not to be too pedantic, a regular chunker looks for 'chunks' based > on part-of-speech tags, e.g. noun phrases can be searched for with patterns > like "(DT)?(JJ)*(NN|NP)+", that is, an optional determinant and zero or > more adjectives preceding a bunch of nouns, etc) > > Assuming my index has POS tags encoded as payloads for each position, how > would one search for such patterns, irrespective of terms? I started > studying the spans search API, as this seemed like the natural place to > start, but I quickly got lost. > > Any tips would be extremely appreciated. (or references to this kind of > thing, I'm sure someone must have tried something similar before...) > > thanks! > ~jta > -- > > sent from a phone. please excuse terseness and tpyos. > > enviado desde un teléfono. por favor disculpe la parquedad y los erroers. > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Using POS payloads for chunking
Hello! I'm not particularly familiar with lucene's search api (as I've been using the library mostly as a dumb index rather than a search engine), but I am almost certain that, using its payload capabilities, it would be trivial to implement a regular chunker to look for patterns in sequences of payloads. (trying not to be too pedantic, a regular chunker looks for 'chunks' based on part-of-speech tags, e.g. noun phrases can be searched for with patterns like "(DT)?(JJ)*(NN|NP)+", that is, an optional determinant and zero or more adjectives preceding a bunch of nouns, etc) Assuming my index has POS tags encoded as payloads for each position, how would one search for such patterns, irrespective of terms? I started studying the spans search API, as this seemed like the natural place to start, but I quickly got lost. Any tips would be extremely appreciated. (or references to this kind of thing, I'm sure someone must have tried something similar before...) thanks! ~jta -- sent from a phone. please excuse terseness and tpyos. enviado desde un teléfono. por favor disculpe la parquedad y los erroers.