Re: Using POS payloads for chunking

2017-06-15 Thread José Tomás Atria
Ah, good to know!

I'm actually using lower level calls, as I'm building the TokenStream by
hand from UIMA annotations and not using any analyzer, but I'll keep that
in mind for uture projects. Thanks!

On Thu, Jun 15, 2017 at 12:10 PM Erick Erickson 
wrote:

> José:
>
> Do note that, while the bytearray isn't limited, prior to LUCENE-7705
> most of the tokenizers you would use limited the incoming token to 256
> at most. This is not at all a _Lucene_ limitation at a low level,
> rather if you're indexing data with a delimited payload (say
> abc|your_payload_here) the tokenizer would chop it off when the whole
> thing reached 256 chars.
>
> Hmmm, still confusing. Say the input to the analysis chain was
> abc|512_byes_of_payload_data
> The tokenizer would give you
>
> abc|frst_252_bytes
>
> But if you're using lower-level Lucene calls directly that limit doesn't
> apply.
>
> Best,
> Erick
>
> On Thu, Jun 15, 2017 at 8:21 AM, José Tomás Atria 
> wrote:
> > Hi Markus, thanks for your response!
> >
> > Now I feel stupid, that is clearly a much simpler approach and it has the
> > added benefits that it would not require me to meddle into the scoring
> > process, which I'm still a bit terrified of. Thanks for the tip.
> >
> > I guess the question is still valid though? i.e. how would one take into
> > account payloads for scoring entire spans? Does this make sense at all?
> Any
> > links to a more-or-less straightforward example?
> >
> > On the length of payloads: I understood that you have other restrictions,
> > but payloads take a bytesref as value, so you can encode arbitrary data
> in
> > them as long as you encode and decode properly. E.g. you could encode the
> > long array that backs a fixed bitset as a bytesref and pass that, though
> > I'm not sure it would be efficient unless you have at least 64 flags.
> >
> > thanks!
> > jta
> >
> >
> >
> > On Wed, Jun 14, 2017 at 4:45 PM Markus Jelsma <
> markus.jel...@openindex.io>
> > wrote:
> >
> >> Hello,
> >>
> >> We use POS-tagging too, and encode them as payload bitsets for scoring,
> >> which is, as far as is know, the only possibility with payloads.
> >>
> >> So, instead of encoding them as payloads, why not index your treebanks
> >> POS-tags as tokens on the same position, like synonyms. If you do that,
> you
> >> can use spans and phrase queries to find chunks of multiple POS-tags.
> >>
> >> This would be the first approach i can think of. Treating them as
> regular
> >> tokens enables you to use regular search for them.
> >>
> >> Regards,
> >> Markus
> >>
> >>
> >>
> >> -Original message-
> >> > From:José Tomás Atria 
> >> > Sent: Wednesday 14th June 2017 22:29
> >> > To: java-user@lucene.apache.org
> >> > Subject: Using POS payloads for chunking
> >> >
> >> > Hello!
> >> >
> >> > I'm not particularly familiar with lucene's search api (as I've been
> >> using
> >> > the library mostly as a dumb index rather than a search engine), but
> I am
> >> > almost certain that, using its payload capabilities, it would be
> trivial
> >> to
> >> > implement a regular chunker to look for patterns in sequences of
> >> payloads.
> >> >
> >> > (trying not to be too pedantic, a regular chunker looks for 'chunks'
> >> based
> >> > on part-of-speech tags, e.g. noun phrases can be searched for with
> >> patterns
> >> > like "(DT)?(JJ)*(NN|NP)+", that is, an optional determinant and zero
> or
> >> > more adjectives preceding a bunch of nouns, etc)
> >> >
> >> > Assuming my index has POS tags encoded as payloads for each position,
> how
> >> > would one search for such patterns, irrespective of terms? I started
> >> > studying the spans search API, as this seemed like the natural place
> to
> >> > start, but I quickly got lost.
> >> >
> >> > Any tips would be extremely appreciated. (or references to this kind
> of
> >> > thing, I'm sure someone must have tried something similar before...)
> >> >
> >> > thanks!
> >> > ~jta
> >> > --
> >> >
> >> > sent from a phone. please excuse terseness and tpyos.
> >> >
> >> > enviado desde un teléfono. por favor disculpe la parquedad y los
> erroers.
> >> >
> >>
> >> -
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >> --
> >
> > sent from a phone. please excuse terseness and tpyos.
> >
> > enviado desde un teléfono. por favor disculpe la parquedad y los erroers.
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
> --

sent from a phone. please excuse terseness and tpyos.

enviado desde un teléfono. por favor disculpe la parquedad y los erroers.


Re: Using POS payloads for chunking

2017-06-15 Thread Erick Erickson
José:

Do note that, while the bytearray isn't limited, prior to LUCENE-7705
most of the tokenizers you would use limited the incoming token to 256
at most. This is not at all a _Lucene_ limitation at a low level,
rather if you're indexing data with a delimited payload (say
abc|your_payload_here) the tokenizer would chop it off when the whole
thing reached 256 chars.

Hmmm, still confusing. Say the input to the analysis chain was
abc|512_byes_of_payload_data
The tokenizer would give you

abc|frst_252_bytes

But if you're using lower-level Lucene calls directly that limit doesn't apply.

Best,
Erick

On Thu, Jun 15, 2017 at 8:21 AM, José Tomás Atria  wrote:
> Hi Markus, thanks for your response!
>
> Now I feel stupid, that is clearly a much simpler approach and it has the
> added benefits that it would not require me to meddle into the scoring
> process, which I'm still a bit terrified of. Thanks for the tip.
>
> I guess the question is still valid though? i.e. how would one take into
> account payloads for scoring entire spans? Does this make sense at all? Any
> links to a more-or-less straightforward example?
>
> On the length of payloads: I understood that you have other restrictions,
> but payloads take a bytesref as value, so you can encode arbitrary data in
> them as long as you encode and decode properly. E.g. you could encode the
> long array that backs a fixed bitset as a bytesref and pass that, though
> I'm not sure it would be efficient unless you have at least 64 flags.
>
> thanks!
> jta
>
>
>
> On Wed, Jun 14, 2017 at 4:45 PM Markus Jelsma 
> wrote:
>
>> Hello,
>>
>> We use POS-tagging too, and encode them as payload bitsets for scoring,
>> which is, as far as is know, the only possibility with payloads.
>>
>> So, instead of encoding them as payloads, why not index your treebanks
>> POS-tags as tokens on the same position, like synonyms. If you do that, you
>> can use spans and phrase queries to find chunks of multiple POS-tags.
>>
>> This would be the first approach i can think of. Treating them as regular
>> tokens enables you to use regular search for them.
>>
>> Regards,
>> Markus
>>
>>
>>
>> -Original message-
>> > From:José Tomás Atria 
>> > Sent: Wednesday 14th June 2017 22:29
>> > To: java-user@lucene.apache.org
>> > Subject: Using POS payloads for chunking
>> >
>> > Hello!
>> >
>> > I'm not particularly familiar with lucene's search api (as I've been
>> using
>> > the library mostly as a dumb index rather than a search engine), but I am
>> > almost certain that, using its payload capabilities, it would be trivial
>> to
>> > implement a regular chunker to look for patterns in sequences of
>> payloads.
>> >
>> > (trying not to be too pedantic, a regular chunker looks for 'chunks'
>> based
>> > on part-of-speech tags, e.g. noun phrases can be searched for with
>> patterns
>> > like "(DT)?(JJ)*(NN|NP)+", that is, an optional determinant and zero or
>> > more adjectives preceding a bunch of nouns, etc)
>> >
>> > Assuming my index has POS tags encoded as payloads for each position, how
>> > would one search for such patterns, irrespective of terms? I started
>> > studying the spans search API, as this seemed like the natural place to
>> > start, but I quickly got lost.
>> >
>> > Any tips would be extremely appreciated. (or references to this kind of
>> > thing, I'm sure someone must have tried something similar before...)
>> >
>> > thanks!
>> > ~jta
>> > --
>> >
>> > sent from a phone. please excuse terseness and tpyos.
>> >
>> > enviado desde un teléfono. por favor disculpe la parquedad y los erroers.
>> >
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>> --
>
> sent from a phone. please excuse terseness and tpyos.
>
> enviado desde un teléfono. por favor disculpe la parquedad y los erroers.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Using POS payloads for chunking

2017-06-15 Thread José Tomás Atria
Hi Markus, thanks for your response!

Now I feel stupid, that is clearly a much simpler approach and it has the
added benefits that it would not require me to meddle into the scoring
process, which I'm still a bit terrified of. Thanks for the tip.

I guess the question is still valid though? i.e. how would one take into
account payloads for scoring entire spans? Does this make sense at all? Any
links to a more-or-less straightforward example?

On the length of payloads: I understood that you have other restrictions,
but payloads take a bytesref as value, so you can encode arbitrary data in
them as long as you encode and decode properly. E.g. you could encode the
long array that backs a fixed bitset as a bytesref and pass that, though
I'm not sure it would be efficient unless you have at least 64 flags.

thanks!
jta



On Wed, Jun 14, 2017 at 4:45 PM Markus Jelsma 
wrote:

> Hello,
>
> We use POS-tagging too, and encode them as payload bitsets for scoring,
> which is, as far as is know, the only possibility with payloads.
>
> So, instead of encoding them as payloads, why not index your treebanks
> POS-tags as tokens on the same position, like synonyms. If you do that, you
> can use spans and phrase queries to find chunks of multiple POS-tags.
>
> This would be the first approach i can think of. Treating them as regular
> tokens enables you to use regular search for them.
>
> Regards,
> Markus
>
>
>
> -Original message-
> > From:José Tomás Atria 
> > Sent: Wednesday 14th June 2017 22:29
> > To: java-user@lucene.apache.org
> > Subject: Using POS payloads for chunking
> >
> > Hello!
> >
> > I'm not particularly familiar with lucene's search api (as I've been
> using
> > the library mostly as a dumb index rather than a search engine), but I am
> > almost certain that, using its payload capabilities, it would be trivial
> to
> > implement a regular chunker to look for patterns in sequences of
> payloads.
> >
> > (trying not to be too pedantic, a regular chunker looks for 'chunks'
> based
> > on part-of-speech tags, e.g. noun phrases can be searched for with
> patterns
> > like "(DT)?(JJ)*(NN|NP)+", that is, an optional determinant and zero or
> > more adjectives preceding a bunch of nouns, etc)
> >
> > Assuming my index has POS tags encoded as payloads for each position, how
> > would one search for such patterns, irrespective of terms? I started
> > studying the spans search API, as this seemed like the natural place to
> > start, but I quickly got lost.
> >
> > Any tips would be extremely appreciated. (or references to this kind of
> > thing, I'm sure someone must have tried something similar before...)
> >
> > thanks!
> > ~jta
> > --
> >
> > sent from a phone. please excuse terseness and tpyos.
> >
> > enviado desde un teléfono. por favor disculpe la parquedad y los erroers.
> >
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
> --

sent from a phone. please excuse terseness and tpyos.

enviado desde un teléfono. por favor disculpe la parquedad y los erroers.


RE: Using POS payloads for chunking

2017-06-14 Thread Markus Jelsma
Hello Tommaso,

These don't propagate to search right, but can be used in the analyzer chain! 
This would be a better solution than using delimiters on words. The only 
problem is that TypeFilter only works on Tokens, after the tokenizer. The bonus 
of a CharFilter is that is sees the whole text, so OpenNLP can digest it all at 
once. Downside is that a CharFilter cannot set TypeAttribute because there are 
no tokens yet.

If we would try that option, we would have to build a TokenFilter that 
understands the whole text at once, because that is what OpenNLP needs, not 
single tokens. This is difficult, so we chose the option of a CharFilter plus a 
TokenFilter. This is not ideal but i find it very hard to digest whole text in 
a TokenFilter. See Shingle and CommonGrams, these are very complicated filters.

How would you overcome this problem? For NLP you need all text at once, which 
CharFilter provides. But that won't allow you to set TypeAttribute. Perhaps i 
am missing something completely and am stupid, probably :)

Thanks,
Markus
 
-Original message-
> From:Tommaso Teofili 
> Sent: Wednesday 14th June 2017 23:49
> To: java-user@lucene.apache.org
> Subject: Re: Using POS payloads for chunking
> 
> I think it'd be interesting to also investigate using TypeAttribute [1]
> together with TypeTokenFilter [2].
> 
> Regards,
> Tommaso
> 
> [1] :
> https://lucene.apache.org/core/6_5_0/core/org/apache/lucene/analysis/tokenattributes/TypeAttribute.html
> [2] :
> https://lucene.apache.org/core/6_5_0/analyzers-common/org/apache/lucene/analysis/core/TypeTokenFilter.html
> 
> Il giorno mer 14 giu 2017 alle ore 23:33 Markus Jelsma <
> markus.jel...@openindex.io> ha scritto:
> 
> > Hello Erick, no worries, i recognize you two.
> >
> > I will take a look at your references tomorrow. Although i am still fine
> > with eight bits, i cannot spare any more but one. If Lucene allows us to
> > pass longer bitsets to the BytesRef, it would be awesome and easy to encode.
> >
> > Thanks!
> > Markus
> >
> > -Original message-
> > > From:Erick Erickson 
> > > Sent: Wednesday 14th June 2017 23:29
> > > To: java-user 
> > > Subject: Re: Using POS payloads for chunking
> > >
> > > Markus:
> > >
> > > I don't believe that payloads are limited in size at all. LUCENE-7705
> > > was done in part because there _was_ a hard-coded 256 limit for some
> > > of the tokenizers. The Payload (at least recent versions) just have
> > > some bytes after them, and (with LUCENE-7705) can be arbitrarily long.
> > >
> > > Of course if you put anything other than a number in there you have to
> > > provide your own decoders and the like to make sense of your
> > > payload
> > >
> > > Best,
> > > Erick (Erickson, not Hatcher)
> > >
> > > On Wed, Jun 14, 2017 at 2:22 PM, Markus Jelsma
> > >  wrote:
> > > > Hello Erik,
> > > >
> > > > Using Solr, or actually more parts are Lucene, we have a CharFilter
> > adding treebank tags to whitespace delimited word using a delimiter,
> > further on we get these tokens with the delimiter and the POS-tag. It won't
> > work with some Tokenizers and put it before WDF, it'll split as you know.
> > That TokenFilter is configured with a tab delimited mapping config
> > containing \t, and there the bitset is encoded as payload.
> > > >
> > > > Our edismax extension rewrites queries to payload supported
> > equivalents, this is quite trivial, except for all those API changes in
> > Lucene you have to put up with. Finally a BM25 extension that has, amongst
> > others, a mapping of bitset to score. Nouns get a bonus, prepositions and
> > other useless pieces get a punishment etc.
> > > >
> > > > Payloads are really great things to use! We also use it to distinguish
> > between compounds and their subwords, o.a. we supply Dutch and German
> > speaking countries.  And stemmed words and non-stemmed words. Although the
> > latter also benefit from IDF statistics, payloads just help to control
> > boosting more precisely regardless of your corpus.
> > > >
> > > > I still need to take a look at your recent payload QParsers for Solr
> > and see how different, probably better, they are compared to our older
> > implementations. Although we don't use PayloadTermQParser equivalent for
> > regular search, we do use it for scoring recommendations via delimited
> > multi valued fields. Payloads are versatile!
> > > >
> > > > The downside of payloa

Re: Using POS payloads for chunking

2017-06-14 Thread Tommaso Teofili
I think it'd be interesting to also investigate using TypeAttribute [1]
together with TypeTokenFilter [2].

Regards,
Tommaso

[1] :
https://lucene.apache.org/core/6_5_0/core/org/apache/lucene/analysis/tokenattributes/TypeAttribute.html
[2] :
https://lucene.apache.org/core/6_5_0/analyzers-common/org/apache/lucene/analysis/core/TypeTokenFilter.html

Il giorno mer 14 giu 2017 alle ore 23:33 Markus Jelsma <
markus.jel...@openindex.io> ha scritto:

> Hello Erick, no worries, i recognize you two.
>
> I will take a look at your references tomorrow. Although i am still fine
> with eight bits, i cannot spare any more but one. If Lucene allows us to
> pass longer bitsets to the BytesRef, it would be awesome and easy to encode.
>
> Thanks!
> Markus
>
> -Original message-
> > From:Erick Erickson 
> > Sent: Wednesday 14th June 2017 23:29
> > To: java-user 
> > Subject: Re: Using POS payloads for chunking
> >
> > Markus:
> >
> > I don't believe that payloads are limited in size at all. LUCENE-7705
> > was done in part because there _was_ a hard-coded 256 limit for some
> > of the tokenizers. The Payload (at least recent versions) just have
> > some bytes after them, and (with LUCENE-7705) can be arbitrarily long.
> >
> > Of course if you put anything other than a number in there you have to
> > provide your own decoders and the like to make sense of your
> > payload
> >
> > Best,
> > Erick (Erickson, not Hatcher)
> >
> > On Wed, Jun 14, 2017 at 2:22 PM, Markus Jelsma
> >  wrote:
> > > Hello Erik,
> > >
> > > Using Solr, or actually more parts are Lucene, we have a CharFilter
> adding treebank tags to whitespace delimited word using a delimiter,
> further on we get these tokens with the delimiter and the POS-tag. It won't
> work with some Tokenizers and put it before WDF, it'll split as you know.
> That TokenFilter is configured with a tab delimited mapping config
> containing \t, and there the bitset is encoded as payload.
> > >
> > > Our edismax extension rewrites queries to payload supported
> equivalents, this is quite trivial, except for all those API changes in
> Lucene you have to put up with. Finally a BM25 extension that has, amongst
> others, a mapping of bitset to score. Nouns get a bonus, prepositions and
> other useless pieces get a punishment etc.
> > >
> > > Payloads are really great things to use! We also use it to distinguish
> between compounds and their subwords, o.a. we supply Dutch and German
> speaking countries.  And stemmed words and non-stemmed words. Although the
> latter also benefit from IDF statistics, payloads just help to control
> boosting more precisely regardless of your corpus.
> > >
> > > I still need to take a look at your recent payload QParsers for Solr
> and see how different, probably better, they are compared to our older
> implementations. Although we don't use PayloadTermQParser equivalent for
> regular search, we do use it for scoring recommendations via delimited
> multi valued fields. Payloads are versatile!
> > >
> > > The downside of payloads is that they are limited to 8 bits. Although
> we can easily fit our reduced treebank in there, we also use single bits to
> signal for compound/subword, and stemmed/unstemmed and some others.
> > >
> > > Hope this helps.
> > >
> > > Regards,
> > > Markus
> > >
> > > -Original message-
> > >> From:Erik Hatcher 
> > >> Sent: Wednesday 14th June 2017 23:03
> > >> To: java-user@lucene.apache.org
> > >> Subject: Re: Using POS payloads for chunking
> > >>
> > >> Markus - how are you encoding payloads as bitsets and use them for
> scoring?   Curious to see how folks are leveraging them.
> > >>
> > >>   Erik
> > >>
> > >> > On Jun 14, 2017, at 4:45 PM, Markus Jelsma <
> markus.jel...@openindex.io> wrote:
> > >> >
> > >> > Hello,
> > >> >
> > >> > We use POS-tagging too, and encode them as payload bitsets for
> scoring, which is, as far as is know, the only possibility with payloads.
> > >> >
> > >> > So, instead of encoding them as payloads, why not index your
> treebanks POS-tags as tokens on the same position, like synonyms. If you do
> that, you can use spans and phrase queries to find chunks of multiple
> POS-tags.
> > >> >
> > >> > This would be the first approach i can think of. Treating them as
> regular tokens enables you to use regular sea

RE: Using POS payloads for chunking

2017-06-14 Thread Markus Jelsma
Hello Erick, no worries, i recognize you two.

I will take a look at your references tomorrow. Although i am still fine with 
eight bits, i cannot spare any more but one. If Lucene allows us to pass longer 
bitsets to the BytesRef, it would be awesome and easy to encode.

Thanks!
Markus
 
-Original message-
> From:Erick Erickson 
> Sent: Wednesday 14th June 2017 23:29
> To: java-user 
> Subject: Re: Using POS payloads for chunking
> 
> Markus:
> 
> I don't believe that payloads are limited in size at all. LUCENE-7705
> was done in part because there _was_ a hard-coded 256 limit for some
> of the tokenizers. The Payload (at least recent versions) just have
> some bytes after them, and (with LUCENE-7705) can be arbitrarily long.
> 
> Of course if you put anything other than a number in there you have to
> provide your own decoders and the like to make sense of your
> payload
> 
> Best,
> Erick (Erickson, not Hatcher)
> 
> On Wed, Jun 14, 2017 at 2:22 PM, Markus Jelsma
>  wrote:
> > Hello Erik,
> >
> > Using Solr, or actually more parts are Lucene, we have a CharFilter adding 
> > treebank tags to whitespace delimited word using a delimiter, further on we 
> > get these tokens with the delimiter and the POS-tag. It won't work with 
> > some Tokenizers and put it before WDF, it'll split as you know. That 
> > TokenFilter is configured with a tab delimited mapping config containing 
> > \t, and there the bitset is encoded as payload.
> >
> > Our edismax extension rewrites queries to payload supported equivalents, 
> > this is quite trivial, except for all those API changes in Lucene you have 
> > to put up with. Finally a BM25 extension that has, amongst others, a 
> > mapping of bitset to score. Nouns get a bonus, prepositions and other 
> > useless pieces get a punishment etc.
> >
> > Payloads are really great things to use! We also use it to distinguish 
> > between compounds and their subwords, o.a. we supply Dutch and German 
> > speaking countries.  And stemmed words and non-stemmed words. Although the 
> > latter also benefit from IDF statistics, payloads just help to control 
> > boosting more precisely regardless of your corpus.
> >
> > I still need to take a look at your recent payload QParsers for Solr and 
> > see how different, probably better, they are compared to our older 
> > implementations. Although we don't use PayloadTermQParser equivalent for 
> > regular search, we do use it for scoring recommendations via delimited 
> > multi valued fields. Payloads are versatile!
> >
> > The downside of payloads is that they are limited to 8 bits. Although we 
> > can easily fit our reduced treebank in there, we also use single bits to 
> > signal for compound/subword, and stemmed/unstemmed and some others.
> >
> > Hope this helps.
> >
> > Regards,
> > Markus
> >
> > -Original message-
> >> From:Erik Hatcher 
> >> Sent: Wednesday 14th June 2017 23:03
> >> To: java-user@lucene.apache.org
> >> Subject: Re: Using POS payloads for chunking
> >>
> >> Markus - how are you encoding payloads as bitsets and use them for 
> >> scoring?   Curious to see how folks are leveraging them.
> >>
> >>   Erik
> >>
> >> > On Jun 14, 2017, at 4:45 PM, Markus Jelsma  
> >> > wrote:
> >> >
> >> > Hello,
> >> >
> >> > We use POS-tagging too, and encode them as payload bitsets for scoring, 
> >> > which is, as far as is know, the only possibility with payloads.
> >> >
> >> > So, instead of encoding them as payloads, why not index your treebanks 
> >> > POS-tags as tokens on the same position, like synonyms. If you do that, 
> >> > you can use spans and phrase queries to find chunks of multiple POS-tags.
> >> >
> >> > This would be the first approach i can think of. Treating them as 
> >> > regular tokens enables you to use regular search for them.
> >> >
> >> > Regards,
> >> > Markus
> >> >
> >> >
> >> >
> >> > -Original message-
> >> >> From:José Tomás Atria 
> >> >> Sent: Wednesday 14th June 2017 22:29
> >> >> To: java-user@lucene.apache.org
> >> >> Subject: Using POS payloads for chunking
> >> >>
> >> >> Hello!
> >> >>
> >> >> I'm not particularly familiar with lucene's search api (as I've been 
> >> >> using
&

Re: Using POS payloads for chunking

2017-06-14 Thread Erick Erickson
Markus:

I don't believe that payloads are limited in size at all. LUCENE-7705
was done in part because there _was_ a hard-coded 256 limit for some
of the tokenizers. The Payload (at least recent versions) just have
some bytes after them, and (with LUCENE-7705) can be arbitrarily long.

Of course if you put anything other than a number in there you have to
provide your own decoders and the like to make sense of your
payload

Best,
Erick (Erickson, not Hatcher)

On Wed, Jun 14, 2017 at 2:22 PM, Markus Jelsma
 wrote:
> Hello Erik,
>
> Using Solr, or actually more parts are Lucene, we have a CharFilter adding 
> treebank tags to whitespace delimited word using a delimiter, further on we 
> get these tokens with the delimiter and the POS-tag. It won't work with some 
> Tokenizers and put it before WDF, it'll split as you know. That TokenFilter 
> is configured with a tab delimited mapping config containing 
> \t, and there the bitset is encoded as payload.
>
> Our edismax extension rewrites queries to payload supported equivalents, this 
> is quite trivial, except for all those API changes in Lucene you have to put 
> up with. Finally a BM25 extension that has, amongst others, a mapping of 
> bitset to score. Nouns get a bonus, prepositions and other useless pieces get 
> a punishment etc.
>
> Payloads are really great things to use! We also use it to distinguish 
> between compounds and their subwords, o.a. we supply Dutch and German 
> speaking countries.  And stemmed words and non-stemmed words. Although the 
> latter also benefit from IDF statistics, payloads just help to control 
> boosting more precisely regardless of your corpus.
>
> I still need to take a look at your recent payload QParsers for Solr and see 
> how different, probably better, they are compared to our older 
> implementations. Although we don't use PayloadTermQParser equivalent for 
> regular search, we do use it for scoring recommendations via delimited multi 
> valued fields. Payloads are versatile!
>
> The downside of payloads is that they are limited to 8 bits. Although we can 
> easily fit our reduced treebank in there, we also use single bits to signal 
> for compound/subword, and stemmed/unstemmed and some others.
>
> Hope this helps.
>
> Regards,
> Markus
>
> -Original message-
>> From:Erik Hatcher 
>> Sent: Wednesday 14th June 2017 23:03
>> To: java-user@lucene.apache.org
>> Subject: Re: Using POS payloads for chunking
>>
>> Markus - how are you encoding payloads as bitsets and use them for scoring?  
>>  Curious to see how folks are leveraging them.
>>
>>   Erik
>>
>> > On Jun 14, 2017, at 4:45 PM, Markus Jelsma  
>> > wrote:
>> >
>> > Hello,
>> >
>> > We use POS-tagging too, and encode them as payload bitsets for scoring, 
>> > which is, as far as is know, the only possibility with payloads.
>> >
>> > So, instead of encoding them as payloads, why not index your treebanks 
>> > POS-tags as tokens on the same position, like synonyms. If you do that, 
>> > you can use spans and phrase queries to find chunks of multiple POS-tags.
>> >
>> > This would be the first approach i can think of. Treating them as regular 
>> > tokens enables you to use regular search for them.
>> >
>> > Regards,
>> > Markus
>> >
>> >
>> >
>> > -Original message-
>> >> From:José Tomás Atria 
>> >> Sent: Wednesday 14th June 2017 22:29
>> >> To: java-user@lucene.apache.org
>> >> Subject: Using POS payloads for chunking
>> >>
>> >> Hello!
>> >>
>> >> I'm not particularly familiar with lucene's search api (as I've been using
>> >> the library mostly as a dumb index rather than a search engine), but I am
>> >> almost certain that, using its payload capabilities, it would be trivial 
>> >> to
>> >> implement a regular chunker to look for patterns in sequences of payloads.
>> >>
>> >> (trying not to be too pedantic, a regular chunker looks for 'chunks' based
>> >> on part-of-speech tags, e.g. noun phrases can be searched for with 
>> >> patterns
>> >> like "(DT)?(JJ)*(NN|NP)+", that is, an optional determinant and zero or
>> >> more adjectives preceding a bunch of nouns, etc)
>> >>
>> >> Assuming my index has POS tags encoded as payloads for each position, how
>> >> would one search for such patterns, irrespective of terms? I started
>> >> studyi

RE: Using POS payloads for chunking

2017-06-14 Thread Markus Jelsma
Hello Erik,

Using Solr, or actually more parts are Lucene, we have a CharFilter adding 
treebank tags to whitespace delimited word using a delimiter, further on we get 
these tokens with the delimiter and the POS-tag. It won't work with some 
Tokenizers and put it before WDF, it'll split as you know. That TokenFilter is 
configured with a tab delimited mapping config containing \t, 
and there the bitset is encoded as payload.

Our edismax extension rewrites queries to payload supported equivalents, this 
is quite trivial, except for all those API changes in Lucene you have to put up 
with. Finally a BM25 extension that has, amongst others, a mapping of bitset to 
score. Nouns get a bonus, prepositions and other useless pieces get a 
punishment etc.

Payloads are really great things to use! We also use it to distinguish between 
compounds and their subwords, o.a. we supply Dutch and German speaking 
countries.  And stemmed words and non-stemmed words. Although the latter also 
benefit from IDF statistics, payloads just help to control boosting more 
precisely regardless of your corpus.

I still need to take a look at your recent payload QParsers for Solr and see 
how different, probably better, they are compared to our older implementations. 
Although we don't use PayloadTermQParser equivalent for regular search, we do 
use it for scoring recommendations via delimited multi valued fields. Payloads 
are versatile!

The downside of payloads is that they are limited to 8 bits. Although we can 
easily fit our reduced treebank in there, we also use single bits to signal for 
compound/subword, and stemmed/unstemmed and some others.

Hope this helps.

Regards,
Markus

-Original message-
> From:Erik Hatcher 
> Sent: Wednesday 14th June 2017 23:03
> To: java-user@lucene.apache.org
> Subject: Re: Using POS payloads for chunking
> 
> Markus - how are you encoding payloads as bitsets and use them for scoring?   
> Curious to see how folks are leveraging them.
> 
>   Erik
> 
> > On Jun 14, 2017, at 4:45 PM, Markus Jelsma  
> > wrote:
> > 
> > Hello,
> > 
> > We use POS-tagging too, and encode them as payload bitsets for scoring, 
> > which is, as far as is know, the only possibility with payloads.
> > 
> > So, instead of encoding them as payloads, why not index your treebanks 
> > POS-tags as tokens on the same position, like synonyms. If you do that, you 
> > can use spans and phrase queries to find chunks of multiple POS-tags.
> > 
> > This would be the first approach i can think of. Treating them as regular 
> > tokens enables you to use regular search for them.
> > 
> > Regards,
> > Markus
> > 
> > 
> > 
> > -Original message-
> >> From:José Tomás Atria 
> >> Sent: Wednesday 14th June 2017 22:29
> >> To: java-user@lucene.apache.org
> >> Subject: Using POS payloads for chunking
> >> 
> >> Hello!
> >> 
> >> I'm not particularly familiar with lucene's search api (as I've been using
> >> the library mostly as a dumb index rather than a search engine), but I am
> >> almost certain that, using its payload capabilities, it would be trivial to
> >> implement a regular chunker to look for patterns in sequences of payloads.
> >> 
> >> (trying not to be too pedantic, a regular chunker looks for 'chunks' based
> >> on part-of-speech tags, e.g. noun phrases can be searched for with patterns
> >> like "(DT)?(JJ)*(NN|NP)+", that is, an optional determinant and zero or
> >> more adjectives preceding a bunch of nouns, etc)
> >> 
> >> Assuming my index has POS tags encoded as payloads for each position, how
> >> would one search for such patterns, irrespective of terms? I started
> >> studying the spans search API, as this seemed like the natural place to
> >> start, but I quickly got lost.
> >> 
> >> Any tips would be extremely appreciated. (or references to this kind of
> >> thing, I'm sure someone must have tried something similar before...)
> >> 
> >> thanks!
> >> ~jta
> >> -- 
> >> 
> >> sent from a phone. please excuse terseness and tpyos.
> >> 
> >> enviado desde un teléfono. por favor disculpe la parquedad y los erroers.
> >> 
> > 
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > 
> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> 

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Using POS payloads for chunking

2017-06-14 Thread Erik Hatcher
Markus - how are you encoding payloads as bitsets and use them for scoring?   
Curious to see how folks are leveraging them.

Erik

> On Jun 14, 2017, at 4:45 PM, Markus Jelsma  wrote:
> 
> Hello,
> 
> We use POS-tagging too, and encode them as payload bitsets for scoring, which 
> is, as far as is know, the only possibility with payloads.
> 
> So, instead of encoding them as payloads, why not index your treebanks 
> POS-tags as tokens on the same position, like synonyms. If you do that, you 
> can use spans and phrase queries to find chunks of multiple POS-tags.
> 
> This would be the first approach i can think of. Treating them as regular 
> tokens enables you to use regular search for them.
> 
> Regards,
> Markus
> 
> 
> 
> -Original message-
>> From:José Tomás Atria 
>> Sent: Wednesday 14th June 2017 22:29
>> To: java-user@lucene.apache.org
>> Subject: Using POS payloads for chunking
>> 
>> Hello!
>> 
>> I'm not particularly familiar with lucene's search api (as I've been using
>> the library mostly as a dumb index rather than a search engine), but I am
>> almost certain that, using its payload capabilities, it would be trivial to
>> implement a regular chunker to look for patterns in sequences of payloads.
>> 
>> (trying not to be too pedantic, a regular chunker looks for 'chunks' based
>> on part-of-speech tags, e.g. noun phrases can be searched for with patterns
>> like "(DT)?(JJ)*(NN|NP)+", that is, an optional determinant and zero or
>> more adjectives preceding a bunch of nouns, etc)
>> 
>> Assuming my index has POS tags encoded as payloads for each position, how
>> would one search for such patterns, irrespective of terms? I started
>> studying the spans search API, as this seemed like the natural place to
>> start, but I quickly got lost.
>> 
>> Any tips would be extremely appreciated. (or references to this kind of
>> thing, I'm sure someone must have tried something similar before...)
>> 
>> thanks!
>> ~jta
>> -- 
>> 
>> sent from a phone. please excuse terseness and tpyos.
>> 
>> enviado desde un teléfono. por favor disculpe la parquedad y los erroers.
>> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Using POS payloads for chunking

2017-06-14 Thread Markus Jelsma
Hello,

We use POS-tagging too, and encode them as payload bitsets for scoring, which 
is, as far as is know, the only possibility with payloads.

So, instead of encoding them as payloads, why not index your treebanks POS-tags 
as tokens on the same position, like synonyms. If you do that, you can use 
spans and phrase queries to find chunks of multiple POS-tags.

This would be the first approach i can think of. Treating them as regular 
tokens enables you to use regular search for them.

Regards,
Markus

 
 
-Original message-
> From:José Tomás Atria 
> Sent: Wednesday 14th June 2017 22:29
> To: java-user@lucene.apache.org
> Subject: Using POS payloads for chunking
> 
> Hello!
> 
> I'm not particularly familiar with lucene's search api (as I've been using
> the library mostly as a dumb index rather than a search engine), but I am
> almost certain that, using its payload capabilities, it would be trivial to
> implement a regular chunker to look for patterns in sequences of payloads.
> 
> (trying not to be too pedantic, a regular chunker looks for 'chunks' based
> on part-of-speech tags, e.g. noun phrases can be searched for with patterns
> like "(DT)?(JJ)*(NN|NP)+", that is, an optional determinant and zero or
> more adjectives preceding a bunch of nouns, etc)
> 
> Assuming my index has POS tags encoded as payloads for each position, how
> would one search for such patterns, irrespective of terms? I started
> studying the spans search API, as this seemed like the natural place to
> start, but I quickly got lost.
> 
> Any tips would be extremely appreciated. (or references to this kind of
> thing, I'm sure someone must have tried something similar before...)
> 
> thanks!
> ~jta
> -- 
> 
> sent from a phone. please excuse terseness and tpyos.
> 
> enviado desde un teléfono. por favor disculpe la parquedad y los erroers.
> 

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Using POS payloads for chunking

2017-06-14 Thread José Tomás Atria
Hello!

I'm not particularly familiar with lucene's search api (as I've been using
the library mostly as a dumb index rather than a search engine), but I am
almost certain that, using its payload capabilities, it would be trivial to
implement a regular chunker to look for patterns in sequences of payloads.

(trying not to be too pedantic, a regular chunker looks for 'chunks' based
on part-of-speech tags, e.g. noun phrases can be searched for with patterns
like "(DT)?(JJ)*(NN|NP)+", that is, an optional determinant and zero or
more adjectives preceding a bunch of nouns, etc)

Assuming my index has POS tags encoded as payloads for each position, how
would one search for such patterns, irrespective of terms? I started
studying the spans search API, as this seemed like the natural place to
start, but I quickly got lost.

Any tips would be extremely appreciated. (or references to this kind of
thing, I'm sure someone must have tried something similar before...)

thanks!
~jta
-- 

sent from a phone. please excuse terseness and tpyos.

enviado desde un teléfono. por favor disculpe la parquedad y los erroers.