Re: Tokenizers and DelimitedPayloadTokenFilterFactory

Jamie Johnson Tue, 25 Aug 2015 17:14:36 -0700

Right, I had assumed (obviously here is my problem) that I'd be able to
specify payloads for the field regardless of the field type.  Looking at
TrieField that is certainly non-trivial.  After a bit of digging it appears
that if I wanted to do something here I'd need to build a new TrieField,
override createField and provide a Field that would return something like
NumericTokenStream but also provide the payloads.  Like you said sounds
"interesting" to say the least...


Were payloads not really intended to be used for these types of fields from
a Lucene perspective?


On Tue, Aug 25, 2015 at 6:29 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> Well, you're going down a path that hasn't been trodden before ;).
>
> If you can treat your primitive types as text types you might get
> some traction, but that makes a lot of operations like numeric
> comparison difficult.
>
> Hmmmm. another idea from left field. For single-valued types,
> what about a sidecar field that has the auth token? And even
> for a multiValued field, two parallel fields are guaranteed to
> maintain order so perhaps you could do something here. Yes,
> I'm waving my hands a LOT here.....
>
> I suspect that trying to have a custom type that incorporates
> payloads for, say, trie fields will be "interesting" to say the least.
> Numeric types are packed to save storage etc. so it'll be
> an adventure..
>
> Best,
> Erick
>
> On Tue, Aug 25, 2015 at 2:43 PM, Jamie Johnson <jej2...@gmail.com> wrote:
> > We were originally using this approach, i.e. run things through the
> > KeywordTokenizer -> DelimitedPayloadFilter -> WordDelimiterFilter.  Again
> > this works fine for text, though I had wanted to use the
> StandardTokenizer
> > in the chain.  Is there an equivalent filter that does what the
> > StandardTokenizer does?
> >
> > All of this said this doesn't address the issue of the primitive field
> > types, which at this point is the bigger issue.  Given this use case
> should
> > there be another way to provide payloads?
> >
> > My current thinking is that I will need to provide custom implementations
> > for all of the field types I would like to support payloads on which will
> > essentially be copies of the standard versions with some extra "sugar" to
> > read/write the payloads (I don't see a way to wrap/delegate these at this
> > point because AttributeSource has the attribute retrieval related methods
> > as final so I can't simply wrap another tokenizer and return my added
> > attributes + the wrapped attributes).  I know my use case is a bit
> strange,
> > but I had not expected to need to do this given that Lucene/Solr supports
> > payloads on these field types, they just aren't exposed.
> >
> > As always I appreciate any ideas if I'm barking up the wrong tree here.
> >
> > On Tue, Aug 25, 2015 at 2:52 PM, Markus Jelsma <
> markus.jel...@openindex.io>
> > wrote:
> >
> >> Well, if i remember correctly (i have no testing facility at hand)
> >> WordDelimiterFilter maintains payloads on emitted sub terms. So if you
> use
> >> a KeywordTokenizer, input 'some text^PAYLOAD', and have a
> >> DelimitedPayloadFilter, the entire string gets a payload. You can then
> >> split that string up again in individual tokens. It is possible to abuse
> >> WordDelimiterFilter for it because it has a types parameter that you can
> >> use to split it on whitespace if its input is not trimmed. Otherwise you
> >> can use any other character instead of a space as your input.
> >>
> >> This is a crazy idea, but it might work.
> >>
> >> -----Original message-----
> >> > From:Jamie Johnson <jej2...@gmail.com>
> >> > Sent: Tuesday 25th August 2015 19:37
> >> > To: solr-user@lucene.apache.org
> >> > Subject: Re: Tokenizers and DelimitedPayloadTokenFilterFactory
> >> >
> >> > To be clear, we are using payloads as a way to attach authorizations
> to
> >> > individual tokens within Solr.  The payloads are normal Solr Payloads
> >> > though we are not using floats, we are using the identity payload
> encoder
> >> > (org.apache.lucene.analysis.payloads.IdentityEncoder) which allows for
> >> > storing a byte[] of our choosing into the payload field.
> >> >
> >> > This works great for text, but now that I'm indexing more than just
> text
> >> I
> >> > need a way to specify the payload on the other field types.  Does that
> >> make
> >> > more sense?
> >> >
> >> > On Tue, Aug 25, 2015 at 12:52 PM, Erick Erickson <
> >> erickerick...@gmail.com>
> >> > wrote:
> >> >
> >> > > This really sounds like an XY problem. Or when you use
> >> > > "payload" it's not the Solr payload.
> >> > >
> >> > > So Solr Payloads are a float value that you can attach to
> >> > > individual terms to influence the scoring. Attaching the
> >> > > _same_ payload to all terms in a field is much the same
> >> > > thing as boosting on any matches in the field at query time
> >> > > or boosting on the field at index time (this latter assuming
> >> > > that different docs would have different boosts).
> >> > >
> >> > > So can you back up a bit and tell us what you're trying to
> >> > > accomplish maybe we can be sure we're both talking about
> >> > > the same thing ;)
> >> > >
> >> > > Best,
> >> > > Erick
> >> > >
> >> > > On Tue, Aug 25, 2015 at 9:09 AM, Jamie Johnson <jej2...@gmail.com>
> >> wrote:
> >> > > > I would like to specify a particular payload for all tokens
> emitted
> >> from
> >> > > a
> >> > > > tokenizer, but don't see a clear way to do this.  Ideally I could
> >> specify
> >> > > > that something like the DelimitedPayloadTokenFilter be run on the
> >> entire
> >> > > > field and then standard analysis be done on the rest of the field,
> >> so in
> >> > > > the case that I had the following text
> >> > > >
> >> > > > this is a test\Foo
> >> > > >
> >> > > > I would like to create tokens "this", "is", "a", "test" each with
> a
> >> > > payload
> >> > > > of Foo.  From what I'm seeing though only test get's the
> payload.  Is
> >> > > there
> >> > > > anyway to accomplish this or will I need to implement a custom
> >> tokenizer?
> >> > >
> >> >
> >>
>

Re: Tokenizers and DelimitedPayloadTokenFilterFactory

Reply via email to