Re: Tokenizers and DelimitedPayloadTokenFilterFactory

Jamie Johnson Wed, 26 Aug 2015 03:26:01 -0700

Thanks again Erick, I created
https://issues.apache.org/jira/browse/SOLR-7975, though I didn't attach s
patch because my current implementation is not useful generally right now,
it meets my use case but likely would not meet others.  I will try to look
about generalizing this to allow something custom to be plugged in.
On Aug 26, 2015 2:46 AM, "Erick Erickson" <erickerick...@gmail.com> wrote:


> Sure, I think it's fine to raise a JIRA, especially if you can include
> a patch, even a preliminary one to solicit feedback... which I'll
> leave to people who are more familiar with that code...
>
> I'm not sure how generally useful this would be, and if it comes
> at a cost to normal searching there's sure to be lively discussion.
>
> Best
> Erick
>
> On Tue, Aug 25, 2015 at 7:50 PM, Jamie Johnson <jej2...@gmail.com> wrote:
> > Looks like I have something basic working for Trie fields.  I am doing
> > exactly what I said in my previous email, so good news there.  I think
> this
> > is a big step as there are only a few field types left that I need to
> > support, those being date (should be similar to Trie) and Spatial fields,
> > which at a glance looked like it provided a way to provide the token
> stream
> > through an extension.  Definitely need to look more though.
> >
> > All of this said though, is this really the right way to get payloads
> into
> > these types of fields?  Should a jira feature request be added for this?
> > On Aug 25, 2015 8:13 PM, "Jamie Johnson" <jej2...@gmail.com> wrote:
> >
> >> Right, I had assumed (obviously here is my problem) that I'd be able to
> >> specify payloads for the field regardless of the field type.  Looking at
> >> TrieField that is certainly non-trivial.  After a bit of digging it
> appears
> >> that if I wanted to do something here I'd need to build a new TrieField,
> >> override createField and provide a Field that would return something
> like
> >> NumericTokenStream but also provide the payloads.  Like you said sounds
> >> "interesting" to say the least...
> >>
> >> Were payloads not really intended to be used for these types of fields
> >> from a Lucene perspective?
> >>
> >>
> >> On Tue, Aug 25, 2015 at 6:29 PM, Erick Erickson <
> erickerick...@gmail.com>
> >> wrote:
> >>
> >>> Well, you're going down a path that hasn't been trodden before ;).
> >>>
> >>> If you can treat your primitive types as text types you might get
> >>> some traction, but that makes a lot of operations like numeric
> >>> comparison difficult.
> >>>
> >>> Hmmmm. another idea from left field. For single-valued types,
> >>> what about a sidecar field that has the auth token? And even
> >>> for a multiValued field, two parallel fields are guaranteed to
> >>> maintain order so perhaps you could do something here. Yes,
> >>> I'm waving my hands a LOT here.....
> >>>
> >>> I suspect that trying to have a custom type that incorporates
> >>> payloads for, say, trie fields will be "interesting" to say the least.
> >>> Numeric types are packed to save storage etc. so it'll be
> >>> an adventure..
> >>>
> >>> Best,
> >>> Erick
> >>>
> >>> On Tue, Aug 25, 2015 at 2:43 PM, Jamie Johnson <jej2...@gmail.com>
> wrote:
> >>> > We were originally using this approach, i.e. run things through the
> >>> > KeywordTokenizer -> DelimitedPayloadFilter -> WordDelimiterFilter.
> >>> Again
> >>> > this works fine for text, though I had wanted to use the
> >>> StandardTokenizer
> >>> > in the chain.  Is there an equivalent filter that does what the
> >>> > StandardTokenizer does?
> >>> >
> >>> > All of this said this doesn't address the issue of the primitive
> field
> >>> > types, which at this point is the bigger issue.  Given this use case
> >>> should
> >>> > there be another way to provide payloads?
> >>> >
> >>> > My current thinking is that I will need to provide custom
> >>> implementations
> >>> > for all of the field types I would like to support payloads on which
> >>> will
> >>> > essentially be copies of the standard versions with some extra
> "sugar"
> >>> to
> >>> > read/write the payloads (I don't see a way to wrap/delegate these at
> >>> this
> >>> > point because AttributeSource has the attribute retrieval related
> >>> methods
> >>> > as final so I can't simply wrap another tokenizer and return my added
> >>> > attributes + the wrapped attributes).  I know my use case is a bit
> >>> strange,
> >>> > but I had not expected to need to do this given that Lucene/Solr
> >>> supports
> >>> > payloads on these field types, they just aren't exposed.
> >>> >
> >>> > As always I appreciate any ideas if I'm barking up the wrong tree
> here.
> >>> >
> >>> > On Tue, Aug 25, 2015 at 2:52 PM, Markus Jelsma <
> >>> markus.jel...@openindex.io>
> >>> > wrote:
> >>> >
> >>> >> Well, if i remember correctly (i have no testing facility at hand)
> >>> >> WordDelimiterFilter maintains payloads on emitted sub terms. So if
> you
> >>> use
> >>> >> a KeywordTokenizer, input 'some text^PAYLOAD', and have a
> >>> >> DelimitedPayloadFilter, the entire string gets a payload. You can
> then
> >>> >> split that string up again in individual tokens. It is possible to
> >>> abuse
> >>> >> WordDelimiterFilter for it because it has a types parameter that you
> >>> can
> >>> >> use to split it on whitespace if its input is not trimmed. Otherwise
> >>> you
> >>> >> can use any other character instead of a space as your input.
> >>> >>
> >>> >> This is a crazy idea, but it might work.
> >>> >>
> >>> >> -----Original message-----
> >>> >> > From:Jamie Johnson <jej2...@gmail.com>
> >>> >> > Sent: Tuesday 25th August 2015 19:37
> >>> >> > To: solr-user@lucene.apache.org
> >>> >> > Subject: Re: Tokenizers and DelimitedPayloadTokenFilterFactory
> >>> >> >
> >>> >> > To be clear, we are using payloads as a way to attach
> authorizations
> >>> to
> >>> >> > individual tokens within Solr.  The payloads are normal Solr
> Payloads
> >>> >> > though we are not using floats, we are using the identity payload
> >>> encoder
> >>> >> > (org.apache.lucene.analysis.payloads.IdentityEncoder) which allows
> >>> for
> >>> >> > storing a byte[] of our choosing into the payload field.
> >>> >> >
> >>> >> > This works great for text, but now that I'm indexing more than
> just
> >>> text
> >>> >> I
> >>> >> > need a way to specify the payload on the other field types.  Does
> >>> that
> >>> >> make
> >>> >> > more sense?
> >>> >> >
> >>> >> > On Tue, Aug 25, 2015 at 12:52 PM, Erick Erickson <
> >>> >> erickerick...@gmail.com>
> >>> >> > wrote:
> >>> >> >
> >>> >> > > This really sounds like an XY problem. Or when you use
> >>> >> > > "payload" it's not the Solr payload.
> >>> >> > >
> >>> >> > > So Solr Payloads are a float value that you can attach to
> >>> >> > > individual terms to influence the scoring. Attaching the
> >>> >> > > _same_ payload to all terms in a field is much the same
> >>> >> > > thing as boosting on any matches in the field at query time
> >>> >> > > or boosting on the field at index time (this latter assuming
> >>> >> > > that different docs would have different boosts).
> >>> >> > >
> >>> >> > > So can you back up a bit and tell us what you're trying to
> >>> >> > > accomplish maybe we can be sure we're both talking about
> >>> >> > > the same thing ;)
> >>> >> > >
> >>> >> > > Best,
> >>> >> > > Erick
> >>> >> > >
> >>> >> > > On Tue, Aug 25, 2015 at 9:09 AM, Jamie Johnson <
> jej2...@gmail.com>
> >>> >> wrote:
> >>> >> > > > I would like to specify a particular payload for all tokens
> >>> emitted
> >>> >> from
> >>> >> > > a
> >>> >> > > > tokenizer, but don't see a clear way to do this.  Ideally I
> could
> >>> >> specify
> >>> >> > > > that something like the DelimitedPayloadTokenFilter be run on
> the
> >>> >> entire
> >>> >> > > > field and then standard analysis be done on the rest of the
> >>> field,
> >>> >> so in
> >>> >> > > > the case that I had the following text
> >>> >> > > >
> >>> >> > > > this is a test\Foo
> >>> >> > > >
> >>> >> > > > I would like to create tokens "this", "is", "a", "test" each
> >>> with a
> >>> >> > > payload
> >>> >> > > > of Foo.  From what I'm seeing though only test get's the
> >>> payload.  Is
> >>> >> > > there
> >>> >> > > > anyway to accomplish this or will I need to implement a custom
> >>> >> tokenizer?
> >>> >> > >
> >>> >> >
> >>> >>
> >>>
> >>
> >>
>

Re: Tokenizers and DelimitedPayloadTokenFilterFactory

Reply via email to