Thanks again Erick, I created https://issues.apache.org/jira/browse/SOLR-7975, though I didn't attach s patch because my current implementation is not useful generally right now, it meets my use case but likely would not meet others. I will try to look about generalizing this to allow something custom to be plugged in. On Aug 26, 2015 2:46 AM, "Erick Erickson" <erickerick...@gmail.com> wrote:
> Sure, I think it's fine to raise a JIRA, especially if you can include > a patch, even a preliminary one to solicit feedback... which I'll > leave to people who are more familiar with that code... > > I'm not sure how generally useful this would be, and if it comes > at a cost to normal searching there's sure to be lively discussion. > > Best > Erick > > On Tue, Aug 25, 2015 at 7:50 PM, Jamie Johnson <jej2...@gmail.com> wrote: > > Looks like I have something basic working for Trie fields. I am doing > > exactly what I said in my previous email, so good news there. I think > this > > is a big step as there are only a few field types left that I need to > > support, those being date (should be similar to Trie) and Spatial fields, > > which at a glance looked like it provided a way to provide the token > stream > > through an extension. Definitely need to look more though. > > > > All of this said though, is this really the right way to get payloads > into > > these types of fields? Should a jira feature request be added for this? > > On Aug 25, 2015 8:13 PM, "Jamie Johnson" <jej2...@gmail.com> wrote: > > > >> Right, I had assumed (obviously here is my problem) that I'd be able to > >> specify payloads for the field regardless of the field type. Looking at > >> TrieField that is certainly non-trivial. After a bit of digging it > appears > >> that if I wanted to do something here I'd need to build a new TrieField, > >> override createField and provide a Field that would return something > like > >> NumericTokenStream but also provide the payloads. Like you said sounds > >> "interesting" to say the least... > >> > >> Were payloads not really intended to be used for these types of fields > >> from a Lucene perspective? > >> > >> > >> On Tue, Aug 25, 2015 at 6:29 PM, Erick Erickson < > erickerick...@gmail.com> > >> wrote: > >> > >>> Well, you're going down a path that hasn't been trodden before ;). > >>> > >>> If you can treat your primitive types as text types you might get > >>> some traction, but that makes a lot of operations like numeric > >>> comparison difficult. > >>> > >>> Hmmmm. another idea from left field. For single-valued types, > >>> what about a sidecar field that has the auth token? And even > >>> for a multiValued field, two parallel fields are guaranteed to > >>> maintain order so perhaps you could do something here. Yes, > >>> I'm waving my hands a LOT here..... > >>> > >>> I suspect that trying to have a custom type that incorporates > >>> payloads for, say, trie fields will be "interesting" to say the least. > >>> Numeric types are packed to save storage etc. so it'll be > >>> an adventure.. > >>> > >>> Best, > >>> Erick > >>> > >>> On Tue, Aug 25, 2015 at 2:43 PM, Jamie Johnson <jej2...@gmail.com> > wrote: > >>> > We were originally using this approach, i.e. run things through the > >>> > KeywordTokenizer -> DelimitedPayloadFilter -> WordDelimiterFilter. > >>> Again > >>> > this works fine for text, though I had wanted to use the > >>> StandardTokenizer > >>> > in the chain. Is there an equivalent filter that does what the > >>> > StandardTokenizer does? > >>> > > >>> > All of this said this doesn't address the issue of the primitive > field > >>> > types, which at this point is the bigger issue. Given this use case > >>> should > >>> > there be another way to provide payloads? > >>> > > >>> > My current thinking is that I will need to provide custom > >>> implementations > >>> > for all of the field types I would like to support payloads on which > >>> will > >>> > essentially be copies of the standard versions with some extra > "sugar" > >>> to > >>> > read/write the payloads (I don't see a way to wrap/delegate these at > >>> this > >>> > point because AttributeSource has the attribute retrieval related > >>> methods > >>> > as final so I can't simply wrap another tokenizer and return my added > >>> > attributes + the wrapped attributes). I know my use case is a bit > >>> strange, > >>> > but I had not expected to need to do this given that Lucene/Solr > >>> supports > >>> > payloads on these field types, they just aren't exposed. > >>> > > >>> > As always I appreciate any ideas if I'm barking up the wrong tree > here. > >>> > > >>> > On Tue, Aug 25, 2015 at 2:52 PM, Markus Jelsma < > >>> markus.jel...@openindex.io> > >>> > wrote: > >>> > > >>> >> Well, if i remember correctly (i have no testing facility at hand) > >>> >> WordDelimiterFilter maintains payloads on emitted sub terms. So if > you > >>> use > >>> >> a KeywordTokenizer, input 'some text^PAYLOAD', and have a > >>> >> DelimitedPayloadFilter, the entire string gets a payload. You can > then > >>> >> split that string up again in individual tokens. It is possible to > >>> abuse > >>> >> WordDelimiterFilter for it because it has a types parameter that you > >>> can > >>> >> use to split it on whitespace if its input is not trimmed. Otherwise > >>> you > >>> >> can use any other character instead of a space as your input. > >>> >> > >>> >> This is a crazy idea, but it might work. > >>> >> > >>> >> -----Original message----- > >>> >> > From:Jamie Johnson <jej2...@gmail.com> > >>> >> > Sent: Tuesday 25th August 2015 19:37 > >>> >> > To: solr-user@lucene.apache.org > >>> >> > Subject: Re: Tokenizers and DelimitedPayloadTokenFilterFactory > >>> >> > > >>> >> > To be clear, we are using payloads as a way to attach > authorizations > >>> to > >>> >> > individual tokens within Solr. The payloads are normal Solr > Payloads > >>> >> > though we are not using floats, we are using the identity payload > >>> encoder > >>> >> > (org.apache.lucene.analysis.payloads.IdentityEncoder) which allows > >>> for > >>> >> > storing a byte[] of our choosing into the payload field. > >>> >> > > >>> >> > This works great for text, but now that I'm indexing more than > just > >>> text > >>> >> I > >>> >> > need a way to specify the payload on the other field types. Does > >>> that > >>> >> make > >>> >> > more sense? > >>> >> > > >>> >> > On Tue, Aug 25, 2015 at 12:52 PM, Erick Erickson < > >>> >> erickerick...@gmail.com> > >>> >> > wrote: > >>> >> > > >>> >> > > This really sounds like an XY problem. Or when you use > >>> >> > > "payload" it's not the Solr payload. > >>> >> > > > >>> >> > > So Solr Payloads are a float value that you can attach to > >>> >> > > individual terms to influence the scoring. Attaching the > >>> >> > > _same_ payload to all terms in a field is much the same > >>> >> > > thing as boosting on any matches in the field at query time > >>> >> > > or boosting on the field at index time (this latter assuming > >>> >> > > that different docs would have different boosts). > >>> >> > > > >>> >> > > So can you back up a bit and tell us what you're trying to > >>> >> > > accomplish maybe we can be sure we're both talking about > >>> >> > > the same thing ;) > >>> >> > > > >>> >> > > Best, > >>> >> > > Erick > >>> >> > > > >>> >> > > On Tue, Aug 25, 2015 at 9:09 AM, Jamie Johnson < > jej2...@gmail.com> > >>> >> wrote: > >>> >> > > > I would like to specify a particular payload for all tokens > >>> emitted > >>> >> from > >>> >> > > a > >>> >> > > > tokenizer, but don't see a clear way to do this. Ideally I > could > >>> >> specify > >>> >> > > > that something like the DelimitedPayloadTokenFilter be run on > the > >>> >> entire > >>> >> > > > field and then standard analysis be done on the rest of the > >>> field, > >>> >> so in > >>> >> > > > the case that I had the following text > >>> >> > > > > >>> >> > > > this is a test\Foo > >>> >> > > > > >>> >> > > > I would like to create tokens "this", "is", "a", "test" each > >>> with a > >>> >> > > payload > >>> >> > > > of Foo. From what I'm seeing though only test get's the > >>> payload. Is > >>> >> > > there > >>> >> > > > anyway to accomplish this or will I need to implement a custom > >>> >> tokenizer? > >>> >> > > > >>> >> > > >>> >> > >>> > >> > >> >