Re: TokenStream and Token APIs

Grant Ingersoll Mon, 20 Oct 2008 06:48:31 -0700


On Oct 19, 2008, at 7:08 PM, Michael Busch wrote:

Grant Ingersoll wrote:
On Oct 19, 2008, at 12:56 AM, Mark Miller wrote:
Grant Ingersoll wrote:
Bear with me, b/c I'm not sure I'm following, but looking at https://issues.apache.org/jira/browse/LUCENE-1422, I see at least 5 different implemented Attributes.
So, let's say I add a 5 more attributes and now have a total of10 attributes. Are you saying that I then would have,potentially, 10 different variables that all point to the tokenas in the code snippet above where the casting takes place? Orwould I just create a single "Super" attribute that folds in allof my new attributes, plus any other existing ones? Or, maybe,what I would do is create the 5 new attributes and then 1 newattribute that extends all 10, thus allowing me to use themindividually, but saving me from having to do a whole ton ofcasting in my Consumer.
Potentially one consumer doing 10 things, but not likely right? Imean, things will stay logical as they are now, and rather than asuper consumer doing everything, we will still have a chain ofconsumers each doing its own piece. So more likely, maybesomething comes along every so often (another 5, over *much* time,say) and each time we add a Consumer that uses one or twoTokenStream types. And then its just an implementation detail onwhether you make a composite TokenStream - if you have added 10new attributes and see it fit to make one consumer use them all,sure, make a composite, super type, but in my mind, the way itsdone in the example code is clearer/cleaner for a handful ofTokenStream types. And even if you do make the composite,supertype, its likely to just be a sugar wrapper anyway - theimplementation for say, payload and positions, should probably bemaintained in their own classes anyway.
Well, there are 5 different attributes already, all of which arecommonly used. Seems weird to have to cast the same var 5different ways. Definitely agree that one would likely deal withthis by wrapping, but then you end up either needing to extend yourwrapper or add new wrappers...
Well yes, there are 5 attributes, but n neither of the coretokenstreams and -filters that I changed in my patch did I have touse more than two or three of those. Currently the only attributesthat are really used are PositionIncrementAttribute andPayloadAttribute. And the OffsetAttribute when TermVectors areturned on.
Even in the indexing chain currently we don't have a single consumerthat needs all attributes. The FreqProxWriter needs positions andpayloads, the TermVectorsWriter needs positions and offsets.

I have an application that uses all the attributes of a Token, or atleast, almost all of them. There are many uses for Lucene's analysiscode that have nothing to do with indexing, Consumers or even Lucene.

Also, you don't have to cast the same variable multiple times. Inthe current patch you would call e. g.token.getAttribute(PayloadAttribute.class) and keep a reference toit in the consumer or filter.
IMO even calling getAttribute() 5 times or so and storing thereferences wouldn't be so bad. And if you really don't like it youcould make a wrapper as you said. You also mentioned thedisadvantages of the wrapper, e. g. that you would have to extend itto add new attributes. But then, isn't that the same disadvantagethe current Token API has?

True. I didn't say the idea was bad, in fact I mostly like it, I wasjust saying I'd like to explore how it would work in practice and themain thing that struck me was all the casting or all the references.Since it's likely that you only deal with a Token one at a time,you're right, it's probably not a big deal other than the code looksfunny, IMO.

You could even use the new API in exact the same way as the old one.Just create a subclass of Token that has all members you need anddon't add any attributes.
So I think the new API adds more flexibility, and still offers touse it in the same way as the old one. I however think therecommended best practice should be to use the new attributes, forreusability of consumers that only need certain attributes.

Perhaps it would be useful for Lucene to offer exactly one subclass ofToken that we guarantee will always have all known Attributes (i.e.the ones Lucene provides) available to it for casting purposes.

However, please let me know if you have any concrete recommendationsabout changing the API in LUCENE-1422.


I thought those concerns were pretty concrete...  :-)

There might be better ones than the APIs I came up with.


I think the APIs in the 2nd patch look pretty reasonable.


-Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: TokenStream and Token APIs

Reply via email to