Re: TokenStream and Token APIs

Michael Busch Mon, 20 Oct 2008 22:40:35 -0700

Grant Ingersoll wrote:

On Oct 19, 2008, at 7:08 PM, Michael Busch wrote:
Grant Ingersoll wrote:
On Oct 19, 2008, at 12:56 AM, Mark Miller wrote:
Grant Ingersoll wrote:
Bear with me, b/c I'm not sure I'm following, but looking athttps://issues.apache.org/jira/browse/LUCENE-1422, I see at least5 different implemented Attributes.
So, let's say I add a 5 more attributes and now have a total of 10attributes. Are you saying that I then would have, potentially, 10different variables that all point to the token as in the codesnippet above where the casting takes place? Or would I justcreate a single "Super" attribute that folds in all of my newattributes, plus any other existing ones? Or, maybe, what I woulddo is create the 5 new attributes and then 1 new attribute thatextends all 10, thus allowing me to use them individually, butsaving me from having to do a whole ton of casting in my Consumer.
Potentially one consumer doing 10 things, but not likely right? Imean, things will stay logical as they are now, and rather than asuper consumer doing everything, we will still have a chain ofconsumers each doing its own piece. So more likely, maybe somethingcomes along every so often (another 5, over *much* time, say) andeach time we add a Consumer that uses one or two TokenStream types.And then its just an implementation detail on whether you make acomposite TokenStream - if you have added 10 new attributes and seeit fit to make one consumer use them all, sure, make a composite,super type, but in my mind, the way its done in the example code isclearer/cleaner for a handful of TokenStream types. And even if youdo make the composite,super type, its likely to just be a sugarwrapper anyway - the implementation for say, payload and positions,should probably be maintained in their own classes anyway.
Well, there are 5 different attributes already, all of which arecommonly used. Seems weird to have to cast the same var 5 differentways. Definitely agree that one would likely deal with this bywrapping, but then you end up either needing to extend your wrapperor add new wrappers...
Well yes, there are 5 attributes, but n neither of the coretokenstreams and -filters that I changed in my patch did I have touse more than two or three of those. Currently the only attributesthat are really used are PositionIncrementAttribute andPayloadAttribute. And the OffsetAttribute when TermVectors are turnedon.
Even in the indexing chain currently we don't have a single consumerthat needs all attributes. The FreqProxWriter needs positions andpayloads, the TermVectorsWriter needs positions and offsets.
I have an application that uses all the attributes of a Token, or atleast, almost all of them. There are many uses for Lucene's analysiscode that have nothing to do with indexing, Consumers or even Lucene.
Also, you don't have to cast the same variable multiple times. In thecurrent patch you would call e. g.token.getAttribute(PayloadAttribute.class) and keep a reference to itin the consumer or filter.
IMO even calling getAttribute() 5 times or so and storing thereferences wouldn't be so bad. And if you really don't like it youcould make a wrapper as you said. You also mentioned thedisadvantages of the wrapper, e. g. that you would have to extend itto add new attributes. But then, isn't that the same disadvantage thecurrent Token API has?
True. I didn't say the idea was bad, in fact I mostly like it, I wasjust saying I'd like to explore how it would work in practice and themain thing that struck me was all the casting or all the references.Since it's likely that you only deal with a Token one at a time,you're right, it's probably not a big deal other than the code looksfunny, IMO.
You could even use the new API in exact the same way as the old one.Just create a subclass of Token that has all members you need anddon't add any attributes.
So I think the new API adds more flexibility, and still offers to useit in the same way as the old one. I however think the recommendedbest practice should be to use the new attributes, for reusability ofconsumers that only need certain attributes.
Perhaps it would be useful for Lucene to offer exactly one subclass ofToken that we guarantee will always have all known Attributes (i.e.the ones Lucene provides) available to it for casting purposes.

Yeah we could do that. In fact, I did exactly this when I startedworking on this patch. I created a class called PlainToken, which hadall the termBuffer and attributes logic, and changed Token to extend it.Then the new getToken() method would return an instance of PlainToken.My main concern with this approach is that it will make the code in theindexer more complicated, because it always has to check if we have aToken or PlainToken; if it's a Token then it has to use the get*()method directly, for a PlainToken it has tocheck for the *Attributes. Sothat's a bit messy (it's in fact exactly like that in the current patchfor backwards-compatibility, but we could clean this up in 3.0). So forcode simplicity I'm slightly in favor of not creating the a class thatimplements a default set of functionality without Attributes.


-Michael


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: TokenStream and Token APIs

Reply via email to