Re: Document aware analyzers was Re: deprecating Versions

Grant Ingersoll Wed, 01 Dec 2010 12:45:23 -0800

On Dec 1, 2010, at 2:40 PM, Robert Muir wrote:

> On Wed, Dec 1, 2010 at 2:25 PM, Grant Ingersoll <gsing...@apache.org> wrote:
>> 
>> Nah, I just meant analysis would often benefit from having knowledge of the 
>> document as a whole instead of just the individual field.
>> 
> 
> and analysis would suffer from this too, because right now these
> things are independent and we have a fast simple reusable model.
> I'd prefer to keep the TokenStream analysis api... but as we have
> discussed on the list, it would be nice to minimize the interface
> between analysis components and indexer/queryparser so you can use an
> *alternative* API... we are working in this direction already.


I think the existing TokenStream API still works, at least in my mind.  

> 
>>> 
>>> Maybe if you give a concrete example then I would have a better
>>> understanding of the problem you think this might solve.
>> 
>> Let me see if I can put some flesh on the bones.  I'm assuming the raw 
>> document has already been parsed and that we are still basically dealing 
>> with strings and that we have a document which contains one or more fields.
>> 
>> If we step back and look at our analysis process, there are some things that 
>> are easy and some things that are hard that maybe shouldn't be because even 
>> though we talk like we are indexing and search documents, we are really 
>> indexing and searching fields and everything is Field centric.  That works 
>> fine for the easy analysis things like tokenization, stemming, lowercasing, 
>> etc. when all the content is in one language.  It doesn't work well when you 
>> have multiple languages in a single document or if you want to do things 
>> like Tee/Sink or even something as simple as Solr's copy field semantics.
> 
> Well i have trouble with a few of your examples: "want to use
> Tee/Sink" doesn't work for me... its a description of an XY problem to
> me... i've never needed to use it, and its rarely discussed on the
> user list...

Shrugs.  In my experiments, it can really speed things up when analyzing the 
same content, but with different outcomes, or at least it did back before the 
new API.  My bigger point is things like that and the PerFieldAnalyzerWrapper 
are symptoms of treating documents as second class citizens.

> 
> As far as working with a lot of languages, i understand this issue
> much more... but i've never much had a desire for this, especially
> given the fact that "Query is a document too"... I'm personally not a
> fan of language detection,
> and I don't think it belongs in our analysis API: like encoding
> detection and other similar heuristics, its part of document parsing
> to me!

I didn't say it did, I just said it is an example of the types of things where 
we pretend like we are document-centric, but we are actually field centric.

> 
> As I said before, I think our TokenStream analysis API is already
> quite complicated and I dont think we should make it more complicated
> for these reasons (especially since these examples are quite vague and
> i'm still not sure you cannot solve them easier in another way.

I never said you couldn't solve them in other ways, but I always find they are 
kludgy.  For instance, how many times, in a complex environment, must one 
tokenize the same text over and over again just to get it in the index?

> 
> If you want to use a more complicated analysis API that doesnt work
> like TokenStreams but instead incorporates things that are document
> parsing or whatever, i guess you should be able to do that. I'm not
> sure Lucene should provide such an API, but we shouldn't force you to
> use the TokenStreams API either.

You keep going back to document parsing, even though I have never mentioned it. 
 All I am proposing/_wanting to discuss_ is the notion that Analysis might 
benefit from a more document centric view of analysis.  You're presupposing I 
want to change TokenStreams, etc. when all I'm wanting to do is take a step 
back and discuss the bigger picture of how a user actually does analysis in the 
real world and whether we can make it easier for them.  I don't even have an 
implementation in mind yet.

For instance, the typical copy field scenario where one has two fields 
containing the same content analyzed in slightly different ways.  In many 
cases, most of the work is exactly the same (tokenize, lowercase, stopword, 
stem or not) and yet we have to pass around the string twice and do almost all 
of the same work twice all so that we can change one little thing on the token. 
 

-Grant
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Document aware analyzers was Re: deprecating Versions

Reply via email to