Re: Small Vocabulary

Carsten Schnober Tue, 07 Aug 2012 02:14:09 -0700

Am 07.08.2012 10:20, schrieb Danil ŢORIN:

Hi Danil,


> If you do intersection (not join), maybe it make sense to put every
> thing into 1 index?

Just a note on that: my application performs intersections and joins
(unions) on the results, depending on the query. So the index structure
has to be ready for both, but intersections are clearly more complicated.

> Just transform your input like "brown fox" into "ADJ:brown|<your
> payload> NOUN:fox|<other payload>"

I understand that this denotes "ADJ" and "NOUN" to be interpreted as the
actual token and "brown" and "fox" as payloads (followed by <other
payload>), right?

This is a very neat approach and I have vaguely considered that. One
problem is that I aim for a very high level of flexibility, meaning that
additional annotations have to be addable at any point and different
tokenizations apply. However, I will re-consider your suggestion,
possibly applying one of multiple tokenizations as a default in this sense.

> Of course I'm not aware of all the details, so my solution might not
> be applicable to your project.
> Maybe you could share more details, so this won't transform in "XY problem".
> 
> Keep in mind : always optimize your index for the query usecase,
> instead of blindly processing the input data.

Thanks for that reminder; this becomes quite difficult in my scenario
though since we want to allow for flexible changes in the index types,
representing different annotations, tokenization logics etc.
Best,
Carsten


-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP                 | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789      | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Small Vocabulary

Reply via email to