Re: Small Vocabulary

Danil ŢORIN Tue, 07 Aug 2012 02:33:03 -0700

I mean "ADJ:brown" as a token and only the <payload> as payload, since
you probably only use it for some scoring/postprocessing not the
actual matching.


You can even write a filter that will emit both tokens "ADJ" and
"AJD:brown" on same position (so you'll be able to do phrase queries),
and still maintain join capability.


On Tue, Aug 7, 2012 at 12:13 PM, Carsten Schnober
<schno...@ids-mannheim.de> wrote:
> Am 07.08.2012 10:20, schrieb Danil ŢORIN:
>
> Hi Danil,
>
>> If you do intersection (not join), maybe it make sense to put every
>> thing into 1 index?
>
> Just a note on that: my application performs intersections and joins
> (unions) on the results, depending on the query. So the index structure
> has to be ready for both, but intersections are clearly more complicated.
>
>> Just transform your input like "brown fox" into "ADJ:brown|<your
>> payload> NOUN:fox|<other payload>"
>
> I understand that this denotes "ADJ" and "NOUN" to be interpreted as the
> actual token and "brown" and "fox" as payloads (followed by <other
> payload>), right?
>
> This is a very neat approach and I have vaguely considered that. One
> problem is that I aim for a very high level of flexibility, meaning that
> additional annotations have to be addable at any point and different
> tokenizations apply. However, I will re-consider your suggestion,
> possibly applying one of multiple tokenizations as a default in this sense.
>
>> Of course I'm not aware of all the details, so my solution might not
>> be applicable to your project.
>> Maybe you could share more details, so this won't transform in "XY problem".
>>
>> Keep in mind : always optimize your index for the query usecase,
>> instead of blindly processing the input data.
>
> Thanks for that reminder; this becomes quite difficult in my scenario
> though since we want to allow for flexible changes in the index types,
> representing different annotations, tokenization logics etc.
> Best,
> Carsten
>
>
> --
> Institut für Deutsche Sprache | http://www.ids-mannheim.de
> Projekt KorAP                 | http://korap.ids-mannheim.de
> Tel. +49-(0)621-43740789      | schno...@ids-mannheim.de
> Korpusanalyseplattform der nächsten Generation
> Next Generation Corpus Analysis Platform
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Small Vocabulary

Reply via email to