To avoid wildcard queries, you can write a TokenFilter that will
create both tokens "ADJ" and "ADJ:brown" in same position.
so you can use you index for both lookups without doing wildcard.
On Tue, Aug 7, 2012 at 12:31 PM, Carsten Schnober
wrote:
> Hi Danil,
>
>>> Just transform your input like
I mean "ADJ:brown" as a token and only the as payload, since
you probably only use it for some scoring/postprocessing not the
actual matching.
You can even write a filter that will emit both tokens "ADJ" and
"AJD:brown" on same position (so you'll be able to do phrase queries),
and still maintain
Hi Danil,
>> Just transform your input like "brown fox" into "ADJ:brown|> payload> NOUN:fox|"
>
> I understand that this denotes "ADJ" and "NOUN" to be interpreted as the
> actual token and "brown" and "fox" as payloads (followed by payload>), right?
Sorry for replying to myself, but I've reali
Am 07.08.2012 10:20, schrieb Danil ŢORIN:
Hi Danil,
> If you do intersection (not join), maybe it make sense to put every
> thing into 1 index?
Just a note on that: my application performs intersections and joins
(unions) on the results, depending on the query. So the index structure
has to be r
If you do intersection (not join), maybe it make sense to put every
thing into 1 index?
Just transform your input like "brown fox" into "ADJ:brown| NOUN:fox|"
Write a custom tokenizer, some filters and that's it.
Of course I'm not aware of all the details, so my solution might not
be applicable
Am 06.08.2012 20:29, schrieb Mike Sokolov:
Hi Mike,
> There was some interesting work done on optimizing queries including
> very common words (stop words) that I think overlaps with your problem.
> See this blog post
> http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-wo
ied to deal with that data
cleverly because the statistical properties of such pseudo-texts is very
distinct from natural language texts and make me wonder whether Lucene's
inverted indexes are suitable. Especially the small vocabulary size (<50
distinct tokens, depending on the tagging system) i
n that index. However, I am still wondering
about the theoretical implications: having a small vocabulary with many
tokens in an inverted index would yield a rather long list of
occurrences for some/many/all (depending on the actual distribution) of
the search terms.
Thanks for your pointer to t
make me wonder whether Lucene's
> inverted indexes are suitable. Especially the small vocabulary size (<50
> distinct tokens, depending on the tagging system) is problematic, I suppose.
>
> First trials for which I have implemented an analyzer that just outputs
> Lucene tokens suc
do-texts is very
distinct from natural language texts and make me wonder whether Lucene's
inverted indexes are suitable. Especially the small vocabulary size (<50
distinct tokens, depending on the tagging system) is problematic, I suppose.
First trials for which I have implemented an analyzer t
10 matches
Mail list logo