Re: Small Vocabulary

2012-08-07 Thread Danil ŢORIN
To avoid wildcard queries, you can write a TokenFilter that will create both tokens "ADJ" and "ADJ:brown" in same position. so you can use you index for both lookups without doing wildcard. On Tue, Aug 7, 2012 at 12:31 PM, Carsten Schnober wrote: > Hi Danil, > >>> Just transform your input like

Re: Small Vocabulary

2012-08-07 Thread Danil ŢORIN
I mean "ADJ:brown" as a token and only the as payload, since you probably only use it for some scoring/postprocessing not the actual matching. You can even write a filter that will emit both tokens "ADJ" and "AJD:brown" on same position (so you'll be able to do phrase queries), and still maintain

Re: Small Vocabulary

2012-08-07 Thread Carsten Schnober
Hi Danil, >> Just transform your input like "brown fox" into "ADJ:brown|> payload> NOUN:fox|" > > I understand that this denotes "ADJ" and "NOUN" to be interpreted as the > actual token and "brown" and "fox" as payloads (followed by payload>), right? Sorry for replying to myself, but I've reali

Re: Small Vocabulary

2012-08-07 Thread Carsten Schnober
Am 07.08.2012 10:20, schrieb Danil ŢORIN: Hi Danil, > If you do intersection (not join), maybe it make sense to put every > thing into 1 index? Just a note on that: my application performs intersections and joins (unions) on the results, depending on the query. So the index structure has to be r

Re: Small Vocabulary

2012-08-07 Thread Danil ŢORIN
If you do intersection (not join), maybe it make sense to put every thing into 1 index? Just transform your input like "brown fox" into "ADJ:brown| NOUN:fox|" Write a custom tokenizer, some filters and that's it. Of course I'm not aware of all the details, so my solution might not be applicable

Re: Small Vocabulary

2012-08-07 Thread Carsten Schnober
Am 06.08.2012 20:29, schrieb Mike Sokolov: Hi Mike, > There was some interesting work done on optimizing queries including > very common words (stop words) that I think overlaps with your problem. > See this blog post > http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-wo

Re: Small Vocabulary

2012-08-06 Thread Mike Sokolov
ied to deal with that data cleverly because the statistical properties of such pseudo-texts is very distinct from natural language texts and make me wonder whether Lucene's inverted indexes are suitable. Especially the small vocabulary size (<50 distinct tokens, depending on the tagging system) i

Re: Small Vocabulary

2012-08-02 Thread Carsten Schnober
n that index. However, I am still wondering about the theoretical implications: having a small vocabulary with many tokens in an inverted index would yield a rather long list of occurrences for some/many/all (depending on the actual distribution) of the search terms. Thanks for your pointer to t

Re: Small Vocabulary

2012-07-31 Thread Ian Lea
make me wonder whether Lucene's > inverted indexes are suitable. Especially the small vocabulary size (<50 > distinct tokens, depending on the tagging system) is problematic, I suppose. > > First trials for which I have implemented an analyzer that just outputs > Lucene tokens suc

Small Vocabulary

2012-07-30 Thread Carsten Schnober
do-texts is very distinct from natural language texts and make me wonder whether Lucene's inverted indexes are suitable. Especially the small vocabulary size (<50 distinct tokens, depending on the tagging system) is problematic, I suppose. First trials for which I have implemented an analyzer t