Re: Small Vocabulary
Lucene 4.0 allows you to use custom codecs and there may be one that would be better for this sort of data, or you could write one. In your tests is it the searching that is slow or are you reading lots of data for lots of docs? The latter is always likely to be slow. General performance advice as in http://wiki.apache.org/lucene-java/ImproveSearchingSpeed may be relevant. SSDs and loads of RAM never hurt. -- Ian. On Mon, Jul 30, 2012 at 2:07 PM, Carsten Schnober wrote: > Dear list, > I'm considering to use Lucene for indexing sequences of part-of-speech > (POS) tags instead of words; for those who don't know, POS tags are > linguistically motivated labels that are assigned to tokens (words) to > describe its morpho-syntactic function. Instead of sequences of words, I > would like to index sequences of tags, for instance "ART ADV ADJA NN". > The aim is to be able to search (efficiently) for occurrences of "ADJA". > > The question is whether Lucene can be applied to deal with that data > cleverly because the statistical properties of such pseudo-texts is very > distinct from natural language texts and make me wonder whether Lucene's > inverted indexes are suitable. Especially the small vocabulary size (<50 > distinct tokens, depending on the tagging system) is problematic, I suppose. > > First trials for which I have implemented an analyzer that just outputs > Lucene tokens such as "ART", "ADV", "ADJA", etc. yield results that are > not exactly perfect regarding search performance, in a test corpus with > a few million tokens. The number of tokens in production mode is > expected to be much larger, so I wonder whether this approach is > promising at all. > Does Lucene (4.0?) provide optimization techniques for extremely small > vocabulary sizes? > > Thank you very much, > Carsten Schnober > > > -- > Institut für Deutsche Sprache | http://www.ids-mannheim.de > Projekt KorAP | http://korap.ids-mannheim.de > Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de > Korpusanalyseplattform der nächsten Generation > Next Generation Corpus Analysis Platform > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Small Vocabulary
Am 31.07.2012 12:10, schrieb Ian Lea: Hi Ian, > Lucene 4.0 allows you to use custom codecs and there may be one that > would be better for this sort of data, or you could write one. > > In your tests is it the searching that is slow or are you reading lots > of data for lots of docs? The latter is always likely to be slow. > General performance advice as in > http://wiki.apache.org/lucene-java/ImproveSearchingSpeed may be > relevant. SSDs and loads of RAM never hurt. You are very right, therer are many results from many docs for the slower searches performed on that index. However, I am still wondering about the theoretical implications: having a small vocabulary with many tokens in an inverted index would yield a rather long list of occurrences for some/many/all (depending on the actual distribution) of the search terms. Thanks for your pointer to the codecs in Lucene 4, I suppose that this will be the actual point to attack for that scenario. It may be a silly question, but one that might be of interest for the whole community ;-) : can someone point me to an in-depth documentation of Lucene 4 codecs, ideally covering both theoretical backgrounds and implementation? There are numerous helpful blog entries, presentations, etc. available on the net, but in case there is some central instance, I have not been able to find it anyway. Thanks! Best regards, Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Small Vocabulary
There was some interesting work done on optimizing queries including very common words (stop words) that I think overlaps with your problem. See this blog post http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2 from the Hathi Trust. The upshot in a nutshell was that queries including terms with very large postings lists (ie high occurrences) were slow, and the approach they took to dealing with this was to index n-grams (ie pairs and triplets of adjacent tokens). However I'm not sure this would help much if your queries will typically include only a single token. -Mike On 07/30/2012 09:07 AM, Carsten Schnober wrote: Dear list, I'm considering to use Lucene for indexing sequences of part-of-speech (POS) tags instead of words; for those who don't know, POS tags are linguistically motivated labels that are assigned to tokens (words) to describe its morpho-syntactic function. Instead of sequences of words, I would like to index sequences of tags, for instance "ART ADV ADJA NN". The aim is to be able to search (efficiently) for occurrences of "ADJA". The question is whether Lucene can be applied to deal with that data cleverly because the statistical properties of such pseudo-texts is very distinct from natural language texts and make me wonder whether Lucene's inverted indexes are suitable. Especially the small vocabulary size (<50 distinct tokens, depending on the tagging system) is problematic, I suppose. First trials for which I have implemented an analyzer that just outputs Lucene tokens such as "ART", "ADV", "ADJA", etc. yield results that are not exactly perfect regarding search performance, in a test corpus with a few million tokens. The number of tokens in production mode is expected to be much larger, so I wonder whether this approach is promising at all. Does Lucene (4.0?) provide optimization techniques for extremely small vocabulary sizes? Thank you very much, Carsten Schnober - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Small Vocabulary
Am 06.08.2012 20:29, schrieb Mike Sokolov: Hi Mike, > There was some interesting work done on optimizing queries including > very common words (stop words) that I think overlaps with your problem. > See this blog post > http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2 > from the Hathi Trust. > > The upshot in a nutshell was that queries including terms with very > large postings lists (ie high occurrences) were slow, and the approach > they took to dealing with this was to index n-grams (ie pairs and > triplets of adjacent tokens). However I'm not sure this would help much > if your queries will typically include only a single token. This is very interesting for our use case indeed. However, you are right that indexing n-grams is not (per sé) a solution for my given problem because I'm working on an application using multiple indexes. A query for one isolated frequent term will indeed be rare presumably, or at least rare enough to tolerate slow response times, but the results will typically be intersected with results from other indexes. To illustrate this more practically: the index I described having relatively few distinct and partially extremely frequent tokens indexes part-of-speech (POS) tags with positional information stored in the payload. A parallel index indexes actual text; a typical query may look for a certain POS tag in one index and a word X at the same position with a matching payload in the other index. So both indexes need to be queries completely before the intersection can be performed. Best, Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Small Vocabulary
If you do intersection (not join), maybe it make sense to put every thing into 1 index? Just transform your input like "brown fox" into "ADJ:brown| NOUN:fox|" Write a custom tokenizer, some filters and that's it. Of course I'm not aware of all the details, so my solution might not be applicable to your project. Maybe you could share more details, so this won't transform in "XY problem". Keep in mind : always optimize your index for the query usecase, instead of blindly processing the input data. On Tue, Aug 7, 2012 at 10:29 AM, Carsten Schnober wrote: > Am 06.08.2012 20:29, schrieb Mike Sokolov: > > Hi Mike, > >> There was some interesting work done on optimizing queries including >> very common words (stop words) that I think overlaps with your problem. >> See this blog post >> http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2 >> from the Hathi Trust. >> >> The upshot in a nutshell was that queries including terms with very >> large postings lists (ie high occurrences) were slow, and the approach >> they took to dealing with this was to index n-grams (ie pairs and >> triplets of adjacent tokens). However I'm not sure this would help much >> if your queries will typically include only a single token. > > This is very interesting for our use case indeed. However, you are right > that indexing n-grams is not (per sé) a solution for my given problem > because I'm working on an application using multiple indexes. A query > for one isolated frequent term will indeed be rare presumably, or at > least rare enough to tolerate slow response times, but the results will > typically be intersected with results from other indexes. > > To illustrate this more practically: the index I described having > relatively few distinct and partially extremely frequent tokens indexes > part-of-speech (POS) tags with positional information stored in the > payload. A parallel index indexes actual text; a typical query may look > for a certain POS tag in one index and a word X at the same position > with a matching payload in the other index. So both indexes need to be > queries completely before the intersection can be performed. > > Best, > Carsten > > > > -- > Institut für Deutsche Sprache | http://www.ids-mannheim.de > Projekt KorAP | http://korap.ids-mannheim.de > Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de > Korpusanalyseplattform der nächsten Generation > Next Generation Corpus Analysis Platform > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Small Vocabulary
Am 07.08.2012 10:20, schrieb Danil ŢORIN: Hi Danil, > If you do intersection (not join), maybe it make sense to put every > thing into 1 index? Just a note on that: my application performs intersections and joins (unions) on the results, depending on the query. So the index structure has to be ready for both, but intersections are clearly more complicated. > Just transform your input like "brown fox" into "ADJ:brown| payload> NOUN:fox|" I understand that this denotes "ADJ" and "NOUN" to be interpreted as the actual token and "brown" and "fox" as payloads (followed by ), right? This is a very neat approach and I have vaguely considered that. One problem is that I aim for a very high level of flexibility, meaning that additional annotations have to be addable at any point and different tokenizations apply. However, I will re-consider your suggestion, possibly applying one of multiple tokenizations as a default in this sense. > Of course I'm not aware of all the details, so my solution might not > be applicable to your project. > Maybe you could share more details, so this won't transform in "XY problem". > > Keep in mind : always optimize your index for the query usecase, > instead of blindly processing the input data. Thanks for that reminder; this becomes quite difficult in my scenario though since we want to allow for flexible changes in the index types, representing different annotations, tokenization logics etc. Best, Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Small Vocabulary
Hi Danil, >> Just transform your input like "brown fox" into "ADJ:brown|> payload> NOUN:fox|" > > I understand that this denotes "ADJ" and "NOUN" to be interpreted as the > actual token and "brown" and "fox" as payloads (followed by payload>), right? Sorry for replying to myself, but I've realised only now that you probably meant to replace the full token string ("brown") by "ADJ:brown" and use the payload otherwise, right? Regarding incoming queries, this method makes it necessary to perform a Wildcard query (e.g. "NOUN:*") when I am not interested in the actual text ("brown") -- which may happen more or less frequently -- am I right? However, this might be an acceptable trade-off... Best regards, Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Small Vocabulary
I mean "ADJ:brown" as a token and only the as payload, since you probably only use it for some scoring/postprocessing not the actual matching. You can even write a filter that will emit both tokens "ADJ" and "AJD:brown" on same position (so you'll be able to do phrase queries), and still maintain join capability. On Tue, Aug 7, 2012 at 12:13 PM, Carsten Schnober wrote: > Am 07.08.2012 10:20, schrieb Danil ŢORIN: > > Hi Danil, > >> If you do intersection (not join), maybe it make sense to put every >> thing into 1 index? > > Just a note on that: my application performs intersections and joins > (unions) on the results, depending on the query. So the index structure > has to be ready for both, but intersections are clearly more complicated. > >> Just transform your input like "brown fox" into "ADJ:brown|> payload> NOUN:fox|" > > I understand that this denotes "ADJ" and "NOUN" to be interpreted as the > actual token and "brown" and "fox" as payloads (followed by payload>), right? > > This is a very neat approach and I have vaguely considered that. One > problem is that I aim for a very high level of flexibility, meaning that > additional annotations have to be addable at any point and different > tokenizations apply. However, I will re-consider your suggestion, > possibly applying one of multiple tokenizations as a default in this sense. > >> Of course I'm not aware of all the details, so my solution might not >> be applicable to your project. >> Maybe you could share more details, so this won't transform in "XY problem". >> >> Keep in mind : always optimize your index for the query usecase, >> instead of blindly processing the input data. > > Thanks for that reminder; this becomes quite difficult in my scenario > though since we want to allow for flexible changes in the index types, > representing different annotations, tokenization logics etc. > Best, > Carsten > > > -- > Institut für Deutsche Sprache | http://www.ids-mannheim.de > Projekt KorAP | http://korap.ids-mannheim.de > Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de > Korpusanalyseplattform der nächsten Generation > Next Generation Corpus Analysis Platform > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Small Vocabulary
To avoid wildcard queries, you can write a TokenFilter that will create both tokens "ADJ" and "ADJ:brown" in same position. so you can use you index for both lookups without doing wildcard. On Tue, Aug 7, 2012 at 12:31 PM, Carsten Schnober wrote: > Hi Danil, > >>> Just transform your input like "brown fox" into "ADJ:brown|>> payload> NOUN:fox|" >> >> I understand that this denotes "ADJ" and "NOUN" to be interpreted as the >> actual token and "brown" and "fox" as payloads (followed by > payload>), right? > > Sorry for replying to myself, but I've realised only now that you > probably meant to replace the full token string ("brown") by "ADJ:brown" > and use the payload otherwise, right? Regarding incoming queries, this > method makes it necessary to perform a Wildcard query (e.g. "NOUN:*") > when I am not interested in the actual text ("brown") -- which may > happen more or less frequently -- am I right? However, this might be an > acceptable trade-off... > Best regards, > Carsten > > > -- > Institut für Deutsche Sprache | http://www.ids-mannheim.de > Projekt KorAP | http://korap.ids-mannheim.de > Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de > Korpusanalyseplattform der nächsten Generation > Next Generation Corpus Analysis Platform > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org