Hi Folks, I´m trying to load Google Web 1T 5 Gram to Lucene. (This corpus contains English word n-grams and their observed frequency counts. The length of the n-grams ranges from unigrams(single words) to five-grams)
I´m loading each ngram (each row is a ngram) as an individual Document. This way I´ll be able to search for each ngram separated, but I´m ending with huge indexes witch makes them very hard to load and read the index. Is there a better way to load and read ngrams to a Lucene index? Maybe using lower level api? More Info about Google Web 1T 5 Gram corpus at: <http://www.ldc.upenn.edu/Catalog/docs/LDC2006T13/readme.txt> Thanks, Rafael