Re: WildcardTermEnum skipping terms containing numbers?!
why reindex? Well, since I had different experiences with different analyzers I've tried, I thougt that this problem must origin from either the indexing or a lucene bug. As stated at the end of my mail, I'd expect that to skip the first term in the enum. Yes, this must be a problem for me, since I took this sentence from the manual as the starting point: Returns the current Term in the enumeration. Initially invalid, valid after next() called for the first time. So, it seems that it was a bug in the docs, not the api itself. Is that, what you miss or do you loose more than one term? It seemed to me that it was skipping more stuff, but I'd better not say this, since I didn't know that the term is valid even before the first next(), so I could've been misleaded by my own chaotic experiences. Since my code was completly restructured since then, I don't have all the surrounging stuff needed for further testing. Anyway, we've found a docs bug thanks to you and my code is cleaner and better the other way. Thanx! __ Do you Yahoo!? The all-new My Yahoo! - Get yours free! http://my.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: WildcardTermEnum skipping terms containing numbers?!
Sanyi writes: If there's a bug, it should be tracked down, not worked around... Sure, but I'm working with 20million records and it takes about 25 hours to re-index, so I'm looking for ways that doesn't require reindexing. why reindex? My code was: WildcardTermEnum wcenum = new WildcardTermEnum(reader, term); while (wcenum.next()) { terms.add(new WeightedTerm(termgroup,wcenum.term().text())); //System.out.println(wcenum.term().text()); } And it skipped lots of things it shouldn't have skipped. As stated at the end of my mail, I'd expect that to skip the first term in the enum. Is that, what you miss or do you loose more than one term? Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: WildcardTermEnum skipping terms containing numbers?!
Sanyi writes: Enumerating the terms using WildcardTermEnum and an IndexReader seems to be too buggy to use. If there's a bug, it should be tracked down, not worked around... But it looks ok to me: import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.index.*; import org.apache.lucene.document.*; import org.apache.lucene.store.*; import org.apache.lucene.search.*; public class LuceneTest { public static void main(String[] args) throws Exception { RAMDirectory dir = new RAMDirectory(); IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(), true); Document doc = new Document(); doc.add(new Field(foo, blabla etc.. etc... c0la c0ca caca ccca, true, true, true)); writer.addDocument(doc); writer.close(); IndexReader reader = IndexReader.open(dir); WildcardTermEnum enum = new WildcardTermEnum(reader, new Term(foo, c??a)); do { System.out.println(enum.term().text()); } while ( enum.next() ); WildcardQuery wq = new WildcardQuery(new Term(foo, c??a)); Query q = wq.rewrite(reader); System.out.println(q.toString()); reader.close(); } } gives c0ca c0la caca ccca foo:c0ca foo:c0la foo:caca foo:ccca The only bug I see is in the docs, that claims enum.term() to be invalid before the first call to next() which does not seem to be the case. So if you use while ( enum.next() ) { ... } you will loose the first term, whatever it is. Looking at the sources I find that this behaviour is shared by FuzzyTermEnum. Both implementations of the abstract FilteredTermEnum class call setEnum at the end of the constructor, which prepares the first result. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: WildcardTermEnum skipping terms containing numbers?!
test __ Do you Yahoo!? The all-new My Yahoo! - Get yours free! http://my.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: WildcardTermEnum skipping terms containing numbers?!
Enumerating the terms using WildcardTermEnum and an IndexReader seems to be too buggy to use. I'm now reimplementing my code using WildcardTermEnum.wildcardEquals which seems to be better so far. --- Sanyi [EMAIL PROTECTED] wrote: Hi! I have following problem with 1.4.2: I'm searching for c?ca (using StandardAnalyzer) and one of the hits looks something like this: blabla c0ca c0la etc.. etc... (those big o-s are zero characters) Now, I'm enumerating the terms using WildcardTermEnum and all I get is: caca ccca ceca cica coca crca csca cuca cyca It doesn't know about c0ca at all. Is there any solution to come over this problem? Thanks, Sanyi __ Do you Yahoo!? The all-new My Yahoo! - Get yours free! http://my.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Meet the all-new My Yahoo! - Try it today! http://my.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]