Hello,
I'm using Lucene to index and search through a collection of Chinese
documents. However, I'm noticing an odd behavior in query parsing/searching.
Given the two queries below:
(Ci refers to Chinese character i)
Input1: C1C2,C3C4,C5C6,C7,C8C9C10
Input2: C1C2 C3C4 C5C6 C7 C8C9C10
Input1 returns absolutely nothing, while Input2 (replacing the commas
with spaces) works as expected. I'm a bit confused why this would be
happening - it seems that QueryParser uses the Analyzer passed to it to
tokenize the input query string, so if the Analyzer ignores the
punctuations, it seems that Input1 and Input2 should return identical
results. Is there some pre-Analyzer filtering or whatever that
QueryParser does? I've tried this with the StandardAnalyzer,
SmartChineseAnalyzer, and an analyzer that I implemented which
explicitly skips over punctuations and whitespaces in tokenizing the
query string, but to no avail.
-------sample code-------------
Analyzer analyzer = new LingPipeAnalyzer();
Searcher searcher = new IndexSearcher(directory);
QueryParser qParser = new MultiFieldQueryParser(Version.LUCENE_30,
SEARCH_FIELDS, analyzer);
Query query = qParser.parse(queryLine[1]);
ScoreDoc[] results = searcher.search(query, TOP_N).scoreDocs;
-----------------------------------
I'm probably just doing something dumb, but any help would be greatly
appreciated!
Thanks,
Wei Ho
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org