Hi all,

I am quite new to the Lucene world and recently started using its  python
wrapper (PyLucence) in my project.
So far, I have been using the token based querying method which works fine.

But, now I want to modify the querying approach as the following:

   - Given the query string
   - extract all its terms (n-grams; n>=2)
   - for each term search the Indexer and return the (k) documents which
      contain that specific term

let's say that the input text is: "search for this !"
And I want to search for each of these sub-strings, separately:
"search for", "for this", "this !", "search for this", "for this ."

I tried with Shingle, but I couldn't get the desired output. You can find
my code at the end.
I added line 14 to the code to avoid searching for unigrams which are of no
interest to me.
It is worth mentioning that for creating both the Indexer and Searcher I
used the following settings:
analyzer = ShingleAnalyzerWrapper(WhitespaceAnalyzer(), 2, 4, ' ', True,
False, None)

The point is that if I remove line 14, it does return me some documents
which they contain the words of the given n-gram, but they do not
necessarily make one single phrase. I mean if the n-gram is *"search for
this"* it might return a document like:   *"please search for that and
this"*
Which means it just looked for the unigrams, not the whole string.

Any idea about this issue?

Thanks,
Amin


1    def query(self, queryString):
2       vec = {}
3        idx = 0
4        ret_documents = {}
5        analyzer = ShingleAnalyzerWrapper(WhitespaceAnalyzer(), 2, 4, ' ',
True, False, None)
6        ts = self.analyzer.tokenStream("source", StringReader(queryString))
7        termAtt = ts.addAttribute(CharTermAttribute.class_)
8        ts.reset()
9
10        all_grams = []
11
12        while ts.incrementToken():
13           ngram = termAtt.toString()
14           if len(ngram.split()) > 1:
15                all_grams.append(ngram)
16        ts.close()
17
18        for ngram in all_grams:
19            query = BooleanQuery.Builder()
20            query.add(TermQuery(Term("source", ngram)),
BooleanClause.Occur.MUST)
21            scoreDocs = self.searcher.search(query.build(),
self.max_retriever).scoreDocs
22
23            for scoreDoc in scoreDocs:
24                doc = self.searcher.doc(scoreDoc.doc)
25                ret_documents[idx] = [doc.get("id"), scoreDoc.score,
doc.get("source")]
26                idx += 1
27
28        print "RET_DOCS: ", ret_documents

Reply via email to