Thanks, Murat.
It was very useful - I also tried to override IndexWriter and
DocumentsWriter instead, but it didn't work well. DocumentsWriter can't be
overriden.
So, I didn't find a better way to make the changes.
My needs are having for every term in different documents different values.
So, like you set the boost at the document level, I would like to set the
boost for different terms within differnt documents.
For that matter, I made some changes in the code you sent - (I coloured the
changes in red):
Below you can find an example for the use of it
**********
private class PayloadAnalyzer extends Analyzer
{
private PayloadTokenStream payToken = null;
private int score;
*private Map<String, Integer> scoresMap = new HashMap<String, Integer>();*
public synchronized void setScore(int s)
{
score = s;
}
* public synchronized void setMapScores(Map<String, Integer> scoresMap)
{
this.scoresMap = scoresMap;
}*
public final TokenStream tokenStream(String field, Reader reader)
{
payToken = new PayloadTokenStream(new WhitespaceTokenizer(reader)); //new
LowerCaseTokenizer(reader));
payToken.setScore(score);
payToken.setMapScores(scoresMap);
return payToken;
}
}
private class PayloadTokenStream extends TokenStream
{
private Tokenizer tok = null;
private int score;
*private Map<String, Integer> scoresMap = new HashMap<String, Integer>();*
public PayloadTokenStream(Tokenizer tokenizer)
{
tok = tokenizer;
}
public void setScore(int s)
{
score = s;
}
* public synchronized void setMapScores(Map<String, Integer> scoresMap)
{
this.scoresMap = scoresMap;
}*
public Token next(Token t) throws IOException
{
t = tok.next(t);
if(t != null)
{
//t.setTermBuffer("can change");
//Do something with the data
byte[] bytes = ("score:" + score).getBytes();
// t.setPayload(new Payload(bytes));
* String word = String.copyValueOf(t.termBuffer(), 0, t.termLength());
int score = scoresMap.get(word);
byte payLoad = Byte.parseByte(String.valueOf(score));
t.setPayload(new Payload(new byte[] { Byte.valueOf(payLoad) }));*
}
return t;
}
public void reset(Reader input) throws IOException
{
tok.reset(input);
}
public void close() throws IOException
{
tok.close();
}
}
**********************************
*Example for the use of payloads:*
PayloadAnalyzer panalyzer = new PayloadAnalyzer();
File index = new File("" + "TestSearchIndex");
IndexWriter iwriter = new IndexWriter(index, panalyzer);
Document d = new Document();
d.add(new Field("text", "word1 word2 word3", Field.Store.YES,
Field.Index.TOKENIZED, Field.TermVector.YES));
d.add(new Field("id", "1^3", Field.Store.YES, Field.Index.UN_TOKENIZED,
Field.TermVector.NO));
Map<String, Integer> mapScores = new HashMap<String, Integer>();
mapScores.put("word1", 3);
mapScores.put("word2", 1);
mapScores.put("word3", 1);
panalyzer.setMapScores(mapScores);
iwriter.addDocument(d, panalyzer);
d = new Document();
d.add(new Field("text", "word1 word2 word3", Field.Store.YES,
Field.Index.TOKENIZED, Field.TermVector.YES));
d.add(new Field("id", "2^3", Field.Store.YES, Field.Index.UN_TOKENIZED,
Field.TermVector.NO));
//We set the score for the term of the document that will be
analyzed.
/*I was worried about this part - document dependent score
which may be utilized*/
mapScores = new HashMap<String, Integer>();
mapScores.put("word1", 1);
mapScores.put("word2", 3);
mapScores.put("word3", 1);
panalyzer.setMapScores(mapScores);
iwriter.addDocument(d, panalyzer);
/*-----------------*/
// iwriter.commit();
iwriter.optimize();
iwriter.close();
BooleanQuery bq = new BooleanQuery();
BoostingTermQuery tq = new BoostingTermQuery(new Term("text", "word1"));
tq.setBoost((float) 1.0);
bq.add(tq, BooleanClause.Occur.MUST);
tq = new BoostingTermQuery(new Term("text", "word2"));
tq.setBoost((float) 3);
bq.add(tq, BooleanClause.Occur.SHOULD);
tq = new BoostingTermQuery(new Term("text", "word3"));
tq.setBoost((float) 1);
bq.add(tq, BooleanClause.Occur.SHOULD);
IndexSearcher searcher1 = new IndexSearcher("TestSearchIndex");
searcher1.setSimilarity(new WordsSimilarity());
TopDocs topDocs = searcher1.search(bq, null, 3);
Hits hits1 = searcher1.search(bq);
for(int j = 0; j < hits1.length(); j++)
{
Explanation explanation = searcher1.explain(bq, j);
System.out.println("**** " + hits1.score(j) + " " +
hits1.doc(j).getField("id").stringValue() + " *****");
System.out.println(explanation.toString());
explanation.getValue();
System.out.println("********************************************************");
System.out.println("Score " + topDocs.scoreDocs[j].score + " doc " +
searcher1.doc(topDocs.scoreDocs[j].doc).getField("id").stringValue());
System.out.println("********************************************************");
}
If you try the same query with differnt boosting, you will get a different
order for the documents.
Does it look ok?
Thanks again!
Liat
2009/4/25 Murat Yakici <[email protected]>
>
>
> Here is what I am doing, not so magical... There are two classes, an
> analyzer and an a TokenStream in which I can inject my document dependent
> data to be stored as payload.
>
>
> private PayloadAnalyzer panalyzer = new PayloadAnalyzer();
>
> private class PayloadAnalyzer extends Analyzer {
>
> private PayloadTokenStream payToken = null;
> private int score;
>
> public synchronized void setScore(int s) {
> score=s;
> }
>
> public final TokenStream tokenStream(String field, Reader reader) {
> payToken = new PayloadTokenStream(new LowerCaseTokenizer(reader));
> payToken.setScore(score);
> return payToken;
> }
> }
>
> private class PayloadTokenStream extends TokenStream {
>
> private Tokenizer tok = null;
> private int score;
>
> public PayloadTokenStream(Tokenizer tokenizer) {
> tok = tokenizer;
> }
>
> public void setScore(int s) {
> score = s;
> }
>
> public Token next(Token t) throws IOException {
> t = tok.next(t);
> if (t != null) {
> //t.setTermBuffer("can change");
> //Do something with the data
> byte[] bytes = ("score:"+ score).getBytes();
> t.setPayload(new Payload(bytes));
> }
> return t;
> }
>
> public void reset(Reader input) throws IOException {
> tok.reset(input);
> }
>
> public void close() throws IOException {
> tok.close();
> }
> }
>
>
> public void doIndex() {
> try {
> File index = new File("./TestPayloadIndex");
> IndexWriter iwriter = new IndexWriter(index,
> panalyzer,
> IndexWriter.MaxFieldLength.UNLIMITED);
>
> Document d = new Document();
> d.add(new Field("content",
> "Everyone, someone, myTerm, yourTerm", Field.Store.YES,
> Field.Index.ANALYZED, Field.TermVector.YES));
> //We set the score for the term of the document that will be
> analyzed.
> /*I was worried about this part - document dependent score
> which may be utilized*/
> panalyzer.setScore(5);
> iwriter.addDocument(d, panalyzer);
> /*-----------------*/
> ...
> iwriter.commit();
> iwriter.optimize();
> iwriter.close();
>
> //Now read the index
> IndexReader ireader = IndexReader.open(index);
> TermPositions tpos = ireader.termPositions(
> new Term("content","myterm"));//Note
> LowercaseTokenizer
> while (tpos.next()) {
> int pos;
> for(int i=0;i<tpos.freq();i++){
> pos=tpos.nextPosition();
> if (tpos.isPayloadAvailable()) {
> byte[] data = new byte[tpos.getPayloadLength()];
> tpos.getPayload(data, 0);
> //Utilise payloads;
> }
> }
> }
>
> tpos.close();
> } catch (CorruptIndexException ex) {
> //
> } catch (LockObtainFailedException ex) {
> //
> } catch (IOException ex) {
> //
> }
> }
>
> I wish it was designed better... Please let me know if you guys have a
> better idea.
>
> Cheers,
> Murat
>
> > Dear Murat,
> >
> > I saw your question and wondered how did you implement these changes?
> > The requirement below are the same ones as I am trying to code now.
> > Did you modify the source code itself or only used Lucene's jar and just
> > override code?
> >
> > I would very much apprecicate if you could give me a short explanation on
> > how was it done.
> >
> > Thanks a lot,
> > Liat
> >
> > 2009/4/21 Murat Yakici <[email protected]>
> >
> >> Hi,
> >> I started playing with the experimental payload functionality. I have
> >> written an analyzer which adds a payload (some sort of a score/boost)
> >> for
> >> each term occurance. The payload/score for each term is dependent on the
> >> document that the term comes from (I guess this is the typoical use
> >> case).
> >> So say term t1 may have a payload of 5 in doc1 and 34 in doc5. The
> >> parameter
> >> for calculating the payload changes after each
> >> indexWriter.addDocument(..)
> >> method call in a while loop. I am assuming that the
> >> indexWriter.addDocument(..) methods are thread safe. Can I confirm this?
> >>
> >> Cheers,
> >>
> >> --
> >> Murat Yakici
> >> Department of Computer & Information Sciences
> >> University of Strathclyde
> >> Glasgow, UK
> >> -------------------------------------------
> >> The University of Strathclyde is a charitable body, registered in
> >> Scotland,
> >> with registration number SC015263.
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
> >>
> >
>
>
> Murat Yakici
> Department of Computer & Information Sciences
> University of Strathclyde
> Glasgow, UK
> -------------------------------------------
> The University of Strathclyde is a charitable body, registered in Scotland,
> with registration number SC015263.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>