Hi, Rupinder,
Our algorithm is a little different from what PubMed does. We have scoring
for each mesh term, which will affect the search result.
What do you think the difference would be for these two:
document.addField(Field.Keyword("mesh", "xxxx"));
and
document.addField( new Field( "mesh", "xxxx", Field.Store.YES ,
Field.Index.TOKENIZED );
Thank you,
Xin
----- Original Message -----
From: "Rupinder Singh Mazara" <[EMAIL PROTECTED]>
To: <java-user@lucene.apache.org>
Sent: Friday, August 25, 2006 11:27 AM
Subject: Re: controlled vocabulary
Hi Xin
then perhaps you can change it to Field.Index.TOKENIZED, but i was not
aware that pubmed boosts mesh terms, they broadly classify terms as major
and minor, if you plan to use this simple system of classification
consider adding the major terms twice to the document ?
Zhao, Xin wrote:
Hi, Rupinder,
My understanding is Field.Index.NO_NORMS disables index-time boosting
and field length normalization at the same time. But I do need index-time
boosting to store the scoring of each mesh term. Have I missed anything?
Thank you very much for your help,
Xin
----- Original Message ----- From: "Rupinder Singh Mazara"
<[EMAIL PROTECTED]>
To: <java-user@lucene.apache.org>
Sent: Friday, August 25, 2006 10:49 AM
Subject: Re: controlled vocabulary
hi Xin
this is take a look at this you can add multiple fields with the name
mesh
for ( i=0; i< meshList.size() ; i++ ){
meshTerm = meshList.get(i)
document.addField( new Field( "mesh", meshTerm.semanticWebConceptId,
Field.Store.YES , Field.Index.NO_NORMS );
}
when querying this index, create a analyzer that infers the text string
and generates id's that correspond to the mesh term in the semantic web
Zhao, Xin wrote:
Hi,
Thank you for your reply. I had thought about the first two solutions
before. If we apply one doc for each MeSH term, it would be 26 docs for
each item digested(we actually need the top 25 MeSH terms generated),
would it be any problem if there are too many documents? If we apply
field name like "mesh_1", "mesh_2"..., when it comes to search, we will
have to generate a loop for each single one of the query terms( there
will be more than 20-30 terms on average, since we are using sematic
web to implement concept search), do you think it would affect the
performance in a very bad way?
Regards,
Xin
----- Original Message ----- From: "Dedian Guo" <[EMAIL PROTECTED]>
To: <java-user@lucene.apache.org>; "Zhao, Xin" <[EMAIL PROTECTED]>
Sent: Thursday, August 24, 2006 4:22 PM
Subject: Re: controlled library
in my solution, you can apply one doc for each mesh term, or apply
different
keyword such as "mesh_1"...."mesh_10" for your top 10 terms...or u can
group
your mesh terms as one string then add into a field, which requires a
simple
string parser for the group string when you wanna read the terms...
not sure if that works or answers your question...
On 8/24/06, Zhao, Xin <[EMAIL PROTECTED]> wrote:
Hi,
I have a design question. Here is what we try to do for indexing:
We designed an indexing tool to generate standard MeSH terms from
medical
citations, and then use Lucene to save the terms and citations for
future
search. The information we need to save are:
a) the exact mesh terms (top 10)
b) the score for each term
so the codings are like
-----------------------------------
for the top 10 MeSH Terms
myField=Field.Keyword("mesh", mesh.toLowerCase());
myField.setBoost(score);
doc.add(myFiled);
end for
------------------------------------
as you could see we generate all the terms under named field "mesh".
If I
understand correctly, all the fields under the same name would
eventually save into one field, with all the scores be normalized
into
filed boost. In this case, we wouldn't be able to save separate
score, so
the information is lost. Am I correct? Is there anyway we could
change it? I
understand Lucene is for keyword search, and what we try to do is
Controlled
Vocabulary search, Any other tool we could use?
Thank you,
Xin
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]