Re: controlled vocabulary

Dedian Guo Fri, 25 Aug 2006 14:38:24 -0700

Hi, Xin, in my understanding , the document in Lucene is a term of
collection of fields, while a field is pair of keyword and value, tough it
can be indexed or stored or both. That is plain structure. if you wanna
index a deep tree structure such as complex objects and keep those
relationship inside, i guess we need do some tricky of that. so in my
mentioned solution, i will do something on the keyword of a document(here, a
document represent a object...) . the score problem you mentioned in your
question is similar, i mean, score is actually an attribute of mesh object,
so u wanna index the information which has a tree-like structure (i met the
similar problem when i indexing xml-based pages. esp. for those have lots of
deep element nodes, deep index needed for deep searching).


correct me if i was wrong or there are some better solutions...

On 8/25/06, Zhao, Xin <[EMAIL PROTECTED]> wrote:


now. i have a second thought about one meah term per document. the scoring
formula(hits too) is based on document, right? does it mean that we
shouldn't have more than one document for each object indexed?
for example, i try to index a publication,  for some of the information,
like title, abstract i would like to store and index them using default
similarity, while the other information i would like to use customized
similarity. i probably should use a different indexing directory and
writer
instead of two documents in the same index, right?
thank you for helping me. you could see that i am in the early learning
stage now.
xin



----- Original Message -----
From: "Zhao, Xin" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Friday, August 25, 2006 10:21 AM
Subject: Re: controlled vocabulary


> Hi,
> Thank you for your reply. I had thought about the first two solutions
> before. If we apply one doc for each MeSH term, it would be 26 docs for
> each item digested(we actually need the top 25 MeSH terms generated),
> would it be any problem if there are too many documents? If we apply
field
> name like "mesh_1", "mesh_2"..., when it comes to search, we will have
to
> generate a loop for each single one of the query terms( there will be
more
> than 20-30 terms on average, since we are using sematic web to implement
> concept search), do you think it would affect the performance in a very
> bad way?
> Regards,
> Xin
>
>
> ----- Original Message -----
> From: "Dedian Guo" <[EMAIL PROTECTED]>
> To: <[email protected]>; "Zhao, Xin" <[EMAIL PROTECTED]>
> Sent: Thursday, August 24, 2006 4:22 PM
> Subject: Re: controlled library
>
>
>> in my solution, you can apply one doc for each mesh term, or apply
>> different
>> keyword such as "mesh_1"...."mesh_10" for your top 10 terms...or u can
>> group
>> your mesh terms as one string then add into a field, which requires a
>> simple
>> string parser for the group string when you wanna read the terms...
>>
>> not sure if that works or answers your question...
>>
>> On 8/24/06, Zhao, Xin <[EMAIL PROTECTED]> wrote:
>>>
>>> Hi,
>>> I have a design question. Here is what we try to do for indexing:
>>> We designed an indexing tool to generate standard MeSH terms from
>>> medical
>>> citations, and then use Lucene to save the terms and citations for
>>> future
>>> search. The information we need to save are:
>>> a) the exact mesh terms (top 10)
>>> b) the score for each term
>>> so the codings are like
>>> -----------------------------------
>>> for the top 10 MeSH Terms
>>> myField=Field.Keyword("mesh", mesh.toLowerCase());
>>> myField.setBoost(score);
>>> doc.add(myFiled);
>>> end for
>>> ------------------------------------
>>> as you could see we generate all the terms under named field "mesh".
If
>>> I
>>> understand correctly, all the fields under the same name would
>>> eventually  save into one field, with all the scores be normalized
into
>>> filed boost. In this case, we wouldn't be able to save separate score,
>>> so
>>> the information is lost. Am I correct? Is there anyway we could change
>>> it? I
>>> understand Lucene is for keyword search, and what we try to do is
>>> Controlled
>>> Vocabulary search, Any other tool we could use?
>>>
>>> Thank you,
>>> Xin
>>>
>>>
>>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: controlled vocabulary

Reply via email to