Vector space implemantion

Andy Thu, 09 Apr 2009 01:51:20 -0700

Hello all,

I'm new to lucene and trying to implement a vector space model using lucene. I 
need to have a file (or on memory) with TF/IDF weight of each term in each 
document. (in fact that is a matrix with documents presented as vectors, in 
which the elements of each vector is the TF weight ...)


Please Please help me on this
contact me if you need any further info via andykan1...@yahoo.com
Many Many thanks

--- Begin Message ---

Hello all,

I'm new to lucene and trying to implement a vector space model using lucene. I 
need to have a file (or on memory) with TF/IDF weight of each term in each 
document. (in fact that is a matrix with documents presented as vectors, in 
which the elements of each vector is the TF weight ...)

Please Please help me on this
contact me if you need any further info via andykan1...@yahoo.com
Many Many thanks

--- On Thu, 4/9/09, John Byrne <john.by...@propylon.com> wrote:

From: John Byrne <john.by...@propylon.com>
Subject: Re: query c++
To: java-user@lucene.apache.org
Date: Thursday, April 9, 2009, 12:57 PM

Hi,

This came up before, a while ago: 
http://www.nabble.com/searching-for-C%2B%2B-to18093942.html#a18093942

I don't think there is an easier way than modifying the standard 
analyzer. As I suggested in that earlier thread, I would make the 
analyzer recognize token patterns that consist of words with prefixed or 
postfixed symbols[1] Then you will receive tokens like "c++" or 
"~/.file" in your token filter. You can then choose to pass them as 
single tokens, or split them down further into two or more tokens.

-John

[1] If you decide to try matching words with symbols in the middle, be 
aware that the StandardAnalyzer already handles some examples of this, 
such as e-mail addresses, so you may make something redundant.

??? wrote:
> to be detailed, I implemented a ftp search engine for campus students. I
> have handle many different words including chinese words, as a result I
> can't only use whitespaceanalyzer. My analyzer is now like this:
>
>     StandardTokenizer tokenStream = new StandardTokenizer(reader,
> replaceInvalidAcronym);
>     tokenStream.setMaxTokenLength(maxTokenLength);
>     TokenStream result = new StandardFilter(tokenStream);
>     result = new LowerCaseFilter(result);
>     result = new StopFilter(result, stopSet);
>     result = new SnowballFilter(result,STEMMER);
>
> StandardTokenizer is modified by me to split words like season09(like search
> for friends season 09) to “season" and "09"?
> word "c++" is analyzed as "c".
>
> I know i can modify the standardtokenizer to achieve my goal. But are there
> any other neat methods?
>
> 2009/4/9 hyj <hongyin...@163.com>
>
>   
>> ???,??!
>>
>>        WhitespaceAnalyzer can work.
>>
>> ======= 2009-04-09 15:15:14 ???????:=======
>>
>>     
>>> I want to make my lucene can search word like c++, c#,  how can i modify
>>>       
>> my
>>     
>>> analyzer to achieve this goal?
>>>
>>> --
>>> ???(Weiwei Wang)
>>> Department of Computer Science
>>> Gulou Campus of Nanjing University
>>> Nanjing, P.R.China, 210093
>>>
>>> Mobile: 86-13913310569
>>> MSN: ww.wang...@gmail.com
>>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>>>       
>> = = = = = = = = = = = = = = = = = = = =
>>
>>
>> ?
>> ?!
>>
>>
>> hyj
>> hongyin...@163.com
>> 2009-04-09
>>
>>
>>     
>
>
>   
> ------------------------------------------------------------------------
>
>
> No virus found in this incoming message.
> Checked by AVG - www.avg.com 
> Version: 8.0.238 / Virus Database: 270.11.48/2048 - Release Date: 04/08/09 
> 19:02:00
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 8.0.238 / Virus Database: 270.11.48/2048 - Release Date: 04/08/09 
19:02:00

--- End Message ---

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Vector space implemantion

Reply via email to