Small Vocabulary

2012-07-30 Thread Carsten Schnober
Dear list,
I'm considering to use Lucene for indexing sequences of part-of-speech
(POS) tags instead of words; for those who don't know, POS tags are
linguistically motivated labels that are assigned to tokens (words) to
describe its morpho-syntactic function. Instead of sequences of words, I
would like to index sequences of tags, for instance "ART ADV ADJA NN".
The aim is to be able to search (efficiently) for occurrences of "ADJA".

The question is whether Lucene can be applied to deal with that data
cleverly because the statistical properties of such pseudo-texts is very
distinct from natural language texts and make me wonder whether Lucene's
inverted indexes are suitable. Especially the small vocabulary size (<50
distinct tokens, depending on the tagging system) is problematic, I suppose.

First trials for which I have implemented an analyzer that just outputs
Lucene tokens such as "ART", "ADV", "ADJA", etc. yield results that are
not exactly perfect regarding search performance, in a test corpus with
a few million tokens. The number of tokens in production mode is
expected to be much larger, so I wonder whether this approach is
promising at all.
Does Lucene (4.0?) provide optimization techniques for extremely small
vocabulary sizes?

Thank you very much,
Carsten Schnober


-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Document Similarity

2012-07-30 Thread in.abdul
Hi ELshaimaa,
  I couldnt able understood what is your need . Can you please explain your
use case.

  If this is case  "I need to use Lucene to find the most similar documents
from the generated index"
then go for morelikethis[1] components .

Based on your use case people can suggest some good ways.



[1] http://wiki.apache.org/solr/MoreLikeThis




Thanks and Regards,
S SYED ABDUL KATHER



On Mon, Jul 30, 2012 at 7:30 PM, Elshaimaa Ali [via Lucene] <
ml-node+s472066n3998082...@n3.nabble.com> wrote:

>
> Hi All
> I created a Lucene index for over 3 million document, and I used term
> vectors to create the index.now for an external document I need to use
> Lucene to find the most similar documents from the generated index.how can
> I process the document to generate a term vector to this document and what
> search technique I can use to map the document to one of the documents in
> the index
> regardsshaimaa
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
> http://lucene.472066.n3.nabble.com/Document-Similarity-tp3998082.html
>  To unsubscribe from Lucene, click 
> here
> .
> NAML
>




-
THANKS AND REGARDS,
SYED ABDUL KATHER
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Document-Similarity-tp3998082p3998095.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

RE: Document Similarity

2012-07-30 Thread Elshaimaa Ali

thank you so much for the prompt reply
I need to extract a document from the index that is similar to an Html 
document, and I need to use cosine similarity or latent semantic analysis which 
means that I need to generate term vector for the html document, the link you 
sent me doesn't contain any code 
any help will be greatly apreciated
regardsshaimaa

> Date: Mon, 30 Jul 2012 07:32:49 -0700
> From: in.ab...@gmail.com
> To: java-user@lucene.apache.org
> Subject: Re: Document Similarity
> 
> Hi ELshaimaa,
>   I couldnt able understood what is your need . Can you please explain your
> use case.
> 
>   If this is case  "I need to use Lucene to find the most similar documents
> from the generated index"
> then go for morelikethis[1] components .
> 
> Based on your use case people can suggest some good ways.
> 
> 
> 
> [1] http://wiki.apache.org/solr/MoreLikeThis
> 
> 
> 
> 
> Thanks and Regards,
> S SYED ABDUL KATHER
> 
> 
> 
> On Mon, Jul 30, 2012 at 7:30 PM, Elshaimaa Ali [via Lucene] <
> ml-node+s472066n3998082...@n3.nabble.com> wrote:
> 
> >
> > Hi All
> > I created a Lucene index for over 3 million document, and I used term
> > vectors to create the index.now for an external document I need to use
> > Lucene to find the most similar documents from the generated index.how can
> > I process the document to generate a term vector to this document and what
> > search technique I can use to map the document to one of the documents in
> > the index
> > regardsshaimaa
> >
> > --
> >  If you reply to this email, your message will be added to the discussion
> > below:
> > http://lucene.472066.n3.nabble.com/Document-Similarity-tp3998082.html
> >  To unsubscribe from Lucene, click 
> > here
> > .
> > NAML
> >
> 
> 
> 
> 
> -
> THANKS AND REGARDS,
> SYED ABDUL KATHER
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Document-Similarity-tp3998082p3998095.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
  

Re: Document Similarity

2012-07-30 Thread in.abdul
I had understood your need . You can use k mean clustering in mahout .
Which can help your you case . You can better post this question in mahout
user list where you get different idea . I had also had use case like this
as i did as POC. But still my suggestion is that . You can post this
question there.

Syed abdul kather
On Jul 30, 2012 8:02 PM, "syed kather"  wrote:

> Hi ELshaimaa,
>   I couldnt able understood what is your need . Can you please explain
> your use case.
>
>   If this is case  "I need to use Lucene to find the most similar
> documents from the generated index"
> then go for morelikethis[1] components .
>
> Based on your use case people can suggest some good ways.
>
>
>
> [1] http://wiki.apache.org/solr/MoreLikeThis
>
>
>
>
> Thanks and Regards,
> S SYED ABDUL KATHER
>
>
>
> On Mon, Jul 30, 2012 at 7:30 PM, Elshaimaa Ali [via Lucene] <
> ml-node+s472066n3998082...@n3.nabble.com> wrote:
>
>>
>> Hi All
>> I created a Lucene index for over 3 million document, and I used term
>> vectors to create the index.now for an external document I need to use
>> Lucene to find the most similar documents from the generated index.how can
>> I process the document to generate a term vector to this document and what
>> search technique I can use to map the document to one of the documents in
>> the index
>> regardsshaimaa
>>
>> --
>>  If you reply to this email, your message will be added to the
>> discussion below:
>> http://lucene.472066.n3.nabble.com/Document-Similarity-tp3998082.html
>>  To unsubscribe from Lucene, click 
>> here
>> .
>> NAML
>>
>
>




-
THANKS AND REGARDS,
SYED ABDUL KATHER
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Document-Similarity-tp3998082p3998165.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.