Re: Lexicon access questions

2006-06-03 Thread eks dev
Thanks Chuck,

 I have to try it with example (s).

Use case one:

Documents:
D1 == John Doe
D2 == sky scraper
D3 ==  blue sky LTD 

Imagine name John is ultra frequent = low IDF weight  
and sky is super low freq = very high weigt

So Query:
Q: sky john

will give order:
D2,  D3, D1

Also imagine, I know (external knowledge) that John is personal name and its 
importance  in  Similarity calculus  should be  corrected by  some  boost  
due to this fact.

So, what I do today is to Lookup in some Dictionary Map where I attach boost to 
this token (reformulate query to sky john^250). 

What I was proposing, is to be able to attach this boost (practically IDF 
correction of some tokens during indexing) to tokens during indexing. With 
this, I could spare one lookup in memory hungry Dictionary and  reformulation 
of the Query. 
This example case is just introduction to the idea. This example is 
over-simplified and possible to solve by indexing the same token many times at 
the same position. Having this possible, things like SweetSpotSimilarity could 
be done  as an optional offline task (adjust IDF curve). 

Second problem to store semantic TAGS per token looks definitly doable by 
your proposal, but I am heving problems to comprehend all noughty details 
(performance impact and expressive power) as I never tried that parts of Lucene.
The quetion, when we are accessing Term from Lexicon anyhow for serching 
purposes (postings offset, freq), would it not be faster to attach this TAG 
info to the Term? 
 

The third issue I briefly mentioned. Use Case where Lexicon can be loaded 
completely in memory (not an unusal case these days) gives us some space to 
play with FuzzyQuery and make them really usful in terms of speed. I guess 
there could be also some other implementations that can work on disk as well.

We currently deal with ca. 50Mio  Docs collection  (short documents) and all 
terms fit nicely in memory in  TernarySearchTree that alows us to issue Term 
lookups give me all Terms that have at most N edits than we run our hand 
tuned Needlman-Wunsch (different costs for substitutions like in hitec vs 
hitek...)... I would say, nice feature for people with reasonably sized 
collections. Better way of doing it would be to have posibility for our 
implementation of the Dictionary to implement Lucene interface Lexicon which 
would provides Lucene with postings offset or whatever is needed for Lucene 
when you search for Term. 


Lucene today is great, this here is just could we do beter not a  can 
someone scratch my itch 










- Original Message 
From: Chuck Williams [EMAIL PROTECTED]
To: java-dev@lucene.apache.org
Sent: Thursday, 1 June, 2006 7:05:27 PM
Subject: Re: Lexicon access questions

This approach comes to mind. You could model your semantic tags as
tokens and index them at the same positions as the words or phrases to
which they apply. This is particularly easy if you can integrate your
taggers with your Analyzer. You would probably want to create one or
more new Query subclasses to facilitate certain types of matching,
making it easy to associate terms/phrases with different tags (e.g.,
OverlappingQuery). This approach would support generation of queries
that are tag-dependent, but would not directly help using tags in a
ranking algorithm for tag-independent queries. As an off-hand thought,
you might be able to extend the idea to support this by naming your tags
something like TERM_TAG where TERM is the term they apply to (best if
the character used for '_' cannot occur in any term). Then something
like a TaggedTermQuery could easily find the tags relevant to a term in
the query and iterate their docs/positions in parallel with those of the
term (rougly equilvaent to OverlappingQuery(term, PrefixQuery(term_*))).

Top-of-mind thoughts,

Chuck


eks dev wrote on 06/01/2006 12:10 AM:
 We have faced the following use case:

 In order to optimize performance and more importantly quality of search 
 results we are forced to attach more attributes to particular words (Terms). 
 Generic attributes like TF, IDF are usefull to model our similarity only up 
 to some level. 

 Examples:
 1. Is one Term first or last name, (e.g. we have comprehensive list of such 
 words). This enables us to make smarter (faster and better queries) in case 
 someone has multiple first names, it influences ranking...
 2. Agreement weight and Disagreement weigt of some words is modelled 
 diferently. 
 3. Semantic classes of words influence ranking (if something verb or noun 
 changes search strategy and ranking radically)

 On top of that, we can afford to load all terms in memory, in order to alow 
 fast string distance callculations and some limited pattern matching using 
 some strange Trie-s. 

 Today, we solve these things by implementing totally redundant data 
 structures that keep some kind of map Term-ValuesObject, which is redundant 
 to Lucene Lexicon storage. Instead of one access gets all we have two

Lexicon access questions

2006-06-01 Thread eks dev

We have faced the following use case:

In order to optimize performance and more importantly quality of search results 
we are forced to attach more attributes to particular words (Terms). Generic 
attributes like TF, IDF are usefull to model our similarity only up to some 
level. 

Examples:
1. Is one Term first or last name, (e.g. we have comprehensive list of such 
words). This enables us to make smarter (faster and better queries) in case 
someone has multiple first names, it influences ranking...
2. Agreement weight and Disagreement weigt of some words is modelled 
diferently. 
3. Semantic classes of words influence ranking (if something verb or noun 
changes search strategy and ranking radically)

On top of that, we can afford to load all terms in memory, in order to alow 
fast string distance callculations and some limited pattern matching using some 
strange Trie-s. 

Today, we solve these things by implementing totally redundant data structures 
that keep some kind of map Term-ValuesObject, which is redundant to Lucene 
Lexicon storage. Instead of one access gets all we have two access terms 
using two diferent access paths, once using our dictionary and second time 
implicitly via Query or so... So we introduce performance/memory penalties. 
(Pls. do not forget, we need to access copy of analyzed document in order to 
attach additional info to Terms)

I guess we are not the only ones to face such a case, as increase in precision 
above TF/IDF can be only achieved by introducing some domain semantics where 
available. For this, attaching domain specific info to Term would be perfect 
solution. Also, enabling flexible implementations for Lexicon access could give 
us some flexibility (e.g. implementation in mg4j goes in that direction)

Could somebody imagine 2.x version of Lucene to have some Interface that needs 
to be implemented with clear contract, that would enable us to attach our 
implementation for accessing lexicon? 

Or even better, some hints how I can do it today :)




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lexicon access questions

2006-06-01 Thread Chuck Williams
This approach comes to mind. You could model your semantic tags as
tokens and index them at the same positions as the words or phrases to
which they apply. This is particularly easy if you can integrate your
taggers with your Analyzer. You would probably want to create one or
more new Query subclasses to facilitate certain types of matching,
making it easy to associate terms/phrases with different tags (e.g.,
OverlappingQuery). This approach would support generation of queries
that are tag-dependent, but would not directly help using tags in a
ranking algorithm for tag-independent queries. As an off-hand thought,
you might be able to extend the idea to support this by naming your tags
something like TERM_TAG where TERM is the term they apply to (best if
the character used for '_' cannot occur in any term). Then something
like a TaggedTermQuery could easily find the tags relevant to a term in
the query and iterate their docs/positions in parallel with those of the
term (rougly equilvaent to OverlappingQuery(term, PrefixQuery(term_*))).

Top-of-mind thoughts,

Chuck


eks dev wrote on 06/01/2006 12:10 AM:
 We have faced the following use case:

 In order to optimize performance and more importantly quality of search 
 results we are forced to attach more attributes to particular words (Terms). 
 Generic attributes like TF, IDF are usefull to model our similarity only up 
 to some level. 

 Examples:
 1. Is one Term first or last name, (e.g. we have comprehensive list of such 
 words). This enables us to make smarter (faster and better queries) in case 
 someone has multiple first names, it influences ranking...
 2. Agreement weight and Disagreement weigt of some words is modelled 
 diferently. 
 3. Semantic classes of words influence ranking (if something verb or noun 
 changes search strategy and ranking radically)

 On top of that, we can afford to load all terms in memory, in order to alow 
 fast string distance callculations and some limited pattern matching using 
 some strange Trie-s. 

 Today, we solve these things by implementing totally redundant data 
 structures that keep some kind of map Term-ValuesObject, which is redundant 
 to Lucene Lexicon storage. Instead of one access gets all we have two 
 access terms using two diferent access paths, once using our dictionary and 
 second time implicitly via Query or so... So we introduce performance/memory 
 penalties. (Pls. do not forget, we need to access copy of analyzed document 
 in order to attach additional info to Terms)

 I guess we are not the only ones to face such a case, as increase in 
 precision above TF/IDF can be only achieved by introducing some domain 
 semantics where available. For this, attaching domain specific info to 
 Term would be perfect solution. Also, enabling flexible implementations for 
 Lexicon access could give us some flexibility (e.g. implementation in mg4j 
 goes in that direction)

 Could somebody imagine 2.x version of Lucene to have some Interface that 
 needs to be implemented with clear contract, that would enable us to attach 
 our implementation for accessing lexicon? 

 Or even better, some hints how I can do it today :)




 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]

   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]