Improve IDF and relevance by separately indexing different entity types sharing 
a common schema
-----------------------------------------------------------------------------------------------

                 Key: SOLR-1599
                 URL: https://issues.apache.org/jira/browse/SOLR-1599
             Project: Solr
          Issue Type: New Feature
          Components: Schema and Analysis
            Reporter: Graham Poulter


In Solr 1.4, the IDF (Inverse Document Frequency) is calculated on all of the 
documents in an index.  This introduces relevance problems when using a single 
schema to store multiple entity types, for example to support "search for 
tracks" and "search for artists".   The ranking for search on the _name_ field 
of _track_ entities will be (much?) more accurate if the IDF for the name field 
does not include counts from _artist_ entities.  The effect on ranking would be 
most pronounced for query terms that have a low document frequency for _track_ 
entities but a high frequency for _artist_ entities.

The current work-around to make the IDF be entity-specific is to use a separate 
Solr core for each entity type sharing the schema - and repeating the process 
of copying solrconfig.xml and schema.xml to all the cores.  This would be more 
complicated with replication, and even more complicated with index 
distribution, because you must now maintain a core for _artists_ and a core for 
_tracks_ on each node.

David Smiley, author of _Solr 1.4 Enterprise Search Server", has filed 
SOLR-1158, where he suggests calculating _numDocs_ after the application of 
filters.  However, _numDocs_ is just the total number of documents: the 
document frequency (DF) for a query term of a _track_ search would also need to 
exclude _artist_ entities from the DF_t total to get the IDF_t=log(N/DF_t).  
However, DF_t needs to be calculated at index time, when Solr has no idea what 
filters will be applied.

I suggest using a metadata field _entitytype_ to specified on submitting a 
batch of documents, with a configured list of allowed values: in the example 
the document could specify either entitytype="track" or entitytype="artist" 
(defaulting to _track_).  The document frequency would then calculated for each 
entity type during indexing. so for term "foo" there will be two DF's stored: 
the DF of "foo" for entitytype="artist" and the DF of "foo" for 
entitytype="track".   This might be implemented by instantiating a separate 
Lucene index for each configured entity type.  Filtering on entitytype="artist" 
would then be implemented by searching only the _artist_ index.  

With this solution (entity type metadata field implemented with separate Lucene 
indeces) a single Solr core can support many different entity types that share 
a common schema but use partially overlapping subsets of fields, instead of 
having to configure maintain, replicate and distribute for every entity type.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to