[ https://issues.apache.org/jira/browse/SOLR-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Graham Poulter updated SOLR-1599: --------------------------------- Description: In Solr 1.4, the IDF (Inverse Document Frequency) is calculated on all of the documents in an index. This introduces relevance problems when using a single schema to store multiple entity types, for example to support "search for tracks" and "search for artists". The ranking for search on the _name_ field of _track_ entities will be (much?) more accurate if the IDF for the name field does not include counts from _artist_ entities. The effect on ranking would be most pronounced for query terms that have a low document frequency for _track_ entities but a high frequency for _artist_ entities, or visa versa. The current work-around to make the IDF be entity-specific is to use a separate Solr core for each entity type sharing the schema - and repeating the process of copying solrconfig.xml and schema.xml to all the cores. This would be more complicated with replication, and more so with sharding, to maintain a core for _artists_ and a core for _tracks_ on each node. David Smiley, author of "Solr 1.4 Enterprise Search Server", has filed SOLR-1158, where he suggests calculating _numDocs_ after the application of filters. He recognises however that the document frequency (DF_t) for each query term in a _track_ search would also needs to exclude _artist_ entities from the DF_t total to get the correct IDF_t=log(N/DF_t). DF_t must be calculated at index time, when Solr does not know what filters will be applied. I suggest having a metadata field _entitytype_ specified on submitting a batch of documents. The the schema would specify a list of allowed entity types and a default entity type. For example, document could say either entitytype="track" or entitytype="artist". Each each entity type has an independent set of document frequencies, so the term "foo" will have a DF for entitytype="artist" and a different DF for entitytype="track". This might be implemented by instantiating a separate Lucene index for each configured entity type. Filtering on entitytype="artist" would be implemented by searching only the _artist_ index, analogous to searching only on the _artist_ core in the multi-core workaround. With this solution (entity type metadata field implemented with separate Lucene indeces) a single Solr core can support many different entity types that share a common schema but use partially overlapping subsets of fields, instead of configuring, replicating and sharding a Solr core for every entity type. was: In Solr 1.4, the IDF (Inverse Document Frequency) is calculated on all of the documents in an index. This introduces relevance problems when using a single schema to store multiple entity types, for example to support "search for tracks" and "search for artists". The ranking for search on the _name_ field of _track_ entities will be (much?) more accurate if the IDF for the name field does not include counts from _artist_ entities. The effect on ranking would be most pronounced for query terms that have a low document frequency for _track_ entities but a high frequency for _artist_ entities, or visa versa. The current work-around to make the IDF be entity-specific is to use a separate Solr core for each entity type sharing the schema - and repeating the process of copying solrconfig.xml and schema.xml to all the cores. This would be more complicated with replication, and even more complicated with sharding, to maintain a core for _artists_ and a core for _tracks_ on each node. David Smiley, author of "Solr 1.4 Enterprise Search Server", has filed SOLR-1158, where he suggests calculating _numDocs_ after the application of filters. He recognises however that the document frequency (DF_t) for each query term in a _track_ search would also needs to exclude _artist_ entities from the DF_t total to get the correct IDF_t=log(N/DF_t). DF_t must be calculated at index time, when Solr does not know what filters will be applied. I suggest having a metadata field _entitytype_ specified on submitting a batch of documents. The the schema would specify a list of allowed entity types and a default entity type. For example, document could say either entitytype="track" or entitytype="artist". Each each entity type has an independent set of document frequencies, so the term "foo" will have a DF for entitytype="artist" and a different DF for entitytype="track". This might be implemented by instantiating a separate Lucene index for each configured entity type. Filtering on entitytype="artist" would be implemented by searching only the _artist_ index, analogous to searching only on the _artist_ core in the multi-core workaround. With this solution (entity type metadata field implemented with separate Lucene indeces) a single Solr core can support many different entity types that share a common schema but use partially overlapping subsets of fields, instead of configureing, replicating and shardoeg a Solr core for every entity type. > Improve IDF and relevance by separately indexing different entity types > sharing a common schema > ----------------------------------------------------------------------------------------------- > > Key: SOLR-1599 > URL: https://issues.apache.org/jira/browse/SOLR-1599 > Project: Solr > Issue Type: New Feature > Components: Schema and Analysis > Reporter: Graham Poulter > Original Estimate: 504h > Remaining Estimate: 504h > > In Solr 1.4, the IDF (Inverse Document Frequency) is calculated on all of the > documents in an index. This introduces relevance problems when using a > single schema to store multiple entity types, for example to support "search > for tracks" and "search for artists". The ranking for search on the _name_ > field of _track_ entities will be (much?) more accurate if the IDF for the > name field does not include counts from _artist_ entities. The effect on > ranking would be most pronounced for query terms that have a low document > frequency for _track_ entities but a high frequency for _artist_ entities, or > visa versa. > The current work-around to make the IDF be entity-specific is to use a > separate Solr core for each entity type sharing the schema - and repeating > the process of copying solrconfig.xml and schema.xml to all the cores. This > would be more complicated with replication, and more so with sharding, to > maintain a core for _artists_ and a core for _tracks_ on each node. > David Smiley, author of "Solr 1.4 Enterprise Search Server", has filed > SOLR-1158, where he suggests calculating _numDocs_ after the application of > filters. He recognises however that the document frequency (DF_t) for each > query term in a _track_ search would also needs to exclude _artist_ entities > from the DF_t total to get the correct IDF_t=log(N/DF_t). DF_t must be > calculated at index time, when Solr does not know what filters will be > applied. > I suggest having a metadata field _entitytype_ specified on submitting a > batch of documents. The the schema would specify a list of allowed entity > types and a default entity type. For example, document could say either > entitytype="track" or entitytype="artist". Each each entity type has an > independent set of document frequencies, so the term "foo" will have a DF for > entitytype="artist" and a different DF for entitytype="track". This might > be implemented by instantiating a separate Lucene index for each configured > entity type. Filtering on entitytype="artist" would be implemented by > searching only the _artist_ index, analogous to searching only on the > _artist_ core in the multi-core workaround. > With this solution (entity type metadata field implemented with separate > Lucene indeces) a single Solr core can support many different entity types > that share a common schema but use partially overlapping subsets of fields, > instead of configuring, replicating and sharding a Solr core for every entity > type. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.