[ https://issues.apache.org/jira/browse/SOLR-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782388#action_12782388 ]
Graham Poulter commented on SOLR-1599: -------------------------------------- This is what could happen when indexing multiple entity types in the same core. For instance, indexing artists and tracks and using a filter to "search for artists". You then search for artists, with two dismax terms _A_ or _B_ on the _name_ field. Term _A_ is rare amongst artist _name_, so it should have a low docFreq and a relatively high weight compared to term _B_. However, term _A_ happens to be common in track _name_, so its docFreq is higher, making the IDF weight for _A_ lower than it should be relative to term _B_. The filtered-out track instances are invisibly modifying the weight of query terms in a query for artists, which would not happen with separate indeces (and thus separate docFreq's) > Improve IDF and relevance by separately indexing different entity types > sharing a common schema > ----------------------------------------------------------------------------------------------- > > Key: SOLR-1599 > URL: https://issues.apache.org/jira/browse/SOLR-1599 > Project: Solr > Issue Type: New Feature > Components: Schema and Analysis > Reporter: Graham Poulter > Original Estimate: 504h > Remaining Estimate: 504h > > In Solr 1.4, the IDF (Inverse Document Frequency) is calculated on all of the > documents in an index. This introduces relevance problems when using a > single schema to store multiple entity types, for example to support "search > for tracks" and "search for artists". The ranking for search on the _name_ > field of _track_ entities will be (much?) more accurate if the IDF for the > name field does not include counts from _artist_ entities. The effect on > ranking would be most pronounced for query terms that have a low document > frequency for _track_ entities but a high frequency for _artist_ entities, or > visa versa. > The current work-around to make the IDF be entity-specific is to use a > separate Solr core for each entity type sharing the schema - and repeating > the process of copying solrconfig.xml and schema.xml to all the cores. This > would be more complicated with replication, and more so with sharding, to > maintain a core for _artists_ and a core for _tracks_ on each node. > David Smiley, author of "Solr 1.4 Enterprise Search Server", has filed > SOLR-1158, where he suggests calculating _numDocs_ after the application of > filters. He recognises however that the document frequency (DF_t) for each > query term in a _track_ search would also needs to exclude _artist_ entities > from the DF_t total to get the correct IDF_t=log(N/DF_t). DF_t must be > calculated at index time, when Solr does not know what filters will be > applied. > I suggest having a metadata field _entitytype_ specified on submitting a > batch of documents. The the schema would specify a list of allowed entity > types and a default entity type. For example, document could say either > entitytype="track" or entitytype="artist". Each each entity type has an > independent set of document frequencies, so the term "foo" will have a DF for > entitytype="artist" and a different DF for entitytype="track". This might > be implemented by instantiating a separate Lucene index for each configured > entity type. Filtering on entitytype="artist" would be implemented by > searching only the _artist_ index, analogous to searching only on the > _artist_ core in the multi-core workaround. > With this solution (entity type metadata field implemented with separate > Lucene indeces) a single Solr core can support many different entity types > that share a common schema but use partially overlapping subsets of fields, > instead of configuring, replicating and sharding a Solr core for every entity > type. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.