[jira] [Commented] (SOLR-4787) Join Contrib

Joel Bernstein (JIRA) Thu, 06 Jun 2013 08:39:24 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-4787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13677149#comment-13677149
 ]


Joel Bernstein commented on SOLR-4787:
--------------------------------------

Kranti,

The vjoin has two performance hotspots:

1) The creation of the HashMap for the hashjoin. My testing shows that it can 
support a couple hundred thousand keys before performance becomes an issue. The 
pjoin is much more scable in this area and can support millions of keys. The 
reason for this is that the pjoin only needs a sorted array to perform  binary 
searches against. Java's array.sort() can sort a random array of integers much 
faster then the LongToInt hashmap impl used by vjoin.

For personalized relevance though this should be plenty because each user will 
have a custom set of data to join to. You don't want or need to have a 
relevance record for each document. If a document is not available in the 
relevance core, vjoin will return a neutral boost of 1.


2) The hash key lookup each time the vjoin is called. This will be called for 
each document that is scored in the result set. This should scale to support 
result sets into the millions. I tested with 4,000,000 results and had 
excellent performance.




                
> Join Contrib
> ------------
>
>                 Key: SOLR-4787
>                 URL: https://issues.apache.org/jira/browse/SOLR-4787
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>    Affects Versions: 4.2.1
>            Reporter: Joel Bernstein
>            Priority: Minor
>             Fix For: 4.2.1
>
>         Attachments: SOLR-4787.patch, SOLR-4787.patch, SOLR-4787.patch, 
> SOLR-4787.patch, SOLR-4787.patch, SOLR-4787.patch, SOLR-4787.patch, 
> SOLR-4787.patch
>
>
> This contrib provides a place where different join implementations can be 
> contributed to Solr. This contrib currently includes 2 join implementations. 
> The initial patch was generated from the Solr 4.3 tag. Because of changes in 
> the FieldCache API this patch will only build with Solr 4.2 or above.
> *PostFilterJoinQParserPlugin aka "pjoin"*
> The pjoin provides a join implementation that filters results in one core 
> based on the results of a search in another core. This is similar in 
> functionality to the JoinQParserPlugin but the implementation differs in a 
> couple of important ways.
> The first way is that the pjoin is designed to work with integer join keys 
> only. So, in order to use pjoin, integer join keys must be included in both 
> the to and from core.
> The second difference is that the pjoin builds memory structures that are 
> used to quickly connect the join keys. It also uses a custom SolrCache named 
> "join" to hold intermediate DocSets which are needed to build the join memory 
> structures. So, the pjoin will need more memory then the JoinQParserPlugin to 
> perform the join.
> The main advantage of the pjoin is that it can scale to join millions of keys 
> between cores.
> Because it's a PostFilter, it only needs to join records that match the main 
> query.
> The syntax of the pjoin is the same as the JoinQParserPlugin except that the 
> plugin is referenced by the string "pjoin" rather then "join".
> fq=\{!pjoin fromCore=collection2 from=id_i to=id_i\}user:customer1
> The example filter query above will search the fromCore (collection2) for 
> "user:customer1". This query will generate a list of values from the "from" 
> field that will be used to filter the main query. Only records from the main 
> query, where the "to" field is present in the "from" list will be included in 
> the results.
> The solrconfig.xml in the main query core must contain the reference to the 
> pjoin.
> <queryParser name="pjoin" 
> class="org.apache.solr.joins.PostFilterJoinQParserPlugin"/>
> And the join contrib jars must be registed in the solrconfig.xml.
> <lib dir="../../../dist/" regex="solr-joins-\d.*\.jar" />
> The solrconfig.xml in the fromcore must have the "join" SolrCache configured.
>  <cache name="join"
>               class="solr.LRUCache"
>               size="4096"
>               initialSize="1024"
>               />
> *ValueSourceJoinParserPlugin aka vjoin*
> The second implementation is the ValueSourceJoinParserPlugin aka "vjoin". 
> This implements a ValueSource function query that can return a value from a 
> second core based on join keys and limiting query. The limiting query can be 
> used to select a specific subset of data from the join core. This allows 
> customer specific relevance data to be stored in a separate core and then 
> joined in the main query.
> The vjoin is called using the "vjoin" function query. For example:
> bf=vjoin(joinCore, fromKey, fromVal, toKey, query)
> This example shows "vjoin" being called by the edismax boost function 
> parameter. This example will return the "fromVal" from the "fromCore". The 
> "fromKey" and "toKey" are used to link the records from the main query to the 
> records in the "fromCore". The "query" is used to select a specific set of 
> records to join with in fromCore.
> Currently the fromKey and toKey must be longs but this will change in future 
> versions. Like the pjoin, the "join" SolrCache is used to hold the join 
> memory structures.
> To configure the vjoin you must register the ValueSource plugin in the 
> solrconfig.xml as follows:
> <valueSourceParser name="vjoin" 
> class="org.apache.solr.joins.ValueSourceJoinParserPlugin" />

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-4787) Join Contrib

Reply via email to