[jira] [Updated] (SOLR-4787) Join Contrib

Joel Bernstein (JIRA) Fri, 23 Aug 2013 05:02:25 -0700

     [ 
https://issues.apache.org/jira/browse/SOLR-4787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Joel Bernstein updated SOLR-4787:
---------------------------------

    Description: 
This contrib provides a place where different join implementations can be 
contributed to Solr. This contrib currently includes 3 join implementations. 
The initial patch was generated from the Solr 4.3 tag. Because of changes in 
the FieldCache API this patch will only build with Solr 4.2 or above.

*HashSetJoinQParserPlugin aka "hjoin"*

The hjoin provides a join implementation that filters results in one core based 
on the results of a search in another core. This is similar in functionality to 
the JoinQParserPlugin but the implementation differs in a couple of important 
ways.

The first way is that the hjoin is designed to work with int and long join keys 
only. So, in order to use hjoin, int or long join keys must be included in both 
the to and from core.

The second difference is that the hjoin builds memory structures that are used 
to quickly connect the join keys. So, the pjoin will need more memory then the 
JoinQParserPlugin to perform the join.

The main advantage of the hjoin is that it can scale to join millions of keys 
between cores and provide sub-second response time. The hjoin should work well 
with up to two million results from the fromIndex and tens of millions of 
results from the main query.

The hjoin supports the following features:

1) Both lucene query and PostFilter implementations. A *"cost"* of > 99 will 
turn on the PostFilter. The PostFilter will typically outperform the Lucene 
query when the main query results have been narrowed down.

2) With the lucene query implementation there is an option to build the filter 
with threads. This can greatly improve the performance of the query if the main 
query index is very large. The "threads" parameter turns on threading. For 
example *threads=6* will use 6 threads to build the filter. This will setup a 
fixed threadpool with six threads to handle all hjoin requests. Once the 
threadpool is created the hjoin will always use it to build the filter. 
Threading does not come into play with the PostFilter.

3) The *size* local parameter can be used to set the initial size of the 
hashset used to perform the join. If this is set above the number of results 
from the fromIndex then the you can avoid hashset resizing which improves 
performance.

4) Nested filter queries. The local parameter "fq" can be used to nest a filter 
query within the join. The nest fq will filter the results of the join query. 
This can point another join to support nested joins.

5) Full caching support for the lucene query implementation. The filterCache 
and queryResultCache should work properly even with deep nesting of joins. Only 
the queryResultCache comes into play with the PostFilter implementation because 
PostFilters are not cacheable in the filterCache.

The syntax of the hjoin is similar to the JoinQParserPlugin except that the 
plugin is referenced by the string "hjoin" rather then "join".

fq=\{!hjoin fromIndex=collection2 from=id_i to=id_i threads=6 
fq=$qq\}user:customer1&qq=group:5

The example filter query above will search the fromIndex (collection2) for 
"user:customer1" applying the local fq parameter to filter the results. The 
lucene filter query will be built using 6 threads. This query will generate a 
list of values from the "from" field that will be used to filter the main 
query. Only records from the main query, where the "to" field is present in the 
"from" list will be included in the results.

The solrconfig.xml in the main query core must contain the reference to the 
pjoin.

<queryParser name="hjoin" 
class="org.apache.solr.joins.HashSetJoinQParserPlugin"/>

And the join contrib jars must be registed in the solrconfig.xml.

 <lib dir="../../../contrib/joins/lib" regex=".*\.jar" />
 <lib dir="../../../dist/" regex="solr-joins-\d.*\.jar" />

*BitSetJoinQParserPlugin* aka bjoin*

The bjoin behaves exactly like the hjoin but uses a BitSet instead of a HashSet 
to perform the underlying join. Because of this the bjoin is much faster and 
can provide sub-second response times on tens of millions of records from the 
fromIndex and hundreds of millions of records from the main query.

But there are limitations to how the bjoin can be used. The bjoin treats the 
join keys as addresses in a BitSet and uses the Lucene OpenBitSet 
implementation which performs very well but is not sparse. So the BitSet memory 
is dictated by the size of the join keys. So a join key of 200,000,000 will 
need 25 MB of memory. For this reason the BitSet join does not support long 
join keys. In order to keep memory usage down the join keys should also be 
packed at the low end, for example from 1 to 50,000,000. 

Below is a sampe bjoin:

fq=\{!bjoin fromIndex=collection2 from=id_i to=id_i threads=6 
fq=$qq\}user:customer1&qq=group:5

To register the bjoin the solrconfig.xml in the main query core must contain 
the reference to the bjoin.

<queryParser name="bjoin" 
class="org.apache.solr.joins.BitSetJoinQParserPlugin"/>

*ValueSourceJoinParserPlugin aka vjoin*

The second implementation is the ValueSourceJoinParserPlugin aka "vjoin". This 
implements a ValueSource function query that can return a value from a second 
core based on join keys and limiting query. The limiting query can be used to 
select a specific subset of data from the join core. This allows customer 
specific relevance data to be stored in a separate core and then joined in the 
main query.

The vjoin is called using the "vjoin" function query. For example:

bf=vjoin(fromCore, fromKey, fromVal, toKey, query)

This example shows "vjoin" being called by the edismax boost function 
parameter. This example will return the "fromVal" from the "fromCore". The 
"fromKey" and "toKey" are used to link the records from the main query to the 
records in the "fromCore". The "query" is used to select a specific set of 
records to join with in fromCore.

Currently the fromKey and toKey must be longs but this will change in future 
versions. Like the pjoin, the "join" SolrCache is used to hold the join memory 
structures.

To configure the vjoin you must register the ValueSource plugin in the 
solrconfig.xml as follows:

<valueSourceParser name="vjoin" 
class="org.apache.solr.joins.ValueSourceJoinParserPlugin" />







  was:
This contrib provides a place where different join implementations can be 
contributed to Solr. This contrib currently includes 2 join implementations. 
The initial patch was generated from the Solr 4.3 tag. Because of changes in 
the FieldCache API this patch will only build with Solr 4.2 or above.

*PostFilterJoinQParserPlugin aka "pjoin"*

The pjoin provides a join implementation that filters results in one core based 
on the results of a search in another core. This is similar in functionality to 
the JoinQParserPlugin but the implementation differs in a couple of important 
ways.

The first way is that the pjoin is designed to work with integer join keys 
only. So, in order to use pjoin, integer join keys must be included in both the 
to and from core.

The second difference is that the pjoin builds memory structures that are used 
to quickly connect the join keys. It also uses a custom SolrCache named "join" 
to hold intermediate DocSets which are needed to build the join memory 
structures. So, the pjoin will need more memory then the JoinQParserPlugin to 
perform the join.

The main advantage of the pjoin is that it can scale to join millions of keys 
between cores.

Because it's a PostFilter, it only needs to join records that match the main 
query.

The syntax of the pjoin is the same as the JoinQParserPlugin except that the 
plugin is referenced by the string "pjoin" rather then "join".

fq=\{!pjoin fromCore=collection2 from=id_i to=id_i\}user:customer1

The example filter query above will search the fromCore (collection2) for 
"user:customer1". This query will generate a list of values from the "from" 
field that will be used to filter the main query. Only records from the main 
query, where the "to" field is present in the "from" list will be included in 
the results.

The solrconfig.xml in the main query core must contain the reference to the 
pjoin.

<queryParser name="pjoin" 
class="org.apache.solr.joins.PostFilterJoinQParserPlugin"/>

And the join contrib jars must be registed in the solrconfig.xml.

<lib dir="../../../dist/" regex="solr-joins-\d.*\.jar" />

The solrconfig.xml in the fromcore must have the "join" SolrCache configured.

 <cache name="join"
              class="solr.LRUCache"
              size="4096"
              initialSize="1024"
              />


*ValueSourceJoinParserPlugin aka vjoin*

The second implementation is the ValueSourceJoinParserPlugin aka "vjoin". This 
implements a ValueSource function query that can return a value from a second 
core based on join keys and limiting query. The limiting query can be used to 
select a specific subset of data from the join core. This allows customer 
specific relevance data to be stored in a separate core and then joined in the 
main query.

The vjoin is called using the "vjoin" function query. For example:

bf=vjoin(fromCore, fromKey, fromVal, toKey, query)

This example shows "vjoin" being called by the edismax boost function 
parameter. This example will return the "fromVal" from the "fromCore". The 
"fromKey" and "toKey" are used to link the records from the main query to the 
records in the "fromCore". The "query" is used to select a specific set of 
records to join with in fromCore.

Currently the fromKey and toKey must be longs but this will change in future 
versions. Like the pjoin, the "join" SolrCache is used to hold the join memory 
structures.

To configure the vjoin you must register the ValueSource plugin in the 
solrconfig.xml as follows:

<valueSourceParser name="vjoin" 
class="org.apache.solr.joins.ValueSourceJoinParserPlugin" />







    
> Join Contrib
> ------------
>
>                 Key: SOLR-4787
>                 URL: https://issues.apache.org/jira/browse/SOLR-4787
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>    Affects Versions: 4.2.1
>            Reporter: Joel Bernstein
>            Priority: Minor
>             Fix For: 4.5, 5.0
>
>         Attachments: SOLR-4787-deadlock-fix.patch, SOLR-4787.patch, 
> SOLR-4787.patch, SOLR-4787.patch, SOLR-4787.patch, SOLR-4787.patch, 
> SOLR-4787.patch, SOLR-4787.patch, SOLR-4787.patch, SOLR-4787.patch, 
> SOLR-4787.patch, SOLR-4787.patch, SOLR-4787-pjoin-long-keys.patch
>
>
> This contrib provides a place where different join implementations can be 
> contributed to Solr. This contrib currently includes 3 join implementations. 
> The initial patch was generated from the Solr 4.3 tag. Because of changes in 
> the FieldCache API this patch will only build with Solr 4.2 or above.
> *HashSetJoinQParserPlugin aka "hjoin"*
> The hjoin provides a join implementation that filters results in one core 
> based on the results of a search in another core. This is similar in 
> functionality to the JoinQParserPlugin but the implementation differs in a 
> couple of important ways.
> The first way is that the hjoin is designed to work with int and long join 
> keys only. So, in order to use hjoin, int or long join keys must be included 
> in both the to and from core.
> The second difference is that the hjoin builds memory structures that are 
> used to quickly connect the join keys. So, the pjoin will need more memory 
> then the JoinQParserPlugin to perform the join.
> The main advantage of the hjoin is that it can scale to join millions of keys 
> between cores and provide sub-second response time. The hjoin should work 
> well with up to two million results from the fromIndex and tens of millions 
> of results from the main query.
> The hjoin supports the following features:
> 1) Both lucene query and PostFilter implementations. A *"cost"* of > 99 will 
> turn on the PostFilter. The PostFilter will typically outperform the Lucene 
> query when the main query results have been narrowed down.
> 2) With the lucene query implementation there is an option to build the 
> filter with threads. This can greatly improve the performance of the query if 
> the main query index is very large. The "threads" parameter turns on 
> threading. For example *threads=6* will use 6 threads to build the filter. 
> This will setup a fixed threadpool with six threads to handle all hjoin 
> requests. Once the threadpool is created the hjoin will always use it to 
> build the filter. Threading does not come into play with the PostFilter.
> 3) The *size* local parameter can be used to set the initial size of the 
> hashset used to perform the join. If this is set above the number of results 
> from the fromIndex then the you can avoid hashset resizing which improves 
> performance.
> 4) Nested filter queries. The local parameter "fq" can be used to nest a 
> filter query within the join. The nest fq will filter the results of the join 
> query. This can point another join to support nested joins.
> 5) Full caching support for the lucene query implementation. The filterCache 
> and queryResultCache should work properly even with deep nesting of joins. 
> Only the queryResultCache comes into play with the PostFilter implementation 
> because PostFilters are not cacheable in the filterCache.
> The syntax of the hjoin is similar to the JoinQParserPlugin except that the 
> plugin is referenced by the string "hjoin" rather then "join".
> fq=\{!hjoin fromIndex=collection2 from=id_i to=id_i threads=6 
> fq=$qq\}user:customer1&qq=group:5
> The example filter query above will search the fromIndex (collection2) for 
> "user:customer1" applying the local fq parameter to filter the results. The 
> lucene filter query will be built using 6 threads. This query will generate a 
> list of values from the "from" field that will be used to filter the main 
> query. Only records from the main query, where the "to" field is present in 
> the "from" list will be included in the results.
> The solrconfig.xml in the main query core must contain the reference to the 
> pjoin.
> <queryParser name="hjoin" 
> class="org.apache.solr.joins.HashSetJoinQParserPlugin"/>
> And the join contrib jars must be registed in the solrconfig.xml.
>  <lib dir="../../../contrib/joins/lib" regex=".*\.jar" />
>  <lib dir="../../../dist/" regex="solr-joins-\d.*\.jar" />
> *BitSetJoinQParserPlugin* aka bjoin*
> The bjoin behaves exactly like the hjoin but uses a BitSet instead of a 
> HashSet to perform the underlying join. Because of this the bjoin is much 
> faster and can provide sub-second response times on tens of millions of 
> records from the fromIndex and hundreds of millions of records from the main 
> query.
> But there are limitations to how the bjoin can be used. The bjoin treats the 
> join keys as addresses in a BitSet and uses the Lucene OpenBitSet 
> implementation which performs very well but is not sparse. So the BitSet 
> memory is dictated by the size of the join keys. So a join key of 200,000,000 
> will need 25 MB of memory. For this reason the BitSet join does not support 
> long join keys. In order to keep memory usage down the join keys should also 
> be packed at the low end, for example from 1 to 50,000,000. 
> Below is a sampe bjoin:
> fq=\{!bjoin fromIndex=collection2 from=id_i to=id_i threads=6 
> fq=$qq\}user:customer1&qq=group:5
> To register the bjoin the solrconfig.xml in the main query core must contain 
> the reference to the bjoin.
> <queryParser name="bjoin" 
> class="org.apache.solr.joins.BitSetJoinQParserPlugin"/>
> *ValueSourceJoinParserPlugin aka vjoin*
> The second implementation is the ValueSourceJoinParserPlugin aka "vjoin". 
> This implements a ValueSource function query that can return a value from a 
> second core based on join keys and limiting query. The limiting query can be 
> used to select a specific subset of data from the join core. This allows 
> customer specific relevance data to be stored in a separate core and then 
> joined in the main query.
> The vjoin is called using the "vjoin" function query. For example:
> bf=vjoin(fromCore, fromKey, fromVal, toKey, query)
> This example shows "vjoin" being called by the edismax boost function 
> parameter. This example will return the "fromVal" from the "fromCore". The 
> "fromKey" and "toKey" are used to link the records from the main query to the 
> records in the "fromCore". The "query" is used to select a specific set of 
> records to join with in fromCore.
> Currently the fromKey and toKey must be longs but this will change in future 
> versions. Like the pjoin, the "join" SolrCache is used to hold the join 
> memory structures.
> To configure the vjoin you must register the ValueSource plugin in the 
> solrconfig.xml as follows:
> <valueSourceParser name="vjoin" 
> class="org.apache.solr.joins.ValueSourceJoinParserPlugin" />

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SOLR-4787) Join Contrib

Reply via email to