Re: [jira] [Updated] (SOLR-4787) Join Contrib

Kranti Parisa Mon, 27 Jan 2014 08:18:34 -0800

Thanks Joel. I shall look into that.

Thanks,
Kranti K. Parisa
http://www.linkedin.com/in/krantiparisa




On Mon, Jan 27, 2014 at 10:19 AM, Joel Bernstein <joels...@gmail.com> wrote:

> Kranti,
>
> The memory leak in the bjoin dealt with the multi-value field joins.
> Specifically how the new UninvertedIntField cache was used in the bjoin. In
> a quick review of the hjoin I'm not seeing the same issue but it would be
> good to confirm through testing.
>
> Joel
>
> Joel Bernstein
> Search Engineer at Heliosearch
>
>
> On Mon, Jan 27, 2014 at 10:06 AM, Kranti Parisa 
> <kranti.par...@gmail.com>wrote:
>
>> does this also applicable for the hjoin?
>>
>>
>> Thanks,
>> Kranti K. Parisa
>> http://www.linkedin.com/in/krantiparisa
>>
>>
>>
>> On Mon, Jan 27, 2014 at 7:27 AM, Joel Bernstein (JIRA) 
>> <j...@apache.org>wrote:
>>
>>>
>>>      [
>>> https://issues.apache.org/jira/browse/SOLR-4787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>>>
>>> Joel Bernstein updated SOLR-4787:
>>> ---------------------------------
>>>
>>>     Attachment: SOLR-4787.patch
>>>
>>> Resolved a memory leak when the bjoin is used with cache autowarming.
>>>
>>> > Join Contrib
>>> > ------------
>>> >
>>> >                 Key: SOLR-4787
>>> >                 URL: https://issues.apache.org/jira/browse/SOLR-4787
>>> >             Project: Solr
>>> >          Issue Type: New Feature
>>> >          Components: search
>>> >    Affects Versions: 4.2.1
>>> >            Reporter: Joel Bernstein
>>> >            Priority: Minor
>>> >             Fix For: 4.7
>>> >
>>> >         Attachments: SOLR-4787-deadlock-fix.patch,
>>> SOLR-4787-pjoin-long-keys.patch, SOLR-4787.patch, SOLR-4787.patch,
>>> SOLR-4787.patch, SOLR-4787.patch, SOLR-4787.patch, SOLR-4787.patch,
>>> SOLR-4787.patch, SOLR-4787.patch, SOLR-4787.patch, SOLR-4787.patch,
>>> SOLR-4787.patch, SOLR-4787.patch, SOLR-4787.patch, SOLR-4787.patch,
>>> SOLR-4797-hjoin-multivaluekeys-trunk.patch
>>> >
>>> >
>>> > This contrib provides a place where different join implementations can
>>> be contributed to Solr. This contrib currently includes 3 join
>>> implementations. The initial patch was generated from the Solr 4.3 tag.
>>> Because of changes in the FieldCache API this patch will only build with
>>> Solr 4.2 or above.
>>> > *HashSetJoinQParserPlugin aka hjoin*
>>> > The hjoin provides a join implementation that filters results in one
>>> core based on the results of a search in another core. This is similar in
>>> functionality to the JoinQParserPlugin but the implementation differs in a
>>> couple of important ways.
>>> > The first way is that the hjoin is designed to work with int and long
>>> join keys only. So, in order to use hjoin, int or long join keys must be
>>> included in both the to and from core.
>>> > The second difference is that the hjoin builds memory structures that
>>> are used to quickly connect the join keys. So, the hjoin will need more
>>> memory then the JoinQParserPlugin to perform the join.
>>> > The main advantage of the hjoin is that it can scale to join millions
>>> of keys between cores and provide sub-second response time. The hjoin
>>> should work well with up to two million results from the fromIndex and tens
>>> of millions of results from the main query.
>>> > The hjoin supports the following features:
>>> > 1) Both lucene query and PostFilter implementations. A *"cost"* > 99
>>> will turn on the PostFilter. The PostFilter will typically outperform the
>>> Lucene query when the main query results have been narrowed down.
>>> > 2) With the lucene query implementation there is an option to build
>>> the filter with threads. This can greatly improve the performance of the
>>> query if the main query index is very large. The "threads" parameter turns
>>> on threading. For example *threads=6* will use 6 threads to build the
>>> filter. This will setup a fixed threadpool with six threads to handle all
>>> hjoin requests. Once the threadpool is created the hjoin will always use it
>>> to build the filter. Threading does not come into play with the PostFilter.
>>> > 3) The *size* local parameter can be used to set the initial size of
>>> the hashset used to perform the join. If this is set above the number of
>>> results from the fromIndex then the you can avoid hashset resizing which
>>> improves performance.
>>> > 4) Nested filter queries. The local parameter "fq" can be used to nest
>>> a filter query within the join. The nested fq will filter the results of
>>> the join query. This can point to another join to support nested joins.
>>> > 5) Full caching support for the lucene query implementation. The
>>> filterCache and queryResultCache should work properly even with deep
>>> nesting of joins. Only the queryResultCache comes into play with the
>>> PostFilter implementation because PostFilters are not cacheable in the
>>> filterCache.
>>> > The syntax of the hjoin is similar to the JoinQParserPlugin except
>>> that the plugin is referenced by the string "hjoin" rather then "join".
>>> > fq=\{!hjoin fromIndex=collection2 from=id_i to=id_i threads=6
>>> fq=$qq\}user:customer1&qq=group:5
>>> > The example filter query above will search the fromIndex (collection2)
>>> for "user:customer1" applying the local fq parameter to filter the results.
>>> The lucene filter query will be built using 6 threads. This query will
>>> generate a list of values from the "from" field that will be used to filter
>>> the main query. Only records from the main query, where the "to" field is
>>> present in the "from" list will be included in the results.
>>> > The solrconfig.xml in the main query core must contain the reference
>>> to the hjoin.
>>> > <queryParser name="hjoin"
>>> class="org.apache.solr.joins.HashSetJoinQParserPlugin"/>
>>> > And the join contrib lib jars must be registed in the solrconfig.xml.
>>> >  <lib dir="../../../contrib/joins/lib" regex=".*\.jar" />
>>> > After issuing the "ant dist" command from inside the solr directory
>>> the joins contrib jar will appear in the solr/dist directory. Place the the
>>> solr-joins-4.*-.jar  in the WEB-INF/lib directory of the solr
>>> webapplication. This will ensure that the top level Solr classloader loads
>>> these classes rather then the core's classloaded.
>>> > *BitSetJoinQParserPlugin aka bjoin*
>>> > The bjoin behaves exactly like the hjoin but uses a BitSet instead of
>>> a HashSet to perform the underlying join. Because of this the bjoin is much
>>> faster and can provide sub-second response times on result sets of tens of
>>> millions of records from the fromIndex and hundreds of millions of records
>>> from the main query.
>>> > But there are limitations to how the bjoin can be used. The bjoin
>>> treats the join keys as addresses in a BitSet and uses the Lucene
>>> OpenBitSet implementation which performs very well but is not sparse. So
>>> the BitSet memory is dictated by the size of the join keys. For example a
>>> bitset with a max join key of 200,000,000 will need 25 MB of memory. For
>>> this reason the BitSet join does not support long join keys. In order to
>>> keep memory usage down the join keys should also be packed at the low end,
>>> for example from 1 to 50,000,000.
>>> > Below is a sampe bjoin:
>>> > fq=\{!bjoin fromIndex=collection2 from=id_i to=id_i threads=6
>>> fq=$qq\}user:customer1&qq=group:5
>>> > To register the bjoin the solrconfig.xml in the main query core must
>>> contain the reference to the bjoin.
>>> > <queryParser name="bjoin"
>>> class="org.apache.solr.joins.BitSetJoinQParserPlugin"/>
>>> > *ValueSourceJoinParserPlugin aka vjoin*
>>> > The second implementation is the ValueSourceJoinParserPlugin aka
>>> "vjoin". This implements a ValueSource function query that can return a
>>> value from a second core based on join keys and limiting query. The
>>> limiting query can be used to select a specific subset of data from the
>>> join core. This allows customer specific relevance data to be stored in a
>>> separate core and then joined in the main query.
>>> > The vjoin is called using the "vjoin" function query. For example:
>>> > bf=vjoin(fromCore, fromKey, fromVal, toKey, query)
>>> > This example shows "vjoin" being called by the edismax boost function
>>> parameter. This example will return the "fromVal" from the "fromCore". The
>>> "fromKey" and "toKey" are used to link the records from the main query to
>>> the records in the "fromCore". The "query" is used to select a specific set
>>> of records to join with in fromCore.
>>> > Currently the fromKey and toKey must be longs but this will change in
>>> future versions. Like the pjoin, the "join" SolrCache is used to hold the
>>> join memory structures.
>>> > To configure the vjoin you must register the ValueSource plugin in the
>>> solrconfig.xml as follows:
>>> > <valueSourceParser name="vjoin"
>>> class="org.apache.solr.joins.ValueSourceJoinParserPlugin" />
>>>
>>>
>>>
>>> --
>>> This message was sent by Atlassian JIRA
>>> (v6.1.5#6160)
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>
>>>
>>
>

Re: [jira] [Updated] (SOLR-4787) Join Contrib

Reply via email to