[
https://issues.apache.org/jira/browse/SOLR-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13399933#comment-13399933
]
Andy Laird edited comment on SOLR-2592 at 6/23/12 1:13 PM:
-----------------------------------------------------------
I've been using some code very similar to Michael's latest patch for a few
weeks now and am liking it less and less for our use case. As I described
above, we are using this patch to ensure that all docs with the same value for
a specific field end up on the same shard -- this is so that the field collapse
counting will work for distributed searches, otherwise the returned counts are
only an upper bound. For us the counts returned during field-collapse have to
be exact since we use that to drive paging (we can't have a user going to page
29 and finding nothing there).
The problems we've encountered have entirely to do with our need to update the
value of the field we're doing a field-collapse on. Our approach --
conceptually similar to the CompositeIdShardKeyParserFactory in Michael's
latest patch -- involved creating a new schema field, indexId, that was a
combination of what used to be our uniqueKey plus the field that we collapse on:
*Original schema*
{code:xml}
<field name="id" type="string" indexed="true" stored="true" multiValued="false"
required="true" />
<field name="xyz" type="string" indexed="true" stored="true"
multiValued="false" required="true" />
...
<uniqueKey>id</uniqueKey>
{code}
*Modified schema*
{code:xml}
<field name="id" type="string" indexed="true" stored="true" multiValued="false"
required="true" />
<field name="xyz" type="string" indexed="true" stored="true"
multiValued="false" required="true" />
<field name="indexId" type="string" indexed="true" stored="true"
multiValued="false" required="true" />
...
<uniqueKey>indexId</uniqueKey>
{code}
During indexing we insert the extra "indexId" data in the form, "id:xyz". Our
custom ShardKeyParser extracts out the xyz portion of the uniqueKey and returns
that as the hash value for shard selection. Everything works great in terms of
field collapse, counts, etc.
The problems begin when considering what happens when we need to change the
value of the field, xyz. Suppose that our document starts out with these
values for the 3 fields above:
{quote}
id=123
xyz=456
indexId=123:456
{quote}
We then want to change xyz to the value 789, say. In other words, we want to
end up...
{quote}
id=123
xyz=789
indexId=123:789
{quote}
...so that the doc lives on the same shard along with other docs that have
xyz=789.
Before any of this we would simply pass in a new document and all would be good
since we weren't changing the uniqueKey. However, now we need to delete the
old document (with the old uniqueKey) or we'll end up with duplicates. We
don't know whether a given update changes the value of xyz or not and we don't
know what the old value for xyz was (without doing an additional lookup) so we
must include an extra delete along with every change:
*Before*
{code:xml}
<add>
<doc>
<field name="id">123</field>
<field name="xyz">789</field>
<doc>
</add>
{code}
*Now*
{code:xml}
<delete>
<query>id:123 AND NOT xyz:789</query>
</delete>
<add>
<doc>
<field name="id">123</field>
<field name="xyz">789</field>
<field name="indexId">123:789</field> <-- old value was 123:456
<doc>
</add>
{code}
So in addition to the "unsavory coupling" between id and xyz there is a
significant performance hit to this approach (as we're doing this in the
context of NRT). The fundamental issue, of course, is that we only have the
uniqueKey value (id) and score for the first phase of distributed search -- we
really need the other field that we are using for shard ownership, too.
One idea is to have another standard schema field similar to uniqueKey that is
used for the purposes of shard distribution:
{code:xml}
<uniqueKey>id</uniqueKey>
<shardKey>xyz</shardKey>
{code}
Then, as standard procedure, the first phase of distributed search would ask
for uniqueKey, shardKey and score. Perhaps the ShardKeyParser gets both
uniqueKey and shardKey data for more flexibility. In addition to solving our
issue with field collapse counts, date-based sharding could be done by setting
the shardKey to a date field and doing appropriate slicing in the
ShardKeyParser.
was (Author: clavius):
I've been using some code very similar to Michael's latest patch for a few
weeks now and am liking it less and less for our use case. As I described
above, we are using this patch to ensure that all docs with the same value for
a specific field end up on the same shard -- this is so that the field collapse
counting will work for distributed searches, otherwise the returned counts are
only an upper bound.
The problems we've encountered have entirely to do with our need to update the
value of the field we're doing a field-collapse on. Our approach --
conceptually similar to the CompositeIdShardKeyParserFactory in Michael's
latest patch -- involved creating a new schema field, indexId, that was a
combination of what used to be our uniqueKey plus the field that we collapse on:
*Original schema*
{code:xml}
<field name="id" type="string" indexed="true" stored="true" multiValued="false"
required="true" />
<field name="xyz" type="string" indexed="true" stored="true"
multiValued="false" required="true" />
...
<uniqueKey>id</uniqueKey>
{code}
*Modified schema*
{code:xml}
<field name="id" type="string" indexed="true" stored="true" multiValued="false"
required="true" />
<field name="xyz" type="string" indexed="true" stored="true"
multiValued="false" required="true" />
<field name="indexId" type="string" indexed="true" stored="true"
multiValued="false" required="true" />
...
<uniqueKey>indexId</uniqueKey>
{code}
During indexing we insert the extra "indexId" data in the form, "id:xyz". Our
custom ShardKeyParser extracts out the xyz portion of the uniqueKey and returns
that as the hash value for shard selection. Everything works great in terms of
field collapse, counts, etc.
The problems begin when considering what happens when we need to change the
value of the field, xyz. Suppose that our document starts out with these
values for the 3 fields above:
{quote}
id=123
xyz=456
indexId=123:456
{quote}
We then want to change xyz to the value 789, say. In other words, we want to
end up...
{quote}
id=123
xyz=789
indexId=123:789
{quote}
...so that the doc lives on the same shard along with other docs that have
xyz=789 (so that field collapse counts are correct since we use that to drive
paging).
Before any of this we would simply pass in a new document and all would be good
since we weren't changing the uniqueKey. However, now we need to delete the
old document (with the old uniqueKey) or we'll end up with duplicates. We
don't know whether a given update changes the value of xyz or not and we don't
know what the old value for xyz was (without doing an additional lookup) so we
must include an extra delete along with every change:
*Before*
{code:xml}
<add>
<doc>
<field name="id">123</field>
<field name="xyz">789</field>
<doc>
</add>
{code}
*Now*
{code:xml}
<delete>
<query>id:123 AND NOT xyz:789</query>
</delete>
<add>
<doc>
<field name="id">123</field>
<field name="xyz">789</field>
<field name="clusterId">123:789</field> <-- old clusterId was 123:456
<doc>
</add>
{code}
So in addition to the "unsavory coupling" between id and xyz there is a
significant performance hit to this approach (as we're doing this in the
context of NRT). The fundamental issue, of course, is that we only have the
uniqueKey value (id) and score for the first phase of distributed search -- we
really need the other field that we are using for shard ownership, too.
One idea is to have another standard schema field similar to uniqueKey that is
used for the purposes of shard distribution:
{code:xml}
<uniqueKey>id</uniqueKey>
<shardKey>xyz</shardKey>
{code}
Then, as standard procedure, the first phase of distributed search would ask
for uniqueKey, shardKey and score. Perhaps the ShardKeyParser gets both
uniqueKey and shardKey data for maximum flexibility. In addition to a solution
to our issue with field collapse counts, date-based sharding could be done by
setting the shardKey to a date field and doing appropriate slicing in the
ShardKeyParser.
> Pluggable shard lookup mechanism for SolrCloud
> ----------------------------------------------
>
> Key: SOLR-2592
> URL: https://issues.apache.org/jira/browse/SOLR-2592
> Project: Solr
> Issue Type: New Feature
> Components: SolrCloud
> Affects Versions: 4.0
> Reporter: Noble Paul
> Attachments: SOLR-2592.patch, dbq_fix.patch,
> pluggable_sharding.patch, pluggable_sharding_V2.patch
>
>
> If the data in a cloud can be partitioned on some criteria (say range, hash,
> attribute value etc) It will be easy to narrow down the search to a smaller
> subset of shards and in effect can achieve more efficient search.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]