[jira] [Commented] (CASSANDRA-8717) Top-k queries with custom secondary indexes

2015-05-05 Thread Aleksey Yeschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528907#comment-14528907
 ] 

Aleksey Yeschenko commented on CASSANDRA-8717:
--

Committed to 2.1 as {{4c7c5be798e2a7d1e72d086bc5011242ea0173dc}} and to trunk. 
Thanks [~adelapena] for the patch, and [~beobal] for review.

 Top-k queries with custom secondary indexes
 ---

 Key: CASSANDRA-8717
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8717
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Andrés de la Peña
Assignee: Andrés de la Peña
Priority: Minor
  Labels: 2i, secondary_index, sort, sorting, top-k
 Fix For: 2.1.6

 Attachments: 0001-Add-support-for-top-k-queries-in-2i.patch, 
 0002-Add-support-for-top-k-queries-in-2i.patch, 
 0003-Add-support-for-top-k-queries-in-2i.patch, 
 0004-Add-support-for-top-k-queries-in-2i.patch, 8717-v5.txt


 As presented in [Cassandra Summit Europe 
 2014|https://www.youtube.com/watch?v=Hg5s-hXy_-M], secondary indexes can be 
 modified to support general top-k queries with minimum changes in Cassandra 
 codebase. This way, custom 2i implementations could provide relevance search, 
 sorting by columns, etc.
 Top-k queries retrieve the k best results for a certain query. That implies 
 querying the k best rows in each token range and then sort them in order to 
 obtain the k globally best rows. 
 For doing that, we propose two additional methods in class 
 SecondaryIndexSearcher:
 {code:java}
 public boolean requiresFullScan(ListIndexExpression clause)
 {
 return false;
 }
 public ListRow sort(ListIndexExpression clause, ListRow rows)
 {
 return rows;
 }
 {code}
 The first one indicates if a query performed in the index requires querying 
 all the nodes in the ring. It is necessary in top-k queries because we do not 
 know which node are the best results. The second method specifies how to sort 
 all the partial node results according to the query. 
 Then we add two similar methods to the class AbstractRangeCommand:
 {code:java}
 this.searcher = 
 Keyspace.open(keyspace).getColumnFamilyStore(columnFamily).indexManager.searcher(rowFilter);
 public boolean requiresFullScan() {
 return searcher == null ? false : searcher.requiresFullScan(rowFilter);
 }
 public ListRow combine(ListRow rows)
 {
 return searcher == null ? trim(rows) : trim(searcher.sort(rowFilter, 
 rows));
 }
 {code}
 Finnally, we modify StorageProxy#getRangeSlice to use the previous method, as 
 shown in the attached patch.
 We think that the proposed approach provides very useful functionality with 
 minimum impact in current codebase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8717) Top-k queries with custom secondary indexes

2015-04-30 Thread JIRA

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14521771#comment-14521771
 ] 

Andrés de la Peña commented on CASSANDRA-8717:
--

I have uploaded a new version of the patch with the changes that need to be 
addressed before including this in 2.1.

As you suggested, I have added a boolean argument named {{trace}} to 
{{SIS#highestSelectivityPredicate}} to indicate whether or not the tracing 
event should be emitted. It is set to true by {{SIS#search}}, and false by 
{{SIS#highestSelectivityIndex}}.

To avoid duplication in {{SecondaryIndexManager}}, now the search method calls 
to {{getHighestSelectivityIndexSearcher}}.

I have modified {{SIS#postReconciliationProcessing}} JavaDoc trying to make it 
clear that it happens on the coordinator node.

I hope you find it OK.

 Top-k queries with custom secondary indexes
 ---

 Key: CASSANDRA-8717
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8717
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Andrés de la Peña
Assignee: Andrés de la Peña
Priority: Minor
  Labels: 2i, secondary_index, sort, sorting, top-k
 Fix For: 3.x

 Attachments: 0001-Add-support-for-top-k-queries-in-2i.patch, 
 0002-Add-support-for-top-k-queries-in-2i.patch, 
 0003-Add-support-for-top-k-queries-in-2i.patch, 
 0004-Add-support-for-top-k-queries-in-2i.patch


 As presented in [Cassandra Summit Europe 
 2014|https://www.youtube.com/watch?v=Hg5s-hXy_-M], secondary indexes can be 
 modified to support general top-k queries with minimum changes in Cassandra 
 codebase. This way, custom 2i implementations could provide relevance search, 
 sorting by columns, etc.
 Top-k queries retrieve the k best results for a certain query. That implies 
 querying the k best rows in each token range and then sort them in order to 
 obtain the k globally best rows. 
 For doing that, we propose two additional methods in class 
 SecondaryIndexSearcher:
 {code:java}
 public boolean requiresFullScan(ListIndexExpression clause)
 {
 return false;
 }
 public ListRow sort(ListIndexExpression clause, ListRow rows)
 {
 return rows;
 }
 {code}
 The first one indicates if a query performed in the index requires querying 
 all the nodes in the ring. It is necessary in top-k queries because we do not 
 know which node are the best results. The second method specifies how to sort 
 all the partial node results according to the query. 
 Then we add two similar methods to the class AbstractRangeCommand:
 {code:java}
 this.searcher = 
 Keyspace.open(keyspace).getColumnFamilyStore(columnFamily).indexManager.searcher(rowFilter);
 public boolean requiresFullScan() {
 return searcher == null ? false : searcher.requiresFullScan(rowFilter);
 }
 public ListRow combine(ListRow rows)
 {
 return searcher == null ? trim(rows) : trim(searcher.sort(rowFilter, 
 rows));
 }
 {code}
 Finnally, we modify StorageProxy#getRangeSlice to use the previous method, as 
 shown in the attached patch.
 We think that the proposed approach provides very useful functionality with 
 minimum impact in current codebase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8717) Top-k queries with custom secondary indexes

2015-04-28 Thread Sam Tunnicliffe (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517067#comment-14517067
 ] 

Sam Tunnicliffe commented on CASSANDRA-8717:


Generally, I think this is ok for 2.1, modulo a couple of points:

A thing that concerns me is that quite a number of {{AbstractRangeCommand}} 
instances are created during execution of an index query, particularly on the 
coordinator. The primary {{RangeSliceCommand}} is created in 
{{SelectStatement#getPageableCommand}}, then a {{PagedRangeCommand}} is 
instantiated for each page of results in 
{{RangeSliceQueryPager#queryNextPage}}. Then, as each of those is pushed down 
to {{StorageProxy}} the fan out creates another instance per sub range. The 
fact that each of these instances performs the selectivity calculation is a bit 
of a pain and somewhat wasteful, though it shouldn't be overly expensive  the 
alternative would involve more change than I'd be comfortable with putting into 
2.1.

What I do think should be addressed before including this in 2.1 is that quite 
a few more tracing events are now emitted as with tracing enabled, each time 
{{SIS#highestSelectivityPredicate}} is called it we log an event. Perhaps you 
could add a boolean argument to {{highestSelectivityPredicate}} to indicate 
whether or not the tracing event should be emitted. I think that only the call 
from the {{SIS#search}} implementations would actually want to do that, so 
{{SIS#highestSelectivityIndex}} could just pass false. 

This would actually produce fewer trace events than we do currently, only 1 per 
command per replica, but I think that would actually be more correct/useful. 
FTR, I know that the whole process of looking up  selecting a searcher needs 
seriously reworking ( I'm hoping to get chance to do that very soon).

The patch introduces quite a bit of duplication in {{SecondaryIndexManager}} 
between the new {{getHighestSelectivityIndexSearcher}} method and {{search}}. 
It would be straightforward to refactor the latter to call the former.

The wording of the javadoc for {{SIS#postReconciliationProcessing}} could be a 
little clearer. In its current form, it could be taken to imply that a number 
of index queries may be executed by an index searcher instance on a single node 
 their results combined. I think it would be better to make it clear that 
reconcilliation happens on the coordinator and not on the replica(s) where the 
search happens.

 Top-k queries with custom secondary indexes
 ---

 Key: CASSANDRA-8717
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8717
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Andrés de la Peña
Assignee: Andrés de la Peña
Priority: Minor
  Labels: 2i, secondary_index, sort, sorting, top-k
 Fix For: 3.0

 Attachments: 0001-Add-support-for-top-k-queries-in-2i.patch, 
 0002-Add-support-for-top-k-queries-in-2i.patch, 
 0003-Add-support-for-top-k-queries-in-2i.patch


 As presented in [Cassandra Summit Europe 
 2014|https://www.youtube.com/watch?v=Hg5s-hXy_-M], secondary indexes can be 
 modified to support general top-k queries with minimum changes in Cassandra 
 codebase. This way, custom 2i implementations could provide relevance search, 
 sorting by columns, etc.
 Top-k queries retrieve the k best results for a certain query. That implies 
 querying the k best rows in each token range and then sort them in order to 
 obtain the k globally best rows. 
 For doing that, we propose two additional methods in class 
 SecondaryIndexSearcher:
 {code:java}
 public boolean requiresFullScan(ListIndexExpression clause)
 {
 return false;
 }
 public ListRow sort(ListIndexExpression clause, ListRow rows)
 {
 return rows;
 }
 {code}
 The first one indicates if a query performed in the index requires querying 
 all the nodes in the ring. It is necessary in top-k queries because we do not 
 know which node are the best results. The second method specifies how to sort 
 all the partial node results according to the query. 
 Then we add two similar methods to the class AbstractRangeCommand:
 {code:java}
 this.searcher = 
 Keyspace.open(keyspace).getColumnFamilyStore(columnFamily).indexManager.searcher(rowFilter);
 public boolean requiresFullScan() {
 return searcher == null ? false : searcher.requiresFullScan(rowFilter);
 }
 public ListRow combine(ListRow rows)
 {
 return searcher == null ? trim(rows) : trim(searcher.sort(rowFilter, 
 rows));
 }
 {code}
 Finnally, we modify StorageProxy#getRangeSlice to use the previous method, as 
 shown in the attached patch.
 We think that the proposed approach provides very useful functionality with 
 minimum impact in current codebase.



--
This message was sent by Atlassian 

[jira] [Commented] (CASSANDRA-8717) Top-k queries with custom secondary indexes

2015-04-15 Thread JIRA

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496232#comment-14496232
 ] 

Andrés de la Peña commented on CASSANDRA-8717:
--

I'm sorry, I forgot the brackets placement. I have uploaded a new version 
complying with the code style. I have also done some minor changes in the 
proposed new {{SecondaryIndexManager}}'s method (for getting the index 
searcher) in order to make it more general. Thanks.

 Top-k queries with custom secondary indexes
 ---

 Key: CASSANDRA-8717
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8717
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Andrés de la Peña
Assignee: Andrés de la Peña
Priority: Minor
  Labels: 2i, secondary_index, sort, sorting, top-k
 Fix For: 3.0

 Attachments: 0001-Add-support-for-top-k-queries-in-2i.patch, 
 0002-Add-support-for-top-k-queries-in-2i.patch, 
 0003-Add-support-for-top-k-queries-in-2i.patch


 As presented in [Cassandra Summit Europe 
 2014|https://www.youtube.com/watch?v=Hg5s-hXy_-M], secondary indexes can be 
 modified to support general top-k queries with minimum changes in Cassandra 
 codebase. This way, custom 2i implementations could provide relevance search, 
 sorting by columns, etc.
 Top-k queries retrieve the k best results for a certain query. That implies 
 querying the k best rows in each token range and then sort them in order to 
 obtain the k globally best rows. 
 For doing that, we propose two additional methods in class 
 SecondaryIndexSearcher:
 {code:java}
 public boolean requiresFullScan(ListIndexExpression clause)
 {
 return false;
 }
 public ListRow sort(ListIndexExpression clause, ListRow rows)
 {
 return rows;
 }
 {code}
 The first one indicates if a query performed in the index requires querying 
 all the nodes in the ring. It is necessary in top-k queries because we do not 
 know which node are the best results. The second method specifies how to sort 
 all the partial node results according to the query. 
 Then we add two similar methods to the class AbstractRangeCommand:
 {code:java}
 this.searcher = 
 Keyspace.open(keyspace).getColumnFamilyStore(columnFamily).indexManager.searcher(rowFilter);
 public boolean requiresFullScan() {
 return searcher == null ? false : searcher.requiresFullScan(rowFilter);
 }
 public ListRow combine(ListRow rows)
 {
 return searcher == null ? trim(rows) : trim(searcher.sort(rowFilter, 
 rows));
 }
 {code}
 Finnally, we modify StorageProxy#getRangeSlice to use the previous method, as 
 shown in the attached patch.
 We think that the proposed approach provides very useful functionality with 
 minimum impact in current codebase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8717) Top-k queries with custom secondary indexes

2015-04-13 Thread Aleksey Yeschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493297#comment-14493297
 ] 

Aleksey Yeschenko commented on CASSANDRA-8717:
--

I'll review shortly (we have a conference this week, so expect early next week 
most likely).

In the meantime, can you format the patch to match the project's code style - 
https://wiki.apache.org/cassandra/CodeStyle ?

Thanks

 Top-k queries with custom secondary indexes
 ---

 Key: CASSANDRA-8717
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8717
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Andrés de la Peña
Assignee: Andrés de la Peña
Priority: Minor
  Labels: 2i, secondary_index, sort, sorting, top-k
 Fix For: 3.0

 Attachments: 0001-Add-support-for-top-k-queries-in-2i.patch, 
 0002-Add-support-for-top-k-queries-in-2i.patch


 As presented in [Cassandra Summit Europe 
 2014|https://www.youtube.com/watch?v=Hg5s-hXy_-M], secondary indexes can be 
 modified to support general top-k queries with minimum changes in Cassandra 
 codebase. This way, custom 2i implementations could provide relevance search, 
 sorting by columns, etc.
 Top-k queries retrieve the k best results for a certain query. That implies 
 querying the k best rows in each token range and then sort them in order to 
 obtain the k globally best rows. 
 For doing that, we propose two additional methods in class 
 SecondaryIndexSearcher:
 {code:java}
 public boolean requiresFullScan(ListIndexExpression clause)
 {
 return false;
 }
 public ListRow sort(ListIndexExpression clause, ListRow rows)
 {
 return rows;
 }
 {code}
 The first one indicates if a query performed in the index requires querying 
 all the nodes in the ring. It is necessary in top-k queries because we do not 
 know which node are the best results. The second method specifies how to sort 
 all the partial node results according to the query. 
 Then we add two similar methods to the class AbstractRangeCommand:
 {code:java}
 this.searcher = 
 Keyspace.open(keyspace).getColumnFamilyStore(columnFamily).indexManager.searcher(rowFilter);
 public boolean requiresFullScan() {
 return searcher == null ? false : searcher.requiresFullScan(rowFilter);
 }
 public ListRow combine(ListRow rows)
 {
 return searcher == null ? trim(rows) : trim(searcher.sort(rowFilter, 
 rows));
 }
 {code}
 Finnally, we modify StorageProxy#getRangeSlice to use the previous method, as 
 shown in the attached patch.
 We think that the proposed approach provides very useful functionality with 
 minimum impact in current codebase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8717) Top-k queries with custom secondary indexes

2015-04-09 Thread JIRA

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14487645#comment-14487645
 ] 

Andrés de la Peña commented on CASSANDRA-8717:
--

It would be nice to have these changes in 2.1. I'm uploading a new version of 
the patch.

[~slebresne], I have renamed, as you suggested, the {{sort}} method to 
{{postReconciliationProcessing}}. Also I have renamed the method 
{{requiresFullScan}} to {{requiresScanningAllRanges}}, which seems clearer.

You are totally right about the computation of the concurrency factor. I have 
created a {{rowsToBeFetched}} variable representing the number of rows to be 
fetched. This is {{command.limit()}} in the regular case and {{command.limit() 
* ranges.size()}} when the command requieres scanning all the token ranges. In 
addition, if we know that the command needs to do a full scan then we can set 
the concurrency factor to {{ranges.size()}} in order to  query all the ranges 
in parallel. Thus, recalculating the concurrency factor is avoided in this 
particular case of full ranges scan.

Please let me know what you think about the new patch.

 Top-k queries with custom secondary indexes
 ---

 Key: CASSANDRA-8717
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8717
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Andrés de la Peña
Assignee: Andrés de la Peña
Priority: Minor
  Labels: 2i, secondary_index, sort, sorting, top-k
 Fix For: 3.0

 Attachments: 0001-Add-support-for-top-k-queries-in-2i.patch, 
 0002-Add-support-for-top-k-queries-in-2i.patch


 As presented in [Cassandra Summit Europe 
 2014|https://www.youtube.com/watch?v=Hg5s-hXy_-M], secondary indexes can be 
 modified to support general top-k queries with minimum changes in Cassandra 
 codebase. This way, custom 2i implementations could provide relevance search, 
 sorting by columns, etc.
 Top-k queries retrieve the k best results for a certain query. That implies 
 querying the k best rows in each token range and then sort them in order to 
 obtain the k globally best rows. 
 For doing that, we propose two additional methods in class 
 SecondaryIndexSearcher:
 {code:java}
 public boolean requiresFullScan(ListIndexExpression clause)
 {
 return false;
 }
 public ListRow sort(ListIndexExpression clause, ListRow rows)
 {
 return rows;
 }
 {code}
 The first one indicates if a query performed in the index requires querying 
 all the nodes in the ring. It is necessary in top-k queries because we do not 
 know which node are the best results. The second method specifies how to sort 
 all the partial node results according to the query. 
 Then we add two similar methods to the class AbstractRangeCommand:
 {code:java}
 this.searcher = 
 Keyspace.open(keyspace).getColumnFamilyStore(columnFamily).indexManager.searcher(rowFilter);
 public boolean requiresFullScan() {
 return searcher == null ? false : searcher.requiresFullScan(rowFilter);
 }
 public ListRow combine(ListRow rows)
 {
 return searcher == null ? trim(rows) : trim(searcher.sort(rowFilter, 
 rows));
 }
 {code}
 Finnally, we modify StorageProxy#getRangeSlice to use the previous method, as 
 shown in the attached patch.
 We think that the proposed approach provides very useful functionality with 
 minimum impact in current codebase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8717) Top-k queries with custom secondary indexes

2015-04-06 Thread Aleksey Yeschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482159#comment-14482159
 ] 

Aleksey Yeschenko commented on CASSANDRA-8717:
--

Once the issues raised by Sylvain are resolved, we can put in into 2.1 as an 
interim measure.

 Top-k queries with custom secondary indexes
 ---

 Key: CASSANDRA-8717
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8717
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Andrés de la Peña
Assignee: Andrés de la Peña
Priority: Minor
  Labels: 2i, secondary_index, sort, sorting, top-k
 Fix For: 3.0

 Attachments: 0001-Add-support-for-top-k-queries-in-2i.patch


 As presented in [Cassandra Summit Europe 
 2014|https://www.youtube.com/watch?v=Hg5s-hXy_-M], secondary indexes can be 
 modified to support general top-k queries with minimum changes in Cassandra 
 codebase. This way, custom 2i implementations could provide relevance search, 
 sorting by columns, etc.
 Top-k queries retrieve the k best results for a certain query. That implies 
 querying the k best rows in each token range and then sort them in order to 
 obtain the k globally best rows. 
 For doing that, we propose two additional methods in class 
 SecondaryIndexSearcher:
 {code:java}
 public boolean requiresFullScan(ListIndexExpression clause)
 {
 return false;
 }
 public ListRow sort(ListIndexExpression clause, ListRow rows)
 {
 return rows;
 }
 {code}
 The first one indicates if a query performed in the index requires querying 
 all the nodes in the ring. It is necessary in top-k queries because we do not 
 know which node are the best results. The second method specifies how to sort 
 all the partial node results according to the query. 
 Then we add two similar methods to the class AbstractRangeCommand:
 {code:java}
 this.searcher = 
 Keyspace.open(keyspace).getColumnFamilyStore(columnFamily).indexManager.searcher(rowFilter);
 public boolean requiresFullScan() {
 return searcher == null ? false : searcher.requiresFullScan(rowFilter);
 }
 public ListRow combine(ListRow rows)
 {
 return searcher == null ? trim(rows) : trim(searcher.sort(rowFilter, 
 rows));
 }
 {code}
 Finnally, we modify StorageProxy#getRangeSlice to use the previous method, as 
 shown in the attached patch.
 We think that the proposed approach provides very useful functionality with 
 minimum impact in current codebase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8717) Top-k queries with custom secondary indexes

2015-04-06 Thread Evan Volgas (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482154#comment-14482154
 ] 

Evan Volgas commented on CASSANDRA-8717:


The ability to drop in a custom indexer would be a game changer. I agree with 
[~slebresne]'s point about sort being the wrong name for this method. I also 
really like the idea of hijacking the 2i in the short run and revisiting it 
later to consider a more generalized version of transforming the result set.  

As far as 2.1 vs 3.0... this is such a big win that I'd hate to see it pushed 
too far back into later. Is it maybe still possible to try and get this into 
the 2.1 line? 

 Top-k queries with custom secondary indexes
 ---

 Key: CASSANDRA-8717
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8717
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Andrés de la Peña
Assignee: Andrés de la Peña
Priority: Minor
  Labels: 2i, secondary_index, sort, sorting, top-k
 Fix For: 3.0

 Attachments: 0001-Add-support-for-top-k-queries-in-2i.patch


 As presented in [Cassandra Summit Europe 
 2014|https://www.youtube.com/watch?v=Hg5s-hXy_-M], secondary indexes can be 
 modified to support general top-k queries with minimum changes in Cassandra 
 codebase. This way, custom 2i implementations could provide relevance search, 
 sorting by columns, etc.
 Top-k queries retrieve the k best results for a certain query. That implies 
 querying the k best rows in each token range and then sort them in order to 
 obtain the k globally best rows. 
 For doing that, we propose two additional methods in class 
 SecondaryIndexSearcher:
 {code:java}
 public boolean requiresFullScan(ListIndexExpression clause)
 {
 return false;
 }
 public ListRow sort(ListIndexExpression clause, ListRow rows)
 {
 return rows;
 }
 {code}
 The first one indicates if a query performed in the index requires querying 
 all the nodes in the ring. It is necessary in top-k queries because we do not 
 know which node are the best results. The second method specifies how to sort 
 all the partial node results according to the query. 
 Then we add two similar methods to the class AbstractRangeCommand:
 {code:java}
 this.searcher = 
 Keyspace.open(keyspace).getColumnFamilyStore(columnFamily).indexManager.searcher(rowFilter);
 public boolean requiresFullScan() {
 return searcher == null ? false : searcher.requiresFullScan(rowFilter);
 }
 public ListRow combine(ListRow rows)
 {
 return searcher == null ? trim(rows) : trim(searcher.sort(rowFilter, 
 rows));
 }
 {code}
 Finnally, we modify StorageProxy#getRangeSlice to use the previous method, as 
 shown in the attached patch.
 We think that the proposed approach provides very useful functionality with 
 minimum impact in current codebase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8717) Top-k queries with custom secondary indexes

2015-03-27 Thread Aleksey Yeschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383613#comment-14383613
 ] 

Aleksey Yeschenko commented on CASSANDRA-8717:
--

On second thought, this looks reasonable enough for at least 3.0 inclusion - 
especially if this eventually allows you guys to get rid of that C* fork.

Still, I want to hear from [~slebresne] and [~beobal], the latter planning to 
do some C* API refactoring for a while, before proceeding.

 Top-k queries with custom secondary indexes
 ---

 Key: CASSANDRA-8717
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8717
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Andrés de la Peña
Assignee: Andrés de la Peña
Priority: Minor
  Labels: 2i, secondary_index, sort, sorting, top-k
 Fix For: 3.0

 Attachments: 0001-Add-support-for-top-k-queries-in-2i.patch


 As presented in [Cassandra Summit Europe 
 2014|https://www.youtube.com/watch?v=Hg5s-hXy_-M], secondary indexes can be 
 modified to support general top-k queries with minimum changes in Cassandra 
 codebase. This way, custom 2i implementations could provide relevance search, 
 sorting by columns, etc.
 Top-k queries retrieve the k best results for a certain query. That implies 
 querying the k best rows in each token range and then sort them in order to 
 obtain the k globally best rows. 
 For doing that, we propose two additional methods in class 
 SecondaryIndexSearcher:
 {code:java}
 public boolean requiresFullScan(ListIndexExpression clause)
 {
 return false;
 }
 public ListRow sort(ListIndexExpression clause, ListRow rows)
 {
 return rows;
 }
 {code}
 The first one indicates if a query performed in the index requires querying 
 all the nodes in the ring. It is necessary in top-k queries because we do not 
 know which node are the best results. The second method specifies how to sort 
 all the partial node results according to the query. 
 Then we add two similar methods to the class AbstractRangeCommand:
 {code:java}
 this.searcher = 
 Keyspace.open(keyspace).getColumnFamilyStore(columnFamily).indexManager.searcher(rowFilter);
 public boolean requiresFullScan() {
 return searcher == null ? false : searcher.requiresFullScan(rowFilter);
 }
 public ListRow combine(ListRow rows)
 {
 return searcher == null ? trim(rows) : trim(searcher.sort(rowFilter, 
 rows));
 }
 {code}
 Finnally, we modify StorageProxy#getRangeSlice to use the previous method, as 
 shown in the attached patch.
 We think that the proposed approach provides very useful functionality with 
 minimum impact in current codebase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8717) Top-k queries with custom secondary indexes

2015-03-27 Thread Robbie Strickland (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383640#comment-14383640
 ] 

Robbie Strickland commented on CASSANDRA-8717:
--

FWIW, I spoke with several other teams at Spark Summit last week that would
really like this patch for the same reason.



 Top-k queries with custom secondary indexes
 ---

 Key: CASSANDRA-8717
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8717
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Andrés de la Peña
Assignee: Andrés de la Peña
Priority: Minor
  Labels: 2i, secondary_index, sort, sorting, top-k
 Fix For: 3.0

 Attachments: 0001-Add-support-for-top-k-queries-in-2i.patch


 As presented in [Cassandra Summit Europe 
 2014|https://www.youtube.com/watch?v=Hg5s-hXy_-M], secondary indexes can be 
 modified to support general top-k queries with minimum changes in Cassandra 
 codebase. This way, custom 2i implementations could provide relevance search, 
 sorting by columns, etc.
 Top-k queries retrieve the k best results for a certain query. That implies 
 querying the k best rows in each token range and then sort them in order to 
 obtain the k globally best rows. 
 For doing that, we propose two additional methods in class 
 SecondaryIndexSearcher:
 {code:java}
 public boolean requiresFullScan(ListIndexExpression clause)
 {
 return false;
 }
 public ListRow sort(ListIndexExpression clause, ListRow rows)
 {
 return rows;
 }
 {code}
 The first one indicates if a query performed in the index requires querying 
 all the nodes in the ring. It is necessary in top-k queries because we do not 
 know which node are the best results. The second method specifies how to sort 
 all the partial node results according to the query. 
 Then we add two similar methods to the class AbstractRangeCommand:
 {code:java}
 this.searcher = 
 Keyspace.open(keyspace).getColumnFamilyStore(columnFamily).indexManager.searcher(rowFilter);
 public boolean requiresFullScan() {
 return searcher == null ? false : searcher.requiresFullScan(rowFilter);
 }
 public ListRow combine(ListRow rows)
 {
 return searcher == null ? trim(rows) : trim(searcher.sort(rowFilter, 
 rows));
 }
 {code}
 Finnally, we modify StorageProxy#getRangeSlice to use the previous method, as 
 shown in the attached patch.
 We think that the proposed approach provides very useful functionality with 
 minimum impact in current codebase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8717) Top-k queries with custom secondary indexes

2015-03-27 Thread JIRA

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383698#comment-14383698
 ] 

Andrés de la Peña commented on CASSANDRA-8717:
--

On our side, we would very much like to abandon the fork and distribute our 
index as a plugin once you guys agree that the proposed changes regarding top-K 
queries are a go.

 Top-k queries with custom secondary indexes
 ---

 Key: CASSANDRA-8717
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8717
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Andrés de la Peña
Assignee: Andrés de la Peña
Priority: Minor
  Labels: 2i, secondary_index, sort, sorting, top-k
 Fix For: 3.0

 Attachments: 0001-Add-support-for-top-k-queries-in-2i.patch


 As presented in [Cassandra Summit Europe 
 2014|https://www.youtube.com/watch?v=Hg5s-hXy_-M], secondary indexes can be 
 modified to support general top-k queries with minimum changes in Cassandra 
 codebase. This way, custom 2i implementations could provide relevance search, 
 sorting by columns, etc.
 Top-k queries retrieve the k best results for a certain query. That implies 
 querying the k best rows in each token range and then sort them in order to 
 obtain the k globally best rows. 
 For doing that, we propose two additional methods in class 
 SecondaryIndexSearcher:
 {code:java}
 public boolean requiresFullScan(ListIndexExpression clause)
 {
 return false;
 }
 public ListRow sort(ListIndexExpression clause, ListRow rows)
 {
 return rows;
 }
 {code}
 The first one indicates if a query performed in the index requires querying 
 all the nodes in the ring. It is necessary in top-k queries because we do not 
 know which node are the best results. The second method specifies how to sort 
 all the partial node results according to the query. 
 Then we add two similar methods to the class AbstractRangeCommand:
 {code:java}
 this.searcher = 
 Keyspace.open(keyspace).getColumnFamilyStore(columnFamily).indexManager.searcher(rowFilter);
 public boolean requiresFullScan() {
 return searcher == null ? false : searcher.requiresFullScan(rowFilter);
 }
 public ListRow combine(ListRow rows)
 {
 return searcher == null ? trim(rows) : trim(searcher.sort(rowFilter, 
 rows));
 }
 {code}
 Finnally, we modify StorageProxy#getRangeSlice to use the previous method, as 
 shown in the attached patch.
 We think that the proposed approach provides very useful functionality with 
 minimum impact in current codebase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8717) Top-k queries with custom secondary indexes

2015-03-27 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383704#comment-14383704
 ] 

Sylvain Lebresne commented on CASSANDRA-8717:
-

I don't have a problem with this in theory, at least in 3.0 (I tend to agree 
with Aleksey on that part), though I could argue that what you fundamentally 
ask is not specific to indexing. What you want is a way to transform the 
result of internal queries. It's rather close to aggregation except that 
instead of transforming multiple rows into a single, you want to transform some 
rows into other rows (sorting them being just one particular use case of that). 
 The fact that the results you want to transform is the result of your custom 
index is kind of incidental. So I do feel that implementing this as the more 
general concept of results transformation would be cleaner (and more generic).  
However, doing so is probably a little bit more involved so I'm happy to 
hijack the 2ndary index API for that in the short term and leave 
generalization to later, provided we agree that we may generalize that better 
and thus slightly break those new APIs.

Now on the patch, I do think {{requiresFullScan}} somewhat break the 
{{concurrencyFactor}} computation in {{getRangeSlice}} as {{remainingRows}} can 
become negative. This is not a huge deal in the sense that the code ensure the 
{{concurrentFactor}} is never smaller than 1, but it still is kind of wrong in 
principle. In fact, that method is really about modifying the query limit 
internally (up until the combine method has been applied), and that's imo the 
proper way to expose it.

Another nit is that we should rename the {{sort}} method in something more 
generic (as said above, sorting is somewhat of a special case and no reason to 
imply a limitation to that). It could be renamed {{combine}} or, imo a bit 
better, something like {{postReconciliationProcessing}}.


 Top-k queries with custom secondary indexes
 ---

 Key: CASSANDRA-8717
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8717
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Andrés de la Peña
Assignee: Andrés de la Peña
Priority: Minor
  Labels: 2i, secondary_index, sort, sorting, top-k
 Fix For: 3.0

 Attachments: 0001-Add-support-for-top-k-queries-in-2i.patch


 As presented in [Cassandra Summit Europe 
 2014|https://www.youtube.com/watch?v=Hg5s-hXy_-M], secondary indexes can be 
 modified to support general top-k queries with minimum changes in Cassandra 
 codebase. This way, custom 2i implementations could provide relevance search, 
 sorting by columns, etc.
 Top-k queries retrieve the k best results for a certain query. That implies 
 querying the k best rows in each token range and then sort them in order to 
 obtain the k globally best rows. 
 For doing that, we propose two additional methods in class 
 SecondaryIndexSearcher:
 {code:java}
 public boolean requiresFullScan(ListIndexExpression clause)
 {
 return false;
 }
 public ListRow sort(ListIndexExpression clause, ListRow rows)
 {
 return rows;
 }
 {code}
 The first one indicates if a query performed in the index requires querying 
 all the nodes in the ring. It is necessary in top-k queries because we do not 
 know which node are the best results. The second method specifies how to sort 
 all the partial node results according to the query. 
 Then we add two similar methods to the class AbstractRangeCommand:
 {code:java}
 this.searcher = 
 Keyspace.open(keyspace).getColumnFamilyStore(columnFamily).indexManager.searcher(rowFilter);
 public boolean requiresFullScan() {
 return searcher == null ? false : searcher.requiresFullScan(rowFilter);
 }
 public ListRow combine(ListRow rows)
 {
 return searcher == null ? trim(rows) : trim(searcher.sort(rowFilter, 
 rows));
 }
 {code}
 Finnally, we modify StorageProxy#getRangeSlice to use the previous method, as 
 shown in the attached patch.
 We think that the proposed approach provides very useful functionality with 
 minimum impact in current codebase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8717) Top-k queries with custom secondary indexes

2015-03-27 Thread JIRA

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383812#comment-14383812
 ] 

Andrés de la Peña commented on CASSANDRA-8717:
--

I agree with your idea about doing this for the short term and leave 
generalization for later. We can deal with future API changes without problems. 
What we would need at 2i level is some way to specify that we need to scan all 
the nodes and the aforementioned method so as to combine the partial results. 
Indeed, sort is not the most fortunate name for this method...

 Top-k queries with custom secondary indexes
 ---

 Key: CASSANDRA-8717
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8717
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Andrés de la Peña
Assignee: Andrés de la Peña
Priority: Minor
  Labels: 2i, secondary_index, sort, sorting, top-k
 Fix For: 3.0

 Attachments: 0001-Add-support-for-top-k-queries-in-2i.patch


 As presented in [Cassandra Summit Europe 
 2014|https://www.youtube.com/watch?v=Hg5s-hXy_-M], secondary indexes can be 
 modified to support general top-k queries with minimum changes in Cassandra 
 codebase. This way, custom 2i implementations could provide relevance search, 
 sorting by columns, etc.
 Top-k queries retrieve the k best results for a certain query. That implies 
 querying the k best rows in each token range and then sort them in order to 
 obtain the k globally best rows. 
 For doing that, we propose two additional methods in class 
 SecondaryIndexSearcher:
 {code:java}
 public boolean requiresFullScan(ListIndexExpression clause)
 {
 return false;
 }
 public ListRow sort(ListIndexExpression clause, ListRow rows)
 {
 return rows;
 }
 {code}
 The first one indicates if a query performed in the index requires querying 
 all the nodes in the ring. It is necessary in top-k queries because we do not 
 know which node are the best results. The second method specifies how to sort 
 all the partial node results according to the query. 
 Then we add two similar methods to the class AbstractRangeCommand:
 {code:java}
 this.searcher = 
 Keyspace.open(keyspace).getColumnFamilyStore(columnFamily).indexManager.searcher(rowFilter);
 public boolean requiresFullScan() {
 return searcher == null ? false : searcher.requiresFullScan(rowFilter);
 }
 public ListRow combine(ListRow rows)
 {
 return searcher == null ? trim(rows) : trim(searcher.sort(rowFilter, 
 rows));
 }
 {code}
 Finnally, we modify StorageProxy#getRangeSlice to use the previous method, as 
 shown in the attached patch.
 We think that the proposed approach provides very useful functionality with 
 minimum impact in current codebase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8717) Top-k queries with custom secondary indexes

2015-03-22 Thread Ernesto Funes (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375265#comment-14375265
 ] 

Ernesto Funes commented on CASSANDRA-8717:
--

Any news about this issue?

 Top-k queries with custom secondary indexes
 ---

 Key: CASSANDRA-8717
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8717
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Andrés de la Peña
Assignee: Andrés de la Peña
Priority: Minor
  Labels: 2i, secondary_index, sort, sorting, top-k
 Fix For: 3.0

 Attachments: 0001-Add-support-for-top-k-queries-in-2i.patch


 As presented in [Cassandra Summit Europe 
 2014|https://www.youtube.com/watch?v=Hg5s-hXy_-M], secondary indexes can be 
 modified to support general top-k queries with minimum changes in Cassandra 
 codebase. This way, custom 2i implementations could provide relevance search, 
 sorting by columns, etc.
 Top-k queries retrieve the k best results for a certain query. That implies 
 querying the k best rows in each token range and then sort them in order to 
 obtain the k globally best rows. 
 For doing that, we propose two additional methods in class 
 SecondaryIndexSearcher:
 {code:java}
 public boolean requiresFullScan(ListIndexExpression clause)
 {
 return false;
 }
 public ListRow sort(ListIndexExpression clause, ListRow rows)
 {
 return rows;
 }
 {code}
 The first one indicates if a query performed in the index requires querying 
 all the nodes in the ring. It is necessary in top-k queries because we do not 
 know which node are the best results. The second method specifies how to sort 
 all the partial node results according to the query. 
 Then we add two similar methods to the class AbstractRangeCommand:
 {code:java}
 this.searcher = 
 Keyspace.open(keyspace).getColumnFamilyStore(columnFamily).indexManager.searcher(rowFilter);
 public boolean requiresFullScan() {
 return searcher == null ? false : searcher.requiresFullScan(rowFilter);
 }
 public ListRow combine(ListRow rows)
 {
 return searcher == null ? trim(rows) : trim(searcher.sort(rowFilter, 
 rows));
 }
 {code}
 Finnally, we modify StorageProxy#getRangeSlice to use the previous method, as 
 shown in the attached patch.
 We think that the proposed approach provides very useful functionality with 
 minimum impact in current codebase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8717) Top-k queries with custom secondary indexes

2015-02-19 Thread Aleksey Yeschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328160#comment-14328160
 ] 

Aleksey Yeschenko commented on CASSANDRA-8717:
--

[~rstrickland] [~adelapena] I'll have another look at the patch, soon-ish.

CC [~slebresne]

 Top-k queries with custom secondary indexes
 ---

 Key: CASSANDRA-8717
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8717
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Andrés de la Peña
Assignee: Andrés de la Peña
Priority: Minor
  Labels: 2i, secondary_index, sort, sorting, top-k
 Fix For: 3.0

 Attachments: 0001-Add-support-for-top-k-queries-in-2i.patch


 As presented in [Cassandra Summit Europe 
 2014|https://www.youtube.com/watch?v=Hg5s-hXy_-M], secondary indexes can be 
 modified to support general top-k queries with minimum changes in Cassandra 
 codebase. This way, custom 2i implementations could provide relevance search, 
 sorting by columns, etc.
 Top-k queries retrieve the k best results for a certain query. That implies 
 querying the k best rows in each token range and then sort them in order to 
 obtain the k globally best rows. 
 For doing that, we propose two additional methods in class 
 SecondaryIndexSearcher:
 {code:java}
 public boolean requiresFullScan(ListIndexExpression clause)
 {
 return false;
 }
 public ListRow sort(ListIndexExpression clause, ListRow rows)
 {
 return rows;
 }
 {code}
 The first one indicates if a query performed in the index requires querying 
 all the nodes in the ring. It is necessary in top-k queries because we do not 
 know which node are the best results. The second method specifies how to sort 
 all the partial node results according to the query. 
 Then we add two similar methods to the class AbstractRangeCommand:
 {code:java}
 this.searcher = 
 Keyspace.open(keyspace).getColumnFamilyStore(columnFamily).indexManager.searcher(rowFilter);
 public boolean requiresFullScan() {
 return searcher == null ? false : searcher.requiresFullScan(rowFilter);
 }
 public ListRow combine(ListRow rows)
 {
 return searcher == null ? trim(rows) : trim(searcher.sort(rowFilter, 
 rows));
 }
 {code}
 Finnally, we modify StorageProxy#getRangeSlice to use the previous method, as 
 shown in the attached patch.
 We think that the proposed approach provides very useful functionality with 
 minimum impact in current codebase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8717) Top-k queries with custom secondary indexes

2015-02-02 Thread Aleksey Yeschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14301431#comment-14301431
 ] 

Aleksey Yeschenko commented on CASSANDRA-8717:
--

It's unlikely to get into the 2.1 line, sorry.

Maybe 3.0, it at all, and should be done on top of CASSANDRA-8099, if at all.

 Top-k queries with custom secondary indexes
 ---

 Key: CASSANDRA-8717
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8717
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Andrés de la Peña
Assignee: Andrés de la Peña
Priority: Minor
  Labels: 2i, secondary_index, sort, sorting, top-k
 Fix For: 2.1.3

 Attachments: 0001-Add-support-for-top-k-queries-in-2i.patch


 As presented in [Cassandra Summit Europe 
 2014|https://www.youtube.com/watch?v=Hg5s-hXy_-M], secondary indexes can be 
 modified to support general top-k queries with minimum changes in Cassandra 
 codebase. This way, custom 2i implementations could provide relevance search, 
 sorting by columns, etc.
 Top-k queries retrieve the k best results for a certain query. That implies 
 querying the k best rows in each token range and then sort them in order to 
 obtain the k globally best rows. 
 For doing that, we propose two additional methods in class 
 SecondaryIndexSearcher:
 {code:java}
 public boolean requiresFullScan(ListIndexExpression clause)
 {
 return false;
 }
 public ListRow sort(ListIndexExpression clause, ListRow rows)
 {
 return rows;
 }
 {code}
 The first one indicates if a query performed in the index requires querying 
 all the nodes in the ring. It is necessary in top-k queries because we do not 
 know which node are the best results. The second method specifies how to sort 
 all the partial node results according to the query. 
 Then we add two similar methods to the class AbstractRangeCommand:
 {code:java}
 this.searcher = 
 Keyspace.open(keyspace).getColumnFamilyStore(columnFamily).indexManager.searcher(rowFilter);
 public boolean requiresFullScan() {
 return searcher == null ? false : searcher.requiresFullScan(rowFilter);
 }
 public ListRow combine(ListRow rows)
 {
 return searcher == null ? trim(rows) : trim(searcher.sort(rowFilter, 
 rows));
 }
 {code}
 Finnally, we modify StorageProxy#getRangeSlice to use the previous method, as 
 shown in the attached patch.
 We think that the proposed approach provides very useful functionality with 
 minimum impact in current codebase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8717) Top-k queries with custom secondary indexes

2015-02-02 Thread Robbie Strickland (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14301452#comment-14301452
 ] 

Robbie Strickland commented on CASSANDRA-8717:
--

[~iamaleksey] Have you looked at the patch?  There's barely anything to it, and 
yet it opens up the door for guys like Stratio to plug in more advanced index 
implementations without breaking anything (i.e. no need for their fork, which 
is a good thing).  Plus who knows when 3.0 will go mainstream?  I think you 
should reconsider, or at least get some other input.

 Top-k queries with custom secondary indexes
 ---

 Key: CASSANDRA-8717
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8717
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Andrés de la Peña
Assignee: Andrés de la Peña
Priority: Minor
  Labels: 2i, secondary_index, sort, sorting, top-k
 Fix For: 3.0

 Attachments: 0001-Add-support-for-top-k-queries-in-2i.patch


 As presented in [Cassandra Summit Europe 
 2014|https://www.youtube.com/watch?v=Hg5s-hXy_-M], secondary indexes can be 
 modified to support general top-k queries with minimum changes in Cassandra 
 codebase. This way, custom 2i implementations could provide relevance search, 
 sorting by columns, etc.
 Top-k queries retrieve the k best results for a certain query. That implies 
 querying the k best rows in each token range and then sort them in order to 
 obtain the k globally best rows. 
 For doing that, we propose two additional methods in class 
 SecondaryIndexSearcher:
 {code:java}
 public boolean requiresFullScan(ListIndexExpression clause)
 {
 return false;
 }
 public ListRow sort(ListIndexExpression clause, ListRow rows)
 {
 return rows;
 }
 {code}
 The first one indicates if a query performed in the index requires querying 
 all the nodes in the ring. It is necessary in top-k queries because we do not 
 know which node are the best results. The second method specifies how to sort 
 all the partial node results according to the query. 
 Then we add two similar methods to the class AbstractRangeCommand:
 {code:java}
 this.searcher = 
 Keyspace.open(keyspace).getColumnFamilyStore(columnFamily).indexManager.searcher(rowFilter);
 public boolean requiresFullScan() {
 return searcher == null ? false : searcher.requiresFullScan(rowFilter);
 }
 public ListRow combine(ListRow rows)
 {
 return searcher == null ? trim(rows) : trim(searcher.sort(rowFilter, 
 rows));
 }
 {code}
 Finnally, we modify StorageProxy#getRangeSlice to use the previous method, as 
 shown in the attached patch.
 We think that the proposed approach provides very useful functionality with 
 minimum impact in current codebase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8717) Top-k queries with custom secondary indexes

2015-02-02 Thread Robbie Strickland (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14301395#comment-14301395
 ] 

Robbie Strickland commented on CASSANDRA-8717:
--

Prior to this patch being submitted, I went through this same exercise and 
patched 2.1 mainline with these changes.  I couldn't see where it broke 
anything, and it allows users to drop in Stratio's (or their own) custom index 
implementation.  This is a big win!

 Top-k queries with custom secondary indexes
 ---

 Key: CASSANDRA-8717
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8717
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Andrés de la Peña
Priority: Minor
  Labels: 2i, secondary_index, sort, sorting, top-k
 Fix For: 2.1.3

 Attachments: 0001-Add-support-for-top-k-queries-in-2i.patch


 As presented in [Cassandra Summit Europe 
 2014|https://www.youtube.com/watch?v=Hg5s-hXy_-M], secondary indexes can be 
 modified to support general top-k queries with minimum changes in Cassandra 
 codebase. This way, custom 2i implementations could provide relevance search, 
 sorting by columns, etc.
 Top-k queries retrieve the k best results for a certain query. That implies 
 querying the k best rows in each token range and then sort them in order to 
 obtain the k globally best rows. 
 For doing that, we propose two additional methods in class 
 SecondaryIndexSearcher:
 {code:java}
 public boolean requiresFullScan(ListIndexExpression clause)
 {
 return false;
 }
 public ListRow sort(ListIndexExpression clause, ListRow rows)
 {
 return rows;
 }
 {code}
 The first one indicates if a query performed in the index requires querying 
 all the nodes in the ring. It is necessary in top-k queries because we do not 
 know which node are the best results. The second method specifies how to sort 
 all the partial node results according to the query. 
 Then we add two similar methods to the class AbstractRangeCommand:
 {code:java}
 this.searcher = 
 Keyspace.open(keyspace).getColumnFamilyStore(columnFamily).indexManager.searcher(rowFilter);
 public boolean requiresFullScan() {
 return searcher == null ? false : searcher.requiresFullScan(rowFilter);
 }
 public ListRow combine(ListRow rows)
 {
 return searcher == null ? trim(rows) : trim(searcher.sort(rowFilter, 
 rows));
 }
 {code}
 Finnally, we modify StorageProxy#getRangeSlice to use the previous method, as 
 shown in the attached patch.
 We think that the proposed approach provides very useful functionality with 
 minimum impact in current codebase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8717) Top-k queries with custom secondary indexes

2015-02-02 Thread Aleksey Yeschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14301634#comment-14301634
 ] 

Aleksey Yeschenko commented on CASSANDRA-8717:
--

See also - CASSANDRA-7017. @jhaliday Might have something to add, too.

It is a small patch, but it does touch the internals, subtly changing behavior 
that may or may nor be taken into account by the rest of C* codebase. My 
autopilot reaction is to say 'no' to any potentially breaking changes when it 
comes to minor C* releases.

The instabilities we had with the 2.1 line so far (hopefully in the past) make 
me be even more careful and more aggressive about pushing stuff to 'Later'.

 Top-k queries with custom secondary indexes
 ---

 Key: CASSANDRA-8717
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8717
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Andrés de la Peña
Assignee: Andrés de la Peña
Priority: Minor
  Labels: 2i, secondary_index, sort, sorting, top-k
 Fix For: 3.0

 Attachments: 0001-Add-support-for-top-k-queries-in-2i.patch


 As presented in [Cassandra Summit Europe 
 2014|https://www.youtube.com/watch?v=Hg5s-hXy_-M], secondary indexes can be 
 modified to support general top-k queries with minimum changes in Cassandra 
 codebase. This way, custom 2i implementations could provide relevance search, 
 sorting by columns, etc.
 Top-k queries retrieve the k best results for a certain query. That implies 
 querying the k best rows in each token range and then sort them in order to 
 obtain the k globally best rows. 
 For doing that, we propose two additional methods in class 
 SecondaryIndexSearcher:
 {code:java}
 public boolean requiresFullScan(ListIndexExpression clause)
 {
 return false;
 }
 public ListRow sort(ListIndexExpression clause, ListRow rows)
 {
 return rows;
 }
 {code}
 The first one indicates if a query performed in the index requires querying 
 all the nodes in the ring. It is necessary in top-k queries because we do not 
 know which node are the best results. The second method specifies how to sort 
 all the partial node results according to the query. 
 Then we add two similar methods to the class AbstractRangeCommand:
 {code:java}
 this.searcher = 
 Keyspace.open(keyspace).getColumnFamilyStore(columnFamily).indexManager.searcher(rowFilter);
 public boolean requiresFullScan() {
 return searcher == null ? false : searcher.requiresFullScan(rowFilter);
 }
 public ListRow combine(ListRow rows)
 {
 return searcher == null ? trim(rows) : trim(searcher.sort(rowFilter, 
 rows));
 }
 {code}
 Finnally, we modify StorageProxy#getRangeSlice to use the previous method, as 
 shown in the attached patch.
 We think that the proposed approach provides very useful functionality with 
 minimum impact in current codebase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8717) Top-k queries with custom secondary indexes

2015-02-02 Thread Aleksey Yeschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14301643#comment-14301643
 ] 

Aleksey Yeschenko commented on CASSANDRA-8717:
--

That's [~jhalliday], sorry.

 Top-k queries with custom secondary indexes
 ---

 Key: CASSANDRA-8717
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8717
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Andrés de la Peña
Assignee: Andrés de la Peña
Priority: Minor
  Labels: 2i, secondary_index, sort, sorting, top-k
 Fix For: 3.0

 Attachments: 0001-Add-support-for-top-k-queries-in-2i.patch


 As presented in [Cassandra Summit Europe 
 2014|https://www.youtube.com/watch?v=Hg5s-hXy_-M], secondary indexes can be 
 modified to support general top-k queries with minimum changes in Cassandra 
 codebase. This way, custom 2i implementations could provide relevance search, 
 sorting by columns, etc.
 Top-k queries retrieve the k best results for a certain query. That implies 
 querying the k best rows in each token range and then sort them in order to 
 obtain the k globally best rows. 
 For doing that, we propose two additional methods in class 
 SecondaryIndexSearcher:
 {code:java}
 public boolean requiresFullScan(ListIndexExpression clause)
 {
 return false;
 }
 public ListRow sort(ListIndexExpression clause, ListRow rows)
 {
 return rows;
 }
 {code}
 The first one indicates if a query performed in the index requires querying 
 all the nodes in the ring. It is necessary in top-k queries because we do not 
 know which node are the best results. The second method specifies how to sort 
 all the partial node results according to the query. 
 Then we add two similar methods to the class AbstractRangeCommand:
 {code:java}
 this.searcher = 
 Keyspace.open(keyspace).getColumnFamilyStore(columnFamily).indexManager.searcher(rowFilter);
 public boolean requiresFullScan() {
 return searcher == null ? false : searcher.requiresFullScan(rowFilter);
 }
 public ListRow combine(ListRow rows)
 {
 return searcher == null ? trim(rows) : trim(searcher.sort(rowFilter, 
 rows));
 }
 {code}
 Finnally, we modify StorageProxy#getRangeSlice to use the previous method, as 
 shown in the attached patch.
 We think that the proposed approach provides very useful functionality with 
 minimum impact in current codebase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)