[jira] [Commented] (CASSANDRA-1337) parallelize fetching rows for low-cardinality indexes

2013-10-02 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13784290#comment-13784290
 ] 

Jonathan Ellis commented on CASSANDRA-1337:
---

Pushed some super minor cleanup to 
https://github.com/jbellis/cassandra/commits/1337.

Question left in my mind is, do we want to shoot for exactly enough concurrent 
requests, on average?  Would imply that half the time we need to do an extra 
round.  ISTM we probably want to give ourselves a margin of error.

 parallelize fetching rows for low-cardinality indexes
 -

 Key: CASSANDRA-1337
 URL: https://issues.apache.org/jira/browse/CASSANDRA-1337
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Assignee: Tyler Hobbs
Priority: Minor
 Fix For: 2.1

 Attachments: 0001-Concurrent-range-and-2ary-index-subqueries.patch, 
 1137-bugfix.patch, 1337.patch, 1337-v4.patch, 
 ASF.LICENSE.NOT.GRANTED--0001-CASSANDRA-1337-scan-concurrently-depending-on-num-rows.txt,
  CASSANDRA-1337.patch

   Original Estimate: 8h
  Remaining Estimate: 8h

 currently, we read the indexed rows from the first node (in partitioner 
 order); if that does not have enough matching rows, we read the rows from the 
 next, and so forth.
 we should use the statistics fom CASSANDRA-1155 to query multiple nodes in 
 parallel, such that we have a high chance of getting enough rows w/o having 
 to do another round of queries (but, if our estimate is incorrect, we do need 
 to loop and do more rounds until we have enough data or we have fetched from 
 each node).



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-1337) parallelize fetching rows for low-cardinality indexes

2013-10-02 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13784296#comment-13784296
 ] 

Jonathan Ellis commented on CASSANDRA-1337:
---

And yeah, dtests is probably the only sane place to test.

 parallelize fetching rows for low-cardinality indexes
 -

 Key: CASSANDRA-1337
 URL: https://issues.apache.org/jira/browse/CASSANDRA-1337
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Assignee: Tyler Hobbs
Priority: Minor
 Fix For: 2.1

 Attachments: 0001-Concurrent-range-and-2ary-index-subqueries.patch, 
 1137-bugfix.patch, 1337.patch, 1337-v4.patch, 
 ASF.LICENSE.NOT.GRANTED--0001-CASSANDRA-1337-scan-concurrently-depending-on-num-rows.txt,
  CASSANDRA-1337.patch

   Original Estimate: 8h
  Remaining Estimate: 8h

 currently, we read the indexed rows from the first node (in partitioner 
 order); if that does not have enough matching rows, we read the rows from the 
 next, and so forth.
 we should use the statistics fom CASSANDRA-1155 to query multiple nodes in 
 parallel, such that we have a high chance of getting enough rows w/o having 
 to do another round of queries (but, if our estimate is incorrect, we do need 
 to loop and do more rounds until we have enough data or we have fetched from 
 each node).



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-1337) parallelize fetching rows for low-cardinality indexes

2013-10-02 Thread Tyler Hobbs (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13784311#comment-13784311
 ] 

Tyler Hobbs commented on CASSANDRA-1337:


bq. Pushed some super minor cleanup to 
https://github.com/jbellis/cassandra/commits/1337.

In your OCD commit, replacing the haveSufficientRows/break behavior with a 
return means that we won't wait on the repair futures, which I believe is 
incorrect.

bq. Question left in my mind is, do we want to shoot for exactly enough 
concurrent requests, on average? Would imply that half the time we need to do 
an extra round. ISTM we probably want to give ourselves a margin of error.

True.  Perhaps we should decrease our estimate of rows per range by, say, 10%, 
and use Math.ceil() instead of Math.round() for the concurrency factor 
calculation.

 parallelize fetching rows for low-cardinality indexes
 -

 Key: CASSANDRA-1337
 URL: https://issues.apache.org/jira/browse/CASSANDRA-1337
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Assignee: Tyler Hobbs
Priority: Minor
 Fix For: 2.1

 Attachments: 0001-Concurrent-range-and-2ary-index-subqueries.patch, 
 1137-bugfix.patch, 1337.patch, 1337-v4.patch, 
 ASF.LICENSE.NOT.GRANTED--0001-CASSANDRA-1337-scan-concurrently-depending-on-num-rows.txt,
  CASSANDRA-1337.patch

   Original Estimate: 8h
  Remaining Estimate: 8h

 currently, we read the indexed rows from the first node (in partitioner 
 order); if that does not have enough matching rows, we read the rows from the 
 next, and so forth.
 we should use the statistics fom CASSANDRA-1155 to query multiple nodes in 
 parallel, such that we have a high chance of getting enough rows w/o having 
 to do another round of queries (but, if our estimate is incorrect, we do need 
 to loop and do more rounds until we have enough data or we have fetched from 
 each node).



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-1337) parallelize fetching rows for low-cardinality indexes

2012-11-08 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13493273#comment-13493273
 ] 

Sylvain Lebresne commented on CASSANDRA-1337:
-

Sorry I haven't been able to got to that sooner. This will thus need rebasing. 
Also, I'm pretty sure the is something specific to do to make that work for 
get_paged_slice. I wouldn't be against just deactivating this parallelization 
for get_paged_slice to start with.

 parallelize fetching rows for low-cardinality indexes
 -

 Key: CASSANDRA-1337
 URL: https://issues.apache.org/jira/browse/CASSANDRA-1337
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Assignee: David Alves
Priority: Minor
 Fix For: 1.2.0 rc1

 Attachments: 1137-bugfix.patch, 1337.patch, 1337-v4.patch, 
 ASF.LICENSE.NOT.GRANTED--0001-CASSANDRA-1337-scan-concurrently-depending-on-num-rows.txt,
  CASSANDRA-1337.patch

   Original Estimate: 8h
  Remaining Estimate: 8h

 currently, we read the indexed rows from the first node (in partitioner 
 order); if that does not have enough matching rows, we read the rows from the 
 next, and so forth.
 we should use the statistics fom CASSANDRA-1155 to query multiple nodes in 
 parallel, such that we have a high chance of getting enough rows w/o having 
 to do another round of queries (but, if our estimate is incorrect, we do need 
 to loop and do more rounds until we have enough data or we have fetched from 
 each node).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-1337) parallelize fetching rows for low-cardinality indexes

2012-09-17 Thread David Alves (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13456908#comment-13456908
 ] 

David Alves commented on CASSANDRA-1337:


cool, that means things are probably kosher at least semantically, right? I 
mean the bug present in v1 is no longer present and all other related tests 
pass.

 parallelize fetching rows for low-cardinality indexes
 -

 Key: CASSANDRA-1337
 URL: https://issues.apache.org/jira/browse/CASSANDRA-1337
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Assignee: David Alves
Priority: Minor
 Fix For: 1.2.1

 Attachments: 1137-bugfix.patch, 1337.patch, 1337-v4.patch, 
 ASF.LICENSE.NOT.GRANTED--0001-CASSANDRA-1337-scan-concurrently-depending-on-num-rows.txt,
  CASSANDRA-1337.patch

   Original Estimate: 8h
  Remaining Estimate: 8h

 currently, we read the indexed rows from the first node (in partitioner 
 order); if that does not have enough matching rows, we read the rows from the 
 next, and so forth.
 we should use the statistics fom CASSANDRA-1155 to query multiple nodes in 
 parallel, such that we have a high chance of getting enough rows w/o having 
 to do another round of queries (but, if our estimate is incorrect, we do need 
 to loop and do more rounds until we have enough data or we have fetched from 
 each node).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-1337) parallelize fetching rows for low-cardinality indexes

2012-09-11 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13452862#comment-13452862
 ] 

Sylvain Lebresne commented on CASSANDRA-1337:
-

I haven't had the time yet to look at the new patches, but there may be 
something wrong with the revert because the test at 
https://github.com/riptano/cassandra-dtest/blob/master/cql_tests.py#L1736 is 
still broken on trunk, though now it simply doesn't complete. Haven't looked 
yet but the test passes alright in 1.1.

 parallelize fetching rows for low-cardinality indexes
 -

 Key: CASSANDRA-1337
 URL: https://issues.apache.org/jira/browse/CASSANDRA-1337
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Assignee: David Alves
Priority: Minor
 Fix For: 1.2.1

 Attachments: 1137-bugfix.patch, 1337.patch, 1337-v4.patch, 
 ASF.LICENSE.NOT.GRANTED--0001-CASSANDRA-1337-scan-concurrently-depending-on-num-rows.txt,
  CASSANDRA-1337.patch

   Original Estimate: 8h
  Remaining Estimate: 8h

 currently, we read the indexed rows from the first node (in partitioner 
 order); if that does not have enough matching rows, we read the rows from the 
 next, and so forth.
 we should use the statistics fom CASSANDRA-1155 to query multiple nodes in 
 parallel, such that we have a high chance of getting enough rows w/o having 
 to do another round of queries (but, if our estimate is incorrect, we do need 
 to loop and do more rounds until we have enough data or we have fetched from 
 each node).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-1337) parallelize fetching rows for low-cardinality indexes

2012-08-16 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13436286#comment-13436286
 ] 

Jonathan Ellis commented on CASSANDRA-1337:
---

Reverted the three commits here in 0bfea6f678034c54d64c0c613f758de02d266415 and 
bumped to 1.2.1 since David may not have time to get back to this before 1.2.0 
freeze.  

 parallelize fetching rows for low-cardinality indexes
 -

 Key: CASSANDRA-1337
 URL: https://issues.apache.org/jira/browse/CASSANDRA-1337
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Assignee: David Alves
Priority: Minor
 Fix For: 1.2.1

 Attachments: 
 0001-CASSANDRA-1337-scan-concurrently-depending-on-num-rows.txt, 
 1137-bugfix.patch, CASSANDRA-1337.patch

   Original Estimate: 8h
  Remaining Estimate: 8h

 currently, we read the indexed rows from the first node (in partitioner 
 order); if that does not have enough matching rows, we read the rows from the 
 next, and so forth.
 we should use the statistics fom CASSANDRA-1155 to query multiple nodes in 
 parallel, such that we have a high chance of getting enough rows w/o having 
 to do another round of queries (but, if our estimate is incorrect, we do need 
 to loop and do more rounds until we have enough data or we have fetched from 
 each node).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-1337) parallelize fetching rows for low-cardinality indexes

2012-07-26 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13422986#comment-13422986
 ] 

Sylvain Lebresne commented on CASSANDRA-1337:
-

The patch almost move the check for do we have enough rows for the query 
inside the don't use the local path for that range, which is broken (as in 
uselessly inefficient), because it means if we do have enough rows locally, 
we will still check another range.

But probably more importantly, I'm confused on how this patch works.

First, it estimated how much the query of *a range* will likely yield based on 
the *total* number of keys on the node. We should at least divide this by the 
replication factor to have a proper estimate of the number-of-keys-per-range.

Second, it seems to correctly parallelize old-style range_slice (i.e. 
range_slice for thrift without any IndexExpression), but I think it doesn't 
correctly handle neither secondary indexes (which was the main goal of the 
patch I believe), not any kind of range slices when CQL3 is involved:
* For secondary index queries, the number or rows returned doesn't depend at 
all on the number of row keys the node holds (and thus if follows that 
estimating the number of parallel queries to do based on that parameter is 
broken), it depends on how many columns the row for the most selective index 
contains. So for the concurrencyFactor we should 1) figure out which index will 
be used and 2) probably use the estimated mean columns count for that index.
* For CQL3 queries, they use the maxIsColumns parameters (for both traditional 
*and* 2ndary index queries) and so the number of rows returned shouldn't be 
directly compared to maxResults (i.e. if the first row we find has enough 
columns to satisfy maxResults, we're done). In that case, it unfortunately 
become more complicated to predict how much results a query might yield in 
general, because this depends on the column filter. I.e. if the filter is a 
name filter, or an identity slice filter (as in IdentitySliceFilter), we can 
try an estimate (in the latter case, something like maxResults / (estimatedKeys 
* meanColumnsCountPerKey)), but for other kind of slice filters, I don't think 
we can do much estimate. That being said, it might still be worth using the 
estimate in those two cases, because at least for 2ndary index query, the 
column filter will likely be very often an identity slice. And in that 
identity slice case, for 2ndary index and CQL3, the estimation should be 
something like maxResults / (meanColumnCountPerKey(index used) * 
meanColumnCountPerKey(parent cf)).

I'll note however that in both case, the meanColumnsCount is not necessary the 
perfect estimate to use, as it pretty much imply that half the time we will 
query one more range than is necessary. Instead we could either use the 
maxColumnsCount (if we really want to be conservative) or add some fudge factor 
to the mean. Fudge factor that may maybe be based on the difference between the 
min, mean and max columns count estimates.

 parallelize fetching rows for low-cardinality indexes
 -

 Key: CASSANDRA-1337
 URL: https://issues.apache.org/jira/browse/CASSANDRA-1337
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Assignee: David Alves
Priority: Minor
 Fix For: 1.2

 Attachments: 
 0001-CASSANDRA-1337-scan-concurrently-depending-on-num-rows.txt, 
 CASSANDRA-1337.patch

   Original Estimate: 8h
  Remaining Estimate: 8h

 currently, we read the indexed rows from the first node (in partitioner 
 order); if that does not have enough matching rows, we read the rows from the 
 next, and so forth.
 we should use the statistics fom CASSANDRA-1155 to query multiple nodes in 
 parallel, such that we have a high chance of getting enough rows w/o having 
 to do another round of queries (but, if our estimate is incorrect, we do need 
 to loop and do more rounds until we have enough data or we have fetched from 
 each node).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-1337) parallelize fetching rows for low-cardinality indexes

2012-06-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13399465#comment-13399465
 ] 

Hudson commented on CASSANDRA-1337:
---

Integrated in Cassandra #1554 (See 
[https://builds.apache.org/job/Cassandra/1554/])
Fix typo in CASSANDRA-1337 (Revision 
f17fbac1934721c4a90c86745f8481b2d8a8b223)

 Result = ABORTED
vijay2win : 
Files : 
* src/java/org/apache/cassandra/service/StorageProxy.java


 parallelize fetching rows for low-cardinality indexes
 -

 Key: CASSANDRA-1337
 URL: https://issues.apache.org/jira/browse/CASSANDRA-1337
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Assignee: David Alves
Priority: Minor
 Fix For: 1.2

 Attachments: 
 0001-CASSANDRA-1337-scan-concurrently-depending-on-num-rows.txt, 
 CASSANDRA-1337.patch

   Original Estimate: 8h
  Remaining Estimate: 8h

 currently, we read the indexed rows from the first node (in partitioner 
 order); if that does not have enough matching rows, we read the rows from the 
 next, and so forth.
 we should use the statistics fom CASSANDRA-1155 to query multiple nodes in 
 parallel, such that we have a high chance of getting enough rows w/o having 
 to do another round of queries (but, if our estimate is incorrect, we do need 
 to loop and do more rounds until we have enough data or we have fetched from 
 each node).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-1337) parallelize fetching rows for low-cardinality indexes

2012-06-21 Thread Vijay (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13398222#comment-13398222
 ] 

Vijay commented on CASSANDRA-1337:
--

Agree, and +1 for the rebased version of the patch. 
David, let me know if you have anything in addition to the attached.

 parallelize fetching rows for low-cardinality indexes
 -

 Key: CASSANDRA-1337
 URL: https://issues.apache.org/jira/browse/CASSANDRA-1337
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Assignee: David Alves
Priority: Minor
 Fix For: 1.2

 Attachments: 
 0001-CASSANDRA-1337-scan-concurrently-depending-on-num-rows.txt, 
 CASSANDRA-1337.patch

   Original Estimate: 8h
  Remaining Estimate: 8h

 currently, we read the indexed rows from the first node (in partitioner 
 order); if that does not have enough matching rows, we read the rows from the 
 next, and so forth.
 we should use the statistics fom CASSANDRA-1155 to query multiple nodes in 
 parallel, such that we have a high chance of getting enough rows w/o having 
 to do another round of queries (but, if our estimate is incorrect, we do need 
 to loop and do more rounds until we have enough data or we have fetched from 
 each node).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-1337) parallelize fetching rows for low-cardinality indexes

2012-06-21 Thread David Alves (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13399093#comment-13399093
 ] 

David Alves commented on CASSANDRA-1337:


Thanks for the review Vijay.

I have nothing to add...

 parallelize fetching rows for low-cardinality indexes
 -

 Key: CASSANDRA-1337
 URL: https://issues.apache.org/jira/browse/CASSANDRA-1337
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Assignee: David Alves
Priority: Minor
 Fix For: 1.2

 Attachments: 
 0001-CASSANDRA-1337-scan-concurrently-depending-on-num-rows.txt, 
 CASSANDRA-1337.patch

   Original Estimate: 8h
  Remaining Estimate: 8h

 currently, we read the indexed rows from the first node (in partitioner 
 order); if that does not have enough matching rows, we read the rows from the 
 next, and so forth.
 we should use the statistics fom CASSANDRA-1155 to query multiple nodes in 
 parallel, such that we have a high chance of getting enough rows w/o having 
 to do another round of queries (but, if our estimate is incorrect, we do need 
 to loop and do more rounds until we have enough data or we have fetched from 
 each node).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-1337) parallelize fetching rows for low-cardinality indexes

2012-06-20 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13398211#comment-13398211
 ] 

Jonathan Ellis commented on CASSANDRA-1337:
---

Feels more like a dtest than a unit test to me.

 parallelize fetching rows for low-cardinality indexes
 -

 Key: CASSANDRA-1337
 URL: https://issues.apache.org/jira/browse/CASSANDRA-1337
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Assignee: David Alves
Priority: Minor
 Fix For: 1.2

 Attachments: 
 0001-CASSANDRA-1337-scan-concurrently-depending-on-num-rows.txt, 
 CASSANDRA-1337.patch

   Original Estimate: 8h
  Remaining Estimate: 8h

 currently, we read the indexed rows from the first node (in partitioner 
 order); if that does not have enough matching rows, we read the rows from the 
 next, and so forth.
 we should use the statistics fom CASSANDRA-1155 to query multiple nodes in 
 parallel, such that we have a high chance of getting enough rows w/o having 
 to do another round of queries (but, if our estimate is incorrect, we do need 
 to loop and do more rounds until we have enough data or we have fetched from 
 each node).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-1337) parallelize fetching rows for low-cardinality indexes

2012-06-12 Thread David Alves (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13293902#comment-13293902
 ] 

David Alves commented on CASSANDRA-1337:


Any suggestions wrt to unit testing this?

 parallelize fetching rows for low-cardinality indexes
 -

 Key: CASSANDRA-1337
 URL: https://issues.apache.org/jira/browse/CASSANDRA-1337
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Assignee: David Alves
Priority: Minor
 Fix For: 1.2

 Attachments: 
 0001-CASSANDRA-1337-scan-concurrently-depending-on-num-rows.txt, 
 CASSANDRA-1337.patch

   Original Estimate: 8h
  Remaining Estimate: 8h

 currently, we read the indexed rows from the first node (in partitioner 
 order); if that does not have enough matching rows, we read the rows from the 
 next, and so forth.
 we should use the statistics fom CASSANDRA-1155 to query multiple nodes in 
 parallel, such that we have a high chance of getting enough rows w/o having 
 to do another round of queries (but, if our estimate is incorrect, we do need 
 to loop and do more rounds until we have enough data or we have fetched from 
 each node).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-1337) parallelize fetching rows for low-cardinality indexes

2012-06-10 Thread Vijay (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13292558#comment-13292558
 ] 

Vijay commented on CASSANDRA-1337:
--

{quote}
Can't we reach a state where we have handlers to which we haven't called get() 
(because they have not exceeded concurrecy factor?).
{quote}
I dont quite follow the question, are you talking about nodes have not 
responded on time? get() method is actually waiting for the nodes to respond 
with data.
If the above is true, yes we can get to that point and at that point we will 
might need to timeout the query.

{quote}
On another matter what would be the best strategy to test this both for 
correctness and speed?
{quote}
You might want to try Stress tool with different cardinality for the index on a 
multi node cluster.

{code}
-C CARDINALITY, --cardinality=CARDINALITY
Number of unique values stored in columns, default:50
{code}

 parallelize fetching rows for low-cardinality indexes
 -

 Key: CASSANDRA-1337
 URL: https://issues.apache.org/jira/browse/CASSANDRA-1337
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Assignee: David Alves
Priority: Minor
 Fix For: 1.2

 Attachments: 
 0001-CASSANDRA-1337-scan-concurrently-depending-on-num-rows.txt, 
 CASSANDRA-1337.patch

   Original Estimate: 8h
  Remaining Estimate: 8h

 currently, we read the indexed rows from the first node (in partitioner 
 order); if that does not have enough matching rows, we read the rows from the 
 next, and so forth.
 we should use the statistics fom CASSANDRA-1155 to query multiple nodes in 
 parallel, such that we have a high chance of getting enough rows w/o having 
 to do another round of queries (but, if our estimate is incorrect, we do need 
 to loop and do more rounds until we have enough data or we have fetched from 
 each node).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-1337) parallelize fetching rows for low-cardinality indexes

2012-06-10 Thread David Alves (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13292562#comment-13292562
 ] 

David Alves commented on CASSANDRA-1337:


Thanks for the suggestion Vijay.

my question referred to a particular instruction in the patch (carried over 
from the original patch) where we block and wait for the handler's results only 
after we have more handlers than concurrency factor. 

My question was: wouldn't it be possible to reach a point where we have no more 
ranges (and will create no more handlers) but still have some for which we 
haven't blocked to read the data and these last few are less than concurrency 
factor therefore never passing the if's condition (if (scanHandlers.size() = 
concurrencyFactor)).

With regard to testing I guess stress is ok to test speed but how (where?) 
would I add the unit/system tests?


 parallelize fetching rows for low-cardinality indexes
 -

 Key: CASSANDRA-1337
 URL: https://issues.apache.org/jira/browse/CASSANDRA-1337
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Assignee: David Alves
Priority: Minor
 Fix For: 1.2

 Attachments: 
 0001-CASSANDRA-1337-scan-concurrently-depending-on-num-rows.txt, 
 CASSANDRA-1337.patch

   Original Estimate: 8h
  Remaining Estimate: 8h

 currently, we read the indexed rows from the first node (in partitioner 
 order); if that does not have enough matching rows, we read the rows from the 
 next, and so forth.
 we should use the statistics fom CASSANDRA-1155 to query multiple nodes in 
 parallel, such that we have a high chance of getting enough rows w/o having 
 to do another round of queries (but, if our estimate is incorrect, we do need 
 to loop and do more rounds until we have enough data or we have fetched from 
 each node).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-1337) parallelize fetching rows for low-cardinality indexes

2012-06-10 Thread Vijay (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13292570#comment-13292570
 ] 

Vijay commented on CASSANDRA-1337:
--

Clarified offline!

 parallelize fetching rows for low-cardinality indexes
 -

 Key: CASSANDRA-1337
 URL: https://issues.apache.org/jira/browse/CASSANDRA-1337
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Assignee: David Alves
Priority: Minor
 Fix For: 1.2

 Attachments: 
 0001-CASSANDRA-1337-scan-concurrently-depending-on-num-rows.txt, 
 CASSANDRA-1337.patch

   Original Estimate: 8h
  Remaining Estimate: 8h

 currently, we read the indexed rows from the first node (in partitioner 
 order); if that does not have enough matching rows, we read the rows from the 
 next, and so forth.
 we should use the statistics fom CASSANDRA-1155 to query multiple nodes in 
 parallel, such that we have a high chance of getting enough rows w/o having 
 to do another round of queries (but, if our estimate is incorrect, we do need 
 to loop and do more rounds until we have enough data or we have fetched from 
 each node).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-1337) parallelize fetching rows for low-cardinality indexes

2012-06-10 Thread David Alves (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13292571#comment-13292571
 ] 

David Alves commented on CASSANDRA-1337:


yes indeed, thanks for your help Vijay.

StorageProxyTest passes, I'll now run stress to try and find whether this makes 
a difference.

 parallelize fetching rows for low-cardinality indexes
 -

 Key: CASSANDRA-1337
 URL: https://issues.apache.org/jira/browse/CASSANDRA-1337
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Assignee: David Alves
Priority: Minor
 Fix For: 1.2

 Attachments: 
 0001-CASSANDRA-1337-scan-concurrently-depending-on-num-rows.txt, 
 CASSANDRA-1337.patch

   Original Estimate: 8h
  Remaining Estimate: 8h

 currently, we read the indexed rows from the first node (in partitioner 
 order); if that does not have enough matching rows, we read the rows from the 
 next, and so forth.
 we should use the statistics fom CASSANDRA-1155 to query multiple nodes in 
 parallel, such that we have a high chance of getting enough rows w/o having 
 to do another round of queries (but, if our estimate is incorrect, we do need 
 to loop and do more rounds until we have enough data or we have fetched from 
 each node).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-1337) parallelize fetching rows for low-cardinality indexes

2012-06-09 Thread David Alves (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13292395#comment-13292395
 ] 

David Alves commented on CASSANDRA-1337:


couple of questions:

- If i got the logic right the point is that we keep creating async handlers 
until we exceed concurrency factor, at which time we block and wait for the 
replies, clear the handlers and start again. Can't we reach a state where we 
have handlers to which we haven't called get() (because they have not exceeded 
concurrecy factor?).
- On another matter what would be the best strategy to test this both for 
correctness and speed?



 parallelize fetching rows for low-cardinality indexes
 -

 Key: CASSANDRA-1337
 URL: https://issues.apache.org/jira/browse/CASSANDRA-1337
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Assignee: David Alves
Priority: Minor
 Fix For: 1.2

 Attachments: 
 0001-CASSANDRA-1337-scan-concurrently-depending-on-num-rows.txt, 
 CASSANDRA-1337.patch

   Original Estimate: 8h
  Remaining Estimate: 8h

 currently, we read the indexed rows from the first node (in partitioner 
 order); if that does not have enough matching rows, we read the rows from the 
 next, and so forth.
 we should use the statistics fom CASSANDRA-1155 to query multiple nodes in 
 parallel, such that we have a high chance of getting enough rows w/o having 
 to do another round of queries (but, if our estimate is incorrect, we do need 
 to loop and do more rounds until we have enough data or we have fetched from 
 each node).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (CASSANDRA-1337) parallelize fetching rows for low-cardinality indexes

2011-01-12 Thread T Jake Luciani (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12980910#action_12980910
 ] 

T Jake Luciani commented on CASSANDRA-1337:
---

This assumes a RP, should I check for that? otherwise the local CF stats are 
misleading.

 parallelize fetching rows for low-cardinality indexes
 -

 Key: CASSANDRA-1337
 URL: https://issues.apache.org/jira/browse/CASSANDRA-1337
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Assignee: T Jake Luciani
Priority: Minor
 Fix For: 0.7.1

   Original Estimate: 8h
  Remaining Estimate: 8h

 currently, we read the indexed rows from the first node (in partitioner 
 order); if that does not have enough matching rows, we read the rows from the 
 next, and so forth.
 we should use the statistics fom CASSANDRA-1155 to query multiple nodes in 
 parallel, such that we have a high chance of getting enough rows w/o having 
 to do another round of queries (but, if our estimate is incorrect, we do need 
 to loop and do more rounds until we have enough data or we have fetched from 
 each node).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CASSANDRA-1337) parallelize fetching rows for low-cardinality indexes

2011-01-12 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12980918#action_12980918
 ] 

Jonathan Ellis commented on CASSANDRA-1337:
---

I don't follow, partitioner shouldn't matter (since we always fetch indexed 
rows by partitioner order).

 parallelize fetching rows for low-cardinality indexes
 -

 Key: CASSANDRA-1337
 URL: https://issues.apache.org/jira/browse/CASSANDRA-1337
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Assignee: T Jake Luciani
Priority: Minor
 Fix For: 0.7.1

   Original Estimate: 8h
  Remaining Estimate: 8h

 currently, we read the indexed rows from the first node (in partitioner 
 order); if that does not have enough matching rows, we read the rows from the 
 next, and so forth.
 we should use the statistics fom CASSANDRA-1155 to query multiple nodes in 
 parallel, such that we have a high chance of getting enough rows w/o having 
 to do another round of queries (but, if our estimate is incorrect, we do need 
 to loop and do more rounds until we have enough data or we have fetched from 
 each node).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CASSANDRA-1337) parallelize fetching rows for low-cardinality indexes

2011-01-12 Thread T Jake Luciani (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12980924#action_12980924
 ] 

T Jake Luciani commented on CASSANDRA-1337:
---

I mean't a unbalanced ring.  

Let me make sure I understand the point of this ticket first.  You are 
suggesting we look at the local CF stats to determine if this index is sparse.  
If it is then query from multiple nodes in parallel.

My point is if the ring isn't balanced then the SSTables aren't uniform so the 
local nodes CF stats might be misleading.  



 parallelize fetching rows for low-cardinality indexes
 -

 Key: CASSANDRA-1337
 URL: https://issues.apache.org/jira/browse/CASSANDRA-1337
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Assignee: T Jake Luciani
Priority: Minor
 Fix For: 0.7.1

   Original Estimate: 8h
  Remaining Estimate: 8h

 currently, we read the indexed rows from the first node (in partitioner 
 order); if that does not have enough matching rows, we read the rows from the 
 next, and so forth.
 we should use the statistics fom CASSANDRA-1155 to query multiple nodes in 
 parallel, such that we have a high chance of getting enough rows w/o having 
 to do another round of queries (but, if our estimate is incorrect, we do need 
 to loop and do more rounds until we have enough data or we have fetched from 
 each node).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CASSANDRA-1337) parallelize fetching rows for low-cardinality indexes

2011-01-12 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12980928#action_12980928
 ] 

Jonathan Ellis commented on CASSANDRA-1337:
---

That's true, but if your ring is badly unbalanced you have worse problems, so I 
think it's a reasonable assumption.

 parallelize fetching rows for low-cardinality indexes
 -

 Key: CASSANDRA-1337
 URL: https://issues.apache.org/jira/browse/CASSANDRA-1337
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Assignee: T Jake Luciani
Priority: Minor
 Fix For: 0.7.1

   Original Estimate: 8h
  Remaining Estimate: 8h

 currently, we read the indexed rows from the first node (in partitioner 
 order); if that does not have enough matching rows, we read the rows from the 
 next, and so forth.
 we should use the statistics fom CASSANDRA-1155 to query multiple nodes in 
 parallel, such that we have a high chance of getting enough rows w/o having 
 to do another round of queries (but, if our estimate is incorrect, we do need 
 to loop and do more rounds until we have enough data or we have fetched from 
 each node).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.