[jira] [Commented] (CASSANDRA-1337) parallelize fetching rows for low-cardinality indexes
[ https://issues.apache.org/jira/browse/CASSANDRA-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13784290#comment-13784290 ] Jonathan Ellis commented on CASSANDRA-1337: --- Pushed some super minor cleanup to https://github.com/jbellis/cassandra/commits/1337. Question left in my mind is, do we want to shoot for exactly enough concurrent requests, on average? Would imply that half the time we need to do an extra round. ISTM we probably want to give ourselves a margin of error. parallelize fetching rows for low-cardinality indexes - Key: CASSANDRA-1337 URL: https://issues.apache.org/jira/browse/CASSANDRA-1337 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Assignee: Tyler Hobbs Priority: Minor Fix For: 2.1 Attachments: 0001-Concurrent-range-and-2ary-index-subqueries.patch, 1137-bugfix.patch, 1337.patch, 1337-v4.patch, ASF.LICENSE.NOT.GRANTED--0001-CASSANDRA-1337-scan-concurrently-depending-on-num-rows.txt, CASSANDRA-1337.patch Original Estimate: 8h Remaining Estimate: 8h currently, we read the indexed rows from the first node (in partitioner order); if that does not have enough matching rows, we read the rows from the next, and so forth. we should use the statistics fom CASSANDRA-1155 to query multiple nodes in parallel, such that we have a high chance of getting enough rows w/o having to do another round of queries (but, if our estimate is incorrect, we do need to loop and do more rounds until we have enough data or we have fetched from each node). -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-1337) parallelize fetching rows for low-cardinality indexes
[ https://issues.apache.org/jira/browse/CASSANDRA-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13784296#comment-13784296 ] Jonathan Ellis commented on CASSANDRA-1337: --- And yeah, dtests is probably the only sane place to test. parallelize fetching rows for low-cardinality indexes - Key: CASSANDRA-1337 URL: https://issues.apache.org/jira/browse/CASSANDRA-1337 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Assignee: Tyler Hobbs Priority: Minor Fix For: 2.1 Attachments: 0001-Concurrent-range-and-2ary-index-subqueries.patch, 1137-bugfix.patch, 1337.patch, 1337-v4.patch, ASF.LICENSE.NOT.GRANTED--0001-CASSANDRA-1337-scan-concurrently-depending-on-num-rows.txt, CASSANDRA-1337.patch Original Estimate: 8h Remaining Estimate: 8h currently, we read the indexed rows from the first node (in partitioner order); if that does not have enough matching rows, we read the rows from the next, and so forth. we should use the statistics fom CASSANDRA-1155 to query multiple nodes in parallel, such that we have a high chance of getting enough rows w/o having to do another round of queries (but, if our estimate is incorrect, we do need to loop and do more rounds until we have enough data or we have fetched from each node). -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-1337) parallelize fetching rows for low-cardinality indexes
[ https://issues.apache.org/jira/browse/CASSANDRA-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13784311#comment-13784311 ] Tyler Hobbs commented on CASSANDRA-1337: bq. Pushed some super minor cleanup to https://github.com/jbellis/cassandra/commits/1337. In your OCD commit, replacing the haveSufficientRows/break behavior with a return means that we won't wait on the repair futures, which I believe is incorrect. bq. Question left in my mind is, do we want to shoot for exactly enough concurrent requests, on average? Would imply that half the time we need to do an extra round. ISTM we probably want to give ourselves a margin of error. True. Perhaps we should decrease our estimate of rows per range by, say, 10%, and use Math.ceil() instead of Math.round() for the concurrency factor calculation. parallelize fetching rows for low-cardinality indexes - Key: CASSANDRA-1337 URL: https://issues.apache.org/jira/browse/CASSANDRA-1337 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Assignee: Tyler Hobbs Priority: Minor Fix For: 2.1 Attachments: 0001-Concurrent-range-and-2ary-index-subqueries.patch, 1137-bugfix.patch, 1337.patch, 1337-v4.patch, ASF.LICENSE.NOT.GRANTED--0001-CASSANDRA-1337-scan-concurrently-depending-on-num-rows.txt, CASSANDRA-1337.patch Original Estimate: 8h Remaining Estimate: 8h currently, we read the indexed rows from the first node (in partitioner order); if that does not have enough matching rows, we read the rows from the next, and so forth. we should use the statistics fom CASSANDRA-1155 to query multiple nodes in parallel, such that we have a high chance of getting enough rows w/o having to do another round of queries (but, if our estimate is incorrect, we do need to loop and do more rounds until we have enough data or we have fetched from each node). -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-1337) parallelize fetching rows for low-cardinality indexes
[ https://issues.apache.org/jira/browse/CASSANDRA-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13493273#comment-13493273 ] Sylvain Lebresne commented on CASSANDRA-1337: - Sorry I haven't been able to got to that sooner. This will thus need rebasing. Also, I'm pretty sure the is something specific to do to make that work for get_paged_slice. I wouldn't be against just deactivating this parallelization for get_paged_slice to start with. parallelize fetching rows for low-cardinality indexes - Key: CASSANDRA-1337 URL: https://issues.apache.org/jira/browse/CASSANDRA-1337 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Assignee: David Alves Priority: Minor Fix For: 1.2.0 rc1 Attachments: 1137-bugfix.patch, 1337.patch, 1337-v4.patch, ASF.LICENSE.NOT.GRANTED--0001-CASSANDRA-1337-scan-concurrently-depending-on-num-rows.txt, CASSANDRA-1337.patch Original Estimate: 8h Remaining Estimate: 8h currently, we read the indexed rows from the first node (in partitioner order); if that does not have enough matching rows, we read the rows from the next, and so forth. we should use the statistics fom CASSANDRA-1155 to query multiple nodes in parallel, such that we have a high chance of getting enough rows w/o having to do another round of queries (but, if our estimate is incorrect, we do need to loop and do more rounds until we have enough data or we have fetched from each node). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-1337) parallelize fetching rows for low-cardinality indexes
[ https://issues.apache.org/jira/browse/CASSANDRA-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13456908#comment-13456908 ] David Alves commented on CASSANDRA-1337: cool, that means things are probably kosher at least semantically, right? I mean the bug present in v1 is no longer present and all other related tests pass. parallelize fetching rows for low-cardinality indexes - Key: CASSANDRA-1337 URL: https://issues.apache.org/jira/browse/CASSANDRA-1337 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Assignee: David Alves Priority: Minor Fix For: 1.2.1 Attachments: 1137-bugfix.patch, 1337.patch, 1337-v4.patch, ASF.LICENSE.NOT.GRANTED--0001-CASSANDRA-1337-scan-concurrently-depending-on-num-rows.txt, CASSANDRA-1337.patch Original Estimate: 8h Remaining Estimate: 8h currently, we read the indexed rows from the first node (in partitioner order); if that does not have enough matching rows, we read the rows from the next, and so forth. we should use the statistics fom CASSANDRA-1155 to query multiple nodes in parallel, such that we have a high chance of getting enough rows w/o having to do another round of queries (but, if our estimate is incorrect, we do need to loop and do more rounds until we have enough data or we have fetched from each node). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-1337) parallelize fetching rows for low-cardinality indexes
[ https://issues.apache.org/jira/browse/CASSANDRA-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13452862#comment-13452862 ] Sylvain Lebresne commented on CASSANDRA-1337: - I haven't had the time yet to look at the new patches, but there may be something wrong with the revert because the test at https://github.com/riptano/cassandra-dtest/blob/master/cql_tests.py#L1736 is still broken on trunk, though now it simply doesn't complete. Haven't looked yet but the test passes alright in 1.1. parallelize fetching rows for low-cardinality indexes - Key: CASSANDRA-1337 URL: https://issues.apache.org/jira/browse/CASSANDRA-1337 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Assignee: David Alves Priority: Minor Fix For: 1.2.1 Attachments: 1137-bugfix.patch, 1337.patch, 1337-v4.patch, ASF.LICENSE.NOT.GRANTED--0001-CASSANDRA-1337-scan-concurrently-depending-on-num-rows.txt, CASSANDRA-1337.patch Original Estimate: 8h Remaining Estimate: 8h currently, we read the indexed rows from the first node (in partitioner order); if that does not have enough matching rows, we read the rows from the next, and so forth. we should use the statistics fom CASSANDRA-1155 to query multiple nodes in parallel, such that we have a high chance of getting enough rows w/o having to do another round of queries (but, if our estimate is incorrect, we do need to loop and do more rounds until we have enough data or we have fetched from each node). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-1337) parallelize fetching rows for low-cardinality indexes
[ https://issues.apache.org/jira/browse/CASSANDRA-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13436286#comment-13436286 ] Jonathan Ellis commented on CASSANDRA-1337: --- Reverted the three commits here in 0bfea6f678034c54d64c0c613f758de02d266415 and bumped to 1.2.1 since David may not have time to get back to this before 1.2.0 freeze. parallelize fetching rows for low-cardinality indexes - Key: CASSANDRA-1337 URL: https://issues.apache.org/jira/browse/CASSANDRA-1337 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Assignee: David Alves Priority: Minor Fix For: 1.2.1 Attachments: 0001-CASSANDRA-1337-scan-concurrently-depending-on-num-rows.txt, 1137-bugfix.patch, CASSANDRA-1337.patch Original Estimate: 8h Remaining Estimate: 8h currently, we read the indexed rows from the first node (in partitioner order); if that does not have enough matching rows, we read the rows from the next, and so forth. we should use the statistics fom CASSANDRA-1155 to query multiple nodes in parallel, such that we have a high chance of getting enough rows w/o having to do another round of queries (but, if our estimate is incorrect, we do need to loop and do more rounds until we have enough data or we have fetched from each node). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-1337) parallelize fetching rows for low-cardinality indexes
[ https://issues.apache.org/jira/browse/CASSANDRA-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13422986#comment-13422986 ] Sylvain Lebresne commented on CASSANDRA-1337: - The patch almost move the check for do we have enough rows for the query inside the don't use the local path for that range, which is broken (as in uselessly inefficient), because it means if we do have enough rows locally, we will still check another range. But probably more importantly, I'm confused on how this patch works. First, it estimated how much the query of *a range* will likely yield based on the *total* number of keys on the node. We should at least divide this by the replication factor to have a proper estimate of the number-of-keys-per-range. Second, it seems to correctly parallelize old-style range_slice (i.e. range_slice for thrift without any IndexExpression), but I think it doesn't correctly handle neither secondary indexes (which was the main goal of the patch I believe), not any kind of range slices when CQL3 is involved: * For secondary index queries, the number or rows returned doesn't depend at all on the number of row keys the node holds (and thus if follows that estimating the number of parallel queries to do based on that parameter is broken), it depends on how many columns the row for the most selective index contains. So for the concurrencyFactor we should 1) figure out which index will be used and 2) probably use the estimated mean columns count for that index. * For CQL3 queries, they use the maxIsColumns parameters (for both traditional *and* 2ndary index queries) and so the number of rows returned shouldn't be directly compared to maxResults (i.e. if the first row we find has enough columns to satisfy maxResults, we're done). In that case, it unfortunately become more complicated to predict how much results a query might yield in general, because this depends on the column filter. I.e. if the filter is a name filter, or an identity slice filter (as in IdentitySliceFilter), we can try an estimate (in the latter case, something like maxResults / (estimatedKeys * meanColumnsCountPerKey)), but for other kind of slice filters, I don't think we can do much estimate. That being said, it might still be worth using the estimate in those two cases, because at least for 2ndary index query, the column filter will likely be very often an identity slice. And in that identity slice case, for 2ndary index and CQL3, the estimation should be something like maxResults / (meanColumnCountPerKey(index used) * meanColumnCountPerKey(parent cf)). I'll note however that in both case, the meanColumnsCount is not necessary the perfect estimate to use, as it pretty much imply that half the time we will query one more range than is necessary. Instead we could either use the maxColumnsCount (if we really want to be conservative) or add some fudge factor to the mean. Fudge factor that may maybe be based on the difference between the min, mean and max columns count estimates. parallelize fetching rows for low-cardinality indexes - Key: CASSANDRA-1337 URL: https://issues.apache.org/jira/browse/CASSANDRA-1337 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Assignee: David Alves Priority: Minor Fix For: 1.2 Attachments: 0001-CASSANDRA-1337-scan-concurrently-depending-on-num-rows.txt, CASSANDRA-1337.patch Original Estimate: 8h Remaining Estimate: 8h currently, we read the indexed rows from the first node (in partitioner order); if that does not have enough matching rows, we read the rows from the next, and so forth. we should use the statistics fom CASSANDRA-1155 to query multiple nodes in parallel, such that we have a high chance of getting enough rows w/o having to do another round of queries (but, if our estimate is incorrect, we do need to loop and do more rounds until we have enough data or we have fetched from each node). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-1337) parallelize fetching rows for low-cardinality indexes
[ https://issues.apache.org/jira/browse/CASSANDRA-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13399465#comment-13399465 ] Hudson commented on CASSANDRA-1337: --- Integrated in Cassandra #1554 (See [https://builds.apache.org/job/Cassandra/1554/]) Fix typo in CASSANDRA-1337 (Revision f17fbac1934721c4a90c86745f8481b2d8a8b223) Result = ABORTED vijay2win : Files : * src/java/org/apache/cassandra/service/StorageProxy.java parallelize fetching rows for low-cardinality indexes - Key: CASSANDRA-1337 URL: https://issues.apache.org/jira/browse/CASSANDRA-1337 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Assignee: David Alves Priority: Minor Fix For: 1.2 Attachments: 0001-CASSANDRA-1337-scan-concurrently-depending-on-num-rows.txt, CASSANDRA-1337.patch Original Estimate: 8h Remaining Estimate: 8h currently, we read the indexed rows from the first node (in partitioner order); if that does not have enough matching rows, we read the rows from the next, and so forth. we should use the statistics fom CASSANDRA-1155 to query multiple nodes in parallel, such that we have a high chance of getting enough rows w/o having to do another round of queries (but, if our estimate is incorrect, we do need to loop and do more rounds until we have enough data or we have fetched from each node). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-1337) parallelize fetching rows for low-cardinality indexes
[ https://issues.apache.org/jira/browse/CASSANDRA-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13398222#comment-13398222 ] Vijay commented on CASSANDRA-1337: -- Agree, and +1 for the rebased version of the patch. David, let me know if you have anything in addition to the attached. parallelize fetching rows for low-cardinality indexes - Key: CASSANDRA-1337 URL: https://issues.apache.org/jira/browse/CASSANDRA-1337 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Assignee: David Alves Priority: Minor Fix For: 1.2 Attachments: 0001-CASSANDRA-1337-scan-concurrently-depending-on-num-rows.txt, CASSANDRA-1337.patch Original Estimate: 8h Remaining Estimate: 8h currently, we read the indexed rows from the first node (in partitioner order); if that does not have enough matching rows, we read the rows from the next, and so forth. we should use the statistics fom CASSANDRA-1155 to query multiple nodes in parallel, such that we have a high chance of getting enough rows w/o having to do another round of queries (but, if our estimate is incorrect, we do need to loop and do more rounds until we have enough data or we have fetched from each node). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-1337) parallelize fetching rows for low-cardinality indexes
[ https://issues.apache.org/jira/browse/CASSANDRA-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13399093#comment-13399093 ] David Alves commented on CASSANDRA-1337: Thanks for the review Vijay. I have nothing to add... parallelize fetching rows for low-cardinality indexes - Key: CASSANDRA-1337 URL: https://issues.apache.org/jira/browse/CASSANDRA-1337 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Assignee: David Alves Priority: Minor Fix For: 1.2 Attachments: 0001-CASSANDRA-1337-scan-concurrently-depending-on-num-rows.txt, CASSANDRA-1337.patch Original Estimate: 8h Remaining Estimate: 8h currently, we read the indexed rows from the first node (in partitioner order); if that does not have enough matching rows, we read the rows from the next, and so forth. we should use the statistics fom CASSANDRA-1155 to query multiple nodes in parallel, such that we have a high chance of getting enough rows w/o having to do another round of queries (but, if our estimate is incorrect, we do need to loop and do more rounds until we have enough data or we have fetched from each node). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-1337) parallelize fetching rows for low-cardinality indexes
[ https://issues.apache.org/jira/browse/CASSANDRA-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13398211#comment-13398211 ] Jonathan Ellis commented on CASSANDRA-1337: --- Feels more like a dtest than a unit test to me. parallelize fetching rows for low-cardinality indexes - Key: CASSANDRA-1337 URL: https://issues.apache.org/jira/browse/CASSANDRA-1337 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Assignee: David Alves Priority: Minor Fix For: 1.2 Attachments: 0001-CASSANDRA-1337-scan-concurrently-depending-on-num-rows.txt, CASSANDRA-1337.patch Original Estimate: 8h Remaining Estimate: 8h currently, we read the indexed rows from the first node (in partitioner order); if that does not have enough matching rows, we read the rows from the next, and so forth. we should use the statistics fom CASSANDRA-1155 to query multiple nodes in parallel, such that we have a high chance of getting enough rows w/o having to do another round of queries (but, if our estimate is incorrect, we do need to loop and do more rounds until we have enough data or we have fetched from each node). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-1337) parallelize fetching rows for low-cardinality indexes
[ https://issues.apache.org/jira/browse/CASSANDRA-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13293902#comment-13293902 ] David Alves commented on CASSANDRA-1337: Any suggestions wrt to unit testing this? parallelize fetching rows for low-cardinality indexes - Key: CASSANDRA-1337 URL: https://issues.apache.org/jira/browse/CASSANDRA-1337 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Assignee: David Alves Priority: Minor Fix For: 1.2 Attachments: 0001-CASSANDRA-1337-scan-concurrently-depending-on-num-rows.txt, CASSANDRA-1337.patch Original Estimate: 8h Remaining Estimate: 8h currently, we read the indexed rows from the first node (in partitioner order); if that does not have enough matching rows, we read the rows from the next, and so forth. we should use the statistics fom CASSANDRA-1155 to query multiple nodes in parallel, such that we have a high chance of getting enough rows w/o having to do another round of queries (but, if our estimate is incorrect, we do need to loop and do more rounds until we have enough data or we have fetched from each node). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-1337) parallelize fetching rows for low-cardinality indexes
[ https://issues.apache.org/jira/browse/CASSANDRA-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13292558#comment-13292558 ] Vijay commented on CASSANDRA-1337: -- {quote} Can't we reach a state where we have handlers to which we haven't called get() (because they have not exceeded concurrecy factor?). {quote} I dont quite follow the question, are you talking about nodes have not responded on time? get() method is actually waiting for the nodes to respond with data. If the above is true, yes we can get to that point and at that point we will might need to timeout the query. {quote} On another matter what would be the best strategy to test this both for correctness and speed? {quote} You might want to try Stress tool with different cardinality for the index on a multi node cluster. {code} -C CARDINALITY, --cardinality=CARDINALITY Number of unique values stored in columns, default:50 {code} parallelize fetching rows for low-cardinality indexes - Key: CASSANDRA-1337 URL: https://issues.apache.org/jira/browse/CASSANDRA-1337 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Assignee: David Alves Priority: Minor Fix For: 1.2 Attachments: 0001-CASSANDRA-1337-scan-concurrently-depending-on-num-rows.txt, CASSANDRA-1337.patch Original Estimate: 8h Remaining Estimate: 8h currently, we read the indexed rows from the first node (in partitioner order); if that does not have enough matching rows, we read the rows from the next, and so forth. we should use the statistics fom CASSANDRA-1155 to query multiple nodes in parallel, such that we have a high chance of getting enough rows w/o having to do another round of queries (but, if our estimate is incorrect, we do need to loop and do more rounds until we have enough data or we have fetched from each node). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-1337) parallelize fetching rows for low-cardinality indexes
[ https://issues.apache.org/jira/browse/CASSANDRA-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13292562#comment-13292562 ] David Alves commented on CASSANDRA-1337: Thanks for the suggestion Vijay. my question referred to a particular instruction in the patch (carried over from the original patch) where we block and wait for the handler's results only after we have more handlers than concurrency factor. My question was: wouldn't it be possible to reach a point where we have no more ranges (and will create no more handlers) but still have some for which we haven't blocked to read the data and these last few are less than concurrency factor therefore never passing the if's condition (if (scanHandlers.size() = concurrencyFactor)). With regard to testing I guess stress is ok to test speed but how (where?) would I add the unit/system tests? parallelize fetching rows for low-cardinality indexes - Key: CASSANDRA-1337 URL: https://issues.apache.org/jira/browse/CASSANDRA-1337 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Assignee: David Alves Priority: Minor Fix For: 1.2 Attachments: 0001-CASSANDRA-1337-scan-concurrently-depending-on-num-rows.txt, CASSANDRA-1337.patch Original Estimate: 8h Remaining Estimate: 8h currently, we read the indexed rows from the first node (in partitioner order); if that does not have enough matching rows, we read the rows from the next, and so forth. we should use the statistics fom CASSANDRA-1155 to query multiple nodes in parallel, such that we have a high chance of getting enough rows w/o having to do another round of queries (but, if our estimate is incorrect, we do need to loop and do more rounds until we have enough data or we have fetched from each node). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-1337) parallelize fetching rows for low-cardinality indexes
[ https://issues.apache.org/jira/browse/CASSANDRA-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13292570#comment-13292570 ] Vijay commented on CASSANDRA-1337: -- Clarified offline! parallelize fetching rows for low-cardinality indexes - Key: CASSANDRA-1337 URL: https://issues.apache.org/jira/browse/CASSANDRA-1337 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Assignee: David Alves Priority: Minor Fix For: 1.2 Attachments: 0001-CASSANDRA-1337-scan-concurrently-depending-on-num-rows.txt, CASSANDRA-1337.patch Original Estimate: 8h Remaining Estimate: 8h currently, we read the indexed rows from the first node (in partitioner order); if that does not have enough matching rows, we read the rows from the next, and so forth. we should use the statistics fom CASSANDRA-1155 to query multiple nodes in parallel, such that we have a high chance of getting enough rows w/o having to do another round of queries (but, if our estimate is incorrect, we do need to loop and do more rounds until we have enough data or we have fetched from each node). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-1337) parallelize fetching rows for low-cardinality indexes
[ https://issues.apache.org/jira/browse/CASSANDRA-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13292571#comment-13292571 ] David Alves commented on CASSANDRA-1337: yes indeed, thanks for your help Vijay. StorageProxyTest passes, I'll now run stress to try and find whether this makes a difference. parallelize fetching rows for low-cardinality indexes - Key: CASSANDRA-1337 URL: https://issues.apache.org/jira/browse/CASSANDRA-1337 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Assignee: David Alves Priority: Minor Fix For: 1.2 Attachments: 0001-CASSANDRA-1337-scan-concurrently-depending-on-num-rows.txt, CASSANDRA-1337.patch Original Estimate: 8h Remaining Estimate: 8h currently, we read the indexed rows from the first node (in partitioner order); if that does not have enough matching rows, we read the rows from the next, and so forth. we should use the statistics fom CASSANDRA-1155 to query multiple nodes in parallel, such that we have a high chance of getting enough rows w/o having to do another round of queries (but, if our estimate is incorrect, we do need to loop and do more rounds until we have enough data or we have fetched from each node). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-1337) parallelize fetching rows for low-cardinality indexes
[ https://issues.apache.org/jira/browse/CASSANDRA-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13292395#comment-13292395 ] David Alves commented on CASSANDRA-1337: couple of questions: - If i got the logic right the point is that we keep creating async handlers until we exceed concurrency factor, at which time we block and wait for the replies, clear the handlers and start again. Can't we reach a state where we have handlers to which we haven't called get() (because they have not exceeded concurrecy factor?). - On another matter what would be the best strategy to test this both for correctness and speed? parallelize fetching rows for low-cardinality indexes - Key: CASSANDRA-1337 URL: https://issues.apache.org/jira/browse/CASSANDRA-1337 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Assignee: David Alves Priority: Minor Fix For: 1.2 Attachments: 0001-CASSANDRA-1337-scan-concurrently-depending-on-num-rows.txt, CASSANDRA-1337.patch Original Estimate: 8h Remaining Estimate: 8h currently, we read the indexed rows from the first node (in partitioner order); if that does not have enough matching rows, we read the rows from the next, and so forth. we should use the statistics fom CASSANDRA-1155 to query multiple nodes in parallel, such that we have a high chance of getting enough rows w/o having to do another round of queries (but, if our estimate is incorrect, we do need to loop and do more rounds until we have enough data or we have fetched from each node). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (CASSANDRA-1337) parallelize fetching rows for low-cardinality indexes
[ https://issues.apache.org/jira/browse/CASSANDRA-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12980910#action_12980910 ] T Jake Luciani commented on CASSANDRA-1337: --- This assumes a RP, should I check for that? otherwise the local CF stats are misleading. parallelize fetching rows for low-cardinality indexes - Key: CASSANDRA-1337 URL: https://issues.apache.org/jira/browse/CASSANDRA-1337 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Assignee: T Jake Luciani Priority: Minor Fix For: 0.7.1 Original Estimate: 8h Remaining Estimate: 8h currently, we read the indexed rows from the first node (in partitioner order); if that does not have enough matching rows, we read the rows from the next, and so forth. we should use the statistics fom CASSANDRA-1155 to query multiple nodes in parallel, such that we have a high chance of getting enough rows w/o having to do another round of queries (but, if our estimate is incorrect, we do need to loop and do more rounds until we have enough data or we have fetched from each node). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CASSANDRA-1337) parallelize fetching rows for low-cardinality indexes
[ https://issues.apache.org/jira/browse/CASSANDRA-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12980918#action_12980918 ] Jonathan Ellis commented on CASSANDRA-1337: --- I don't follow, partitioner shouldn't matter (since we always fetch indexed rows by partitioner order). parallelize fetching rows for low-cardinality indexes - Key: CASSANDRA-1337 URL: https://issues.apache.org/jira/browse/CASSANDRA-1337 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Assignee: T Jake Luciani Priority: Minor Fix For: 0.7.1 Original Estimate: 8h Remaining Estimate: 8h currently, we read the indexed rows from the first node (in partitioner order); if that does not have enough matching rows, we read the rows from the next, and so forth. we should use the statistics fom CASSANDRA-1155 to query multiple nodes in parallel, such that we have a high chance of getting enough rows w/o having to do another round of queries (but, if our estimate is incorrect, we do need to loop and do more rounds until we have enough data or we have fetched from each node). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CASSANDRA-1337) parallelize fetching rows for low-cardinality indexes
[ https://issues.apache.org/jira/browse/CASSANDRA-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12980924#action_12980924 ] T Jake Luciani commented on CASSANDRA-1337: --- I mean't a unbalanced ring. Let me make sure I understand the point of this ticket first. You are suggesting we look at the local CF stats to determine if this index is sparse. If it is then query from multiple nodes in parallel. My point is if the ring isn't balanced then the SSTables aren't uniform so the local nodes CF stats might be misleading. parallelize fetching rows for low-cardinality indexes - Key: CASSANDRA-1337 URL: https://issues.apache.org/jira/browse/CASSANDRA-1337 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Assignee: T Jake Luciani Priority: Minor Fix For: 0.7.1 Original Estimate: 8h Remaining Estimate: 8h currently, we read the indexed rows from the first node (in partitioner order); if that does not have enough matching rows, we read the rows from the next, and so forth. we should use the statistics fom CASSANDRA-1155 to query multiple nodes in parallel, such that we have a high chance of getting enough rows w/o having to do another round of queries (but, if our estimate is incorrect, we do need to loop and do more rounds until we have enough data or we have fetched from each node). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CASSANDRA-1337) parallelize fetching rows for low-cardinality indexes
[ https://issues.apache.org/jira/browse/CASSANDRA-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12980928#action_12980928 ] Jonathan Ellis commented on CASSANDRA-1337: --- That's true, but if your ring is badly unbalanced you have worse problems, so I think it's a reasonable assumption. parallelize fetching rows for low-cardinality indexes - Key: CASSANDRA-1337 URL: https://issues.apache.org/jira/browse/CASSANDRA-1337 Project: Cassandra Issue Type: Improvement Reporter: Jonathan Ellis Assignee: T Jake Luciani Priority: Minor Fix For: 0.7.1 Original Estimate: 8h Remaining Estimate: 8h currently, we read the indexed rows from the first node (in partitioner order); if that does not have enough matching rows, we read the rows from the next, and so forth. we should use the statistics fom CASSANDRA-1155 to query multiple nodes in parallel, such that we have a high chance of getting enough rows w/o having to do another round of queries (but, if our estimate is incorrect, we do need to loop and do more rounds until we have enough data or we have fetched from each node). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.