Re: BatchScanner taking too much time to scan rows
Dylan could you elaborate on the average query time you had? Thanks Vaibhav On 14-May-2015 11:03 pm, Dylan Hutchison dhutc...@mit.edu wrote: I think this is the same issue I found for ACCUMULO-3710 https://issues.apache.org/jira/browse/ACCUMULO-3710, only in my case the tserver ran out of memory. Accumulo doesn't handle large numbers of small, disjoint ranges well. I bet there's room for improvement on both the client and tablet server. ~Dylan On Wed, May 13, 2015 at 3:13 PM, Eric Newton eric.new...@gmail.com wrote: Yes, hot-spotting does affect accumulo because you have fewer servers and caches handling your request. Let's say your data is spread out, in a normal distribution from 0..9. What if you have only 1 split? You would want it at 5, to divide the data in half, and you could host the halves on different servers. But if you split at 1, now 10% of your queries go to one tablet, and 90% go to the other. -Eric On Wed, May 13, 2015 at 1:56 PM, vaibhav thapliyal vaibhav.thapliyal...@gmail.com wrote: Thank you Eric. I will surely do the same. Should uneven distribution across the tablets affect querying in accumulo? If this case, it is. Is this behaviour normal? On 13-May-2015 10:58 pm, Eric Newton eric.new...@gmail.com wrote: Yes, that's a great way to split the data evenly. Also, since the data set is so small, turn on data caching for your table: shell config -t mytable -s table.cache.block.enable=true You may want to increase the size of your tserver JVM, and increase the size of the cache: shell config -s tserver.cache.data.size=1G This will help with repeated random look-ups. -Eric On Wed, May 13, 2015 at 11:31 AM, vaibhav thapliyal vaibhav.thapliyal...@gmail.com wrote: Thank you Eric. One thing I would like to know. Does pre-splitting the data play a part in querying accumulo? Because I managed to somewhat decrease the querying time. I did the following steps: My table was around 1.47gb so I explicity set the split parameter to 256mb instead of the default 1gb. So I had just 8 tablets. Now when I carried out the same query, it finished in 15s. Is it because of the split points are more evenly distributed? The previous table on which the query took 50s had entries unevenly distributed across the tablets. Thanks Vaibhav On 13-May-2015 7:43 pm, Eric Newton eric.new...@gmail.com wrote: This use case is one of the things Accumulo was designed to handle well. It's the reason there is a BatchScanner. I've created: https://issues.apache.org/jira/browse/ACCUMULO-3813 so we can investigate and track down any problems or improvements. Feel free to add any other details to the JIRA ticket. -Eric On Wed, May 13, 2015 at 10:03 AM, Emilio Lahr-Vivaz elahrvi...@ccri.com wrote: It sounds like each of your ranges is an ID, e.g. a single row. I've found that scanning lots of non-sequential single-row ranges is pretty slow in accumulo. Your best approach is probably to create an index table on whatever you are originally trying to query (assuming those 1 ids came from some other query). Thanks, Emilio On 05/13/2015 09:14 AM, vaibhav thapliyal wrote: The rf files per tablet vary between 2 to 5 per tablet. The entries returned to me by the batchScanner is 46. The approx. average data rate is 0.5 MB/s as seen on the accumulo monitor page. A simple scan on the table has an average data rate of about 7-8 MB/s. All the ids exist in the accumulo table. On 12 May 2015 at 23:39, Keith Turner ke...@deenlo.com wrote: Do you know how much data is being brought back (i.e. 100 megabytes)? I am wondering what the data rate is in MB/s. Do you know how many files per tablet you have? Do most of the 10,000 ids you are querying for exist? On Tue, May 12, 2015 at 1:58 PM, vaibhav thapliyal vaibhav.thapliyal...@gmail.com wrote: I have 194 tablets. Currently I am using 20 threads to create the batchscanner inside the createBatchScanner method. On 12-May-2015 11:19 pm, Keith Turner ke...@deenlo.com wrote: How many tablets do you have? The batch scanner does not parallelize operations within a tablet. If you give the batch scanner more threads than there are tservers, it will make multilple parallel rpc calls to each tserver if the tserver has multiple tablets. Each rpc may include multiple tablets and ranges for each tablet. If the batch scanner has less threads than tservers, it will make one rpc per tserver per thread. Each rpc call will include all tablets and associated ranges for that tserver. Keith On Tue, May 12, 2015 at 1:39 PM, vaibhav thapliyal vaibhav.thapliyal...@gmail.com wrote: Hi, I am using BatchScanner to scan rows from a accumulo table. The table has around 187m entries and I am using a 3 node cluster which has accumulo 1.6.1. I have passed 1 ids which are stored as row id in my table as a list in the setRanges()
Re: BatchScanner taking too much time to scan rows
Sorry, just remembered that my setup was to scan an index table and gather rowIDs, then scan a main data table using the rowIDs as the BatchScan ranges. Effectively it is a join of part of the index table to a main data table. The scan rate I achieved is therefore double the value I cited previously: I showed about 76k entries/second. Still not the best but it is more within Accumulo standards. On Thu, May 14, 2015 at 2:15 PM, Dylan Hutchison dhutc...@mit.edu wrote: I didn't have an average query time-- the tablet server crashed. A quick solution is to batch the ranges into groups of 50k (or 500k, I forgot which one) and do many BatchScans-- not ideal. I think I achieved 33k entries/second retrieval on a single-node Accumulo. Accumulo is better for sequential lookup than random. On Thu, May 14, 2015 at 1:57 PM, vaibhav thapliyal vaibhav.thapliyal...@gmail.com wrote: Dylan could you elaborate on the average query time you had? Thanks Vaibhav On 14-May-2015 11:03 pm, Dylan Hutchison dhutc...@mit.edu wrote: I think this is the same issue I found for ACCUMULO-3710 https://issues.apache.org/jira/browse/ACCUMULO-3710, only in my case the tserver ran out of memory. Accumulo doesn't handle large numbers of small, disjoint ranges well. I bet there's room for improvement on both the client and tablet server. ~Dylan On Wed, May 13, 2015 at 3:13 PM, Eric Newton eric.new...@gmail.com wrote: Yes, hot-spotting does affect accumulo because you have fewer servers and caches handling your request. Let's say your data is spread out, in a normal distribution from 0..9. What if you have only 1 split? You would want it at 5, to divide the data in half, and you could host the halves on different servers. But if you split at 1, now 10% of your queries go to one tablet, and 90% go to the other. -Eric On Wed, May 13, 2015 at 1:56 PM, vaibhav thapliyal vaibhav.thapliyal...@gmail.com wrote: Thank you Eric. I will surely do the same. Should uneven distribution across the tablets affect querying in accumulo? If this case, it is. Is this behaviour normal? On 13-May-2015 10:58 pm, Eric Newton eric.new...@gmail.com wrote: Yes, that's a great way to split the data evenly. Also, since the data set is so small, turn on data caching for your table: shell config -t mytable -s table.cache.block.enable=true You may want to increase the size of your tserver JVM, and increase the size of the cache: shell config -s tserver.cache.data.size=1G This will help with repeated random look-ups. -Eric On Wed, May 13, 2015 at 11:31 AM, vaibhav thapliyal vaibhav.thapliyal...@gmail.com wrote: Thank you Eric. One thing I would like to know. Does pre-splitting the data play a part in querying accumulo? Because I managed to somewhat decrease the querying time. I did the following steps: My table was around 1.47gb so I explicity set the split parameter to 256mb instead of the default 1gb. So I had just 8 tablets. Now when I carried out the same query, it finished in 15s. Is it because of the split points are more evenly distributed? The previous table on which the query took 50s had entries unevenly distributed across the tablets. Thanks Vaibhav On 13-May-2015 7:43 pm, Eric Newton eric.new...@gmail.com wrote: This use case is one of the things Accumulo was designed to handle well. It's the reason there is a BatchScanner. I've created: https://issues.apache.org/jira/browse/ACCUMULO-3813 so we can investigate and track down any problems or improvements. Feel free to add any other details to the JIRA ticket. -Eric On Wed, May 13, 2015 at 10:03 AM, Emilio Lahr-Vivaz elahrvi...@ccri.com wrote: It sounds like each of your ranges is an ID, e.g. a single row. I've found that scanning lots of non-sequential single-row ranges is pretty slow in accumulo. Your best approach is probably to create an index table on whatever you are originally trying to query (assuming those 1 ids came from some other query). Thanks, Emilio On 05/13/2015 09:14 AM, vaibhav thapliyal wrote: The rf files per tablet vary between 2 to 5 per tablet. The entries returned to me by the batchScanner is 46. The approx. average data rate is 0.5 MB/s as seen on the accumulo monitor page. A simple scan on the table has an average data rate of about 7-8 MB/s. All the ids exist in the accumulo table. On 12 May 2015 at 23:39, Keith Turner ke...@deenlo.com wrote: Do you know how much data is being brought back (i.e. 100 megabytes)? I am wondering what the data rate is in MB/s. Do you know how many files per tablet you have? Do most of the 10,000 ids you are querying for exist? On Tue, May 12, 2015 at 1:58 PM, vaibhav thapliyal vaibhav.thapliyal...@gmail.com wrote: I have 194 tablets. Currently I am using 20 threads to create the batchscanner inside the createBatchScanner method. On 12-May-2015 11:19 pm, Keith Turner
Re: BatchScanner taking too much time to scan rows
I didn't have an average query time-- the tablet server crashed. A quick solution is to batch the ranges into groups of 50k (or 500k, I forgot which one) and do many BatchScans-- not ideal. I think I achieved 33k entries/second retrieval on a single-node Accumulo. Accumulo is better for sequential lookup than random. On Thu, May 14, 2015 at 1:57 PM, vaibhav thapliyal vaibhav.thapliyal...@gmail.com wrote: Dylan could you elaborate on the average query time you had? Thanks Vaibhav On 14-May-2015 11:03 pm, Dylan Hutchison dhutc...@mit.edu wrote: I think this is the same issue I found for ACCUMULO-3710 https://issues.apache.org/jira/browse/ACCUMULO-3710, only in my case the tserver ran out of memory. Accumulo doesn't handle large numbers of small, disjoint ranges well. I bet there's room for improvement on both the client and tablet server. ~Dylan On Wed, May 13, 2015 at 3:13 PM, Eric Newton eric.new...@gmail.com wrote: Yes, hot-spotting does affect accumulo because you have fewer servers and caches handling your request. Let's say your data is spread out, in a normal distribution from 0..9. What if you have only 1 split? You would want it at 5, to divide the data in half, and you could host the halves on different servers. But if you split at 1, now 10% of your queries go to one tablet, and 90% go to the other. -Eric On Wed, May 13, 2015 at 1:56 PM, vaibhav thapliyal vaibhav.thapliyal...@gmail.com wrote: Thank you Eric. I will surely do the same. Should uneven distribution across the tablets affect querying in accumulo? If this case, it is. Is this behaviour normal? On 13-May-2015 10:58 pm, Eric Newton eric.new...@gmail.com wrote: Yes, that's a great way to split the data evenly. Also, since the data set is so small, turn on data caching for your table: shell config -t mytable -s table.cache.block.enable=true You may want to increase the size of your tserver JVM, and increase the size of the cache: shell config -s tserver.cache.data.size=1G This will help with repeated random look-ups. -Eric On Wed, May 13, 2015 at 11:31 AM, vaibhav thapliyal vaibhav.thapliyal...@gmail.com wrote: Thank you Eric. One thing I would like to know. Does pre-splitting the data play a part in querying accumulo? Because I managed to somewhat decrease the querying time. I did the following steps: My table was around 1.47gb so I explicity set the split parameter to 256mb instead of the default 1gb. So I had just 8 tablets. Now when I carried out the same query, it finished in 15s. Is it because of the split points are more evenly distributed? The previous table on which the query took 50s had entries unevenly distributed across the tablets. Thanks Vaibhav On 13-May-2015 7:43 pm, Eric Newton eric.new...@gmail.com wrote: This use case is one of the things Accumulo was designed to handle well. It's the reason there is a BatchScanner. I've created: https://issues.apache.org/jira/browse/ACCUMULO-3813 so we can investigate and track down any problems or improvements. Feel free to add any other details to the JIRA ticket. -Eric On Wed, May 13, 2015 at 10:03 AM, Emilio Lahr-Vivaz elahrvi...@ccri.com wrote: It sounds like each of your ranges is an ID, e.g. a single row. I've found that scanning lots of non-sequential single-row ranges is pretty slow in accumulo. Your best approach is probably to create an index table on whatever you are originally trying to query (assuming those 1 ids came from some other query). Thanks, Emilio On 05/13/2015 09:14 AM, vaibhav thapliyal wrote: The rf files per tablet vary between 2 to 5 per tablet. The entries returned to me by the batchScanner is 46. The approx. average data rate is 0.5 MB/s as seen on the accumulo monitor page. A simple scan on the table has an average data rate of about 7-8 MB/s. All the ids exist in the accumulo table. On 12 May 2015 at 23:39, Keith Turner ke...@deenlo.com wrote: Do you know how much data is being brought back (i.e. 100 megabytes)? I am wondering what the data rate is in MB/s. Do you know how many files per tablet you have? Do most of the 10,000 ids you are querying for exist? On Tue, May 12, 2015 at 1:58 PM, vaibhav thapliyal vaibhav.thapliyal...@gmail.com wrote: I have 194 tablets. Currently I am using 20 threads to create the batchscanner inside the createBatchScanner method. On 12-May-2015 11:19 pm, Keith Turner ke...@deenlo.com wrote: How many tablets do you have? The batch scanner does not parallelize operations within a tablet. If you give the batch scanner more threads than there are tservers, it will make multilple parallel rpc calls to each tserver if the tserver has multiple tablets. Each rpc may include multiple tablets and ranges for each tablet. If the batch scanner has less threads than tservers, it will make one rpc per tserver per thread. Each rpc call will include
Re: BatchScanner taking too much time to scan rows
I think this is the same issue I found for ACCUMULO-3710 https://issues.apache.org/jira/browse/ACCUMULO-3710, only in my case the tserver ran out of memory. Accumulo doesn't handle large numbers of small, disjoint ranges well. I bet there's room for improvement on both the client and tablet server. ~Dylan On Wed, May 13, 2015 at 3:13 PM, Eric Newton eric.new...@gmail.com wrote: Yes, hot-spotting does affect accumulo because you have fewer servers and caches handling your request. Let's say your data is spread out, in a normal distribution from 0..9. What if you have only 1 split? You would want it at 5, to divide the data in half, and you could host the halves on different servers. But if you split at 1, now 10% of your queries go to one tablet, and 90% go to the other. -Eric On Wed, May 13, 2015 at 1:56 PM, vaibhav thapliyal vaibhav.thapliyal...@gmail.com wrote: Thank you Eric. I will surely do the same. Should uneven distribution across the tablets affect querying in accumulo? If this case, it is. Is this behaviour normal? On 13-May-2015 10:58 pm, Eric Newton eric.new...@gmail.com wrote: Yes, that's a great way to split the data evenly. Also, since the data set is so small, turn on data caching for your table: shell config -t mytable -s table.cache.block.enable=true You may want to increase the size of your tserver JVM, and increase the size of the cache: shell config -s tserver.cache.data.size=1G This will help with repeated random look-ups. -Eric On Wed, May 13, 2015 at 11:31 AM, vaibhav thapliyal vaibhav.thapliyal...@gmail.com wrote: Thank you Eric. One thing I would like to know. Does pre-splitting the data play a part in querying accumulo? Because I managed to somewhat decrease the querying time. I did the following steps: My table was around 1.47gb so I explicity set the split parameter to 256mb instead of the default 1gb. So I had just 8 tablets. Now when I carried out the same query, it finished in 15s. Is it because of the split points are more evenly distributed? The previous table on which the query took 50s had entries unevenly distributed across the tablets. Thanks Vaibhav On 13-May-2015 7:43 pm, Eric Newton eric.new...@gmail.com wrote: This use case is one of the things Accumulo was designed to handle well. It's the reason there is a BatchScanner. I've created: https://issues.apache.org/jira/browse/ACCUMULO-3813 so we can investigate and track down any problems or improvements. Feel free to add any other details to the JIRA ticket. -Eric On Wed, May 13, 2015 at 10:03 AM, Emilio Lahr-Vivaz elahrvi...@ccri.com wrote: It sounds like each of your ranges is an ID, e.g. a single row. I've found that scanning lots of non-sequential single-row ranges is pretty slow in accumulo. Your best approach is probably to create an index table on whatever you are originally trying to query (assuming those 1 ids came from some other query). Thanks, Emilio On 05/13/2015 09:14 AM, vaibhav thapliyal wrote: The rf files per tablet vary between 2 to 5 per tablet. The entries returned to me by the batchScanner is 46. The approx. average data rate is 0.5 MB/s as seen on the accumulo monitor page. A simple scan on the table has an average data rate of about 7-8 MB/s. All the ids exist in the accumulo table. On 12 May 2015 at 23:39, Keith Turner ke...@deenlo.com wrote: Do you know how much data is being brought back (i.e. 100 megabytes)? I am wondering what the data rate is in MB/s. Do you know how many files per tablet you have? Do most of the 10,000 ids you are querying for exist? On Tue, May 12, 2015 at 1:58 PM, vaibhav thapliyal vaibhav.thapliyal...@gmail.com wrote: I have 194 tablets. Currently I am using 20 threads to create the batchscanner inside the createBatchScanner method. On 12-May-2015 11:19 pm, Keith Turner ke...@deenlo.com wrote: How many tablets do you have? The batch scanner does not parallelize operations within a tablet. If you give the batch scanner more threads than there are tservers, it will make multilple parallel rpc calls to each tserver if the tserver has multiple tablets. Each rpc may include multiple tablets and ranges for each tablet. If the batch scanner has less threads than tservers, it will make one rpc per tserver per thread. Each rpc call will include all tablets and associated ranges for that tserver. Keith On Tue, May 12, 2015 at 1:39 PM, vaibhav thapliyal vaibhav.thapliyal...@gmail.com wrote: Hi, I am using BatchScanner to scan rows from a accumulo table. The table has around 187m entries and I am using a 3 node cluster which has accumulo 1.6.1. I have passed 1 ids which are stored as row id in my table as a list in the setRanges() method. This whole process takes around 50 secs(from adding the ids in the list to scanning the whole table using the BatchScanner). I tried
Re: BatchScanner taking too much time to scan rows
This use case is one of the things Accumulo was designed to handle well. It's the reason there is a BatchScanner. I've created: https://issues.apache.org/jira/browse/ACCUMULO-3813 so we can investigate and track down any problems or improvements. Feel free to add any other details to the JIRA ticket. -Eric On Wed, May 13, 2015 at 10:03 AM, Emilio Lahr-Vivaz elahrvi...@ccri.com wrote: It sounds like each of your ranges is an ID, e.g. a single row. I've found that scanning lots of non-sequential single-row ranges is pretty slow in accumulo. Your best approach is probably to create an index table on whatever you are originally trying to query (assuming those 1 ids came from some other query). Thanks, Emilio On 05/13/2015 09:14 AM, vaibhav thapliyal wrote: The rf files per tablet vary between 2 to 5 per tablet. The entries returned to me by the batchScanner is 46. The approx. average data rate is 0.5 MB/s as seen on the accumulo monitor page. A simple scan on the table has an average data rate of about 7-8 MB/s. All the ids exist in the accumulo table. On 12 May 2015 at 23:39, Keith Turner ke...@deenlo.com wrote: Do you know how much data is being brought back (i.e. 100 megabytes)? I am wondering what the data rate is in MB/s. Do you know how many files per tablet you have? Do most of the 10,000 ids you are querying for exist? On Tue, May 12, 2015 at 1:58 PM, vaibhav thapliyal vaibhav.thapliyal...@gmail.com wrote: I have 194 tablets. Currently I am using 20 threads to create the batchscanner inside the createBatchScanner method. On 12-May-2015 11:19 pm, Keith Turner ke...@deenlo.com wrote: How many tablets do you have? The batch scanner does not parallelize operations within a tablet. If you give the batch scanner more threads than there are tservers, it will make multilple parallel rpc calls to each tserver if the tserver has multiple tablets. Each rpc may include multiple tablets and ranges for each tablet. If the batch scanner has less threads than tservers, it will make one rpc per tserver per thread. Each rpc call will include all tablets and associated ranges for that tserver. Keith On Tue, May 12, 2015 at 1:39 PM, vaibhav thapliyal vaibhav.thapliyal...@gmail.com wrote: Hi, I am using BatchScanner to scan rows from a accumulo table. The table has around 187m entries and I am using a 3 node cluster which has accumulo 1.6.1. I have passed 1 ids which are stored as row id in my table as a list in the setRanges() method. This whole process takes around 50 secs(from adding the ids in the list to scanning the whole table using the BatchScanner). I tried switching on bloom filters but that didn't work. Also if anyone could briefly explain how a BatchScanner works, how it does parallel scanning it would help me understand what I am doing better. Thanks Vaibhav
Re: BatchScanner taking too much time to scan rows
It sounds like each of your ranges is an ID, e.g. a single row. I've found that scanning lots of non-sequential single-row ranges is pretty slow in accumulo. Your best approach is probably to create an index table on whatever you are originally trying to query (assuming those 1 ids came from some other query). Thanks, Emilio On 05/13/2015 09:14 AM, vaibhav thapliyal wrote: The rf files per tablet vary between 2 to 5 per tablet. The entries returned to me by the batchScanner is 46. The approx. average data rate is 0.5 MB/s as seen on the accumulo monitor page. A simple scan on the table has an average data rate of about 7-8 MB/s. All the ids exist in the accumulo table. On 12 May 2015 at 23:39, Keith Turner ke...@deenlo.com mailto:ke...@deenlo.com wrote: Do you know how much data is being brought back (i.e. 100 megabytes)? I am wondering what the data rate is in MB/s. Do you know how many files per tablet you have? Do most of the 10,000 ids you are querying for exist? On Tue, May 12, 2015 at 1:58 PM, vaibhav thapliyal vaibhav.thapliyal...@gmail.com mailto:vaibhav.thapliyal...@gmail.com wrote: I have 194 tablets. Currently I am using 20 threads to create the batchscanner inside the createBatchScanner method. On 12-May-2015 11:19 pm, Keith Turner ke...@deenlo.com mailto:ke...@deenlo.com wrote: How many tablets do you have? The batch scanner does not parallelize operations within a tablet. If you give the batch scanner more threads than there are tservers, it will make multilple parallel rpc calls to each tserver if the tserver has multiple tablets. Each rpc may include multiple tablets and ranges for each tablet. If the batch scanner has less threads than tservers, it will make one rpc per tserver per thread. Each rpc call will include all tablets and associated ranges for that tserver. Keith On Tue, May 12, 2015 at 1:39 PM, vaibhav thapliyal vaibhav.thapliyal...@gmail.com mailto:vaibhav.thapliyal...@gmail.com wrote: Hi, I am using BatchScanner to scan rows from a accumulo table. The table has around 187m entries and I am using a 3 node cluster which has accumulo 1.6.1. I have passed 1 ids which are stored as row id in my table as a list in the setRanges() method. This whole process takes around 50 secs(from adding the ids in the list to scanning the whole table using the BatchScanner). I tried switching on bloom filters but that didn't work. Also if anyone could briefly explain how a BatchScanner works, how it does parallel scanning it would help me understand what I am doing better. Thanks Vaibhav
Re: BatchScanner taking too much time to scan rows
Yes, hot-spotting does affect accumulo because you have fewer servers and caches handling your request. Let's say your data is spread out, in a normal distribution from 0..9. What if you have only 1 split? You would want it at 5, to divide the data in half, and you could host the halves on different servers. But if you split at 1, now 10% of your queries go to one tablet, and 90% go to the other. -Eric On Wed, May 13, 2015 at 1:56 PM, vaibhav thapliyal vaibhav.thapliyal...@gmail.com wrote: Thank you Eric. I will surely do the same. Should uneven distribution across the tablets affect querying in accumulo? If this case, it is. Is this behaviour normal? On 13-May-2015 10:58 pm, Eric Newton eric.new...@gmail.com wrote: Yes, that's a great way to split the data evenly. Also, since the data set is so small, turn on data caching for your table: shell config -t mytable -s table.cache.block.enable=true You may want to increase the size of your tserver JVM, and increase the size of the cache: shell config -s tserver.cache.data.size=1G This will help with repeated random look-ups. -Eric On Wed, May 13, 2015 at 11:31 AM, vaibhav thapliyal vaibhav.thapliyal...@gmail.com wrote: Thank you Eric. One thing I would like to know. Does pre-splitting the data play a part in querying accumulo? Because I managed to somewhat decrease the querying time. I did the following steps: My table was around 1.47gb so I explicity set the split parameter to 256mb instead of the default 1gb. So I had just 8 tablets. Now when I carried out the same query, it finished in 15s. Is it because of the split points are more evenly distributed? The previous table on which the query took 50s had entries unevenly distributed across the tablets. Thanks Vaibhav On 13-May-2015 7:43 pm, Eric Newton eric.new...@gmail.com wrote: This use case is one of the things Accumulo was designed to handle well. It's the reason there is a BatchScanner. I've created: https://issues.apache.org/jira/browse/ACCUMULO-3813 so we can investigate and track down any problems or improvements. Feel free to add any other details to the JIRA ticket. -Eric On Wed, May 13, 2015 at 10:03 AM, Emilio Lahr-Vivaz elahrvi...@ccri.com wrote: It sounds like each of your ranges is an ID, e.g. a single row. I've found that scanning lots of non-sequential single-row ranges is pretty slow in accumulo. Your best approach is probably to create an index table on whatever you are originally trying to query (assuming those 1 ids came from some other query). Thanks, Emilio On 05/13/2015 09:14 AM, vaibhav thapliyal wrote: The rf files per tablet vary between 2 to 5 per tablet. The entries returned to me by the batchScanner is 46. The approx. average data rate is 0.5 MB/s as seen on the accumulo monitor page. A simple scan on the table has an average data rate of about 7-8 MB/s. All the ids exist in the accumulo table. On 12 May 2015 at 23:39, Keith Turner ke...@deenlo.com wrote: Do you know how much data is being brought back (i.e. 100 megabytes)? I am wondering what the data rate is in MB/s. Do you know how many files per tablet you have? Do most of the 10,000 ids you are querying for exist? On Tue, May 12, 2015 at 1:58 PM, vaibhav thapliyal vaibhav.thapliyal...@gmail.com wrote: I have 194 tablets. Currently I am using 20 threads to create the batchscanner inside the createBatchScanner method. On 12-May-2015 11:19 pm, Keith Turner ke...@deenlo.com wrote: How many tablets do you have? The batch scanner does not parallelize operations within a tablet. If you give the batch scanner more threads than there are tservers, it will make multilple parallel rpc calls to each tserver if the tserver has multiple tablets. Each rpc may include multiple tablets and ranges for each tablet. If the batch scanner has less threads than tservers, it will make one rpc per tserver per thread. Each rpc call will include all tablets and associated ranges for that tserver. Keith On Tue, May 12, 2015 at 1:39 PM, vaibhav thapliyal vaibhav.thapliyal...@gmail.com wrote: Hi, I am using BatchScanner to scan rows from a accumulo table. The table has around 187m entries and I am using a 3 node cluster which has accumulo 1.6.1. I have passed 1 ids which are stored as row id in my table as a list in the setRanges() method. This whole process takes around 50 secs(from adding the ids in the list to scanning the whole table using the BatchScanner). I tried switching on bloom filters but that didn't work. Also if anyone could briefly explain how a BatchScanner works, how it does parallel scanning it would help me understand what I am doing better. Thanks Vaibhav
Re: BatchScanner taking too much time to scan rows
Yes, that's a great way to split the data evenly. Also, since the data set is so small, turn on data caching for your table: shell config -t mytable -s table.cache.block.enable=true You may want to increase the size of your tserver JVM, and increase the size of the cache: shell config -s tserver.cache.data.size=1G This will help with repeated random look-ups. -Eric On Wed, May 13, 2015 at 11:31 AM, vaibhav thapliyal vaibhav.thapliyal...@gmail.com wrote: Thank you Eric. One thing I would like to know. Does pre-splitting the data play a part in querying accumulo? Because I managed to somewhat decrease the querying time. I did the following steps: My table was around 1.47gb so I explicity set the split parameter to 256mb instead of the default 1gb. So I had just 8 tablets. Now when I carried out the same query, it finished in 15s. Is it because of the split points are more evenly distributed? The previous table on which the query took 50s had entries unevenly distributed across the tablets. Thanks Vaibhav On 13-May-2015 7:43 pm, Eric Newton eric.new...@gmail.com wrote: This use case is one of the things Accumulo was designed to handle well. It's the reason there is a BatchScanner. I've created: https://issues.apache.org/jira/browse/ACCUMULO-3813 so we can investigate and track down any problems or improvements. Feel free to add any other details to the JIRA ticket. -Eric On Wed, May 13, 2015 at 10:03 AM, Emilio Lahr-Vivaz elahrvi...@ccri.com wrote: It sounds like each of your ranges is an ID, e.g. a single row. I've found that scanning lots of non-sequential single-row ranges is pretty slow in accumulo. Your best approach is probably to create an index table on whatever you are originally trying to query (assuming those 1 ids came from some other query). Thanks, Emilio On 05/13/2015 09:14 AM, vaibhav thapliyal wrote: The rf files per tablet vary between 2 to 5 per tablet. The entries returned to me by the batchScanner is 46. The approx. average data rate is 0.5 MB/s as seen on the accumulo monitor page. A simple scan on the table has an average data rate of about 7-8 MB/s. All the ids exist in the accumulo table. On 12 May 2015 at 23:39, Keith Turner ke...@deenlo.com wrote: Do you know how much data is being brought back (i.e. 100 megabytes)? I am wondering what the data rate is in MB/s. Do you know how many files per tablet you have? Do most of the 10,000 ids you are querying for exist? On Tue, May 12, 2015 at 1:58 PM, vaibhav thapliyal vaibhav.thapliyal...@gmail.com wrote: I have 194 tablets. Currently I am using 20 threads to create the batchscanner inside the createBatchScanner method. On 12-May-2015 11:19 pm, Keith Turner ke...@deenlo.com wrote: How many tablets do you have? The batch scanner does not parallelize operations within a tablet. If you give the batch scanner more threads than there are tservers, it will make multilple parallel rpc calls to each tserver if the tserver has multiple tablets. Each rpc may include multiple tablets and ranges for each tablet. If the batch scanner has less threads than tservers, it will make one rpc per tserver per thread. Each rpc call will include all tablets and associated ranges for that tserver. Keith On Tue, May 12, 2015 at 1:39 PM, vaibhav thapliyal vaibhav.thapliyal...@gmail.com wrote: Hi, I am using BatchScanner to scan rows from a accumulo table. The table has around 187m entries and I am using a 3 node cluster which has accumulo 1.6.1. I have passed 1 ids which are stored as row id in my table as a list in the setRanges() method. This whole process takes around 50 secs(from adding the ids in the list to scanning the whole table using the BatchScanner). I tried switching on bloom filters but that didn't work. Also if anyone could briefly explain how a BatchScanner works, how it does parallel scanning it would help me understand what I am doing better. Thanks Vaibhav
Re: BatchScanner taking too much time to scan rows
Thank you Eric. One thing I would like to know. Does pre-splitting the data play a part in querying accumulo? Because I managed to somewhat decrease the querying time. I did the following steps: My table was around 1.47gb so I explicity set the split parameter to 256mb instead of the default 1gb. So I had just 8 tablets. Now when I carried out the same query, it finished in 15s. Is it because of the split points are more evenly distributed? The previous table on which the query took 50s had entries unevenly distributed across the tablets. Thanks Vaibhav On 13-May-2015 7:43 pm, Eric Newton eric.new...@gmail.com wrote: This use case is one of the things Accumulo was designed to handle well. It's the reason there is a BatchScanner. I've created: https://issues.apache.org/jira/browse/ACCUMULO-3813 so we can investigate and track down any problems or improvements. Feel free to add any other details to the JIRA ticket. -Eric On Wed, May 13, 2015 at 10:03 AM, Emilio Lahr-Vivaz elahrvi...@ccri.com wrote: It sounds like each of your ranges is an ID, e.g. a single row. I've found that scanning lots of non-sequential single-row ranges is pretty slow in accumulo. Your best approach is probably to create an index table on whatever you are originally trying to query (assuming those 1 ids came from some other query). Thanks, Emilio On 05/13/2015 09:14 AM, vaibhav thapliyal wrote: The rf files per tablet vary between 2 to 5 per tablet. The entries returned to me by the batchScanner is 46. The approx. average data rate is 0.5 MB/s as seen on the accumulo monitor page. A simple scan on the table has an average data rate of about 7-8 MB/s. All the ids exist in the accumulo table. On 12 May 2015 at 23:39, Keith Turner ke...@deenlo.com wrote: Do you know how much data is being brought back (i.e. 100 megabytes)? I am wondering what the data rate is in MB/s. Do you know how many files per tablet you have? Do most of the 10,000 ids you are querying for exist? On Tue, May 12, 2015 at 1:58 PM, vaibhav thapliyal vaibhav.thapliyal...@gmail.com wrote: I have 194 tablets. Currently I am using 20 threads to create the batchscanner inside the createBatchScanner method. On 12-May-2015 11:19 pm, Keith Turner ke...@deenlo.com wrote: How many tablets do you have? The batch scanner does not parallelize operations within a tablet. If you give the batch scanner more threads than there are tservers, it will make multilple parallel rpc calls to each tserver if the tserver has multiple tablets. Each rpc may include multiple tablets and ranges for each tablet. If the batch scanner has less threads than tservers, it will make one rpc per tserver per thread. Each rpc call will include all tablets and associated ranges for that tserver. Keith On Tue, May 12, 2015 at 1:39 PM, vaibhav thapliyal vaibhav.thapliyal...@gmail.com wrote: Hi, I am using BatchScanner to scan rows from a accumulo table. The table has around 187m entries and I am using a 3 node cluster which has accumulo 1.6.1. I have passed 1 ids which are stored as row id in my table as a list in the setRanges() method. This whole process takes around 50 secs(from adding the ids in the list to scanning the whole table using the BatchScanner). I tried switching on bloom filters but that didn't work. Also if anyone could briefly explain how a BatchScanner works, how it does parallel scanning it would help me understand what I am doing better. Thanks Vaibhav
Re: BatchScanner taking too much time to scan rows
Thank you Eric. I will surely do the same. Should uneven distribution across the tablets affect querying in accumulo? If this case, it is. Is this behaviour normal? On 13-May-2015 10:58 pm, Eric Newton eric.new...@gmail.com wrote: Yes, that's a great way to split the data evenly. Also, since the data set is so small, turn on data caching for your table: shell config -t mytable -s table.cache.block.enable=true You may want to increase the size of your tserver JVM, and increase the size of the cache: shell config -s tserver.cache.data.size=1G This will help with repeated random look-ups. -Eric On Wed, May 13, 2015 at 11:31 AM, vaibhav thapliyal vaibhav.thapliyal...@gmail.com wrote: Thank you Eric. One thing I would like to know. Does pre-splitting the data play a part in querying accumulo? Because I managed to somewhat decrease the querying time. I did the following steps: My table was around 1.47gb so I explicity set the split parameter to 256mb instead of the default 1gb. So I had just 8 tablets. Now when I carried out the same query, it finished in 15s. Is it because of the split points are more evenly distributed? The previous table on which the query took 50s had entries unevenly distributed across the tablets. Thanks Vaibhav On 13-May-2015 7:43 pm, Eric Newton eric.new...@gmail.com wrote: This use case is one of the things Accumulo was designed to handle well. It's the reason there is a BatchScanner. I've created: https://issues.apache.org/jira/browse/ACCUMULO-3813 so we can investigate and track down any problems or improvements. Feel free to add any other details to the JIRA ticket. -Eric On Wed, May 13, 2015 at 10:03 AM, Emilio Lahr-Vivaz elahrvi...@ccri.com wrote: It sounds like each of your ranges is an ID, e.g. a single row. I've found that scanning lots of non-sequential single-row ranges is pretty slow in accumulo. Your best approach is probably to create an index table on whatever you are originally trying to query (assuming those 1 ids came from some other query). Thanks, Emilio On 05/13/2015 09:14 AM, vaibhav thapliyal wrote: The rf files per tablet vary between 2 to 5 per tablet. The entries returned to me by the batchScanner is 46. The approx. average data rate is 0.5 MB/s as seen on the accumulo monitor page. A simple scan on the table has an average data rate of about 7-8 MB/s. All the ids exist in the accumulo table. On 12 May 2015 at 23:39, Keith Turner ke...@deenlo.com wrote: Do you know how much data is being brought back (i.e. 100 megabytes)? I am wondering what the data rate is in MB/s. Do you know how many files per tablet you have? Do most of the 10,000 ids you are querying for exist? On Tue, May 12, 2015 at 1:58 PM, vaibhav thapliyal vaibhav.thapliyal...@gmail.com wrote: I have 194 tablets. Currently I am using 20 threads to create the batchscanner inside the createBatchScanner method. On 12-May-2015 11:19 pm, Keith Turner ke...@deenlo.com wrote: How many tablets do you have? The batch scanner does not parallelize operations within a tablet. If you give the batch scanner more threads than there are tservers, it will make multilple parallel rpc calls to each tserver if the tserver has multiple tablets. Each rpc may include multiple tablets and ranges for each tablet. If the batch scanner has less threads than tservers, it will make one rpc per tserver per thread. Each rpc call will include all tablets and associated ranges for that tserver. Keith On Tue, May 12, 2015 at 1:39 PM, vaibhav thapliyal vaibhav.thapliyal...@gmail.com wrote: Hi, I am using BatchScanner to scan rows from a accumulo table. The table has around 187m entries and I am using a 3 node cluster which has accumulo 1.6.1. I have passed 1 ids which are stored as row id in my table as a list in the setRanges() method. This whole process takes around 50 secs(from adding the ids in the list to scanning the whole table using the BatchScanner). I tried switching on bloom filters but that didn't work. Also if anyone could briefly explain how a BatchScanner works, how it does parallel scanning it would help me understand what I am doing better. Thanks Vaibhav
Re: BatchScanner taking too much time to scan rows
Do you know how much data is being brought back (i.e. 100 megabytes)? I am wondering what the data rate is in MB/s. Do you know how many files per tablet you have? Do most of the 10,000 ids you are querying for exist? On Tue, May 12, 2015 at 1:58 PM, vaibhav thapliyal vaibhav.thapliyal...@gmail.com wrote: I have 194 tablets. Currently I am using 20 threads to create the batchscanner inside the createBatchScanner method. On 12-May-2015 11:19 pm, Keith Turner ke...@deenlo.com wrote: How many tablets do you have? The batch scanner does not parallelize operations within a tablet. If you give the batch scanner more threads than there are tservers, it will make multilple parallel rpc calls to each tserver if the tserver has multiple tablets. Each rpc may include multiple tablets and ranges for each tablet. If the batch scanner has less threads than tservers, it will make one rpc per tserver per thread. Each rpc call will include all tablets and associated ranges for that tserver. Keith On Tue, May 12, 2015 at 1:39 PM, vaibhav thapliyal vaibhav.thapliyal...@gmail.com wrote: Hi, I am using BatchScanner to scan rows from a accumulo table. The table has around 187m entries and I am using a 3 node cluster which has accumulo 1.6.1. I have passed 1 ids which are stored as row id in my table as a list in the setRanges() method. This whole process takes around 50 secs(from adding the ids in the list to scanning the whole table using the BatchScanner). I tried switching on bloom filters but that didn't work. Also if anyone could briefly explain how a BatchScanner works, how it does parallel scanning it would help me understand what I am doing better. Thanks Vaibhav
Re: BatchScanner taking too much time to scan rows
How many tablets do you have? The batch scanner does not parallelize operations within a tablet. If you give the batch scanner more threads than there are tservers, it will make multilple parallel rpc calls to each tserver if the tserver has multiple tablets. Each rpc may include multiple tablets and ranges for each tablet. If the batch scanner has less threads than tservers, it will make one rpc per tserver per thread. Each rpc call will include all tablets and associated ranges for that tserver. Keith On Tue, May 12, 2015 at 1:39 PM, vaibhav thapliyal vaibhav.thapliyal...@gmail.com wrote: Hi, I am using BatchScanner to scan rows from a accumulo table. The table has around 187m entries and I am using a 3 node cluster which has accumulo 1.6.1. I have passed 1 ids which are stored as row id in my table as a list in the setRanges() method. This whole process takes around 50 secs(from adding the ids in the list to scanning the whole table using the BatchScanner). I tried switching on bloom filters but that didn't work. Also if anyone could briefly explain how a BatchScanner works, how it does parallel scanning it would help me understand what I am doing better. Thanks Vaibhav
BatchScanner taking too much time to scan rows
Hi, I am using BatchScanner to scan rows from a accumulo table. The table has around 187m entries and I am using a 3 node cluster which has accumulo 1.6.1. I have passed 1 ids which are stored as row id in my table as a list in the setRanges() method. This whole process takes around 50 secs(from adding the ids in the list to scanning the whole table using the BatchScanner). I tried switching on bloom filters but that didn't work. Also if anyone could briefly explain how a BatchScanner works, how it does parallel scanning it would help me understand what I am doing better. Thanks Vaibhav
Re: BatchScanner taking too much time to scan rows
On the monitor page, you should see how many threads are running in each tserver, if I remember correctly. There are also graphs to show response rates. On Tue, May 12, 2015 at 2:39 PM, vaibhav thapliyal vaibhav.thapliyal...@gmail.com wrote: I also tried to increase threads to a bigger number about 500, but yes I will try using batchscanner with 194 threads too. I will get back with the info that Keith has asked in some time. Thanks Vaibhav On 13-May-2015 12:04 am, David Medinets david.medin...@gmail.com wrote: Try using 194 threads if your hardware can support them. The worst that'll happen is the client program crashes during testing. If that happens, cut the number of threads in half. And so on. On Tue, May 12, 2015 at 1:58 PM, vaibhav thapliyal vaibhav.thapliyal...@gmail.com wrote: I have 194 tablets. Currently I am using 20 threads to create the batchscanner inside the createBatchScanner method. On 12-May-2015 11:19 pm, Keith Turner ke...@deenlo.com wrote: How many tablets do you have? The batch scanner does not parallelize operations within a tablet. If you give the batch scanner more threads than there are tservers, it will make multilple parallel rpc calls to each tserver if the tserver has multiple tablets. Each rpc may include multiple tablets and ranges for each tablet. If the batch scanner has less threads than tservers, it will make one rpc per tserver per thread. Each rpc call will include all tablets and associated ranges for that tserver. Keith On Tue, May 12, 2015 at 1:39 PM, vaibhav thapliyal vaibhav.thapliyal...@gmail.com wrote: Hi, I am using BatchScanner to scan rows from a accumulo table. The table has around 187m entries and I am using a 3 node cluster which has accumulo 1.6.1. I have passed 1 ids which are stored as row id in my table as a list in the setRanges() method. This whole process takes around 50 secs(from adding the ids in the list to scanning the whole table using the BatchScanner). I tried switching on bloom filters but that didn't work. Also if anyone could briefly explain how a BatchScanner works, how it does parallel scanning it would help me understand what I am doing better. Thanks Vaibhav
Re: BatchScanner taking too much time to scan rows
I also tried to increase threads to a bigger number about 500, but yes I will try using batchscanner with 194 threads too. I will get back with the info that Keith has asked in some time. Thanks Vaibhav On 13-May-2015 12:04 am, David Medinets david.medin...@gmail.com wrote: Try using 194 threads if your hardware can support them. The worst that'll happen is the client program crashes during testing. If that happens, cut the number of threads in half. And so on. On Tue, May 12, 2015 at 1:58 PM, vaibhav thapliyal vaibhav.thapliyal...@gmail.com wrote: I have 194 tablets. Currently I am using 20 threads to create the batchscanner inside the createBatchScanner method. On 12-May-2015 11:19 pm, Keith Turner ke...@deenlo.com wrote: How many tablets do you have? The batch scanner does not parallelize operations within a tablet. If you give the batch scanner more threads than there are tservers, it will make multilple parallel rpc calls to each tserver if the tserver has multiple tablets. Each rpc may include multiple tablets and ranges for each tablet. If the batch scanner has less threads than tservers, it will make one rpc per tserver per thread. Each rpc call will include all tablets and associated ranges for that tserver. Keith On Tue, May 12, 2015 at 1:39 PM, vaibhav thapliyal vaibhav.thapliyal...@gmail.com wrote: Hi, I am using BatchScanner to scan rows from a accumulo table. The table has around 187m entries and I am using a 3 node cluster which has accumulo 1.6.1. I have passed 1 ids which are stored as row id in my table as a list in the setRanges() method. This whole process takes around 50 secs(from adding the ids in the list to scanning the whole table using the BatchScanner). I tried switching on bloom filters but that didn't work. Also if anyone could briefly explain how a BatchScanner works, how it does parallel scanning it would help me understand what I am doing better. Thanks Vaibhav
Re: BatchScanner taking too much time to scan rows
Try using 194 threads if your hardware can support them. The worst that'll happen is the client program crashes during testing. If that happens, cut the number of threads in half. And so on. On Tue, May 12, 2015 at 1:58 PM, vaibhav thapliyal vaibhav.thapliyal...@gmail.com wrote: I have 194 tablets. Currently I am using 20 threads to create the batchscanner inside the createBatchScanner method. On 12-May-2015 11:19 pm, Keith Turner ke...@deenlo.com wrote: How many tablets do you have? The batch scanner does not parallelize operations within a tablet. If you give the batch scanner more threads than there are tservers, it will make multilple parallel rpc calls to each tserver if the tserver has multiple tablets. Each rpc may include multiple tablets and ranges for each tablet. If the batch scanner has less threads than tservers, it will make one rpc per tserver per thread. Each rpc call will include all tablets and associated ranges for that tserver. Keith On Tue, May 12, 2015 at 1:39 PM, vaibhav thapliyal vaibhav.thapliyal...@gmail.com wrote: Hi, I am using BatchScanner to scan rows from a accumulo table. The table has around 187m entries and I am using a 3 node cluster which has accumulo 1.6.1. I have passed 1 ids which are stored as row id in my table as a list in the setRanges() method. This whole process takes around 50 secs(from adding the ids in the list to scanning the whole table using the BatchScanner). I tried switching on bloom filters but that didn't work. Also if anyone could briefly explain how a BatchScanner works, how it does parallel scanning it would help me understand what I am doing better. Thanks Vaibhav
Re: BatchScanner taking too much time to scan rows
I have 194 tablets. Currently I am using 20 threads to create the batchscanner inside the createBatchScanner method. On 12-May-2015 11:19 pm, Keith Turner ke...@deenlo.com wrote: How many tablets do you have? The batch scanner does not parallelize operations within a tablet. If you give the batch scanner more threads than there are tservers, it will make multilple parallel rpc calls to each tserver if the tserver has multiple tablets. Each rpc may include multiple tablets and ranges for each tablet. If the batch scanner has less threads than tservers, it will make one rpc per tserver per thread. Each rpc call will include all tablets and associated ranges for that tserver. Keith On Tue, May 12, 2015 at 1:39 PM, vaibhav thapliyal vaibhav.thapliyal...@gmail.com wrote: Hi, I am using BatchScanner to scan rows from a accumulo table. The table has around 187m entries and I am using a 3 node cluster which has accumulo 1.6.1. I have passed 1 ids which are stored as row id in my table as a list in the setRanges() method. This whole process takes around 50 secs(from adding the ids in the list to scanning the whole table using the BatchScanner). I tried switching on bloom filters but that didn't work. Also if anyone could briefly explain how a BatchScanner works, how it does parallel scanning it would help me understand what I am doing better. Thanks Vaibhav