Re: BatchScanner taking too much time to scan rows

2015-05-14 Thread vaibhav thapliyal
Dylan could you elaborate on the average query time you had?
Thanks
Vaibhav
On 14-May-2015 11:03 pm, Dylan Hutchison dhutc...@mit.edu wrote:

 I think this is the same issue I found for ACCUMULO-3710
 https://issues.apache.org/jira/browse/ACCUMULO-3710, only in my case
 the tserver ran out of memory.  Accumulo doesn't handle large numbers of
 small, disjoint ranges well.  I bet there's room for improvement on both
 the client and tablet server.
 ~Dylan

 On Wed, May 13, 2015 at 3:13 PM, Eric Newton eric.new...@gmail.com
 wrote:

 Yes, hot-spotting does affect accumulo because you have fewer servers and
 caches handling your request.

 Let's say your data is spread out, in a normal distribution from
 0..9.

 What if you have only 1 split?  You would want it at 5, to divide the
 data in half, and you could host the halves on different servers.  But if
 you split at 1, now 10% of your queries go to one tablet, and 90% go to the
 other.

 -Eric


 On Wed, May 13, 2015 at 1:56 PM, vaibhav thapliyal 
 vaibhav.thapliyal...@gmail.com wrote:

 Thank you Eric. I will surely do the same. Should uneven distribution
 across the tablets affect querying in accumulo?  If this case, it is. Is
 this behaviour normal?
 On 13-May-2015 10:58 pm, Eric Newton eric.new...@gmail.com wrote:

 Yes, that's a great way to split the data evenly.

 Also, since the data set is so small, turn on data caching for your
 table:

 shell config -t mytable -s table.cache.block.enable=true

 You may want to increase the size of your tserver JVM, and increase the
 size of the cache:

 shell config -s tserver.cache.data.size=1G

 This will help with repeated random look-ups.

 -Eric

 On Wed, May 13, 2015 at 11:31 AM, vaibhav thapliyal 
 vaibhav.thapliyal...@gmail.com wrote:

 Thank you Eric.

 One thing I would like to know. Does pre-splitting the data play a
 part in querying accumulo?

 Because I managed to somewhat decrease the querying time.
 I did the following steps:
 My table was around 1.47gb so I explicity set the split parameter to
 256mb instead of the default 1gb.

 So I had just 8 tablets. Now when I carried out the same query, it
 finished in 15s.

 Is it because of the split points are more evenly distributed?

 The previous table on which the query took 50s had entries unevenly
 distributed across the tablets.
 Thanks
 Vaibhav
 On 13-May-2015 7:43 pm, Eric Newton eric.new...@gmail.com wrote:

 This use case is one of the things Accumulo was designed to handle
 well. It's the reason there is a BatchScanner.

 I've created:

 https://issues.apache.org/jira/browse/ACCUMULO-3813

 so we can investigate and track down any problems or improvements.

 Feel free to add any other details to the JIRA ticket.

 -Eric


 On Wed, May 13, 2015 at 10:03 AM, Emilio Lahr-Vivaz 
 elahrvi...@ccri.com wrote:

  It sounds like each of your ranges is an ID, e.g. a single row.
 I've found that scanning lots of non-sequential single-row ranges is 
 pretty
 slow in accumulo. Your best approach is probably to create an index 
 table
 on whatever you are originally trying to query (assuming those 1 ids
 came from some other query).

 Thanks,

 Emilio


 On 05/13/2015 09:14 AM, vaibhav thapliyal wrote:

  The rf files per tablet vary between 2 to 5 per tablet. The
 entries returned to me by the batchScanner is 46. The approx. 
 average
 data rate is 0.5 MB/s as seen on the accumulo monitor page.

  A simple scan on the table has an average data rate of about 7-8
 MB/s.

  All the ids exist in the accumulo table.

 On 12 May 2015 at 23:39, Keith Turner ke...@deenlo.com wrote:

 Do you know how much data is being brought back (i.e. 100
 megabytes)? I am wondering what the data rate is in MB/s.  Do you know 
 how
 many files per tablet you have?  Do most of the 10,000 ids you are 
 querying
 for exist?

 On Tue, May 12, 2015 at 1:58 PM, vaibhav thapliyal 
 vaibhav.thapliyal...@gmail.com wrote:

 I have 194 tablets. Currently I am using 20 threads to create the
 batchscanner inside the createBatchScanner method.
  On 12-May-2015 11:19 pm, Keith Turner ke...@deenlo.com wrote:

   How many tablets do you have?  The batch scanner does not
 parallelize operations within a tablet.

  If you give the batch scanner more threads than there are
 tservers, it will make multilple parallel rpc calls to each tserver 
 if the
 tserver has multiple tablets.  Each rpc may include multiple tablets 
 and
 ranges for each tablet.

  If the batch scanner has less threads than tservers, it will
 make one rpc per tserver per thread.  Each rpc call will include all
 tablets and associated ranges for that tserver.

  Keith



 On Tue, May 12, 2015 at 1:39 PM, vaibhav thapliyal 
 vaibhav.thapliyal...@gmail.com wrote:

 Hi,

  I am using BatchScanner to scan rows from a accumulo table.
 The table has around 187m entries and I am using a 3 node cluster 
 which has
 accumulo 1.6.1.

  I have passed 1 ids which are stored as row id in my table
 as a list in the setRanges() 

Re: BatchScanner taking too much time to scan rows

2015-05-14 Thread Dylan Hutchison
Sorry, just remembered that my setup was to scan an index table and gather
rowIDs, then scan a main data table using the rowIDs as the BatchScan
ranges.  Effectively it is a join of part of the index table to a main data
table.

The scan rate I achieved is therefore double the value I cited previously:
I showed about 76k entries/second.  Still not the best but it is more
within Accumulo standards.


On Thu, May 14, 2015 at 2:15 PM, Dylan Hutchison dhutc...@mit.edu wrote:

 I didn't have an average query time-- the tablet server crashed.  A quick
 solution is to batch the ranges into groups of 50k (or 500k, I forgot which
 one) and do many BatchScans-- not ideal.  I think I achieved 33k
 entries/second retrieval on a single-node Accumulo.  Accumulo is better for
 sequential lookup than random.

 On Thu, May 14, 2015 at 1:57 PM, vaibhav thapliyal 
 vaibhav.thapliyal...@gmail.com wrote:

 Dylan could you elaborate on the average query time you had?
 Thanks
 Vaibhav
 On 14-May-2015 11:03 pm, Dylan Hutchison dhutc...@mit.edu wrote:

 I think this is the same issue I found for ACCUMULO-3710
 https://issues.apache.org/jira/browse/ACCUMULO-3710, only in my case
 the tserver ran out of memory.  Accumulo doesn't handle large numbers of
 small, disjoint ranges well.  I bet there's room for improvement on both
 the client and tablet server.
 ~Dylan

 On Wed, May 13, 2015 at 3:13 PM, Eric Newton eric.new...@gmail.com
 wrote:

 Yes, hot-spotting does affect accumulo because you have fewer servers
 and caches handling your request.

 Let's say your data is spread out, in a normal distribution from
 0..9.

 What if you have only 1 split?  You would want it at 5, to divide the
 data in half, and you could host the halves on different servers.  But if
 you split at 1, now 10% of your queries go to one tablet, and 90% go to the
 other.

 -Eric


 On Wed, May 13, 2015 at 1:56 PM, vaibhav thapliyal 
 vaibhav.thapliyal...@gmail.com wrote:

 Thank you Eric. I will surely do the same. Should uneven distribution
 across the tablets affect querying in accumulo?  If this case, it is. Is
 this behaviour normal?
 On 13-May-2015 10:58 pm, Eric Newton eric.new...@gmail.com wrote:

 Yes, that's a great way to split the data evenly.

 Also, since the data set is so small, turn on data caching for your
 table:

 shell config -t mytable -s table.cache.block.enable=true

 You may want to increase the size of your tserver JVM, and increase
 the size of the cache:

 shell config -s tserver.cache.data.size=1G

 This will help with repeated random look-ups.

 -Eric

 On Wed, May 13, 2015 at 11:31 AM, vaibhav thapliyal 
 vaibhav.thapliyal...@gmail.com wrote:

 Thank you Eric.

 One thing I would like to know. Does pre-splitting the data play a
 part in querying accumulo?

 Because I managed to somewhat decrease the querying time.
 I did the following steps:
 My table was around 1.47gb so I explicity set the split parameter to
 256mb instead of the default 1gb.

 So I had just 8 tablets. Now when I carried out the same query, it
 finished in 15s.

 Is it because of the split points are more evenly distributed?

 The previous table on which the query took 50s had entries unevenly
 distributed across the tablets.
 Thanks
 Vaibhav
 On 13-May-2015 7:43 pm, Eric Newton eric.new...@gmail.com wrote:

 This use case is one of the things Accumulo was designed to handle
 well. It's the reason there is a BatchScanner.

 I've created:

 https://issues.apache.org/jira/browse/ACCUMULO-3813

 so we can investigate and track down any problems or improvements.

 Feel free to add any other details to the JIRA ticket.

 -Eric


 On Wed, May 13, 2015 at 10:03 AM, Emilio Lahr-Vivaz 
 elahrvi...@ccri.com wrote:

  It sounds like each of your ranges is an ID, e.g. a single row.
 I've found that scanning lots of non-sequential single-row ranges is 
 pretty
 slow in accumulo. Your best approach is probably to create an index 
 table
 on whatever you are originally trying to query (assuming those 1 
 ids
 came from some other query).

 Thanks,

 Emilio


 On 05/13/2015 09:14 AM, vaibhav thapliyal wrote:

  The rf files per tablet vary between 2 to 5 per tablet. The
 entries returned to me by the batchScanner is 46. The approx. 
 average
 data rate is 0.5 MB/s as seen on the accumulo monitor page.

  A simple scan on the table has an average data rate of about 7-8
 MB/s.

  All the ids exist in the accumulo table.

 On 12 May 2015 at 23:39, Keith Turner ke...@deenlo.com wrote:

 Do you know how much data is being brought back (i.e. 100
 megabytes)? I am wondering what the data rate is in MB/s.  Do you 
 know how
 many files per tablet you have?  Do most of the 10,000 ids you are 
 querying
 for exist?

 On Tue, May 12, 2015 at 1:58 PM, vaibhav thapliyal 
 vaibhav.thapliyal...@gmail.com wrote:

 I have 194 tablets. Currently I am using 20 threads to create
 the batchscanner inside the createBatchScanner method.
  On 12-May-2015 11:19 pm, Keith Turner 

Re: BatchScanner taking too much time to scan rows

2015-05-14 Thread Dylan Hutchison
I didn't have an average query time-- the tablet server crashed.  A quick
solution is to batch the ranges into groups of 50k (or 500k, I forgot which
one) and do many BatchScans-- not ideal.  I think I achieved 33k
entries/second retrieval on a single-node Accumulo.  Accumulo is better for
sequential lookup than random.

On Thu, May 14, 2015 at 1:57 PM, vaibhav thapliyal 
vaibhav.thapliyal...@gmail.com wrote:

 Dylan could you elaborate on the average query time you had?
 Thanks
 Vaibhav
 On 14-May-2015 11:03 pm, Dylan Hutchison dhutc...@mit.edu wrote:

 I think this is the same issue I found for ACCUMULO-3710
 https://issues.apache.org/jira/browse/ACCUMULO-3710, only in my case
 the tserver ran out of memory.  Accumulo doesn't handle large numbers of
 small, disjoint ranges well.  I bet there's room for improvement on both
 the client and tablet server.
 ~Dylan

 On Wed, May 13, 2015 at 3:13 PM, Eric Newton eric.new...@gmail.com
 wrote:

 Yes, hot-spotting does affect accumulo because you have fewer servers
 and caches handling your request.

 Let's say your data is spread out, in a normal distribution from
 0..9.

 What if you have only 1 split?  You would want it at 5, to divide the
 data in half, and you could host the halves on different servers.  But if
 you split at 1, now 10% of your queries go to one tablet, and 90% go to the
 other.

 -Eric


 On Wed, May 13, 2015 at 1:56 PM, vaibhav thapliyal 
 vaibhav.thapliyal...@gmail.com wrote:

 Thank you Eric. I will surely do the same. Should uneven distribution
 across the tablets affect querying in accumulo?  If this case, it is. Is
 this behaviour normal?
 On 13-May-2015 10:58 pm, Eric Newton eric.new...@gmail.com wrote:

 Yes, that's a great way to split the data evenly.

 Also, since the data set is so small, turn on data caching for your
 table:

 shell config -t mytable -s table.cache.block.enable=true

 You may want to increase the size of your tserver JVM, and increase
 the size of the cache:

 shell config -s tserver.cache.data.size=1G

 This will help with repeated random look-ups.

 -Eric

 On Wed, May 13, 2015 at 11:31 AM, vaibhav thapliyal 
 vaibhav.thapliyal...@gmail.com wrote:

 Thank you Eric.

 One thing I would like to know. Does pre-splitting the data play a
 part in querying accumulo?

 Because I managed to somewhat decrease the querying time.
 I did the following steps:
 My table was around 1.47gb so I explicity set the split parameter to
 256mb instead of the default 1gb.

 So I had just 8 tablets. Now when I carried out the same query, it
 finished in 15s.

 Is it because of the split points are more evenly distributed?

 The previous table on which the query took 50s had entries unevenly
 distributed across the tablets.
 Thanks
 Vaibhav
 On 13-May-2015 7:43 pm, Eric Newton eric.new...@gmail.com wrote:

 This use case is one of the things Accumulo was designed to handle
 well. It's the reason there is a BatchScanner.

 I've created:

 https://issues.apache.org/jira/browse/ACCUMULO-3813

 so we can investigate and track down any problems or improvements.

 Feel free to add any other details to the JIRA ticket.

 -Eric


 On Wed, May 13, 2015 at 10:03 AM, Emilio Lahr-Vivaz 
 elahrvi...@ccri.com wrote:

  It sounds like each of your ranges is an ID, e.g. a single row.
 I've found that scanning lots of non-sequential single-row ranges is 
 pretty
 slow in accumulo. Your best approach is probably to create an index 
 table
 on whatever you are originally trying to query (assuming those 1 
 ids
 came from some other query).

 Thanks,

 Emilio


 On 05/13/2015 09:14 AM, vaibhav thapliyal wrote:

  The rf files per tablet vary between 2 to 5 per tablet. The
 entries returned to me by the batchScanner is 46. The approx. 
 average
 data rate is 0.5 MB/s as seen on the accumulo monitor page.

  A simple scan on the table has an average data rate of about 7-8
 MB/s.

  All the ids exist in the accumulo table.

 On 12 May 2015 at 23:39, Keith Turner ke...@deenlo.com wrote:

 Do you know how much data is being brought back (i.e. 100
 megabytes)? I am wondering what the data rate is in MB/s.  Do you 
 know how
 many files per tablet you have?  Do most of the 10,000 ids you are 
 querying
 for exist?

 On Tue, May 12, 2015 at 1:58 PM, vaibhav thapliyal 
 vaibhav.thapliyal...@gmail.com wrote:

 I have 194 tablets. Currently I am using 20 threads to create the
 batchscanner inside the createBatchScanner method.
  On 12-May-2015 11:19 pm, Keith Turner ke...@deenlo.com
 wrote:

   How many tablets do you have?  The batch scanner does not
 parallelize operations within a tablet.

  If you give the batch scanner more threads than there are
 tservers, it will make multilple parallel rpc calls to each tserver 
 if the
 tserver has multiple tablets.  Each rpc may include multiple 
 tablets and
 ranges for each tablet.

  If the batch scanner has less threads than tservers, it will
 make one rpc per tserver per thread.  Each rpc call will include 

Re: BatchScanner taking too much time to scan rows

2015-05-14 Thread Dylan Hutchison
I think this is the same issue I found for ACCUMULO-3710
https://issues.apache.org/jira/browse/ACCUMULO-3710, only in my case the
tserver ran out of memory.  Accumulo doesn't handle large numbers of small,
disjoint ranges well.  I bet there's room for improvement on both the
client and tablet server.
~Dylan

On Wed, May 13, 2015 at 3:13 PM, Eric Newton eric.new...@gmail.com wrote:

 Yes, hot-spotting does affect accumulo because you have fewer servers and
 caches handling your request.

 Let's say your data is spread out, in a normal distribution from 0..9.

 What if you have only 1 split?  You would want it at 5, to divide the
 data in half, and you could host the halves on different servers.  But if
 you split at 1, now 10% of your queries go to one tablet, and 90% go to the
 other.

 -Eric


 On Wed, May 13, 2015 at 1:56 PM, vaibhav thapliyal 
 vaibhav.thapliyal...@gmail.com wrote:

 Thank you Eric. I will surely do the same. Should uneven distribution
 across the tablets affect querying in accumulo?  If this case, it is. Is
 this behaviour normal?
 On 13-May-2015 10:58 pm, Eric Newton eric.new...@gmail.com wrote:

 Yes, that's a great way to split the data evenly.

 Also, since the data set is so small, turn on data caching for your
 table:

 shell config -t mytable -s table.cache.block.enable=true

 You may want to increase the size of your tserver JVM, and increase the
 size of the cache:

 shell config -s tserver.cache.data.size=1G

 This will help with repeated random look-ups.

 -Eric

 On Wed, May 13, 2015 at 11:31 AM, vaibhav thapliyal 
 vaibhav.thapliyal...@gmail.com wrote:

 Thank you Eric.

 One thing I would like to know. Does pre-splitting the data play a part
 in querying accumulo?

 Because I managed to somewhat decrease the querying time.
 I did the following steps:
 My table was around 1.47gb so I explicity set the split parameter to
 256mb instead of the default 1gb.

 So I had just 8 tablets. Now when I carried out the same query, it
 finished in 15s.

 Is it because of the split points are more evenly distributed?

 The previous table on which the query took 50s had entries unevenly
 distributed across the tablets.
 Thanks
 Vaibhav
 On 13-May-2015 7:43 pm, Eric Newton eric.new...@gmail.com wrote:

 This use case is one of the things Accumulo was designed to handle
 well. It's the reason there is a BatchScanner.

 I've created:

 https://issues.apache.org/jira/browse/ACCUMULO-3813

 so we can investigate and track down any problems or improvements.

 Feel free to add any other details to the JIRA ticket.

 -Eric


 On Wed, May 13, 2015 at 10:03 AM, Emilio Lahr-Vivaz 
 elahrvi...@ccri.com wrote:

  It sounds like each of your ranges is an ID, e.g. a single row. I've
 found that scanning lots of non-sequential single-row ranges is pretty 
 slow
 in accumulo. Your best approach is probably to create an index table on
 whatever you are originally trying to query (assuming those 1 ids 
 came
 from some other query).

 Thanks,

 Emilio


 On 05/13/2015 09:14 AM, vaibhav thapliyal wrote:

  The rf files per tablet vary between 2 to 5 per tablet. The entries
 returned to me by the batchScanner is 46. The approx. average data 
 rate
 is 0.5 MB/s as seen on the accumulo monitor page.

  A simple scan on the table has an average data rate of about 7-8
 MB/s.

  All the ids exist in the accumulo table.

 On 12 May 2015 at 23:39, Keith Turner ke...@deenlo.com wrote:

 Do you know how much data is being brought back (i.e. 100
 megabytes)? I am wondering what the data rate is in MB/s.  Do you know 
 how
 many files per tablet you have?  Do most of the 10,000 ids you are 
 querying
 for exist?

 On Tue, May 12, 2015 at 1:58 PM, vaibhav thapliyal 
 vaibhav.thapliyal...@gmail.com wrote:

 I have 194 tablets. Currently I am using 20 threads to create the
 batchscanner inside the createBatchScanner method.
  On 12-May-2015 11:19 pm, Keith Turner ke...@deenlo.com wrote:

   How many tablets do you have?  The batch scanner does not
 parallelize operations within a tablet.

  If you give the batch scanner more threads than there are
 tservers, it will make multilple parallel rpc calls to each tserver 
 if the
 tserver has multiple tablets.  Each rpc may include multiple tablets 
 and
 ranges for each tablet.

  If the batch scanner has less threads than tservers, it will make
 one rpc per tserver per thread.  Each rpc call will include all 
 tablets and
 associated ranges for that tserver.

  Keith



 On Tue, May 12, 2015 at 1:39 PM, vaibhav thapliyal 
 vaibhav.thapliyal...@gmail.com wrote:

 Hi,

  I am using BatchScanner to scan rows from a accumulo table. The
 table has around 187m entries and I am using a 3 node cluster which 
 has
 accumulo 1.6.1.

  I have passed 1 ids which are stored as row id in my table
 as a list in the setRanges() method.

  This whole process takes around 50 secs(from adding the ids in
 the list to scanning the whole table using the BatchScanner).

  I tried 

Re: BatchScanner taking too much time to scan rows

2015-05-13 Thread Eric Newton
This use case is one of the things Accumulo was designed to handle well.
It's the reason there is a BatchScanner.

I've created:

https://issues.apache.org/jira/browse/ACCUMULO-3813

so we can investigate and track down any problems or improvements.

Feel free to add any other details to the JIRA ticket.

-Eric


On Wed, May 13, 2015 at 10:03 AM, Emilio Lahr-Vivaz elahrvi...@ccri.com
wrote:

  It sounds like each of your ranges is an ID, e.g. a single row. I've
 found that scanning lots of non-sequential single-row ranges is pretty slow
 in accumulo. Your best approach is probably to create an index table on
 whatever you are originally trying to query (assuming those 1 ids came
 from some other query).

 Thanks,

 Emilio


 On 05/13/2015 09:14 AM, vaibhav thapliyal wrote:

  The rf files per tablet vary between 2 to 5 per tablet. The entries
 returned to me by the batchScanner is 46. The approx. average data rate
 is 0.5 MB/s as seen on the accumulo monitor page.

  A simple scan on the table has an average data rate of about 7-8 MB/s.

  All the ids exist in the accumulo table.

 On 12 May 2015 at 23:39, Keith Turner ke...@deenlo.com wrote:

 Do you know how much data is being brought back (i.e. 100 megabytes)? I
 am wondering what the data rate is in MB/s.  Do you know how many files per
 tablet you have?  Do most of the 10,000 ids you are querying for exist?

 On Tue, May 12, 2015 at 1:58 PM, vaibhav thapliyal 
 vaibhav.thapliyal...@gmail.com wrote:

 I have 194 tablets. Currently I am using 20 threads to create the
 batchscanner inside the createBatchScanner method.
  On 12-May-2015 11:19 pm, Keith Turner ke...@deenlo.com wrote:

   How many tablets do you have?  The batch scanner does not
 parallelize operations within a tablet.

  If you give the batch scanner more threads than there are tservers,
 it will make multilple parallel rpc calls to each tserver if the tserver
 has multiple tablets.  Each rpc may include multiple tablets and ranges for
 each tablet.

  If the batch scanner has less threads than tservers, it will make one
 rpc per tserver per thread.  Each rpc call will include all tablets and
 associated ranges for that tserver.

  Keith



 On Tue, May 12, 2015 at 1:39 PM, vaibhav thapliyal 
 vaibhav.thapliyal...@gmail.com wrote:

 Hi,

  I am using BatchScanner to scan rows from a accumulo table. The
 table has around 187m entries and I am using a 3 node cluster which has
 accumulo 1.6.1.

  I have passed 1 ids which are stored as row id in my table as a
 list in the setRanges() method.

  This whole process takes around 50 secs(from adding the ids in the
 list to scanning the whole table using the BatchScanner).

  I tried switching on bloom filters but that didn't work.

  Also if anyone could briefly explain how a BatchScanner works, how
 it does parallel scanning it would help me understand what I am doing
 better.

  Thanks
  Vaibhav









Re: BatchScanner taking too much time to scan rows

2015-05-13 Thread Emilio Lahr-Vivaz
It sounds like each of your ranges is an ID, e.g. a single row. I've 
found that scanning lots of non-sequential single-row ranges is pretty 
slow in accumulo. Your best approach is probably to create an index 
table on whatever you are originally trying to query (assuming those 
1 ids came from some other query).


Thanks,

Emilio

On 05/13/2015 09:14 AM, vaibhav thapliyal wrote:
The rf files per tablet vary between 2 to 5 per tablet. The entries 
returned to me by the batchScanner is 46. The approx. average data 
rate is 0.5 MB/s as seen on the accumulo monitor page.


A simple scan on the table has an average data rate of about 7-8 MB/s.

All the ids exist in the accumulo table.

On 12 May 2015 at 23:39, Keith Turner ke...@deenlo.com 
mailto:ke...@deenlo.com wrote:


Do you know how much data is being brought back (i.e. 100
megabytes)? I am wondering what the data rate is in MB/s.  Do you
know how many files per tablet you have?  Do most of the 10,000
ids you are querying for exist?

On Tue, May 12, 2015 at 1:58 PM, vaibhav thapliyal
vaibhav.thapliyal...@gmail.com
mailto:vaibhav.thapliyal...@gmail.com wrote:

I have 194 tablets. Currently I am using 20 threads to create
the batchscanner inside the createBatchScanner method.

On 12-May-2015 11:19 pm, Keith Turner ke...@deenlo.com
mailto:ke...@deenlo.com wrote:

How many tablets do you have? The batch scanner does not
parallelize operations within a tablet.

If you give the batch scanner more threads than there are
tservers, it will make multilple parallel rpc calls to
each tserver if the tserver has multiple tablets.  Each
rpc may include multiple tablets and ranges for each tablet.

If the batch scanner has less threads than tservers, it
will make one rpc per tserver per thread.  Each rpc call
will include all tablets and associated ranges for that
tserver.

Keith



On Tue, May 12, 2015 at 1:39 PM, vaibhav thapliyal
vaibhav.thapliyal...@gmail.com
mailto:vaibhav.thapliyal...@gmail.com wrote:

Hi,

I am using BatchScanner to scan rows from a accumulo
table. The table has around 187m entries and I am
using a 3 node cluster which has accumulo 1.6.1.

I have passed 1 ids which are stored as row id in
my table as a list in the setRanges() method.

This whole process takes around 50 secs(from adding
the ids in the list to scanning the whole table using
the BatchScanner).

I tried switching on bloom filters but that didn't work.

Also if anyone could briefly explain how a
BatchScanner works, how it does parallel scanning it
would help me understand what I am doing better.

Thanks
Vaibhav









Re: BatchScanner taking too much time to scan rows

2015-05-13 Thread Eric Newton
Yes, hot-spotting does affect accumulo because you have fewer servers and
caches handling your request.

Let's say your data is spread out, in a normal distribution from 0..9.

What if you have only 1 split?  You would want it at 5, to divide the
data in half, and you could host the halves on different servers.  But if
you split at 1, now 10% of your queries go to one tablet, and 90% go to the
other.

-Eric

On Wed, May 13, 2015 at 1:56 PM, vaibhav thapliyal 
vaibhav.thapliyal...@gmail.com wrote:

 Thank you Eric. I will surely do the same. Should uneven distribution
 across the tablets affect querying in accumulo?  If this case, it is. Is
 this behaviour normal?
 On 13-May-2015 10:58 pm, Eric Newton eric.new...@gmail.com wrote:

 Yes, that's a great way to split the data evenly.

 Also, since the data set is so small, turn on data caching for your table:

 shell config -t mytable -s table.cache.block.enable=true

 You may want to increase the size of your tserver JVM, and increase the
 size of the cache:

 shell config -s tserver.cache.data.size=1G

 This will help with repeated random look-ups.

 -Eric

 On Wed, May 13, 2015 at 11:31 AM, vaibhav thapliyal 
 vaibhav.thapliyal...@gmail.com wrote:

 Thank you Eric.

 One thing I would like to know. Does pre-splitting the data play a part
 in querying accumulo?

 Because I managed to somewhat decrease the querying time.
 I did the following steps:
 My table was around 1.47gb so I explicity set the split parameter to
 256mb instead of the default 1gb.

 So I had just 8 tablets. Now when I carried out the same query, it
 finished in 15s.

 Is it because of the split points are more evenly distributed?

 The previous table on which the query took 50s had entries unevenly
 distributed across the tablets.
 Thanks
 Vaibhav
 On 13-May-2015 7:43 pm, Eric Newton eric.new...@gmail.com wrote:

 This use case is one of the things Accumulo was designed to handle
 well. It's the reason there is a BatchScanner.

 I've created:

 https://issues.apache.org/jira/browse/ACCUMULO-3813

 so we can investigate and track down any problems or improvements.

 Feel free to add any other details to the JIRA ticket.

 -Eric


 On Wed, May 13, 2015 at 10:03 AM, Emilio Lahr-Vivaz 
 elahrvi...@ccri.com wrote:

  It sounds like each of your ranges is an ID, e.g. a single row. I've
 found that scanning lots of non-sequential single-row ranges is pretty 
 slow
 in accumulo. Your best approach is probably to create an index table on
 whatever you are originally trying to query (assuming those 1 ids came
 from some other query).

 Thanks,

 Emilio


 On 05/13/2015 09:14 AM, vaibhav thapliyal wrote:

  The rf files per tablet vary between 2 to 5 per tablet. The entries
 returned to me by the batchScanner is 46. The approx. average data 
 rate
 is 0.5 MB/s as seen on the accumulo monitor page.

  A simple scan on the table has an average data rate of about 7-8
 MB/s.

  All the ids exist in the accumulo table.

 On 12 May 2015 at 23:39, Keith Turner ke...@deenlo.com wrote:

 Do you know how much data is being brought back (i.e. 100 megabytes)?
 I am wondering what the data rate is in MB/s.  Do you know how many files
 per tablet you have?  Do most of the 10,000 ids you are querying for 
 exist?

 On Tue, May 12, 2015 at 1:58 PM, vaibhav thapliyal 
 vaibhav.thapliyal...@gmail.com wrote:

 I have 194 tablets. Currently I am using 20 threads to create the
 batchscanner inside the createBatchScanner method.
  On 12-May-2015 11:19 pm, Keith Turner ke...@deenlo.com wrote:

   How many tablets do you have?  The batch scanner does not
 parallelize operations within a tablet.

  If you give the batch scanner more threads than there are
 tservers, it will make multilple parallel rpc calls to each tserver if 
 the
 tserver has multiple tablets.  Each rpc may include multiple tablets 
 and
 ranges for each tablet.

  If the batch scanner has less threads than tservers, it will make
 one rpc per tserver per thread.  Each rpc call will include all 
 tablets and
 associated ranges for that tserver.

  Keith



 On Tue, May 12, 2015 at 1:39 PM, vaibhav thapliyal 
 vaibhav.thapliyal...@gmail.com wrote:

 Hi,

  I am using BatchScanner to scan rows from a accumulo table. The
 table has around 187m entries and I am using a 3 node cluster which 
 has
 accumulo 1.6.1.

  I have passed 1 ids which are stored as row id in my table
 as a list in the setRanges() method.

  This whole process takes around 50 secs(from adding the ids in
 the list to scanning the whole table using the BatchScanner).

  I tried switching on bloom filters but that didn't work.

  Also if anyone could briefly explain how a BatchScanner works,
 how it does parallel scanning it would help me understand what I am 
 doing
 better.

  Thanks
  Vaibhav











Re: BatchScanner taking too much time to scan rows

2015-05-13 Thread Eric Newton
Yes, that's a great way to split the data evenly.

Also, since the data set is so small, turn on data caching for your table:

shell config -t mytable -s table.cache.block.enable=true

You may want to increase the size of your tserver JVM, and increase the
size of the cache:

shell config -s tserver.cache.data.size=1G

This will help with repeated random look-ups.

-Eric

On Wed, May 13, 2015 at 11:31 AM, vaibhav thapliyal 
vaibhav.thapliyal...@gmail.com wrote:

 Thank you Eric.

 One thing I would like to know. Does pre-splitting the data play a part in
 querying accumulo?

 Because I managed to somewhat decrease the querying time.
 I did the following steps:
 My table was around 1.47gb so I explicity set the split parameter to 256mb
 instead of the default 1gb.

 So I had just 8 tablets. Now when I carried out the same query, it
 finished in 15s.

 Is it because of the split points are more evenly distributed?

 The previous table on which the query took 50s had entries unevenly
 distributed across the tablets.
 Thanks
 Vaibhav
 On 13-May-2015 7:43 pm, Eric Newton eric.new...@gmail.com wrote:

 This use case is one of the things Accumulo was designed to handle well.
 It's the reason there is a BatchScanner.

 I've created:

 https://issues.apache.org/jira/browse/ACCUMULO-3813

 so we can investigate and track down any problems or improvements.

 Feel free to add any other details to the JIRA ticket.

 -Eric


 On Wed, May 13, 2015 at 10:03 AM, Emilio Lahr-Vivaz elahrvi...@ccri.com
 wrote:

  It sounds like each of your ranges is an ID, e.g. a single row. I've
 found that scanning lots of non-sequential single-row ranges is pretty slow
 in accumulo. Your best approach is probably to create an index table on
 whatever you are originally trying to query (assuming those 1 ids came
 from some other query).

 Thanks,

 Emilio


 On 05/13/2015 09:14 AM, vaibhav thapliyal wrote:

  The rf files per tablet vary between 2 to 5 per tablet. The entries
 returned to me by the batchScanner is 46. The approx. average data rate
 is 0.5 MB/s as seen on the accumulo monitor page.

  A simple scan on the table has an average data rate of about 7-8 MB/s.

  All the ids exist in the accumulo table.

 On 12 May 2015 at 23:39, Keith Turner ke...@deenlo.com wrote:

 Do you know how much data is being brought back (i.e. 100 megabytes)? I
 am wondering what the data rate is in MB/s.  Do you know how many files per
 tablet you have?  Do most of the 10,000 ids you are querying for exist?

 On Tue, May 12, 2015 at 1:58 PM, vaibhav thapliyal 
 vaibhav.thapliyal...@gmail.com wrote:

 I have 194 tablets. Currently I am using 20 threads to create the
 batchscanner inside the createBatchScanner method.
  On 12-May-2015 11:19 pm, Keith Turner ke...@deenlo.com wrote:

   How many tablets do you have?  The batch scanner does not
 parallelize operations within a tablet.

  If you give the batch scanner more threads than there are tservers,
 it will make multilple parallel rpc calls to each tserver if the tserver
 has multiple tablets.  Each rpc may include multiple tablets and ranges 
 for
 each tablet.

  If the batch scanner has less threads than tservers, it will make
 one rpc per tserver per thread.  Each rpc call will include all tablets 
 and
 associated ranges for that tserver.

  Keith



 On Tue, May 12, 2015 at 1:39 PM, vaibhav thapliyal 
 vaibhav.thapliyal...@gmail.com wrote:

 Hi,

  I am using BatchScanner to scan rows from a accumulo table. The
 table has around 187m entries and I am using a 3 node cluster which has
 accumulo 1.6.1.

  I have passed 1 ids which are stored as row id in my table as
 a list in the setRanges() method.

  This whole process takes around 50 secs(from adding the ids in the
 list to scanning the whole table using the BatchScanner).

  I tried switching on bloom filters but that didn't work.

  Also if anyone could briefly explain how a BatchScanner works, how
 it does parallel scanning it would help me understand what I am doing
 better.

  Thanks
  Vaibhav










Re: BatchScanner taking too much time to scan rows

2015-05-13 Thread vaibhav thapliyal
Thank you Eric.

One thing I would like to know. Does pre-splitting the data play a part in
querying accumulo?

Because I managed to somewhat decrease the querying time.
I did the following steps:
My table was around 1.47gb so I explicity set the split parameter to 256mb
instead of the default 1gb.

So I had just 8 tablets. Now when I carried out the same query, it finished
in 15s.

Is it because of the split points are more evenly distributed?

The previous table on which the query took 50s had entries unevenly
distributed across the tablets.
Thanks
Vaibhav
On 13-May-2015 7:43 pm, Eric Newton eric.new...@gmail.com wrote:

 This use case is one of the things Accumulo was designed to handle well.
 It's the reason there is a BatchScanner.

 I've created:

 https://issues.apache.org/jira/browse/ACCUMULO-3813

 so we can investigate and track down any problems or improvements.

 Feel free to add any other details to the JIRA ticket.

 -Eric


 On Wed, May 13, 2015 at 10:03 AM, Emilio Lahr-Vivaz elahrvi...@ccri.com
 wrote:

  It sounds like each of your ranges is an ID, e.g. a single row. I've
 found that scanning lots of non-sequential single-row ranges is pretty slow
 in accumulo. Your best approach is probably to create an index table on
 whatever you are originally trying to query (assuming those 1 ids came
 from some other query).

 Thanks,

 Emilio


 On 05/13/2015 09:14 AM, vaibhav thapliyal wrote:

  The rf files per tablet vary between 2 to 5 per tablet. The entries
 returned to me by the batchScanner is 46. The approx. average data rate
 is 0.5 MB/s as seen on the accumulo monitor page.

  A simple scan on the table has an average data rate of about 7-8 MB/s.

  All the ids exist in the accumulo table.

 On 12 May 2015 at 23:39, Keith Turner ke...@deenlo.com wrote:

 Do you know how much data is being brought back (i.e. 100 megabytes)? I
 am wondering what the data rate is in MB/s.  Do you know how many files per
 tablet you have?  Do most of the 10,000 ids you are querying for exist?

 On Tue, May 12, 2015 at 1:58 PM, vaibhav thapliyal 
 vaibhav.thapliyal...@gmail.com wrote:

 I have 194 tablets. Currently I am using 20 threads to create the
 batchscanner inside the createBatchScanner method.
  On 12-May-2015 11:19 pm, Keith Turner ke...@deenlo.com wrote:

   How many tablets do you have?  The batch scanner does not
 parallelize operations within a tablet.

  If you give the batch scanner more threads than there are tservers,
 it will make multilple parallel rpc calls to each tserver if the tserver
 has multiple tablets.  Each rpc may include multiple tablets and ranges 
 for
 each tablet.

  If the batch scanner has less threads than tservers, it will make one
 rpc per tserver per thread.  Each rpc call will include all tablets and
 associated ranges for that tserver.

  Keith



 On Tue, May 12, 2015 at 1:39 PM, vaibhav thapliyal 
 vaibhav.thapliyal...@gmail.com wrote:

 Hi,

  I am using BatchScanner to scan rows from a accumulo table. The
 table has around 187m entries and I am using a 3 node cluster which has
 accumulo 1.6.1.

  I have passed 1 ids which are stored as row id in my table as a
 list in the setRanges() method.

  This whole process takes around 50 secs(from adding the ids in the
 list to scanning the whole table using the BatchScanner).

  I tried switching on bloom filters but that didn't work.

  Also if anyone could briefly explain how a BatchScanner works, how
 it does parallel scanning it would help me understand what I am doing
 better.

  Thanks
  Vaibhav










Re: BatchScanner taking too much time to scan rows

2015-05-13 Thread vaibhav thapliyal
Thank you Eric. I will surely do the same. Should uneven distribution
across the tablets affect querying in accumulo?  If this case, it is. Is
this behaviour normal?
On 13-May-2015 10:58 pm, Eric Newton eric.new...@gmail.com wrote:

 Yes, that's a great way to split the data evenly.

 Also, since the data set is so small, turn on data caching for your table:

 shell config -t mytable -s table.cache.block.enable=true

 You may want to increase the size of your tserver JVM, and increase the
 size of the cache:

 shell config -s tserver.cache.data.size=1G

 This will help with repeated random look-ups.

 -Eric

 On Wed, May 13, 2015 at 11:31 AM, vaibhav thapliyal 
 vaibhav.thapliyal...@gmail.com wrote:

 Thank you Eric.

 One thing I would like to know. Does pre-splitting the data play a part
 in querying accumulo?

 Because I managed to somewhat decrease the querying time.
 I did the following steps:
 My table was around 1.47gb so I explicity set the split parameter to
 256mb instead of the default 1gb.

 So I had just 8 tablets. Now when I carried out the same query, it
 finished in 15s.

 Is it because of the split points are more evenly distributed?

 The previous table on which the query took 50s had entries unevenly
 distributed across the tablets.
 Thanks
 Vaibhav
 On 13-May-2015 7:43 pm, Eric Newton eric.new...@gmail.com wrote:

 This use case is one of the things Accumulo was designed to handle well.
 It's the reason there is a BatchScanner.

 I've created:

 https://issues.apache.org/jira/browse/ACCUMULO-3813

 so we can investigate and track down any problems or improvements.

 Feel free to add any other details to the JIRA ticket.

 -Eric


 On Wed, May 13, 2015 at 10:03 AM, Emilio Lahr-Vivaz elahrvi...@ccri.com
  wrote:

  It sounds like each of your ranges is an ID, e.g. a single row. I've
 found that scanning lots of non-sequential single-row ranges is pretty slow
 in accumulo. Your best approach is probably to create an index table on
 whatever you are originally trying to query (assuming those 1 ids came
 from some other query).

 Thanks,

 Emilio


 On 05/13/2015 09:14 AM, vaibhav thapliyal wrote:

  The rf files per tablet vary between 2 to 5 per tablet. The entries
 returned to me by the batchScanner is 46. The approx. average data rate
 is 0.5 MB/s as seen on the accumulo monitor page.

  A simple scan on the table has an average data rate of about 7-8 MB/s.

  All the ids exist in the accumulo table.

 On 12 May 2015 at 23:39, Keith Turner ke...@deenlo.com wrote:

 Do you know how much data is being brought back (i.e. 100 megabytes)?
 I am wondering what the data rate is in MB/s.  Do you know how many files
 per tablet you have?  Do most of the 10,000 ids you are querying for 
 exist?

 On Tue, May 12, 2015 at 1:58 PM, vaibhav thapliyal 
 vaibhav.thapliyal...@gmail.com wrote:

 I have 194 tablets. Currently I am using 20 threads to create the
 batchscanner inside the createBatchScanner method.
  On 12-May-2015 11:19 pm, Keith Turner ke...@deenlo.com wrote:

   How many tablets do you have?  The batch scanner does not
 parallelize operations within a tablet.

  If you give the batch scanner more threads than there are
 tservers, it will make multilple parallel rpc calls to each tserver if 
 the
 tserver has multiple tablets.  Each rpc may include multiple tablets and
 ranges for each tablet.

  If the batch scanner has less threads than tservers, it will make
 one rpc per tserver per thread.  Each rpc call will include all tablets 
 and
 associated ranges for that tserver.

  Keith



 On Tue, May 12, 2015 at 1:39 PM, vaibhav thapliyal 
 vaibhav.thapliyal...@gmail.com wrote:

 Hi,

  I am using BatchScanner to scan rows from a accumulo table. The
 table has around 187m entries and I am using a 3 node cluster which has
 accumulo 1.6.1.

  I have passed 1 ids which are stored as row id in my table as
 a list in the setRanges() method.

  This whole process takes around 50 secs(from adding the ids in
 the list to scanning the whole table using the BatchScanner).

  I tried switching on bloom filters but that didn't work.

  Also if anyone could briefly explain how a BatchScanner works,
 how it does parallel scanning it would help me understand what I am 
 doing
 better.

  Thanks
  Vaibhav











Re: BatchScanner taking too much time to scan rows

2015-05-12 Thread Keith Turner
Do you know how much data is being brought back (i.e. 100 megabytes)? I am
wondering what the data rate is in MB/s.  Do you know how many files per
tablet you have?  Do most of the 10,000 ids you are querying for exist?

On Tue, May 12, 2015 at 1:58 PM, vaibhav thapliyal 
vaibhav.thapliyal...@gmail.com wrote:

 I have 194 tablets. Currently I am using 20 threads to create the
 batchscanner inside the createBatchScanner method.
 On 12-May-2015 11:19 pm, Keith Turner ke...@deenlo.com wrote:

 How many tablets do you have?  The batch scanner does not parallelize
 operations within a tablet.

 If you give the batch scanner more threads than there are tservers, it
 will make multilple parallel rpc calls to each tserver if the tserver has
 multiple tablets.  Each rpc may include multiple tablets and ranges for
 each tablet.

 If the batch scanner has less threads than tservers, it will make one rpc
 per tserver per thread.  Each rpc call will include all tablets and
 associated ranges for that tserver.

 Keith



 On Tue, May 12, 2015 at 1:39 PM, vaibhav thapliyal 
 vaibhav.thapliyal...@gmail.com wrote:

 Hi,

 I am using BatchScanner to scan rows from a accumulo table. The table
 has around 187m entries and I am using a 3 node cluster which has accumulo
 1.6.1.

 I have passed 1 ids which are stored as row id in my table as a list
 in the setRanges() method.

 This whole process takes around 50 secs(from adding the ids in the list
 to scanning the whole table using the BatchScanner).

 I tried switching on bloom filters but that didn't work.

 Also if anyone could briefly explain how a BatchScanner works, how it
 does parallel scanning it would help me understand what I am doing better.

 Thanks
 Vaibhav






Re: BatchScanner taking too much time to scan rows

2015-05-12 Thread Keith Turner
How many tablets do you have?  The batch scanner does not parallelize
operations within a tablet.

If you give the batch scanner more threads than there are tservers, it will
make multilple parallel rpc calls to each tserver if the tserver has
multiple tablets.  Each rpc may include multiple tablets and ranges for
each tablet.

If the batch scanner has less threads than tservers, it will make one rpc
per tserver per thread.  Each rpc call will include all tablets and
associated ranges for that tserver.

Keith



On Tue, May 12, 2015 at 1:39 PM, vaibhav thapliyal 
vaibhav.thapliyal...@gmail.com wrote:

 Hi,

 I am using BatchScanner to scan rows from a accumulo table. The table has
 around 187m entries and I am using a 3 node cluster which has accumulo
 1.6.1.

 I have passed 1 ids which are stored as row id in my table as a list
 in the setRanges() method.

 This whole process takes around 50 secs(from adding the ids in the list to
 scanning the whole table using the BatchScanner).

 I tried switching on bloom filters but that didn't work.

 Also if anyone could briefly explain how a BatchScanner works, how it does
 parallel scanning it would help me understand what I am doing better.

 Thanks
 Vaibhav





BatchScanner taking too much time to scan rows

2015-05-12 Thread vaibhav thapliyal
Hi,

I am using BatchScanner to scan rows from a accumulo table. The table has
around 187m entries and I am using a 3 node cluster which has accumulo
1.6.1.

I have passed 1 ids which are stored as row id in my table as a list in
the setRanges() method.

This whole process takes around 50 secs(from adding the ids in the list to
scanning the whole table using the BatchScanner).

I tried switching on bloom filters but that didn't work.

Also if anyone could briefly explain how a BatchScanner works, how it does
parallel scanning it would help me understand what I am doing better.

Thanks
Vaibhav


Re: BatchScanner taking too much time to scan rows

2015-05-12 Thread David Medinets
On the monitor page, you should see how many threads are running in
each tserver, if I remember correctly. There are also graphs to show
response rates.

On Tue, May 12, 2015 at 2:39 PM, vaibhav thapliyal
vaibhav.thapliyal...@gmail.com wrote:
 I also tried to increase threads to a bigger number about 500, but yes I
 will try using batchscanner with 194 threads too.  I will get back with the
 info that Keith has asked in some time.

 Thanks
 Vaibhav

 On 13-May-2015 12:04 am, David Medinets david.medin...@gmail.com wrote:

 Try using 194 threads if your hardware can support them. The worst
 that'll happen is the client program crashes during testing. If that
 happens, cut the number of threads in half. And so on.

 On Tue, May 12, 2015 at 1:58 PM, vaibhav thapliyal
 vaibhav.thapliyal...@gmail.com wrote:
  I have 194 tablets. Currently I am using 20 threads to create the
  batchscanner inside the createBatchScanner method.
 
  On 12-May-2015 11:19 pm, Keith Turner ke...@deenlo.com wrote:
 
  How many tablets do you have?  The batch scanner does not parallelize
  operations within a tablet.
 
  If you give the batch scanner more threads than there are tservers, it
  will make multilple parallel rpc calls to each tserver if the tserver
  has
  multiple tablets.  Each rpc may include multiple tablets and ranges for
  each
  tablet.
 
  If the batch scanner has less threads than tservers, it will make one
  rpc
  per tserver per thread.  Each rpc call will include all tablets and
  associated ranges for that tserver.
 
  Keith
 
 
 
  On Tue, May 12, 2015 at 1:39 PM, vaibhav thapliyal
  vaibhav.thapliyal...@gmail.com wrote:
 
  Hi,
 
  I am using BatchScanner to scan rows from a accumulo table. The table
  has
  around 187m entries and I am using a 3 node cluster which has accumulo
  1.6.1.
 
  I have passed 1 ids which are stored as row id in my table as a
  list
  in the setRanges() method.
 
  This whole process takes around 50 secs(from adding the ids in the
  list
  to scanning the whole table using the BatchScanner).
 
  I tried switching on bloom filters but that didn't work.
 
  Also if anyone could briefly explain how a BatchScanner works, how it
  does parallel scanning it would help me understand what I am doing
  better.
 
  Thanks
  Vaibhav
 
 
 
 


Re: BatchScanner taking too much time to scan rows

2015-05-12 Thread vaibhav thapliyal
I also tried to increase threads to a bigger number about 500, but yes I
will try using batchscanner with 194 threads too.  I will get back with the
info that Keith has asked in some time.

Thanks
Vaibhav
On 13-May-2015 12:04 am, David Medinets david.medin...@gmail.com wrote:

 Try using 194 threads if your hardware can support them. The worst
 that'll happen is the client program crashes during testing. If that
 happens, cut the number of threads in half. And so on.

 On Tue, May 12, 2015 at 1:58 PM, vaibhav thapliyal
 vaibhav.thapliyal...@gmail.com wrote:
  I have 194 tablets. Currently I am using 20 threads to create the
  batchscanner inside the createBatchScanner method.
 
  On 12-May-2015 11:19 pm, Keith Turner ke...@deenlo.com wrote:
 
  How many tablets do you have?  The batch scanner does not parallelize
  operations within a tablet.
 
  If you give the batch scanner more threads than there are tservers, it
  will make multilple parallel rpc calls to each tserver if the tserver
 has
  multiple tablets.  Each rpc may include multiple tablets and ranges for
 each
  tablet.
 
  If the batch scanner has less threads than tservers, it will make one
 rpc
  per tserver per thread.  Each rpc call will include all tablets and
  associated ranges for that tserver.
 
  Keith
 
 
 
  On Tue, May 12, 2015 at 1:39 PM, vaibhav thapliyal
  vaibhav.thapliyal...@gmail.com wrote:
 
  Hi,
 
  I am using BatchScanner to scan rows from a accumulo table. The table
 has
  around 187m entries and I am using a 3 node cluster which has accumulo
  1.6.1.
 
  I have passed 1 ids which are stored as row id in my table as a
 list
  in the setRanges() method.
 
  This whole process takes around 50 secs(from adding the ids in the list
  to scanning the whole table using the BatchScanner).
 
  I tried switching on bloom filters but that didn't work.
 
  Also if anyone could briefly explain how a BatchScanner works, how it
  does parallel scanning it would help me understand what I am doing
 better.
 
  Thanks
  Vaibhav
 
 
 
 



Re: BatchScanner taking too much time to scan rows

2015-05-12 Thread David Medinets
Try using 194 threads if your hardware can support them. The worst
that'll happen is the client program crashes during testing. If that
happens, cut the number of threads in half. And so on.

On Tue, May 12, 2015 at 1:58 PM, vaibhav thapliyal
vaibhav.thapliyal...@gmail.com wrote:
 I have 194 tablets. Currently I am using 20 threads to create the
 batchscanner inside the createBatchScanner method.

 On 12-May-2015 11:19 pm, Keith Turner ke...@deenlo.com wrote:

 How many tablets do you have?  The batch scanner does not parallelize
 operations within a tablet.

 If you give the batch scanner more threads than there are tservers, it
 will make multilple parallel rpc calls to each tserver if the tserver has
 multiple tablets.  Each rpc may include multiple tablets and ranges for each
 tablet.

 If the batch scanner has less threads than tservers, it will make one rpc
 per tserver per thread.  Each rpc call will include all tablets and
 associated ranges for that tserver.

 Keith



 On Tue, May 12, 2015 at 1:39 PM, vaibhav thapliyal
 vaibhav.thapliyal...@gmail.com wrote:

 Hi,

 I am using BatchScanner to scan rows from a accumulo table. The table has
 around 187m entries and I am using a 3 node cluster which has accumulo
 1.6.1.

 I have passed 1 ids which are stored as row id in my table as a list
 in the setRanges() method.

 This whole process takes around 50 secs(from adding the ids in the list
 to scanning the whole table using the BatchScanner).

 I tried switching on bloom filters but that didn't work.

 Also if anyone could briefly explain how a BatchScanner works, how it
 does parallel scanning it would help me understand what I am doing better.

 Thanks
 Vaibhav






Re: BatchScanner taking too much time to scan rows

2015-05-12 Thread vaibhav thapliyal
I have 194 tablets. Currently I am using 20 threads to create the
batchscanner inside the createBatchScanner method.
On 12-May-2015 11:19 pm, Keith Turner ke...@deenlo.com wrote:

 How many tablets do you have?  The batch scanner does not parallelize
 operations within a tablet.

 If you give the batch scanner more threads than there are tservers, it
 will make multilple parallel rpc calls to each tserver if the tserver has
 multiple tablets.  Each rpc may include multiple tablets and ranges for
 each tablet.

 If the batch scanner has less threads than tservers, it will make one rpc
 per tserver per thread.  Each rpc call will include all tablets and
 associated ranges for that tserver.

 Keith



 On Tue, May 12, 2015 at 1:39 PM, vaibhav thapliyal 
 vaibhav.thapliyal...@gmail.com wrote:

 Hi,

 I am using BatchScanner to scan rows from a accumulo table. The table has
 around 187m entries and I am using a 3 node cluster which has accumulo
 1.6.1.

 I have passed 1 ids which are stored as row id in my table as a list
 in the setRanges() method.

 This whole process takes around 50 secs(from adding the ids in the list
 to scanning the whole table using the BatchScanner).

 I tried switching on bloom filters but that didn't work.

 Also if anyone could briefly explain how a BatchScanner works, how it
 does parallel scanning it would help me understand what I am doing better.

 Thanks
 Vaibhav