Hi Vamsi,

I see from your counters that the number of map spill records is
double the number of map output records, so I think that raising the
mapreduce.task.io.sort.mb setting on the job should improve the
shuffle throughput.

However, like I said before, I think that the first thing to try is
increasing the number of regions.

Indeed, increasing the number of regions can potentially increase
parallelism for reads by Phoenix, although Phoenix actually internally
does sub-region reads as-is, so there probably won't be a huge effect
either way in terms of read performance.

Aggregate queries shouldn't be impacted much either way. The increased
parallelism that Phoenix does to do sub-region reads is still in place
regardless. In addition, aggregate reads are done per region (or
sub-region split), and then the aggregation results are combined to
give the whole aggregate result. Having five times as many regions
(for example) would increase the number of portions of the aggregation
that need to be combined, but this should still be very minor in
comparison to the total amount of work required to do aggregations, so
it also shouldn't have a major effect either way.

- Gabriel

On Wed, Mar 16, 2016 at 7:15 PM, Vamsi Krishna <vamsi.attl...@gmail.com> wrote:
> Thanks Gabriel,
> Please find the job counters attached.
>
> Would increasing the splitting affect the reads?
> I assume a simple read would be benefitted by increased splitting as it
> increases the parallelism.
> But, how would it impact the aggregate queries?
>
> Vamsi Attluri
>
> On Wed, Mar 16, 2016 at 9:06 AM Gabriel Reid <gabriel.r...@gmail.com> wrote:
>>
>> Hi Vamsi,
>>
>> The first thing that I notice looking at the info that you've posted
>> is that you have 13 nodes and 13 salt buckets (which I assume also
>> means that you have 13 regions).
>>
>> A single region is the unit of parallelism that is used for reducers
>> in the CsvBulkLoadTool (or HFile-writing MapReduce job in general), so
>> currently you're only getting an average of a single reduce process
>> per node on your cluster. Assuming that you have multiple cores in
>> each of those nodes, you will probably get a decent improvement in
>> performance by further splitting your destination table so that it has
>> multiple regions per node (thereby triggering multiple reduce tasks
>> per node).
>>
>> Would you also be able to post the full set of job counters that are
>> shown after the job is completed? This would also be helpful in
>> pinpointing things that can be (possibly) tuned.
>>
>> - Gabriel
>>
>>
>> On Wed, Mar 16, 2016 at 1:28 PM, Vamsi Krishna <vamsi.attl...@gmail.com>
>> wrote:
>> > Hi,
>> >
>> > I'm using CsvBulkLoadTool to load a csv data file into Phoenix/HBase
>> > table.
>> >
>> > HDP Version : 2.3.2 (Phoenix Version : 4.4.0, HBase Version: 1.1.2)
>> > CSV file size: 97.6 GB
>> > No. of records: 1,439,000,238
>> > Cluster: 13 node
>> > Phoenix table salt-buckets: 13
>> > Phoenix table compression: snappy
>> > HBase table size after loading: 26.6 GB
>> >
>> > The job completed in 1hrs, 39mins, 43sec.
>> > Average Map Time         5mins, 25sec
>> > Average Shuffle Time 47mins, 46sec
>> > Average Merge Time 12mins, 22sec
>> > Average Reduce Time 32mins, 9sec
>> >
>> > I'm looking for an opportunity to tune this job.
>> > Could someone please help me with some pointers on how to tune this job?
>> > Please let me know if you need to know any cluster configuration
>> > parameters
>> > that I'm using.
>> >
>> > This is only a performance test. My PRODUCTION data file is 7x bigger.
>> >
>> > Thanks,
>> > Vamsi Attluri
>> >
>> > --
>> > Vamsi Attluri
>
> --
> Vamsi Attluri

Reply via email to