Thanks Gabriel. Will try that.

On Thu, Mar 17, 2016 at 3:33 AM Gabriel Reid <gabriel.r...@gmail.com> wrote:

> Hi Vamsi,
>
> I see from your counters that the number of map spill records is
> double the number of map output records, so I think that raising the
> mapreduce.task.io.sort.mb setting on the job should improve the
> shuffle throughput.
>
> However, like I said before, I think that the first thing to try is
> increasing the number of regions.
>
> Indeed, increasing the number of regions can potentially increase
> parallelism for reads by Phoenix, although Phoenix actually internally
> does sub-region reads as-is, so there probably won't be a huge effect
> either way in terms of read performance.
>
> Aggregate queries shouldn't be impacted much either way. The increased
> parallelism that Phoenix does to do sub-region reads is still in place
> regardless. In addition, aggregate reads are done per region (or
> sub-region split), and then the aggregation results are combined to
> give the whole aggregate result. Having five times as many regions
> (for example) would increase the number of portions of the aggregation
> that need to be combined, but this should still be very minor in
> comparison to the total amount of work required to do aggregations, so
> it also shouldn't have a major effect either way.
>
> - Gabriel
>
> On Wed, Mar 16, 2016 at 7:15 PM, Vamsi Krishna <vamsi.attl...@gmail.com>
> wrote:
> > Thanks Gabriel,
> > Please find the job counters attached.
> >
> > Would increasing the splitting affect the reads?
> > I assume a simple read would be benefitted by increased splitting as it
> > increases the parallelism.
> > But, how would it impact the aggregate queries?
> >
> > Vamsi Attluri
> >
> > On Wed, Mar 16, 2016 at 9:06 AM Gabriel Reid <gabriel.r...@gmail.com>
> wrote:
> >>
> >> Hi Vamsi,
> >>
> >> The first thing that I notice looking at the info that you've posted
> >> is that you have 13 nodes and 13 salt buckets (which I assume also
> >> means that you have 13 regions).
> >>
> >> A single region is the unit of parallelism that is used for reducers
> >> in the CsvBulkLoadTool (or HFile-writing MapReduce job in general), so
> >> currently you're only getting an average of a single reduce process
> >> per node on your cluster. Assuming that you have multiple cores in
> >> each of those nodes, you will probably get a decent improvement in
> >> performance by further splitting your destination table so that it has
> >> multiple regions per node (thereby triggering multiple reduce tasks
> >> per node).
> >>
> >> Would you also be able to post the full set of job counters that are
> >> shown after the job is completed? This would also be helpful in
> >> pinpointing things that can be (possibly) tuned.
> >>
> >> - Gabriel
> >>
> >>
> >> On Wed, Mar 16, 2016 at 1:28 PM, Vamsi Krishna <vamsi.attl...@gmail.com
> >
> >> wrote:
> >> > Hi,
> >> >
> >> > I'm using CsvBulkLoadTool to load a csv data file into Phoenix/HBase
> >> > table.
> >> >
> >> > HDP Version : 2.3.2 (Phoenix Version : 4.4.0, HBase Version: 1.1.2)
> >> > CSV file size: 97.6 GB
> >> > No. of records: 1,439,000,238
> >> > Cluster: 13 node
> >> > Phoenix table salt-buckets: 13
> >> > Phoenix table compression: snappy
> >> > HBase table size after loading: 26.6 GB
> >> >
> >> > The job completed in 1hrs, 39mins, 43sec.
> >> > Average Map Time         5mins, 25sec
> >> > Average Shuffle Time 47mins, 46sec
> >> > Average Merge Time 12mins, 22sec
> >> > Average Reduce Time 32mins, 9sec
> >> >
> >> > I'm looking for an opportunity to tune this job.
> >> > Could someone please help me with some pointers on how to tune this
> job?
> >> > Please let me know if you need to know any cluster configuration
> >> > parameters
> >> > that I'm using.
> >> >
> >> > This is only a performance test. My PRODUCTION data file is 7x bigger.
> >> >
> >> > Thanks,
> >> > Vamsi Attluri
> >> >
> >> > --
> >> > Vamsi Attluri
> >
> > --
> > Vamsi Attluri
>
-- 
Vamsi Attluri

Reply via email to