Thanks Gabriel. Will try that. On Thu, Mar 17, 2016 at 3:33 AM Gabriel Reid <gabriel.r...@gmail.com> wrote:
> Hi Vamsi, > > I see from your counters that the number of map spill records is > double the number of map output records, so I think that raising the > mapreduce.task.io.sort.mb setting on the job should improve the > shuffle throughput. > > However, like I said before, I think that the first thing to try is > increasing the number of regions. > > Indeed, increasing the number of regions can potentially increase > parallelism for reads by Phoenix, although Phoenix actually internally > does sub-region reads as-is, so there probably won't be a huge effect > either way in terms of read performance. > > Aggregate queries shouldn't be impacted much either way. The increased > parallelism that Phoenix does to do sub-region reads is still in place > regardless. In addition, aggregate reads are done per region (or > sub-region split), and then the aggregation results are combined to > give the whole aggregate result. Having five times as many regions > (for example) would increase the number of portions of the aggregation > that need to be combined, but this should still be very minor in > comparison to the total amount of work required to do aggregations, so > it also shouldn't have a major effect either way. > > - Gabriel > > On Wed, Mar 16, 2016 at 7:15 PM, Vamsi Krishna <vamsi.attl...@gmail.com> > wrote: > > Thanks Gabriel, > > Please find the job counters attached. > > > > Would increasing the splitting affect the reads? > > I assume a simple read would be benefitted by increased splitting as it > > increases the parallelism. > > But, how would it impact the aggregate queries? > > > > Vamsi Attluri > > > > On Wed, Mar 16, 2016 at 9:06 AM Gabriel Reid <gabriel.r...@gmail.com> > wrote: > >> > >> Hi Vamsi, > >> > >> The first thing that I notice looking at the info that you've posted > >> is that you have 13 nodes and 13 salt buckets (which I assume also > >> means that you have 13 regions). > >> > >> A single region is the unit of parallelism that is used for reducers > >> in the CsvBulkLoadTool (or HFile-writing MapReduce job in general), so > >> currently you're only getting an average of a single reduce process > >> per node on your cluster. Assuming that you have multiple cores in > >> each of those nodes, you will probably get a decent improvement in > >> performance by further splitting your destination table so that it has > >> multiple regions per node (thereby triggering multiple reduce tasks > >> per node). > >> > >> Would you also be able to post the full set of job counters that are > >> shown after the job is completed? This would also be helpful in > >> pinpointing things that can be (possibly) tuned. > >> > >> - Gabriel > >> > >> > >> On Wed, Mar 16, 2016 at 1:28 PM, Vamsi Krishna <vamsi.attl...@gmail.com > > > >> wrote: > >> > Hi, > >> > > >> > I'm using CsvBulkLoadTool to load a csv data file into Phoenix/HBase > >> > table. > >> > > >> > HDP Version : 2.3.2 (Phoenix Version : 4.4.0, HBase Version: 1.1.2) > >> > CSV file size: 97.6 GB > >> > No. of records: 1,439,000,238 > >> > Cluster: 13 node > >> > Phoenix table salt-buckets: 13 > >> > Phoenix table compression: snappy > >> > HBase table size after loading: 26.6 GB > >> > > >> > The job completed in 1hrs, 39mins, 43sec. > >> > Average Map Time 5mins, 25sec > >> > Average Shuffle Time 47mins, 46sec > >> > Average Merge Time 12mins, 22sec > >> > Average Reduce Time 32mins, 9sec > >> > > >> > I'm looking for an opportunity to tune this job. > >> > Could someone please help me with some pointers on how to tune this > job? > >> > Please let me know if you need to know any cluster configuration > >> > parameters > >> > that I'm using. > >> > > >> > This is only a performance test. My PRODUCTION data file is 7x bigger. > >> > > >> > Thanks, > >> > Vamsi Attluri > >> > > >> > -- > >> > Vamsi Attluri > > > > -- > > Vamsi Attluri > -- Vamsi Attluri