Hi Vamsi, The first thing that I notice looking at the info that you've posted is that you have 13 nodes and 13 salt buckets (which I assume also means that you have 13 regions).
A single region is the unit of parallelism that is used for reducers in the CsvBulkLoadTool (or HFile-writing MapReduce job in general), so currently you're only getting an average of a single reduce process per node on your cluster. Assuming that you have multiple cores in each of those nodes, you will probably get a decent improvement in performance by further splitting your destination table so that it has multiple regions per node (thereby triggering multiple reduce tasks per node). Would you also be able to post the full set of job counters that are shown after the job is completed? This would also be helpful in pinpointing things that can be (possibly) tuned. - Gabriel On Wed, Mar 16, 2016 at 1:28 PM, Vamsi Krishna <vamsi.attl...@gmail.com> wrote: > Hi, > > I'm using CsvBulkLoadTool to load a csv data file into Phoenix/HBase table. > > HDP Version : 2.3.2 (Phoenix Version : 4.4.0, HBase Version: 1.1.2) > CSV file size: 97.6 GB > No. of records: 1,439,000,238 > Cluster: 13 node > Phoenix table salt-buckets: 13 > Phoenix table compression: snappy > HBase table size after loading: 26.6 GB > > The job completed in 1hrs, 39mins, 43sec. > Average Map Time 5mins, 25sec > Average Shuffle Time 47mins, 46sec > Average Merge Time 12mins, 22sec > Average Reduce Time 32mins, 9sec > > I'm looking for an opportunity to tune this job. > Could someone please help me with some pointers on how to tune this job? > Please let me know if you need to know any cluster configuration parameters > that I'm using. > > This is only a performance test. My PRODUCTION data file is 7x bigger. > > Thanks, > Vamsi Attluri > > -- > Vamsi Attluri