how to tune phoenix CsvBulkLoadTool job

2016-03-19 Thread Vamsi Krishna
Hi, I'm using CsvBulkLoadTool to load a csv data file into Phoenix/HBase table. HDP Version : 2.3.2 (Phoenix Version : 4.4.0, HBase Version: 1.1.2) CSV file size: 97.6 GB No. of records: 1,439,000,238 Cluster: 13 node Phoenix table salt-buckets: 13 Phoenix table compression: snappy HBase table si

Re: how to tune phoenix CsvBulkLoadTool job

2016-03-19 Thread Gabriel Reid
Hi Vamsi, The first thing that I notice looking at the info that you've posted is that you have 13 nodes and 13 salt buckets (which I assume also means that you have 13 regions). A single region is the unit of parallelism that is used for reducers in the CsvBulkLoadTool (or HFile-writing MapReduc

Re: how to tune phoenix CsvBulkLoadTool job

2016-03-19 Thread Gabriel Reid
Hi Vamsi, I see from your counters that the number of map spill records is double the number of map output records, so I think that raising the mapreduce.task.io.sort.mb setting on the job should improve the shuffle throughput. However, like I said before, I think that the first thing to try is i

Re: how to tune phoenix CsvBulkLoadTool job

2016-03-21 Thread Vamsi Krishna
Thanks Gabriel. Will try that. On Thu, Mar 17, 2016 at 3:33 AM Gabriel Reid wrote: > Hi Vamsi, > > I see from your counters that the number of map spill records is > double the number of map output records, so I think that raising the > mapreduce.task.io.sort.mb setting on the job should improve