Hi,

I need some help in understanding how CsvBulkLoadTool works. I am trying to
load data ~ 200 GB (There are 100 files of 2 GB each) from hdfs to Phoenix
with 1 master and 4 region-servers. These region servers have 32 GB RAM and
16 cores each. Total HDFS disk space is 4 TB.  

The table is salted with 16. So 4 regions per regionservers. There are 400
columns and more than 30 local indexes.

Here is the command i am using - 
/HADOOP_CLASSPATH=/usr/lib/hbase/hbase-protocol.jar:/usr/lib/hbase/conf
hadoop jar /usr/lib/phoenix/phoenix-client.jar
org.apache.phoenix.mapreduce.CsvBulkLoadTool -Dfs.permissions.umask-mode=000
--table TABLE_SNAPSHOT --input /user/table/*.csv/

The job proceeds normally but gets stuck at reduce phase around 90 %. I also
observed that initially it was using full resource of the cluster but it
uses much less resources near completion. (10 percent of RAM and cores).

What exactly is happening behind the scenes ? How i can tune it to work
faster ? I am using HBase + HDFS deployed on YARN on AWS.

Any help is appreciated.

Thanks
Chaitanya




--
View this message in context: 
http://apache-phoenix-user-list.1124778.n5.nabble.com/Large-CSV-bulk-load-stuck-tp3622.html
Sent from the Apache Phoenix User List mailing list archive at Nabble.com.

Reply via email to