Hi, I need some help in understanding how CsvBulkLoadTool works. I am trying to load data ~ 200 GB (There are 100 files of 2 GB each) from hdfs to Phoenix with 1 master and 4 region-servers. These region servers have 32 GB RAM and 16 cores each. Total HDFS disk space is 4 TB.
The table is salted with 16. So 4 regions per regionservers. There are 400 columns and more than 30 local indexes. Here is the command i am using - /HADOOP_CLASSPATH=/usr/lib/hbase/hbase-protocol.jar:/usr/lib/hbase/conf hadoop jar /usr/lib/phoenix/phoenix-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool -Dfs.permissions.umask-mode=000 --table TABLE_SNAPSHOT --input /user/table/*.csv/ The job proceeds normally but gets stuck at reduce phase around 90 %. I also observed that initially it was using full resource of the cluster but it uses much less resources near completion. (10 percent of RAM and cores). What exactly is happening behind the scenes ? How i can tune it to work faster ? I am using HBase + HDFS deployed on YARN on AWS. Any help is appreciated. Thanks Chaitanya -- View this message in context: http://apache-phoenix-user-list.1124778.n5.nabble.com/Large-CSV-bulk-load-stuck-tp3622.html Sent from the Apache Phoenix User List mailing list archive at Nabble.com.