[ https://issues.apache.org/jira/browse/HBASE-8768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725186#comment-13725186 ]
Jean-Marc Spaggiari commented on HBASE-8768: -------------------------------------------- Awesome! Thanks a lot [~rajesh23] > Improve bulk load performance by moving key value construction from map phase > to reduce phase. > ---------------------------------------------------------------------------------------------- > > Key: HBASE-8768 > URL: https://issues.apache.org/jira/browse/HBASE-8768 > Project: HBase > Issue Type: Improvement > Components: mapreduce, Performance > Reporter: rajeshbabu > Assignee: rajeshbabu > Fix For: 0.98.0, 0.95.2 > > Attachments: HBASE-8768_v2.patch, HBASE-8768_v3.patch, > HBASE-8768_v4.patch, HBase_Bulkload_Performance_Improvement.pdf > > > ImportTSV bulkloading approach uses MapReduce framework. Existing mapper and > reducer classes used by ImportTSV are TsvImporterMapper.java and > PutSortReducer.java. ImportTSV tool parses the tab(by default) seperated > values from the input files and Mapper class generates the PUT objects for > each row using the Key value pairs created from the parsed text. > PutSortReducer then uses the partions based on the regions and sorts the Put > objects for each region. > Overheads we can see in the above approach: > ========================================== > 1) keyvalue construction for each parsed value in the line adding extra data > like rowkey,columnfamily,qualifier which will increase around 5x extra data > to be shuffled in reduce phase. > We can calculate data size to shuffled as below > {code} > Data to be shuffled = nl*nt*(rl+cfl+cql+vall+tsl+30) > {code} > If we move keyvalue construction to reduce phase we datasize to be shuffle > will be which is very less compared to above. > {code} > Data to be shuffled = nl*nt*vall > {code} > nl - Number of lines in the raw file > nt - Number of tabs or columns including row key. > rl - row length which will be different for each line. > cfl - column family length which will be different for each family > cql - qualifier length > tsl - timestamp length. > vall- each parsed value length. > 30 bytes for kv size,number of families etc. > 2) In mapper side we are creating put objects by adding all keyvalues > constructed for each line and in reducer we will again collect keyvalues from > put and sort them. > Instead we can directly create and sort keyvalues in reducer. > Solution: > ======== > We can improve bulk load performance by moving the key value construction > from mapper to reducer so that Mapper just sends the raw text for each row to > the Reducer. Reducer then parses the records for rows and create and sort the > key value pairs before writing to HFiles. > Conclusion: > =========== > The above suggestions will improve map phase performance by avoiding keyvalue > construction and reduce phase performance by avoiding excess data to be > shuffled. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira