[ 
https://issues.apache.org/jira/browse/PHOENIX-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15129306#comment-15129306
 ] 

Enis Soztutar commented on PHOENIX-1973:
----------------------------------------

This is pretty important. An internal test to measure the IO from bulk load 
process concluded: 
{code}
Total space used by the MR bulk loader is1.8GB to load 67MB of data, roughly 
27x the amount of original data is written to do the bulkload.
Final file size 171mb versus 67mb raw data
All column names except for primary key repeated, one per record
Much of the data from the reduce output is stripped out, fields seem to have 
6-10 bytes or so of overhead apart from the column name. Average record size 
drops from 503 bytes to 110 bytes.
Does suggest reduce output is re-written, adding yet another step to the 
process.
Final data size is 2.5x size of original data.
{code}

Definitely there is a lot of gains for not writing 27 times more data in the 
shuffle step. [~sergey.soldatov] do you have comparable numbers with the patch? 


> Improve CsvBulkLoadTool performance by moving keyvalue construction from map 
> phase to reduce phase
> --------------------------------------------------------------------------------------------------
>
>                 Key: PHOENIX-1973
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-1973
>             Project: Phoenix
>          Issue Type: Improvement
>            Reporter: Rajeshbabu Chintaguntla
>            Assignee: Rajeshbabu Chintaguntla
>             Fix For: 4.4.1
>
>         Attachments: PHOENIX-1973-1.patch
>
>
> It's similar to HBASE-8768. Only thing is we need to write custom mapper and 
> reducer in Phoenix. In Map phase we just need to get row key from primary key 
> columns and write the full text of a line as usual(to ensure sorting). In 
> reducer we need to get actual key values by running upsert query.
> It's basically reduces lot of map output to write to disk and data need to be 
> transferred through network.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to