[jira] [Commented] (PHOENIX-1973) Improve CsvBulkLoadTool performance by moving keyvalue construction from map phase to reduce phase

Enis Soztutar (JIRA) Tue, 09 Feb 2016 12:21:12 -0800

    [ 
https://issues.apache.org/jira/browse/PHOENIX-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15139689#comment-15139689
 ]


Enis Soztutar commented on PHOENIX-1973:
----------------------------------------

I think silently ignoring if findIndex() cannot find the column is not right: 
{code}
+            for (KeyValue cell : lkv) {
+                int i = findIndex(tableIndex, cell);
+                if(i == -1)
+                    continue;
{code}

The findIndex returns the index for the KV from a previously obtained list of 
PColumns. In our offline discussions with Sergey, we did not think about 
dynamic columns (my bad). If we allow CSV bulk load to also be able to put to 
dynamic columns (which are not defined in the table definition) than, this 
approach won't work. It is probably worth a check. If we are not allowing that, 
we should then throw an exception instead of silently ignoring if this happens. 

We do not need this marker I think. We already know the length of the 
serialized record in Hadoop. 
{code}
+        WritableUtils.writeVInt(outputStream, -1); // Marker that there is no 
more values
{code} 

Instead here, on the reading side, we can check input.available():
{code}
+                int index = WritableUtils.readVInt(input);
{code}

This should declare the full type generics for returned List. 
{code}
+    private List initColumnIndexes() throws SQLException {
{code}

Maybe we can use TreeMap<byte[], ...> (Bytes.BYTES_COMPARATOR) instead of 
HashMap< ImmutableBytesWritable, ..>. Not that important though, just that it 
is shorter. 

Also can you re-use initColumnIndexes() method between M and R.   

> Improve CsvBulkLoadTool performance by moving keyvalue construction from map 
> phase to reduce phase
> --------------------------------------------------------------------------------------------------
>
>                 Key: PHOENIX-1973
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-1973
>             Project: Phoenix
>          Issue Type: Improvement
>            Reporter: Rajeshbabu Chintaguntla
>            Assignee: Sergey Soldatov
>             Fix For: 4.7.0
>
>         Attachments: PHOENIX-1973-1.patch, PHOENIX-1973-2.patch
>
>
> It's similar to HBASE-8768. Only thing is we need to write custom mapper and 
> reducer in Phoenix. In Map phase we just need to get row key from primary key 
> columns and write the full text of a line as usual(to ensure sorting). In 
> reducer we need to get actual key values by running upsert query.
> It's basically reduces lot of map output to write to disk and data need to be 
> transferred through network.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PHOENIX-1973) Improve CsvBulkLoadTool performance by moving keyvalue construction from map phase to reduce phase

Reply via email to