[
https://issues.apache.org/jira/browse/HIVE-7853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114501#comment-14114501
]
john commented on HIVE-7853:
----------------------------
Navis: what do you think about the failed test case results?
> Make OrcNewInputFormat return row number as a key
> -------------------------------------------------
>
> Key: HIVE-7853
> URL: https://issues.apache.org/jira/browse/HIVE-7853
> Project: Hive
> Issue Type: Bug
> Components: File Formats
> Affects Versions: 0.13.1
> Environment: all
> Reporter: john
> Assignee: Navis
> Labels: Orc
> Attachments: HIVE-7853.1.patch.txt
>
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> Key is null in map when OrcNewInputFormat is used as Input Format Class
> When using OrcNewInputFormat as input format class for my map reduce job, I
> find its key is always null in my map method. This gives me no way to get row
> number in my map method. If you compare RCFileInputFormat (for RC file), its
> key in map method returns the row number so I know which row I am processing.
> Is there any workaround for me to get the row number from my map method? Of
> course, I can count the row number by myself. But that has two problems: #1
> I have to assume the row is coming in the order; #2 I will get duplicated
> (and wrong) row numbers if a big input file causes multiple file splits
> (which will trigger my map method multiple times in different data nodes).
> At this point, I am really seeking a better way to get row number for each
> processed row in map method.
> Here is what I have in my map logs:
> [2014-08-06 09:39:25 DEBUG com.xxxx.hadoop.orcfile.OrcFileMap]: Mapper
> Input Key: (null)
> [2014-08-06 09:39:25 DEBUG com.xxxx.hadoop.orcfile.OrcFileMap]: Mapper
> Input Value: {Q81510000, T99760000, 699760000, 81567560000, 9667981610000,
> 978989898980000, Laura, [email protected]}
> My map method is:
> protected void map(Object key, Writable value, Context context)
> throws IOException, InterruptedException {
> logger.debug("Mapper Input Key: " + key);
> logger.debug("Mapper Input Value: " + value.toString());
> .....
> }
> The fix should be: add following statement in nextKeyValue() method and pass
> the result all the way up to the map() method as its key:
> reader.getRowNumber();
--
This message was sent by Atlassian JIRA
(v6.2#6252)