[ 
https://issues.apache.org/jira/browse/HIVE-5020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13732719#comment-13732719
 ] 

Edward Capriolo commented on HIVE-5020:
---------------------------------------

If I had to hazard a guess I would say that the original implementation was 
about supporting thrift structures. Possibly if thrift does not support this 
case that design was not carried over.

Personally I think we SHOULD support NULL key and NULL value in maps. The map 
need not be sorted.
                
> HCat reading null-key map entries causes NPE
> --------------------------------------------
>
>                 Key: HIVE-5020
>                 URL: https://issues.apache.org/jira/browse/HIVE-5020
>             Project: Hive
>          Issue Type: Bug
>          Components: HCatalog
>            Reporter: Sushanth Sowmyan
>            Assignee: Sushanth Sowmyan
>
> Currently, if someone has a null key in a map, HCatInputFormat will terminate 
> with an NPE while trying to read it.
> {noformat}
> java.lang.NullPointerException
> at java.lang.String.compareTo(String.java:1167)
> at java.lang.String.compareTo(String.java:92)
> at java.util.TreeMap.put(TreeMap.java:545)
> at 
> org.apache.hcatalog.data.HCatRecordSerDe.serializeMap(HCatRecordSerDe.java:222)
> at 
> org.apache.hcatalog.data.HCatRecordSerDe.serializeField(HCatRecordSerDe.java:198)
> at org.apache.hcatalog.data.LazyHCatRecord.get(LazyHCatRecord.java:53)
> at org.apache.hcatalog.data.LazyHCatRecord.get(LazyHCatRecord.java:97)
> at 
> org.apache.hcatalog.mapreduce.HCatRecordReader.nextKeyValue(HCatRecordReader.java:203)
> {noformat}
> This is because we use a TreeMap to preserve order of elements in the map 
> when reading from the underlying storage/serde.
> This problem is easily fixed in a number of ways:
> a) Switch to HashMap, which allows null keys. That does not preserve order of 
> keys, which should not be important for map fields, but if we desire that, we 
> have a solution for that too - LinkedHashMap, which would both retain order 
> and allow us to insert null keys into the map.
> b) Ignore null keyed entries - check if the field we read is null, and if it 
> is, then ignore that item in the record altogether. This way, HCat is robust 
> in what it does - it does not terminate with an NPE, and it does not allow 
> null keys in maps that might be problematic to layers above us that are not 
> used to seeing nulls as keys in maps.
> Why do I bring up the second fix? I bring it up because of the way we 
> discovered this bug. When reading from an RCFile, we do not notice this bug. 
> If the same query that produced the RCFile instead produces an Orcfile, and 
> we try reading from it, we see this problem.
> RCFile seems to be quietly stripping any null key entries, whereas Orc 
> retains them. This is why we didn't notice this problem for a long while, and 
> suddenly, now, we are. Now, if we fix our code to allow nulls in map keys 
> through to layers above, we expose layers above to this change, which may 
> then cause them to break. (Technically, this is stretching the case because 
> we already break now if they care) More importantly, though, we have a case 
> now, where the same data will be exposed differently if it were stored as orc 
> or if it were stored as rcfile. And as a layer that is supposed to make 
> storage invisible to the end user, HCat should attempt to provide some 
> consistency in how data behaves to the end user.
> That said...
> There is another important concern at hand here: nulls in map keys might be 
> due to bad data(corruption or loading error), and by stripping them, we might 
> be silently hiding that from the user. This is an important point that does 
> steer me towards the former approach, of passing it on to layers above, and 
> standardize on an understanding that null keys in maps are acceptable data 
> that layers above us have to handle. After that, it could be taken on as a 
> further consistency fix, to fix RCFile so that it allows nulls in map keys.
> Having gone through this discussion of standardization, another important 
> question is whether or not there is actually a use-case for null keys in maps 
> in data. If there isn't, maybe we shouldn't allow writing that in the first 
> place, and both orc and rcfile must simply error out to the end user if they 
> try to write a  null map key? Well, it is true that it is possible that data 
> errors lead to null keys, but it's also possible that the user wants to store 
> a mapping for value transformations, and they might have a transformation for 
> null as well. In the case I encountered it, they were writing out an 
> intermediate table after having read from a sparse table using a custom input 
> format that generated an arbitrary number of columns, and were using the map 
> to store column name mappings that would eventually be written out to another 
> table. That seems a valid use, and we shouldn't prevent users from this sort 
> of usage.
> Another reason for not allowing null keys from a java perspective is locking 
> and concurrency concerns, where locking on a null is a pain, per 
> philosophical disagreements between Joshua Bloch and Doug Lea in the design 
> of HashMap and ConcurrentHashMap. However, given that HCatalog reads are 
> happening in a thread on a drone where there should be no parallel access of 
> that record, and more importantly, this should strictly be used in a 
> read-only kind of usage, we should not have to worry about that.
> Increasingly, my preference is to change to LinkedHashMaps to allow null 
> keys, and for consistency's sake, after this is tackled, to see if we should 
> be fixing RCFile to allow null keys(this might be trickier since RCFile has a 
> lot of other users that are probably currently working.)
> Another option is to change to LinkedHashMap, but also add a conf key to hcat 
> to allow the user to specify whether or not we want to strip nulls. That way, 
> a user can specify what behaviour they like. That's more cruft though, and I 
> don't want to go down that path unless there is a user that explicitly 
> wants/needs that.
> Anyone have any other thoughts on the matter?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to