[ 
https://issues.apache.org/jira/browse/HIVE-5020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13736421#comment-13736421
 ] 

David Schorow commented on HIVE-5020:
-------------------------------------

It sounds like b is the simplest and safest solution and would effectively make 
ORC work consistently with RC files.  

In Data Base semantics, NULL is a special value, so for example, NULL does not 
match NULL.  Hence I don't know what it means to have a NULL key, or to try to 
do a lookup with a NULL key.  I think such a lookup should never return a 
value, regardless of what is in the map.  It may be okay to have NULL values in 
a map.
                
> HCat reading null-key map entries causes NPE
> --------------------------------------------
>
>                 Key: HIVE-5020
>                 URL: https://issues.apache.org/jira/browse/HIVE-5020
>             Project: Hive
>          Issue Type: Bug
>          Components: HCatalog
>            Reporter: Sushanth Sowmyan
>            Assignee: Sushanth Sowmyan
>
> Currently, if someone has a null key in a map, HCatInputFormat will terminate 
> with an NPE while trying to read it.
> {noformat}
> java.lang.NullPointerException
> at java.lang.String.compareTo(String.java:1167)
> at java.lang.String.compareTo(String.java:92)
> at java.util.TreeMap.put(TreeMap.java:545)
> at 
> org.apache.hcatalog.data.HCatRecordSerDe.serializeMap(HCatRecordSerDe.java:222)
> at 
> org.apache.hcatalog.data.HCatRecordSerDe.serializeField(HCatRecordSerDe.java:198)
> at org.apache.hcatalog.data.LazyHCatRecord.get(LazyHCatRecord.java:53)
> at org.apache.hcatalog.data.LazyHCatRecord.get(LazyHCatRecord.java:97)
> at 
> org.apache.hcatalog.mapreduce.HCatRecordReader.nextKeyValue(HCatRecordReader.java:203)
> {noformat}
> This is because we use a TreeMap to preserve order of elements in the map 
> when reading from the underlying storage/serde.
> This problem is easily fixed in a number of ways:
> a) Switch to HashMap, which allows null keys. That does not preserve order of 
> keys, which should not be important for map fields, but if we desire that, we 
> have a solution for that too - LinkedHashMap, which would both retain order 
> and allow us to insert null keys into the map.
> b) Ignore null keyed entries - check if the field we read is null, and if it 
> is, then ignore that item in the record altogether. This way, HCat is robust 
> in what it does - it does not terminate with an NPE, and it does not allow 
> null keys in maps that might be problematic to layers above us that are not 
> used to seeing nulls as keys in maps.
> Why do I bring up the second fix? First, I bring it up because of the way we 
> discovered this bug. When reading from an RCFile, we do not notice this bug. 
> If the same query that produced the RCFile instead produces an Orcfile, and 
> we try reading from it, we see this problem.
> RCFile seems to be quietly stripping any null key entries, whereas Orc 
> retains them. This is why we didn't notice this problem for a long while, and 
> suddenly, now, we are. Now, if we fix our code to allow nulls in map keys 
> through to layers above, we expose layers above to this change, which may 
> then cause them to break. (Technically, this is stretching the case because 
> we already break now if they care) More importantly, though, we have a case 
> now, where the same data will be exposed differently if it were stored as orc 
> or if it were stored as rcfile. And as a layer that is supposed to make 
> storage invisible to the end user, HCat should attempt to provide some 
> consistency in how data behaves to the end user.
> Secondly, whether or not nulls should be supported as keys in Maps seems to 
> be almost a religious view. Some people see it from a perspective of a 
> "mapping", which lends itself to a "Sure, if we encounter a null, we map to 
> this other value" kind of a view, whereas other people view it from a "lookup 
> index" kind of view, which lends itself to a "null as a key makes no sense - 
> What kind of lookup do you expect to perform?" kind of view. Both views have 
> their points, and it makes sense to see if we need to support it.
> That said...
> There is another important concern at hand here: nulls in map keys might be 
> due to bad data(corruption or loading error), and by stripping them, we might 
> be silently hiding that from the user. So "silent stripping" is bad. This is 
> an important point that does steer me towards the former approach, of passing 
> it on to layers above, and standardize on an understanding that null keys in 
> maps are acceptable data that layers above us have to handle. After that, it 
> could be taken on as a further consistency fix, to fix RCFile so that it 
> allows nulls in map keys.
> Having gone through this discussion of standardization, another important 
> question is whether or not there is actually a use-case for null keys in maps 
> in data. If there isn't, maybe we shouldn't allow writing that in the first 
> place, and both orc and rcfile must simply error out to the end user if they 
> try to write a  null map key? Well, it is true that it is possible that data 
> errors lead to null keys, but it's also possible that the user wants to store 
> a mapping for value transformations, and they might have a transformation for 
> null as well. In the case I encountered it, they were writing out an 
> intermediate table after having read from a sparse table using a custom input 
> format that generated an arbitrary number of columns, and were using the map 
> to store column name mappings that would eventually be written out to another 
> table. That seems a valid use, and we shouldn't prevent users from this sort 
> of usage.
> Another reason for not allowing null keys from a java perspective is locking 
> and concurrency concerns, where locking on a null is a pain, per 
> disagreements between Joshua Bloch and Doug Lea in the design of HashMap and 
> ConcurrentHashMap. However, given that HCatalog reads are happening in a 
> thread on a drone where there should be no parallel access of that record, 
> and more importantly, this should strictly be used in a read-only kind of 
> usage, we should not have to worry about that.
> Increasingly, my preference is to change to LinkedHashMaps to allow null 
> keys, and for consistency's sake, after this is tackled, to see if we should 
> be fixing RCFile to allow null keys(this might be trickier since RCFile has a 
> lot of other users that are probably currently working.)
> Another option is to change to LinkedHashMap, but also add a conf key to hcat 
> to allow the user to specify whether or not we want to strip nulls. That way, 
> a user can specify what behaviour they like. That's more cruft though, and I 
> don't want to go down that path unless there is a user that explicitly 
> wants/needs that.
> Anyone have any other thoughts on the matter?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to