[ https://issues.apache.org/jira/browse/HIVE-5020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13736421#comment-13736421 ]
David Schorow commented on HIVE-5020: ------------------------------------- It sounds like b is the simplest and safest solution and would effectively make ORC work consistently with RC files. In Data Base semantics, NULL is a special value, so for example, NULL does not match NULL. Hence I don't know what it means to have a NULL key, or to try to do a lookup with a NULL key. I think such a lookup should never return a value, regardless of what is in the map. It may be okay to have NULL values in a map. > HCat reading null-key map entries causes NPE > -------------------------------------------- > > Key: HIVE-5020 > URL: https://issues.apache.org/jira/browse/HIVE-5020 > Project: Hive > Issue Type: Bug > Components: HCatalog > Reporter: Sushanth Sowmyan > Assignee: Sushanth Sowmyan > > Currently, if someone has a null key in a map, HCatInputFormat will terminate > with an NPE while trying to read it. > {noformat} > java.lang.NullPointerException > at java.lang.String.compareTo(String.java:1167) > at java.lang.String.compareTo(String.java:92) > at java.util.TreeMap.put(TreeMap.java:545) > at > org.apache.hcatalog.data.HCatRecordSerDe.serializeMap(HCatRecordSerDe.java:222) > at > org.apache.hcatalog.data.HCatRecordSerDe.serializeField(HCatRecordSerDe.java:198) > at org.apache.hcatalog.data.LazyHCatRecord.get(LazyHCatRecord.java:53) > at org.apache.hcatalog.data.LazyHCatRecord.get(LazyHCatRecord.java:97) > at > org.apache.hcatalog.mapreduce.HCatRecordReader.nextKeyValue(HCatRecordReader.java:203) > {noformat} > This is because we use a TreeMap to preserve order of elements in the map > when reading from the underlying storage/serde. > This problem is easily fixed in a number of ways: > a) Switch to HashMap, which allows null keys. That does not preserve order of > keys, which should not be important for map fields, but if we desire that, we > have a solution for that too - LinkedHashMap, which would both retain order > and allow us to insert null keys into the map. > b) Ignore null keyed entries - check if the field we read is null, and if it > is, then ignore that item in the record altogether. This way, HCat is robust > in what it does - it does not terminate with an NPE, and it does not allow > null keys in maps that might be problematic to layers above us that are not > used to seeing nulls as keys in maps. > Why do I bring up the second fix? First, I bring it up because of the way we > discovered this bug. When reading from an RCFile, we do not notice this bug. > If the same query that produced the RCFile instead produces an Orcfile, and > we try reading from it, we see this problem. > RCFile seems to be quietly stripping any null key entries, whereas Orc > retains them. This is why we didn't notice this problem for a long while, and > suddenly, now, we are. Now, if we fix our code to allow nulls in map keys > through to layers above, we expose layers above to this change, which may > then cause them to break. (Technically, this is stretching the case because > we already break now if they care) More importantly, though, we have a case > now, where the same data will be exposed differently if it were stored as orc > or if it were stored as rcfile. And as a layer that is supposed to make > storage invisible to the end user, HCat should attempt to provide some > consistency in how data behaves to the end user. > Secondly, whether or not nulls should be supported as keys in Maps seems to > be almost a religious view. Some people see it from a perspective of a > "mapping", which lends itself to a "Sure, if we encounter a null, we map to > this other value" kind of a view, whereas other people view it from a "lookup > index" kind of view, which lends itself to a "null as a key makes no sense - > What kind of lookup do you expect to perform?" kind of view. Both views have > their points, and it makes sense to see if we need to support it. > That said... > There is another important concern at hand here: nulls in map keys might be > due to bad data(corruption or loading error), and by stripping them, we might > be silently hiding that from the user. So "silent stripping" is bad. This is > an important point that does steer me towards the former approach, of passing > it on to layers above, and standardize on an understanding that null keys in > maps are acceptable data that layers above us have to handle. After that, it > could be taken on as a further consistency fix, to fix RCFile so that it > allows nulls in map keys. > Having gone through this discussion of standardization, another important > question is whether or not there is actually a use-case for null keys in maps > in data. If there isn't, maybe we shouldn't allow writing that in the first > place, and both orc and rcfile must simply error out to the end user if they > try to write a null map key? Well, it is true that it is possible that data > errors lead to null keys, but it's also possible that the user wants to store > a mapping for value transformations, and they might have a transformation for > null as well. In the case I encountered it, they were writing out an > intermediate table after having read from a sparse table using a custom input > format that generated an arbitrary number of columns, and were using the map > to store column name mappings that would eventually be written out to another > table. That seems a valid use, and we shouldn't prevent users from this sort > of usage. > Another reason for not allowing null keys from a java perspective is locking > and concurrency concerns, where locking on a null is a pain, per > disagreements between Joshua Bloch and Doug Lea in the design of HashMap and > ConcurrentHashMap. However, given that HCatalog reads are happening in a > thread on a drone where there should be no parallel access of that record, > and more importantly, this should strictly be used in a read-only kind of > usage, we should not have to worry about that. > Increasingly, my preference is to change to LinkedHashMaps to allow null > keys, and for consistency's sake, after this is tackled, to see if we should > be fixing RCFile to allow null keys(this might be trickier since RCFile has a > lot of other users that are probably currently working.) > Another option is to change to LinkedHashMap, but also add a conf key to hcat > to allow the user to specify whether or not we want to strip nulls. That way, > a user can specify what behaviour they like. That's more cruft though, and I > don't want to go down that path unless there is a user that explicitly > wants/needs that. > Anyone have any other thoughts on the matter? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira