[ https://issues.apache.org/jira/browse/HIVE-5020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13732719#comment-13732719 ]
Edward Capriolo commented on HIVE-5020: --------------------------------------- If I had to hazard a guess I would say that the original implementation was about supporting thrift structures. Possibly if thrift does not support this case that design was not carried over. Personally I think we SHOULD support NULL key and NULL value in maps. The map need not be sorted. > HCat reading null-key map entries causes NPE > -------------------------------------------- > > Key: HIVE-5020 > URL: https://issues.apache.org/jira/browse/HIVE-5020 > Project: Hive > Issue Type: Bug > Components: HCatalog > Reporter: Sushanth Sowmyan > Assignee: Sushanth Sowmyan > > Currently, if someone has a null key in a map, HCatInputFormat will terminate > with an NPE while trying to read it. > {noformat} > java.lang.NullPointerException > at java.lang.String.compareTo(String.java:1167) > at java.lang.String.compareTo(String.java:92) > at java.util.TreeMap.put(TreeMap.java:545) > at > org.apache.hcatalog.data.HCatRecordSerDe.serializeMap(HCatRecordSerDe.java:222) > at > org.apache.hcatalog.data.HCatRecordSerDe.serializeField(HCatRecordSerDe.java:198) > at org.apache.hcatalog.data.LazyHCatRecord.get(LazyHCatRecord.java:53) > at org.apache.hcatalog.data.LazyHCatRecord.get(LazyHCatRecord.java:97) > at > org.apache.hcatalog.mapreduce.HCatRecordReader.nextKeyValue(HCatRecordReader.java:203) > {noformat} > This is because we use a TreeMap to preserve order of elements in the map > when reading from the underlying storage/serde. > This problem is easily fixed in a number of ways: > a) Switch to HashMap, which allows null keys. That does not preserve order of > keys, which should not be important for map fields, but if we desire that, we > have a solution for that too - LinkedHashMap, which would both retain order > and allow us to insert null keys into the map. > b) Ignore null keyed entries - check if the field we read is null, and if it > is, then ignore that item in the record altogether. This way, HCat is robust > in what it does - it does not terminate with an NPE, and it does not allow > null keys in maps that might be problematic to layers above us that are not > used to seeing nulls as keys in maps. > Why do I bring up the second fix? I bring it up because of the way we > discovered this bug. When reading from an RCFile, we do not notice this bug. > If the same query that produced the RCFile instead produces an Orcfile, and > we try reading from it, we see this problem. > RCFile seems to be quietly stripping any null key entries, whereas Orc > retains them. This is why we didn't notice this problem for a long while, and > suddenly, now, we are. Now, if we fix our code to allow nulls in map keys > through to layers above, we expose layers above to this change, which may > then cause them to break. (Technically, this is stretching the case because > we already break now if they care) More importantly, though, we have a case > now, where the same data will be exposed differently if it were stored as orc > or if it were stored as rcfile. And as a layer that is supposed to make > storage invisible to the end user, HCat should attempt to provide some > consistency in how data behaves to the end user. > That said... > There is another important concern at hand here: nulls in map keys might be > due to bad data(corruption or loading error), and by stripping them, we might > be silently hiding that from the user. This is an important point that does > steer me towards the former approach, of passing it on to layers above, and > standardize on an understanding that null keys in maps are acceptable data > that layers above us have to handle. After that, it could be taken on as a > further consistency fix, to fix RCFile so that it allows nulls in map keys. > Having gone through this discussion of standardization, another important > question is whether or not there is actually a use-case for null keys in maps > in data. If there isn't, maybe we shouldn't allow writing that in the first > place, and both orc and rcfile must simply error out to the end user if they > try to write a null map key? Well, it is true that it is possible that data > errors lead to null keys, but it's also possible that the user wants to store > a mapping for value transformations, and they might have a transformation for > null as well. In the case I encountered it, they were writing out an > intermediate table after having read from a sparse table using a custom input > format that generated an arbitrary number of columns, and were using the map > to store column name mappings that would eventually be written out to another > table. That seems a valid use, and we shouldn't prevent users from this sort > of usage. > Another reason for not allowing null keys from a java perspective is locking > and concurrency concerns, where locking on a null is a pain, per > philosophical disagreements between Joshua Bloch and Doug Lea in the design > of HashMap and ConcurrentHashMap. However, given that HCatalog reads are > happening in a thread on a drone where there should be no parallel access of > that record, and more importantly, this should strictly be used in a > read-only kind of usage, we should not have to worry about that. > Increasingly, my preference is to change to LinkedHashMaps to allow null > keys, and for consistency's sake, after this is tackled, to see if we should > be fixing RCFile to allow null keys(this might be trickier since RCFile has a > lot of other users that are probably currently working.) > Another option is to change to LinkedHashMap, but also add a conf key to hcat > to allow the user to specify whether or not we want to strip nulls. That way, > a user can specify what behaviour they like. That's more cruft though, and I > don't want to go down that path unless there is a user that explicitly > wants/needs that. > Anyone have any other thoughts on the matter? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira