[ 
https://issues.apache.org/jira/browse/HIVE-5020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sushanth Sowmyan updated HIVE-5020:
-----------------------------------

    Description: 
Currently, if someone has a null key in a map, HCatInputFormat will terminate 
with an NPE while trying to read it.

{noformat}
java.lang.NullPointerException
at java.lang.String.compareTo(String.java:1167)
at java.lang.String.compareTo(String.java:92)
at java.util.TreeMap.put(TreeMap.java:545)
at 
org.apache.hcatalog.data.HCatRecordSerDe.serializeMap(HCatRecordSerDe.java:222)
at 
org.apache.hcatalog.data.HCatRecordSerDe.serializeField(HCatRecordSerDe.java:198)
at org.apache.hcatalog.data.LazyHCatRecord.get(LazyHCatRecord.java:53)
at org.apache.hcatalog.data.LazyHCatRecord.get(LazyHCatRecord.java:97)
at 
org.apache.hcatalog.mapreduce.HCatRecordReader.nextKeyValue(HCatRecordReader.java:203)
{noformat}

This is because we use a TreeMap to preserve order of elements in the map when 
reading from the underlying storage/serde.

This problem is easily fixed in a number of ways:

a) Switch to HashMap, which allows null keys. That does not preserve order of 
keys, which should not be important for map fields, but if we desire that, we 
have a solution for that too - LinkedHashMap, which would both retain order and 
allow us to insert null keys into the map.

b) Ignore null keyed entries - check if the field we read is null, and if it 
is, then ignore that item in the record altogether. This way, HCat is robust in 
what it does - it does not terminate with an NPE, and it does not allow null 
keys in maps that might be problematic to layers above us that are not used to 
seeing nulls as keys in maps.

Why do I bring up the second fix? First, I bring it up because of the way we 
discovered this bug. When reading from an RCFile, we do not notice this bug. If 
the same query that produced the RCFile instead produces an Orcfile, and we try 
reading from it, we see this problem.

RCFile seems to be quietly stripping any null key entries, whereas Orc retains 
them. This is why we didn't notice this problem for a long while, and suddenly, 
now, we are. Now, if we fix our code to allow nulls in map keys through to 
layers above, we expose layers above to this change, which may then cause them 
to break. (Technically, this is stretching the case because we already break 
now if they care) More importantly, though, we have a case now, where the same 
data will be exposed differently if it were stored as orc or if it were stored 
as rcfile. And as a layer that is supposed to make storage invisible to the end 
user, HCat should attempt to provide some consistency in how data behaves to 
the end user.

Secondly, whether or not nulls should be supported as keys in Maps seems to be 
almost a religious view. Some people see it from a perspective of a "mapping", 
which lends itself to a "Sure, if we encounter a null, we map to this other 
value" kind of a view, whereas other people view it from a "lookup index" kind 
of view, which lends itself to a "null as a key makes no sense - What kind of 
lookup do you expect to perform?" kind of view. Both views have their points, 
and it makes sense to see if we need to support it.

That said...

There is another important concern at hand here: nulls in map keys might be due 
to bad data(corruption or loading error), and by stripping them, we might be 
silently hiding that from the user. So "silent stripping" is bad. This is an 
important point that does steer me towards the former approach, of passing it 
on to layers above, and standardize on an understanding that null keys in maps 
are acceptable data that layers above us have to handle. After that, it could 
be taken on as a further consistency fix, to fix RCFile so that it allows nulls 
in map keys.

Having gone through this discussion of standardization, another important 
question is whether or not there is actually a use-case for null keys in maps 
in data. If there isn't, maybe we shouldn't allow writing that in the first 
place, and both orc and rcfile must simply error out to the end user if they 
try to write a  null map key? Well, it is true that it is possible that data 
errors lead to null keys, but it's also possible that the user wants to store a 
mapping for value transformations, and they might have a transformation for 
null as well. In the case I encountered it, they were writing out an 
intermediate table after having read from a sparse table using a custom input 
format that generated an arbitrary number of columns, and were using the map to 
store column name mappings that would eventually be written out to another 
table. That seems a valid use, and we shouldn't prevent users from this sort of 
usage.

Another reason for not allowing null keys from a java perspective is locking 
and concurrency concerns, where locking on a null is a pain, per disagreements 
between Joshua Bloch and Doug Lea in the design of HashMap and 
ConcurrentHashMap. However, given that HCatalog reads are happening in a thread 
on a drone where there should be no parallel access of that record, and more 
importantly, this should strictly be used in a read-only kind of usage, we 
should not have to worry about that.

Increasingly, my preference is to change to LinkedHashMaps to allow null keys, 
and for consistency's sake, after this is tackled, to see if we should be 
fixing RCFile to allow null keys(this might be trickier since RCFile has a lot 
of other users that are probably currently working.)

Another option is to change to LinkedHashMap, but also add a conf key to hcat 
to allow the user to specify whether or not we want to strip nulls. That way, a 
user can specify what behaviour they like. That's more cruft though, and I 
don't want to go down that path unless there is a user that explicitly 
wants/needs that.

Anyone have any other thoughts on the matter?

  was:
Currently, if someone has a null key in a map, HCatInputFormat will terminate 
with an NPE while trying to read it.

{noformat}
java.lang.NullPointerException
at java.lang.String.compareTo(String.java:1167)
at java.lang.String.compareTo(String.java:92)
at java.util.TreeMap.put(TreeMap.java:545)
at 
org.apache.hcatalog.data.HCatRecordSerDe.serializeMap(HCatRecordSerDe.java:222)
at 
org.apache.hcatalog.data.HCatRecordSerDe.serializeField(HCatRecordSerDe.java:198)
at org.apache.hcatalog.data.LazyHCatRecord.get(LazyHCatRecord.java:53)
at org.apache.hcatalog.data.LazyHCatRecord.get(LazyHCatRecord.java:97)
at 
org.apache.hcatalog.mapreduce.HCatRecordReader.nextKeyValue(HCatRecordReader.java:203)
{noformat}

This is because we use a TreeMap to preserve order of elements in the map when 
reading from the underlying storage/serde.

This problem is easily fixed in a number of ways:

a) Switch to HashMap, which allows null keys. That does not preserve order of 
keys, which should not be important for map fields, but if we desire that, we 
have a solution for that too - LinkedHashMap, which would both retain order and 
allow us to insert null keys into the map.

b) Ignore null keyed entries - check if the field we read is null, and if it 
is, then ignore that item in the record altogether. This way, HCat is robust in 
what it does - it does not terminate with an NPE, and it does not allow null 
keys in maps that might be problematic to layers above us that are not used to 
seeing nulls as keys in maps.

Why do I bring up the second fix? I bring it up because of the way we 
discovered this bug. When reading from an RCFile, we do not notice this bug. If 
the same query that produced the RCFile instead produces an Orcfile, and we try 
reading from it, we see this problem.

RCFile seems to be quietly stripping any null key entries, whereas Orc retains 
them. This is why we didn't notice this problem for a long while, and suddenly, 
now, we are. Now, if we fix our code to allow nulls in map keys through to 
layers above, we expose layers above to this change, which may then cause them 
to break. (Technically, this is stretching the case because we already break 
now if they care) More importantly, though, we have a case now, where the same 
data will be exposed differently if it were stored as orc or if it were stored 
as rcfile. And as a layer that is supposed to make storage invisible to the end 
user, HCat should attempt to provide some consistency in how data behaves to 
the end user.

That said...

There is another important concern at hand here: nulls in map keys might be due 
to bad data(corruption or loading error), and by stripping them, we might be 
silently hiding that from the user. This is an important point that does steer 
me towards the former approach, of passing it on to layers above, and 
standardize on an understanding that null keys in maps are acceptable data that 
layers above us have to handle. After that, it could be taken on as a further 
consistency fix, to fix RCFile so that it allows nulls in map keys.

Having gone through this discussion of standardization, another important 
question is whether or not there is actually a use-case for null keys in maps 
in data. If there isn't, maybe we shouldn't allow writing that in the first 
place, and both orc and rcfile must simply error out to the end user if they 
try to write a  null map key? Well, it is true that it is possible that data 
errors lead to null keys, but it's also possible that the user wants to store a 
mapping for value transformations, and they might have a transformation for 
null as well. In the case I encountered it, they were writing out an 
intermediate table after having read from a sparse table using a custom input 
format that generated an arbitrary number of columns, and were using the map to 
store column name mappings that would eventually be written out to another 
table. That seems a valid use, and we shouldn't prevent users from this sort of 
usage.

Another reason for not allowing null keys from a java perspective is locking 
and concurrency concerns, where locking on a null is a pain, per philosophical 
disagreements between Joshua Bloch and Doug Lea in the design of HashMap and 
ConcurrentHashMap. However, given that HCatalog reads are happening in a thread 
on a drone where there should be no parallel access of that record, and more 
importantly, this should strictly be used in a read-only kind of usage, we 
should not have to worry about that.

Increasingly, my preference is to change to LinkedHashMaps to allow null keys, 
and for consistency's sake, after this is tackled, to see if we should be 
fixing RCFile to allow null keys(this might be trickier since RCFile has a lot 
of other users that are probably currently working.)

Another option is to change to LinkedHashMap, but also add a conf key to hcat 
to allow the user to specify whether or not we want to strip nulls. That way, a 
user can specify what behaviour they like. That's more cruft though, and I 
don't want to go down that path unless there is a user that explicitly 
wants/needs that.

Anyone have any other thoughts on the matter?

    
> HCat reading null-key map entries causes NPE
> --------------------------------------------
>
>                 Key: HIVE-5020
>                 URL: https://issues.apache.org/jira/browse/HIVE-5020
>             Project: Hive
>          Issue Type: Bug
>          Components: HCatalog
>            Reporter: Sushanth Sowmyan
>            Assignee: Sushanth Sowmyan
>
> Currently, if someone has a null key in a map, HCatInputFormat will terminate 
> with an NPE while trying to read it.
> {noformat}
> java.lang.NullPointerException
> at java.lang.String.compareTo(String.java:1167)
> at java.lang.String.compareTo(String.java:92)
> at java.util.TreeMap.put(TreeMap.java:545)
> at 
> org.apache.hcatalog.data.HCatRecordSerDe.serializeMap(HCatRecordSerDe.java:222)
> at 
> org.apache.hcatalog.data.HCatRecordSerDe.serializeField(HCatRecordSerDe.java:198)
> at org.apache.hcatalog.data.LazyHCatRecord.get(LazyHCatRecord.java:53)
> at org.apache.hcatalog.data.LazyHCatRecord.get(LazyHCatRecord.java:97)
> at 
> org.apache.hcatalog.mapreduce.HCatRecordReader.nextKeyValue(HCatRecordReader.java:203)
> {noformat}
> This is because we use a TreeMap to preserve order of elements in the map 
> when reading from the underlying storage/serde.
> This problem is easily fixed in a number of ways:
> a) Switch to HashMap, which allows null keys. That does not preserve order of 
> keys, which should not be important for map fields, but if we desire that, we 
> have a solution for that too - LinkedHashMap, which would both retain order 
> and allow us to insert null keys into the map.
> b) Ignore null keyed entries - check if the field we read is null, and if it 
> is, then ignore that item in the record altogether. This way, HCat is robust 
> in what it does - it does not terminate with an NPE, and it does not allow 
> null keys in maps that might be problematic to layers above us that are not 
> used to seeing nulls as keys in maps.
> Why do I bring up the second fix? First, I bring it up because of the way we 
> discovered this bug. When reading from an RCFile, we do not notice this bug. 
> If the same query that produced the RCFile instead produces an Orcfile, and 
> we try reading from it, we see this problem.
> RCFile seems to be quietly stripping any null key entries, whereas Orc 
> retains them. This is why we didn't notice this problem for a long while, and 
> suddenly, now, we are. Now, if we fix our code to allow nulls in map keys 
> through to layers above, we expose layers above to this change, which may 
> then cause them to break. (Technically, this is stretching the case because 
> we already break now if they care) More importantly, though, we have a case 
> now, where the same data will be exposed differently if it were stored as orc 
> or if it were stored as rcfile. And as a layer that is supposed to make 
> storage invisible to the end user, HCat should attempt to provide some 
> consistency in how data behaves to the end user.
> Secondly, whether or not nulls should be supported as keys in Maps seems to 
> be almost a religious view. Some people see it from a perspective of a 
> "mapping", which lends itself to a "Sure, if we encounter a null, we map to 
> this other value" kind of a view, whereas other people view it from a "lookup 
> index" kind of view, which lends itself to a "null as a key makes no sense - 
> What kind of lookup do you expect to perform?" kind of view. Both views have 
> their points, and it makes sense to see if we need to support it.
> That said...
> There is another important concern at hand here: nulls in map keys might be 
> due to bad data(corruption or loading error), and by stripping them, we might 
> be silently hiding that from the user. So "silent stripping" is bad. This is 
> an important point that does steer me towards the former approach, of passing 
> it on to layers above, and standardize on an understanding that null keys in 
> maps are acceptable data that layers above us have to handle. After that, it 
> could be taken on as a further consistency fix, to fix RCFile so that it 
> allows nulls in map keys.
> Having gone through this discussion of standardization, another important 
> question is whether or not there is actually a use-case for null keys in maps 
> in data. If there isn't, maybe we shouldn't allow writing that in the first 
> place, and both orc and rcfile must simply error out to the end user if they 
> try to write a  null map key? Well, it is true that it is possible that data 
> errors lead to null keys, but it's also possible that the user wants to store 
> a mapping for value transformations, and they might have a transformation for 
> null as well. In the case I encountered it, they were writing out an 
> intermediate table after having read from a sparse table using a custom input 
> format that generated an arbitrary number of columns, and were using the map 
> to store column name mappings that would eventually be written out to another 
> table. That seems a valid use, and we shouldn't prevent users from this sort 
> of usage.
> Another reason for not allowing null keys from a java perspective is locking 
> and concurrency concerns, where locking on a null is a pain, per 
> disagreements between Joshua Bloch and Doug Lea in the design of HashMap and 
> ConcurrentHashMap. However, given that HCatalog reads are happening in a 
> thread on a drone where there should be no parallel access of that record, 
> and more importantly, this should strictly be used in a read-only kind of 
> usage, we should not have to worry about that.
> Increasingly, my preference is to change to LinkedHashMaps to allow null 
> keys, and for consistency's sake, after this is tackled, to see if we should 
> be fixing RCFile to allow null keys(this might be trickier since RCFile has a 
> lot of other users that are probably currently working.)
> Another option is to change to LinkedHashMap, but also add a conf key to hcat 
> to allow the user to specify whether or not we want to strip nulls. That way, 
> a user can specify what behaviour they like. That's more cruft though, and I 
> don't want to go down that path unless there is a user that explicitly 
> wants/needs that.
> Anyone have any other thoughts on the matter?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to