[ 
https://issues.apache.org/jira/browse/HIVE-8808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207095#comment-14207095
 ] 

Sushanth Sowmyan commented on HIVE-8808:
----------------------------------------

>From a strict M/R standpoint:

Traditional M/R guarantees state information availability to InputFormats 
through JobConf, and through serialized InputSplits. Any expectation past that 
by the InputFormat is not guaranteed to work. Practically, though, M/R does not 
standardize a setInput equivalent call, and thus, InputFormats wind up 
implementing their own methodologies. It is not unheard of for them to maintain 
state.

In practice also, though, we absolutely need to have a standardization, to be 
able to access it from Hive/HCatalog. HCatalog took a route where it said that 
InputFormats, as currently defined are not well-specified enough to be able to 
do all the setup needed to be effectively stateless, and so, relegated that 
notion upwards, (in earlier versions of HCatalog to something called 
StorageDriver, but as of HCat 0.3, replaced with Hive's StorageHandler) to 
StorageHandlers, HCat's primary storage abstraction. While the InputFormat is 
itself considered stateless, a StorageHandler is considered stateful, and HCat 
does the following:

a) Instantiate the appropriate StorageHandler using ReflectionUtils.newInstance 
  (will call setConf if available, and usually is)
b) Call configureInputJobProperties() on that StorageHandler to set it up for 
input, and it modifies a map of key value properties (jobProperties) that HCat 
ensures that it will put into JobConf/Job before calling any methods on the 
relevant InputFormat.
c) Call .getInputFormatClass,  that class eventually gets instantiated at run 
time with ReflectionUtils.newInstance(inputFormatClass, Job). Now, this allows 
the InputFormat to set up the Job (which already had the above map of kvps 
inserted into it) any which way it wants, without itself being stateful.
d) Call .getInputSplits, again passing in the relevant JobConf as the 
state-carrier into it, and the InputSplits themselves being a serializable 
state carrier on the outbound.
e) Call .createRecordReader on that InputFormat, again, the InputFormat itself 
can(and is) stateless, but gets passed in an InputSplit(has state) and a 
TaskAttemptContext (has state, with the above jobProperties map)

> HiveInputFormat caching cannot work with all input formats
> ----------------------------------------------------------
>
>                 Key: HIVE-8808
>                 URL: https://issues.apache.org/jira/browse/HIVE-8808
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Brock Noland
>
> In {{HiveInputFormat}} we implement instance caching (see 
> {{getInputFormatFromCache}}). In HS2, this assumes that InputFormats are 
> stateless but I don't think this assumption is true, especially with regards 
> to HBase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to