[ https://issues.apache.org/jira/browse/HIVE-8808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207095#comment-14207095 ]
Sushanth Sowmyan commented on HIVE-8808: ---------------------------------------- >From a strict M/R standpoint: Traditional M/R guarantees state information availability to InputFormats through JobConf, and through serialized InputSplits. Any expectation past that by the InputFormat is not guaranteed to work. Practically, though, M/R does not standardize a setInput equivalent call, and thus, InputFormats wind up implementing their own methodologies. It is not unheard of for them to maintain state. In practice also, though, we absolutely need to have a standardization, to be able to access it from Hive/HCatalog. HCatalog took a route where it said that InputFormats, as currently defined are not well-specified enough to be able to do all the setup needed to be effectively stateless, and so, relegated that notion upwards, (in earlier versions of HCatalog to something called StorageDriver, but as of HCat 0.3, replaced with Hive's StorageHandler) to StorageHandlers, HCat's primary storage abstraction. While the InputFormat is itself considered stateless, a StorageHandler is considered stateful, and HCat does the following: a) Instantiate the appropriate StorageHandler using ReflectionUtils.newInstance (will call setConf if available, and usually is) b) Call configureInputJobProperties() on that StorageHandler to set it up for input, and it modifies a map of key value properties (jobProperties) that HCat ensures that it will put into JobConf/Job before calling any methods on the relevant InputFormat. c) Call .getInputFormatClass, that class eventually gets instantiated at run time with ReflectionUtils.newInstance(inputFormatClass, Job). Now, this allows the InputFormat to set up the Job (which already had the above map of kvps inserted into it) any which way it wants, without itself being stateful. d) Call .getInputSplits, again passing in the relevant JobConf as the state-carrier into it, and the InputSplits themselves being a serializable state carrier on the outbound. e) Call .createRecordReader on that InputFormat, again, the InputFormat itself can(and is) stateless, but gets passed in an InputSplit(has state) and a TaskAttemptContext (has state, with the above jobProperties map) > HiveInputFormat caching cannot work with all input formats > ---------------------------------------------------------- > > Key: HIVE-8808 > URL: https://issues.apache.org/jira/browse/HIVE-8808 > Project: Hive > Issue Type: Bug > Reporter: Brock Noland > > In {{HiveInputFormat}} we implement instance caching (see > {{getInputFormatFromCache}}). In HS2, this assumes that InputFormats are > stateless but I don't think this assumption is true, especially with regards > to HBase. -- This message was sent by Atlassian JIRA (v6.3.4#6332)