[
https://issues.apache.org/jira/browse/HIVE-8808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207095#comment-14207095
]
Sushanth Sowmyan commented on HIVE-8808:
----------------------------------------
>From a strict M/R standpoint:
Traditional M/R guarantees state information availability to InputFormats
through JobConf, and through serialized InputSplits. Any expectation past that
by the InputFormat is not guaranteed to work. Practically, though, M/R does not
standardize a setInput equivalent call, and thus, InputFormats wind up
implementing their own methodologies. It is not unheard of for them to maintain
state.
In practice also, though, we absolutely need to have a standardization, to be
able to access it from Hive/HCatalog. HCatalog took a route where it said that
InputFormats, as currently defined are not well-specified enough to be able to
do all the setup needed to be effectively stateless, and so, relegated that
notion upwards, (in earlier versions of HCatalog to something called
StorageDriver, but as of HCat 0.3, replaced with Hive's StorageHandler) to
StorageHandlers, HCat's primary storage abstraction. While the InputFormat is
itself considered stateless, a StorageHandler is considered stateful, and HCat
does the following:
a) Instantiate the appropriate StorageHandler using ReflectionUtils.newInstance
(will call setConf if available, and usually is)
b) Call configureInputJobProperties() on that StorageHandler to set it up for
input, and it modifies a map of key value properties (jobProperties) that HCat
ensures that it will put into JobConf/Job before calling any methods on the
relevant InputFormat.
c) Call .getInputFormatClass, that class eventually gets instantiated at run
time with ReflectionUtils.newInstance(inputFormatClass, Job). Now, this allows
the InputFormat to set up the Job (which already had the above map of kvps
inserted into it) any which way it wants, without itself being stateful.
d) Call .getInputSplits, again passing in the relevant JobConf as the
state-carrier into it, and the InputSplits themselves being a serializable
state carrier on the outbound.
e) Call .createRecordReader on that InputFormat, again, the InputFormat itself
can(and is) stateless, but gets passed in an InputSplit(has state) and a
TaskAttemptContext (has state, with the above jobProperties map)
> HiveInputFormat caching cannot work with all input formats
> ----------------------------------------------------------
>
> Key: HIVE-8808
> URL: https://issues.apache.org/jira/browse/HIVE-8808
> Project: Hive
> Issue Type: Bug
> Reporter: Brock Noland
>
> In {{HiveInputFormat}} we implement instance caching (see
> {{getInputFormatFromCache}}). In HS2, this assumes that InputFormats are
> stateless but I don't think this assumption is true, especially with regards
> to HBase.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)