[
https://issues.apache.org/jira/browse/HCATALOG-142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Travis Crawford resolved HCATALOG-142.
--------------------------------------
Resolution: Duplicate
> Reducing JobConf size used by HCatInputFormat
> ---------------------------------------------
>
> Key: HCATALOG-142
> URL: https://issues.apache.org/jira/browse/HCATALOG-142
> Project: HCatalog
> Issue Type: Improvement
> Affects Versions: 0.2, 0.3
> Reporter: Sushanth Sowmyan
> Labels: inputformat, metastore
>
> Currently, the .setInput() call in HCat fetches information regarding all the
> partitions we want to read from, and stores it in the JobConf. The reason it
> stores it there is because it is statically called, and that information is
> required at the time the MR framework calls getSplits(). Since the first call
> is a static call and the second is a call on an object instantiated by the MR
> framework (implying no member variable based info passing), we pass that
> information along through the JobConf.
> Now, we could move the place where we contact the metastore to the
> getSplits() time, which means we contact the metastore late, but that breaks
> other things like being able to check whether the input can/will succeed, or
> checking the schema/etc. Now, we could follow a hybrid approach to address
> that too, and contact the metastore during the setInput() to get the schema,
> check whether input is possible, and not get the partition objects at that
> time to set in the jobconf, and then contact the metastore again during the
> getSplits() to populate the splits with information fetched from the
> partition objects.
> Issues with this approach still exist :
> a) Multiple contacts to the metastore increase number of times metastore load
> (technically, it's still only moving accesses around, so it should be okay,
> just that it's separated a bit more)
> b) Things like testing whether the partition objects are valid, whether the
> storage drivers specified exist/can be instantiated, etc are now at
> getSplits() time, which means the programs have a harder time of
> error-handling, since this happens after they submit a job rather than as a
> pre-run check-time. (this should also be okay for most programs)
> Further discussion/thoughts on this issue is welcome. :)
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira