[ 
https://issues.apache.org/jira/browse/HCATALOG-142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Travis Crawford resolved HCATALOG-142.
--------------------------------------

    Resolution: Duplicate
    
> Reducing JobConf size used by HCatInputFormat
> ---------------------------------------------
>
>                 Key: HCATALOG-142
>                 URL: https://issues.apache.org/jira/browse/HCATALOG-142
>             Project: HCatalog
>          Issue Type: Improvement
>    Affects Versions: 0.2, 0.3
>            Reporter: Sushanth Sowmyan
>              Labels: inputformat, metastore
>
> Currently, the .setInput() call in HCat fetches information regarding all the 
> partitions we want to read from, and stores it in the JobConf. The reason it 
> stores it there is because it is statically called, and that information is 
> required at the time the MR framework calls getSplits(). Since the first call 
> is a static call and the second is a call on an object instantiated by the MR 
> framework (implying no member variable based info passing), we pass that 
> information along through the JobConf.
> Now, we could move the place where we contact the metastore to the 
> getSplits() time, which means we contact the metastore late, but that breaks 
> other things like being able to check whether the input can/will succeed, or 
> checking the schema/etc. Now, we could follow a hybrid approach to address 
> that too, and contact the metastore during the setInput() to get the schema, 
> check whether input is possible, and not get the partition objects at that 
> time to set in the jobconf, and then contact the metastore again during the 
> getSplits() to populate the splits with information fetched from the 
> partition objects.
> Issues with this approach still exist :
> a) Multiple contacts to the metastore increase number of times metastore load 
> (technically, it's still only moving accesses around, so it should be okay, 
> just that it's separated a bit more)
> b) Things like testing whether the partition objects are valid, whether the 
> storage drivers specified exist/can be instantiated, etc are now at 
> getSplits() time, which means the programs have a harder time of 
> error-handling, since this happens after they submit a job rather than as a 
> pre-run check-time. (this should also be okay for most programs)
> Further discussion/thoughts on this issue is welcome. :)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to