[ 
https://issues.apache.org/jira/browse/HCATALOG-453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13473339#comment-13473339
 ] 

Travis Crawford commented on HCATALOG-453:
------------------------------------------

Awesome, thanks for the patch [~pengfeng]! From the mailing list and meetup 
several people have run into this issue, so its great to see a fix for this.

In general I like the approach you took here, leaving pretty much everything 
as-is, and simply compressing the potentially big list of partitions. I agree 
the existing tests cover functionality, since anything reading data from HCat 
hits this code path, and there's really not much point in testing the 
compression part. I also tested on a fully-distributed cluster and a job with 
many partitons that previously would not run ran fine.

My only suggestion is we need to improve docs around this, because its part of 
our public API In our [Input and Output 
Interfaces|http://incubator.apache.org/hcatalog/docs/r0.4.0/inputoutput.html] 
documentation we see:

{code}
HCatInputFormat.setInput(job, InputJobInfo.create(dbName,
                inputTableName, null));
{code}

Since we change serialization anyone who saved an {{InputJobInfo}} somewhere 
would see breakage. I don't think anyone would do that (seems like a horribly 
broken usage) but its technically valid, so we should put a note in. If you 
don't object I can do that at commit time. I'm actually not a fan of exposing 
{{InputJobInfo}} publicly for exactly this reason, but it currently works this 
way.
                
> HCatalog queries fail due to exceeding max jobconf size
> -------------------------------------------------------
>
>                 Key: HCATALOG-453
>                 URL: https://issues.apache.org/jira/browse/HCATALOG-453
>             Project: HCatalog
>          Issue Type: Bug
>    Affects Versions: 0.2, 0.4, 0.5
>            Reporter: Feng Peng
>            Assignee: Feng Peng
>         Attachments: HCATALOG-453.patch
>
>
> The following script fails because exceeding max jobconf size:
> {noformat}
> raw_data = LOAD 'db.table' using org.apache.hcatalog.pig.HCatLoader(); 
> filtered_data = FILTER raw_data BY (part_dt>='20120528T000000Z') and 
> (part_dt<='20120624T230000Z');
> dump filtered_data;
> {noformat}
> Stacktrace:
> {noformat}
> 2012-07-18 00:06:24,067 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 6017: org.apache.hadoop.ipc.RemoteException: java.io.IOException: 
> java.io.IOException: Exceeded max jobconf size: 10523042 limit: 5242880
>         at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3766)
>         at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1434)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1430)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:396)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1428)
> Caused by: java.io.IOException: Exceeded max jobconf size: 10523042 limit: 
> 5242880
>         at 
> org.apache.hadoop.mapred.JobInProgress.<init>(JobInProgress.java:406)
>         at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3764)
>         ... 10 more
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to