[ 
https://issues.apache.org/jira/browse/HCATALOG-580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542596#comment-13542596
 ] 

Arup Malakar commented on HCATALOG-580:
---------------------------------------

I hadn't seen the _temporary/_logs directories in case of branch-0.4 with 
hadoop 23. The e2e tests had all succeeded when I ran them for branch-0.4. Do 
this patch need to be applied for branch-0.4 as well? If that is the case I can 
report back with the performance numbers I have for the 100GB input data 
scenario of HCATALOG-538. It used to take 30 minutes on a 20 node cluster with 
HCATALOG-538.
                
> Optimizations in HCAT-538 break e2e tests
> -----------------------------------------
>
>                 Key: HCATALOG-580
>                 URL: https://issues.apache.org/jira/browse/HCATALOG-580
>             Project: HCatalog
>          Issue Type: Bug
>    Affects Versions: 0.5
>         Environment: RH 5.8 (on AWS)
> Hadoop 1.1.2.17 (build)
> HCat 0.5 (build)
>            Reporter: Sushanth Sowmyan
>            Assignee: Daniel Dai
>            Priority: Blocker
>             Fix For: 0.5
>
>         Attachments: HCATALOG-580-1.patch, HCATALOG-580-2.patch, 
> HCATALOG-580-3.patch
>
>
> The optimizations brought in by HCATALOG-538 break dynamic partitioning in 
> the e2e tests. The issue is that the assumption that if the first child in a 
> directory structure is a directory, the rest are directories, and if the 
> first child is a file, then the rest are files is an incorrect one.
> (Admittedly, one part of that, that of assuming that if the first child is a 
> file, the assumption that it is a leaf directory is not necessarily a bad one 
> in premise, although still incorrect)
> The issue with this is that underlying FileOutputCommitter and OutputFormat 
> behaviour would affect whether or not you get files or directories, or 
> whether there would be any _temporary directories still left behind, for eg.
> In the case I tested, the issue is that there is a _temporary directory in a 
> "leaf" directory, followed by part files. The optimization sees the 
> _temporary directory, finds nothing inside it, so doesn't mkdir any parent, 
> then decides that the rest are directories, then moves to the part file, and 
> tries to rename it directly without mkdir-ing its parent directory.
> The e2e test conf in question is Pig_Checkin_7
> {code}
>                 {
>                                  'num' => 7
>                                 ,'hcat_prep'=>q\drop table if exists 
> pig_checkin_7;
> create table pig_checkin_7 (name string, age int) partitioned by (ds string) 
> STORED AS TEXTFILE;\
>                                 ,'pig' => q\a = load 'studentparttab30k' 
> using org.apache.hcatalog.pig.HCatLoader();
> b = foreach a generate name, age, ds;
> store b into 'pig_checkin_7' using org.apache.hcatalog.pig.HCatStorer();\,
>                                 ,'result_table' => 'pig_checkin_7',
>                                 ,'sql'   => "select name, age, ds from 
> studentparttab30k;",
>                                 ,'floatpostprocess' => 1
>                                 ,'delimiter' => '       '
>                 }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to