Re: Performance with Insert overwrite into Hive Table.
Thanks Ted. This looks like the issue since I am running it in EMR and the Hive version is 1.0.0. Thanks, Bijay On Wed, May 4, 2016 at 10:29 AM, Ted Yuwrote: > Looks like you were hitting HIVE-11940 > > On Wed, May 4, 2016 at 10:02 AM, Bijay Kumar Pathak > wrote: > >> Hello, >> >> I am writing Dataframe of around 60+ GB into partitioned Hive Table using >> hiveContext in parquet format. The Spark insert overwrite jobs completes in >> a reasonable amount of time around 20 minutes. >> >> But the job is taking a huge amount of time more than 2 hours to copy >> data from .hivestaging directory in HDFS to final partition directory. What >> could be the potential problem over here? >> >> hive_c.sql(""" >> INSERT OVERWRITE TABLE {0} PARTITION (row_eff_end_dt='{1}', >> ccd_dt) >> SELECT * from temp_table >> """.format(table_name, eff_end_dt) >> ) >> >> And the below process from the log is taking more than 2 hours. >> >> 16/05/04 06:41:28 INFO Hive: Replacing >> src:hdfs://internal:8020/user/hadoop/so_core_us/.hive-staging_hive_2016-05-04_04-39-13_992_6600245407573569189-1/-ext-1/ccd_dt=2012-09-02/part-00306, >> dest: >> hdfs://internal:8020/user/hadoop/so_core_us/row_eff_end_dt=-12-31/ccd_dt=2012-09-02/part-00306, >> Status:true >> 16/05/04 06:41:28 INFO Hive: New loading path = >> hdfs://internal:8020/user/hadoop/so_core_us/.hive-staging_hive_2016-05-04_04-39-13_992_6600245407573569189-1/-ext-1/ccd_dt=2012-09-02 >> with partSpec {row_eff_end_dt=-12-31, ccd_dt=2012-09-02} >> >> >> Thanks, >> Bijay >> > >
Re: Performance with Insert overwrite into Hive Table.
Looks like you were hitting HIVE-11940 On Wed, May 4, 2016 at 10:02 AM, Bijay Kumar Pathakwrote: > Hello, > > I am writing Dataframe of around 60+ GB into partitioned Hive Table using > hiveContext in parquet format. The Spark insert overwrite jobs completes in > a reasonable amount of time around 20 minutes. > > But the job is taking a huge amount of time more than 2 hours to copy data > from .hivestaging directory in HDFS to final partition directory. What > could be the potential problem over here? > > hive_c.sql(""" > INSERT OVERWRITE TABLE {0} PARTITION (row_eff_end_dt='{1}', > ccd_dt) > SELECT * from temp_table > """.format(table_name, eff_end_dt) > ) > > And the below process from the log is taking more than 2 hours. > > 16/05/04 06:41:28 INFO Hive: Replacing > src:hdfs://internal:8020/user/hadoop/so_core_us/.hive-staging_hive_2016-05-04_04-39-13_992_6600245407573569189-1/-ext-1/ccd_dt=2012-09-02/part-00306, > dest: > hdfs://internal:8020/user/hadoop/so_core_us/row_eff_end_dt=-12-31/ccd_dt=2012-09-02/part-00306, > Status:true > 16/05/04 06:41:28 INFO Hive: New loading path = > hdfs://internal:8020/user/hadoop/so_core_us/.hive-staging_hive_2016-05-04_04-39-13_992_6600245407573569189-1/-ext-1/ccd_dt=2012-09-02 > with partSpec {row_eff_end_dt=-12-31, ccd_dt=2012-09-02} > > > Thanks, > Bijay >
Performance with Insert overwrite into Hive Table.
Hello, I am writing Dataframe of around 60+ GB into partitioned Hive Table using hiveContext in parquet format. The Spark insert overwrite jobs completes in a reasonable amount of time around 20 minutes. But the job is taking a huge amount of time more than 2 hours to copy data from .hivestaging directory in HDFS to final partition directory. What could be the potential problem over here? hive_c.sql(""" INSERT OVERWRITE TABLE {0} PARTITION (row_eff_end_dt='{1}', ccd_dt) SELECT * from temp_table """.format(table_name, eff_end_dt) ) And the below process from the log is taking more than 2 hours. 16/05/04 06:41:28 INFO Hive: Replacing src:hdfs://internal:8020/user/hadoop/so_core_us/.hive-staging_hive_2016-05-04_04-39-13_992_6600245407573569189-1/-ext-1/ccd_dt=2012-09-02/part-00306, dest: hdfs://internal:8020/user/hadoop/so_core_us/row_eff_end_dt=-12-31/ccd_dt=2012-09-02/part-00306, Status:true 16/05/04 06:41:28 INFO Hive: New loading path = hdfs://internal:8020/user/hadoop/so_core_us/.hive-staging_hive_2016-05-04_04-39-13_992_6600245407573569189-1/-ext-1/ccd_dt=2012-09-02 with partSpec {row_eff_end_dt=-12-31, ccd_dt=2012-09-02} Thanks, Bijay