[ 
https://issues.apache.org/jira/browse/IMPALA-9777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joe McDonnell updated IMPALA-9777:
----------------------------------
    Description: 
Currently, dataload for the Impala development environment uses Hive to 
populate tpcds.store_sales. We use several insert statements that select from 
tpcds.stores_sales_unpartitioned, which is loaded from text files. The inserts 
have this form:
{noformat}
insert overwrite table {table_name} partition(ss_sold_date_sk)
select ss_sold_time_sk,
  ss_item_sk,
  ss_customer_sk,
  ss_cdemo_sk,
  ss_hdemo_sk,
  ss_addr_sk,
  ss_store_sk,
  ss_promo_sk,
  ss_ticket_number,
  ss_quantity,
  ss_wholesale_cost,
  ss_list_price,
  ss_sales_price,
  ss_ext_discount_amt,
  ss_ext_sales_price,
  ss_ext_wholesale_cost,
  ss_ext_list_price,
  ss_ext_tax,
  ss_coupon_amt,
  ss_net_paid,
  ss_net_paid_inc_tax,
  ss_net_profit,
  ss_sold_date_sk
from store_sales_unpartitioned
WHERE ss_sold_date_sk < 2451272
distribute by ss_sold_date_sk;{noformat}
Since this is inserting into a partitioned table, it is creating a file per 
partition. Each statement manipulates hundreds of partitions. With the current 
settings, the Hive implementation of this insert opens several hundred files 
simultaneously (by my measurement, ~450). HDFS reserves a whole block for each 
file (even though the resulting files are not large), and if there isn't enough 
disk space for all of the reservations, then these inserts can fail. This is a 
common problem on development environments. This is currently failing for 
erasure coding tests.

Impala uses clustered inserts where the input is sorted and files are written 
one at a time (per backend). This limits the number of simultaneously open 
files, eliminating the corresponding disk space reservation. Switching 
populating tpcds.store_sales to use Impala would reduce the diskspace 
requirement for an Impala developer environment. Alternatively, there is likely 
equivalent Hive functionality for doing an initial sort so that only one 
partition needs to be written at a time.

This only applies to the text version of store_sales, which is created from 
store_sales_unpartitioned. All other formats are created from the text version 
of store_sales. Since the text store_sales is already partitioned in the same 
way as the destination store_sales, Hive can be more efficient, processing a 
small number of partitions at a time.

  was:
Currently, dataload for the Impala development environment uses Hive to 
populate tpcds.store_sales. We use several insert statements that select from 
tpcds.stores_sales_unpartitioned, which is loaded from text files. The inserts 
have this form:
{noformat}
insert overwrite table {table_name} partition(ss_sold_date_sk)
select ss_sold_time_sk,
  ss_item_sk,
  ss_customer_sk,
  ss_cdemo_sk,
  ss_hdemo_sk,
  ss_addr_sk,
  ss_store_sk,
  ss_promo_sk,
  ss_ticket_number,
  ss_quantity,
  ss_wholesale_cost,
  ss_list_price,
  ss_sales_price,
  ss_ext_discount_amt,
  ss_ext_sales_price,
  ss_ext_wholesale_cost,
  ss_ext_list_price,
  ss_ext_tax,
  ss_coupon_amt,
  ss_net_paid,
  ss_net_paid_inc_tax,
  ss_net_profit,
  ss_sold_date_sk
from store_sales_unpartitioned
WHERE ss_sold_date_sk < 2451272
distribute by ss_sold_date_sk;{noformat}
Since this is inserting into a partitioned table, it is creating a file per 
partition. Each statement manipulates hundreds of partitions. The Hive 
implementation of this insert opens several hundred files simultaneously (by my 
measurement, ~450). HDFS reserves a whole block for each file (even though the 
resulting files are not large), and if there isn't enough disk space for all of 
the reservations, then these inserts can fail. This is a common problem on 
development environments.

Impala uses clustered inserts where the input is sorted and files are written 
one at a time (per backend). This limits the number of simultaneously open 
files, eliminating the corresponding disk space reservation. Switching 
populating tpcds.store_sales to use Impala would reduce the diskspace 
requirement for an Impala developer environment.

This only applies to the text version of store_sales, which is created from 
store_sales_unpartitioned. All other formats are created from the text version 
of store_sales. Since the text store_sales is already partitioned in the same 
way as the destination store_sales, Hive can be more efficient, processing a 
small number of partitions at a time.


> Reduce the diskspace requirements of loading the text version of 
> tpcds.store_sales
> ----------------------------------------------------------------------------------
>
>                 Key: IMPALA-9777
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9777
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Infrastructure
>    Affects Versions: Impala 4.0
>            Reporter: Joe McDonnell
>            Assignee: Joe McDonnell
>            Priority: Major
>         Attachments: namenodeparse.py
>
>
> Currently, dataload for the Impala development environment uses Hive to 
> populate tpcds.store_sales. We use several insert statements that select from 
> tpcds.stores_sales_unpartitioned, which is loaded from text files. The 
> inserts have this form:
> {noformat}
> insert overwrite table {table_name} partition(ss_sold_date_sk)
> select ss_sold_time_sk,
>   ss_item_sk,
>   ss_customer_sk,
>   ss_cdemo_sk,
>   ss_hdemo_sk,
>   ss_addr_sk,
>   ss_store_sk,
>   ss_promo_sk,
>   ss_ticket_number,
>   ss_quantity,
>   ss_wholesale_cost,
>   ss_list_price,
>   ss_sales_price,
>   ss_ext_discount_amt,
>   ss_ext_sales_price,
>   ss_ext_wholesale_cost,
>   ss_ext_list_price,
>   ss_ext_tax,
>   ss_coupon_amt,
>   ss_net_paid,
>   ss_net_paid_inc_tax,
>   ss_net_profit,
>   ss_sold_date_sk
> from store_sales_unpartitioned
> WHERE ss_sold_date_sk < 2451272
> distribute by ss_sold_date_sk;{noformat}
> Since this is inserting into a partitioned table, it is creating a file per 
> partition. Each statement manipulates hundreds of partitions. With the 
> current settings, the Hive implementation of this insert opens several 
> hundred files simultaneously (by my measurement, ~450). HDFS reserves a whole 
> block for each file (even though the resulting files are not large), and if 
> there isn't enough disk space for all of the reservations, then these inserts 
> can fail. This is a common problem on development environments. This is 
> currently failing for erasure coding tests.
> Impala uses clustered inserts where the input is sorted and files are written 
> one at a time (per backend). This limits the number of simultaneously open 
> files, eliminating the corresponding disk space reservation. Switching 
> populating tpcds.store_sales to use Impala would reduce the diskspace 
> requirement for an Impala developer environment. Alternatively, there is 
> likely equivalent Hive functionality for doing an initial sort so that only 
> one partition needs to be written at a time.
> This only applies to the text version of store_sales, which is created from 
> store_sales_unpartitioned. All other formats are created from the text 
> version of store_sales. Since the text store_sales is already partitioned in 
> the same way as the destination store_sales, Hive can be more efficient, 
> processing a small number of partitions at a time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to