[ 
https://issues.apache.org/jira/browse/IMPALA-9777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joe McDonnell reassigned IMPALA-9777:
-------------------------------------

    Assignee: Sahil Takiar  (was: Joe McDonnell)

> Reduce the diskspace requirements of loading the text version of 
> tpcds.store_sales
> ----------------------------------------------------------------------------------
>
>                 Key: IMPALA-9777
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9777
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Infrastructure
>    Affects Versions: Impala 4.0
>            Reporter: Joe McDonnell
>            Assignee: Sahil Takiar
>            Priority: Major
>         Attachments: namenodeparse.py
>
>
> Currently, dataload for the Impala development environment uses Hive to 
> populate tpcds.store_sales. We use several insert statements that select from 
> tpcds.stores_sales_unpartitioned, which is loaded from text files. The 
> inserts have this form:
> {noformat}
> insert overwrite table {table_name} partition(ss_sold_date_sk)
> select ss_sold_time_sk,
>   ss_item_sk,
>   ss_customer_sk,
>   ss_cdemo_sk,
>   ss_hdemo_sk,
>   ss_addr_sk,
>   ss_store_sk,
>   ss_promo_sk,
>   ss_ticket_number,
>   ss_quantity,
>   ss_wholesale_cost,
>   ss_list_price,
>   ss_sales_price,
>   ss_ext_discount_amt,
>   ss_ext_sales_price,
>   ss_ext_wholesale_cost,
>   ss_ext_list_price,
>   ss_ext_tax,
>   ss_coupon_amt,
>   ss_net_paid,
>   ss_net_paid_inc_tax,
>   ss_net_profit,
>   ss_sold_date_sk
> from store_sales_unpartitioned
> WHERE ss_sold_date_sk < 2451272
> distribute by ss_sold_date_sk;{noformat}
> Since this is inserting into a partitioned table, it is creating a file per 
> partition. Each statement manipulates hundreds of partitions. With the 
> current settings, the Hive implementation of this insert opens several 
> hundred files simultaneously (by my measurement, ~450). HDFS reserves a whole 
> block for each file (even though the resulting files are not large), and if 
> there isn't enough disk space for all of the reservations, then these inserts 
> can fail. This is a common problem on development environments. This is 
> currently failing for erasure coding tests.
> Impala uses clustered inserts where the input is sorted and files are written 
> one at a time (per backend). This limits the number of simultaneously open 
> files, eliminating the corresponding disk space reservation. Switching 
> populating tpcds.store_sales to use Impala would reduce the diskspace 
> requirement for an Impala developer environment. Alternatively, there is 
> likely equivalent Hive functionality for doing an initial sort so that only 
> one partition needs to be written at a time.
> This only applies to the text version of store_sales, which is created from 
> store_sales_unpartitioned. All other formats are created from the text 
> version of store_sales. Since the text store_sales is already partitioned in 
> the same way as the destination store_sales, Hive can be more efficient, 
> processing a small number of partitions at a time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to