[ 
https://issues.apache.org/jira/browse/IMPALA-9777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17118997#comment-17118997
 ] 

Joe McDonnell commented on IMPALA-9777:
---------------------------------------

[~stakiar] Yeah, unfortunately, it is a hacked-together script, and it doesn't 
work on EC namenode logs. I was testing in my regular minicluster deployment 
without EC. I would shutdown HDFS, delete the existing namenode log, start up 
everything, and then run "bin/load-data.py -w tpcds -f". The log then has 
distinctive spikes corresponding to tpcds.store_sales inserts.

> Use Impala to load the text version of tpcds.store_sales
> --------------------------------------------------------
>
>                 Key: IMPALA-9777
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9777
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Infrastructure
>    Affects Versions: Impala 4.0
>            Reporter: Joe McDonnell
>            Assignee: Joe McDonnell
>            Priority: Major
>         Attachments: namenodeparse.py
>
>
> Currently, dataload for the Impala development environment uses Hive to 
> populate tpcds.store_sales. We use several insert statements that select from 
> tpcds.stores_sales_unpartitioned, which is loaded from text files. The 
> inserts have this form:
> {noformat}
> insert overwrite table {table_name} partition(ss_sold_date_sk)
> select ss_sold_time_sk,
>   ss_item_sk,
>   ss_customer_sk,
>   ss_cdemo_sk,
>   ss_hdemo_sk,
>   ss_addr_sk,
>   ss_store_sk,
>   ss_promo_sk,
>   ss_ticket_number,
>   ss_quantity,
>   ss_wholesale_cost,
>   ss_list_price,
>   ss_sales_price,
>   ss_ext_discount_amt,
>   ss_ext_sales_price,
>   ss_ext_wholesale_cost,
>   ss_ext_list_price,
>   ss_ext_tax,
>   ss_coupon_amt,
>   ss_net_paid,
>   ss_net_paid_inc_tax,
>   ss_net_profit,
>   ss_sold_date_sk
> from store_sales_unpartitioned
> WHERE ss_sold_date_sk < 2451272
> distribute by ss_sold_date_sk;{noformat}
> Since this is inserting into a partitioned table, it is creating a file per 
> partition. Each statement manipulates hundreds of partitions. The Hive 
> implementation of this insert opens several hundred files simultaneously (by 
> my measurement, ~450). HDFS reserves a whole block for each file (even though 
> the resulting files are not large), and if there isn't enough disk space for 
> all of the reservations, then these inserts can fail. This is a common 
> problem on development environments.
> Impala uses clustered inserts where the input is sorted and files are written 
> one at a time (per backend). This limits the number of simultaneously open 
> files, eliminating the corresponding disk space reservation. Switching 
> populating tpcds.store_sales to use Impala would reduce the diskspace 
> requirement for an Impala developer environment.
> This only applies to the text version of store_sales, which is created from 
> store_sales_unpartitioned. All other formats are created from the text 
> version of store_sales. Since the text store_sales is already partitioned in 
> the same way as the destination store_sales, Hive can be more efficient, 
> processing a small number of partitions at a time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to