[ 
https://issues.apache.org/jira/browse/IMPALA-8630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16864358#comment-16864358
 ] 

Todd Lipcon commented on IMPALA-8630:
-------------------------------------

bq. It would be nice to avoid reconstructing the path and hashing it on every 
query
bq. ...not just the partition path but also the filename, it would reduce the 
cost of hashing in the scheduler....

Do you have any data to suggest that the reconstructing and hashing woudl be a 
bottleneck? Assuming a very high upper bound million files and 100 bytes each, 
the cost here is hashing 100M of data which should be a few tens of 
milliseconds. For typical queries on tens of thousands of files I can't imagine 
this showing up at all relative to other costs.

bq. We could hash it there and put it in FbFileDesc

that would have a persistent memory cost on the catalogd which seems like 
something we should avoid



> Consistent remote placement should include partition information when 
> calculating placement
> -------------------------------------------------------------------------------------------
>
>                 Key: IMPALA-8630
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8630
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 3.2.0
>            Reporter: Joe McDonnell
>            Assignee: Joe McDonnell
>            Priority: Blocker
>
> For partitioned tables, the actual filenames within partitions may not have 
> large entropy. Impala includes information in its filenames that would not be 
> the same across partitions, but this is common for tables written by the 
> current CDH version of Hive. For example, in our minicluster, the TPC-DS 
> store_sales table has many partitions, but the actual filenames within 
> partitions are very simple:
> {noformat}
> hdfs dfs -ls /test-warehouse/tpcds.store_sales/ss_sold_date_sk=2452642
> Found 1 items
> -rwxr-xr-x 3 joe supergroup 379535 2019-06-05 15:16 
> /test-warehouse/tpcds.store_sales/ss_sold_date_sk=2452642/000000_0
> hdfs dfs -ls /test-warehouse/tpcds.store_sales/ss_sold_date_sk=2452640
> Found 1 items
> -rwxr-xr-x 3 joe supergroup 412959 2019-06-05 15:16 
> /test-warehouse/tpcds.store_sales/ss_sold_date_sk=2452640/000000_0{noformat}
> Right now, consistent remote placement uses the filename+offset without the 
> partition id.
> {code:java}
> uint32_t hash = HashUtil::Hash(hdfs_file_split->relative_path.data(),
>       hdfs_file_split->relative_path.length(), 0);
> {code}
> This would produce a poor balance of files across nodes when there is low 
> entropy in filenames. This should be amended to include the partition id, 
> which is already accessible on the THdfsFileSplit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to