[
https://issues.apache.org/jira/browse/HIVE-8851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jimmy Xiang updated HIVE-8851:
------------------------------
Status: Open (was: Patch Available)
> Broadcast files for small tables via SparkContext.addFile() and
> SparkFiles.get() [Spark Branch]
> -----------------------------------------------------------------------------------------------
>
> Key: HIVE-8851
> URL: https://issues.apache.org/jira/browse/HIVE-8851
> Project: Hive
> Issue Type: Sub-task
> Components: Spark
> Reporter: Xuefu Zhang
> Assignee: Jimmy Xiang
> Fix For: spark-branch
>
> Attachments: HIVE-8851.1-spark.patch
>
>
> Currently files generated by SparkHashTableSinkOperator for small tables are
> written directly on HDFS with a high replication factor. When map join
> happens, map join operator is going to load these files into hash tables.
> Since on multiple partitions can be process on the same worker node, reading
> the same set of files multiple times are not ideal. The improvment can be
> done by calling SparkContext.addFiles() on these files, and use
> SparkFiles.getFile() to download them to the worker node just once.
> Please note that SparkFiles.getFile() is a static method. Code invoking this
> method needs to be in a static method. This calling method needs to be
> synchronized because it may get called in different threads.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)