[
https://issues.apache.org/jira/browse/HIVE-8851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Stamatis Zampetakis updated HIVE-8851:
--------------------------------------
I cleared the fixVersion field since this ticket is not resolved. Please review
this ticket and if the fix is already committed to a specific version please
set the version accordingly and mark the ticket as RESOLVED.
According to the JIRA guidelines
(https://cwiki.apache.org/confluence/display/Hive/HowToContribute) the
fixVersion should be set only when the issue is resolved/closed.
> Broadcast files for small tables via SparkContext.addFile() and
> SparkFiles.get() [Spark Branch]
> -----------------------------------------------------------------------------------------------
>
> Key: HIVE-8851
> URL: https://issues.apache.org/jira/browse/HIVE-8851
> Project: Hive
> Issue Type: Sub-task
> Components: Spark
> Reporter: Xuefu Zhang
> Assignee: Jimmy Xiang
> Priority: Major
> Fix For: spark-branch
>
> Attachments: HIVE-8851.1-spark.patch, HIVE-8851.2-spark.patch
>
>
> Currently files generated by SparkHashTableSinkOperator for small tables are
> written directly on HDFS with a high replication factor. When map join
> happens, map join operator is going to load these files into hash tables.
> Since on multiple partitions can be process on the same worker node, reading
> the same set of files multiple times are not ideal. The improvment can be
> done by calling SparkContext.addFiles() on these files, and use
> SparkFiles.getFile() to download them to the worker node just once.
> Please note that SparkFiles.getFile() is a static method. Code invoking this
> method needs to be in a static method. This calling method needs to be
> synchronized because it may get called in different threads.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)