Xuefu Zhang created HIVE-8851:
---------------------------------
Summary: Broadcast files for small tables via
SparkContext.addFile() and SparkFiles.get() [Spark Branch]
Key: HIVE-8851
URL: https://issues.apache.org/jira/browse/HIVE-8851
Project: Hive
Issue Type: Sub-task
Components: Spark
Reporter: Xuefu Zhang
Currently files generated by SparkHashTableSinkOperator for small tables are
written directly on HDFS with a high replication factor. When map join happens,
map join operator is going to load these files into hash tables. Since on
multiple partitions can be process on the same worker node, reading the same
set of files multiple times are not ideal. The improvment can be done by
calling SparkContext.addFiles() on these files, and use SparkFiles.getFile() to
download them to the worker node just once.
Please note that SparkFiles.getFile() is a static method. Code invoking this
method needs to be in a static method. This calling method needs to be
synchronized because it may get called in different threads.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)