[jira] [Commented] (SPARK-3621) Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access

2015-03-02 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14343977#comment-14343977
 ] 

Xuefu Zhang commented on SPARK-3621:


{quote}
 you can go a step further if you wanted to and just read the data directly 
from HDFS into a (singleton) cache on the executor.
{quote}
Yeah. Sharing cache in an executor is something we wanted to do but it seems a 
little complicated due to concurrency and lack of job boundary at executor.

 Provide a way to broadcast an RDD (instead of just a variable made of the 
 RDD) so that a job can access
 ---

 Key: SPARK-3621
 URL: https://issues.apache.org/jira/browse/SPARK-3621
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0, 1.1.0
Reporter: Xuefu Zhang

 In some cases, such as Hive's way of doing map-side join, it would be 
 benefcial to allow client program to broadcast RDDs rather than just 
 variables made of these RDDs. Broadcasting a variable made of RDDs requires 
 all RDD data be collected to the driver and that the variable be shipped to 
 the cluster after being made. It would be more performing if driver just 
 broadcasts the RDDs and uses the corresponding data in jobs (such building 
 hashmaps at executors).
 Tez has a broadcast edge which can ship data from previous stage to the next 
 stage, which doesn't require driver side processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3621) Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access

2015-03-02 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14343582#comment-14343582
 ] 

Xuefu Zhang commented on SPARK-3621:


For Hive's map join, we create a hash table out of a small table (HadoopRDD in 
spark's term) after some transformations. We want to broadcast the hash table 
(which is written to HDFS) such that each executor will be able to access it to 
do the join. We thought of spark's broadcast variable for this purpose. 
However, Spark's broadcast variable will ship the data to the driver and then 
broadcast to every executor. We wanted to avoid this extra trip since the hash 
table is already in HDFS. Thus, we wanted a mechanism to broadcast the dataset 
and make the dataset available (even better if in memory) at each executor, w/o 
shipping the dataset back to the driver. Referring this dataset as an RDD might 
have caused the confusion at the first place.

Currently, we worked around the problem by calling SparkContext.addFile() at 
the driver and accessing it using SparkFiles.get() at the executor.

 Provide a way to broadcast an RDD (instead of just a variable made of the 
 RDD) so that a job can access
 ---

 Key: SPARK-3621
 URL: https://issues.apache.org/jira/browse/SPARK-3621
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0, 1.1.0
Reporter: Xuefu Zhang

 In some cases, such as Hive's way of doing map-side join, it would be 
 benefcial to allow client program to broadcast RDDs rather than just 
 variables made of these RDDs. Broadcasting a variable made of RDDs requires 
 all RDD data be collected to the driver and that the variable be shipped to 
 the cluster after being made. It would be more performing if driver just 
 broadcasts the RDDs and uses the corresponding data in jobs (such building 
 hashmaps at executors).
 Tez has a broadcast edge which can ship data from previous stage to the next 
 stage, which doesn't require driver side processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3621) Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access

2015-03-02 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14343655#comment-14343655
 ] 

Xuefu Zhang commented on SPARK-3621:


addFile() can take a HDFS file, for which case, no file is shipped from the 
driver. And getFile() guarantees that the file is only copied once to local 
node per executor.

 Provide a way to broadcast an RDD (instead of just a variable made of the 
 RDD) so that a job can access
 ---

 Key: SPARK-3621
 URL: https://issues.apache.org/jira/browse/SPARK-3621
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0, 1.1.0
Reporter: Xuefu Zhang

 In some cases, such as Hive's way of doing map-side join, it would be 
 benefcial to allow client program to broadcast RDDs rather than just 
 variables made of these RDDs. Broadcasting a variable made of RDDs requires 
 all RDD data be collected to the driver and that the variable be shipped to 
 the cluster after being made. It would be more performing if driver just 
 broadcasts the RDDs and uses the corresponding data in jobs (such building 
 hashmaps at executors).
 Tez has a broadcast edge which can ship data from previous stage to the next 
 stage, which doesn't require driver side processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3621) Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access

2015-03-02 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14343646#comment-14343646
 ] 

Sean Owen commented on SPARK-3621:
--

In that case, it really doesn't involve Spark. The remote workers are reading 
from HDFS, and can read the closest source of the data directly. I think that's 
the good news; it's not even Spark-specific and is quite possible to write 
right now. Yes it should not go through the driver. What is addFile() for 
though? that would only work to send a file from the driver to workers, no?

 Provide a way to broadcast an RDD (instead of just a variable made of the 
 RDD) so that a job can access
 ---

 Key: SPARK-3621
 URL: https://issues.apache.org/jira/browse/SPARK-3621
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0, 1.1.0
Reporter: Xuefu Zhang

 In some cases, such as Hive's way of doing map-side join, it would be 
 benefcial to allow client program to broadcast RDDs rather than just 
 variables made of these RDDs. Broadcasting a variable made of RDDs requires 
 all RDD data be collected to the driver and that the variable be shipped to 
 the cluster after being made. It would be more performing if driver just 
 broadcasts the RDDs and uses the corresponding data in jobs (such building 
 hashmaps at executors).
 Tez has a broadcast edge which can ship data from previous stage to the next 
 stage, which doesn't require driver side processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3621) Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access

2015-03-02 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14343663#comment-14343663
 ] 

Sean Owen commented on SPARK-3621:
--

Gotcha, yes that makes sense. I think you can go a step further if you wanted 
to and just read the data directly from HDFS into a (singleton) cache on the 
executor, and not even copy it to local disk first. This is not going to be 
much faster if at all; it's already fairly optimal to read directly from HDFS.

If the idea here is making a resource on HDFS available to all executors 
locally, yes that's already available as addFile() + HDFS file.

 Provide a way to broadcast an RDD (instead of just a variable made of the 
 RDD) so that a job can access
 ---

 Key: SPARK-3621
 URL: https://issues.apache.org/jira/browse/SPARK-3621
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0, 1.1.0
Reporter: Xuefu Zhang

 In some cases, such as Hive's way of doing map-side join, it would be 
 benefcial to allow client program to broadcast RDDs rather than just 
 variables made of these RDDs. Broadcasting a variable made of RDDs requires 
 all RDD data be collected to the driver and that the variable be shipped to 
 the cluster after being made. It would be more performing if driver just 
 broadcasts the RDDs and uses the corresponding data in jobs (such building 
 hashmaps at executors).
 Tez has a broadcast edge which can ship data from previous stage to the next 
 stage, which doesn't require driver side processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3621) Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access

2015-03-01 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342119#comment-14342119
 ] 

Sean Owen commented on SPARK-3621:
--

I'd like to resolve what the use case is here:

Is the request to load an entire RDD into memory on every executor?
If so, what is the use case? the setup implies this is a situation where the 
data is large to be handled by the driver, but then putting it in memory 
everywhere is expensive.

If the goal is sharing between stages, how is this different from persisting an 
RDD, either on disk or in memory?

 Provide a way to broadcast an RDD (instead of just a variable made of the 
 RDD) so that a job can access
 ---

 Key: SPARK-3621
 URL: https://issues.apache.org/jira/browse/SPARK-3621
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0, 1.1.0
Reporter: Xuefu Zhang

 In some cases, such as Hive's way of doing map-side join, it would be 
 benefcial to allow client program to broadcast RDDs rather than just 
 variables made of these RDDs. Broadcasting a variable made of RDDs requires 
 all RDD data be collected to the driver and that the variable be shipped to 
 the cluster after being made. It would be more performing if driver just 
 broadcasts the RDDs and uses the corresponding data in jobs (such building 
 hashmaps at executors).
 Tez has a broadcast edge which can ship data from previous stage to the next 
 stage, which doesn't require driver side processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3621) Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access

2015-01-25 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291145#comment-14291145
 ] 

Xuefu Zhang commented on SPARK-3621:


I'm not sure if I agree that this is not a problem. To broadcast is to make 
certain dataset available to all nodes in the cluster. Existing broadcast 
functionality is limited to broadcast data in the driver, while this 
improvement requests that datasets, which already exists in the cluster, be 
broadcast to all nodes without requiring shipping that dataset from the cluster 
to the driver and then to all nodes in the cluster again.

Improvement is never a problem if we are not open to it. If for some reason 
this cannot be done, we need to understand the reason.

 Provide a way to broadcast an RDD (instead of just a variable made of the 
 RDD) so that a job can access
 ---

 Key: SPARK-3621
 URL: https://issues.apache.org/jira/browse/SPARK-3621
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0, 1.1.0
Reporter: Xuefu Zhang

 In some cases, such as Hive's way of doing map-side join, it would be 
 benefcial to allow client program to broadcast RDDs rather than just 
 variables made of these RDDs. Broadcasting a variable made of RDDs requires 
 all RDD data be collected to the driver and that the variable be shipped to 
 the cluster after being made. It would be more performing if driver just 
 broadcasts the RDDs and uses the corresponding data in jobs (such building 
 hashmaps at executors).
 Tez has a broadcast edge which can ship data from previous stage to the next 
 stage, which doesn't require driver side processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3621) Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access

2015-01-25 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291152#comment-14291152
 ] 

Sean Owen commented on SPARK-3621:
--

Hm, what is an example? I think you mean collect an RDD directly to every 
executor in its entirety. That's not an operation today but makes some sense. 

However my first question is, is this really something you need RDDs to do? You 
can already side load whatever you want on executors without involving the 
driver. 

The original description however talks about sharing data between stages. Is 
this not just a matter of persisting an RDD? This also does not involve the 
driver. 

 Provide a way to broadcast an RDD (instead of just a variable made of the 
 RDD) so that a job can access
 ---

 Key: SPARK-3621
 URL: https://issues.apache.org/jira/browse/SPARK-3621
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0, 1.1.0
Reporter: Xuefu Zhang

 In some cases, such as Hive's way of doing map-side join, it would be 
 benefcial to allow client program to broadcast RDDs rather than just 
 variables made of these RDDs. Broadcasting a variable made of RDDs requires 
 all RDD data be collected to the driver and that the variable be shipped to 
 the cluster after being made. It would be more performing if driver just 
 broadcasts the RDDs and uses the corresponding data in jobs (such building 
 hashmaps at executors).
 Tez has a broadcast edge which can ship data from previous stage to the next 
 stage, which doesn't require driver side processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3621) Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access

2014-09-22 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143009#comment-14143009
 ] 

Sean Owen commented on SPARK-3621:
--

If the data is shipped to the worker node, and the driver is the thing that can 
marshal the data to be sent, how is it different from a Broadcast variable? the 
broadcast can be done efficiently with the torrent-based broadcast, for 
example. 

 Provide a way to broadcast an RDD (instead of just a variable made of the 
 RDD) so that a job can access
 ---

 Key: SPARK-3621
 URL: https://issues.apache.org/jira/browse/SPARK-3621
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0, 1.1.0
Reporter: Xuefu Zhang

 In some cases, such as Hive's way of doing map-side join, it would be 
 benefcial to allow client program to broadcast RDDs rather than just 
 variables made of these RDDs. Broadcasting a variable made of RDDs requires 
 all RDD data be collected to the driver and that the variable be shipped to 
 the cluster after being made. It would be more performing if driver just 
 broadcasts the RDDs and uses the corresponding data in jobs (such building 
 hashmaps at executors).
 Tez has a broadcast edge which can ship data from previous stage to the next 
 stage, which doesn't require driver side processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3621) Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access

2014-09-22 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143231#comment-14143231
 ] 

Xuefu Zhang commented on SPARK-3621:


In my limited understanding, to broadcast a variable made of an RDD, you have 
to call RDD.collect() at the driver, which means data will be transferred to 
the driver. While broadcasting the variable might be very efficient, I'd like 
to avoid shipping data to the driver also.

 Provide a way to broadcast an RDD (instead of just a variable made of the 
 RDD) so that a job can access
 ---

 Key: SPARK-3621
 URL: https://issues.apache.org/jira/browse/SPARK-3621
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0, 1.1.0
Reporter: Xuefu Zhang

 In some cases, such as Hive's way of doing map-side join, it would be 
 benefcial to allow client program to broadcast RDDs rather than just 
 variables made of these RDDs. Broadcasting a variable made of RDDs requires 
 all RDD data be collected to the driver and that the variable be shipped to 
 the cluster after being made. It would be more performing if driver just 
 broadcasts the RDDs and uses the corresponding data in jobs (such building 
 hashmaps at executors).
 Tez has a broadcast edge which can ship data from previous stage to the next 
 stage, which doesn't require driver side processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3621) Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access

2014-09-22 Thread bc Wong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143933#comment-14143933
 ] 

bc Wong commented on SPARK-3621:


I think this is for the case of a map-side join where one of the tables is 
small.

[~xuefuz], if the driver is running in the cluster, then RDD.collect() means it 
reading from HDFS and then broadcast the data to everyone. Right? That seems 
reasonable. I don't see another way to broadcast something. Alternatively, 
it's probably better for each executor to individually read that small HDFS 
file into its memory.

 Provide a way to broadcast an RDD (instead of just a variable made of the 
 RDD) so that a job can access
 ---

 Key: SPARK-3621
 URL: https://issues.apache.org/jira/browse/SPARK-3621
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0, 1.1.0
Reporter: Xuefu Zhang

 In some cases, such as Hive's way of doing map-side join, it would be 
 benefcial to allow client program to broadcast RDDs rather than just 
 variables made of these RDDs. Broadcasting a variable made of RDDs requires 
 all RDD data be collected to the driver and that the variable be shipped to 
 the cluster after being made. It would be more performing if driver just 
 broadcasts the RDDs and uses the corresponding data in jobs (such building 
 hashmaps at executors).
 Tez has a broadcast edge which can ship data from previous stage to the next 
 stage, which doesn't require driver side processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3621) Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access

2014-09-21 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142383#comment-14142383
 ] 

Sean Owen commented on SPARK-3621:
--

My understanding is that this is fairly fundamentally not possible in Spark. 
The metadata and machinery necessary to operate on RDDs is with the driver. 
RDDs are not accessible within transformations or actions. I'm interested both 
in whether that is in fact true, how much of an issue it really is for 
Hive-on-Spark to use collect + broadcast, and whether these sorts of 
requirements can be met with join, cogroup, etc.

 Provide a way to broadcast an RDD (instead of just a variable made of the 
 RDD) so that a job can access
 ---

 Key: SPARK-3621
 URL: https://issues.apache.org/jira/browse/SPARK-3621
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0, 1.1.0
Reporter: Xuefu Zhang

 In some cases, such as Hive's way of doing map-side join, it would be 
 benefcial to allow client program to broadcast RDDs rather than just 
 variables made of these RDDs. Broadcasting a variable made of RDDs requires 
 all RDD data be collected to the driver and that the variable be shipped to 
 the cluster after being made. It would be more performing if driver just 
 broadcasts the RDDs and uses the corresponding data in jobs (such building 
 hashmaps at executors).
 Tez has a broadcast edge which can ship data from previous stage to the next 
 stage, which doesn't require driver side processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3621) Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access

2014-09-21 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142466#comment-14142466
 ] 

Xuefu Zhang commented on SPARK-3621:


I understand RDD is a concept existing only in the driver. However, accessing 
the data in Spark job doesn't have to be in the form of RDD. An iterator over 
the underlying data is sufficient, as long as the data is already shipped to 
the node when the job starts to run. One way to identify the shipped RDD and 
the iterator afterwards could be a UUID.

Hive on Spark isn't using Spark's transformations to do map-join, or join in 
general. Hive's own implementation is to build hash maps for the small tables 
when the join starts, and then do key lookups while streaming thru the big 
table. For this, small table data (which can be a result RDD of another Spark 
job) needs to be shipped to all nodes that do the join.

 Provide a way to broadcast an RDD (instead of just a variable made of the 
 RDD) so that a job can access
 ---

 Key: SPARK-3621
 URL: https://issues.apache.org/jira/browse/SPARK-3621
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0, 1.1.0
Reporter: Xuefu Zhang

 In some cases, such as Hive's way of doing map-side join, it would be 
 benefcial to allow client program to broadcast RDDs rather than just 
 variables made of these RDDs. Broadcasting a variable made of RDDs requires 
 all RDD data be collected to the driver and that the variable be shipped to 
 the cluster after being made. It would be more performing if driver just 
 broadcasts the RDDs and uses the corresponding data in jobs (such building 
 hashmaps at executors).
 Tez has a broadcast edge which can ship data from previous stage to the next 
 stage, which doesn't require driver side processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org