[jira] [Commented] (SPARK-3621) Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access
[ https://issues.apache.org/jira/browse/SPARK-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14343977#comment-14343977 ] Xuefu Zhang commented on SPARK-3621: {quote} you can go a step further if you wanted to and just read the data directly from HDFS into a (singleton) cache on the executor. {quote} Yeah. Sharing cache in an executor is something we wanted to do but it seems a little complicated due to concurrency and lack of job boundary at executor. Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access --- Key: SPARK-3621 URL: https://issues.apache.org/jira/browse/SPARK-3621 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0, 1.1.0 Reporter: Xuefu Zhang In some cases, such as Hive's way of doing map-side join, it would be benefcial to allow client program to broadcast RDDs rather than just variables made of these RDDs. Broadcasting a variable made of RDDs requires all RDD data be collected to the driver and that the variable be shipped to the cluster after being made. It would be more performing if driver just broadcasts the RDDs and uses the corresponding data in jobs (such building hashmaps at executors). Tez has a broadcast edge which can ship data from previous stage to the next stage, which doesn't require driver side processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3621) Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access
[ https://issues.apache.org/jira/browse/SPARK-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14343582#comment-14343582 ] Xuefu Zhang commented on SPARK-3621: For Hive's map join, we create a hash table out of a small table (HadoopRDD in spark's term) after some transformations. We want to broadcast the hash table (which is written to HDFS) such that each executor will be able to access it to do the join. We thought of spark's broadcast variable for this purpose. However, Spark's broadcast variable will ship the data to the driver and then broadcast to every executor. We wanted to avoid this extra trip since the hash table is already in HDFS. Thus, we wanted a mechanism to broadcast the dataset and make the dataset available (even better if in memory) at each executor, w/o shipping the dataset back to the driver. Referring this dataset as an RDD might have caused the confusion at the first place. Currently, we worked around the problem by calling SparkContext.addFile() at the driver and accessing it using SparkFiles.get() at the executor. Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access --- Key: SPARK-3621 URL: https://issues.apache.org/jira/browse/SPARK-3621 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0, 1.1.0 Reporter: Xuefu Zhang In some cases, such as Hive's way of doing map-side join, it would be benefcial to allow client program to broadcast RDDs rather than just variables made of these RDDs. Broadcasting a variable made of RDDs requires all RDD data be collected to the driver and that the variable be shipped to the cluster after being made. It would be more performing if driver just broadcasts the RDDs and uses the corresponding data in jobs (such building hashmaps at executors). Tez has a broadcast edge which can ship data from previous stage to the next stage, which doesn't require driver side processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3621) Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access
[ https://issues.apache.org/jira/browse/SPARK-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14343655#comment-14343655 ] Xuefu Zhang commented on SPARK-3621: addFile() can take a HDFS file, for which case, no file is shipped from the driver. And getFile() guarantees that the file is only copied once to local node per executor. Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access --- Key: SPARK-3621 URL: https://issues.apache.org/jira/browse/SPARK-3621 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0, 1.1.0 Reporter: Xuefu Zhang In some cases, such as Hive's way of doing map-side join, it would be benefcial to allow client program to broadcast RDDs rather than just variables made of these RDDs. Broadcasting a variable made of RDDs requires all RDD data be collected to the driver and that the variable be shipped to the cluster after being made. It would be more performing if driver just broadcasts the RDDs and uses the corresponding data in jobs (such building hashmaps at executors). Tez has a broadcast edge which can ship data from previous stage to the next stage, which doesn't require driver side processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3621) Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access
[ https://issues.apache.org/jira/browse/SPARK-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14343646#comment-14343646 ] Sean Owen commented on SPARK-3621: -- In that case, it really doesn't involve Spark. The remote workers are reading from HDFS, and can read the closest source of the data directly. I think that's the good news; it's not even Spark-specific and is quite possible to write right now. Yes it should not go through the driver. What is addFile() for though? that would only work to send a file from the driver to workers, no? Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access --- Key: SPARK-3621 URL: https://issues.apache.org/jira/browse/SPARK-3621 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0, 1.1.0 Reporter: Xuefu Zhang In some cases, such as Hive's way of doing map-side join, it would be benefcial to allow client program to broadcast RDDs rather than just variables made of these RDDs. Broadcasting a variable made of RDDs requires all RDD data be collected to the driver and that the variable be shipped to the cluster after being made. It would be more performing if driver just broadcasts the RDDs and uses the corresponding data in jobs (such building hashmaps at executors). Tez has a broadcast edge which can ship data from previous stage to the next stage, which doesn't require driver side processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3621) Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access
[ https://issues.apache.org/jira/browse/SPARK-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14343663#comment-14343663 ] Sean Owen commented on SPARK-3621: -- Gotcha, yes that makes sense. I think you can go a step further if you wanted to and just read the data directly from HDFS into a (singleton) cache on the executor, and not even copy it to local disk first. This is not going to be much faster if at all; it's already fairly optimal to read directly from HDFS. If the idea here is making a resource on HDFS available to all executors locally, yes that's already available as addFile() + HDFS file. Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access --- Key: SPARK-3621 URL: https://issues.apache.org/jira/browse/SPARK-3621 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0, 1.1.0 Reporter: Xuefu Zhang In some cases, such as Hive's way of doing map-side join, it would be benefcial to allow client program to broadcast RDDs rather than just variables made of these RDDs. Broadcasting a variable made of RDDs requires all RDD data be collected to the driver and that the variable be shipped to the cluster after being made. It would be more performing if driver just broadcasts the RDDs and uses the corresponding data in jobs (such building hashmaps at executors). Tez has a broadcast edge which can ship data from previous stage to the next stage, which doesn't require driver side processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3621) Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access
[ https://issues.apache.org/jira/browse/SPARK-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342119#comment-14342119 ] Sean Owen commented on SPARK-3621: -- I'd like to resolve what the use case is here: Is the request to load an entire RDD into memory on every executor? If so, what is the use case? the setup implies this is a situation where the data is large to be handled by the driver, but then putting it in memory everywhere is expensive. If the goal is sharing between stages, how is this different from persisting an RDD, either on disk or in memory? Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access --- Key: SPARK-3621 URL: https://issues.apache.org/jira/browse/SPARK-3621 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0, 1.1.0 Reporter: Xuefu Zhang In some cases, such as Hive's way of doing map-side join, it would be benefcial to allow client program to broadcast RDDs rather than just variables made of these RDDs. Broadcasting a variable made of RDDs requires all RDD data be collected to the driver and that the variable be shipped to the cluster after being made. It would be more performing if driver just broadcasts the RDDs and uses the corresponding data in jobs (such building hashmaps at executors). Tez has a broadcast edge which can ship data from previous stage to the next stage, which doesn't require driver side processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3621) Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access
[ https://issues.apache.org/jira/browse/SPARK-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291145#comment-14291145 ] Xuefu Zhang commented on SPARK-3621: I'm not sure if I agree that this is not a problem. To broadcast is to make certain dataset available to all nodes in the cluster. Existing broadcast functionality is limited to broadcast data in the driver, while this improvement requests that datasets, which already exists in the cluster, be broadcast to all nodes without requiring shipping that dataset from the cluster to the driver and then to all nodes in the cluster again. Improvement is never a problem if we are not open to it. If for some reason this cannot be done, we need to understand the reason. Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access --- Key: SPARK-3621 URL: https://issues.apache.org/jira/browse/SPARK-3621 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0, 1.1.0 Reporter: Xuefu Zhang In some cases, such as Hive's way of doing map-side join, it would be benefcial to allow client program to broadcast RDDs rather than just variables made of these RDDs. Broadcasting a variable made of RDDs requires all RDD data be collected to the driver and that the variable be shipped to the cluster after being made. It would be more performing if driver just broadcasts the RDDs and uses the corresponding data in jobs (such building hashmaps at executors). Tez has a broadcast edge which can ship data from previous stage to the next stage, which doesn't require driver side processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3621) Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access
[ https://issues.apache.org/jira/browse/SPARK-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291152#comment-14291152 ] Sean Owen commented on SPARK-3621: -- Hm, what is an example? I think you mean collect an RDD directly to every executor in its entirety. That's not an operation today but makes some sense. However my first question is, is this really something you need RDDs to do? You can already side load whatever you want on executors without involving the driver. The original description however talks about sharing data between stages. Is this not just a matter of persisting an RDD? This also does not involve the driver. Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access --- Key: SPARK-3621 URL: https://issues.apache.org/jira/browse/SPARK-3621 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0, 1.1.0 Reporter: Xuefu Zhang In some cases, such as Hive's way of doing map-side join, it would be benefcial to allow client program to broadcast RDDs rather than just variables made of these RDDs. Broadcasting a variable made of RDDs requires all RDD data be collected to the driver and that the variable be shipped to the cluster after being made. It would be more performing if driver just broadcasts the RDDs and uses the corresponding data in jobs (such building hashmaps at executors). Tez has a broadcast edge which can ship data from previous stage to the next stage, which doesn't require driver side processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3621) Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access
[ https://issues.apache.org/jira/browse/SPARK-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143009#comment-14143009 ] Sean Owen commented on SPARK-3621: -- If the data is shipped to the worker node, and the driver is the thing that can marshal the data to be sent, how is it different from a Broadcast variable? the broadcast can be done efficiently with the torrent-based broadcast, for example. Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access --- Key: SPARK-3621 URL: https://issues.apache.org/jira/browse/SPARK-3621 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0, 1.1.0 Reporter: Xuefu Zhang In some cases, such as Hive's way of doing map-side join, it would be benefcial to allow client program to broadcast RDDs rather than just variables made of these RDDs. Broadcasting a variable made of RDDs requires all RDD data be collected to the driver and that the variable be shipped to the cluster after being made. It would be more performing if driver just broadcasts the RDDs and uses the corresponding data in jobs (such building hashmaps at executors). Tez has a broadcast edge which can ship data from previous stage to the next stage, which doesn't require driver side processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3621) Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access
[ https://issues.apache.org/jira/browse/SPARK-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143231#comment-14143231 ] Xuefu Zhang commented on SPARK-3621: In my limited understanding, to broadcast a variable made of an RDD, you have to call RDD.collect() at the driver, which means data will be transferred to the driver. While broadcasting the variable might be very efficient, I'd like to avoid shipping data to the driver also. Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access --- Key: SPARK-3621 URL: https://issues.apache.org/jira/browse/SPARK-3621 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0, 1.1.0 Reporter: Xuefu Zhang In some cases, such as Hive's way of doing map-side join, it would be benefcial to allow client program to broadcast RDDs rather than just variables made of these RDDs. Broadcasting a variable made of RDDs requires all RDD data be collected to the driver and that the variable be shipped to the cluster after being made. It would be more performing if driver just broadcasts the RDDs and uses the corresponding data in jobs (such building hashmaps at executors). Tez has a broadcast edge which can ship data from previous stage to the next stage, which doesn't require driver side processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3621) Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access
[ https://issues.apache.org/jira/browse/SPARK-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143933#comment-14143933 ] bc Wong commented on SPARK-3621: I think this is for the case of a map-side join where one of the tables is small. [~xuefuz], if the driver is running in the cluster, then RDD.collect() means it reading from HDFS and then broadcast the data to everyone. Right? That seems reasonable. I don't see another way to broadcast something. Alternatively, it's probably better for each executor to individually read that small HDFS file into its memory. Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access --- Key: SPARK-3621 URL: https://issues.apache.org/jira/browse/SPARK-3621 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0, 1.1.0 Reporter: Xuefu Zhang In some cases, such as Hive's way of doing map-side join, it would be benefcial to allow client program to broadcast RDDs rather than just variables made of these RDDs. Broadcasting a variable made of RDDs requires all RDD data be collected to the driver and that the variable be shipped to the cluster after being made. It would be more performing if driver just broadcasts the RDDs and uses the corresponding data in jobs (such building hashmaps at executors). Tez has a broadcast edge which can ship data from previous stage to the next stage, which doesn't require driver side processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3621) Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access
[ https://issues.apache.org/jira/browse/SPARK-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142383#comment-14142383 ] Sean Owen commented on SPARK-3621: -- My understanding is that this is fairly fundamentally not possible in Spark. The metadata and machinery necessary to operate on RDDs is with the driver. RDDs are not accessible within transformations or actions. I'm interested both in whether that is in fact true, how much of an issue it really is for Hive-on-Spark to use collect + broadcast, and whether these sorts of requirements can be met with join, cogroup, etc. Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access --- Key: SPARK-3621 URL: https://issues.apache.org/jira/browse/SPARK-3621 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0, 1.1.0 Reporter: Xuefu Zhang In some cases, such as Hive's way of doing map-side join, it would be benefcial to allow client program to broadcast RDDs rather than just variables made of these RDDs. Broadcasting a variable made of RDDs requires all RDD data be collected to the driver and that the variable be shipped to the cluster after being made. It would be more performing if driver just broadcasts the RDDs and uses the corresponding data in jobs (such building hashmaps at executors). Tez has a broadcast edge which can ship data from previous stage to the next stage, which doesn't require driver side processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3621) Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access
[ https://issues.apache.org/jira/browse/SPARK-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142466#comment-14142466 ] Xuefu Zhang commented on SPARK-3621: I understand RDD is a concept existing only in the driver. However, accessing the data in Spark job doesn't have to be in the form of RDD. An iterator over the underlying data is sufficient, as long as the data is already shipped to the node when the job starts to run. One way to identify the shipped RDD and the iterator afterwards could be a UUID. Hive on Spark isn't using Spark's transformations to do map-join, or join in general. Hive's own implementation is to build hash maps for the small tables when the join starts, and then do key lookups while streaming thru the big table. For this, small table data (which can be a result RDD of another Spark job) needs to be shipped to all nodes that do the join. Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access --- Key: SPARK-3621 URL: https://issues.apache.org/jira/browse/SPARK-3621 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0, 1.1.0 Reporter: Xuefu Zhang In some cases, such as Hive's way of doing map-side join, it would be benefcial to allow client program to broadcast RDDs rather than just variables made of these RDDs. Broadcasting a variable made of RDDs requires all RDD data be collected to the driver and that the variable be shipped to the cluster after being made. It would be more performing if driver just broadcasts the RDDs and uses the corresponding data in jobs (such building hashmaps at executors). Tez has a broadcast edge which can ship data from previous stage to the next stage, which doesn't require driver side processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org