[jira] [Commented] (SPARK-4683) Add a beeline.cmd to run on Windows
[ https://issues.apache.org/jira/browse/SPARK-4683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233995#comment-14233995 ] Apache Spark commented on SPARK-4683: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/3599 Add a beeline.cmd to run on Windows --- Key: SPARK-4683 URL: https://issues.apache.org/jira/browse/SPARK-4683 Project: Spark Issue Type: New Feature Components: SQL Reporter: Matei Zaharia Assignee: Cheng Lian Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4740) Netty's network bandwidth is much lower than NIO in spark-perf and Netty takes longer running time
[ https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-4740: --- Affects Version/s: 1.2.0 Netty's network bandwidth is much lower than NIO in spark-perf and Netty takes longer running time -- Key: SPARK-4740 URL: https://issues.apache.org/jira/browse/SPARK-4740 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Affects Versions: 1.2.0 Reporter: Zhang, Liye When testing current spark master (1.3.0-snapshot) with spark-perf (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService takes much longer time than NIO based shuffle transferService. The network throughput of Netty is only about half of that of NIO. We tested with standalone mode, and the data set we used for test is 20 billion records, and the total size is about 400GB. Spark-perf test is Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each executor memory is 64GB. The reduce tasks number is set to 1000. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4741) Do not destroy and re-create FileInputStream
Liang-Chi Hsieh created SPARK-4741: -- Summary: Do not destroy and re-create FileInputStream Key: SPARK-4741 URL: https://issues.apache.org/jira/browse/SPARK-4741 Project: Spark Issue Type: Improvement Reporter: Liang-Chi Hsieh Priority: Minor The FileInputStream in DiskMapIterator is destroyed and recreate after each batch reading. However, since we can change the reading position on that stream, it is no need and inefficient to destroy and recreate it every time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4741) Do not destroy and re-create FileInputStream
[ https://issues.apache.org/jira/browse/SPARK-4741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234016#comment-14234016 ] Apache Spark commented on SPARK-4741: - User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/3600 Do not destroy and re-create FileInputStream Key: SPARK-4741 URL: https://issues.apache.org/jira/browse/SPARK-4741 Project: Spark Issue Type: Improvement Reporter: Liang-Chi Hsieh Priority: Minor The FileInputStream in DiskMapIterator is destroyed and recreate after each batch reading. However, since we can change the reading position on that stream, it is no need and inefficient to destroy and recreate it every time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4719) Consolidate various narrow dep RDD classes with MapPartitionsRDD
[ https://issues.apache.org/jira/browse/SPARK-4719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-4719. Resolution: Fixed Fix Version/s: 1.3.0 Consolidate various narrow dep RDD classes with MapPartitionsRDD Key: SPARK-4719 URL: https://issues.apache.org/jira/browse/SPARK-4719 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.3.0 Seems like we don't really need MappedRDD, MappedValuesRDD, FlatMappedValuesRDD, FilteredRDD, GlommedRDD. They can all be implemented directly using MapPartitionsRDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4685) Update JavaDoc settings to include spark.ml and all spark.mllib subpackages in the right sections
[ https://issues.apache.org/jira/browse/SPARK-4685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-4685. -- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 3598 [https://github.com/apache/spark/pull/3598] Update JavaDoc settings to include spark.ml and all spark.mllib subpackages in the right sections - Key: SPARK-4685 URL: https://issues.apache.org/jira/browse/SPARK-4685 Project: Spark Issue Type: New Feature Components: Documentation Reporter: Matei Zaharia Priority: Trivial Fix For: 1.2.0 Right now they're listed under other packages on the homepage of the JavaDoc docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4685) Update JavaDoc settings to include spark.ml and all spark.mllib subpackages in the right sections
[ https://issues.apache.org/jira/browse/SPARK-4685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4685: - Assignee: Kai Sasaki Update JavaDoc settings to include spark.ml and all spark.mllib subpackages in the right sections - Key: SPARK-4685 URL: https://issues.apache.org/jira/browse/SPARK-4685 Project: Spark Issue Type: New Feature Components: Documentation Reporter: Matei Zaharia Assignee: Kai Sasaki Priority: Trivial Fix For: 1.2.0 Right now they're listed under other packages on the homepage of the JavaDoc docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4575) Documentation for the pipeline features
[ https://issues.apache.org/jira/browse/SPARK-4575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-4575. -- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 3588 [https://github.com/apache/spark/pull/3588] Documentation for the pipeline features --- Key: SPARK-4575 URL: https://issues.apache.org/jira/browse/SPARK-4575 Project: Spark Issue Type: Improvement Components: Documentation, ML, MLlib Affects Versions: 1.2.0 Reporter: Xiangrui Meng Assignee: Joseph K. Bradley Fix For: 1.2.0 Add user guide for the newly added ML pipeline feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4742) The name of Parquet File generated by AppendingParquetOutputFormat should be zero padded
Sasaki Toru created SPARK-4742: -- Summary: The name of Parquet File generated by AppendingParquetOutputFormat should be zero padded Key: SPARK-4742 URL: https://issues.apache.org/jira/browse/SPARK-4742 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0 Reporter: Sasaki Toru Priority: Minor When I use Parquet File as a output file using ParquetOutputFormat#getDefaultWorkFile, the file name is not zero padded while RDD#saveAsText does zero padding. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4742) The name of Parquet File generated by AppendingParquetOutputFormat should be zero padded
[ https://issues.apache.org/jira/browse/SPARK-4742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234081#comment-14234081 ] Apache Spark commented on SPARK-4742: - User 'sasakitoa' has created a pull request for this issue: https://github.com/apache/spark/pull/3602 The name of Parquet File generated by AppendingParquetOutputFormat should be zero padded Key: SPARK-4742 URL: https://issues.apache.org/jira/browse/SPARK-4742 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0 Reporter: Sasaki Toru Priority: Minor When I use Parquet File as a output file using ParquetOutputFormat#getDefaultWorkFile, the file name is not zero padded while RDD#saveAsText does zero padding. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4494) IDFModel.transform() add support for single vector
[ https://issues.apache.org/jira/browse/SPARK-4494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234091#comment-14234091 ] Apache Spark commented on SPARK-4494: - User 'yu-iskw' has created a pull request for this issue: https://github.com/apache/spark/pull/3603 IDFModel.transform() add support for single vector -- Key: SPARK-4494 URL: https://issues.apache.org/jira/browse/SPARK-4494 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.1.1, 1.2.0 Reporter: Jean-Philippe Quemener Priority: Minor For now when using the tfidf implementation of mllib you have no other possibility to map your data back onto i.e. labels or ids than use a hackish way with ziping: {quote} 1. Persist input RDD. 2. Transform it to just vectors and apply IDFModel 3. zip with original RDD 4. transform label and new vector to LabeledPoint{quote} Source:[http://stackoverflow.com/questions/26897908/spark-mllib-tfidf-implementation-for-logisticregression] I think as in production alot of users want to map their data back to some identifier, it would be a good imporvement to allow using a single vector on IDFModel.transform() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4726) NotSerializableException thrown on SystemDefaultHttpClient with stack not related to my functions
[ https://issues.apache.org/jira/browse/SPARK-4726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234142#comment-14234142 ] Sean Owen commented on SPARK-4726: -- You can use it, you just can't serialize these objects from the drivers to workers. You'll want to diagnose your code to see if you're accidentally creating a connection or client on the driver but then using it inside functions that are sent to the workers. NotSerializableException thrown on SystemDefaultHttpClient with stack not related to my functions - Key: SPARK-4726 URL: https://issues.apache.org/jira/browse/SPARK-4726 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.0.1 Reporter: Dmitriy Makarenko I get this stacktrace that doesn't contain any of my function - Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: org.apache.http.impl.client.SystemDefaultHttpClient at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:771) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:714) at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:698) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1198) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) As I know SystemDefaultHttpClient is used inside the SolrJ library that I use, but it is in the separate Jar from my project. All of mine classes are Serializable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4734) [Streaming]limit the file Dstream size for each batch
[ https://issues.apache.org/jira/browse/SPARK-4734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234146#comment-14234146 ] Sean Owen commented on SPARK-4734: -- I don't quite understand this suggestion. In general, if processing time exceeds the batch duration, you simply need a longer batch duration or need to speed up your processing. Lots of small files are a problem in general, for shuffle -- although less so for a sort-based shuffle. The basic solution there is: don't design your system to put lots of tiny files on HDFS. Are you suggesting capping the amount of data in each batch? This does not solve either problem. Either you are just running more, smaller batches, or you are dropping data. In any event this amounts to a significant change in semantics. This doesn't sound likely. [Streaming]limit the file Dstream size for each batch - Key: SPARK-4734 URL: https://issues.apache.org/jira/browse/SPARK-4734 Project: Spark Issue Type: New Feature Components: Streaming Reporter: 宿荣全 Priority: Minor Streaming scan new files form the HDFS and process those files in each batch process.Current streaming exist some problems: 1.When the number of files is very large(the count size of those files is very large) in some batch segement.The processing time required will become very long.The processing time maybe over slideDuration time.Eventually lead to dispatch the next batch process is delay. 2.when the size of total file Dstream is very large in one batch,those dstream data do shuffle after memory will be n times increasing occupation,app will be slow or even terminated by operating system. So if we set a upper limit value of input data for each batch to control the batch process time,the job dispatch delay and the process delay wil be alleviated. modification: Add a new parameter spark.streaming.segmentSizeThreshold in InputDStream (input data base class).the size of each batch process segments will be set in this parameter from [spark-defaults.conf] or setting in source. all implements class of InputDStream will do corresponding action be aimed at the segmentSizeThreshold. This patch is a modification about FileInputDStream ,so when find new files ,put those files's name and size in a queue and take elements package to a batch data with totail size segmentSizeThreshold in FileInputDStream.Please look source about detailed logic. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4735) Spark SQL UDF doesn't support 0 arguments.
[ https://issues.apache.org/jira/browse/SPARK-4735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234153#comment-14234153 ] Apache Spark commented on SPARK-4735: - User 'potix2' has created a pull request for this issue: https://github.com/apache/spark/pull/3604 Spark SQL UDF doesn't support 0 arguments. -- Key: SPARK-4735 URL: https://issues.apache.org/jira/browse/SPARK-4735 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Priority: Minor To reproduce that with: val udf = () = {Seq(1,2,3)} sqlCtx.registerFunction(myudf, udf) sqlCtx.sql(select myudf() from tbl limit 1).collect.foreach(println) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4743) Use SparkEnv.serializer instead of closureSerializer in aggregateByKey and foldByKey
Ivan Vergiliev created SPARK-4743: - Summary: Use SparkEnv.serializer instead of closureSerializer in aggregateByKey and foldByKey Key: SPARK-4743 URL: https://issues.apache.org/jira/browse/SPARK-4743 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Ivan Vergiliev AggregateByKey and foldByKey in PairRDDFunctions both use the closure serializer to serialize and deserialize the initial value. This means that the Java serializer is always used, which can be very expensive if there's a large number of groups. Calling combineByKey manually and using the normal serializer instead of the closure one improved the performance on the dataset I'm testing with by about 30-35%. I'm not familiar enough with the codebase to be certain that replacing the serializer here is OK, but it works correctly in my tests, and it's only serializing a single value of type U, which should be serializable by the default one since it can be the output of a job. Let me know if I'm missing anything. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4743) Use SparkEnv.serializer instead of closureSerializer in aggregateByKey and foldByKey
[ https://issues.apache.org/jira/browse/SPARK-4743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234165#comment-14234165 ] Apache Spark commented on SPARK-4743: - User 'IvanVergiliev' has created a pull request for this issue: https://github.com/apache/spark/pull/3605 Use SparkEnv.serializer instead of closureSerializer in aggregateByKey and foldByKey Key: SPARK-4743 URL: https://issues.apache.org/jira/browse/SPARK-4743 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Ivan Vergiliev Labels: performance AggregateByKey and foldByKey in PairRDDFunctions both use the closure serializer to serialize and deserialize the initial value. This means that the Java serializer is always used, which can be very expensive if there's a large number of groups. Calling combineByKey manually and using the normal serializer instead of the closure one improved the performance on the dataset I'm testing with by about 30-35%. I'm not familiar enough with the codebase to be certain that replacing the serializer here is OK, but it works correctly in my tests, and it's only serializing a single value of type U, which should be serializable by the default one since it can be the output of a job. Let me know if I'm missing anything. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4744) Short Circuit evaluation for AND OR in code gen
Cheng Hao created SPARK-4744: Summary: Short Circuit evaluation for AND OR in code gen Key: SPARK-4744 URL: https://issues.apache.org/jira/browse/SPARK-4744 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4744) Short Circuit evaluation for AND OR in code gen
[ https://issues.apache.org/jira/browse/SPARK-4744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234182#comment-14234182 ] Apache Spark commented on SPARK-4744: - User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/3606 Short Circuit evaluation for AND OR in code gen - Key: SPARK-4744 URL: https://issues.apache.org/jira/browse/SPARK-4744 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2188) Support sbt/sbt for Windows
[ https://issues.apache.org/jira/browse/SPARK-2188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234209#comment-14234209 ] Masayoshi TSUZUKI commented on SPARK-2188: -- We have some bugs reported on JIRA about Windows. When we struggle with them or try to reproduce them or fix them, we need building tools for Windows. Indeed we already have maven, but sbt is much better for trial and error development as you know. Support sbt/sbt for Windows --- Key: SPARK-2188 URL: https://issues.apache.org/jira/browse/SPARK-2188 Project: Spark Issue Type: New Feature Components: Build Affects Versions: 1.0.0 Reporter: Pat McDonough Add the equivalent of sbt/sbt for Windows users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1953) yarn client mode Application Master memory size is same as driver memory size
[ https://issues.apache.org/jira/browse/SPARK-1953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234212#comment-14234212 ] Apache Spark commented on SPARK-1953: - User 'WangTaoTheTonic' has created a pull request for this issue: https://github.com/apache/spark/pull/3607 yarn client mode Application Master memory size is same as driver memory size - Key: SPARK-1953 URL: https://issues.apache.org/jira/browse/SPARK-1953 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.0 Reporter: Thomas Graves With Spark on yarn in client mode, the application master that gets created to allocated containers gets the same amount of memory as the driver running on the client. (--driver-memory option through spark-submit) This could definitely be more then what is really needed, thus wasting resources. The Application Master should be very small and require very little memory since all its doing is allocating and starting containers. We should allow the memory for the application master to be configurable separate from the driver in client mode. We probably need to be careful about how we do this as to not cause confusion about what the options do in the various modes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4740) Netty's network bandwidth is much lower than NIO in spark-perf and Netty takes longer running time
[ https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhang, Liye updated SPARK-4740: --- Attachment: Spark-perf Test Report.pdf Netty's network bandwidth is much lower than NIO in spark-perf and Netty takes longer running time -- Key: SPARK-4740 URL: https://issues.apache.org/jira/browse/SPARK-4740 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Affects Versions: 1.2.0 Reporter: Zhang, Liye Attachments: Spark-perf Test Report.pdf When testing current spark master (1.3.0-snapshot) with spark-perf (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService takes much longer time than NIO based shuffle transferService. The network throughput of Netty is only about half of that of NIO. We tested with standalone mode, and the data set we used for test is 20 billion records, and the total size is about 400GB. Spark-perf test is Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each executor memory is 64GB. The reduce tasks number is set to 1000. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4745) get_existing_cluster() doesn't work with additional security groups
Alex DeBrie created SPARK-4745: -- Summary: get_existing_cluster() doesn't work with additional security groups Key: SPARK-4745 URL: https://issues.apache.org/jira/browse/SPARK-4745 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.1.0 Reporter: Alex DeBrie The spark-ec2 script has a flag that allows you to add additional security groups to clusters when you launch. However, the get_existing_cluster() function cycles through active instances and only returns instances whose group_names == cluster_name + -master (or + -slaves), which is the group created by default. The get_existing_cluster() function is used to login to, stop, and destroy existing clusters, among other actions. This is a pretty simple fix for which I've already submitted a [pull request|https://github.com/apache/spark/pull/3596]. It checks if cluster_name + -master is in the list of groups for each active instance. This means the cluster group can be one among many groups, rather than the sole group for an instance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4745) get_existing_cluster() doesn't work with additional security groups
[ https://issues.apache.org/jira/browse/SPARK-4745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234240#comment-14234240 ] Apache Spark commented on SPARK-4745: - User 'alexdebrie' has created a pull request for this issue: https://github.com/apache/spark/pull/3596 get_existing_cluster() doesn't work with additional security groups --- Key: SPARK-4745 URL: https://issues.apache.org/jira/browse/SPARK-4745 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.1.0 Reporter: Alex DeBrie The spark-ec2 script has a flag that allows you to add additional security groups to clusters when you launch. However, the get_existing_cluster() function cycles through active instances and only returns instances whose group_names == cluster_name + -master (or + -slaves), which is the group created by default. The get_existing_cluster() function is used to login to, stop, and destroy existing clusters, among other actions. This is a pretty simple fix for which I've already submitted a [pull request|https://github.com/apache/spark/pull/3596]. It checks if cluster_name + -master is in the list of groups for each active instance. This means the cluster group can be one among many groups, rather than the sole group for an instance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4727) Add dimensional RDDs (time series, spatial)
[ https://issues.apache.org/jira/browse/SPARK-4727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234269#comment-14234269 ] Jeremy Freeman commented on SPARK-4727: --- Great to brainstorm about this RJ! To some extent, we've been doing this over on the [Thunder|http://thefreemanlab.com/thunder/docs/] project. In particular, check out the {{TimeSeries}} and {{Images}} classes [here|https://github.com/freeman-lab/thunder/tree/master/python/thunder/rdds], which are essentially wrappers for specialized RDDs. Our basic abstraction is RDDs of ndarrays (1D for time series, 2D or 3D for images/volumes), with metadeta (lazily propagated) for things like dimensionality and time base, coordinates embedded in keys, and useful methods on these objects like the ones you menion (e.g. filtering, fourier, cross-correlation). We've also worked on transformations between representations, for the common case of sequences of images corresponding to different time points. We haven't worked on custom partition strategies yet, I think that will be most important for image tiles drawn from a much larger image. There's cool work ongoing for that in GeoTrellis, see the [repo|https://github.com/geotrellis/geotrellis/tree/master/spark/src/main] and a [talk|http://spark-summit.org/2014/talk/geotrellis-adding-geospatial-capabilities-to-spark] from Rob. FWIW, when we started it seemed more appropriate to build this into a specialized library, rather than Spark core. It's also something that benefits from using Python, due to a bevy of existing libraries for temporal and image data (though there are certainly analogs in Java/Scala). But it would be great to probe the community for general interest in these kinds of abstractions and methods. Add dimensional RDDs (time series, spatial) - Key: SPARK-4727 URL: https://issues.apache.org/jira/browse/SPARK-4727 Project: Spark Issue Type: Brainstorming Components: Spark Core Affects Versions: 1.1.0 Reporter: RJ Nowling Certain types of data (times series, spatial) can benefit from specialized RDDs. I'd like to open a discussion about this. For example, time series data should be ordered by time and would benefit from operations like: * Subsampling (taking every n data points) * Signal processing (correlations, FFTs, filtering) * Windowing functions Spatial data benefits from ordering and partitioning along a 2D or 3D grid. For example, path finding algorithms can optimized by only comparing points within a set distance, which can be computed more efficiently by partitioning data into a grid. Although the operations on time series and spatial data may be different, there is some commonality in the sense of the data having ordered dimensions and the implementations may overlap. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1010) Update all unit tests to use SparkConf instead of system properties
[ https://issues.apache.org/jira/browse/SPARK-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234272#comment-14234272 ] liu chang commented on SPARK-1010: -- please assign to me, I will fix it. Update all unit tests to use SparkConf instead of system properties --- Key: SPARK-1010 URL: https://issues.apache.org/jira/browse/SPARK-1010 Project: Spark Issue Type: New Feature Affects Versions: 0.9.0 Reporter: Patrick Wendell Assignee: Nirmal Priority: Minor Labels: starter -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4181) Create separate options to control the client-mode AM resource allocation request
[ https://issues.apache.org/jira/browse/SPARK-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234289#comment-14234289 ] Thomas Graves commented on SPARK-4181: -- What exactly is the change you are proposing here? You reference other jiras that all have specific things to fix. Is this above and beyond those? Create separate options to control the client-mode AM resource allocation request - Key: SPARK-4181 URL: https://issues.apache.org/jira/browse/SPARK-4181 Project: Spark Issue Type: Improvement Components: YARN Reporter: WangTaoTheTonic Priority: Minor I found related discussion in https://github.com/apache/spark/pull/2115, SPARK-1953 and SPARK-1507. And recently I found some inconvenience in configuring properties like logging while we use yarn-client mode. So if no one else do the work, I will try it. Maybe start in few days, and complete in next 1 or 2 weeks. As not very familiar with spark on yarn, any discussion and feedback is welcome! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4181) Create separate options to control the client-mode AM resource allocation request
[ https://issues.apache.org/jira/browse/SPARK-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234305#comment-14234305 ] WangTaoTheTonic commented on SPARK-4181: Maybe I didn't describe exactly here. What I wanna do is to pass extraJavaOption and extraLibraryPath to AM in yarn-client mode. Create separate options to control the client-mode AM resource allocation request - Key: SPARK-4181 URL: https://issues.apache.org/jira/browse/SPARK-4181 Project: Spark Issue Type: Improvement Components: YARN Reporter: WangTaoTheTonic Priority: Minor I found related discussion in https://github.com/apache/spark/pull/2115, SPARK-1953 and SPARK-1507. And recently I found some inconvenience in configuring properties like logging while we use yarn-client mode. So if no one else do the work, I will try it. Maybe start in few days, and complete in next 1 or 2 weeks. As not very familiar with spark on yarn, any discussion and feedback is welcome! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4702) Querying non-existent partition produces exception in v1.2.0-rc1
[ https://issues.apache.org/jira/browse/SPARK-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234307#comment-14234307 ] Yana Kadiyska commented on SPARK-4702: -- Just confirming that https://github.com/apache/spark/pull/3586 does fix the issue. Thanks! Querying non-existent partition produces exception in v1.2.0-rc1 - Key: SPARK-4702 URL: https://issues.apache.org/jira/browse/SPARK-4702 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yana Kadiyska Using HiveThriftServer2, when querying a non-existent partition I get an exception rather than an empty result set. This seems to be a regression -- I had an older build of master branch where this works. Build off of RC1.2 tag produces the following: 14/12/02 20:04:12 WARN ThriftCLIService: Error executing statement: org.apache.hive.service.cli.HiveSQLException: java.lang.IllegalArgumentException: Can not create a Path from an empty string at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:192) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79) at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37) at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60) at com.sun.proxy.$Proxy19.executeStatementAsync(Unknown Source) at org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233) at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4181) Create separate options to control the client-mode AM resource allocation request
[ https://issues.apache.org/jira/browse/SPARK-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234333#comment-14234333 ] Thomas Graves commented on SPARK-4181: -- ok. as you discovered extraJavaOptions and possibly the others is being discussed in https://github.com/apache/spark/pull/3409. What is your use case for the extraLibraryPath as I couldn't think of one? Create separate options to control the client-mode AM resource allocation request - Key: SPARK-4181 URL: https://issues.apache.org/jira/browse/SPARK-4181 Project: Spark Issue Type: Improvement Components: YARN Reporter: WangTaoTheTonic Priority: Minor I found related discussion in https://github.com/apache/spark/pull/2115, SPARK-1953 and SPARK-1507. And recently I found some inconvenience in configuring properties like logging while we use yarn-client mode. So if no one else do the work, I will try it. Maybe start in few days, and complete in next 1 or 2 weeks. As not very familiar with spark on yarn, any discussion and feedback is welcome! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4298) The spark-submit cannot read Main-Class from Manifest.
[ https://issues.apache.org/jira/browse/SPARK-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234346#comment-14234346 ] Brennon York commented on SPARK-4298: - [~pwendell] could you take a look at this? This is an annoying issue our developers continue to run into and would like to see this pushed into the next release. Thanks! The spark-submit cannot read Main-Class from Manifest. -- Key: SPARK-4298 URL: https://issues.apache.org/jira/browse/SPARK-4298 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Environment: Linux spark-1.1.0-bin-hadoop2.4.tgz java version 1.7.0_72 Java(TM) SE Runtime Environment (build 1.7.0_72-b14) Java HotSpot(TM) 64-Bit Server VM (build 24.72-b04, mixed mode) Reporter: Milan Straka Consider trivial {{test.scala}}: {code:title=test.scala|borderStyle=solid} import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ object Main { def main(args: Array[String]) { val sc = new SparkContext() sc.stop() } } {code} When built with {{sbt}} and executed using {{spark-submit target/scala-2.10/test_2.10-1.0.jar}}, I get the following error: {code} Spark assembly has been built with Hive, including Datanucleus jars on classpath Error: Cannot load main class from JAR: file:/ha/home/straka/s/target/scala-2.10/test_2.10-1.0.jar Run with --help for usage help or --verbose for debug output {code} When executed using {{spark-submit --class Main target/scala-2.10/test_2.10-1.0.jar}}, it works. The jar file has correct MANIFEST.MF: {code:title=MANIFEST.MF|borderStyle=solid} Manifest-Version: 1.0 Implementation-Vendor: test Implementation-Title: test Implementation-Version: 1.0 Implementation-Vendor-Id: test Specification-Vendor: test Specification-Title: test Specification-Version: 1.0 Main-Class: Main {code} The problem is that in {{org.apache.spark.deploy.SparkSubmitArguments}}, line 127: {code} val jar = new JarFile(primaryResource) {code} the primaryResource has String value {{file:/ha/home/straka/s/target/scala-2.10/test_2.10-1.0.jar}}, which is URI, but JarFile can use only Path. One way to fix this would be using {code} val uri = new URI(primaryResource) val jar = new JarFile(uri.getPath) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4616) SPARK_CONF_DIR is not effective in spark-submit
[ https://issues.apache.org/jira/browse/SPARK-4616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234347#comment-14234347 ] Brennon York commented on SPARK-4616: - [~pwendell] could you review this? Since this answers a larger problem I was hoping to get some feedback on this commit. Thanks! SPARK_CONF_DIR is not effective in spark-submit --- Key: SPARK-4616 URL: https://issues.apache.org/jira/browse/SPARK-4616 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Reporter: leo.luan SPARK_CONF_DIR is not effective in spark-submit ,because this line in spark-submit: DEFAULT_PROPERTIES_FILE=$SPARK_HOME/conf/spark-defaults.conf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-546) Support full outer join and multiple join in a single shuffle
[ https://issues.apache.org/jira/browse/SPARK-546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234383#comment-14234383 ] Thiago Souza commented on SPARK-546: What about #2? Did you file a new ticket? I'm quite interested on this! Support full outer join and multiple join in a single shuffle - Key: SPARK-546 URL: https://issues.apache.org/jira/browse/SPARK-546 Project: Spark Issue Type: Improvement Components: Spark Core, Streaming Reporter: Reynold Xin Assignee: Aaron Staple Fix For: 1.2.0 RDD[(K,V)] now supports left/right outer join but not full outer join. Also it'd be nice to provide a way for users to join multiple RDDs on the same key in a single shuffle. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4746) integration tests should be seseparated from faster unit tests
Imran Rashid created SPARK-4746: --- Summary: integration tests should be seseparated from faster unit tests Key: SPARK-4746 URL: https://issues.apache.org/jira/browse/SPARK-4746 Project: Spark Issue Type: Bug Reporter: Imran Rashid Priority: Trivial Currently there isn't a good way for a developer to skip the longer integration tests. This can slow down local development. See http://apache-spark-developers-list.1001551.n3.nabble.com/Spurious-test-failures-testing-best-practices-td9560.html One option is to use scalatest's notion of test tags to tag all integration tests, so they could easily be skipped -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4727) Add dimensional RDDs (time series, spatial)
[ https://issues.apache.org/jira/browse/SPARK-4727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234399#comment-14234399 ] RJ Nowling commented on SPARK-4727: --- Thanks, Jeremy! Your work may cover my needs, and if not, it seems like a great place to contribute to! Was there some talk about encouraging people to build Spark libraries and putting together a community list? I'd love to see this sort of work advertised more. Add dimensional RDDs (time series, spatial) - Key: SPARK-4727 URL: https://issues.apache.org/jira/browse/SPARK-4727 Project: Spark Issue Type: Brainstorming Components: Spark Core Affects Versions: 1.1.0 Reporter: RJ Nowling Certain types of data (times series, spatial) can benefit from specialized RDDs. I'd like to open a discussion about this. For example, time series data should be ordered by time and would benefit from operations like: * Subsampling (taking every n data points) * Signal processing (correlations, FFTs, filtering) * Windowing functions Spatial data benefits from ordering and partitioning along a 2D or 3D grid. For example, path finding algorithms can optimized by only comparing points within a set distance, which can be computed more efficiently by partitioning data into a grid. Although the operations on time series and spatial data may be different, there is some commonality in the sense of the data having ordered dimensions and the implementations may overlap. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-546) Support full outer join and multiple join in a single shuffle
[ https://issues.apache.org/jira/browse/SPARK-546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234404#comment-14234404 ] Reynold Xin commented on SPARK-546: --- Actually my experience implementing full join in a single shuffle is that it is fairly complicated and very hard to maintain. Since it is doable entirely in user code and given SparkSQL's SchemaRDD already supports it, I suggest not pulling this in Spark core. Support full outer join and multiple join in a single shuffle - Key: SPARK-546 URL: https://issues.apache.org/jira/browse/SPARK-546 Project: Spark Issue Type: Improvement Components: Spark Core, Streaming Reporter: Reynold Xin Assignee: Aaron Staple Fix For: 1.2.0 RDD[(K,V)] now supports left/right outer join but not full outer join. Also it'd be nice to provide a way for users to join multiple RDDs on the same key in a single shuffle. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4747) Move JobProgressListener out of org.apache.spark.ui.jobs
[ https://issues.apache.org/jira/browse/SPARK-4747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234422#comment-14234422 ] Ryan Williams commented on SPARK-4747: -- [~vanzin] let me know what package you think it should go to and I'll make the change, if you like. Move JobProgressListener out of org.apache.spark.ui.jobs Key: SPARK-4747 URL: https://issues.apache.org/jira/browse/SPARK-4747 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Ryan Williams Priority: Minor [~vanzin] noted on [#2696|https://github.com/apache/spark/pull/2696/files#r19235280] that {{JobProgressListener}} should be moved out of {{ui.jobs}} since it doesn't really deal with UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4747) Move JobProgressListener out of org.apache.spark.ui.jobs
Ryan Williams created SPARK-4747: Summary: Move JobProgressListener out of org.apache.spark.ui.jobs Key: SPARK-4747 URL: https://issues.apache.org/jira/browse/SPARK-4747 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Ryan Williams Priority: Minor [~vanzin] noted on [#2696|https://github.com/apache/spark/pull/2696/files#r19235280] that {{JobProgressListener}} should be moved out of {{ui.jobs}} since it doesn't really deal with UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4739) spark.files.userClassPathFirst does not work in local[*] mode
[ https://issues.apache.org/jira/browse/SPARK-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234433#comment-14234433 ] Marcelo Vanzin commented on SPARK-4739: --- BTW my fix for SPARK-2996 (https://github.com/apache/spark/pull/3233) should also fix this. spark.files.userClassPathFirst does not work in local[*] mode - Key: SPARK-4739 URL: https://issues.apache.org/jira/browse/SPARK-4739 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.1.0 Reporter: Tobias Pfeiffer The parameter spark.files.userClassPathFirst=true does not work when using spark-submit with \-\-master local\[3\]. In particular, even though my application jar file contains netty-3.9.4.Final, the older version from the spark-assembly jar file is loaded (cf. SPARK-4738). When using the same jars with \-\-master yarn-cluster and spark.yarn.user.classpath.first=true (cf. SPARK-2996), it works correctly and my bundled classes are used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4747) Move JobProgressListener out of org.apache.spark.ui.jobs
[ https://issues.apache.org/jira/browse/SPARK-4747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234434#comment-14234434 ] Marcelo Vanzin commented on SPARK-4747: --- I don't really have a recommendation aside from not the UI package. Maybe the package where all the other job tracking types are declared. Move JobProgressListener out of org.apache.spark.ui.jobs Key: SPARK-4747 URL: https://issues.apache.org/jira/browse/SPARK-4747 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Ryan Williams Priority: Minor [~vanzin] noted on [#2696|https://github.com/apache/spark/pull/2696/files#r19235280] that {{JobProgressListener}} should be moved out of {{ui.jobs}} since it doesn't really deal with UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4683) Add a beeline.cmd to run on Windows
[ https://issues.apache.org/jira/browse/SPARK-4683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-4683. Resolution: Fixed Fix Version/s: 1.2.0 Add a beeline.cmd to run on Windows --- Key: SPARK-4683 URL: https://issues.apache.org/jira/browse/SPARK-4683 Project: Spark Issue Type: New Feature Components: SQL Reporter: Matei Zaharia Assignee: Cheng Lian Priority: Critical Fix For: 1.2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4747) Move JobProgressListener out of org.apache.spark.ui.jobs
[ https://issues.apache.org/jira/browse/SPARK-4747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234445#comment-14234445 ] Patrick Wendell commented on SPARK-4747: Because this is an exposed API I'd prefer not to move it - I know many applications that build on this and it would break their code. It is slightly nicer to not nest it under the ui package but IMO it's not worth breaking user applications for this minor clean-up. Move JobProgressListener out of org.apache.spark.ui.jobs Key: SPARK-4747 URL: https://issues.apache.org/jira/browse/SPARK-4747 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Ryan Williams Priority: Minor [~vanzin] noted on [#2696|https://github.com/apache/spark/pull/2696/files#r19235280] that {{JobProgressListener}} should be moved out of {{ui.jobs}} since it doesn't really deal with UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4737) Prevent serialization errors from ever crashing the DAG scheduler
[ https://issues.apache.org/jira/browse/SPARK-4737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4737: - Affects Version/s: 1.2.0 Prevent serialization errors from ever crashing the DAG scheduler - Key: SPARK-4737 URL: https://issues.apache.org/jira/browse/SPARK-4737 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Patrick Wendell Assignee: Matthew Cheah Priority: Blocker Currently in Spark we assume that when tasks are serialized in the TaskSetManager that the serialization cannot fail. We assume this because upstream in the DAGScheduler we attempt to catch any serialization errors by serializing a single partition. However, in some cases this upstream test is not accurate - i.e. an RDD can have one partition that can serialize cleanly but not others. Do do this in the proper way we need to catch and propagate the exception at the time of serialization. The tricky bit is making sure it gets propagated in the right way. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4747) Move JobProgressListener out of org.apache.spark.ui.jobs
[ https://issues.apache.org/jira/browse/SPARK-4747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234448#comment-14234448 ] Marcelo Vanzin commented on SPARK-4747: --- Ah. It's a @DeveloperApi... that makes it trickier to move around. :-/ Move JobProgressListener out of org.apache.spark.ui.jobs Key: SPARK-4747 URL: https://issues.apache.org/jira/browse/SPARK-4747 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Ryan Williams Priority: Minor [~vanzin] noted on [#2696|https://github.com/apache/spark/pull/2696/files#r19235280] that {{JobProgressListener}} should be moved out of {{ui.jobs}} since it doesn't really deal with UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4747) Move JobProgressListener out of org.apache.spark.ui.jobs
[ https://issues.apache.org/jira/browse/SPARK-4747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234456#comment-14234456 ] Ryan Williams commented on SPARK-4747: -- OK, feel free to wontfix this then Move JobProgressListener out of org.apache.spark.ui.jobs Key: SPARK-4747 URL: https://issues.apache.org/jira/browse/SPARK-4747 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Ryan Williams Priority: Minor [~vanzin] noted on [#2696|https://github.com/apache/spark/pull/2696/files#r19235280] that {{JobProgressListener}} should be moved out of {{ui.jobs}} since it doesn't really deal with UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4702) Querying non-existent partition produces exception in v1.2.0-rc1
[ https://issues.apache.org/jira/browse/SPARK-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234469#comment-14234469 ] Michael Armbrust commented on SPARK-4702: - It did in my testing. Please let us know if you are stil having problems. To answer your question above, heterogenous schema is not supported in either mode officially. Depending on which file gets picked up when convertMetastoreParquet=true it may or may not work (assuming you are only adding columns). See [SPARK-3851] for more info. Querying non-existent partition produces exception in v1.2.0-rc1 - Key: SPARK-4702 URL: https://issues.apache.org/jira/browse/SPARK-4702 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yana Kadiyska Using HiveThriftServer2, when querying a non-existent partition I get an exception rather than an empty result set. This seems to be a regression -- I had an older build of master branch where this works. Build off of RC1.2 tag produces the following: 14/12/02 20:04:12 WARN ThriftCLIService: Error executing statement: org.apache.hive.service.cli.HiveSQLException: java.lang.IllegalArgumentException: Can not create a Path from an empty string at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:192) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231) at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79) at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37) at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60) at com.sun.proxy.$Proxy19.executeStatementAsync(Unknown Source) at org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233) at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313) at org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4747) Move JobProgressListener out of org.apache.spark.ui.jobs
[ https://issues.apache.org/jira/browse/SPARK-4747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234474#comment-14234474 ] Patrick Wendell commented on SPARK-4747: Yeah - okay if you guys don't mind I'll probably close this as wont fix. Move JobProgressListener out of org.apache.spark.ui.jobs Key: SPARK-4747 URL: https://issues.apache.org/jira/browse/SPARK-4747 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Ryan Williams Priority: Minor [~vanzin] noted on [#2696|https://github.com/apache/spark/pull/2696/files#r19235280] that {{JobProgressListener}} should be moved out of {{ui.jobs}} since it doesn't really deal with UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4747) Move JobProgressListener out of org.apache.spark.ui.jobs
[ https://issues.apache.org/jira/browse/SPARK-4747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-4747. Resolution: Won't Fix Move JobProgressListener out of org.apache.spark.ui.jobs Key: SPARK-4747 URL: https://issues.apache.org/jira/browse/SPARK-4747 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Ryan Williams Priority: Minor [~vanzin] noted on [#2696|https://github.com/apache/spark/pull/2696/files#r19235280] that {{JobProgressListener}} should be moved out of {{ui.jobs}} since it doesn't really deal with UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4740) Netty's network bandwidth is much lower than NIO in spark-perf and Netty takes longer running time
[ https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234479#comment-14234479 ] Patrick Wendell commented on SPARK-4740: Thanks for reporting this. We've run a bunch of tests and never found netty to be slower than NIO, so this is a helpful piece of feedback. One unique thing about your environment is that you have 48 cores per node. Do you observe the same effect if you limit the parallelism on each node to fewer cores? Netty's network bandwidth is much lower than NIO in spark-perf and Netty takes longer running time -- Key: SPARK-4740 URL: https://issues.apache.org/jira/browse/SPARK-4740 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Affects Versions: 1.2.0 Reporter: Zhang, Liye Attachments: Spark-perf Test Report.pdf When testing current spark master (1.3.0-snapshot) with spark-perf (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService takes much longer time than NIO based shuffle transferService. The network throughput of Netty is only about half of that of NIO. We tested with standalone mode, and the data set we used for test is 20 billion records, and the total size is about 400GB. Spark-perf test is Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each executor memory is 64GB. The reduce tasks number is set to 1000. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4740) Netty's network bandwidth is much lower than NIO in spark-perf and Netty takes longer running time
[ https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234479#comment-14234479 ] Patrick Wendell edited comment on SPARK-4740 at 12/4/14 7:00 PM: - Thanks for reporting this. We've run a bunch of tests and never found netty to be slower than NIO, so this is a helpful piece of feedback. One unique thing about your environment is that you have 48 cores per node. Do you observe the same effect if you limit the parallelism on each node to fewer cores? /cc [~adav] [~rxin] was (Author: pwendell): Thanks for reporting this. We've run a bunch of tests and never found netty to be slower than NIO, so this is a helpful piece of feedback. One unique thing about your environment is that you have 48 cores per node. Do you observe the same effect if you limit the parallelism on each node to fewer cores? Netty's network bandwidth is much lower than NIO in spark-perf and Netty takes longer running time -- Key: SPARK-4740 URL: https://issues.apache.org/jira/browse/SPARK-4740 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Affects Versions: 1.2.0 Reporter: Zhang, Liye Attachments: Spark-perf Test Report.pdf When testing current spark master (1.3.0-snapshot) with spark-perf (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService takes much longer time than NIO based shuffle transferService. The network throughput of Netty is only about half of that of NIO. We tested with standalone mode, and the data set we used for test is 20 billion records, and the total size is about 400GB. Spark-perf test is Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each executor memory is 64GB. The reduce tasks number is set to 1000. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4740) Netty's network bandwidth is much lower than NIO in spark-perf and Netty takes longer running time
[ https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234489#comment-14234489 ] Reynold Xin commented on SPARK-4740: [~adav] Could it be the thread pool size being too small? Netty's network bandwidth is much lower than NIO in spark-perf and Netty takes longer running time -- Key: SPARK-4740 URL: https://issues.apache.org/jira/browse/SPARK-4740 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Affects Versions: 1.2.0 Reporter: Zhang, Liye Attachments: Spark-perf Test Report.pdf When testing current spark master (1.3.0-snapshot) with spark-perf (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService takes much longer time than NIO based shuffle transferService. The network throughput of Netty is only about half of that of NIO. We tested with standalone mode, and the data set we used for test is 20 billion records, and the total size is about 400GB. Spark-perf test is Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each executor memory is 64GB. The reduce tasks number is set to 1000. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4737) Prevent serialization errors from ever crashing the DAG scheduler
[ https://issues.apache.org/jira/browse/SPARK-4737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234538#comment-14234538 ] Michael Armbrust commented on SPARK-4737: - I think another big problem here is that the DAGScheduler restarts (somewhat silently) and comes back in a bad state. Perhaps if the DAGScheduler crashes we should kill the whole process if we aren't actually resilient to restarts. Prevent serialization errors from ever crashing the DAG scheduler - Key: SPARK-4737 URL: https://issues.apache.org/jira/browse/SPARK-4737 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Patrick Wendell Assignee: Matthew Cheah Priority: Blocker Currently in Spark we assume that when tasks are serialized in the TaskSetManager that the serialization cannot fail. We assume this because upstream in the DAGScheduler we attempt to catch any serialization errors by serializing a single partition. However, in some cases this upstream test is not accurate - i.e. an RDD can have one partition that can serialize cleanly but not others. Do do this in the proper way we need to catch and propagate the exception at the time of serialization. The tricky bit is making sure it gets propagated in the right way. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4331) SBT Scalastyle doesn't work for the sources under hive's v0.12.0 and v0.13.1
[ https://issues.apache.org/jira/browse/SPARK-4331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234556#comment-14234556 ] Michael Armbrust commented on SPARK-4331: - I'll add that scalastyle does not run on test code either. SBT Scalastyle doesn't work for the sources under hive's v0.12.0 and v0.13.1 Key: SPARK-4331 URL: https://issues.apache.org/jira/browse/SPARK-4331 Project: Spark Issue Type: Bug Components: Build, SQL Affects Versions: 1.3.0 Reporter: Kousuke Saruta v0.13.1 and v0.12.0 is not standard directory structure for sbt's sclastyle plugin so scalastyle doesn't work for sources under those directories. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4253) Ignore spark.driver.host in yarn-cluster and standalone-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-4253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-4253. --- Resolution: Fixed Fix Version/s: 1.2.0 1.1.2 Issue resolved by pull request 3112 [https://github.com/apache/spark/pull/3112] Ignore spark.driver.host in yarn-cluster and standalone-cluster mode Key: SPARK-4253 URL: https://issues.apache.org/jira/browse/SPARK-4253 Project: Spark Issue Type: Bug Components: YARN Reporter: WangTaoTheTonic Priority: Minor Fix For: 1.1.2, 1.2.0 Attachments: Cannot assign requested address.txt We actually don't know where driver will be before it is launched in yarn-cluster mode. If we set spark.driver.host property, Spark will create Actor on the hostname or ip as setted, which will leads an error. So we should ignore this config item in yarn-cluster mode. As [~joshrosen]] pointed, we also ignore it in standalone cluster mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4253) Ignore spark.driver.host in yarn-cluster and standalone-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-4253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4253: -- Assignee: WangTaoTheTonic Ignore spark.driver.host in yarn-cluster and standalone-cluster mode Key: SPARK-4253 URL: https://issues.apache.org/jira/browse/SPARK-4253 Project: Spark Issue Type: Bug Components: YARN Reporter: WangTaoTheTonic Assignee: WangTaoTheTonic Priority: Minor Fix For: 1.2.0, 1.1.2 Attachments: Cannot assign requested address.txt We actually don't know where driver will be before it is launched in yarn-cluster mode. If we set spark.driver.host property, Spark will create Actor on the hostname or ip as setted, which will leads an error. So we should ignore this config item in yarn-cluster mode. As [~joshrosen]] pointed, we also ignore it in standalone cluster mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4731) Spark 1.1.1 launches broken EC2 clusters
[ https://issues.apache.org/jira/browse/SPARK-4731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or reassigned SPARK-4731: Assignee: Andrew Or Spark 1.1.1 launches broken EC2 clusters Key: SPARK-4731 URL: https://issues.apache.org/jira/browse/SPARK-4731 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.1.1 Environment: Spark 1.1.1 on MacOS X Reporter: Jey Kottalam Assignee: Andrew Or EC2 clusters launched using Spark 1.1.1's `spark-ec2` script with the `-v 1.1.1` flag fail to initialize the master and workers correctly. The `/root/spark` directory contains only the `conf` directory and doesn't have the `bin` and other directories. [~joshrosen] suggested that [spark-ec2 #81](https://github.com/mesos/spark-ec2/pull/81) might have fixed it, but I still see this problem after that was merged. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4731) Spark 1.1.1 launches broken EC2 clusters
[ https://issues.apache.org/jira/browse/SPARK-4731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234666#comment-14234666 ] Andrew Or commented on SPARK-4731: -- This should work once https://github.com/mesos/spark-ec2/pull/82 is merged. Spark 1.1.1 launches broken EC2 clusters Key: SPARK-4731 URL: https://issues.apache.org/jira/browse/SPARK-4731 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.1.1 Environment: Spark 1.1.1 on MacOS X Reporter: Jey Kottalam Assignee: Andrew Or EC2 clusters launched using Spark 1.1.1's `spark-ec2` script with the `-v 1.1.1` flag fail to initialize the master and workers correctly. The `/root/spark` directory contains only the `conf` directory and doesn't have the `bin` and other directories. [~joshrosen] suggested that [spark-ec2 #81](https://github.com/mesos/spark-ec2/pull/81) might have fixed it, but I still see this problem after that was merged. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4748) PySpark can't read data in HDFS in YARN mode
Sebastián Ramírez created SPARK-4748: Summary: PySpark can't read data in HDFS in YARN mode Key: SPARK-4748 URL: https://issues.apache.org/jira/browse/SPARK-4748 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 1.1.1 Environment: Spark 1.1.1 precompiled for Hadoop 2.4 Hortonworks HDP 2.1 CentOS 6.6 (Anaconda 2.1.0 64-bit) Python 2.7.8 Numpy 1.9.0 Reporter: Sebastián Ramírez Using *PySpark*, I'm being unable to read and process data in *HDFS* in *YARN* cluster mode. But I can read data from HDFS in local mode. I have a 6 nodes cluster with Hortonworks HDP 2.1. The operating system is CentOS 6.6. I have installed Anaconda Python (which includes numpy) on every node for the user yarn. h5. This works (*PySpark* local reading from HDFS): When I start the console with: {code} IPYTHON=1 /home/hdfs/spark-1.1.1-bin-hadoop2.4/bin/pyspark --master local {code} Then I do (that file is in HDFS): {code} testdata = sc.textFile('/user/hdfs/testdata.csv') {code} And then: {code} testdata.first() {code} I get my data back: {code} u'asdf,qwer,1,M' {code} And if I do: {code} testdata.count() {code} It also works, I get: {code} 500 {code} h5. This also works (*Scala* in YARN cluster reading from HDFS): When I start the console with: {code} /home/hdfs/spark-1.1.1-bin-hadoop2.4/bin/spark-shell --master yarn-client --num-executors 6 --executor-cores 2 --executor-memory 2G --driver-memory 2G {code} Then I do (that file is in HDFS): {code} val testdata = sc.textFile(/user/hdfs/testdata.csv) {code} And then: {code} testdata.first() {code} I get my data back: {code} res1: String = asdf,qwer,1,M {code} And if I do: {code} testdata.count() {code} It also works, I get: {code} res2: Long = 500 {code} h5. This doesn't work (*PySpark* in YARN cluster reading from HDFS): When I start the console with: {code} IPYTHON=1 /home/hdfs/spark-1.1.1-bin-hadoop2.4/bin/pyspark --master yarn-client --num-executors 6 --executor-cores 2 --executor-memory 2G --driver-memory 2G {code} Then I do (that file is in HDFS): {code} testdata = sc.textFile('/user/hdfs/testdata.csv') {code} And then: {code} testdata.first() {code} And I get some *INFO* logs, and then a *WARN*: {code} 14/12/04 15:26:40 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, node05): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File /hadoop/yarn/local/usercache/hdfs/filecache/44/spark-assembly-1.1.1-hadoop2.4.0.jar/pyspark/worker.py, line 79, in main serializer.dump_stream(func(split_index, iterator), outfile) File /hadoop/yarn/local/usercache/hdfs/filecache/44/spark-assembly-1.1.1-hadoop2.4.0.jar/pyspark/serializers.py, line 196, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File /hadoop/yarn/local/usercache/hdfs/filecache/44/spark-assembly-1.1.1-hadoop2.4.0.jar/pyspark/serializers.py, line 127, in dump_stream for obj in iterator: File /hadoop/yarn/local/usercache/hdfs/filecache/44/spark-assembly-1.1.1-hadoop2.4.0.jar/pyspark/serializers.py, line 185, in _batched for item in iterator: File /home/hdfs/spark-1.1.1-bin-hadoop2.4/python/pyspark/rdd.py, line 1146, in takeUpToNumLeft ImportError: No module named next org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:124) org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:154) org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:87) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:744) 14/12/04 15:26:40 INFO scheduler.TaskSetManager: Starting task 0.1 in stage 0.0 (TID 1, node05, NODE_LOCAL, 1254 bytes) 14/12/04 15:26:40 INFO scheduler.TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) on executor node05: org.apache.spark.api.python.PythonException (Traceback (most recent call last): File /hadoop/yarn/local/usercache/hdfs/filecache/44/spark-assembly-1.1.1-hadoop2.4.0.jar/pyspark/worker.py, line 79, in main serializer.dump_stream(func(split_index, iterator), outfile) File /hadoop/yarn/local/usercache/hdfs/filecache/44/spark-assembly-1.1.1-hadoop2.4.0.jar/pyspark/serializers.py, line 196, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File
[jira] [Created] (SPARK-4749) Allow initializing KMeans clusters using a seed
Nate Crosswhite created SPARK-4749: -- Summary: Allow initializing KMeans clusters using a seed Key: SPARK-4749 URL: https://issues.apache.org/jira/browse/SPARK-4749 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Affects Versions: 1.1.0 Reporter: Nate Crosswhite Add an optional seed to MLLib KMeans clustering to allow initial cluster choices to be deterministic. Update the pyspark mllib interface to also allow an optional seed parameter to be supplie. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4749) Allow initializing KMeans clusters using a seed
[ https://issues.apache.org/jira/browse/SPARK-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234698#comment-14234698 ] Apache Spark commented on SPARK-4749: - User 'nxwhite-str' has created a pull request for this issue: https://github.com/apache/spark/pull/3610 Allow initializing KMeans clusters using a seed --- Key: SPARK-4749 URL: https://issues.apache.org/jira/browse/SPARK-4749 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Affects Versions: 1.1.0 Reporter: Nate Crosswhite Original Estimate: 24h Remaining Estimate: 24h Add an optional seed to MLLib KMeans clustering to allow initial cluster choices to be deterministic. Update the pyspark mllib interface to also allow an optional seed parameter to be supplie. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3431) Parallelize execution of tests
[ https://issues.apache.org/jira/browse/SPARK-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234702#comment-14234702 ] Nicholas Chammas commented on SPARK-3431: - I think I'm on to something, but I need some help. I think I understand how to tell SBT to fork JVMs for tests, and I also think I got how to specify how the tests should be grouped in the various forked JVMs. It's not working because I think the forked JVMs are not getting passed all the options they need. Basically, I don't think that the reference to {{javaOptions}} [here in this line|https://github.com/nchammas/spark/blob/ab127b798dbfa9399833d546e627f9651b060918/project/SparkBuild.scala#L429] actually has all the options [defined earlier|https://github.com/nchammas/spark/blob/ab127b798dbfa9399833d546e627f9651b060918/project/SparkBuild.scala#L403-L418]. I don't know much Scala. If anyone could review what I have so far give me some pointers, that would be great! You can see all the variations I've tried along with the associated output in [the open pull request|https://github.com/apache/spark/pull/3564]. I know we want to get this working with Maven, but I figured getting it to work first with SBT wouldn't be a bad thing. Parallelize execution of tests -- Key: SPARK-3431 URL: https://issues.apache.org/jira/browse/SPARK-3431 Project: Spark Issue Type: Improvement Components: Build Reporter: Nicholas Chammas Running all the tests in {{dev/run-tests}} takes up to 2 hours. A common strategy to cut test time down is to parallelize the execution of the tests. Doing that may in turn require some prerequisite changes to be made to how certain tests run. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4745) get_existing_cluster() doesn't work with additional security groups
[ https://issues.apache.org/jira/browse/SPARK-4745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-4745. --- Resolution: Fixed Fix Version/s: 1.2.1 1.1.2 Issue resolved by pull request 3596 [https://github.com/apache/spark/pull/3596] get_existing_cluster() doesn't work with additional security groups --- Key: SPARK-4745 URL: https://issues.apache.org/jira/browse/SPARK-4745 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.1.0 Reporter: Alex DeBrie Fix For: 1.1.2, 1.2.1 The spark-ec2 script has a flag that allows you to add additional security groups to clusters when you launch. However, the get_existing_cluster() function cycles through active instances and only returns instances whose group_names == cluster_name + -master (or + -slaves), which is the group created by default. The get_existing_cluster() function is used to login to, stop, and destroy existing clusters, among other actions. This is a pretty simple fix for which I've already submitted a [pull request|https://github.com/apache/spark/pull/3596]. It checks if cluster_name + -master is in the list of groups for each active instance. This means the cluster group can be one among many groups, rather than the sole group for an instance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4745) get_existing_cluster() doesn't work with additional security groups
[ https://issues.apache.org/jira/browse/SPARK-4745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4745: -- Assignee: Alex DeBrie get_existing_cluster() doesn't work with additional security groups --- Key: SPARK-4745 URL: https://issues.apache.org/jira/browse/SPARK-4745 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.1.0 Reporter: Alex DeBrie Assignee: Alex DeBrie Fix For: 1.1.2, 1.2.1 The spark-ec2 script has a flag that allows you to add additional security groups to clusters when you launch. However, the get_existing_cluster() function cycles through active instances and only returns instances whose group_names == cluster_name + -master (or + -slaves), which is the group created by default. The get_existing_cluster() function is used to login to, stop, and destroy existing clusters, among other actions. This is a pretty simple fix for which I've already submitted a [pull request|https://github.com/apache/spark/pull/3596]. It checks if cluster_name + -master is in the list of groups for each active instance. This means the cluster group can be one among many groups, rather than the sole group for an instance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4731) Spark 1.1.1 launches broken EC2 clusters
[ https://issues.apache.org/jira/browse/SPARK-4731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-4731. Resolution: Fixed Fix Version/s: 1.1.1 Target Version/s: 1.1.1 Spark 1.1.1 launches broken EC2 clusters Key: SPARK-4731 URL: https://issues.apache.org/jira/browse/SPARK-4731 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.1.1 Environment: Spark 1.1.1 on MacOS X Reporter: Jey Kottalam Assignee: Andrew Or Fix For: 1.1.1 EC2 clusters launched using Spark 1.1.1's `spark-ec2` script with the `-v 1.1.1` flag fail to initialize the master and workers correctly. The `/root/spark` directory contains only the `conf` directory and doesn't have the `bin` and other directories. [~joshrosen] suggested that [spark-ec2 #81](https://github.com/mesos/spark-ec2/pull/81) might have fixed it, but I still see this problem after that was merged. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4740) Netty's network bandwidth is much lower than NIO in spark-perf and Netty takes longer running time
[ https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234758#comment-14234758 ] Aaron Davidson commented on SPARK-4740: --- Could you try to set spark.shuffle.io.serverThreads and spark.shuffle.io.clientThreads to 48? We have an artificial max default of 8 to limit off-heap memory usage, but it's possible this is not sufficient to saturate 10GB/s. Netty's network bandwidth is much lower than NIO in spark-perf and Netty takes longer running time -- Key: SPARK-4740 URL: https://issues.apache.org/jira/browse/SPARK-4740 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Affects Versions: 1.2.0 Reporter: Zhang, Liye Attachments: Spark-perf Test Report.pdf When testing current spark master (1.3.0-snapshot) with spark-perf (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService takes much longer time than NIO based shuffle transferService. The network throughput of Netty is only about half of that of NIO. We tested with standalone mode, and the data set we used for test is 20 billion records, and the total size is about 400GB. Spark-perf test is Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each executor memory is 64GB. The reduce tasks number is set to 1000. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4459) JavaRDDLike.groupBy[K](f: JFunction[T, K]) may fail with typechecking errors
[ https://issues.apache.org/jira/browse/SPARK-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4459: -- Affects Version/s: (was: 1.0.2) (was: 1.1.0) 1.1.2 1.2.0 JavaRDDLike.groupBy[K](f: JFunction[T, K]) may fail with typechecking errors Key: SPARK-4459 URL: https://issues.apache.org/jira/browse/SPARK-4459 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.2.0, 1.1.2 Reporter: Alok Saldanha Fix For: 1.1.1, 1.1.2 I believe this issue is essentially the same as SPARK-668. Original error: {code} [ERROR] /Users/saldaal1/workspace/JavaSparkSimpleApp/src/main/java/SimpleApp.java:[29,105] no suitable method found for groupBy(org.apache.spark.api.java.function.Functionscala.Tuple2java.lang.String,java.lang.Long,java.lang.Long) [ERROR] method org.apache.spark.api.java.JavaPairRDD.KgroupBy(org.apache.spark.api.java.function.Functionscala.Tuple2K,java.lang.Long,K) is not applicable [ERROR] (inferred type does not conform to equality constraint(s) {code} from core/src/main/scala/org/apache/spark/api/java/JavaRDDLike.scala {code} 211 /** 212* Return an RDD of grouped elements. Each group consists of a key and a sequence of elements 213* mapping to that key. 214*/ 215 def groupBy[K](f: JFunction[T, K]): JavaPairRDD[K, JIterable[T]] = { 216 implicit val ctagK: ClassTag[K] = fakeClassTag 217 implicit val ctagV: ClassTag[JList[T]] = fakeClassTag 218 JavaPairRDD.fromRDD(groupByResultToJava(rdd.groupBy(f)(fakeClassTag))) 219 } {code} Then in core/src/main/scala/org/apache/spark/api/java/JavaPairRDD.scala: {code} 45 class JavaPairRDD[K, V](val rdd: RDD[(K, V)]) 46(implicit val kClassTag: ClassTag[K], implicit val vClassTag: ClassTag[V]) 47 extends JavaRDDLike[(K, V), JavaPairRDD[K, V]] { {code} The problem is that the type parameter T in JavaRDDLike is Tuple2[K,V], which means the combined signature for groupBy in the JavaPairRDD is {code} groupBy[K](f: JFunction[Tuple2[K,V], K]) {code} which imposes an unfortunate correlation between the Tuple2 and the return type of the grouping function, namely that the return type of the grouping function must be the same as the first type of the JavaPairRDD. If we compare the method signature to flatMap: {code} 105 /** 106* Return a new RDD by first applying a function to all elements of this 107* RDD, and then flattening the results. 108*/ 109 def flatMap[U](f: FlatMapFunction[T, U]): JavaRDD[U] = { 110 import scala.collection.JavaConverters._ 111 def fn = (x: T) = f.call(x).asScala 112 JavaRDD.fromRDD(rdd.flatMap(fn)(fakeClassTag[U]))(fakeClassTag[U]) 113 } {code} we see there should be an easy fix by changing the type parameter of the groupBy function from K to U. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4459) JavaRDDLike.groupBy[K](f: JFunction[T, K]) may fail with typechecking errors
[ https://issues.apache.org/jira/browse/SPARK-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-4459. --- Resolution: Fixed Fix Version/s: 1.1.1 1.1.2 Issue resolved by pull request 3327 [https://github.com/apache/spark/pull/3327] JavaRDDLike.groupBy[K](f: JFunction[T, K]) may fail with typechecking errors Key: SPARK-4459 URL: https://issues.apache.org/jira/browse/SPARK-4459 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.2.0, 1.1.2 Reporter: Alok Saldanha Fix For: 1.1.2, 1.1.1 I believe this issue is essentially the same as SPARK-668. Original error: {code} [ERROR] /Users/saldaal1/workspace/JavaSparkSimpleApp/src/main/java/SimpleApp.java:[29,105] no suitable method found for groupBy(org.apache.spark.api.java.function.Functionscala.Tuple2java.lang.String,java.lang.Long,java.lang.Long) [ERROR] method org.apache.spark.api.java.JavaPairRDD.KgroupBy(org.apache.spark.api.java.function.Functionscala.Tuple2K,java.lang.Long,K) is not applicable [ERROR] (inferred type does not conform to equality constraint(s) {code} from core/src/main/scala/org/apache/spark/api/java/JavaRDDLike.scala {code} 211 /** 212* Return an RDD of grouped elements. Each group consists of a key and a sequence of elements 213* mapping to that key. 214*/ 215 def groupBy[K](f: JFunction[T, K]): JavaPairRDD[K, JIterable[T]] = { 216 implicit val ctagK: ClassTag[K] = fakeClassTag 217 implicit val ctagV: ClassTag[JList[T]] = fakeClassTag 218 JavaPairRDD.fromRDD(groupByResultToJava(rdd.groupBy(f)(fakeClassTag))) 219 } {code} Then in core/src/main/scala/org/apache/spark/api/java/JavaPairRDD.scala: {code} 45 class JavaPairRDD[K, V](val rdd: RDD[(K, V)]) 46(implicit val kClassTag: ClassTag[K], implicit val vClassTag: ClassTag[V]) 47 extends JavaRDDLike[(K, V), JavaPairRDD[K, V]] { {code} The problem is that the type parameter T in JavaRDDLike is Tuple2[K,V], which means the combined signature for groupBy in the JavaPairRDD is {code} groupBy[K](f: JFunction[Tuple2[K,V], K]) {code} which imposes an unfortunate correlation between the Tuple2 and the return type of the grouping function, namely that the return type of the grouping function must be the same as the first type of the JavaPairRDD. If we compare the method signature to flatMap: {code} 105 /** 106* Return a new RDD by first applying a function to all elements of this 107* RDD, and then flattening the results. 108*/ 109 def flatMap[U](f: FlatMapFunction[T, U]): JavaRDD[U] = { 110 import scala.collection.JavaConverters._ 111 def fn = (x: T) = f.call(x).asScala 112 JavaRDD.fromRDD(rdd.flatMap(fn)(fakeClassTag[U]))(fakeClassTag[U]) 113 } {code} we see there should be an easy fix by changing the type parameter of the groupBy function from K to U. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4459) JavaRDDLike.groupBy[K](f: JFunction[T, K]) may fail with typechecking errors
[ https://issues.apache.org/jira/browse/SPARK-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4459: -- Assignee: Alok Saldanha JavaRDDLike.groupBy[K](f: JFunction[T, K]) may fail with typechecking errors Key: SPARK-4459 URL: https://issues.apache.org/jira/browse/SPARK-4459 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.0.0, 1.2.0, 1.1.2 Reporter: Alok Saldanha Assignee: Alok Saldanha Fix For: 1.1.1, 1.1.2 I believe this issue is essentially the same as SPARK-668. Original error: {code} [ERROR] /Users/saldaal1/workspace/JavaSparkSimpleApp/src/main/java/SimpleApp.java:[29,105] no suitable method found for groupBy(org.apache.spark.api.java.function.Functionscala.Tuple2java.lang.String,java.lang.Long,java.lang.Long) [ERROR] method org.apache.spark.api.java.JavaPairRDD.KgroupBy(org.apache.spark.api.java.function.Functionscala.Tuple2K,java.lang.Long,K) is not applicable [ERROR] (inferred type does not conform to equality constraint(s) {code} from core/src/main/scala/org/apache/spark/api/java/JavaRDDLike.scala {code} 211 /** 212* Return an RDD of grouped elements. Each group consists of a key and a sequence of elements 213* mapping to that key. 214*/ 215 def groupBy[K](f: JFunction[T, K]): JavaPairRDD[K, JIterable[T]] = { 216 implicit val ctagK: ClassTag[K] = fakeClassTag 217 implicit val ctagV: ClassTag[JList[T]] = fakeClassTag 218 JavaPairRDD.fromRDD(groupByResultToJava(rdd.groupBy(f)(fakeClassTag))) 219 } {code} Then in core/src/main/scala/org/apache/spark/api/java/JavaPairRDD.scala: {code} 45 class JavaPairRDD[K, V](val rdd: RDD[(K, V)]) 46(implicit val kClassTag: ClassTag[K], implicit val vClassTag: ClassTag[V]) 47 extends JavaRDDLike[(K, V), JavaPairRDD[K, V]] { {code} The problem is that the type parameter T in JavaRDDLike is Tuple2[K,V], which means the combined signature for groupBy in the JavaPairRDD is {code} groupBy[K](f: JFunction[Tuple2[K,V], K]) {code} which imposes an unfortunate correlation between the Tuple2 and the return type of the grouping function, namely that the return type of the grouping function must be the same as the first type of the JavaPairRDD. If we compare the method signature to flatMap: {code} 105 /** 106* Return a new RDD by first applying a function to all elements of this 107* RDD, and then flattening the results. 108*/ 109 def flatMap[U](f: FlatMapFunction[T, U]): JavaRDD[U] = { 110 import scala.collection.JavaConverters._ 111 def fn = (x: T) = f.call(x).asScala 112 JavaRDD.fromRDD(rdd.flatMap(fn)(fakeClassTag[U]))(fakeClassTag[U]) 113 } {code} we see there should be an easy fix by changing the type parameter of the groupBy function from K to U. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4459) JavaRDDLike.groupBy[K](f: JFunction[T, K]) may fail with typechecking errors
[ https://issues.apache.org/jira/browse/SPARK-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4459: -- Affects Version/s: 1.0.0 JavaRDDLike.groupBy[K](f: JFunction[T, K]) may fail with typechecking errors Key: SPARK-4459 URL: https://issues.apache.org/jira/browse/SPARK-4459 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.0.0, 1.2.0, 1.1.2 Reporter: Alok Saldanha Assignee: Alok Saldanha Fix For: 1.1.1, 1.1.2 I believe this issue is essentially the same as SPARK-668. Original error: {code} [ERROR] /Users/saldaal1/workspace/JavaSparkSimpleApp/src/main/java/SimpleApp.java:[29,105] no suitable method found for groupBy(org.apache.spark.api.java.function.Functionscala.Tuple2java.lang.String,java.lang.Long,java.lang.Long) [ERROR] method org.apache.spark.api.java.JavaPairRDD.KgroupBy(org.apache.spark.api.java.function.Functionscala.Tuple2K,java.lang.Long,K) is not applicable [ERROR] (inferred type does not conform to equality constraint(s) {code} from core/src/main/scala/org/apache/spark/api/java/JavaRDDLike.scala {code} 211 /** 212* Return an RDD of grouped elements. Each group consists of a key and a sequence of elements 213* mapping to that key. 214*/ 215 def groupBy[K](f: JFunction[T, K]): JavaPairRDD[K, JIterable[T]] = { 216 implicit val ctagK: ClassTag[K] = fakeClassTag 217 implicit val ctagV: ClassTag[JList[T]] = fakeClassTag 218 JavaPairRDD.fromRDD(groupByResultToJava(rdd.groupBy(f)(fakeClassTag))) 219 } {code} Then in core/src/main/scala/org/apache/spark/api/java/JavaPairRDD.scala: {code} 45 class JavaPairRDD[K, V](val rdd: RDD[(K, V)]) 46(implicit val kClassTag: ClassTag[K], implicit val vClassTag: ClassTag[V]) 47 extends JavaRDDLike[(K, V), JavaPairRDD[K, V]] { {code} The problem is that the type parameter T in JavaRDDLike is Tuple2[K,V], which means the combined signature for groupBy in the JavaPairRDD is {code} groupBy[K](f: JFunction[Tuple2[K,V], K]) {code} which imposes an unfortunate correlation between the Tuple2 and the return type of the grouping function, namely that the return type of the grouping function must be the same as the first type of the JavaPairRDD. If we compare the method signature to flatMap: {code} 105 /** 106* Return a new RDD by first applying a function to all elements of this 107* RDD, and then flattening the results. 108*/ 109 def flatMap[U](f: FlatMapFunction[T, U]): JavaRDD[U] = { 110 import scala.collection.JavaConverters._ 111 def fn = (x: T) = f.call(x).asScala 112 JavaRDD.fromRDD(rdd.flatMap(fn)(fakeClassTag[U]))(fakeClassTag[U]) 113 } {code} we see there should be an easy fix by changing the type parameter of the groupBy function from K to U. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4459) JavaRDDLike.groupBy[K](f: JFunction[T, K]) may fail with typechecking errors
[ https://issues.apache.org/jira/browse/SPARK-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4459: -- Affects Version/s: (was: 1.1.2) (was: 1.2.0) (was: 1.0.0) 1.0.2 1.1.0 JavaRDDLike.groupBy[K](f: JFunction[T, K]) may fail with typechecking errors Key: SPARK-4459 URL: https://issues.apache.org/jira/browse/SPARK-4459 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.0.2, 1.1.0 Reporter: Alok Saldanha Assignee: Alok Saldanha Fix For: 1.1.2, 1.2.1 I believe this issue is essentially the same as SPARK-668. Original error: {code} [ERROR] /Users/saldaal1/workspace/JavaSparkSimpleApp/src/main/java/SimpleApp.java:[29,105] no suitable method found for groupBy(org.apache.spark.api.java.function.Functionscala.Tuple2java.lang.String,java.lang.Long,java.lang.Long) [ERROR] method org.apache.spark.api.java.JavaPairRDD.KgroupBy(org.apache.spark.api.java.function.Functionscala.Tuple2K,java.lang.Long,K) is not applicable [ERROR] (inferred type does not conform to equality constraint(s) {code} from core/src/main/scala/org/apache/spark/api/java/JavaRDDLike.scala {code} 211 /** 212* Return an RDD of grouped elements. Each group consists of a key and a sequence of elements 213* mapping to that key. 214*/ 215 def groupBy[K](f: JFunction[T, K]): JavaPairRDD[K, JIterable[T]] = { 216 implicit val ctagK: ClassTag[K] = fakeClassTag 217 implicit val ctagV: ClassTag[JList[T]] = fakeClassTag 218 JavaPairRDD.fromRDD(groupByResultToJava(rdd.groupBy(f)(fakeClassTag))) 219 } {code} Then in core/src/main/scala/org/apache/spark/api/java/JavaPairRDD.scala: {code} 45 class JavaPairRDD[K, V](val rdd: RDD[(K, V)]) 46(implicit val kClassTag: ClassTag[K], implicit val vClassTag: ClassTag[V]) 47 extends JavaRDDLike[(K, V), JavaPairRDD[K, V]] { {code} The problem is that the type parameter T in JavaRDDLike is Tuple2[K,V], which means the combined signature for groupBy in the JavaPairRDD is {code} groupBy[K](f: JFunction[Tuple2[K,V], K]) {code} which imposes an unfortunate correlation between the Tuple2 and the return type of the grouping function, namely that the return type of the grouping function must be the same as the first type of the JavaPairRDD. If we compare the method signature to flatMap: {code} 105 /** 106* Return a new RDD by first applying a function to all elements of this 107* RDD, and then flattening the results. 108*/ 109 def flatMap[U](f: FlatMapFunction[T, U]): JavaRDD[U] = { 110 import scala.collection.JavaConverters._ 111 def fn = (x: T) = f.call(x).asScala 112 JavaRDD.fromRDD(rdd.flatMap(fn)(fakeClassTag[U]))(fakeClassTag[U]) 113 } {code} we see there should be an easy fix by changing the type parameter of the groupBy function from K to U. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4652) Add docs about spark-git-repo option
[ https://issues.apache.org/jira/browse/SPARK-4652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4652: -- Assignee: Kai Sasaki Add docs about spark-git-repo option Key: SPARK-4652 URL: https://issues.apache.org/jira/browse/SPARK-4652 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 1.1.0 Reporter: Kai Sasaki Assignee: Kai Sasaki Priority: Minor It was a little hard to understand how to use --spark-git-repo option on spark-ec2 script. Some additional documentation might be needed to use it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4652) Add docs about spark-git-repo option
[ https://issues.apache.org/jira/browse/SPARK-4652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-4652. --- Resolution: Fixed Fix Version/s: 1.2.1 1.1.2 Issue resolved by pull request 3513 [https://github.com/apache/spark/pull/3513] Add docs about spark-git-repo option Key: SPARK-4652 URL: https://issues.apache.org/jira/browse/SPARK-4652 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 1.1.0 Reporter: Kai Sasaki Assignee: Kai Sasaki Priority: Minor Fix For: 1.1.2, 1.2.1 It was a little hard to understand how to use --spark-git-repo option on spark-ec2 script. Some additional documentation might be needed to use it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3431) Parallelize execution of tests
[ https://issues.apache.org/jira/browse/SPARK-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234783#comment-14234783 ] Nicholas Chammas commented on SPARK-3431: - As an aside, I expect there to be some work required to let certain tests play nicely with one another. But if we figure out how to specify test groupings and make sure the forked JVMs are configured correctly, refactoring tests where necessary should be very doable. Parallelize execution of tests -- Key: SPARK-3431 URL: https://issues.apache.org/jira/browse/SPARK-3431 Project: Spark Issue Type: Improvement Components: Build Reporter: Nicholas Chammas Running all the tests in {{dev/run-tests}} takes up to 2 hours. A common strategy to cut test time down is to parallelize the execution of the tests. Doing that may in turn require some prerequisite changes to be made to how certain tests run. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4136) Under dynamic allocation, cancel outstanding executor requests when pending task queue is empty
[ https://issues.apache.org/jira/browse/SPARK-4136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4136: - Target Version/s: 1.3.0 (was: 1.2.0) Under dynamic allocation, cancel outstanding executor requests when pending task queue is empty --- Key: SPARK-4136 URL: https://issues.apache.org/jira/browse/SPARK-4136 Project: Spark Issue Type: Improvement Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4750) Dynamic allocation - we need to synchronize kills
Andrew Or created SPARK-4750: Summary: Dynamic allocation - we need to synchronize kills Key: SPARK-4750 URL: https://issues.apache.org/jira/browse/SPARK-4750 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Andrew Or Assignee: Andrew Or https://github.com/apache/spark/blob/ab8177da2defab1ecd8bc0cd5a21f07be5b8d2c5/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L337 Simple omission on my part. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4750) Dynamic allocation - we need to synchronize kills
[ https://issues.apache.org/jira/browse/SPARK-4750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234838#comment-14234838 ] Apache Spark commented on SPARK-4750: - User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/3612 Dynamic allocation - we need to synchronize kills - Key: SPARK-4750 URL: https://issues.apache.org/jira/browse/SPARK-4750 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Andrew Or Assignee: Andrew Or https://github.com/apache/spark/blob/ab8177da2defab1ecd8bc0cd5a21f07be5b8d2c5/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L337 Simple omission on my part. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4751) Support dynamic allocation for standalone mode
Andrew Or created SPARK-4751: Summary: Support dynamic allocation for standalone mode Key: SPARK-4751 URL: https://issues.apache.org/jira/browse/SPARK-4751 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Andrew Or Assignee: Andrew Or Priority: Blocker This is equivalent to SPARK-3822 but for standalone mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4752) Classifier based on artificial neural network
Alexander Ulanov created SPARK-4752: --- Summary: Classifier based on artificial neural network Key: SPARK-4752 URL: https://issues.apache.org/jira/browse/SPARK-4752 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.1.0 Reporter: Alexander Ulanov Fix For: 1.3.0 Implement classifier based on artificial neural network (ANN). Requirements: 1) Use the existing artificial neural network implementation https://issues.apache.org/jira/browse/SPARK-2352, https://github.com/apache/spark/pull/1290 2) Extend MLlib ClassificationModel trait, 3) Like other classifiers in MLlib, accept RDD[LabeledPoint] for training, 4) Be able to return the ANN model -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4752) Classifier based on artificial neural network
[ https://issues.apache.org/jira/browse/SPARK-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234855#comment-14234855 ] Alexander Ulanov edited comment on SPARK-4752 at 12/5/14 12:51 AM: --- The initial implementation can be found here: https://github.com/avulanov/spark/tree/annclassifier. It encodes the class label as a binary vector in the ANN output and selects the class based on biggest output value. The implementation contains unit tests as well. The mentioned code uses the following PR: https://github.com/apache/spark/pull/1290. It is not yet merged into the main branch. I think that I should not make a pull request until then. was (Author: avulanov): The initial implementation can be found here: https://github.com/avulanov/spark/tree/annclassifier. It codes the class label as a binary vector in the ANN output and selects the class based on biggest output value. The implementation contains unit tests as well. The mentioned code uses the following PR: https://github.com/apache/spark/pull/1290. It is not yet merged into the main branch. I think that I should not make a pull request until then. Classifier based on artificial neural network - Key: SPARK-4752 URL: https://issues.apache.org/jira/browse/SPARK-4752 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.1.0 Reporter: Alexander Ulanov Fix For: 1.3.0 Original Estimate: 168h Remaining Estimate: 168h Implement classifier based on artificial neural network (ANN). Requirements: 1) Use the existing artificial neural network implementation https://issues.apache.org/jira/browse/SPARK-2352, https://github.com/apache/spark/pull/1290 2) Extend MLlib ClassificationModel trait, 3) Like other classifiers in MLlib, accept RDD[LabeledPoint] for training, 4) Be able to return the ANN model -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4752) Classifier based on artificial neural network
[ https://issues.apache.org/jira/browse/SPARK-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234855#comment-14234855 ] Alexander Ulanov commented on SPARK-4752: - The initial implementation can be found here: https://github.com/avulanov/spark/tree/annclassifier. It codes the class label as a binary vector in the ANN output and selects the class based on biggest output value. The implementation contains unit tests as well. The mentioned code uses the following PR: https://github.com/apache/spark/pull/1290. It is not yet merged into the main branch. I think that I should not make a pull request until then. Classifier based on artificial neural network - Key: SPARK-4752 URL: https://issues.apache.org/jira/browse/SPARK-4752 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.1.0 Reporter: Alexander Ulanov Fix For: 1.3.0 Original Estimate: 168h Remaining Estimate: 168h Implement classifier based on artificial neural network (ANN). Requirements: 1) Use the existing artificial neural network implementation https://issues.apache.org/jira/browse/SPARK-2352, https://github.com/apache/spark/pull/1290 2) Extend MLlib ClassificationModel trait, 3) Like other classifiers in MLlib, accept RDD[LabeledPoint] for training, 4) Be able to return the ANN model -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4740) Netty's network bandwidth is much lower than NIO in spark-perf and Netty takes longer running time
[ https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234876#comment-14234876 ] Saisai Shao commented on SPARK-4740: We also tested with small dataset like 40GB, the netty performance is similar to NIO, I'm guessing if Netty is not efficient when fetching large number of shuffle blocks, in our 400GB case, each reduce task need to fetch about 7000 shuffle blocks, and each shuffle block is about tens of KB size. We will try increase shuffle thread number to test again. Seeing from the call stack, all the shuffle client are busy waiting on epoll_wait, I'm not sure is this the right thing? Netty's network bandwidth is much lower than NIO in spark-perf and Netty takes longer running time -- Key: SPARK-4740 URL: https://issues.apache.org/jira/browse/SPARK-4740 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Affects Versions: 1.2.0 Reporter: Zhang, Liye Attachments: Spark-perf Test Report.pdf When testing current spark master (1.3.0-snapshot) with spark-perf (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService takes much longer time than NIO based shuffle transferService. The network throughput of Netty is only about half of that of NIO. We tested with standalone mode, and the data set we used for test is 20 billion records, and the total size is about 400GB. Spark-perf test is Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each executor memory is 64GB. The reduce tasks number is set to 1000. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4740) Netty's network bandwidth is much lower than NIO in spark-perf and Netty takes longer running time
[ https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234902#comment-14234902 ] Saisai Shao commented on SPARK-4740: Besides we also tested with 24 cores WSM cpu, the performance of Netty is still slower than NIO. Netty's network bandwidth is much lower than NIO in spark-perf and Netty takes longer running time -- Key: SPARK-4740 URL: https://issues.apache.org/jira/browse/SPARK-4740 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Affects Versions: 1.2.0 Reporter: Zhang, Liye Attachments: Spark-perf Test Report.pdf When testing current spark master (1.3.0-snapshot) with spark-perf (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService takes much longer time than NIO based shuffle transferService. The network throughput of Netty is only about half of that of NIO. We tested with standalone mode, and the data set we used for test is 20 billion records, and the total size is about 400GB. Spark-perf test is Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each executor memory is 64GB. The reduce tasks number is set to 1000. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4751) Support dynamic allocation for standalone mode
[ https://issues.apache.org/jira/browse/SPARK-4751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4751: - Description: This is equivalent to SPARK-3822 but for standalone mode. This is actually a very tricky issue was:This is equivalent to SPARK-3822 but for standalone mode. Support dynamic allocation for standalone mode -- Key: SPARK-4751 URL: https://issues.apache.org/jira/browse/SPARK-4751 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.2.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Blocker This is equivalent to SPARK-3822 but for standalone mode. This is actually a very tricky issue -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4751) Support dynamic allocation for standalone mode
[ https://issues.apache.org/jira/browse/SPARK-4751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4751: - Affects Version/s: 1.2.0 Support dynamic allocation for standalone mode -- Key: SPARK-4751 URL: https://issues.apache.org/jira/browse/SPARK-4751 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.2.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Blocker This is equivalent to SPARK-3822 but for standalone mode. This is actually a very tricky issue -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4751) Support dynamic allocation for standalone mode
[ https://issues.apache.org/jira/browse/SPARK-4751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4751: - Description: This is equivalent to SPARK-3822 but for standalone mode. This is actually a very tricky issue because of the scheduling mechanisms in the standalone Master. In standalone mode we allocate resources based on cores. By default, an application will grab all the cores in the cluster unless spark.cores.max is specified. This means an application could get executors of different sizes (in terms of cores) if we kill and then request executors. Further, standalone mode is subject to the constraint that only one executor can be allocated on each worker per application. As a result, it is rather meaningless to request new executors if the existing ones are already spread out across all nodes. was: This is equivalent to SPARK-3822 but for standalone mode. This is actually a very tricky issue Support dynamic allocation for standalone mode -- Key: SPARK-4751 URL: https://issues.apache.org/jira/browse/SPARK-4751 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.2.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Blocker This is equivalent to SPARK-3822 but for standalone mode. This is actually a very tricky issue because of the scheduling mechanisms in the standalone Master. In standalone mode we allocate resources based on cores. By default, an application will grab all the cores in the cluster unless spark.cores.max is specified. This means an application could get executors of different sizes (in terms of cores) if we kill and then request executors. Further, standalone mode is subject to the constraint that only one executor can be allocated on each worker per application. As a result, it is rather meaningless to request new executors if the existing ones are already spread out across all nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4751) Support dynamic allocation for standalone mode
[ https://issues.apache.org/jira/browse/SPARK-4751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4751: - Description: This is equivalent to SPARK-3822 but for standalone mode. This is actually a very tricky issue the scheduling mechanism in the standalone Master uses different semantics. In standalone mode we allocate resources based on cores. By default, an application will grab all the cores in the cluster unless spark.cores.max is specified. Unfortunately, this means an application could get executors of different sizes (in terms of cores) if: 1) App 1 kills an executor 2) App 2, with spark.cores.max set, grabs a subset of cores on a worker 3) App 1 requests an executor In this case, App 1 will get back an executor of half the number of cores. Further, standalone mode is subject to the constraint that only one executor can be allocated on each worker per application. As a result, it is rather meaningless to request new executors if the existing ones are already spread out across all nodes. was: This is equivalent to SPARK-3822 but for standalone mode. This is actually a very tricky issue because of the scheduling mechanisms in the standalone Master. In standalone mode we allocate resources based on cores. By default, an application will grab all the cores in the cluster unless spark.cores.max is specified. This means an application could get executors of different sizes (in terms of cores) if we kill and then request executors. Further, standalone mode is subject to the constraint that only one executor can be allocated on each worker per application. As a result, it is rather meaningless to request new executors if the existing ones are already spread out across all nodes. Support dynamic allocation for standalone mode -- Key: SPARK-4751 URL: https://issues.apache.org/jira/browse/SPARK-4751 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.2.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Blocker This is equivalent to SPARK-3822 but for standalone mode. This is actually a very tricky issue the scheduling mechanism in the standalone Master uses different semantics. In standalone mode we allocate resources based on cores. By default, an application will grab all the cores in the cluster unless spark.cores.max is specified. Unfortunately, this means an application could get executors of different sizes (in terms of cores) if: 1) App 1 kills an executor 2) App 2, with spark.cores.max set, grabs a subset of cores on a worker 3) App 1 requests an executor In this case, App 1 will get back an executor of half the number of cores. Further, standalone mode is subject to the constraint that only one executor can be allocated on each worker per application. As a result, it is rather meaningless to request new executors if the existing ones are already spread out across all nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4751) Support dynamic allocation for standalone mode
[ https://issues.apache.org/jira/browse/SPARK-4751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4751: - Description: This is equivalent to SPARK-3822 but for standalone mode. This is actually a very tricky issue because the scheduling mechanism in the standalone Master uses different semantics. In standalone mode we allocate resources based on cores. By default, an application will grab all the cores in the cluster unless spark.cores.max is specified. Unfortunately, this means an application could get executors of different sizes (in terms of cores) if: 1) App 1 kills an executor 2) App 2, with spark.cores.max set, grabs a subset of cores on a worker 3) App 1 requests an executor In this case, App 1 will get back an executor of half the number of cores. Further, standalone mode is subject to the constraint that only one executor can be allocated on each worker per application. As a result, it is rather meaningless to request new executors if the existing ones are already spread out across all nodes. was: This is equivalent to SPARK-3822 but for standalone mode. This is actually a very tricky issue the scheduling mechanism in the standalone Master uses different semantics. In standalone mode we allocate resources based on cores. By default, an application will grab all the cores in the cluster unless spark.cores.max is specified. Unfortunately, this means an application could get executors of different sizes (in terms of cores) if: 1) App 1 kills an executor 2) App 2, with spark.cores.max set, grabs a subset of cores on a worker 3) App 1 requests an executor In this case, App 1 will get back an executor of half the number of cores. Further, standalone mode is subject to the constraint that only one executor can be allocated on each worker per application. As a result, it is rather meaningless to request new executors if the existing ones are already spread out across all nodes. Support dynamic allocation for standalone mode -- Key: SPARK-4751 URL: https://issues.apache.org/jira/browse/SPARK-4751 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.2.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Blocker This is equivalent to SPARK-3822 but for standalone mode. This is actually a very tricky issue because the scheduling mechanism in the standalone Master uses different semantics. In standalone mode we allocate resources based on cores. By default, an application will grab all the cores in the cluster unless spark.cores.max is specified. Unfortunately, this means an application could get executors of different sizes (in terms of cores) if: 1) App 1 kills an executor 2) App 2, with spark.cores.max set, grabs a subset of cores on a worker 3) App 1 requests an executor In this case, App 1 will get back an executor of half the number of cores. Further, standalone mode is subject to the constraint that only one executor can be allocated on each worker per application. As a result, it is rather meaningless to request new executors if the existing ones are already spread out across all nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4751) Support dynamic allocation for standalone mode
[ https://issues.apache.org/jira/browse/SPARK-4751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4751: - Description: This is equivalent to SPARK-3822 but for standalone mode. This is actually a very tricky issue because the scheduling mechanism in the standalone Master uses different semantics. In standalone mode we allocate resources based on cores. By default, an application will grab all the cores in the cluster unless spark.cores.max is specified. Unfortunately, this means an application could get executors of different sizes (in terms of cores) if: 1) App 1 kills an executor 2) App 2, with spark.cores.max set, grabs a subset of cores on a worker 3) App 1 requests an executor In this case, the new executor that App 1 gets back will be smaller than the rest and can execute fewer tasks in parallel. Further, standalone mode is subject to the constraint that only one executor can be allocated on each worker per application. As a result, it is rather meaningless to request new executors if the existing ones are already spread out across all nodes. was: This is equivalent to SPARK-3822 but for standalone mode. This is actually a very tricky issue because the scheduling mechanism in the standalone Master uses different semantics. In standalone mode we allocate resources based on cores. By default, an application will grab all the cores in the cluster unless spark.cores.max is specified. Unfortunately, this means an application could get executors of different sizes (in terms of cores) if: 1) App 1 kills an executor 2) App 2, with spark.cores.max set, grabs a subset of cores on a worker 3) App 1 requests an executor In this case, App 1 will get back an executor of half the number of cores. Further, standalone mode is subject to the constraint that only one executor can be allocated on each worker per application. As a result, it is rather meaningless to request new executors if the existing ones are already spread out across all nodes. Support dynamic allocation for standalone mode -- Key: SPARK-4751 URL: https://issues.apache.org/jira/browse/SPARK-4751 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.2.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Blocker This is equivalent to SPARK-3822 but for standalone mode. This is actually a very tricky issue because the scheduling mechanism in the standalone Master uses different semantics. In standalone mode we allocate resources based on cores. By default, an application will grab all the cores in the cluster unless spark.cores.max is specified. Unfortunately, this means an application could get executors of different sizes (in terms of cores) if: 1) App 1 kills an executor 2) App 2, with spark.cores.max set, grabs a subset of cores on a worker 3) App 1 requests an executor In this case, the new executor that App 1 gets back will be smaller than the rest and can execute fewer tasks in parallel. Further, standalone mode is subject to the constraint that only one executor can be allocated on each worker per application. As a result, it is rather meaningless to request new executors if the existing ones are already spread out across all nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4421) Wrong link in spark-standalone.html
[ https://issues.apache.org/jira/browse/SPARK-4421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen reassigned SPARK-4421: - Assignee: Josh Rosen Wrong link in spark-standalone.html --- Key: SPARK-4421 URL: https://issues.apache.org/jira/browse/SPARK-4421 Project: Spark Issue Type: Bug Components: Documentation Reporter: Masayoshi TSUZUKI Assignee: Josh Rosen Priority: Trivial Fix For: 1.1.2, 1.2.1 The link about building spark in the document page Spark Standalone Mode (spark-standalone.html) is wrong. That link is pointed at {{index.html#building}}, but it is only available until 0.9. The building guide was moved to another page ({{building-with-maven.html}} in 1.0 and 1.1, or {{building-spark.html}} in 1.2). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4421) Wrong link in spark-standalone.html
[ https://issues.apache.org/jira/browse/SPARK-4421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-4421. --- Resolution: Fixed Fix Version/s: 1.2.1 1.1.2 Issue resolved by pull request 3279 [https://github.com/apache/spark/pull/3279] Wrong link in spark-standalone.html --- Key: SPARK-4421 URL: https://issues.apache.org/jira/browse/SPARK-4421 Project: Spark Issue Type: Bug Components: Documentation Reporter: Masayoshi TSUZUKI Priority: Trivial Fix For: 1.1.2, 1.2.1 The link about building spark in the document page Spark Standalone Mode (spark-standalone.html) is wrong. That link is pointed at {{index.html#building}}, but it is only available until 0.9. The building guide was moved to another page ({{building-with-maven.html}} in 1.0 and 1.1, or {{building-spark.html}} in 1.2). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4421) Wrong link in spark-standalone.html
[ https://issues.apache.org/jira/browse/SPARK-4421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4421: -- Assignee: Masayoshi TSUZUKI (was: Josh Rosen) Wrong link in spark-standalone.html --- Key: SPARK-4421 URL: https://issues.apache.org/jira/browse/SPARK-4421 Project: Spark Issue Type: Bug Components: Documentation Reporter: Masayoshi TSUZUKI Assignee: Masayoshi TSUZUKI Priority: Trivial Fix For: 1.1.2, 1.2.1 The link about building spark in the document page Spark Standalone Mode (spark-standalone.html) is wrong. That link is pointed at {{index.html#building}}, but it is only available until 0.9. The building guide was moved to another page ({{building-with-maven.html}} in 1.0 and 1.1, or {{building-spark.html}} in 1.2). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4740) Netty's network bandwidth is much lower than NIO in spark-perf and Netty takes longer running time
[ https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234937#comment-14234937 ] Zhang, Liye commented on SPARK-4740: We found this issue when we make the performance test for [SPARK-2926|https://issues.apache.org/jira/browse/SPARK-2926], since [SPARK-2926|https://issues.apache.org/jira/browse/SPARK-2926] takes less time in reduce phase, so the difference between Netty and Nio is not too much, about 20%. So we tested the master branch, and the difference is more significant, more than 30%. Netty's network bandwidth is much lower than NIO in spark-perf and Netty takes longer running time -- Key: SPARK-4740 URL: https://issues.apache.org/jira/browse/SPARK-4740 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Affects Versions: 1.2.0 Reporter: Zhang, Liye Attachments: Spark-perf Test Report.pdf When testing current spark master (1.3.0-snapshot) with spark-perf (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService takes much longer time than NIO based shuffle transferService. The network throughput of Netty is only about half of that of NIO. We tested with standalone mode, and the data set we used for test is 20 billion records, and the total size is about 400GB. Spark-perf test is Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each executor memory is 64GB. The reduce tasks number is set to 1000. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4740) Netty's network bandwidth is much lower than NIO in spark-perf and Netty takes longer running time
[ https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234941#comment-14234941 ] Zhang, Liye commented on SPARK-4740: [~adav], I have tested by setting spark.shuffle.io.serverThreads and spark.shuffle.io.clientThreads to 48, the result does not change, Netty takes the same time with 39mins for reduce phase. Netty's network bandwidth is much lower than NIO in spark-perf and Netty takes longer running time -- Key: SPARK-4740 URL: https://issues.apache.org/jira/browse/SPARK-4740 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Affects Versions: 1.2.0 Reporter: Zhang, Liye Attachments: Spark-perf Test Report.pdf When testing current spark master (1.3.0-snapshot) with spark-perf (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService takes much longer time than NIO based shuffle transferService. The network throughput of Netty is only about half of that of NIO. We tested with standalone mode, and the data set we used for test is 20 billion records, and the total size is about 400GB. Spark-perf test is Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each executor memory is 64GB. The reduce tasks number is set to 1000. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4740) Netty's network bandwidth is much lower than NIO in spark-perf and Netty takes longer running time
[ https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234960#comment-14234960 ] Reynold Xin commented on SPARK-4740: Can you limit the number of cores to a lower volume and see what happens? i.e. try it with 16 threads and see if the problem still exists. Thanks. Netty's network bandwidth is much lower than NIO in spark-perf and Netty takes longer running time -- Key: SPARK-4740 URL: https://issues.apache.org/jira/browse/SPARK-4740 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Affects Versions: 1.2.0 Reporter: Zhang, Liye Attachments: Spark-perf Test Report.pdf When testing current spark master (1.3.0-snapshot) with spark-perf (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService takes much longer time than NIO based shuffle transferService. The network throughput of Netty is only about half of that of NIO. We tested with standalone mode, and the data set we used for test is 20 billion records, and the total size is about 400GB. Spark-perf test is Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each executor memory is 64GB. The reduce tasks number is set to 1000. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4740) Netty's network bandwidth is much lower than NIO in spark-perf and Netty takes longer running time
[ https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234963#comment-14234963 ] Reynold Xin commented on SPARK-4740: Also can you take a few more jstacks and paste those here? Thanks. Netty's network bandwidth is much lower than NIO in spark-perf and Netty takes longer running time -- Key: SPARK-4740 URL: https://issues.apache.org/jira/browse/SPARK-4740 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Affects Versions: 1.2.0 Reporter: Zhang, Liye Attachments: Spark-perf Test Report.pdf When testing current spark master (1.3.0-snapshot) with spark-perf (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService takes much longer time than NIO based shuffle transferService. The network throughput of Netty is only about half of that of NIO. We tested with standalone mode, and the data set we used for test is 20 billion records, and the total size is about 400GB. Spark-perf test is Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each executor memory is 64GB. The reduce tasks number is set to 1000. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4753) Parquet2 does not prune based on OR filters on partition columns
Michael Armbrust created SPARK-4753: --- Summary: Parquet2 does not prune based on OR filters on partition columns Key: SPARK-4753 URL: https://issues.apache.org/jira/browse/SPARK-4753 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Michael Armbrust Assignee: Michael Armbrust -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4753) Parquet2 does not prune based on OR filters on partition columns
[ https://issues.apache.org/jira/browse/SPARK-4753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-4753: Priority: Blocker (was: Major) Parquet2 does not prune based on OR filters on partition columns Key: SPARK-4753 URL: https://issues.apache.org/jira/browse/SPARK-4753 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4753) Parquet2 does not prune based on OR filters on partition columns
[ https://issues.apache.org/jira/browse/SPARK-4753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234973#comment-14234973 ] Apache Spark commented on SPARK-4753: - User 'marmbrus' has created a pull request for this issue: https://github.com/apache/spark/pull/3613 Parquet2 does not prune based on OR filters on partition columns Key: SPARK-4753 URL: https://issues.apache.org/jira/browse/SPARK-4753 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4740) Netty's network bandwidth is much lower than NIO in spark-perf and Netty takes longer running time
[ https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhang, Liye updated SPARK-4740: --- Attachment: TestRunner sort-by-key - Thread dump for executor 1_files (48 Cores per node).zip Netty's network bandwidth is much lower than NIO in spark-perf and Netty takes longer running time -- Key: SPARK-4740 URL: https://issues.apache.org/jira/browse/SPARK-4740 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Affects Versions: 1.2.0 Reporter: Zhang, Liye Attachments: Spark-perf Test Report.pdf, TestRunner sort-by-key - Thread dump for executor 1_files (48 Cores per node).zip When testing current spark master (1.3.0-snapshot) with spark-perf (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService takes much longer time than NIO based shuffle transferService. The network throughput of Netty is only about half of that of NIO. We tested with standalone mode, and the data set we used for test is 20 billion records, and the total size is about 400GB. Spark-perf test is Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each executor memory is 64GB. The reduce tasks number is set to 1000. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4740) Netty's network bandwidth is much lower than NIO in spark-perf and Netty takes longer running time
[ https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234983#comment-14234983 ] Zhang, Liye commented on SPARK-4740: [~rxin] I attached the thread dump of one executor (48 cores) in reduce phase, please take a look. I'll try 16 cores later on. Netty's network bandwidth is much lower than NIO in spark-perf and Netty takes longer running time -- Key: SPARK-4740 URL: https://issues.apache.org/jira/browse/SPARK-4740 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Affects Versions: 1.2.0 Reporter: Zhang, Liye Attachments: Spark-perf Test Report.pdf, TestRunner sort-by-key - Thread dump for executor 1_files (48 Cores per node).zip When testing current spark master (1.3.0-snapshot) with spark-perf (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService takes much longer time than NIO based shuffle transferService. The network throughput of Netty is only about half of that of NIO. We tested with standalone mode, and the data set we used for test is 20 billion records, and the total size is about 400GB. Spark-perf test is Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each executor memory is 64GB. The reduce tasks number is set to 1000. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4754) ExecutorAllocationManager should not take in SparkContext
Andrew Or created SPARK-4754: Summary: ExecutorAllocationManager should not take in SparkContext Key: SPARK-4754 URL: https://issues.apache.org/jira/browse/SPARK-4754 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: Andrew Or Assignee: Andrew Or We should refactor ExecutorAllocationManager to not take in a SparkContext. Otherwise, new developers may try to add a lot of unnecessary pointers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org