[jira] [Created] (SPARK-4828) sum and avg over empty table should return null
Adrian Wang created SPARK-4828: -- Summary: sum and avg over empty table should return null Key: SPARK-4828 URL: https://issues.apache.org/jira/browse/SPARK-4828 Project: Spark Issue Type: Bug Components: SQL Reporter: Adrian Wang Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4828) sum and avg over empty table should return null
[ https://issues.apache.org/jira/browse/SPARK-4828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242388#comment-14242388 ] Apache Spark commented on SPARK-4828: - User 'adrian-wang' has created a pull request for this issue: https://github.com/apache/spark/pull/3675 sum and avg over empty table should return null --- Key: SPARK-4828 URL: https://issues.apache.org/jira/browse/SPARK-4828 Project: Spark Issue Type: Bug Components: SQL Reporter: Adrian Wang Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4829) eliminate expressions calculation in count expression
Adrian Wang created SPARK-4829: -- Summary: eliminate expressions calculation in count expression Key: SPARK-4829 URL: https://issues.apache.org/jira/browse/SPARK-4829 Project: Spark Issue Type: Improvement Components: SQL Reporter: Adrian Wang -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4817) [streaming]Print the specified number of data and handle all of the elements in RDD
[ https://issues.apache.org/jira/browse/SPARK-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242402#comment-14242402 ] Sean Owen commented on SPARK-4817: -- [~tianyi] I agree that it would be nice to add a parameter to {{print()}}. But that is precisely what SPARK-3326 already covered. It's fine to want to process all of an RDD, and then print some of it. But, this does not require a new method at all. [~surq] I don't understand your examples, relative to what you say you want to do. The first example only prints elements, and does it with needless complexity. Why {{collect()}}? The second example also doesn't do anything but print but tries to manually run a new job? Something simply like this seems to be just what you want. It does something with the entire RDD, then prints just the first 100 elements: {code} stream.foreachRDD { rdd = rdd.foreach(row = ... do whateveryou want with every element ...) rdd.take(100).foreach(println) } {code} I think this also works fine: {code} stream.foreachRDD { rdd = rdd.foreach(row = ... do whateveryou want with every element ...) } stream.print(100) {code} ... if SPARK-3326 is implemented to add an argument to {{print()}}. [streaming]Print the specified number of data and handle all of the elements in RDD --- Key: SPARK-4817 URL: https://issues.apache.org/jira/browse/SPARK-4817 Project: Spark Issue Type: New Feature Components: Streaming Reporter: 宿荣全 Priority: Minor Dstream.print function:Print 10 elements and handle 11 elements. A new function based on Dstream.print function is presented: the new function: Print the specified number of data and handle all of the elements in RDD. there is a work scene: val dstream = stream.map-filter-mapPartitions-print the data after filter need update database in mapPartitions,but don't need print each data,only need to print the top 20 for view the data processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4829) eliminate expressions calculation in count expression
[ https://issues.apache.org/jira/browse/SPARK-4829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242413#comment-14242413 ] Apache Spark commented on SPARK-4829: - User 'adrian-wang' has created a pull request for this issue: https://github.com/apache/spark/pull/3676 eliminate expressions calculation in count expression - Key: SPARK-4829 URL: https://issues.apache.org/jira/browse/SPARK-4829 Project: Spark Issue Type: Improvement Components: SQL Reporter: Adrian Wang -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2984) FileNotFoundException on _temporary directory
[ https://issues.apache.org/jira/browse/SPARK-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242436#comment-14242436 ] Paulo Motta commented on SPARK-2984: We're also facing a similar issue when using S3N, described in detail on this thread: https://www.mail-archive.com/user@spark.apache.org/msg17253.html Here is the relevant exception: {code} 2014-12-10 19:05:13,823 ERROR [sparkDriver-akka.actor.default-dispatcher-16] scheduler.JobScheduler (Logging.scala:logError(96)) - Error runnin g job streaming job 141823830 ms.0 java.io.FileNotFoundException: File s3n://BUCKET/_temporary/0/task_201412101900_0039_m_33 does not exist. at org.apache.hadoop.fs.s3native.NativeS3FileSystem.listStatus(NativeS3FileSystem.java:506) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:360) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310) at org.apache.hadoop.mapred.FileOutputCommitter.commitJob(FileOutputCommitter.java:136) at org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:126) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:995) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:878) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:845) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:803) at MyDumperClass$$anonfun$main$1.apply(IncrementalDumpsJenkins.scala:100) at MyDumperClass$$anonfun$main$1.apply(IncrementalDumpsJenkins.scala:79) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2014-12-10 19:05:13,829 INFO [Driver] yarn.ApplicationMaster (Logging.scala:logInfo(59)) - Unregistering ApplicationMaster with FAILED {code} I'm quite sure it has to do with eventual consistency on S3, since it's common to publish files on S3 and they're not promptly available (when you try s3cmd ls s3://mybucket/whatever soon after a file is posted, for instance), only after a few seconds after they appear. Is there already a configuration for retrying to fetch S3 files if they'e not found (maybe with some kind of exponential backoff)? Maybe this could be a solution, if it's not yet available. FileNotFoundException on _temporary directory - Key: SPARK-2984 URL: https://issues.apache.org/jira/browse/SPARK-2984 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Andrew Ash Priority: Critical We've seen several stacktraces and threads on the user mailing list where people are having issues with a {{FileNotFoundException}} stemming from an HDFS path containing {{_temporary}}. I ([~aash]) think this may be related to {{spark.speculation}}. I think the error condition might manifest in this circumstance: 1) task T starts on a executor E1 2) it takes a long time, so task T' is started on another executor E2 3) T finishes in E1 so moves its data from {{_temporary}} to the final destination and deletes the {{_temporary}} directory during cleanup 4) T' finishes in E2 and attempts to move its data from {{_temporary}}, but those files no longer exist! exception Some samples: {noformat} 14/08/11 08:05:08 ERROR JobScheduler: Error running job streaming job 140774430 ms.0 java.io.FileNotFoundException: File hdfs://hadoopc/user/csong/output/human_bot/-140774430.out/_temporary/0/task_201408110805__m_07 does not exist. at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654) at org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102) at org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712) at org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:708) at
[jira] [Commented] (SPARK-4156) Add expectation maximization for Gaussian mixture models to MLLib clustering
[ https://issues.apache.org/jira/browse/SPARK-4156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242486#comment-14242486 ] Meethu Mathew commented on SPARK-4156: -- [~tgaloppo] The current version of the code has no predict function to return the cluster labels, i.e, the index of the cluster to which the point has maximum membership.We have written a predict function to return the cluster labels and the membership values.We would be happy to contribute this to your code. cc [~mengxr] Add expectation maximization for Gaussian mixture models to MLLib clustering Key: SPARK-4156 URL: https://issues.apache.org/jira/browse/SPARK-4156 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Travis Galoppo Assignee: Travis Galoppo As an additional clustering algorithm, implement expectation maximization for Gaussian mixture models -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4830) Spark Java Application : java.lang.ClassNotFoundException
Mykhaylo Telizhyn created SPARK-4830: Summary: Spark Java Application : java.lang.ClassNotFoundException Key: SPARK-4830 URL: https://issues.apache.org/jira/browse/SPARK-4830 Project: Spark Issue Type: Bug Affects Versions: 1.1.0 Reporter: Mykhaylo Telizhyn -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4156) Add expectation maximization for Gaussian mixture models to MLLib clustering
[ https://issues.apache.org/jira/browse/SPARK-4156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242494#comment-14242494 ] Travis Galoppo commented on SPARK-4156: --- [~MeethuMathew] This would be great! If possible, please issue a pull request against my repo and I will merge it in as soon as possible. Add expectation maximization for Gaussian mixture models to MLLib clustering Key: SPARK-4156 URL: https://issues.apache.org/jira/browse/SPARK-4156 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Travis Galoppo Assignee: Travis Galoppo As an additional clustering algorithm, implement expectation maximization for Gaussian mixture models -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4830) Spark Java Application : java.lang.ClassNotFoundException
[ https://issues.apache.org/jira/browse/SPARK-4830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mykhaylo Telizhyn updated SPARK-4830: - Description: We have Spark Streaming application that consumes messages from RabbitMQ and process them. When generating hundreds of events on RabbitMQ and running our application on spark standalone cluster we see some java.lang.ClassNotFoundException exceptions in the log. Application Overview: Our domain model is simple POJO that represents RabbitMQ events we want to consume and contains some custom properties we are interested in: {code:title=Event.java|borderStyle=solid} class Event implements java.io.Externalizable { // custom properties // custom implementation of writeExternal(), readExternal() methods } {code} We have implemented custom spark receiver that just receives messages from RabbitMQ queue by means of custom consumer(See Receiving messages by subscription at https://www.rabbitmq.com/api-guide.html), converts them to our custom domain event objects(com.xxx.Event) and stores them on spark memory: {code:title=RabbitMQReceiver.java|borderStyle=solid} byte[] body = // data received from Rabbit using custom consumer Event event = new Event(body); store(event) // store into Spark {code} The main program is simple, it just set up spark streaming context: {code:title=SparkApplication.java|borderStyle=solid} SparkConf sparkConf = new SparkConf().setAppName(APPLICATION_NAME); sparkConf.setJars(SparkContext.jarOfClass(Application.class).toList()); {code} Initialize input streams: {code:title=SparkApplication.java|borderStyle=solid} ReceiverInputDStreamEvent stream = // create input stream from RabbitMQ JavaReceiverInputDStreamEvent events = new JavaReceiverInputDStreamEvent(stream, classTag(Event.class)); {code} Process events: {code:title=SparkApplication.java|borderStyle=solid} events.foreachRDD( rdd - { rdd.foreachPartition( partition - { // process partition } } }) ssc.start(); ssc.awaitTermination(); {code} Application submission: Application is packaged into single fat jar using maven shade plugin(http://maven.apache.org/plugins/maven-shade-plugin/). It compiled with spark version 1.1.0 We run our application on spark version 1.1.0 standalone cluster that consists of driver host, master host and two worker hosts. We submit application from driver host. On one of the workers we see java.lang.ClassNotFoundException exceptions: We see that worker has downloaded application.jar and added it to class loader: 14/11/27 10:26:59 INFO Executor: Fetching http://xx.xx.xx.xx:38287/jars/application.jar with timestamp 1417084011213 14/11/27 10:26:59 INFO Utils: Fetching http://xx.xx.xx.xx:38287/jars/application.jar to /tmp/fetchFileTemp8223721356974787443.tmp 14/11/27 10:27:00 INFO BlockManager: Removing RDD 4 14/11/27 10:27:01 INFO Executor: Adding file:/path/to/spark/work/app-20141127102651-0001/1/./application.jar to class loader ... 14/11/27 10:27:10 ERROR BlockManagerWorker: Exception handling buffer message java.lang.ClassNotFoundException: com.xxx.Event at java.net.URLClassLoader$1.run(URLClassLoader.java:372) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:360) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:344) at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) at
[jira] [Commented] (SPARK-4526) Gradient should be added batch computing interface
[ https://issues.apache.org/jira/browse/SPARK-4526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242504#comment-14242504 ] Apache Spark commented on SPARK-4526: - User 'witgo' has created a pull request for this issue: https://github.com/apache/spark/pull/3677 Gradient should be added batch computing interface -- Key: SPARK-4526 URL: https://issues.apache.org/jira/browse/SPARK-4526 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Guoqiang Li If Gradient support batch computing, we can use some efficient numerical libraries(eg, BLAS). In some cases, it can improve the performance of more than ten times as much. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4830) Spark Java Application : java.lang.ClassNotFoundException
[ https://issues.apache.org/jira/browse/SPARK-4830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mykhaylo Telizhyn updated SPARK-4830: - Description: h4. Application Overview: We have Spark Streaming application that consumes messages from RabbitMQ and processes them. When generating hundreds of events on RabbitMQ and running our application on spark standalone cluster we see some {{java.lang.ClassNotFoundException}} exceptions in the log. Our domain model is simple POJO that represents RabbitMQ events we want to consume and contains some custom properties we are interested in: {code:title=com.xxx.Event.java|borderStyle=solid} public class Event implements java.io.Externalizable { // custom properties // custom implementation of writeExternal(), readExternal() methods } {code} We have implemented custom Spark Streaming receiver that just receives messages from RabbitMQ queue by means of custom consumer (See _Receiving messages by subscription_ at https://www.rabbitmq.com/api-guide.html), converts them to our custom domain event objects ({{com.xxx.Event}}) and stores them on spark memory: {code:title=RabbitMQReceiver.java|borderStyle=solid} byte[] body = // data received from Rabbit using custom consumer Event event = new Event(body); store(event) // store into Spark {code} The main program is simple, it just set up spark streaming context: {code:title=Application.java|borderStyle=solid} SparkConf sparkConf = new SparkConf().setAppName(APPLICATION_NAME); sparkConf.setJars(SparkContext.jarOfClass(Application.class).toList()); JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, new Duration(BATCH_DURATION_MS)); {code} Initialize input streams: {code:title=Application.java|borderStyle=solid} ReceiverInputDStreamEvent stream = // create input stream from RabbitMQ JavaReceiverInputDStreamEvent events = new JavaReceiverInputDStreamEvent(stream, classTag(Event.class)); {code} Process events: {code:title=Application.java|borderStyle=solid} events.foreachRDD( rdd - { rdd.foreachPartition( partition - { // process partition } } }) ssc.start(); ssc.awaitTermination(); {code} h4. Application submission: Application is packaged as a single fat jar file using maven shade plugin (http://maven.apache.org/plugins/maven-shade-plugin/). It is compiled with spark version _1.1.0_ We run our application on spark version _1.1.0_ standalone cluster that consists of driver host, master host and two worker hosts. We submit application from driver host. On one of the workers we see {{java.lang.ClassNotFoundException}} exceptions: {panel:title=app.log|borderStyle=dashed|borderColor=#ccc|titleBGColor=#e3e4e1|bgColor=#ff} 14/11/27 10:27:10 ERROR BlockManagerWorker: Exception handling buffer message java.lang.ClassNotFoundException: com.xxx.Event at java.net.URLClassLoader$1.run(URLClassLoader.java:372) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:360) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:344) at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:235) at org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:126) at
[jira] [Updated] (SPARK-4830) Spark Java Application : java.lang.ClassNotFoundException
[ https://issues.apache.org/jira/browse/SPARK-4830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mykhaylo Telizhyn updated SPARK-4830: - Description: h4. Application Overview: We have Spark Streaming application that consumes messages from RabbitMQ and processes them. When generating hundreds of events on RabbitMQ and running our application on spark standalone cluster we see some {{java.lang.ClassNotFoundException}} exceptions in the log. Our domain model is simple POJO that represents RabbitMQ events we want to consume and contains some custom properties we are interested in: {code:title=com.xxx.Event.java|borderStyle=solid} public class Event implements java.io.Externalizable { // custom properties // custom implementation of writeExternal(), readExternal() methods } {code} We have implemented custom Spark Streaming receiver that just receives messages from RabbitMQ queue by means of custom consumer (See _Receiving messages by subscription_ at https://www.rabbitmq.com/api-guide.html), converts them to our custom domain event objects ({{com.xxx.Event}}) and stores them on spark memory: {code:title=RabbitMQReceiver.java|borderStyle=solid} byte[] body = // data received from Rabbit using custom consumer Event event = new Event(body); store(event) // store into Spark {code} The main program is simple, it just set up spark streaming context: {code:title=Application.java|borderStyle=solid} SparkConf sparkConf = new SparkConf().setAppName(APPLICATION_NAME); sparkConf.setJars(SparkContext.jarOfClass(Application.class).toList()); JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, new Duration(BATCH_DURATION_MS)); {code} Initialize input streams: {code:title=Application.java|borderStyle=solid} ReceiverInputDStreamEvent stream = // create input stream from RabbitMQ JavaReceiverInputDStreamEvent events = new JavaReceiverInputDStreamEvent(stream, classTag(Event.class)); {code} Process events: {code:title=Application.java|borderStyle=solid} events.foreachRDD( rdd - { rdd.foreachPartition( partition - { // process partition } } }) ssc.start(); ssc.awaitTermination(); {code} h4. Application submission: Application is packaged as a single fat jar file using maven shade plugin (http://maven.apache.org/plugins/maven-shade-plugin/). It is compiled with spark version _1.1.0_ We run our application on spark version _1.1.0_ standalone cluster that consists of driver host, master host and two worker hosts. We submit application from driver host. On one of the workers we see {{java.lang.ClassNotFoundException}} exceptions: {panel:title=app.log|borderStyle=dashed|borderColor=#ccc|titleBGColor=#e3e4e1|bgColor=#f0f8ff} 14/11/27 10:27:10 ERROR BlockManagerWorker: Exception handling buffer message java.lang.ClassNotFoundException: com.xxx.Event at java.net.URLClassLoader$1.run(URLClassLoader.java:372) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:360) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:344) at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:235) at org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:126) at
[jira] [Updated] (SPARK-4830) Spark Streaming Java Application : java.lang.ClassNotFoundException
[ https://issues.apache.org/jira/browse/SPARK-4830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mykhaylo Telizhyn updated SPARK-4830: - Summary: Spark Streaming Java Application : java.lang.ClassNotFoundException (was: Spark Java Application : java.lang.ClassNotFoundException) Spark Streaming Java Application : java.lang.ClassNotFoundException --- Key: SPARK-4830 URL: https://issues.apache.org/jira/browse/SPARK-4830 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.1.0 Reporter: Mykhaylo Telizhyn h4. Application Overview: We have Spark Streaming application that consumes messages from RabbitMQ and processes them. When generating hundreds of events on RabbitMQ and running our application on spark standalone cluster we see some {{java.lang.ClassNotFoundException}} exceptions in the log. Our domain model is simple POJO that represents RabbitMQ events we want to consume and contains some custom properties we are interested in: {code:title=com.xxx.Event.java|borderStyle=solid} public class Event implements java.io.Externalizable { // custom properties // custom implementation of writeExternal(), readExternal() methods } {code} We have implemented custom Spark Streaming receiver that just receives messages from RabbitMQ queue by means of custom consumer (See _Receiving messages by subscription_ at https://www.rabbitmq.com/api-guide.html), converts them to our custom domain event objects ({{com.xxx.Event}}) and stores them on spark memory: {code:title=RabbitMQReceiver.java|borderStyle=solid} byte[] body = // data received from Rabbit using custom consumer Event event = new Event(body); store(event) // store into Spark {code} The main program is simple, it just set up spark streaming context: {code:title=Application.java|borderStyle=solid} SparkConf sparkConf = new SparkConf().setAppName(APPLICATION_NAME); sparkConf.setJars(SparkContext.jarOfClass(Application.class).toList()); JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, new Duration(BATCH_DURATION_MS)); {code} Initialize input streams: {code:title=Application.java|borderStyle=solid} ReceiverInputDStreamEvent stream = // create input stream from RabbitMQ JavaReceiverInputDStreamEvent events = new JavaReceiverInputDStreamEvent(stream, classTag(Event.class)); {code} Process events: {code:title=Application.java|borderStyle=solid} events.foreachRDD( rdd - { rdd.foreachPartition( partition - { // process partition } } }) ssc.start(); ssc.awaitTermination(); {code} h4. Application submission: Application is packaged as a single fat jar file using maven shade plugin (http://maven.apache.org/plugins/maven-shade-plugin/). It is compiled with spark version _1.1.0_ We run our application on spark version _1.1.0_ standalone cluster that consists of driver host, master host and two worker hosts. We submit application from driver host. On one of the workers we see {{java.lang.ClassNotFoundException}} exceptions: {panel:title=app.log|borderStyle=dashed|borderColor=#ccc|titleBGColor=#e3e4e1|bgColor=#f0f8ff} 14/11/27 10:27:10 ERROR BlockManagerWorker: Exception handling buffer message java.lang.ClassNotFoundException: com.xxx.Event at java.net.URLClassLoader$1.run(URLClassLoader.java:372) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:360) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:344) at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774) at
[jira] [Commented] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)
[ https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242547#comment-14242547 ] koert kuipers commented on SPARK-3655: -- i updated the pullreq to use Iterables instead of TraversableOnce i also wanted to take this opportunity to one more time make a pitch for foldLeft. i think we should implement foldLeft because 1) it is a well known operation that perfectly fits many problems such as time series analysis 2) it does not need to make the in-memory assumption for the sorted values, which is crucial for a lot of problems 3) it is (i think?) the most basic api that does not need values in memory, since it uses a repeated operation that uses the values like a Traversable and builds the return value. no Iterator or TraversableOnce is exposed, so it does not have potential strange interactions with things like caching and downstream shuffles. 4) groupByKeysAndSortValues (which does keep values in memory) can be expressed in foldLeft trivially: groupByKeysAndSortValues(valueOrdering) = foldLeftByKey(valueOrdering, new ArrayBuffer[V])(_ += _) Support sorting of values in addition to keys (i.e. secondary sort) --- Key: SPARK-3655 URL: https://issues.apache.org/jira/browse/SPARK-3655 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0, 1.2.0 Reporter: koert kuipers Assignee: Koert Kuipers Priority: Minor Now that spark has a sort based shuffle, can we expect a secondary sort soon? There are some use cases where getting a sorted iterator of values per key is helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4806) Update Streaming Programming Guide for Spark 1.2
[ https://issues.apache.org/jira/browse/SPARK-4806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-4806. -- Resolution: Done Target Version/s: 1.2.0, 1.3.0 (was: 1.2.0) Update Streaming Programming Guide for Spark 1.2 Key: SPARK-4806 URL: https://issues.apache.org/jira/browse/SPARK-4806 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das Important updates to the streaming programming guide - Make the fault-tolerance properties easier to understand, with information about write ahead logs - Update the information about deploying the spark streaming app with information about Driver HA - Update Receiver guide to discuss reliable vs unreliable receivers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4814) Enable assertions in SBT, Maven tests / AssertionError from Hive's LazyBinaryInteger
[ https://issues.apache.org/jira/browse/SPARK-4814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242611#comment-14242611 ] Cheng Lian commented on SPARK-4814: --- This assertion failure seems to be related to details of the RCFile format. I haven't found the root cause. Tested the original {{alter_merge_2.q}} test case under Hive 0.13.1 by running {code} $ cd itest/qtest $ mvn -Dtest=TestCliDriver -Phadoop-2 -Dqfile=alter_merge_2.q test {code} didn't observe similar {{AssertionError}}. Keep digging. Although in this case the assertion failure doesn't affect correctness, I'm not sure whether it's generally safe to ignore it... Enable assertions in SBT, Maven tests / AssertionError from Hive's LazyBinaryInteger Key: SPARK-4814 URL: https://issues.apache.org/jira/browse/SPARK-4814 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 1.1.0 Reporter: Sean Owen Follow up to SPARK-4159, wherein we noticed that Java tests weren't running in Maven, in part because a Java test actually fails with {{AssertionError}}. That code/test was fixed in SPARK-4850. The reason it wasn't caught by SBT tests was that they don't run with assertions on, and Maven's surefire does. Turning on assertions in the SBT build is trivial, adding one line: {code} javaOptions in Test += -ea, {code} This reveals a test failure in Scala test suites though: {code} [info] - alter_merge_2 *** FAILED *** (1 second, 305 milliseconds) [info] Failed to execute query using catalyst: [info] Error: Job aborted due to stage failure: Task 1 in stage 551.0 failed 1 times, most recent failure: Lost task 1.0 in stage 551.0 (TID 1532, localhost): java.lang.AssertionError [info]at org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryInteger.init(LazyBinaryInteger.java:51) [info]at org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase$FieldInfo.uncheckedGetField(ColumnarStructBase.java:110) [info]at org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase.getField(ColumnarStructBase.java:171) [info]at org.apache.hadoop.hive.serde2.objectinspector.ColumnarStructObjectInspector.getStructFieldData(ColumnarStructObjectInspector.java:166) [info]at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:318) [info]at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:314) [info]at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) [info]at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:132) [info]at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:128) [info]at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:615) [info]at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:615) [info]at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) [info]at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264) [info]at org.apache.spark.rdd.RDD.iterator(RDD.scala:231) [info]at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) [info]at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264) [info]at org.apache.spark.rdd.RDD.iterator(RDD.scala:231) [info]at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) [info]at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) [info]at org.apache.spark.scheduler.Task.run(Task.scala:56) [info]at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195) [info]at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [info]at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [info]at java.lang.Thread.run(Thread.java:745) {code} The items for this JIRA are therefore: - Enable assertions in SBT - Fix this failure - Figure out why Maven scalatest didn't trigger it - may need assertions explicitly turned on too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4740) Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey
[ https://issues.apache.org/jira/browse/SPARK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242656#comment-14242656 ] Zhang, Liye commented on SPARK-4740: Hi [~adav], I missed there is another patch from [~rxin] set connectionsPerPeer to 1, the result in my last comment is with the default value. I have made another test with connectionsPerPeer set to 2 on 8HDDs, the result is a little better than connectionsPerPeer=1, but still can not compete with NIO. Seems the unbalance of Netty is not introduced by rxin's patch. It exists in the original master branch with HDD. Hi [~rxin], I tested your patch https://github.com/apache/spark/pull/3667 with 8HDDs and with spark.executor.memory=48GB, the result shows this patch doesn't get better performance, the reduce time with patch is longer than without the patch (37mins VS 31 mins). Netty's network throughput is about 1/2 of NIO's in spark-perf sortByKey Key: SPARK-4740 URL: https://issues.apache.org/jira/browse/SPARK-4740 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Affects Versions: 1.2.0 Reporter: Zhang, Liye Assignee: Reynold Xin Attachments: (rxin patch better executor)TestRunner sort-by-key - Thread dump for executor 3_files.zip, (rxin patch normal executor)TestRunner sort-by-key - Thread dump for executor 0 _files.zip, Spark-perf Test Report 16 Cores per Executor.pdf, Spark-perf Test Report.pdf, TestRunner sort-by-key - Thread dump for executor 1_files (Netty-48 Cores per node).zip, TestRunner sort-by-key - Thread dump for executor 1_files (Nio-48 cores per node).zip, rxin_patch-on_4_node_cluster_48CoresPerNode(Unbalance).7z When testing current spark master (1.3.0-snapshot) with spark-perf (sort-by-key, aggregate-by-key, etc), Netty based shuffle transferService takes much longer time than NIO based shuffle transferService. The network throughput of Netty is only about half of that of NIO. We tested with standalone mode, and the data set we used for test is 20 billion records, and the total size is about 400GB. Spark-perf test is Running on a 4 node cluster with 10G NIC, 48 cpu cores per node and each executor memory is 64GB. The reduce tasks number is set to 1000. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4831) Current directory always on classpath with spark-submit
Daniel Darabos created SPARK-4831: - Summary: Current directory always on classpath with spark-submit Key: SPARK-4831 URL: https://issues.apache.org/jira/browse/SPARK-4831 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.1.1, 1.2.0 Reporter: Daniel Darabos Priority: Minor We had a situation where we were launching an application with spark-submit, and a file (play.plugins) was on the classpath twice, causing problems (trying to register plugins twice). Upon investigating how it got on the classpath twice, we found that it was present in one of our jars, and also in the current working directory. But the one in the current working directory should not be on the classpath. We never asked spark-submit to put the current directory on the classpath. I think this is caused by a line in [compute-classpath.sh|https://github.com/apache/spark/blob/v1.2.0-rc2/bin/compute-classpath.sh#L28]: {code} CLASSPATH=$SPARK_CLASSPATH:$SPARK_SUBMIT_CLASSPATH {code} Now if SPARK_CLASSPATH is empty, the empty string is added to the classpath, which means the current working directory. We tried setting SPARK_CLASSPATH to a bogus value, but that is [not allowed|https://github.com/apache/spark/blob/v1.2.0-rc2/core/src/main/scala/org/apache/spark/SparkConf.scala#L312]. What is the right solution? Only add SPARK_CLASSPATH if it's non-empty? I can send a pull request for that I think. Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2980) Python support for chi-squared test
[ https://issues.apache.org/jira/browse/SPARK-2980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242718#comment-14242718 ] Apache Spark commented on SPARK-2980: - User 'jbencook' has created a pull request for this issue: https://github.com/apache/spark/pull/3679 Python support for chi-squared test --- Key: SPARK-2980 URL: https://issues.apache.org/jira/browse/SPARK-2980 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Doris Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4831) Current directory always on classpath with spark-submit
[ https://issues.apache.org/jira/browse/SPARK-4831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242778#comment-14242778 ] Sean Owen commented on SPARK-4831: -- Hm, so I made a quick test, where I put a class {{Foo.class}} inside {{Foo.jar}} and then ran {{java -cp :otherstuff.jar Foo}}. It does not find the class, which suggests to me that it does not interpret that empty entry as meaning local directory too. It doesn't work even if I put . on the classpath. That makes sense. The working directory contains JARs, in your case, not classes. However it finds it if I leave {{Foo.class}} in the working directory, *if* I have an empty entry in the classpath. Is it perhaps finding and exploded directory of classes? Otherwise, I can't repro this directly I suppose, in Java. Current directory always on classpath with spark-submit --- Key: SPARK-4831 URL: https://issues.apache.org/jira/browse/SPARK-4831 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.1.1, 1.2.0 Reporter: Daniel Darabos Priority: Minor We had a situation where we were launching an application with spark-submit, and a file (play.plugins) was on the classpath twice, causing problems (trying to register plugins twice). Upon investigating how it got on the classpath twice, we found that it was present in one of our jars, and also in the current working directory. But the one in the current working directory should not be on the classpath. We never asked spark-submit to put the current directory on the classpath. I think this is caused by a line in [compute-classpath.sh|https://github.com/apache/spark/blob/v1.2.0-rc2/bin/compute-classpath.sh#L28]: {code} CLASSPATH=$SPARK_CLASSPATH:$SPARK_SUBMIT_CLASSPATH {code} Now if SPARK_CLASSPATH is empty, the empty string is added to the classpath, which means the current working directory. We tried setting SPARK_CLASSPATH to a bogus value, but that is [not allowed|https://github.com/apache/spark/blob/v1.2.0-rc2/core/src/main/scala/org/apache/spark/SparkConf.scala#L312]. What is the right solution? Only add SPARK_CLASSPATH if it's non-empty? I can send a pull request for that I think. Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-2892) Socket Receiver does not stop when streaming context is stopped
[ https://issues.apache.org/jira/browse/SPARK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reopened SPARK-2892: -- OK. The issues may have a common cause but that can be deferred until the other JIRA is resolved. If it turns out to resolve this, great. Socket Receiver does not stop when streaming context is stopped --- Key: SPARK-2892 URL: https://issues.apache.org/jira/browse/SPARK-2892 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.0.2 Reporter: Tathagata Das Assignee: Tathagata Das Priority: Critical Running NetworkWordCount with {quote} ssc.start(); Thread.sleep(1); ssc.stop(stopSparkContext = false); Thread.sleep(6) {quote} gives the following error {quote} 14/08/06 18:37:13 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 10047 ms on localhost (1/1) 14/08/06 18:37:13 INFO DAGScheduler: Stage 0 (runJob at ReceiverTracker.scala:275) finished in 10.056 s 14/08/06 18:37:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 14/08/06 18:37:13 INFO SparkContext: Job finished: runJob at ReceiverTracker.scala:275, took 10.179263 s 14/08/06 18:37:13 INFO ReceiverTracker: All of the receivers have been terminated 14/08/06 18:37:13 WARN ReceiverTracker: All of the receivers have not deregistered, Map(0 - ReceiverInfo(0,SocketReceiver-0,null,false,localhost,Stopped by driver,)) 14/08/06 18:37:13 INFO ReceiverTracker: ReceiverTracker stopped 14/08/06 18:37:13 INFO JobGenerator: Stopping JobGenerator immediately 14/08/06 18:37:13 INFO RecurringTimer: Stopped timer for JobGenerator after time 1407375433000 14/08/06 18:37:13 INFO JobGenerator: Stopped JobGenerator 14/08/06 18:37:13 INFO JobScheduler: Stopped JobScheduler 14/08/06 18:37:13 INFO StreamingContext: StreamingContext stopped successfully 14/08/06 18:37:43 INFO SocketReceiver: Stopped receiving 14/08/06 18:37:43 INFO SocketReceiver: Closed socket to localhost: {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3677) pom.xml and SparkBuild.scala are wrong : Scalastyle is never applyed to the sources under yarn/common
[ https://issues.apache.org/jira/browse/SPARK-3677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3677. -- Resolution: Not a Problem Target Version/s: (was: 1.1.2, 1.2.1) Obsoleted by the restructuring of the YARN code. See discussion in the PR. pom.xml and SparkBuild.scala are wrong : Scalastyle is never applyed to the sources under yarn/common - Key: SPARK-3677 URL: https://issues.apache.org/jira/browse/SPARK-3677 Project: Spark Issue Type: Bug Components: Build, YARN Affects Versions: 1.2.0 Reporter: Kousuke Saruta When we run sbt -Pyarn scalastyle or mvn package, scalastyle is not applied to the sources under yarn/common. I think, this is caused by the directory structure. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3918) Forget Unpersist in RandomForest.scala(train Method)
[ https://issues.apache.org/jira/browse/SPARK-3918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3918. -- Resolution: Fixed Great, looks like this was in fact fixed for 1.2 then. Forget Unpersist in RandomForest.scala(train Method) Key: SPARK-3918 URL: https://issues.apache.org/jira/browse/SPARK-3918 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.1.0 Environment: All Reporter: junlong Assignee: Joseph K. Bradley Labels: decisiontree, train, unpersist Fix For: 1.2.0 Original Estimate: 10m Remaining Estimate: 10m In version 1.1.0 DecisionTree.scala, train Method, treeInput has been persisted in Memory, but without unpersist. It caused heavy DISK usage. In github version(1.2.0 maybe), RandomForest.scala, train Method, baggedInput has been persisted but without unpersisted too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3918) Forget Unpersist in RandomForest.scala(train Method)
[ https://issues.apache.org/jira/browse/SPARK-3918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-3918: - Target Version/s: 1.2.0 (was: 1.1.0) Affects Version/s: (was: 1.2.0) 1.1.0 Fix Version/s: (was: 1.1.0) 1.2.0 Forget Unpersist in RandomForest.scala(train Method) Key: SPARK-3918 URL: https://issues.apache.org/jira/browse/SPARK-3918 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.1.0 Environment: All Reporter: junlong Assignee: Joseph K. Bradley Labels: decisiontree, train, unpersist Fix For: 1.2.0 Original Estimate: 10m Remaining Estimate: 10m In version 1.1.0 DecisionTree.scala, train Method, treeInput has been persisted in Memory, but without unpersist. It caused heavy DISK usage. In github version(1.2.0 maybe), RandomForest.scala, train Method, baggedInput has been persisted but without unpersisted too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4458) Skip compilation of tests classes when using make-distribution
[ https://issues.apache.org/jira/browse/SPARK-4458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-4458. -- Resolution: Won't Fix Given PR discussion, looks like a WontFix. Skip compilation of tests classes when using make-distribution -- Key: SPARK-4458 URL: https://issues.apache.org/jira/browse/SPARK-4458 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.0.0, 1.1.0 Reporter: Tathagata Das Assignee: Tathagata Das Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4675) Find similar products and similar users in MatrixFactorizationModel
[ https://issues.apache.org/jira/browse/SPARK-4675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242808#comment-14242808 ] Sean Owen commented on SPARK-4675: -- The lower dimensional space is of course smaller. This makes it faster and more efficient to work with, which is an advantage to be sure at scale. But the real reason is that the original high-dimensional space is extremely sparse. Standard similarity measures are undefined for most pairs, or are 0. It's sort of a symptom of the curse of dimensionality. Find similar products and similar users in MatrixFactorizationModel --- Key: SPARK-4675 URL: https://issues.apache.org/jira/browse/SPARK-4675 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Steven Bourke Priority: Trivial Labels: mllib, recommender Using the latent feature space that is learnt in MatrixFactorizationModel, I have added 2 new functions to find similar products and similar users. A user of the API can for example pass a product ID, and get the closest products. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4779) PySpark Shuffle Fails Looking for Files that Don't Exist when low on Memory
[ https://issues.apache.org/jira/browse/SPARK-4779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242831#comment-14242831 ] Ilya Ganelin commented on SPARK-4779: - I've seen this issue on Scala as well. This happens during large shuffles when an intermediate stage of the shuffle map/reduce fails due to memory constraints. I have not received any suggestions on how to resolve it short of increasing available memory and shuffling smaller sizes. PySpark Shuffle Fails Looking for Files that Don't Exist when low on Memory --- Key: SPARK-4779 URL: https://issues.apache.org/jira/browse/SPARK-4779 Project: Spark Issue Type: Bug Components: PySpark, Shuffle Affects Versions: 1.1.0 Environment: ec2 launched cluster with scripts 6 Nodes c3.2xlarge Reporter: Brad Willard When Spark is tight on memory it starts saying files don't exist during shuffle causing tasks to fail and be rebuilt destroying performance. The same code works flawlessly with smaller datasets with less memory pressure I assume. 14/12/06 18:39:37 WARN scheduler.TaskSetManager: Lost task 292.0 in stage 3.0 (TID 1099, ip-10-13-192-209.ec2.internal): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File /root/spark/python/pyspark/worker.py, line 79, in main serializer.dump_stream(func(split_index, iterator), outfile) File /root/spark/python/pyspark/serializers.py, line 196, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File /root/spark/python/pyspark/serializers.py, line 127, in dump_stream for obj in iterator: File /root/spark/python/pyspark/serializers.py, line 185, in _batched for item in iterator: File /root/spark/python/pyspark/shuffle.py, line 370, in _external_items self.mergeCombiners(self.serializer.load_stream(open(p)), IOError: [Errno 2] No such file or directory: '/mnt/spark/spark-local-20141206182702-8748/python/16070/66618000/1/18' org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:124) org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:91) org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:87) org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43) scala.collection.Iterator$$anon$12.next(Iterator.scala:357) org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43) scala.collection.Iterator$$anon$12.next(Iterator.scala:357) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:335) org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209) org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184) org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184) org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311) org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3533) Add saveAsTextFileByKey() method to RDDs
[ https://issues.apache.org/jira/browse/SPARK-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242913#comment-14242913 ] Ilya Ganelin commented on SPARK-3533: - I am looking into a solution for this. Add saveAsTextFileByKey() method to RDDs Key: SPARK-3533 URL: https://issues.apache.org/jira/browse/SPARK-3533 Project: Spark Issue Type: Improvement Components: PySpark, Spark Core Affects Versions: 1.1.0 Reporter: Nicholas Chammas Users often have a single RDD of key-value pairs that they want to save to multiple locations based on the keys. For example, say I have an RDD like this: {code} a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben', 'Frankie']).keyBy(lambda x: x[0]) a.collect() [('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F', 'Frankie')] a.keys().distinct().collect() ['B', 'F', 'N'] {code} Now I want to write the RDD out to different paths depending on the keys, so that I have one output directory per distinct key. Each output directory could potentially have multiple {{part-}} files, one per RDD partition. So the output would look something like: {code} /path/prefix/B [/part-1, /part-2, etc] /path/prefix/F [/part-1, /part-2, etc] /path/prefix/N [/part-1, /part-2, etc] {code} Though it may be possible to do this with some combination of {{saveAsNewAPIHadoopFile()}}, {{saveAsHadoopFile()}}, and the {{MultipleTextOutputFormat}} output format class, it isn't straightforward. It's not clear if it's even possible at all in PySpark. Please add a {{saveAsTextFileByKey()}} method or something similar to RDDs that makes it easy to save RDDs out to multiple locations at once. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4728) Add exponential, log normal, and gamma distributions to data generator to MLlib
[ https://issues.apache.org/jira/browse/SPARK-4728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242915#comment-14242915 ] Apache Spark commented on SPARK-4728: - User 'rnowling' has created a pull request for this issue: https://github.com/apache/spark/pull/3680 Add exponential, log normal, and gamma distributions to data generator to MLlib --- Key: SPARK-4728 URL: https://issues.apache.org/jira/browse/SPARK-4728 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.1.0 Reporter: RJ Nowling Priority: Minor MLlib supports sampling from normal, uniform, and Poisson distributions. I'd like to add support for sampling from exponential, gamma, and log normal distributions, using the features of math3 like the other generators. Please assign this to me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4728) Add exponential, log normal, and gamma distributions to data generator to MLlib
[ https://issues.apache.org/jira/browse/SPARK-4728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242924#comment-14242924 ] RJ Nowling commented on SPARK-4728: --- I posted a PR for this issue: https://github.com/apache/spark/pull/3680 Add exponential, log normal, and gamma distributions to data generator to MLlib --- Key: SPARK-4728 URL: https://issues.apache.org/jira/browse/SPARK-4728 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.1.0 Reporter: RJ Nowling Priority: Minor MLlib supports sampling from normal, uniform, and Poisson distributions. I'd like to add support for sampling from exponential, gamma, and log normal distributions, using the features of math3 like the other generators. Please assign this to me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-4728) Add exponential, log normal, and gamma distributions to data generator to MLlib
[ https://issues.apache.org/jira/browse/SPARK-4728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] RJ Nowling updated SPARK-4728: -- Comment: was deleted (was: I posted a PR for this issue: https://github.com/apache/spark/pull/3680) Add exponential, log normal, and gamma distributions to data generator to MLlib --- Key: SPARK-4728 URL: https://issues.apache.org/jira/browse/SPARK-4728 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.1.0 Reporter: RJ Nowling Priority: Minor MLlib supports sampling from normal, uniform, and Poisson distributions. I'd like to add support for sampling from exponential, gamma, and log normal distributions, using the features of math3 like the other generators. Please assign this to me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4675) Find similar products and similar users in MatrixFactorizationModel
[ https://issues.apache.org/jira/browse/SPARK-4675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243026#comment-14243026 ] Debasish Das commented on SPARK-4675: - Is there a metric like MAP / AUC kind of measure that can help us validate similarUsers and similarProducts ? Right now if I run column similarities with sparse vector on matrix factorization datasets for product similarities, it will assume all unvisited entries (which should be ?) as 0 and compute column similarities for...If the sparse vector has ? in place of 0 then basically all similarity calculation is incorrect...so in that sense it makes more sense to compute the similarities on the matrix factors... But then we are back to map-reduce calculation of rowSimilarities. Find similar products and similar users in MatrixFactorizationModel --- Key: SPARK-4675 URL: https://issues.apache.org/jira/browse/SPARK-4675 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Steven Bourke Priority: Trivial Labels: mllib, recommender Using the latent feature space that is learnt in MatrixFactorizationModel, I have added 2 new functions to find similar products and similar users. A user of the API can for example pass a product ID, and get the closest products. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4455) Exclude dependency on hbase-annotations module
[ https://issues.apache.org/jira/browse/SPARK-4455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243044#comment-14243044 ] Sean Owen commented on SPARK-4455: -- FWIW this change also unfortunately causes compiler warnings, so hopefully it can be undone some day. But it's the right thing for now. {code} [warn] Class org.apache.hadoop.hbase.classification.InterfaceAudience not found - continuing with a stub. [warn] Class org.apache.hadoop.hbase.classification.InterfaceStability not found - continuing with a stub. [warn] Class org.apache.hadoop.hbase.classification.InterfaceAudience not found - continuing with a stub. [warn] Class org.apache.hadoop.hbase.classification.InterfaceStability not found - continuing with a stub. [warn] Class org.apache.hadoop.hbase.classification.InterfaceAudience not found - continuing with a stub. ... {code} Exclude dependency on hbase-annotations module -- Key: SPARK-4455 URL: https://issues.apache.org/jira/browse/SPARK-4455 Project: Spark Issue Type: Bug Reporter: Ted Yu Assignee: Ted Yu Fix For: 1.2.0 As Patrick mentioned in the thread 'Has anyone else observed this build break?' : The error I've seen is this when building the examples project: {code} spark-examples_2.10: Could not resolve dependencies for project org.apache.spark:spark-examples_2.10:jar:1.2.0-SNAPSHOT: Could not find artifact jdk.tools:jdk.tools:jar:1.7 at specified path /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home/../lib/tools.jar {code} The reason for this error is that hbase-annotations is using a system scoped dependency in their hbase-annotations pom, and this doesn't work with certain JDK layouts such as that provided on Mac OS: http://central.maven.org/maven2/org/apache/hbase/hbase-annotations/0.98.7-hadoop2/hbase-annotations-0.98.7-hadoop2.pom hbase-annotations module is transitively brought in through other HBase modules, we should exclude it from related modules. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4823) rowSimilarities
[ https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243048#comment-14243048 ] Debasish Das commented on SPARK-4823: - [~srowen] did you implement map-reduce row similarities for user factors ? What's the algorithm that you used ? Any pointers will be really helpful... rowSimilarities --- Key: SPARK-4823 URL: https://issues.apache.org/jira/browse/SPARK-4823 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Reza Zadeh RowMatrix has a columnSimilarities method to find cosine similarities between columns. A rowSimilarities method would be useful to find similarities between rows. This is JIRA is to investigate which algorithms are suitable for such a method, better than brute-forcing it. Note that when there are many rows ( 10^6), it is unlikely that brute-force will be feasible, since the output will be of order 10^12. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4823) rowSimilarities
[ https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243055#comment-14243055 ] Sean Owen commented on SPARK-4823: -- I don't think MapReduce matters here. You can compute pairs of similarities with any framework, or try to do it on the fly. It's not different than column similarities, right? I don't think there's anything more to it than applying a similarity metric to all pairs of vectors. I think the JIRA is about exposing a method just for API convenience, not because it's conceptually different. rowSimilarities --- Key: SPARK-4823 URL: https://issues.apache.org/jira/browse/SPARK-4823 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Reza Zadeh RowMatrix has a columnSimilarities method to find cosine similarities between columns. A rowSimilarities method would be useful to find similarities between rows. This is JIRA is to investigate which algorithms are suitable for such a method, better than brute-forcing it. Note that when there are many rows ( 10^6), it is unlikely that brute-force will be feasible, since the output will be of order 10^12. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1412) Disable partial aggregation automatically when reduction factor is low
[ https://issues.apache.org/jira/browse/SPARK-1412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243072#comment-14243072 ] Sean Owen commented on SPARK-1412: -- The PRs for SPARK-2253 and SPARK-1412 were abandoned. Are both a WontFix? Disable partial aggregation automatically when reduction factor is low -- Key: SPARK-1412 URL: https://issues.apache.org/jira/browse/SPARK-1412 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.0 Reporter: Reynold Xin Assignee: Michael Armbrust Priority: Minor Fix For: 1.2.0 Once we see enough number of rows in partial aggregation and don't observe any reduction, the aggregate operator should just turn off partial aggregation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4831) Current directory always on classpath with spark-submit
[ https://issues.apache.org/jira/browse/SPARK-4831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243074#comment-14243074 ] Daniel Darabos commented on SPARK-4831: --- bq. Is it perhaps finding and exploded directory of classes? Yes, that is exactly the situation. One instance of the file is in a jar, another is just there (free-floating) in the directory. It is a configuration file. (Actually it's in a conf directory, but Play looks for both play.plugins and conf/play.plugins with getResources in the classpath. So it finds the copy inside the generated jar, also in the conf directory of the project. We can of course work around this in numerous ways.) I think there is no reason for spark-submit to add an empty entry to the classpath. It will just lead to accidents like ours. If the user wants to add an empty entry, they can easily do so. I've sent https://github.com/apache/spark/pull/3678 as a possible fix. Thanks for investigating! Current directory always on classpath with spark-submit --- Key: SPARK-4831 URL: https://issues.apache.org/jira/browse/SPARK-4831 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.1.1, 1.2.0 Reporter: Daniel Darabos Priority: Minor We had a situation where we were launching an application with spark-submit, and a file (play.plugins) was on the classpath twice, causing problems (trying to register plugins twice). Upon investigating how it got on the classpath twice, we found that it was present in one of our jars, and also in the current working directory. But the one in the current working directory should not be on the classpath. We never asked spark-submit to put the current directory on the classpath. I think this is caused by a line in [compute-classpath.sh|https://github.com/apache/spark/blob/v1.2.0-rc2/bin/compute-classpath.sh#L28]: {code} CLASSPATH=$SPARK_CLASSPATH:$SPARK_SUBMIT_CLASSPATH {code} Now if SPARK_CLASSPATH is empty, the empty string is added to the classpath, which means the current working directory. We tried setting SPARK_CLASSPATH to a bogus value, but that is [not allowed|https://github.com/apache/spark/blob/v1.2.0-rc2/core/src/main/scala/org/apache/spark/SparkConf.scala#L312]. What is the right solution? Only add SPARK_CLASSPATH if it's non-empty? I can send a pull request for that I think. Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1412) Disable partial aggregation automatically when reduction factor is low
[ https://issues.apache.org/jira/browse/SPARK-1412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243077#comment-14243077 ] Reynold Xin commented on SPARK-1412: I think we should still do it - it's just that the current AppendOnlyMap isn't really built for it. We will probably revisit this in the future. Disable partial aggregation automatically when reduction factor is low -- Key: SPARK-1412 URL: https://issues.apache.org/jira/browse/SPARK-1412 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.0 Reporter: Reynold Xin Assignee: Michael Armbrust Priority: Minor Once we see enough number of rows in partial aggregation and don't observe any reduction, the aggregate operator should just turn off partial aggregation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1412) Disable partial aggregation automatically when reduction factor is low
[ https://issues.apache.org/jira/browse/SPARK-1412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-1412: --- Fix Version/s: (was: 1.2.0) Disable partial aggregation automatically when reduction factor is low -- Key: SPARK-1412 URL: https://issues.apache.org/jira/browse/SPARK-1412 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.0 Reporter: Reynold Xin Assignee: Michael Armbrust Priority: Minor Once we see enough number of rows in partial aggregation and don't observe any reduction, the aggregate operator should just turn off partial aggregation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1412) [SQL] Disable partial aggregation automatically when reduction factor is low
[ https://issues.apache.org/jira/browse/SPARK-1412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-1412: --- Summary: [SQL] Disable partial aggregation automatically when reduction factor is low (was: Disable partial aggregation automatically when reduction factor is low) [SQL] Disable partial aggregation automatically when reduction factor is low Key: SPARK-1412 URL: https://issues.apache.org/jira/browse/SPARK-1412 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.0 Reporter: Reynold Xin Assignee: Michael Armbrust Priority: Minor Once we see enough number of rows in partial aggregation and don't observe any reduction, the aggregate operator should just turn off partial aggregation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1627) Support external aggregation in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-1627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-1627. -- Resolution: Won't Fix Fix Version/s: (was: 1.2.0) The discussion in https://github.com/apache/spark/pull/867 suggests this was subsumed by SPARK-2873. Support external aggregation in Spark SQL - Key: SPARK-1627 URL: https://issues.apache.org/jira/browse/SPARK-1627 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.0 Reporter: Chen Chao Spark SQL Aggregator does not support external sorting now which is extremely important when data is much larger than memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2253) [Core] Disable partial aggregation automatically when reduction factor is low
[ https://issues.apache.org/jira/browse/SPARK-2253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2253: --- Summary: [Core] Disable partial aggregation automatically when reduction factor is low (was: Disable partial aggregation automatically when reduction factor is low) [Core] Disable partial aggregation automatically when reduction factor is low - Key: SPARK-2253 URL: https://issues.apache.org/jira/browse/SPARK-2253 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Reynold Xin Fix For: 1.3.0 Once we see enough number of rows in partial aggregation and don't observe any reduction, Aggregator should just turn off partial aggregation. This reduces memory usage for high cardinality aggregations. This one is for Spark core. There is another ticket tracking this for SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1581) Allow One Flume Avro RPC Server for Each Worker rather than Just One Worker
[ https://issues.apache.org/jira/browse/SPARK-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-1581. -- Resolution: Won't Fix No follow-up from OP explaining the change, and so the PR was closed already. Allow One Flume Avro RPC Server for Each Worker rather than Just One Worker --- Key: SPARK-1581 URL: https://issues.apache.org/jira/browse/SPARK-1581 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Christophe Clapp Priority: Minor Labels: Flume -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1559) Add conf dir to CLASSPATH in compute-classpath.sh dependent on whether SPARK_CONF_DIR is set
[ https://issues.apache.org/jira/browse/SPARK-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-1559. -- Resolution: Duplicate The PR discussion suggests it was duplicated by the PR for SPARK-2058. Add conf dir to CLASSPATH in compute-classpath.sh dependent on whether SPARK_CONF_DIR is set Key: SPARK-1559 URL: https://issues.apache.org/jira/browse/SPARK-1559 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.0.0 Reporter: Albert Chu Priority: Minor Attachments: SPARK-1559.patch bin/load-spark-env.sh loads spark-env.sh from SPARK_CONF_DIR if it is set, or from $parent_dir/conf if it is not set. However, in compute-classpath.sh, the CLASSPATH adds $FWDIR/conf to the CLASSPATH regardless if SPARK_CONF_DIR is set. Attached patch fixes this. Pull request on github will also be sent. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1532) provide option for more restrictive firewall rule in ec2/spark_ec2.py
[ https://issues.apache.org/jira/browse/SPARK-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243086#comment-14243086 ] Sean Owen commented on SPARK-1532: -- [~foundart] Is this abandoned? your second PR was ready to merge but needed rebasing, then got closed. Looks like it was a good change that can be revived. provide option for more restrictive firewall rule in ec2/spark_ec2.py - Key: SPARK-1532 URL: https://issues.apache.org/jira/browse/SPARK-1532 Project: Spark Issue Type: Improvement Components: EC2 Affects Versions: 0.9.0 Reporter: Art Peel Priority: Minor When ec2/spark_ec2.py sets up firewall rules for various ports, it uses an extremely lenient hard-coded value for allowed IP addresses: '0.0.0.0/0' It would be very useful for deployments to allow specifying the allowed IP addresses as a command-line option to ec2/spark_ec2.py. This new configuration parameter should have as its default the current hard-coded value, '0.0.0.0/0', so the functionality of ec2/spark_ec2.py will change only for those users who specify the new option. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1771) CoarseGrainedSchedulerBackend is not resilient to Akka restarts
[ https://issues.apache.org/jira/browse/SPARK-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-1771. -- Resolution: Won't Fix The PR says this was abandoned in favor of SPARK-4004 CoarseGrainedSchedulerBackend is not resilient to Akka restarts --- Key: SPARK-1771 URL: https://issues.apache.org/jira/browse/SPARK-1771 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Aaron Davidson The exception reported in SPARK-1769 was propagated through the CoarseGrainedSchedulerBackend, and caused an Actor restart of the DriverActor. Unfortunately, this actor does not seem to have been written with Akka restartability in mind. For instance, the new DriverActor has lost all state about the prior Executors without cleanly disconnecting them. This means that the driver actually has executors attached to it, but doesn't think it does, which leads to mayhem of various sorts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1700) PythonRDD leaks socket descriptors during cancellation
[ https://issues.apache.org/jira/browse/SPARK-1700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-1700. -- Resolution: Fixed Fix Version/s: 1.0.0 The PR was https://github.com/apache/spark/pull/623 and says it was merged in 1.0. PythonRDD leaks socket descriptors during cancellation -- Key: SPARK-1700 URL: https://issues.apache.org/jira/browse/SPARK-1700 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0, 1.0.0 Reporter: Aaron Davidson Assignee: Aaron Davidson Fix For: 1.0.0 Sockets from Spark to Python workers are not cleaned up over the duration of a job, causing the total number of opened file descriptors to grow to around the number of partitions in the job. Usually these go away if the job is successful, but in the case of cancellation (and possibly exceptions, though I haven't investigated), the socket file descriptors remain indefinitely. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1888) enhance MEMORY_AND_DISK mode by dropping blocks in parallel
[ https://issues.apache.org/jira/browse/SPARK-1888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-1888. -- Resolution: Duplicate From the end of the PR discussion, it looks like this was continued in SPARK-3000? The issue looks the same and the change looks like it touches the same files. enhance MEMORY_AND_DISK mode by dropping blocks in parallel --- Key: SPARK-1888 URL: https://issues.apache.org/jira/browse/SPARK-1888 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Wenchen Fan Assignee: Wenchen Fan Sometimes MEMORY_AND_DISK mode is slower than DISK_ONLY mode because of the lock on IO operations(dropping blocks in memory store). As the TODO says, the solution is: only synchronize the selecting of to-be-dropped blocks and do the dropping in parallel. I have a quick fix in my PR: https://github.com/apache/spark/pull/791#issuecomment-43567924 It's fragile currently but I'm working on it to make it more robust. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4746) integration tests should be separated from faster unit tests
[ https://issues.apache.org/jira/browse/SPARK-4746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243101#comment-14243101 ] Ryan Williams commented on SPARK-4746: -- I don't have any experience with test tags, but this approach sounds good to me [~imranr]! integration tests should be separated from faster unit tests Key: SPARK-4746 URL: https://issues.apache.org/jira/browse/SPARK-4746 Project: Spark Issue Type: Bug Reporter: Imran Rashid Priority: Trivial Currently there isn't a good way for a developer to skip the longer integration tests. This can slow down local development. See http://apache-spark-developers-list.1001551.n3.nabble.com/Spurious-test-failures-testing-best-practices-td9560.html One option is to use scalatest's notion of test tags to tag all integration tests, so they could easily be skipped -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1880) Eliminate unnecessary job executions.
[ https://issues.apache.org/jira/browse/SPARK-1880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243103#comment-14243103 ] Sean Owen commented on SPARK-1880: -- Is this now a WontFix? The PR refers to this being subsumed by hash join changes related to SPARK-1800. Eliminate unnecessary job executions. - Key: SPARK-1880 URL: https://issues.apache.org/jira/browse/SPARK-1880 Project: Spark Issue Type: Improvement Components: SQL Reporter: Takuya Ueshin There are unnecessary job executions in {{BroadcastNestedLoopJoin}}. When {{Innner}} or {{LeftOuter}} join, preparation of {{rightOuterMatches}} for {{RightOuter}} or {{FullOuter}} join is not neccessary. And when {{RightOuter}} or {{FullOuter}}, it should use not {{count}} and then {{reduce}} but {{fold}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-1880) Eliminate unnecessary job executions.
[ https://issues.apache.org/jira/browse/SPARK-1880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin closed SPARK-1880. -- Resolution: Won't Fix Eliminate unnecessary job executions. - Key: SPARK-1880 URL: https://issues.apache.org/jira/browse/SPARK-1880 Project: Spark Issue Type: Improvement Components: SQL Reporter: Takuya Ueshin There are unnecessary job executions in {{BroadcastNestedLoopJoin}}. When {{Innner}} or {{LeftOuter}} join, preparation of {{rightOuterMatches}} for {{RightOuter}} or {{FullOuter}} join is not neccessary. And when {{RightOuter}} or {{FullOuter}}, it should use not {{count}} and then {{reduce}} but {{fold}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2016) rdd in-memory storage UI becomes unresponsive when the number of RDD partitions is large
[ https://issues.apache.org/jira/browse/SPARK-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243109#comment-14243109 ] Sean Owen commented on SPARK-2016: -- Is this and SPARK-2017 now subsumed by SPARK-3644? the PR for this and SPARK-2017 are closed and discussion suggested it was to be continued in SPARK-3644. rdd in-memory storage UI becomes unresponsive when the number of RDD partitions is large Key: SPARK-2016 URL: https://issues.apache.org/jira/browse/SPARK-2016 Project: Spark Issue Type: Sub-task Reporter: Reynold Xin Labels: starter Try run {code} sc.parallelize(1 to 100, 100).cache().count() {code} And open the storage UI for this RDD. It takes forever to load the page. When the number of partitions is very large, I think there are a few alternatives: 0. Only show the top 1000. 1. Pagination 2. Instead of grouping by RDD blocks, group by executors -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2227) Support dfs command
[ https://issues.apache.org/jira/browse/SPARK-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-2227. -- Resolution: Fixed Fix Version/s: 1.1.0 Looks like this was in fact merged in https://github.com/apache/spark/commit/51c8168377a89d20d0b2d7b9a28af58593a0fe0c Support dfs command - Key: SPARK-2227 URL: https://issues.apache.org/jira/browse/SPARK-2227 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Priority: Minor Fix For: 1.1.0 Potentially just delegate to Hive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2201) Improve FlumeInputDStream's stability and make it scalable
[ https://issues.apache.org/jira/browse/SPARK-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-2201. -- Resolution: Won't Fix I hope I understood this right, but the PR discussion seemed to end with suggesting that this would not go into Spark, but maybe a contrib repo, and that it was partly already implemented by other changes. Improve FlumeInputDStream's stability and make it scalable -- Key: SPARK-2201 URL: https://issues.apache.org/jira/browse/SPARK-2201 Project: Spark Issue Type: Improvement Components: Streaming Reporter: sunsc Currently: FlumeUtils.createStream(ssc, localhost, port); This means that only one flume receiver can work with FlumeInputDStream .so the solution is not scalable. I use a zookeeper to solve this problem. Spark flume receivers register themselves to a zk path when started, and a flume agent get physical hosts and push events to them. Some works need to be done here: 1.receiver create tmp node in zk, listeners just watch those tmp nodes. 2. when spark FlumeReceivers started, they acquire a physical host (localhost's ip and an idle port) and register itself to zookeeper. 3. A new flume sink. In the method of appendEvents, they get physical hosts and push data to them in a round-robin manner. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2193) Improve tasks‘ preferred locality by sorting tasks partial ordering
[ https://issues.apache.org/jira/browse/SPARK-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-2193. -- Resolution: Won't Fix Last word appears to be that this was obviated by SPARK-2294 and https://github.com/apache/spark/pull/1313 Improve tasks‘ preferred locality by sorting tasks partial ordering --- Key: SPARK-2193 URL: https://issues.apache.org/jira/browse/SPARK-2193 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Zhihui Attachments: Improve Tasks Preferred Locality.pptx Now, the last executor(s) maybe not get it’s preferred task(s), although these tasks have build in pendingTasksForHosts map. Because executers pick up tasks sequential, their preferred task(s) maybe picked up by other executors. This appearance can be eliminated by sorting tasks partial ordering. Executor pick up task by host’s order of task’s preferredLocation, that mean, executor firstly pick up all tasks which task.preferredLocations.1 = executor.hostName, then secondly… -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4823) rowSimilarities
[ https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243133#comment-14243133 ] Reza Zadeh commented on SPARK-4823: --- Given that we're talking about RowMatrices, computing rowSimilarities the same way as columnSimilarities would require transposing the matrix, which is dangerous when the original matrix has many rows. RowMatrix assumes a single row should fit in memory on a single machine, but this might not happen after transposing a RowMatrix. rowSimilarities --- Key: SPARK-4823 URL: https://issues.apache.org/jira/browse/SPARK-4823 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Reza Zadeh RowMatrix has a columnSimilarities method to find cosine similarities between columns. A rowSimilarities method would be useful to find similarities between rows. This is JIRA is to investigate which algorithms are suitable for such a method, better than brute-forcing it. Note that when there are many rows ( 10^6), it is unlikely that brute-force will be feasible, since the output will be of order 10^12. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2127) Use application specific folders to dump metrics via CsvSink
[ https://issues.apache.org/jira/browse/SPARK-2127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-2127. -- Resolution: Duplicate The PR that closes SPARK-3377 also closed the PR for this JIRA, and it looks like this is a subset of SPARK-3377. Use application specific folders to dump metrics via CsvSink Key: SPARK-2127 URL: https://issues.apache.org/jira/browse/SPARK-2127 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Rahul Singhal Priority: Minor Currently when using the CsvSink, all application's csv metrics are dumped in the root folder (configured via *.sink.csv.director in metrics.properties). Also, some files that have common names (e.g. jvm.PS-MarkSweep.count.csv) are reused. And if one is running the same application multiple times, the metrics get appended to previously existing files. This makes it harder to parse these files and extract the information that one might be looking for. I suggest that a unique folder is created every time an application is run and use it to dump the metrics from that particular run only. This unique folder could be created similar the one that is currently craeted for logging application events (e.g. spark-pi-1402484928439). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2016) rdd in-memory storage UI becomes unresponsive when the number of RDD partitions is large
[ https://issues.apache.org/jira/browse/SPARK-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243147#comment-14243147 ] Reynold Xin commented on SPARK-2016: cc [~andrewor14] can you comment on this? rdd in-memory storage UI becomes unresponsive when the number of RDD partitions is large Key: SPARK-2016 URL: https://issues.apache.org/jira/browse/SPARK-2016 Project: Spark Issue Type: Sub-task Reporter: Reynold Xin Labels: starter Try run {code} sc.parallelize(1 to 100, 100).cache().count() {code} And open the storage UI for this RDD. It takes forever to load the page. When the number of partitions is very large, I think there are a few alternatives: 0. Only show the top 1000. 1. Pagination 2. Instead of grouping by RDD blocks, group by executors -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243149#comment-14243149 ] Debasish Das commented on SPARK-2426: - [~mengxr] as per our discussion, QuadraticMinimizer and NNLS are both added to breeze and updated with breeze DenseMatrix and DenseVector...Inside breeze I did some interesting comparisons and that motivated me to port NNLS to breeze as well...I added all the testcases for QuadraticMinimizer and NNLS as well based on my experiments with MovieLens dataset... Here is the PR: https://github.com/scalanlp/breeze/pull/321 To run the Quadratic programming variants in breeze: runMain breeze.optimize.quadratic.QuadraticMinimizer 100 1 0.1 0.99 regParam = 0.1, beta = 0.99 is Elastic Net parameter It will randomly generate quadratic problems with 100 variables, 1 equality constraint and lower/upper bounds. This format is similar to PDCO QP generator (please look into my Matlab examples) 0.5x'Hx + c'x s.t Ax = B, lb = x = ub 1. Unconstrained minimization: breeze luSolve, cg and qp(dposv added to breeze through this PR) Minimize 0.5x'Hx + c'x ||qp - lu|| norm 4.312577233496585E-10 max-norm 1.3842793578078272E-10 ||cg - lu|| norm 4.167925029822007E-7 max-norm 1.0053204402282745E-7 dim 100 lu 86.007 qp 41.56 cg 102.627 ||qp - lu|| norm 4.267891623199082E-8 max-norm 6.681460718027665E-9 ||cg - lu|| norm 1.94497623480055E-7 max-norm 2.6288773824489908E-8 dim 500 lu 169.993 qp 78.001 cg 443.044 qp is faster than cg for smaller dimensions as expected. I also tried unconstrained BFGS but the results were not good. We are looking into it. 2. Elastic Net formulation: 0.5 x'Hx + c'x + (1-beta)*L2(x) + beta*regParam*L1(x) beta = 0.99 Strong L1 regParam=0.1 ||owlqn - sparseqp|| norm 0.1653200701235298 inf-norm 0.051855911945906996 sparseQp 61.948 ms iters 227 owlqn 928.11 ms beta = 0.5 average L1 regParam=0.1 ||owlqn - sparseqp|| norm 0.15823773098501168 inf-norm 0.035153837685728107 sparseQp 69.934 ms iters 353 owlqn 882.104 ms beta = 0.01 mostly BFGS regParam=0.1 ||owlqn - sparseqp|| norm 0.17950035092790165 inf-norm 0.04718697692014828 sparseQp 80.411 ms iters 580 owlqn 988.313 ms ADMM based proximal formulation is faster for smaller dimension. Even as I scale dimension, I notice similar behavior that owlqn is taking longer to converge and results are not same. Look for example in dim = 500 case: ||owlqn - sparseqp|| norm 10.946326189397649 inf-norm 1.412726586317294 sparseQp 830.593 ms iters 2417 owlqn 19848.932 ms I validated ADMM through Matlab scripts so there is something funky going on in OWLQN. 3. NNLS formulation: 0.5 x'Hx + c'x s.t x = 0 Here are compared ADMM based proximal formulation with CG based projected gradient in NNLS. NNLS converges much nicer but the convergence criteria does not look same as breeze CG but they should be same. For now I ported it to breeze and we can call NNLS for x = 0 and QuadraticMinimizer for other formulations dim = 100 posQp 16.367 ms iters 284 nnls 8.854 ms iters 107 dim = 500 posQp 303.184 ms iters 950 nnls 183.543 ms iters 517 NNLS on average looks 2X faster ! 4. Bounds formulation: 0.5x'Hx + c'x s.t lb = x = ub Validated through Matlab scripts above. Here are the runtime numbers: dim = 100 boundsQp 15.654 ms iters 284 converged true dim= 500 boundsQp 311.613 ms iters 950 converged true 5. Equality and positivity: 0.5 x'Hx + c'x s.t \sum_i x_i = 1, x_i =0 Validated through Matlab scripts above. Here are the runtime numbers: dim = 100 Qp Equality 13.64 ms iters 184 converged true dim = 500 Qp Equality 278.525 ms iters 890 converged true With this change all copyrights are moved to breeze. Once it merges, I will update the Spark PR. With this change we can move ALS code to Breeze DenseMatrix and DenseVector as well My focus next will be to get a Truncated Newton running for convex cost since convex cost is required for PLSA, SVM and Neural Net formulations... I am still puzzled that why BFGS/OWLQN is not working well for the unconstrained case/L1 optimization. If TRON works well for unconstrained case, that's what I will use for NonlinearMinimizer. I am looking more into it. Quadratic Minimization for MLlib ALS Key: SPARK-2426 URL: https://issues.apache.org/jira/browse/SPARK-2426 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0 Reporter: Debasish Das Assignee: Debasish Das Original Estimate: 504h Remaining Estimate: 504h Current ALS supports least squares and nonnegative least squares. I presented ADMM and IPM based Quadratic Minimization solvers to be used for the following ALS problems: 1. ALS with bounds 2. ALS with L1 regularization 3. ALS with Equality constraint and bounds Initial runtime
[jira] [Resolved] (SPARK-2381) streaming receiver crashed,but seems nothing happened
[ https://issues.apache.org/jira/browse/SPARK-2381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-2381. -- Resolution: Won't Fix PR comments didn't receive follow-up changes either, so per comments here, looks like WontFix. streaming receiver crashed,but seems nothing happened - Key: SPARK-2381 URL: https://issues.apache.org/jira/browse/SPARK-2381 Project: Spark Issue Type: Bug Components: Streaming Reporter: sunsc when we submit a streaming job and if receivers doesn't start normally, the application should stop itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2016) rdd in-memory storage UI becomes unresponsive when the number of RDD partitions is large
[ https://issues.apache.org/jira/browse/SPARK-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243153#comment-14243153 ] Andrew Or commented on SPARK-2016: -- This was filed before SPARK-2316 (https://github.com/apache/spark/pull/1679) was fixed. At least on the backend side, this should be much quicker than before. I don't know if we need to do some CSS magic to make the frontend side blazing fast too. Is this still reproducible? rdd in-memory storage UI becomes unresponsive when the number of RDD partitions is large Key: SPARK-2016 URL: https://issues.apache.org/jira/browse/SPARK-2016 Project: Spark Issue Type: Sub-task Reporter: Reynold Xin Labels: starter Try run {code} sc.parallelize(1 to 100, 100).cache().count() {code} And open the storage UI for this RDD. It takes forever to load the page. When the number of partitions is very large, I think there are a few alternatives: 0. Only show the top 1000. 1. Pagination 2. Instead of grouping by RDD blocks, group by executors -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2368) Improve io.netty related handlers and clients in network.netty
[ https://issues.apache.org/jira/browse/SPARK-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-2368. -- Resolution: Won't Fix PR discussion says that the OP abandoned this change as it was covered by other changes to Netty code. Improve io.netty related handlers and clients in network.netty -- Key: SPARK-2368 URL: https://issues.apache.org/jira/browse/SPARK-2368 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 1.0.0 Reporter: Binh Nguyen Priority: Minor One issue with current implementation is that FileServerHandler will just write to the channel without checking if the channel buffer is full. This could cause OOM on the receiving end. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2402) DiskBlockObjectWriter should update the initial position when reusing this object
[ https://issues.apache.org/jira/browse/SPARK-2402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-2402. -- Resolution: Won't Fix There was disagreement about whether to merge this change, but looks like it was closed and not merged in the end because the object is not supposed to be reusable. DiskBlockObjectWriter should update the initial position when reusing this object - Key: SPARK-2402 URL: https://issues.apache.org/jira/browse/SPARK-2402 Project: Spark Issue Type: Bug Components: Block Manager Affects Versions: 1.0.0 Reporter: Saisai Shao Priority: Minor Initial position of DiskBlockObjectWriter is not updated when closing and reopening, so reuse of this object to write file will lead to error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4823) rowSimilarities
[ https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243207#comment-14243207 ] Debasish Das commented on SPARK-4823: - Even for matrix factorization userFactors are user x rank...with modest ranks of 50..and users at 10M, I don't think it is possible to transpose the matrix and run column similarities...doing it on the fly complexity wise is still O(n*n) right... rowSimilarities --- Key: SPARK-4823 URL: https://issues.apache.org/jira/browse/SPARK-4823 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Reza Zadeh RowMatrix has a columnSimilarities method to find cosine similarities between columns. A rowSimilarities method would be useful to find similarities between rows. This is JIRA is to investigate which algorithms are suitable for such a method, better than brute-forcing it. Note that when there are many rows ( 10^6), it is unlikely that brute-force will be feasible, since the output will be of order 10^12. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2542) Exit Code Class should be renamed and placed package properly
[ https://issues.apache.org/jira/browse/SPARK-2542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-2542. -- Resolution: Won't Fix PR discussion says this is WontFix. Exit Code Class should be renamed and placed package properly - Key: SPARK-2542 URL: https://issues.apache.org/jira/browse/SPARK-2542 Project: Spark Issue Type: Bug Reporter: Kousuke Saruta org.apache.spark.executor.ExecutorExitCode represents some of Exit Codes. The name of the class associates the set of exit code of Executor. But, the exit codes defined in the class can be used not only Executor (e.g Driver). Actually, DiskBlockManager uses ExecutorExitCode.DISK_STORE_FAILED_TO_CREATE_DIR and DiskBlockManager can be used Driver. We should rename and move the class to new package. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2671) BlockObjectWriter should create parent directory when the directory doesn't exist
[ https://issues.apache.org/jira/browse/SPARK-2671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-2671. -- Resolution: Won't Fix This is another where the PR discussion indicates this is WontFix. BlockObjectWriter should create parent directory when the directory doesn't exist - Key: SPARK-2671 URL: https://issues.apache.org/jira/browse/SPARK-2671 Project: Spark Issue Type: Bug Affects Versions: 1.0.0 Reporter: Kousuke Saruta Priority: Minor BlockObjectWriter#open expects parent directory is present. {code} override def open(): BlockObjectWriter = { fos = new FileOutputStream(file, true) ts = new TimeTrackingOutputStream(fos) channel = fos.getChannel() lastValidPosition = initialPosition bs = compressStream(new BufferedOutputStream(ts, bufferSize)) objOut = serializer.newInstance().serializeStream(bs) initialized = true this } {code} Normally, the parent directory is created by DiskBlockManager#createLocalDirs but, just in case, BlockObjectWriter#open should check the existence of the directory and create the directory if the directory does not exist. I think, recoverable error should be recovered. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2604) Spark Application hangs on yarn in edge case scenario of executor memory requirement
[ https://issues.apache.org/jira/browse/SPARK-2604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243221#comment-14243221 ] Sean Owen commented on SPARK-2604: -- PR comments suggest this was fixed by SPARK-2140? https://github.com/apache/spark/pull/1571 Spark Application hangs on yarn in edge case scenario of executor memory requirement Key: SPARK-2604 URL: https://issues.apache.org/jira/browse/SPARK-2604 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Twinkle Sachdeva In yarn environment, let's say : MaxAM = Maximum allocatable memory ExecMem - Executor's memory if (MaxAM ExecMem ( MaxAM - ExecMem) 384m )) then Maximum resource validation fails w.r.t executor memory , and application master gets launched, but when resource is allocated and again validated, they are returned and application appears to be hanged. Typical use case is to ask for executor memory = maximum allowed memory as per yarn config -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2770) Rename spark-ganglia-lgpl to ganglia-lgpl
[ https://issues.apache.org/jira/browse/SPARK-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243244#comment-14243244 ] Sean Owen commented on SPARK-2770: -- Is this still active? the PR was attempted but got messed up and then I don't see another. Rename spark-ganglia-lgpl to ganglia-lgpl - Key: SPARK-2770 URL: https://issues.apache.org/jira/browse/SPARK-2770 Project: Spark Issue Type: Improvement Components: Build Reporter: Chris Fregly Assignee: Chris Fregly Priority: Minor Fix For: 1.2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2750) Add Https support for Web UI
[ https://issues.apache.org/jira/browse/SPARK-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243250#comment-14243250 ] Sean Owen commented on SPARK-2750: -- Shall this be rolled into SPARK-3883? both have an open PR. I'm not sure which is the better one to pursue, but they overlap. Add Https support for Web UI Key: SPARK-2750 URL: https://issues.apache.org/jira/browse/SPARK-2750 Project: Spark Issue Type: New Feature Components: Web UI Reporter: WangTaoTheTonic Labels: https, ssl, webui Fix For: 1.0.3 Original Estimate: 96h Remaining Estimate: 96h Now I try to add https support for web ui using Jetty ssl integration.Below is the plan: 1.Web UI include Master UI, Worker UI, HistoryServer UI and Spark Ui. User can switch between https and http by configure spark.http.policy in JVM property for each process, while choose http by default. 2.Web port of Master and worker would be decided in order of launch arguments, JVM property, System Env and default port. 3.Below is some other configuration items: spark.ssl.server.keystore.location The file or URL of the SSL Key store spark.ssl.server.keystore.password The password for the key store spark.ssl.server.keystore.keypassword The password (if any) for the specific key within the key store spark.ssl.server.keystore.type The type of the key store (default JKS) spark.client.https.need-auth True if SSL needs client authentication spark.ssl.server.truststore.location The file name or URL of the trust store location spark.ssl.server.truststore.password The password for the trust store spark.ssl.server.truststore.type The type of the trust store (default JKS) Any feedback is welcome! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3247) Improved support for external data sources
[ https://issues.apache.org/jira/browse/SPARK-3247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243253#comment-14243253 ] Matei Zaharia commented on SPARK-3247: -- For those looking to learn about the interface in more detail, there is a meetup video on it at https://www.youtube.com/watch?v=GQSNJAzxOr8. Improved support for external data sources -- Key: SPARK-3247 URL: https://issues.apache.org/jira/browse/SPARK-3247 Project: Spark Issue Type: New Feature Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Blocker Fix For: 1.2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2715) ExternalAppendOnlyMap adds max limit of times and max limit of disk bytes written for spilling
[ https://issues.apache.org/jira/browse/SPARK-2715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-2715. -- Resolution: Won't Fix PR discussion says it is a WontFix. ExternalAppendOnlyMap adds max limit of times and max limit of disk bytes written for spilling -- Key: SPARK-2715 URL: https://issues.apache.org/jira/browse/SPARK-2715 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: YanTang Zhai Priority: Minor ExternalAppendOnlyMap adds max limit of times and max limit of disk bytes written for spilling. Therefore, some task could be let fail fast instead of running for a long time if it has data skew. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2710) Build SchemaRDD from a JdbcRDD with MetaData (no hard-coded case class)
[ https://issues.apache.org/jira/browse/SPARK-2710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-2710. -- Resolution: Won't Fix PR discussion says that this should become an external library, given the new external data source API in 1.2. Build SchemaRDD from a JdbcRDD with MetaData (no hard-coded case class) --- Key: SPARK-2710 URL: https://issues.apache.org/jira/browse/SPARK-2710 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Reporter: Teng Qiu Spark SQL can take Parquet files or JSON files as a table directly (without given a case class to define the schema) as a component named SQL, it should also be able to take a ResultSet from RDBMS easily. i find that there is a JdbcRDD in core: core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala so i want to make some small change in this file to allow SQLContext to read the MetaData from the PreparedStatement (read metadata do not need to execute the query really). Then, in Spark SQL, SQLContext can create SchemaRDD with JdbcRDD and his MetaData. In the further, maybe we can add a feature in sql-shell, so that user can using spark-thrift-server join tables from different sources such as: {code} CREATE TABLE jdbc_tbl1 AS JDBC connectionString username password initQuery bound ... CREATE TABLE parquet_files AS PARQUET hdfs://tmp/parquet_table/ SELECT parquet_files.colX, jdbc_tbl1.colY FROM parquet_files JOIN jdbc_tbl1 ON (parquet_files.id = jdbc_tbl1.id) {code} I think such a feature will be useful, like facebook Presto engine does. oh, and there is a small bug in JdbcRDD in compute(), method close() {code} if (null != conn ! stmt.isClosed()) conn.close() {code} should be {code} if (null != conn ! conn.isClosed()) conn.close() {code} just a small write error :) but such a close method will never be able to close conn... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2872) Fix conflict between code and doc in YarnClientSchedulerBackend
[ https://issues.apache.org/jira/browse/SPARK-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-2872. -- Resolution: Won't Fix Looks like this was obsoleted by subsequent changes to how YARN parses configuration, given the PR. Fix conflict between code and doc in YarnClientSchedulerBackend --- Key: SPARK-2872 URL: https://issues.apache.org/jira/browse/SPARK-2872 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.0 Reporter: Zhihui Doc say: system properties override environment variables. https://github.com/apache/spark/blob/master/yarn/common/src/main/scala/org/apache/spark/scheduler/cluster/YarnClientSchedulerBackend.scala#L71 But code is conflict with it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2947) DAGScheduler resubmit the stage into an infinite loop
[ https://issues.apache.org/jira/browse/SPARK-2947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-2947. -- Resolution: Duplicate Fix Version/s: (was: 1.0.3) (was: 1.2.0) Discussion indicates this was the same as SPARK-3224, which makes a more comprehensive change and has been resolved. DAGScheduler resubmit the stage into an infinite loop - Key: SPARK-2947 URL: https://issues.apache.org/jira/browse/SPARK-2947 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0, 1.0.2 Reporter: Guoqiang Li Priority: Blocker Stage to resubmit more than 5 times. This seems to be caused by {{FetchFailed.bmAddress}} is null . I don't know how to reproduce it. master log: {noformat} 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Starting task 1.189:276 as TID 52334 on executor 82: sanshan (PROCESS_LOCAL) 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Serialized task 1.189:276 as 3060 bytes in 0 ms 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Starting task 1.189:277 as TID 52335 on executor 78: tuan231 (PROCESS_LOCAL) 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Serialized task 1.189:277 as 3060 bytes in 0 ms 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Lost TID 52199 (task 1.189:141) 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch failure from null 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at DealCF.scala:215) for resubmision due to a fetch failure 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch failure from null 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at DealCF.scala:215) for resubmision due to a fetch failure 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission -- 5 times --- 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at DealCF.scala:215) for resubmision due to a fetch failure 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission 14/08/09 21:50:17 INFO cluster.YarnClientClusterScheduler: Removed TaskSet 1.189, whose tasks have all completed, from pool 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Finished TID 1869 in 87398 ms on jilin (progress: 280/280) 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Completed ShuffleMapTask(2, 269) 14/08/09 21:50:17 INFO cluster.YarnClientClusterScheduler: Removed TaskSet 2.1, whose tasks have all completed, from pool 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Stage 2 (flatMap at DealCF.scala:207) finished in 129.544 s {noformat} worker: log {noformat} /1408/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_57 not found, computing it 14/08/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_191 not found, computing it 14/08/09 21:49:41 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 18017 14/08/09 21:49:41 INFO executor.Executor: Running task ID 18017 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_1 locally 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_2 locally 14/08/09 21:49:41 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 18151 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_0 locally 14/08/09 21:49:41 INFO executor.Executor: Running task ID 18151 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_1 locally 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_2 locally 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_0 locally 14/08/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_86 not found, computing it 14/08/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_220 not found, computing it 14/08/09 21:49:41 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 18285 14/08/09 21:49:41 INFO executor.Executor: Running task ID 18285 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_1 locally 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_2 locally 14/08/09 21:49:41 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 18419 14/08/09 21:49:41 INFO executor.Executor: Running task ID 18419 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_0 locally 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_1 locally 14/08/09 21:49:41 INFO storage.BlockManager:
[jira] [Resolved] (SPARK-3099) Staging Directory is never deleted when we run job with YARN Client Mode
[ https://issues.apache.org/jira/browse/SPARK-3099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3099. -- Resolution: Won't Fix PR discussion says this was obsoleted by SPARK-2933. Staging Directory is never deleted when we run job with YARN Client Mode Key: SPARK-3099 URL: https://issues.apache.org/jira/browse/SPARK-3099 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0 Reporter: Kousuke Saruta When we run application with YARN Cluster mode, the class 'ApplicationMaster' is used as ApplicationMaster, which has shutdown hook to cleanup stagind directory (~/.sparkStaging). But, when we run application with YARN Client mode, the class 'ExecutorLauncher' as an ApplicationMaster doesn't cleanup staging directory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3038) delete history server logs when there are too many logs
[ https://issues.apache.org/jira/browse/SPARK-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3038. -- Resolution: Won't Fix Fix Version/s: (was: 1.2.0) PR says this is WontFix. delete history server logs when there are too many logs Key: SPARK-3038 URL: https://issues.apache.org/jira/browse/SPARK-3038 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.1 Reporter: wangfei enhance history server to delete logs automatically 1 use spark.history.deletelogs.enable to enable this function 2 when app logs num is greater than spark.history.maxsavedapplication, delete the older logs -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243398#comment-14243398 ] Valeriy Avanesov commented on SPARK-2426: - what's the normalization constraint ? Each row of W should sum upto 1 and each column of H should sum upto 1 with positivity ? Yes. That is similar to PLSA right except that PLSA will have a bi-concave loss... There's a completely different loss... BTW, we've used a factorisation with the loss you've described as an initial approximation for PLSA. It gave a significant speed-up. Quadratic Minimization for MLlib ALS Key: SPARK-2426 URL: https://issues.apache.org/jira/browse/SPARK-2426 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0 Reporter: Debasish Das Assignee: Debasish Das Original Estimate: 504h Remaining Estimate: 504h Current ALS supports least squares and nonnegative least squares. I presented ADMM and IPM based Quadratic Minimization solvers to be used for the following ALS problems: 1. ALS with bounds 2. ALS with L1 regularization 3. ALS with Equality constraint and bounds Initial runtime comparisons are presented at Spark Summit. http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark Based on Xiangrui's feedback I am currently comparing the ADMM based Quadratic Minimization solvers with IPM based QpSolvers and the default ALS/NNLS. I will keep updating the runtime comparison results. For integration the detailed plan is as follows: 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization 2. Integrate QuadraticMinimizer in mllib ALS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3124) Jar version conflict in the assembly package
[ https://issues.apache.org/jira/browse/SPARK-3124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3124. -- Resolution: Fixed I'm a little unclear on the outcome, but in master, running {{mvn -Phive -Phive-thriftserver -Dhadoop.version=2.0.0-mr1-cdh4.3.0 dependency:tree}} says there is no Netty 3.2.2 anymore. So it looks fixed. Jar version conflict in the assembly package Key: SPARK-3124 URL: https://issues.apache.org/jira/browse/SPARK-3124 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Priority: Blocker Both netty-3.2.2.Final.jar and netty-3.6.6.Final.jar are flatten into the assembly package, however, the class(NioWorker) signature difference leads to the failure in launching sparksql CLI/ThriftServer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3358) PySpark worker fork()ing performance regression in m3.* / PVM instances
[ https://issues.apache.org/jira/browse/SPARK-3358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243454#comment-14243454 ] Sean Owen commented on SPARK-3358: -- Is this then resolved by one of https://github.com/apache/spark/pull/2244 or https://github.com/apache/spark/pull/2259 ? PySpark worker fork()ing performance regression in m3.* / PVM instances --- Key: SPARK-3358 URL: https://issues.apache.org/jira/browse/SPARK-3358 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0 Environment: m3.* instances on EC2 Reporter: Josh Rosen SPARK-2764 (and some followup commits) simplified PySpark's worker process structure by removing an intermediate pool of processes forked by daemon.py. Previously, daemon.py forked a fixed-size pool of processes that shared a socket and handled worker launch requests from Java. After my patch, this intermediate pool was removed and launch requests are handled directly in daemon.py. Unfortunately, this seems to have increased PySpark task launch latency when running on m3* class instances in EC2. Most of this difference can be attributed to m3 instances' more expensive fork() system calls. I tried the following microbenchmark on m3.xlarge and r3.xlarge instances: {code} import os for x in range(1000): if os.fork() == 0: exit() {code} On the r3.xlarge instance: {code} real 0m0.761s user 0m0.008s sys 0m0.144s {code} And on m3.xlarge: {code} real0m1.699s user0m0.012s sys 0m1.008s {code} I think this is due to HVM vs PVM EC2 instances using different virtualization technologies with different fork costs. It may be the case that this performance difference only appears in certain microbenchmarks and is masked by other performance improvements in PySpark, such as improvements to large group-bys. I'm in the process of re-running spark-perf benchmarks on m3 instances in order to confirm whether this impacts more realistic jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3352) Rename Flume Polling stream to Pull Based stream
[ https://issues.apache.org/jira/browse/SPARK-3352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3352. -- Resolution: Won't Fix According to the PR, this is WontFix. Rename Flume Polling stream to Pull Based stream Key: SPARK-3352 URL: https://issues.apache.org/jira/browse/SPARK-3352 Project: Spark Issue Type: Bug Reporter: Hari Shreedharan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243456#comment-14243456 ] Debasish Das commented on SPARK-2426: - [~akopich] I got good MAP results on recommendation datasets with the approximated PLSA formulation. I did not get time to compare that formulation with Gibbs sampling based LDA PR: https://issues.apache.org/jira/browse/SPARK-1405 yet. Did you compare them ? Quadratic Minimization for MLlib ALS Key: SPARK-2426 URL: https://issues.apache.org/jira/browse/SPARK-2426 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0 Reporter: Debasish Das Assignee: Debasish Das Original Estimate: 504h Remaining Estimate: 504h Current ALS supports least squares and nonnegative least squares. I presented ADMM and IPM based Quadratic Minimization solvers to be used for the following ALS problems: 1. ALS with bounds 2. ALS with L1 regularization 3. ALS with Equality constraint and bounds Initial runtime comparisons are presented at Spark Summit. http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark Based on Xiangrui's feedback I am currently comparing the ADMM based Quadratic Minimization solvers with IPM based QpSolvers and the default ALS/NNLS. I will keep updating the runtime comparison results. For integration the detailed plan is as follows: 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization 2. Integrate QuadraticMinimizer in mllib ALS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3229) spark.shuffle.safetyFraction and spark.storage.safetyFraction is not documented
[ https://issues.apache.org/jira/browse/SPARK-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3229. -- Resolution: Won't Fix Another one that says it's WontFix in the PR. spark.shuffle.safetyFraction and spark.storage.safetyFraction is not documented --- Key: SPARK-3229 URL: https://issues.apache.org/jira/browse/SPARK-3229 Project: Spark Issue Type: Bug Components: Shuffle, Spark Core Affects Versions: 1.1.0 Reporter: Kousuke Saruta There are no descriptions about spark.shuffle.safetyFraction and spark.storage.safetyFraction. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3201) Yarn Client do not support the -X java opts
[ https://issues.apache.org/jira/browse/SPARK-3201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3201. -- Resolution: Won't Fix Looks like it was abandoned by the OP, possibly in favor of SPARK-1953 or SPARK-1507 Yarn Client do not support the -X java opts - Key: SPARK-3201 URL: https://issues.apache.org/jira/browse/SPARK-3201 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.2 Reporter: SaintBacchus In yarn-client mode, it's not allowed to set the spark.driver.extraJavaOptions . I think it's very inconvenient if we want to set the -X java opts in the process of ExecutorLauncher. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3548) Display cache hit ratio on WebUI
[ https://issues.apache.org/jira/browse/SPARK-3548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3548. -- Resolution: Won't Fix I think this particular suggestion is WontFix since the idea of a hit ratio was declined in the PR discussion and the PR was closed. Display cache hit ratio on WebUI Key: SPARK-3548 URL: https://issues.apache.org/jira/browse/SPARK-3548 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.2.0 Reporter: Kousuke Saruta In Stage page, if cache hit ratio is displayed, it's useful for application / cache strategy tuning. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3433) Mima false-positives with @DeveloperAPI and @Experimental annotations
[ https://issues.apache.org/jira/browse/SPARK-3433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3433. -- Resolution: Fixed Fix Version/s: 1.2.0 The PR was https://github.com/apache/spark/pull/2285 and it was merged, looks like for 1.2: https://github.com/apache/spark/commit/ecf0c02935815f0d4018c0e30ec4c784e60a5db0 Mima false-positives with @DeveloperAPI and @Experimental annotations - Key: SPARK-3433 URL: https://issues.apache.org/jira/browse/SPARK-3433 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.2.0 Reporter: Josh Rosen Assignee: Prashant Sharma Priority: Minor Fix For: 1.2.0 In https://github.com/apache/spark/pull/2315, I found two cases where {{@DeveloperAPI}} and {{@Experimental}} annotations didn't prevent false-positive warnings from Mima. To reproduce this problem, run dev/mima as of https://github.com/JoshRosen/spark/commit/ec90e21947b615d4ef94a3a54cfd646924ccaf7c. The spurious warnings are listed at the top of https://gist.github.com/JoshRosen/5d8df835516dc367389d. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3719) Spark UI: complete/failed stages is better to show the total number of stages
[ https://issues.apache.org/jira/browse/SPARK-3719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3719. -- Resolution: Duplicate Target Version/s: (was: 1.1.1, 1.2.0) Apparently resolved by the very similar SPARK-4168, according to the PR. Spark UI: complete/failed stages is better to show the total number of stages Key: SPARK-3719 URL: https://issues.apache.org/jira/browse/SPARK-3719 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: uncleGen Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3712) add a new UpdateDStream to update a rdd dynamically
[ https://issues.apache.org/jira/browse/SPARK-3712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3712. -- Resolution: Won't Fix Target Version/s: (was: 1.1.1, 1.2.0) Withdrawn in the PR by the proposer. add a new UpdateDStream to update a rdd dynamically --- Key: SPARK-3712 URL: https://issues.apache.org/jira/browse/SPARK-3712 Project: Spark Issue Type: Improvement Components: Streaming Reporter: uncleGen Priority: Minor Maybe, we can achieve the aim by using forEachRdd function. But I feel weird in this way, because I need to pass a closure, like this: val baseRdd = ... var updatedRDD = ... val inputStream = ... val func = (rdd: RDD[T], t: Time) = { updatedRDD = baseRDD.op(rdd) } inputStream.foreachRDD(func _) In my PR, we can update a rdd like: val updateStream = inputStream.updateRDD(baseRDD, func).asInstanceOf[T, V, U] and obtain the updatedRDD like this: val updatedRDD = updateStream.getUpdatedRDD -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3689) FileLogger should create new instance of FileSystem regardless of it's scheme
[ https://issues.apache.org/jira/browse/SPARK-3689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3689. -- Resolution: Not a Problem Target Version/s: (was: 1.1.1, 1.2.0) The PR says it's not a problem actually. FileLogger should create new instance of FileSystem regardless of it's scheme - Key: SPARK-3689 URL: https://issues.apache.org/jira/browse/SPARK-3689 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Kousuke Saruta FileLogger creates new instance of FileSystem to avoid the effect of FileSystem#close from another module but it's expected only HDFS. We can used another filesystem for the directory which event log is stored to. {code} if (scheme == hdfs) { conf.setBoolean(fs.hdfs.impl.disable.cache, true) } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3663) Document SPARK_LOG_DIR and SPARK_PID_DIR
[ https://issues.apache.org/jira/browse/SPARK-3663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3663. -- Resolution: Fixed Fix Version/s: 1.3.0 The PR was merged, though looks like not for 1.2: https://github.com/apache/spark/commit/5c265ccde0c5594899ec61f9c1ea100ddff52da7 Document SPARK_LOG_DIR and SPARK_PID_DIR Key: SPARK-3663 URL: https://issues.apache.org/jira/browse/SPARK-3663 Project: Spark Issue Type: Documentation Components: Documentation Reporter: Andrew Ash Assignee: Andrew Ash Fix For: 1.3.0 I'm using these two parameters in some puppet scripts for standalone deployment and realized that they're not documented anywhere. We should document them -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3636) It is not friendly to interrupt a Job when user passes different storageLevels to a RDD
[ https://issues.apache.org/jira/browse/SPARK-3636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3636. -- Resolution: Won't Fix Target Version/s: (was: 1.1.1) PR discussion says this is WontFix It is not friendly to interrupt a Job when user passes different storageLevels to a RDD --- Key: SPARK-3636 URL: https://issues.apache.org/jira/browse/SPARK-3636 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: uncleGen Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4817) [streaming]Print the specified number of data and handle all of the elements in RDD
[ https://issues.apache.org/jira/browse/SPARK-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243683#comment-14243683 ] 宿荣全 commented on SPARK-4817: [~srowen] I‘m sorry that didn't describe the problem clearly. If there is such a scene that there are multiple outputs: *data from HDFS files - map- filter-map(each row to updata mysql DB)- filter- map-print(print 20 datas to console)* # {color:red} output to mysql DB{color} # {color:red} output to console DB{color} this patch: ( the function {{processAllAndPrintFirst}} is new be defined.) {code} ssc.textFileStream(path).map(func1).filter(func2).map(f={updataMysql(f)}).filter(func3). map(func4).processAllAndPrintFirst(20) {code} How to do this scene if use {{foreachRDD}} and {{take}} or {{print(num)}}? Both [ {{rdd.foreach}} and {{rdd.take}} ]or [ {{rdd.foreach}} and {{stream.print(100)}} ] will have two Jobs in each streaming batch. With have a job to compare it will whether or not the efficiency? [streaming]Print the specified number of data and handle all of the elements in RDD --- Key: SPARK-4817 URL: https://issues.apache.org/jira/browse/SPARK-4817 Project: Spark Issue Type: New Feature Components: Streaming Reporter: 宿荣全 Priority: Minor Dstream.print function:Print 10 elements and handle 11 elements. A new function based on Dstream.print function is presented: the new function: Print the specified number of data and handle all of the elements in RDD. there is a work scene: val dstream = stream.map-filter-mapPartitions-print the data after filter need update database in mapPartitions,but don't need print each data,only need to print the top 20 for view the data processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4713) SchemaRDD.unpersist() should not raise exception if it is not cached.
[ https://issues.apache.org/jira/browse/SPARK-4713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4713. - Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 3572 [https://github.com/apache/spark/pull/3572] SchemaRDD.unpersist() should not raise exception if it is not cached. - Key: SPARK-4713 URL: https://issues.apache.org/jira/browse/SPARK-4713 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Priority: Minor Fix For: 1.3.0 Unpersist a uncached RDD, will not raise exception, for example: {panel} val data = Array(1, 2, 3, 4, 5) val distData = sc.parallelize(data) distData.unpersist(true) {panel} But the SchemaRDD will throws exception if the SchemaRDD is not cached. Since SchemaRDD is the subclasses of the RDD, we should follow the same behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4662) Whitelist more Hive unittest
[ https://issues.apache.org/jira/browse/SPARK-4662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4662. - Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 3522 [https://github.com/apache/spark/pull/3522] Whitelist more Hive unittest Key: SPARK-4662 URL: https://issues.apache.org/jira/browse/SPARK-4662 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Fix For: 1.3.0 Whitelist more hive unit test: create_like_tbl_props udf5 udf_java_method decimal_1 udf_pmod udf_to_double udf_to_float udf7 (this will fail in Hive 0.12) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4639) Pass maxIterations in as a parameter in Analyzer
[ https://issues.apache.org/jira/browse/SPARK-4639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4639. - Resolution: Fixed Issue resolved by pull request 3499 [https://github.com/apache/spark/pull/3499] Pass maxIterations in as a parameter in Analyzer Key: SPARK-4639 URL: https://issues.apache.org/jira/browse/SPARK-4639 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Jacky Li Priority: Minor Fix For: 1.3.0 fix a TODO in Analyzer: // TODO: pass this in as a parameter val fixedPoint = FixedPoint(100) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4293) Make Cast be able to handle complex types.
[ https://issues.apache.org/jira/browse/SPARK-4293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4293. - Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 3150 [https://github.com/apache/spark/pull/3150] Make Cast be able to handle complex types. -- Key: SPARK-4293 URL: https://issues.apache.org/jira/browse/SPARK-4293 Project: Spark Issue Type: Improvement Components: SQL Reporter: Takuya Ueshin Priority: Critical Fix For: 1.3.0 Inserting data of type including {{ArrayType.containsNull == false}} or {{MapType.valueContainsNull == false}} or {{StructType.fields.exists(_.nullable == false)}} into Hive table will fail because {{Cast}} inserted by {{HiveMetastoreCatalog.PreInsertionCasts}} rule of {{Analyzer}} can't handle these types correctly. Complex type cast rule proposal: * Cast for non-complex types should be able to cast the same as before. * Cast for {{ArrayType}} can evaluate if ** Element type can cast ** Nullability rule doesn't break * Cast for {{MapType}} can evaluate if ** Key type can cast ** Nullability for casted key type is {{false}} ** Value type can cast ** Nullability rule for value type doesn't break * Cast for {{StructType}} can evaluate if ** The field size is the same ** Each field can cast ** Nullability rule for each field doesn't break * The nested structure should be the same. Nullability rule: * If the casted type is {{nullable == true}}, the target nullability should be {{true}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4828) sum and avg over empty table should return null
[ https://issues.apache.org/jira/browse/SPARK-4828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4828. - Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 3675 [https://github.com/apache/spark/pull/3675] sum and avg over empty table should return null --- Key: SPARK-4828 URL: https://issues.apache.org/jira/browse/SPARK-4828 Project: Spark Issue Type: Bug Components: SQL Reporter: Adrian Wang Priority: Minor Fix For: 1.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4825) CTAS fails to resolve when created using saveAsTable
[ https://issues.apache.org/jira/browse/SPARK-4825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4825. - Resolution: Fixed Fix Version/s: 1.2.1 Target Version/s: 1.2.1 (was: 1.2.0) CTAS fails to resolve when created using saveAsTable Key: SPARK-4825 URL: https://issues.apache.org/jira/browse/SPARK-4825 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Michael Armbrust Assignee: Cheng Hao Priority: Critical Fix For: 1.2.1 While writing a test for a different issue, I found that saveAsTable seems to be broken: {code} test(save join to table) { val testData = sparkContext.parallelize(1 to 10).map(i = TestData(i, i.toString)) sql(CREATE TABLE test1 (key INT, value STRING)) testData.insertInto(test1) sql(CREATE TABLE test2 (key INT, value STRING)) testData.insertInto(test2) testData.insertInto(test2) sql(SELECT COUNT(a.value) FROM test1 a JOIN test2 b ON a.key = b.key).saveAsTable(test) checkAnswer( table(test), sql(SELECT COUNT(a.value) FROM test1 a JOIN test2 b ON a.key = b.key).collect().toSeq) } sql(SELECT COUNT(a.value) FROM test1 a JOIN test2 b ON a.key = b.key).saveAsTable(test) org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved plan found, tree: 'CreateTableAsSelect None, test, false, None Aggregate [], [COUNT(value#336) AS _c0#334L] Join Inner, Some((key#335 = key#339)) MetastoreRelation default, test1, Some(a) MetastoreRelation default, test2, Some(b) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4742) The name of Parquet File generated by AppendingParquetOutputFormat should be zero padded
[ https://issues.apache.org/jira/browse/SPARK-4742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4742. - Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 3602 [https://github.com/apache/spark/pull/3602] The name of Parquet File generated by AppendingParquetOutputFormat should be zero padded Key: SPARK-4742 URL: https://issues.apache.org/jira/browse/SPARK-4742 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0 Reporter: Sasaki Toru Priority: Minor Fix For: 1.3.0 When I use Parquet File as a output file using ParquetOutputFormat#getDefaultWorkFile, the file name is not zero padded while RDD#saveAsText does zero padding. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4829) eliminate expressions calculation in count expression
[ https://issues.apache.org/jira/browse/SPARK-4829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4829. - Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 3676 [https://github.com/apache/spark/pull/3676] eliminate expressions calculation in count expression - Key: SPARK-4829 URL: https://issues.apache.org/jira/browse/SPARK-4829 Project: Spark Issue Type: Improvement Components: SQL Reporter: Adrian Wang Fix For: 1.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org