[jira] [Assigned] (SPARK-1503) Implement Nesterov's accelerated first-order method
[ https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-1503: Assignee: Xiangrui Meng Implement Nesterov's accelerated first-order method --- Key: SPARK-1503 URL: https://issues.apache.org/jira/browse/SPARK-1503 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Nesterov's accelerated first-order method is a drop-in replacement for steepest descent but it converges much faster. We should implement this method and compare its performance with existing algorithms, including SGD and L-BFGS. TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's method and its variants on composite objectives. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1503) Implement Nesterov's accelerated first-order method
[ https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166465#comment-14166465 ] Xiangrui Meng commented on SPARK-1503: -- [~staple] Thanks for picking up this JIRA! TFOCS is a good place to start. We can support AT (Auslender and Teboulle) update, line search, and restart in the first version. It would be nice to take generic composite objective functions. Please note that this could become a big task. We definitely need to go through the design first. Implement Nesterov's accelerated first-order method --- Key: SPARK-1503 URL: https://issues.apache.org/jira/browse/SPARK-1503 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Nesterov's accelerated first-order method is a drop-in replacement for steepest descent but it converges much faster. We should implement this method and compare its performance with existing algorithms, including SGD and L-BFGS. TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's method and its variants on composite objectives. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1503) Implement Nesterov's accelerated first-order method
[ https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1503: - Assignee: Aaron Staple (was: Xiangrui Meng) Implement Nesterov's accelerated first-order method --- Key: SPARK-1503 URL: https://issues.apache.org/jira/browse/SPARK-1503 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Aaron Staple Nesterov's accelerated first-order method is a drop-in replacement for steepest descent but it converges much faster. We should implement this method and compare its performance with existing algorithms, including SGD and L-BFGS. TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's method and its variants on composite objectives. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3897) Scala style: format example code
sjk created SPARK-3897: -- Summary: Scala style: format example code Key: SPARK-3897 URL: https://issues.apache.org/jira/browse/SPARK-3897 Project: Spark Issue Type: Sub-task Reporter: sjk -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3897) Scala style: format example code
[ https://issues.apache.org/jira/browse/SPARK-3897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sjk updated SPARK-3897: --- https://github.com/apache/spark/pull/2754 Scala style: format example code Key: SPARK-3897 URL: https://issues.apache.org/jira/browse/SPARK-3897 Project: Spark Issue Type: Sub-task Components: Project Infra Reporter: sjk -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3897) Scala style: format example code
[ https://issues.apache.org/jira/browse/SPARK-3897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sjk updated SPARK-3897: --- Description: https://github.com/apache/spark/pull/2754 Scala style: format example code Key: SPARK-3897 URL: https://issues.apache.org/jira/browse/SPARK-3897 Project: Spark Issue Type: Sub-task Components: Project Infra Reporter: sjk https://github.com/apache/spark/pull/2754 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3421) StructField.toString should quote the name field to allow arbitrary character as struct field name
[ https://issues.apache.org/jira/browse/SPARK-3421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-3421. --- Resolution: Fixed StructField.toString should quote the name field to allow arbitrary character as struct field name -- Key: SPARK-3421 URL: https://issues.apache.org/jira/browse/SPARK-3421 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.2 Reporter: Cheng Lian Assignee: Cheng Lian The original use case is something like this: {code} // JSON snippet with illegal characters in field names val json = { a(b): { c(d): hello } } :: { a(b): { c(d): world } } :: Nil val jsonSchemaRdd = sqlContext.jsonRDD(sparkContext.makeRDD(json)) jsonSchemaRdd.saveAsParquetFile(/tmp/file.parquet) java.lang.Exception: java.lang.RuntimeException: Unsupported dataType: StructType(ArrayBuffer(StructField(a(b),StructType(ArrayBuffer(StructField(c(d),StringType,true))),true))), [1.37] failure: `,' expected but `(' found {code} The reason is that, the {{DataType}} parser only allows {{\[a-zA-Z0-9_\]*}} as struct field name. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2805) Update akka to version 2.3.4
[ https://issues.apache.org/jira/browse/SPARK-2805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-2805. Resolution: Fixed We've merged again with some modifications and we'll see if it works well in the maven builds. Update akka to version 2.3.4 Key: SPARK-2805 URL: https://issues.apache.org/jira/browse/SPARK-2805 Project: Spark Issue Type: Sub-task Components: Build, Spark Core Reporter: Anand Avati Assignee: Anand Avati Fix For: 1.2.0 akka-2.3 is the lowest version available in Scala 2.11 akka-2.3 depends on protobuf 2.5. Hadoop-1 requires protobuf 2.4.1. In order to reconcile the conflicting dependencies, need to release akka-2.3.x-shaded-protobuf artifact which has protobuf 2.5 within. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2019) Spark workers die/disappear when job fails for nearly any reason
[ https://issues.apache.org/jira/browse/SPARK-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166555#comment-14166555 ] Denis Serduik commented on SPARK-2019: -- I have noticed the same problem with workers behavior. My installation: Spark 1.0.2-hadoop2.0.0-mr1-cdh4.2.0 on Mesos 0.13. In my case, workers fail when there was an error while serialization the closure. Spark workers die/disappear when job fails for nearly any reason Key: SPARK-2019 URL: https://issues.apache.org/jira/browse/SPARK-2019 Project: Spark Issue Type: Bug Affects Versions: 0.9.0 Reporter: sam We either have to reboot all the nodes, or run 'sudo service spark-worker restart' across our cluster. I don't think this should happen - the job failures are often not even that bad. There is a 5 upvoted SO question here: http://stackoverflow.com/questions/22031006/spark-0-9-0-worker-keeps-dying-in-standalone-mode-when-job-fails We shouldn't be giving restart privileges to our devs, and therefore our sysadm has to frequently restart the workers. When the sysadm is not around, there is nothing our devs can do. Many thanks -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2019) Spark workers die/disappear when job fails for nearly any reason
[ https://issues.apache.org/jira/browse/SPARK-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166555#comment-14166555 ] Denis Serduik edited comment on SPARK-2019 at 10/10/14 8:39 AM: I have noticed the same problem with workers behavior. My installation: Spark 1.0.2-hadoop2.0.0-mr1-cdh4.2.0 on Mesos 0.13. In my case, workers fail when there was an error while serialization the closure. Also please notice that we run Spark in coarse-grained mode was (Author: dmaverick): I have noticed the same problem with workers behavior. My installation: Spark 1.0.2-hadoop2.0.0-mr1-cdh4.2.0 on Mesos 0.13. In my case, workers fail when there was an error while serialization the closure. Spark workers die/disappear when job fails for nearly any reason Key: SPARK-2019 URL: https://issues.apache.org/jira/browse/SPARK-2019 Project: Spark Issue Type: Bug Affects Versions: 0.9.0 Reporter: sam We either have to reboot all the nodes, or run 'sudo service spark-worker restart' across our cluster. I don't think this should happen - the job failures are often not even that bad. There is a 5 upvoted SO question here: http://stackoverflow.com/questions/22031006/spark-0-9-0-worker-keeps-dying-in-standalone-mode-when-job-fails We shouldn't be giving restart privileges to our devs, and therefore our sysadm has to frequently restart the workers. When the sysadm is not around, there is nothing our devs can do. Many thanks -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2019) Spark workers die/disappear when job fails for nearly any reason
[ https://issues.apache.org/jira/browse/SPARK-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166555#comment-14166555 ] Denis Serduik edited comment on SPARK-2019 at 10/10/14 8:40 AM: I have noticed the same problem with workers behavior. My installation: Spark 1.0.2-hadoop2.0.0-mr1-cdh4.2.0 on Mesos 0.13. In my case, workers fail when there was an error while serialization the closure. Also please note, we run Spark in coarse-grained mode was (Author: dmaverick): I have noticed the same problem with workers behavior. My installation: Spark 1.0.2-hadoop2.0.0-mr1-cdh4.2.0 on Mesos 0.13. In my case, workers fail when there was an error while serialization the closure. Also please notice that we run Spark in coarse-grained mode Spark workers die/disappear when job fails for nearly any reason Key: SPARK-2019 URL: https://issues.apache.org/jira/browse/SPARK-2019 Project: Spark Issue Type: Bug Affects Versions: 0.9.0 Reporter: sam We either have to reboot all the nodes, or run 'sudo service spark-worker restart' across our cluster. I don't think this should happen - the job failures are often not even that bad. There is a 5 upvoted SO question here: http://stackoverflow.com/questions/22031006/spark-0-9-0-worker-keeps-dying-in-standalone-mode-when-job-fails We shouldn't be giving restart privileges to our devs, and therefore our sysadm has to frequently restart the workers. When the sysadm is not around, there is nothing our devs can do. Many thanks -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2348) In Windows having a enviorinment variable named 'classpath' gives error
[ https://issues.apache.org/jira/browse/SPARK-2348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166559#comment-14166559 ] AaronLin commented on SPARK-2348: - why the issue hasn't been solved? Anyone can help? In Windows having a enviorinment variable named 'classpath' gives error --- Key: SPARK-2348 URL: https://issues.apache.org/jira/browse/SPARK-2348 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: Windows 7 Enterprise Reporter: Chirag Todarka Assignee: Chirag Todarka Operating System:: Windows 7 Enterprise If having enviorinment variable named 'classpath' gives then starting 'spark-shell' gives below error:: mydir\spark\binspark-shell Failed to initialize compiler: object scala.runtime in compiler mirror not found . ** Note that as of 2.8 scala does not assume use of the java classpath. ** For the old behavior pass -usejavacp to scala, or if using a Settings ** object programatically, settings.usejavacp.value = true. 14/07/02 14:22:06 WARN SparkILoop$SparkILoopInterpreter: Warning: compiler acces sed before init set up. Assuming no postInit code. Failed to initialize compiler: object scala.runtime in compiler mirror not found . ** Note that as of 2.8 scala does not assume use of the java classpath. ** For the old behavior pass -usejavacp to scala, or if using a Settings ** object programatically, settings.usejavacp.value = true. Exception in thread main java.lang.AssertionError: assertion failed: null at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.repl.SparkIMain.initializeSynchronous(SparkIMain.sca la:202) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(Spar kILoop.scala:929) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop. scala:884) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop. scala:884) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClass Loader.scala:135) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3889) JVM dies with SIGBUS, resulting in ConnectionManager failed ACK
[ https://issues.apache.org/jira/browse/SPARK-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-3889. Resolution: Fixed Fix Version/s: 1.2.0 JVM dies with SIGBUS, resulting in ConnectionManager failed ACK --- Key: SPARK-3889 URL: https://issues.apache.org/jira/browse/SPARK-3889 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Aaron Davidson Assignee: Aaron Davidson Priority: Critical Fix For: 1.2.0 Here's the first part of the core dump, possibly caused by a job which shuffles a lot of very small partitions. {code} # # A fatal error has been detected by the Java Runtime Environment: # # SIGBUS (0x7) at pc=0x7fa5885fcdb0, pid=488, tid=140343502632704 # # JRE version: 7.0_25-b30 # Java VM: OpenJDK 64-Bit Server VM (23.7-b01 mixed mode linux-amd64 compressed oops) # Problematic frame: # v ~StubRoutines::jbyte_disjoint_arraycopy # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try ulimit -c unlimited before starting Java again # # If you would like to submit a bug report, please include # instructions on how to reproduce the bug and visit: # https://bugs.launchpad.net/ubuntu/+source/openjdk-7/ # --- T H R E A D --- Current thread (0x7fa4b0631000): JavaThread Executor task launch worker-170 daemon [_thread_in_Java, id=6783, stack(0x7fa4448ef000,0x7fa4449f)] siginfo:si_signo=SIGBUS: si_errno=0, si_code=2 (BUS_ADRERR), si_addr=0x7fa428f79000 {code} Here is the only useful content I can find related to JVM and SIGBUS from Google: https://bugzilla.redhat.com/show_bug.cgi?format=multipleid=976664 It appears it may be related to disposing byte buffers, which we do in the ConnectionManager -- we mmap shuffle files via ManagedBuffer and dispose of them in BufferMessage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2348) In Windows having a enviorinment variable named 'classpath' gives error
[ https://issues.apache.org/jira/browse/SPARK-2348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166566#comment-14166566 ] AaronLin commented on SPARK-2348: - I encounter this issue in sbt building spark1.1.0 (windows7 os), i solved this issue by changing one line in spark-class2.cmd --old- set JAVA_OPTS=-XX:MaxPermSize=128m %OUR_JAVA_OPTS% -Xms%OUR_JAVA_MEM% -Xmx%OUR_JAVA_MEM% -new-- set JAVA_OPTS=%OUR_JAVA_OPTS% -Djava.library.path=%SPARK_LIBRARY_PATH% -Dscala.usejavacp=true -Xms%OUR_JAVA_MEM% -Xmx%OUR_JAVA_MEM% --end it works. In Windows having a enviorinment variable named 'classpath' gives error --- Key: SPARK-2348 URL: https://issues.apache.org/jira/browse/SPARK-2348 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: Windows 7 Enterprise Reporter: Chirag Todarka Assignee: Chirag Todarka Operating System:: Windows 7 Enterprise If having enviorinment variable named 'classpath' gives then starting 'spark-shell' gives below error:: mydir\spark\binspark-shell Failed to initialize compiler: object scala.runtime in compiler mirror not found . ** Note that as of 2.8 scala does not assume use of the java classpath. ** For the old behavior pass -usejavacp to scala, or if using a Settings ** object programatically, settings.usejavacp.value = true. 14/07/02 14:22:06 WARN SparkILoop$SparkILoopInterpreter: Warning: compiler acces sed before init set up. Assuming no postInit code. Failed to initialize compiler: object scala.runtime in compiler mirror not found . ** Note that as of 2.8 scala does not assume use of the java classpath. ** For the old behavior pass -usejavacp to scala, or if using a Settings ** object programatically, settings.usejavacp.value = true. Exception in thread main java.lang.AssertionError: assertion failed: null at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.repl.SparkIMain.initializeSynchronous(SparkIMain.sca la:202) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(Spar kILoop.scala:929) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop. scala:884) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop. scala:884) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClass Loader.scala:135) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3898) History Web UI display incorrectly.
zzc created SPARK-3898: -- Summary: History Web UI display incorrectly. Key: SPARK-3898 URL: https://issues.apache.org/jira/browse/SPARK-3898 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.2.0 Environment: Spark 1.2.0-snapshot On Yarn Reporter: zzc After successfully run an spark application, history web ui display incorrectly: App Name:Not Started Started:1970/01/01 07:59:59 Spark User:Not Started Last Updated:2014/10/10 14:50:39 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3898) History Web UI display incorrectly.
[ https://issues.apache.org/jira/browse/SPARK-3898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zzc updated SPARK-3898: --- Fix Version/s: 1.2.0 History Web UI display incorrectly. --- Key: SPARK-3898 URL: https://issues.apache.org/jira/browse/SPARK-3898 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.2.0 Environment: Spark 1.2.0-snapshot On Yarn Reporter: zzc Fix For: 1.1.1, 1.2.0 After successfully run an spark application, history web ui display incorrectly: App Name:Not Started Started:1970/01/01 07:59:59 Spark User:Not Started Last Updated:2014/10/10 14:50:39 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3898) History Web UI display incorrectly.
[ https://issues.apache.org/jira/browse/SPARK-3898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zzc updated SPARK-3898: --- Fix Version/s: 1.1.1 History Web UI display incorrectly. --- Key: SPARK-3898 URL: https://issues.apache.org/jira/browse/SPARK-3898 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.2.0 Environment: Spark 1.2.0-snapshot On Yarn Reporter: zzc Fix For: 1.1.1, 1.2.0 After successfully run an spark application, history web ui display incorrectly: App Name:Not Started Started:1970/01/01 07:59:59 Spark User:Not Started Last Updated:2014/10/10 14:50:39 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3826) enable hive-thriftserver support hive-0.13.1
[ https://issues.apache.org/jira/browse/SPARK-3826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangfei updated SPARK-3826: --- Affects Version/s: (was: 1.1.1) 1.1.0 enable hive-thriftserver support hive-0.13.1 Key: SPARK-3826 URL: https://issues.apache.org/jira/browse/SPARK-3826 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: wangfei Now hive-thriftserver not support hive-0.13, to make it support both 0.12 and 0.13 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3899) wrong links in streaming doc
wangfei created SPARK-3899: -- Summary: wrong links in streaming doc Key: SPARK-3899 URL: https://issues.apache.org/jira/browse/SPARK-3899 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.1.0 Reporter: wangfei -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166593#comment-14166593 ] Yu Ishikawa commented on SPARK-2429: Hi [~rnowling], Thank you for your comments and advices. {quote} Ok, first off, let me make sure I understand what you're doing. You start with 2 centers. You assign all the points. You then apply KMeans recursively to each cluster, splitting each center into 2 centers. Each instance of KMeans stops when the error is below a certain value or a fixed number of iterations have been run. {quote} You are right. The algorithm runs as you said. {quote} I think your analysis of the overall run time is good and probably what we expect. Can you break down the timing to see which parts are the most expensive? Maybe we can figure out where to optimize it. {quote} OK. I will measure the execution time of parts of the implementation. {quote} 1. It might be good to convert everything to Breeze vectors before you do any operations – you need to convert the same vectors over and over again. KMeans converts them at the beginning and converts the vectors for the centers back at the end. {quote} I agree with you. I am troubled with this problem. After training the model, the user seems to select the data in a cluster which is the part of the whole input data. I think there are three approaches to realize it as below. # We extract the centers and their `RDD \[Vector\]` data in a cluster through the training like my implementation # We extract the centers and their `RDD\[BV\[Double\]\]` data, and then convert the data into `RDD\[Vector\]` at the last. The converting from breeze vectors to spark vectors is very slow. That's why we didn't implement it. # We only extract the centers through the training, not their data. And then we apply the trained model to the input data with `predict` method like scikit-lean in order to extract the part of the data in each cluster. This seems to be good. We have to save the `RDD\[BV\[Double\]\]` data of each cluster thorough the clustering. Because we extract the `RDD\[Vector\]` data of each cluster after the training, I am worried that it is a waste of the `RDD\[DB\[Double\]\]` data through the clustering. And I am troubled with how to elegantly save the data in progress of the clustering. {quote} 2. Instead of passing the centers as part of the EuclideanClosestCenterFinder, look into using a broadcast variable. See the latest KMeans implementation. This could improve performance by 10%+. 3. You may want to look into using reduceByKey or similar RDD operations – they will enable parallel reductions which will be faster than a loop on the master. {quote} I will give it a try. Thanks! Hierarchical Implementation of KMeans - Key: SPARK-2429 URL: https://issues.apache.org/jira/browse/SPARK-2429 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: Yu Ishikawa Priority: Minor Attachments: The Result of Benchmarking a Hierarchical Clustering.pdf Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib. Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches: * Top down, recursive application of KMeans * Reuse DecisionTree implementation with different objective function * Hierarchical SVD It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3561) Allow for pluggable execution contexts in Spark
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166705#comment-14166705 ] Oleg Zhurakousky commented on SPARK-3561: - Patrick, I think there is misunderstanding about the mechanics of this proposal, so I'd like to clarify. The proposal here is certainly not to introduce any new dependencies to Spark Core and existing pull request (https://github.com/apache/spark/pull/2422) clearly shows it. What I am proposing is to expose an integration point in Spark by means of extracting *existing* Spark operations into a *configurable and @Experimental* strategy, allowing Spark not only to integrate with other execution environments, but it would also be very useful in unit-testing as it would provide a clear separation between _assembly_ and _execution_ layer allowing them to be tested in isolation. I think this feature would benefit Spark tremendously; particularly given how several folks have already expressed their interest in this feature/direction. Appreciate your help and advise in helping to get this contribution into Spark. Thanks! Allow for pluggable execution contexts in Spark --- Key: SPARK-3561 URL: https://issues.apache.org/jira/browse/SPARK-3561 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: Oleg Zhurakousky Labels: features Fix For: 1.2.0 Attachments: SPARK-3561.pdf Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3561) Allow for pluggable execution contexts in Spark
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166705#comment-14166705 ] Oleg Zhurakousky edited comment on SPARK-3561 at 10/10/14 12:10 PM: Patrick, I think there is misunderstanding about the mechanics of this proposal, so I'd like to clarify. The proposal here is certainly not to introduce any new dependencies to Spark Core and existing pull request (https://github.com/apache/spark/pull/2422) clearly shows it. What I am proposing is to expose an integration point in Spark by means of extracting *existing* Spark operations into a *configurable and @Experimental* strategy, allowing Spark not only to integrate with other execution contexts, but it would also be very useful in unit-testing as it would provide a clear separation between _assembly_ and _execution_ layer allowing them to be tested in isolation. I think this feature would benefit Spark tremendously; particularly given how several folks have already expressed their interest in this feature/direction. Appreciate your help and advise in helping to get this contribution into Spark. Thanks! was (Author: ozhurakousky): Patrick, I think there is misunderstanding about the mechanics of this proposal, so I'd like to clarify. The proposal here is certainly not to introduce any new dependencies to Spark Core and existing pull request (https://github.com/apache/spark/pull/2422) clearly shows it. What I am proposing is to expose an integration point in Spark by means of extracting *existing* Spark operations into a *configurable and @Experimental* strategy, allowing Spark not only to integrate with other execution environments, but it would also be very useful in unit-testing as it would provide a clear separation between _assembly_ and _execution_ layer allowing them to be tested in isolation. I think this feature would benefit Spark tremendously; particularly given how several folks have already expressed their interest in this feature/direction. Appreciate your help and advise in helping to get this contribution into Spark. Thanks! Allow for pluggable execution contexts in Spark --- Key: SPARK-3561 URL: https://issues.apache.org/jira/browse/SPARK-3561 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: Oleg Zhurakousky Labels: features Fix For: 1.2.0 Attachments: SPARK-3561.pdf Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3892) Map type should have typeName
[ https://issues.apache.org/jira/browse/SPARK-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166724#comment-14166724 ] Venkata Ramana G commented on SPARK-3892: - Can you explain in detail? Map type should have typeName - Key: SPARK-3892 URL: https://issues.apache.org/jira/browse/SPARK-3892 Project: Spark Issue Type: Bug Components: SQL Reporter: Adrian Wang -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3892) Map type should have typeName
[ https://issues.apache.org/jira/browse/SPARK-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166754#comment-14166754 ] Adrian Wang commented on SPARK-3892: Of course. We are using `.typeName` method to build formatted string and JSON to serialize. but in MapType it turns out to be simpleName, I assume it is a typo. The `simpleName` function is never used. [~lian cheng] Map type should have typeName - Key: SPARK-3892 URL: https://issues.apache.org/jira/browse/SPARK-3892 Project: Spark Issue Type: Bug Components: SQL Reporter: Adrian Wang -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3892) Map type should have typeName
[ https://issues.apache.org/jira/browse/SPARK-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166812#comment-14166812 ] Cheng Lian commented on SPARK-3892: --- Actually {{MapType.simpleName}} can be simply removed, it's not used anywhere, I forgot to remove it while refactoring. {{DataType.typeName}} is defined as: {code} def typeName: String = this.getClass.getSimpleName.stripSuffix($).dropRight(4).toLowerCase {code} So concrete {{DataType}} classes don't need to override {{typeName}} as long as their name ends with {{Type}}. Map type should have typeName - Key: SPARK-3892 URL: https://issues.apache.org/jira/browse/SPARK-3892 Project: Spark Issue Type: Bug Components: SQL Reporter: Adrian Wang -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3892) Map type should have typeName
[ https://issues.apache.org/jira/browse/SPARK-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166822#comment-14166822 ] Cheng Lian commented on SPARK-3892: --- [~adrian-wang] You're right, it's a typo. So would you mind to change the priority of this ticket to Minor? Map type should have typeName - Key: SPARK-3892 URL: https://issues.apache.org/jira/browse/SPARK-3892 Project: Spark Issue Type: Bug Components: SQL Reporter: Adrian Wang -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3892) Map type should have typeName
[ https://issues.apache.org/jira/browse/SPARK-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrian Wang updated SPARK-3892: --- Priority: Minor (was: Major) Map type should have typeName - Key: SPARK-3892 URL: https://issues.apache.org/jira/browse/SPARK-3892 Project: Spark Issue Type: Bug Components: SQL Reporter: Adrian Wang Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3892) Map type should have typeName
[ https://issues.apache.org/jira/browse/SPARK-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrian Wang updated SPARK-3892: --- Issue Type: Improvement (was: Bug) Map type should have typeName - Key: SPARK-3892 URL: https://issues.apache.org/jira/browse/SPARK-3892 Project: Spark Issue Type: Improvement Components: SQL Reporter: Adrian Wang Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3892) Map type do not need simpleName
[ https://issues.apache.org/jira/browse/SPARK-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrian Wang updated SPARK-3892: --- Summary: Map type do not need simpleName (was: Map type should have typeName) Map type do not need simpleName --- Key: SPARK-3892 URL: https://issues.apache.org/jira/browse/SPARK-3892 Project: Spark Issue Type: Improvement Components: SQL Reporter: Adrian Wang Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3876) Doing a RDD map/reduce within a DStream map fails with a high enough input rate
[ https://issues.apache.org/jira/browse/SPARK-3876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166832#comment-14166832 ] Saisai Shao commented on SPARK-3876: Hi [~afilip], is there any specific purpose you need to do rdd's map and reduce operation inside DStream's map function. I don't think this code can be worked and correctly executed in remote side, this code can be translated into RDD's transformation in each batch duration, like: rdd.map { r = rdd1.map(c = op(c, r)).reduce(...) }.foreach(...) since rdd's transformation should be divided into stages in driver side and be executed in executor side, remotely using rdd in closure may get error. If you want to use this RDD as a lookup table, you can build a local hashmap and broadcast to the remote side for looking up. So maybe this is not a bug. Doing a RDD map/reduce within a DStream map fails with a high enough input rate --- Key: SPARK-3876 URL: https://issues.apache.org/jira/browse/SPARK-3876 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.0.2 Reporter: Andrei Filip Having a custom receiver than generates random strings at custom rates: JavaRandomSentenceReceiver A class that does work on a received string: class LengthGetter implements Serializable{ public int getStrLength(String s){ return s.length(); } } The following code: ListLengthGetter objList = Arrays.asList(new LengthGetter(), new LengthGetter(), new LengthGetter()); final JavaRDDLengthGetter objRdd = sc.parallelize(objList); JavaInputDStreamString sentences = jssc.receiverStream(new JavaRandomSentenceReceiver(frequency)); sentences.map(new FunctionString, Integer() { @Override public Integer call(final String input) throws Exception { Integer res = objRdd.map(new FunctionLengthGetter, Integer() { @Override public Integer call(LengthGetter lg) throws Exception { return lg.getStrLength(input); } }).reduce(new Function2Integer, Integer, Integer() { @Override public Integer call(Integer left, Integer right) throws Exception { return left + right; } }); return res; } }).print(); fails for high enough frequencies with the following stack trace: Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 3.0:0 failed 1 times, most recent failure: Exception failure in TID 3 on host localhost: java.lang.NullPointerException org.apache.spark.rdd.RDD.map(RDD.scala:270) org.apache.spark.api.java.JavaRDDLike$class.map(JavaRDDLike.scala:72) org.apache.spark.api.java.JavaRDD.map(JavaRDD.scala:29) Other information that might be useful is that my current batch duration is set to 1sec and the frequencies for JavaRandomSentenceReceiver at which the application fails are as low as 2Hz (1Hz for example works) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3880) HBase as data source to SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166838#comment-14166838 ] Ravindra Pesala commented on SPARK-3880: There is already some work going in the direction of adding foreign data sources to Spark SQL. https://github.com/apache/spark/pull/2475. So I guess Hbase is also like foreign data source and it should fit into this design. Adding new project/context for each datasource may be cumbersome to maintain. Can we improve on the current PR to add DDL support. HBase as data source to SparkSQL Key: SPARK-3880 URL: https://issues.apache.org/jira/browse/SPARK-3880 Project: Spark Issue Type: New Feature Components: SQL Reporter: Yan Assignee: Yan Attachments: HBaseOnSpark.docx -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3892) Map type do not need simpleName
[ https://issues.apache.org/jira/browse/SPARK-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166840#comment-14166840 ] Adrian Wang commented on SPARK-3892: Yeah. Actually the original method is called simpleString, and now we have typeName. Map type do not need simpleName --- Key: SPARK-3892 URL: https://issues.apache.org/jira/browse/SPARK-3892 Project: Spark Issue Type: Improvement Components: SQL Reporter: Adrian Wang Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3892) Map type do not need simpleName
[ https://issues.apache.org/jira/browse/SPARK-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166848#comment-14166848 ] Cheng Lian commented on SPARK-3892: --- Ah, while working on the {{DataType}} JSON ser/de PR ([#2563|https://github.com/apache/spark/pull/2563]), I had once refactored {{simpleString}} to {{simpleName}}, and at last got the current version and removed all overrides from sub-classes. {{MapType.simpleName}} was not removed partly because its a member of {{object MapType}}, which is not a subclass of {{DataType}}. Sorry for the trouble and confusion. Map type do not need simpleName --- Key: SPARK-3892 URL: https://issues.apache.org/jira/browse/SPARK-3892 Project: Spark Issue Type: Improvement Components: SQL Reporter: Adrian Wang Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3892) Map type do not need simpleName
[ https://issues.apache.org/jira/browse/SPARK-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166856#comment-14166856 ] Adrian Wang commented on SPARK-3892: Thanks for the explain! I have create PR #2747 to change simpleName to typeName. Maybe it is also useful since we defined this in object MapType. For class MapType, we already have the default one... Did I made anything wrong here? Map type do not need simpleName --- Key: SPARK-3892 URL: https://issues.apache.org/jira/browse/SPARK-3892 Project: Spark Issue Type: Improvement Components: SQL Reporter: Adrian Wang Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3892) Map type do not need simpleName
[ https://issues.apache.org/jira/browse/SPARK-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166886#comment-14166886 ] Cheng Lian commented on SPARK-3892: --- Please see my comments in the PR :) Map type do not need simpleName --- Key: SPARK-3892 URL: https://issues.apache.org/jira/browse/SPARK-3892 Project: Spark Issue Type: Improvement Components: SQL Reporter: Adrian Wang Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3880) HBase as data source to SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167006#comment-14167006 ] Yan commented on SPARK-3880: The new context is intended to be very light-weighted. We noticed that SparkSQL is a very active project and there have been talks/Jiras about SQLContext and data sources. As mentioned in the design, we are aware of the PR, and the need to have a universal mechanism to access different types of data stores, and will a keep close watch on the latest movements and will definitely fit our efforts to those latest features and interfaces when they are ready and reasonably stable. In the meanwhile, the design is intended to be heavy of HBase-specific data model, data access mechanisms and query optimizations, and keep the integration part light-weighted so it can be easily adjusted to future changes. The point is that we need to find some compromise between a rapidly changing project and the need to have a more or less stable context to base a new feature on. Chasing a constantly moving target is never easy, I guess. HBase as data source to SparkSQL Key: SPARK-3880 URL: https://issues.apache.org/jira/browse/SPARK-3880 Project: Spark Issue Type: New Feature Components: SQL Reporter: Yan Assignee: Yan Attachments: HBaseOnSpark.docx -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3900) ApplicationMaster's shutdown hook fails and IllegalStateException is thrown.
[ https://issues.apache.org/jira/browse/SPARK-3900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-3900: -- Summary: ApplicationMaster's shutdown hook fails and IllegalStateException is thrown. (was: ApplicationMaster's shutdown hook fails to cleanup staging directory.) ApplicationMaster's shutdown hook fails and IllegalStateException is thrown. Key: SPARK-3900 URL: https://issues.apache.org/jira/browse/SPARK-3900 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Environment: Hadoop 0.23 Reporter: Kousuke Saruta Priority: Critical ApplicationMaster registers a shutdown hook and it calls ApplicationMaster#cleanupStagingDir. cleanupStagingDir invokes FileSystem.get(yarnConf) and it invokes FileSystem.getInternal. FileSystem.getInternal also registers shutdown hook. In FileSystem of hadoop 0.23, the shutdown hook registration does not consider whether shutdown is in progress or not (In 2.2, it's considered). {code} // 0.23 if (map.isEmpty() ) { ShutdownHookManager.get().addShutdownHook(clientFinalizer, SHUTDOWN_HOOK_PRIORITY); } {code} {code} // 2.2 if (map.isEmpty() !ShutdownHookManager.get().isShutdownInProgress()) { ShutdownHookManager.get().addShutdownHook(clientFinalizer, SHUTDOWN_HOOK_PRIORITY); } {code} Thus, in 0.23, another shutdown hook can be registered when ApplicationMaster's shutdown hook run. This issue cause IllegalStateException as follows. {code} java.lang.IllegalStateException: Shutdown in progress, cannot add a shutdownHook at org.apache.hadoop.util.ShutdownHookManager.addShutdownHook(ShutdownHookManager.java:152) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2306) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2278) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:316) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:162) at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$cleanupStagingDir(ApplicationMaster.scala:307) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:118) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3900) ApplicationMaster's shutdown hook fails to cleanup staging directory.
Kousuke Saruta created SPARK-3900: - Summary: ApplicationMaster's shutdown hook fails to cleanup staging directory. Key: SPARK-3900 URL: https://issues.apache.org/jira/browse/SPARK-3900 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Environment: Hadoop 0.23 Reporter: Kousuke Saruta Priority: Critical ApplicationMaster registers a shutdown hook and it calls ApplicationMaster#cleanupStagingDir. cleanupStagingDir invokes FileSystem.get(yarnConf) and it invokes FileSystem.getInternal. FileSystem.getInternal also registers shutdown hook. In FileSystem of hadoop 0.23, the shutdown hook registration does not consider whether shutdown is in progress or not (In 2.2, it's considered). {code} // 0.23 if (map.isEmpty() ) { ShutdownHookManager.get().addShutdownHook(clientFinalizer, SHUTDOWN_HOOK_PRIORITY); } {code} {code} // 2.2 if (map.isEmpty() !ShutdownHookManager.get().isShutdownInProgress()) { ShutdownHookManager.get().addShutdownHook(clientFinalizer, SHUTDOWN_HOOK_PRIORITY); } {code} Thus, in 0.23, another shutdown hook can be registered when ApplicationMaster's shutdown hook run. This issue cause IllegalStateException as follows. {code} java.lang.IllegalStateException: Shutdown in progress, cannot add a shutdownHook at org.apache.hadoop.util.ShutdownHookManager.addShutdownHook(ShutdownHookManager.java:152) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2306) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2278) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:316) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:162) at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$cleanupStagingDir(ApplicationMaster.scala:307) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:118) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3795) Add scheduler hooks/heuristics for adding and removing executors
[ https://issues.apache.org/jira/browse/SPARK-3795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167131#comment-14167131 ] Nan Zhu commented on SPARK-3795: this is for YARN or standalone? Add scheduler hooks/heuristics for adding and removing executors Key: SPARK-3795 URL: https://issues.apache.org/jira/browse/SPARK-3795 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 1.1.0 Reporter: Patrick Wendell Assignee: Andrew Or To support dynamic scaling of a Spark application, Spark's scheduler will need to have hooks around explicitly decommissioning executors. We'll also need basic heuristics governing when to start/stop executors based on load. An initial goal is to keep this very simple. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3845) SQLContext(...) should inherit configurations from SparkContext
[ https://issues.apache.org/jira/browse/SPARK-3845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jianshi Huang resolved SPARK-3845. -- Resolution: Fixed Fix Version/s: 1.2.0 SQLContext(...) should inherit configurations from SparkContext --- Key: SPARK-3845 URL: https://issues.apache.org/jira/browse/SPARK-3845 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: Jianshi Huang Fix For: 1.2.0 It's very confusing that Spark configurations (e.g. spark.serializer, spark.speculation, etc.) can be set in the spark-default.conf file, while SparkSQL configurations (e..g spark.sql.inMemoryColumnarStorage.compressed, spark.sql.codegen, etc.) has to be set either in sqlContext.setConf or sql(SET ...). When I do: val sqlContext = new org.apache.spark.sql.SQLContext(sparkContext) I would expect sqlContext recognizes all the SQL configurations comes with sparkContext. Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2962) Suboptimal scheduling in spark
[ https://issues.apache.org/jira/browse/SPARK-2962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167144#comment-14167144 ] Nan Zhu commented on SPARK-2962: Hi, [~mrid...@yahoo-inc.com] I think this has been fixed in https://github.com/apache/spark/pull/1313/files, {code:title=TaskSetManager.scala|borderStyle=solid} if (tasks(index).preferredLocations == Nil) { addTo(pendingTasksWithNoPrefs) } {code} Now, only tasks without explicit preference is added to pendingTasksWithNoPrefs, and NO_PREF tasks are always scheduled after NODE_LOCAL Suboptimal scheduling in spark -- Key: SPARK-2962 URL: https://issues.apache.org/jira/browse/SPARK-2962 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Environment: All Reporter: Mridul Muralidharan In findTask, irrespective of 'locality' specified, pendingTasksWithNoPrefs are always scheduled with PROCESS_LOCAL pendingTasksWithNoPrefs contains tasks which currently do not have any alive locations - but which could come in 'later' : particularly relevant when spark app is just coming up and containers are still being added. This causes a large number of non node local tasks to be scheduled incurring significant network transfers in the cluster when running with non trivial datasets. The comment // Look for no-pref tasks after rack-local tasks since they can run anywhere. is misleading in the method code : locality levels start from process_local down to any, and so no prefs get scheduled much before rack. Also note that, currentLocalityIndex is reset to the taskLocality returned by this method - so returning PROCESS_LOCAL as the level will trigger wait times again. (Was relevant before recent change to scheduler, and might be again based on resolution of this issue). Found as part of writing test for SPARK-2931 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3434) Distributed block matrix
[ https://issues.apache.org/jira/browse/SPARK-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167152#comment-14167152 ] Burak Yavuz commented on SPARK-3434: [~ConcreteVitamin], any updates? Anything I can help out with? Distributed block matrix Key: SPARK-3434 URL: https://issues.apache.org/jira/browse/SPARK-3434 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng This JIRA is for discussing distributed matrices stored in block sub-matrices. The main challenge is the partitioning scheme to allow adding linear algebra operations in the future, e.g.: 1. matrix multiplication 2. matrix factorization (QR, LU, ...) Let's discuss the partitioning and storage and how they fit into the above use cases. Questions: 1. Should it be backed by a single RDD that contains all of the sub-matrices or many RDDs with each contains only one sub-matrix? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3823) Spark Hive SQL readColumn is not reset each time for a new query
[ https://issues.apache.org/jira/browse/SPARK-3823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167239#comment-14167239 ] Ravindra Pesala commented on SPARK-3823: It seems this issue is duplicate of https://issues.apache.org/jira/browse/SPARK-3559 Spark Hive SQL readColumn is not reset each time for a new query Key: SPARK-3823 URL: https://issues.apache.org/jira/browse/SPARK-3823 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Alex Liu After a few queries running in the same hiveContext, hive.io.file.readcolumn.ids and hive.io.file.readcolumn.names values are added on by pre-running queries. e.g. running the following querys {code} hql(use sql_integration_ks) val container = hql(select * from double_table as aa JOIN boolean_table as bb on aa.type_id = bb.type_id) container.collect().foreach(println) val container = hql(select * from ascii_table ORDER BY type_id) container.collect().foreach(println) val container = hql(select shippers.shippername, COUNT(orders.orderid) AS numorders FROM orders LEFT JOIN shippers ON orders.shipperid=shippers.shipperid GROUP BY shippername) container.collect().foreach(println) val container = hql(select * from ascii_table where type_id 126) container.collect().length {code} The read column ids for the last query are [2, 0, 3, 1] read column names are : type_id,value,type_id,value,type_id,value,orderid,shipperid,shipper name, shipperid The source code is at https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScan.scala#L80 hiveContext has a shared hiveconf which add readColumns for each query. It should be reset each time for a new hive query or remove the duplicate readColumn Ids -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3889) JVM dies with SIGBUS, resulting in ConnectionManager failed ACK
[ https://issues.apache.org/jira/browse/SPARK-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167303#comment-14167303 ] Mridul Muralidharan commented on SPARK-3889: The status says fixed - what was done to resolve this ? I did not see a PR ... JVM dies with SIGBUS, resulting in ConnectionManager failed ACK --- Key: SPARK-3889 URL: https://issues.apache.org/jira/browse/SPARK-3889 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Aaron Davidson Assignee: Aaron Davidson Priority: Critical Fix For: 1.2.0 Here's the first part of the core dump, possibly caused by a job which shuffles a lot of very small partitions. {code} # # A fatal error has been detected by the Java Runtime Environment: # # SIGBUS (0x7) at pc=0x7fa5885fcdb0, pid=488, tid=140343502632704 # # JRE version: 7.0_25-b30 # Java VM: OpenJDK 64-Bit Server VM (23.7-b01 mixed mode linux-amd64 compressed oops) # Problematic frame: # v ~StubRoutines::jbyte_disjoint_arraycopy # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try ulimit -c unlimited before starting Java again # # If you would like to submit a bug report, please include # instructions on how to reproduce the bug and visit: # https://bugs.launchpad.net/ubuntu/+source/openjdk-7/ # --- T H R E A D --- Current thread (0x7fa4b0631000): JavaThread Executor task launch worker-170 daemon [_thread_in_Java, id=6783, stack(0x7fa4448ef000,0x7fa4449f)] siginfo:si_signo=SIGBUS: si_errno=0, si_code=2 (BUS_ADRERR), si_addr=0x7fa428f79000 {code} Here is the only useful content I can find related to JVM and SIGBUS from Google: https://bugzilla.redhat.com/show_bug.cgi?format=multipleid=976664 It appears it may be related to disposing byte buffers, which we do in the ConnectionManager -- we mmap shuffle files via ManagedBuffer and dispose of them in BufferMessage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3568) Add metrics for ranking algorithms
[ https://issues.apache.org/jira/browse/SPARK-3568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuo Xiang updated SPARK-3568: -- Description: Include common metrics for ranking algorithms (http://www-nlp.stanford.edu/IR-book/), including: - Mean Average Precision - Precision@n: top-n precision - Discounted cumulative gain (DCG) and NDCG This implementation attempts to create a new class called *RankingMetrics* under *org.apache.spark.mllib.evaluation*, which accepts input (prediction and label pairs) as *RDD[Array[T], Array[T]]*. The following methods will be implemented: {code:title=RankingMetrics.scala|borderStyle=solid} class RankingMetrics[T](predictionAndLabels: RDD[(Array[T], Array[T])]) { /* Returns the precsion@k for each query */ lazy val precAtK: RDD[Array[Double]] /** * @param k the position to compute the truncated precision * @return the average precision at the first k ranking positions */ def precision(k: Int): Double /* Returns the average precision for each query */ lazy val avePrec: RDD[Double] /*Returns the mean average precision (MAP) of all the queries*/ lazy val meanAvePrec: Double /*Returns the normalized discounted cumulative gain for each query */ lazy val ndcgAtK: RDD[Array[Double]] /** * @param k the position to compute the truncated ndcg * @return the average ndcg at the first k ranking positions */ def ndcg(k: Int): Double } {code} was: Include common metrics for ranking algorithms (http://www-nlp.stanford.edu/IR-book/), including: - Mean Average Precision - Precision@n: top-n precision - Discounted cumulative gain (DCG) and NDCG This implementation attempts to create a new class called *RankingMetrics* under *org.apache.spark.mllib.evaluation*, which accepts input (prediction and label pairs) as *RDD[Array[Double], Array[Double]]*. The following methods will be implemented: {code:title=RankingMetrics.scala|borderStyle=solid} class RankingMetrics(predictionAndLabels: RDD[(Array[Double], Array[Double])]) { /* Returns the precsion@k for each query */ lazy val precAtK: RDD[Array[Double]] /* Returns the average precision for each query */ lazy val avePrec: RDD[Double] /*Returns the mean average precision (MAP) of all the queries*/ lazy val meanAvePrec: Double /*Returns the normalized discounted cumulative gain for each query */ lazy val ndcg: RDD[Double] /* Returns the mean NDCG of all the queries */ lazy val meanNdcg: Double } {code} Add metrics for ranking algorithms -- Key: SPARK-3568 URL: https://issues.apache.org/jira/browse/SPARK-3568 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Shuo Xiang Assignee: Shuo Xiang Include common metrics for ranking algorithms (http://www-nlp.stanford.edu/IR-book/), including: - Mean Average Precision - Precision@n: top-n precision - Discounted cumulative gain (DCG) and NDCG This implementation attempts to create a new class called *RankingMetrics* under *org.apache.spark.mllib.evaluation*, which accepts input (prediction and label pairs) as *RDD[Array[T], Array[T]]*. The following methods will be implemented: {code:title=RankingMetrics.scala|borderStyle=solid} class RankingMetrics[T](predictionAndLabels: RDD[(Array[T], Array[T])]) { /* Returns the precsion@k for each query */ lazy val precAtK: RDD[Array[Double]] /** * @param k the position to compute the truncated precision * @return the average precision at the first k ranking positions */ def precision(k: Int): Double /* Returns the average precision for each query */ lazy val avePrec: RDD[Double] /*Returns the mean average precision (MAP) of all the queries*/ lazy val meanAvePrec: Double /*Returns the normalized discounted cumulative gain for each query */ lazy val ndcgAtK: RDD[Array[Double]] /** * @param k the position to compute the truncated ndcg * @return the average ndcg at the first k ranking positions */ def ndcg(k: Int): Double } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3795) Add scheduler hooks/heuristics for adding and removing executors
[ https://issues.apache.org/jira/browse/SPARK-3795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167457#comment-14167457 ] Andrew Or commented on SPARK-3795: -- It's agnostic to the cluster manager, but for now we will focus on Yarn (SPARK-3822). Later we will do the same for standalone and mesos. Add scheduler hooks/heuristics for adding and removing executors Key: SPARK-3795 URL: https://issues.apache.org/jira/browse/SPARK-3795 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 1.1.0 Reporter: Patrick Wendell Assignee: Andrew Or To support dynamic scaling of a Spark application, Spark's scheduler will need to have hooks around explicitly decommissioning executors. We'll also need basic heuristics governing when to start/stop executors based on load. An initial goal is to keep this very simple. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3434) Distributed block matrix
[ https://issues.apache.org/jira/browse/SPARK-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167478#comment-14167478 ] Shivaram Venkataraman commented on SPARK-3434: -- ~brkyvz -- We are just adding a few more test cases to classes to make sure our interfaces look fine. I'll also create a simple design doc and post it here. Distributed block matrix Key: SPARK-3434 URL: https://issues.apache.org/jira/browse/SPARK-3434 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng This JIRA is for discussing distributed matrices stored in block sub-matrices. The main challenge is the partitioning scheme to allow adding linear algebra operations in the future, e.g.: 1. matrix multiplication 2. matrix factorization (QR, LU, ...) Let's discuss the partitioning and storage and how they fit into the above use cases. Questions: 1. Should it be backed by a single RDD that contains all of the sub-matrices or many RDDs with each contains only one sub-matrix? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3434) Distributed block matrix
[ https://issues.apache.org/jira/browse/SPARK-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167478#comment-14167478 ] Shivaram Venkataraman edited comment on SPARK-3434 at 10/10/14 8:45 PM: [~brkyvz] -- We are just adding a few more test cases to classes to make sure our interfaces look fine. I'll also create a simple design doc and post it here. was (Author: shivaram): ~brkyvz -- We are just adding a few more test cases to classes to make sure our interfaces look fine. I'll also create a simple design doc and post it here. Distributed block matrix Key: SPARK-3434 URL: https://issues.apache.org/jira/browse/SPARK-3434 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng This JIRA is for discussing distributed matrices stored in block sub-matrices. The main challenge is the partitioning scheme to allow adding linear algebra operations in the future, e.g.: 1. matrix multiplication 2. matrix factorization (QR, LU, ...) Let's discuss the partitioning and storage and how they fit into the above use cases. Questions: 1. Should it be backed by a single RDD that contains all of the sub-matrices or many RDDs with each contains only one sub-matrix? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3886) Choose the batch size of serializer based on size of object
[ https://issues.apache.org/jira/browse/SPARK-3886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-3886. --- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2740 [https://github.com/apache/spark/pull/2740] Choose the batch size of serializer based on size of object --- Key: SPARK-3886 URL: https://issues.apache.org/jira/browse/SPARK-3886 Project: Spark Issue Type: Improvement Reporter: Davies Liu Assignee: Davies Liu Fix For: 1.2.0 The default batch size (1024) maybe will not work for huge objects, so it's better to choose the proper size based on the size of objects. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3886) Choose the batch size of serializer based on size of object
[ https://issues.apache.org/jira/browse/SPARK-3886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-3886: -- Affects Version/s: 1.0.2 1.1.0 Choose the batch size of serializer based on size of object --- Key: SPARK-3886 URL: https://issues.apache.org/jira/browse/SPARK-3886 Project: Spark Issue Type: Improvement Affects Versions: 1.0.2, 1.1.0 Reporter: Davies Liu Assignee: Davies Liu Fix For: 1.2.0 The default batch size (1024) maybe will not work for huge objects, so it's better to choose the proper size based on the size of objects. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3901) Add SocketSink capability for Spark metrics
Sreepathi Prasanna created SPARK-3901: - Summary: Add SocketSink capability for Spark metrics Key: SPARK-3901 URL: https://issues.apache.org/jira/browse/SPARK-3901 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0, 1.0.0 Reporter: Sreepathi Prasanna Priority: Minor Fix For: 1.1.1 Spark depends on Coda hale metrics library to collect metrics. Today we can send metrics to console, csv and jmx. We use chukwa as a monitoring framework to monitor the hadoop services. To extend the the framework to collect spark metrics, we need additional socketsink capability which is not there at the moment in Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3901) Add SocketSink capability for Spark metrics
[ https://issues.apache.org/jira/browse/SPARK-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167541#comment-14167541 ] Sreepathi Prasanna commented on SPARK-3901: --- For this, we need a SocketReporter class in coda hale which I have submitted a request for. https://github.com/dropwizard/metrics/pull/685 Once this is reviewed and merged into coda hale, we can use a socketsink class to send the metrics over socket. Add SocketSink capability for Spark metrics --- Key: SPARK-3901 URL: https://issues.apache.org/jira/browse/SPARK-3901 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.0.0, 1.1.0 Reporter: Sreepathi Prasanna Priority: Minor Fix For: 1.1.1 Original Estimate: 48h Remaining Estimate: 48h Spark depends on Coda hale metrics library to collect metrics. Today we can send metrics to console, csv and jmx. We use chukwa as a monitoring framework to monitor the hadoop services. To extend the the framework to collect spark metrics, we need additional socketsink capability which is not there at the moment in Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3901) Add SocketSink capability for Spark metrics
[ https://issues.apache.org/jira/browse/SPARK-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167550#comment-14167550 ] Sreepathi Prasanna commented on SPARK-3901: --- I have the patch ready, but it will not work unless we have the SocketReporter in the coda hale. Add SocketSink capability for Spark metrics --- Key: SPARK-3901 URL: https://issues.apache.org/jira/browse/SPARK-3901 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.0.0, 1.1.0 Reporter: Sreepathi Prasanna Priority: Minor Fix For: 1.1.1 Original Estimate: 48h Remaining Estimate: 48h Spark depends on Coda hale metrics library to collect metrics. Today we can send metrics to console, csv and jmx. We use chukwa as a monitoring framework to monitor the hadoop services. To extend the the framework to collect spark metrics, we need additional socketsink capability which is not there at the moment in Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3902) Stabilize AsyncRDDActions and expose its methods in Java API
Josh Rosen created SPARK-3902: - Summary: Stabilize AsyncRDDActions and expose its methods in Java API Key: SPARK-3902 URL: https://issues.apache.org/jira/browse/SPARK-3902 Project: Spark Issue Type: New Feature Components: Java API, Spark Core Reporter: Josh Rosen The AsyncRDDActions methods are currently the easiest way to determine Spark jobs' ids for use in progress-monitoring code (see SPARK-2636). AsyncRDDActions is currently marked as {{@Experimental}}; for 1.2, I think that we should stabilize this API and expose it in Java, too. One concern is whether there's a better async API design that we should prefer over this one as our stable API; I had some ideas for a more general API in SPARK-3626 (discussed in much greater detail on GitHub: https://github.com/apache/spark/pull/2482) but decided against the more general API due to its confusing cancellation semantics. Given this, I'd be comfortable stabilizing our current API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3626) Replace AsyncRDDActions with a more general async. API
[ https://issues.apache.org/jira/browse/SPARK-3626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167571#comment-14167571 ] Josh Rosen commented on SPARK-3626: --- I've opened SPARK-3902 to discuss stabilizing our current AsyncRDDActions APIs. Replace AsyncRDDActions with a more general async. API -- Key: SPARK-3626 URL: https://issues.apache.org/jira/browse/SPARK-3626 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Josh Rosen Assignee: Josh Rosen The experimental AsyncRDDActions APIs seem to only exist in order to enable job cancellation. We've been considering extending these APIs to support progress monitoring, but this would require stabilizing them so they're no longer {{@Experimental}}. Instead, I propose to replace all of the AsyncRDDActions with a mechanism based on job groups which allows arbitrary computations to be run in job groups and supports cancellation / monitoring of Spark jobs launched from those computations. (full design pending; see my GitHub PR for more details). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3889) JVM dies with SIGBUS, resulting in ConnectionManager failed ACK
[ https://issues.apache.org/jira/browse/SPARK-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167647#comment-14167647 ] Aaron Davidson commented on SPARK-3889: --- Sorry, it was not linked: https://github.com/apache/spark/pull/2742 JVM dies with SIGBUS, resulting in ConnectionManager failed ACK --- Key: SPARK-3889 URL: https://issues.apache.org/jira/browse/SPARK-3889 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Aaron Davidson Assignee: Aaron Davidson Priority: Critical Fix For: 1.2.0 Here's the first part of the core dump, possibly caused by a job which shuffles a lot of very small partitions. {code} # # A fatal error has been detected by the Java Runtime Environment: # # SIGBUS (0x7) at pc=0x7fa5885fcdb0, pid=488, tid=140343502632704 # # JRE version: 7.0_25-b30 # Java VM: OpenJDK 64-Bit Server VM (23.7-b01 mixed mode linux-amd64 compressed oops) # Problematic frame: # v ~StubRoutines::jbyte_disjoint_arraycopy # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try ulimit -c unlimited before starting Java again # # If you would like to submit a bug report, please include # instructions on how to reproduce the bug and visit: # https://bugs.launchpad.net/ubuntu/+source/openjdk-7/ # --- T H R E A D --- Current thread (0x7fa4b0631000): JavaThread Executor task launch worker-170 daemon [_thread_in_Java, id=6783, stack(0x7fa4448ef000,0x7fa4449f)] siginfo:si_signo=SIGBUS: si_errno=0, si_code=2 (BUS_ADRERR), si_addr=0x7fa428f79000 {code} Here is the only useful content I can find related to JVM and SIGBUS from Google: https://bugzilla.redhat.com/show_bug.cgi?format=multipleid=976664 It appears it may be related to disposing byte buffers, which we do in the ConnectionManager -- we mmap shuffle files via ManagedBuffer and dispose of them in BufferMessage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3855) Binding Exception when running PythonUDFs
[ https://issues.apache.org/jira/browse/SPARK-3855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3855: --- Component/s: PySpark Binding Exception when running PythonUDFs - Key: SPARK-3855 URL: https://issues.apache.org/jira/browse/SPARK-3855 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.1.0 Reporter: Michael Armbrust Assignee: Michael Armbrust {code} from pyspark import * from pyspark.sql import * sc = SparkContext() sqlContext = SQLContext(sc) sqlContext.registerFunction(strlen, lambda string: len(string)) sqlContext.inferSchema(sc.parallelize([Row(a=test)])).registerTempTable(test) srdd = sqlContext.sql(SELECT strlen(a) FROM test WHERE strlen(a) 1) print srdd._jschema_rdd.baseSchemaRDD().queryExecution().toString() print srdd.collect() {code} output: {code} == Parsed Logical Plan == Project ['strlen('a) AS c0#1] Filter ('strlen('a) 1) UnresolvedRelation None, test, None == Analyzed Logical Plan == Project [c0#1] Project [pythonUDF#2 AS c0#1] EvaluatePython PythonUDF#strlen(a#0) Project [a#0] Filter (CAST(pythonUDF#3, DoubleType) CAST(1, DoubleType)) EvaluatePython PythonUDF#strlen(a#0) SparkLogicalPlan (ExistingRdd [a#0], MapPartitionsRDD[7] at mapPartitions at SQLContext.scala:525) == Optimized Logical Plan == Project [pythonUDF#2 AS c0#1] EvaluatePython PythonUDF#strlen(a#0) Project [a#0] Filter (CAST(pythonUDF#3, DoubleType) 1.0) EvaluatePython PythonUDF#strlen(a#0) SparkLogicalPlan (ExistingRdd [a#0], MapPartitionsRDD[7] at mapPartitions at SQLContext.scala:525) == Physical Plan == Project [pythonUDF#2 AS c0#1] BatchPythonEvaluation PythonUDF#strlen(a#0), [a#0,pythonUDF#5] Project [a#0] Filter (CAST(pythonUDF#3, DoubleType) 1.0) BatchPythonEvaluation PythonUDF#strlen(a#0), [a#0,pythonUDF#3] ExistingRdd [a#0], MapPartitionsRDD[7] at mapPartitions at SQLContext.scala:525 Code Generation: false == RDD == 14/10/08 15:03:00 ERROR Executor: Exception in task 1.0 in stage 4.0 (TID 9) org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: pythonUDF#2 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:47) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:46) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:162) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:191) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:147) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135) at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:46) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:52) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:52) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at
[jira] [Updated] (SPARK-3855) Binding Exception when running PythonUDFs
[ https://issues.apache.org/jira/browse/SPARK-3855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3855: --- Component/s: SQL Binding Exception when running PythonUDFs - Key: SPARK-3855 URL: https://issues.apache.org/jira/browse/SPARK-3855 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.1.0 Reporter: Michael Armbrust Assignee: Michael Armbrust {code} from pyspark import * from pyspark.sql import * sc = SparkContext() sqlContext = SQLContext(sc) sqlContext.registerFunction(strlen, lambda string: len(string)) sqlContext.inferSchema(sc.parallelize([Row(a=test)])).registerTempTable(test) srdd = sqlContext.sql(SELECT strlen(a) FROM test WHERE strlen(a) 1) print srdd._jschema_rdd.baseSchemaRDD().queryExecution().toString() print srdd.collect() {code} output: {code} == Parsed Logical Plan == Project ['strlen('a) AS c0#1] Filter ('strlen('a) 1) UnresolvedRelation None, test, None == Analyzed Logical Plan == Project [c0#1] Project [pythonUDF#2 AS c0#1] EvaluatePython PythonUDF#strlen(a#0) Project [a#0] Filter (CAST(pythonUDF#3, DoubleType) CAST(1, DoubleType)) EvaluatePython PythonUDF#strlen(a#0) SparkLogicalPlan (ExistingRdd [a#0], MapPartitionsRDD[7] at mapPartitions at SQLContext.scala:525) == Optimized Logical Plan == Project [pythonUDF#2 AS c0#1] EvaluatePython PythonUDF#strlen(a#0) Project [a#0] Filter (CAST(pythonUDF#3, DoubleType) 1.0) EvaluatePython PythonUDF#strlen(a#0) SparkLogicalPlan (ExistingRdd [a#0], MapPartitionsRDD[7] at mapPartitions at SQLContext.scala:525) == Physical Plan == Project [pythonUDF#2 AS c0#1] BatchPythonEvaluation PythonUDF#strlen(a#0), [a#0,pythonUDF#5] Project [a#0] Filter (CAST(pythonUDF#3, DoubleType) 1.0) BatchPythonEvaluation PythonUDF#strlen(a#0), [a#0,pythonUDF#3] ExistingRdd [a#0], MapPartitionsRDD[7] at mapPartitions at SQLContext.scala:525 Code Generation: false == RDD == 14/10/08 15:03:00 ERROR Executor: Exception in task 1.0 in stage 4.0 (TID 9) org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: pythonUDF#2 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:47) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:46) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:162) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:191) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:147) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135) at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:46) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:52) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:52) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at
[jira] [Commented] (SPARK-1503) Implement Nesterov's accelerated first-order method
[ https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167868#comment-14167868 ] Aaron Staple commented on SPARK-1503: - [~mengxr] Thanks for the heads up! I’ll definitely go through TFOCS and am happy to work carefully and collaboratively on design. Implement Nesterov's accelerated first-order method --- Key: SPARK-1503 URL: https://issues.apache.org/jira/browse/SPARK-1503 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Aaron Staple Nesterov's accelerated first-order method is a drop-in replacement for steepest descent but it converges much faster. We should implement this method and compare its performance with existing algorithms, including SGD and L-BFGS. TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's method and its variants on composite objectives. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3903) Create general data loading method for LabeledPoints
Joseph K. Bradley created SPARK-3903: Summary: Create general data loading method for LabeledPoints Key: SPARK-3903 URL: https://issues.apache.org/jira/browse/SPARK-3903 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Priority: Minor Proposal: Provide a more general data loading function for LabeledPoints. * load multiple data files (e.g., train + test), and ensure they have the same number of features (determined based on a scan of the data) * use same function for multiple input formats Proposed function format (in MLUtils), with default parameters: {code} def loadLabeledPointsFiles( sc: SparkContext, paths: Seq[String], numFeatures = -1, vectorFormat = auto, numPartitions = sc.defaultMinPartitions): Seq[RDD[LabeledPoint]] {code} About the parameters: * paths: list of paths to data files or folders with data files * vectorFormat options: dense/sparse/auto * numFeatures, numPartitions: same behavior as loadLibSVMFile -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3903) Create general data loading method for LabeledPoints
[ https://issues.apache.org/jira/browse/SPARK-3903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-3903: - Description: Proposal: Provide a more general data loading function for LabeledPoints. * load multiple data files (e.g., train + test), and ensure they have the same number of features (determined based on a scan of the data) * use same function for multiple input formats Proposed function format (in MLUtils), with default parameters: {code} def loadLabeledPointsFiles( sc: SparkContext, paths: Seq[String], numFeatures = -1, vectorFormat = auto, numPartitions = sc.defaultMinPartitions): Seq[RDD[LabeledPoint]] {code} About the parameters: * paths: list of paths to data files or folders with data files * vectorFormat options: dense/sparse/auto * numFeatures, numPartitions: same behavior as loadLibSVMFile Return value: Order of RDDs follows the order of the paths. Note: This is named differently from loadLabeledPoints for 2 reasons: * different argument order (following loadLibSVMFile) * different return type was: Proposal: Provide a more general data loading function for LabeledPoints. * load multiple data files (e.g., train + test), and ensure they have the same number of features (determined based on a scan of the data) * use same function for multiple input formats Proposed function format (in MLUtils), with default parameters: {code} def loadLabeledPointsFiles( sc: SparkContext, paths: Seq[String], numFeatures = -1, vectorFormat = auto, numPartitions = sc.defaultMinPartitions): Seq[RDD[LabeledPoint]] {code} About the parameters: * paths: list of paths to data files or folders with data files * vectorFormat options: dense/sparse/auto * numFeatures, numPartitions: same behavior as loadLibSVMFile Create general data loading method for LabeledPoints Key: SPARK-3903 URL: https://issues.apache.org/jira/browse/SPARK-3903 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Priority: Minor Proposal: Provide a more general data loading function for LabeledPoints. * load multiple data files (e.g., train + test), and ensure they have the same number of features (determined based on a scan of the data) * use same function for multiple input formats Proposed function format (in MLUtils), with default parameters: {code} def loadLabeledPointsFiles( sc: SparkContext, paths: Seq[String], numFeatures = -1, vectorFormat = auto, numPartitions = sc.defaultMinPartitions): Seq[RDD[LabeledPoint]] {code} About the parameters: * paths: list of paths to data files or folders with data files * vectorFormat options: dense/sparse/auto * numFeatures, numPartitions: same behavior as loadLibSVMFile Return value: Order of RDDs follows the order of the paths. Note: This is named differently from loadLabeledPoints for 2 reasons: * different argument order (following loadLibSVMFile) * different return type -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3903) Create general data loading method for LabeledPoints
[ https://issues.apache.org/jira/browse/SPARK-3903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3903: - Assignee: Joseph K. Bradley Create general data loading method for LabeledPoints Key: SPARK-3903 URL: https://issues.apache.org/jira/browse/SPARK-3903 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Minor Proposal: Provide a more general data loading function for LabeledPoints. * load multiple data files (e.g., train + test), and ensure they have the same number of features (determined based on a scan of the data) * use same function for multiple input formats Proposed function format (in MLUtils), with default parameters: {code} def loadLabeledPointsFiles( sc: SparkContext, paths: Seq[String], numFeatures = -1, vectorFormat = auto, numPartitions = sc.defaultMinPartitions): Seq[RDD[LabeledPoint]] {code} About the parameters: * paths: list of paths to data files or folders with data files * vectorFormat options: dense/sparse/auto * numFeatures, numPartitions: same behavior as loadLibSVMFile Return value: Order of RDDs follows the order of the paths. Note: This is named differently from loadLabeledPoints for 2 reasons: * different argument order (following loadLibSVMFile) * different return type -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3898) History Web UI display incorrectly.
[ https://issues.apache.org/jira/browse/SPARK-3898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zzc updated SPARK-3898: --- Fix Version/s: (was: 1.1.1) History Web UI display incorrectly. --- Key: SPARK-3898 URL: https://issues.apache.org/jira/browse/SPARK-3898 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.2.0 Environment: Spark 1.2.0-snapshot On Yarn Reporter: zzc Fix For: 1.2.0 After successfully run an spark application, history web ui display incorrectly: App Name:Not Started Started:1970/01/01 07:59:59 Spark User:Not Started Last Updated:2014/10/10 14:50:39 Exception message: 2014-10-10 14:51:14,284 - ERROR - org.apache.spark.Logging$class.logError(Logging.scala:96) - qtp1594785497-16851 -Exception in parsing Spark event log hdfs://wscluster/sparklogs/24.3g_15_5g_2c-1412923684977/EVENT_LOG_1 org.json4s.package$MappingException: Did not find value which can be converted into int at org.json4s.reflect.package$.fail(package.scala:96) at org.json4s.Extraction$.convert(Extraction.scala:554) at org.json4s.Extraction$.extract(Extraction.scala:331) at org.json4s.Extraction$.extract(Extraction.scala:42) at org.json4s.ExtractableJsonAstNode.extract(ExtractableJsonAstNode.scala:21) at org.apache.spark.util.JsonProtocol$.blockManagerIdFromJson(JsonProtocol.scala:647) at org.apache.spark.util.JsonProtocol$.blockManagerAddedFromJson(JsonProtocol.scala:468) at org.apache.spark.util.JsonProtocol$.sparkEventFromJson(JsonProtocol.scala:404) at org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2$$anonfun$apply$1.apply(ReplayListenerBus.scala:71) at org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2$$anonfun$apply$1.apply(ReplayListenerBus.scala:69) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2.apply(ReplayListenerBus.scala:69) at org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2.apply(ReplayListenerBus.scala:55) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) at org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:55) at org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$loadAppInfo(FsHistoryProvider.scala:181) at org.apache.spark.deploy.history.FsHistoryProvider.getAppUI(FsHistoryProvider.scala:99) at org.apache.spark.deploy.history.HistoryServer$$anon$3.load(HistoryServer.scala:55) at org.apache.spark.deploy.history.HistoryServer$$anon$3.load(HistoryServer.scala:53) at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342) at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2257) at com.google.common.cache.LocalCache.get(LocalCache.java:4000) at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004) at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) at org.apache.spark.deploy.history.HistoryServer$$anon$1.doGet(HistoryServer.scala:88) at javax.servlet.http.HttpServlet.service(HttpServlet.java:735) at javax.servlet.http.HttpServlet.service(HttpServlet.java:848) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:370) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494) at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971) at
[jira] [Updated] (SPARK-3586) Support nested directories in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-3586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangxj updated SPARK-3586: -- Issue Type: Bug (was: Improvement) Support nested directories in Spark Streaming - Key: SPARK-3586 URL: https://issues.apache.org/jira/browse/SPARK-3586 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.1.0 Reporter: wangxj Priority: Minor Labels: patch Fix For: 1.1.0 For text files, the method streamingContext.textFileStream(dataDirectory). Spark Streaming will monitor the directory dataDirectory and process any files created in that directory.but files written in nested directories not supported eg streamingContext.textFileStream(/test). Look at the direction contents: /test/file1 /test/file2 /test/dr/file1 In this mothod the textFileStream can only read file: /test/file1 /test/file2 /test/dr/ but the file: /test/dr/file1 is not. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3867) ./python/run-tests failed when it run with Python 2.6 and unittest2 is not installed
[ https://issues.apache.org/jira/browse/SPARK-3867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] cocoatomo updated SPARK-3867: - Description: ./python/run-tests search a Python 2.6 executable on PATH and use it if available. When using Python 2.6, it is going to import unittest2 module which is *not* a standard library in Python 2.6, so it fails with ImportError. commit: 1d72a30874a88bdbab75217f001cf2af409016e7 was: ./python/run-tests search a Python 2.6 executable on PATH and use it if available. When using Python 2.6, it is going to import unittest2 module which is *not* a standard library in Python 2.6, so it fails with ImportError. ./python/run-tests failed when it run with Python 2.6 and unittest2 is not installed Key: SPARK-3867 URL: https://issues.apache.org/jira/browse/SPARK-3867 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 1.2.0 Environment: Mac OS X 10.9.5, Python 2.6.8, Java 1.8.0_20 Reporter: cocoatomo Labels: pyspark, testing ./python/run-tests search a Python 2.6 executable on PATH and use it if available. When using Python 2.6, it is going to import unittest2 module which is *not* a standard library in Python 2.6, so it fails with ImportError. commit: 1d72a30874a88bdbab75217f001cf2af409016e7 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-922) Update Spark AMI to Python 2.7
[ https://issues.apache.org/jira/browse/SPARK-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14168018#comment-14168018 ] Davies Liu commented on SPARK-922: -- We did not use json heavily in pyspark, also user have several choice of json library in Python, this should not be a issue, i think. We definitely need to upgrade to Python2.7 (as default), if some user need python2.6, it's easy to use it by PYSPARK_PYTHON. Update Spark AMI to Python 2.7 -- Key: SPARK-922 URL: https://issues.apache.org/jira/browse/SPARK-922 Project: Spark Issue Type: Task Components: EC2, PySpark Affects Versions: 0.9.0, 0.9.1, 1.0.0, 1.1.0 Reporter: Josh Rosen Many Python libraries only support Python 2.7+, so we should make Python 2.7 the default Python on the Spark AMIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org