Is there any way to select columns of Dataset in addition to the combination of `expr` and `as`?
Hi all, I have two questions about selecting columns of Dataset. First, could you tell me know if there is any way to select TypedColumn columns in addition to the combination of `expr` and `as`? Second, how can we alias such a `expr("name").as[String]` Column? I tried to select a column of Dataset like DataFrame. However, I couldn't do that. ``` case class Person(id: Int, name: String) val df = sc.parallelize(Seq((1, "Bob"), (2, "Tom"))).toDF("id", "name") val ds = df.as[Person] ds.select(expr("name").as[String]).show +-+ |value| +-+ | Bob| | Tom| +-+ ds.select('id).show :34: error: type mismatch; found : Symbol required: org.apache.spark.sql.TypedColumn[Person,?] ds.select('id).show') ds.select($"id").show :34: error: type mismatch; found : org.apache.spark.sql.ColumnName required: org.apache.spark.sql.TypedColumn[Person,?] ds.select($"id").show ds.select(ds("id")).show :34: error: org.apache.spark.sql.Dataset[Person] does not take parameters ds.select(ds("id")).show ds.select("id").show :34: error: type mismatch; found : String("id") required: org.apache.spark.sql.TypedColumn[Person,?] ds.select("id").show ``` Best, Yu - -- Yu Ishikawa -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Is-there-any-way-to-select-columns-of-Dataset-in-addition-to-the-combination-of-expr-and-as-tp15713.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
How do we convert a Dataset includes timestamp columns to RDD?
he$spark$repl$SparkILoop$$process(SparkILoop.scala:945) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1064) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.io.NotSerializableException: org.apache.spark.sql.catalyst.util.DateTimeUtils$ Serialization stack: - object not serializable (class: org.apache.spark.sql.catalyst.util.DateTimeUtils$, value: org.apache.spark.sql.catalyst.util.DateTimeUtils$@216c782f) - field (class: org.apache.spark.sql.catalyst.expressions.StaticInvoke, name: staticObject, type: class java.lang.Object) - object (class org.apache.spark.sql.catalyst.expressions.StaticInvoke, staticinvoke(org.apache.spark.sql.catalyst.util.DateTimeUtils$@216c782f,ObjectType(class java.sql.Timestamp),toJavaTimestamp,input[0, TimestampType],true)) - writeObject data (class: scala.collection.immutable.$colon$colon) - object (class scala.collection.immutable.$colon$colon, List(staticinvoke(org.apache.spark.sql.catalyst.util.DateTimeUtils$@216c782f,ObjectType(class java.sql.Timestamp),toJavaTimestamp,input[0, TimestampType],true))) - field (class: org.apache.spark.sql.catalyst.expressions.NewInstance, name: arguments, type: interface scala.collection.Seq) - object (class org.apache.spark.sql.catalyst.expressions.NewInstance, newinstance(class $iwC$$iwC$TimestampExample,staticinvoke(org.apache.spark.sql.catalyst.util.DateTimeUtils$@216c782f,ObjectType(class java.sql.Timestamp),toJavaTimestamp,input[0, TimestampType],true),false,ObjectType(class $iwC$$iwC$TimestampExample),Some($iwC$$iwC@23b27380))) - field (class: org.apache.spark.sql.catalyst.encoders.ExpressionEncoder, name: fromRowExpression, type: class org.apache.spark.sql.catalyst.expressions.Expression) - object (class org.apache.spark.sql.catalyst.encoders.ExpressionEncoder, class[dt[0]: timestamp]) - field (class: org.apache.spark.sql.Dataset, name: boundTEncoder, type: class org.apache.spark.sql.catalyst.encoders.ExpressionEncoder) - object (class org.apache.spark.sql.Dataset, [dt: timestamp]) - field (class: org.apache.spark.sql.Dataset$$anonfun$rdd$1, name: $outer, type: class org.apache.spark.sql.Dataset) - object (class org.apache.spark.sql.Dataset$$anonfun$rdd$1, ) at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301) ... 68 more ``` Thanks, Yu - -- Yu Ishikawa -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/How-do-we-convert-a-Dataset-includes-timestamp-columns-to-RDD-tp15682.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [ANNOUNCE] Announcing Spark 1.5.0
Great work, everyone! - -- Yu Ishikawa -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/ANNOUNCE-Announcing-Spark-1-5-0-tp14013p14015.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
[SparkR] lint script for SpakrR
Hi all, Shivaram and I added a lint script for SparkR which is `dev/lint-r`. And it's been already running on Jenkins. If there are any validation problems in your patch, Jenkins will fail. Could you please make sure that your patch don't have any validation problems on your local machine before sending a PR. https://github.com/apache/spark/blob/master/dev/lint-r And we could also discuss the validation rules. I think there is still room for improvement. If you have any idea, please join the discussion about SparkR style guide. https://issues.apache.org/jira/browse/SPARK-6813 Thanks Shivaram and Josh, I couldn't have done it without you. Thanks Yu - -- Yu Ishikawa -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/SparkR-lint-script-for-SpakrR-tp13923.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Is `dev/lint-python` broken?
Hi all, When I run `dev/lint-python` at the latest master branch, I got an error message as follows. Is the lint script broken? Or is there any problems with my environment? ``` $ ./dev/lint-python ./dev/lint-python: line 64: syntax error near unexpected token `' ./dev/lint-python: line 64: `easy_install -d $PYLINT_HOME pylint==1.4.4 $PYLINT_INSTALL_INFO' ``` If the redirect is a syntax error, I'll send a PR to fix. Thanks, Yu - -- Yu Ishikawa -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Is-dev-lint-python-broken-tp13439.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Is `dev/lint-python` broken?
Hi Sean, Thank you for answering my question. It seems that I used an old version bash which is the default Mac bash. ``` $ bash --version GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin14) Copyright (C) 2007 Free Software Foundation, Inc. share_history ``` Thanks, Yu - -- Yu Ishikawa -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Is-dev-lint-python-broken-tp13439p13441.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Is `dev/lint-python` broken?
I'm using 10.10.4. And Xcode is version 6.4. Maybe, it isn't old. I guess the old bash version causes the problem. I'll try to install another bash with brew. - -- Yu Ishikawa -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Is-dev-lint-python-broken-tp13439p13443.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: What is the difference between SlowSparkPullRequestBuilder and SparkPullRequestBuilder?
Hi Andrew, I understand that there is no difference currently. Thanks, Yu - -- Yu Ishikawa -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/What-is-the-difference-between-SlowSparkPullRequestBuilder-and-SparkPullRequestBuilder-tp13377p13380.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
What is the difference between SlowSparkPullRequestBuilder and SparkPullRequestBuilder?
Hi all, When we send a PR, it seems that two requests to run tests are thrown to the Jenkins sometimes. What is the difference between SparkPullRequestBuilder and SlowSparkPullRequestBuilder? Thanks, Yu - -- Yu Ishikawa -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/What-is-the-difference-between-SlowSparkPullRequestBuilder-and-SparkPullRequestBuilder-tp13377.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [pyspark] What is the best way to run a minimum unit testing related to our developing module?
Thanks! --Yu - -- Yu Ishikawa -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/pyspark-What-is-the-best-way-to-run-a-minimum-unit-testing-related-to-our-developing-module-tp12987p12989.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [pyspark] What is the best way to run a minimum unit testing related to our developing module?
Thanks! --Yu 2015-07-02 13:13 GMT+09:00 Reynold Xin r...@databricks.com: Run ./python/run-tests --help and you will see. :) On Wed, Jul 1, 2015 at 9:10 PM, Yu Ishikawa yuu.ishikawa+sp...@gmail.com wrote: Hi all, When I develop pyspark modules, such as adding a spark.ml API in Python, I'd like to run a minimum unit testing related to the developing module again and again. In the previous version, that was easy with commenting out unrelated modules in the ./python/run-tests script. So what is the best way to run a minimum unit testing related to our developing modules under the current version? Of course, I think it would be nice to be able to identify testing targets with the script like scala's sbt. Thanks, Yu - -- Yu Ishikawa -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/pyspark-What-is-the-best-way-to-run-a-minimum-unit-testing-related-to-our-developing-module-tp12987.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
[pyspark] What is the best way to run a minimum unit testing related to our developing module?
Hi all, When I develop pyspark modules, such as adding a spark.ml API in Python, I'd like to run a minimum unit testing related to the developing module again and again. In the previous version, that was easy with commenting out unrelated modules in the ./python/run-tests script. So what is the best way to run a minimum unit testing related to our developing modules under the current version? Of course, I think it would be nice to be able to identify testing targets with the script like scala's sbt. Thanks, Yu - -- Yu Ishikawa -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/pyspark-What-is-the-best-way-to-run-a-minimum-unit-testing-related-to-our-developing-module-tp12987.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
[jenkins] ERROR: Publisher 'Publish JUnit test result report' failed: No test report files were found. Configuration error?
Hi all, How do I deal with the error on the official Jenkins? https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35412/console ``` Archiving unit tests logs... Send successful. Attempting to post to Github... Post successful. Archiving artifacts WARN: No artifacts found that match the file pattern **/target/unit-tests.log. Configuration error? WARN: java.lang.InterruptedException: no matches found within 1 Recording test results ERROR: Publisher 'Publish JUnit test result report' failed: No test report files were found. Configuration error? Finished: FAILURE ``` It seems that the unit testing related to the PR passed. However, the Jenkins posted Merged build finished. Test FAILed. to github. https://github.com/apache/spark/pull/6926 Thanks Yu - -- Yu Ishikawa -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/jenkins-ERROR-Publisher-Publish-JUnit-test-result-report-failed-No-test-report-files-were-found-Conf-tp12823.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
[pyspark][mllib] What is the best way to treat int and long int between python2.6/python3.4 and Java?
Hi all, I have done a survey about treating int type and long int type between python2.6/python3.4 and Java as follows. When we want to return Long value(s) from Java to python vice versa, what is the best way? https://gist.github.com/yu-iskw/12e92c2d718ca41dea90 Based on that, Joseph and I are tackling [SPARK-6259] Python API for LDA. We wonder if we should create a wrapper class for the document of LDA or not. Do you have any idea to implement it? https://issues.apache.org/jira/browse/SPARK-6259 Thanks, Yu - -- Yu Ishikawa -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/pyspark-mllib-What-is-the-best-way-to-treat-int-and-long-int-between-python2-6-python3-4-and-Java-tp12811.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [mllib] Refactoring some spark.mllib model classes in Python not inheriting JavaModelWrapper
Hi Xiangrui I got it. I will try to refactor any model class not inheriting JavaModelWrapper and show you it. Thanks, Yu - -- Yu Ishikawa -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Refactoring-some-spark-mllib-model-classes-in-Python-not-inheriting-JavaModelWrapper-tp12781p12803.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Workaround for problems with OS X + JIRA Client
Hi Sean, That sounds interesting. I didn't know the client. I will try it later. Thank you for sharing the information. Yu - -- Yu Ishikawa -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Workaround-for-problems-with-OS-X-JIRA-Client-tp12799p12804.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
[mllib] Refactoring some spark.mllib model classes in Python not inheriting JavaModelWrapper
Hi all, I think we should refactor some machine learning model classes in Python to reduce the software maintainability. Inheriting JavaModelWrapper class, we can easily and directly call Scala API for the model without PythonMLlibAPI. In some case, a machine learning model class in Python has complicated variables. That is, it is a little hard to implement import/export methods and it is also a little troublesome to implement the function in both of Scala and Python. And I think standardizing how to create a model class in python is important. What do you think about that? Thanks, Yu - -- Yu Ishikawa -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Refactoring-some-spark-mllib-model-classes-in-Python-not-inheriting-JavaModelWrapper-tp12781.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
[mllib] Deprecate static train and use builder instead for Scala/Java
Hi all, Joseph proposed an idea about using just builder methods, instead of static train() methods for Scala/Java. I agree with that idea. Because we have many duplicated static train() method. If you have any thoughts on that please share it with us. [SPARK-6682] Deprecate static train and use builder instead for Scala/Java https://issues.apache.org/jira/browse/SPARK-6682 Thanks Yu Ishikawa - -- Yu Ishikawa -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Deprecate-static-train-and-use-builder-instead-for-Scala-Java-tp11438.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [mllib] Is there any bugs to divide a Breeze sparse vectors at Spark v1.3.0-rc3?
Sorry for the delay in replying. I moved from Tokyo to New York in order to attend Spark Summit East. I verified the snapshot and the difference. https://github.com/scalanlp/breeze/commit/f61d2f61137807651fc860404a244640e213f6d3 Thank you for your great work! Yu Ishikawa - -- Yu Ishikawa -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Is-there-any-bugs-to-divide-a-Breeze-sparse-vectors-at-Spark-v1-3-0-rc3-tp11056p11107.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
[mllib] Is there any bugs to divide a Breeze sparse vectors at Spark v1.3.0-rc3?
Hi all, Is there any bugs to divide a Breeze sparse vector at Spark v1.3.0-rc3? When I tried to divide a sparse vector at Spark v1.3.0-rc3, I got a wrong result if the target vector has any zero values. Spark v1.3.0-rc3 depends on Breeze v0.11.1. And Breeze v0.11.1 seems to have any bugs to divide a sparse vector by a scalar value. When dividing a breeze sparse vector which has any zero values, the result seems to be a zero vector. However, we can run the same code on Spark v1.2.x. However, there is no problem to multiply a breeze sparse vector. I asked the breeze community this problem on the below issue. https://github.com/scalanlp/breeze/issues/382 For example, ``` test(dividing a breeze spark vector) { val vec = Vectors.sparse(6, Array(0, 4), Array(0.0, 10.0)).toBreeze val n = 60.0 val answer1 = vec :/ n val answer2 = vec.toDenseVector :/ n println(vec) println(answer1) println(answer2) assert(answer1.toDenseVector === answer2) } SparseVector((0,0.0), (4,10.0)) SparseVector() DenseVector(0.0, 0.0, 0.0, 0.0, 0.1, 0.0) DenseVector(0.0, 0.0, 0.0, 0.0, 0.0, 0.0) did not equal DenseVector(0.0, 0.0, 0.0, 0.0, 0.1, 0.0) org.scalatest.exceptions.TestFailedException: DenseVector(0.0, 0.0, 0.0, 0.0, 0.0, 0.0) did not equal DenseVector(0.0, 0.0, 0.0, 0.0, 0.1, 0.0) ``` Thanks, Yu Ishikawa - -- Yu Ishikawa -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Is-there-any-bugs-to-divide-a-Breeze-sparse-vectors-at-Spark-v1-3-0-rc3-tp11056.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [mllib] Is there any bugs to divide a Breeze sparse vectors at Spark v1.3.0-rc3?
David Hall who is a breeze creator told me that it's a bug. So, I made a jira ticket about this issue. We need to upgrade breeze from 0.11.1 to 0.11.2 or later in order to fix the bug, when the new version of breeze will be released. [SPARK-6341] Upgrade breeze from 0.11.1 to 0.11.2 or later - ASF JIRA https://issues.apache.org/jira/browse/SPARK-6341 Thanks, Yu Ishikawa - -- Yu Ishikawa -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Is-there-any-bugs-to-divide-a-Breeze-sparse-vectors-at-Spark-v1-3-0-rc3-tp11056p11058.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [mllib] Which is the correct package to add a new algorithm?
Hi Joseph, Thank you for your nice work and telling us the draft! During the next development cycle, new algorithms should be contributed to spark.mllib. Optionally, wrappers for new (and old) algorithms can be contributed to spark.ml. I understand that we should contribute new algorithms to spark.mllib. thanks, Yu - -- Yu Ishikawa -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Which-is-the-correct-package-to-add-a-new-algorithm-tp9540p9575.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
[mllib] Which is the correct package to add a new algorithm?
Hi all, Spark ML alpha version exists in the current master branch on Github. If we want to add new machine learning algorithms or to modify algorithms which already exists, which package should we implement them at org.apache.spark.mllib or org.apache.spark.ml? thanks, Yu - -- Yu Ishikawa -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Which-is-the-correct-package-to-add-a-new-algorithm-tp9540.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Designating maintainers for some Spark components
+1 (binding) On Wed, Nov 5, 2014 at 5:33 PM, Matei Zaharia [hidden email] wrote: BTW, my own vote is obviously +1 (binding). Matei On Nov 5, 2014, at 5:31 PM, Matei Zaharia [hidden email] wrote: Hi all, I wanted to share a discussion we've been having on the PMC list, as well as call for an official vote on it on a public list. Basically, as the Spark project scales up, we need to define a model to make sure there is still great oversight of key components (in particular internal architecture and public APIs), and to this end I've proposed implementing a maintainer model for some of these components, similar to other large projects. As background on this, Spark has grown a lot since joining Apache. We've had over 80 contributors/month for the past 3 months, which I believe makes us the most active project in contributors/month at Apache, as well as over 500 patches/month. The codebase has also grown significantly, with new libraries for SQL, ML, graphs and more. In this kind of large project, one common way to scale development is to assign maintainers to oversee key components, where each patch to that component needs to get sign-off from at least one of its maintainers. Most existing large projects do this -- at Apache, some large ones with this model are CloudStack (the second-most active project overall), Subversion, and Kafka, and other examples include Linux and Python. This is also by-and-large how Spark operates today -- most components have a de-facto maintainer. IMO, adopting this model would have two benefits: 1) Consistent oversight of design for that component, especially regarding architecture and API. This process would ensure that the component's maintainers see all proposed changes and consider them to fit together in a good way. 2) More structure for new contributors and committers -- in particular, it would be easy to look up who’s responsible for each module and ask them for reviews, etc, rather than having patches slip between the cracks. We'd like to start with in a light-weight manner, where the model only applies to certain key components (e.g. scheduler, shuffle) and user-facing APIs (MLlib, GraphX, etc). Over time, as the project grows, we can expand it if we deem it useful. The specific mechanics would be as follows: - Some components in Spark will have maintainers assigned to them, where one of the maintainers needs to sign off on each patch to the component. - Each component with maintainers will have at least 2 maintainers. - Maintainers will be assigned from the most active and knowledgeable committers on that component by the PMC. The PMC can vote to add / remove maintainers, and maintained components, through consensus. - Maintainers are expected to be active in responding to patches for their components, though they do not need to be the main reviewers for them (e.g. they might just sign off on architecture / API). To prevent inactive maintainers from blocking the project, if a maintainer isn't responding in a reasonable time period (say 2 weeks), other committers can merge the patch, and the PMC will want to discuss adding another maintainer. If you'd like to see examples for this model, check out the following projects: - CloudStack: https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide - Subversion: https://subversion.apache.org/docs/community-guide/roles.html https://subversion.apache.org/docs/community-guide/roles.html Finally, I wanted to list our current proposal for initial components and maintainers. It would be good to get feedback on other components we might add, but please note that personnel discussions (e.g. I don't think Matei should maintain *that* component) should only happen on the private list. The initial components were chosen to include all public APIs and the main core components, and the maintainers were chosen from the most active contributors to those modules. - Spark core public API: Matei, Patrick, Reynold - Job scheduler: Matei, Kay, Patrick - Shuffle and network: Reynold, Aaron, Matei - Block manager: Reynold, Aaron - YARN: Tom, Andrew Or - Python: Josh, Matei - MLlib: Xiangrui, Matei - SQL: Michael, Reynold - Streaming: TD, Matei - GraphX: Ankur, Joey, Reynold I'd like to formally call a [VOTE] on this model, to last 72 hours. The [VOTE] will end on Nov 8, 2014 at 6 PM PST. Matei - -- Yu Ishikawa -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Designating-maintainers-for-some-Spark-components-tp9115p9281.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com
Re: JIRA + PR backlog
Great jobs! I didn't know Spark PR Dashboard. Thanks Yu Ishikawa - -- Yu Ishikawa -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/JIRA-PR-backlog-tp9157p9282.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
[mllib] Share the simple benchmark result about the cast cost from Spark vector to Breeze vector
Hi all, I wondered the cast cost from Spark Vectors to Breeze vector is high or low. So I benchmarked the simple operation about addition, multiplication and division of RDD[Vector] or RDD[BV[Double]]. I share the simple benchmark result with you. In conclusion, the cast cost was lower than I had expected. For more information, please read the below report, if you are interested in it. https://github.com/yu-iskw/benchmark-breeze-on-spark/blob/master/doc%2Fbenchmark-result.md Best, Yu Ishikawa - -- Yu Ishikawa -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Share-the-simple-benchmark-result-about-the-cast-cost-from-Spark-vector-to-Breeze-vector-tp8793.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Standardized Distance Functions in MLlib
Hi all, In my limited understanding of the MLlib, it is a good idea to use the various distance functions on some machine learning algorithms. For example, we can only use Euclidean distance metric in KMeans. And I am tackling with contributing hierarchical clustering to MLlib (https://issues.apache.org/jira/browse/SPARK-2429). I would like to support the various distance functions in it. Should we support the standardized distance function in MLlib or not? You know, Spark depends on Breeze. So I think we have two approaches in order to use distance functions in MLlib. One is implementing some distance functions in MLlib. The other is wrapping the functions of Breeze. And I am a bit worried about using Breeze directly in Spark. For example, we can't absolutely control the release of Breeze. I sent a PR before. But it is stopping. I'd like to get your thoughts on it, community. https://github.com/apache/spark/pull/1964#issuecomment-54953348 Best, - -- Yu Ishikawa -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Standardized-Distance-Functions-in-MLlib-tp8697.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: What is the best way to build my developing Spark for testing on EC2?
Hi Evan, Sorry for my replay late. And Thank you for your comment. As far as cluster set up goes, I usually launch spot instances with the spark-ec2 scripts, and then check out a repo which contains a simple driver application for my code. Then I have something crude like bash scripts running my program and collecting output. It's just as you thought. I agree with you. You could have a look at the spark-perf repo if you want something a little better principled/automatic. I overlooked this. I will give it a try. best, - -- Yu Ishikawa -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/What-is-the-best-way-to-build-my-developing-Spark-for-testing-on-EC2-tp8638p8677.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
What is the best way to build my developing Spark for testing on EC2?
Hi all, I am trying to contribute some machine learning algorithms to MLlib. I must evaluate their performance on a cluster, changing input data size, the number of CPU cores and any their parameters. I would like to build my develoipng Spark on EC2 automatically. Is there already a building script for a developing version like spark-ec2 script? Or if you have any good idea to evaluate the performance of a developing MLlib algorithm on a spark cluster like EC2, could you tell me? Best, - -- Yu Ishikawa -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/What-is-the-best-way-to-build-my-developing-Spark-for-testing-on-EC2-tp8638.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: MLlib enable extension of the LabeledPoint class
Hi Niklas Wilcke, As you said, it is difficult to extend LabeledPoint class in mllib.regression. Do you want to extend LabeledPoint class in order to use any other type exclude Double type? If you have your code on Github, could you show us it? I want to know what you want to do. Community By the way, I think LabeledPoint class is very useful exclude mllib.regression package. Especially, some estimation algorithms should use a type for the labels exclude Double type, such as String type. The common generics labeled-point class would be useful in MLlib. I'd like to get your thoughts on it. For example, ``` abstract class LabeledPoint[T](label: T, features: Vector) ``` thanks - -- Yu Ishikawa -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-enable-extension-of-the-LabeledPoint-class-tp8546p8549.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: MLlib enable extension of the LabeledPoint class
Hi Egor Pahomov, Thank you for your comment! - -- Yu Ishikawa -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-enable-extension-of-the-LabeledPoint-class-tp8546p8551.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [mllib] Add multiplying large scale matrices
Hi Xiangrui Meng, Thank you for your comment and creating tickets. The ticket which I created would be moved to your tickets. I will close my ticket, and then will link it to yours later. Best, Yu Ishikawa -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Add-multiplying-large-scale-matrices-tp8291p8333.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [mllib] Add multiplying large scale matrices
Hi Jeremy, Great work! I'm interested in your work. If there is your code on github, could you let me know? -- Yu Ishikawa -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Add-multiplying-large-scale-matrices-tp8291p8309.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [mllib] Add multiplying large scale matrices
Hi Rong, Great job! Thank you for let me know your work. I will read the source code of saury later. Although AMPLab is working to implement them, would you like to merge it into Spark? Best, -- Yu Ishikawa -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Add-multiplying-large-scale-matrices-tp8291p8310.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
[mllib] Add multiplying large scale matrices
Hi all, It seems that there is a method to multiply a RowMatrix and a (local) Matrix. However, there is not a method to multiply a large scale matrix and another one in Spark. It would be helpful. Does anyone have a plan to add multiplying large scale matrices? Or shouldn't we support it in Spark? thanks, -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Add-multiplying-large-scale-matrices-tp8291.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [mllib] Add multiplying large scale matrices
Hi RJ, Thank you for your comment. I am interested in to have other matrix operations too. I will create a JIRA issue in the first place. thanks, -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Add-multiplying-large-scale-matrices-tp8291p8293.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [mllib] Add multiplying large scale matrices
Hi Evan, That's sounds interesting. Here is the ticket which I created. https://issues.apache.org/jira/browse/SPARK-3416 thanks, -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Add-multiplying-large-scale-matrices-tp8291p8296.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Contributing to MLlib: Proposal for Clustering Algorithms
Hi all, I am also interested in specifying a common framework. And I am trying to implement a hierarchical k-means and a hierarchical clustering like single-link method with LSH. https://issues.apache.org/jira/browse/SPARK-2966 If you have designed the standardized clustering algorithms API, please let me know. best, Yu Ishikawa -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7822.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Can I translate the documentations of Spark in Japanese?
Hi Kenichi Takagiwa, Thank you for commenting. I am going to proceed with the translation, will you please help me. Further details will be sent later. Best, Yu -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Can-I-translate-the-documentations-of-Spark-in-Japanese-tp7538p7614.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Re: Can I translate the documentations of Spark in Japanese?
Hi Nick, I know some projects get translations crowdsourced via one website or other. Thank you for your comments. I think crowdsourced translation is fit for the translation project on github. Best, Yu -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Can-I-translate-the-documentations-of-Spark-in-Japanese-tp7538p7615.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Can I translate the documentations of Spark in Japanese?
Hi all, I'm Yu Ishikawa, a Japanese. I would like to translate the documentations of Spark 1.0.x officially. If I will translate them and send a pull request, then can you merge it ? And where is the best directory to create the Japanese documentations ? Best, Yu -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Can-I-translate-the-documentations-of-Spark-in-Japanese-tp7538.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.