Is there any way to select columns of Dataset in addition to the combination of `expr` and `as`?

2015-12-18 Thread Yu Ishikawa
Hi all, 

I have two questions about selecting columns of Dataset.
First, could you tell me know if there is any way to select TypedColumn
columns in addition to the combination of `expr` and `as`?
Second, how can we alias such a `expr("name").as[String]` Column?

I tried to select a column of Dataset like DataFrame.
However, I couldn't do that.

```
case class Person(id: Int, name: String)
val df = sc.parallelize(Seq((1, "Bob"), (2, "Tom"))).toDF("id", "name")
val ds = df.as[Person]

ds.select(expr("name").as[String]).show
+-+
|value|
+-+
|  Bob|
|  Tom|
+-+

ds.select('id).show
:34: error: type mismatch;
found   : Symbol
required: org.apache.spark.sql.TypedColumn[Person,?]
  ds.select('id).show')

ds.select($"id").show
:34: error: type mismatch;
found   : org.apache.spark.sql.ColumnName
required: org.apache.spark.sql.TypedColumn[Person,?]
  ds.select($"id").show

ds.select(ds("id")).show
:34: error: org.apache.spark.sql.Dataset[Person] does not take
parameters
  ds.select(ds("id")).show

ds.select("id").show
:34: error: type mismatch;
found   : String("id")
required: org.apache.spark.sql.TypedColumn[Person,?]
  ds.select("id").show
```

Best,
Yu



-
-- Yu Ishikawa
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Is-there-any-way-to-select-columns-of-Dataset-in-addition-to-the-combination-of-expr-and-as-tp15713.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



How do we convert a Dataset includes timestamp columns to RDD?

2015-12-16 Thread Yu Ishikawa
he$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1064)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at
org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.NotSerializableException:
org.apache.spark.sql.catalyst.util.DateTimeUtils$
Serialization stack:
- object not serializable (class:
org.apache.spark.sql.catalyst.util.DateTimeUtils$, value:
org.apache.spark.sql.catalyst.util.DateTimeUtils$@216c782f)
- field (class:
org.apache.spark.sql.catalyst.expressions.StaticInvoke, name: staticObject,
type: class java.lang.Object)
- object (class
org.apache.spark.sql.catalyst.expressions.StaticInvoke,
staticinvoke(org.apache.spark.sql.catalyst.util.DateTimeUtils$@216c782f,ObjectType(class
java.sql.Timestamp),toJavaTimestamp,input[0, TimestampType],true))
- writeObject data (class: scala.collection.immutable.$colon$colon)
- object (class scala.collection.immutable.$colon$colon,
List(staticinvoke(org.apache.spark.sql.catalyst.util.DateTimeUtils$@216c782f,ObjectType(class
java.sql.Timestamp),toJavaTimestamp,input[0, TimestampType],true)))
- field (class:
org.apache.spark.sql.catalyst.expressions.NewInstance, name: arguments,
type: interface scala.collection.Seq)
- object (class
org.apache.spark.sql.catalyst.expressions.NewInstance, newinstance(class
$iwC$$iwC$TimestampExample,staticinvoke(org.apache.spark.sql.catalyst.util.DateTimeUtils$@216c782f,ObjectType(class
java.sql.Timestamp),toJavaTimestamp,input[0,
TimestampType],true),false,ObjectType(class
$iwC$$iwC$TimestampExample),Some($iwC$$iwC@23b27380)))
- field (class:
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder, name:
fromRowExpression, type: class
org.apache.spark.sql.catalyst.expressions.Expression)
- object (class
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder, class[dt[0]:
timestamp])
- field (class: org.apache.spark.sql.Dataset, name: boundTEncoder,
type: class org.apache.spark.sql.catalyst.encoders.ExpressionEncoder)
- object (class org.apache.spark.sql.Dataset, [dt: timestamp])
- field (class: org.apache.spark.sql.Dataset$$anonfun$rdd$1, name:
$outer, type: class org.apache.spark.sql.Dataset)
- object (class org.apache.spark.sql.Dataset$$anonfun$rdd$1,
)
at
org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
... 68 more
```

Thanks,
Yu



-
-- Yu Ishikawa
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/How-do-we-convert-a-Dataset-includes-timestamp-columns-to-RDD-tp15682.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [ANNOUNCE] Announcing Spark 1.5.0

2015-09-09 Thread Yu Ishikawa
Great work, everyone!



-
-- Yu Ishikawa
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/ANNOUNCE-Announcing-Spark-1-5-0-tp14013p14015.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[SparkR] lint script for SpakrR

2015-09-01 Thread Yu Ishikawa
Hi all,

Shivaram and I added a lint script for SparkR which is `dev/lint-r`. And
it's been already running on Jenkins. If there are any validation problems
in your patch, Jenkins will fail. 
Could you please make sure that your patch don't have any validation
problems on your local machine before sending a PR.
https://github.com/apache/spark/blob/master/dev/lint-r

And we could also discuss the validation rules. I think there is still room
for improvement. 
If you have any idea, please join the discussion about SparkR style guide.
https://issues.apache.org/jira/browse/SPARK-6813

Thanks Shivaram and Josh, I couldn't have done it without you.

Thanks
Yu



-
-- Yu Ishikawa
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/SparkR-lint-script-for-SpakrR-tp13923.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Is `dev/lint-python` broken?

2015-07-27 Thread Yu Ishikawa
Hi all,

When I run `dev/lint-python` at the latest master branch, I got an error
message as follows.
Is the lint script broken? Or is there any problems with my environment?

```
$ ./dev/lint-python
./dev/lint-python: line 64: syntax error near unexpected token `'
./dev/lint-python: line 64: `easy_install -d $PYLINT_HOME
pylint==1.4.4  $PYLINT_INSTALL_INFO'
```

If the redirect is a syntax error, I'll send a PR to fix.

Thanks,
Yu



-
-- Yu Ishikawa
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Is-dev-lint-python-broken-tp13439.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Is `dev/lint-python` broken?

2015-07-27 Thread Yu Ishikawa
Hi Sean,

Thank you for answering my question.
It seems that I used an old version bash which is the default Mac bash.

```
$ bash --version
GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin14)
Copyright (C) 2007 Free Software Foundation, Inc.
share_history
```

Thanks,
Yu



-
-- Yu Ishikawa
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Is-dev-lint-python-broken-tp13439p13441.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Is `dev/lint-python` broken?

2015-07-27 Thread Yu Ishikawa
I'm using 10.10.4. And Xcode is version 6.4. Maybe, it isn't old.
I guess the old bash version causes the problem. I'll try to install another
bash with brew.




-
-- Yu Ishikawa
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Is-dev-lint-python-broken-tp13439p13443.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: What is the difference between SlowSparkPullRequestBuilder and SparkPullRequestBuilder?

2015-07-22 Thread Yu Ishikawa
Hi Andrew,

I understand that there is no difference currently.

Thanks,
Yu



-
-- Yu Ishikawa
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/What-is-the-difference-between-SlowSparkPullRequestBuilder-and-SparkPullRequestBuilder-tp13377p13380.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



What is the difference between SlowSparkPullRequestBuilder and SparkPullRequestBuilder?

2015-07-21 Thread Yu Ishikawa
Hi all, 

When we send a PR, it seems that two requests to run tests are thrown to the
Jenkins sometimes. 
What is the difference between SparkPullRequestBuilder and
SlowSparkPullRequestBuilder?

Thanks,
Yu



-
-- Yu Ishikawa
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/What-is-the-difference-between-SlowSparkPullRequestBuilder-and-SparkPullRequestBuilder-tp13377.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [pyspark] What is the best way to run a minimum unit testing related to our developing module?

2015-07-01 Thread Yu Ishikawa
Thanks! --Yu



-
-- Yu Ishikawa
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/pyspark-What-is-the-best-way-to-run-a-minimum-unit-testing-related-to-our-developing-module-tp12987p12989.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [pyspark] What is the best way to run a minimum unit testing related to our developing module?

2015-07-01 Thread Yu ISHIKAWA
Thanks!  --Yu

2015-07-02 13:13 GMT+09:00 Reynold Xin r...@databricks.com:

 Run

 ./python/run-tests --help

 and you will see. :)

 On Wed, Jul 1, 2015 at 9:10 PM, Yu Ishikawa yuu.ishikawa+sp...@gmail.com
 wrote:

 Hi all,

 When I develop pyspark modules, such as adding a spark.ml API in Python,
 I'd
 like to run a minimum unit testing related to the developing module again
 and again.
 In the previous version, that was easy with commenting out unrelated
 modules
 in the ./python/run-tests script. So what is the best way to run a minimum
 unit testing related to our developing modules under the current version?
 Of course, I think it would be nice to be able to identify testing targets
 with the script like scala's sbt.

 Thanks,
 Yu



 -
 -- Yu Ishikawa
 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/pyspark-What-is-the-best-way-to-run-a-minimum-unit-testing-related-to-our-developing-module-tp12987.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org





[pyspark] What is the best way to run a minimum unit testing related to our developing module?

2015-07-01 Thread Yu Ishikawa
Hi all,

When I develop pyspark modules, such as adding a spark.ml API in Python, I'd
like to run a minimum unit testing related to the developing module again
and again. 
In the previous version, that was easy with commenting out unrelated modules
in the ./python/run-tests script. So what is the best way to run a minimum
unit testing related to our developing modules under the current version?
Of course, I think it would be nice to be able to identify testing targets
with the script like scala's sbt.

Thanks,
Yu



-
-- Yu Ishikawa
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/pyspark-What-is-the-best-way-to-run-a-minimum-unit-testing-related-to-our-developing-module-tp12987.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[jenkins] ERROR: Publisher 'Publish JUnit test result report' failed: No test report files were found. Configuration error?

2015-06-21 Thread Yu Ishikawa
Hi all,

How do I deal with the error on the official Jenkins?
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35412/console

```
Archiving unit tests logs...
 Send successful.
Attempting to post to Github...
  Post successful.
Archiving artifacts
WARN: No artifacts found that match the file pattern
**/target/unit-tests.log. Configuration error?
WARN: java.lang.InterruptedException: no matches found within 1
Recording test results
ERROR: Publisher 'Publish JUnit test result report' failed: No test report
files were found. Configuration error?
Finished: FAILURE
```

It seems that the unit testing related to the PR passed. However, the
Jenkins posted Merged build finished. Test FAILed. to github.
https://github.com/apache/spark/pull/6926

Thanks
Yu




-
-- Yu Ishikawa
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/jenkins-ERROR-Publisher-Publish-JUnit-test-result-report-failed-No-test-report-files-were-found-Conf-tp12823.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[pyspark][mllib] What is the best way to treat int and long int between python2.6/python3.4 and Java?

2015-06-20 Thread Yu Ishikawa
Hi all,

I have done a survey about treating int type and long int type between
python2.6/python3.4 and Java as follows. When we want to return Long
value(s) from Java to python vice versa, what is the best way?
https://gist.github.com/yu-iskw/12e92c2d718ca41dea90

Based on that, Joseph and I are tackling [SPARK-6259] Python API for LDA.
We wonder if we should create a wrapper class for the document of LDA or
not. Do you have any idea to implement it?
https://issues.apache.org/jira/browse/SPARK-6259

Thanks,
Yu



-
-- Yu Ishikawa
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/pyspark-mllib-What-is-the-best-way-to-treat-int-and-long-int-between-python2-6-python3-4-and-Java-tp12811.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [mllib] Refactoring some spark.mllib model classes in Python not inheriting JavaModelWrapper

2015-06-19 Thread Yu Ishikawa
Hi Xiangrui

I got it. I will try to refactor any model class not inheriting
JavaModelWrapper and show you it.

Thanks,
Yu



-
-- Yu Ishikawa
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Refactoring-some-spark-mllib-model-classes-in-Python-not-inheriting-JavaModelWrapper-tp12781p12803.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Workaround for problems with OS X + JIRA Client

2015-06-19 Thread Yu Ishikawa
Hi Sean,

That sounds interesting. I didn't know the client. I will try it later.
Thank you for sharing the information.

Yu



-
-- Yu Ishikawa
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Workaround-for-problems-with-OS-X-JIRA-Client-tp12799p12804.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[mllib] Refactoring some spark.mllib model classes in Python not inheriting JavaModelWrapper

2015-06-17 Thread Yu Ishikawa
Hi all,

I think we should refactor some machine learning model classes in Python to
reduce the software maintainability.
Inheriting JavaModelWrapper class, we can easily and directly call Scala API
for the model without PythonMLlibAPI.

In some case, a machine learning model class in Python has complicated
variables. That is, it is a little hard to implement import/export methods
and it is also a little troublesome to implement the function in both of
Scala and Python. And I think standardizing how to create a model class in
python is important.

What do you think about that?

Thanks,
Yu



-
-- Yu Ishikawa
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Refactoring-some-spark-mllib-model-classes-in-Python-not-inheriting-JavaModelWrapper-tp12781.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[mllib] Deprecate static train and use builder instead for Scala/Java

2015-04-06 Thread Yu Ishikawa
Hi all, 

Joseph proposed an idea about using just builder methods, instead of static
train() 
methods for Scala/Java. I agree with that idea. Because we have many
duplicated 
static train() method. If you have any thoughts on that please share it with
us.

[SPARK-6682] Deprecate static train and use builder instead for Scala/Java
https://issues.apache.org/jira/browse/SPARK-6682

Thanks
Yu Ishikawa




-
-- Yu Ishikawa
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Deprecate-static-train-and-use-builder-instead-for-Scala-Java-tp11438.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [mllib] Is there any bugs to divide a Breeze sparse vectors at Spark v1.3.0-rc3?

2015-03-18 Thread Yu Ishikawa
Sorry for the delay in replying. I moved from Tokyo to New York in order to
attend Spark Summit East.
I verified the snapshot and the difference.
https://github.com/scalanlp/breeze/commit/f61d2f61137807651fc860404a244640e213f6d3

Thank you for your great work!
Yu Ishikawa



-
-- Yu Ishikawa
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Is-there-any-bugs-to-divide-a-Breeze-sparse-vectors-at-Spark-v1-3-0-rc3-tp11056p11107.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[mllib] Is there any bugs to divide a Breeze sparse vectors at Spark v1.3.0-rc3?

2015-03-15 Thread Yu Ishikawa
Hi all,

Is there any bugs to divide a Breeze sparse vector at Spark v1.3.0-rc3? When
I tried to divide a sparse vector at Spark v1.3.0-rc3, I got a wrong result
if the target vector has any zero values.

Spark v1.3.0-rc3 depends on Breeze v0.11.1. And Breeze v0.11.1 seems to have
any bugs to divide a sparse vector by a scalar value. When dividing a breeze
sparse vector which has any zero values, the result seems to be a zero
vector. However, we can run the same code on Spark v1.2.x.

However, there is no problem to multiply a breeze sparse vector. I asked the
breeze community this problem on the below issue.
https://github.com/scalanlp/breeze/issues/382

For example,
```
test(dividing a breeze spark vector) {
val vec = Vectors.sparse(6, Array(0, 4), Array(0.0, 10.0)).toBreeze
val n = 60.0
val answer1 = vec :/ n
val answer2 = vec.toDenseVector :/ n
println(vec)
println(answer1)
println(answer2)
assert(answer1.toDenseVector === answer2)
}

SparseVector((0,0.0), (4,10.0))
SparseVector()
DenseVector(0.0, 0.0, 0.0, 0.0, 0.1, 0.0)

DenseVector(0.0, 0.0, 0.0, 0.0, 0.0, 0.0) did not equal DenseVector(0.0,
0.0, 0.0, 0.0, 0.1, 0.0)
org.scalatest.exceptions.TestFailedException: DenseVector(0.0, 0.0, 0.0,
0.0, 0.0, 0.0) did not equal DenseVector(0.0, 0.0, 0.0, 0.0,
0.1, 0.0)
```

Thanks,
Yu Ishikawa



-
-- Yu Ishikawa
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Is-there-any-bugs-to-divide-a-Breeze-sparse-vectors-at-Spark-v1-3-0-rc3-tp11056.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [mllib] Is there any bugs to divide a Breeze sparse vectors at Spark v1.3.0-rc3?

2015-03-15 Thread Yu Ishikawa
David Hall who is a breeze creator told me that it's a bug. So, I made a jira
ticket about this issue. We need to upgrade breeze from 0.11.1 to 0.11.2 or
later in order to fix the bug, when the new version of breeze will be
released.

[SPARK-6341] Upgrade breeze from 0.11.1 to 0.11.2 or later - ASF JIRA
https://issues.apache.org/jira/browse/SPARK-6341

Thanks,
Yu Ishikawa



-
-- Yu Ishikawa
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Is-there-any-bugs-to-divide-a-Breeze-sparse-vectors-at-Spark-v1-3-0-rc3-tp11056p11058.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [mllib] Which is the correct package to add a new algorithm?

2014-11-30 Thread Yu Ishikawa
Hi Joseph, 

Thank you for your nice work and telling us the draft!

 During the next development cycle, new algorithms should be contributed to 
 spark.mllib.  Optionally, wrappers for new (and old) algorithms can be 
 contributed to spark.ml. 

I understand that we should contribute new algorithms to spark.mllib.
thanks, 
Yu



-
-- Yu Ishikawa
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Which-is-the-correct-package-to-add-a-new-algorithm-tp9540p9575.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[mllib] Which is the correct package to add a new algorithm?

2014-11-27 Thread Yu Ishikawa
Hi all, 

Spark ML alpha version exists in the current master branch on Github.
If we want to add new machine learning algorithms or to modify algorithms
which already exists, 
which package should we implement them at org.apache.spark.mllib or
org.apache.spark.ml?

thanks,
Yu



-
-- Yu Ishikawa
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Which-is-the-correct-package-to-add-a-new-algorithm-tp9540.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Designating maintainers for some Spark components

2014-11-11 Thread Yu Ishikawa
+1 (binding) 

On Wed, Nov 5, 2014 at 5:33 PM, Matei Zaharia [hidden email] 
wrote: 

 BTW, my own vote is obviously +1 (binding). 
 
 Matei 
 
  On Nov 5, 2014, at 5:31 PM, Matei Zaharia [hidden email] 
 wrote: 
  
  Hi all, 
  
  I wanted to share a discussion we've been having on the PMC list, as 
 well as call for an official vote on it on a public list. Basically, as
 the 
 Spark project scales up, we need to define a model to make sure there is 
 still great oversight of key components (in particular internal 
 architecture and public APIs), and to this end I've proposed implementing
 a 
 maintainer model for some of these components, similar to other large 
 projects. 
  
  As background on this, Spark has grown a lot since joining Apache. We've 
 had over 80 contributors/month for the past 3 months, which I believe
 makes 
 us the most active project in contributors/month at Apache, as well as
 over 
 500 patches/month. The codebase has also grown significantly, with new 
 libraries for SQL, ML, graphs and more. 
  
  In this kind of large project, one common way to scale development is to 
 assign maintainers to oversee key components, where each patch to that 
 component needs to get sign-off from at least one of its maintainers. Most 
 existing large projects do this -- at Apache, some large ones with this 
 model are CloudStack (the second-most active project overall), Subversion, 
 and Kafka, and other examples include Linux and Python. This is also 
 by-and-large how Spark operates today -- most components have a de-facto 
 maintainer. 
  
  IMO, adopting this model would have two benefits: 
  
  1) Consistent oversight of design for that component, especially 
 regarding architecture and API. This process would ensure that the 
 component's maintainers see all proposed changes and consider them to fit 
 together in a good way. 
  
  2) More structure for new contributors and committers -- in particular, 
 it would be easy to look up who’s responsible for each module and ask them 
 for reviews, etc, rather than having patches slip between the cracks. 
  
  We'd like to start with in a light-weight manner, where the model only 
 applies to certain key components (e.g. scheduler, shuffle) and
 user-facing 
 APIs (MLlib, GraphX, etc). Over time, as the project grows, we can expand 
 it if we deem it useful. The specific mechanics would be as follows: 
  
  - Some components in Spark will have maintainers assigned to them, where 
 one of the maintainers needs to sign off on each patch to the component. 
  - Each component with maintainers will have at least 2 maintainers. 
  - Maintainers will be assigned from the most active and knowledgeable 
 committers on that component by the PMC. The PMC can vote to add / remove 
 maintainers, and maintained components, through consensus. 
  - Maintainers are expected to be active in responding to patches for 
 their components, though they do not need to be the main reviewers for
 them 
 (e.g. they might just sign off on architecture / API). To prevent inactive 
 maintainers from blocking the project, if a maintainer isn't responding in 
 a reasonable time period (say 2 weeks), other committers can merge the 
 patch, and the PMC will want to discuss adding another maintainer. 
  
  If you'd like to see examples for this model, check out the following 
 projects: 
  - CloudStack: 
 https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide
  
 https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide
  
  - Subversion: 
 https://subversion.apache.org/docs/community-guide/roles.html  
 https://subversion.apache.org/docs/community-guide/roles.html 
  
  Finally, I wanted to list our current proposal for initial components 
 and maintainers. It would be good to get feedback on other components we 
 might add, but please note that personnel discussions (e.g. I don't think 
 Matei should maintain *that* component) should only happen on the private 
 list. The initial components were chosen to include all public APIs and
 the 
 main core components, and the maintainers were chosen from the most active 
 contributors to those modules. 
  
  - Spark core public API: Matei, Patrick, Reynold 
  - Job scheduler: Matei, Kay, Patrick 
  - Shuffle and network: Reynold, Aaron, Matei 
  - Block manager: Reynold, Aaron 
  - YARN: Tom, Andrew Or 
  - Python: Josh, Matei 
  - MLlib: Xiangrui, Matei 
  - SQL: Michael, Reynold 
  - Streaming: TD, Matei 
  - GraphX: Ankur, Joey, Reynold 
  
  I'd like to formally call a [VOTE] on this model, to last 72 hours. The 
 [VOTE] will end on Nov 8, 2014 at 6 PM PST. 
  
  Matei 
 
 



-
-- Yu Ishikawa
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Designating-maintainers-for-some-Spark-components-tp9115p9281.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com

Re: JIRA + PR backlog

2014-11-11 Thread Yu Ishikawa
Great jobs!
I didn't know Spark PR Dashboard.

Thanks
Yu Ishikawa



-
-- Yu Ishikawa
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/JIRA-PR-backlog-tp9157p9282.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[mllib] Share the simple benchmark result about the cast cost from Spark vector to Breeze vector

2014-10-15 Thread Yu Ishikawa
Hi all,

I wondered the cast cost from Spark Vectors to Breeze vector is high or low. 
So I benchmarked the simple operation about addition, multiplication and 
division of RDD[Vector] or RDD[BV[Double]]. I share the simple benchmark
result with you.

In conclusion, the cast cost was lower than I had expected. 
For more information, please read the below report, if you are interested in
it.
https://github.com/yu-iskw/benchmark-breeze-on-spark/blob/master/doc%2Fbenchmark-result.md

Best,
Yu Ishikawa



-
-- Yu Ishikawa
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Share-the-simple-benchmark-result-about-the-cast-cost-from-Spark-vector-to-Breeze-vector-tp8793.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Standardized Distance Functions in MLlib

2014-10-08 Thread Yu Ishikawa
Hi all, 

In my limited understanding of the MLlib, it is a good idea to use the
various distance functions on some machine learning algorithms. For example,
we can only use Euclidean distance metric in KMeans. And I am tackling with
contributing hierarchical clustering to MLlib
(https://issues.apache.org/jira/browse/SPARK-2429). I would like to support
the various distance functions in it.

Should we support the standardized distance function in MLlib or not?
You know, Spark depends on Breeze. So I think we have two approaches in
order to use distance functions in MLlib. One is implementing some distance
functions in MLlib. The other is wrapping the functions of Breeze. And I am
a bit worried about using Breeze directly in Spark. For example,  we can't
absolutely control the release of Breeze. 

I sent a PR before. But it is stopping. I'd like to get your thoughts on it,
community.
https://github.com/apache/spark/pull/1964#issuecomment-54953348

Best,



-
-- Yu Ishikawa
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Standardized-Distance-Functions-in-MLlib-tp8697.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: What is the best way to build my developing Spark for testing on EC2?

2014-10-06 Thread Yu Ishikawa
Hi Evan,

Sorry for my replay late. And Thank you for your comment.

 As far as cluster set up goes, I usually launch spot instances with the
 spark-ec2 scripts, 
 and then check out a repo which contains a simple driver application for
 my code. 
 Then I have something crude like bash scripts running my program and
 collecting output. 

It's just as you thought.  I agree with you.

 You could have a look at the spark-perf repo if you want something a
 little better principled/automatic. 

I overlooked this. I will give it a try.

best,



-
-- Yu Ishikawa
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/What-is-the-best-way-to-build-my-developing-Spark-for-testing-on-EC2-tp8638p8677.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



What is the best way to build my developing Spark for testing on EC2?

2014-10-02 Thread Yu Ishikawa
Hi all, 

I am trying to contribute some machine learning algorithms to MLlib. 
I must evaluate their performance on a cluster, changing input data 
size, the number of CPU cores and any their parameters.

I would like to build my develoipng Spark on EC2 automatically. 
Is there already a building script for a developing version like spark-ec2
script?
Or if you have any good idea to evaluate the performance of a developing 
MLlib algorithm on a spark cluster like EC2, could you tell me?

Best,



-
-- Yu Ishikawa
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/What-is-the-best-way-to-build-my-developing-Spark-for-testing-on-EC2-tp8638.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: MLlib enable extension of the LabeledPoint class

2014-09-25 Thread Yu Ishikawa
Hi Niklas Wilcke,

As you said, it is difficult to extend LabeledPoint class in
mllib.regression.
Do you want to extend LabeledPoint class in order to use any other type
exclude Double type?
If you have your code on Github, could you show us it? I want to know what
you want to do.

 Community
By the way, I think LabeledPoint class is very useful exclude
mllib.regression package.
Especially, some estimation algorithms should use a type for the labels
exclude Double type, 
such as String type. The common generics labeled-point class would be useful
in MLlib.
I'd like to get your thoughts on it.

For example,
```
abstract class LabeledPoint[T](label: T, features: Vector)
```

thanks






-
-- Yu Ishikawa
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-enable-extension-of-the-LabeledPoint-class-tp8546p8549.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: MLlib enable extension of the LabeledPoint class

2014-09-25 Thread Yu Ishikawa
Hi Egor Pahomov, 

Thank you for your comment!



-
-- Yu Ishikawa
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-enable-extension-of-the-LabeledPoint-class-tp8546p8551.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [mllib] Add multiplying large scale matrices

2014-09-08 Thread Yu Ishikawa
Hi Xiangrui Meng,

Thank you for your comment and creating tickets.

The ticket which I created would be moved to your tickets.
I will close my ticket, and then will link it to yours later.

Best,
Yu Ishikawa



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Add-multiplying-large-scale-matrices-tp8291p8333.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [mllib] Add multiplying large scale matrices

2014-09-06 Thread Yu Ishikawa
Hi  Jeremy, 

Great work!

I'm interested in your work. If there is your code on github, could you let
me know?

-- Yu Ishikawa



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Add-multiplying-large-scale-matrices-tp8291p8309.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [mllib] Add multiplying large scale matrices

2014-09-06 Thread Yu Ishikawa
Hi Rong, 

Great job! Thank you for let me know your work.
I will read the source code of saury later.

Although AMPLab is working to implement them, would you like to merge it
into Spark?

Best,

-- Yu Ishikawa




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Add-multiplying-large-scale-matrices-tp8291p8310.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[mllib] Add multiplying large scale matrices

2014-09-05 Thread Yu Ishikawa
Hi all, 

It seems that there is a method to multiply a RowMatrix and a (local)
Matrix. 
However, there is not a method to multiply a large scale matrix and another
one in Spark.
It would be helpful. Does anyone have a plan to add multiplying large scale
matrices? 
Or shouldn't  we support it in Spark?

thanks,



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Add-multiplying-large-scale-matrices-tp8291.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [mllib] Add multiplying large scale matrices

2014-09-05 Thread Yu Ishikawa
Hi RJ,

Thank you for your comment. I am interested in to have other matrix
operations too.
I will create a JIRA issue in the first place.

thanks,



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Add-multiplying-large-scale-matrices-tp8291p8293.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [mllib] Add multiplying large scale matrices

2014-09-05 Thread Yu Ishikawa
Hi Evan, 

That's sounds interesting. 

Here is the ticket which I created.
https://issues.apache.org/jira/browse/SPARK-3416

thanks,



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Add-multiplying-large-scale-matrices-tp8291p8296.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-08-13 Thread Yu Ishikawa
Hi all,

I am also interested in specifying a common framework.
And I am trying to implement a hierarchical k-means and a hierarchical
clustering like single-link method with LSH.
https://issues.apache.org/jira/browse/SPARK-2966

If you have designed the standardized clustering algorithms API, please let
me know.


best,
Yu Ishikawa



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7822.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Can I translate the documentations of Spark in Japanese?

2014-07-31 Thread Yu Ishikawa
Hi Kenichi Takagiwa,

Thank you for commenting.
I am going to proceed with the translation, will you please help me.
Further details will be sent later.

Best,

Yu



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Can-I-translate-the-documentations-of-Spark-in-Japanese-tp7538p7614.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.


Re: Can I translate the documentations of Spark in Japanese?

2014-07-31 Thread Yu Ishikawa
Hi Nick,

 I know some projects get translations crowdsourced via one website or
 other.

Thank you for your comments.
I think crowdsourced translation is fit for the translation project on
github.

Best,

Yu




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Can-I-translate-the-documentations-of-Spark-in-Japanese-tp7538p7615.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.


Can I translate the documentations of Spark in Japanese?

2014-07-27 Thread Yu Ishikawa
Hi all,

I'm Yu Ishikawa, a Japanese.
I would like to translate the documentations of Spark 1.0.x officially.
If I will translate them and send a pull request, then can you merge it ?
And where is the best directory to create the Japanese documentations ?

Best,
Yu



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Can-I-translate-the-documentations-of-Spark-in-Japanese-tp7538.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.