[jira] [Updated] (SPARK-6721) IllegalStateException

2015-04-06 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-6721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luis Rodríguez Trejo updated SPARK-6721:

Description: 
I get the following exception when using saveAsNewAPIHadoopFile:
{code}
15/03/23 17:05:34 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 4, 
10.0.2.15): java.lang.IllegalStateException: open
at org.bson.util.Assertions.isTrue(Assertions.java:36)
at com.mongodb.DBTCPConnector.getPrimaryPort(DBTCPConnector.java:406)
at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:184)
at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:167)
at com.mongodb.DBCollection.insert(DBCollection.java:161)
at com.mongodb.DBCollection.insert(DBCollection.java:107)
at com.mongodb.DBCollection.save(DBCollection.java:1049)
at com.mongodb.DBCollection.save(DBCollection.java:1014)
at com.mongodb.hadoop.output.MongoRecordWriter.write(MongoRecordWriter.java:105)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:1000)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:979)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}

Before Spark 1.3.0 this would result in the application crashing, but now the 
data just remains unprocessed.

There is no close instruction at any part of the code.

  was:
I get the following exception when using saveAsNewAPIHadoopFile:
bq. 15/03/23 17:05:34 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 4, 
10.0.2.15): java.lang.IllegalStateException: open
at org.bson.util.Assertions.isTrue(Assertions.java:36)
at com.mongodb.DBTCPConnector.getPrimaryPort(DBTCPConnector.java:406)
at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:184)
at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:167)
at com.mongodb.DBCollection.insert(DBCollection.java:161)
at com.mongodb.DBCollection.insert(DBCollection.java:107)
at com.mongodb.DBCollection.save(DBCollection.java:1049)
at com.mongodb.DBCollection.save(DBCollection.java:1014)
at com.mongodb.hadoop.output.MongoRecordWriter.write(MongoRecordWriter.java:105)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:1000)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:979)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Before Spark 1.3.0 this would result in the application crashing, but now the 
data just remains unprocessed.

There is no close instruction at any part of the code.


 IllegalStateException
 -

 Key: SPARK-6721
 URL: https://issues.apache.org/jira/browse/SPARK-6721
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.2.0, 1.2.1, 1.3.0
 Environment: Ubuntu 14.04, Java 8, MongoDB 3.0, Spark 1.3
Reporter: Luis Rodríguez Trejo
  Labels: MongoDB, java.lang.IllegalStateexception, 
 saveAsNewAPIHadoopFile

 I get the following exception when using saveAsNewAPIHadoopFile:
 {code}
 15/03/23 17:05:34 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 4, 
 10.0.2.15): java.lang.IllegalStateException: open
 at org.bson.util.Assertions.isTrue(Assertions.java:36)
 at com.mongodb.DBTCPConnector.getPrimaryPort(DBTCPConnector.java:406)
 at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:184)
 at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:167)
 at com.mongodb.DBCollection.insert(DBCollection.java:161)
 at com.mongodb.DBCollection.insert(DBCollection.java:107)
 at com.mongodb.DBCollection.save(DBCollection.java:1049)
 at com.mongodb.DBCollection.save(DBCollection.java:1014)
 at 
 com.mongodb.hadoop.output.MongoRecordWriter.write(MongoRecordWriter.java:105)
 at 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:1000)
 at 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:979)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:64)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 

[jira] [Commented] (SPARK-6700) flaky test: run Python application in yarn-cluster mode

2015-04-06 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481534#comment-14481534
 ] 

Davies Liu commented on SPARK-6700:
---

There is one failure here: 
https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2036/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/testReport/junit/org.apache.spark.deploy.yarn/YarnClusterSuite/run_Python_application_in_yarn_cluster_mode/

and here: 
https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2025/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/testReport/junit/org.apache.spark.deploy.yarn/YarnClusterSuite/run_Python_application_in_yarn_cluster_mode/

Is it related to hadoop2.3 ?

 flaky test: run Python application in yarn-cluster mode 
 

 Key: SPARK-6700
 URL: https://issues.apache.org/jira/browse/SPARK-6700
 Project: Spark
  Issue Type: Bug
  Components: Tests
Reporter: Davies Liu
Assignee: Lianhui Wang
Priority: Critical
  Labels: test, yarn

 org.apache.spark.deploy.yarn.YarnClusterSuite.run Python application in 
 yarn-cluster mode
 Failing for the past 1 build (Since Failed#2025 )
 Took 12 sec.
 Error Message
 {code}
 Process 
 List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit,
  --master, yarn-cluster, --num-executors, 1, --properties-file, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/spark3554401802242467930.properties,
  --py-files, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test2.py, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test.py, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/result8930129095246825990.tmp)
  exited with code 1
 Stacktrace
 sbt.ForkMain$ForkError: Process 
 List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit,
  --master, yarn-cluster, --num-executors, 1, --properties-file, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/spark3554401802242467930.properties,
  --py-files, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test2.py, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test.py, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/result8930129095246825990.tmp)
  exited with code 1
   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:1122)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite.org$apache$spark$deploy$yarn$YarnClusterSuite$$runSpark(YarnClusterSuite.scala:259)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply$mcV$sp(YarnClusterSuite.scala:160)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
   at org.scalatest.Suite$class.run(Suite.scala:1424)
   at 
 org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
   at 
 

[jira] [Commented] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java

2015-04-06 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481455#comment-14481455
 ] 

Joseph K. Bradley commented on SPARK-6682:
--

As you're suggesting, a wrapper mechanism like won't be an acceptable solution 
since it would be a confusing, difficult-to-document API.

 Deprecate static train and use builder instead for Scala/Java
 -

 Key: SPARK-6682
 URL: https://issues.apache.org/jira/browse/SPARK-6682
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 In MLlib, we have for some time been unofficially moving away from the old 
 static train() methods and moving towards builder patterns.  This JIRA is to 
 discuss this move and (hopefully) make it official.
 Old static train() API:
 {code}
 val myModel = NaiveBayes.train(myData, ...)
 {code}
 New builder pattern API:
 {code}
 val nb = new NaiveBayes().setLambda(0.1)
 val myModel = nb.train(myData)
 {code}
 Pros of the builder pattern:
 * Much less code when algorithms have many parameters.  Since Java does not 
 support default arguments, we required *many* duplicated static train() 
 methods (for each prefix set of arguments).
 * Helps to enforce default parameters.  Users should ideally not have to even 
 think about setting parameters if they just want to try an algorithm quickly.
 * Matches spark.ml API
 Cons of the builder pattern:
 * In Python APIs, static train methods are more Pythonic.
 Proposal:
 * Scala/Java: We should start deprecating the old static train() methods.  We 
 must keep them for API stability, but deprecating will help with API 
 consistency, making it clear that everyone should use the builder pattern.  
 As we deprecate them, we should make sure that the builder pattern supports 
 all parameters.
 * Python: Keep static train methods.
 CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3702) Standardize MLlib classes for learners, models

2015-04-06 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481464#comment-14481464
 ] 

Joseph K. Bradley commented on SPARK-3702:
--

Using Vector types is better since they store values as Array[Double], which 
avoids creating an object for every value.  If you're thinking about feature 
names/metadata, the Metadata capability in DataFrame will be able to handle 
metadata for each feature in Vector columns.

 Standardize MLlib classes for learners, models
 --

 Key: SPARK-3702
 URL: https://issues.apache.org/jira/browse/SPARK-3702
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Blocker

 Summary: Create a class hierarchy for learning algorithms and the models 
 those algorithms produce.
 This is a super-task of several sub-tasks (but JIRA does not allow subtasks 
 of subtasks).  See the requires links below for subtasks.
 Goals:
 * give intuitive structure to API, both for developers and for generated 
 documentation
 * support meta-algorithms (e.g., boosting)
 * support generic functionality (e.g., evaluation)
 * reduce code duplication across classes
 [Design doc for class hierarchy | 
 https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering

2015-04-06 Thread Burak Yavuz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481874#comment-14481874
 ] 

Burak Yavuz commented on SPARK-6407:


I actually worked on this over the weekend for fun and have a streaming, 
gradient descent based, matrix factorization model implemented here: 
https://github.com/brkyvz/streaming-matrix-factorization

It is a very naive implementation, but it might be something to work on top of. 
I will publish a Spark Package for it as soon as I get the tests in. The model 
it uses for predicting ratings for user `u` and product `p` is:
{code}
r = U(u) * P^T(p) + bu(u) + bp(p) + mu
{code}
where U(u) is the u'th row of the User matrix, P(p) is the p'th row for the 
product matrix, bu(u) is the u'th element of the user bias vector, bp(p) is the 
p'th element of the product bias vector and mu is the global average.

 Streaming ALS for Collaborative Filtering
 -

 Key: SPARK-6407
 URL: https://issues.apache.org/jira/browse/SPARK-6407
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Felix Cheung
Priority: Minor

 Like MLLib's ALS implementation for recommendation, and applying to streaming.
 Similar to streaming linear regression, logistic regression, could we apply 
 gradient updates to batches of data and reuse existing MLLib implementation?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6725) Model export/import for Pipeline API

2015-04-06 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-6725:


 Summary: Model export/import for Pipeline API
 Key: SPARK-6725
 URL: https://issues.apache.org/jira/browse/SPARK-6725
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Critical


This is an umbrella JIRA for adding model export/import to the spark.ml API.  
This JIRA is for adding the internal Saveable/Loadable API and Parquet-based 
format, not for other formats like PMML.

This will require the following steps:
* Add export/import for all PipelineStages supported by spark.ml
** This will include some Transformers which are not Models.
** These can use almost the same format as the spark.mllib model save/load 
functions, but the model metadata must store a different class name (marking 
the class as a spark.ml class).
* After all PipelineStages support save/load, add an interface which forces 
future additions to support save/load.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6722) Model import/export for StreamingKMeansModel

2015-04-06 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-6722:


 Summary: Model import/export for StreamingKMeansModel
 Key: SPARK-6722
 URL: https://issues.apache.org/jira/browse/SPARK-6722
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley


CC: [~freeman-lab] Is this API stable enough to merit adding import/export 
(which will require supporting the model format version from now on)?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5988) Model import/export for PowerIterationClusteringModel

2015-04-06 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481891#comment-14481891
 ] 

Joseph K. Bradley commented on SPARK-5988:
--

Feel free to go ahead!  I just assigned it to you.  Thanks!

 Model import/export for PowerIterationClusteringModel
 -

 Key: SPARK-5988
 URL: https://issues.apache.org/jira/browse/SPARK-5988
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Xusen Yin

 Add save/load for PowerIterationClusteringModel



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5988) Model import/export for PowerIterationClusteringModel

2015-04-06 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-5988:
-
Assignee: Xusen Yin

 Model import/export for PowerIterationClusteringModel
 -

 Key: SPARK-5988
 URL: https://issues.apache.org/jira/browse/SPARK-5988
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Xusen Yin

 Add save/load for PowerIterationClusteringModel



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6692) Add an option for client to kill AM when it is killed

2015-04-06 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated SPARK-6692:
-
Summary: Add an option for client to kill AM when it is killed  (was: Make 
it possible to kill AM in YARN cluster mode when the client is terminated)

 Add an option for client to kill AM when it is killed
 -

 Key: SPARK-6692
 URL: https://issues.apache.org/jira/browse/SPARK-6692
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.3.0
Reporter: Cheolsoo Park
Assignee: Cheolsoo Park
Priority: Minor
  Labels: yarn

 I understand that the yarn-cluster mode is designed for fire-and-forget 
 model; therefore, terminating the yarn client doesn't kill AM.
 However, it is very common that users submit Spark jobs via job scheduler 
 (e.g. Apache Oozie) or remote job server (e.g. Netflix Genie) where it is 
 expected that killing the yarn client will terminate AM. 
 It is true that the yarn-client mode can be used in such cases. But then, the 
 yarn client sometimes needs lots of heap memory for big jobs if it runs in 
 the yarn-client mode. In fact, the yarn-cluster mode is ideal for big jobs 
 because AM can be given arbitrary heap memory unlike the yarn client. So it 
 would be very useful to make it possible to kill AM even in the yarn-cluster 
 mode.
 In addition, Spark jobs often become zombie jobs if users ctrl-c them as soon 
 as they're accepted (but not yet running). Although they're eventually 
 shutdown after AM timeout, it would be nice if AM could immediately get 
 killed in such cases too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6222) [STREAMING] All data may not be recovered from WAL when driver is killed

2015-04-06 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6222:
---
Fix Version/s: 1.4.0
   1.3.1

 [STREAMING] All data may not be recovered from WAL when driver is killed
 

 Key: SPARK-6222
 URL: https://issues.apache.org/jira/browse/SPARK-6222
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
Reporter: Hari Shreedharan
Priority: Blocker
 Fix For: 1.3.1, 1.4.0

 Attachments: AfterPatch.txt, CleanWithoutPatch.txt, SPARK-6122.patch


 When testing for our next release, our internal tests written by [~wypoon] 
 caught a regression in Spark Streaming between 1.2.0 and 1.3.0. The test runs 
 FlumePolling stream to read data from Flume, then kills the Application 
 Master. Once YARN restarts it, the test waits until no more data is to be 
 written and verifies the original against the data on HDFS. This was passing 
 in 1.2.0, but is failing now.
 Since the test ties into Cloudera's internal infrastructure and build 
 process, it cannot be directly run on an Apache build. But I have been 
 working on isolating the commit that may have caused the regression. I have 
 confirmed that it was caused by SPARK-5147 (PR # 
 [4149|https://github.com/apache/spark/pull/4149]). I confirmed this several 
 times using the test and the failure is consistently reproducible. 
 To re-confirm, I reverted just this one commit (and Clock consolidation one 
 to avoid conflicts), and the issue was no longer reproducible.
 Since this is a data loss issue, I believe this is a blocker for Spark 1.3.0
 /cc [~tdas], [~pwendell]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6606) Accumulator deserialized twice because the NarrowCoGroupSplitDep contains rdd object.

2015-04-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481639#comment-14481639
 ] 

Apache Spark commented on SPARK-6606:
-

User 'kayousterhout' has created a pull request for this issue:
https://github.com/apache/spark/pull/4145

 Accumulator deserialized twice because the NarrowCoGroupSplitDep contains rdd 
 object.
 -

 Key: SPARK-6606
 URL: https://issues.apache.org/jira/browse/SPARK-6606
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0, 1.3.0
Reporter: SuYan

 1. Use code like belows, will found accumulator deserialized twice.
 first: 
 {code}
 task = ser.deserialize[Task[Any]](taskBytes, 
 Thread.currentThread.getContextClassLoader)
 {code}
 second:
 {code}
 val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])](
   ByteBuffer.wrap(taskBinary.value), 
 Thread.currentThread.getContextClassLoader)
 {code}
 which the first deserialized is not what expected.
 because ResultTask or ShuffleMapTask will have a partition object.
 in class 
 {code}
 CoGroupedRDD[K](@transient var rdds: Seq[RDD[_ : Product2[K, _]]], part: 
 Partitioner)
 {code}, the CogroupPartition may contains a  CoGroupDep:
 {code}
 NarrowCoGroupSplitDep(
 rdd: RDD[_],
 splitIndex: Int,
 var split: Partition
   ) extends CoGroupSplitDep {
 {code}
 in that *NarrowCoGroupSplitDep*, it will bring into rdd object, which result 
 into the first deserialized.
 example:
 {code}
val acc1 = sc.accumulator(0, test1)
 val acc2 = sc.accumulator(0, test2)
 val rdd1 = sc.parallelize((1 to 10).toSeq, 3)
 val rdd2 = sc.parallelize((1 to 10).toSeq, 3)
 val combine1 = rdd1.map { case a = (a, 1)}.combineByKey(a = {
   acc1 += 1
   a
 }, (a: Int, b: Int) = {
   a + b
 },
   (a: Int, b: Int) = {
 a + b
   }, new HashPartitioner(3), mapSideCombine = false)
 val combine2 = rdd2.map { case a = (a, 1)}.combineByKey(
   a = {
 acc2 += 1
 a
   },
   (a: Int, b: Int) = {
 a + b
   },
   (a: Int, b: Int) = {
 a + b
   }, new HashPartitioner(3), mapSideCombine = false)
 combine1.cogroup(combine2, new HashPartitioner(3)).count()
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering

2015-04-06 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481711#comment-14481711
 ] 

Xiangrui Meng commented on SPARK-6407:
--

Attached the comment from Chunnan Yao in SPARK-6711:

On-line Collaborative Filtering(CF) has been widely used and studied. To 
re-train a CF model from scratch every time when new data comes in is very 
inefficient 
(http://stackoverflow.com/questions/27734329/apache-spark-incremental-training-of-als-model).
 However, in Spark community we see few discussion about collaborative 
filtering on streaming data. Given streaming k-means, streaming logistic 
regression, and the on-going incremental model training of Naive Bayes 
Classifier (SPARK-4144), we think it is meaningful to consider streaming 
Collaborative Filtering support on MLlib. 

We have already been considering about this issue during the past week. We plan 
to refer to this paper
(https://www.cs.utexas.edu/~cjohnson/ParallelCollabFilt.pdf). It is based on 
SGD instead of ALS, which is easier to be tackled under streaming data. 

Fortunately, the authors of this paper have implemented their algorithm as a 
Github Project, based on Storm:
https://github.com/MrChrisJohnson/CollabStream

 Streaming ALS for Collaborative Filtering
 -

 Key: SPARK-6407
 URL: https://issues.apache.org/jira/browse/SPARK-6407
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Felix Cheung
Priority: Minor

 Like MLLib's ALS implementation for recommendation, and applying to streaming.
 Similar to streaming linear regression, logistic regression, could we apply 
 gradient updates to batches of data and reuse existing MLLib implementation?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-6711) Support parallelized online matrix factorization for Collaborative Filtering

2015-04-06 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng closed SPARK-6711.

Resolution: Duplicate

 Support parallelized online matrix factorization for Collaborative Filtering 
 -

 Key: SPARK-6711
 URL: https://issues.apache.org/jira/browse/SPARK-6711
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, Streaming
Reporter: Chunnan Yao
   Original Estimate: 840h
  Remaining Estimate: 840h

 On-line Collaborative Filtering(CF) has been widely used and studied. To 
 re-train a CF model from scratch every time when new data comes in is very 
 inefficient 
 (http://stackoverflow.com/questions/27734329/apache-spark-incremental-training-of-als-model).
  However, in Spark community we see few discussion about collaborative 
 filtering on streaming data. Given streaming k-means, streaming logistic 
 regression, and the on-going incremental model training of Naive Bayes 
 Classifier (SPARK-4144), we think it is meaningful to consider streaming 
 Collaborative Filtering support on MLlib. 
 We have already been considering about this issue during the past week. We 
 plan to refer to this paper
 (https://www.cs.utexas.edu/~cjohnson/ParallelCollabFilt.pdf). It is based on 
 SGD instead of ALS, which is easier to be tackled under streaming data. 
 Fortunately, the authors of this paper have implemented their algorithm as a 
 Github Project, based on Storm:
 https://github.com/MrChrisJohnson/CollabStream



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6720) PySpark MultivariateStatisticalSummary unit test for normL1 and normL2

2015-04-06 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6720:
-
Assignee: Kai Sasaki

 PySpark MultivariateStatisticalSummary unit test for normL1 and normL2
 --

 Key: SPARK-6720
 URL: https://issues.apache.org/jira/browse/SPARK-6720
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Affects Versions: 1.4.0
Reporter: Kai Sasaki
Assignee: Kai Sasaki
Priority: Minor

 Implement correct normL1 and normL2 test.
 continuation: https://github.com/apache/spark/pull/5359



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6720) PySpark MultivariateStatisticalSummary unit test for normL1 and normL2

2015-04-06 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6720:
-
Target Version/s: 1.4.0
   Fix Version/s: (was: 1.4.0)

 PySpark MultivariateStatisticalSummary unit test for normL1 and normL2
 --

 Key: SPARK-6720
 URL: https://issues.apache.org/jira/browse/SPARK-6720
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Affects Versions: 1.4.0
Reporter: Kai Sasaki
Priority: Minor

 Implement correct normL1 and normL2 test.
 continuation: https://github.com/apache/spark/pull/5359



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-6718) Improve the test on normL1/normL2 of summary statistics

2015-04-06 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-6718.

Resolution: Duplicate

 Improve the test on normL1/normL2 of summary statistics
 ---

 Key: SPARK-6718
 URL: https://issues.apache.org/jira/browse/SPARK-6718
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Kai Sasaki
Priority: Minor

 As discussed on https://github.com/apache/spark/pull/5359, we should improve 
 the unit test there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6720) PySpark MultivariateStatisticalSummary unit test for normL1 and normL2

2015-04-06 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6720:
-
Component/s: PySpark

 PySpark MultivariateStatisticalSummary unit test for normL1 and normL2
 --

 Key: SPARK-6720
 URL: https://issues.apache.org/jira/browse/SPARK-6720
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Affects Versions: 1.4.0
Reporter: Kai Sasaki
Priority: Minor

 Implement correct normL1 and normL2 test.
 continuation: https://github.com/apache/spark/pull/5359



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6720) PySpark MultivariateStatisticalSummary unit test for normL1 and normL2

2015-04-06 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6720:
-
Affects Version/s: (was: 1.3.0)
   1.4.0

 PySpark MultivariateStatisticalSummary unit test for normL1 and normL2
 --

 Key: SPARK-6720
 URL: https://issues.apache.org/jira/browse/SPARK-6720
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Affects Versions: 1.4.0
Reporter: Kai Sasaki
Priority: Minor

 Implement correct normL1 and normL2 test.
 continuation: https://github.com/apache/spark/pull/5359



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6720) PySpark MultivariateStatisticalSummary unit test for normL1 and normL2

2015-04-06 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6720:
-
Issue Type: Improvement  (was: Bug)

 PySpark MultivariateStatisticalSummary unit test for normL1 and normL2
 --

 Key: SPARK-6720
 URL: https://issues.apache.org/jira/browse/SPARK-6720
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Affects Versions: 1.4.0
Reporter: Kai Sasaki
Priority: Minor

 Implement correct normL1 and normL2 test.
 continuation: https://github.com/apache/spark/pull/5359



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6713) Iterators in columnSimilarities to allow flatMap spill

2015-04-06 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-6713:
-
Assignee: Reza Zadeh

 Iterators in columnSimilarities to allow flatMap spill
 --

 Key: SPARK-6713
 URL: https://issues.apache.org/jira/browse/SPARK-6713
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Reza Zadeh
Assignee: Reza Zadeh
 Fix For: 1.4.0


 We should use Iterators in columnSimilarities to allow mapPartitionsWithIndex 
 to spill to disk. This could happen in a dense and large column - this way 
 Spark can spill the pairs onto disk instead of building all the pairs before 
 handing them to Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6724) Model import/export for FPGrowth

2015-04-06 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-6724:


 Summary: Model import/export for FPGrowth
 Key: SPARK-6724
 URL: https://issues.apache.org/jira/browse/SPARK-6724
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Minor


Note: experimental model API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6723) Model import/export for ChiSqSelector

2015-04-06 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-6723:


 Summary: Model import/export for ChiSqSelector
 Key: SPARK-6723
 URL: https://issues.apache.org/jira/browse/SPARK-6723
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6710) Wrong initial bias in GraphX SVDPlusPlus

2015-04-06 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482063#comment-14482063
 ] 

Reynold Xin commented on SPARK-6710:


[~michaelmalak] would you like to submit a pull request for this?

 Wrong initial bias in GraphX SVDPlusPlus
 

 Key: SPARK-6710
 URL: https://issues.apache.org/jira/browse/SPARK-6710
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.3.0
Reporter: Michael Malak
  Labels: easyfix
   Original Estimate: 2h
  Remaining Estimate: 2h

 In the initialization portion of GraphX SVDPlusPluS, the initialization of 
 biases appears to be incorrect. Specifically, in line 
 https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/SVDPlusPlus.scala#L96
  
 instead of 
 (vd._1, vd._2, msg.get._2 / msg.get._1, 1.0 / scala.math.sqrt(msg.get._1)) 
 it should probably be 
 (vd._1, vd._2, msg.get._2 / msg.get._1 - u, 1.0 / 
 scala.math.sqrt(msg.get._1)) 
 That is, the biases bu and bi (both represented as the third component of the 
 Tuple4[] above, depending on whether the vertex is a user or an item), 
 described in equation (1) of the Koren paper, are supposed to be small 
 offsets to the mean (represented by the variable u, signifying the Greek 
 letter mu) to account for peculiarities of individual users and items. 
 Initializing these biases to wrong values should theoretically not matter 
 given enough iterations of the algorithm, but some quick empirical testing 
 shows it has trouble converging at all, even after many orders of magnitude 
 additional iterations. 
 This perhaps could be the source of previously reported trouble with 
 SVDPlusPlus. 
 http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-SVDPlusPlus-problem-td12885.html
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6728) Improve performance of py4j for large bytearray

2015-04-06 Thread Davies Liu (JIRA)
Davies Liu created SPARK-6728:
-

 Summary: Improve performance of py4j for large bytearray
 Key: SPARK-6728
 URL: https://issues.apache.org/jira/browse/SPARK-6728
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Davies Liu


PySpark relies on py4j to transfer function arguments and return between Python 
and JVM, it's very slow to pass a large bytearray (larger than 10M). 

In MLlib, it's possible to have a Vector with more than 100M bytes, which will 
need few GB memory, may crash.

The reason is that py4j use text protocol, it will encode the bytearray as 
base64, and do multiple string concat. 

Binary will help a lot, create a issue for py4j: 
https://github.com/bartdag/py4j/issues/159



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6229) Support SASL encryption in network/common module

2015-04-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6229:
---

Assignee: (was: Apache Spark)

 Support SASL encryption in network/common module
 

 Key: SPARK-6229
 URL: https://issues.apache.org/jira/browse/SPARK-6229
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Marcelo Vanzin

 After SASL support has been added to network/common, supporting encryption 
 should be rather simple. Encryption is supported for DIGEST-MD5 and GSSAPI. 
 Since the latter requires a valid kerberos login to work (and so doesn't 
 really work with executors), encryption would require the use of DIGEST-MD5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6229) Support SASL encryption in network/common module

2015-04-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6229:
---

Assignee: Apache Spark

 Support SASL encryption in network/common module
 

 Key: SPARK-6229
 URL: https://issues.apache.org/jira/browse/SPARK-6229
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Marcelo Vanzin
Assignee: Apache Spark

 After SASL support has been added to network/common, supporting encryption 
 should be rather simple. Encryption is supported for DIGEST-MD5 and GSSAPI. 
 Since the latter requires a valid kerberos login to work (and so doesn't 
 really work with executors), encryption would require the use of DIGEST-MD5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6229) Support SASL encryption in network/common module

2015-04-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482157#comment-14482157
 ] 

Apache Spark commented on SPARK-6229:
-

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/5377

 Support SASL encryption in network/common module
 

 Key: SPARK-6229
 URL: https://issues.apache.org/jira/browse/SPARK-6229
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Marcelo Vanzin

 After SASL support has been added to network/common, supporting encryption 
 should be rather simple. Encryption is supported for DIGEST-MD5 and GSSAPI. 
 Since the latter requires a valid kerberos login to work (and so doesn't 
 really work with executors), encryption would require the use of DIGEST-MD5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5281) Registering table on RDD is giving MissingRequirementError

2015-04-06 Thread Patrick Walsh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482244#comment-14482244
 ] 

Patrick Walsh commented on SPARK-5281:
--

I also have this issue with spark 1.3.0.  Even example snippets where case 
classes are used in the rrd's trigger the problem.  For me, this happens from 
eclipse and from sbt.

 Registering table on RDD is giving MissingRequirementError
 --

 Key: SPARK-5281
 URL: https://issues.apache.org/jira/browse/SPARK-5281
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: sarsol
Priority: Critical

 Application crashes on this line  {{rdd.registerTempTable(temp)}}  in 1.2 
 version when using sbt or Eclipse SCALA IDE
 Stacktrace:
 {code}
 Exception in thread main scala.reflect.internal.MissingRequirementError: 
 class org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with 
 primordial classloader with boot classpath 
 [C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-library.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-reflect.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-actor.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-swing.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-compiler.jar;C:\Program
  Files\Java\jre7\lib\resources.jar;C:\Program 
 Files\Java\jre7\lib\rt.jar;C:\Program 
 Files\Java\jre7\lib\sunrsasign.jar;C:\Program 
 Files\Java\jre7\lib\jsse.jar;C:\Program 
 Files\Java\jre7\lib\jce.jar;C:\Program 
 Files\Java\jre7\lib\charsets.jar;C:\Program 
 Files\Java\jre7\lib\jfr.jar;C:\Program Files\Java\jre7\classes] not found.
   at 
 scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
   at 
 scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
   at 
 scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
   at 
 scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61)
   at 
 scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72)
   at 
 scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119)
   at 
 scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$$typecreator1$1.apply(ScalaReflection.scala:115)
   at 
 scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231)
   at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231)
   at scala.reflect.api.TypeTags$class.typeOf(TypeTags.scala:335)
   at scala.reflect.api.Universe.typeOf(Universe.scala:59)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:115)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:100)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$class.attributesFor(ScalaReflection.scala:94)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$.attributesFor(ScalaReflection.scala:33)
   at org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:111)
   at 
 com.sar.spark.dq.poc.SparkPOC$delayedInit$body.apply(SparkPOC.scala:43)
   at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
   at 
 scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
   at scala.App$$anonfun$main$1.apply(App.scala:71)
   at scala.App$$anonfun$main$1.apply(App.scala:71)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at 
 scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
   at scala.App$class.main(App.scala:71)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6704) integrate SparkR docs build tool into Spark doc build

2015-04-06 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481972#comment-14481972
 ] 

Davies Liu commented on SPARK-6704:
---

Great, thanks!

 integrate SparkR docs build tool into Spark doc build
 -

 Key: SPARK-6704
 URL: https://issues.apache.org/jira/browse/SPARK-6704
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Reporter: Davies Liu
Priority: Blocker

 We should integrate the SparkR docs build tool into Spark one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6729) DriverQuirks get can get OutOfBounds exception is some cases

2015-04-06 Thread Volodymyr Lyubinets (JIRA)
Volodymyr Lyubinets created SPARK-6729:
--

 Summary: DriverQuirks get can get OutOfBounds exception is some 
cases
 Key: SPARK-6729
 URL: https://issues.apache.org/jira/browse/SPARK-6729
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Volodymyr Lyubinets
Priority: Minor


The function uses .substring(0, X), which will trigger OutOfBoundsException if 
string length is less than X. A better way to do this is to use startsWith, 
which won't error out in this case. I'll propose a patch shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6729) DriverQuirks get can get OutOfBounds exception is some cases

2015-04-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6729:
---

Assignee: (was: Apache Spark)

 DriverQuirks get can get OutOfBounds exception is some cases
 

 Key: SPARK-6729
 URL: https://issues.apache.org/jira/browse/SPARK-6729
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Volodymyr Lyubinets
Priority: Minor

 The function uses .substring(0, X), which will trigger OutOfBoundsException 
 if string length is less than X. A better way to do this is to use 
 startsWith, which won't error out in this case. I'll propose a patch shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6729) DriverQuirks get can get OutOfBounds exception is some cases

2015-04-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482193#comment-14482193
 ] 

Apache Spark commented on SPARK-6729:
-

User 'vlyubin' has created a pull request for this issue:
https://github.com/apache/spark/pull/5378

 DriverQuirks get can get OutOfBounds exception is some cases
 

 Key: SPARK-6729
 URL: https://issues.apache.org/jira/browse/SPARK-6729
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Volodymyr Lyubinets
Priority: Minor

 The function uses .substring(0, X), which will trigger OutOfBoundsException 
 if string length is less than X. A better way to do this is to use 
 startsWith, which won't error out in this case. I'll propose a patch shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6726) Model export/import for spark.ml: LogisticRegression

2015-04-06 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-6726:


 Summary: Model export/import for spark.ml: LogisticRegression
 Key: SPARK-6726
 URL: https://issues.apache.org/jira/browse/SPARK-6726
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6728) Improve performance of py4j for large bytearray

2015-04-06 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-6728:

Affects Version/s: 1.3.0

 Improve performance of py4j for large bytearray
 ---

 Key: SPARK-6728
 URL: https://issues.apache.org/jira/browse/SPARK-6728
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.3.0
Reporter: Davies Liu

 PySpark relies on py4j to transfer function arguments and return between 
 Python and JVM, it's very slow to pass a large bytearray (larger than 10M). 
 In MLlib, it's possible to have a Vector with more than 100M bytes, which 
 will need few GB memory, may crash.
 The reason is that py4j use text protocol, it will encode the bytearray as 
 base64, and do multiple string concat. 
 Binary will help a lot, create a issue for py4j: 
 https://github.com/bartdag/py4j/issues/159



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6728) Improve performance of py4j for large bytearray

2015-04-06 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-6728:

Priority: Critical  (was: Major)
Target Version/s: 1.4.0

 Improve performance of py4j for large bytearray
 ---

 Key: SPARK-6728
 URL: https://issues.apache.org/jira/browse/SPARK-6728
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.3.0
Reporter: Davies Liu
Priority: Critical

 PySpark relies on py4j to transfer function arguments and return between 
 Python and JVM, it's very slow to pass a large bytearray (larger than 10M). 
 In MLlib, it's possible to have a Vector with more than 100M bytes, which 
 will need few GB memory, may crash.
 The reason is that py4j use text protocol, it will encode the bytearray as 
 base64, and do multiple string concat. 
 Binary will help a lot, create a issue for py4j: 
 https://github.com/bartdag/py4j/issues/159



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6727) Model export/import for spark.ml: HashingTF

2015-04-06 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-6727:


 Summary: Model export/import for spark.ml: HashingTF
 Key: SPARK-6727
 URL: https://issues.apache.org/jira/browse/SPARK-6727
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6729) DriverQuirks get can get OutOfBounds exception is some cases

2015-04-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6729:
---

Assignee: Apache Spark

 DriverQuirks get can get OutOfBounds exception is some cases
 

 Key: SPARK-6729
 URL: https://issues.apache.org/jira/browse/SPARK-6729
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Volodymyr Lyubinets
Assignee: Apache Spark
Priority: Minor

 The function uses .substring(0, X), which will trigger OutOfBoundsException 
 if string length is less than X. A better way to do this is to use 
 startsWith, which won't error out in this case. I'll propose a patch shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3219) K-Means clusterer should support Bregman distance functions

2015-04-06 Thread Sai Nishanth Parepally (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482297#comment-14482297
 ] 

Sai Nishanth Parepally commented on SPARK-3219:
---

[~mengxr], is https://github.com/derrickburns/generalized-kmeans-clustering 
going to be merged into mllib as I would like to use jaccard distance as a 
distance metric for kmeans clustering?

 K-Means clusterer should support Bregman distance functions
 ---

 Key: SPARK-3219
 URL: https://issues.apache.org/jira/browse/SPARK-3219
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Derrick Burns
Assignee: Derrick Burns
  Labels: clustering

 The K-Means clusterer supports the Euclidean distance metric.  However, it is 
 rather straightforward to support Bregman 
 (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) 
 distance functions which would increase the utility of the clusterer 
 tremendously.
 I have modified the clusterer to support pluggable distance functions.  
 However, I notice that there are hundreds of outstanding pull requests.  If 
 someone is willing to work with me to sponsor the work through the process, I 
 will create a pull request.  Otherwise, I will just keep my own fork.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6721) IllegalStateException

2015-04-06 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482367#comment-14482367
 ] 

Sean Owen commented on SPARK-6721:
--

(Also IllegalStateException isn't a useful JIRA name -- please edit it to 
something more meaningful, like including mongo)

 IllegalStateException
 -

 Key: SPARK-6721
 URL: https://issues.apache.org/jira/browse/SPARK-6721
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.2.0, 1.2.1, 1.3.0
 Environment: Ubuntu 14.04, Java 8, MongoDB 3.0, Spark 1.3
Reporter: Luis Rodríguez Trejo
  Labels: MongoDB, java.lang.IllegalStateexception, 
 saveAsNewAPIHadoopFile

 I get the following exception when using saveAsNewAPIHadoopFile:
 {code}
 15/03/23 17:05:34 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 4, 
 10.0.2.15): java.lang.IllegalStateException: open
 at org.bson.util.Assertions.isTrue(Assertions.java:36)
 at com.mongodb.DBTCPConnector.getPrimaryPort(DBTCPConnector.java:406)
 at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:184)
 at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:167)
 at com.mongodb.DBCollection.insert(DBCollection.java:161)
 at com.mongodb.DBCollection.insert(DBCollection.java:107)
 at com.mongodb.DBCollection.save(DBCollection.java:1049)
 at com.mongodb.DBCollection.save(DBCollection.java:1014)
 at 
 com.mongodb.hadoop.output.MongoRecordWriter.write(MongoRecordWriter.java:105)
 at 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:1000)
 at 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:979)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:64)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 {code}
 Before Spark 1.3.0 this would result in the application crashing, but now the 
 data just remains unprocessed.
 There is no close instruction at any part of the code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6730) Can't have table as identifier in OPTIONS

2015-04-06 Thread Alex Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Liu updated SPARK-6730:

Description: 
The following query fails because there is an  identifier table in OPTIONS

{code}
CREATE TEMPORARY TABLE ddlTable
USING org.apache.spark.sql.cassandra
OPTIONS (
 table test1,
 keyspace test
)
{code} 

The following error

{code}

]   java.lang.RuntimeException: [1.2] failure: ``insert'' expected but 
identifier CREATE found
[info] 
[info]  CREATE TEMPORARY TABLE ddlTable USING org.apache.spark.sql.cassandra 
OPTIONS (  table test1,  keyspace dstest  )   
[info]  ^
[info]   at scala.sys.package$.error(package.scala:27)
[info]   at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40)
[info]   at 
org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:130)
[info]   at 
org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:130)
[info]   at 
org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96)
[info]   at 
org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95)
[info]   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
[info]   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
[info]   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
[info]   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
[info]   at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
[info]   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
[info]   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
[info]   at 
scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
[info]   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
[info]   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
[info]   at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
[info]   at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
[info]   at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
[info]   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
[info]   at 
scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
[info]   at 
scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
[info]   at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38)
[info]   at 
org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:134)
[info]   at 
org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:134)
[info]   at scala.Option.getOrElse(Option.scala:120)
[info]   at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:134)
{code}

  was:
The following query fails because there is an  identifier table in OPTIONS

{code}
CREATE TEMPORARY TABLE ddlTable
USING org.apache.spark.sql.cassandra
OPTIONS (
 table test1,
 keyspace test
{code} 

The following error

{code}

]   java.lang.RuntimeException: [1.2] failure: ``insert'' expected but 
identifier CREATE found
[info] 
[info]  CREATE TEMPORARY TABLE ddlTable USING org.apache.spark.sql.cassandra 
OPTIONS (  table test1,  keyspace dstest  )   
[info]  ^
[info]   at scala.sys.package$.error(package.scala:27)
[info]   at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40)
[info]   at 
org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:130)
[info]   at 
org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:130)
[info]   at 
org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96)
[info]   at 
org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95)
[info]   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
[info]   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
[info]   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
[info]   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
[info]   at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
[info]   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
[info]   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
[info]   at 

[jira] [Created] (SPARK-6730) Can't have table as identifier in OPTIONS

2015-04-06 Thread Alex Liu (JIRA)
Alex Liu created SPARK-6730:
---

 Summary: Can't have table as identifier in OPTIONS
 Key: SPARK-6730
 URL: https://issues.apache.org/jira/browse/SPARK-6730
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Alex Liu


The following query fails because there is an  identifier table in OPTIONS

{code}
CREATE TEMPORARY TABLE ddlTable
USING org.apache.spark.sql.cassandra
OPTIONS (
 table test1,
 keyspace test
{code} 

The following error

{code}

]   java.lang.RuntimeException: [1.2] failure: ``insert'' expected but 
identifier CREATE found
[info] 
[info]  CREATE TEMPORARY TABLE ddlTable USING org.apache.spark.sql.cassandra 
OPTIONS (  table test1,  keyspace dstest  )   
[info]  ^
[info]   at scala.sys.package$.error(package.scala:27)
[info]   at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:40)
[info]   at 
org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:130)
[info]   at 
org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:130)
[info]   at 
org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96)
[info]   at 
org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95)
[info]   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
[info]   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
[info]   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
[info]   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
[info]   at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
[info]   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
[info]   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
[info]   at 
scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
[info]   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
[info]   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
[info]   at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
[info]   at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
[info]   at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
[info]   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
[info]   at 
scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
[info]   at 
scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
[info]   at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38)
[info]   at 
org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:134)
[info]   at 
org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:134)
[info]   at scala.Option.getOrElse(Option.scala:120)
[info]   at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:134)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6721) IllegalStateException

2015-04-06 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482366#comment-14482366
 ] 

Sean Owen commented on SPARK-6721:
--

Isn't this an error / config problem in Mongo rather than Spark?

 IllegalStateException
 -

 Key: SPARK-6721
 URL: https://issues.apache.org/jira/browse/SPARK-6721
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.2.0, 1.2.1, 1.3.0
 Environment: Ubuntu 14.04, Java 8, MongoDB 3.0, Spark 1.3
Reporter: Luis Rodríguez Trejo
  Labels: MongoDB, java.lang.IllegalStateexception, 
 saveAsNewAPIHadoopFile

 I get the following exception when using saveAsNewAPIHadoopFile:
 {code}
 15/03/23 17:05:34 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 4, 
 10.0.2.15): java.lang.IllegalStateException: open
 at org.bson.util.Assertions.isTrue(Assertions.java:36)
 at com.mongodb.DBTCPConnector.getPrimaryPort(DBTCPConnector.java:406)
 at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:184)
 at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:167)
 at com.mongodb.DBCollection.insert(DBCollection.java:161)
 at com.mongodb.DBCollection.insert(DBCollection.java:107)
 at com.mongodb.DBCollection.save(DBCollection.java:1049)
 at com.mongodb.DBCollection.save(DBCollection.java:1014)
 at 
 com.mongodb.hadoop.output.MongoRecordWriter.write(MongoRecordWriter.java:105)
 at 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:1000)
 at 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:979)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:64)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 {code}
 Before Spark 1.3.0 this would result in the application crashing, but now the 
 data just remains unprocessed.
 There is no close instruction at any part of the code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6599) Improve reliability and usability of Kinesis-based Spark Streaming

2015-04-06 Thread Chris Fregly (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Fregly updated SPARK-6599:

Summary: Improve reliability and usability of Kinesis-based Spark Streaming 
 (was: Add Kinesis Direct API)

 Improve reliability and usability of Kinesis-based Spark Streaming
 --

 Key: SPARK-6599
 URL: https://issues.apache.org/jira/browse/SPARK-6599
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2960) Spark executables fail to start via symlinks

2015-04-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-2960:
-
Component/s: Deploy

 Spark executables fail to start via symlinks
 

 Key: SPARK-2960
 URL: https://issues.apache.org/jira/browse/SPARK-2960
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Reporter: Shay Rojansky
Priority: Minor

 The current scripts (e.g. pyspark) fail to run when they are executed via 
 symlinks. A common Linux scenario would be to have Spark installed somewhere 
 (e.g. /opt) and have a symlink to it in /usr/bin.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6732) Scala existentials warning during compilation

2015-04-06 Thread Raymond Tay (JIRA)
Raymond Tay created SPARK-6732:
--

 Summary: Scala existentials warning during compilation
 Key: SPARK-6732
 URL: https://issues.apache.org/jira/browse/SPARK-6732
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
 Environment: operating system: OSX Yosemite
scala version: 2.10.4
hardware: 2.7 GHz Intel Core i7, 16 GB 1600 MHz DDR3

Reporter: Raymond Tay
Priority: Minor


Certain parts of the Scala code was detected to have used existentials but the 
scala import can be included in the source file to prevent such warnings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6343) Make doc more explicit regarding network connectivity requirements

2015-04-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482496#comment-14482496
 ] 

Apache Spark commented on SPARK-6343:
-

User 'parente' has created a pull request for this issue:
https://github.com/apache/spark/pull/5382

 Make doc more explicit regarding network connectivity requirements
 --

 Key: SPARK-6343
 URL: https://issues.apache.org/jira/browse/SPARK-6343
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Peter Parente
Priority: Minor

 As a new user of Spark, I read through the official documentation before 
 attempting to stand-up my own cluster and write my own driver application. 
 But only after attempting to run my app remotely against my cluster did I 
 realize that full network connectivity (layer 3) is necessary between my 
 driver program and worker nodes (i.e., my driver was *listening* for 
 connections from my workers).
 I returned to the documentation to see how I had missed this requirement. On 
 a second read-through, I saw that the doc hints at it in a few places (e.g., 
 [driver 
 config|http://spark.apache.org/docs/1.2.0/configuration.html#networking], 
 [submitting applications 
 suggestion|http://spark.apache.org/docs/1.2.0/submitting-applications.html], 
 [cluster overview|http://spark.apache.org/docs/1.2.0/cluster-overview.html])  
 but never outright says it.
 I think it would help would-be users better understand how Spark works to 
 state the network connectivity requirements right up-front in the overview 
 section of the doc. I suggest revising the diagram and accompanying text 
 found on the [overview 
 page|http://spark.apache.org/docs/1.2.0/cluster-overview.html]:
 !http://spark.apache.org/docs/1.2.0/img/cluster-overview.png!
 so that it depicts at least the directionality of the network connections 
 initiated (perhaps like so):
 !http://i.imgur.com/2dqGbCr.png!
 and states that the driver must listen for and accept connections from other 
 Spark components on a variety of ports.
 Please treat my diagram and text as strawmen: I expect more experienced Spark 
 users and developers will have better ideas on how to convey these 
 requirements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6733) Suppression of usage of Scala existential code should be done

2015-04-06 Thread Raymond Tay (JIRA)
Raymond Tay created SPARK-6733:
--

 Summary: Suppression of usage of Scala existential code should be 
done
 Key: SPARK-6733
 URL: https://issues.apache.org/jira/browse/SPARK-6733
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Affects Versions: 1.3.0
 Environment: OS: OSX Yosemite
Hardware: Intel Core i7 with 16 GB RAM
Reporter: Raymond Tay


The inclusion of this statement in the file 

{code:scala}
import scala.language.existentials
{code}

should have suppressed all warnings regarding the use of scala existential code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6729) DriverQuirks get can get OutOfBounds exception is some cases

2015-04-06 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson resolved SPARK-6729.
---
   Resolution: Fixed
Fix Version/s: 1.4.0

 DriverQuirks get can get OutOfBounds exception is some cases
 

 Key: SPARK-6729
 URL: https://issues.apache.org/jira/browse/SPARK-6729
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Volodymyr Lyubinets
Assignee: Volodymyr Lyubinets
Priority: Minor
 Fix For: 1.4.0


 The function uses .substring(0, X), which will trigger OutOfBoundsException 
 if string length is less than X. A better way to do this is to use 
 startsWith, which won't error out in this case. I'll propose a patch shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6729) DriverQuirks get can get OutOfBounds exception is some cases

2015-04-06 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson updated SPARK-6729:
--
Assignee: Volodymyr Lyubinets

 DriverQuirks get can get OutOfBounds exception is some cases
 

 Key: SPARK-6729
 URL: https://issues.apache.org/jira/browse/SPARK-6729
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Volodymyr Lyubinets
Assignee: Volodymyr Lyubinets
Priority: Minor
 Fix For: 1.4.0


 The function uses .substring(0, X), which will trigger OutOfBoundsException 
 if string length is less than X. A better way to do this is to use 
 startsWith, which won't error out in this case. I'll propose a patch shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6506) python support yarn cluster mode requires SPARK_HOME to be set

2015-04-06 Thread Kostas Sakellis (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482414#comment-14482414
 ] 

Kostas Sakellis commented on SPARK-6506:


I ran into this issue too by running:
bq. spark-submit  --master yarn-cluster examples/pi.py 4

it looks like I only had to set: spark.yarn.appMasterEnv.SPARK_HOME=/bogus to 
get it going:
bq. spark-submit --conf spark.yarn.appMasterEnv.SPARK_HOME=/bogus --master 
yarn-cluster pi.py 4


 python support yarn cluster mode requires SPARK_HOME to be set
 --

 Key: SPARK-6506
 URL: https://issues.apache.org/jira/browse/SPARK-6506
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.3.0
Reporter: Thomas Graves

 We added support for python running in yarn cluster mode in 
 https://issues.apache.org/jira/browse/SPARK-5173, but it requires that 
 SPARK_HOME be set in the environment variables for application master and 
 executor.  It doesn't have to be set to anything real but it fails if its not 
 set.  See the command at the end of: https://github.com/apache/spark/pull/3976



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6731) Upgrade Apache commons-math3 to 3.4.1

2015-04-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6731:
---

Assignee: (was: Apache Spark)

 Upgrade Apache commons-math3 to 3.4.1
 -

 Key: SPARK-6731
 URL: https://issues.apache.org/jira/browse/SPARK-6731
 Project: Spark
  Issue Type: Dependency upgrade
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Punya Biswal

 Spark depends on Apache commons-math3 version 3.1.1, which is 2 years old. 
 The current version (3.4.1) includes approximate percentile statistics (among 
 other things).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6731) Upgrade Apache commons-math3 to 3.4.1

2015-04-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482460#comment-14482460
 ] 

Apache Spark commented on SPARK-6731:
-

User 'punya' has created a pull request for this issue:
https://github.com/apache/spark/pull/5380

 Upgrade Apache commons-math3 to 3.4.1
 -

 Key: SPARK-6731
 URL: https://issues.apache.org/jira/browse/SPARK-6731
 Project: Spark
  Issue Type: Dependency upgrade
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Punya Biswal

 Spark depends on Apache commons-math3 version 3.1.1, which is 2 years old. 
 The current version (3.4.1) includes approximate percentile statistics (among 
 other things).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5281) Registering table on RDD is giving MissingRequirementError

2015-04-06 Thread William Benton (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482291#comment-14482291
 ] 

William Benton commented on SPARK-5281:
---

As [~marmbrus] recently pointed out on the user list, this happens when you 
don't have all of the dependencies for Scala reflection loaded by the 
primordial classloader.  For running apps from sbt, setting {{fork := true}} 
should do the trick.  For running a REPL from sbt, try [this 
workaround|http://chapeau.freevariable.com/2015/04/spark-sql-repl.html].  
(Sorry to not have a solution for Eclipse.)

 Registering table on RDD is giving MissingRequirementError
 --

 Key: SPARK-5281
 URL: https://issues.apache.org/jira/browse/SPARK-5281
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: sarsol
Priority: Critical

 Application crashes on this line  {{rdd.registerTempTable(temp)}}  in 1.2 
 version when using sbt or Eclipse SCALA IDE
 Stacktrace:
 {code}
 Exception in thread main scala.reflect.internal.MissingRequirementError: 
 class org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with 
 primordial classloader with boot classpath 
 [C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-library.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-reflect.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-actor.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-swing.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-compiler.jar;C:\Program
  Files\Java\jre7\lib\resources.jar;C:\Program 
 Files\Java\jre7\lib\rt.jar;C:\Program 
 Files\Java\jre7\lib\sunrsasign.jar;C:\Program 
 Files\Java\jre7\lib\jsse.jar;C:\Program 
 Files\Java\jre7\lib\jce.jar;C:\Program 
 Files\Java\jre7\lib\charsets.jar;C:\Program 
 Files\Java\jre7\lib\jfr.jar;C:\Program Files\Java\jre7\classes] not found.
   at 
 scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
   at 
 scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
   at 
 scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
   at 
 scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61)
   at 
 scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72)
   at 
 scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119)
   at 
 scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$$typecreator1$1.apply(ScalaReflection.scala:115)
   at 
 scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231)
   at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231)
   at scala.reflect.api.TypeTags$class.typeOf(TypeTags.scala:335)
   at scala.reflect.api.Universe.typeOf(Universe.scala:59)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:115)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:100)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$class.attributesFor(ScalaReflection.scala:94)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$.attributesFor(ScalaReflection.scala:33)
   at org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:111)
   at 
 com.sar.spark.dq.poc.SparkPOC$delayedInit$body.apply(SparkPOC.scala:43)
   at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
   at 
 scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
   at scala.App$$anonfun$main$1.apply(App.scala:71)
   at scala.App$$anonfun$main$1.apply(App.scala:71)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at 
 scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
   at scala.App$class.main(App.scala:71)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6514) For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as the Kinesis stream itself

2015-04-06 Thread Chris Fregly (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Fregly updated SPARK-6514:

Target Version/s: 1.4.0  (was: 1.3.1)

 For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as 
 the Kinesis stream itself  
 

 Key: SPARK-6514
 URL: https://issues.apache.org/jira/browse/SPARK-6514
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.3.0
Reporter: Chris Fregly

 this was not supported when i originally wrote this receiver.
 this is now supported.  also, upgrade to the latest Kinesis Client Library 
 (KCL) which is 1.2, i believe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6734) Support GenericUDTF.close for Generate

2015-04-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482567#comment-14482567
 ] 

Apache Spark commented on SPARK-6734:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/5383

 Support GenericUDTF.close for Generate
 --

 Key: SPARK-6734
 URL: https://issues.apache.org/jira/browse/SPARK-6734
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao

 Some third-party UDTF extension, will generate more rows in the 
 GenericUDTF.close() method, which is supported by Hive.
 https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide+UDTF
 However, Spark SQL ignores the GenericUDTF.close(), and it causes bug while 
 porting job from Hive to Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6734) Support GenericUDTF.close for Generate

2015-04-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6734:
---

Assignee: Apache Spark

 Support GenericUDTF.close for Generate
 --

 Key: SPARK-6734
 URL: https://issues.apache.org/jira/browse/SPARK-6734
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Assignee: Apache Spark

 Some third-party UDTF extension, will generate more rows in the 
 GenericUDTF.close() method, which is supported by Hive.
 https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide+UDTF
 However, Spark SQL ignores the GenericUDTF.close(), and it causes bug while 
 porting job from Hive to Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6734) Support GenericUDTF.close for Generate

2015-04-06 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-6734:


 Summary: Support GenericUDTF.close for Generate
 Key: SPARK-6734
 URL: https://issues.apache.org/jira/browse/SPARK-6734
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao


Some third-party UDTF extension, will generate more rows in the 
GenericUDTF.close() method, which is supported by Hive.

https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide+UDTF

However, Spark SQL ignores the GenericUDTF.close(), and it causes bug while 
porting job from Hive to Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6734) Support GenericUDTF.close for Generate

2015-04-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6734:
---

Assignee: (was: Apache Spark)

 Support GenericUDTF.close for Generate
 --

 Key: SPARK-6734
 URL: https://issues.apache.org/jira/browse/SPARK-6734
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao

 Some third-party UDTF extension, will generate more rows in the 
 GenericUDTF.close() method, which is supported by Hive.
 https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide+UDTF
 However, Spark SQL ignores the GenericUDTF.close(), and it causes bug while 
 porting job from Hive to Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6733) Suppression of usage of Scala existential code should be done

2015-04-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6733:
---

Assignee: (was: Apache Spark)

 Suppression of usage of Scala existential code should be done
 -

 Key: SPARK-6733
 URL: https://issues.apache.org/jira/browse/SPARK-6733
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Affects Versions: 1.3.0
 Environment: OS: OSX Yosemite
 Hardware: Intel Core i7 with 16 GB RAM
Reporter: Raymond Tay

 The inclusion of this statement in the file 
 {code:scala}
 import scala.language.existentials
 {code}
 should have suppressed all warnings regarding the use of scala existential 
 code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6733) Suppression of usage of Scala existential code should be done

2015-04-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482630#comment-14482630
 ] 

Apache Spark commented on SPARK-6733:
-

User 'vinodkc' has created a pull request for this issue:
https://github.com/apache/spark/pull/5384

 Suppression of usage of Scala existential code should be done
 -

 Key: SPARK-6733
 URL: https://issues.apache.org/jira/browse/SPARK-6733
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Affects Versions: 1.3.0
 Environment: OS: OSX Yosemite
 Hardware: Intel Core i7 with 16 GB RAM
Reporter: Raymond Tay

 The inclusion of this statement in the file 
 {code:scala}
 import scala.language.existentials
 {code}
 should have suppressed all warnings regarding the use of scala existential 
 code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6733) Suppression of usage of Scala existential code should be done

2015-04-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6733:
---

Assignee: Apache Spark

 Suppression of usage of Scala existential code should be done
 -

 Key: SPARK-6733
 URL: https://issues.apache.org/jira/browse/SPARK-6733
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Affects Versions: 1.3.0
 Environment: OS: OSX Yosemite
 Hardware: Intel Core i7 with 16 GB RAM
Reporter: Raymond Tay
Assignee: Apache Spark

 The inclusion of this statement in the file 
 {code:scala}
 import scala.language.existentials
 {code}
 should have suppressed all warnings regarding the use of scala existential 
 code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6630) SparkConf.setIfMissing should only evaluate the assigned value if indeed missing

2015-04-06 Thread Svend Vanderveken (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481055#comment-14481055
 ] 

Svend Vanderveken commented on SPARK-6630:
--

Oh, ok. For the record (and my education...) , could you clarify how does this 
breaks binary compatibility ?  Do you mean that client code written against 
older version of spark would no longer work on this version? 

 SparkConf.setIfMissing should only evaluate the assigned value if indeed 
 missing
 

 Key: SPARK-6630
 URL: https://issues.apache.org/jira/browse/SPARK-6630
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Svend Vanderveken
Priority: Minor

 The method setIfMissing() in SparkConf is currently systematically evaluating 
 the right hand side of the assignment even if not used. This leads to 
 unnecessary computation, like in the case of 
 {code}
   conf.setIfMissing(spark.driver.host, Utils.localHostName())
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6673) spark-shell.cmd can't start even when spark was built in Windows

2015-04-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6673:
-
Target Version/s: 1.4.0  (was: 1.3.1, 1.4.0)

 spark-shell.cmd can't start even when spark was built in Windows
 

 Key: SPARK-6673
 URL: https://issues.apache.org/jira/browse/SPARK-6673
 Project: Spark
  Issue Type: Bug
  Components: Windows
Affects Versions: 1.3.0
Reporter: Masayoshi TSUZUKI
Assignee: Masayoshi TSUZUKI
Priority: Blocker
 Fix For: 1.4.0


 spark-shell.cmd can't start.
 {code}
 bin\spark-shell.cmd --master local
 {code}
 will get
 {code}
 Failed to find Spark assembly JAR.
 You need to build Spark before running this program.
 {code}
 even when we have built spark.
 This is because of the lack of the environment {{SPARK_SCALA_VERSION}} which 
 is used in {{spark-class2.cmd}}.
 In linux scripts, this value is set as {{2.10}} or {{2.11}} by default in 
 {{load-spark-env.sh}}, but there are no equivalent script in Windows.
 As workaround, by executing
 {code}
 set SPARK_SCALA_VERSION=2.10
 {code}
 before execute spark-shell.cmd, we can successfully start it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java

2015-04-06 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14480983#comment-14480983
 ] 

Yu Ishikawa commented on SPARK-6682:


I got it. I think the only way to realize an automatic mechanism is to execute 
builder methods in Scala/Java from Python. That is, we should make a wrapper 
mechanism for the machine learning algorithms like the python's 
`JavaModelWrapper`. However, I don't think that is not good idea very much 
because of the readability of the code and the documentation.

- Pros
-- We don't need to implement builder methods in Python, once we Implement them 
in Scala/Java.
- Cons
-- Python's documentation about builder methods is not generated because of not 
implementing them in Python.

 Deprecate static train and use builder instead for Scala/Java
 -

 Key: SPARK-6682
 URL: https://issues.apache.org/jira/browse/SPARK-6682
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 In MLlib, we have for some time been unofficially moving away from the old 
 static train() methods and moving towards builder patterns.  This JIRA is to 
 discuss this move and (hopefully) make it official.
 Old static train() API:
 {code}
 val myModel = NaiveBayes.train(myData, ...)
 {code}
 New builder pattern API:
 {code}
 val nb = new NaiveBayes().setLambda(0.1)
 val myModel = nb.train(myData)
 {code}
 Pros of the builder pattern:
 * Much less code when algorithms have many parameters.  Since Java does not 
 support default arguments, we required *many* duplicated static train() 
 methods (for each prefix set of arguments).
 * Helps to enforce default parameters.  Users should ideally not have to even 
 think about setting parameters if they just want to try an algorithm quickly.
 * Matches spark.ml API
 Cons of the builder pattern:
 * In Python APIs, static train methods are more Pythonic.
 Proposal:
 * Scala/Java: We should start deprecating the old static train() methods.  We 
 must keep them for API stability, but deprecating will help with API 
 consistency, making it clear that everyone should use the builder pattern.  
 As we deprecate them, we should make sure that the builder pattern supports 
 all parameters.
 * Python: Keep static train methods.
 CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6700) flaky test: run Python application in yarn-cluster mode

2015-04-06 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14480988#comment-14480988
 ] 

Lianhui Wang commented on SPARK-6700:
-

i do not think this is related to SPARK-6506 because YarnClusterSuite setted 
SPARK_HOME. Just now i run YarnClusterSuite test,but i got python application 
test in YarnClusterSuite is successfully.[~davies] can you report your 
unit-test.log or appMaster.log?

 flaky test: run Python application in yarn-cluster mode 
 

 Key: SPARK-6700
 URL: https://issues.apache.org/jira/browse/SPARK-6700
 Project: Spark
  Issue Type: Bug
  Components: Tests
Reporter: Davies Liu
Assignee: Lianhui Wang
Priority: Critical
  Labels: test, yarn

 org.apache.spark.deploy.yarn.YarnClusterSuite.run Python application in 
 yarn-cluster mode
 Failing for the past 1 build (Since Failed#2025 )
 Took 12 sec.
 Error Message
 {code}
 Process 
 List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit,
  --master, yarn-cluster, --num-executors, 1, --properties-file, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/spark3554401802242467930.properties,
  --py-files, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test2.py, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test.py, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/result8930129095246825990.tmp)
  exited with code 1
 Stacktrace
 sbt.ForkMain$ForkError: Process 
 List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit,
  --master, yarn-cluster, --num-executors, 1, --properties-file, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/spark3554401802242467930.properties,
  --py-files, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test2.py, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test.py, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/result8930129095246825990.tmp)
  exited with code 1
   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:1122)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite.org$apache$spark$deploy$yarn$YarnClusterSuite$$runSpark(YarnClusterSuite.scala:259)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply$mcV$sp(YarnClusterSuite.scala:160)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
   at org.scalatest.Suite$class.run(Suite.scala:1424)
   at 
 org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
   at 

[jira] [Resolved] (SPARK-6673) spark-shell.cmd can't start even when spark was built in Windows

2015-04-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6673.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5328
[https://github.com/apache/spark/pull/5328]

 spark-shell.cmd can't start even when spark was built in Windows
 

 Key: SPARK-6673
 URL: https://issues.apache.org/jira/browse/SPARK-6673
 Project: Spark
  Issue Type: Bug
  Components: Windows
Affects Versions: 1.3.0
Reporter: Masayoshi TSUZUKI
Assignee: Masayoshi TSUZUKI
Priority: Blocker
 Fix For: 1.4.0


 spark-shell.cmd can't start.
 {code}
 bin\spark-shell.cmd --master local
 {code}
 will get
 {code}
 Failed to find Spark assembly JAR.
 You need to build Spark before running this program.
 {code}
 even when we have built spark.
 This is because of the lack of the environment {{SPARK_SCALA_VERSION}} which 
 is used in {{spark-class2.cmd}}.
 In linux scripts, this value is set as {{2.10}} or {{2.11}} by default in 
 {{load-spark-env.sh}}, but there are no equivalent script in Windows.
 As workaround, by executing
 {code}
 set SPARK_SCALA_VERSION=2.10
 {code}
 before execute spark-shell.cmd, we can successfully start it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6700) flaky test: run Python application in yarn-cluster mode

2015-04-06 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14480988#comment-14480988
 ] 

Lianhui Wang edited comment on SPARK-6700 at 4/6/15 6:49 AM:
-

i do not think this is related to SPARK-6506 because YarnClusterSuite setted 
SPARK_HOME. Just now i run YarnClusterSuite test,but i got python application 
test in YarnClusterSuite is successfully.[~davies] can you report your 
unit-test.log or appMaster.log? in addition, i think you can try again because 
there maybe has other errors to cause it failed. 


was (Author: lianhuiwang):
i do not think this is related to SPARK-6506 because YarnClusterSuite setted 
SPARK_HOME. Just now i run YarnClusterSuite test,but i got python application 
test in YarnClusterSuite is successfully.[~davies] can you report your 
unit-test.log or appMaster.log?

 flaky test: run Python application in yarn-cluster mode 
 

 Key: SPARK-6700
 URL: https://issues.apache.org/jira/browse/SPARK-6700
 Project: Spark
  Issue Type: Bug
  Components: Tests
Reporter: Davies Liu
Assignee: Lianhui Wang
Priority: Critical
  Labels: test, yarn

 org.apache.spark.deploy.yarn.YarnClusterSuite.run Python application in 
 yarn-cluster mode
 Failing for the past 1 build (Since Failed#2025 )
 Took 12 sec.
 Error Message
 {code}
 Process 
 List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit,
  --master, yarn-cluster, --num-executors, 1, --properties-file, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/spark3554401802242467930.properties,
  --py-files, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test2.py, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test.py, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/result8930129095246825990.tmp)
  exited with code 1
 Stacktrace
 sbt.ForkMain$ForkError: Process 
 List(/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.3/label/centos/bin/spark-submit,
  --master, yarn-cluster, --num-executors, 1, --properties-file, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/spark3554401802242467930.properties,
  --py-files, /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test2.py, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/test.py, 
 /tmp/spark-451f65e7-8e13-404f-ae7a-12a0d0394f09/result8930129095246825990.tmp)
  exited with code 1
   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:1122)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite.org$apache$spark$deploy$yarn$YarnClusterSuite$$runSpark(YarnClusterSuite.scala:259)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply$mcV$sp(YarnClusterSuite.scala:160)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146)
   at 
 org.apache.spark.deploy.yarn.YarnClusterSuite$$anonfun$4.apply(YarnClusterSuite.scala:146)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
   

[jira] [Resolved] (SPARK-6687) In the hadoop 0.23 profile, hadoop pulls in an older version of netty which conflicts with akka's netty

2015-04-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6687.
--
Resolution: Not A Problem

I'm not sure what the problem is here, so closing until there's any follow up.

 In the hadoop 0.23 profile, hadoop pulls in an older version of netty which 
 conflicts with akka's netty 
 

 Key: SPARK-6687
 URL: https://issues.apache.org/jira/browse/SPARK-6687
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Sai Nishanth Parepally

 excerpt from mvn -Dverbose dependency:tree of spark-core, note the 
 org.jboss.netty:netty dependency:
 [INFO] |  |  +- 
 org.apache.hadoop:hadoop-mapreduce-client-app:jar:0.23.10:compile
 [INFO] |  |  |  +- 
 org.apache.hadoop:hadoop-mapreduce-client-common:jar:0.23.10:compile
 [INFO] |  |  |  |  +- 
 (org.apache.hadoop:hadoop-yarn-common:jar:0.23.10:compile - omitted for 
 duplicate)
 [INFO] |  |  |  |  +- 
 (org.apache.hadoop:hadoop-mapreduce-client-core:jar:0.23.10:compile - omitted 
 for duplicate)
 [INFO] |  |  |  |  +- 
 org.apache.hadoop:hadoop-yarn-server-common:jar:0.23.10:compile
 [INFO] |  |  |  |  |  +- 
 (org.apache.hadoop:hadoop-yarn-common:jar:0.23.10:compile - omitted for 
 duplicate)
 [INFO] |  |  |  |  |  +- (org.apache.zookeeper:zookeeper:jar:3.4.5:compile - 
 version managed from 3.4.2; omitted for duplicate)
 [INFO] |  |  |  |  |  +- (org.slf4j:slf4j-api:jar:1.7.10:compile - version 
 managed from 1.6.1; omitted for duplicate)
 [INFO] |  |  |  |  |  +- (org.slf4j:slf4j-log4j12:jar:1.7.10:compile - 
 version managed from 1.6.1; omitted for duplicate)
 [INFO] |  |  |  |  |  +- (org.jboss.netty:netty:jar:3.2.4.Final:compile - 
 omitted for duplicate)
 [INFO] |  |  |  |  |  +- 
 (com.google.protobuf:protobuf-java:jar:2.4.0a:compile - omitted for duplicate)
 [INFO] |  |  |  |  |  +- (commons-io:commons-io:jar:2.1:compile - omitted for 
 duplicate)
 [INFO] |  |  |  |  |  +- (com.google.inject:guice:jar:3.0:compile - omitted 
 for duplicate)
 [INFO] |  |  |  |  |  +- 
 (com.sun.jersey.jersey-test-framework:jersey-test-framework-grizzly2:jar:1.8:compile
  - omitted for duplicate)
 [INFO] |  |  |  |  |  +- (com.sun.jersey:jersey-server:jar:1.8:compile - 
 omitted for duplicate)
 [INFO] |  |  |  |  |  \- 
 (com.sun.jersey.contribs:jersey-guice:jar:1.8:compile - omitted for duplicate)
 [INFO] |  |  |  |  +- (com.google.protobuf:protobuf-java:jar:2.4.0a:compile - 
 omitted for duplicate)
 [INFO] |  |  |  |  +- (org.slf4j:slf4j-api:jar:1.7.10:compile - version 
 managed from 1.6.1; omitted for duplicate)
 [INFO] |  |  |  |  +- (org.slf4j:slf4j-log4j12:jar:1.7.10:compile - version 
 managed from 1.6.1; omitted for duplicate)
 [INFO] |  |  |  |  +- (org.apache.hadoop:hadoop-hdfs:jar:1.23.10:compile - 
 omitted for duplicate)
 [INFO] |  |  |  |  \- (org.jboss.netty:netty:jar:3.2.4.Final:compile - 
 omitted for duplicate)
 [INFO] |  |  |  +- 
 org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:0.23.10:compile
 [INFO] |  |  |  |  +- 
 (org.apache.hadoop:hadoop-mapreduce-client-core:jar:0.23.10:compile - omitted 
 for duplicate)
 [INFO] |  |  |  |  +- (com.google.protobuf:protobuf-java:jar:2.4.0a:compile - 
 omitted for duplicate)
 [INFO] |  |  |  |  +- (org.slf4j:slf4j-api:jar:1.7.10:compile - version 
 managed from 1.6.1; omitted for duplicate)
 [INFO] |  |  |  |  +- (org.slf4j:slf4j-log4j12:jar:1.7.10:compile - version 
 managed from 1.6.1; omitted for duplicate)
 [INFO] |  |  |  |  +- (org.apache.hadoop:hadoop-hdfs:jar:0.23.10:compile - 
 omitted for duplicate)
 [INFO] |  |  |  |  \- (org.jboss.netty:netty:jar:3.2.4.Final:compile - 
 omitted for duplicate)
 [INFO] |  |  |  +- (com.google.protobuf:protobuf-java:jar:2.4.0a:compile - 
 omitted for duplicate)
 [INFO] |  |  |  +- (org.slf4j:slf4j-api:jar:1.7.10:compile - version managed 
 from 1.6.1; omitted for duplicate)
 [INFO] |  |  |  +- (org.slf4j:slf4j-log4j12:jar:1.7.10:compile - version 
 managed from 1.6.1; omitted for duplicate)
 [INFO] |  |  |  +- (org.apache.hadoop:hadoop-hdfs:jar:0.23.10:compile - 
 omitted for duplicate)
 [INFO] |  |  |  \- org.jboss.netty:netty:jar:3.2.4.Final:compile



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6630) SparkConf.setIfMissing should only evaluate the assigned value if indeed missing

2015-04-06 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481064#comment-14481064
 ] 

Sean Owen commented on SPARK-6630:
--

Yeah because the second argument becomes a Function producing a String, not a 
String. Code compiled against older versions of Spark are expected to run as 
much as possible on newer ones and the old code would not find the String 
method. We could add an overload, but then I am not sure what happens to the 
current code. I think code continues to bind to the String overload, defeating 
the purpose.

 SparkConf.setIfMissing should only evaluate the assigned value if indeed 
 missing
 

 Key: SPARK-6630
 URL: https://issues.apache.org/jira/browse/SPARK-6630
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Svend Vanderveken
Priority: Minor

 The method setIfMissing() in SparkConf is currently systematically evaluating 
 the right hand side of the assignment even if not used. This leads to 
 unnecessary computation, like in the case of 
 {code}
   conf.setIfMissing(spark.driver.host, Utils.localHostName())
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6719) Update spark.apache.org/mllib page to 1.3

2015-04-06 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-6719:


 Summary: Update spark.apache.org/mllib page to 1.3
 Key: SPARK-6719
 URL: https://issues.apache.org/jira/browse/SPARK-6719
 Project: Spark
  Issue Type: Task
  Components: Documentation, MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


The current web page is outdated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6569) Kafka directInputStream logs what appear to be incorrect warnings

2015-04-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6569:
-
  Priority: Trivial  (was: Minor)
  Assignee: Platon Potapov
Issue Type: Improvement  (was: Bug)

 Kafka directInputStream logs what appear to be incorrect warnings
 -

 Key: SPARK-6569
 URL: https://issues.apache.org/jira/browse/SPARK-6569
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.3.0
 Environment: Spark 1.3.0
Reporter: Platon Potapov
Assignee: Platon Potapov
Priority: Trivial
 Fix For: 1.4.0


 During what appears to be normal operation of streaming from a Kafka topic, 
 the following log records are observed, logged periodically:
 {code}
 [Stage 391:==  (3 + 0) / 
 4]
 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the 
 same as ending offset skipping raw 0
 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the 
 same as ending offset skipping raw 0
 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the 
 same as ending offset skipping raw 0
 {code}
 * the part.fromOffset placeholder is not correctly substituted to a value
 * is the condition really mandates a warning being logged?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6569) Kafka directInputStream logs what appear to be incorrect warnings

2015-04-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6569.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5366
[https://github.com/apache/spark/pull/5366]

 Kafka directInputStream logs what appear to be incorrect warnings
 -

 Key: SPARK-6569
 URL: https://issues.apache.org/jira/browse/SPARK-6569
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
 Environment: Spark 1.3.0
Reporter: Platon Potapov
Priority: Minor
 Fix For: 1.4.0


 During what appears to be normal operation of streaming from a Kafka topic, 
 the following log records are observed, logged periodically:
 {code}
 [Stage 391:==  (3 + 0) / 
 4]
 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the 
 same as ending offset skipping raw 0
 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the 
 same as ending offset skipping raw 0
 2015-03-27 12:49:54 WARN KafkaRDD: Beginning offset ${part.fromOffset} is the 
 same as ending offset skipping raw 0
 {code}
 * the part.fromOffset placeholder is not correctly substituted to a value
 * is the condition really mandates a warning being logged?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6630) SparkConf.setIfMissing should only evaluate the assigned value if indeed missing

2015-04-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6630.
--
Resolution: Won't Fix

Idea was good, just probably can't be reconciled with binary compatibility at 
this point without significantly more change, so closing. If there's a 
particularly expensive computation we want to avoid, we can fix those directly 
by checking the property's existence first before computing and setting a new 
value.

 SparkConf.setIfMissing should only evaluate the assigned value if indeed 
 missing
 

 Key: SPARK-6630
 URL: https://issues.apache.org/jira/browse/SPARK-6630
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Svend Vanderveken
Priority: Minor

 The method setIfMissing() in SparkConf is currently systematically evaluating 
 the right hand side of the assignment even if not used. This leads to 
 unnecessary computation, like in the case of 
 {code}
   conf.setIfMissing(spark.driver.host, Utils.localHostName())
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5261) In some cases ,The value of word's vector representation is too big

2015-04-06 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481132#comment-14481132
 ] 

Sean Owen commented on SPARK-5261:
--

In the new code you pasted, I don't see a difference between the two runs. Is 
the point that the result isn't deterministic even with a fixed seed? that it 
might be sensitive to the order in which it encounters the words?

 In some cases ,The value of word's vector representation is too big
 ---

 Key: SPARK-5261
 URL: https://issues.apache.org/jira/browse/SPARK-5261
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Guoqiang Li

 Get data:
 {code:none}
 normalize_text() {
   awk '{print tolower($0);}' | sed -e s/’/'/g -e s/′/'/g -e s/''/ /g -e 
 s/'/ ' /g -e s/“/\/g -e s/”/\/g \
   -e 's//  /g' -e 's/\./ \. /g' -e 's/br \// /g' -e 's/, / , /g' -e 's/(/ 
 ( /g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \
   -e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' -e 's/-/ - /g' -e 's/=/ /g' -e 
 's/=/ /g' -e 's/*/ /g' -e 's/|/ /g' \
   -e 's/«/ /g' | tr 0-9  
 }
 wget 
 http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz
 gzip -d news.2013.en.shuffled.gz
 normalize_text  news.2013.en.shuffled  data.txt
 {code}
 {code:none}
 import org.apache.spark.mllib.feature.Word2Vec
 val text = sc.textFile(dataPath).map { t = t.split( ).toIterable }
 val word2Vec = new Word2Vec()
 word2Vec.
   setVectorSize(100).
   setSeed(42L).
   setNumIterations(5).
   setNumPartitions(36).
   setMinCount(5)
 val model = word2Vec.fit(text)
 model.getVectors.map { t = t._2.map(_.abs).sum }.sum / 100 / 
 model.getVectors.size
 = 
 res1: Float = 375059.84
 val word2Vec = new Word2Vec()
 word2Vec.
   setVectorSize(100).
   setSeed(42L).
   setNumIterations(5).
   setNumPartitions(36).
   setMinCount(5)
 val model = word2Vec.fit(text)
 model.getVectors.map { t = t._2.map(_.abs).sum }.sum / 100 / 
 model.getVectors.size
 = 
 res3: Float = 1661285.2 
 {code}
 The average absolute value of the word's vector representation is 60731.8
 {code}
 val word2Vec = new Word2Vec()
 word2Vec.
   setVectorSize(100).
   setSeed(42L).
   setNumIterations(5).
   setNumPartitions(1)
 {code}
 The average  absolute value of the word's vector representation is 0.13889



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6720) PySpark MultivariateStatisticalSummary unit test for normL1 and normL2

2015-04-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6720:
---

Assignee: (was: Apache Spark)

 PySpark MultivariateStatisticalSummary unit test for normL1 and normL2
 --

 Key: SPARK-6720
 URL: https://issues.apache.org/jira/browse/SPARK-6720
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Kai Sasaki
Priority: Minor
 Fix For: 1.4.0


 Implement correct normL1 and normL2 test.
 continuation: https://github.com/apache/spark/pull/5359



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6720) PySpark MultivariateStatisticalSummary unit test for normL1 and normL2

2015-04-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481176#comment-14481176
 ] 

Apache Spark commented on SPARK-6720:
-

User 'Lewuathe' has created a pull request for this issue:
https://github.com/apache/spark/pull/5374

 PySpark MultivariateStatisticalSummary unit test for normL1 and normL2
 --

 Key: SPARK-6720
 URL: https://issues.apache.org/jira/browse/SPARK-6720
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Kai Sasaki
Priority: Minor
 Fix For: 1.4.0


 Implement correct normL1 and normL2 test.
 continuation: https://github.com/apache/spark/pull/5359



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6720) PySpark MultivariateStatisticalSummary unit test for normL1 and normL2

2015-04-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6720:
---

Assignee: Apache Spark

 PySpark MultivariateStatisticalSummary unit test for normL1 and normL2
 --

 Key: SPARK-6720
 URL: https://issues.apache.org/jira/browse/SPARK-6720
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Kai Sasaki
Assignee: Apache Spark
Priority: Minor
 Fix For: 1.4.0


 Implement correct normL1 and normL2 test.
 continuation: https://github.com/apache/spark/pull/5359



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2960) Spark executables fail to start via symlinks

2015-04-06 Thread Danil Mironov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481228#comment-14481228
 ] 

Danil Mironov commented on SPARK-2960:
--

This now formed a loop of three tickets (SPARK-2960, SPARK-3482 and SPARK-4162) 
all three resolved as duplicate; two PR-s (#1875 and #2386) are closed but not 
merged. Apparently this issue doesn't progress at all.

Is there anything that can be done to burst through?

I could draft a new PR; can this ticket be re-opened?

 Spark executables fail to start via symlinks
 

 Key: SPARK-2960
 URL: https://issues.apache.org/jira/browse/SPARK-2960
 Project: Spark
  Issue Type: Bug
Reporter: Shay Rojansky
Priority: Minor

 The current scripts (e.g. pyspark) fail to run when they are executed via 
 symlinks. A common Linux scenario would be to have Spark installed somewhere 
 (e.g. /opt) and have a symlink to it in /usr/bin.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6720) PySpark MultivariateStatisticalSummary unit test for normL1 and normL2

2015-04-06 Thread Kai Sasaki (JIRA)
Kai Sasaki created SPARK-6720:
-

 Summary: PySpark MultivariateStatisticalSummary unit test for 
normL1 and normL2
 Key: SPARK-6720
 URL: https://issues.apache.org/jira/browse/SPARK-6720
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Kai Sasaki
Priority: Minor
 Fix For: 1.4.0


Implement correct normL1 and normL2 test.

continuation: https://github.com/apache/spark/pull/5359



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-2991) RDD transforms for scan and scanLeft

2015-04-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-2991:
---

Assignee: Apache Spark  (was: Erik Erlandson)

 RDD transforms for scan and scanLeft 
 -

 Key: SPARK-2991
 URL: https://issues.apache.org/jira/browse/SPARK-2991
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Erik Erlandson
Assignee: Apache Spark
Priority: Minor
  Labels: features

 Provide RDD transforms analogous to Scala scan(z)(f) (parallel prefix scan) 
 and scanLeft(z)(f) (sequential prefix scan)
 Discussion of a scanLeft implementation:
 http://erikerlandson.github.io/blog/2014/08/09/implementing-an-rdd-scanleft-transform-with-cascade-rdds/
 Discussion of scan:
 http://erikerlandson.github.io/blog/2014/08/12/implementing-parallel-prefix-scan-as-a-spark-rdd-transform/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-2991) RDD transforms for scan and scanLeft

2015-04-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-2991:
---

Assignee: Erik Erlandson  (was: Apache Spark)

 RDD transforms for scan and scanLeft 
 -

 Key: SPARK-2991
 URL: https://issues.apache.org/jira/browse/SPARK-2991
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Erik Erlandson
Assignee: Erik Erlandson
Priority: Minor
  Labels: features

 Provide RDD transforms analogous to Scala scan(z)(f) (parallel prefix scan) 
 and scanLeft(z)(f) (sequential prefix scan)
 Discussion of a scanLeft implementation:
 http://erikerlandson.github.io/blog/2014/08/09/implementing-an-rdd-scanleft-transform-with-cascade-rdds/
 Discussion of scan:
 http://erikerlandson.github.io/blog/2014/08/12/implementing-parallel-prefix-scan-as-a-spark-rdd-transform/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6205) UISeleniumSuite fails for Hadoop 2.x test with NoClassDefFoundError

2015-04-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6205:
-
Fix Version/s: 1.3.2

 UISeleniumSuite fails for Hadoop 2.x test with NoClassDefFoundError
 ---

 Key: SPARK-6205
 URL: https://issues.apache.org/jira/browse/SPARK-6205
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.3.0
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor
 Fix For: 1.3.2, 1.4.0


 {code}
 mvn -DskipTests -Pyarn -Phive -Phadoop-2.4 -Dhadoop.version=2.6.0 clean 
 install
 mvn -Pyarn -Phive -Phadoop-2.4 -Dhadoop.version=2.6.0 test 
 -DwildcardSuites=org.apache.spark.ui.UISeleniumSuite -Dtest=none -pl core/ 
 {code}
 will produce:
 {code}
 UISeleniumSuite:
 *** RUN ABORTED ***
   java.lang.NoClassDefFoundError: org/w3c/dom/ElementTraversal
   ...
 {code}
 It doesn't seem to happen without the various profiles set above.
 The fix is simple, although sounds weird; Selenium's dependency on 
 {{xml-apis:xml-apis}} must be manually included in core's test dependencies. 
 This probably has something to do with Hadoop 2 vs 1 dependency changes and 
 the fact that Maven test deps aren't transitive, AFAIK.
 PR coming...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3702) Standardize MLlib classes for learners, models

2015-04-06 Thread Peter Rudenko (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481342#comment-14481342
 ] 

Peter Rudenko commented on SPARK-3702:
--

For trees based algorithms curious whether there would be performance benefit 
by passing directly Dataframe columns rather than single column with vector 
type. E.g.:

{code}
class GBT extends Estimator with HasInputCols

val model = new GBT.setInputCols(col1,col2, col3, ...)
{code}





 Standardize MLlib classes for learners, models
 --

 Key: SPARK-3702
 URL: https://issues.apache.org/jira/browse/SPARK-3702
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Blocker

 Summary: Create a class hierarchy for learning algorithms and the models 
 those algorithms produce.
 This is a super-task of several sub-tasks (but JIRA does not allow subtasks 
 of subtasks).  See the requires links below for subtasks.
 Goals:
 * give intuitive structure to API, both for developers and for generated 
 documentation
 * support meta-algorithms (e.g., boosting)
 * support generic functionality (e.g., evaluation)
 * reduce code duplication across classes
 [Design doc for class hierarchy | 
 https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6431) Couldn't find leader offsets exception when creating KafkaDirectStream

2015-04-06 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481266#comment-14481266
 ] 

Cody Koeninger commented on SPARK-6431:
---

I think this got mis-diagnosed on the mailing list, sorry for the confusion.

The only way I've been able to reproduce that exception is by trying to start a 
stream for a topic that doesn't exist at all.  Alberto, did you actually run 
kafka-topics.sh --create before starting the job, or in some other way create 
the topic?  Pretty sure what happened here is that your topic didn't exist the 
first time you ran the job.  Your brokers were set to auto-create topics, so it 
did exist the next time you ran the job.  Putting a message into the topic 
didn't have anything to do with it.

Here's why I think that's what happened.  Following console session is an 
example, where empty topic existed prior to starting the console, but had no 
messages.  Topic hasonemesssage existed and had one message in it.  Topic 
doesntexistyet didn't exist at the beginning of the console.

The metadata apis return the same info for existing-but-empty topics as they do 
for topics with messages in them:

scala kc.getPartitions(Set(empty)).right
res0: 
scala.util.Either.RightProjection[org.apache.spark.streaming.kafka.KafkaCluster.Err,Set[kafka.common.TopicAndPartition]]
 = RightProjection(Right(
Set([empty,0], [empty,1])))

scala kc.getPartitions(Set(hasonemessage)).right
res1: 
scala.util.Either.RightProjection[org.apache.spark.streaming.kafka.KafkaCluster.Err,Set[kafka.common.TopicAndPartition]]
 = RightProjection(Right(Set([hasonemessage,0], [hasonemessage,1])))


Leader offsets are both 0 for the empty topic, as you'd expect:

scala kc.getLatestLeaderOffsets(kc.getPartitions(Set(empty)).right.get)
res5: 
Either[org.apache.spark.streaming.kafka.KafkaCluster.Err,Map[kafka.common.TopicAndPartition,org.apache.spark.streaming.kafka.KafkaCluster.LeaderOffset]]
 = Right(Map([empty,1] - LeaderOffset(localhost,9094,0), [empty,0] - 
LeaderOffset(localhost,9093,0)))

And one of the leader offsets is 1 for the topic with one message:

scala 
kc.getLatestLeaderOffsets(kc.getPartitions(Set(hasonemessage)).right.get)
res6: 
Either[org.apache.spark.streaming.kafka.KafkaCluster.Err,Map[kafka.common.TopicAndPartition,org.apache.spark.streaming.kafka.KafkaCluster.LeaderOffset]]
 = Right(Map([hasonemessage,0] - LeaderOffset(localhost,9092,1), 
[hasonemessage,1] - LeaderOffset(localhost,9093,0)))


The first time a metadata request is made against the non-existing topic, it 
returns empty:

kc.getPartitions(Set(doesntexistyet)).right
res2: 
scala.util.Either.RightProjection[org.apache.spark.streaming.kafka.KafkaCluster.Err,Set[kafka.common.TopicAndPartition]]
 = RightProjection(Right(Set()))


But if your brokers are configured with auto.create.topics.enable set to true, 
that metadata request alone is enough to trigger creation of the topic.  
Requesting it again shows that the topic has been created:

scala kc.getPartitions(Set(doesntexistyet)).right
res3: 
scala.util.Either.RightProjection[org.apache.spark.streaming.kafka.KafkaCluster.Err,Set[kafka.common.TopicAndPartition]]
 = RightProjection(Right(Set([doesntexistyet,0], [doesntexistyet,1])))


If you don't think that explains what happened, please let me know if you have 
a way of reproducing that exception against an existing-but-empty topic, 
because I cant.

As far as what to do about this, my instinct is to just improve the error 
handling for the getPartitions call.  If the topic doesn't exist yet, It 
shouldn't be returning an empty set, it should be returning an error.


 Couldn't find leader offsets exception when creating KafkaDirectStream
 --

 Key: SPARK-6431
 URL: https://issues.apache.org/jira/browse/SPARK-6431
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
Reporter: Alberto

 When I try to create an InputDStream using the createDirectStream method of 
 the KafkaUtils class and the kafka topic does not have any messages yet am 
 getting the following error:
 org.apache.spark.SparkException: Couldn't find leader offsets for Set()
 org.apache.spark.SparkException: org.apache.spark.SparkException: Couldn't 
 find leader offsets for Set()
   at 
 org.apache.spark.streaming.kafka.KafkaUtils$$anonfun$createDirectStream$2.apply(KafkaUtils.scala:413)
 If I put a message in the topic before creating the DirectStream everything 
 works fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3702) Standardize MLlib classes for learners, models

2015-04-06 Thread Peter Rudenko (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481342#comment-14481342
 ] 

Peter Rudenko edited comment on SPARK-3702 at 4/6/15 4:06 PM:
--

For trees based algorithms curious whether there would be performance benefit 
(assuming reimplementation of Decision tree) by passing directly Dataframe 
columns rather than single column with vector type. E.g.:

{code}
class GBT extends Estimator with HasInputCols

val model = new GBT.setInputCols(col1,col2, col3, ...)
{code}

and split dataset using dataframe api.




was (Author: prudenko):
For trees based algorithms curious whether there would be performance benefit 
by passing directly Dataframe columns rather than single column with vector 
type. E.g.:

{code}
class GBT extends Estimator with HasInputCols

val model = new GBT.setInputCols(col1,col2, col3, ...)
{code}





 Standardize MLlib classes for learners, models
 --

 Key: SPARK-3702
 URL: https://issues.apache.org/jira/browse/SPARK-3702
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Blocker

 Summary: Create a class hierarchy for learning algorithms and the models 
 those algorithms produce.
 This is a super-task of several sub-tasks (but JIRA does not allow subtasks 
 of subtasks).  See the requires links below for subtasks.
 Goals:
 * give intuitive structure to API, both for developers and for generated 
 documentation
 * support meta-algorithms (e.g., boosting)
 * support generic functionality (e.g., evaluation)
 * reduce code duplication across classes
 [Design doc for class hierarchy | 
 https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5261) In some cases ,The value of word's vector representation is too big

2015-04-06 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-5261:
---
Description: 
Get data:
{code:none}
normalize_text() {
  awk '{print tolower($0);}' | sed -e s/’/'/g -e s/′/'/g -e s/''/ /g -e 
s/'/ ' /g -e s/“/\/g -e s/”/\/g \
  -e 's//  /g' -e 's/\./ \. /g' -e 's/br \// /g' -e 's/, / , /g' -e 's/(/ ( 
/g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \
  -e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' -e 's/-/ - /g' -e 's/=/ /g' -e 
's/=/ /g' -e 's/*/ /g' -e 's/|/ /g' \
  -e 's/«/ /g' | tr 0-9  
}
wget 
http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz
gzip -d news.2013.en.shuffled.gz
normalize_text  news.2013.en.shuffled  data.txt
{code}
{code:none}
import org.apache.spark.mllib.feature.Word2Vec

val text = sc.textFile(dataPath).map { t = t.split( ).toIterable }
val word2Vec = new Word2Vec()
word2Vec.
  setVectorSize(100).
  setSeed(42L).
  setNumIterations(5).
  setNumPartitions(36).
  setMinCount(100)

val model = word2Vec.fit(text)
model.getVectors.map { t = t._2.map(_.abs).sum }.sum / 100 / 
model.getVectors.size
= 
res1: Float = 375059.84


val word2Vec = new Word2Vec()
word2Vec.
  setVectorSize(100).
  setSeed(42L).
  setNumIterations(5).
  setNumPartitions(36).
  setMinCount(5)

val model = word2Vec.fit(text)
model.getVectors.map { t = t._2.map(_.abs).sum }.sum / 100 / 
model.getVectors.size
= 
res3: Float = 1661285.2 
{code}
The average absolute value of the word's vector representation is 60731.8

{code}
val word2Vec = new Word2Vec()
word2Vec.
  setVectorSize(100).
  setSeed(42L).
  setNumIterations(5).
  setNumPartitions(1)
{code}
The average  absolute value of the word's vector representation is 0.13889

  was:
Get data:
{code:none}
normalize_text() {
  awk '{print tolower($0);}' | sed -e s/’/'/g -e s/′/'/g -e s/''/ /g -e 
s/'/ ' /g -e s/“/\/g -e s/”/\/g \
  -e 's//  /g' -e 's/\./ \. /g' -e 's/br \// /g' -e 's/, / , /g' -e 's/(/ ( 
/g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \
  -e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' -e 's/-/ - /g' -e 's/=/ /g' -e 
's/=/ /g' -e 's/*/ /g' -e 's/|/ /g' \
  -e 's/«/ /g' | tr 0-9  
}
wget 
http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz
gzip -d news.2013.en.shuffled.gz
normalize_text  news.2013.en.shuffled  data.txt
{code}
{code:none}
import org.apache.spark.mllib.feature.Word2Vec

val text = sc.textFile(dataPath).map { t = t.split( ).toIterable }
val word2Vec = new Word2Vec()
word2Vec.
  setVectorSize(100).
  setSeed(42L).
  setNumIterations(5).
  setNumPartitions(36).
  setMinCount(5)

val model = word2Vec.fit(text)
model.getVectors.map { t = t._2.map(_.abs).sum }.sum / 100 / 
model.getVectors.size
= 
res1: Float = 375059.84


val word2Vec = new Word2Vec()
word2Vec.
  setVectorSize(100).
  setSeed(42L).
  setNumIterations(5).
  setNumPartitions(36).
  setMinCount(5)

val model = word2Vec.fit(text)
model.getVectors.map { t = t._2.map(_.abs).sum }.sum / 100 / 
model.getVectors.size
= 
res3: Float = 1661285.2 
{code}
The average absolute value of the word's vector representation is 60731.8

{code}
val word2Vec = new Word2Vec()
word2Vec.
  setVectorSize(100).
  setSeed(42L).
  setNumIterations(5).
  setNumPartitions(1)
{code}
The average  absolute value of the word's vector representation is 0.13889


 In some cases ,The value of word's vector representation is too big
 ---

 Key: SPARK-5261
 URL: https://issues.apache.org/jira/browse/SPARK-5261
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Guoqiang Li

 Get data:
 {code:none}
 normalize_text() {
   awk '{print tolower($0);}' | sed -e s/’/'/g -e s/′/'/g -e s/''/ /g -e 
 s/'/ ' /g -e s/“/\/g -e s/”/\/g \
   -e 's//  /g' -e 's/\./ \. /g' -e 's/br \// /g' -e 's/, / , /g' -e 's/(/ 
 ( /g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \
   -e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' -e 's/-/ - /g' -e 's/=/ /g' -e 
 's/=/ /g' -e 's/*/ /g' -e 's/|/ /g' \
   -e 's/«/ /g' | tr 0-9  
 }
 wget 
 http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz
 gzip -d news.2013.en.shuffled.gz
 normalize_text  news.2013.en.shuffled  data.txt
 {code}
 {code:none}
 import org.apache.spark.mllib.feature.Word2Vec
 val text = sc.textFile(dataPath).map { t = t.split( ).toIterable }
 val word2Vec = new Word2Vec()
 word2Vec.
   setVectorSize(100).
   setSeed(42L).
   setNumIterations(5).
   setNumPartitions(36).
   setMinCount(100)
 val model = word2Vec.fit(text)
 model.getVectors.map { t = t._2.map(_.abs).sum }.sum / 100 / 
 model.getVectors.size
 = 
 res1: Float = 375059.84
 val word2Vec = new Word2Vec()
 word2Vec.
   setVectorSize(100).
   setSeed(42L).
   setNumIterations(5).
   setNumPartitions(36).
   setMinCount(5)

[jira] [Updated] (SPARK-5261) In some cases ,The value of word's vector representation is too big

2015-04-06 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-5261:
---
Description: 
Get data:
{code:none}
normalize_text() {
  awk '{print tolower($0);}' | sed -e s/’/'/g -e s/′/'/g -e s/''/ /g -e 
s/'/ ' /g -e s/“/\/g -e s/”/\/g \
  -e 's//  /g' -e 's/\./ \. /g' -e 's/br \// /g' -e 's/, / , /g' -e 's/(/ ( 
/g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \
  -e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' -e 's/-/ - /g' -e 's/=/ /g' -e 
's/=/ /g' -e 's/*/ /g' -e 's/|/ /g' \
  -e 's/«/ /g' | tr 0-9  
}
wget 
http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz
gzip -d news.2013.en.shuffled.gz
normalize_text  news.2013.en.shuffled  data.txt
{code}
{code:none}
import org.apache.spark.mllib.feature.Word2Vec

val text = sc.textFile(dataPath).map { t = t.split( ).toIterable }
val word2Vec = new Word2Vec()
word2Vec.
  setVectorSize(100).
  setSeed(42L).
  setNumIterations(5).
  setNumPartitions(36).
  setMinCount(5)

val model = word2Vec.fit(text)
model.getVectors.map { t = t._2.map(_.abs).sum }.sum / 100 / 
model.getVectors.size
= 
res1: Float = 375059.84


val word2Vec = new Word2Vec()
word2Vec.
  setVectorSize(100).
  setSeed(42L).
  setNumIterations(5).
  setNumPartitions(36).
  setMinCount(100)

val model = word2Vec.fit(text)
model.getVectors.map { t = t._2.map(_.abs).sum }.sum / 100 / 
model.getVectors.size
= 
res3: Float = 1661285.2 


 val word2Vec = new Word2Vec()
 word2Vec.
setVectorSize(100).
setSeed(42L).
setNumIterations(5).
setNumPartitions(1)

val model = word2Vec.fit(text)
model.getVectors.map { t = t._2.map(_.abs).sum }.sum / 100 / 
model.getVectors.size
= 
 0.13889
{code}

  was:
Get data:
{code:none}
normalize_text() {
  awk '{print tolower($0);}' | sed -e s/’/'/g -e s/′/'/g -e s/''/ /g -e 
s/'/ ' /g -e s/“/\/g -e s/”/\/g \
  -e 's//  /g' -e 's/\./ \. /g' -e 's/br \// /g' -e 's/, / , /g' -e 's/(/ ( 
/g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \
  -e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' -e 's/-/ - /g' -e 's/=/ /g' -e 
's/=/ /g' -e 's/*/ /g' -e 's/|/ /g' \
  -e 's/«/ /g' | tr 0-9  
}
wget 
http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz
gzip -d news.2013.en.shuffled.gz
normalize_text  news.2013.en.shuffled  data.txt
{code}
{code:none}
import org.apache.spark.mllib.feature.Word2Vec

val text = sc.textFile(dataPath).map { t = t.split( ).toIterable }
val word2Vec = new Word2Vec()
word2Vec.
  setVectorSize(100).
  setSeed(42L).
  setNumIterations(5).
  setNumPartitions(36).
  setMinCount(100)

val model = word2Vec.fit(text)
model.getVectors.map { t = t._2.map(_.abs).sum }.sum / 100 / 
model.getVectors.size
= 
res1: Float = 375059.84


val word2Vec = new Word2Vec()
word2Vec.
  setVectorSize(100).
  setSeed(42L).
  setNumIterations(5).
  setNumPartitions(36).
  setMinCount(5)

val model = word2Vec.fit(text)
model.getVectors.map { t = t._2.map(_.abs).sum }.sum / 100 / 
model.getVectors.size
= 
res3: Float = 1661285.2 


 val word2Vec = new Word2Vec()
 word2Vec.
setVectorSize(100).
setSeed(42L).
setNumIterations(5).
setNumPartitions(1)

val model = word2Vec.fit(text)
model.getVectors.map { t = t._2.map(_.abs).sum }.sum / 100 / 
model.getVectors.size
= 
 0.13889
{code}


 In some cases ,The value of word's vector representation is too big
 ---

 Key: SPARK-5261
 URL: https://issues.apache.org/jira/browse/SPARK-5261
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Guoqiang Li

 Get data:
 {code:none}
 normalize_text() {
   awk '{print tolower($0);}' | sed -e s/’/'/g -e s/′/'/g -e s/''/ /g -e 
 s/'/ ' /g -e s/“/\/g -e s/”/\/g \
   -e 's//  /g' -e 's/\./ \. /g' -e 's/br \// /g' -e 's/, / , /g' -e 's/(/ 
 ( /g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \
   -e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' -e 's/-/ - /g' -e 's/=/ /g' -e 
 's/=/ /g' -e 's/*/ /g' -e 's/|/ /g' \
   -e 's/«/ /g' | tr 0-9  
 }
 wget 
 http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz
 gzip -d news.2013.en.shuffled.gz
 normalize_text  news.2013.en.shuffled  data.txt
 {code}
 {code:none}
 import org.apache.spark.mllib.feature.Word2Vec
 val text = sc.textFile(dataPath).map { t = t.split( ).toIterable }
 val word2Vec = new Word2Vec()
 word2Vec.
   setVectorSize(100).
   setSeed(42L).
   setNumIterations(5).
   setNumPartitions(36).
   setMinCount(5)
 val model = word2Vec.fit(text)
 model.getVectors.map { t = t._2.map(_.abs).sum }.sum / 100 / 
 model.getVectors.size
 = 
 res1: Float = 375059.84
 val word2Vec = new Word2Vec()
 word2Vec.
   setVectorSize(100).
   setSeed(42L).
   setNumIterations(5).
   setNumPartitions(36).
   setMinCount(100)
 val model = word2Vec.fit(text)
 model.getVectors.map { t = t._2.map(_.abs).sum }.sum / 100 / 

[jira] [Commented] (SPARK-6577) SparseMatrix should be supported in PySpark

2015-04-06 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481395#comment-14481395
 ] 

Manoj Kumar commented on SPARK-6577:


Let us please take the discussion to the Pull Request. Thanks!

 SparseMatrix should be supported in PySpark
 ---

 Key: SPARK-6577
 URL: https://issues.apache.org/jira/browse/SPARK-6577
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Manoj Kumar
Assignee: Manoj Kumar





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5261) In some cases ,The value of word's vector representation is too big

2015-04-06 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14481378#comment-14481378
 ] 

Guoqiang Li commented on SPARK-5261:


I'm sorry, the  after  one 's  mincount is 100

 In some cases ,The value of word's vector representation is too big
 ---

 Key: SPARK-5261
 URL: https://issues.apache.org/jira/browse/SPARK-5261
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Guoqiang Li

 Get data:
 {code:none}
 normalize_text() {
   awk '{print tolower($0);}' | sed -e s/’/'/g -e s/′/'/g -e s/''/ /g -e 
 s/'/ ' /g -e s/“/\/g -e s/”/\/g \
   -e 's//  /g' -e 's/\./ \. /g' -e 's/br \// /g' -e 's/, / , /g' -e 's/(/ 
 ( /g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \
   -e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' -e 's/-/ - /g' -e 's/=/ /g' -e 
 's/=/ /g' -e 's/*/ /g' -e 's/|/ /g' \
   -e 's/«/ /g' | tr 0-9  
 }
 wget 
 http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz
 gzip -d news.2013.en.shuffled.gz
 normalize_text  news.2013.en.shuffled  data.txt
 {code}
 {code:none}
 import org.apache.spark.mllib.feature.Word2Vec
 val text = sc.textFile(dataPath).map { t = t.split( ).toIterable }
 val word2Vec = new Word2Vec()
 word2Vec.
   setVectorSize(100).
   setSeed(42L).
   setNumIterations(5).
   setNumPartitions(36).
   setMinCount(100)
 val model = word2Vec.fit(text)
 model.getVectors.map { t = t._2.map(_.abs).sum }.sum / 100 / 
 model.getVectors.size
 = 
 res1: Float = 375059.84
 val word2Vec = new Word2Vec()
 word2Vec.
   setVectorSize(100).
   setSeed(42L).
   setNumIterations(5).
   setNumPartitions(36).
   setMinCount(5)
 val model = word2Vec.fit(text)
 model.getVectors.map { t = t._2.map(_.abs).sum }.sum / 100 / 
 model.getVectors.size
 = 
 res3: Float = 1661285.2 
 {code}
 The average absolute value of the word's vector representation is 60731.8
 {code}
 val word2Vec = new Word2Vec()
 word2Vec.
   setVectorSize(100).
   setSeed(42L).
   setNumIterations(5).
   setNumPartitions(1)
 {code}
 The average  absolute value of the word's vector representation is 0.13889



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5261) In some cases ,The value of word's vector representation is too big

2015-04-06 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-5261:
---
Description: 
Get data:
{code:none}
normalize_text() {
  awk '{print tolower($0);}' | sed -e s/’/'/g -e s/′/'/g -e s/''/ /g -e 
s/'/ ' /g -e s/“/\/g -e s/”/\/g \
  -e 's//  /g' -e 's/\./ \. /g' -e 's/br \// /g' -e 's/, / , /g' -e 's/(/ ( 
/g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \
  -e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' -e 's/-/ - /g' -e 's/=/ /g' -e 
's/=/ /g' -e 's/*/ /g' -e 's/|/ /g' \
  -e 's/«/ /g' | tr 0-9  
}
wget 
http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz
gzip -d news.2013.en.shuffled.gz
normalize_text  news.2013.en.shuffled  data.txt
{code}
{code:none}
import org.apache.spark.mllib.feature.Word2Vec

val text = sc.textFile(dataPath).map { t = t.split( ).toIterable }
val word2Vec = new Word2Vec()
word2Vec.
  setVectorSize(100).
  setSeed(42L).
  setNumIterations(5).
  setNumPartitions(36).
  setMinCount(100)

val model = word2Vec.fit(text)
model.getVectors.map { t = t._2.map(_.abs).sum }.sum / 100 / 
model.getVectors.size
= 
res1: Float = 375059.84


val word2Vec = new Word2Vec()
word2Vec.
  setVectorSize(100).
  setSeed(42L).
  setNumIterations(5).
  setNumPartitions(36).
  setMinCount(5)

val model = word2Vec.fit(text)
model.getVectors.map { t = t._2.map(_.abs).sum }.sum / 100 / 
model.getVectors.size
= 
res3: Float = 1661285.2 


 val word2Vec = new Word2Vec()
 word2Vec.
setVectorSize(100).
setSeed(42L).
setNumIterations(5).
setNumPartitions(1)

val model = word2Vec.fit(text)
model.getVectors.map { t = t._2.map(_.abs).sum }.sum / 100 / 
model.getVectors.size
= 
 0.13889
{code}

  was:
Get data:
{code:none}
normalize_text() {
  awk '{print tolower($0);}' | sed -e s/’/'/g -e s/′/'/g -e s/''/ /g -e 
s/'/ ' /g -e s/“/\/g -e s/”/\/g \
  -e 's//  /g' -e 's/\./ \. /g' -e 's/br \// /g' -e 's/, / , /g' -e 's/(/ ( 
/g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \
  -e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' -e 's/-/ - /g' -e 's/=/ /g' -e 
's/=/ /g' -e 's/*/ /g' -e 's/|/ /g' \
  -e 's/«/ /g' | tr 0-9  
}
wget 
http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz
gzip -d news.2013.en.shuffled.gz
normalize_text  news.2013.en.shuffled  data.txt
{code}
{code:none}
import org.apache.spark.mllib.feature.Word2Vec

val text = sc.textFile(dataPath).map { t = t.split( ).toIterable }
val word2Vec = new Word2Vec()
word2Vec.
  setVectorSize(100).
  setSeed(42L).
  setNumIterations(5).
  setNumPartitions(36).
  setMinCount(100)

val model = word2Vec.fit(text)
model.getVectors.map { t = t._2.map(_.abs).sum }.sum / 100 / 
model.getVectors.size
= 
res1: Float = 375059.84


val word2Vec = new Word2Vec()
word2Vec.
  setVectorSize(100).
  setSeed(42L).
  setNumIterations(5).
  setNumPartitions(36).
  setMinCount(5)

val model = word2Vec.fit(text)
model.getVectors.map { t = t._2.map(_.abs).sum }.sum / 100 / 
model.getVectors.size
= 
res3: Float = 1661285.2 
{code}
The average absolute value of the word's vector representation is 60731.8

{code}
val word2Vec = new Word2Vec()
word2Vec.
  setVectorSize(100).
  setSeed(42L).
  setNumIterations(5).
  setNumPartitions(1)
{code}
The average  absolute value of the word's vector representation is 0.13889


 In some cases ,The value of word's vector representation is too big
 ---

 Key: SPARK-5261
 URL: https://issues.apache.org/jira/browse/SPARK-5261
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Guoqiang Li

 Get data:
 {code:none}
 normalize_text() {
   awk '{print tolower($0);}' | sed -e s/’/'/g -e s/′/'/g -e s/''/ /g -e 
 s/'/ ' /g -e s/“/\/g -e s/”/\/g \
   -e 's//  /g' -e 's/\./ \. /g' -e 's/br \// /g' -e 's/, / , /g' -e 's/(/ 
 ( /g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \
   -e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' -e 's/-/ - /g' -e 's/=/ /g' -e 
 's/=/ /g' -e 's/*/ /g' -e 's/|/ /g' \
   -e 's/«/ /g' | tr 0-9  
 }
 wget 
 http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz
 gzip -d news.2013.en.shuffled.gz
 normalize_text  news.2013.en.shuffled  data.txt
 {code}
 {code:none}
 import org.apache.spark.mllib.feature.Word2Vec
 val text = sc.textFile(dataPath).map { t = t.split( ).toIterable }
 val word2Vec = new Word2Vec()
 word2Vec.
   setVectorSize(100).
   setSeed(42L).
   setNumIterations(5).
   setNumPartitions(36).
   setMinCount(100)
 val model = word2Vec.fit(text)
 model.getVectors.map { t = t._2.map(_.abs).sum }.sum / 100 / 
 model.getVectors.size
 = 
 res1: Float = 375059.84
 val word2Vec = new Word2Vec()
 word2Vec.
   setVectorSize(100).
   setSeed(42L).
   setNumIterations(5).
   setNumPartitions(36).
   setMinCount(5)
 val model = word2Vec.fit(text)