[jira] [Commented] (SPARK-1547) Add gradient boosting algorithm to MLlib

2014-07-02 Thread Hector Yee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049659#comment-14049659
 ] 

Hector Yee commented on SPARK-1547:
---

Just generic log loss with L1 regularization should suffice. Most of the work 
is in feature engineering anyway. It is no hurry at all, I already have several 
implementations not in MLLib that I am using. It would just be convenient to 
have another implementation to compare against.

 Add gradient boosting algorithm to MLlib
 

 Key: SPARK-1547
 URL: https://issues.apache.org/jira/browse/SPARK-1547
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Manish Amde
Assignee: Manish Amde

 This task requires adding the gradient boosting algorithm to Spark MLlib. The 
 implementation needs to adapt the gradient boosting algorithm to the scalable 
 tree implementation.
 The tasks involves:
 - Comparing the various tradeoffs and finalizing the algorithm before 
 implementation
 - Code implementation
 - Unit tests
 - Functional tests
 - Performance tests
 - Documentation



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Closed] (SPARK-1525) TaskSchedulerImpl should decrease availableCpus by spark.task.cpus not 1

2014-07-02 Thread YanTang Zhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YanTang Zhai closed SPARK-1525.
---

Resolution: Fixed

 TaskSchedulerImpl should decrease availableCpus by spark.task.cpus not 1
 

 Key: SPARK-1525
 URL: https://issues.apache.org/jira/browse/SPARK-1525
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: YanTang Zhai
Priority: Minor

 TaskSchedulerImpl decreases availableCpus by 1 in resourceOffers process 
 always even though spark.task.cpus is more than 1, which will schedule more 
 tasks to some node when spark.task.cpus is more than 1.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-786) Clean up old work directories in standalone worker

2014-07-02 Thread anurag tangri (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049671#comment-14049671
 ] 

anurag tangri commented on SPARK-786:
-

Hi,
We are also facing this issue.

Could somebody assign this ticket to me ?

I would like to work on this.


Thanks,
Anurag Tangri

 Clean up old work directories in standalone worker
 --

 Key: SPARK-786
 URL: https://issues.apache.org/jira/browse/SPARK-786
 Project: Spark
  Issue Type: New Feature
  Components: Deploy
Affects Versions: 0.7.2
Reporter: Matei Zaharia

 We should add a setting to clean old work directories after X days. 
 Otherwise, the directory gets filled forever with shuffle files and logs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

2014-07-02 Thread Eustache (JIRA)
Eustache created SPARK-2341:
---

 Summary: loadLibSVMFile doesn't handle regression datasets
 Key: SPARK-2341
 URL: https://issues.apache.org/jira/browse/SPARK-2341
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Eustache
Priority: Minor


Many datasets exist in LibSVM format for regression tasks [1] but currently the 
loadLibSVMFile primitive doesn't handle regression datasets.

More precisely, the LabelParser is either a MulticlassLabelParser or a 
BinaryLabelParser. What happens then is that the file is loaded but in 
multiclass mode : each target value is interpreted as a class name !

The fix would be to write a RegressionLabelParser which converts target values 
to Double and plug it into the loadLibSVMFile routine.

[1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2339) SQL parser in sql-core is case sensitive, but a table alias is converted to lower case when we create Subquery

2014-07-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2339:


Fix Version/s: 1.1.0

 SQL parser in sql-core is case sensitive, but a table alias is converted to 
 lower case when we create Subquery
 --

 Key: SPARK-2339
 URL: https://issues.apache.org/jira/browse/SPARK-2339
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Yin Huai
 Fix For: 1.1.0


 Reported by 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Join-throws-exception-td8599.html
 After we get the table from the catalog, because the table has an alias, we 
 will temporarily insert a Subquery. Then, we convert the table alias to lower 
 case no matter if the parser is case sensitive or not.
 To see the issue ...
 {code}
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 import sqlContext.createSchemaRDD
 case class Person(name: String, age: Int)
 val people = 
 sc.textFile(examples/src/main/resources/people.txt).map(_.split(,)).map(p 
 = Person(p(0), p(1).trim.toInt))
 people.registerAsTable(people)
 sqlContext.sql(select PEOPLE.name from people PEOPLE)
 {code}
 The plan is ...
 {code}
 == Query Plan ==
 Project ['PEOPLE.name]
  ExistingRdd [name#0,age#1], MapPartitionsRDD[4] at mapPartitions at 
 basicOperators.scala:176
 {code}
 You can find that PEOPLE.name is not resolved.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2342) Evaluation helper's output type doesn't conform to input type

2014-07-02 Thread Yijie Shen (JIRA)
Yijie Shen created SPARK-2342:
-

 Summary: Evaluation helper's output type doesn't conform to input 
type
 Key: SPARK-2342
 URL: https://issues.apache.org/jira/browse/SPARK-2342
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Yijie Shen
Priority: Minor


In sql/catalyst/org/apache/spark/sql/catalyst/expressions.scala
protected final def n2 ( i: Row, e1: Expression, e2: Expression, f: 
((Numeric[Any], Any, Any) = Any)): Any  
is intended  to do computations for Numeric add/Minus/Multipy.
Just as the comment suggest : Those expressions are supposed to be in the same 
data type, and also the return type.
But in code, function f was casted to function signature:
(Numeric[n.JvmType], n.JvmType, n.JvmType) = Int
I thought it as a typo and the correct should be:
(Numeric[n.JvmType], n.JvmType, n.JvmType) = n.JvmType



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2342) Evaluation helper's output type doesn't conform to input type

2014-07-02 Thread Yijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yijie Shen updated SPARK-2342:
--

Description: 
In sql/catalyst/org/apache/spark/sql/catalyst/expressions.scala
{code}protected final def n2 ( i: Row, e1: Expression, e2: Expression, f: 
((Numeric[Any], Any, Any) = Any)): Any  {code}
is intended  to do computations for Numeric add/Minus/Multipy.
Just as the comment suggest : {quote}Those expressions are supposed to be in 
the same data type, and also the return type.{quote}
But in code, function f was casted to function signature:
{code}(Numeric[n.JvmType], n.JvmType, n.JvmType) = Int{code}
I thought it as a typo and the correct should be:
{code}(Numeric[n.JvmType], n.JvmType, n.JvmType) = n.JvmType{code}

  was:
In sql/catalyst/org/apache/spark/sql/catalyst/expressions.scala
protected final def n2 ( i: Row, e1: Expression, e2: Expression, f: 
((Numeric[Any], Any, Any) = Any)): Any  
is intended  to do computations for Numeric add/Minus/Multipy.
Just as the comment suggest : Those expressions are supposed to be in the same 
data type, and also the return type.
But in code, function f was casted to function signature:
(Numeric[n.JvmType], n.JvmType, n.JvmType) = Int
I thought it as a typo and the correct should be:
(Numeric[n.JvmType], n.JvmType, n.JvmType) = n.JvmType


 Evaluation helper's output type doesn't conform to input type
 -

 Key: SPARK-2342
 URL: https://issues.apache.org/jira/browse/SPARK-2342
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Yijie Shen
Priority: Minor
  Labels: easyfix

 In sql/catalyst/org/apache/spark/sql/catalyst/expressions.scala
 {code}protected final def n2 ( i: Row, e1: Expression, e2: Expression, f: 
 ((Numeric[Any], Any, Any) = Any)): Any  {code}
 is intended  to do computations for Numeric add/Minus/Multipy.
 Just as the comment suggest : {quote}Those expressions are supposed to be in 
 the same data type, and also the return type.{quote}
 But in code, function f was casted to function signature:
 {code}(Numeric[n.JvmType], n.JvmType, n.JvmType) = Int{code}
 I thought it as a typo and the correct should be:
 {code}(Numeric[n.JvmType], n.JvmType, n.JvmType) = n.JvmType{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

2014-07-02 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049732#comment-14049732
 ] 

Xiangrui Meng commented on SPARK-2341:
--

Just set `multiclass = true` to load double values.

 loadLibSVMFile doesn't handle regression datasets
 -

 Key: SPARK-2341
 URL: https://issues.apache.org/jira/browse/SPARK-2341
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Eustache
Priority: Minor
  Labels: easyfix

 Many datasets exist in LibSVM format for regression tasks [1] but currently 
 the loadLibSVMFile primitive doesn't handle regression datasets.
 More precisely, the LabelParser is either a MulticlassLabelParser or a 
 BinaryLabelParser. What happens then is that the file is loaded but in 
 multiclass mode : each target value is interpreted as a class name !
 The fix would be to write a RegressionLabelParser which converts target 
 values to Double and plug it into the loadLibSVMFile routine.
 [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

2014-07-02 Thread Eustache (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049755#comment-14049755
 ] 

Eustache commented on SPARK-2341:
-

I see that LabelParser with multiclass=true works for the regression
setting.

What I fail to understand is how it is related to multiclass ? Is the
naming proper ?

In any case shouldn't we provide a naming that explicitly mentions
regression ?






 loadLibSVMFile doesn't handle regression datasets
 -

 Key: SPARK-2341
 URL: https://issues.apache.org/jira/browse/SPARK-2341
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Eustache
Priority: Minor
  Labels: easyfix

 Many datasets exist in LibSVM format for regression tasks [1] but currently 
 the loadLibSVMFile primitive doesn't handle regression datasets.
 More precisely, the LabelParser is either a MulticlassLabelParser or a 
 BinaryLabelParser. What happens then is that the file is loaded but in 
 multiclass mode : each target value is interpreted as a class name !
 The fix would be to write a RegressionLabelParser which converts target 
 values to Double and plug it into the loadLibSVMFile routine.
 [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

2014-07-02 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049765#comment-14049765
 ] 

Xiangrui Meng commented on SPARK-2341:
--

It is a little awkward to have both `regression` and `multiclass` as input 
arguments. I agree that a correct name should be `multiclassOrRegression`. But 
it is certainly too long. We tried to make this clear in the doc:

{code}
multiclass: whether the input labels contain more than two classes. If false, 
any label with value greater than 0.5 will be mapped to 1.0, or 0.0 otherwise. 
So it works for both +1/-1 and 1/0 cases. If true, the double value parsed 
directly from the label string will be used as the label value.
{code}

It would be good if we can improve the documentation to make it clearer. But 
for the API, I don't feel that it is necessary to change.


 loadLibSVMFile doesn't handle regression datasets
 -

 Key: SPARK-2341
 URL: https://issues.apache.org/jira/browse/SPARK-2341
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Eustache
Priority: Minor
  Labels: easyfix

 Many datasets exist in LibSVM format for regression tasks [1] but currently 
 the loadLibSVMFile primitive doesn't handle regression datasets.
 More precisely, the LabelParser is either a MulticlassLabelParser or a 
 BinaryLabelParser. What happens then is that the file is loaded but in 
 multiclass mode : each target value is interpreted as a class name !
 The fix would be to write a RegressionLabelParser which converts target 
 values to Double and plug it into the loadLibSVMFile routine.
 [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

2014-07-02 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049765#comment-14049765
 ] 

Xiangrui Meng edited comment on SPARK-2341 at 7/2/14 9:09 AM:
--

It is a little awkward to have both `regression` and `multiclass` as input 
arguments. I agree that a correct name should be `multiclassOrRegression` or 
`multiclassOrContinuous`. But it is certainly too long. We tried to make this 
clear in the doc:

{code}
multiclass: whether the input labels contain more than two classes. If false, 
any label with value greater than 0.5 will be mapped to 1.0, or 0.0 otherwise. 
So it works for both +1/-1 and 1/0 cases. If true, the double value parsed 
directly from the label string will be used as the label value.
{code}

It would be good if we can improve the documentation to make it clearer. But 
for the API, I don't feel that it is necessary to change.



was (Author: mengxr):
It is a little awkward to have both `regression` and `multiclass` as input 
arguments. I agree that a correct name should be `multiclassOrRegression`. But 
it is certainly too long. We tried to make this clear in the doc:

{code}
multiclass: whether the input labels contain more than two classes. If false, 
any label with value greater than 0.5 will be mapped to 1.0, or 0.0 otherwise. 
So it works for both +1/-1 and 1/0 cases. If true, the double value parsed 
directly from the label string will be used as the label value.
{code}

It would be good if we can improve the documentation to make it clearer. But 
for the API, I don't feel that it is necessary to change.


 loadLibSVMFile doesn't handle regression datasets
 -

 Key: SPARK-2341
 URL: https://issues.apache.org/jira/browse/SPARK-2341
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Eustache
Priority: Minor
  Labels: easyfix

 Many datasets exist in LibSVM format for regression tasks [1] but currently 
 the loadLibSVMFile primitive doesn't handle regression datasets.
 More precisely, the LabelParser is either a MulticlassLabelParser or a 
 BinaryLabelParser. What happens then is that the file is loaded but in 
 multiclass mode : each target value is interpreted as a class name !
 The fix would be to write a RegressionLabelParser which converts target 
 values to Double and plug it into the loadLibSVMFile routine.
 [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1681) Handle hive support correctly in ./make-distribution.sh

2014-07-02 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-1681:
---

Summary: Handle hive support correctly in ./make-distribution.sh  (was: 
Handle hive support correctly in ./make-distribution)

 Handle hive support correctly in ./make-distribution.sh
 ---

 Key: SPARK-1681
 URL: https://issues.apache.org/jira/browse/SPARK-1681
 Project: Spark
  Issue Type: Bug
  Components: Build, SQL
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Blocker
 Fix For: 1.0.0


 When Hive support is enabled we should copy the datanucleus jars to the 
 packaged distribution. The simplest way would be to create a lib_managed 
 folder in the final distribution so that the compute-classpath script 
 searches in exactly the same way whether or not it's a release.
 A slightly nicer solution is to put the jars inside of `/lib` and have some 
 fancier check for the jar location in the compute-classpath script.
 We should also document how to run Spark SQL on YARN when hive support is 
 enabled. In particular how to add the necessary jars to spark-submit.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2306) BoundedPriorityQueue is private and not registered with Kryo

2014-07-02 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049818#comment-14049818
 ] 

Daniel Darabos commented on SPARK-2306:
---

You're the best, Ankit! Thanks!

 BoundedPriorityQueue is private and not registered with Kryo
 

 Key: SPARK-2306
 URL: https://issues.apache.org/jira/browse/SPARK-2306
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Daniel Darabos

 Because BoundedPriorityQueue is private and not registered with Kryo, RDD.top 
 cannot be used when using Kryo (the recommended configuration).
 Curiously BoundedPriorityQueue is registered by GraphKryoRegistrator. But 
 that's the wrong registrator. (Is there one for Spark Core?)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1884) Shark failed to start

2014-07-02 Thread Pete MacKinnon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049877#comment-14049877
 ] 

Pete MacKinnon commented on SPARK-1884:
---

This is due to the version of protobuf-java provided by Shark being older 
(2.4.1) than what's needed by Hadoop 2.4 (2.5.0). See SPARK-2338.

 Shark failed to start
 -

 Key: SPARK-1884
 URL: https://issues.apache.org/jira/browse/SPARK-1884
 Project: Spark
  Issue Type: Bug
Affects Versions: 0.9.1
 Environment: ubuntu 14.04, spark 0.9.1, hive 0.13.0, hadoop 2.4.0 
 (stand alone), scala 2.11.0
Reporter: Wei Cui
Priority: Blocker

 the hadoop, hive, spark works fine.
 when start the shark, it failed with the following messages:
 Starting the Shark Command Line Client
 14/05/19 16:47:21 INFO Configuration.deprecation: mapred.input.dir.recursive 
 is deprecated. Instead, use 
 mapreduce.input.fileinputformat.input.dir.recursive
 14/05/19 16:47:21 INFO Configuration.deprecation: mapred.max.split.size is 
 deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
 14/05/19 16:47:21 INFO Configuration.deprecation: mapred.min.split.size is 
 deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
 14/05/19 16:47:21 INFO Configuration.deprecation: 
 mapred.min.split.size.per.rack is deprecated. Instead, use 
 mapreduce.input.fileinputformat.split.minsize.per.rack
 14/05/19 16:47:21 INFO Configuration.deprecation: 
 mapred.min.split.size.per.node is deprecated. Instead, use 
 mapreduce.input.fileinputformat.split.minsize.per.node
 14/05/19 16:47:21 INFO Configuration.deprecation: mapred.reduce.tasks is 
 deprecated. Instead, use mapreduce.job.reduces
 14/05/19 16:47:21 INFO Configuration.deprecation: 
 mapred.reduce.tasks.speculative.execution is deprecated. Instead, use 
 mapreduce.reduce.speculative
 14/05/19 16:47:22 WARN conf.Configuration: 
 org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@48c724c:an attempt to 
 override final parameter: mapreduce.job.end-notification.max.retry.interval;  
 Ignoring.
 14/05/19 16:47:22 WARN conf.Configuration: 
 org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@48c724c:an attempt to 
 override final parameter: mapreduce.cluster.local.dir;  Ignoring.
 14/05/19 16:47:22 WARN conf.Configuration: 
 org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@48c724c:an attempt to 
 override final parameter: mapreduce.job.end-notification.max.attempts;  
 Ignoring.
 14/05/19 16:47:22 WARN conf.Configuration: 
 org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@48c724c:an attempt to 
 override final parameter: mapreduce.cluster.temp.dir;  Ignoring.
 Logging initialized using configuration in 
 jar:file:/usr/local/shark/lib_managed/jars/edu.berkeley.cs.shark/hive-common/hive-common-0.11.0-shark-0.9.1.jar!/hive-log4j.properties
 Hive history 
 file=/tmp/root/hive_job_log_root_14857@ubuntu_201405191647_897494215.txt
 6.004: [GC 279616K-18440K(1013632K), 0.0438980 secs]
 6.445: [Full GC 59125K-7949K(1013632K), 0.0685160 secs]
 Reloading cached RDDs from previous Shark sessions... (use -skipRddReload 
 flag to skip reloading)
 7.535: [Full GC 104136K-13059K(1013632K), 0.0885820 secs]
 8.459: [Full GC 61237K-18031K(1013632K), 0.0820400 secs]
 8.662: [Full GC 29832K-8958K(1013632K), 0.0869700 secs]
 8.751: [Full GC 13433K-8998K(1013632K), 0.0856520 secs]
 10.435: [Full GC 72246K-14140K(1013632K), 0.1797530 secs]
 Exception in thread main org.apache.hadoop.hive.ql.metadata.HiveException: 
 java.lang.RuntimeException: Unable to instantiate 
 org.apache.hadoop.hive.metastore.HiveMetaStoreClient
   at 
 org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1072)
   at shark.memstore2.TableRecovery$.reloadRdds(TableRecovery.scala:49)
   at shark.SharkCliDriver.init(SharkCliDriver.scala:283)
   at shark.SharkCliDriver$.main(SharkCliDriver.scala:162)
   at shark.SharkCliDriver.main(SharkCliDriver.scala)
 Caused by: java.lang.RuntimeException: Unable to instantiate 
 org.apache.hadoop.hive.metastore.HiveMetaStoreClient
   at 
 org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1139)
   at 
 org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.init(RetryingMetaStoreClient.java:51)
   at 
 org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:61)
   at 
 org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2288)
   at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2299)
   at 
 org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1070)
   ... 4 more
 Caused by: java.lang.reflect.InvocationTargetException
   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
   at 
 

[jira] [Commented] (SPARK-1850) Bad exception if multiple jars exist when running PySpark

2014-07-02 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049895#comment-14049895
 ] 

Matthew Farrellee commented on SPARK-1850:
--

[~andrewor14] -

i think this should be closed as resolved in SPARK-2242

the current output for the error is,

{noformat}
$ ./dist/bin/pyspark
Python 2.7.5 (default, Feb 19 2014, 13:47:28) 
[GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] on linux2
Type help, copyright, credits or license for more information.
Traceback (most recent call last):
  File /home/matt/Documents/Repositories/spark/dist/python/pyspark/shell.py, 
line 43, in module
sc = SparkContext(appName=PySparkShell, pyFiles=add_files)
  File 
/home/matt/Documents/Repositories/spark/dist/python/pyspark/context.py, line 
95, in __init__
SparkContext._ensure_initialized(self, gateway=gateway)
  File 
/home/matt/Documents/Repositories/spark/dist/python/pyspark/context.py, line 
191, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway()
  File 
/home/matt/Documents/Repositories/spark/dist/python/pyspark/java_gateway.py, 
line 66, in launch_gateway
raise Exception(error_msg)
Exception: Launching GatewayServer failed with exit code 1!(Warning: unexpected 
output detected.)

Found multiple Spark assembly jars in 
/home/matt/Documents/Repositories/spark/dist/lib:
spark-assembly-1.1.0-SNAPSHOT-hadoop1.0.4-.jar
spark-assembly-1.1.0-SNAPSHOT-hadoop1.0.4.jar
Please remove all but one jar.
{noformat}

 Bad exception if multiple jars exist when running PySpark
 -

 Key: SPARK-1850
 URL: https://issues.apache.org/jira/browse/SPARK-1850
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0
Reporter: Andrew Or
 Fix For: 1.0.1


 {code}
 Found multiple Spark assembly jars in 
 /Users/andrew/Documents/dev/andrew-spark/assembly/target/scala-2.10:
 Traceback (most recent call last):
   File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/shell.py, 
 line 43, in module
 sc = SparkContext(os.environ.get(MASTER, local[*]), PySparkShell, 
 pyFiles=add_files)
   File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/context.py, 
 line 94, in __init__
 SparkContext._ensure_initialized(self, gateway=gateway)
   File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/context.py, 
 line 180, in _ensure_initialized
 SparkContext._gateway = gateway or launch_gateway()
   File 
 /Users/andrew/Documents/dev/andrew-spark/python/pyspark/java_gateway.py, 
 line 49, in launch_gateway
 gateway_port = int(proc.stdout.readline())
 ValueError: invalid literal for int() with base 10: 
 'spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4-deps.jar\n'
 {code}
 It's trying to read the Java gateway port as an int from the sub-process' 
 STDOUT. However, what it read was an error message, which is clearly not an 
 int. We should differentiate between these cases and just propagate the 
 original message if it's not an int. Right now, this exception is not very 
 helpful.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1550) Successive creation of spark context fails in pyspark, if the previous initialization of spark context had failed.

2014-07-02 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049918#comment-14049918
 ] 

Matthew Farrellee commented on SPARK-1550:
--

this issue as reported is no longer present in spark 1.0, where defaults are 
provided for app name and master.

{code}
$ SPARK_HOME=dist 
PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.8.1-src.zip python
Python 2.7.5 (default, Feb 19 2014, 13:47:28) 
[GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] on linux2
Type help, copyright, credits or license for more information.
 from pyspark import SparkContext
 sc=SparkContext('local')
[successful creation of context]
{code}

i believe this should be closed as resolved. /cc: [~pwendell]

 Successive creation of spark context fails in pyspark, if the previous 
 initialization of spark context had failed.
 --

 Key: SPARK-1550
 URL: https://issues.apache.org/jira/browse/SPARK-1550
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Prabin Banka
  Labels: pyspark, sparkcontext

 For example;-
 In PySpark, if we try to initialize spark context with insufficient 
 arguments, sc=SparkContext('local')
 it fails with an exception 
 Exception: An application name must be set in your configuration
 This is all fine. 
 However, any successive creation of spark context with correct arguments, 
 also fails,
 s1=SparkContext('local', 'test1')
 AttributeError: 'SparkContext' object has no attribute 'master'



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1257) Endless running task when using pyspark with input file containing a long line

2014-07-02 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049933#comment-14049933
 ] 

Matthew Farrellee commented on SPARK-1257:
--

recommend close as resolved w/ option for filer to reopen if the issue 
reproduces in 1.0 /cc: [~pwendell] [~joshrosen]

 Endless running task when using pyspark with input file containing a long line
 --

 Key: SPARK-1257
 URL: https://issues.apache.org/jira/browse/SPARK-1257
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 0.9.0
Reporter: Hanchen Su

 When launching any pyspark applications with an input file containing a very 
 long line(about 7 characters), the job will be hanging and never stops. 
 The application UI shows that there is a task running endlessly.
 There will be no problem using the scala version with the same input.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1030) unneeded file required when running pyspark program using yarn-client

2014-07-02 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049929#comment-14049929
 ] 

Matthew Farrellee commented on SPARK-1030:
--

using pyspark to submit is deprecated in spark 1.0 in favor of spark-submit. i 
think this should be closed as resolved/workfix. /cc: [~pwendell] [~joshrosen]

 unneeded file required when running pyspark program using yarn-client
 -

 Key: SPARK-1030
 URL: https://issues.apache.org/jira/browse/SPARK-1030
 Project: Spark
  Issue Type: Bug
  Components: Deploy, PySpark, YARN
Affects Versions: 0.8.1
Reporter: Diana Carroll
Assignee: Josh Rosen

 I can successfully run a pyspark program using the yarn-client master using 
 the following command:
 {code}
 SPARK_JAR=$SPARK_HOME/assembly/target/scala-2.9.3/spark-assembly_2.9.3-0.8.1-incubating-hadoop2.2.0.jar
  \
 SPARK_YARN_APP_JAR=~/testdata.txt pyspark \
 test1.py
 {code}
 However, the SPARK_YARN_APP_JAR doesn't make any sense; it's a Python 
 program, and therefore there's no JAR.  If I don't set the value, or if I set 
 the value to a non-existent files, Spark gives me an error message.  
 {code}
 py4j.protocol.Py4JJavaError: An error occurred while calling 
 None.org.apache.spark.api.java.JavaSparkContext.
 : org.apache.spark.SparkException: env SPARK_YARN_APP_JAR is not set
   at 
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:46)
 {code}
 or
 {code}
 py4j.protocol.Py4JJavaError: An error occurred while calling 
 None.org.apache.spark.api.java.JavaSparkContext.
 : java.io.FileNotFoundException: File file:dummy.txt does not exist
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:520)
 {code}
 My program is very simple:
 {code}
 from pyspark import SparkContext
 def main():
 sc = SparkContext(yarn-client, Simple App)
 logData = 
 sc.textFile(hdfs://localhost/user/training/weblogs/2013-09-15.log)
 numjpgs = logData.filter(lambda s: '.jpg' in s).count()
 print Number of JPG requests:  + str(numjpgs)
 {code}
 Although it reads the SPARK_YARN_APP_JAR file, it doesn't use the file at 
 all; I can point it at anything, as long as it's a valid, accessible file, 
 and it works the same.
 Although there's an obvious workaround for this bug, it's high priority from 
 my perspective because I'm working on a course to teach people how to do 
 this, and it's really hard to explain why this variable is needed!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1284) pyspark hangs after IOError on Executor

2014-07-02 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049937#comment-14049937
 ] 

Matthew Farrellee commented on SPARK-1284:
--

[~jblomo] -

will you add a reproducer script to this issue?

i did a simple test based on what you suggested w/ the tip of master and could 
not reproduce -

{code}
$ ./dist/bin/pyspark
Python 2.7.5 (default, Feb 19 2014, 13:47:28) 
[GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] on linux2
Type help, copyright, credits or license for more information.
...
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.0.0-SNAPSHOT
  /_/

Using Python version 2.7.5 (default, Feb 19 2014 13:47:28)
SparkContext available as sc.
 data = sc.textFile('/etc/passwd')
14/07/02 07:03:59 INFO MemoryStore: ensureFreeSpace(32816) called with 
curMem=0, maxMem=308910489
14/07/02 07:03:59 INFO MemoryStore: Block broadcast_0 stored as values in 
memory (estimated size 32.0 KB, free 294.6 MB)
 data.cache()
/etc/passwd MappedRDD[1] at textFile at NativeMethodAccessorImpl.java:-2
 data.take(10)
...[expected output]...
 data.flatMap(lambda line: line.split(':')).map(lambda word: (word, 
 1)).reduceByKey(lambda x, y: x + y).collect()
...[expected output, no hang]...
{code}

 pyspark hangs after IOError on Executor
 ---

 Key: SPARK-1284
 URL: https://issues.apache.org/jira/browse/SPARK-1284
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Jim Blomo

 When running a reduceByKey over a cached RDD, Python fails with an exception, 
 but the failure is not detected by the task runner.  Spark and the pyspark 
 shell hang waiting for the task to finish.
 The error is:
 {code}
 PySpark worker failed with exception:
 Traceback (most recent call last):
   File /home/hadoop/spark/python/pyspark/worker.py, line 77, in main
 serializer.dump_stream(func(split_index, iterator), outfile)
   File /home/hadoop/spark/python/pyspark/serializers.py, line 182, in 
 dump_stream
 self.serializer.dump_stream(self._batched(iterator), stream)
   File /home/hadoop/spark/python/pyspark/serializers.py, line 118, in 
 dump_stream
 self._write_with_length(obj, stream)
   File /home/hadoop/spark/python/pyspark/serializers.py, line 130, in 
 _write_with_length
 stream.write(serialized)
 IOError: [Errno 104] Connection reset by peer
 14/03/19 22:48:15 INFO scheduler.TaskSetManager: Serialized task 4.0:0 as 
 4257 bytes in 47 ms
 Traceback (most recent call last):
   File /home/hadoop/spark/python/pyspark/daemon.py, line 117, in 
 launch_worker
 worker(listen_sock)
   File /home/hadoop/spark/python/pyspark/daemon.py, line 107, in worker
 outfile.flush()
 IOError: [Errno 32] Broken pipe
 {code}
 I can reproduce the error by running take(10) on the cached RDD before 
 running reduceByKey (which looks at the whole input file).
 Affects Version 1.0.0-SNAPSHOT (4d88030486)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1473) Feature selection for high dimensional datasets

2014-07-02 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049939#comment-14049939
 ] 

Alexander Ulanov commented on SPARK-1473:
-

Does anybody work on this issue?

 Feature selection for high dimensional datasets
 ---

 Key: SPARK-1473
 URL: https://issues.apache.org/jira/browse/SPARK-1473
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Ignacio Zendejas
Priority: Minor
  Labels: features
 Fix For: 1.1.0


 For classification tasks involving large feature spaces in the order of tens 
 of thousands or higher (e.g., text classification with n-grams, where n  1), 
 it is often useful to rank and filter features that are irrelevant thereby 
 reducing the feature space by at least one or two orders of magnitude without 
 impacting performance on key evaluation metrics (accuracy/precision/recall).
 A feature evaluation interface which is flexible needs to be designed and at 
 least two methods should be implemented with Information Gain being a 
 priority as it has been shown to be amongst the most reliable.
 Special consideration should be taken in the design to account for wrapper 
 methods (see research papers below) which are more practical for lower 
 dimensional data.
 Relevant research:
 * Brown, G., Pocock, A., Zhao, M. J.,  Luján, M. (2012). Conditional
 likelihood maximisation: a unifying framework for information theoretic
 feature selection.*The Journal of Machine Learning Research*, *13*, 27-66.
 * Forman, George. An extensive empirical study of feature selection metrics 
 for text classification. The Journal of machine learning research 3 (2003): 
 1289-1305.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

2014-07-02 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049942#comment-14049942
 ] 

Sean Owen commented on SPARK-2341:
--

I've been a bit uncomfortable with how the MLlib API conflates categorical 
values and numbers, since they aren't numbers in general. Treating them as 
numbers is a convenience in some cases, and common in papers, but feels like 
suboptimal software design -- should a user have to convert categoricals to 
some numeric representation? To me it invites confusion, and this is one 
symptom. So I am not sure multiclass should mean parse target as double to 
begin with?

OK, it's not the issue here. But we're on the subject of an experimental API 
subject to change with an example of something related that could be improved 
along the way, and it's my #1 wish for MLlib at the moment. I'd really like to 
work on a change to try to accommodate classes as, say, strings at least, and 
not presume doubles. But I am trying to figure out if anyone agrees with that. 

 loadLibSVMFile doesn't handle regression datasets
 -

 Key: SPARK-2341
 URL: https://issues.apache.org/jira/browse/SPARK-2341
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Eustache
Priority: Minor
  Labels: easyfix

 Many datasets exist in LibSVM format for regression tasks [1] but currently 
 the loadLibSVMFile primitive doesn't handle regression datasets.
 More precisely, the LabelParser is either a MulticlassLabelParser or a 
 BinaryLabelParser. What happens then is that the file is loaded but in 
 multiclass mode : each target value is interpreted as a class name !
 The fix would be to write a RegressionLabelParser which converts target 
 values to Double and plug it into the loadLibSVMFile routine.
 [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1989) Exit executors faster if they get into a cycle of heavy GC

2014-07-02 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050005#comment-14050005
 ] 

Guoqiang Li commented on SPARK-1989:


In this case should also triggers the driver garbage collection.
The related work: 
https://github.com/witgo/spark/compare/taskEvent

 Exit executors faster if they get into a cycle of heavy GC
 --

 Key: SPARK-1989
 URL: https://issues.apache.org/jira/browse/SPARK-1989
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Matei Zaharia
 Fix For: 1.1.0


 I've seen situations where an application is allocating too much memory 
 across its tasks + cache to proceed, but Java gets into a cycle where it 
 repeatedly runs full GCs, frees up a bit of the heap, and continues instead 
 of giving up. This then leads to timeouts and confusing error messages. It 
 would be better to crash with OOM sooner. The JVM has options to support 
 this: http://java.dzone.com/articles/tracking-excessive-garbage.
 The right solution would probably be:
 - Add some config options used by spark-submit to set XX:GCTimeLimit and 
 XX:GCHeapFreeLimit, with more conservative values than the defaults (e.g. 90% 
 time limit, 5% free limit)
 - Make sure we pass these into the Java options for executors in each 
 deployment mode



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2343) QueueInputDStream with oneAtATime=false does not dequeue items

2014-07-02 Thread Manuel Laflamme (JIRA)
Manuel Laflamme created SPARK-2343:
--

 Summary: QueueInputDStream with oneAtATime=false does not dequeue 
items
 Key: SPARK-2343
 URL: https://issues.apache.org/jira/browse/SPARK-2343
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.0, 0.9.1, 0.9.0
Reporter: Manuel Laflamme
Priority: Minor


QueueInputDStream does not dequeue items when used with the oneAtATime flag 
disabled. The same items are reprocessed for every batch. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1850) Bad exception if multiple jars exist when running PySpark

2014-07-02 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050318#comment-14050318
 ] 

Andrew Or commented on SPARK-1850:
--

Ye, I will change it.

 Bad exception if multiple jars exist when running PySpark
 -

 Key: SPARK-1850
 URL: https://issues.apache.org/jira/browse/SPARK-1850
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0
Reporter: Andrew Or
 Fix For: 1.0.1


 {code}
 Found multiple Spark assembly jars in 
 /Users/andrew/Documents/dev/andrew-spark/assembly/target/scala-2.10:
 Traceback (most recent call last):
   File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/shell.py, 
 line 43, in module
 sc = SparkContext(os.environ.get(MASTER, local[*]), PySparkShell, 
 pyFiles=add_files)
   File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/context.py, 
 line 94, in __init__
 SparkContext._ensure_initialized(self, gateway=gateway)
   File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/context.py, 
 line 180, in _ensure_initialized
 SparkContext._gateway = gateway or launch_gateway()
   File 
 /Users/andrew/Documents/dev/andrew-spark/python/pyspark/java_gateway.py, 
 line 49, in launch_gateway
 gateway_port = int(proc.stdout.readline())
 ValueError: invalid literal for int() with base 10: 
 'spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4-deps.jar\n'
 {code}
 It's trying to read the Java gateway port as an int from the sub-process' 
 STDOUT. However, what it read was an error message, which is clearly not an 
 int. We should differentiate between these cases and just propagate the 
 original message if it's not an int. Right now, this exception is not very 
 helpful.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Closed] (SPARK-1850) Bad exception if multiple jars exist when running PySpark

2014-07-02 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-1850.


Resolution: Fixed

 Bad exception if multiple jars exist when running PySpark
 -

 Key: SPARK-1850
 URL: https://issues.apache.org/jira/browse/SPARK-1850
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0
Reporter: Andrew Or
 Fix For: 1.0.1


 {code}
 Found multiple Spark assembly jars in 
 /Users/andrew/Documents/dev/andrew-spark/assembly/target/scala-2.10:
 Traceback (most recent call last):
   File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/shell.py, 
 line 43, in module
 sc = SparkContext(os.environ.get(MASTER, local[*]), PySparkShell, 
 pyFiles=add_files)
   File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/context.py, 
 line 94, in __init__
 SparkContext._ensure_initialized(self, gateway=gateway)
   File /Users/andrew/Documents/dev/andrew-spark/python/pyspark/context.py, 
 line 180, in _ensure_initialized
 SparkContext._gateway = gateway or launch_gateway()
   File 
 /Users/andrew/Documents/dev/andrew-spark/python/pyspark/java_gateway.py, 
 line 49, in launch_gateway
 gateway_port = int(proc.stdout.readline())
 ValueError: invalid literal for int() with base 10: 
 'spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4-deps.jar\n'
 {code}
 It's trying to read the Java gateway port as an int from the sub-process' 
 STDOUT. However, what it read was an error message, which is clearly not an 
 int. We should differentiate between these cases and just propagate the 
 original message if it's not an int. Right now, this exception is not very 
 helpful.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2328) Add execution of `SHOW TABLES` before `TestHive.reset()`.

2014-07-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2328.
-

   Resolution: Fixed
Fix Version/s: 1.1.0
   1.0.1
 Assignee: Takuya Ueshin

 Add execution of `SHOW TABLES` before `TestHive.reset()`.
 -

 Key: SPARK-2328
 URL: https://issues.apache.org/jira/browse/SPARK-2328
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Takuya Ueshin
Assignee: Takuya Ueshin
 Fix For: 1.0.1, 1.1.0


 {{PruningSuite}} is executed first of Hive tests unfortunately, 
 {{TestHive.reset()}} breaks the test environment.
 To prevent this, we must run a query before calling reset the first time.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2186) Spark SQL DSL support for simple aggregations such as SUM and AVG

2014-07-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2186.
-

   Resolution: Fixed
Fix Version/s: 1.1.0
   1.0.1

 Spark SQL DSL support for simple aggregations such as SUM and AVG
 -

 Key: SPARK-2186
 URL: https://issues.apache.org/jira/browse/SPARK-2186
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.0
Reporter: Zongheng Yang
Priority: Minor
 Fix For: 1.0.1, 1.1.0


 Inspired by this thread 
 (http://apache-spark-user-list.1001560.n3.nabble.com/Patterns-for-making-multiple-aggregations-in-one-pass-td7874.html):
  Spark SQL doesn't seem to have DSL support for simple aggregations such as 
 AVG and SUM. It'd be nice if the user could avoid writing a SQL query and 
 instead write something like:
 {code}
 data.select('country, 'age.avg, 'hits.sum).groupBy('country).collect()
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2287) Make ScalaReflection be able to handle Generic case classes.

2014-07-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2287.
-

   Resolution: Fixed
Fix Version/s: 1.1.0
   1.0.1
 Assignee: Takuya Ueshin

 Make ScalaReflection be able to handle Generic case classes.
 

 Key: SPARK-2287
 URL: https://issues.apache.org/jira/browse/SPARK-2287
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Takuya Ueshin
Assignee: Takuya Ueshin
 Fix For: 1.0.1, 1.1.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2342) Evaluation helper's output type doesn't conform to input type

2014-07-02 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050347#comment-14050347
 ] 

Michael Armbrust commented on SPARK-2342:
-

This does look like a typo (though maybe one that doesn't matter due to 
erasure?).  That said, if you make a PR I'll certainly merge it.  Thanks!

 Evaluation helper's output type doesn't conform to input type
 -

 Key: SPARK-2342
 URL: https://issues.apache.org/jira/browse/SPARK-2342
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Yijie Shen
Priority: Minor
  Labels: easyfix

 In sql/catalyst/org/apache/spark/sql/catalyst/expressions.scala
 {code}protected final def n2 ( i: Row, e1: Expression, e2: Expression, f: 
 ((Numeric[Any], Any, Any) = Any)): Any  {code}
 is intended  to do computations for Numeric add/Minus/Multipy.
 Just as the comment suggest : {quote}Those expressions are supposed to be in 
 the same data type, and also the return type.{quote}
 But in code, function f was casted to function signature:
 {code}(Numeric[n.JvmType], n.JvmType, n.JvmType) = Int{code}
 I thought it as a typo and the correct should be:
 {code}(Numeric[n.JvmType], n.JvmType, n.JvmType) = n.JvmType{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2277) Make TaskScheduler track whether there's host on a rack

2014-07-02 Thread Chen He (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050381#comment-14050381
 ] 

Chen He commented on SPARK-2277:


This is interesting. I will take a look.

 Make TaskScheduler track whether there's host on a rack
 ---

 Key: SPARK-2277
 URL: https://issues.apache.org/jira/browse/SPARK-2277
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Rui Li

 When TaskSetManager adds a pending task, it checks whether the tasks's 
 preferred location is available. Regarding RACK_LOCAL task, we consider the 
 preferred rack available if such a rack is defined for the preferred host. 
 This is incorrect as there may be no alive hosts on that rack at all. 
 Therefore, TaskScheduler should track the hosts on each rack, and provides an 
 API for TaskSetManager to check if there's host alive on a specific rack.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1054) Get Cassandra support in Spark Core/Spark Cassandra Module

2014-07-02 Thread Rohit Rai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohit Rai updated SPARK-1054:
-

Summary: Get Cassandra support in Spark Core/Spark Cassandra Module  (was: 
Contribute Calliope Core to Spark as spark-cassandra)

 Get Cassandra support in Spark Core/Spark Cassandra Module
 --

 Key: SPARK-1054
 URL: https://issues.apache.org/jira/browse/SPARK-1054
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Rohit Rai
  Labels: calliope, cassandra

 Calliope is a library providing an interface to consume data from Cassandra 
 to spark and store RDDs from Spark to Cassandra. 
 Building as wrapper over Cassandra's Hadoop I/O it provides a simplified and 
 very generic API to consume and produces data from and to Cassandra. It 
 allows you to consume data from Legacy as well as CQL3 Cassandra Storage.  It 
 can also harness C* to speed up your process by fetching only the relevant 
 data from C* harnessing CQL3 and C*'s secondary indexes. Though it currently 
 uses only the Hadoop I/O formats for Cassandra in near future we see the same 
 API harnessing other means of consuming Cassandra data like using the 
 StorageProxy or even reading from SSTables directly.
 Over the basic data fetch functionality, the Calliope API harnesses Scala and 
 it's implicit parameters and conversions for you to work on a higher 
 abstraction dealing with tuples/objects instead of Cassandra's Row/Columns in 
 your MapRed jobs.
 Over past few months we have seen the combination of Spark+Cassandra gaining 
 a lot of traction. And we feel Calliope provides the path of least friction 
 for developers to start working with this combination.
 We have been using this ins production for over a year now and the Calliope 
 early access repository has 30+ users.  I am putting this issue to start a 
 discussion around whether we would want Calliope to be a part of Spark and if 
 yes, what will be involved in doing so.
 You can read more about Calliope here -
 http://tuplejump.github.io/calliope



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1054) Get Cassandra support in Spark Core/Spark Cassandra Module

2014-07-02 Thread Rohit Rai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050544#comment-14050544
 ] 

Rohit Rai commented on SPARK-1054:
--

With the https://github.com/datastax/cassandra-driver-spark from Datastax, we 
should work on getting a united standard API in Spark, getting good things from 
both worlds.

 Get Cassandra support in Spark Core/Spark Cassandra Module
 --

 Key: SPARK-1054
 URL: https://issues.apache.org/jira/browse/SPARK-1054
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Rohit Rai
  Labels: calliope, cassandra

 Calliope is a library providing an interface to consume data from Cassandra 
 to spark and store RDDs from Spark to Cassandra. 
 Building as wrapper over Cassandra's Hadoop I/O it provides a simplified and 
 very generic API to consume and produces data from and to Cassandra. It 
 allows you to consume data from Legacy as well as CQL3 Cassandra Storage.  It 
 can also harness C* to speed up your process by fetching only the relevant 
 data from C* harnessing CQL3 and C*'s secondary indexes. Though it currently 
 uses only the Hadoop I/O formats for Cassandra in near future we see the same 
 API harnessing other means of consuming Cassandra data like using the 
 StorageProxy or even reading from SSTables directly.
 Over the basic data fetch functionality, the Calliope API harnesses Scala and 
 it's implicit parameters and conversions for you to work on a higher 
 abstraction dealing with tuples/objects instead of Cassandra's Row/Columns in 
 your MapRed jobs.
 Over past few months we have seen the combination of Spark+Cassandra gaining 
 a lot of traction. And we feel Calliope provides the path of least friction 
 for developers to start working with this combination.
 We have been using this ins production for over a year now and the Calliope 
 early access repository has 30+ users.  I am putting this issue to start a 
 discussion around whether we would want Calliope to be a part of Spark and if 
 yes, what will be involved in doing so.
 You can read more about Calliope here -
 http://tuplejump.github.io/calliope



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2345) ForEachDStream should have an option of running the foreachfunc on Spark

2014-07-02 Thread Hari Shreedharan (JIRA)
Hari Shreedharan created SPARK-2345:
---

 Summary: ForEachDStream should have an option of running the 
foreachfunc on Spark
 Key: SPARK-2345
 URL: https://issues.apache.org/jira/browse/SPARK-2345
 Project: Spark
  Issue Type: Bug
Reporter: Hari Shreedharan


Today the Job generated simply calls the foreachfunc, but does not run it on 
spark itself using the sparkContext.runJob method.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2345) ForEachDStream should have an option of running the foreachfunc on Spark

2014-07-02 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050659#comment-14050659
 ] 

Hari Shreedharan commented on SPARK-2345:
-

Currently, the job (like saveAsTextFile or saveAsHadoopFile) on the DStream 
will cause the rdd.save calls to be executed on sparkContext.runJob, which in 
turn will call the foreachfunc which is passed to the ForEachDStream. So a case 
where this DStream is saved off works fine. 

But if you simply do a register and have the foreachfunc do some processing and 
custom writes may cause the application to be run locally.

 ForEachDStream should have an option of running the foreachfunc on Spark
 

 Key: SPARK-2345
 URL: https://issues.apache.org/jira/browse/SPARK-2345
 Project: Spark
  Issue Type: Bug
Reporter: Hari Shreedharan

 Today the Job generated simply calls the foreachfunc, but does not run it on 
 spark itself using the sparkContext.runJob method.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2346) Error parsing table names that starts from numbers

2014-07-02 Thread Alexander Albul (JIRA)
Alexander Albul created SPARK-2346:
--

 Summary: Error parsing table names that starts from numbers
 Key: SPARK-2346
 URL: https://issues.apache.org/jira/browse/SPARK-2346
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Alexander Albul


Looks like org.apache.spark.sql.catalyst.SqlParser cannot parse table names 
when they start from numbers.

Steps to reproduce:

{code:title=Test.scala|borderStyle=solid}
case class Data(value: String)

object Test {
  def main(args: Array[String]) {
val sc = new SparkContext(local, sql)
val sqlSc = new SQLContext(sc)
import sqlSc._

sc.parallelize(List(Data(one), Data(two))).registerAsTable(123_table)
sql(SELECT * FROM '123_table').collect().foreach(println)
  }
}
{code}

And here is an exception:

{quote}
Exception in thread main java.lang.RuntimeException: [1.15] failure: ``('' 
expected but 123_table found

SELECT * FROM '123_table'
  ^
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:47)
at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:70)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:150)
at io.ubix.spark.Test$.main(Test.scala:24)
at io.ubix.spark.Test.main(Test.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
{quote}

When i am changing from 123_table to table_123 problem disappears.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2345) ForEachDStream should have an option of running the foreachfunc on Spark

2014-07-02 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050670#comment-14050670
 ] 

Hari Shreedharan commented on SPARK-2345:
-

Looks like we'd have to do this in a new DStream, since the ForEachDStream 
takes a (RDD[T], Time)= Unit, but to call runJob we'd have to pass in 
(Iterator[T], Time)=Unit. I am not sure how much value this adds, but it does 
seem like if we are not using one of the built-in save/collect methods, you'd 
have to specifically run this function in context.runJob(...)

Do you think this makes sense, [~tdas], [~pwendell]?

 ForEachDStream should have an option of running the foreachfunc on Spark
 

 Key: SPARK-2345
 URL: https://issues.apache.org/jira/browse/SPARK-2345
 Project: Spark
  Issue Type: Bug
Reporter: Hari Shreedharan

 Today the Job generated simply calls the foreachfunc, but does not run it on 
 spark itself using the sparkContext.runJob method.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2346) Error parsing table names that starts from numbers

2014-07-02 Thread Alexander Albul (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Albul updated SPARK-2346:
---

Description: 
Looks like *org.apache.spark.sql.catalyst.SqlParser* cannot parse table names 
when they start from numbers.

Steps to reproduce:

{code:title=Test.scala|borderStyle=solid}
case class Data(value: String)

object Test {
  def main(args: Array[String]) {
val sc = new SparkContext(local, sql)
val sqlSc = new SQLContext(sc)
import sqlSc._

sc.parallelize(List(Data(one), Data(two))).registerAsTable(123_table)
sql(SELECT * FROM '123_table').collect().foreach(println)
  }
}
{code}

And here is an exception:

{quote}
Exception in thread main java.lang.RuntimeException: [1.15] failure: ``('' 
expected but 123_table found

SELECT * FROM '123_table'
  ^
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:47)
at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:70)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:150)
at io.ubix.spark.Test$.main(Test.scala:24)
at io.ubix.spark.Test.main(Test.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
{quote}

When i am changing from 123_table to table_123 problem disappears.

  was:
Looks like org.apache.spark.sql.catalyst.SqlParser cannot parse table names 
when they start from numbers.

Steps to reproduce:

{code:title=Test.scala|borderStyle=solid}
case class Data(value: String)

object Test {
  def main(args: Array[String]) {
val sc = new SparkContext(local, sql)
val sqlSc = new SQLContext(sc)
import sqlSc._

sc.parallelize(List(Data(one), Data(two))).registerAsTable(123_table)
sql(SELECT * FROM '123_table').collect().foreach(println)
  }
}
{code}

And here is an exception:

{quote}
Exception in thread main java.lang.RuntimeException: [1.15] failure: ``('' 
expected but 123_table found

SELECT * FROM '123_table'
  ^
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:47)
at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:70)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:150)
at io.ubix.spark.Test$.main(Test.scala:24)
at io.ubix.spark.Test.main(Test.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
{quote}

When i am changing from 123_table to table_123 problem disappears.


 Error parsing table names that starts from numbers
 --

 Key: SPARK-2346
 URL: https://issues.apache.org/jira/browse/SPARK-2346
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Alexander Albul
  Labels: Parser, SQL

 Looks like *org.apache.spark.sql.catalyst.SqlParser* cannot parse table names 
 when they start from numbers.
 Steps to reproduce:
 {code:title=Test.scala|borderStyle=solid}
 case class Data(value: String)
 object Test {
   def main(args: Array[String]) {
 val sc = new SparkContext(local, sql)
 val sqlSc = new SQLContext(sc)
 import sqlSc._
 sc.parallelize(List(Data(one), 
 Data(two))).registerAsTable(123_table)
 sql(SELECT * FROM '123_table').collect().foreach(println)
   }
 }
 {code}
 And here is an exception:
 {quote}
 Exception in thread main java.lang.RuntimeException: [1.15] failure: ``('' 
 expected but 123_table found
 SELECT * FROM '123_table'
   ^
   at scala.sys.package$.error(package.scala:27)
   at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:47)
   at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:70)
   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:150)
   at io.ubix.spark.Test$.main(Test.scala:24)
   at io.ubix.spark.Test.main(Test.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  

[jira] [Created] (SPARK-2347) Graph object can not be set to StorageLevel.MEMORY_ONLY_SER

2014-07-02 Thread Baoxu Shi (JIRA)
Baoxu Shi created SPARK-2347:


 Summary: Graph object can not be set to 
StorageLevel.MEMORY_ONLY_SER
 Key: SPARK-2347
 URL: https://issues.apache.org/jira/browse/SPARK-2347
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.0.0
 Environment: Spark standalone with 5 workers and 1 driver
Reporter: Baoxu Shi


I'm creating Graph object by using 

Graph(vertices, edges, null, StorageLevel.MEMORY_ONLY, StorageLevel.MEMORY_ONLY)

But that will throw out not serializable exception on both workers and driver. 

14/07/02 16:30:26 ERROR BlockManagerWorker: Exception handling buffer message
java.io.NotSerializableException: org.apache.spark.graphx.impl.VertexPartition
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
at 
org.apache.spark.serializer.SerializationStream$class.writeAll(Serializer.scala:106)
at 
org.apache.spark.serializer.JavaSerializationStream.writeAll(JavaSerializer.scala:30)
at 
org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:988)
at 
org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:997)
at org.apache.spark.storage.MemoryStore.getBytes(MemoryStore.scala:102)
at 
org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:392)
at 
org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:358)
at 
org.apache.spark.storage.BlockManagerWorker.getBlock(BlockManagerWorker.scala:90)
at 
org.apache.spark.storage.BlockManagerWorker.processBlockMessage(BlockManagerWorker.scala:69)
at 
org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44)
at 
org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at 
org.apache.spark.storage.BlockMessageArray.foreach(BlockMessageArray.scala:28)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at 
org.apache.spark.storage.BlockMessageArray.map(BlockMessageArray.scala:28)
at 
org.apache.spark.storage.BlockManagerWorker.onBlockMessageReceive(BlockManagerWorker.scala:44)
at 
org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34)
at 
org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34)
at 
org.apache.spark.network.ConnectionManager.org$apache$spark$network$ConnectionManager$$handleMessage(ConnectionManager.scala:662)
at 
org.apache.spark.network.ConnectionManager$$anon$9.run(ConnectionManager.scala:504)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

Even if the driver sometime does not throw this exception, it will throw 

java.io.FileNotFoundException: 
/tmp/spark-local-20140702151845-9620/2a/shuffle_2_25_3 (No such file or 
directory)

I know that VertexPartition not supposed to be serializable, so is there any 
workaround on this?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2339) SQL parser in sql-core is case sensitive, but a table alias is converted to lower case when we create Subquery

2014-07-02 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050721#comment-14050721
 ] 

Yin Huai commented on SPARK-2339:
-

Also, names of those registered tables are case sensitive. But, names of Hive 
tables are case insensitive. It will cause confusion when a user using 
HiveContext. I think it may be good to treat all identifiers case insensitive 
when a user is using HiveContext and make HiveContext.sql as a alias of 
HiveContext.hql (basically do not expose catalyst's SQLParser in HiveContext).

 SQL parser in sql-core is case sensitive, but a table alias is converted to 
 lower case when we create Subquery
 --

 Key: SPARK-2339
 URL: https://issues.apache.org/jira/browse/SPARK-2339
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Yin Huai
 Fix For: 1.1.0


 Reported by 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Join-throws-exception-td8599.html
 After we get the table from the catalog, because the table has an alias, we 
 will temporarily insert a Subquery. Then, we convert the table alias to lower 
 case no matter if the parser is case sensitive or not.
 To see the issue ...
 {code}
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 import sqlContext.createSchemaRDD
 case class Person(name: String, age: Int)
 val people = 
 sc.textFile(examples/src/main/resources/people.txt).map(_.split(,)).map(p 
 = Person(p(0), p(1).trim.toInt))
 people.registerAsTable(people)
 sqlContext.sql(select PEOPLE.name from people PEOPLE)
 {code}
 The plan is ...
 {code}
 == Query Plan ==
 Project ['PEOPLE.name]
  ExistingRdd [name#0,age#1], MapPartitionsRDD[4] at mapPartitions at 
 basicOperators.scala:176
 {code}
 You can find that PEOPLE.name is not resolved.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2348) In Windows having a enviorinment variable named 'classpath' gives error

2014-07-02 Thread Chirag Todarka (JIRA)
Chirag Todarka created SPARK-2348:
-

 Summary: In Windows having a enviorinment variable named 
'classpath' gives error
 Key: SPARK-2348
 URL: https://issues.apache.org/jira/browse/SPARK-2348
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: Windows 7 Enterprise
Reporter: Chirag Todarka


Operating System:: Windows 7 Enterprise
If having enviorinment variable named 'classpath' gives then starting 
'spark-shell' gives below error::

mydir\spark\binspark-shell

Failed to initialize compiler: object scala.runtime in compiler mirror not found
.
** Note that as of 2.8 scala does not assume use of the java classpath.
** For the old behavior pass -usejavacp to scala, or if using a Settings
** object programatically, settings.usejavacp.value = true.
14/07/02 14:22:06 WARN SparkILoop$SparkILoopInterpreter: Warning: compiler acces
sed before init set up.  Assuming no postInit code.

Failed to initialize compiler: object scala.runtime in compiler mirror not found
.
** Note that as of 2.8 scala does not assume use of the java classpath.
** For the old behavior pass -usejavacp to scala, or if using a Settings
** object programatically, settings.usejavacp.value = true.
Exception in thread main java.lang.AssertionError: assertion failed: null
at scala.Predef$.assert(Predef.scala:179)
at org.apache.spark.repl.SparkIMain.initializeSynchronous(SparkIMain.sca
la:202)
at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(Spar
kILoop.scala:929)
at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.
scala:884)
at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.
scala:884)
at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClass
Loader.scala:135)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Issue Comment Deleted] (SPARK-1305) Support persisting RDD's directly to Tachyon

2014-07-02 Thread Henry Saputra (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Henry Saputra updated SPARK-1305:
-

Comment: was deleted

(was: Sorry to comment on old JIRA but does anyone have PR for this ticket?)

 Support persisting RDD's directly to Tachyon
 

 Key: SPARK-1305
 URL: https://issues.apache.org/jira/browse/SPARK-1305
 Project: Spark
  Issue Type: New Feature
  Components: Block Manager
Reporter: Patrick Wendell
Assignee: Haoyuan Li
Priority: Blocker
 Fix For: 1.0.0


 This is already an ongoing pull request - in a nutshell we want to support 
 Tachyon as a storage level in Spark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2348) In Windows having a enviorinment variable named 'classpath' gives error

2014-07-02 Thread Chirag Todarka (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050757#comment-14050757
 ] 

Chirag Todarka commented on SPARK-2348:
---

[~pwendell]
[~cheffpj]

Hi Patrick/Pat,

I am new to the project and want to contribute in this. 
I hope this will be a great starting point for me so please if possible assign 
it to me.

Regards,
Chirag Todarka

 In Windows having a enviorinment variable named 'classpath' gives error
 ---

 Key: SPARK-2348
 URL: https://issues.apache.org/jira/browse/SPARK-2348
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: Windows 7 Enterprise
Reporter: Chirag Todarka

 Operating System:: Windows 7 Enterprise
 If having enviorinment variable named 'classpath' gives then starting 
 'spark-shell' gives below error::
 mydir\spark\binspark-shell
 Failed to initialize compiler: object scala.runtime in compiler mirror not 
 found
 .
 ** Note that as of 2.8 scala does not assume use of the java classpath.
 ** For the old behavior pass -usejavacp to scala, or if using a Settings
 ** object programatically, settings.usejavacp.value = true.
 14/07/02 14:22:06 WARN SparkILoop$SparkILoopInterpreter: Warning: compiler 
 acces
 sed before init set up.  Assuming no postInit code.
 Failed to initialize compiler: object scala.runtime in compiler mirror not 
 found
 .
 ** Note that as of 2.8 scala does not assume use of the java classpath.
 ** For the old behavior pass -usejavacp to scala, or if using a Settings
 ** object programatically, settings.usejavacp.value = true.
 Exception in thread main java.lang.AssertionError: assertion failed: null
 at scala.Predef$.assert(Predef.scala:179)
 at 
 org.apache.spark.repl.SparkIMain.initializeSynchronous(SparkIMain.sca
 la:202)
 at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(Spar
 kILoop.scala:929)
 at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.
 scala:884)
 at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.
 scala:884)
 at 
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClass
 Loader.scala:135)
 at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884)
 at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982)
 at org.apache.spark.repl.Main$.main(Main.scala:31)
 at org.apache.spark.repl.Main.main(Main.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
 at java.lang.reflect.Method.invoke(Unknown Source)
 at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2346) Error parsing table names that starts with numbers

2014-07-02 Thread Alexander Albul (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Albul updated SPARK-2346:
---

Summary: Error parsing table names that starts with numbers  (was: Error 
parsing table names that starts from numbers)

 Error parsing table names that starts with numbers
 --

 Key: SPARK-2346
 URL: https://issues.apache.org/jira/browse/SPARK-2346
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Alexander Albul
  Labels: Parser, SQL

 Looks like *org.apache.spark.sql.catalyst.SqlParser* cannot parse table names 
 when they start from numbers.
 Steps to reproduce:
 {code:title=Test.scala|borderStyle=solid}
 case class Data(value: String)
 object Test {
   def main(args: Array[String]) {
 val sc = new SparkContext(local, sql)
 val sqlSc = new SQLContext(sc)
 import sqlSc._
 sc.parallelize(List(Data(one), 
 Data(two))).registerAsTable(123_table)
 sql(SELECT * FROM '123_table').collect().foreach(println)
   }
 }
 {code}
 And here is an exception:
 {quote}
 Exception in thread main java.lang.RuntimeException: [1.15] failure: ``('' 
 expected but 123_table found
 SELECT * FROM '123_table'
   ^
   at scala.sys.package$.error(package.scala:27)
   at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:47)
   at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:70)
   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:150)
   at io.ubix.spark.Test$.main(Test.scala:24)
   at io.ubix.spark.Test.main(Test.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
 {quote}
 When i am changing from 123_table to table_123 problem disappears.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1614) Move Mesos protobufs out of TaskState

2014-07-02 Thread Martin Zapletal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050804#comment-14050804
 ] 

Martin Zapletal commented on SPARK-1614:


I am considering moving the protobufs to a new object - something like object 
org.apache.spark.MesosTaskState. Is that acceptable solution with regards to 
the requirements (to avoid the conflicts)? If not, can you please suggest which 
place would be the best for it?

 Move Mesos protobufs out of TaskState
 -

 Key: SPARK-1614
 URL: https://issues.apache.org/jira/browse/SPARK-1614
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 0.9.1
Reporter: Shivaram Venkataraman
Priority: Minor
  Labels: Starter

 To isolate usage of Mesos protobufs it would be good to move them out of 
 TaskState into either a new class (MesosUtils ?) or 
 CoarseGrainedMesos{Executor, Backend}.
 This would allow applications to build Spark to run without including 
 protobuf from Mesos in their shaded jars.  This is one way to avoid protobuf 
 conflicts between Mesos and Hadoop 
 (https://issues.apache.org/jira/browse/MESOS-1203)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2349) Fix NPE in ExternalAppendOnlyMap

2014-07-02 Thread Andrew Or (JIRA)
Andrew Or created SPARK-2349:


 Summary: Fix NPE in ExternalAppendOnlyMap
 Key: SPARK-2349
 URL: https://issues.apache.org/jira/browse/SPARK-2349
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Or


It throws an NPE on null keys.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2277) Make TaskScheduler track whether there's host on a rack

2014-07-02 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050886#comment-14050886
 ] 

Mridul Muralidharan commented on SPARK-2277:


I am not sure I follow this requirement.
For preferred locations, we populate their corresponding racks (if available) 
as preferred rack.

For available executors hosts, we lookup the rack they belong to - and then see 
if that rack is preferred or not.

This, ofcourse, assumes a host is only on a single rack.


What exactly is the behavior you are expecting from scheduler ?

 Make TaskScheduler track whether there's host on a rack
 ---

 Key: SPARK-2277
 URL: https://issues.apache.org/jira/browse/SPARK-2277
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Rui Li

 When TaskSetManager adds a pending task, it checks whether the tasks's 
 preferred location is available. Regarding RACK_LOCAL task, we consider the 
 preferred rack available if such a rack is defined for the preferred host. 
 This is incorrect as there may be no alive hosts on that rack at all. 
 Therefore, TaskScheduler should track the hosts on each rack, and provides an 
 API for TaskSetManager to check if there's host alive on a specific rack.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2350) Master throws NPE

2014-07-02 Thread Andrew Or (JIRA)
Andrew Or created SPARK-2350:


 Summary: Master throws NPE
 Key: SPARK-2350
 URL: https://issues.apache.org/jira/browse/SPARK-2350
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Or
 Fix For: 1.1.0


... if we launch a driver and there are more waiting drivers to be launched. 
This is because we remove from a list while iterating through this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2350) Master throws NPE

2014-07-02 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-2350:
-

Description: 
... if we launch a driver and there are more waiting drivers to be launched. 
This is because we remove from a list while iterating through this.

{code}
for (driver - waitingDrivers) {
  if (worker.memoryFree = driver.desc.mem  worker.coresFree = 
driver.desc.cores) {
launchDriver(worker, driver)
waitingDrivers -= driver
  }
}
{code}

  was:
... if we launch a driver and there are more waiting drivers to be launched. 
This is because we remove from a list while iterating through this.

{code}
  for (driver - waitingDrivers) {
if (worker.memoryFree = driver.desc.mem  worker.coresFree = 
driver.desc.cores) {
  launchDriver(worker, driver)
  waitingDrivers -= driver
}
  }
{code}


 Master throws NPE
 -

 Key: SPARK-2350
 URL: https://issues.apache.org/jira/browse/SPARK-2350
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Or
 Fix For: 1.1.0


 ... if we launch a driver and there are more waiting drivers to be launched. 
 This is because we remove from a list while iterating through this.
 {code}
 for (driver - waitingDrivers) {
   if (worker.memoryFree = driver.desc.mem  worker.coresFree = 
 driver.desc.cores) {
 launchDriver(worker, driver)
 waitingDrivers -= driver
   }
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2350) Master throws NPE

2014-07-02 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-2350:
-

Description: 
... if we launch a driver and there are more waiting drivers to be launched. 
This is because we remove from a list while iterating through this.

{code}
  for (driver - waitingDrivers) {
if (worker.memoryFree = driver.desc.mem  worker.coresFree = 
driver.desc.cores) {
  launchDriver(worker, driver)
  waitingDrivers -= driver
}
  }
{code}

  was:... if we launch a driver and there are more waiting drivers to be 
launched. This is because we remove from a list while iterating through this.


 Master throws NPE
 -

 Key: SPARK-2350
 URL: https://issues.apache.org/jira/browse/SPARK-2350
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Or
 Fix For: 1.1.0


 ... if we launch a driver and there are more waiting drivers to be launched. 
 This is because we remove from a list while iterating through this.
 {code}
   for (driver - waitingDrivers) {
 if (worker.memoryFree = driver.desc.mem  worker.coresFree = 
 driver.desc.cores) {
   launchDriver(worker, driver)
   waitingDrivers -= driver
 }
   }
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2350) Master throws NPE

2014-07-02 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-2350:
-

Description: 
... if we launch a driver and there are more waiting drivers to be launched. 
This is because we remove from a list while iterating through this.

Here is the culprit from Master.scala (L487 as of the creation of this JIRA, 
commit bc7041a42dfa84312492ea8cae6fdeaeac4f6d1c).

{code}
for (driver - waitingDrivers) {
  if (worker.memoryFree = driver.desc.mem  worker.coresFree = 
driver.desc.cores) {
launchDriver(worker, driver)
waitingDrivers -= driver
  }
}
{code}

  was:
... if we launch a driver and there are more waiting drivers to be launched. 
This is because we remove from a list while iterating through this.

{code}
for (driver - waitingDrivers) {
  if (worker.memoryFree = driver.desc.mem  worker.coresFree = 
driver.desc.cores) {
launchDriver(worker, driver)
waitingDrivers -= driver
  }
}
{code}


 Master throws NPE
 -

 Key: SPARK-2350
 URL: https://issues.apache.org/jira/browse/SPARK-2350
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Or
 Fix For: 1.1.0


 ... if we launch a driver and there are more waiting drivers to be launched. 
 This is because we remove from a list while iterating through this.
 Here is the culprit from Master.scala (L487 as of the creation of this JIRA, 
 commit bc7041a42dfa84312492ea8cae6fdeaeac4f6d1c).
 {code}
 for (driver - waitingDrivers) {
   if (worker.memoryFree = driver.desc.mem  worker.coresFree = 
 driver.desc.cores) {
 launchDriver(worker, driver)
 waitingDrivers -= driver
   }
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2350) Master throws NPE

2014-07-02 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050891#comment-14050891
 ] 

Andrew Or commented on SPARK-2350:
--

In general, if Master dies because of an exception, it automatically restarts 
and the exception message is hidden in the logs. It took a while for 
[~ilikerps] and I to find the exception as we are scrolling through the logs. 

 Master throws NPE
 -

 Key: SPARK-2350
 URL: https://issues.apache.org/jira/browse/SPARK-2350
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Or
 Fix For: 1.1.0


 ... if we launch a driver and there are more waiting drivers to be launched. 
 This is because we remove from a list while iterating through this.
 Here is the culprit from Master.scala (L487 as of the creation of this JIRA, 
 commit bc7041a42dfa84312492ea8cae6fdeaeac4f6d1c).
 {code}
 for (driver - waitingDrivers) {
   if (worker.memoryFree = driver.desc.mem  worker.coresFree = 
 driver.desc.cores) {
 launchDriver(worker, driver)
 waitingDrivers -= driver
   }
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-2350) Master throws NPE

2014-07-02 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050891#comment-14050891
 ] 

Andrew Or edited comment on SPARK-2350 at 7/3/14 12:07 AM:
---

In general, if Master dies because of an exception, it automatically restarts 
and the exception message is hidden in the logs. In the mean time, the symptoms 
are not indicative of a Master having thrown an exception and restarted. It 
took a while for [~ilikerps] and I to find the exception as we were scrolling 
through the logs.


was (Author: andrewor):
In general, if Master dies because of an exception, it automatically restarts 
and the exception message is hidden in the logs. It took a while for 
[~ilikerps] and I to find the exception as we are scrolling through the logs. 

 Master throws NPE
 -

 Key: SPARK-2350
 URL: https://issues.apache.org/jira/browse/SPARK-2350
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Or
 Fix For: 1.1.0


 ... if we launch a driver and there are more waiting drivers to be launched. 
 This is because we remove from a list while iterating through this.
 Here is the culprit from Master.scala (L487 as of the creation of this JIRA, 
 commit bc7041a42dfa84312492ea8cae6fdeaeac4f6d1c).
 {code}
 for (driver - waitingDrivers) {
   if (worker.memoryFree = driver.desc.mem  worker.coresFree = 
 driver.desc.cores) {
 launchDriver(worker, driver)
 waitingDrivers -= driver
   }
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2350) Master throws NPE

2014-07-02 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050894#comment-14050894
 ] 

Andrew Or commented on SPARK-2350:
--

This is the root cause of SPARK-2154

 Master throws NPE
 -

 Key: SPARK-2350
 URL: https://issues.apache.org/jira/browse/SPARK-2350
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Or
 Fix For: 1.1.0


 ... if we launch a driver and there are more waiting drivers to be launched. 
 This is because we remove from a list while iterating through this.
 Here is the culprit from Master.scala (L487 as of the creation of this JIRA, 
 commit bc7041a42dfa84312492ea8cae6fdeaeac4f6d1c).
 {code}
 for (driver - waitingDrivers) {
   if (worker.memoryFree = driver.desc.mem  worker.coresFree = 
 driver.desc.cores) {
 launchDriver(worker, driver)
 waitingDrivers -= driver
   }
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2277) Make TaskScheduler track whether there's host on a rack

2014-07-02 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050951#comment-14050951
 ] 

Rui Li commented on SPARK-2277:
---

Suppose task1 prefers node1 but node1 is not available at the moment. However, 
we know node1 is on rack1, which makes task1 prefers rack1 for RACK_LOCAL 
locality. The problem is, we don't know if there's alive host on rack1, so we 
cannot check the availability of this preference.
Please let me know if I misunderstand anything :)

 Make TaskScheduler track whether there's host on a rack
 ---

 Key: SPARK-2277
 URL: https://issues.apache.org/jira/browse/SPARK-2277
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Rui Li

 When TaskSetManager adds a pending task, it checks whether the tasks's 
 preferred location is available. Regarding RACK_LOCAL task, we consider the 
 preferred rack available if such a rack is defined for the preferred host. 
 This is incorrect as there may be no alive hosts on that rack at all. 
 Therefore, TaskScheduler should track the hosts on each rack, and provides an 
 API for TaskSetManager to check if there's host alive on a specific rack.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2277) Make TaskScheduler track whether there's host on a rack

2014-07-02 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050952#comment-14050952
 ] 

Rui Li commented on SPARK-2277:
---

PR created at:
https://github.com/apache/spark/pull/1212

 Make TaskScheduler track whether there's host on a rack
 ---

 Key: SPARK-2277
 URL: https://issues.apache.org/jira/browse/SPARK-2277
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Rui Li

 When TaskSetManager adds a pending task, it checks whether the tasks's 
 preferred location is available. Regarding RACK_LOCAL task, we consider the 
 preferred rack available if such a rack is defined for the preferred host. 
 This is incorrect as there may be no alive hosts on that rack at all. 
 Therefore, TaskScheduler should track the hosts on each rack, and provides an 
 API for TaskSetManager to check if there's host alive on a specific rack.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-2342) Evaluation helper's output type doesn't conform to input type

2014-07-02 Thread Yijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050982#comment-14050982
 ] 

Yijie Shen edited comment on SPARK-2342 at 7/3/14 1:52 AM:
---

[~marmbrus] I fix the typo in PR: https://github.com/apache/spark/pull/1283.
Please check it, thanks.


was (Author: yijieshen):
[~marmbrus] Fix the typo in PR: https://github.com/apache/spark/pull/1283.
Please check it, thanks.

 Evaluation helper's output type doesn't conform to input type
 -

 Key: SPARK-2342
 URL: https://issues.apache.org/jira/browse/SPARK-2342
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Yijie Shen
Priority: Minor
  Labels: easyfix

 In sql/catalyst/org/apache/spark/sql/catalyst/expressions.scala
 {code}protected final def n2 ( i: Row, e1: Expression, e2: Expression, f: 
 ((Numeric[Any], Any, Any) = Any)): Any  {code}
 is intended  to do computations for Numeric add/Minus/Multipy.
 Just as the comment suggest : {quote}Those expressions are supposed to be in 
 the same data type, and also the return type.{quote}
 But in code, function f was casted to function signature:
 {code}(Numeric[n.JvmType], n.JvmType, n.JvmType) = Int{code}
 I thought it as a typo and the correct should be:
 {code}(Numeric[n.JvmType], n.JvmType, n.JvmType) = n.JvmType{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-2342) Evaluation helper's output type doesn't conform to input type

2014-07-02 Thread Yijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050982#comment-14050982
 ] 

Yijie Shen edited comment on SPARK-2342 at 7/3/14 1:51 AM:
---

[~marmbrus] Fix the typo in PR: https://github.com/apache/spark/pull/1283.
Please check it, thanks.


was (Author: yijieshen):
Fix the typo in PR: https://github.com/apache/spark/pull/1283.
Please check it, thanks.

 Evaluation helper's output type doesn't conform to input type
 -

 Key: SPARK-2342
 URL: https://issues.apache.org/jira/browse/SPARK-2342
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Yijie Shen
Priority: Minor
  Labels: easyfix

 In sql/catalyst/org/apache/spark/sql/catalyst/expressions.scala
 {code}protected final def n2 ( i: Row, e1: Expression, e2: Expression, f: 
 ((Numeric[Any], Any, Any) = Any)): Any  {code}
 is intended  to do computations for Numeric add/Minus/Multipy.
 Just as the comment suggest : {quote}Those expressions are supposed to be in 
 the same data type, and also the return type.{quote}
 But in code, function f was casted to function signature:
 {code}(Numeric[n.JvmType], n.JvmType, n.JvmType) = Int{code}
 I thought it as a typo and the correct should be:
 {code}(Numeric[n.JvmType], n.JvmType, n.JvmType) = n.JvmType{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-2342) Evaluation helper's output type doesn't conform to input type

2014-07-02 Thread Yijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050982#comment-14050982
 ] 

Yijie Shen edited comment on SPARK-2342 at 7/3/14 1:52 AM:
---

[~marmbrus], I fix the typo in PR: https://github.com/apache/spark/pull/1283.
Please check it, thanks.


was (Author: yijieshen):
[~marmbrus] I fix the typo in PR: https://github.com/apache/spark/pull/1283.
Please check it, thanks.

 Evaluation helper's output type doesn't conform to input type
 -

 Key: SPARK-2342
 URL: https://issues.apache.org/jira/browse/SPARK-2342
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Yijie Shen
Priority: Minor
  Labels: easyfix

 In sql/catalyst/org/apache/spark/sql/catalyst/expressions.scala
 {code}protected final def n2 ( i: Row, e1: Expression, e2: Expression, f: 
 ((Numeric[Any], Any, Any) = Any)): Any  {code}
 is intended  to do computations for Numeric add/Minus/Multipy.
 Just as the comment suggest : {quote}Those expressions are supposed to be in 
 the same data type, and also the return type.{quote}
 But in code, function f was casted to function signature:
 {code}(Numeric[n.JvmType], n.JvmType, n.JvmType) = Int{code}
 I thought it as a typo and the correct should be:
 {code}(Numeric[n.JvmType], n.JvmType, n.JvmType) = n.JvmType{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2351) Add Artificial Neural Network (ANN) to Spark

2014-07-02 Thread Bert Greevenbosch (JIRA)
Bert Greevenbosch created SPARK-2351:


 Summary: Add Artificial Neural Network (ANN) to Spark
 Key: SPARK-2351
 URL: https://issues.apache.org/jira/browse/SPARK-2351
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
 Environment: MLLIB code
Reporter: Bert Greevenbosch


It would be good if the Machine Learning Library contained Artificial Neural 
Networks (ANNs).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2352) Add Artificial Neural Network (ANN) to Spark

2014-07-02 Thread Bert Greevenbosch (JIRA)
Bert Greevenbosch created SPARK-2352:


 Summary: Add Artificial Neural Network (ANN) to Spark
 Key: SPARK-2352
 URL: https://issues.apache.org/jira/browse/SPARK-2352
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
 Environment: MLLIB code
Reporter: Bert Greevenbosch


It would be good if the Machine Learning Library contained Artificial Neural 
Networks (ANNs).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Closed] (SPARK-2351) Add Artificial Neural Network (ANN) to Spark

2014-07-02 Thread Bert Greevenbosch (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bert Greevenbosch closed SPARK-2351.


Resolution: Duplicate

Duplicate with SPARK-2352.

 Add Artificial Neural Network (ANN) to Spark
 

 Key: SPARK-2351
 URL: https://issues.apache.org/jira/browse/SPARK-2351
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
 Environment: MLLIB code
Reporter: Bert Greevenbosch

 It would be good if the Machine Learning Library contained Artificial Neural 
 Networks (ANNs).



--
This message was sent by Atlassian JIRA
(v6.2#6252)