[jira] [Commented] (SPARK-5585) Flaky test: Python regression

2015-02-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304729#comment-14304729
 ] 

Apache Spark commented on SPARK-5585:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/4358

> Flaky test: Python regression
> -
>
> Key: SPARK-5585
> URL: https://issues.apache.org/jira/browse/SPARK-5585
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Patrick Wendell
>Assignee: Davies Liu
>Priority: Critical
>  Labels: flaky-test
>
> Hey [~davies] any chance you can take a look at this? The master build is 
> having random python failures fairly often. Not quite sure what is going on:
> {code}
> 0inputs+128outputs (0major+13320minor)pagefaults 0swaps
> Run mllib tests ...
> Running test: pyspark/mllib/classification.py
> tput: No value for $TERM and no -T specified
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> 0.43user 0.12system 0:14.85elapsed 3%CPU (0avgtext+0avgdata 94272maxresident)k
> 0inputs+280outputs (0major+12627minor)pagefaults 0swaps
> Running test: pyspark/mllib/clustering.py
> tput: No value for $TERM and no -T specified
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> 0.35user 0.11system 0:12.63elapsed 3%CPU (0avgtext+0avgdata 93568maxresident)k
> 0inputs+88outputs (0major+12532minor)pagefaults 0swaps
> Running test: pyspark/mllib/feature.py
> tput: No value for $TERM and no -T specified
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> 0.28user 0.08system 0:05.73elapsed 6%CPU (0avgtext+0avgdata 93424maxresident)k
> 0inputs+32outputs (0major+12548minor)pagefaults 0swaps
> Running test: pyspark/mllib/linalg.py
> 0.16user 0.05system 0:00.22elapsed 98%CPU (0avgtext+0avgdata 
> 89888maxresident)k
> 0inputs+0outputs (0major+8099minor)pagefaults 0swaps
> Running test: pyspark/mllib/rand.py
> tput: No value for $TERM and no -T specified
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> 0.25user 0.08system 0:05.42elapsed 6%CPU (0avgtext+0avgdata 87872maxresident)k
> 0inputs+0outputs (0major+11849minor)pagefaults 0swaps
> Running test: pyspark/mllib/recommendation.py
> tput: No value for $TERM and no -T specified
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> 0.32user 0.09system 0:11.42elapsed 3%CPU (0avgtext+0avgdata 94256maxresident)k
> 0inputs+32outputs (0major+11797minor)pagefaults 0swaps
> Running test: pyspark/mllib/regression.py
> tput: No value for $TERM and no -T specified
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> 0.53user 0.17system 0:23.53elapsed 3%CPU (0avgtext+0avgdata 99600maxresident)k
> 0inputs+48outputs (0major+12402minor)pagefaults 0swaps
> Running test: pyspark/mllib/stat/_statistics.py
> tput: No value for $TERM and no -T specified
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> 0.29user 0.09system 0:08.03elapsed 4%CPU (0avgtext+0avgdata 92656maxresident)k
> 0inputs+48outputs (0major+12508minor)pagefaults 0swaps
> Running test: pyspark/mllib/tree.py
> tput: No value for $TERM and no -T specified
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> 0.57user 0.16system 0:25.30elapsed 2%CPU (0avgtext+0avgdata 94400maxresident)k
> 0inputs+144outputs (0major+12600minor)pagefaults 0swaps
> Running test: pyspark/mllib/util.py
> tput: No value for $TERM and no -T specified
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> 0.20user 0.06system 0:08.08elapsed 3%CPU (0avgtext+0avgdata 92768maxresident)k
> 0inputs+56outputs (0major+12474minor)pagefaults 0swaps
> Running test: pyspark/mllib/tests.py
> tput: No value for $TERM and no -T specified
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> .F/usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:2499: 
> VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or 
> function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.
>   VisibleDeprecationWarning)
> ./usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:2499: 
> VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or 
> function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.
>   VisibleDeprecationWarning)
> /usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:2499: 
> VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or 
> function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.
>   VisibleDeprecationWarnin

[jira] [Commented] (SPARK-5585) Flaky test: Python regression

2015-02-03 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304724#comment-14304724
 ] 

Davies Liu commented on SPARK-5585:
---

[~pwendell] I can not reproduce it locally, will add a seed for it, test it 
several times in jenkins.

> Flaky test: Python regression
> -
>
> Key: SPARK-5585
> URL: https://issues.apache.org/jira/browse/SPARK-5585
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Patrick Wendell
>Assignee: Davies Liu
>Priority: Critical
>  Labels: flaky-test
>
> Hey [~davies] any chance you can take a look at this? The master build is 
> having random python failures fairly often. Not quite sure what is going on:
> {code}
> 0inputs+128outputs (0major+13320minor)pagefaults 0swaps
> Run mllib tests ...
> Running test: pyspark/mllib/classification.py
> tput: No value for $TERM and no -T specified
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> 0.43user 0.12system 0:14.85elapsed 3%CPU (0avgtext+0avgdata 94272maxresident)k
> 0inputs+280outputs (0major+12627minor)pagefaults 0swaps
> Running test: pyspark/mllib/clustering.py
> tput: No value for $TERM and no -T specified
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> 0.35user 0.11system 0:12.63elapsed 3%CPU (0avgtext+0avgdata 93568maxresident)k
> 0inputs+88outputs (0major+12532minor)pagefaults 0swaps
> Running test: pyspark/mllib/feature.py
> tput: No value for $TERM and no -T specified
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> 0.28user 0.08system 0:05.73elapsed 6%CPU (0avgtext+0avgdata 93424maxresident)k
> 0inputs+32outputs (0major+12548minor)pagefaults 0swaps
> Running test: pyspark/mllib/linalg.py
> 0.16user 0.05system 0:00.22elapsed 98%CPU (0avgtext+0avgdata 
> 89888maxresident)k
> 0inputs+0outputs (0major+8099minor)pagefaults 0swaps
> Running test: pyspark/mllib/rand.py
> tput: No value for $TERM and no -T specified
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> 0.25user 0.08system 0:05.42elapsed 6%CPU (0avgtext+0avgdata 87872maxresident)k
> 0inputs+0outputs (0major+11849minor)pagefaults 0swaps
> Running test: pyspark/mllib/recommendation.py
> tput: No value for $TERM and no -T specified
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> 0.32user 0.09system 0:11.42elapsed 3%CPU (0avgtext+0avgdata 94256maxresident)k
> 0inputs+32outputs (0major+11797minor)pagefaults 0swaps
> Running test: pyspark/mllib/regression.py
> tput: No value for $TERM and no -T specified
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> 0.53user 0.17system 0:23.53elapsed 3%CPU (0avgtext+0avgdata 99600maxresident)k
> 0inputs+48outputs (0major+12402minor)pagefaults 0swaps
> Running test: pyspark/mllib/stat/_statistics.py
> tput: No value for $TERM and no -T specified
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> 0.29user 0.09system 0:08.03elapsed 4%CPU (0avgtext+0avgdata 92656maxresident)k
> 0inputs+48outputs (0major+12508minor)pagefaults 0swaps
> Running test: pyspark/mllib/tree.py
> tput: No value for $TERM and no -T specified
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> 0.57user 0.16system 0:25.30elapsed 2%CPU (0avgtext+0avgdata 94400maxresident)k
> 0inputs+144outputs (0major+12600minor)pagefaults 0swaps
> Running test: pyspark/mllib/util.py
> tput: No value for $TERM and no -T specified
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> 0.20user 0.06system 0:08.08elapsed 3%CPU (0avgtext+0avgdata 92768maxresident)k
> 0inputs+56outputs (0major+12474minor)pagefaults 0swaps
> Running test: pyspark/mllib/tests.py
> tput: No value for $TERM and no -T specified
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> .F/usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:2499: 
> VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or 
> function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.
>   VisibleDeprecationWarning)
> ./usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:2499: 
> VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or 
> function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.
>   VisibleDeprecationWarning)
> /usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:2499: 
> VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or 
> function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.
>   VisibleDeprecationWarnin

[jira] [Commented] (SPARK-5587) Support change database owner

2015-02-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304721#comment-14304721
 ] 

Apache Spark commented on SPARK-5587:
-

User 'scwf' has created a pull request for this issue:
https://github.com/apache/spark/pull/4357

> Support change database owner 
> --
>
> Key: SPARK-5587
> URL: https://issues.apache.org/jira/browse/SPARK-5587
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: wangfei
>
> Support change database owner :
> create database db_alter_onr;
> describe database db_alter_onr;
> alter database db_alter_onr set owner user user1;
> describe database db_alter_onr;
> alter database db_alter_onr set owner role role1;
> describe database db_alter_onr;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5587) Support change database owner

2015-02-03 Thread wangfei (JIRA)
wangfei created SPARK-5587:
--

 Summary: Support change database owner 
 Key: SPARK-5587
 URL: https://issues.apache.org/jira/browse/SPARK-5587
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: wangfei


Support change database owner :
create database db_alter_onr;
describe database db_alter_onr;

alter database db_alter_onr set owner user user1;
describe database db_alter_onr;

alter database db_alter_onr set owner role role1;
describe database db_alter_onr;




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5586) Automatically provide sqlContext in Spark shell

2015-02-03 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-5586:
--

 Summary: Automatically provide sqlContext in Spark shell
 Key: SPARK-5586
 URL: https://issues.apache.org/jira/browse/SPARK-5586
 Project: Spark
  Issue Type: Improvement
  Components: Spark Shell, SQL
Reporter: Patrick Wendell
Assignee: Patrick Wendell
 Fix For: 1.3.0


A simple patch, but we should create a sqlContext (and, if supported by the 
build, a Hive context) in the Spark shell when it's created, and import the 
DSL. We can just call it sqlContext. This would save us so much time writing 
code examples :P



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5586) Automatically provide sqlContext in Spark shell

2015-02-03 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5586:
---
Fix Version/s: (was: 1.3.0)

> Automatically provide sqlContext in Spark shell
> ---
>
> Key: SPARK-5586
> URL: https://issues.apache.org/jira/browse/SPARK-5586
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, SQL
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>Priority: Critical
>
> A simple patch, but we should create a sqlContext (and, if supported by the 
> build, a Hive context) in the Spark shell when it's created, and import the 
> DSL. We can just call it sqlContext. This would save us so much time writing 
> code examples :P



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5586) Automatically provide sqlContext in Spark shell

2015-02-03 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5586:
---
Priority: Critical  (was: Major)

> Automatically provide sqlContext in Spark shell
> ---
>
> Key: SPARK-5586
> URL: https://issues.apache.org/jira/browse/SPARK-5586
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, SQL
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>Priority: Critical
>
> A simple patch, but we should create a sqlContext (and, if supported by the 
> build, a Hive context) in the Spark shell when it's created, and import the 
> DSL. We can just call it sqlContext. This would save us so much time writing 
> code examples :P



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5068) When the path not found in the hdfs,we can't get the result

2015-02-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304710#comment-14304710
 ] 

Apache Spark commented on SPARK-5068:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/4356

> When the path not found in the hdfs,we can't get the result
> ---
>
> Key: SPARK-5068
> URL: https://issues.apache.org/jira/browse/SPARK-5068
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: jeanlyn
>
> when the partion path was found in the metastore but not found in the hdfs,it 
> will casue some problems as follow:
> {noformat}
> hive> show partitions partition_test;
> OK
> dt=1
> dt=2
> dt=3
> dt=4
> Time taken: 0.168 seconds, Fetched: 4 row(s)
> {noformat}
> {noformat}
> hive> dfs -ls /user/jeanlyn/warehouse/partition_test;
> Found 3 items
> drwxr-xr-x   - jeanlyn supergroup  0 2014-12-02 16:29 
> /user/jeanlyn/warehouse/partition_test/dt=1
> drwxr-xr-x   - jeanlyn supergroup  0 2014-12-02 16:29 
> /user/jeanlyn/warehouse/partition_test/dt=3
> drwxr-xr-x   - jeanlyn supergroup  0 2014-12-02 17:42 
> /user/jeanlyn/warehouse/partition_test/dt=4
> {noformat}
> when i run the sql 
> {noformat}
> select * from partition_test limit 10
> {noformat} in  *hive*,i got no problem,but when i run in *spark-sql* i get 
> the error as follow:
> {noformat}
> Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: 
> Input path does not exist: 
> hdfs://jeanlyn:9000/user/jeanlyn/warehouse/partition_test/dt=2
> at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251)
> at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
> at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
> at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
> at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
> at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1328)
> at org.apache.spark.rdd.RDD.collect(RDD.scala:780)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:84)
> at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444)
> at org.apache.spark.sql.hive.testpartition$.main(test.scala:23)
> at org.apache.spark.sql.hive.testpartition.main(test.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at com.intellij.rt.execu

[jira] [Resolved] (SPARK-5341) Support maven coordinates in spark-shell and spark-submit

2015-02-03 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-5341.

   Resolution: Fixed
Fix Version/s: 1.3.0

> Support maven coordinates in spark-shell and spark-submit
> -
>
> Key: SPARK-5341
> URL: https://issues.apache.org/jira/browse/SPARK-5341
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, Spark Shell
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Critical
> Fix For: 1.3.0
>
>
> This feature will allow users to provide the maven coordinates of jars they 
> wish to use in their spark application. Coordinates can be a comma-delimited 
> list and be supplied like:
> ```spark-submit --maven org.apache.example.a,org.apache.example.b```
> This feature will also be added to spark-shell (where it is more critical to 
> have this feature)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5341) Support maven coordinates in spark-shell and spark-submit

2015-02-03 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5341:
---
Assignee: Burak Yavuz

> Support maven coordinates in spark-shell and spark-submit
> -
>
> Key: SPARK-5341
> URL: https://issues.apache.org/jira/browse/SPARK-5341
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, Spark Shell
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Critical
> Fix For: 1.3.0
>
>
> This feature will allow users to provide the maven coordinates of jars they 
> wish to use in their spark application. Coordinates can be a comma-delimited 
> list and be supplied like:
> ```spark-submit --maven org.apache.example.a,org.apache.example.b```
> This feature will also be added to spark-shell (where it is more critical to 
> have this feature)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4969) Add binaryRecords support to streaming

2015-02-03 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-4969.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

> Add binaryRecords support to streaming
> --
>
> Key: SPARK-4969
> URL: https://issues.apache.org/jira/browse/SPARK-4969
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Streaming
>Affects Versions: 1.2.0
>Reporter: Jeremy Freeman
>Priority: Minor
> Fix For: 1.3.0
>
>
> As of Spark 1.2 there is support for loading fixed length records from flat 
> binary files. This is a useful way to load dense numerical array data into 
> Spark, especially in scientific computing applications.
> We should add support for loading this same file type in Spark Streaming, 
> both in Scala/Java and in Python. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5585) Flaky test: Python regression

2015-02-03 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5585:
---
Priority: Critical  (was: Major)

> Flaky test: Python regression
> -
>
> Key: SPARK-5585
> URL: https://issues.apache.org/jira/browse/SPARK-5585
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Patrick Wendell
>Assignee: Davies Liu
>Priority: Critical
>  Labels: flaky-test
>
> Hey [~davies] any chance you can take a look at this? The master build is 
> having random python failures fairly often. Not quite sure what is going on:
> {code}
> 0inputs+128outputs (0major+13320minor)pagefaults 0swaps
> Run mllib tests ...
> Running test: pyspark/mllib/classification.py
> tput: No value for $TERM and no -T specified
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> 0.43user 0.12system 0:14.85elapsed 3%CPU (0avgtext+0avgdata 94272maxresident)k
> 0inputs+280outputs (0major+12627minor)pagefaults 0swaps
> Running test: pyspark/mllib/clustering.py
> tput: No value for $TERM and no -T specified
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> 0.35user 0.11system 0:12.63elapsed 3%CPU (0avgtext+0avgdata 93568maxresident)k
> 0inputs+88outputs (0major+12532minor)pagefaults 0swaps
> Running test: pyspark/mllib/feature.py
> tput: No value for $TERM and no -T specified
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> 0.28user 0.08system 0:05.73elapsed 6%CPU (0avgtext+0avgdata 93424maxresident)k
> 0inputs+32outputs (0major+12548minor)pagefaults 0swaps
> Running test: pyspark/mllib/linalg.py
> 0.16user 0.05system 0:00.22elapsed 98%CPU (0avgtext+0avgdata 
> 89888maxresident)k
> 0inputs+0outputs (0major+8099minor)pagefaults 0swaps
> Running test: pyspark/mllib/rand.py
> tput: No value for $TERM and no -T specified
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> 0.25user 0.08system 0:05.42elapsed 6%CPU (0avgtext+0avgdata 87872maxresident)k
> 0inputs+0outputs (0major+11849minor)pagefaults 0swaps
> Running test: pyspark/mllib/recommendation.py
> tput: No value for $TERM and no -T specified
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> 0.32user 0.09system 0:11.42elapsed 3%CPU (0avgtext+0avgdata 94256maxresident)k
> 0inputs+32outputs (0major+11797minor)pagefaults 0swaps
> Running test: pyspark/mllib/regression.py
> tput: No value for $TERM and no -T specified
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> 0.53user 0.17system 0:23.53elapsed 3%CPU (0avgtext+0avgdata 99600maxresident)k
> 0inputs+48outputs (0major+12402minor)pagefaults 0swaps
> Running test: pyspark/mllib/stat/_statistics.py
> tput: No value for $TERM and no -T specified
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> 0.29user 0.09system 0:08.03elapsed 4%CPU (0avgtext+0avgdata 92656maxresident)k
> 0inputs+48outputs (0major+12508minor)pagefaults 0swaps
> Running test: pyspark/mllib/tree.py
> tput: No value for $TERM and no -T specified
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> 0.57user 0.16system 0:25.30elapsed 2%CPU (0avgtext+0avgdata 94400maxresident)k
> 0inputs+144outputs (0major+12600minor)pagefaults 0swaps
> Running test: pyspark/mllib/util.py
> tput: No value for $TERM and no -T specified
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> 0.20user 0.06system 0:08.08elapsed 3%CPU (0avgtext+0avgdata 92768maxresident)k
> 0inputs+56outputs (0major+12474minor)pagefaults 0swaps
> Running test: pyspark/mllib/tests.py
> tput: No value for $TERM and no -T specified
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> .F/usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:2499: 
> VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or 
> function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.
>   VisibleDeprecationWarning)
> ./usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:2499: 
> VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or 
> function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.
>   VisibleDeprecationWarning)
> /usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:2499: 
> VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or 
> function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.
>   VisibleDeprecationWarning)
> /usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:2499: 
> VisibleDeprecationWarning: `rank` is de

[jira] [Updated] (SPARK-5585) Flaky test: Python regression

2015-02-03 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5585:
---
Labels: flaky-test  (was: )

> Flaky test: Python regression
> -
>
> Key: SPARK-5585
> URL: https://issues.apache.org/jira/browse/SPARK-5585
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Patrick Wendell
>Assignee: Davies Liu
>Priority: Critical
>  Labels: flaky-test
>
> Hey [~davies] any chance you can take a look at this? The master build is 
> having random python failures fairly often. Not quite sure what is going on:
> {code}
> 0inputs+128outputs (0major+13320minor)pagefaults 0swaps
> Run mllib tests ...
> Running test: pyspark/mllib/classification.py
> tput: No value for $TERM and no -T specified
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> 0.43user 0.12system 0:14.85elapsed 3%CPU (0avgtext+0avgdata 94272maxresident)k
> 0inputs+280outputs (0major+12627minor)pagefaults 0swaps
> Running test: pyspark/mllib/clustering.py
> tput: No value for $TERM and no -T specified
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> 0.35user 0.11system 0:12.63elapsed 3%CPU (0avgtext+0avgdata 93568maxresident)k
> 0inputs+88outputs (0major+12532minor)pagefaults 0swaps
> Running test: pyspark/mllib/feature.py
> tput: No value for $TERM and no -T specified
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> 0.28user 0.08system 0:05.73elapsed 6%CPU (0avgtext+0avgdata 93424maxresident)k
> 0inputs+32outputs (0major+12548minor)pagefaults 0swaps
> Running test: pyspark/mllib/linalg.py
> 0.16user 0.05system 0:00.22elapsed 98%CPU (0avgtext+0avgdata 
> 89888maxresident)k
> 0inputs+0outputs (0major+8099minor)pagefaults 0swaps
> Running test: pyspark/mllib/rand.py
> tput: No value for $TERM and no -T specified
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> 0.25user 0.08system 0:05.42elapsed 6%CPU (0avgtext+0avgdata 87872maxresident)k
> 0inputs+0outputs (0major+11849minor)pagefaults 0swaps
> Running test: pyspark/mllib/recommendation.py
> tput: No value for $TERM and no -T specified
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> 0.32user 0.09system 0:11.42elapsed 3%CPU (0avgtext+0avgdata 94256maxresident)k
> 0inputs+32outputs (0major+11797minor)pagefaults 0swaps
> Running test: pyspark/mllib/regression.py
> tput: No value for $TERM and no -T specified
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> 0.53user 0.17system 0:23.53elapsed 3%CPU (0avgtext+0avgdata 99600maxresident)k
> 0inputs+48outputs (0major+12402minor)pagefaults 0swaps
> Running test: pyspark/mllib/stat/_statistics.py
> tput: No value for $TERM and no -T specified
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> 0.29user 0.09system 0:08.03elapsed 4%CPU (0avgtext+0avgdata 92656maxresident)k
> 0inputs+48outputs (0major+12508minor)pagefaults 0swaps
> Running test: pyspark/mllib/tree.py
> tput: No value for $TERM and no -T specified
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> 0.57user 0.16system 0:25.30elapsed 2%CPU (0avgtext+0avgdata 94400maxresident)k
> 0inputs+144outputs (0major+12600minor)pagefaults 0swaps
> Running test: pyspark/mllib/util.py
> tput: No value for $TERM and no -T specified
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> 0.20user 0.06system 0:08.08elapsed 3%CPU (0avgtext+0avgdata 92768maxresident)k
> 0inputs+56outputs (0major+12474minor)pagefaults 0swaps
> Running test: pyspark/mllib/tests.py
> tput: No value for $TERM and no -T specified
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> .F/usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:2499: 
> VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or 
> function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.
>   VisibleDeprecationWarning)
> ./usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:2499: 
> VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or 
> function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.
>   VisibleDeprecationWarning)
> /usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:2499: 
> VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or 
> function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.
>   VisibleDeprecationWarning)
> /usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:2499: 
> VisibleDeprecationWarning: `rank` is depreca

[jira] [Created] (SPARK-5585) Flaky test: Python regression

2015-02-03 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-5585:
--

 Summary: Flaky test: Python regression
 Key: SPARK-5585
 URL: https://issues.apache.org/jira/browse/SPARK-5585
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Patrick Wendell
Assignee: Davies Liu


Hey [~davies] any chance you can take a look at this? The master build is 
having random python failures fairly often. Not quite sure what is going on:

{code}
0inputs+128outputs (0major+13320minor)pagefaults 0swaps
Run mllib tests ...
Running test: pyspark/mllib/classification.py
tput: No value for $TERM and no -T specified
Spark assembly has been built with Hive, including Datanucleus jars on classpath
0.43user 0.12system 0:14.85elapsed 3%CPU (0avgtext+0avgdata 94272maxresident)k
0inputs+280outputs (0major+12627minor)pagefaults 0swaps
Running test: pyspark/mllib/clustering.py
tput: No value for $TERM and no -T specified
Spark assembly has been built with Hive, including Datanucleus jars on classpath
0.35user 0.11system 0:12.63elapsed 3%CPU (0avgtext+0avgdata 93568maxresident)k
0inputs+88outputs (0major+12532minor)pagefaults 0swaps
Running test: pyspark/mllib/feature.py
tput: No value for $TERM and no -T specified
Spark assembly has been built with Hive, including Datanucleus jars on classpath
0.28user 0.08system 0:05.73elapsed 6%CPU (0avgtext+0avgdata 93424maxresident)k
0inputs+32outputs (0major+12548minor)pagefaults 0swaps
Running test: pyspark/mllib/linalg.py
0.16user 0.05system 0:00.22elapsed 98%CPU (0avgtext+0avgdata 89888maxresident)k
0inputs+0outputs (0major+8099minor)pagefaults 0swaps
Running test: pyspark/mllib/rand.py
tput: No value for $TERM and no -T specified
Spark assembly has been built with Hive, including Datanucleus jars on classpath
0.25user 0.08system 0:05.42elapsed 6%CPU (0avgtext+0avgdata 87872maxresident)k
0inputs+0outputs (0major+11849minor)pagefaults 0swaps
Running test: pyspark/mllib/recommendation.py
tput: No value for $TERM and no -T specified
Spark assembly has been built with Hive, including Datanucleus jars on classpath
0.32user 0.09system 0:11.42elapsed 3%CPU (0avgtext+0avgdata 94256maxresident)k
0inputs+32outputs (0major+11797minor)pagefaults 0swaps
Running test: pyspark/mllib/regression.py
tput: No value for $TERM and no -T specified
Spark assembly has been built with Hive, including Datanucleus jars on classpath
0.53user 0.17system 0:23.53elapsed 3%CPU (0avgtext+0avgdata 99600maxresident)k
0inputs+48outputs (0major+12402minor)pagefaults 0swaps
Running test: pyspark/mllib/stat/_statistics.py
tput: No value for $TERM and no -T specified
Spark assembly has been built with Hive, including Datanucleus jars on classpath
0.29user 0.09system 0:08.03elapsed 4%CPU (0avgtext+0avgdata 92656maxresident)k
0inputs+48outputs (0major+12508minor)pagefaults 0swaps
Running test: pyspark/mllib/tree.py
tput: No value for $TERM and no -T specified
Spark assembly has been built with Hive, including Datanucleus jars on classpath
0.57user 0.16system 0:25.30elapsed 2%CPU (0avgtext+0avgdata 94400maxresident)k
0inputs+144outputs (0major+12600minor)pagefaults 0swaps
Running test: pyspark/mllib/util.py
tput: No value for $TERM and no -T specified
Spark assembly has been built with Hive, including Datanucleus jars on classpath
0.20user 0.06system 0:08.08elapsed 3%CPU (0avgtext+0avgdata 92768maxresident)k
0inputs+56outputs (0major+12474minor)pagefaults 0swaps
Running test: pyspark/mllib/tests.py
tput: No value for $TERM and no -T specified
Spark assembly has been built with Hive, including Datanucleus jars on classpath
.F/usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:2499: 
VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or 
function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.
  VisibleDeprecationWarning)
./usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:2499: 
VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or 
function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.
  VisibleDeprecationWarning)
/usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:2499: 
VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or 
function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.
  VisibleDeprecationWarning)
/usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:2499: 
VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or 
function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.
  VisibleDeprecationWarning)
/usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:2499: 
VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or 
function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.
  VisibleDeprecationWarning)
./usr/lib64/python2.6/site-packages/numpy/lib/uti

[jira] [Updated] (SPARK-5585) Flaky test: Python regression

2015-02-03 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5585:
---
Affects Version/s: 1.3.0

> Flaky test: Python regression
> -
>
> Key: SPARK-5585
> URL: https://issues.apache.org/jira/browse/SPARK-5585
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Patrick Wendell
>Assignee: Davies Liu
>Priority: Critical
>  Labels: flaky-test
>
> Hey [~davies] any chance you can take a look at this? The master build is 
> having random python failures fairly often. Not quite sure what is going on:
> {code}
> 0inputs+128outputs (0major+13320minor)pagefaults 0swaps
> Run mllib tests ...
> Running test: pyspark/mllib/classification.py
> tput: No value for $TERM and no -T specified
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> 0.43user 0.12system 0:14.85elapsed 3%CPU (0avgtext+0avgdata 94272maxresident)k
> 0inputs+280outputs (0major+12627minor)pagefaults 0swaps
> Running test: pyspark/mllib/clustering.py
> tput: No value for $TERM and no -T specified
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> 0.35user 0.11system 0:12.63elapsed 3%CPU (0avgtext+0avgdata 93568maxresident)k
> 0inputs+88outputs (0major+12532minor)pagefaults 0swaps
> Running test: pyspark/mllib/feature.py
> tput: No value for $TERM and no -T specified
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> 0.28user 0.08system 0:05.73elapsed 6%CPU (0avgtext+0avgdata 93424maxresident)k
> 0inputs+32outputs (0major+12548minor)pagefaults 0swaps
> Running test: pyspark/mllib/linalg.py
> 0.16user 0.05system 0:00.22elapsed 98%CPU (0avgtext+0avgdata 
> 89888maxresident)k
> 0inputs+0outputs (0major+8099minor)pagefaults 0swaps
> Running test: pyspark/mllib/rand.py
> tput: No value for $TERM and no -T specified
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> 0.25user 0.08system 0:05.42elapsed 6%CPU (0avgtext+0avgdata 87872maxresident)k
> 0inputs+0outputs (0major+11849minor)pagefaults 0swaps
> Running test: pyspark/mllib/recommendation.py
> tput: No value for $TERM and no -T specified
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> 0.32user 0.09system 0:11.42elapsed 3%CPU (0avgtext+0avgdata 94256maxresident)k
> 0inputs+32outputs (0major+11797minor)pagefaults 0swaps
> Running test: pyspark/mllib/regression.py
> tput: No value for $TERM and no -T specified
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> 0.53user 0.17system 0:23.53elapsed 3%CPU (0avgtext+0avgdata 99600maxresident)k
> 0inputs+48outputs (0major+12402minor)pagefaults 0swaps
> Running test: pyspark/mllib/stat/_statistics.py
> tput: No value for $TERM and no -T specified
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> 0.29user 0.09system 0:08.03elapsed 4%CPU (0avgtext+0avgdata 92656maxresident)k
> 0inputs+48outputs (0major+12508minor)pagefaults 0swaps
> Running test: pyspark/mllib/tree.py
> tput: No value for $TERM and no -T specified
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> 0.57user 0.16system 0:25.30elapsed 2%CPU (0avgtext+0avgdata 94400maxresident)k
> 0inputs+144outputs (0major+12600minor)pagefaults 0swaps
> Running test: pyspark/mllib/util.py
> tput: No value for $TERM and no -T specified
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> 0.20user 0.06system 0:08.08elapsed 3%CPU (0avgtext+0avgdata 92768maxresident)k
> 0inputs+56outputs (0major+12474minor)pagefaults 0swaps
> Running test: pyspark/mllib/tests.py
> tput: No value for $TERM and no -T specified
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> .F/usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:2499: 
> VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or 
> function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.
>   VisibleDeprecationWarning)
> ./usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:2499: 
> VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or 
> function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.
>   VisibleDeprecationWarning)
> /usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:2499: 
> VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or 
> function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.
>   VisibleDeprecationWarning)
> /usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:2499: 
> VisibleDeprecationWarning: `rank` is deprecated

[jira] [Updated] (SPARK-5529) Executor is still hold while BlockManager has been removed

2015-02-03 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-5529:
---
Description: 
When I run a spark job, one executor is hold, after 120s, blockManager is 
removed by driver, but after half an hour before the executor is remove by  
driver. Here is the log:
{code}
15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager 
BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms 
exceeds 12ms

15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 
10.215.143.14: remote Akka client disassociated
15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote 
system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is 
now gated for [5000] ms. Reason is: [Disassociated].
15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 0.0
15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 
10.215.143.14): ExecutorLostFailure (executor 1 lost)
15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove 
non-existent executor 1
15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0)
15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 
from BlockManagerMaster.
15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in 
removeExecutor
{code}

  was:
When I run a spark job, one executor is hold, after 120s, blockManager is 
removed by driver, but after half an hour before the executor is remove by  
driver. Here is the log:
15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager 
BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms 
exceeds 12ms

15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 
10.215.143.14: remote Akka client disassociated
15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote 
system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is 
now gated for [5000] ms. Reason is: [Disassociated].
15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 0.0
15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 
10.215.143.14): ExecutorLostFailure (executor 1 lost)
15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove 
non-existent executor 1
15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0)
15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 
from BlockManagerMaster.
15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in 
removeExecutor



> Executor is still hold while BlockManager has been removed
> --
>
> Key: SPARK-5529
> URL: https://issues.apache.org/jira/browse/SPARK-5529
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Hong Shen
>
> When I run a spark job, one executor is hold, after 120s, blockManager is 
> removed by driver, but after half an hour before the executor is remove by  
> driver. Here is the log:
> {code}
> 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms 
> exceeds 12ms
> 
> 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 
> 10.215.143.14: remote Akka client disassociated
> 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote 
> system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is 
> now gated for [5000] ms. Reason is: [Disassociated].
> 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 
> 0.0
> 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 
> 10.215.143.14): ExecutorLostFailure (executor 1 lost)
> 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove 
> non-existent executor 1
> 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0)
> 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 
> from BlockManagerMaster.
> 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in 
> removeExecutor
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5583) Support unique join in hive context

2015-02-03 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-5583.
--
Resolution: Won't Fix

Going to close this one as won't fix since it is a weird syntax that only Hive 
has (and as far as I know not that many Hive users know about it).

The patch is already on GitHub. We can merge that in the future if there is 
strong demand. 

> Support unique join in hive context
> ---
>
> Key: SPARK-5583
> URL: https://issues.apache.org/jira/browse/SPARK-5583
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: wangfei
>
> Support unique join in hive context:
> FROM UNIQUEJOIN PRESERVE T1 a (a.key), PRESERVE T2 b (b.key), PRESERVE T3 c 
> (c.key)
> SELECT a.key, b.key, c.key;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5475) Java 8 tests are like maintenance overhead.

2015-02-03 Thread Prashant Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304662#comment-14304662
 ] 

Prashant Sharma commented on SPARK-5475:


And this is how it looks after running in maven, with command
{noformat}
build/mvn clean install -Pjava8-tests -DskipTests -T 6
{noformat}

http://pastebin.com/SxeHUpEY

> Java 8 tests are like maintenance overhead. 
> 
>
> Key: SPARK-5475
> URL: https://issues.apache.org/jira/browse/SPARK-5475
> Project: Spark
>  Issue Type: Bug
>Reporter: Prashant Sharma
>
> Having tests that validate the same code compatible with java 8 and java 7 is 
> like asserting that java 8 is backward compatible with java 7 and still 
> supports java 8 features(lambda expressions to be precise). This was once 
> necessary as asm was not compatible with java 8 and so on. 
> Running java8-tests on the current code base results in more than 100 
> compilation errors, it felt as if they are never run. This is based on the 
> fact that compilation errors have existed for a pretty long period. So IMHO, 
> we should really remove them, if we don't plan to maintain.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5584) Add Maven Enforcer Plugin dependencyConvergence rule (fail false)

2015-02-03 Thread Markus Dale (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Dale updated SPARK-5584:
---
Description: 
The Spark Maven build uses the Maven Enforcer plugin but does not have a rule 
for dependencyConvergence (no version conflicts between dependencies/transitive 
dependencies). 

Putting this in the current 1.3.0-SNAPSHOT in main pom.xml by adding 
dependencyConvergence rule:

{noformat}

  org.apache.maven.plugins
  maven-enforcer-plugin
  1.3.1
  

  enforce-versions
  
enforce
  
  

  
3.0.4
  
  
${java.version}
  
  

  

  

{noformat}

And running with:
mvn -Pyarn -Phadoop-2.4 -Phive -DskipTests clean package -Denforcer.fail=false 
&> output.txt

identified a lot of dependency convergence problems (one of them re-opening 
SPARK-3039 and fixed via exclude transitive dependency/explicit include of 
desired version of library).

Many convergence errors like:

Dependency convergence error for com.thoughtworks.paranamer:paranamer:2.3 paths 
to dependency are:
+-org.apache.spark:spark-core_2.10:1.3.0-SNAPSHOT
  +-org.apache.hadoop:hadoop-client:2.4.0
+-org.apache.hadoop:hadoop-common:2.4.0
  +-org.apache.avro:avro:1.7.6
+-com.thoughtworks.paranamer:paranamer:2.3
and
+-org.apache.spark:spark-core_2.10:1.3.0-SNAPSHOT
  +-org.json4s:json4s-jackson_2.10:3.2.10
+-org.json4s:json4s-core_2.10:3.2.10
  +-com.thoughtworks.paranamer:paranamer:2.6

[WARNING] 
Dependency convergence error for io.netty:netty:3.8.0.Final paths to dependency 
are:
+-org.apache.spark:spark-core_2.10:1.3.0-SNAPSHOT
  +-org.spark-project.akka:akka-remote_2.10:2.3.4-spark
+-io.netty:netty:3.8.0.Final
and
+-org.apache.spark:spark-core_2.10:1.3.0-SNAPSHOT
  +-org.seleniumhq.selenium:selenium-java:2.42.2
+-org.webbitserver:webbit:0.4.14
  +-io.netty:netty:3.5.2.Final


  was:
The Spark Maven build uses the Maven Enforcer plugin but does not have a rule 
for dependencyConvergence (no version conflicts between dependencies/transitive 
dependencies). 

Putting this in the current 1.3.0-SNAPSHOT in main pom.xml by adding 
dependencyConvergence rule:

{noformat}

  org.apache.maven.plugins
  maven-enforcer-plugin
  1.3.1
  

  enforce-versions
  
enforce
  
  

  
3.0.4
  
  
${java.version}
  
  

  

  

{noformat}

And running with:
mvn -Pyarn -Phadoop-2.4 -Phive -DskipTests clean package -Denforcer.fail=false 
&> output.txt

identified a lot of dependency convergence problems (one of them re-opening 
SPARK-3039 and fixed via exclude transitive dependency/explicit include of 
desired version of library).


> Add Maven Enforcer Plugin dependencyConvergence rule (fail false)
> -
>
> Key: SPARK-5584
> URL: https://issues.apache.org/jira/browse/SPARK-5584
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.2.0
>Reporter: Markus Dale
>Priority: Minor
>
> The Spark Maven build uses the Maven Enforcer plugin but does not have a rule 
> for dependencyConvergence (no version conflicts between 
> dependencies/transitive dependencies). 
> Putting this in the current 1.3.0-SNAPSHOT in main pom.xml by adding 
> dependencyConvergence rule:
> {noformat}
> 
>   org.apache.maven.plugins
>   maven-enforcer-plugin
>   1.3.1
>   
> 
>   enforce-versions
>   
> enforce
>   
>   
> 
>   
> 3.0.4
>   
>   
> ${java.version}
>   
>   
> 
>   
> 
>   
> 
> {noformat}
> And running with:
> mvn -Pyarn -Phadoop-2.4 -Phive -DskipTests clean package 
> -Denforcer.fail=false &> output.txt
> identified a lot of dependency convergence problems (one of them re-opening 
> SPARK-3039 and fixed via exclude transitive dependency/explicit include of 
> desired version of library).
> Many convergence errors like:
> Dependency convergence error for com.thoughtworks.paranamer:paranamer:2.3 
> paths to dependency are:
> +-org.apache.spark:spark-core_2.10:1.3.

[jira] [Resolved] (SPARK-5237) UDTF don't work with multi-alias of multi-columns as output on SparK SQL

2015-02-03 Thread Adrian Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrian Wang resolved SPARK-5237.

Resolution: Duplicate

SPARK-5383 should solved this.

> UDTF don't work with multi-alias of multi-columns as output on SparK SQL
> 
>
> Key: SPARK-5237
> URL: https://issues.apache.org/jira/browse/SPARK-5237
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Yi Zhou
>
> Hive query with UDTF don't work on Spark SQL like below example
> SELECT extract_sentiment(pr_item_sk,pr_review_content) AS (pr_item_sk, 
> review_sentence, sentiment, sentiment_word)
> FROM product_reviews;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5584) Add Maven Enforcer Plugin dependencyConvergence rule (fail false)

2015-02-03 Thread Markus Dale (JIRA)
Markus Dale created SPARK-5584:
--

 Summary: Add Maven Enforcer Plugin dependencyConvergence rule 
(fail false)
 Key: SPARK-5584
 URL: https://issues.apache.org/jira/browse/SPARK-5584
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.2.0
Reporter: Markus Dale
Priority: Minor


The Spark Maven build uses the Maven Enforcer plugin but does not have a rule 
for dependencyConvergence (no version conflicts between dependencies/transitive 
dependencies). 

Putting this in the current 1.3.0-SNAPSHOT in main pom.xml by adding 
dependencyConvergence rule:

{noformat}

  org.apache.maven.plugins
  maven-enforcer-plugin
  1.3.1
  

  enforce-versions
  
enforce
  
  

  
3.0.4
  
  
${java.version}
  
  

  

  

{noformat}

And running with:
mvn -Pyarn -Phadoop-2.4 -Phive -DskipTests clean package -Denforcer.fail=false 
&> output.txt

identified a lot of dependency convergence problems (one of them re-opening 
SPARK-3039 and fixed via exclude transitive dependency/explicit include of 
desired version of library).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4795) Redesign the "primitive type => Writable" implicit APIs to make them be activated automatically

2015-02-03 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-4795.

   Resolution: Fixed
Fix Version/s: 1.3.0
 Assignee: Shixiong Zhu

> Redesign the "primitive type => Writable" implicit APIs to make them be 
> activated automatically
> ---
>
> Key: SPARK-4795
> URL: https://issues.apache.org/jira/browse/SPARK-4795
> Project: Spark
>  Issue Type: Improvement
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 1.3.0
>
>
> Try to redesign the "primitive type => Writable" implicit APIs to make them 
> be activated automatically and without breaking compatibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5583) Support unique join in hive context

2015-02-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304624#comment-14304624
 ] 

Apache Spark commented on SPARK-5583:
-

User 'scwf' has created a pull request for this issue:
https://github.com/apache/spark/pull/4354

> Support unique join in hive context
> ---
>
> Key: SPARK-5583
> URL: https://issues.apache.org/jira/browse/SPARK-5583
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: wangfei
>
> Support unique join in hive context:
> FROM UNIQUEJOIN PRESERVE T1 a (a.key), PRESERVE T2 b (b.key), PRESERVE T3 c 
> (c.key)
> SELECT a.key, b.key, c.key;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5578) Provide a convenient way for Scala users to use UDFs

2015-02-03 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-5578.

   Resolution: Fixed
Fix Version/s: 1.3.0

> Provide a convenient way for Scala users to use UDFs
> 
>
> Key: SPARK-5578
> URL: https://issues.apache.org/jira/browse/SPARK-5578
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Blocker
> Fix For: 1.3.0
>
>
> Dsl.udf(...).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5583) Support unique join in hive context

2015-02-03 Thread wangfei (JIRA)
wangfei created SPARK-5583:
--

 Summary: Support unique join in hive context
 Key: SPARK-5583
 URL: https://issues.apache.org/jira/browse/SPARK-5583
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: wangfei


Support unique join in hive context:

FROM UNIQUEJOIN PRESERVE T1 a (a.key), PRESERVE T2 b (b.key), PRESERVE T3 c 
(c.key)
SELECT a.key, b.key, c.key;




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5367) support star expression in udf

2015-02-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304611#comment-14304611
 ] 

Apache Spark commented on SPARK-5367:
-

User 'scwf' has created a pull request for this issue:
https://github.com/apache/spark/pull/4353

> support star expression in udf
> --
>
> Key: SPARK-5367
> URL: https://issues.apache.org/jira/browse/SPARK-5367
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: wangfei
> Fix For: 1.3.0
>
>
> now spark sql does not support star expression in udf, the following sql will 
> get error
> ```
> select concat( * ) from src
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5582) History server does not list anything if log root contains an empty directory

2015-02-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304599#comment-14304599
 ] 

Apache Spark commented on SPARK-5582:
-

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/4352

> History server does not list anything if log root contains an empty directory
> -
>
> Key: SPARK-5582
> URL: https://issues.apache.org/jira/browse/SPARK-5582
> Project: Spark
>  Issue Type: Bug
>Reporter: Marcelo Vanzin
>
> As summary says. Exception from logs:
> {noformat}
> 15/02/03 17:35:10.292 
> pool-1-thread-1-ScalaTest-running-FsHistoryProviderSuite ERROR 
> FsHistoryProvider: Exception in checking for event log updates
> java.lang.UnsupportedOperationException: empty.max
> at 
> scala.collection.TraversableOnce$class.max(TraversableOnce.scala:216)
> at scala.collection.AbstractTraversable.max(Traversable.scala:105)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$getModificationTime(FsHistoryProvider.scala:315)
> {noformat}
> Patch coming up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5472) Add support for reading from and writing to a JDBC database

2015-02-03 Thread Tor Myklebust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304588#comment-14304588
 ] 

Tor Myklebust commented on SPARK-5472:
--

If the data in the underlying table changes, this code might not work reliably; 
some partitions might have new data and others won't.  If you change the schema 
of the underlying table after you make it visible to Spark SQL, retrieving data 
will (probably) blow up.  Whatever behaviour you might observe from this code 
when given a changing underlying table will not be behaviour you can rely on.

> Add support for reading from and writing to a JDBC database
> ---
>
> Key: SPARK-5472
> URL: https://issues.apache.org/jira/browse/SPARK-5472
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Tor Myklebust
>Assignee: Tor Myklebust
>Priority: Blocker
> Fix For: 1.3.0
>
>
> It would be nice to be able to make a table in a JDBC database appear as a 
> table in Spark SQL.  This would let users, for instance, perform a JOIN 
> between a DataFrame in Spark SQL with a table in a Postgres database.
> It might also be nice to be able to go the other direction -- save a 
> DataFrame to a database -- for instance in an ETL job.
> Edited to clarify:  Both of these tasks are certainly possible to accomplish 
> at the moment with a little bit of ad-hoc glue code.  However, there is no 
> fundamental reason why the user should need to supply the table schema and 
> some code for pulling data out of a ResultSet row into a Catalyst Row 
> structure when this information can be derived from the schema of the 
> database table itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5582) History server does not list anything if log root contains an empty directory

2015-02-03 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-5582:
-

 Summary: History server does not list anything if log root 
contains an empty directory
 Key: SPARK-5582
 URL: https://issues.apache.org/jira/browse/SPARK-5582
 Project: Spark
  Issue Type: Bug
Reporter: Marcelo Vanzin


As summary says. Exception from logs:

{noformat}
15/02/03 17:35:10.292 pool-1-thread-1-ScalaTest-running-FsHistoryProviderSuite 
ERROR FsHistoryProvider: Exception in checking for event log updates
java.lang.UnsupportedOperationException: empty.max
at scala.collection.TraversableOnce$class.max(TraversableOnce.scala:216)
at scala.collection.AbstractTraversable.max(Traversable.scala:105)
at 
org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$getModificationTime(FsHistoryProvider.scala:315)
{noformat}

Patch coming up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2440) Enable HistoryServer to display lots of Application History

2015-02-03 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-2440.
---
Resolution: Fixed

I'll mark this as fixed since the current history server doesn't have that 
limitation anymore.

> Enable HistoryServer to display lots of Application History
> ---
>
> Key: SPARK-2440
> URL: https://issues.apache.org/jira/browse/SPARK-2440
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.0.0
>Reporter: Kousuke Saruta
>
> In current implementation of HistoryServer, it can display 250 records by 
> default.
> Sometimes we'd like to see over 250 records and configure to be able to list 
> more records, but current implementation lists all the records just in one 
> page. This is not useful.
> And to make matters worse, initial launch of HistoryServer is very slowly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5260) Expose JsonRDD.allKeysWithValueTypes() in a utility class

2015-02-03 Thread Corey J. Nolet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304577#comment-14304577
 ] 

Corey J. Nolet commented on SPARK-5260:
---

I'm thinking all the schema-specific functions should be pulled out into an 
object called JsonSchemaFunctions. allKeysWithValueTypes() and createSchema() 
functions should be exposed via the public API and commented well based on 
their use. 

For the project I have that's using these functions, I am actually using the 
allKeysWithValueTypes() over my entire RDD as it's being saved to a sequence 
file and I'm using an Accumulator[Set[(String, DataType)]] that is aggregating 
all the schema elements for the RDD into a final Set where I can then store off 
the schema and later call "CreateSchema()" to get the final StructType that can 
be used with the sql table. I had to write a isConflicted(Set[(String, 
DataType)]]) function as well to determine if it's possible that a JSON object 
or JSON array was also encountered as a primitive type in one of the records in 
the RDD or vice versa.

> Expose JsonRDD.allKeysWithValueTypes() in a utility class 
> --
>
> Key: SPARK-5260
> URL: https://issues.apache.org/jira/browse/SPARK-5260
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Corey J. Nolet
>Assignee: Corey J. Nolet
>
> I have found this method extremely useful when implementing my own strategy 
> for inferring a schema from parsed json. For now, I've actually copied the 
> method right out of the JsonRDD class into my own project but I think it 
> would be immensely useful to keep the code in Spark and expose it publicly 
> somewhere else- like an object called JsonSchema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5580) Grep bug in compute-classpath.sh

2015-02-03 Thread Yadong Qi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yadong Qi closed SPARK-5580.

Resolution: Fixed

> Grep bug in compute-classpath.sh
> 
>
> Key: SPARK-5580
> URL: https://issues.apache.org/jira/browse/SPARK-5580
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Yadong Qi
> Fix For: 1.3.0
>
>
> When I test spark, I often need to change assembly jar to test different 
> version.
> So I will move spark-assembly.*hadoop.*.jar to 
> spark-assembly.*hadoop.*.jar.bak.
> But I will get the error info "Found multiple Spark assembly jars in 
> $assembly_folder:". I think it just need to compare jar, so the grep 
> expression need to begin with "^" and end with "$".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5529) Executor is still hold while BlockManager has been removed

2015-02-03 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304509#comment-14304509
 ] 

Lianhui Wang edited comment on SPARK-5529 at 2/4/15 2:40 AM:
-

the phenomenon is:
blockManagerSlave is timeout  and BlockManagerMasterActor will remove this 
blockManager, but executor on this blockManager is not timeout because akka's 
heartbeat is normal.Because blockManager is in executor, if blockManager is 
removed, executor on this blockManager should be removed too.
Especially when dynamicAllocation is enabled, allocationManager listen 
onBlockManagerRemoved and remove this executor. but actually in 
CoarseGrainedSchedulerBackend it is still in executorDataMap.
[~rxin] [~andrewor14]  [~sandyr] when BlockManagerMasterActor remove 
blockmanager due to timeout of BlockManager, we need to check whether executor 
on this blockmanager has been removed. if its executor has not been removed, we 
should firstly remove this executor. how about this way to solve this problem? 



was (Author: lianhuiwang):
the phenomenon is:
blockManagerSlave is timeout  and BlockManagerMasterActor will remove this 
blockManager, but executor on this blockManager is not timeout because akka's 
heartbeat is normal.Because blockManager is in executor, if blockManager is 
removed, executor on this blockManager should be removed too.
Especially when dynamicAllocation is enabled, allocationManager listen 
onBlockManagerRemoved and remove this executor. but actually in 
CoarseGrainedSchedulerBackend it is still in executorDataMap.
[~andrewor14]  [~sandyr] when BlockManagerMasterActor remove blockmanager due 
to timeout of BlockManager, we need to check whether executor on this 
blockmanager has been removed. if its executor has not been removed, we should 
firstly remove this executor. how about this way to solve this problem? 


> Executor is still hold while BlockManager has been removed
> --
>
> Key: SPARK-5529
> URL: https://issues.apache.org/jira/browse/SPARK-5529
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Hong Shen
>
> When I run a spark job, one executor is hold, after 120s, blockManager is 
> removed by driver, but after half an hour before the executor is remove by  
> driver. Here is the log:
> 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms 
> exceeds 12ms
> 
> 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 
> 10.215.143.14: remote Akka client disassociated
> 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote 
> system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is 
> now gated for [5000] ms. Reason is: [Disassociated].
> 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 
> 0.0
> 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 
> 10.215.143.14): ExecutorLostFailure (executor 1 lost)
> 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove 
> non-existent executor 1
> 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0)
> 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 
> from BlockManagerMaster.
> 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in 
> removeExecutor



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5529) Executor is still hold while BlockManager has been removed

2015-02-03 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304509#comment-14304509
 ] 

Lianhui Wang edited comment on SPARK-5529 at 2/4/15 2:39 AM:
-

the phenomenon is:
blockManagerSlave is timeout  and BlockManagerMasterActor will remove this 
blockManager, but executor on this blockManager is not timeout because akka's 
heartbeat is normal.Because blockManager is in executor, if blockManager is 
removed, executor on this blockManager should be removed too.
Especially when dynamicAllocation is enabled, allocationManager listen 
onBlockManagerRemoved and remove this executor. but actually in 
CoarseGrainedSchedulerBackend it is still in executorDataMap.
[~andrewor14]  [~sandyr] when BlockManagerMasterActor remove blockmanager due 
to timeout of BlockManager, we need to check whether executor on this 
blockmanager has been removed. if its executor has not been removed, we should 
firstly remove this executor. how about this way to solve this problem? 



was (Author: lianhuiwang):
the phenomenon is:
blockManagerSlave is timeout  and BlockManagerMasterActor will remove this 
blockManager, but executor on this blockManager is not timeout because akka's 
heartbeat is normal.Because blockManager is in executor, if blockManager is 
removed, executor on this blockManager should be removed too.
Especially when dynamicAllocation is enabled, allocationManager listen 
onBlockManagerRemoved and remove this executor. but actually in 
CoarseGrainedSchedulerBackend it is still in executorDataMap.
[~andrewor14]  when BlockManagerMasterActor remove blockmanager due to timeout 
of BlockManager, we need to check whether executor on this blockmanager has 
been removed. if its executor has not been removed, we should firstly remove 
this executor. how about this way to solve this problem? 


> Executor is still hold while BlockManager has been removed
> --
>
> Key: SPARK-5529
> URL: https://issues.apache.org/jira/browse/SPARK-5529
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Hong Shen
>
> When I run a spark job, one executor is hold, after 120s, blockManager is 
> removed by driver, but after half an hour before the executor is remove by  
> driver. Here is the log:
> 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms 
> exceeds 12ms
> 
> 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 
> 10.215.143.14: remote Akka client disassociated
> 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote 
> system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is 
> now gated for [5000] ms. Reason is: [Disassociated].
> 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 
> 0.0
> 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 
> 10.215.143.14): ExecutorLostFailure (executor 1 lost)
> 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove 
> non-existent executor 1
> 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0)
> 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 
> from BlockManagerMaster.
> 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in 
> removeExecutor



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5581) When writing sorted map output file, avoid open / close between each partition

2015-02-03 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-5581:
-

 Summary: When writing sorted map output file, avoid open / close 
between each partition
 Key: SPARK-5581
 URL: https://issues.apache.org/jira/browse/SPARK-5581
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.3.0
Reporter: Sandy Ryza


{code}
  // Bypassing merge-sort; get an iterator by partition and just write 
everything directly.
  for ((id, elements) <- this.partitionedIterator) {
if (elements.hasNext) {
  val writer = blockManager.getDiskWriter(
blockId, outputFile, ser, fileBufferSize, 
context.taskMetrics.shuffleWriteMetrics.get)
  for (elem <- elements) {
writer.write(elem)
  }
  writer.commitAndClose()
  val segment = writer.fileSegment()
  lengths(id) = segment.length
}
  }
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5529) Executor is still hold while BlockManager has been removed

2015-02-03 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304509#comment-14304509
 ] 

Lianhui Wang edited comment on SPARK-5529 at 2/4/15 2:37 AM:
-

the phenomenon is:
blockManagerSlave is timeout  and BlockManagerMasterActor will remove this 
blockManager, but executor on this blockManager is not timeout because akka's 
heartbeat is normal.Because blockManager is in executor, if blockManager is 
removed, executor on this blockManager should be removed too.
Especially when dynamicAllocation is enabled, allocationManager listen 
onBlockManagerRemoved and remove this executor. but actually in 
CoarseGrainedSchedulerBackend it is still in executorDataMap.
[~andrewor14]  when BlockManagerMasterActor remove blockmanager due to timeout 
of BlockManager, we need to check whether executor on this blockmanager has 
been removed. if its executor has not been removed, we should firstly remove 
this executor. how about this way to solve this problem? 



was (Author: lianhuiwang):
the phenomenon is:
blockManagerSlave is timeout  and BlockManagerMasterActor will remove this 
blockManager, but executor on this blockManager is not timeout because akka's 
heartbeat is normal.Because blockManager is in executor, if blockManager is 
removed, executor on this blockManager should be removed too.
Especially when dynamicAllocation is enabled, allocationManager listen 
onBlockManagerRemoved and remove this executor. but actually in 
CoarseGrainedSchedulerBackend it is still in executorDataMap.
[~andrewor14]  when BlockManagerMasterActor remove blockmanager due to timeout 
of BlockManager, we need to check whether executor on this blockmanager has 
been removed. if its executor has not been removed, we should firstly remove 
this executor. how about this way to solve this problem?


> Executor is still hold while BlockManager has been removed
> --
>
> Key: SPARK-5529
> URL: https://issues.apache.org/jira/browse/SPARK-5529
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Hong Shen
>
> When I run a spark job, one executor is hold, after 120s, blockManager is 
> removed by driver, but after half an hour before the executor is remove by  
> driver. Here is the log:
> 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms 
> exceeds 12ms
> 
> 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 
> 10.215.143.14: remote Akka client disassociated
> 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote 
> system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is 
> now gated for [5000] ms. Reason is: [Disassociated].
> 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 
> 0.0
> 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 
> 10.215.143.14): ExecutorLostFailure (executor 1 lost)
> 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove 
> non-existent executor 1
> 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0)
> 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 
> from BlockManagerMaster.
> 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in 
> removeExecutor



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5580) Grep bug in compute-classpath.sh

2015-02-03 Thread Yadong Qi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yadong Qi updated SPARK-5580:
-
Affects Version/s: 1.2.0

> Grep bug in compute-classpath.sh
> 
>
> Key: SPARK-5580
> URL: https://issues.apache.org/jira/browse/SPARK-5580
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Yadong Qi
> Fix For: 1.3.0
>
>
> When I test spark, I often need to change assembly jar to test different 
> version.
> So I will move spark-assembly.*hadoop.*.jar to 
> spark-assembly.*hadoop.*.jar.bak.
> But I will get the error info "Found multiple Spark assembly jars in 
> $assembly_folder:". I think it just need to compare jar, so the grep 
> expression need to begin with "^" and end with "$".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5580) Grep bug in compute-classpath.sh

2015-02-03 Thread Yadong Qi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yadong Qi updated SPARK-5580:
-
Fix Version/s: 1.3.0

> Grep bug in compute-classpath.sh
> 
>
> Key: SPARK-5580
> URL: https://issues.apache.org/jira/browse/SPARK-5580
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Yadong Qi
> Fix For: 1.3.0
>
>
> When I test spark, I often need to change assembly jar to test different 
> version.
> So I will move spark-assembly.*hadoop.*.jar to 
> spark-assembly.*hadoop.*.jar.bak.
> But I will get the error info "Found multiple Spark assembly jars in 
> $assembly_folder:". I think it just need to compare jar, so the grep 
> expression need to begin with "^" and end with "$".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5580) Grep bug in compute-classpath.sh

2015-02-03 Thread Yadong Qi (JIRA)
Yadong Qi created SPARK-5580:


 Summary: Grep bug in compute-classpath.sh
 Key: SPARK-5580
 URL: https://issues.apache.org/jira/browse/SPARK-5580
 Project: Spark
  Issue Type: Bug
Reporter: Yadong Qi


When I test spark, I often need to change assembly jar to test different 
version.
So I will move spark-assembly.*hadoop.*.jar to spark-assembly.*hadoop.*.jar.bak.
But I will get the error info "Found multiple Spark assembly jars in 
$assembly_folder:". I think it just need to compare jar, so the grep expression 
need to begin with "^" and end with "$".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5529) Executor is still hold while BlockManager has been removed

2015-02-03 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304509#comment-14304509
 ] 

Lianhui Wang edited comment on SPARK-5529 at 2/4/15 2:29 AM:
-

the phenomenon is:
blockManagerSlave is timeout  and BlockManagerMasterActor will remove this 
blockManager, but executor on this blockManager is not timeout because akka's 
heartbeat is normal.Because blockManager is in executor, if blockManager is 
removed, executor on this blockManager should be removed too.
Especially when dynamicAllocation is enabled, allocationManager listen 
onBlockManagerRemoved and remove this executor. but actually in 
CoarseGrainedSchedulerBackend it is still in executorDataMap.
[~andrewor14]  when BlockManagerMasterActor remove blockmanager due to timeout 
of BlockManager, we need to check whether executor on this blockmanager has 
been removed. if its executor has not been removed, we should firstly remove 
this executor. how about this way to solve this problem?



was (Author: lianhuiwang):
the phenomenon is:
blockManagerSlave is timeout  and BlockManagerMasterActor will remove this 
blockManager, but executor on this blockManager is not timeout because akka's 
heartbeat is normal.
when dynamicAllocation is enabled, allocationManager listen 
onBlockManagerRemoved and remove this executor. but actually in 
CoarseGrainedSchedulerBackend it is still in executorDataMap. At this time it 
is wrong.
[~andrewor14]  when BlockManagerMasterActor remove blockmanager due to timeout 
of BlockManager, we need to check whether executor on this blockmanager has 
been removed. if its executor has not been removed, we should firstly remove 
this executor. how about this way to solve this problem?


> Executor is still hold while BlockManager has been removed
> --
>
> Key: SPARK-5529
> URL: https://issues.apache.org/jira/browse/SPARK-5529
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Hong Shen
>
> When I run a spark job, one executor is hold, after 120s, blockManager is 
> removed by driver, but after half an hour before the executor is remove by  
> driver. Here is the log:
> 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms 
> exceeds 12ms
> 
> 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 
> 10.215.143.14: remote Akka client disassociated
> 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote 
> system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is 
> now gated for [5000] ms. Reason is: [Disassociated].
> 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 
> 0.0
> 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 
> 10.215.143.14): ExecutorLostFailure (executor 1 lost)
> 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove 
> non-existent executor 1
> 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0)
> 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 
> from BlockManagerMaster.
> 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in 
> removeExecutor



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5529) Executor is still hold while BlockManager has been removed

2015-02-03 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304509#comment-14304509
 ] 

Lianhui Wang edited comment on SPARK-5529 at 2/4/15 2:27 AM:
-

the phenomenon is:
blockManagerSlave is timeout  and BlockManagerMasterActor will remove this 
blockManager, but executor on this blockManager is not timeout because akka's 
heartbeat is normal.
when dynamicAllocation is enabled, allocationManager listen 
onBlockManagerRemoved and remove this executor. but actually in 
CoarseGrainedSchedulerBackend it is still in executorDataMap. At this time it 
is wrong.
[~andrewor14]  when BlockManagerMasterActor remove blockmanager due to timeout 
of BlockManager, we need to check whether executor on this blockmanager has 
been removed. if its executor has not been removed, we should firstly remove 
this executor. how about this way to solve this problem?



was (Author: lianhuiwang):
the phenomenon is:
blockManagerSlave is timeout  and BlockManagerMasterActor will remove this 
blockManager, but executor on this blockManager is not timeout because akka's 
heartbeat is normal.
when dynamicAllocation is enabled, allocationManager listen 
onBlockManagerRemoved and remove this executor. but actually in 
CoarseGrainedSchedulerBackend it is still in executorDataMap. at this time it 
is wrong.
[~andrewor14]  when BlockManagerMasterActor remove blockmanager due to timeout 
of BlockManager, we need to check whether executor on this blockmanager has 
been removed. if its executor has not been removed, we should firstly remove 
this executor. how about this way to solve this problem?


> Executor is still hold while BlockManager has been removed
> --
>
> Key: SPARK-5529
> URL: https://issues.apache.org/jira/browse/SPARK-5529
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Hong Shen
>
> When I run a spark job, one executor is hold, after 120s, blockManager is 
> removed by driver, but after half an hour before the executor is remove by  
> driver. Here is the log:
> 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms 
> exceeds 12ms
> 
> 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 
> 10.215.143.14: remote Akka client disassociated
> 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote 
> system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is 
> now gated for [5000] ms. Reason is: [Disassociated].
> 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 
> 0.0
> 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 
> 10.215.143.14): ExecutorLostFailure (executor 1 lost)
> 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove 
> non-existent executor 1
> 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0)
> 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 
> from BlockManagerMaster.
> 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in 
> removeExecutor



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5529) Executor is still hold while BlockManager has been removed

2015-02-03 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304509#comment-14304509
 ] 

Lianhui Wang commented on SPARK-5529:
-

the phenomenon is:
blockManagerSlave is timeout  and BlockManagerMasterActor will remove this 
blockManager, but executor on this blockManager is not timeout because akka's 
heartbeat is normal.
when dynamicAllocation is enabled, allocationManager listen 
onBlockManagerRemoved and remove this executor. but actually in 
CoarseGrainedSchedulerBackend it is still in executorDataMap. at this time it 
is wrong.
[~andrewor14]  when BlockManagerMasterActor remove blockmanager due to timeout 
of BlockManager, we need to check whether executor on this blockmanager has 
been removed. if its executor has not been removed, we should firstly remove 
this executor. how about this way to solve this problem?


> Executor is still hold while BlockManager has been removed
> --
>
> Key: SPARK-5529
> URL: https://issues.apache.org/jira/browse/SPARK-5529
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Hong Shen
>
> When I run a spark job, one executor is hold, after 120s, blockManager is 
> removed by driver, but after half an hour before the executor is remove by  
> driver. Here is the log:
> 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms 
> exceeds 12ms
> 
> 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 
> 10.215.143.14: remote Akka client disassociated
> 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote 
> system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is 
> now gated for [5000] ms. Reason is: [Disassociated].
> 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 
> 0.0
> 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 
> 10.215.143.14): ExecutorLostFailure (executor 1 lost)
> 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove 
> non-existent executor 1
> 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0)
> 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 
> from BlockManagerMaster.
> 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in 
> removeExecutor



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5140) Two RDDs which are scheduled concurrently should be able to wait on parent in all cases

2015-02-03 Thread Corey J. Nolet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304510#comment-14304510
 ] 

Corey J. Nolet commented on SPARK-5140:
---

I think the problem is that when actions are performed on RDDs in multiple 
threads, the SparkContext on the driver that's scheduling the DAG should be 
able to see that the two RDDs depend on the same parents and synchronize them 
so that only one will run at a time, whether being cached or not (you'd assume 
the parent would be getting cached but I think this change would still affect 
cases where it hasn't been.). 

The fact that I did:

val rdd1 = input data -> transform data -> groupBy -> etc... -> cache
val rdd2 = future { rdd1.transform.groupBy.saveAsSequenceFile() }
val rdd3 = future { rdd1.transform.groupBy.saveAsSequenceFile() }

Has unexpected results when I find that rdd1 was assigned an id and run 
completely separately for rdd2 and rdd3. I would have expected, whether cached 
or not, that when run in separate threads, rdd1 would have been assigned an id, 
then rdd2 would have caused it to begin running through its stages, and rdd3 
would have paused because it's waiting on rdd1's id to complete its stages. 
What  I see is that, after rdd2 and rdd3 both run concurrently calculating 
rdd1, the storage for rdd1 = 200% cached. It causes issues when I have 50 or so 
rdds calling saveAsSequenceFile() that all have different shared dependencies 
on parent rdds (which may not always be known at creation time without 
introspecting them in my own tree). 

Now i've basically got to the scheduling myself- I've got to determine what 
depends on what and run things concurrently myself. It seems like the DAG 
should know this already and be able to make use of it.

> Two RDDs which are scheduled concurrently should be able to wait on parent in 
> all cases
> ---
>
> Key: SPARK-5140
> URL: https://issues.apache.org/jira/browse/SPARK-5140
> Project: Spark
>  Issue Type: New Feature
>Reporter: Corey J. Nolet
>  Labels: features
>
> Not sure if this would change too much of the internals to be included in the 
> 1.2.1 but it would be very helpful if it could be.
> This ticket is from a discussion between myself and [~ilikerps]. Here's the 
> result of some testing that [~ilikerps] did:
> bq. I did some testing as well, and it turns out the "wait for other guy to 
> finish caching" logic is on a per-task basis, and it only works on tasks that 
> happen to be executing on the same machine. 
> bq. Once a partition is cached, we will schedule tasks that touch that 
> partition on that executor. The problem here, though, is that the cache is in 
> progress, and so the tasks are still scheduled randomly (or with whatever 
> locality the data source has), so tasks which end up on different machines 
> will not see that the cache is already in progress.
> {code}
> Here was my test, by the way:
> import scala.concurrent.ExecutionContext.Implicits.global
> import scala.concurrent._
> import scala.concurrent.duration._
> val rdd = sc.parallelize(0 until 8).map(i => { Thread.sleep(1); i 
> }).cache()
> val futures = (0 until 4).map { _ => Future { rdd.count } }
> Await.result(Future.sequence(futures), 120.second)
> {code}
> bq. Note that I run the future 4 times in parallel. I found that the first 
> run has all tasks take 10 seconds. The second has about 50% of its tasks take 
> 10 seconds, and the rest just wait for the first stage to finish. The last 
> two runs have no tasks that take 10 seconds; all wait for the first two 
> stages to finish.
> What we want is the ability to fire off a job and have the DAG figure out 
> that two RDDs depend on the same parent so that when the children are 
> scheduled concurrently, the first one to start will activate the parent and 
> both will wait on the parent. When the parent is done, they will both be able 
> to finish their work concurrently. We are trying to use this pattern by 
> having the parent cache results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5526) expression [date '2011-01-01' = cast(timestamp('2011-01-01 23:24:25') as date)] return false

2015-02-03 Thread xukun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xukun closed SPARK-5526.

Resolution: Fixed

the issue is fixed by #4325

> expression [date '2011-01-01' = cast(timestamp('2011-01-01 23:24:25') as 
> date)] return false
> 
>
> Key: SPARK-5526
> URL: https://issues.apache.org/jira/browse/SPARK-5526
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: xukun
>
> previous work for test case,
> create table date_1(d date); 
> insert overwrite table date_1 select cast('2011-01-01' as date) from src 
> tablesample (1 rows); 
> In Hive,execute sql  {select date '2011-01-01' = cast(timestamp('2011-01-01 
> 23:24:25') as date) from date_1 limit 1; } return true,but in spark SQL, 
> return false



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5577) Create a convenient way for Python users to register SQL UDFs

2015-02-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304494#comment-14304494
 ] 

Apache Spark commented on SPARK-5577:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/4351

> Create a convenient way for Python users to register SQL UDFs
> -
>
> Key: SPARK-5577
> URL: https://issues.apache.org/jira/browse/SPARK-5577
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Davies Liu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2945) Allow specifying num of executors in the context configuration

2015-02-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304446#comment-14304446
 ] 

Apache Spark commented on SPARK-2945:
-

User 'WangTaoTheTonic' has created a pull request for this issue:
https://github.com/apache/spark/pull/4350

> Allow specifying num of executors in the context configuration
> --
>
> Key: SPARK-2945
> URL: https://issues.apache.org/jira/browse/SPARK-2945
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 1.0.0
> Environment: Ubuntu precise, on YARN (CDH 5.1.0)
>Reporter: Shay Rojansky
>
> Running on YARN, the only way to specify the number of executors seems to be 
> on the command line of spark-submit, via the --num-executors switch.
> In many cases this is too early. Our Spark app receives some cmdline 
> arguments which determine the amount of work that needs to be done - and that 
> affects the number of executors it ideally requires. Ideally, the Spark 
> context configuration would support specifying this like any other config 
> param.
> Our current workaround is a wrapper script that determines how much work is 
> needed, and which itself launches spark-submit with the number passed to 
> --num-executors - it's a shame to have to do this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5529) Executor is still hold while BlockManager has been removed

2015-02-03 Thread Hong Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304433#comment-14304433
 ] 

Hong Shen commented on SPARK-5529:
--

Executor will lost when a akka throw a disassociatedEvent.

> Executor is still hold while BlockManager has been removed
> --
>
> Key: SPARK-5529
> URL: https://issues.apache.org/jira/browse/SPARK-5529
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Hong Shen
>
> When I run a spark job, one executor is hold, after 120s, blockManager is 
> removed by driver, but after half an hour before the executor is remove by  
> driver. Here is the log:
> 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms 
> exceeds 12ms
> 
> 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 
> 10.215.143.14: remote Akka client disassociated
> 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote 
> system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is 
> now gated for [5000] ms. Reason is: [Disassociated].
> 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 
> 0.0
> 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 
> 10.215.143.14): ExecutorLostFailure (executor 1 lost)
> 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove 
> non-existent executor 1
> 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0)
> 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 
> from BlockManagerMaster.
> 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in 
> removeExecutor



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5520) Make FP-Growth implementation take generic item types

2015-02-03 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-5520.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4340
[https://github.com/apache/spark/pull/4340]

> Make FP-Growth implementation take generic item types
> -
>
> Key: SPARK-5520
> URL: https://issues.apache.org/jira/browse/SPARK-5520
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Jacky Li
>Priority: Critical
> Fix For: 1.3.0
>
>
> There is not technical restriction on the item types in the FP-Growth 
> implementation. We used String in the first PR for simplicity. Maybe we could 
> make the type generic before 1.3 (and specialize it for Int/Long).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5153) flaky test of "Reliable Kafka input stream with multiple topics"

2015-02-03 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304423#comment-14304423
 ] 

Saisai Shao commented on SPARK-5153:


Hi TD, thanks a lot for your PR, currently I've no better solution instead of 
increasing the timeout threshold, as I remembered in the Kafka unit test, it 
also deal with this way, I will check again to see if we can solve this with 
elegant way.

> flaky test of "Reliable Kafka input stream with multiple topics"
> 
>
> Key: SPARK-5153
> URL: https://issues.apache.org/jira/browse/SPARK-5153
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.2.0
>Reporter: Nan Zhu
>  Labels: flaky-test
> Fix For: 1.3.0, 1.2.2
>
>
> I have seen several irrelevant PR failed on this test
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25254/consoleFull
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25248/
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25251/console



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5552) Automated data science AMI creation and data science cluster deployment on EC2

2015-02-03 Thread Florian Verhein (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304412#comment-14304412
 ] 

Florian Verhein commented on SPARK-5552:


Thanks [~sowen]. 

So it wouldn't fit in the spark repo itself (the only change there would be to 
add an option in spark_ec2.py to use an alternate spark-ec2 repo/branch). It 
would naturally live in spark-ec2, as it  involves changes to spark-ec2 for 
both use cases
- Image creation is based on the work soon to be added to spark-ec2 for this: 
https://issues.apache.org/jira/browse/SPARK-3821
- Cluster deployment+configuration is done using the spark-ec2 scripts 
themselves (but with many modifications/fixes).

Since there is a dependency between the image and the configuration (init.sh 
and setup.sh) scripts, it's not possible to solve this with just an AMI.

The extra components (actually, just vowpal wabbit and more python libraries - 
the rest already exists in spark-ec2 AMI) are just added to the image for data 
science convenience.


> Automated data science AMI creation and data science cluster deployment on EC2
> --
>
> Key: SPARK-5552
> URL: https://issues.apache.org/jira/browse/SPARK-5552
> Project: Spark
>  Issue Type: New Feature
>  Components: EC2
>Reporter: Florian Verhein
>
> Issue created RE: 
> https://github.com/mesos/spark-ec2/pull/90#issuecomment-72597154 (please read 
> for background)
> Goal:
> Extend spark-ec2 scripts to create an automated data science cluster 
> deployment on EC2, suitable for almost(?)-production use.
> Use cases: 
> - A user can build their own custom data science AMIs from a CentOS minimal 
> image by calling a packer configuration (good defaults should be provided, 
> some options for flexibility)
> - A user can then easily deploy a new (correctly configured) cluster using 
> these AMIs, and do so as quickly as possible.
> Components/modules: Spark + tachyon + hdfs (on instance storage) + python + R 
> + vowpal wabbit + any rpms + ... + ganglia
> Focus is on reliability (rather than e.g. supporting many versions / dev 
> testing) and speed of deployment.
> Use hadoop 2 so option to lift into yarn later.
> My current solution is here: 
> https://github.com/florianverhein/spark-ec2/tree/packer. It includes other 
> fixes/improvements as needed to get it working.
> Now that it seems to work (but has deviated a lot more from the existing code 
> base than I was expecting), I'm wondering what to do with it...
> Keen to hear ideas if anyone is interested. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5574) Utils.createDirectory ignores namePrefix

2015-02-03 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-5574:
---
Assignee: Imran Rashid

> Utils.createDirectory ignores namePrefix
> 
>
> Key: SPARK-5574
> URL: https://issues.apache.org/jira/browse/SPARK-5574
> Project: Spark
>  Issue Type: Bug
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Trivial
>
> this is really minor, I just noticed it as I was trying to find the 
> "blockmgr" dir during some debugging, and then realized that the 
> {{namePrefix}} is ignored in  {{Utils.createDirectory}}.  Also via 
> {{Utils.createTempDir}} this effects these dirs:
> * httpd
> * userFiles
> * broadcast
> I'll submit a PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5472) Add support for reading from and writing to a JDBC database

2015-02-03 Thread Anand Mohan Tumuluri (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304383#comment-14304383
 ] 

Anand Mohan Tumuluri commented on SPARK-5472:
-

Thanks again [~rxin] and [~tmyklebu]. My bad, I only checked the table creation 
scripts in before and made assumptions.
This would very well satisfy our use case.
The custom partitioning conditions would remove the need to use SQL 
conditionals as well.

One more question, how do I 'get' new data that got inserted in the source 
table(s)? Would 'refresh table' work for this?

> Add support for reading from and writing to a JDBC database
> ---
>
> Key: SPARK-5472
> URL: https://issues.apache.org/jira/browse/SPARK-5472
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Tor Myklebust
>Assignee: Tor Myklebust
>Priority: Blocker
> Fix For: 1.3.0
>
>
> It would be nice to be able to make a table in a JDBC database appear as a 
> table in Spark SQL.  This would let users, for instance, perform a JOIN 
> between a DataFrame in Spark SQL with a table in a Postgres database.
> It might also be nice to be able to go the other direction -- save a 
> DataFrame to a database -- for instance in an ETL job.
> Edited to clarify:  Both of these tasks are certainly possible to accomplish 
> at the moment with a little bit of ad-hoc glue code.  However, there is no 
> fundamental reason why the user should need to supply the table schema and 
> some code for pulling data out of a ResultSet row into a Catalyst Row 
> structure when this information can be derived from the schema of the 
> database table itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5579) Provide support for project using SQL expression

2015-02-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304378#comment-14304378
 ] 

Apache Spark commented on SPARK-5579:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/4348

> Provide support for project using SQL expression
> 
>
> Key: SPARK-5579
> URL: https://issues.apache.org/jira/browse/SPARK-5579
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Would be nice to allow something like
> df.selectExpr("abs(colA)", "colB")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5579) Provide support for project using SQL expression

2015-02-03 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-5579:
--

 Summary: Provide support for project using SQL expression
 Key: SPARK-5579
 URL: https://issues.apache.org/jira/browse/SPARK-5579
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin
Assignee: Reynold Xin


Would be nice to allow something like

df.selectExpr("abs(colA)", "colB")




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5460) RandomForest should catch exceptions when removing checkpoint files

2015-02-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304352#comment-14304352
 ] 

Apache Spark commented on SPARK-5460:
-

User 'x1-' has created a pull request for this issue:
https://github.com/apache/spark/pull/4347

> RandomForest should catch exceptions when removing checkpoint files
> ---
>
> Key: SPARK-5460
> URL: https://issues.apache.org/jira/browse/SPARK-5460
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> RandomForest can optionally use checkpointing.  When it tries to remove 
> checkpoint files, it could fail (if a user has write but not delete access on 
> some filesystem).  There should be a try-catch to catch exceptions when 
> trying to remove checkpoint files in NodeIdCache.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5578) Provide a convenient way for Scala users to use UDFs

2015-02-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304349#comment-14304349
 ] 

Apache Spark commented on SPARK-5578:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/4345

> Provide a convenient way for Scala users to use UDFs
> 
>
> Key: SPARK-5578
> URL: https://issues.apache.org/jira/browse/SPARK-5578
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Blocker
>
> Dsl.udf(...).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5578) Provide a convenient way for Scala users to use UDFs

2015-02-03 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-5578:
--

 Summary: Provide a convenient way for Scala users to use UDFs
 Key: SPARK-5578
 URL: https://issues.apache.org/jira/browse/SPARK-5578
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Blocker


Dsl.udf(...).




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5577) Create a convenient way for Python users to register SQL UDFs

2015-02-03 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-5577:
--

 Summary: Create a convenient way for Python users to register SQL 
UDFs
 Key: SPARK-5577
 URL: https://issues.apache.org/jira/browse/SPARK-5577
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Davies Liu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5554) Add more tests and docs for DataFrame Python API

2015-02-03 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-5554:
---
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-5166

> Add more tests and docs for DataFrame Python API
> 
>
> Key: SPARK-5554
> URL: https://issues.apache.org/jira/browse/SPARK-5554
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 1.3.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Blocker
> Fix For: 1.3.0
>
>
> more tests for DataFrame Python API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5554) Add more tests and docs for DataFrame Python API

2015-02-03 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-5554.

   Resolution: Fixed
Fix Version/s: 1.3.0
 Assignee: Davies Liu

> Add more tests and docs for DataFrame Python API
> 
>
> Key: SPARK-5554
> URL: https://issues.apache.org/jira/browse/SPARK-5554
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.3.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Blocker
> Fix For: 1.3.0
>
>
> more tests for DataFrame Python API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5280) Import RDF graphs into GraphX

2015-02-03 Thread lukovnikov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304070#comment-14304070
 ] 

lukovnikov edited comment on SPARK-5280 at 2/3/15 11:28 PM:


started working on it: 
https://github.com/lukovnikov/spark/blob/rdfloader/graphx/src/main/scala/org/apache/spark/graphx/loaders/RDFLoader.scala
 and 
https://github.com/lukovnikov/spark/blob/rdfloaderhash/graphx/src/main/scala/org/apache/spark/graphx/loaders/RDFLoader.scala
 

The second one computes hashes for VertexIds instead of building a whole 
dictionary of the whole RDF input and broadcasting it as the first one does.

Will test soon, write comments and make a pull request.


was (Author: lukovnikov):
started working on it: 
https://github.com/lukovnikov/spark/blob/rdfloader/graphx/src/main/scala/org/apache/spark/graphx/loaders/RDFLoader.scala
 and 
https://github.com/lukovnikov/spark/blob/rdfloaderhash/graphx/src/main/scala/org/apache/spark/graphx/loaders/RDFLoader.scala
 

The second one computes hashes for VertexIds instead of building a whole 
dictionary of the whole RDF input and broadcasting it as the first one does.

> Import RDF graphs into GraphX
> -
>
> Key: SPARK-5280
> URL: https://issues.apache.org/jira/browse/SPARK-5280
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX
>Reporter: lukovnikov
>
> RDF (Resource Description Framework) models knowledge in a graph and is 
> heavily used on the Semantic Web and beyond.
> GraphX should include a way to import RDF data easily.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5280) Import RDF graphs into GraphX

2015-02-03 Thread lukovnikov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304070#comment-14304070
 ] 

lukovnikov edited comment on SPARK-5280 at 2/3/15 11:27 PM:


started working on it: 
https://github.com/lukovnikov/spark/blob/rdfloader/graphx/src/main/scala/org/apache/spark/graphx/loaders/RDFLoader.scala
 and 
https://github.com/lukovnikov/spark/blob/rdfloaderhash/graphx/src/main/scala/org/apache/spark/graphx/loaders/RDFLoader.scala
 

The second one computes hashes for VertexIds instead of building a whole 
dictionary of the whole RDF input and broadcasting it as the first one does.


was (Author: lukovnikov):
started working on it: 
https://github.com/lukovnikov/spark/blob/rdfloader/graphx/src/main/scala/org/apache/spark/graphx/loaders/RDFLoader.scala

> Import RDF graphs into GraphX
> -
>
> Key: SPARK-5280
> URL: https://issues.apache.org/jira/browse/SPARK-5280
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX
>Reporter: lukovnikov
>
> RDF (Resource Description Framework) models knowledge in a graph and is 
> heavily used on the Semantic Web and beyond.
> GraphX should include a way to import RDF data easily.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input

2015-02-03 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304298#comment-14304298
 ] 

Joseph K. Bradley commented on SPARK-5021:
--

That BLAS implementation is actually part of MLlib (see the imports).  You may 
need to generalize it to work with SparseVector, but it should belong in 
mllib.linalg.BLAS.

> GaussianMixtureEM should be faster for SparseVector input
> -
>
> Key: SPARK-5021
> URL: https://issues.apache.org/jira/browse/SPARK-5021
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Manoj Kumar
>
> GaussianMixtureEM currently converts everything to dense vectors.  It would 
> be nice if it were faster for SparseVectors (running in time linear in the 
> number of non-zero values).
> However, this may not be too important since clustering should rarely be done 
> in high dimensions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4986) Graceful shutdown for Spark Streaming does not work in Standalone cluster mode

2015-02-03 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-4986.
--
   Resolution: Fixed
Fix Version/s: 1.2.2

> Graceful shutdown for Spark Streaming does not work in Standalone cluster mode
> --
>
> Key: SPARK-4986
> URL: https://issues.apache.org/jira/browse/SPARK-4986
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.2.0
>Reporter: Jesper Lundgren
>Priority: Blocker
> Fix For: 1.3.0, 1.2.2
>
>
> When using the graceful stop API of Spark Streaming in Spark Standalone 
> cluster the stop signal never reaches the receivers. I have tested this with 
> Spark 1.2 and Kafka receivers. 
> ReceiverTracker will send StopReceiver message to ReceiverSupervisorImpl.
> In local mode ReceiverSupervisorImpl receives this message but in Standalone 
> cluster mode the message seems to be lost.
> (I have modified the code to send my own string message as a stop signal from 
> ReceiverTracker to ReceiverSupervisorImpl and it works as a workaround.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5557) spark-shell failed to start

2015-02-03 Thread Ben Mabey (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304252#comment-14304252
 ] 

Ben Mabey commented on SPARK-5557:
--

I ran a git bisect and this is the first bad commit:

https://github.com/apache/spark/commit/7930d2bef0e2c7f62456e013124455061dfe6dc8

The commit adds a Jetty dep so that seems like the culprit.




> spark-shell failed to start
> ---
>
> Key: SPARK-5557
> URL: https://issues.apache.org/jira/browse/SPARK-5557
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Guoqiang Li
>Priority: Blocker
>
> the log:
> {noformat}
> 5/02/03 19:06:39 INFO spark.HttpServer: Starting HTTP Server
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> javax/servlet/http/HttpServletResponse
>   at 
> org.apache.spark.HttpServer.org$apache$spark$HttpServer$$doStart(HttpServer.scala:75)
>   at org.apache.spark.HttpServer$$anonfun$1.apply(HttpServer.scala:62)
>   at org.apache.spark.HttpServer$$anonfun$1.apply(HttpServer.scala:62)
>   at 
> org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1774)
>   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
>   at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1765)
>   at org.apache.spark.HttpServer.start(HttpServer.scala:62)
>   at org.apache.spark.repl.SparkIMain.(SparkIMain.scala:130)
>   at 
> org.apache.spark.repl.SparkILoop$SparkILoopInterpreter.(SparkILoop.scala:185)
>   at 
> org.apache.spark.repl.SparkILoop.createInterpreter(SparkILoop.scala:214)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:946)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:942)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:942)
>   at 
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
>   at 
> org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:942)
>   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1039)
>   at org.apache.spark.repl.Main$.main(Main.scala:31)
>   at org.apache.spark.repl.Main.main(Main.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:403)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:77)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.ClassNotFoundException: 
> javax.servlet.http.HttpServletResponse
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   ... 25 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5576) saveAsTable into Hive fails due to duplicate columns

2015-02-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304229#comment-14304229
 ] 

Apache Spark commented on SPARK-5576:
-

User 'danosipov' has created a pull request for this issue:
https://github.com/apache/spark/pull/4346

> saveAsTable into Hive fails due to duplicate columns
> 
>
> Key: SPARK-5576
> URL: https://issues.apache.org/jira/browse/SPARK-5576
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Dan Osipov
>
> Loading JSON files infers case sensitive schema, which results in an error if 
> attempting to save to Hive.
> {code}
> import org.apache.spark.sql._
> import org.apache.spark.sql.hive._
> val hive = new HiveContext(sc)
> val data = hive.jsonFile("/path/")
> data.saveAsTable("table")
> {code}
> Results in an error:
> org.apache.hadoop.hive.ql.metadata.HiveException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Duplicate column name 
> data-errorcode in the table definition.
> Outputting the schema shows the problem field:
>  |-- data-errorCode: string (nullable = true)
>  |-- data-errorcode: string (nullable = true)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5576) saveAsTable into Hive fails due to duplicate columns

2015-02-03 Thread Dan Osipov (JIRA)
Dan Osipov created SPARK-5576:
-

 Summary: saveAsTable into Hive fails due to duplicate columns
 Key: SPARK-5576
 URL: https://issues.apache.org/jira/browse/SPARK-5576
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Dan Osipov


Loading JSON files infers case sensitive schema, which results in an error if 
attempting to save to Hive.

{code}
import org.apache.spark.sql._
import org.apache.spark.sql.hive._
val hive = new HiveContext(sc)
val data = hive.jsonFile("/path/")
data.saveAsTable("table")
{code}

Results in an error:
org.apache.hadoop.hive.ql.metadata.HiveException: 
org.apache.hadoop.hive.ql.metadata.HiveException: Duplicate column name 
data-errorcode in the table definition.

Outputting the schema shows the problem field:
 |-- data-errorCode: string (nullable = true)
 |-- data-errorcode: string (nullable = true)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5574) Utils.createDirectory ignores namePrefix

2015-02-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304190#comment-14304190
 ] 

Apache Spark commented on SPARK-5574:
-

User 'squito' has created a pull request for this issue:
https://github.com/apache/spark/pull/4344

> Utils.createDirectory ignores namePrefix
> 
>
> Key: SPARK-5574
> URL: https://issues.apache.org/jira/browse/SPARK-5574
> Project: Spark
>  Issue Type: Bug
>Reporter: Imran Rashid
>Priority: Trivial
>
> this is really minor, I just noticed it as I was trying to find the 
> "blockmgr" dir during some debugging, and then realized that the 
> {{namePrefix}} is ignored in  {{Utils.createDirectory}}.  Also via 
> {{Utils.createTempDir}} this effects these dirs:
> * httpd
> * userFiles
> * broadcast
> I'll submit a PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5575) Artificial neural networks for MLlib deep learning

2015-02-03 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-5575:
---

 Summary: Artificial neural networks for MLlib deep learning
 Key: SPARK-5575
 URL: https://issues.apache.org/jira/browse/SPARK-5575
 Project: Spark
  Issue Type: Umbrella
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Alexander Ulanov


Goal: Implement various types of artificial neural networks

Motivation: deep learning trend

Requirements: 
1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward and 
Backpropagation etc. should be implemented as traits or interfaces, so they can 
be easily extended or reused
2) Implement complex abstractions, such as feed forward and recurrent networks
3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), 
autoencoder (sparse and denoising), stacked autoencoder, restricted  boltzmann 
machines (RBM), deep belief networks (DBN) etc.
4) Implement or reuse supporting constucts, such as classifiers, normalizers, 
poolers,  etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5574) Utils.createDirectory ignores namePrefix

2015-02-03 Thread Imran Rashid (JIRA)
Imran Rashid created SPARK-5574:
---

 Summary: Utils.createDirectory ignores namePrefix
 Key: SPARK-5574
 URL: https://issues.apache.org/jira/browse/SPARK-5574
 Project: Spark
  Issue Type: Bug
Reporter: Imran Rashid
Priority: Trivial


this is really minor, I just noticed it as I was trying to find the "blockmgr" 
dir during some debugging, and then realized that the {{namePrefix}} is ignored 
in  {{Utils.createDirectory}}.  Also via {{Utils.createTempDir}} this effects 
these dirs:

* httpd
* userFiles
* broadcast

I'll submit a PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5140) Two RDDs which are scheduled concurrently should be able to wait on parent in all cases

2015-02-03 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5140:
---
Fix Version/s: (was: 1.2.1)
   (was: 1.3.0)

> Two RDDs which are scheduled concurrently should be able to wait on parent in 
> all cases
> ---
>
> Key: SPARK-5140
> URL: https://issues.apache.org/jira/browse/SPARK-5140
> Project: Spark
>  Issue Type: New Feature
>Reporter: Corey J. Nolet
>  Labels: features
>
> Not sure if this would change too much of the internals to be included in the 
> 1.2.1 but it would be very helpful if it could be.
> This ticket is from a discussion between myself and [~ilikerps]. Here's the 
> result of some testing that [~ilikerps] did:
> bq. I did some testing as well, and it turns out the "wait for other guy to 
> finish caching" logic is on a per-task basis, and it only works on tasks that 
> happen to be executing on the same machine. 
> bq. Once a partition is cached, we will schedule tasks that touch that 
> partition on that executor. The problem here, though, is that the cache is in 
> progress, and so the tasks are still scheduled randomly (or with whatever 
> locality the data source has), so tasks which end up on different machines 
> will not see that the cache is already in progress.
> {code}
> Here was my test, by the way:
> import scala.concurrent.ExecutionContext.Implicits.global
> import scala.concurrent._
> import scala.concurrent.duration._
> val rdd = sc.parallelize(0 until 8).map(i => { Thread.sleep(1); i 
> }).cache()
> val futures = (0 until 4).map { _ => Future { rdd.count } }
> Await.result(Future.sequence(futures), 120.second)
> {code}
> bq. Note that I run the future 4 times in parallel. I found that the first 
> run has all tasks take 10 seconds. The second has about 50% of its tasks take 
> 10 seconds, and the rest just wait for the first stage to finish. The last 
> two runs have no tasks that take 10 seconds; all wait for the first two 
> stages to finish.
> What we want is the ability to fire off a job and have the DAG figure out 
> that two RDDs depend on the same parent so that when the children are 
> scheduled concurrently, the first one to start will activate the parent and 
> both will wait on the parent. When the parent is done, they will both be able 
> to finish their work concurrently. We are trying to use this pattern by 
> having the parent cache results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5140) Two RDDs which are scheduled concurrently should be able to wait on parent in all cases

2015-02-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304114#comment-14304114
 ] 

Sean Owen commented on SPARK-5140:
--

Is this not substantially answered by just materializing the cached RDD right 
after you cache it? then anything that happens after already sees a cached RDD. 
Is the request basically to automatically persist and unpersist RDDs to 
implement this? I suppose the issue is simply that this is hard to figure out. 
Even if you can figure out 2 RDDs can be computed in parallel, and need to be, 
and depend on one parent, it's not obvious you can just persist the RDD 
automatically. I guess the question is what specifically would this change look 
like?

> Two RDDs which are scheduled concurrently should be able to wait on parent in 
> all cases
> ---
>
> Key: SPARK-5140
> URL: https://issues.apache.org/jira/browse/SPARK-5140
> Project: Spark
>  Issue Type: New Feature
>Reporter: Corey J. Nolet
>  Labels: features
> Fix For: 1.3.0, 1.2.1
>
>
> Not sure if this would change too much of the internals to be included in the 
> 1.2.1 but it would be very helpful if it could be.
> This ticket is from a discussion between myself and [~ilikerps]. Here's the 
> result of some testing that [~ilikerps] did:
> bq. I did some testing as well, and it turns out the "wait for other guy to 
> finish caching" logic is on a per-task basis, and it only works on tasks that 
> happen to be executing on the same machine. 
> bq. Once a partition is cached, we will schedule tasks that touch that 
> partition on that executor. The problem here, though, is that the cache is in 
> progress, and so the tasks are still scheduled randomly (or with whatever 
> locality the data source has), so tasks which end up on different machines 
> will not see that the cache is already in progress.
> {code}
> Here was my test, by the way:
> import scala.concurrent.ExecutionContext.Implicits.global
> import scala.concurrent._
> import scala.concurrent.duration._
> val rdd = sc.parallelize(0 until 8).map(i => { Thread.sleep(1); i 
> }).cache()
> val futures = (0 until 4).map { _ => Future { rdd.count } }
> Await.result(Future.sequence(futures), 120.second)
> {code}
> bq. Note that I run the future 4 times in parallel. I found that the first 
> run has all tasks take 10 seconds. The second has about 50% of its tasks take 
> 10 seconds, and the rest just wait for the first stage to finish. The last 
> two runs have no tasks that take 10 seconds; all wait for the first two 
> stages to finish.
> What we want is the ability to fire off a job and have the DAG figure out 
> that two RDDs depend on the same parent so that when the children are 
> scheduled concurrently, the first one to start will activate the parent and 
> both will wait on the parent. When the parent is done, they will both be able 
> to finish their work concurrently. We are trying to use this pattern by 
> having the parent cache results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5573) Support explode in DataFrame DSL

2015-02-03 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-5573:
--

 Summary: Support explode in DataFrame DSL
 Key: SPARK-5573
 URL: https://issues.apache.org/jira/browse/SPARK-5573
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Michael Armbrust
Priority: Blocker


The DSL is missing explode support. We should enable developers to explode a 
column, or explode multiple columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input

2015-02-03 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304099#comment-14304099
 ] 

Manoj Kumar edited comment on SPARK-5021 at 2/3/15 10:01 PM:
-

Hi. I'm almost there. I have one last question.

In this line, 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala#L223
 I'm not sure how to do this, other than doing an own implementation which does 
not depend on NativeBlas for sparse data. Is that okay?


was (Author: mechcoder):
Hi. I'm almost there. I have one last question.

In this line, 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala#L223
 I'm not sure how to do this, other than doing an own implementation which does 
not depend on NativeBlas. Is that okay?

> GaussianMixtureEM should be faster for SparseVector input
> -
>
> Key: SPARK-5021
> URL: https://issues.apache.org/jira/browse/SPARK-5021
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Manoj Kumar
>
> GaussianMixtureEM currently converts everything to dense vectors.  It would 
> be nice if it were faster for SparseVectors (running in time linear in the 
> number of non-zero values).
> However, this may not be too important since clustering should rarely be done 
> in high dimensions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input

2015-02-03 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304099#comment-14304099
 ] 

Manoj Kumar edited comment on SPARK-5021 at 2/3/15 10:02 PM:
-

Hi. I'm almost there. I have one last question.

In this line, 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala#L223
 I'm not sure how to do this, other than doing an own implementation which does 
not depend on NativeBlas for a SparseVector. Is that okay?


was (Author: mechcoder):
Hi. I'm almost there. I have one last question.

In this line, 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala#L223
 I'm not sure how to do this, other than doing an own implementation which does 
not depend on NativeBlas for sparse data. Is that okay?

> GaussianMixtureEM should be faster for SparseVector input
> -
>
> Key: SPARK-5021
> URL: https://issues.apache.org/jira/browse/SPARK-5021
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Manoj Kumar
>
> GaussianMixtureEM currently converts everything to dense vectors.  It would 
> be nice if it were faster for SparseVectors (running in time linear in the 
> number of non-zero values).
> However, this may not be too important since clustering should rarely be done 
> in high dimensions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input

2015-02-03 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304099#comment-14304099
 ] 

Manoj Kumar commented on SPARK-5021:


Hi. I'm almost there. I have one last question.

In this line, 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala#L223
 I'm not sure how to do this, other than doing an own implementation which does 
not depend on NativeBlas. Is that okay?

> GaussianMixtureEM should be faster for SparseVector input
> -
>
> Key: SPARK-5021
> URL: https://issues.apache.org/jira/browse/SPARK-5021
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Manoj Kumar
>
> GaussianMixtureEM currently converts everything to dense vectors.  It would 
> be nice if it were faster for SparseVectors (running in time linear in the 
> number of non-zero values).
> However, this may not be too important since clustering should rarely be done 
> in high dimensions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5153) flaky test of "Reliable Kafka input stream with multiple topics"

2015-02-03 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-5153.
--
   Resolution: Fixed
Fix Version/s: 1.2.2
   1.3.0

> flaky test of "Reliable Kafka input stream with multiple topics"
> 
>
> Key: SPARK-5153
> URL: https://issues.apache.org/jira/browse/SPARK-5153
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.2.0
>Reporter: Nan Zhu
>  Labels: flaky-test
> Fix For: 1.3.0, 1.2.2
>
>
> I have seen several irrelevant PR failed on this test
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25254/consoleFull
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25248/
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25251/console



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5280) Import RDF graphs into GraphX

2015-02-03 Thread lukovnikov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304070#comment-14304070
 ] 

lukovnikov commented on SPARK-5280:
---

started working on it: 
https://github.com/lukovnikov/spark/blob/rdfloader/graphx/src/main/scala/org/apache/spark/graphx/loaders/RDFLoader.scala

> Import RDF graphs into GraphX
> -
>
> Key: SPARK-5280
> URL: https://issues.apache.org/jira/browse/SPARK-5280
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX
>Reporter: lukovnikov
>
> RDF (Resource Description Framework) models knowledge in a graph and is 
> heavily used on the Semantic Web and beyond.
> GraphX should include a way to import RDF data easily.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5388) Provide a stable application submission gateway in standalone cluster mode

2015-02-03 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304052#comment-14304052
 ] 

Marcelo Vanzin commented on SPARK-5388:
---

Hi Patrick,

Most of my questions are related to the protocol specification attached to this 
bug. So when I ask about something, I generally mean that the specification is 
vague about that. If the implementation made a choice about that thing, it just 
means that the implementation should be the specification, and everybody should 
just ignore the document attached to this bug. And we can then move the 
discussion to the PR itself.

bq.  The intention for this is really just to take single RPC that was using 
Akka and add a stable version of it that we are okay supporting long term. 

That's fine, but I'd really like the spec to actually be very clear about what 
this means. For example, the very last sentence:

bq. n. This set of fields must remain compatible across Spark version

See my previous comment, where I asked the same question: what does that mean? 
Does that mean that you can never add any fields to existing messages? You 
mention the code does some version negotiation, but the spec doesn't mention 
that. So maybe that negotiation is the answer to my question?

Anyway, I'm just a little concerned that there's still some vagueness in the 
spec, for a protocol that is supposed to be stable from the get go.



> Provide a stable application submission gateway in standalone cluster mode
> --
>
> Key: SPARK-5388
> URL: https://issues.apache.org/jira/browse/SPARK-5388
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Blocker
> Attachments: Stable Spark Standalone Submission.pdf
>
>
> The existing submission gateway in standalone mode is not compatible across 
> Spark versions. If you have a newer version of Spark submitting to an older 
> version of the standalone Master, it is currently not guaranteed to work. The 
> goal is to provide a stable REST interface to replace this channel.
> The first cut implementation will target standalone cluster mode because 
> there are very few messages exchanged. The design, however, should be general 
> enough to potentially support this for other cluster managers too. Note that 
> this is not necessarily required in YARN because we already use YARN's stable 
> interface to submit applications there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5388) Provide a stable application submission gateway in standalone cluster mode

2015-02-03 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304010#comment-14304010
 ] 

Patrick Wendell commented on SPARK-5388:


The intention for this is really just to take single RPC that was using Akka 
and add a stable version of it that we are okay supporting long term. It 
doesn't preclude moving to avro or some other RPC as a general thing we use 
across all of Spark. However, that design choice was intentionally excluded 
from this decision given all the complexities you bring up. Doing some basic 
message dispatching on our own - there is only a small and very straightforward 
code related to this. Adopting Avro would be overkill for this.

In the current implementation the client and server exchange Spark versions, so 
this is the basis of reasoning about version changes - maybe it wasn't in the 
design doc. In terms of evolvability, the way you do this is that you only add 
new functionality over time, and you never remove fields from messages. This is 
similar to the API contract of the history logs with the history server. So the 
idea is that newer clients would implement a super set of messages and fields 
as older ones.

Adding v1 seems like a good idea in case this evolves into something public or 
more well specified over time. It would just be good to define precisely what 
it means to advance that version identifier. That all matters a lot more if we 
want it to be something others interact with.

> Provide a stable application submission gateway in standalone cluster mode
> --
>
> Key: SPARK-5388
> URL: https://issues.apache.org/jira/browse/SPARK-5388
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Blocker
> Attachments: Stable Spark Standalone Submission.pdf
>
>
> The existing submission gateway in standalone mode is not compatible across 
> Spark versions. If you have a newer version of Spark submitting to an older 
> version of the standalone Master, it is currently not guaranteed to work. The 
> goal is to provide a stable REST interface to replace this channel.
> The first cut implementation will target standalone cluster mode because 
> there are very few messages exchanged. The design, however, should be general 
> enough to potentially support this for other cluster managers too. Note that 
> this is not necessarily required in YARN because we already use YARN's stable 
> interface to submit applications there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5548) Flaky test: org.apache.spark.util.AkkaUtilsSuite.remote fetch ssl on - untrusted server

2015-02-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303992#comment-14303992
 ] 

Apache Spark commented on SPARK-5548:
-

User 'jacek-lewandowski' has created a pull request for this issue:
https://github.com/apache/spark/pull/4343

> Flaky test: org.apache.spark.util.AkkaUtilsSuite.remote fetch ssl on - 
> untrusted server
> ---
>
> Key: SPARK-5548
> URL: https://issues.apache.org/jira/browse/SPARK-5548
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Patrick Wendell
>Assignee: Jacek Lewandowski
>  Labels: flaky-test
>
> {code}
> sbt.ForkMain$ForkError: Expected exception 
> java.util.concurrent.TimeoutException to be thrown, but 
> akka.actor.ActorNotFound was thrown.
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:496)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
>   at org.scalatest.Assertions$class.intercept(Assertions.scala:1004)
>   at org.scalatest.FunSuite.intercept(FunSuite.scala:1555)
>   at 
> org.apache.spark.util.AkkaUtilsSuite$$anonfun$8.apply$mcV$sp(AkkaUtilsSuite.scala:373)
>   at 
> org.apache.spark.util.AkkaUtilsSuite$$anonfun$8.apply(AkkaUtilsSuite.scala:349)
>   at 
> org.apache.spark.util.AkkaUtilsSuite$$anonfun$8.apply(AkkaUtilsSuite.scala:349)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
>   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at 
> org.apache.spark.util.AkkaUtilsSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(AkkaUtilsSuite.scala:37)
>   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
>   at org.apache.spark.util.AkkaUtilsSuite.runTest(AkkaUtilsSuite.scala:37)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
>   at org.scalatest.Suite$class.run(Suite.scala:1424)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
>   at 
> org.apache.spark.util.AkkaUtilsSuite.org$scalatest$BeforeAndAfterAll$$super$run(AkkaUtilsSuite.scala:37)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
>   at org.apache.spark.util.AkkaUtilsSuite.run(AkkaUtilsSuite.scala:37)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:294)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:284)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.Thre

[jira] [Commented] (SPARK-5420) Cross-langauge load/store functions for creating and saving DataFrames

2015-02-03 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303982#comment-14303982
 ] 

Yin Huai commented on SPARK-5420:
-

h3. End user APIs added to SQLContext (load related)
h4. Load data through a data source and create a DataFrame
{code}
// This method is used to load data through file based data source (e.g. 
Parquet). We will use the default data source . Right now, it is Parquet.
def load(path: String): DataFrame
def load(
  dataSourceName: String,
  option: (String, String),
  options: (String, String)*): DataFrame
// This is for Java users.
def load(
  dataSourceName: String,
  options: java.util.Map[String, String]): DataFrame
{code}

h3. End user APIs added to HiveContext (load related)
h4. Create a metastore table for the existing data
{code}
// This method is used create a table from a file based data source.  We will 
use the default data source . Right now, it is Parquet.
def createTable(tableName: String, path: String, allowExisting: Boolean): Unit
def createTable(
  tableName: String,
  dataSourceName: String,
  allowExisting: Boolean,
  option: (String, String),
  options: (String, String)*): Unit
def createTable(
  tableName: String,
  dataSourceName: String,
  schema: StructType,
  allowExisting: Boolean,
  option: (String, String),
  options: (String, String)*): Unit
// This one is for Java users.
def createTable(
  tableName: String,
  dataSourceName: String,
  allowExisting: Boolean,
  options: java.util.Map[String, String]): Unit
// This one is for Java users.
def createTable(
  tableName: String,
  dataSourceName: String,
  schema: StructType,
  allowExisting: Boolean,
  options: java.util.Map[String, String]): Unit
{code} 

> Cross-langauge load/store functions for creating and saving DataFrames
> --
>
> Key: SPARK-5420
> URL: https://issues.apache.org/jira/browse/SPARK-5420
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Patrick Wendell
>Assignee: Yin Huai
>Priority: Blocker
> Fix For: 1.3.0
>
>
> We should have standard API's for loading or saving a table from a data 
> store. Per comment discussion:
> {code}
> def loadData(datasource: String, parameters: Map[String, String]): DataFrame
> def loadData(datasource: String, parameters: java.util.Map[String, String]): 
> DataFrame
> def storeData(datasource: String, parameters: Map[String, String]): DataFrame
> def storeData(datasource: String, parameters: java.util.Map[String, String]): 
> DataFrame
> {code}
> Python should have this too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5420) Cross-langauge load/store functions for creating and saving DataFrames

2015-02-03 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303967#comment-14303967
 ] 

Yin Huai commented on SPARK-5420:
-

I am copying the summary of write related interfaces from 
[SPARK-5501|https://issues.apache.org/jira/browse/SPARK-5501?focusedCommentId=14303760&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14303760]
 to here.
h3. End user APIs added to DataFrame (write related)
h4. Save a DataFrame as a table
When a user is using *HiveContext*, he/she can save a DataFrame as a table. The 
metadata of this table will be stored in metastore.
{code}
// When a data source name is not specified, we will use our default one 
(configured by spark.sql.default.datasource). Right now, it is Parquet.
def saveAsTable(tableName: String): Unit
def saveAsTable(
  tableName: String,
  dataSourceName: String,
  option: (String, String),
  options: (String, String)*): Unit
// This is for Java users.
def saveAsTable(
  tableName: String,
  dataSourceName: String,
  options: java.util.Map[String, String]): Unit
{code}

h4. Save a DataFrame to a data source
Users can save a DataFrame with a data source.
{code}
//This method is used to save a DataFrame to a file based data source (e.g. 
Parquet). We will use the default data source . Right now, it is Parquet.
def save(path: String): Unit
def save(
  dataSourceName: String,
  option: (String, String),
  options: (String, String)*): Unit
// This is for Java users.
def save(
  dataSourceName: String,
  options: java.util.Map[String, String]): Unit
{code}

h4. Insert data into a table from a DataFrame
Users can insert the data of DataFrame to an existing table created by the data 
source API.
{code}
// Appends the data of this DataFrame to the table tableName.
def insertInto(tableName: String): Unit
// When overwrite is true, inserts the data of this DataFrame to the table 
tableName and overwrite existing data.
// When overwrite is false, A=appends the data of this DataFrame to the table 
tableName.
def insertInto(tableName: String, overwrite: Boolean): Unit
{code}

> Cross-langauge load/store functions for creating and saving DataFrames
> --
>
> Key: SPARK-5420
> URL: https://issues.apache.org/jira/browse/SPARK-5420
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Patrick Wendell
>Assignee: Yin Huai
>Priority: Blocker
> Fix For: 1.3.0
>
>
> We should have standard API's for loading or saving a table from a data 
> store. Per comment discussion:
> {code}
> def loadData(datasource: String, parameters: Map[String, String]): DataFrame
> def loadData(datasource: String, parameters: java.util.Map[String, String]): 
> DataFrame
> def storeData(datasource: String, parameters: Map[String, String]): DataFrame
> def storeData(datasource: String, parameters: java.util.Map[String, String]): 
> DataFrame
> {code}
> Python should have this too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5501) Write support for the data source API

2015-02-03 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303760#comment-14303760
 ] 

Yin Huai edited comment on SPARK-5501 at 2/3/15 8:55 PM:
-

h3. End user APIs added to DataFrame (write related)
h4. Save a DataFrame as a table
When a user is using *HiveContext*, he/she can save a DataFrame as a table. The 
metadata of this table will be stored in metastore.
{code}
// When a data source name is not specified, we will use our default one 
(configured by spark.sql.default.datasource). Right now, it is Parquet.
def saveAsTable(tableName: String): Unit
def saveAsTable(
  tableName: String,
  dataSourceName: String,
  option: (String, String),
  options: (String, String)*): Unit
// This is for Java users.
def saveAsTable(
  tableName: String,
  dataSourceName: String,
  options: java.util.Map[String, String]): Unit
{code}

h4. Save a DataFrame to a data source
Users can save a DataFrame with a data source.
{code}
//This method is used to save a DataFrame to a file based data source (e.g. 
Parquet). We will use the default data source . Right now, it is Parquet.
def save(path: String): Unit
def save(
  dataSourceName: String,
  option: (String, String),
  options: (String, String)*): Unit
// This is for Java users.
def save(
  dataSourceName: String,
  options: java.util.Map[String, String]): Unit
{code}

h4. Insert data into a table from a DataFrame
Users can insert the data of DataFrame to an existing table created by the data 
source API.
{code}
// Appends the data of this DataFrame to the table tableName.
def insertInto(tableName: String): Unit
// When overwrite is true, inserts the data of this DataFrame to the table 
tableName and overwrite existing data.
// When overwrite is false, A=appends the data of this DataFrame to the table 
tableName.
def insertInto(tableName: String, overwrite: Boolean): Unit
{code}


was (Author: yhuai):
h3. End user APIs added to DataFrame
h4. Save a DataFrame as a table
When a user is using *HiveContext*, he/she can save a DataFrame as a table. The 
metadata of this table will be stored in metastore.
{code}
// When a data source name is not specified, we will use our default one 
(configured by spark.sql.default.datasource). Right now, it is Parquet.
def saveAsTable(tableName: String): Unit
def saveAsTable(
  tableName: String,
  dataSourceName: String,
  option: (String, String),
  options: (String, String)*): Unit
// This is for Java users.
def saveAsTable(
  tableName: String,
  dataSourceName: String,
  options: java.util.Map[String, String]): Unit
{code}

h4. Save a DataFrame to a data source
Users can save a DataFrame with a data source.
{code}
//This method is used to save a DataFrame to a file based data source (e.g. 
Parquet). We will use the default data source . Right now, it is Parquet.
def save(path: String): Unit
def save(
  dataSourceName: String,
  option: (String, String),
  options: (String, String)*): Unit
// This is for Java users.
def save(
  dataSourceName: String,
  options: java.util.Map[String, String]): Unit
{code}

h4. Insert data into a table from a DataFrame
Users can insert the data of DataFrame to an existing table created by the data 
source API.
{code}
// Appends the data of this DataFrame to the table tableName.
def insertInto(tableName: String): Unit
// When overwrite is true, inserts the data of this DataFrame to the table 
tableName and overwrite existing data.
// When overwrite is false, A=appends the data of this DataFrame to the table 
tableName.
def insertInto(tableName: String, overwrite: Boolean): Unit
{code}

> Write support for the data source API
> -
>
> Key: SPARK-5501
> URL: https://issues.apache.org/jira/browse/SPARK-5501
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Blocker
> Fix For: 1.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5420) Cross-langauge load/store functions for creating and saving DataFrames

2015-02-03 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303771#comment-14303771
 ] 

Yin Huai edited comment on SPARK-5420 at 2/3/15 8:54 PM:
-

This JIRA is fixed by the attached PR. I am resolving it.


was (Author: yhuai):
This JIRA is fixed by the attached PR. A summary of added interfaces can be 
found in https://issues.apache.org/jira/browse/SPARK-5501. 

> Cross-langauge load/store functions for creating and saving DataFrames
> --
>
> Key: SPARK-5420
> URL: https://issues.apache.org/jira/browse/SPARK-5420
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Patrick Wendell
>Assignee: Yin Huai
>Priority: Blocker
> Fix For: 1.3.0
>
>
> We should have standard API's for loading or saving a table from a data 
> store. Per comment discussion:
> {code}
> def loadData(datasource: String, parameters: Map[String, String]): DataFrame
> def loadData(datasource: String, parameters: java.util.Map[String, String]): 
> DataFrame
> def storeData(datasource: String, parameters: Map[String, String]): DataFrame
> def storeData(datasource: String, parameters: java.util.Map[String, String]): 
> DataFrame
> {code}
> Python should have this too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4768) Add Support For Impala Encoded Timestamp (INT96)

2015-02-03 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-4768.
-
Resolution: Duplicate

SPARK-4987 has resolved the issues. I am resolving this one.

> Add Support For Impala Encoded Timestamp (INT96)
> 
>
> Key: SPARK-4768
> URL: https://issues.apache.org/jira/browse/SPARK-4768
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Pat McDonough
>Assignee: Yin Huai
>Priority: Blocker
> Attachments: 5e4481a02f951e29-651ee94ed14560bf_922627129_data.0.parq, 
> string_timestamp.gz
>
>
> Impala is using INT96 for timestamps. Spark SQL should be able to read this 
> data despite the fact that it is not part of the spec.
> Perhaps adding a flag to act like impala when reading parquet (like we do for 
> strings already) would be useful.
> Here's an example of the error you might see:
> {code}
> Caused by: java.lang.RuntimeException: Potential loss of precision: cannot 
> convert INT96
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.toPrimitiveDataType(ParquetTypes.scala:61)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.toDataType(ParquetTypes.scala:113)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:314)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:311)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.convertToAttributes(ParquetTypes.scala:310)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:441)
> at 
> org.apache.spark.sql.parquet.ParquetRelation.(ParquetRelation.scala:66)
> at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:141)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4768) Add Support For Impala Encoded Timestamp (INT96)

2015-02-03 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-4768:

Fix Version/s: 1.3.0

> Add Support For Impala Encoded Timestamp (INT96)
> 
>
> Key: SPARK-4768
> URL: https://issues.apache.org/jira/browse/SPARK-4768
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Pat McDonough
>Assignee: Yin Huai
>Priority: Blocker
> Fix For: 1.3.0
>
> Attachments: 5e4481a02f951e29-651ee94ed14560bf_922627129_data.0.parq, 
> string_timestamp.gz
>
>
> Impala is using INT96 for timestamps. Spark SQL should be able to read this 
> data despite the fact that it is not part of the spec.
> Perhaps adding a flag to act like impala when reading parquet (like we do for 
> strings already) would be useful.
> Here's an example of the error you might see:
> {code}
> Caused by: java.lang.RuntimeException: Potential loss of precision: cannot 
> convert INT96
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.toPrimitiveDataType(ParquetTypes.scala:61)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.toDataType(ParquetTypes.scala:113)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:314)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:311)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.convertToAttributes(ParquetTypes.scala:310)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:441)
> at 
> org.apache.spark.sql.parquet.ParquetRelation.(ParquetRelation.scala:66)
> at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:141)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4768) Add Support For Impala Encoded Timestamp (INT96)

2015-02-03 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303960#comment-14303960
 ] 

Yin Huai edited comment on SPARK-4768 at 2/3/15 8:51 PM:
-

SPARK-4987 has resolved the issue. I am resolving this one.


was (Author: yhuai):
SPARK-4987 has resolved the issues. I am resolving this one.

> Add Support For Impala Encoded Timestamp (INT96)
> 
>
> Key: SPARK-4768
> URL: https://issues.apache.org/jira/browse/SPARK-4768
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Pat McDonough
>Assignee: Yin Huai
>Priority: Blocker
> Fix For: 1.3.0
>
> Attachments: 5e4481a02f951e29-651ee94ed14560bf_922627129_data.0.parq, 
> string_timestamp.gz
>
>
> Impala is using INT96 for timestamps. Spark SQL should be able to read this 
> data despite the fact that it is not part of the spec.
> Perhaps adding a flag to act like impala when reading parquet (like we do for 
> strings already) would be useful.
> Here's an example of the error you might see:
> {code}
> Caused by: java.lang.RuntimeException: Potential loss of precision: cannot 
> convert INT96
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.toPrimitiveDataType(ParquetTypes.scala:61)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.toDataType(ParquetTypes.scala:113)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:314)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:311)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.convertToAttributes(ParquetTypes.scala:310)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:441)
> at 
> org.apache.spark.sql.parquet.ParquetRelation.(ParquetRelation.scala:66)
> at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:141)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5472) Add support for reading from and writing to a JDBC database

2015-02-03 Thread Tor Myklebust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303961#comment-14303961
 ] 

Tor Myklebust commented on SPARK-5472:
--

It probably does not handle SQL ARRAY types in a sane way.  I would guess that 
type mapping would throw an error if you try to read from a table that has an 
ARRAY column.  I would also guess that type mapping would throw an error if you 
try to write a DataFrame that has an ARRAY column.

JDBCRDD handles partitioning however you instruct it to.  If you give no 
instructions, the entire table is a single partition.  If you give it a 
JDBCPartitioningInfo object, it divides the specified range of the specified 
column into the appropriate number of slices.  If you give it a list of WHERE 
clauses, each WHERE clause corresponds to one partition.

> Add support for reading from and writing to a JDBC database
> ---
>
> Key: SPARK-5472
> URL: https://issues.apache.org/jira/browse/SPARK-5472
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Tor Myklebust
>Assignee: Tor Myklebust
>Priority: Blocker
> Fix For: 1.3.0
>
>
> It would be nice to be able to make a table in a JDBC database appear as a 
> table in Spark SQL.  This would let users, for instance, perform a JOIN 
> between a DataFrame in Spark SQL with a table in a Postgres database.
> It might also be nice to be able to go the other direction -- save a 
> DataFrame to a database -- for instance in an ETL job.
> Edited to clarify:  Both of these tasks are certainly possible to accomplish 
> at the moment with a little bit of ad-hoc glue code.  However, there is no 
> fundamental reason why the user should need to supply the table schema and 
> some code for pulling data out of a ResultSet row into a Catalyst Row 
> structure when this information can be derived from the schema of the 
> database table itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5548) Flaky test: org.apache.spark.util.AkkaUtilsSuite.remote fetch ssl on - untrusted server

2015-02-03 Thread Jacek Lewandowski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303953#comment-14303953
 ] 

Jacek Lewandowski commented on SPARK-5548:
--

You are absolutely right [~joshrosen]. This is due to the inconsistent 
behaviour of {{Await.result}} and {{resolveOne}} methods. The first one fails 
with {{TimeoutException}} while the second (in case of timeout) fails with 
{{ActorNotFoundException}}. I'll fix it right away. 


> Flaky test: org.apache.spark.util.AkkaUtilsSuite.remote fetch ssl on - 
> untrusted server
> ---
>
> Key: SPARK-5548
> URL: https://issues.apache.org/jira/browse/SPARK-5548
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Patrick Wendell
>Assignee: Jacek Lewandowski
>  Labels: flaky-test
>
> {code}
> sbt.ForkMain$ForkError: Expected exception 
> java.util.concurrent.TimeoutException to be thrown, but 
> akka.actor.ActorNotFound was thrown.
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:496)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
>   at org.scalatest.Assertions$class.intercept(Assertions.scala:1004)
>   at org.scalatest.FunSuite.intercept(FunSuite.scala:1555)
>   at 
> org.apache.spark.util.AkkaUtilsSuite$$anonfun$8.apply$mcV$sp(AkkaUtilsSuite.scala:373)
>   at 
> org.apache.spark.util.AkkaUtilsSuite$$anonfun$8.apply(AkkaUtilsSuite.scala:349)
>   at 
> org.apache.spark.util.AkkaUtilsSuite$$anonfun$8.apply(AkkaUtilsSuite.scala:349)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
>   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at 
> org.apache.spark.util.AkkaUtilsSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(AkkaUtilsSuite.scala:37)
>   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
>   at org.apache.spark.util.AkkaUtilsSuite.runTest(AkkaUtilsSuite.scala:37)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
>   at org.scalatest.Suite$class.run(Suite.scala:1424)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
>   at 
> org.apache.spark.util.AkkaUtilsSuite.org$scalatest$BeforeAndAfterAll$$super$run(AkkaUtilsSuite.scala:37)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
>   at org.apache.spark.util.AkkaUtilsSuite.run(AkkaUtilsSuite.scala:37)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671)
>   at sbt.ForkMain$Run$2.call

[jira] [Resolved] (SPARK-4987) Parquet support for timestamp type

2015-02-03 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-4987.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

Thank you [~adrian-wang]!

> Parquet support for timestamp type
> --
>
> Key: SPARK-4987
> URL: https://issues.apache.org/jira/browse/SPARK-4987
> Project: Spark
>  Issue Type: New Feature
>Reporter: Adrian Wang
> Fix For: 1.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4709) Spark SQL support error reading Parquet with timestamp type field

2015-02-03 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-4709.
-
  Resolution: Duplicate
Target Version/s:   (was: 1.1.0)

Seems it duplicates SPARK-4987. I am resolving it.

> Spark SQL support error reading Parquet with timestamp type field
> -
>
> Key: SPARK-4709
> URL: https://issues.apache.org/jira/browse/SPARK-4709
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Felix Cheung
>Priority: Critical
>
> Have a data set on Parquet format (created by Hive) with a field of the 
> timestamp type. Reading this causes an exception:
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> val p = sqlContext.parquetFile("hdfs:///data/parquetdata")
> java.lang.RuntimeException: Potential loss of precision: cannot convert INT96
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.toPrimitiveDataType(ParquetTypes.scala:61)
>   at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.toDataType(ParquetTypes.scala:113)
>   at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:314)
>   at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:311)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.convertToAttributes(ParquetTypes.scala:310)
>   at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:441)
>   at 
> org.apache.spark.sql.parquet.ParquetRelation.(ParquetRelation.scala:66)
>   at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:141)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:17)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:22)
>   at $iwC$$iwC$$iwC$$iwC.(:24)
>   at $iwC$$iwC$$iwC.(:26)
>   at $iwC$$iwC.(:28)
>   at $iwC.(:30)
>   at (:32)
>   at .(:36)
>   at .()
>   at .(:7)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:846)
>   at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1119)
>   at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:672)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:703)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:667)
>   at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:819)
>   at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:864)
>   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:776)
>   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:619)
>   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:627)
>   at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:632)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:959)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:907)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:907)
>   at 
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
>   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:907)
>   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1002)
>   at org.apache.spark.repl.Main$.main(Main.scala:31)
>   at org.apache.spark.repl.Main.main(Main.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl

[jira] [Updated] (SPARK-4987) Parquet support for timestamp type

2015-02-03 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-4987:

Target Version/s: 1.3.0

> Parquet support for timestamp type
> --
>
> Key: SPARK-4987
> URL: https://issues.apache.org/jira/browse/SPARK-4987
> Project: Spark
>  Issue Type: New Feature
>Reporter: Adrian Wang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5472) Add support for reading from and writing to a JDBC database

2015-02-03 Thread Tor Myklebust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303949#comment-14303949
 ] 

Tor Myklebust commented on SPARK-5472:
--

This is good feedback.  Thanks.

You don't actually have to pass real table names to JDBCRDD.  For instance, 
`(SELECT name, id FROM people)` is a perfectly valid table name to JDBCRDD.  As 
long as `SELECT columnlist FROM tablename WHERE conditions` is a valid SQL 
query, anything goes.  So, insofar as you trust the underlying database to 
optimise `SELECT columnlist FROM tablename WHERE filters AND 
partitioningcondition` into something reasonable, you should be able to avoid 
creating a view in the external database.

Custom partitioning can be done with the new JDBCRDD API as well; there is an 
interface in SQLContext that just takes a list of syntactically-valid 
conditions and creates one partition per condition.

> Add support for reading from and writing to a JDBC database
> ---
>
> Key: SPARK-5472
> URL: https://issues.apache.org/jira/browse/SPARK-5472
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Tor Myklebust
>Assignee: Tor Myklebust
>Priority: Blocker
> Fix For: 1.3.0
>
>
> It would be nice to be able to make a table in a JDBC database appear as a 
> table in Spark SQL.  This would let users, for instance, perform a JOIN 
> between a DataFrame in Spark SQL with a table in a Postgres database.
> It might also be nice to be able to go the other direction -- save a 
> DataFrame to a database -- for instance in an ETL job.
> Edited to clarify:  Both of these tasks are certainly possible to accomplish 
> at the moment with a little bit of ad-hoc glue code.  However, there is no 
> fundamental reason why the user should need to supply the table schema and 
> some code for pulling data out of a ResultSet row into a Catalyst Row 
> structure when this information can be derived from the schema of the 
> database table itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5472) Add support for reading from and writing to a JDBC database

2015-02-03 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303943#comment-14303943
 ] 

Reynold Xin commented on SPARK-5472:


Actually I think it already supports arbitrary queries. There is even a test 
case for it: 

https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala#L241

Would that satisfy your use case? 

> Add support for reading from and writing to a JDBC database
> ---
>
> Key: SPARK-5472
> URL: https://issues.apache.org/jira/browse/SPARK-5472
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Tor Myklebust
>Assignee: Tor Myklebust
>Priority: Blocker
> Fix For: 1.3.0
>
>
> It would be nice to be able to make a table in a JDBC database appear as a 
> table in Spark SQL.  This would let users, for instance, perform a JOIN 
> between a DataFrame in Spark SQL with a table in a Postgres database.
> It might also be nice to be able to go the other direction -- save a 
> DataFrame to a database -- for instance in an ETL job.
> Edited to clarify:  Both of these tasks are certainly possible to accomplish 
> at the moment with a little bit of ad-hoc glue code.  However, there is no 
> fundamental reason why the user should need to supply the table schema and 
> some code for pulling data out of a ResultSet row into a Catalyst Row 
> structure when this information can be derived from the schema of the 
> database table itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5472) Add support for reading from and writing to a JDBC database

2015-02-03 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303933#comment-14303933
 ] 

Reynold Xin edited comment on SPARK-5472 at 2/3/15 8:39 PM:


What if we expand the JDBC data source to support arbitrary queries, in 
addition to tables/views?




was (Author: rxin):
What if we expand the JDBC data source to support arbitrary queries?



> Add support for reading from and writing to a JDBC database
> ---
>
> Key: SPARK-5472
> URL: https://issues.apache.org/jira/browse/SPARK-5472
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Tor Myklebust
>Assignee: Tor Myklebust
>Priority: Blocker
> Fix For: 1.3.0
>
>
> It would be nice to be able to make a table in a JDBC database appear as a 
> table in Spark SQL.  This would let users, for instance, perform a JOIN 
> between a DataFrame in Spark SQL with a table in a Postgres database.
> It might also be nice to be able to go the other direction -- save a 
> DataFrame to a database -- for instance in an ETL job.
> Edited to clarify:  Both of these tasks are certainly possible to accomplish 
> at the moment with a little bit of ad-hoc glue code.  However, there is no 
> fundamental reason why the user should need to supply the table schema and 
> some code for pulling data out of a ResultSet row into a Catalyst Row 
> structure when this information can be derived from the schema of the 
> database table itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-5472) Add support for reading from and writing to a JDBC database

2015-02-03 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-5472:
---
Comment: was deleted

(was: What if we expand the JDBC data source to support arbitrary queries?

)

> Add support for reading from and writing to a JDBC database
> ---
>
> Key: SPARK-5472
> URL: https://issues.apache.org/jira/browse/SPARK-5472
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Tor Myklebust
>Assignee: Tor Myklebust
>Priority: Blocker
> Fix For: 1.3.0
>
>
> It would be nice to be able to make a table in a JDBC database appear as a 
> table in Spark SQL.  This would let users, for instance, perform a JOIN 
> between a DataFrame in Spark SQL with a table in a Postgres database.
> It might also be nice to be able to go the other direction -- save a 
> DataFrame to a database -- for instance in an ETL job.
> Edited to clarify:  Both of these tasks are certainly possible to accomplish 
> at the moment with a little bit of ad-hoc glue code.  However, there is no 
> fundamental reason why the user should need to supply the table schema and 
> some code for pulling data out of a ResultSet row into a Catalyst Row 
> structure when this information can be derived from the schema of the 
> database table itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5472) Add support for reading from and writing to a JDBC database

2015-02-03 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303933#comment-14303933
 ] 

Reynold Xin commented on SPARK-5472:


What if we expand the JDBC data source to support arbitrary queries?



> Add support for reading from and writing to a JDBC database
> ---
>
> Key: SPARK-5472
> URL: https://issues.apache.org/jira/browse/SPARK-5472
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Tor Myklebust
>Assignee: Tor Myklebust
>Priority: Blocker
> Fix For: 1.3.0
>
>
> It would be nice to be able to make a table in a JDBC database appear as a 
> table in Spark SQL.  This would let users, for instance, perform a JOIN 
> between a DataFrame in Spark SQL with a table in a Postgres database.
> It might also be nice to be able to go the other direction -- save a 
> DataFrame to a database -- for instance in an ETL job.
> Edited to clarify:  Both of these tasks are certainly possible to accomplish 
> at the moment with a little bit of ad-hoc glue code.  However, there is no 
> fundamental reason why the user should need to supply the table schema and 
> some code for pulling data out of a ResultSet row into a Catalyst Row 
> structure when this information can be derived from the schema of the 
> database table itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5472) Add support for reading from and writing to a JDBC database

2015-02-03 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303934#comment-14303934
 ] 

Reynold Xin commented on SPARK-5472:


What if we expand the JDBC data source to support arbitrary queries?



> Add support for reading from and writing to a JDBC database
> ---
>
> Key: SPARK-5472
> URL: https://issues.apache.org/jira/browse/SPARK-5472
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Tor Myklebust
>Assignee: Tor Myklebust
>Priority: Blocker
> Fix For: 1.3.0
>
>
> It would be nice to be able to make a table in a JDBC database appear as a 
> table in Spark SQL.  This would let users, for instance, perform a JOIN 
> between a DataFrame in Spark SQL with a table in a Postgres database.
> It might also be nice to be able to go the other direction -- save a 
> DataFrame to a database -- for instance in an ETL job.
> Edited to clarify:  Both of these tasks are certainly possible to accomplish 
> at the moment with a little bit of ad-hoc glue code.  However, there is no 
> fundamental reason why the user should need to supply the table schema and 
> some code for pulling data out of a ResultSet row into a Catalyst Row 
> structure when this information can be derived from the schema of the 
> database table itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5260) Expose JsonRDD.allKeysWithValueTypes() in a utility class

2015-02-03 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303927#comment-14303927
 ] 

Reynold Xin commented on SPARK-5260:


BTW I've also added you to the contributor list.

> Expose JsonRDD.allKeysWithValueTypes() in a utility class 
> --
>
> Key: SPARK-5260
> URL: https://issues.apache.org/jira/browse/SPARK-5260
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Corey J. Nolet
>
> I have found this method extremely useful when implementing my own strategy 
> for inferring a schema from parsed json. For now, I've actually copied the 
> method right out of the JsonRDD class into my own project but I think it 
> would be immensely useful to keep the code in Spark and expose it publicly 
> somewhere else- like an object called JsonSchema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >