[jira] [Commented] (SPARK-5585) Flaky test: Python regression
[ https://issues.apache.org/jira/browse/SPARK-5585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304729#comment-14304729 ] Apache Spark commented on SPARK-5585: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/4358 > Flaky test: Python regression > - > > Key: SPARK-5585 > URL: https://issues.apache.org/jira/browse/SPARK-5585 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Patrick Wendell >Assignee: Davies Liu >Priority: Critical > Labels: flaky-test > > Hey [~davies] any chance you can take a look at this? The master build is > having random python failures fairly often. Not quite sure what is going on: > {code} > 0inputs+128outputs (0major+13320minor)pagefaults 0swaps > Run mllib tests ... > Running test: pyspark/mllib/classification.py > tput: No value for $TERM and no -T specified > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > 0.43user 0.12system 0:14.85elapsed 3%CPU (0avgtext+0avgdata 94272maxresident)k > 0inputs+280outputs (0major+12627minor)pagefaults 0swaps > Running test: pyspark/mllib/clustering.py > tput: No value for $TERM and no -T specified > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > 0.35user 0.11system 0:12.63elapsed 3%CPU (0avgtext+0avgdata 93568maxresident)k > 0inputs+88outputs (0major+12532minor)pagefaults 0swaps > Running test: pyspark/mllib/feature.py > tput: No value for $TERM and no -T specified > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > 0.28user 0.08system 0:05.73elapsed 6%CPU (0avgtext+0avgdata 93424maxresident)k > 0inputs+32outputs (0major+12548minor)pagefaults 0swaps > Running test: pyspark/mllib/linalg.py > 0.16user 0.05system 0:00.22elapsed 98%CPU (0avgtext+0avgdata > 89888maxresident)k > 0inputs+0outputs (0major+8099minor)pagefaults 0swaps > Running test: pyspark/mllib/rand.py > tput: No value for $TERM and no -T specified > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > 0.25user 0.08system 0:05.42elapsed 6%CPU (0avgtext+0avgdata 87872maxresident)k > 0inputs+0outputs (0major+11849minor)pagefaults 0swaps > Running test: pyspark/mllib/recommendation.py > tput: No value for $TERM and no -T specified > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > 0.32user 0.09system 0:11.42elapsed 3%CPU (0avgtext+0avgdata 94256maxresident)k > 0inputs+32outputs (0major+11797minor)pagefaults 0swaps > Running test: pyspark/mllib/regression.py > tput: No value for $TERM and no -T specified > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > 0.53user 0.17system 0:23.53elapsed 3%CPU (0avgtext+0avgdata 99600maxresident)k > 0inputs+48outputs (0major+12402minor)pagefaults 0swaps > Running test: pyspark/mllib/stat/_statistics.py > tput: No value for $TERM and no -T specified > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > 0.29user 0.09system 0:08.03elapsed 4%CPU (0avgtext+0avgdata 92656maxresident)k > 0inputs+48outputs (0major+12508minor)pagefaults 0swaps > Running test: pyspark/mllib/tree.py > tput: No value for $TERM and no -T specified > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > 0.57user 0.16system 0:25.30elapsed 2%CPU (0avgtext+0avgdata 94400maxresident)k > 0inputs+144outputs (0major+12600minor)pagefaults 0swaps > Running test: pyspark/mllib/util.py > tput: No value for $TERM and no -T specified > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > 0.20user 0.06system 0:08.08elapsed 3%CPU (0avgtext+0avgdata 92768maxresident)k > 0inputs+56outputs (0major+12474minor)pagefaults 0swaps > Running test: pyspark/mllib/tests.py > tput: No value for $TERM and no -T specified > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > .F/usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:2499: > VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or > function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`. > VisibleDeprecationWarning) > ./usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:2499: > VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or > function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`. > VisibleDeprecationWarning) > /usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:2499: > VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or > function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`. > VisibleDeprecationWarnin
[jira] [Commented] (SPARK-5585) Flaky test: Python regression
[ https://issues.apache.org/jira/browse/SPARK-5585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304724#comment-14304724 ] Davies Liu commented on SPARK-5585: --- [~pwendell] I can not reproduce it locally, will add a seed for it, test it several times in jenkins. > Flaky test: Python regression > - > > Key: SPARK-5585 > URL: https://issues.apache.org/jira/browse/SPARK-5585 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Patrick Wendell >Assignee: Davies Liu >Priority: Critical > Labels: flaky-test > > Hey [~davies] any chance you can take a look at this? The master build is > having random python failures fairly often. Not quite sure what is going on: > {code} > 0inputs+128outputs (0major+13320minor)pagefaults 0swaps > Run mllib tests ... > Running test: pyspark/mllib/classification.py > tput: No value for $TERM and no -T specified > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > 0.43user 0.12system 0:14.85elapsed 3%CPU (0avgtext+0avgdata 94272maxresident)k > 0inputs+280outputs (0major+12627minor)pagefaults 0swaps > Running test: pyspark/mllib/clustering.py > tput: No value for $TERM and no -T specified > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > 0.35user 0.11system 0:12.63elapsed 3%CPU (0avgtext+0avgdata 93568maxresident)k > 0inputs+88outputs (0major+12532minor)pagefaults 0swaps > Running test: pyspark/mllib/feature.py > tput: No value for $TERM and no -T specified > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > 0.28user 0.08system 0:05.73elapsed 6%CPU (0avgtext+0avgdata 93424maxresident)k > 0inputs+32outputs (0major+12548minor)pagefaults 0swaps > Running test: pyspark/mllib/linalg.py > 0.16user 0.05system 0:00.22elapsed 98%CPU (0avgtext+0avgdata > 89888maxresident)k > 0inputs+0outputs (0major+8099minor)pagefaults 0swaps > Running test: pyspark/mllib/rand.py > tput: No value for $TERM and no -T specified > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > 0.25user 0.08system 0:05.42elapsed 6%CPU (0avgtext+0avgdata 87872maxresident)k > 0inputs+0outputs (0major+11849minor)pagefaults 0swaps > Running test: pyspark/mllib/recommendation.py > tput: No value for $TERM and no -T specified > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > 0.32user 0.09system 0:11.42elapsed 3%CPU (0avgtext+0avgdata 94256maxresident)k > 0inputs+32outputs (0major+11797minor)pagefaults 0swaps > Running test: pyspark/mllib/regression.py > tput: No value for $TERM and no -T specified > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > 0.53user 0.17system 0:23.53elapsed 3%CPU (0avgtext+0avgdata 99600maxresident)k > 0inputs+48outputs (0major+12402minor)pagefaults 0swaps > Running test: pyspark/mllib/stat/_statistics.py > tput: No value for $TERM and no -T specified > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > 0.29user 0.09system 0:08.03elapsed 4%CPU (0avgtext+0avgdata 92656maxresident)k > 0inputs+48outputs (0major+12508minor)pagefaults 0swaps > Running test: pyspark/mllib/tree.py > tput: No value for $TERM and no -T specified > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > 0.57user 0.16system 0:25.30elapsed 2%CPU (0avgtext+0avgdata 94400maxresident)k > 0inputs+144outputs (0major+12600minor)pagefaults 0swaps > Running test: pyspark/mllib/util.py > tput: No value for $TERM and no -T specified > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > 0.20user 0.06system 0:08.08elapsed 3%CPU (0avgtext+0avgdata 92768maxresident)k > 0inputs+56outputs (0major+12474minor)pagefaults 0swaps > Running test: pyspark/mllib/tests.py > tput: No value for $TERM and no -T specified > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > .F/usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:2499: > VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or > function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`. > VisibleDeprecationWarning) > ./usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:2499: > VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or > function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`. > VisibleDeprecationWarning) > /usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:2499: > VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or > function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`. > VisibleDeprecationWarnin
[jira] [Commented] (SPARK-5587) Support change database owner
[ https://issues.apache.org/jira/browse/SPARK-5587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304721#comment-14304721 ] Apache Spark commented on SPARK-5587: - User 'scwf' has created a pull request for this issue: https://github.com/apache/spark/pull/4357 > Support change database owner > -- > > Key: SPARK-5587 > URL: https://issues.apache.org/jira/browse/SPARK-5587 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.2.0 >Reporter: wangfei > > Support change database owner : > create database db_alter_onr; > describe database db_alter_onr; > alter database db_alter_onr set owner user user1; > describe database db_alter_onr; > alter database db_alter_onr set owner role role1; > describe database db_alter_onr; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5587) Support change database owner
wangfei created SPARK-5587: -- Summary: Support change database owner Key: SPARK-5587 URL: https://issues.apache.org/jira/browse/SPARK-5587 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0 Reporter: wangfei Support change database owner : create database db_alter_onr; describe database db_alter_onr; alter database db_alter_onr set owner user user1; describe database db_alter_onr; alter database db_alter_onr set owner role role1; describe database db_alter_onr; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5586) Automatically provide sqlContext in Spark shell
Patrick Wendell created SPARK-5586: -- Summary: Automatically provide sqlContext in Spark shell Key: SPARK-5586 URL: https://issues.apache.org/jira/browse/SPARK-5586 Project: Spark Issue Type: Improvement Components: Spark Shell, SQL Reporter: Patrick Wendell Assignee: Patrick Wendell Fix For: 1.3.0 A simple patch, but we should create a sqlContext (and, if supported by the build, a Hive context) in the Spark shell when it's created, and import the DSL. We can just call it sqlContext. This would save us so much time writing code examples :P -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5586) Automatically provide sqlContext in Spark shell
[ https://issues.apache.org/jira/browse/SPARK-5586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5586: --- Fix Version/s: (was: 1.3.0) > Automatically provide sqlContext in Spark shell > --- > > Key: SPARK-5586 > URL: https://issues.apache.org/jira/browse/SPARK-5586 > Project: Spark > Issue Type: Improvement > Components: Spark Shell, SQL >Reporter: Patrick Wendell >Assignee: Patrick Wendell >Priority: Critical > > A simple patch, but we should create a sqlContext (and, if supported by the > build, a Hive context) in the Spark shell when it's created, and import the > DSL. We can just call it sqlContext. This would save us so much time writing > code examples :P -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5586) Automatically provide sqlContext in Spark shell
[ https://issues.apache.org/jira/browse/SPARK-5586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5586: --- Priority: Critical (was: Major) > Automatically provide sqlContext in Spark shell > --- > > Key: SPARK-5586 > URL: https://issues.apache.org/jira/browse/SPARK-5586 > Project: Spark > Issue Type: Improvement > Components: Spark Shell, SQL >Reporter: Patrick Wendell >Assignee: Patrick Wendell >Priority: Critical > > A simple patch, but we should create a sqlContext (and, if supported by the > build, a Hive context) in the Spark shell when it's created, and import the > DSL. We can just call it sqlContext. This would save us so much time writing > code examples :P -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5068) When the path not found in the hdfs,we can't get the result
[ https://issues.apache.org/jira/browse/SPARK-5068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304710#comment-14304710 ] Apache Spark commented on SPARK-5068: - User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/4356 > When the path not found in the hdfs,we can't get the result > --- > > Key: SPARK-5068 > URL: https://issues.apache.org/jira/browse/SPARK-5068 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: jeanlyn > > when the partion path was found in the metastore but not found in the hdfs,it > will casue some problems as follow: > {noformat} > hive> show partitions partition_test; > OK > dt=1 > dt=2 > dt=3 > dt=4 > Time taken: 0.168 seconds, Fetched: 4 row(s) > {noformat} > {noformat} > hive> dfs -ls /user/jeanlyn/warehouse/partition_test; > Found 3 items > drwxr-xr-x - jeanlyn supergroup 0 2014-12-02 16:29 > /user/jeanlyn/warehouse/partition_test/dt=1 > drwxr-xr-x - jeanlyn supergroup 0 2014-12-02 16:29 > /user/jeanlyn/warehouse/partition_test/dt=3 > drwxr-xr-x - jeanlyn supergroup 0 2014-12-02 17:42 > /user/jeanlyn/warehouse/partition_test/dt=4 > {noformat} > when i run the sql > {noformat} > select * from partition_test limit 10 > {noformat} in *hive*,i got no problem,but when i run in *spark-sql* i get > the error as follow: > {noformat} > Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: > Input path does not exist: > hdfs://jeanlyn:9000/user/jeanlyn/warehouse/partition_test/dt=2 > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) > at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) > at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1328) > at org.apache.spark.rdd.RDD.collect(RDD.scala:780) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:84) > at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444) > at org.apache.spark.sql.hive.testpartition$.main(test.scala:23) > at org.apache.spark.sql.hive.testpartition.main(test.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at com.intellij.rt.execu
[jira] [Resolved] (SPARK-5341) Support maven coordinates in spark-shell and spark-submit
[ https://issues.apache.org/jira/browse/SPARK-5341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-5341. Resolution: Fixed Fix Version/s: 1.3.0 > Support maven coordinates in spark-shell and spark-submit > - > > Key: SPARK-5341 > URL: https://issues.apache.org/jira/browse/SPARK-5341 > Project: Spark > Issue Type: New Feature > Components: Deploy, Spark Shell >Reporter: Burak Yavuz >Assignee: Burak Yavuz >Priority: Critical > Fix For: 1.3.0 > > > This feature will allow users to provide the maven coordinates of jars they > wish to use in their spark application. Coordinates can be a comma-delimited > list and be supplied like: > ```spark-submit --maven org.apache.example.a,org.apache.example.b``` > This feature will also be added to spark-shell (where it is more critical to > have this feature) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5341) Support maven coordinates in spark-shell and spark-submit
[ https://issues.apache.org/jira/browse/SPARK-5341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5341: --- Assignee: Burak Yavuz > Support maven coordinates in spark-shell and spark-submit > - > > Key: SPARK-5341 > URL: https://issues.apache.org/jira/browse/SPARK-5341 > Project: Spark > Issue Type: New Feature > Components: Deploy, Spark Shell >Reporter: Burak Yavuz >Assignee: Burak Yavuz >Priority: Critical > Fix For: 1.3.0 > > > This feature will allow users to provide the maven coordinates of jars they > wish to use in their spark application. Coordinates can be a comma-delimited > list and be supplied like: > ```spark-submit --maven org.apache.example.a,org.apache.example.b``` > This feature will also be added to spark-shell (where it is more critical to > have this feature) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4969) Add binaryRecords support to streaming
[ https://issues.apache.org/jira/browse/SPARK-4969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-4969. -- Resolution: Fixed Fix Version/s: 1.3.0 > Add binaryRecords support to streaming > -- > > Key: SPARK-4969 > URL: https://issues.apache.org/jira/browse/SPARK-4969 > Project: Spark > Issue Type: Improvement > Components: PySpark, Streaming >Affects Versions: 1.2.0 >Reporter: Jeremy Freeman >Priority: Minor > Fix For: 1.3.0 > > > As of Spark 1.2 there is support for loading fixed length records from flat > binary files. This is a useful way to load dense numerical array data into > Spark, especially in scientific computing applications. > We should add support for loading this same file type in Spark Streaming, > both in Scala/Java and in Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5585) Flaky test: Python regression
[ https://issues.apache.org/jira/browse/SPARK-5585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5585: --- Priority: Critical (was: Major) > Flaky test: Python regression > - > > Key: SPARK-5585 > URL: https://issues.apache.org/jira/browse/SPARK-5585 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Patrick Wendell >Assignee: Davies Liu >Priority: Critical > Labels: flaky-test > > Hey [~davies] any chance you can take a look at this? The master build is > having random python failures fairly often. Not quite sure what is going on: > {code} > 0inputs+128outputs (0major+13320minor)pagefaults 0swaps > Run mllib tests ... > Running test: pyspark/mllib/classification.py > tput: No value for $TERM and no -T specified > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > 0.43user 0.12system 0:14.85elapsed 3%CPU (0avgtext+0avgdata 94272maxresident)k > 0inputs+280outputs (0major+12627minor)pagefaults 0swaps > Running test: pyspark/mllib/clustering.py > tput: No value for $TERM and no -T specified > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > 0.35user 0.11system 0:12.63elapsed 3%CPU (0avgtext+0avgdata 93568maxresident)k > 0inputs+88outputs (0major+12532minor)pagefaults 0swaps > Running test: pyspark/mllib/feature.py > tput: No value for $TERM and no -T specified > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > 0.28user 0.08system 0:05.73elapsed 6%CPU (0avgtext+0avgdata 93424maxresident)k > 0inputs+32outputs (0major+12548minor)pagefaults 0swaps > Running test: pyspark/mllib/linalg.py > 0.16user 0.05system 0:00.22elapsed 98%CPU (0avgtext+0avgdata > 89888maxresident)k > 0inputs+0outputs (0major+8099minor)pagefaults 0swaps > Running test: pyspark/mllib/rand.py > tput: No value for $TERM and no -T specified > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > 0.25user 0.08system 0:05.42elapsed 6%CPU (0avgtext+0avgdata 87872maxresident)k > 0inputs+0outputs (0major+11849minor)pagefaults 0swaps > Running test: pyspark/mllib/recommendation.py > tput: No value for $TERM and no -T specified > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > 0.32user 0.09system 0:11.42elapsed 3%CPU (0avgtext+0avgdata 94256maxresident)k > 0inputs+32outputs (0major+11797minor)pagefaults 0swaps > Running test: pyspark/mllib/regression.py > tput: No value for $TERM and no -T specified > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > 0.53user 0.17system 0:23.53elapsed 3%CPU (0avgtext+0avgdata 99600maxresident)k > 0inputs+48outputs (0major+12402minor)pagefaults 0swaps > Running test: pyspark/mllib/stat/_statistics.py > tput: No value for $TERM and no -T specified > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > 0.29user 0.09system 0:08.03elapsed 4%CPU (0avgtext+0avgdata 92656maxresident)k > 0inputs+48outputs (0major+12508minor)pagefaults 0swaps > Running test: pyspark/mllib/tree.py > tput: No value for $TERM and no -T specified > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > 0.57user 0.16system 0:25.30elapsed 2%CPU (0avgtext+0avgdata 94400maxresident)k > 0inputs+144outputs (0major+12600minor)pagefaults 0swaps > Running test: pyspark/mllib/util.py > tput: No value for $TERM and no -T specified > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > 0.20user 0.06system 0:08.08elapsed 3%CPU (0avgtext+0avgdata 92768maxresident)k > 0inputs+56outputs (0major+12474minor)pagefaults 0swaps > Running test: pyspark/mllib/tests.py > tput: No value for $TERM and no -T specified > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > .F/usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:2499: > VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or > function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`. > VisibleDeprecationWarning) > ./usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:2499: > VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or > function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`. > VisibleDeprecationWarning) > /usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:2499: > VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or > function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`. > VisibleDeprecationWarning) > /usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:2499: > VisibleDeprecationWarning: `rank` is de
[jira] [Updated] (SPARK-5585) Flaky test: Python regression
[ https://issues.apache.org/jira/browse/SPARK-5585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5585: --- Labels: flaky-test (was: ) > Flaky test: Python regression > - > > Key: SPARK-5585 > URL: https://issues.apache.org/jira/browse/SPARK-5585 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Patrick Wendell >Assignee: Davies Liu >Priority: Critical > Labels: flaky-test > > Hey [~davies] any chance you can take a look at this? The master build is > having random python failures fairly often. Not quite sure what is going on: > {code} > 0inputs+128outputs (0major+13320minor)pagefaults 0swaps > Run mllib tests ... > Running test: pyspark/mllib/classification.py > tput: No value for $TERM and no -T specified > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > 0.43user 0.12system 0:14.85elapsed 3%CPU (0avgtext+0avgdata 94272maxresident)k > 0inputs+280outputs (0major+12627minor)pagefaults 0swaps > Running test: pyspark/mllib/clustering.py > tput: No value for $TERM and no -T specified > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > 0.35user 0.11system 0:12.63elapsed 3%CPU (0avgtext+0avgdata 93568maxresident)k > 0inputs+88outputs (0major+12532minor)pagefaults 0swaps > Running test: pyspark/mllib/feature.py > tput: No value for $TERM and no -T specified > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > 0.28user 0.08system 0:05.73elapsed 6%CPU (0avgtext+0avgdata 93424maxresident)k > 0inputs+32outputs (0major+12548minor)pagefaults 0swaps > Running test: pyspark/mllib/linalg.py > 0.16user 0.05system 0:00.22elapsed 98%CPU (0avgtext+0avgdata > 89888maxresident)k > 0inputs+0outputs (0major+8099minor)pagefaults 0swaps > Running test: pyspark/mllib/rand.py > tput: No value for $TERM and no -T specified > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > 0.25user 0.08system 0:05.42elapsed 6%CPU (0avgtext+0avgdata 87872maxresident)k > 0inputs+0outputs (0major+11849minor)pagefaults 0swaps > Running test: pyspark/mllib/recommendation.py > tput: No value for $TERM and no -T specified > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > 0.32user 0.09system 0:11.42elapsed 3%CPU (0avgtext+0avgdata 94256maxresident)k > 0inputs+32outputs (0major+11797minor)pagefaults 0swaps > Running test: pyspark/mllib/regression.py > tput: No value for $TERM and no -T specified > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > 0.53user 0.17system 0:23.53elapsed 3%CPU (0avgtext+0avgdata 99600maxresident)k > 0inputs+48outputs (0major+12402minor)pagefaults 0swaps > Running test: pyspark/mllib/stat/_statistics.py > tput: No value for $TERM and no -T specified > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > 0.29user 0.09system 0:08.03elapsed 4%CPU (0avgtext+0avgdata 92656maxresident)k > 0inputs+48outputs (0major+12508minor)pagefaults 0swaps > Running test: pyspark/mllib/tree.py > tput: No value for $TERM and no -T specified > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > 0.57user 0.16system 0:25.30elapsed 2%CPU (0avgtext+0avgdata 94400maxresident)k > 0inputs+144outputs (0major+12600minor)pagefaults 0swaps > Running test: pyspark/mllib/util.py > tput: No value for $TERM and no -T specified > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > 0.20user 0.06system 0:08.08elapsed 3%CPU (0avgtext+0avgdata 92768maxresident)k > 0inputs+56outputs (0major+12474minor)pagefaults 0swaps > Running test: pyspark/mllib/tests.py > tput: No value for $TERM and no -T specified > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > .F/usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:2499: > VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or > function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`. > VisibleDeprecationWarning) > ./usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:2499: > VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or > function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`. > VisibleDeprecationWarning) > /usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:2499: > VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or > function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`. > VisibleDeprecationWarning) > /usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:2499: > VisibleDeprecationWarning: `rank` is depreca
[jira] [Created] (SPARK-5585) Flaky test: Python regression
Patrick Wendell created SPARK-5585: -- Summary: Flaky test: Python regression Key: SPARK-5585 URL: https://issues.apache.org/jira/browse/SPARK-5585 Project: Spark Issue Type: Bug Components: MLlib Reporter: Patrick Wendell Assignee: Davies Liu Hey [~davies] any chance you can take a look at this? The master build is having random python failures fairly often. Not quite sure what is going on: {code} 0inputs+128outputs (0major+13320minor)pagefaults 0swaps Run mllib tests ... Running test: pyspark/mllib/classification.py tput: No value for $TERM and no -T specified Spark assembly has been built with Hive, including Datanucleus jars on classpath 0.43user 0.12system 0:14.85elapsed 3%CPU (0avgtext+0avgdata 94272maxresident)k 0inputs+280outputs (0major+12627minor)pagefaults 0swaps Running test: pyspark/mllib/clustering.py tput: No value for $TERM and no -T specified Spark assembly has been built with Hive, including Datanucleus jars on classpath 0.35user 0.11system 0:12.63elapsed 3%CPU (0avgtext+0avgdata 93568maxresident)k 0inputs+88outputs (0major+12532minor)pagefaults 0swaps Running test: pyspark/mllib/feature.py tput: No value for $TERM and no -T specified Spark assembly has been built with Hive, including Datanucleus jars on classpath 0.28user 0.08system 0:05.73elapsed 6%CPU (0avgtext+0avgdata 93424maxresident)k 0inputs+32outputs (0major+12548minor)pagefaults 0swaps Running test: pyspark/mllib/linalg.py 0.16user 0.05system 0:00.22elapsed 98%CPU (0avgtext+0avgdata 89888maxresident)k 0inputs+0outputs (0major+8099minor)pagefaults 0swaps Running test: pyspark/mllib/rand.py tput: No value for $TERM and no -T specified Spark assembly has been built with Hive, including Datanucleus jars on classpath 0.25user 0.08system 0:05.42elapsed 6%CPU (0avgtext+0avgdata 87872maxresident)k 0inputs+0outputs (0major+11849minor)pagefaults 0swaps Running test: pyspark/mllib/recommendation.py tput: No value for $TERM and no -T specified Spark assembly has been built with Hive, including Datanucleus jars on classpath 0.32user 0.09system 0:11.42elapsed 3%CPU (0avgtext+0avgdata 94256maxresident)k 0inputs+32outputs (0major+11797minor)pagefaults 0swaps Running test: pyspark/mllib/regression.py tput: No value for $TERM and no -T specified Spark assembly has been built with Hive, including Datanucleus jars on classpath 0.53user 0.17system 0:23.53elapsed 3%CPU (0avgtext+0avgdata 99600maxresident)k 0inputs+48outputs (0major+12402minor)pagefaults 0swaps Running test: pyspark/mllib/stat/_statistics.py tput: No value for $TERM and no -T specified Spark assembly has been built with Hive, including Datanucleus jars on classpath 0.29user 0.09system 0:08.03elapsed 4%CPU (0avgtext+0avgdata 92656maxresident)k 0inputs+48outputs (0major+12508minor)pagefaults 0swaps Running test: pyspark/mllib/tree.py tput: No value for $TERM and no -T specified Spark assembly has been built with Hive, including Datanucleus jars on classpath 0.57user 0.16system 0:25.30elapsed 2%CPU (0avgtext+0avgdata 94400maxresident)k 0inputs+144outputs (0major+12600minor)pagefaults 0swaps Running test: pyspark/mllib/util.py tput: No value for $TERM and no -T specified Spark assembly has been built with Hive, including Datanucleus jars on classpath 0.20user 0.06system 0:08.08elapsed 3%CPU (0avgtext+0avgdata 92768maxresident)k 0inputs+56outputs (0major+12474minor)pagefaults 0swaps Running test: pyspark/mllib/tests.py tput: No value for $TERM and no -T specified Spark assembly has been built with Hive, including Datanucleus jars on classpath .F/usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:2499: VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`. VisibleDeprecationWarning) ./usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:2499: VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`. VisibleDeprecationWarning) /usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:2499: VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`. VisibleDeprecationWarning) /usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:2499: VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`. VisibleDeprecationWarning) /usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:2499: VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`. VisibleDeprecationWarning) ./usr/lib64/python2.6/site-packages/numpy/lib/uti
[jira] [Updated] (SPARK-5585) Flaky test: Python regression
[ https://issues.apache.org/jira/browse/SPARK-5585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5585: --- Affects Version/s: 1.3.0 > Flaky test: Python regression > - > > Key: SPARK-5585 > URL: https://issues.apache.org/jira/browse/SPARK-5585 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Patrick Wendell >Assignee: Davies Liu >Priority: Critical > Labels: flaky-test > > Hey [~davies] any chance you can take a look at this? The master build is > having random python failures fairly often. Not quite sure what is going on: > {code} > 0inputs+128outputs (0major+13320minor)pagefaults 0swaps > Run mllib tests ... > Running test: pyspark/mllib/classification.py > tput: No value for $TERM and no -T specified > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > 0.43user 0.12system 0:14.85elapsed 3%CPU (0avgtext+0avgdata 94272maxresident)k > 0inputs+280outputs (0major+12627minor)pagefaults 0swaps > Running test: pyspark/mllib/clustering.py > tput: No value for $TERM and no -T specified > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > 0.35user 0.11system 0:12.63elapsed 3%CPU (0avgtext+0avgdata 93568maxresident)k > 0inputs+88outputs (0major+12532minor)pagefaults 0swaps > Running test: pyspark/mllib/feature.py > tput: No value for $TERM and no -T specified > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > 0.28user 0.08system 0:05.73elapsed 6%CPU (0avgtext+0avgdata 93424maxresident)k > 0inputs+32outputs (0major+12548minor)pagefaults 0swaps > Running test: pyspark/mllib/linalg.py > 0.16user 0.05system 0:00.22elapsed 98%CPU (0avgtext+0avgdata > 89888maxresident)k > 0inputs+0outputs (0major+8099minor)pagefaults 0swaps > Running test: pyspark/mllib/rand.py > tput: No value for $TERM and no -T specified > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > 0.25user 0.08system 0:05.42elapsed 6%CPU (0avgtext+0avgdata 87872maxresident)k > 0inputs+0outputs (0major+11849minor)pagefaults 0swaps > Running test: pyspark/mllib/recommendation.py > tput: No value for $TERM and no -T specified > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > 0.32user 0.09system 0:11.42elapsed 3%CPU (0avgtext+0avgdata 94256maxresident)k > 0inputs+32outputs (0major+11797minor)pagefaults 0swaps > Running test: pyspark/mllib/regression.py > tput: No value for $TERM and no -T specified > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > 0.53user 0.17system 0:23.53elapsed 3%CPU (0avgtext+0avgdata 99600maxresident)k > 0inputs+48outputs (0major+12402minor)pagefaults 0swaps > Running test: pyspark/mllib/stat/_statistics.py > tput: No value for $TERM and no -T specified > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > 0.29user 0.09system 0:08.03elapsed 4%CPU (0avgtext+0avgdata 92656maxresident)k > 0inputs+48outputs (0major+12508minor)pagefaults 0swaps > Running test: pyspark/mllib/tree.py > tput: No value for $TERM and no -T specified > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > 0.57user 0.16system 0:25.30elapsed 2%CPU (0avgtext+0avgdata 94400maxresident)k > 0inputs+144outputs (0major+12600minor)pagefaults 0swaps > Running test: pyspark/mllib/util.py > tput: No value for $TERM and no -T specified > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > 0.20user 0.06system 0:08.08elapsed 3%CPU (0avgtext+0avgdata 92768maxresident)k > 0inputs+56outputs (0major+12474minor)pagefaults 0swaps > Running test: pyspark/mllib/tests.py > tput: No value for $TERM and no -T specified > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > .F/usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:2499: > VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or > function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`. > VisibleDeprecationWarning) > ./usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:2499: > VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or > function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`. > VisibleDeprecationWarning) > /usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:2499: > VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or > function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`. > VisibleDeprecationWarning) > /usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:2499: > VisibleDeprecationWarning: `rank` is deprecated
[jira] [Updated] (SPARK-5529) Executor is still hold while BlockManager has been removed
[ https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-5529: --- Description: When I run a spark job, one executor is hold, after 120s, blockManager is removed by driver, but after half an hour before the executor is remove by driver. Here is the log: {code} 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms exceeds 12ms 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 10.215.143.14: remote Akka client disassociated 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 0.0 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 10.215.143.14): ExecutorLostFailure (executor 1 lost) 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove non-existent executor 1 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0) 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 from BlockManagerMaster. 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor {code} was: When I run a spark job, one executor is hold, after 120s, blockManager is removed by driver, but after half an hour before the executor is remove by driver. Here is the log: 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms exceeds 12ms 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 10.215.143.14: remote Akka client disassociated 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 0.0 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 10.215.143.14): ExecutorLostFailure (executor 1 lost) 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove non-existent executor 1 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0) 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 from BlockManagerMaster. 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor > Executor is still hold while BlockManager has been removed > -- > > Key: SPARK-5529 > URL: https://issues.apache.org/jira/browse/SPARK-5529 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Hong Shen > > When I run a spark job, one executor is hold, after 120s, blockManager is > removed by driver, but after half an hour before the executor is remove by > driver. Here is the log: > {code} > 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms > exceeds 12ms > > 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on > 10.215.143.14: remote Akka client disassociated > 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote > system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is > now gated for [5000] ms. Reason is: [Disassociated]. > 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet > 0.0 > 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, > 10.215.143.14): ExecutorLostFailure (executor 1 lost) > 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove > non-existent executor 1 > 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0) > 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 > from BlockManagerMaster. > 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in > removeExecutor > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5583) Support unique join in hive context
[ https://issues.apache.org/jira/browse/SPARK-5583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin closed SPARK-5583. -- Resolution: Won't Fix Going to close this one as won't fix since it is a weird syntax that only Hive has (and as far as I know not that many Hive users know about it). The patch is already on GitHub. We can merge that in the future if there is strong demand. > Support unique join in hive context > --- > > Key: SPARK-5583 > URL: https://issues.apache.org/jira/browse/SPARK-5583 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.2.0 >Reporter: wangfei > > Support unique join in hive context: > FROM UNIQUEJOIN PRESERVE T1 a (a.key), PRESERVE T2 b (b.key), PRESERVE T3 c > (c.key) > SELECT a.key, b.key, c.key; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5475) Java 8 tests are like maintenance overhead.
[ https://issues.apache.org/jira/browse/SPARK-5475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304662#comment-14304662 ] Prashant Sharma commented on SPARK-5475: And this is how it looks after running in maven, with command {noformat} build/mvn clean install -Pjava8-tests -DskipTests -T 6 {noformat} http://pastebin.com/SxeHUpEY > Java 8 tests are like maintenance overhead. > > > Key: SPARK-5475 > URL: https://issues.apache.org/jira/browse/SPARK-5475 > Project: Spark > Issue Type: Bug >Reporter: Prashant Sharma > > Having tests that validate the same code compatible with java 8 and java 7 is > like asserting that java 8 is backward compatible with java 7 and still > supports java 8 features(lambda expressions to be precise). This was once > necessary as asm was not compatible with java 8 and so on. > Running java8-tests on the current code base results in more than 100 > compilation errors, it felt as if they are never run. This is based on the > fact that compilation errors have existed for a pretty long period. So IMHO, > we should really remove them, if we don't plan to maintain. > Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5584) Add Maven Enforcer Plugin dependencyConvergence rule (fail false)
[ https://issues.apache.org/jira/browse/SPARK-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Dale updated SPARK-5584: --- Description: The Spark Maven build uses the Maven Enforcer plugin but does not have a rule for dependencyConvergence (no version conflicts between dependencies/transitive dependencies). Putting this in the current 1.3.0-SNAPSHOT in main pom.xml by adding dependencyConvergence rule: {noformat} org.apache.maven.plugins maven-enforcer-plugin 1.3.1 enforce-versions enforce 3.0.4 ${java.version} {noformat} And running with: mvn -Pyarn -Phadoop-2.4 -Phive -DskipTests clean package -Denforcer.fail=false &> output.txt identified a lot of dependency convergence problems (one of them re-opening SPARK-3039 and fixed via exclude transitive dependency/explicit include of desired version of library). Many convergence errors like: Dependency convergence error for com.thoughtworks.paranamer:paranamer:2.3 paths to dependency are: +-org.apache.spark:spark-core_2.10:1.3.0-SNAPSHOT +-org.apache.hadoop:hadoop-client:2.4.0 +-org.apache.hadoop:hadoop-common:2.4.0 +-org.apache.avro:avro:1.7.6 +-com.thoughtworks.paranamer:paranamer:2.3 and +-org.apache.spark:spark-core_2.10:1.3.0-SNAPSHOT +-org.json4s:json4s-jackson_2.10:3.2.10 +-org.json4s:json4s-core_2.10:3.2.10 +-com.thoughtworks.paranamer:paranamer:2.6 [WARNING] Dependency convergence error for io.netty:netty:3.8.0.Final paths to dependency are: +-org.apache.spark:spark-core_2.10:1.3.0-SNAPSHOT +-org.spark-project.akka:akka-remote_2.10:2.3.4-spark +-io.netty:netty:3.8.0.Final and +-org.apache.spark:spark-core_2.10:1.3.0-SNAPSHOT +-org.seleniumhq.selenium:selenium-java:2.42.2 +-org.webbitserver:webbit:0.4.14 +-io.netty:netty:3.5.2.Final was: The Spark Maven build uses the Maven Enforcer plugin but does not have a rule for dependencyConvergence (no version conflicts between dependencies/transitive dependencies). Putting this in the current 1.3.0-SNAPSHOT in main pom.xml by adding dependencyConvergence rule: {noformat} org.apache.maven.plugins maven-enforcer-plugin 1.3.1 enforce-versions enforce 3.0.4 ${java.version} {noformat} And running with: mvn -Pyarn -Phadoop-2.4 -Phive -DskipTests clean package -Denforcer.fail=false &> output.txt identified a lot of dependency convergence problems (one of them re-opening SPARK-3039 and fixed via exclude transitive dependency/explicit include of desired version of library). > Add Maven Enforcer Plugin dependencyConvergence rule (fail false) > - > > Key: SPARK-5584 > URL: https://issues.apache.org/jira/browse/SPARK-5584 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 1.2.0 >Reporter: Markus Dale >Priority: Minor > > The Spark Maven build uses the Maven Enforcer plugin but does not have a rule > for dependencyConvergence (no version conflicts between > dependencies/transitive dependencies). > Putting this in the current 1.3.0-SNAPSHOT in main pom.xml by adding > dependencyConvergence rule: > {noformat} > > org.apache.maven.plugins > maven-enforcer-plugin > 1.3.1 > > > enforce-versions > > enforce > > > > > 3.0.4 > > > ${java.version} > > > > > > > > {noformat} > And running with: > mvn -Pyarn -Phadoop-2.4 -Phive -DskipTests clean package > -Denforcer.fail=false &> output.txt > identified a lot of dependency convergence problems (one of them re-opening > SPARK-3039 and fixed via exclude transitive dependency/explicit include of > desired version of library). > Many convergence errors like: > Dependency convergence error for com.thoughtworks.paranamer:paranamer:2.3 > paths to dependency are: > +-org.apache.spark:spark-core_2.10:1.3.
[jira] [Resolved] (SPARK-5237) UDTF don't work with multi-alias of multi-columns as output on SparK SQL
[ https://issues.apache.org/jira/browse/SPARK-5237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrian Wang resolved SPARK-5237. Resolution: Duplicate SPARK-5383 should solved this. > UDTF don't work with multi-alias of multi-columns as output on SparK SQL > > > Key: SPARK-5237 > URL: https://issues.apache.org/jira/browse/SPARK-5237 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: Yi Zhou > > Hive query with UDTF don't work on Spark SQL like below example > SELECT extract_sentiment(pr_item_sk,pr_review_content) AS (pr_item_sk, > review_sentence, sentiment, sentiment_word) > FROM product_reviews; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5584) Add Maven Enforcer Plugin dependencyConvergence rule (fail false)
Markus Dale created SPARK-5584: -- Summary: Add Maven Enforcer Plugin dependencyConvergence rule (fail false) Key: SPARK-5584 URL: https://issues.apache.org/jira/browse/SPARK-5584 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.2.0 Reporter: Markus Dale Priority: Minor The Spark Maven build uses the Maven Enforcer plugin but does not have a rule for dependencyConvergence (no version conflicts between dependencies/transitive dependencies). Putting this in the current 1.3.0-SNAPSHOT in main pom.xml by adding dependencyConvergence rule: {noformat} org.apache.maven.plugins maven-enforcer-plugin 1.3.1 enforce-versions enforce 3.0.4 ${java.version} {noformat} And running with: mvn -Pyarn -Phadoop-2.4 -Phive -DskipTests clean package -Denforcer.fail=false &> output.txt identified a lot of dependency convergence problems (one of them re-opening SPARK-3039 and fixed via exclude transitive dependency/explicit include of desired version of library). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4795) Redesign the "primitive type => Writable" implicit APIs to make them be activated automatically
[ https://issues.apache.org/jira/browse/SPARK-4795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-4795. Resolution: Fixed Fix Version/s: 1.3.0 Assignee: Shixiong Zhu > Redesign the "primitive type => Writable" implicit APIs to make them be > activated automatically > --- > > Key: SPARK-4795 > URL: https://issues.apache.org/jira/browse/SPARK-4795 > Project: Spark > Issue Type: Improvement >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Minor > Fix For: 1.3.0 > > > Try to redesign the "primitive type => Writable" implicit APIs to make them > be activated automatically and without breaking compatibility. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5583) Support unique join in hive context
[ https://issues.apache.org/jira/browse/SPARK-5583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304624#comment-14304624 ] Apache Spark commented on SPARK-5583: - User 'scwf' has created a pull request for this issue: https://github.com/apache/spark/pull/4354 > Support unique join in hive context > --- > > Key: SPARK-5583 > URL: https://issues.apache.org/jira/browse/SPARK-5583 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.2.0 >Reporter: wangfei > > Support unique join in hive context: > FROM UNIQUEJOIN PRESERVE T1 a (a.key), PRESERVE T2 b (b.key), PRESERVE T3 c > (c.key) > SELECT a.key, b.key, c.key; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5578) Provide a convenient way for Scala users to use UDFs
[ https://issues.apache.org/jira/browse/SPARK-5578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-5578. Resolution: Fixed Fix Version/s: 1.3.0 > Provide a convenient way for Scala users to use UDFs > > > Key: SPARK-5578 > URL: https://issues.apache.org/jira/browse/SPARK-5578 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Blocker > Fix For: 1.3.0 > > > Dsl.udf(...). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5583) Support unique join in hive context
wangfei created SPARK-5583: -- Summary: Support unique join in hive context Key: SPARK-5583 URL: https://issues.apache.org/jira/browse/SPARK-5583 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0 Reporter: wangfei Support unique join in hive context: FROM UNIQUEJOIN PRESERVE T1 a (a.key), PRESERVE T2 b (b.key), PRESERVE T3 c (c.key) SELECT a.key, b.key, c.key; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5367) support star expression in udf
[ https://issues.apache.org/jira/browse/SPARK-5367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304611#comment-14304611 ] Apache Spark commented on SPARK-5367: - User 'scwf' has created a pull request for this issue: https://github.com/apache/spark/pull/4353 > support star expression in udf > -- > > Key: SPARK-5367 > URL: https://issues.apache.org/jira/browse/SPARK-5367 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: wangfei > Fix For: 1.3.0 > > > now spark sql does not support star expression in udf, the following sql will > get error > ``` > select concat( * ) from src > ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5582) History server does not list anything if log root contains an empty directory
[ https://issues.apache.org/jira/browse/SPARK-5582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304599#comment-14304599 ] Apache Spark commented on SPARK-5582: - User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/4352 > History server does not list anything if log root contains an empty directory > - > > Key: SPARK-5582 > URL: https://issues.apache.org/jira/browse/SPARK-5582 > Project: Spark > Issue Type: Bug >Reporter: Marcelo Vanzin > > As summary says. Exception from logs: > {noformat} > 15/02/03 17:35:10.292 > pool-1-thread-1-ScalaTest-running-FsHistoryProviderSuite ERROR > FsHistoryProvider: Exception in checking for event log updates > java.lang.UnsupportedOperationException: empty.max > at > scala.collection.TraversableOnce$class.max(TraversableOnce.scala:216) > at scala.collection.AbstractTraversable.max(Traversable.scala:105) > at > org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$getModificationTime(FsHistoryProvider.scala:315) > {noformat} > Patch coming up. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5472) Add support for reading from and writing to a JDBC database
[ https://issues.apache.org/jira/browse/SPARK-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304588#comment-14304588 ] Tor Myklebust commented on SPARK-5472: -- If the data in the underlying table changes, this code might not work reliably; some partitions might have new data and others won't. If you change the schema of the underlying table after you make it visible to Spark SQL, retrieving data will (probably) blow up. Whatever behaviour you might observe from this code when given a changing underlying table will not be behaviour you can rely on. > Add support for reading from and writing to a JDBC database > --- > > Key: SPARK-5472 > URL: https://issues.apache.org/jira/browse/SPARK-5472 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Tor Myklebust >Assignee: Tor Myklebust >Priority: Blocker > Fix For: 1.3.0 > > > It would be nice to be able to make a table in a JDBC database appear as a > table in Spark SQL. This would let users, for instance, perform a JOIN > between a DataFrame in Spark SQL with a table in a Postgres database. > It might also be nice to be able to go the other direction -- save a > DataFrame to a database -- for instance in an ETL job. > Edited to clarify: Both of these tasks are certainly possible to accomplish > at the moment with a little bit of ad-hoc glue code. However, there is no > fundamental reason why the user should need to supply the table schema and > some code for pulling data out of a ResultSet row into a Catalyst Row > structure when this information can be derived from the schema of the > database table itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5582) History server does not list anything if log root contains an empty directory
Marcelo Vanzin created SPARK-5582: - Summary: History server does not list anything if log root contains an empty directory Key: SPARK-5582 URL: https://issues.apache.org/jira/browse/SPARK-5582 Project: Spark Issue Type: Bug Reporter: Marcelo Vanzin As summary says. Exception from logs: {noformat} 15/02/03 17:35:10.292 pool-1-thread-1-ScalaTest-running-FsHistoryProviderSuite ERROR FsHistoryProvider: Exception in checking for event log updates java.lang.UnsupportedOperationException: empty.max at scala.collection.TraversableOnce$class.max(TraversableOnce.scala:216) at scala.collection.AbstractTraversable.max(Traversable.scala:105) at org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$getModificationTime(FsHistoryProvider.scala:315) {noformat} Patch coming up. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2440) Enable HistoryServer to display lots of Application History
[ https://issues.apache.org/jira/browse/SPARK-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-2440. --- Resolution: Fixed I'll mark this as fixed since the current history server doesn't have that limitation anymore. > Enable HistoryServer to display lots of Application History > --- > > Key: SPARK-2440 > URL: https://issues.apache.org/jira/browse/SPARK-2440 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.0.0 >Reporter: Kousuke Saruta > > In current implementation of HistoryServer, it can display 250 records by > default. > Sometimes we'd like to see over 250 records and configure to be able to list > more records, but current implementation lists all the records just in one > page. This is not useful. > And to make matters worse, initial launch of HistoryServer is very slowly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5260) Expose JsonRDD.allKeysWithValueTypes() in a utility class
[ https://issues.apache.org/jira/browse/SPARK-5260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304577#comment-14304577 ] Corey J. Nolet commented on SPARK-5260: --- I'm thinking all the schema-specific functions should be pulled out into an object called JsonSchemaFunctions. allKeysWithValueTypes() and createSchema() functions should be exposed via the public API and commented well based on their use. For the project I have that's using these functions, I am actually using the allKeysWithValueTypes() over my entire RDD as it's being saved to a sequence file and I'm using an Accumulator[Set[(String, DataType)]] that is aggregating all the schema elements for the RDD into a final Set where I can then store off the schema and later call "CreateSchema()" to get the final StructType that can be used with the sql table. I had to write a isConflicted(Set[(String, DataType)]]) function as well to determine if it's possible that a JSON object or JSON array was also encountered as a primitive type in one of the records in the RDD or vice versa. > Expose JsonRDD.allKeysWithValueTypes() in a utility class > -- > > Key: SPARK-5260 > URL: https://issues.apache.org/jira/browse/SPARK-5260 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Corey J. Nolet >Assignee: Corey J. Nolet > > I have found this method extremely useful when implementing my own strategy > for inferring a schema from parsed json. For now, I've actually copied the > method right out of the JsonRDD class into my own project but I think it > would be immensely useful to keep the code in Spark and expose it publicly > somewhere else- like an object called JsonSchema. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5580) Grep bug in compute-classpath.sh
[ https://issues.apache.org/jira/browse/SPARK-5580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yadong Qi closed SPARK-5580. Resolution: Fixed > Grep bug in compute-classpath.sh > > > Key: SPARK-5580 > URL: https://issues.apache.org/jira/browse/SPARK-5580 > Project: Spark > Issue Type: Bug >Affects Versions: 1.2.0 >Reporter: Yadong Qi > Fix For: 1.3.0 > > > When I test spark, I often need to change assembly jar to test different > version. > So I will move spark-assembly.*hadoop.*.jar to > spark-assembly.*hadoop.*.jar.bak. > But I will get the error info "Found multiple Spark assembly jars in > $assembly_folder:". I think it just need to compare jar, so the grep > expression need to begin with "^" and end with "$". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5529) Executor is still hold while BlockManager has been removed
[ https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304509#comment-14304509 ] Lianhui Wang edited comment on SPARK-5529 at 2/4/15 2:40 AM: - the phenomenon is: blockManagerSlave is timeout and BlockManagerMasterActor will remove this blockManager, but executor on this blockManager is not timeout because akka's heartbeat is normal.Because blockManager is in executor, if blockManager is removed, executor on this blockManager should be removed too. Especially when dynamicAllocation is enabled, allocationManager listen onBlockManagerRemoved and remove this executor. but actually in CoarseGrainedSchedulerBackend it is still in executorDataMap. [~rxin] [~andrewor14] [~sandyr] when BlockManagerMasterActor remove blockmanager due to timeout of BlockManager, we need to check whether executor on this blockmanager has been removed. if its executor has not been removed, we should firstly remove this executor. how about this way to solve this problem? was (Author: lianhuiwang): the phenomenon is: blockManagerSlave is timeout and BlockManagerMasterActor will remove this blockManager, but executor on this blockManager is not timeout because akka's heartbeat is normal.Because blockManager is in executor, if blockManager is removed, executor on this blockManager should be removed too. Especially when dynamicAllocation is enabled, allocationManager listen onBlockManagerRemoved and remove this executor. but actually in CoarseGrainedSchedulerBackend it is still in executorDataMap. [~andrewor14] [~sandyr] when BlockManagerMasterActor remove blockmanager due to timeout of BlockManager, we need to check whether executor on this blockmanager has been removed. if its executor has not been removed, we should firstly remove this executor. how about this way to solve this problem? > Executor is still hold while BlockManager has been removed > -- > > Key: SPARK-5529 > URL: https://issues.apache.org/jira/browse/SPARK-5529 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Hong Shen > > When I run a spark job, one executor is hold, after 120s, blockManager is > removed by driver, but after half an hour before the executor is remove by > driver. Here is the log: > 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms > exceeds 12ms > > 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on > 10.215.143.14: remote Akka client disassociated > 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote > system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is > now gated for [5000] ms. Reason is: [Disassociated]. > 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet > 0.0 > 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, > 10.215.143.14): ExecutorLostFailure (executor 1 lost) > 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove > non-existent executor 1 > 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0) > 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 > from BlockManagerMaster. > 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in > removeExecutor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5529) Executor is still hold while BlockManager has been removed
[ https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304509#comment-14304509 ] Lianhui Wang edited comment on SPARK-5529 at 2/4/15 2:39 AM: - the phenomenon is: blockManagerSlave is timeout and BlockManagerMasterActor will remove this blockManager, but executor on this blockManager is not timeout because akka's heartbeat is normal.Because blockManager is in executor, if blockManager is removed, executor on this blockManager should be removed too. Especially when dynamicAllocation is enabled, allocationManager listen onBlockManagerRemoved and remove this executor. but actually in CoarseGrainedSchedulerBackend it is still in executorDataMap. [~andrewor14] [~sandyr] when BlockManagerMasterActor remove blockmanager due to timeout of BlockManager, we need to check whether executor on this blockmanager has been removed. if its executor has not been removed, we should firstly remove this executor. how about this way to solve this problem? was (Author: lianhuiwang): the phenomenon is: blockManagerSlave is timeout and BlockManagerMasterActor will remove this blockManager, but executor on this blockManager is not timeout because akka's heartbeat is normal.Because blockManager is in executor, if blockManager is removed, executor on this blockManager should be removed too. Especially when dynamicAllocation is enabled, allocationManager listen onBlockManagerRemoved and remove this executor. but actually in CoarseGrainedSchedulerBackend it is still in executorDataMap. [~andrewor14] when BlockManagerMasterActor remove blockmanager due to timeout of BlockManager, we need to check whether executor on this blockmanager has been removed. if its executor has not been removed, we should firstly remove this executor. how about this way to solve this problem? > Executor is still hold while BlockManager has been removed > -- > > Key: SPARK-5529 > URL: https://issues.apache.org/jira/browse/SPARK-5529 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Hong Shen > > When I run a spark job, one executor is hold, after 120s, blockManager is > removed by driver, but after half an hour before the executor is remove by > driver. Here is the log: > 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms > exceeds 12ms > > 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on > 10.215.143.14: remote Akka client disassociated > 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote > system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is > now gated for [5000] ms. Reason is: [Disassociated]. > 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet > 0.0 > 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, > 10.215.143.14): ExecutorLostFailure (executor 1 lost) > 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove > non-existent executor 1 > 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0) > 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 > from BlockManagerMaster. > 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in > removeExecutor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5581) When writing sorted map output file, avoid open / close between each partition
Sandy Ryza created SPARK-5581: - Summary: When writing sorted map output file, avoid open / close between each partition Key: SPARK-5581 URL: https://issues.apache.org/jira/browse/SPARK-5581 Project: Spark Issue Type: Improvement Affects Versions: 1.3.0 Reporter: Sandy Ryza {code} // Bypassing merge-sort; get an iterator by partition and just write everything directly. for ((id, elements) <- this.partitionedIterator) { if (elements.hasNext) { val writer = blockManager.getDiskWriter( blockId, outputFile, ser, fileBufferSize, context.taskMetrics.shuffleWriteMetrics.get) for (elem <- elements) { writer.write(elem) } writer.commitAndClose() val segment = writer.fileSegment() lengths(id) = segment.length } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5529) Executor is still hold while BlockManager has been removed
[ https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304509#comment-14304509 ] Lianhui Wang edited comment on SPARK-5529 at 2/4/15 2:37 AM: - the phenomenon is: blockManagerSlave is timeout and BlockManagerMasterActor will remove this blockManager, but executor on this blockManager is not timeout because akka's heartbeat is normal.Because blockManager is in executor, if blockManager is removed, executor on this blockManager should be removed too. Especially when dynamicAllocation is enabled, allocationManager listen onBlockManagerRemoved and remove this executor. but actually in CoarseGrainedSchedulerBackend it is still in executorDataMap. [~andrewor14] when BlockManagerMasterActor remove blockmanager due to timeout of BlockManager, we need to check whether executor on this blockmanager has been removed. if its executor has not been removed, we should firstly remove this executor. how about this way to solve this problem? was (Author: lianhuiwang): the phenomenon is: blockManagerSlave is timeout and BlockManagerMasterActor will remove this blockManager, but executor on this blockManager is not timeout because akka's heartbeat is normal.Because blockManager is in executor, if blockManager is removed, executor on this blockManager should be removed too. Especially when dynamicAllocation is enabled, allocationManager listen onBlockManagerRemoved and remove this executor. but actually in CoarseGrainedSchedulerBackend it is still in executorDataMap. [~andrewor14] when BlockManagerMasterActor remove blockmanager due to timeout of BlockManager, we need to check whether executor on this blockmanager has been removed. if its executor has not been removed, we should firstly remove this executor. how about this way to solve this problem? > Executor is still hold while BlockManager has been removed > -- > > Key: SPARK-5529 > URL: https://issues.apache.org/jira/browse/SPARK-5529 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Hong Shen > > When I run a spark job, one executor is hold, after 120s, blockManager is > removed by driver, but after half an hour before the executor is remove by > driver. Here is the log: > 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms > exceeds 12ms > > 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on > 10.215.143.14: remote Akka client disassociated > 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote > system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is > now gated for [5000] ms. Reason is: [Disassociated]. > 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet > 0.0 > 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, > 10.215.143.14): ExecutorLostFailure (executor 1 lost) > 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove > non-existent executor 1 > 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0) > 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 > from BlockManagerMaster. > 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in > removeExecutor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5580) Grep bug in compute-classpath.sh
[ https://issues.apache.org/jira/browse/SPARK-5580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yadong Qi updated SPARK-5580: - Affects Version/s: 1.2.0 > Grep bug in compute-classpath.sh > > > Key: SPARK-5580 > URL: https://issues.apache.org/jira/browse/SPARK-5580 > Project: Spark > Issue Type: Bug >Affects Versions: 1.2.0 >Reporter: Yadong Qi > Fix For: 1.3.0 > > > When I test spark, I often need to change assembly jar to test different > version. > So I will move spark-assembly.*hadoop.*.jar to > spark-assembly.*hadoop.*.jar.bak. > But I will get the error info "Found multiple Spark assembly jars in > $assembly_folder:". I think it just need to compare jar, so the grep > expression need to begin with "^" and end with "$". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5580) Grep bug in compute-classpath.sh
[ https://issues.apache.org/jira/browse/SPARK-5580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yadong Qi updated SPARK-5580: - Fix Version/s: 1.3.0 > Grep bug in compute-classpath.sh > > > Key: SPARK-5580 > URL: https://issues.apache.org/jira/browse/SPARK-5580 > Project: Spark > Issue Type: Bug >Affects Versions: 1.2.0 >Reporter: Yadong Qi > Fix For: 1.3.0 > > > When I test spark, I often need to change assembly jar to test different > version. > So I will move spark-assembly.*hadoop.*.jar to > spark-assembly.*hadoop.*.jar.bak. > But I will get the error info "Found multiple Spark assembly jars in > $assembly_folder:". I think it just need to compare jar, so the grep > expression need to begin with "^" and end with "$". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5580) Grep bug in compute-classpath.sh
Yadong Qi created SPARK-5580: Summary: Grep bug in compute-classpath.sh Key: SPARK-5580 URL: https://issues.apache.org/jira/browse/SPARK-5580 Project: Spark Issue Type: Bug Reporter: Yadong Qi When I test spark, I often need to change assembly jar to test different version. So I will move spark-assembly.*hadoop.*.jar to spark-assembly.*hadoop.*.jar.bak. But I will get the error info "Found multiple Spark assembly jars in $assembly_folder:". I think it just need to compare jar, so the grep expression need to begin with "^" and end with "$". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5529) Executor is still hold while BlockManager has been removed
[ https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304509#comment-14304509 ] Lianhui Wang edited comment on SPARK-5529 at 2/4/15 2:29 AM: - the phenomenon is: blockManagerSlave is timeout and BlockManagerMasterActor will remove this blockManager, but executor on this blockManager is not timeout because akka's heartbeat is normal.Because blockManager is in executor, if blockManager is removed, executor on this blockManager should be removed too. Especially when dynamicAllocation is enabled, allocationManager listen onBlockManagerRemoved and remove this executor. but actually in CoarseGrainedSchedulerBackend it is still in executorDataMap. [~andrewor14] when BlockManagerMasterActor remove blockmanager due to timeout of BlockManager, we need to check whether executor on this blockmanager has been removed. if its executor has not been removed, we should firstly remove this executor. how about this way to solve this problem? was (Author: lianhuiwang): the phenomenon is: blockManagerSlave is timeout and BlockManagerMasterActor will remove this blockManager, but executor on this blockManager is not timeout because akka's heartbeat is normal. when dynamicAllocation is enabled, allocationManager listen onBlockManagerRemoved and remove this executor. but actually in CoarseGrainedSchedulerBackend it is still in executorDataMap. At this time it is wrong. [~andrewor14] when BlockManagerMasterActor remove blockmanager due to timeout of BlockManager, we need to check whether executor on this blockmanager has been removed. if its executor has not been removed, we should firstly remove this executor. how about this way to solve this problem? > Executor is still hold while BlockManager has been removed > -- > > Key: SPARK-5529 > URL: https://issues.apache.org/jira/browse/SPARK-5529 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Hong Shen > > When I run a spark job, one executor is hold, after 120s, blockManager is > removed by driver, but after half an hour before the executor is remove by > driver. Here is the log: > 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms > exceeds 12ms > > 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on > 10.215.143.14: remote Akka client disassociated > 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote > system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is > now gated for [5000] ms. Reason is: [Disassociated]. > 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet > 0.0 > 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, > 10.215.143.14): ExecutorLostFailure (executor 1 lost) > 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove > non-existent executor 1 > 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0) > 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 > from BlockManagerMaster. > 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in > removeExecutor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5529) Executor is still hold while BlockManager has been removed
[ https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304509#comment-14304509 ] Lianhui Wang edited comment on SPARK-5529 at 2/4/15 2:27 AM: - the phenomenon is: blockManagerSlave is timeout and BlockManagerMasterActor will remove this blockManager, but executor on this blockManager is not timeout because akka's heartbeat is normal. when dynamicAllocation is enabled, allocationManager listen onBlockManagerRemoved and remove this executor. but actually in CoarseGrainedSchedulerBackend it is still in executorDataMap. At this time it is wrong. [~andrewor14] when BlockManagerMasterActor remove blockmanager due to timeout of BlockManager, we need to check whether executor on this blockmanager has been removed. if its executor has not been removed, we should firstly remove this executor. how about this way to solve this problem? was (Author: lianhuiwang): the phenomenon is: blockManagerSlave is timeout and BlockManagerMasterActor will remove this blockManager, but executor on this blockManager is not timeout because akka's heartbeat is normal. when dynamicAllocation is enabled, allocationManager listen onBlockManagerRemoved and remove this executor. but actually in CoarseGrainedSchedulerBackend it is still in executorDataMap. at this time it is wrong. [~andrewor14] when BlockManagerMasterActor remove blockmanager due to timeout of BlockManager, we need to check whether executor on this blockmanager has been removed. if its executor has not been removed, we should firstly remove this executor. how about this way to solve this problem? > Executor is still hold while BlockManager has been removed > -- > > Key: SPARK-5529 > URL: https://issues.apache.org/jira/browse/SPARK-5529 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Hong Shen > > When I run a spark job, one executor is hold, after 120s, blockManager is > removed by driver, but after half an hour before the executor is remove by > driver. Here is the log: > 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms > exceeds 12ms > > 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on > 10.215.143.14: remote Akka client disassociated > 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote > system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is > now gated for [5000] ms. Reason is: [Disassociated]. > 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet > 0.0 > 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, > 10.215.143.14): ExecutorLostFailure (executor 1 lost) > 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove > non-existent executor 1 > 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0) > 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 > from BlockManagerMaster. > 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in > removeExecutor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5529) Executor is still hold while BlockManager has been removed
[ https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304509#comment-14304509 ] Lianhui Wang commented on SPARK-5529: - the phenomenon is: blockManagerSlave is timeout and BlockManagerMasterActor will remove this blockManager, but executor on this blockManager is not timeout because akka's heartbeat is normal. when dynamicAllocation is enabled, allocationManager listen onBlockManagerRemoved and remove this executor. but actually in CoarseGrainedSchedulerBackend it is still in executorDataMap. at this time it is wrong. [~andrewor14] when BlockManagerMasterActor remove blockmanager due to timeout of BlockManager, we need to check whether executor on this blockmanager has been removed. if its executor has not been removed, we should firstly remove this executor. how about this way to solve this problem? > Executor is still hold while BlockManager has been removed > -- > > Key: SPARK-5529 > URL: https://issues.apache.org/jira/browse/SPARK-5529 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Hong Shen > > When I run a spark job, one executor is hold, after 120s, blockManager is > removed by driver, but after half an hour before the executor is remove by > driver. Here is the log: > 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms > exceeds 12ms > > 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on > 10.215.143.14: remote Akka client disassociated > 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote > system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is > now gated for [5000] ms. Reason is: [Disassociated]. > 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet > 0.0 > 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, > 10.215.143.14): ExecutorLostFailure (executor 1 lost) > 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove > non-existent executor 1 > 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0) > 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 > from BlockManagerMaster. > 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in > removeExecutor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5140) Two RDDs which are scheduled concurrently should be able to wait on parent in all cases
[ https://issues.apache.org/jira/browse/SPARK-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304510#comment-14304510 ] Corey J. Nolet commented on SPARK-5140: --- I think the problem is that when actions are performed on RDDs in multiple threads, the SparkContext on the driver that's scheduling the DAG should be able to see that the two RDDs depend on the same parents and synchronize them so that only one will run at a time, whether being cached or not (you'd assume the parent would be getting cached but I think this change would still affect cases where it hasn't been.). The fact that I did: val rdd1 = input data -> transform data -> groupBy -> etc... -> cache val rdd2 = future { rdd1.transform.groupBy.saveAsSequenceFile() } val rdd3 = future { rdd1.transform.groupBy.saveAsSequenceFile() } Has unexpected results when I find that rdd1 was assigned an id and run completely separately for rdd2 and rdd3. I would have expected, whether cached or not, that when run in separate threads, rdd1 would have been assigned an id, then rdd2 would have caused it to begin running through its stages, and rdd3 would have paused because it's waiting on rdd1's id to complete its stages. What I see is that, after rdd2 and rdd3 both run concurrently calculating rdd1, the storage for rdd1 = 200% cached. It causes issues when I have 50 or so rdds calling saveAsSequenceFile() that all have different shared dependencies on parent rdds (which may not always be known at creation time without introspecting them in my own tree). Now i've basically got to the scheduling myself- I've got to determine what depends on what and run things concurrently myself. It seems like the DAG should know this already and be able to make use of it. > Two RDDs which are scheduled concurrently should be able to wait on parent in > all cases > --- > > Key: SPARK-5140 > URL: https://issues.apache.org/jira/browse/SPARK-5140 > Project: Spark > Issue Type: New Feature >Reporter: Corey J. Nolet > Labels: features > > Not sure if this would change too much of the internals to be included in the > 1.2.1 but it would be very helpful if it could be. > This ticket is from a discussion between myself and [~ilikerps]. Here's the > result of some testing that [~ilikerps] did: > bq. I did some testing as well, and it turns out the "wait for other guy to > finish caching" logic is on a per-task basis, and it only works on tasks that > happen to be executing on the same machine. > bq. Once a partition is cached, we will schedule tasks that touch that > partition on that executor. The problem here, though, is that the cache is in > progress, and so the tasks are still scheduled randomly (or with whatever > locality the data source has), so tasks which end up on different machines > will not see that the cache is already in progress. > {code} > Here was my test, by the way: > import scala.concurrent.ExecutionContext.Implicits.global > import scala.concurrent._ > import scala.concurrent.duration._ > val rdd = sc.parallelize(0 until 8).map(i => { Thread.sleep(1); i > }).cache() > val futures = (0 until 4).map { _ => Future { rdd.count } } > Await.result(Future.sequence(futures), 120.second) > {code} > bq. Note that I run the future 4 times in parallel. I found that the first > run has all tasks take 10 seconds. The second has about 50% of its tasks take > 10 seconds, and the rest just wait for the first stage to finish. The last > two runs have no tasks that take 10 seconds; all wait for the first two > stages to finish. > What we want is the ability to fire off a job and have the DAG figure out > that two RDDs depend on the same parent so that when the children are > scheduled concurrently, the first one to start will activate the parent and > both will wait on the parent. When the parent is done, they will both be able > to finish their work concurrently. We are trying to use this pattern by > having the parent cache results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5526) expression [date '2011-01-01' = cast(timestamp('2011-01-01 23:24:25') as date)] return false
[ https://issues.apache.org/jira/browse/SPARK-5526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xukun closed SPARK-5526. Resolution: Fixed the issue is fixed by #4325 > expression [date '2011-01-01' = cast(timestamp('2011-01-01 23:24:25') as > date)] return false > > > Key: SPARK-5526 > URL: https://issues.apache.org/jira/browse/SPARK-5526 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: xukun > > previous work for test case, > create table date_1(d date); > insert overwrite table date_1 select cast('2011-01-01' as date) from src > tablesample (1 rows); > In Hive,execute sql {select date '2011-01-01' = cast(timestamp('2011-01-01 > 23:24:25') as date) from date_1 limit 1; } return true,but in spark SQL, > return false -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5577) Create a convenient way for Python users to register SQL UDFs
[ https://issues.apache.org/jira/browse/SPARK-5577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304494#comment-14304494 ] Apache Spark commented on SPARK-5577: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/4351 > Create a convenient way for Python users to register SQL UDFs > - > > Key: SPARK-5577 > URL: https://issues.apache.org/jira/browse/SPARK-5577 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Davies Liu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2945) Allow specifying num of executors in the context configuration
[ https://issues.apache.org/jira/browse/SPARK-2945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304446#comment-14304446 ] Apache Spark commented on SPARK-2945: - User 'WangTaoTheTonic' has created a pull request for this issue: https://github.com/apache/spark/pull/4350 > Allow specifying num of executors in the context configuration > -- > > Key: SPARK-2945 > URL: https://issues.apache.org/jira/browse/SPARK-2945 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 1.0.0 > Environment: Ubuntu precise, on YARN (CDH 5.1.0) >Reporter: Shay Rojansky > > Running on YARN, the only way to specify the number of executors seems to be > on the command line of spark-submit, via the --num-executors switch. > In many cases this is too early. Our Spark app receives some cmdline > arguments which determine the amount of work that needs to be done - and that > affects the number of executors it ideally requires. Ideally, the Spark > context configuration would support specifying this like any other config > param. > Our current workaround is a wrapper script that determines how much work is > needed, and which itself launches spark-submit with the number passed to > --num-executors - it's a shame to have to do this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5529) Executor is still hold while BlockManager has been removed
[ https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304433#comment-14304433 ] Hong Shen commented on SPARK-5529: -- Executor will lost when a akka throw a disassociatedEvent. > Executor is still hold while BlockManager has been removed > -- > > Key: SPARK-5529 > URL: https://issues.apache.org/jira/browse/SPARK-5529 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Hong Shen > > When I run a spark job, one executor is hold, after 120s, blockManager is > removed by driver, but after half an hour before the executor is remove by > driver. Here is the log: > 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms > exceeds 12ms > > 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on > 10.215.143.14: remote Akka client disassociated > 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote > system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is > now gated for [5000] ms. Reason is: [Disassociated]. > 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet > 0.0 > 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, > 10.215.143.14): ExecutorLostFailure (executor 1 lost) > 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove > non-existent executor 1 > 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0) > 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 > from BlockManagerMaster. > 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in > removeExecutor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5520) Make FP-Growth implementation take generic item types
[ https://issues.apache.org/jira/browse/SPARK-5520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-5520. -- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4340 [https://github.com/apache/spark/pull/4340] > Make FP-Growth implementation take generic item types > - > > Key: SPARK-5520 > URL: https://issues.apache.org/jira/browse/SPARK-5520 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Jacky Li >Priority: Critical > Fix For: 1.3.0 > > > There is not technical restriction on the item types in the FP-Growth > implementation. We used String in the first PR for simplicity. Maybe we could > make the type generic before 1.3 (and specialize it for Int/Long). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5153) flaky test of "Reliable Kafka input stream with multiple topics"
[ https://issues.apache.org/jira/browse/SPARK-5153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304423#comment-14304423 ] Saisai Shao commented on SPARK-5153: Hi TD, thanks a lot for your PR, currently I've no better solution instead of increasing the timeout threshold, as I remembered in the Kafka unit test, it also deal with this way, I will check again to see if we can solve this with elegant way. > flaky test of "Reliable Kafka input stream with multiple topics" > > > Key: SPARK-5153 > URL: https://issues.apache.org/jira/browse/SPARK-5153 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.2.0 >Reporter: Nan Zhu > Labels: flaky-test > Fix For: 1.3.0, 1.2.2 > > > I have seen several irrelevant PR failed on this test > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25254/consoleFull > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25248/ > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25251/console -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5552) Automated data science AMI creation and data science cluster deployment on EC2
[ https://issues.apache.org/jira/browse/SPARK-5552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304412#comment-14304412 ] Florian Verhein commented on SPARK-5552: Thanks [~sowen]. So it wouldn't fit in the spark repo itself (the only change there would be to add an option in spark_ec2.py to use an alternate spark-ec2 repo/branch). It would naturally live in spark-ec2, as it involves changes to spark-ec2 for both use cases - Image creation is based on the work soon to be added to spark-ec2 for this: https://issues.apache.org/jira/browse/SPARK-3821 - Cluster deployment+configuration is done using the spark-ec2 scripts themselves (but with many modifications/fixes). Since there is a dependency between the image and the configuration (init.sh and setup.sh) scripts, it's not possible to solve this with just an AMI. The extra components (actually, just vowpal wabbit and more python libraries - the rest already exists in spark-ec2 AMI) are just added to the image for data science convenience. > Automated data science AMI creation and data science cluster deployment on EC2 > -- > > Key: SPARK-5552 > URL: https://issues.apache.org/jira/browse/SPARK-5552 > Project: Spark > Issue Type: New Feature > Components: EC2 >Reporter: Florian Verhein > > Issue created RE: > https://github.com/mesos/spark-ec2/pull/90#issuecomment-72597154 (please read > for background) > Goal: > Extend spark-ec2 scripts to create an automated data science cluster > deployment on EC2, suitable for almost(?)-production use. > Use cases: > - A user can build their own custom data science AMIs from a CentOS minimal > image by calling a packer configuration (good defaults should be provided, > some options for flexibility) > - A user can then easily deploy a new (correctly configured) cluster using > these AMIs, and do so as quickly as possible. > Components/modules: Spark + tachyon + hdfs (on instance storage) + python + R > + vowpal wabbit + any rpms + ... + ganglia > Focus is on reliability (rather than e.g. supporting many versions / dev > testing) and speed of deployment. > Use hadoop 2 so option to lift into yarn later. > My current solution is here: > https://github.com/florianverhein/spark-ec2/tree/packer. It includes other > fixes/improvements as needed to get it working. > Now that it seems to work (but has deviated a lot more from the existing code > base than I was expecting), I'm wondering what to do with it... > Keen to hear ideas if anyone is interested. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5574) Utils.createDirectory ignores namePrefix
[ https://issues.apache.org/jira/browse/SPARK-5574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-5574: --- Assignee: Imran Rashid > Utils.createDirectory ignores namePrefix > > > Key: SPARK-5574 > URL: https://issues.apache.org/jira/browse/SPARK-5574 > Project: Spark > Issue Type: Bug >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Trivial > > this is really minor, I just noticed it as I was trying to find the > "blockmgr" dir during some debugging, and then realized that the > {{namePrefix}} is ignored in {{Utils.createDirectory}}. Also via > {{Utils.createTempDir}} this effects these dirs: > * httpd > * userFiles > * broadcast > I'll submit a PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5472) Add support for reading from and writing to a JDBC database
[ https://issues.apache.org/jira/browse/SPARK-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304383#comment-14304383 ] Anand Mohan Tumuluri commented on SPARK-5472: - Thanks again [~rxin] and [~tmyklebu]. My bad, I only checked the table creation scripts in before and made assumptions. This would very well satisfy our use case. The custom partitioning conditions would remove the need to use SQL conditionals as well. One more question, how do I 'get' new data that got inserted in the source table(s)? Would 'refresh table' work for this? > Add support for reading from and writing to a JDBC database > --- > > Key: SPARK-5472 > URL: https://issues.apache.org/jira/browse/SPARK-5472 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Tor Myklebust >Assignee: Tor Myklebust >Priority: Blocker > Fix For: 1.3.0 > > > It would be nice to be able to make a table in a JDBC database appear as a > table in Spark SQL. This would let users, for instance, perform a JOIN > between a DataFrame in Spark SQL with a table in a Postgres database. > It might also be nice to be able to go the other direction -- save a > DataFrame to a database -- for instance in an ETL job. > Edited to clarify: Both of these tasks are certainly possible to accomplish > at the moment with a little bit of ad-hoc glue code. However, there is no > fundamental reason why the user should need to supply the table schema and > some code for pulling data out of a ResultSet row into a Catalyst Row > structure when this information can be derived from the schema of the > database table itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5579) Provide support for project using SQL expression
[ https://issues.apache.org/jira/browse/SPARK-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304378#comment-14304378 ] Apache Spark commented on SPARK-5579: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/4348 > Provide support for project using SQL expression > > > Key: SPARK-5579 > URL: https://issues.apache.org/jira/browse/SPARK-5579 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > Would be nice to allow something like > df.selectExpr("abs(colA)", "colB") -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5579) Provide support for project using SQL expression
Reynold Xin created SPARK-5579: -- Summary: Provide support for project using SQL expression Key: SPARK-5579 URL: https://issues.apache.org/jira/browse/SPARK-5579 Project: Spark Issue Type: Sub-task Reporter: Reynold Xin Assignee: Reynold Xin Would be nice to allow something like df.selectExpr("abs(colA)", "colB") -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5460) RandomForest should catch exceptions when removing checkpoint files
[ https://issues.apache.org/jira/browse/SPARK-5460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304352#comment-14304352 ] Apache Spark commented on SPARK-5460: - User 'x1-' has created a pull request for this issue: https://github.com/apache/spark/pull/4347 > RandomForest should catch exceptions when removing checkpoint files > --- > > Key: SPARK-5460 > URL: https://issues.apache.org/jira/browse/SPARK-5460 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Priority: Minor > > RandomForest can optionally use checkpointing. When it tries to remove > checkpoint files, it could fail (if a user has write but not delete access on > some filesystem). There should be a try-catch to catch exceptions when > trying to remove checkpoint files in NodeIdCache. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5578) Provide a convenient way for Scala users to use UDFs
[ https://issues.apache.org/jira/browse/SPARK-5578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304349#comment-14304349 ] Apache Spark commented on SPARK-5578: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/4345 > Provide a convenient way for Scala users to use UDFs > > > Key: SPARK-5578 > URL: https://issues.apache.org/jira/browse/SPARK-5578 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Blocker > > Dsl.udf(...). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5578) Provide a convenient way for Scala users to use UDFs
Reynold Xin created SPARK-5578: -- Summary: Provide a convenient way for Scala users to use UDFs Key: SPARK-5578 URL: https://issues.apache.org/jira/browse/SPARK-5578 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Priority: Blocker Dsl.udf(...). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5577) Create a convenient way for Python users to register SQL UDFs
Reynold Xin created SPARK-5577: -- Summary: Create a convenient way for Python users to register SQL UDFs Key: SPARK-5577 URL: https://issues.apache.org/jira/browse/SPARK-5577 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Davies Liu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5554) Add more tests and docs for DataFrame Python API
[ https://issues.apache.org/jira/browse/SPARK-5554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-5554: --- Issue Type: Sub-task (was: Improvement) Parent: SPARK-5166 > Add more tests and docs for DataFrame Python API > > > Key: SPARK-5554 > URL: https://issues.apache.org/jira/browse/SPARK-5554 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 1.3.0 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Blocker > Fix For: 1.3.0 > > > more tests for DataFrame Python API -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5554) Add more tests and docs for DataFrame Python API
[ https://issues.apache.org/jira/browse/SPARK-5554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-5554. Resolution: Fixed Fix Version/s: 1.3.0 Assignee: Davies Liu > Add more tests and docs for DataFrame Python API > > > Key: SPARK-5554 > URL: https://issues.apache.org/jira/browse/SPARK-5554 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 1.3.0 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Blocker > Fix For: 1.3.0 > > > more tests for DataFrame Python API -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5280) Import RDF graphs into GraphX
[ https://issues.apache.org/jira/browse/SPARK-5280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304070#comment-14304070 ] lukovnikov edited comment on SPARK-5280 at 2/3/15 11:28 PM: started working on it: https://github.com/lukovnikov/spark/blob/rdfloader/graphx/src/main/scala/org/apache/spark/graphx/loaders/RDFLoader.scala and https://github.com/lukovnikov/spark/blob/rdfloaderhash/graphx/src/main/scala/org/apache/spark/graphx/loaders/RDFLoader.scala The second one computes hashes for VertexIds instead of building a whole dictionary of the whole RDF input and broadcasting it as the first one does. Will test soon, write comments and make a pull request. was (Author: lukovnikov): started working on it: https://github.com/lukovnikov/spark/blob/rdfloader/graphx/src/main/scala/org/apache/spark/graphx/loaders/RDFLoader.scala and https://github.com/lukovnikov/spark/blob/rdfloaderhash/graphx/src/main/scala/org/apache/spark/graphx/loaders/RDFLoader.scala The second one computes hashes for VertexIds instead of building a whole dictionary of the whole RDF input and broadcasting it as the first one does. > Import RDF graphs into GraphX > - > > Key: SPARK-5280 > URL: https://issues.apache.org/jira/browse/SPARK-5280 > Project: Spark > Issue Type: New Feature > Components: GraphX >Reporter: lukovnikov > > RDF (Resource Description Framework) models knowledge in a graph and is > heavily used on the Semantic Web and beyond. > GraphX should include a way to import RDF data easily. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5280) Import RDF graphs into GraphX
[ https://issues.apache.org/jira/browse/SPARK-5280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304070#comment-14304070 ] lukovnikov edited comment on SPARK-5280 at 2/3/15 11:27 PM: started working on it: https://github.com/lukovnikov/spark/blob/rdfloader/graphx/src/main/scala/org/apache/spark/graphx/loaders/RDFLoader.scala and https://github.com/lukovnikov/spark/blob/rdfloaderhash/graphx/src/main/scala/org/apache/spark/graphx/loaders/RDFLoader.scala The second one computes hashes for VertexIds instead of building a whole dictionary of the whole RDF input and broadcasting it as the first one does. was (Author: lukovnikov): started working on it: https://github.com/lukovnikov/spark/blob/rdfloader/graphx/src/main/scala/org/apache/spark/graphx/loaders/RDFLoader.scala > Import RDF graphs into GraphX > - > > Key: SPARK-5280 > URL: https://issues.apache.org/jira/browse/SPARK-5280 > Project: Spark > Issue Type: New Feature > Components: GraphX >Reporter: lukovnikov > > RDF (Resource Description Framework) models knowledge in a graph and is > heavily used on the Semantic Web and beyond. > GraphX should include a way to import RDF data easily. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input
[ https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304298#comment-14304298 ] Joseph K. Bradley commented on SPARK-5021: -- That BLAS implementation is actually part of MLlib (see the imports). You may need to generalize it to work with SparseVector, but it should belong in mllib.linalg.BLAS. > GaussianMixtureEM should be faster for SparseVector input > - > > Key: SPARK-5021 > URL: https://issues.apache.org/jira/browse/SPARK-5021 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Manoj Kumar > > GaussianMixtureEM currently converts everything to dense vectors. It would > be nice if it were faster for SparseVectors (running in time linear in the > number of non-zero values). > However, this may not be too important since clustering should rarely be done > in high dimensions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4986) Graceful shutdown for Spark Streaming does not work in Standalone cluster mode
[ https://issues.apache.org/jira/browse/SPARK-4986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-4986. -- Resolution: Fixed Fix Version/s: 1.2.2 > Graceful shutdown for Spark Streaming does not work in Standalone cluster mode > -- > > Key: SPARK-4986 > URL: https://issues.apache.org/jira/browse/SPARK-4986 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.2.0 >Reporter: Jesper Lundgren >Priority: Blocker > Fix For: 1.3.0, 1.2.2 > > > When using the graceful stop API of Spark Streaming in Spark Standalone > cluster the stop signal never reaches the receivers. I have tested this with > Spark 1.2 and Kafka receivers. > ReceiverTracker will send StopReceiver message to ReceiverSupervisorImpl. > In local mode ReceiverSupervisorImpl receives this message but in Standalone > cluster mode the message seems to be lost. > (I have modified the code to send my own string message as a stop signal from > ReceiverTracker to ReceiverSupervisorImpl and it works as a workaround.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5557) spark-shell failed to start
[ https://issues.apache.org/jira/browse/SPARK-5557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304252#comment-14304252 ] Ben Mabey commented on SPARK-5557: -- I ran a git bisect and this is the first bad commit: https://github.com/apache/spark/commit/7930d2bef0e2c7f62456e013124455061dfe6dc8 The commit adds a Jetty dep so that seems like the culprit. > spark-shell failed to start > --- > > Key: SPARK-5557 > URL: https://issues.apache.org/jira/browse/SPARK-5557 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Guoqiang Li >Priority: Blocker > > the log: > {noformat} > 5/02/03 19:06:39 INFO spark.HttpServer: Starting HTTP Server > Exception in thread "main" java.lang.NoClassDefFoundError: > javax/servlet/http/HttpServletResponse > at > org.apache.spark.HttpServer.org$apache$spark$HttpServer$$doStart(HttpServer.scala:75) > at org.apache.spark.HttpServer$$anonfun$1.apply(HttpServer.scala:62) > at org.apache.spark.HttpServer$$anonfun$1.apply(HttpServer.scala:62) > at > org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1774) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) > at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1765) > at org.apache.spark.HttpServer.start(HttpServer.scala:62) > at org.apache.spark.repl.SparkIMain.(SparkIMain.scala:130) > at > org.apache.spark.repl.SparkILoop$SparkILoopInterpreter.(SparkILoop.scala:185) > at > org.apache.spark.repl.SparkILoop.createInterpreter(SparkILoop.scala:214) > at > org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:946) > at > org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:942) > at > org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:942) > at > scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) > at > org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:942) > at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1039) > at org.apache.spark.repl.Main$.main(Main.scala:31) > at org.apache.spark.repl.Main.main(Main.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:403) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:77) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: java.lang.ClassNotFoundException: > javax.servlet.http.HttpServletResponse > at java.net.URLClassLoader$1.run(URLClassLoader.java:366) > at java.net.URLClassLoader$1.run(URLClassLoader.java:355) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:354) > at java.lang.ClassLoader.loadClass(ClassLoader.java:425) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) > at java.lang.ClassLoader.loadClass(ClassLoader.java:358) > ... 25 more > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5576) saveAsTable into Hive fails due to duplicate columns
[ https://issues.apache.org/jira/browse/SPARK-5576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304229#comment-14304229 ] Apache Spark commented on SPARK-5576: - User 'danosipov' has created a pull request for this issue: https://github.com/apache/spark/pull/4346 > saveAsTable into Hive fails due to duplicate columns > > > Key: SPARK-5576 > URL: https://issues.apache.org/jira/browse/SPARK-5576 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: Dan Osipov > > Loading JSON files infers case sensitive schema, which results in an error if > attempting to save to Hive. > {code} > import org.apache.spark.sql._ > import org.apache.spark.sql.hive._ > val hive = new HiveContext(sc) > val data = hive.jsonFile("/path/") > data.saveAsTable("table") > {code} > Results in an error: > org.apache.hadoop.hive.ql.metadata.HiveException: > org.apache.hadoop.hive.ql.metadata.HiveException: Duplicate column name > data-errorcode in the table definition. > Outputting the schema shows the problem field: > |-- data-errorCode: string (nullable = true) > |-- data-errorcode: string (nullable = true) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5576) saveAsTable into Hive fails due to duplicate columns
Dan Osipov created SPARK-5576: - Summary: saveAsTable into Hive fails due to duplicate columns Key: SPARK-5576 URL: https://issues.apache.org/jira/browse/SPARK-5576 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Dan Osipov Loading JSON files infers case sensitive schema, which results in an error if attempting to save to Hive. {code} import org.apache.spark.sql._ import org.apache.spark.sql.hive._ val hive = new HiveContext(sc) val data = hive.jsonFile("/path/") data.saveAsTable("table") {code} Results in an error: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: Duplicate column name data-errorcode in the table definition. Outputting the schema shows the problem field: |-- data-errorCode: string (nullable = true) |-- data-errorcode: string (nullable = true) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5574) Utils.createDirectory ignores namePrefix
[ https://issues.apache.org/jira/browse/SPARK-5574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304190#comment-14304190 ] Apache Spark commented on SPARK-5574: - User 'squito' has created a pull request for this issue: https://github.com/apache/spark/pull/4344 > Utils.createDirectory ignores namePrefix > > > Key: SPARK-5574 > URL: https://issues.apache.org/jira/browse/SPARK-5574 > Project: Spark > Issue Type: Bug >Reporter: Imran Rashid >Priority: Trivial > > this is really minor, I just noticed it as I was trying to find the > "blockmgr" dir during some debugging, and then realized that the > {{namePrefix}} is ignored in {{Utils.createDirectory}}. Also via > {{Utils.createTempDir}} this effects these dirs: > * httpd > * userFiles > * broadcast > I'll submit a PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5575) Artificial neural networks for MLlib deep learning
Alexander Ulanov created SPARK-5575: --- Summary: Artificial neural networks for MLlib deep learning Key: SPARK-5575 URL: https://issues.apache.org/jira/browse/SPARK-5575 Project: Spark Issue Type: Umbrella Components: MLlib Affects Versions: 1.2.0 Reporter: Alexander Ulanov Goal: Implement various types of artificial neural networks Motivation: deep learning trend Requirements: 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward and Backpropagation etc. should be implemented as traits or interfaces, so they can be easily extended or reused 2) Implement complex abstractions, such as feed forward and recurrent networks 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), autoencoder (sparse and denoising), stacked autoencoder, restricted boltzmann machines (RBM), deep belief networks (DBN) etc. 4) Implement or reuse supporting constucts, such as classifiers, normalizers, poolers, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5574) Utils.createDirectory ignores namePrefix
Imran Rashid created SPARK-5574: --- Summary: Utils.createDirectory ignores namePrefix Key: SPARK-5574 URL: https://issues.apache.org/jira/browse/SPARK-5574 Project: Spark Issue Type: Bug Reporter: Imran Rashid Priority: Trivial this is really minor, I just noticed it as I was trying to find the "blockmgr" dir during some debugging, and then realized that the {{namePrefix}} is ignored in {{Utils.createDirectory}}. Also via {{Utils.createTempDir}} this effects these dirs: * httpd * userFiles * broadcast I'll submit a PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5140) Two RDDs which are scheduled concurrently should be able to wait on parent in all cases
[ https://issues.apache.org/jira/browse/SPARK-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5140: --- Fix Version/s: (was: 1.2.1) (was: 1.3.0) > Two RDDs which are scheduled concurrently should be able to wait on parent in > all cases > --- > > Key: SPARK-5140 > URL: https://issues.apache.org/jira/browse/SPARK-5140 > Project: Spark > Issue Type: New Feature >Reporter: Corey J. Nolet > Labels: features > > Not sure if this would change too much of the internals to be included in the > 1.2.1 but it would be very helpful if it could be. > This ticket is from a discussion between myself and [~ilikerps]. Here's the > result of some testing that [~ilikerps] did: > bq. I did some testing as well, and it turns out the "wait for other guy to > finish caching" logic is on a per-task basis, and it only works on tasks that > happen to be executing on the same machine. > bq. Once a partition is cached, we will schedule tasks that touch that > partition on that executor. The problem here, though, is that the cache is in > progress, and so the tasks are still scheduled randomly (or with whatever > locality the data source has), so tasks which end up on different machines > will not see that the cache is already in progress. > {code} > Here was my test, by the way: > import scala.concurrent.ExecutionContext.Implicits.global > import scala.concurrent._ > import scala.concurrent.duration._ > val rdd = sc.parallelize(0 until 8).map(i => { Thread.sleep(1); i > }).cache() > val futures = (0 until 4).map { _ => Future { rdd.count } } > Await.result(Future.sequence(futures), 120.second) > {code} > bq. Note that I run the future 4 times in parallel. I found that the first > run has all tasks take 10 seconds. The second has about 50% of its tasks take > 10 seconds, and the rest just wait for the first stage to finish. The last > two runs have no tasks that take 10 seconds; all wait for the first two > stages to finish. > What we want is the ability to fire off a job and have the DAG figure out > that two RDDs depend on the same parent so that when the children are > scheduled concurrently, the first one to start will activate the parent and > both will wait on the parent. When the parent is done, they will both be able > to finish their work concurrently. We are trying to use this pattern by > having the parent cache results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5140) Two RDDs which are scheduled concurrently should be able to wait on parent in all cases
[ https://issues.apache.org/jira/browse/SPARK-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304114#comment-14304114 ] Sean Owen commented on SPARK-5140: -- Is this not substantially answered by just materializing the cached RDD right after you cache it? then anything that happens after already sees a cached RDD. Is the request basically to automatically persist and unpersist RDDs to implement this? I suppose the issue is simply that this is hard to figure out. Even if you can figure out 2 RDDs can be computed in parallel, and need to be, and depend on one parent, it's not obvious you can just persist the RDD automatically. I guess the question is what specifically would this change look like? > Two RDDs which are scheduled concurrently should be able to wait on parent in > all cases > --- > > Key: SPARK-5140 > URL: https://issues.apache.org/jira/browse/SPARK-5140 > Project: Spark > Issue Type: New Feature >Reporter: Corey J. Nolet > Labels: features > Fix For: 1.3.0, 1.2.1 > > > Not sure if this would change too much of the internals to be included in the > 1.2.1 but it would be very helpful if it could be. > This ticket is from a discussion between myself and [~ilikerps]. Here's the > result of some testing that [~ilikerps] did: > bq. I did some testing as well, and it turns out the "wait for other guy to > finish caching" logic is on a per-task basis, and it only works on tasks that > happen to be executing on the same machine. > bq. Once a partition is cached, we will schedule tasks that touch that > partition on that executor. The problem here, though, is that the cache is in > progress, and so the tasks are still scheduled randomly (or with whatever > locality the data source has), so tasks which end up on different machines > will not see that the cache is already in progress. > {code} > Here was my test, by the way: > import scala.concurrent.ExecutionContext.Implicits.global > import scala.concurrent._ > import scala.concurrent.duration._ > val rdd = sc.parallelize(0 until 8).map(i => { Thread.sleep(1); i > }).cache() > val futures = (0 until 4).map { _ => Future { rdd.count } } > Await.result(Future.sequence(futures), 120.second) > {code} > bq. Note that I run the future 4 times in parallel. I found that the first > run has all tasks take 10 seconds. The second has about 50% of its tasks take > 10 seconds, and the rest just wait for the first stage to finish. The last > two runs have no tasks that take 10 seconds; all wait for the first two > stages to finish. > What we want is the ability to fire off a job and have the DAG figure out > that two RDDs depend on the same parent so that when the children are > scheduled concurrently, the first one to start will activate the parent and > both will wait on the parent. When the parent is done, they will both be able > to finish their work concurrently. We are trying to use this pattern by > having the parent cache results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5573) Support explode in DataFrame DSL
Reynold Xin created SPARK-5573: -- Summary: Support explode in DataFrame DSL Key: SPARK-5573 URL: https://issues.apache.org/jira/browse/SPARK-5573 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Michael Armbrust Priority: Blocker The DSL is missing explode support. We should enable developers to explode a column, or explode multiple columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input
[ https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304099#comment-14304099 ] Manoj Kumar edited comment on SPARK-5021 at 2/3/15 10:01 PM: - Hi. I'm almost there. I have one last question. In this line, https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala#L223 I'm not sure how to do this, other than doing an own implementation which does not depend on NativeBlas for sparse data. Is that okay? was (Author: mechcoder): Hi. I'm almost there. I have one last question. In this line, https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala#L223 I'm not sure how to do this, other than doing an own implementation which does not depend on NativeBlas. Is that okay? > GaussianMixtureEM should be faster for SparseVector input > - > > Key: SPARK-5021 > URL: https://issues.apache.org/jira/browse/SPARK-5021 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Manoj Kumar > > GaussianMixtureEM currently converts everything to dense vectors. It would > be nice if it were faster for SparseVectors (running in time linear in the > number of non-zero values). > However, this may not be too important since clustering should rarely be done > in high dimensions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input
[ https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304099#comment-14304099 ] Manoj Kumar edited comment on SPARK-5021 at 2/3/15 10:02 PM: - Hi. I'm almost there. I have one last question. In this line, https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala#L223 I'm not sure how to do this, other than doing an own implementation which does not depend on NativeBlas for a SparseVector. Is that okay? was (Author: mechcoder): Hi. I'm almost there. I have one last question. In this line, https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala#L223 I'm not sure how to do this, other than doing an own implementation which does not depend on NativeBlas for sparse data. Is that okay? > GaussianMixtureEM should be faster for SparseVector input > - > > Key: SPARK-5021 > URL: https://issues.apache.org/jira/browse/SPARK-5021 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Manoj Kumar > > GaussianMixtureEM currently converts everything to dense vectors. It would > be nice if it were faster for SparseVectors (running in time linear in the > number of non-zero values). > However, this may not be too important since clustering should rarely be done > in high dimensions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input
[ https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304099#comment-14304099 ] Manoj Kumar commented on SPARK-5021: Hi. I'm almost there. I have one last question. In this line, https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala#L223 I'm not sure how to do this, other than doing an own implementation which does not depend on NativeBlas. Is that okay? > GaussianMixtureEM should be faster for SparseVector input > - > > Key: SPARK-5021 > URL: https://issues.apache.org/jira/browse/SPARK-5021 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Manoj Kumar > > GaussianMixtureEM currently converts everything to dense vectors. It would > be nice if it were faster for SparseVectors (running in time linear in the > number of non-zero values). > However, this may not be too important since clustering should rarely be done > in high dimensions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5153) flaky test of "Reliable Kafka input stream with multiple topics"
[ https://issues.apache.org/jira/browse/SPARK-5153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-5153. -- Resolution: Fixed Fix Version/s: 1.2.2 1.3.0 > flaky test of "Reliable Kafka input stream with multiple topics" > > > Key: SPARK-5153 > URL: https://issues.apache.org/jira/browse/SPARK-5153 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.2.0 >Reporter: Nan Zhu > Labels: flaky-test > Fix For: 1.3.0, 1.2.2 > > > I have seen several irrelevant PR failed on this test > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25254/consoleFull > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25248/ > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25251/console -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5280) Import RDF graphs into GraphX
[ https://issues.apache.org/jira/browse/SPARK-5280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304070#comment-14304070 ] lukovnikov commented on SPARK-5280: --- started working on it: https://github.com/lukovnikov/spark/blob/rdfloader/graphx/src/main/scala/org/apache/spark/graphx/loaders/RDFLoader.scala > Import RDF graphs into GraphX > - > > Key: SPARK-5280 > URL: https://issues.apache.org/jira/browse/SPARK-5280 > Project: Spark > Issue Type: New Feature > Components: GraphX >Reporter: lukovnikov > > RDF (Resource Description Framework) models knowledge in a graph and is > heavily used on the Semantic Web and beyond. > GraphX should include a way to import RDF data easily. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5388) Provide a stable application submission gateway in standalone cluster mode
[ https://issues.apache.org/jira/browse/SPARK-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304052#comment-14304052 ] Marcelo Vanzin commented on SPARK-5388: --- Hi Patrick, Most of my questions are related to the protocol specification attached to this bug. So when I ask about something, I generally mean that the specification is vague about that. If the implementation made a choice about that thing, it just means that the implementation should be the specification, and everybody should just ignore the document attached to this bug. And we can then move the discussion to the PR itself. bq. The intention for this is really just to take single RPC that was using Akka and add a stable version of it that we are okay supporting long term. That's fine, but I'd really like the spec to actually be very clear about what this means. For example, the very last sentence: bq. n. This set of fields must remain compatible across Spark version See my previous comment, where I asked the same question: what does that mean? Does that mean that you can never add any fields to existing messages? You mention the code does some version negotiation, but the spec doesn't mention that. So maybe that negotiation is the answer to my question? Anyway, I'm just a little concerned that there's still some vagueness in the spec, for a protocol that is supposed to be stable from the get go. > Provide a stable application submission gateway in standalone cluster mode > -- > > Key: SPARK-5388 > URL: https://issues.apache.org/jira/browse/SPARK-5388 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Blocker > Attachments: Stable Spark Standalone Submission.pdf > > > The existing submission gateway in standalone mode is not compatible across > Spark versions. If you have a newer version of Spark submitting to an older > version of the standalone Master, it is currently not guaranteed to work. The > goal is to provide a stable REST interface to replace this channel. > The first cut implementation will target standalone cluster mode because > there are very few messages exchanged. The design, however, should be general > enough to potentially support this for other cluster managers too. Note that > this is not necessarily required in YARN because we already use YARN's stable > interface to submit applications there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5388) Provide a stable application submission gateway in standalone cluster mode
[ https://issues.apache.org/jira/browse/SPARK-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304010#comment-14304010 ] Patrick Wendell commented on SPARK-5388: The intention for this is really just to take single RPC that was using Akka and add a stable version of it that we are okay supporting long term. It doesn't preclude moving to avro or some other RPC as a general thing we use across all of Spark. However, that design choice was intentionally excluded from this decision given all the complexities you bring up. Doing some basic message dispatching on our own - there is only a small and very straightforward code related to this. Adopting Avro would be overkill for this. In the current implementation the client and server exchange Spark versions, so this is the basis of reasoning about version changes - maybe it wasn't in the design doc. In terms of evolvability, the way you do this is that you only add new functionality over time, and you never remove fields from messages. This is similar to the API contract of the history logs with the history server. So the idea is that newer clients would implement a super set of messages and fields as older ones. Adding v1 seems like a good idea in case this evolves into something public or more well specified over time. It would just be good to define precisely what it means to advance that version identifier. That all matters a lot more if we want it to be something others interact with. > Provide a stable application submission gateway in standalone cluster mode > -- > > Key: SPARK-5388 > URL: https://issues.apache.org/jira/browse/SPARK-5388 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Blocker > Attachments: Stable Spark Standalone Submission.pdf > > > The existing submission gateway in standalone mode is not compatible across > Spark versions. If you have a newer version of Spark submitting to an older > version of the standalone Master, it is currently not guaranteed to work. The > goal is to provide a stable REST interface to replace this channel. > The first cut implementation will target standalone cluster mode because > there are very few messages exchanged. The design, however, should be general > enough to potentially support this for other cluster managers too. Note that > this is not necessarily required in YARN because we already use YARN's stable > interface to submit applications there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5548) Flaky test: org.apache.spark.util.AkkaUtilsSuite.remote fetch ssl on - untrusted server
[ https://issues.apache.org/jira/browse/SPARK-5548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303992#comment-14303992 ] Apache Spark commented on SPARK-5548: - User 'jacek-lewandowski' has created a pull request for this issue: https://github.com/apache/spark/pull/4343 > Flaky test: org.apache.spark.util.AkkaUtilsSuite.remote fetch ssl on - > untrusted server > --- > > Key: SPARK-5548 > URL: https://issues.apache.org/jira/browse/SPARK-5548 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Patrick Wendell >Assignee: Jacek Lewandowski > Labels: flaky-test > > {code} > sbt.ForkMain$ForkError: Expected exception > java.util.concurrent.TimeoutException to be thrown, but > akka.actor.ActorNotFound was thrown. > at > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:496) > at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) > at org.scalatest.Assertions$class.intercept(Assertions.scala:1004) > at org.scalatest.FunSuite.intercept(FunSuite.scala:1555) > at > org.apache.spark.util.AkkaUtilsSuite$$anonfun$8.apply$mcV$sp(AkkaUtilsSuite.scala:373) > at > org.apache.spark.util.AkkaUtilsSuite$$anonfun$8.apply(AkkaUtilsSuite.scala:349) > at > org.apache.spark.util.AkkaUtilsSuite$$anonfun$8.apply(AkkaUtilsSuite.scala:349) > at > org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) > at org.scalatest.Suite$class.withFixture(Suite.scala:1122) > at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) > at > org.apache.spark.util.AkkaUtilsSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(AkkaUtilsSuite.scala:37) > at > org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255) > at org.apache.spark.util.AkkaUtilsSuite.runTest(AkkaUtilsSuite.scala:37) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) > at scala.collection.immutable.List.foreach(List.scala:318) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) > at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) > at org.scalatest.Suite$class.run(Suite.scala:1424) > at > org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) > at org.scalatest.SuperEngine.runImpl(Engine.scala:545) > at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) > at > org.apache.spark.util.AkkaUtilsSuite.org$scalatest$BeforeAndAfterAll$$super$run(AkkaUtilsSuite.scala:37) > at > org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257) > at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256) > at org.apache.spark.util.AkkaUtilsSuite.run(AkkaUtilsSuite.scala:37) > at > org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462) > at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671) > at sbt.ForkMain$Run$2.call(ForkMain.java:294) > at sbt.ForkMain$Run$2.call(ForkMain.java:284) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.Thre
[jira] [Commented] (SPARK-5420) Cross-langauge load/store functions for creating and saving DataFrames
[ https://issues.apache.org/jira/browse/SPARK-5420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303982#comment-14303982 ] Yin Huai commented on SPARK-5420: - h3. End user APIs added to SQLContext (load related) h4. Load data through a data source and create a DataFrame {code} // This method is used to load data through file based data source (e.g. Parquet). We will use the default data source . Right now, it is Parquet. def load(path: String): DataFrame def load( dataSourceName: String, option: (String, String), options: (String, String)*): DataFrame // This is for Java users. def load( dataSourceName: String, options: java.util.Map[String, String]): DataFrame {code} h3. End user APIs added to HiveContext (load related) h4. Create a metastore table for the existing data {code} // This method is used create a table from a file based data source. We will use the default data source . Right now, it is Parquet. def createTable(tableName: String, path: String, allowExisting: Boolean): Unit def createTable( tableName: String, dataSourceName: String, allowExisting: Boolean, option: (String, String), options: (String, String)*): Unit def createTable( tableName: String, dataSourceName: String, schema: StructType, allowExisting: Boolean, option: (String, String), options: (String, String)*): Unit // This one is for Java users. def createTable( tableName: String, dataSourceName: String, allowExisting: Boolean, options: java.util.Map[String, String]): Unit // This one is for Java users. def createTable( tableName: String, dataSourceName: String, schema: StructType, allowExisting: Boolean, options: java.util.Map[String, String]): Unit {code} > Cross-langauge load/store functions for creating and saving DataFrames > -- > > Key: SPARK-5420 > URL: https://issues.apache.org/jira/browse/SPARK-5420 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Patrick Wendell >Assignee: Yin Huai >Priority: Blocker > Fix For: 1.3.0 > > > We should have standard API's for loading or saving a table from a data > store. Per comment discussion: > {code} > def loadData(datasource: String, parameters: Map[String, String]): DataFrame > def loadData(datasource: String, parameters: java.util.Map[String, String]): > DataFrame > def storeData(datasource: String, parameters: Map[String, String]): DataFrame > def storeData(datasource: String, parameters: java.util.Map[String, String]): > DataFrame > {code} > Python should have this too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5420) Cross-langauge load/store functions for creating and saving DataFrames
[ https://issues.apache.org/jira/browse/SPARK-5420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303967#comment-14303967 ] Yin Huai commented on SPARK-5420: - I am copying the summary of write related interfaces from [SPARK-5501|https://issues.apache.org/jira/browse/SPARK-5501?focusedCommentId=14303760&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14303760] to here. h3. End user APIs added to DataFrame (write related) h4. Save a DataFrame as a table When a user is using *HiveContext*, he/she can save a DataFrame as a table. The metadata of this table will be stored in metastore. {code} // When a data source name is not specified, we will use our default one (configured by spark.sql.default.datasource). Right now, it is Parquet. def saveAsTable(tableName: String): Unit def saveAsTable( tableName: String, dataSourceName: String, option: (String, String), options: (String, String)*): Unit // This is for Java users. def saveAsTable( tableName: String, dataSourceName: String, options: java.util.Map[String, String]): Unit {code} h4. Save a DataFrame to a data source Users can save a DataFrame with a data source. {code} //This method is used to save a DataFrame to a file based data source (e.g. Parquet). We will use the default data source . Right now, it is Parquet. def save(path: String): Unit def save( dataSourceName: String, option: (String, String), options: (String, String)*): Unit // This is for Java users. def save( dataSourceName: String, options: java.util.Map[String, String]): Unit {code} h4. Insert data into a table from a DataFrame Users can insert the data of DataFrame to an existing table created by the data source API. {code} // Appends the data of this DataFrame to the table tableName. def insertInto(tableName: String): Unit // When overwrite is true, inserts the data of this DataFrame to the table tableName and overwrite existing data. // When overwrite is false, A=appends the data of this DataFrame to the table tableName. def insertInto(tableName: String, overwrite: Boolean): Unit {code} > Cross-langauge load/store functions for creating and saving DataFrames > -- > > Key: SPARK-5420 > URL: https://issues.apache.org/jira/browse/SPARK-5420 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Patrick Wendell >Assignee: Yin Huai >Priority: Blocker > Fix For: 1.3.0 > > > We should have standard API's for loading or saving a table from a data > store. Per comment discussion: > {code} > def loadData(datasource: String, parameters: Map[String, String]): DataFrame > def loadData(datasource: String, parameters: java.util.Map[String, String]): > DataFrame > def storeData(datasource: String, parameters: Map[String, String]): DataFrame > def storeData(datasource: String, parameters: java.util.Map[String, String]): > DataFrame > {code} > Python should have this too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5501) Write support for the data source API
[ https://issues.apache.org/jira/browse/SPARK-5501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303760#comment-14303760 ] Yin Huai edited comment on SPARK-5501 at 2/3/15 8:55 PM: - h3. End user APIs added to DataFrame (write related) h4. Save a DataFrame as a table When a user is using *HiveContext*, he/she can save a DataFrame as a table. The metadata of this table will be stored in metastore. {code} // When a data source name is not specified, we will use our default one (configured by spark.sql.default.datasource). Right now, it is Parquet. def saveAsTable(tableName: String): Unit def saveAsTable( tableName: String, dataSourceName: String, option: (String, String), options: (String, String)*): Unit // This is for Java users. def saveAsTable( tableName: String, dataSourceName: String, options: java.util.Map[String, String]): Unit {code} h4. Save a DataFrame to a data source Users can save a DataFrame with a data source. {code} //This method is used to save a DataFrame to a file based data source (e.g. Parquet). We will use the default data source . Right now, it is Parquet. def save(path: String): Unit def save( dataSourceName: String, option: (String, String), options: (String, String)*): Unit // This is for Java users. def save( dataSourceName: String, options: java.util.Map[String, String]): Unit {code} h4. Insert data into a table from a DataFrame Users can insert the data of DataFrame to an existing table created by the data source API. {code} // Appends the data of this DataFrame to the table tableName. def insertInto(tableName: String): Unit // When overwrite is true, inserts the data of this DataFrame to the table tableName and overwrite existing data. // When overwrite is false, A=appends the data of this DataFrame to the table tableName. def insertInto(tableName: String, overwrite: Boolean): Unit {code} was (Author: yhuai): h3. End user APIs added to DataFrame h4. Save a DataFrame as a table When a user is using *HiveContext*, he/she can save a DataFrame as a table. The metadata of this table will be stored in metastore. {code} // When a data source name is not specified, we will use our default one (configured by spark.sql.default.datasource). Right now, it is Parquet. def saveAsTable(tableName: String): Unit def saveAsTable( tableName: String, dataSourceName: String, option: (String, String), options: (String, String)*): Unit // This is for Java users. def saveAsTable( tableName: String, dataSourceName: String, options: java.util.Map[String, String]): Unit {code} h4. Save a DataFrame to a data source Users can save a DataFrame with a data source. {code} //This method is used to save a DataFrame to a file based data source (e.g. Parquet). We will use the default data source . Right now, it is Parquet. def save(path: String): Unit def save( dataSourceName: String, option: (String, String), options: (String, String)*): Unit // This is for Java users. def save( dataSourceName: String, options: java.util.Map[String, String]): Unit {code} h4. Insert data into a table from a DataFrame Users can insert the data of DataFrame to an existing table created by the data source API. {code} // Appends the data of this DataFrame to the table tableName. def insertInto(tableName: String): Unit // When overwrite is true, inserts the data of this DataFrame to the table tableName and overwrite existing data. // When overwrite is false, A=appends the data of this DataFrame to the table tableName. def insertInto(tableName: String, overwrite: Boolean): Unit {code} > Write support for the data source API > - > > Key: SPARK-5501 > URL: https://issues.apache.org/jira/browse/SPARK-5501 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Blocker > Fix For: 1.3.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5420) Cross-langauge load/store functions for creating and saving DataFrames
[ https://issues.apache.org/jira/browse/SPARK-5420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303771#comment-14303771 ] Yin Huai edited comment on SPARK-5420 at 2/3/15 8:54 PM: - This JIRA is fixed by the attached PR. I am resolving it. was (Author: yhuai): This JIRA is fixed by the attached PR. A summary of added interfaces can be found in https://issues.apache.org/jira/browse/SPARK-5501. > Cross-langauge load/store functions for creating and saving DataFrames > -- > > Key: SPARK-5420 > URL: https://issues.apache.org/jira/browse/SPARK-5420 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Patrick Wendell >Assignee: Yin Huai >Priority: Blocker > Fix For: 1.3.0 > > > We should have standard API's for loading or saving a table from a data > store. Per comment discussion: > {code} > def loadData(datasource: String, parameters: Map[String, String]): DataFrame > def loadData(datasource: String, parameters: java.util.Map[String, String]): > DataFrame > def storeData(datasource: String, parameters: Map[String, String]): DataFrame > def storeData(datasource: String, parameters: java.util.Map[String, String]): > DataFrame > {code} > Python should have this too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4768) Add Support For Impala Encoded Timestamp (INT96)
[ https://issues.apache.org/jira/browse/SPARK-4768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-4768. - Resolution: Duplicate SPARK-4987 has resolved the issues. I am resolving this one. > Add Support For Impala Encoded Timestamp (INT96) > > > Key: SPARK-4768 > URL: https://issues.apache.org/jira/browse/SPARK-4768 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Pat McDonough >Assignee: Yin Huai >Priority: Blocker > Attachments: 5e4481a02f951e29-651ee94ed14560bf_922627129_data.0.parq, > string_timestamp.gz > > > Impala is using INT96 for timestamps. Spark SQL should be able to read this > data despite the fact that it is not part of the spec. > Perhaps adding a flag to act like impala when reading parquet (like we do for > strings already) would be useful. > Here's an example of the error you might see: > {code} > Caused by: java.lang.RuntimeException: Potential loss of precision: cannot > convert INT96 > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$.toPrimitiveDataType(ParquetTypes.scala:61) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$.toDataType(ParquetTypes.scala:113) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:314) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:311) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$.convertToAttributes(ParquetTypes.scala:310) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:441) > at > org.apache.spark.sql.parquet.ParquetRelation.(ParquetRelation.scala:66) > at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:141) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4768) Add Support For Impala Encoded Timestamp (INT96)
[ https://issues.apache.org/jira/browse/SPARK-4768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-4768: Fix Version/s: 1.3.0 > Add Support For Impala Encoded Timestamp (INT96) > > > Key: SPARK-4768 > URL: https://issues.apache.org/jira/browse/SPARK-4768 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Pat McDonough >Assignee: Yin Huai >Priority: Blocker > Fix For: 1.3.0 > > Attachments: 5e4481a02f951e29-651ee94ed14560bf_922627129_data.0.parq, > string_timestamp.gz > > > Impala is using INT96 for timestamps. Spark SQL should be able to read this > data despite the fact that it is not part of the spec. > Perhaps adding a flag to act like impala when reading parquet (like we do for > strings already) would be useful. > Here's an example of the error you might see: > {code} > Caused by: java.lang.RuntimeException: Potential loss of precision: cannot > convert INT96 > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$.toPrimitiveDataType(ParquetTypes.scala:61) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$.toDataType(ParquetTypes.scala:113) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:314) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:311) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$.convertToAttributes(ParquetTypes.scala:310) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:441) > at > org.apache.spark.sql.parquet.ParquetRelation.(ParquetRelation.scala:66) > at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:141) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4768) Add Support For Impala Encoded Timestamp (INT96)
[ https://issues.apache.org/jira/browse/SPARK-4768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303960#comment-14303960 ] Yin Huai edited comment on SPARK-4768 at 2/3/15 8:51 PM: - SPARK-4987 has resolved the issue. I am resolving this one. was (Author: yhuai): SPARK-4987 has resolved the issues. I am resolving this one. > Add Support For Impala Encoded Timestamp (INT96) > > > Key: SPARK-4768 > URL: https://issues.apache.org/jira/browse/SPARK-4768 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Pat McDonough >Assignee: Yin Huai >Priority: Blocker > Fix For: 1.3.0 > > Attachments: 5e4481a02f951e29-651ee94ed14560bf_922627129_data.0.parq, > string_timestamp.gz > > > Impala is using INT96 for timestamps. Spark SQL should be able to read this > data despite the fact that it is not part of the spec. > Perhaps adding a flag to act like impala when reading parquet (like we do for > strings already) would be useful. > Here's an example of the error you might see: > {code} > Caused by: java.lang.RuntimeException: Potential loss of precision: cannot > convert INT96 > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$.toPrimitiveDataType(ParquetTypes.scala:61) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$.toDataType(ParquetTypes.scala:113) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:314) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:311) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$.convertToAttributes(ParquetTypes.scala:310) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:441) > at > org.apache.spark.sql.parquet.ParquetRelation.(ParquetRelation.scala:66) > at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:141) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5472) Add support for reading from and writing to a JDBC database
[ https://issues.apache.org/jira/browse/SPARK-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303961#comment-14303961 ] Tor Myklebust commented on SPARK-5472: -- It probably does not handle SQL ARRAY types in a sane way. I would guess that type mapping would throw an error if you try to read from a table that has an ARRAY column. I would also guess that type mapping would throw an error if you try to write a DataFrame that has an ARRAY column. JDBCRDD handles partitioning however you instruct it to. If you give no instructions, the entire table is a single partition. If you give it a JDBCPartitioningInfo object, it divides the specified range of the specified column into the appropriate number of slices. If you give it a list of WHERE clauses, each WHERE clause corresponds to one partition. > Add support for reading from and writing to a JDBC database > --- > > Key: SPARK-5472 > URL: https://issues.apache.org/jira/browse/SPARK-5472 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Tor Myklebust >Assignee: Tor Myklebust >Priority: Blocker > Fix For: 1.3.0 > > > It would be nice to be able to make a table in a JDBC database appear as a > table in Spark SQL. This would let users, for instance, perform a JOIN > between a DataFrame in Spark SQL with a table in a Postgres database. > It might also be nice to be able to go the other direction -- save a > DataFrame to a database -- for instance in an ETL job. > Edited to clarify: Both of these tasks are certainly possible to accomplish > at the moment with a little bit of ad-hoc glue code. However, there is no > fundamental reason why the user should need to supply the table schema and > some code for pulling data out of a ResultSet row into a Catalyst Row > structure when this information can be derived from the schema of the > database table itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5548) Flaky test: org.apache.spark.util.AkkaUtilsSuite.remote fetch ssl on - untrusted server
[ https://issues.apache.org/jira/browse/SPARK-5548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303953#comment-14303953 ] Jacek Lewandowski commented on SPARK-5548: -- You are absolutely right [~joshrosen]. This is due to the inconsistent behaviour of {{Await.result}} and {{resolveOne}} methods. The first one fails with {{TimeoutException}} while the second (in case of timeout) fails with {{ActorNotFoundException}}. I'll fix it right away. > Flaky test: org.apache.spark.util.AkkaUtilsSuite.remote fetch ssl on - > untrusted server > --- > > Key: SPARK-5548 > URL: https://issues.apache.org/jira/browse/SPARK-5548 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Patrick Wendell >Assignee: Jacek Lewandowski > Labels: flaky-test > > {code} > sbt.ForkMain$ForkError: Expected exception > java.util.concurrent.TimeoutException to be thrown, but > akka.actor.ActorNotFound was thrown. > at > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:496) > at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) > at org.scalatest.Assertions$class.intercept(Assertions.scala:1004) > at org.scalatest.FunSuite.intercept(FunSuite.scala:1555) > at > org.apache.spark.util.AkkaUtilsSuite$$anonfun$8.apply$mcV$sp(AkkaUtilsSuite.scala:373) > at > org.apache.spark.util.AkkaUtilsSuite$$anonfun$8.apply(AkkaUtilsSuite.scala:349) > at > org.apache.spark.util.AkkaUtilsSuite$$anonfun$8.apply(AkkaUtilsSuite.scala:349) > at > org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) > at org.scalatest.Suite$class.withFixture(Suite.scala:1122) > at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) > at > org.apache.spark.util.AkkaUtilsSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(AkkaUtilsSuite.scala:37) > at > org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255) > at org.apache.spark.util.AkkaUtilsSuite.runTest(AkkaUtilsSuite.scala:37) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) > at scala.collection.immutable.List.foreach(List.scala:318) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) > at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) > at org.scalatest.Suite$class.run(Suite.scala:1424) > at > org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) > at org.scalatest.SuperEngine.runImpl(Engine.scala:545) > at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) > at > org.apache.spark.util.AkkaUtilsSuite.org$scalatest$BeforeAndAfterAll$$super$run(AkkaUtilsSuite.scala:37) > at > org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257) > at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256) > at org.apache.spark.util.AkkaUtilsSuite.run(AkkaUtilsSuite.scala:37) > at > org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462) > at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671) > at sbt.ForkMain$Run$2.call
[jira] [Resolved] (SPARK-4987) Parquet support for timestamp type
[ https://issues.apache.org/jira/browse/SPARK-4987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-4987. - Resolution: Fixed Fix Version/s: 1.3.0 Thank you [~adrian-wang]! > Parquet support for timestamp type > -- > > Key: SPARK-4987 > URL: https://issues.apache.org/jira/browse/SPARK-4987 > Project: Spark > Issue Type: New Feature >Reporter: Adrian Wang > Fix For: 1.3.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4709) Spark SQL support error reading Parquet with timestamp type field
[ https://issues.apache.org/jira/browse/SPARK-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-4709. - Resolution: Duplicate Target Version/s: (was: 1.1.0) Seems it duplicates SPARK-4987. I am resolving it. > Spark SQL support error reading Parquet with timestamp type field > - > > Key: SPARK-4709 > URL: https://issues.apache.org/jira/browse/SPARK-4709 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: Felix Cheung >Priority: Critical > > Have a data set on Parquet format (created by Hive) with a field of the > timestamp type. Reading this causes an exception: > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > val p = sqlContext.parquetFile("hdfs:///data/parquetdata") > java.lang.RuntimeException: Potential loss of precision: cannot convert INT96 > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$.toPrimitiveDataType(ParquetTypes.scala:61) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$.toDataType(ParquetTypes.scala:113) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:314) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:311) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$.convertToAttributes(ParquetTypes.scala:310) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:441) > at > org.apache.spark.sql.parquet.ParquetRelation.(ParquetRelation.scala:66) > at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:141) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:17) > at $iwC$$iwC$$iwC$$iwC$$iwC.(:22) > at $iwC$$iwC$$iwC$$iwC.(:24) > at $iwC$$iwC$$iwC.(:26) > at $iwC$$iwC.(:28) > at $iwC.(:30) > at (:32) > at .(:36) > at .() > at .(:7) > at .() > at $print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:846) > at > org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1119) > at > org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:672) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:703) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:667) > at > org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:819) > at > org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:864) > at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:776) > at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:619) > at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:627) > at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:632) > at > org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:959) > at > org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:907) > at > org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:907) > at > scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) > at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:907) > at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1002) > at org.apache.spark.repl.Main$.main(Main.scala:31) > at org.apache.spark.repl.Main.main(Main.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl
[jira] [Updated] (SPARK-4987) Parquet support for timestamp type
[ https://issues.apache.org/jira/browse/SPARK-4987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-4987: Target Version/s: 1.3.0 > Parquet support for timestamp type > -- > > Key: SPARK-4987 > URL: https://issues.apache.org/jira/browse/SPARK-4987 > Project: Spark > Issue Type: New Feature >Reporter: Adrian Wang > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5472) Add support for reading from and writing to a JDBC database
[ https://issues.apache.org/jira/browse/SPARK-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303949#comment-14303949 ] Tor Myklebust commented on SPARK-5472: -- This is good feedback. Thanks. You don't actually have to pass real table names to JDBCRDD. For instance, `(SELECT name, id FROM people)` is a perfectly valid table name to JDBCRDD. As long as `SELECT columnlist FROM tablename WHERE conditions` is a valid SQL query, anything goes. So, insofar as you trust the underlying database to optimise `SELECT columnlist FROM tablename WHERE filters AND partitioningcondition` into something reasonable, you should be able to avoid creating a view in the external database. Custom partitioning can be done with the new JDBCRDD API as well; there is an interface in SQLContext that just takes a list of syntactically-valid conditions and creates one partition per condition. > Add support for reading from and writing to a JDBC database > --- > > Key: SPARK-5472 > URL: https://issues.apache.org/jira/browse/SPARK-5472 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Tor Myklebust >Assignee: Tor Myklebust >Priority: Blocker > Fix For: 1.3.0 > > > It would be nice to be able to make a table in a JDBC database appear as a > table in Spark SQL. This would let users, for instance, perform a JOIN > between a DataFrame in Spark SQL with a table in a Postgres database. > It might also be nice to be able to go the other direction -- save a > DataFrame to a database -- for instance in an ETL job. > Edited to clarify: Both of these tasks are certainly possible to accomplish > at the moment with a little bit of ad-hoc glue code. However, there is no > fundamental reason why the user should need to supply the table schema and > some code for pulling data out of a ResultSet row into a Catalyst Row > structure when this information can be derived from the schema of the > database table itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5472) Add support for reading from and writing to a JDBC database
[ https://issues.apache.org/jira/browse/SPARK-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303943#comment-14303943 ] Reynold Xin commented on SPARK-5472: Actually I think it already supports arbitrary queries. There is even a test case for it: https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala#L241 Would that satisfy your use case? > Add support for reading from and writing to a JDBC database > --- > > Key: SPARK-5472 > URL: https://issues.apache.org/jira/browse/SPARK-5472 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Tor Myklebust >Assignee: Tor Myklebust >Priority: Blocker > Fix For: 1.3.0 > > > It would be nice to be able to make a table in a JDBC database appear as a > table in Spark SQL. This would let users, for instance, perform a JOIN > between a DataFrame in Spark SQL with a table in a Postgres database. > It might also be nice to be able to go the other direction -- save a > DataFrame to a database -- for instance in an ETL job. > Edited to clarify: Both of these tasks are certainly possible to accomplish > at the moment with a little bit of ad-hoc glue code. However, there is no > fundamental reason why the user should need to supply the table schema and > some code for pulling data out of a ResultSet row into a Catalyst Row > structure when this information can be derived from the schema of the > database table itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5472) Add support for reading from and writing to a JDBC database
[ https://issues.apache.org/jira/browse/SPARK-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303933#comment-14303933 ] Reynold Xin edited comment on SPARK-5472 at 2/3/15 8:39 PM: What if we expand the JDBC data source to support arbitrary queries, in addition to tables/views? was (Author: rxin): What if we expand the JDBC data source to support arbitrary queries? > Add support for reading from and writing to a JDBC database > --- > > Key: SPARK-5472 > URL: https://issues.apache.org/jira/browse/SPARK-5472 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Tor Myklebust >Assignee: Tor Myklebust >Priority: Blocker > Fix For: 1.3.0 > > > It would be nice to be able to make a table in a JDBC database appear as a > table in Spark SQL. This would let users, for instance, perform a JOIN > between a DataFrame in Spark SQL with a table in a Postgres database. > It might also be nice to be able to go the other direction -- save a > DataFrame to a database -- for instance in an ETL job. > Edited to clarify: Both of these tasks are certainly possible to accomplish > at the moment with a little bit of ad-hoc glue code. However, there is no > fundamental reason why the user should need to supply the table schema and > some code for pulling data out of a ResultSet row into a Catalyst Row > structure when this information can be derived from the schema of the > database table itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-5472) Add support for reading from and writing to a JDBC database
[ https://issues.apache.org/jira/browse/SPARK-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-5472: --- Comment: was deleted (was: What if we expand the JDBC data source to support arbitrary queries? ) > Add support for reading from and writing to a JDBC database > --- > > Key: SPARK-5472 > URL: https://issues.apache.org/jira/browse/SPARK-5472 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Tor Myklebust >Assignee: Tor Myklebust >Priority: Blocker > Fix For: 1.3.0 > > > It would be nice to be able to make a table in a JDBC database appear as a > table in Spark SQL. This would let users, for instance, perform a JOIN > between a DataFrame in Spark SQL with a table in a Postgres database. > It might also be nice to be able to go the other direction -- save a > DataFrame to a database -- for instance in an ETL job. > Edited to clarify: Both of these tasks are certainly possible to accomplish > at the moment with a little bit of ad-hoc glue code. However, there is no > fundamental reason why the user should need to supply the table schema and > some code for pulling data out of a ResultSet row into a Catalyst Row > structure when this information can be derived from the schema of the > database table itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5472) Add support for reading from and writing to a JDBC database
[ https://issues.apache.org/jira/browse/SPARK-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303933#comment-14303933 ] Reynold Xin commented on SPARK-5472: What if we expand the JDBC data source to support arbitrary queries? > Add support for reading from and writing to a JDBC database > --- > > Key: SPARK-5472 > URL: https://issues.apache.org/jira/browse/SPARK-5472 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Tor Myklebust >Assignee: Tor Myklebust >Priority: Blocker > Fix For: 1.3.0 > > > It would be nice to be able to make a table in a JDBC database appear as a > table in Spark SQL. This would let users, for instance, perform a JOIN > between a DataFrame in Spark SQL with a table in a Postgres database. > It might also be nice to be able to go the other direction -- save a > DataFrame to a database -- for instance in an ETL job. > Edited to clarify: Both of these tasks are certainly possible to accomplish > at the moment with a little bit of ad-hoc glue code. However, there is no > fundamental reason why the user should need to supply the table schema and > some code for pulling data out of a ResultSet row into a Catalyst Row > structure when this information can be derived from the schema of the > database table itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5472) Add support for reading from and writing to a JDBC database
[ https://issues.apache.org/jira/browse/SPARK-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303934#comment-14303934 ] Reynold Xin commented on SPARK-5472: What if we expand the JDBC data source to support arbitrary queries? > Add support for reading from and writing to a JDBC database > --- > > Key: SPARK-5472 > URL: https://issues.apache.org/jira/browse/SPARK-5472 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Tor Myklebust >Assignee: Tor Myklebust >Priority: Blocker > Fix For: 1.3.0 > > > It would be nice to be able to make a table in a JDBC database appear as a > table in Spark SQL. This would let users, for instance, perform a JOIN > between a DataFrame in Spark SQL with a table in a Postgres database. > It might also be nice to be able to go the other direction -- save a > DataFrame to a database -- for instance in an ETL job. > Edited to clarify: Both of these tasks are certainly possible to accomplish > at the moment with a little bit of ad-hoc glue code. However, there is no > fundamental reason why the user should need to supply the table schema and > some code for pulling data out of a ResultSet row into a Catalyst Row > structure when this information can be derived from the schema of the > database table itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5260) Expose JsonRDD.allKeysWithValueTypes() in a utility class
[ https://issues.apache.org/jira/browse/SPARK-5260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303927#comment-14303927 ] Reynold Xin commented on SPARK-5260: BTW I've also added you to the contributor list. > Expose JsonRDD.allKeysWithValueTypes() in a utility class > -- > > Key: SPARK-5260 > URL: https://issues.apache.org/jira/browse/SPARK-5260 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Corey J. Nolet > > I have found this method extremely useful when implementing my own strategy > for inferring a schema from parsed json. For now, I've actually copied the > method right out of the JsonRDD class into my own project but I think it > would be immensely useful to keep the code in Spark and expose it publicly > somewhere else- like an object called JsonSchema. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org