[jira] [Assigned] (SPARK-20264) asm should be non-test dependency in sql/core

2017-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20264:


Assignee: Reynold Xin  (was: Apache Spark)

> asm should be non-test dependency in sql/core
> -
>
> Key: SPARK-20264
> URL: https://issues.apache.org/jira/browse/SPARK-20264
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> sq/core module currently declares asm as a test scope dependency. 
> Transitively it should actually be a normal dependency since the actual core 
> module defines it. This occasionally confuses IntelliJ.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20264) asm should be non-test dependency in sql/core

2017-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20264:


Assignee: Apache Spark  (was: Reynold Xin)

> asm should be non-test dependency in sql/core
> -
>
> Key: SPARK-20264
> URL: https://issues.apache.org/jira/browse/SPARK-20264
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> sq/core module currently declares asm as a test scope dependency. 
> Transitively it should actually be a normal dependency since the actual core 
> module defines it. This occasionally confuses IntelliJ.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20264) asm should be non-test dependency in sql/core

2017-04-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961700#comment-15961700
 ] 

Apache Spark commented on SPARK-20264:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/17574

> asm should be non-test dependency in sql/core
> -
>
> Key: SPARK-20264
> URL: https://issues.apache.org/jira/browse/SPARK-20264
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> sq/core module currently declares asm as a test scope dependency. 
> Transitively it should actually be a normal dependency since the actual core 
> module defines it. This occasionally confuses IntelliJ.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20264) asm should be non-test dependency in sql/core

2017-04-07 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-20264:
---

 Summary: asm should be non-test dependency in sql/core
 Key: SPARK-20264
 URL: https://issues.apache.org/jira/browse/SPARK-20264
 Project: Spark
  Issue Type: Bug
  Components: Build, SQL
Affects Versions: 2.1.0
Reporter: Reynold Xin
Assignee: Reynold Xin


sq/core module currently declares asm as a test scope dependency. Transitively 
it should actually be a normal dependency since the actual core module defines 
it. This occasionally confuses IntelliJ.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18055) Dataset.flatMap can't work with types from customized jar

2017-04-07 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961687#comment-15961687
 ] 

Wenchen Fan commented on SPARK-18055:
-

I think this is a different issue, can you open a new ticket for it? thanks!

> Dataset.flatMap can't work with types from customized jar
> -
>
> Key: SPARK-18055
> URL: https://issues.apache.org/jira/browse/SPARK-18055
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Davies Liu
>Assignee: Michael Armbrust
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
> Attachments: test-jar_2.11-1.0.jar
>
>
> Try to apply flatMap() on Dataset column which of of type
> com.A.B
> Here's a schema of a dataset:
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- outputs: array (nullable = true)
>  ||-- element: string
> {code}
> flatMap works on RDD
> {code}
>  ds.rdd.flatMap(_.outputs)
> {code}
> flatMap doesnt work on dataset and gives the following error
> {code}
> ds.flatMap(_.outputs)
> {code}
> The exception:
> {code}
> scala.ScalaReflectionException: class com.A.B in JavaMirror … not found
> at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123)
> at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22)
> at 
> line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51)
> at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
> at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
> at 
> org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125)
> at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
> at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49)
> at 
> org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125)
> {code}
> Spoke to Michael Armbrust and he confirmed it as a Dataset bug.
> There is a workaround using explode()
> {code}
> ds.select(explode(col("outputs")))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19352) Sorting issues on relatively big datasets

2017-04-07 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961686#comment-15961686
 ] 

Wenchen Fan commented on SPARK-19352:
-

I don't think Spark will provide API support for this feature(Does hive really 
have?), but the implementation is quite stable now, so you can follow the 
example in this ticket to write out sorted data.

> Sorting issues on relatively big datasets
> -
>
> Key: SPARK-19352
> URL: https://issues.apache.org/jira/browse/SPARK-19352
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: Spark version 2.1.0
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_102
> macOS 10.12.3
>Reporter: Ivan Gozali
>
> _More details, including the script to generate the synthetic dataset 
> (requires pandas and numpy) are in this GitHub gist._
> https://gist.github.com/igozali/d327a85646abe7ab10c2ae479bed431f
> Given a relatively large synthetic time series dataset of various users 
> (4.1GB), when attempting to:
> * partition this dataset by user ID
> * sort the time series data for each user by timestamp
> * write each partition to a single CSV file
> then some files are unsorted in a very specific manner. In one of the 
> supposedly sorted files, the rows looked as follows:
> {code}
> 2014-01-01T00:00:00.000-08:00,-0.07,0.39,-0.39
> 2014-12-31T02:07:30.000-08:00,0.34,-0.62,-0.22
> 2014-01-01T00:00:05.000-08:00,-0.07,-0.52,0.47
> 2014-12-31T02:07:35.000-08:00,-0.15,-0.13,-0.14
> 2014-01-01T00:00:10.000-08:00,-1.31,-1.17,2.24
> 2014-12-31T02:07:40.000-08:00,-1.28,0.88,-0.43
> {code}
> The above is attempted using the following Scala/Spark code:
> {code}
> val inpth = "/tmp/gen_data_3cols_small"
> spark
> .read
> .option("inferSchema", "true")
> .option("header", "true")
> .csv(inpth)
> .repartition($"userId")
> .sortWithinPartitions("timestamp")
> .write
> .partitionBy("userId")
> .option("header", "true")
> .csv(inpth + "_sorted")
> {code}
> This issue is not seen when using a smaller sized dataset by making the time 
> span smaller (354MB, with the same number of columns).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-20262) AssertNotNull should throw NullPointerException

2017-04-07 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li closed SPARK-20262.
---
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.2

> AssertNotNull should throw NullPointerException
> ---
>
> Key: SPARK-20262
> URL: https://issues.apache.org/jira/browse/SPARK-20262
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.1.2, 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20259) Support push down join optimizations in DataFrameReader when loading from JDBC

2017-04-07 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961677#comment-15961677
 ] 

Hyukjin Kwon commented on SPARK-20259:
--

Could you describe the current status and why it should be like that?

> Support push down join optimizations in DataFrameReader when loading from JDBC
> --
>
> Key: SPARK-20259
> URL: https://issues.apache.org/jira/browse/SPARK-20259
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.2, 2.1.0
>Reporter: John Muller
>Priority: Minor
>
> Given two dataframes loaded from the same JDBC connection:
> {code:title=UnoptimizedJDBCJoin.scala|borderStyle=solid}
> val ordersDF = spark.read
>   .format("jdbc")
>   .option("url", "jdbc:postgresql:dbserver")
>   .option("dbtable", "northwind.orders")
>   .option("user", "username")
>   .option("password", "password")
>   .load().toDS
>   
> val productDF = spark.read
>   .format("jdbc")
>   .option("url", "jdbc:postgresql:dbserver")
>   .option("dbtable", "northwind.product")
>   .option("user", "username")
>   .option("password", "password")
>   .load().toDS
>   
> ordersDF.createOrReplaceTempView("orders")
> productDF.createOrReplaceTempView("product")
> // Followed by a join between them:
> val ordersByProduct = sql("SELECT p.name, SUM(o.qty) AS qty FROM orders AS o 
> INNER JOIN product AS p ON o.product_id = p.product_id GROUP BY p.name")
> {code}
> Catalyst should optimize the query to be:
> SELECT northwind.product.name, SUM(northwind.orders.qty)
> FROM northwind.orders
> INNER JOIN northwind.product ON
>   northwind.orders.product_id = northwind.product.product_id
> GROUP BY p.name



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19935) SparkSQL unsupports to create a hive table which is mapped for HBase table

2017-04-07 Thread Xiaochen Ouyang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961670#comment-15961670
 ] 

Xiaochen Ouyang commented on SPARK-19935:
-

Not yet! I tried to work on this issue, but now only support to create table 
and scan table.
Throwing an exeception when run insert command because Spark2.x have no 
HBaseFileFormat for Writing HBase.
code as follow :
  @transient private lazy val outputFormat =
jobConf.value.getOutputFormat.asInstanceOf[HiveOutputFormat[AnyRef, 
Writable]]

> SparkSQL unsupports to create a hive table which is mapped for HBase table
> --
>
> Key: SPARK-19935
> URL: https://issues.apache.org/jira/browse/SPARK-19935
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: spark2.0.2
>Reporter: Xiaochen Ouyang
>
> SparkSQL unsupports the command as following:
>  CREATE TABLE spark_test(key int, value string)   
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'   
> WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")   
> TBLPROPERTIES ("hbase.table.name" = "xyz");



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19935) SparkSQL unsupports to create a hive table which is mapped for HBase table

2017-04-07 Thread Xiaochen Ouyang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961670#comment-15961670
 ] 

Xiaochen Ouyang edited comment on SPARK-19935 at 4/8/17 4:04 AM:
-

[~wangchao2017] Not yet! I tried to work on this issue, but now only support to 
create table and scan table.
Throwing an exeception when run insert command because Spark2.x have no 
HBaseFileFormat for Writing HBase.
code as follow :
  @transient private lazy val outputFormat =
jobConf.value.getOutputFormat.asInstanceOf[HiveOutputFormat[AnyRef, 
Writable]]


was (Author: ouyangxc.zte):
Not yet! I tried to work on this issue, but now only support to create table 
and scan table.
Throwing an exeception when run insert command because Spark2.x have no 
HBaseFileFormat for Writing HBase.
code as follow :
  @transient private lazy val outputFormat =
jobConf.value.getOutputFormat.asInstanceOf[HiveOutputFormat[AnyRef, 
Writable]]

> SparkSQL unsupports to create a hive table which is mapped for HBase table
> --
>
> Key: SPARK-19935
> URL: https://issues.apache.org/jira/browse/SPARK-19935
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: spark2.0.2
>Reporter: Xiaochen Ouyang
>
> SparkSQL unsupports the command as following:
>  CREATE TABLE spark_test(key int, value string)   
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'   
> WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")   
> TBLPROPERTIES ("hbase.table.name" = "xyz");



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20246) Should check determinism when pushing predicates down through aggregation

2017-04-07 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-20246.
-
   Resolution: Fixed
 Assignee: Wenchen Fan
Fix Version/s: 2.2.0
   2.1.2
   2.0.3

> Should check determinism when pushing predicates down through aggregation
> -
>
> Key: SPARK-20246
> URL: https://issues.apache.org/jira/browse/SPARK-20246
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Weiluo Ren
>Assignee: Wenchen Fan
>  Labels: correctness
> Fix For: 2.0.3, 2.1.2, 2.2.0
>
>
> {code}import org.apache.spark.sql.functions._
> spark.range(1,1000).distinct.withColumn("random", 
> rand()).filter(col("random") > 0.3).orderBy("random").show{code}
> gives wrong result.
>  In the optimized logical plan, it shows that the filter with the 
> non-deterministic predicate is pushed beneath the aggregate operator, which 
> should not happen.
> cc [~lian cheng]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20263) create empty dataframes in sparkR

2017-04-07 Thread Ott Toomet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ott Toomet updated SPARK-20263:
---
   Priority: Minor  (was: Trivial)
Description: 
SparkR 2.1 does not support creating empty dataframes, nor conversion of empty 
R dataframes to spark ones:

createDataFrame(data.frame(a=integer()))

gives 

Error in takeRDD(x, 1)[[1]] : subscript out of bounds

  was:
Spark 2.1 does not support creating empty dataframes, nor conversion of empty R 
dataframes to spark ones:

createDataFrame(data.frame(a=integer()))

gives 

Error in takeRDD(x, 1)[[1]] : subscript out of bounds


> create empty dataframes in sparkR
> -
>
> Key: SPARK-20263
> URL: https://issues.apache.org/jira/browse/SPARK-20263
> Project: Spark
>  Issue Type: Wish
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Ott Toomet
>Priority: Minor
>
> SparkR 2.1 does not support creating empty dataframes, nor conversion of 
> empty R dataframes to spark ones:
> createDataFrame(data.frame(a=integer()))
> gives 
> Error in takeRDD(x, 1)[[1]] : subscript out of bounds



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20263) create empty dataframes in sparkR

2017-04-07 Thread Ott Toomet (JIRA)
Ott Toomet created SPARK-20263:
--

 Summary: create empty dataframes in sparkR
 Key: SPARK-20263
 URL: https://issues.apache.org/jira/browse/SPARK-20263
 Project: Spark
  Issue Type: Wish
  Components: SparkR
Affects Versions: 2.1.0
Reporter: Ott Toomet
Priority: Trivial


Spark 2.1 does not support creating empty dataframes, nor conversion of empty R 
dataframes to spark ones:

createDataFrame(data.frame(a=integer()))

gives 

Error in takeRDD(x, 1)[[1]] : subscript out of bounds



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20262) AssertNotNull should throw NullPointerException

2017-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20262:


Assignee: Reynold Xin  (was: Apache Spark)

> AssertNotNull should throw NullPointerException
> ---
>
> Key: SPARK-20262
> URL: https://issues.apache.org/jira/browse/SPARK-20262
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20262) AssertNotNull should throw NullPointerException

2017-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20262:


Assignee: Apache Spark  (was: Reynold Xin)

> AssertNotNull should throw NullPointerException
> ---
>
> Key: SPARK-20262
> URL: https://issues.apache.org/jira/browse/SPARK-20262
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20262) AssertNotNull should throw NullPointerException

2017-04-07 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-20262:
---

 Summary: AssertNotNull should throw NullPointerException
 Key: SPARK-20262
 URL: https://issues.apache.org/jira/browse/SPARK-20262
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Reynold Xin
Assignee: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20262) AssertNotNull should throw NullPointerException

2017-04-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961584#comment-15961584
 ] 

Apache Spark commented on SPARK-20262:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/17573

> AssertNotNull should throw NullPointerException
> ---
>
> Key: SPARK-20262
> URL: https://issues.apache.org/jira/browse/SPARK-20262
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20261) EventLoggingListener may not truly flush the logger when a compression codec is used

2017-04-07 Thread Brian Cho (JIRA)
Brian Cho created SPARK-20261:
-

 Summary: EventLoggingListener may not truly flush the logger when 
a compression codec is used
 Key: SPARK-20261
 URL: https://issues.apache.org/jira/browse/SPARK-20261
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Brian Cho
Priority: Minor


Log events with flushLogger set to true are supposed to immediately flush to 
update the event history. However, this does not happen when using some 
compression codecs e.g., LZ4BlockOutputStream, because the compressed stream 
can hold on to the update until the compression block is filled.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20260) MLUtils parseLibSVMRecord has incorrect string interpolation for error message

2017-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20260:


Assignee: (was: Apache Spark)

> MLUtils parseLibSVMRecord has incorrect string interpolation for error message
> --
>
> Key: SPARK-20260
> URL: https://issues.apache.org/jira/browse/SPARK-20260
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Vijay Krishna Ramesh
>Priority: Minor
>
> There is missing string interpolation for the error message, which causes it 
> to not actually display the line that failed. See 
> https://github.com/apache/spark/pull/17572/files for a trivial fix. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20260) MLUtils parseLibSVMRecord has incorrect string interpolation for error message

2017-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20260:


Assignee: Apache Spark

> MLUtils parseLibSVMRecord has incorrect string interpolation for error message
> --
>
> Key: SPARK-20260
> URL: https://issues.apache.org/jira/browse/SPARK-20260
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Vijay Krishna Ramesh
>Assignee: Apache Spark
>Priority: Minor
>
> There is missing string interpolation for the error message, which causes it 
> to not actually display the line that failed. See 
> https://github.com/apache/spark/pull/17572/files for a trivial fix. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20260) MLUtils parseLibSVMRecord has incorrect string interpolation for error message

2017-04-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961491#comment-15961491
 ] 

Apache Spark commented on SPARK-20260:
--

User 'vijaykramesh' has created a pull request for this issue:
https://github.com/apache/spark/pull/17572

> MLUtils parseLibSVMRecord has incorrect string interpolation for error message
> --
>
> Key: SPARK-20260
> URL: https://issues.apache.org/jira/browse/SPARK-20260
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Vijay Krishna Ramesh
>Priority: Minor
>
> There is missing string interpolation for the error message, which causes it 
> to not actually display the line that failed. See 
> https://github.com/apache/spark/pull/17572/files for a trivial fix. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20260) MLUtils parseLibSVMRecord has incorrect string interpolation for error message

2017-04-07 Thread Vijay Krishna Ramesh (JIRA)
Vijay Krishna Ramesh created SPARK-20260:


 Summary: MLUtils parseLibSVMRecord has incorrect string 
interpolation for error message
 Key: SPARK-20260
 URL: https://issues.apache.org/jira/browse/SPARK-20260
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.1.0
Reporter: Vijay Krishna Ramesh
Priority: Minor


There is missing string interpolation for the error message, which causes it to 
not actually display the line that failed. See 
https://github.com/apache/spark/pull/17572/files for a trivial fix. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20255) FileIndex hierarchy inconsistency

2017-04-07 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-20255.
-
   Resolution: Fixed
 Assignee: Adrian Ionescu
Fix Version/s: 2.2.0

> FileIndex hierarchy inconsistency
> -
>
> Key: SPARK-20255
> URL: https://issues.apache.org/jira/browse/SPARK-20255
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Adrian Ionescu
>Assignee: Adrian Ionescu
>Priority: Minor
> Fix For: 2.2.0
>
>
> Trying to get a grip on the {{FileIndex}} hierarchy, I was confused by the 
> following inconsistency: 
> On the one hand, {{PartitioningAwareFileIndex}} defines {{leafFiles}} and 
> {{leafDirToChildrenFiles}} as abstract, but on the other it fully implements 
> {{listLeafFiles}} which does all the listing of files. However, the latter is 
> only used by {{InMemoryFileIndex}}.
> I'm hereby proposing to move this method (and all its dependencies) to the 
> implementation class that actually uses it, and thus unclutter the 
> {{PartitioningAwareFileIndex}} interface.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20258) SparkR logistic regression example did not converge in programming guide

2017-04-07 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-20258:
-
Fix Version/s: 2.2.0

> SparkR logistic regression example did not converge in programming guide
> 
>
> Key: SPARK-20258
> URL: https://issues.apache.org/jira/browse/SPARK-20258
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Wayne Zhang
>Assignee: Wayne Zhang
> Fix For: 2.2.0
>
>
> SparkR logistic regression example did not converge in programming guide. All 
> estimates are essentially zero:
> {code}
> training2 <- read.df("data/mllib/sample_binary_classification_data.txt", 
> source = "libsvm")
> df_list2 <- randomSplit(training2, c(7,3), 2)
> binomialDF <- df_list2[[1]]
> binomialTestDF <- df_list2[[2]]
> binomialGLM <- spark.glm(binomialDF, label ~ features, family = "binomial")
> 17/04/07 11:42:03 WARN WeightedLeastSquares: Cholesky solver failed due to 
> singular covariance matrix. Retrying with Quasi-Newton solver.
> > summary(binomialGLM)
> Deviance Residuals: 
> (Note: These are approximate quantiles with relative error <= 0.01)
> Min   1Q   Median   3Q  Max  
> -2.4828e-06  -2.4063e-06   2.2778e-06   2.4350e-06   2.7722e-06  
> Coefficients:
>  Estimate
> (Intercept)9.0255e+00
> features_0 0.e+00
> features_1 0.e+00
> features_2 0.e+00
> features_3 0.e+00
> features_4 0.e+00
> features_5 0.e+00
> features_6 0.e+00
> features_7 0.e+00
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20258) SparkR logistic regression example did not converge in programming guide

2017-04-07 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-20258.
--
Resolution: Fixed
  Assignee: Wayne Zhang

> SparkR logistic regression example did not converge in programming guide
> 
>
> Key: SPARK-20258
> URL: https://issues.apache.org/jira/browse/SPARK-20258
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Wayne Zhang
>Assignee: Wayne Zhang
>
> SparkR logistic regression example did not converge in programming guide. All 
> estimates are essentially zero:
> {code}
> training2 <- read.df("data/mllib/sample_binary_classification_data.txt", 
> source = "libsvm")
> df_list2 <- randomSplit(training2, c(7,3), 2)
> binomialDF <- df_list2[[1]]
> binomialTestDF <- df_list2[[2]]
> binomialGLM <- spark.glm(binomialDF, label ~ features, family = "binomial")
> 17/04/07 11:42:03 WARN WeightedLeastSquares: Cholesky solver failed due to 
> singular covariance matrix. Retrying with Quasi-Newton solver.
> > summary(binomialGLM)
> Deviance Residuals: 
> (Note: These are approximate quantiles with relative error <= 0.01)
> Min   1Q   Median   3Q  Max  
> -2.4828e-06  -2.4063e-06   2.2778e-06   2.4350e-06   2.7722e-06  
> Coefficients:
>  Estimate
> (Intercept)9.0255e+00
> features_0 0.e+00
> features_1 0.e+00
> features_2 0.e+00
> features_3 0.e+00
> features_4 0.e+00
> features_5 0.e+00
> features_6 0.e+00
> features_7 0.e+00
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18934) Writing to dynamic partitions does not preserve sort order if spill occurs

2017-04-07 Thread Charles Pritchard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961390#comment-15961390
 ] 

Charles Pritchard commented on SPARK-18934:
---

Possibly fixed in: https://issues.apache.org/jira/browse/SPARK-19563 appears to 
be out of scope in Spark per comment in 
https://issues.apache.org/jira/browse/SPARK-19352

> Writing to dynamic partitions does not preserve sort order if spill occurs
> --
>
> Key: SPARK-18934
> URL: https://issues.apache.org/jira/browse/SPARK-18934
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Junegunn Choi
>
> When writing to dynamic partitions, the task sorts the input data by the 
> partition key (also with bucket key if used), so that it can write to one 
> partition at a time using a single writer. And if spill occurs during the 
> process, {{UnsafeSorterSpillMerger}} is used to merge partial sequences of 
> data.
> However, the merge process only considers the partition key, so that the sort 
> order within a partition specified via {{sortWithinPartitions}} or {{SORT 
> BY}} is not preserved.
> We can reproduce the problem on Spark shell. Make sure to start shell in 
> local mode with small driver memory (e.g. 1G) so that spills occur.
> {code}
> // FileFormatWriter
> sc.parallelize(1 to 1000).toDS.withColumn("part", 'value.mod(2))
>   .repartition(1, 'part).sortWithinPartitions("value")
>   .write.mode("overwrite").format("orc").partitionBy("part")
>   .saveAsTable("test_sort_within")
> spark.read.table("test_sort_within").show
> {code}
> {noformat}
> +---++
> |  value|part|
> +---++
> |  2|   0|
> |8388610|   0|
> |  4|   0|
> |8388612|   0|
> |  6|   0|
> |8388614|   0|
> |  8|   0|
> |8388616|   0|
> | 10|   0|
> |8388618|   0|
> | 12|   0|
> |8388620|   0|
> | 14|   0|
> |8388622|   0|
> | 16|   0|
> |8388624|   0|
> | 18|   0|
> |8388626|   0|
> | 20|   0|
> |8388628|   0|
> +---++
> {noformat}
> We can confirm that the issue using orc dump.
> {noformat}
> > java -jar orc-tools-1.3.0-SNAPSHOT-uber.jar meta -d 
> > part=0/part-r-0-96c022f0-a173-40cc-b2e5-9d02fed4213e.snappy.orc | head 
> > -20
> {"value":2}
> {"value":8388610}
> {"value":4}
> {"value":8388612}
> {"value":6}
> {"value":8388614}
> {"value":8}
> {"value":8388616}
> {"value":10}
> {"value":8388618}
> {"value":12}
> {"value":8388620}
> {"value":14}
> {"value":8388622}
> {"value":16}
> {"value":8388624}
> {"value":18}
> {"value":8388626}
> {"value":20}
> {"value":8388628}
> {noformat}
> {{SparkHiveDynamicPartitionWriterContainer}} has the same problem.
> {code}
> // Insert into an existing Hive table with dynamic partitions
> //   CREATE TABLE TEST_SORT_WITHIN (VALUE INT) PARTITIONED BY (PART INT) 
> STORED AS ORC
> spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")
> sc.parallelize(1 to 1000).toDS.withColumn("part", 'value.mod(2))
>   .repartition(1, 'part).sortWithinPartitions("value")
>   .write.mode("overwrite").insertInto("test_sort_within")
> spark.read.table("test_sort_within").show
> {code}
> I was able to fix the problem by appending a numeric index column to the 
> sorting key which effectively makes the sort stable. I'll create a pull 
> request on GitHub but since I'm not really familiar with the internals of 
> Spark, I'm not sure if my approach is valid or idiomatic. So please let me 
> know if there are better ways to handle this, or if you want to address the 
> issue differently.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19352) Sorting issues on relatively big datasets

2017-04-07 Thread Charles Pritchard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961389#comment-15961389
 ] 

Charles Pritchard commented on SPARK-19352:
---

[~cloud_fan] Is there something on the roadmap to get that guarantee? We need 
guaranteed sorting from a general performance perspective, but it's also a 
baseline feature of Hive (AKA: "SORT BY") to be able to sort data into a file 
in a partition.

> Sorting issues on relatively big datasets
> -
>
> Key: SPARK-19352
> URL: https://issues.apache.org/jira/browse/SPARK-19352
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: Spark version 2.1.0
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_102
> macOS 10.12.3
>Reporter: Ivan Gozali
>
> _More details, including the script to generate the synthetic dataset 
> (requires pandas and numpy) are in this GitHub gist._
> https://gist.github.com/igozali/d327a85646abe7ab10c2ae479bed431f
> Given a relatively large synthetic time series dataset of various users 
> (4.1GB), when attempting to:
> * partition this dataset by user ID
> * sort the time series data for each user by timestamp
> * write each partition to a single CSV file
> then some files are unsorted in a very specific manner. In one of the 
> supposedly sorted files, the rows looked as follows:
> {code}
> 2014-01-01T00:00:00.000-08:00,-0.07,0.39,-0.39
> 2014-12-31T02:07:30.000-08:00,0.34,-0.62,-0.22
> 2014-01-01T00:00:05.000-08:00,-0.07,-0.52,0.47
> 2014-12-31T02:07:35.000-08:00,-0.15,-0.13,-0.14
> 2014-01-01T00:00:10.000-08:00,-1.31,-1.17,2.24
> 2014-12-31T02:07:40.000-08:00,-1.28,0.88,-0.43
> {code}
> The above is attempted using the following Scala/Spark code:
> {code}
> val inpth = "/tmp/gen_data_3cols_small"
> spark
> .read
> .option("inferSchema", "true")
> .option("header", "true")
> .csv(inpth)
> .repartition($"userId")
> .sortWithinPartitions("timestamp")
> .write
> .partitionBy("userId")
> .option("header", "true")
> .csv(inpth + "_sorted")
> {code}
> This issue is not seen when using a smaller sized dataset by making the time 
> span smaller (354MB, with the same number of columns).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20259) Support push down join optimizations in DataFrameReader when loading from JDBC

2017-04-07 Thread John Muller (JIRA)
John Muller created SPARK-20259:
---

 Summary: Support push down join optimizations in DataFrameReader 
when loading from JDBC
 Key: SPARK-20259
 URL: https://issues.apache.org/jira/browse/SPARK-20259
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0, 1.6.2
Reporter: John Muller
Priority: Minor


Given two dataframes loaded from the same JDBC connection:

{code:title=UnoptimizedJDBCJoin.scala|borderStyle=solid}
val ordersDF = spark.read
  .format("jdbc")
  .option("url", "jdbc:postgresql:dbserver")
  .option("dbtable", "northwind.orders")
  .option("user", "username")
  .option("password", "password")
  .load().toDS
  
val productDF = spark.read
  .format("jdbc")
  .option("url", "jdbc:postgresql:dbserver")
  .option("dbtable", "northwind.product")
  .option("user", "username")
  .option("password", "password")
  .load().toDS
  
ordersDF.createOrReplaceTempView("orders")
productDF.createOrReplaceTempView("product")

// Followed by a join between them:
val ordersByProduct = sql("SELECT p.name, SUM(o.qty) AS qty FROM orders AS o 
INNER JOIN product AS p ON o.product_id = p.product_id GROUP BY p.name")
{code}

Catalyst should optimize the query to be:
SELECT northwind.product.name, SUM(northwind.orders.qty)
FROM northwind.orders
INNER JOIN northwind.product ON
  northwind.orders.product_id = northwind.product.product_id
GROUP BY p.name



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20258) SparkR logistic regression example did not converge in programming guide

2017-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20258:


Assignee: Apache Spark

> SparkR logistic regression example did not converge in programming guide
> 
>
> Key: SPARK-20258
> URL: https://issues.apache.org/jira/browse/SPARK-20258
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Wayne Zhang
>Assignee: Apache Spark
>
> SparkR logistic regression example did not converge in programming guide. All 
> estimates are essentially zero:
> {code}
> training2 <- read.df("data/mllib/sample_binary_classification_data.txt", 
> source = "libsvm")
> df_list2 <- randomSplit(training2, c(7,3), 2)
> binomialDF <- df_list2[[1]]
> binomialTestDF <- df_list2[[2]]
> binomialGLM <- spark.glm(binomialDF, label ~ features, family = "binomial")
> 17/04/07 11:42:03 WARN WeightedLeastSquares: Cholesky solver failed due to 
> singular covariance matrix. Retrying with Quasi-Newton solver.
> > summary(binomialGLM)
> Deviance Residuals: 
> (Note: These are approximate quantiles with relative error <= 0.01)
> Min   1Q   Median   3Q  Max  
> -2.4828e-06  -2.4063e-06   2.2778e-06   2.4350e-06   2.7722e-06  
> Coefficients:
>  Estimate
> (Intercept)9.0255e+00
> features_0 0.e+00
> features_1 0.e+00
> features_2 0.e+00
> features_3 0.e+00
> features_4 0.e+00
> features_5 0.e+00
> features_6 0.e+00
> features_7 0.e+00
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20258) SparkR logistic regression example did not converge in programming guide

2017-04-07 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961274#comment-15961274
 ] 

Felix Cheung commented on SPARK-20258:
--

Thanks!

> SparkR logistic regression example did not converge in programming guide
> 
>
> Key: SPARK-20258
> URL: https://issues.apache.org/jira/browse/SPARK-20258
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Wayne Zhang
>
> SparkR logistic regression example did not converge in programming guide. All 
> estimates are essentially zero:
> {code}
> training2 <- read.df("data/mllib/sample_binary_classification_data.txt", 
> source = "libsvm")
> df_list2 <- randomSplit(training2, c(7,3), 2)
> binomialDF <- df_list2[[1]]
> binomialTestDF <- df_list2[[2]]
> binomialGLM <- spark.glm(binomialDF, label ~ features, family = "binomial")
> 17/04/07 11:42:03 WARN WeightedLeastSquares: Cholesky solver failed due to 
> singular covariance matrix. Retrying with Quasi-Newton solver.
> > summary(binomialGLM)
> Deviance Residuals: 
> (Note: These are approximate quantiles with relative error <= 0.01)
> Min   1Q   Median   3Q  Max  
> -2.4828e-06  -2.4063e-06   2.2778e-06   2.4350e-06   2.7722e-06  
> Coefficients:
>  Estimate
> (Intercept)9.0255e+00
> features_0 0.e+00
> features_1 0.e+00
> features_2 0.e+00
> features_3 0.e+00
> features_4 0.e+00
> features_5 0.e+00
> features_6 0.e+00
> features_7 0.e+00
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20258) SparkR logistic regression example did not converge in programming guide

2017-04-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961273#comment-15961273
 ] 

Apache Spark commented on SPARK-20258:
--

User 'actuaryzhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/17571

> SparkR logistic regression example did not converge in programming guide
> 
>
> Key: SPARK-20258
> URL: https://issues.apache.org/jira/browse/SPARK-20258
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Wayne Zhang
>
> SparkR logistic regression example did not converge in programming guide. All 
> estimates are essentially zero:
> {code}
> training2 <- read.df("data/mllib/sample_binary_classification_data.txt", 
> source = "libsvm")
> df_list2 <- randomSplit(training2, c(7,3), 2)
> binomialDF <- df_list2[[1]]
> binomialTestDF <- df_list2[[2]]
> binomialGLM <- spark.glm(binomialDF, label ~ features, family = "binomial")
> 17/04/07 11:42:03 WARN WeightedLeastSquares: Cholesky solver failed due to 
> singular covariance matrix. Retrying with Quasi-Newton solver.
> > summary(binomialGLM)
> Deviance Residuals: 
> (Note: These are approximate quantiles with relative error <= 0.01)
> Min   1Q   Median   3Q  Max  
> -2.4828e-06  -2.4063e-06   2.2778e-06   2.4350e-06   2.7722e-06  
> Coefficients:
>  Estimate
> (Intercept)9.0255e+00
> features_0 0.e+00
> features_1 0.e+00
> features_2 0.e+00
> features_3 0.e+00
> features_4 0.e+00
> features_5 0.e+00
> features_6 0.e+00
> features_7 0.e+00
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20258) SparkR logistic regression example did not converge in programming guide

2017-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20258:


Assignee: (was: Apache Spark)

> SparkR logistic regression example did not converge in programming guide
> 
>
> Key: SPARK-20258
> URL: https://issues.apache.org/jira/browse/SPARK-20258
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Wayne Zhang
>
> SparkR logistic regression example did not converge in programming guide. All 
> estimates are essentially zero:
> {code}
> training2 <- read.df("data/mllib/sample_binary_classification_data.txt", 
> source = "libsvm")
> df_list2 <- randomSplit(training2, c(7,3), 2)
> binomialDF <- df_list2[[1]]
> binomialTestDF <- df_list2[[2]]
> binomialGLM <- spark.glm(binomialDF, label ~ features, family = "binomial")
> 17/04/07 11:42:03 WARN WeightedLeastSquares: Cholesky solver failed due to 
> singular covariance matrix. Retrying with Quasi-Newton solver.
> > summary(binomialGLM)
> Deviance Residuals: 
> (Note: These are approximate quantiles with relative error <= 0.01)
> Min   1Q   Median   3Q  Max  
> -2.4828e-06  -2.4063e-06   2.2778e-06   2.4350e-06   2.7722e-06  
> Coefficients:
>  Estimate
> (Intercept)9.0255e+00
> features_0 0.e+00
> features_1 0.e+00
> features_2 0.e+00
> features_3 0.e+00
> features_4 0.e+00
> features_5 0.e+00
> features_6 0.e+00
> features_7 0.e+00
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20258) SparkR logistic regression example did not converge in programming guide

2017-04-07 Thread Wayne Zhang (JIRA)
Wayne Zhang created SPARK-20258:
---

 Summary: SparkR logistic regression example did not converge in 
programming guide
 Key: SPARK-20258
 URL: https://issues.apache.org/jira/browse/SPARK-20258
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.1.0
Reporter: Wayne Zhang


SparkR logistic regression example did not converge in programming guide. All 
estimates are essentially zero:

{code}
training2 <- read.df("data/mllib/sample_binary_classification_data.txt", source 
= "libsvm")
df_list2 <- randomSplit(training2, c(7,3), 2)
binomialDF <- df_list2[[1]]
binomialTestDF <- df_list2[[2]]
binomialGLM <- spark.glm(binomialDF, label ~ features, family = "binomial")


17/04/07 11:42:03 WARN WeightedLeastSquares: Cholesky solver failed due to 
singular covariance matrix. Retrying with Quasi-Newton solver.

> summary(binomialGLM)

Deviance Residuals: 
(Note: These are approximate quantiles with relative error <= 0.01)
Min   1Q   Median   3Q  Max  
-2.4828e-06  -2.4063e-06   2.2778e-06   2.4350e-06   2.7722e-06  

Coefficients:
 Estimate
(Intercept)9.0255e+00
features_0 0.e+00
features_1 0.e+00
features_2 0.e+00
features_3 0.e+00
features_4 0.e+00
features_5 0.e+00
features_6 0.e+00
features_7 0.e+00
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20257) Fix test for directory created to work when running as R CMD check

2017-04-07 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-20257:


 Summary: Fix test for directory created to work when running as R 
CMD check
 Key: SPARK-20257
 URL: https://issues.apache.org/jira/browse/SPARK-20257
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.2.0
Reporter: Felix Cheung






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20257) Fix test for directory created to work when running as R CMD check

2017-04-07 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961221#comment-15961221
 ] 

Felix Cheung commented on SPARK-20257:
--

Please see the PR https://github.com/apache/spark/pull/17516 for info

And this is the test 
https://github.com/apache/spark/blob/master/R/pkg/inst/tests/testthat/test_sparkSQL.R#L3058
"No extra files are created in SPARK_HOME by starting session and making calls"

> Fix test for directory created to work when running as R CMD check
> --
>
> Key: SPARK-20257
> URL: https://issues.apache.org/jira/browse/SPARK-20257
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20197) CRAN check fail with package installation

2017-04-07 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-20197:
-
Target Version/s: 2.1.1, 2.2.0
   Fix Version/s: 2.2.0
  2.1.1

> CRAN check fail with package installation 
> --
>
> Key: SPARK-20197
> URL: https://issues.apache.org/jira/browse/SPARK-20197
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0, 2.1.1
>Reporter: Felix Cheung
>Assignee: Felix Cheung
> Fix For: 2.1.1, 2.2.0
>
>
>  Failed 
> -
>   1. Failure: No extra files are created in SPARK_HOME by starting session 
> and making calls (@test_sparkSQL.R#2858)
>   length(sparkRFilesBefore) > 0 isn't true.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20256) Fail to start SparkContext/SparkSession with Hive support enabled when user does not have read/write privilege to Hive metastore warehouse dir

2017-04-07 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961193#comment-15961193
 ] 

Xin Wu commented on SPARK-20256:


I am working on a fix and creating simulated test cases for this issue. 

> Fail to start SparkContext/SparkSession with Hive support enabled when user 
> does not have read/write privilege to Hive metastore warehouse dir
> --
>
> Key: SPARK-20256
> URL: https://issues.apache.org/jira/browse/SPARK-20256
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Xin Wu
>Priority: Critical
>
> In a cluster setup with production Hive running, when the user wants to run 
> spark-shell using the production Hive metastore, hive-site.xml is copied to 
> SPARK_HOME/conf. So when spark-shell is being started, it tries to check 
> database existence of "default" database from Hive metastore. Yet, since this 
> user may not have READ/WRITE access to the configured Hive warehouse 
> directory done by Hive itself, such permission error will prevent spark-shell 
> or any spark application with Hive support enabled from starting at all. 
> Example error:
> {code}To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> java.lang.IllegalArgumentException: Error while instantiating 
> 'org.apache.spark.sql.hive.HiveSessionState':
>   at 
> org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:981)
>   at 
> org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:110)
>   at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
>   at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
>   at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
>   at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:878)
>   at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95)
>   ... 47 elided
> Caused by: java.lang.reflect.InvocationTargetException: 
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: 
> MetaException(message:java.security.AccessControlException: Permission 
> denied: user=notebook, access=READ, 
> inode="/apps/hive/warehouse":hive:hadoop:drwxrwx---
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:320)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:219)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1728)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1712)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPathAccess(FSDirectory.java:1686)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAccess(FSNamesystem.java:8238)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkAccess(NameNodeRpcServer.java:1933)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.checkAccess(ClientNamenodeProtocolServerSideTranslatorPB.java:1455)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1697)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045)
> );
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> 

[jira] [Updated] (SPARK-20256) Fail to start SparkContext/SparkSession with Hive support enabled when user does not have read/write privilege to Hive metastore warehouse dir

2017-04-07 Thread Xin Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Wu updated SPARK-20256:
---
Description: 
In a cluster setup with production Hive running, when the user wants to run 
spark-shell using the production Hive metastore, hive-site.xml is copied to 
SPARK_HOME/conf. So when spark-shell is being started, it tries to check 
database existence of "default" database from Hive metastore. Yet, since this 
user may not have READ/WRITE access to the configured Hive warehouse directory 
done by Hive itself, such permission error will prevent spark-shell or any 
spark application with Hive support enabled from starting at all. 

Example error:
{code}To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
java.lang.IllegalArgumentException: Error while instantiating 
'org.apache.spark.sql.hive.HiveSessionState':
  at 
org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:981)
  at 
org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:110)
  at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)
  at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878)
  at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878)
  at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
  at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
  at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
  at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
  at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
  at 
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:878)
  at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95)
  ... 47 elided
Caused by: java.lang.reflect.InvocationTargetException: 
org.apache.spark.sql.AnalysisException: 
org.apache.hadoop.hive.ql.metadata.HiveException: 
MetaException(message:java.security.AccessControlException: Permission denied: 
user=notebook, access=READ, inode="/apps/hive/warehouse":hive:hadoop:drwxrwx---
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:320)
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:219)
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1728)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1712)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPathAccess(FSDirectory.java:1686)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAccess(FSNamesystem.java:8238)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkAccess(NameNodeRpcServer.java:1933)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.checkAccess(ClientNamenodeProtocolServerSideTranslatorPB.java:1455)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1697)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045)
);
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
  at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
  at 
org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:978)
  ... 58 more
Caused by: org.apache.spark.sql.AnalysisException: 
org.apache.hadoop.hive.ql.metadata.HiveException: 
MetaException(message:java.security.AccessControlException: Permission denied: 
user=notebook, access=READ, inode="/apps/hive/warehouse":hive:hadoop:drwxrwx---
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:320)
at 

[jira] [Resolved] (SPARK-20026) Document R GLM Tweedie family support in programming guide and code example

2017-04-07 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-20026.
--
   Resolution: Fixed
 Assignee: Wayne Zhang
Fix Version/s: 2.2.0

> Document R GLM Tweedie family support in programming guide and code example
> ---
>
> Key: SPARK-20026
> URL: https://issues.apache.org/jira/browse/SPARK-20026
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Assignee: Wayne Zhang
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20256) Fail to start SparkContext/SparkSession with Hive support enabled when user does not have read/write privilege to Hive metastore warehouse dir

2017-04-07 Thread Xin Wu (JIRA)
Xin Wu created SPARK-20256:
--

 Summary: Fail to start SparkContext/SparkSession with Hive support 
enabled when user does not have read/write privilege to Hive metastore 
warehouse dir
 Key: SPARK-20256
 URL: https://issues.apache.org/jira/browse/SPARK-20256
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0, 2.1.1, 2.2.0
Reporter: Xin Wu
Priority: Critical


In a cluster setup with production Hive running, when the user wants to run 
spark-shell using the production Hive metastore, hive-site.xml is copied to 
SPARK_HOME/conf. So when spark-shell is being started, it tries to check 
database existence of "default" database from Hive metastore. Yet, since this 
user may not have READ/WRITE access to the configured Hive warehouse directory 
done by Hive itself, such permission error will prevent spark-shell or any 
spark application with Hive support enabled from starting at all. 

Example error:
{code}To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
java.lang.IllegalArgumentException: Error while instantiating 
'org.apache.spark.sql.hive.HiveSessionState':
  at 
org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:981)
  at 
org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:110)
  at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)
  at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878)
  at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878)
  at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
  at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
  at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
  at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
  at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
  at 
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:878)
  at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95)
  ... 47 elided
Caused by: java.lang.reflect.InvocationTargetException: 
org.apache.spark.sql.AnalysisException: 
org.apache.hadoop.hive.ql.metadata.HiveException: 
MetaException(message:java.security.AccessControlException: Permission denied: 
user=notebook, access=READ, inode="/apps/hive/warehouse":hive:hadoop:drwxrwx---
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:320)
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:219)
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1728)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1712)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPathAccess(FSDirectory.java:1686)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAccess(FSNamesystem.java:8238)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkAccess(NameNodeRpcServer.java:1933)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.checkAccess(ClientNamenodeProtocolServerSideTranslatorPB.java:1455)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1697)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045)
);
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
  at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
  at 
org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:978)
  ... 58 more
Caused by: org.apache.spark.sql.AnalysisException: 
org.apache.hadoop.hive.ql.metadata.HiveException: 

[jira] [Commented] (SPARK-18055) Dataset.flatMap can't work with types from customized jar

2017-04-07 Thread Paul Zaczkieiwcz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961180#comment-15961180
 ] 

Paul Zaczkieiwcz commented on SPARK-18055:
--

[~marmbrus]: I ran into this issue when using a custom 
org.apache.spark.sql.expressions.Aggregator in Spark 2.0.2.

{code:java}
val aggregator:Aggregator = 
df.groupByKey(s => CookieId(s.cookie_id)
).agg(aggregator.toColumn)
{code}

I got a very similar {{scala.ScalaReflectionException}}, which is how I found 
this ticket. Is there an easy way around this short of either converting my 
brand-new {{Aggregator}} into a {{UserDefinedAggregateFunction}} or custom 
installing a patched version of Spark onto my cluster?

> Dataset.flatMap can't work with types from customized jar
> -
>
> Key: SPARK-18055
> URL: https://issues.apache.org/jira/browse/SPARK-18055
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Davies Liu
>Assignee: Michael Armbrust
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
> Attachments: test-jar_2.11-1.0.jar
>
>
> Try to apply flatMap() on Dataset column which of of type
> com.A.B
> Here's a schema of a dataset:
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- outputs: array (nullable = true)
>  ||-- element: string
> {code}
> flatMap works on RDD
> {code}
>  ds.rdd.flatMap(_.outputs)
> {code}
> flatMap doesnt work on dataset and gives the following error
> {code}
> ds.flatMap(_.outputs)
> {code}
> The exception:
> {code}
> scala.ScalaReflectionException: class com.A.B in JavaMirror … not found
> at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123)
> at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22)
> at 
> line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51)
> at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
> at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
> at 
> org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125)
> at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
> at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49)
> at 
> org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125)
> {code}
> Spoke to Michael Armbrust and he confirmed it as a Dataset bug.
> There is a workaround using explode()
> {code}
> ds.select(explode(col("outputs")))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20255) FileIndex hierarchy inconsistency

2017-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20255:


Assignee: Apache Spark

> FileIndex hierarchy inconsistency
> -
>
> Key: SPARK-20255
> URL: https://issues.apache.org/jira/browse/SPARK-20255
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Adrian Ionescu
>Assignee: Apache Spark
>Priority: Minor
>
> Trying to get a grip on the {{FileIndex}} hierarchy, I was confused by the 
> following inconsistency: 
> On the one hand, {{PartitioningAwareFileIndex}} defines {{leafFiles}} and 
> {{leafDirToChildrenFiles}} as abstract, but on the other it fully implements 
> {{listLeafFiles}} which does all the listing of files. However, the latter is 
> only used by {{InMemoryFileIndex}}.
> I'm hereby proposing to move this method (and all its dependencies) to the 
> implementation class that actually uses it, and thus unclutter the 
> {{PartitioningAwareFileIndex}} interface.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20255) FileIndex hierarchy inconsistency

2017-04-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961173#comment-15961173
 ] 

Apache Spark commented on SPARK-20255:
--

User 'adrian-ionescu' has created a pull request for this issue:
https://github.com/apache/spark/pull/17570

> FileIndex hierarchy inconsistency
> -
>
> Key: SPARK-20255
> URL: https://issues.apache.org/jira/browse/SPARK-20255
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Adrian Ionescu
>Priority: Minor
>
> Trying to get a grip on the {{FileIndex}} hierarchy, I was confused by the 
> following inconsistency: 
> On the one hand, {{PartitioningAwareFileIndex}} defines {{leafFiles}} and 
> {{leafDirToChildrenFiles}} as abstract, but on the other it fully implements 
> {{listLeafFiles}} which does all the listing of files. However, the latter is 
> only used by {{InMemoryFileIndex}}.
> I'm hereby proposing to move this method (and all its dependencies) to the 
> implementation class that actually uses it, and thus unclutter the 
> {{PartitioningAwareFileIndex}} interface.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20255) FileIndex hierarchy inconsistency

2017-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20255:


Assignee: (was: Apache Spark)

> FileIndex hierarchy inconsistency
> -
>
> Key: SPARK-20255
> URL: https://issues.apache.org/jira/browse/SPARK-20255
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Adrian Ionescu
>Priority: Minor
>
> Trying to get a grip on the {{FileIndex}} hierarchy, I was confused by the 
> following inconsistency: 
> On the one hand, {{PartitioningAwareFileIndex}} defines {{leafFiles}} and 
> {{leafDirToChildrenFiles}} as abstract, but on the other it fully implements 
> {{listLeafFiles}} which does all the listing of files. However, the latter is 
> only used by {{InMemoryFileIndex}}.
> I'm hereby proposing to move this method (and all its dependencies) to the 
> implementation class that actually uses it, and thus unclutter the 
> {{PartitioningAwareFileIndex}} interface.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20253) Remove unnecessary nullchecks of a return value from Spark runtime routines in generated Java code

2017-04-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961171#comment-15961171
 ] 

Apache Spark commented on SPARK-20253:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/17569

> Remove unnecessary nullchecks of a return value from Spark runtime routines 
> in generated Java code
> --
>
> Key: SPARK-20253
> URL: https://issues.apache.org/jira/browse/SPARK-20253
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>
> While we know several Spark runtime routines never return null (e.g. 
> {{UnsafeArrayData.toDoubleArray()}}, the generated code by Catalyst always 
> checks whether the return value is null or not.
> It is good to remove this nullcheck for reducing Java bytecode size and 
> reducing the native code size.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20253) Remove unnecessary nullchecks of a return value from Spark runtime routines in generated Java code

2017-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20253:


Assignee: (was: Apache Spark)

> Remove unnecessary nullchecks of a return value from Spark runtime routines 
> in generated Java code
> --
>
> Key: SPARK-20253
> URL: https://issues.apache.org/jira/browse/SPARK-20253
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>
> While we know several Spark runtime routines never return null (e.g. 
> {{UnsafeArrayData.toDoubleArray()}}, the generated code by Catalyst always 
> checks whether the return value is null or not.
> It is good to remove this nullcheck for reducing Java bytecode size and 
> reducing the native code size.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20253) Remove unnecessary nullchecks of a return value from Spark runtime routines in generated Java code

2017-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20253:


Assignee: Apache Spark

> Remove unnecessary nullchecks of a return value from Spark runtime routines 
> in generated Java code
> --
>
> Key: SPARK-20253
> URL: https://issues.apache.org/jira/browse/SPARK-20253
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>Assignee: Apache Spark
>
> While we know several Spark runtime routines never return null (e.g. 
> {{UnsafeArrayData.toDoubleArray()}}, the generated code by Catalyst always 
> checks whether the return value is null or not.
> It is good to remove this nullcheck for reducing Java bytecode size and 
> reducing the native code size.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20254) SPARK-19716 generates unnecessary data conversion for Dataset with primitive array

2017-04-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961167#comment-15961167
 ] 

Apache Spark commented on SPARK-20254:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/17569

> SPARK-19716 generates unnecessary data conversion for Dataset with primitive 
> array
> --
>
> Key: SPARK-20254
> URL: https://issues.apache.org/jira/browse/SPARK-20254
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>
> Since {{unresolvedmapobjects}} is newly introduced by SPARK-19716, the 
> current implementation generates {{mapobjects()}} at {{DeserializeToObject}} 
> in {{Analyzed Logical Plan}}. This {{mapObject()}} introduces Java code to 
> store an array into {{GenericArrayData}}.
> cc: [~cloud_fan]
>  
> {code}
> val ds = sparkContext.parallelize(Seq(Array(1.1, 2.2)), 1).toDS.cache
> ds.count
> val ds2 = ds.map(e => e)
> ds2.explain(true)
> ds2.show
> {code}
> Plans before SPARK-19716
> {code}
> == Parsed Logical Plan ==
> 'SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#25]
> +- 'MapElements , class [D, 
> [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D
>+- 'DeserializeToObject 
> unresolveddeserializer(upcast(getcolumnbyordinal(0, 
> ArrayType(DoubleType,false)), ArrayType(DoubleType,false), - root class: 
> "scala.Array").toDoubleArray), obj#23: [D
>   +- SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#2]
>  +- ExternalRDD [obj#1]
> == Analyzed Logical Plan ==
> value: array
> SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#25]
> +- MapElements , class [D, 
> [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D
>+- DeserializeToObject cast(value#2 as array).toDoubleArray, 
> obj#23: [D
>   +- SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#2]
>  +- ExternalRDD [obj#1]
> == Optimized Logical Plan ==
> SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#25]
> +- MapElements , class [D, 
> [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D
>+- DeserializeToObject value#2.toDoubleArray, obj#23: [D
>   +- InMemoryRelation [value#2], true, 1, StorageLevel(disk, memory, 
> deserialized, 1 replicas)
> +- *SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#2]
>+- Scan ExternalRDDScan[obj#1]
> == Physical Plan ==
> *SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#25]
> +- *MapElements , obj#24: [D
>+- *DeserializeToObject value#2.toDoubleArray, obj#23: [D
>   +- *InMemoryTableScan [value#2]
> +- InMemoryRelation [value#2], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
>   +- *SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#2]
>  +- Scan ExternalRDDScan[obj#1]
> {code}
> Plans after SPARK-19716
> {code}
> == Parsed Logical Plan ==
> 'SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#25]
> +- 'MapElements , class [D, 
> [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D
>+- 'DeserializeToObject 
> unresolveddeserializer(unresolvedmapobjects(, 
> getcolumnbyordinal(0, ArrayType(DoubleType,false)), None).toDoubleArray), 
> obj#23: [D
>   +- SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#2]
>  +- ExternalRDD [obj#1]
> == Analyzed 

[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data

2017-04-07 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961170#comment-15961170
 ] 

Andrew Ash commented on SPARK-20144:


This is a regression from 1.6 to the 2.x line.  [~marmbrus] recommended 
modifying {{spark.sql.files.openCostInBytes}} as a workaround in this post:

http://apache-spark-developers-list.1001551.n3.nabble.com/Sorting-within-partitions-is-not-maintained-in-parquet-td18618.html#a18627

> spark.read.parquet no long maintains ordering of the data
> -
>
> Key: SPARK-20144
> URL: https://issues.apache.org/jira/browse/SPARK-20144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Li Jin
>
> Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is 
> when we read parquet files in 2.0.2, the ordering of rows in the resulting 
> dataframe is not the same as the ordering of rows in the dataframe that the 
> parquet file was reproduced with. 
> This is because FileSourceStrategy.scala combines the parquet files into 
> fewer partitions and also reordered them. This breaks our workflows because 
> they assume the ordering of the data. 
> Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec 
> changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with 
> 2.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20255) FileIndex hierarchy inconsistency

2017-04-07 Thread Adrian Ionescu (JIRA)
Adrian Ionescu created SPARK-20255:
--

 Summary: FileIndex hierarchy inconsistency
 Key: SPARK-20255
 URL: https://issues.apache.org/jira/browse/SPARK-20255
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Adrian Ionescu
Priority: Minor


Trying to get a grip on the {{FileIndex}} hierarchy, I was confused by the 
following inconsistency: 

On the one hand, {{PartitioningAwareFileIndex}} defines {{leafFiles}} and 
{{leafDirToChildrenFiles}} as abstract, but on the other it fully implements 
{{listLeafFiles}} which does all the listing of files. However, the latter is 
only used by {{InMemoryFileIndex}}.

I'm hereby proposing to move this method (and all its dependencies) to the 
implementation class that actually uses it, and thus unclutter the 
{{PartitioningAwareFileIndex}} interface.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20254) SPARK-19716 generates unnecessary data conversion for Dataset with primitive array

2017-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20254:


Assignee: (was: Apache Spark)

> SPARK-19716 generates unnecessary data conversion for Dataset with primitive 
> array
> --
>
> Key: SPARK-20254
> URL: https://issues.apache.org/jira/browse/SPARK-20254
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>
> Since {{unresolvedmapobjects}} is newly introduced by SPARK-19716, the 
> current implementation generates {{mapobjects()}} at {{DeserializeToObject}} 
> in {{Analyzed Logical Plan}}. This {{mapObject()}} introduces Java code to 
> store an array into {{GenericArrayData}}.
> cc: [~cloud_fan]
>  
> {code}
> val ds = sparkContext.parallelize(Seq(Array(1.1, 2.2)), 1).toDS.cache
> ds.count
> val ds2 = ds.map(e => e)
> ds2.explain(true)
> ds2.show
> {code}
> Plans before SPARK-19716
> {code}
> == Parsed Logical Plan ==
> 'SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#25]
> +- 'MapElements , class [D, 
> [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D
>+- 'DeserializeToObject 
> unresolveddeserializer(upcast(getcolumnbyordinal(0, 
> ArrayType(DoubleType,false)), ArrayType(DoubleType,false), - root class: 
> "scala.Array").toDoubleArray), obj#23: [D
>   +- SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#2]
>  +- ExternalRDD [obj#1]
> == Analyzed Logical Plan ==
> value: array
> SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#25]
> +- MapElements , class [D, 
> [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D
>+- DeserializeToObject cast(value#2 as array).toDoubleArray, 
> obj#23: [D
>   +- SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#2]
>  +- ExternalRDD [obj#1]
> == Optimized Logical Plan ==
> SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#25]
> +- MapElements , class [D, 
> [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D
>+- DeserializeToObject value#2.toDoubleArray, obj#23: [D
>   +- InMemoryRelation [value#2], true, 1, StorageLevel(disk, memory, 
> deserialized, 1 replicas)
> +- *SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#2]
>+- Scan ExternalRDDScan[obj#1]
> == Physical Plan ==
> *SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#25]
> +- *MapElements , obj#24: [D
>+- *DeserializeToObject value#2.toDoubleArray, obj#23: [D
>   +- *InMemoryTableScan [value#2]
> +- InMemoryRelation [value#2], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
>   +- *SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#2]
>  +- Scan ExternalRDDScan[obj#1]
> {code}
> Plans after SPARK-19716
> {code}
> == Parsed Logical Plan ==
> 'SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#25]
> +- 'MapElements , class [D, 
> [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D
>+- 'DeserializeToObject 
> unresolveddeserializer(unresolvedmapobjects(, 
> getcolumnbyordinal(0, ArrayType(DoubleType,false)), None).toDoubleArray), 
> obj#23: [D
>   +- SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#2]
>  +- ExternalRDD [obj#1]
> == Analyzed Logical Plan ==
> value: array
> SerializeFromObject [staticinvoke(class 
> 

[jira] [Assigned] (SPARK-20254) SPARK-19716 generates unnecessary data conversion for Dataset with primitive array

2017-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20254:


Assignee: Apache Spark

> SPARK-19716 generates unnecessary data conversion for Dataset with primitive 
> array
> --
>
> Key: SPARK-20254
> URL: https://issues.apache.org/jira/browse/SPARK-20254
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>Assignee: Apache Spark
>
> Since {{unresolvedmapobjects}} is newly introduced by SPARK-19716, the 
> current implementation generates {{mapobjects()}} at {{DeserializeToObject}} 
> in {{Analyzed Logical Plan}}. This {{mapObject()}} introduces Java code to 
> store an array into {{GenericArrayData}}.
> cc: [~cloud_fan]
>  
> {code}
> val ds = sparkContext.parallelize(Seq(Array(1.1, 2.2)), 1).toDS.cache
> ds.count
> val ds2 = ds.map(e => e)
> ds2.explain(true)
> ds2.show
> {code}
> Plans before SPARK-19716
> {code}
> == Parsed Logical Plan ==
> 'SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#25]
> +- 'MapElements , class [D, 
> [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D
>+- 'DeserializeToObject 
> unresolveddeserializer(upcast(getcolumnbyordinal(0, 
> ArrayType(DoubleType,false)), ArrayType(DoubleType,false), - root class: 
> "scala.Array").toDoubleArray), obj#23: [D
>   +- SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#2]
>  +- ExternalRDD [obj#1]
> == Analyzed Logical Plan ==
> value: array
> SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#25]
> +- MapElements , class [D, 
> [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D
>+- DeserializeToObject cast(value#2 as array).toDoubleArray, 
> obj#23: [D
>   +- SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#2]
>  +- ExternalRDD [obj#1]
> == Optimized Logical Plan ==
> SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#25]
> +- MapElements , class [D, 
> [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D
>+- DeserializeToObject value#2.toDoubleArray, obj#23: [D
>   +- InMemoryRelation [value#2], true, 1, StorageLevel(disk, memory, 
> deserialized, 1 replicas)
> +- *SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#2]
>+- Scan ExternalRDDScan[obj#1]
> == Physical Plan ==
> *SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#25]
> +- *MapElements , obj#24: [D
>+- *DeserializeToObject value#2.toDoubleArray, obj#23: [D
>   +- *InMemoryTableScan [value#2]
> +- InMemoryRelation [value#2], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
>   +- *SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#2]
>  +- Scan ExternalRDDScan[obj#1]
> {code}
> Plans after SPARK-19716
> {code}
> == Parsed Logical Plan ==
> 'SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#25]
> +- 'MapElements , class [D, 
> [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D
>+- 'DeserializeToObject 
> unresolveddeserializer(unresolvedmapobjects(, 
> getcolumnbyordinal(0, ArrayType(DoubleType,false)), None).toDoubleArray), 
> obj#23: [D
>   +- SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#2]
>  +- ExternalRDD [obj#1]
> == Analyzed Logical Plan ==
> value: array
> SerializeFromObject [staticinvoke(class 

[jira] [Commented] (SPARK-20254) SPARK-19716 generates unnecessary data conversion for Dataset with primitive array

2017-04-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961126#comment-15961126
 ] 

Apache Spark commented on SPARK-20254:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/17568

> SPARK-19716 generates unnecessary data conversion for Dataset with primitive 
> array
> --
>
> Key: SPARK-20254
> URL: https://issues.apache.org/jira/browse/SPARK-20254
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>
> Since {{unresolvedmapobjects}} is newly introduced by SPARK-19716, the 
> current implementation generates {{mapobjects()}} at {{DeserializeToObject}} 
> in {{Analyzed Logical Plan}}. This {{mapObject()}} introduces Java code to 
> store an array into {{GenericArrayData}}.
> cc: [~cloud_fan]
>  
> {code}
> val ds = sparkContext.parallelize(Seq(Array(1.1, 2.2)), 1).toDS.cache
> ds.count
> val ds2 = ds.map(e => e)
> ds2.explain(true)
> ds2.show
> {code}
> Plans before SPARK-19716
> {code}
> == Parsed Logical Plan ==
> 'SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#25]
> +- 'MapElements , class [D, 
> [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D
>+- 'DeserializeToObject 
> unresolveddeserializer(upcast(getcolumnbyordinal(0, 
> ArrayType(DoubleType,false)), ArrayType(DoubleType,false), - root class: 
> "scala.Array").toDoubleArray), obj#23: [D
>   +- SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#2]
>  +- ExternalRDD [obj#1]
> == Analyzed Logical Plan ==
> value: array
> SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#25]
> +- MapElements , class [D, 
> [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D
>+- DeserializeToObject cast(value#2 as array).toDoubleArray, 
> obj#23: [D
>   +- SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#2]
>  +- ExternalRDD [obj#1]
> == Optimized Logical Plan ==
> SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#25]
> +- MapElements , class [D, 
> [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D
>+- DeserializeToObject value#2.toDoubleArray, obj#23: [D
>   +- InMemoryRelation [value#2], true, 1, StorageLevel(disk, memory, 
> deserialized, 1 replicas)
> +- *SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#2]
>+- Scan ExternalRDDScan[obj#1]
> == Physical Plan ==
> *SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#25]
> +- *MapElements , obj#24: [D
>+- *DeserializeToObject value#2.toDoubleArray, obj#23: [D
>   +- *InMemoryTableScan [value#2]
> +- InMemoryRelation [value#2], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
>   +- *SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#2]
>  +- Scan ExternalRDDScan[obj#1]
> {code}
> Plans after SPARK-19716
> {code}
> == Parsed Logical Plan ==
> 'SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#25]
> +- 'MapElements , class [D, 
> [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D
>+- 'DeserializeToObject 
> unresolveddeserializer(unresolvedmapobjects(, 
> getcolumnbyordinal(0, ArrayType(DoubleType,false)), None).toDoubleArray), 
> obj#23: [D
>   +- SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#2]
>  +- ExternalRDD [obj#1]
> == Analyzed 

[jira] [Updated] (SPARK-20254) SPARK-19716 generates unnecessary data conversion for Dataset with primitive array

2017-04-07 Thread Kazuaki Ishizaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-20254:
-
Summary: SPARK-19716 generates unnecessary data conversion for Dataset with 
primitive array  (was: SPARK-19716 generates inefficient Java code from a 
primitive array of Dataset)

> SPARK-19716 generates unnecessary data conversion for Dataset with primitive 
> array
> --
>
> Key: SPARK-20254
> URL: https://issues.apache.org/jira/browse/SPARK-20254
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>
> Since {{unresolvedmapobjects}} is newly introduced by SPARK-19716, the 
> current implementation generates {{mapobjects()}} at {{DeserializeToObject}} 
> in {{Analyzed Logical Plan}}. This {{mapObject()}} introduces Java code to 
> store an array into {{GenericArrayData}}.
> cc: [~cloud_fan]
>  
> {code}
> val ds = sparkContext.parallelize(Seq(Array(1.1, 2.2)), 1).toDS.cache
> ds.count
> val ds2 = ds.map(e => e)
> ds2.explain(true)
> ds2.show
> {code}
> Plans before SPARK-19716
> {code}
> == Parsed Logical Plan ==
> 'SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#25]
> +- 'MapElements , class [D, 
> [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D
>+- 'DeserializeToObject 
> unresolveddeserializer(upcast(getcolumnbyordinal(0, 
> ArrayType(DoubleType,false)), ArrayType(DoubleType,false), - root class: 
> "scala.Array").toDoubleArray), obj#23: [D
>   +- SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#2]
>  +- ExternalRDD [obj#1]
> == Analyzed Logical Plan ==
> value: array
> SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#25]
> +- MapElements , class [D, 
> [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D
>+- DeserializeToObject cast(value#2 as array).toDoubleArray, 
> obj#23: [D
>   +- SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#2]
>  +- ExternalRDD [obj#1]
> == Optimized Logical Plan ==
> SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#25]
> +- MapElements , class [D, 
> [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D
>+- DeserializeToObject value#2.toDoubleArray, obj#23: [D
>   +- InMemoryRelation [value#2], true, 1, StorageLevel(disk, memory, 
> deserialized, 1 replicas)
> +- *SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#2]
>+- Scan ExternalRDDScan[obj#1]
> == Physical Plan ==
> *SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#25]
> +- *MapElements , obj#24: [D
>+- *DeserializeToObject value#2.toDoubleArray, obj#23: [D
>   +- *InMemoryTableScan [value#2]
> +- InMemoryRelation [value#2], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
>   +- *SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#2]
>  +- Scan ExternalRDDScan[obj#1]
> {code}
> Plans after SPARK-19716
> {code}
> == Parsed Logical Plan ==
> 'SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#25]
> +- 'MapElements , class [D, 
> [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D
>+- 'DeserializeToObject 
> unresolveddeserializer(unresolvedmapobjects(, 
> getcolumnbyordinal(0, ArrayType(DoubleType,false)), None).toDoubleArray), 
> obj#23: [D
>   +- SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#2]

[jira] [Issue Comment Deleted] (SPARK-20243) DebugFilesystem.assertNoOpenStreams thread race

2017-04-07 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-20243:
--
Comment: was deleted

(was: This needs detail to be a JIRA.)

> DebugFilesystem.assertNoOpenStreams thread race
> ---
>
> Key: SPARK-20243
> URL: https://issues.apache.org/jira/browse/SPARK-20243
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: Bogdan Raducanu
>
> Introduced by SPARK-19946.
> DebugFilesystem.assertNoOpenStreams gets the size of the openStreams 
> ConcurrentHashMap and then later, if the size was > 0, accesses the first 
> element in openStreams.values. But, the ConcurrentHashMap might be cleared by 
> another thread between getting its size and accessing it, resulting in an 
> exception when trying to call .head on an empty collection.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19991) FileSegmentManagedBuffer performance improvement.

2017-04-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961110#comment-15961110
 ] 

Apache Spark commented on SPARK-19991:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/17567

> FileSegmentManagedBuffer performance improvement.
> -
>
> Key: SPARK-19991
> URL: https://issues.apache.org/jira/browse/SPARK-19991
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Guoqiang Li
>Priority: Minor
>
> When we do not set the value of the configuration items 
> {{spark.storage.memoryMapThreshold}} and {{spark.shuffle.io.lazyFD}}, 
> each call to the cFileSegmentManagedBuffer.nioByteBuffer or 
> FileSegmentManagedBuffer.createInputStream method creates a 
> NoSuchElementException instance. This is a more time-consuming operation.
> The shuffle-server thread`s stack:
> {noformat}
> "shuffle-server-2-42" #335 daemon prio=5 os_prio=0 tid=0x7f71e4507800 
> nid=0x28d12 runnable [0x7f71af93e000]
>java.lang.Thread.State: RUNNABLE
> at java.lang.Throwable.fillInStackTrace(Native Method)
> at java.lang.Throwable.fillInStackTrace(Throwable.java:783)
> - locked <0x0007a930f080> (a java.util.NoSuchElementException)
> at java.lang.Throwable.(Throwable.java:265)
> at java.lang.Exception.(Exception.java:66)
> at java.lang.RuntimeException.(RuntimeException.java:62)
> at 
> java.util.NoSuchElementException.(NoSuchElementException.java:57)
> at 
> org.apache.spark.network.yarn.util.HadoopConfigProvider.get(HadoopConfigProvider.java:38)
> at 
> org.apache.spark.network.util.ConfigProvider.get(ConfigProvider.java:31)
> at 
> org.apache.spark.network.util.ConfigProvider.getBoolean(ConfigProvider.java:50)
> at 
> org.apache.spark.network.util.TransportConf.lazyFileDescriptor(TransportConf.java:157)
> at 
> org.apache.spark.network.buffer.FileSegmentManagedBuffer.convertToNetty(FileSegmentManagedBuffer.java:132)
> at 
> org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:54)
> at 
> org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:33)
> at 
> org.spark_project.io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:88)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:743)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:735)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:820)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:728)
> at 
> org.spark_project.io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:284)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:743)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:806)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:818)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:799)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:835)
> at 
> org.spark_project.io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:1017)
> at 
> org.spark_project.io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:256)
> at 
> org.apache.spark.network.server.TransportRequestHandler.respond(TransportRequestHandler.java:194)
> at 
> org.apache.spark.network.server.TransportRequestHandler.processFetchRequest(TransportRequestHandler.java:135)
> at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:105)
> at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:119)
> at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
> at 
> org.spark_project.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
> at 
> org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
> at 
> 

[jira] [Commented] (SPARK-20219) Schedule tasks based on size of input from ScheduledRDD

2017-04-07 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15961104#comment-15961104
 ] 

Imran Rashid commented on SPARK-20219:
--

I'm not saying I don't think this is a good proposal.  I'm just trying to be 
realistic about this change vs. other things in the works, and to try to set 
priorities, to set expectations.

I'll take another quick look at the PR and see if I can think of another way to 
do this.

> Schedule tasks based on size of input from ScheduledRDD
> ---
>
> Key: SPARK-20219
> URL: https://issues.apache.org/jira/browse/SPARK-20219
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: jin xing
> Attachments: screenshot-1.png
>
>
> When data is highly skewed on ShuffledRDD, it make sense to launch those 
> tasks which process much more input as soon as possible. The current 
> scheduling mechanism in *TaskSetManager* is quite simple:
> {code}
>   for (i <- (0 until numTasks).reverse) {
> addPendingTask(i)
>   }
> {code}
> In scenario that "large tasks" locate at bottom half of tasks array, if tasks 
> with much more input are launched early, we can significantly reduce the time 
> cost and save resource when *"dynamic allocation"* is disabled.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20254) SPARK-19716 generates inefficient Java code from a primitive array of Dataset

2017-04-07 Thread Kazuaki Ishizaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-20254:
-
Description: 
Since {{unresolvedmapobjects}} is newly introduced by SPARK-19716, the current 
implementation generates {{mapobjects()}} at {{DeserializeToObject}} in 
{{Analyzed Logical Plan}}. This {{mapObject()}} introduces Java code to store 
an array into {{GenericArrayData}}.
cc: [~cloud_fan]
 
{code}
val ds = sparkContext.parallelize(Seq(Array(1.1, 2.2)), 1).toDS.cache
ds.count
val ds2 = ds.map(e => e)
ds2.explain(true)
ds2.show
{code}

Plans before SPARK-19716
{code}
== Parsed Logical Plan ==
'SerializeFromObject [staticinvoke(class 
org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
value#25]
+- 'MapElements , class [D, 
[StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D
   +- 'DeserializeToObject unresolveddeserializer(upcast(getcolumnbyordinal(0, 
ArrayType(DoubleType,false)), ArrayType(DoubleType,false), - root class: 
"scala.Array").toDoubleArray), obj#23: [D
  +- SerializeFromObject [staticinvoke(class 
org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
value#2]
 +- ExternalRDD [obj#1]

== Analyzed Logical Plan ==
value: array
SerializeFromObject [staticinvoke(class 
org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
value#25]
+- MapElements , class [D, 
[StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D
   +- DeserializeToObject cast(value#2 as array).toDoubleArray, obj#23: 
[D
  +- SerializeFromObject [staticinvoke(class 
org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
value#2]
 +- ExternalRDD [obj#1]

== Optimized Logical Plan ==
SerializeFromObject [staticinvoke(class 
org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
value#25]
+- MapElements , class [D, 
[StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D
   +- DeserializeToObject value#2.toDoubleArray, obj#23: [D
  +- InMemoryRelation [value#2], true, 1, StorageLevel(disk, memory, 
deserialized, 1 replicas)
+- *SerializeFromObject [staticinvoke(class 
org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
value#2]
   +- Scan ExternalRDDScan[obj#1]

== Physical Plan ==
*SerializeFromObject [staticinvoke(class 
org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
value#25]
+- *MapElements , obj#24: [D
   +- *DeserializeToObject value#2.toDoubleArray, obj#23: [D
  +- *InMemoryTableScan [value#2]
+- InMemoryRelation [value#2], true, 1, StorageLevel(disk, 
memory, deserialized, 1 replicas)
  +- *SerializeFromObject [staticinvoke(class 
org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
value#2]
 +- Scan ExternalRDDScan[obj#1]

{code}

Plans after SPARK-19716
{code}
== Parsed Logical Plan ==
'SerializeFromObject [staticinvoke(class 
org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
value#25]
+- 'MapElements , class [D, 
[StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D
   +- 'DeserializeToObject 
unresolveddeserializer(unresolvedmapobjects(, getcolumnbyordinal(0, 
ArrayType(DoubleType,false)), None).toDoubleArray), obj#23: [D
  +- SerializeFromObject [staticinvoke(class 
org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
value#2]
 +- ExternalRDD [obj#1]

== Analyzed Logical Plan ==
value: array
SerializeFromObject [staticinvoke(class 
org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
value#25]
+- MapElements , class [D, 
[StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D
   +- DeserializeToObject mapobjects(MapObjects_loopValue4, 
MapObjects_loopIsNull4, DoubleType, 
assertnotnull(lambdavariable(MapObjects_loopValue4, MapObjects_loopIsNull4, 
DoubleType, true), - array element class: "scala.Double", - root class: 
"scala.Array"), value#2, None, MapObjects_builderValue4).toDoubleArray, obj#23: 
[D
  +- SerializeFromObject [staticinvoke(class 

[jira] [Updated] (SPARK-20254) SPARK-19716 generates inefficient Java code from a primitive array of Dataset

2017-04-07 Thread Kazuaki Ishizaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-20254:
-
Summary: SPARK-19716 generates inefficient Java code from a primitive array 
of Dataset  (was: SPARK-19716 generate inefficient Java code from a primitive 
array of Dataset)

> SPARK-19716 generates inefficient Java code from a primitive array of Dataset
> -
>
> Key: SPARK-20254
> URL: https://issues.apache.org/jira/browse/SPARK-20254
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>
> Since {{unresolvedmapobjects}} is newly introduced by SPARK-19716, the 
> current implementation generates {{mapobjects()}} at {{DeserializeToObject}}. 
> This {{mapObject()}} introduces Java code to store an array into 
> {{GenericArrayData}}.
> cc: [~cloud_fan]
>  
> {code}
> val ds = sparkContext.parallelize(Seq(Array(1.1, 2.2)), 1).toDS.cache
> ds.count
> val ds2 = ds.map(e => e)
> ds2.explain(true)
> ds2.show
> {code}
> {code}
> == Parsed Logical Plan ==
> 'SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#25]
> +- 'MapElements , class [D, 
> [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D
>+- 'DeserializeToObject 
> unresolveddeserializer(upcast(getcolumnbyordinal(0, 
> ArrayType(DoubleType,false)), ArrayType(DoubleType,false), - root class: 
> "scala.Array").toDoubleArray), obj#23: [D
>   +- SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#2]
>  +- ExternalRDD [obj#1]
> == Analyzed Logical Plan ==
> value: array
> SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#25]
> +- MapElements , class [D, 
> [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D
>+- DeserializeToObject cast(value#2 as array).toDoubleArray, 
> obj#23: [D
>   +- SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#2]
>  +- ExternalRDD [obj#1]
> == Optimized Logical Plan ==
> SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#25]
> +- MapElements , class [D, 
> [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D
>+- DeserializeToObject value#2.toDoubleArray, obj#23: [D
>   +- InMemoryRelation [value#2], true, 1, StorageLevel(disk, memory, 
> deserialized, 1 replicas)
> +- *SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#2]
>+- Scan ExternalRDDScan[obj#1]
> == Physical Plan ==
> *SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#25]
> +- *MapElements , obj#24: [D
>+- *DeserializeToObject value#2.toDoubleArray, obj#23: [D
>   +- *InMemoryTableScan [value#2]
> +- InMemoryRelation [value#2], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
>   +- *SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#2]
>  +- Scan ExternalRDDScan[obj#1]
> {code}
> {code}
> == Parsed Logical Plan ==
> 'SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#25]
> +- 'MapElements , class [D, 
> [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D
>+- 'DeserializeToObject 
> unresolveddeserializer(unresolvedmapobjects(, 
> getcolumnbyordinal(0, ArrayType(DoubleType,false)), None).toDoubleArray), 
> obj#23: [D
>   +- SerializeFromObject [staticinvoke(class 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
> ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
> value#2]
>  +- ExternalRDD [obj#1]
> == Analyzed Logical Plan ==
> value: array
> SerializeFromObject 

[jira] [Created] (SPARK-20254) SPARK-19716 generate inefficient Java code from a primitive array of Dataset

2017-04-07 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-20254:


 Summary: SPARK-19716 generate inefficient Java code from a 
primitive array of Dataset
 Key: SPARK-20254
 URL: https://issues.apache.org/jira/browse/SPARK-20254
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Kazuaki Ishizaki


Since {{unresolvedmapobjects}} is newly introduced by SPARK-19716, the current 
implementation generates {{mapobjects()}} at {{DeserializeToObject}}. This 
{{mapObject()}} introduces Java code to store an array into 
{{GenericArrayData}}.
cc: [~cloud_fan]
 
{code}
val ds = sparkContext.parallelize(Seq(Array(1.1, 2.2)), 1).toDS.cache
ds.count
val ds2 = ds.map(e => e)
ds2.explain(true)
ds2.show
{code}

{code}
== Parsed Logical Plan ==
'SerializeFromObject [staticinvoke(class 
org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
value#25]
+- 'MapElements , class [D, 
[StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D
   +- 'DeserializeToObject unresolveddeserializer(upcast(getcolumnbyordinal(0, 
ArrayType(DoubleType,false)), ArrayType(DoubleType,false), - root class: 
"scala.Array").toDoubleArray), obj#23: [D
  +- SerializeFromObject [staticinvoke(class 
org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
value#2]
 +- ExternalRDD [obj#1]

== Analyzed Logical Plan ==
value: array
SerializeFromObject [staticinvoke(class 
org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
value#25]
+- MapElements , class [D, 
[StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D
   +- DeserializeToObject cast(value#2 as array).toDoubleArray, obj#23: 
[D
  +- SerializeFromObject [staticinvoke(class 
org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
value#2]
 +- ExternalRDD [obj#1]

== Optimized Logical Plan ==
SerializeFromObject [staticinvoke(class 
org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
value#25]
+- MapElements , class [D, 
[StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D
   +- DeserializeToObject value#2.toDoubleArray, obj#23: [D
  +- InMemoryRelation [value#2], true, 1, StorageLevel(disk, memory, 
deserialized, 1 replicas)
+- *SerializeFromObject [staticinvoke(class 
org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
value#2]
   +- Scan ExternalRDDScan[obj#1]

== Physical Plan ==
*SerializeFromObject [staticinvoke(class 
org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
value#25]
+- *MapElements , obj#24: [D
   +- *DeserializeToObject value#2.toDoubleArray, obj#23: [D
  +- *InMemoryTableScan [value#2]
+- InMemoryRelation [value#2], true, 1, StorageLevel(disk, 
memory, deserialized, 1 replicas)
  +- *SerializeFromObject [staticinvoke(class 
org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
value#2]
 +- Scan ExternalRDDScan[obj#1]

{code}

{code}
== Parsed Logical Plan ==
'SerializeFromObject [staticinvoke(class 
org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
value#25]
+- 'MapElements , class [D, 
[StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D
   +- 'DeserializeToObject 
unresolveddeserializer(unresolvedmapobjects(, getcolumnbyordinal(0, 
ArrayType(DoubleType,false)), None).toDoubleArray), obj#23: [D
  +- SerializeFromObject [staticinvoke(class 
org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
value#2]
 +- ExternalRDD [obj#1]

== Analyzed Logical Plan ==
value: array
SerializeFromObject [staticinvoke(class 
org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, 
ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS 
value#25]
+- MapElements , class [D, 
[StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D
   +- DeserializeToObject mapobjects(MapObjects_loopValue4, 
MapObjects_loopIsNull4, DoubleType, 
assertnotnull(lambdavariable(MapObjects_loopValue4, MapObjects_loopIsNull4, 
DoubleType, true), - array element class: "scala.Double", - root class: 
"scala.Array"), value#2, None, 

[jira] [Assigned] (SPARK-19518) IGNORE NULLS in first_value / last_value should be supported in SQL statements

2017-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19518:


Assignee: Apache Spark

> IGNORE NULLS in first_value / last_value should be supported in SQL statements
> --
>
> Key: SPARK-19518
> URL: https://issues.apache.org/jira/browse/SPARK-19518
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Ferenc Erdelyi
>Assignee: Apache Spark
>
> https://issues.apache.org/jira/browse/SPARK-13049 was implemented in Spark2, 
> however it does not work in SQL statements as it is not implemented in Hive 
> yet: https://issues.apache.org/jira/browse/HIVE-11189



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19518) IGNORE NULLS in first_value / last_value should be supported in SQL statements

2017-04-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19518:


Assignee: (was: Apache Spark)

> IGNORE NULLS in first_value / last_value should be supported in SQL statements
> --
>
> Key: SPARK-19518
> URL: https://issues.apache.org/jira/browse/SPARK-19518
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Ferenc Erdelyi
>
> https://issues.apache.org/jira/browse/SPARK-13049 was implemented in Spark2, 
> however it does not work in SQL statements as it is not implemented in Hive 
> yet: https://issues.apache.org/jira/browse/HIVE-11189



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19518) IGNORE NULLS in first_value / last_value should be supported in SQL statements

2017-04-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960985#comment-15960985
 ] 

Apache Spark commented on SPARK-19518:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/17566

> IGNORE NULLS in first_value / last_value should be supported in SQL statements
> --
>
> Key: SPARK-19518
> URL: https://issues.apache.org/jira/browse/SPARK-19518
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Ferenc Erdelyi
>
> https://issues.apache.org/jira/browse/SPARK-13049 was implemented in Spark2, 
> however it does not work in SQL statements as it is not implemented in Hive 
> yet: https://issues.apache.org/jira/browse/HIVE-11189



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-20181) Avoid noisy Jetty WARN log when failing to bind a port

2017-04-07 Thread Derek Dagit (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Derek Dagit closed SPARK-20181.
---
Resolution: Invalid

This is no longer an issue in master because the log level is already set such 
that the message with the stack trace does not appear.

> Avoid noisy Jetty WARN log when failing to bind a port
> --
>
> Key: SPARK-20181
> URL: https://issues.apache.org/jira/browse/SPARK-20181
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Derek Dagit
>Priority: Minor
>
> As a user, I would like to suppress the Jetty WARN log about failing to bind 
> to a port already in use, so that my logs are less noisy.
> Currently, Jetty code prints the stack trace of the BindException at WARN 
> level. In the context of starting a service on an ephemeral port, this is not 
> a useful warning, and it is exceedingly verbose.
> {noformat}
> 17/03/06 14:57:26 WARN AbstractLifeCycle: FAILED 
> ServerConnector@79476a4e{HTTP/1.1}{0.0.0.0:4040}: java.net.BindException: 
> Address already in use
> java.net.BindException: Address already in use
>   at sun.nio.ch.Net.bind0(Native Method)
>   at sun.nio.ch.Net.bind(Net.java:433)
>   at sun.nio.ch.Net.bind(Net.java:425)
>   at 
> sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
>   at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
>   at 
> org.spark_project.jetty.server.ServerConnector.open(ServerConnector.java:321)
>   at 
> org.spark_project.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80)
>   at 
> org.spark_project.jetty.server.ServerConnector.doStart(ServerConnector.java:236)
>   at 
> org.spark_project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
>   at org.spark_project.jetty.server.Server.doStart(Server.java:366)
>   at 
> org.spark_project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
>   at 
> org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$connect$1(JettyUtils.scala:306)
>   at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:316)
>   at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:316)
>   at 
> org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:2175)
>   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
>   at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:2166)
>   at 
> org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:316)
>   at org.apache.spark.ui.WebUI.bind(WebUI.scala:139)
>   at 
> org.apache.spark.SparkContext$$anonfun$10.apply(SparkContext.scala:448)
>   at 
> org.apache.spark.SparkContext$$anonfun$10.apply(SparkContext.scala:448)
>   at scala.Option.foreach(Option.scala:257)
>   at org.apache.spark.SparkContext.(SparkContext.scala:448)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2282)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823)
>   at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95)
>   at $line3.$read$$iw$$iw.(:15)
>   at $line3.$read$$iw.(:31)
>   at $line3.$read.(:33)
>   at $line3.$read$.(:37)
>   at $line3.$read$.()
>   at $line3.$eval$.$print$lzycompute(:7)
>   at $line3.$eval$.$print(:6)
>   at $line3.$eval.$print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786)
>   at 
> scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047)
>   at 
> scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638)
>   at 
> scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637)
>   at 
> scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
>   at 
> scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19)
>   at 
> scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637)
>   at 

[jira] [Comment Edited] (SPARK-20227) Job hangs when joining a lot of aggregated columns

2017-04-07 Thread Quentin Auge (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960953#comment-15960953
 ] 

Quentin Auge edited comment on SPARK-20227 at 4/7/17 3:28 PM:
--

Well, you were right to ask. After further investigation, it seems the job does 
not hang forever, but takes much longer to finish when n is large enough.

I made a mistake when I mentioned the job finishes in a sensible amount of time 
for n = 20. Really, it starts to hang badly from n = 13 (so 12 successive 
aggregations).

Please see the following chart:
!https://image.ibb.co/fxRL85/figure_1.png!
n on x axis, time for the job to complete in seconds on y axis.
The job takes as much as 1 hour and 47 minutes to complete for n = 14.

For the most part, the amount of time the job takes to complete from n = 13 is 
spent after this message:
{code}
17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because 
it has been idle for 60 seconds (new desired total will be 0)
17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling 
executor 1.
17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0)
17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 
from BlockManagerMaster.
17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Removing block manager 
BlockManagerId(1, ip-172-30-0-149.ec2.internal, 35666, None)
17/04/05 14:39:29 INFO BlockManagerMaster: Removed 1 successfully in 
removeExecutor
17/04/05 14:39:29 INFO YarnScheduler: Executor 1 on 
ip-172-30-0-149.ec2.internal killed by driver.
17/04/05 14:39:29 INFO ExecutorAllocationManager: Existing executor 1 has been 
removed (new total is 0)
{code}

Except for the master, the cluster remains idle. Right after that, the job 
starts again and completes.


was (Author: quentin):
Well, you were right to ask. After further investigation, it seems the job does 
not hang forever, but takes much longer to finish when n is large enough.

I made a mistake when I mentioned the job finishes in a sensible amount of time 
for n = 20. Really, it starts to hang badly from n = 12 (so 11 successive 
aggregations).

Please see the following chart:
!https://image.ibb.co/fxRL85/figure_1.png!
n on x axis, time for the job to complete in seconds on y axis.
The job takes as much as 1 hour and 47 minutes to complete for n = 14.

For the most part, the amount of time the job takes to complete from n = 12 is 
spent after this message:
{code}
17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because 
it has been idle for 60 seconds (new desired total will be 0)
17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling 
executor 1.
17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0)
17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 
from BlockManagerMaster.
17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Removing block manager 
BlockManagerId(1, ip-172-30-0-149.ec2.internal, 35666, None)
17/04/05 14:39:29 INFO BlockManagerMaster: Removed 1 successfully in 
removeExecutor
17/04/05 14:39:29 INFO YarnScheduler: Executor 1 on 
ip-172-30-0-149.ec2.internal killed by driver.
17/04/05 14:39:29 INFO ExecutorAllocationManager: Existing executor 1 has been 
removed (new total is 0)
{code}

Except for the master, the cluster remains idle. Right after that, the job 
starts again and completes.

> Job hangs when joining a lot of aggregated columns
> --
>
> Key: SPARK-20227
> URL: https://issues.apache.org/jira/browse/SPARK-20227
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: AWS emr-5.4.0, master: m4.xlarge, core: 4 m4.xlarge
>Reporter: Quentin Auge
>
> I'm trying to replace a lot of different columns in a dataframe with 
> aggregates of themselves, and then join the resulting dataframe.
> {code}
> # Create a dataframe with 1 row and 50 columns
> n = 50
> df = sc.parallelize([Row(*range(n))]).toDF()
> cols = df.columns
> # Replace each column values with aggregated values
> window = Window.partitionBy(cols[0])
> for col in cols[1:]:
> df = df.withColumn(col, sum(col).over(window))
> # Join
> other_df = sc.parallelize([Row(0)]).toDF()
> result = other_df.join(df, on = cols[0])
> result.show()
> {code}
> Spark hangs forever when executing the last line. The strange thing is, it 
> depends on the number of columns. Spark does not hang for n = 5, 10, or 20 
> columns. For n = 50 and beyond, it does.
> {code}
> 17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because 
> it has been idle for 60 seconds (new desired total will be 0)
> 17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling 
> executor 1.
> 17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 

[jira] [Commented] (SPARK-20227) Job hangs when joining a lot of aggregated columns

2017-04-07 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960965#comment-15960965
 ] 

Sean Owen commented on SPARK-20227:
---

See https://issues.apache.org/jira/browse/SPARK-20226 for something possibly 
similar. It'd be useful to get a thread dump to see where the time is being 
spent.

> Job hangs when joining a lot of aggregated columns
> --
>
> Key: SPARK-20227
> URL: https://issues.apache.org/jira/browse/SPARK-20227
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: AWS emr-5.4.0, master: m4.xlarge, core: 4 m4.xlarge
>Reporter: Quentin Auge
>
> I'm trying to replace a lot of different columns in a dataframe with 
> aggregates of themselves, and then join the resulting dataframe.
> {code}
> # Create a dataframe with 1 row and 50 columns
> n = 50
> df = sc.parallelize([Row(*range(n))]).toDF()
> cols = df.columns
> # Replace each column values with aggregated values
> window = Window.partitionBy(cols[0])
> for col in cols[1:]:
> df = df.withColumn(col, sum(col).over(window))
> # Join
> other_df = sc.parallelize([Row(0)]).toDF()
> result = other_df.join(df, on = cols[0])
> result.show()
> {code}
> Spark hangs forever when executing the last line. The strange thing is, it 
> depends on the number of columns. Spark does not hang for n = 5, 10, or 20 
> columns. For n = 50 and beyond, it does.
> {code}
> 17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because 
> it has been idle for 60 seconds (new desired total will be 0)
> 17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling 
> executor 1.
> 17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0)
> 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Trying to remove executor 
> 1 from BlockManagerMaster.
> 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Removing block manager 
> BlockManagerId(1, ip-172-30-0-149.ec2.internal, 35666, None)
> 17/04/05 14:39:29 INFO BlockManagerMaster: Removed 1 successfully in 
> removeExecutor
> 17/04/05 14:39:29 INFO YarnScheduler: Executor 1 on 
> ip-172-30-0-149.ec2.internal killed by driver.
> 17/04/05 14:39:29 INFO ExecutorAllocationManager: Existing executor 1 has 
> been removed (new total is 0)
> {code}
> All executors are inactive and thus killed after 60 seconds, the master 
> spends some CPU on a process that hangs indefinitely, and the workers are 
> idle.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20227) Job hangs when joining a lot of aggregated columns

2017-04-07 Thread Quentin Auge (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960953#comment-15960953
 ] 

Quentin Auge edited comment on SPARK-20227 at 4/7/17 3:26 PM:
--

Well, you were right to ask. After further investigation, it seems the job does 
not hang forever, but takes much longer to finish when n is large enough.

I made a mistake when I mentioned the job finishes in a sensible amount of time 
for n = 20. Really, it starts to hang badly from n = 12 (so 11 successive 
aggregations).

See the following figure:
!https://image.ibb.co/fxRL85/figure_1.png!
n on x axis, time for the job to complete in seconds on y axis.
The job takes as much as 1 hour and 47 minutes to complete for n = 14.

For the most part, the amount of time the job takes to complete from n = 12 is 
spent after this message:
{code}
17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because 
it has been idle for 60 seconds (new desired total will be 0)
17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling 
executor 1.
17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0)
17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 
from BlockManagerMaster.
17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Removing block manager 
BlockManagerId(1, ip-172-30-0-149.ec2.internal, 35666, None)
17/04/05 14:39:29 INFO BlockManagerMaster: Removed 1 successfully in 
removeExecutor
17/04/05 14:39:29 INFO YarnScheduler: Executor 1 on 
ip-172-30-0-149.ec2.internal killed by driver.
17/04/05 14:39:29 INFO ExecutorAllocationManager: Existing executor 1 has been 
removed (new total is 0)
{code}

Except for the master, the cluster remains idle. Right after that, the job 
starts again and completes.


was (Author: quentin):
Well, you were right to ask. After further investigation, it seems the job does 
not hang forever, but takes much longer to finish when n is large enough.

I made a mistake when I mentioned the job finishes in a sensible amount of time 
for n = 20. Really, it starts to hang badly from n = 12 (so 11 successive 
aggregations).

See the following figure:
!https://image.ibb.co/fxRL85/figure_1.png!
n on x axis, time for the job to complete in seconds on y axis.
The job takes as much as 1 hour and 47 minutes to complete for n = 14.

The significant amount of time the job takes from n = 12 is spent after this 
message:
{code}
17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because 
it has been idle for 60 seconds (new desired total will be 0)
17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling 
executor 1.
17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0)
17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 
from BlockManagerMaster.
17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Removing block manager 
BlockManagerId(1, ip-172-30-0-149.ec2.internal, 35666, None)
17/04/05 14:39:29 INFO BlockManagerMaster: Removed 1 successfully in 
removeExecutor
17/04/05 14:39:29 INFO YarnScheduler: Executor 1 on 
ip-172-30-0-149.ec2.internal killed by driver.
17/04/05 14:39:29 INFO ExecutorAllocationManager: Existing executor 1 has been 
removed (new total is 0)
{code}

Right after that, the job starts again and completes.

> Job hangs when joining a lot of aggregated columns
> --
>
> Key: SPARK-20227
> URL: https://issues.apache.org/jira/browse/SPARK-20227
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: AWS emr-5.4.0, master: m4.xlarge, core: 4 m4.xlarge
>Reporter: Quentin Auge
>
> I'm trying to replace a lot of different columns in a dataframe with 
> aggregates of themselves, and then join the resulting dataframe.
> {code}
> # Create a dataframe with 1 row and 50 columns
> n = 50
> df = sc.parallelize([Row(*range(n))]).toDF()
> cols = df.columns
> # Replace each column values with aggregated values
> window = Window.partitionBy(cols[0])
> for col in cols[1:]:
> df = df.withColumn(col, sum(col).over(window))
> # Join
> other_df = sc.parallelize([Row(0)]).toDF()
> result = other_df.join(df, on = cols[0])
> result.show()
> {code}
> Spark hangs forever when executing the last line. The strange thing is, it 
> depends on the number of columns. Spark does not hang for n = 5, 10, or 20 
> columns. For n = 50 and beyond, it does.
> {code}
> 17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because 
> it has been idle for 60 seconds (new desired total will be 0)
> 17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling 
> executor 1.
> 17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0)
> 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Trying to remove executor 

[jira] [Comment Edited] (SPARK-20227) Job hangs when joining a lot of aggregated columns

2017-04-07 Thread Quentin Auge (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960953#comment-15960953
 ] 

Quentin Auge edited comment on SPARK-20227 at 4/7/17 3:26 PM:
--

Well, you were right to ask. After further investigation, it seems the job does 
not hang forever, but takes much longer to finish when n is large enough.

I made a mistake when I mentioned the job finishes in a sensible amount of time 
for n = 20. Really, it starts to hang badly from n = 12 (so 11 successive 
aggregations).

Please see the following chart:
!https://image.ibb.co/fxRL85/figure_1.png!
n on x axis, time for the job to complete in seconds on y axis.
The job takes as much as 1 hour and 47 minutes to complete for n = 14.

For the most part, the amount of time the job takes to complete from n = 12 is 
spent after this message:
{code}
17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because 
it has been idle for 60 seconds (new desired total will be 0)
17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling 
executor 1.
17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0)
17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 
from BlockManagerMaster.
17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Removing block manager 
BlockManagerId(1, ip-172-30-0-149.ec2.internal, 35666, None)
17/04/05 14:39:29 INFO BlockManagerMaster: Removed 1 successfully in 
removeExecutor
17/04/05 14:39:29 INFO YarnScheduler: Executor 1 on 
ip-172-30-0-149.ec2.internal killed by driver.
17/04/05 14:39:29 INFO ExecutorAllocationManager: Existing executor 1 has been 
removed (new total is 0)
{code}

Except for the master, the cluster remains idle. Right after that, the job 
starts again and completes.


was (Author: quentin):
Well, you were right to ask. After further investigation, it seems the job does 
not hang forever, but takes much longer to finish when n is large enough.

I made a mistake when I mentioned the job finishes in a sensible amount of time 
for n = 20. Really, it starts to hang badly from n = 12 (so 11 successive 
aggregations).

See the following figure:
!https://image.ibb.co/fxRL85/figure_1.png!
n on x axis, time for the job to complete in seconds on y axis.
The job takes as much as 1 hour and 47 minutes to complete for n = 14.

For the most part, the amount of time the job takes to complete from n = 12 is 
spent after this message:
{code}
17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because 
it has been idle for 60 seconds (new desired total will be 0)
17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling 
executor 1.
17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0)
17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 
from BlockManagerMaster.
17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Removing block manager 
BlockManagerId(1, ip-172-30-0-149.ec2.internal, 35666, None)
17/04/05 14:39:29 INFO BlockManagerMaster: Removed 1 successfully in 
removeExecutor
17/04/05 14:39:29 INFO YarnScheduler: Executor 1 on 
ip-172-30-0-149.ec2.internal killed by driver.
17/04/05 14:39:29 INFO ExecutorAllocationManager: Existing executor 1 has been 
removed (new total is 0)
{code}

Except for the master, the cluster remains idle. Right after that, the job 
starts again and completes.

> Job hangs when joining a lot of aggregated columns
> --
>
> Key: SPARK-20227
> URL: https://issues.apache.org/jira/browse/SPARK-20227
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: AWS emr-5.4.0, master: m4.xlarge, core: 4 m4.xlarge
>Reporter: Quentin Auge
>
> I'm trying to replace a lot of different columns in a dataframe with 
> aggregates of themselves, and then join the resulting dataframe.
> {code}
> # Create a dataframe with 1 row and 50 columns
> n = 50
> df = sc.parallelize([Row(*range(n))]).toDF()
> cols = df.columns
> # Replace each column values with aggregated values
> window = Window.partitionBy(cols[0])
> for col in cols[1:]:
> df = df.withColumn(col, sum(col).over(window))
> # Join
> other_df = sc.parallelize([Row(0)]).toDF()
> result = other_df.join(df, on = cols[0])
> result.show()
> {code}
> Spark hangs forever when executing the last line. The strange thing is, it 
> depends on the number of columns. Spark does not hang for n = 5, 10, or 20 
> columns. For n = 50 and beyond, it does.
> {code}
> 17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because 
> it has been idle for 60 seconds (new desired total will be 0)
> 17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling 
> executor 1.
> 17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0)
> 

[jira] [Comment Edited] (SPARK-20227) Job hangs when joining a lot of aggregated columns

2017-04-07 Thread Quentin Auge (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960953#comment-15960953
 ] 

Quentin Auge edited comment on SPARK-20227 at 4/7/17 3:24 PM:
--

Well, you were right to ask. After further investigation, it seems the job does 
not hang forever, but takes much longer to finish when n is large enough.

I made a mistake when I mentioned the job finishes in a sensible amount of time 
for n = 20. Really, it starts to hang badly from n = 12 (so 11 successive 
aggregations).

See the following figure:
!https://image.ibb.co/fxRL85/figure_1.png!
n on x axis, time for the job to complete in seconds on y axis.
The job takes as much as 1 hour and 47 minutes to complete for n = 14.

The significant amount of time the job takes from n = 12 is spent after this 
message:
{code}
17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because 
it has been idle for 60 seconds (new desired total will be 0)
17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling 
executor 1.
17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0)
17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 
from BlockManagerMaster.
17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Removing block manager 
BlockManagerId(1, ip-172-30-0-149.ec2.internal, 35666, None)
17/04/05 14:39:29 INFO BlockManagerMaster: Removed 1 successfully in 
removeExecutor
17/04/05 14:39:29 INFO YarnScheduler: Executor 1 on 
ip-172-30-0-149.ec2.internal killed by driver.
17/04/05 14:39:29 INFO ExecutorAllocationManager: Existing executor 1 has been 
removed (new total is 0)
{code}

Right after that, the job starts again and completes.


was (Author: quentin):
Well, you were right to ask. After further investigation, it seems the job does 
not hang forever, but takes much longer to finish when n is large enough.

I made a mistake when I mentioned the job finishes in a sensible amount of time 
for n = 20. Really, it starts to hang badly from n = 12 (so 11 successive 
aggregations).

See the following time:
!https://image.ibb.co/fxRL85/figure_1.png!
n on x axis, time for the job to complete in seconds on y axis.
The job takes as much as 1 hour and 47 minutes to complete for n = 14.

The significant amount of time the job takes from n = 12 is spent after this 
message:
{code}
17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because 
it has been idle for 60 seconds (new desired total will be 0)
17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling 
executor 1.
17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0)
17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 
from BlockManagerMaster.
17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Removing block manager 
BlockManagerId(1, ip-172-30-0-149.ec2.internal, 35666, None)
17/04/05 14:39:29 INFO BlockManagerMaster: Removed 1 successfully in 
removeExecutor
17/04/05 14:39:29 INFO YarnScheduler: Executor 1 on 
ip-172-30-0-149.ec2.internal killed by driver.
17/04/05 14:39:29 INFO ExecutorAllocationManager: Existing executor 1 has been 
removed (new total is 0)
{code}

Right after that, the job starts again and completes.

> Job hangs when joining a lot of aggregated columns
> --
>
> Key: SPARK-20227
> URL: https://issues.apache.org/jira/browse/SPARK-20227
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: AWS emr-5.4.0, master: m4.xlarge, core: 4 m4.xlarge
>Reporter: Quentin Auge
>
> I'm trying to replace a lot of different columns in a dataframe with 
> aggregates of themselves, and then join the resulting dataframe.
> {code}
> # Create a dataframe with 1 row and 50 columns
> n = 50
> df = sc.parallelize([Row(*range(n))]).toDF()
> cols = df.columns
> # Replace each column values with aggregated values
> window = Window.partitionBy(cols[0])
> for col in cols[1:]:
> df = df.withColumn(col, sum(col).over(window))
> # Join
> other_df = sc.parallelize([Row(0)]).toDF()
> result = other_df.join(df, on = cols[0])
> result.show()
> {code}
> Spark hangs forever when executing the last line. The strange thing is, it 
> depends on the number of columns. Spark does not hang for n = 5, 10, or 20 
> columns. For n = 50 and beyond, it does.
> {code}
> 17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because 
> it has been idle for 60 seconds (new desired total will be 0)
> 17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling 
> executor 1.
> 17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0)
> 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Trying to remove executor 
> 1 from BlockManagerMaster.
> 17/04/05 14:39:29 INFO 

[jira] [Comment Edited] (SPARK-20227) Job hangs when joining a lot of aggregated columns

2017-04-07 Thread Quentin Auge (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960953#comment-15960953
 ] 

Quentin Auge edited comment on SPARK-20227 at 4/7/17 3:24 PM:
--

Well, you were right to ask. After further investigation, it seems the job does 
not hang forever, but takes much longer to finish when n is large enough.

I made a mistake when I mentioned the job finished in a sensible amount of time 
for n = 20. Really, it starts to hang badly from n = 12 (so 11 successive 
aggregations).

See the following time:
!https://image.ibb.co/fxRL85/figure_1.png!
n on x axis, time for the job to complete in seconds on y axis.
The job takes as much as 1 hour and 47 minutes to complete for n = 14.

The significant amount of time the job takes from n = 12 is spent after this 
message:
{code}
17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because 
it has been idle for 60 seconds (new desired total will be 0)
17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling 
executor 1.
17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0)
17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 
from BlockManagerMaster.
17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Removing block manager 
BlockManagerId(1, ip-172-30-0-149.ec2.internal, 35666, None)
17/04/05 14:39:29 INFO BlockManagerMaster: Removed 1 successfully in 
removeExecutor
17/04/05 14:39:29 INFO YarnScheduler: Executor 1 on 
ip-172-30-0-149.ec2.internal killed by driver.
17/04/05 14:39:29 INFO ExecutorAllocationManager: Existing executor 1 has been 
removed (new total is 0)
{code}

Right after that, the job starts again and completes.


was (Author: quentin):
Well, you were right to ask. After further investigation, it seems the job does 
not hang forever, but takes much longer to finish when n is large enough.

I made a mistake when I mentioned the job finished in a sensible amount of time 
for n = 20. Really, it starts to hang badly from n = 12 (so 11 successive 
aggregations).

See the following time:
!https://image.ibb.co/fxRL85/figure_1.png!

The significant amount of time the job takes from n = 12 is spent after this 
message:
{code}
17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because 
it has been idle for 60 seconds (new desired total will be 0)
17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling 
executor 1.
17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0)
17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 
from BlockManagerMaster.
17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Removing block manager 
BlockManagerId(1, ip-172-30-0-149.ec2.internal, 35666, None)
17/04/05 14:39:29 INFO BlockManagerMaster: Removed 1 successfully in 
removeExecutor
17/04/05 14:39:29 INFO YarnScheduler: Executor 1 on 
ip-172-30-0-149.ec2.internal killed by driver.
17/04/05 14:39:29 INFO ExecutorAllocationManager: Existing executor 1 has been 
removed (new total is 0)
{code}

Right after that, the job starts again and completes.

> Job hangs when joining a lot of aggregated columns
> --
>
> Key: SPARK-20227
> URL: https://issues.apache.org/jira/browse/SPARK-20227
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: AWS emr-5.4.0, master: m4.xlarge, core: 4 m4.xlarge
>Reporter: Quentin Auge
>
> I'm trying to replace a lot of different columns in a dataframe with 
> aggregates of themselves, and then join the resulting dataframe.
> {code}
> # Create a dataframe with 1 row and 50 columns
> n = 50
> df = sc.parallelize([Row(*range(n))]).toDF()
> cols = df.columns
> # Replace each column values with aggregated values
> window = Window.partitionBy(cols[0])
> for col in cols[1:]:
> df = df.withColumn(col, sum(col).over(window))
> # Join
> other_df = sc.parallelize([Row(0)]).toDF()
> result = other_df.join(df, on = cols[0])
> result.show()
> {code}
> Spark hangs forever when executing the last line. The strange thing is, it 
> depends on the number of columns. Spark does not hang for n = 5, 10, or 20 
> columns. For n = 50 and beyond, it does.
> {code}
> 17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because 
> it has been idle for 60 seconds (new desired total will be 0)
> 17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling 
> executor 1.
> 17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0)
> 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Trying to remove executor 
> 1 from BlockManagerMaster.
> 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Removing block manager 
> BlockManagerId(1, ip-172-30-0-149.ec2.internal, 35666, None)
> 17/04/05 14:39:29 INFO 

[jira] [Comment Edited] (SPARK-20227) Job hangs when joining a lot of aggregated columns

2017-04-07 Thread Quentin Auge (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960953#comment-15960953
 ] 

Quentin Auge edited comment on SPARK-20227 at 4/7/17 3:24 PM:
--

Well, you were right to ask. After further investigation, it seems the job does 
not hang forever, but takes much longer to finish when n is large enough.

I made a mistake when I mentioned the job finishes in a sensible amount of time 
for n = 20. Really, it starts to hang badly from n = 12 (so 11 successive 
aggregations).

See the following time:
!https://image.ibb.co/fxRL85/figure_1.png!
n on x axis, time for the job to complete in seconds on y axis.
The job takes as much as 1 hour and 47 minutes to complete for n = 14.

The significant amount of time the job takes from n = 12 is spent after this 
message:
{code}
17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because 
it has been idle for 60 seconds (new desired total will be 0)
17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling 
executor 1.
17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0)
17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 
from BlockManagerMaster.
17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Removing block manager 
BlockManagerId(1, ip-172-30-0-149.ec2.internal, 35666, None)
17/04/05 14:39:29 INFO BlockManagerMaster: Removed 1 successfully in 
removeExecutor
17/04/05 14:39:29 INFO YarnScheduler: Executor 1 on 
ip-172-30-0-149.ec2.internal killed by driver.
17/04/05 14:39:29 INFO ExecutorAllocationManager: Existing executor 1 has been 
removed (new total is 0)
{code}

Right after that, the job starts again and completes.


was (Author: quentin):
Well, you were right to ask. After further investigation, it seems the job does 
not hang forever, but takes much longer to finish when n is large enough.

I made a mistake when I mentioned the job finished in a sensible amount of time 
for n = 20. Really, it starts to hang badly from n = 12 (so 11 successive 
aggregations).

See the following time:
!https://image.ibb.co/fxRL85/figure_1.png!
n on x axis, time for the job to complete in seconds on y axis.
The job takes as much as 1 hour and 47 minutes to complete for n = 14.

The significant amount of time the job takes from n = 12 is spent after this 
message:
{code}
17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because 
it has been idle for 60 seconds (new desired total will be 0)
17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling 
executor 1.
17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0)
17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 
from BlockManagerMaster.
17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Removing block manager 
BlockManagerId(1, ip-172-30-0-149.ec2.internal, 35666, None)
17/04/05 14:39:29 INFO BlockManagerMaster: Removed 1 successfully in 
removeExecutor
17/04/05 14:39:29 INFO YarnScheduler: Executor 1 on 
ip-172-30-0-149.ec2.internal killed by driver.
17/04/05 14:39:29 INFO ExecutorAllocationManager: Existing executor 1 has been 
removed (new total is 0)
{code}

Right after that, the job starts again and completes.

> Job hangs when joining a lot of aggregated columns
> --
>
> Key: SPARK-20227
> URL: https://issues.apache.org/jira/browse/SPARK-20227
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: AWS emr-5.4.0, master: m4.xlarge, core: 4 m4.xlarge
>Reporter: Quentin Auge
>
> I'm trying to replace a lot of different columns in a dataframe with 
> aggregates of themselves, and then join the resulting dataframe.
> {code}
> # Create a dataframe with 1 row and 50 columns
> n = 50
> df = sc.parallelize([Row(*range(n))]).toDF()
> cols = df.columns
> # Replace each column values with aggregated values
> window = Window.partitionBy(cols[0])
> for col in cols[1:]:
> df = df.withColumn(col, sum(col).over(window))
> # Join
> other_df = sc.parallelize([Row(0)]).toDF()
> result = other_df.join(df, on = cols[0])
> result.show()
> {code}
> Spark hangs forever when executing the last line. The strange thing is, it 
> depends on the number of columns. Spark does not hang for n = 5, 10, or 20 
> columns. For n = 50 and beyond, it does.
> {code}
> 17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because 
> it has been idle for 60 seconds (new desired total will be 0)
> 17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling 
> executor 1.
> 17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0)
> 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Trying to remove executor 
> 1 from BlockManagerMaster.
> 17/04/05 14:39:29 INFO 

[jira] [Comment Edited] (SPARK-20227) Job hangs when joining a lot of aggregated columns

2017-04-07 Thread Quentin Auge (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960953#comment-15960953
 ] 

Quentin Auge edited comment on SPARK-20227 at 4/7/17 3:23 PM:
--

Well, you were right to ask. After further investigation, it seems the job does 
not hang forever, but takes much longer to finish when n is large enough.

I made a mistake when I mentioned the job finished in a sensible amount of time 
for n = 20. Really, it starts to hang badly from n = 12 (so 11 successive 
aggregations).

See the following time:
!https://image.ibb.co/fxRL85/figure_1.png!

The significant amount of time the job takes from n = 12 is spent after this 
message:
{code}
17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because 
it has been idle for 60 seconds (new desired total will be 0)
17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling 
executor 1.
17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0)
17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 
from BlockManagerMaster.
17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Removing block manager 
BlockManagerId(1, ip-172-30-0-149.ec2.internal, 35666, None)
17/04/05 14:39:29 INFO BlockManagerMaster: Removed 1 successfully in 
removeExecutor
17/04/05 14:39:29 INFO YarnScheduler: Executor 1 on 
ip-172-30-0-149.ec2.internal killed by driver.
17/04/05 14:39:29 INFO ExecutorAllocationManager: Existing executor 1 has been 
removed (new total is 0)
{code}

Right after that, the job starts again and completes.


was (Author: quentin):
Well, you were right to ask. After further investigation, it seems the job does 
not hang forever, but takes much longer to finish when n is large enough.

I made a mistake when I mentioned the job finished in a sensible amount of time 
for n = 20. Really, it starts to hang badly from n = 12 (so 11 successive 
aggregations).

See the following time:
!https://image.ibb.co/fxRL85/figure_1.png|n on x axis, time for the job to 
complete in seconds on y axis!

The significant amount of time the job takes from n = 12 is spent after this 
message:
{code}
17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because 
it has been idle for 60 seconds (new desired total will be 0)
17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling 
executor 1.
17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0)
17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 
from BlockManagerMaster.
17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Removing block manager 
BlockManagerId(1, ip-172-30-0-149.ec2.internal, 35666, None)
17/04/05 14:39:29 INFO BlockManagerMaster: Removed 1 successfully in 
removeExecutor
17/04/05 14:39:29 INFO YarnScheduler: Executor 1 on 
ip-172-30-0-149.ec2.internal killed by driver.
17/04/05 14:39:29 INFO ExecutorAllocationManager: Existing executor 1 has been 
removed (new total is 0)
{code}

Right after that, the job starts again and completes.

> Job hangs when joining a lot of aggregated columns
> --
>
> Key: SPARK-20227
> URL: https://issues.apache.org/jira/browse/SPARK-20227
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: AWS emr-5.4.0, master: m4.xlarge, core: 4 m4.xlarge
>Reporter: Quentin Auge
>
> I'm trying to replace a lot of different columns in a dataframe with 
> aggregates of themselves, and then join the resulting dataframe.
> {code}
> # Create a dataframe with 1 row and 50 columns
> n = 50
> df = sc.parallelize([Row(*range(n))]).toDF()
> cols = df.columns
> # Replace each column values with aggregated values
> window = Window.partitionBy(cols[0])
> for col in cols[1:]:
> df = df.withColumn(col, sum(col).over(window))
> # Join
> other_df = sc.parallelize([Row(0)]).toDF()
> result = other_df.join(df, on = cols[0])
> result.show()
> {code}
> Spark hangs forever when executing the last line. The strange thing is, it 
> depends on the number of columns. Spark does not hang for n = 5, 10, or 20 
> columns. For n = 50 and beyond, it does.
> {code}
> 17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because 
> it has been idle for 60 seconds (new desired total will be 0)
> 17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling 
> executor 1.
> 17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0)
> 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Trying to remove executor 
> 1 from BlockManagerMaster.
> 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Removing block manager 
> BlockManagerId(1, ip-172-30-0-149.ec2.internal, 35666, None)
> 17/04/05 14:39:29 INFO BlockManagerMaster: Removed 1 successfully in 
> removeExecutor
> 17/04/05 14:39:29 

[jira] [Comment Edited] (SPARK-20227) Job hangs when joining a lot of aggregated columns

2017-04-07 Thread Quentin Auge (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960953#comment-15960953
 ] 

Quentin Auge edited comment on SPARK-20227 at 4/7/17 3:22 PM:
--

Well, you were right to ask. After further investigation, it seems the job does 
not hang forever, but takes much longer to finish when n is large enough.

I made a mistake when I mentioned the job finished in a sensible amount of time 
for n = 20. Really, it starts to hang badly from n = 12 (so 11 successive 
aggregations).

See the following time:
!https://image.ibb.co/fxRL85/figure_1.png|n on x axis, time for the job to 
complete in seconds on y axis!

The significant amount of time the job takes from n = 12 is spent after this 
message:
{code}
17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because 
it has been idle for 60 seconds (new desired total will be 0)
17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling 
executor 1.
17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0)
17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 
from BlockManagerMaster.
17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Removing block manager 
BlockManagerId(1, ip-172-30-0-149.ec2.internal, 35666, None)
17/04/05 14:39:29 INFO BlockManagerMaster: Removed 1 successfully in 
removeExecutor
17/04/05 14:39:29 INFO YarnScheduler: Executor 1 on 
ip-172-30-0-149.ec2.internal killed by driver.
17/04/05 14:39:29 INFO ExecutorAllocationManager: Existing executor 1 has been 
removed (new total is 0)
{code}

Right after that, the job starts again and completes.


was (Author: quentin):
Well, you were right to ask. After further investigation, it seems the job does 
not hang forever, but takes much longer to finish when n is large enough.

I made a mistake when I mentioned the job finished in a sensible amount of time 
for n = 20. Really, it starts to hang badly from n = 12 (so 11 successive 
aggregations).

See the following time:
!https://ibb.co/d9QrFk|n on x axis, time for the job to complete in seconds on 
y axis!

The significant amount of time the job takes from n = 12 is spent after this 
message:
{code}
17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because 
it has been idle for 60 seconds (new desired total will be 0)
17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling 
executor 1.
17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0)
17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 
from BlockManagerMaster.
17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Removing block manager 
BlockManagerId(1, ip-172-30-0-149.ec2.internal, 35666, None)
17/04/05 14:39:29 INFO BlockManagerMaster: Removed 1 successfully in 
removeExecutor
17/04/05 14:39:29 INFO YarnScheduler: Executor 1 on 
ip-172-30-0-149.ec2.internal killed by driver.
17/04/05 14:39:29 INFO ExecutorAllocationManager: Existing executor 1 has been 
removed (new total is 0)
{code}

Right after that, the job starts again and completes.

> Job hangs when joining a lot of aggregated columns
> --
>
> Key: SPARK-20227
> URL: https://issues.apache.org/jira/browse/SPARK-20227
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: AWS emr-5.4.0, master: m4.xlarge, core: 4 m4.xlarge
>Reporter: Quentin Auge
>
> I'm trying to replace a lot of different columns in a dataframe with 
> aggregates of themselves, and then join the resulting dataframe.
> {code}
> # Create a dataframe with 1 row and 50 columns
> n = 50
> df = sc.parallelize([Row(*range(n))]).toDF()
> cols = df.columns
> # Replace each column values with aggregated values
> window = Window.partitionBy(cols[0])
> for col in cols[1:]:
> df = df.withColumn(col, sum(col).over(window))
> # Join
> other_df = sc.parallelize([Row(0)]).toDF()
> result = other_df.join(df, on = cols[0])
> result.show()
> {code}
> Spark hangs forever when executing the last line. The strange thing is, it 
> depends on the number of columns. Spark does not hang for n = 5, 10, or 20 
> columns. For n = 50 and beyond, it does.
> {code}
> 17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because 
> it has been idle for 60 seconds (new desired total will be 0)
> 17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling 
> executor 1.
> 17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0)
> 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Trying to remove executor 
> 1 from BlockManagerMaster.
> 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Removing block manager 
> BlockManagerId(1, ip-172-30-0-149.ec2.internal, 35666, None)
> 17/04/05 14:39:29 INFO BlockManagerMaster: Removed 1 

[jira] [Commented] (SPARK-20227) Job hangs when joining a lot of aggregated columns

2017-04-07 Thread Quentin Auge (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960953#comment-15960953
 ] 

Quentin Auge commented on SPARK-20227:
--

Well, you were right to ask. After further investigation, it seems the job does 
not hang forever, but takes much longer to finish when n is large enough.

I made a mistake when I mentioned the job finished in a sensible amount of time 
for n = 20. Really, it starts to hang badly from n = 12 (so 11 successive 
aggregations).

See the following time:
!https://ibb.co/d9QrFk|n on x axis, time for the job to complete in seconds on 
y axis!

The significant amount of time the job takes from n = 12 is spent after this 
message:
{code}
17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because 
it has been idle for 60 seconds (new desired total will be 0)
17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling 
executor 1.
17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0)
17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 
from BlockManagerMaster.
17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Removing block manager 
BlockManagerId(1, ip-172-30-0-149.ec2.internal, 35666, None)
17/04/05 14:39:29 INFO BlockManagerMaster: Removed 1 successfully in 
removeExecutor
17/04/05 14:39:29 INFO YarnScheduler: Executor 1 on 
ip-172-30-0-149.ec2.internal killed by driver.
17/04/05 14:39:29 INFO ExecutorAllocationManager: Existing executor 1 has been 
removed (new total is 0)
{code}

Right after that, the job starts again and completes.

> Job hangs when joining a lot of aggregated columns
> --
>
> Key: SPARK-20227
> URL: https://issues.apache.org/jira/browse/SPARK-20227
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: AWS emr-5.4.0, master: m4.xlarge, core: 4 m4.xlarge
>Reporter: Quentin Auge
>
> I'm trying to replace a lot of different columns in a dataframe with 
> aggregates of themselves, and then join the resulting dataframe.
> {code}
> # Create a dataframe with 1 row and 50 columns
> n = 50
> df = sc.parallelize([Row(*range(n))]).toDF()
> cols = df.columns
> # Replace each column values with aggregated values
> window = Window.partitionBy(cols[0])
> for col in cols[1:]:
> df = df.withColumn(col, sum(col).over(window))
> # Join
> other_df = sc.parallelize([Row(0)]).toDF()
> result = other_df.join(df, on = cols[0])
> result.show()
> {code}
> Spark hangs forever when executing the last line. The strange thing is, it 
> depends on the number of columns. Spark does not hang for n = 5, 10, or 20 
> columns. For n = 50 and beyond, it does.
> {code}
> 17/04/05 14:39:28 INFO ExecutorAllocationManager: Removing executor 1 because 
> it has been idle for 60 seconds (new desired total will be 0)
> 17/04/05 14:39:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling 
> executor 1.
> 17/04/05 14:39:29 INFO DAGScheduler: Executor lost: 1 (epoch 0)
> 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Trying to remove executor 
> 1 from BlockManagerMaster.
> 17/04/05 14:39:29 INFO BlockManagerMasterEndpoint: Removing block manager 
> BlockManagerId(1, ip-172-30-0-149.ec2.internal, 35666, None)
> 17/04/05 14:39:29 INFO BlockManagerMaster: Removed 1 successfully in 
> removeExecutor
> 17/04/05 14:39:29 INFO YarnScheduler: Executor 1 on 
> ip-172-30-0-149.ec2.internal killed by driver.
> 17/04/05 14:39:29 INFO ExecutorAllocationManager: Existing executor 1 has 
> been removed (new total is 0)
> {code}
> All executors are inactive and thus killed after 60 seconds, the master 
> spends some CPU on a process that hangs indefinitely, and the workers are 
> idle.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2984) FileNotFoundException on _temporary directory

2017-04-07 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960925#comment-15960925
 ] 

Steve Loughran commented on SPARK-2984:
---

For s3a commits, HADOOP-13786 is going to be the fix. This not some trivial 
"directory missing" problem, it is the fundamental issue that rename() isn't 
how you commit work into an eventually consistent object store whose s3 client 
mimics rename by copying all the data from one blob to another. See [the design 
document|https://github.com/steveloughran/hadoop/blob/s3guard/HADOOP-13786-committer/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/s3a_committer.md].
 

It's currently at demo quality: appears to work, just not been tested at scale 
or with fault injection. Netflix have been using the core code though.Works for 
ORC that is, Parquet will take a bit more work due to bits of the spark code 
which explicitly look for ParquetOutputCommitter subclasses; something will 
need to be done there.



> FileNotFoundException on _temporary directory
> -
>
> Key: SPARK-2984
> URL: https://issues.apache.org/jira/browse/SPARK-2984
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Ash
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.3.0
>
>
> We've seen several stacktraces and threads on the user mailing list where 
> people are having issues with a {{FileNotFoundException}} stemming from an 
> HDFS path containing {{_temporary}}.
> I ([~aash]) think this may be related to {{spark.speculation}}.  I think the 
> error condition might manifest in this circumstance:
> 1) task T starts on a executor E1
> 2) it takes a long time, so task T' is started on another executor E2
> 3) T finishes in E1 so moves its data from {{_temporary}} to the final 
> destination and deletes the {{_temporary}} directory during cleanup
> 4) T' finishes in E2 and attempts to move its data from {{_temporary}}, but 
> those files no longer exist!  exception
> Some samples:
> {noformat}
> 14/08/11 08:05:08 ERROR JobScheduler: Error running job streaming job 
> 140774430 ms.0
> java.io.FileNotFoundException: File 
> hdfs://hadoopc/user/csong/output/human_bot/-140774430.out/_temporary/0/task_201408110805__m_07
>  does not exist.
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:708)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:708)
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:360)
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310)
> at 
> org.apache.hadoop.mapred.FileOutputCommitter.commitJob(FileOutputCommitter.java:136)
> at 
> org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:126)
> at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:841)
> at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:724)
> at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:643)
> at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1068)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:773)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:771)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> at scala.util.Try$.apply(Try.scala:161)
> at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
> at 
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> -- Chen Song at 
> 

[jira] [Comment Edited] (SPARK-16784) Configurable log4j settings

2017-04-07 Thread Torsten Scholak (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960922#comment-15960922
 ] 

Torsten Scholak edited comment on SPARK-16784 at 4/7/17 3:01 PM:
-

I having this exact problem. I need to be able to change the log settings 
depending on the job and/or the application. The method illustrated above, i.e. 
specifying

spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties,
spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.properties

using spark-submit with "--files log4j.properties" does indeed NOT work.
However, I was surprised to find it suggested as solution on stackoverflow 
(http://stackoverflow.com/questions/28454080/how-to-log-using-log4j-to-local-file-system-inside-a-spark-application-that-runs),
 since it is in direct contradiction to the issue described in this ticket.

[~jbacon] have you created a follow-up ticket?


was (Author: tscholak):
I having this exact problem. I need to be able to change the log settings 
depending on the job and/or the application. The method illustrated above, i.e. 
specifying

spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties,
spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.properties

using spark-submit with "--files log4j.properties" does indeed NOT work.
However, I was surprised to find it suggested as solution on stackoverflow 
(http://stackoverflow.com/questions/28454080/how-to-log-using-log4j-to-local-file-system-inside-a-spark-application-that-runs),
 since it is in direct contradiction to issue described in this ticket.

[~jbacon] have you created a follow-up ticket?

> Configurable log4j settings
> ---
>
> Key: SPARK-16784
> URL: https://issues.apache.org/jira/browse/SPARK-16784
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Michael Gummelt
>
> I often want to change the logging configuration on a single spark job.  This 
> is easy in client mode.  I just modify log4j.properties.  It's difficult in 
> cluster mode, because I need to modify the log4j.properties in the 
> distribution in which the driver runs.  I'd like a way of setting this 
> dynamically, such as a java system property.  Some brief searching showed 
> that log4j doesn't seem to accept such a property, but I'd like to open up 
> this idea for further comment.  Maybe we can find a solution.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16784) Configurable log4j settings

2017-04-07 Thread Torsten Scholak (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960922#comment-15960922
 ] 

Torsten Scholak commented on SPARK-16784:
-

I having this exact problem. I need to be able to change the log settings 
depending on the job and/or the application. The method illustrated above, i.e. 
specifying

spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties,
spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.properties

using spark-submit with "--files log4j.properties" does indeed NOT work.
However, I was surprised to find it suggested as solution on stackoverflow 
(http://stackoverflow.com/questions/28454080/how-to-log-using-log4j-to-local-file-system-inside-a-spark-application-that-runs),
 since it is in direct contradiction to issue described in this ticket.

[~jbacon] have you created a follow-up ticket?

> Configurable log4j settings
> ---
>
> Key: SPARK-16784
> URL: https://issues.apache.org/jira/browse/SPARK-16784
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Michael Gummelt
>
> I often want to change the logging configuration on a single spark job.  This 
> is easy in client mode.  I just modify log4j.properties.  It's difficult in 
> cluster mode, because I need to modify the log4j.properties in the 
> distribution in which the driver runs.  I'd like a way of setting this 
> dynamically, such as a java system property.  Some brief searching showed 
> that log4j doesn't seem to accept such a property, but I'd like to open up 
> this idea for further comment.  Maybe we can find a solution.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2984) FileNotFoundException on _temporary directory

2017-04-07 Thread Hemang Nagar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960906#comment-15960906
 ] 

Hemang Nagar commented on SPARK-2984:
-

Is there any work going on this issue, or anything related to this, as it seems 
nobody has been able to resolve this, and a lot of people including me have 
this issue?

> FileNotFoundException on _temporary directory
> -
>
> Key: SPARK-2984
> URL: https://issues.apache.org/jira/browse/SPARK-2984
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Ash
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.3.0
>
>
> We've seen several stacktraces and threads on the user mailing list where 
> people are having issues with a {{FileNotFoundException}} stemming from an 
> HDFS path containing {{_temporary}}.
> I ([~aash]) think this may be related to {{spark.speculation}}.  I think the 
> error condition might manifest in this circumstance:
> 1) task T starts on a executor E1
> 2) it takes a long time, so task T' is started on another executor E2
> 3) T finishes in E1 so moves its data from {{_temporary}} to the final 
> destination and deletes the {{_temporary}} directory during cleanup
> 4) T' finishes in E2 and attempts to move its data from {{_temporary}}, but 
> those files no longer exist!  exception
> Some samples:
> {noformat}
> 14/08/11 08:05:08 ERROR JobScheduler: Error running job streaming job 
> 140774430 ms.0
> java.io.FileNotFoundException: File 
> hdfs://hadoopc/user/csong/output/human_bot/-140774430.out/_temporary/0/task_201408110805__m_07
>  does not exist.
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:708)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:708)
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:360)
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310)
> at 
> org.apache.hadoop.mapred.FileOutputCommitter.commitJob(FileOutputCommitter.java:136)
> at 
> org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:126)
> at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:841)
> at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:724)
> at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:643)
> at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1068)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:773)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:771)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> at scala.util.Try$.apply(Try.scala:161)
> at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
> at 
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> -- Chen Song at 
> http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFiles-file-not-found-exception-td10686.html
> {noformat}
> I am running a Spark Streaming job that uses saveAsTextFiles to save results 
> into hdfs files. However, it has an exception after 20 batches
> result-140631234/_temporary/0/task_201407251119__m_03 does not 
> exist.
> {noformat}
> and
> {noformat}
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>  No lease on /apps/data/vddil/real-time/checkpoint/temp: File does not exist. 
> Holder DFSClient_NONMAPREDUCE_327993456_13 does not have any open files.
>   at 
> 

[jira] [Commented] (SPARK-20226) Call to sqlContext.cacheTable takes an incredibly long time in some cases

2017-04-07 Thread Barry Becker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960868#comment-15960868
 ] 

Barry Becker commented on SPARK-20226:
--

Only 11 columns. I did not want to wait for 10 or 20 minutes on each run, so I 
only used 11. If I went to 14 it would take over 10 minutes (or longer). I 
guess I could try it again with 14 columns and see how much it helps. Maybe in 
that case it would make a bigger difference, but even waiting a minute for such 
a small dataset seems too long.

> Call to sqlContext.cacheTable takes an incredibly long time in some cases
> -
>
> Key: SPARK-20226
> URL: https://issues.apache.org/jira/browse/SPARK-20226
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: linux or windows
>Reporter: Barry Becker
>  Labels: cache
> Attachments: profile_indexer2.PNG, xyzzy.csv
>
>
> I have a case where the call to sqlContext.cacheTable can take an arbitrarily 
> long time depending on the number of columns that are referenced in a 
> withColumn expression applied to a dataframe.
> The dataset is small (20 columns 7861 rows). The sequence to reproduce is the 
> following:
> 1) add a new column that references 8 - 14 of the columns in the dataset. 
>- If I add 8 columns, then the call to cacheTable is fast - like *5 
> seconds*
>- If I add 11 columns, then it is slow - like *60 seconds*
>- and if I add 14 columns, then it basically *takes forever* - I gave up 
> after 10 minutes or so.
>   The Column expression that is added, is basically just concatenating 
> the columns together in a single string. If a number is concatenated on a 
> string (or vice versa) the number is first converted to a string.
>   The expression looks something like this:
> {code}
> `Plate` + `State` + `License Type` + `Summons Number` + `Issue Date` + 
> `Violation Time` + `Violation` + `Judgment Entry Date` + `Fine Amount` + 
> `Penalty Amount` + `Interest Amount`
> {code}
> which we then convert to a Column expression that looks like this:
> {code}
> UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF('Plate, 'State), 'License Type), 
> UDF('Summons Number)), UDF('Issue Date)), 'Violation Time), 'Violation), 
> UDF('Judgment Entry Date)), UDF('Fine Amount)), UDF('Penalty Amount)), 
> UDF('Interest Amount))
> {code}
>where the UDFs are very simple functions that basically call toString 
> and + as needed.
> 2) apply a pipeline that includes some transformers that was saved earlier. 
> Here are the steps of the pipeline (extracted from parquet)
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333200603,"sparkVersion":"2.1.0","uid":"strIdx_aeb04d2777cc","paramMap":{"handleInvalid":"skip","outputCol":"State_IDX__","inputCol":"State_CLEANED__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333200837,"sparkVersion":"2.1.0","uid":"strIdx_0164c4c13979","paramMap":{"inputCol":"License
>  Type_CLEANED__","handleInvalid":"skip","outputCol":"License 
> Type_IDX__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201068,"sparkVersion":"2.1.0","uid":"strIdx_25b6cbd02751","paramMap":{"inputCol":"Violation_CLEANED__","handleInvalid":"skip","outputCol":"Violation_IDX__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201282,"sparkVersion":"2.1.0","uid":"strIdx_aa12df0354d9","paramMap":{"handleInvalid":"skip","inputCol":"County_CLEANED__","outputCol":"County_IDX__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201552,"sparkVersion":"2.1.0","uid":"strIdx_babb120f3cc1","paramMap":{"handleInvalid":"skip","outputCol":"Issuing
>  Agency_IDX__","inputCol":"Issuing Agency_CLEANED__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201759,"sparkVersion":"2.1.0","uid":"strIdx_5f2de9d9542d","paramMap":{"handleInvalid":"skip","outputCol":"Violation
>  Status_IDX__","inputCol":"Violation Status_CLEANED__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.Bucketizer","timestamp":1491333201987,"sparkVersion":"2.1.0",
> "uid":"bucketizer_6f65ca9fa813",
>   "paramMap":{
> "outputCol":"Summons 
> Number_BINNED__","handleInvalid":"keep","splits":["-Inf",1.386630656E9,3.696078592E9,4.005258752E9,6.045063168E9,8.136507392E9,"Inf"],"inputCol":"Summons
>  Number_CLEANED__"
>}
>}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.Bucketizer","timestamp":1491333202079,"sparkVersion":"2.1.0",
> "uid":"bucketizer_f5db4fb8120e",
> "paramMap":{
>  
> 

[jira] [Created] (SPARK-20253) Remove unnecessary nullchecks of a return value from Spark runtime routines in generated Java code

2017-04-07 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-20253:


 Summary: Remove unnecessary nullchecks of a return value from 
Spark runtime routines in generated Java code
 Key: SPARK-20253
 URL: https://issues.apache.org/jira/browse/SPARK-20253
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Kazuaki Ishizaki


While we know several Spark runtime routines never return null (e.g. 
{{UnsafeArrayData.toDoubleArray()}}, the generated code by Catalyst always 
checks whether the return value is null or not.
It is good to remove this nullcheck for reducing Java bytecode size and 
reducing the native code size.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20226) Call to sqlContext.cacheTable takes an incredibly long time in some cases

2017-04-07 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960835#comment-15960835
 ] 

Liang-Chi Hsieh commented on SPARK-20226:
-

How many columns are added in above runs? I didn't see the long running time (> 
10mins at least) as you reported in the jira description.

For big query plans, constraint propagation will hit combination explosion 
issue and block the driver for long. So we have this flag 
"spark.sql.constraintPropagation.enabled" to disable it.

For relatively small query plans (I suppose the above runs are because of the 
shorter running time), this flag doesn't make significant difference.

Every time when you cache the table after adding a column, it finishes planning 
the query plan, so you will not hit the issue of constraint propagation.

> Call to sqlContext.cacheTable takes an incredibly long time in some cases
> -
>
> Key: SPARK-20226
> URL: https://issues.apache.org/jira/browse/SPARK-20226
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: linux or windows
>Reporter: Barry Becker
>  Labels: cache
> Attachments: profile_indexer2.PNG, xyzzy.csv
>
>
> I have a case where the call to sqlContext.cacheTable can take an arbitrarily 
> long time depending on the number of columns that are referenced in a 
> withColumn expression applied to a dataframe.
> The dataset is small (20 columns 7861 rows). The sequence to reproduce is the 
> following:
> 1) add a new column that references 8 - 14 of the columns in the dataset. 
>- If I add 8 columns, then the call to cacheTable is fast - like *5 
> seconds*
>- If I add 11 columns, then it is slow - like *60 seconds*
>- and if I add 14 columns, then it basically *takes forever* - I gave up 
> after 10 minutes or so.
>   The Column expression that is added, is basically just concatenating 
> the columns together in a single string. If a number is concatenated on a 
> string (or vice versa) the number is first converted to a string.
>   The expression looks something like this:
> {code}
> `Plate` + `State` + `License Type` + `Summons Number` + `Issue Date` + 
> `Violation Time` + `Violation` + `Judgment Entry Date` + `Fine Amount` + 
> `Penalty Amount` + `Interest Amount`
> {code}
> which we then convert to a Column expression that looks like this:
> {code}
> UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF('Plate, 'State), 'License Type), 
> UDF('Summons Number)), UDF('Issue Date)), 'Violation Time), 'Violation), 
> UDF('Judgment Entry Date)), UDF('Fine Amount)), UDF('Penalty Amount)), 
> UDF('Interest Amount))
> {code}
>where the UDFs are very simple functions that basically call toString 
> and + as needed.
> 2) apply a pipeline that includes some transformers that was saved earlier. 
> Here are the steps of the pipeline (extracted from parquet)
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333200603,"sparkVersion":"2.1.0","uid":"strIdx_aeb04d2777cc","paramMap":{"handleInvalid":"skip","outputCol":"State_IDX__","inputCol":"State_CLEANED__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333200837,"sparkVersion":"2.1.0","uid":"strIdx_0164c4c13979","paramMap":{"inputCol":"License
>  Type_CLEANED__","handleInvalid":"skip","outputCol":"License 
> Type_IDX__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201068,"sparkVersion":"2.1.0","uid":"strIdx_25b6cbd02751","paramMap":{"inputCol":"Violation_CLEANED__","handleInvalid":"skip","outputCol":"Violation_IDX__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201282,"sparkVersion":"2.1.0","uid":"strIdx_aa12df0354d9","paramMap":{"handleInvalid":"skip","inputCol":"County_CLEANED__","outputCol":"County_IDX__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201552,"sparkVersion":"2.1.0","uid":"strIdx_babb120f3cc1","paramMap":{"handleInvalid":"skip","outputCol":"Issuing
>  Agency_IDX__","inputCol":"Issuing Agency_CLEANED__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201759,"sparkVersion":"2.1.0","uid":"strIdx_5f2de9d9542d","paramMap":{"handleInvalid":"skip","outputCol":"Violation
>  Status_IDX__","inputCol":"Violation Status_CLEANED__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.Bucketizer","timestamp":1491333201987,"sparkVersion":"2.1.0",
> "uid":"bucketizer_6f65ca9fa813",
>   "paramMap":{
> "outputCol":"Summons 
> 

[jira] [Commented] (SPARK-20226) Call to sqlContext.cacheTable takes an incredibly long time in some cases

2017-04-07 Thread Barry Becker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960806#comment-15960806
 ] 

Barry Becker commented on SPARK-20226:
--

OK, I set the flag using 
sqlContext.setConf("spark.sql.constraintPropagation.enabled", "false") to be 
sure its set.
Here are my results. I averaged a few runs in each instance to try and get a 
more accurate reading. The variance was pretty high.
{code}
cacheTable applyPipeline total   cache after add column  
"spark.sql.constraintPropagation.enabled"
--   --  --  --
71s  8.5s1m 30s   no caching "true"
53s  8.5s1m 10s   no caching "false"
65s  8.5s1m 23s   no caching nothing set (true is 
default I assume)
 1s  8.421s   cachingnothing set  
{code}  
As you can see, it gets a little better with constraintPropagation off, but not 
nearly as good as caching the dataframe before applying the pipeline. Why is 
caching before the pipeline such a big win, if the pipeline stages are applied 
linearly?

> Call to sqlContext.cacheTable takes an incredibly long time in some cases
> -
>
> Key: SPARK-20226
> URL: https://issues.apache.org/jira/browse/SPARK-20226
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: linux or windows
>Reporter: Barry Becker
>  Labels: cache
> Attachments: profile_indexer2.PNG, xyzzy.csv
>
>
> I have a case where the call to sqlContext.cacheTable can take an arbitrarily 
> long time depending on the number of columns that are referenced in a 
> withColumn expression applied to a dataframe.
> The dataset is small (20 columns 7861 rows). The sequence to reproduce is the 
> following:
> 1) add a new column that references 8 - 14 of the columns in the dataset. 
>- If I add 8 columns, then the call to cacheTable is fast - like *5 
> seconds*
>- If I add 11 columns, then it is slow - like *60 seconds*
>- and if I add 14 columns, then it basically *takes forever* - I gave up 
> after 10 minutes or so.
>   The Column expression that is added, is basically just concatenating 
> the columns together in a single string. If a number is concatenated on a 
> string (or vice versa) the number is first converted to a string.
>   The expression looks something like this:
> {code}
> `Plate` + `State` + `License Type` + `Summons Number` + `Issue Date` + 
> `Violation Time` + `Violation` + `Judgment Entry Date` + `Fine Amount` + 
> `Penalty Amount` + `Interest Amount`
> {code}
> which we then convert to a Column expression that looks like this:
> {code}
> UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF('Plate, 'State), 'License Type), 
> UDF('Summons Number)), UDF('Issue Date)), 'Violation Time), 'Violation), 
> UDF('Judgment Entry Date)), UDF('Fine Amount)), UDF('Penalty Amount)), 
> UDF('Interest Amount))
> {code}
>where the UDFs are very simple functions that basically call toString 
> and + as needed.
> 2) apply a pipeline that includes some transformers that was saved earlier. 
> Here are the steps of the pipeline (extracted from parquet)
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333200603,"sparkVersion":"2.1.0","uid":"strIdx_aeb04d2777cc","paramMap":{"handleInvalid":"skip","outputCol":"State_IDX__","inputCol":"State_CLEANED__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333200837,"sparkVersion":"2.1.0","uid":"strIdx_0164c4c13979","paramMap":{"inputCol":"License
>  Type_CLEANED__","handleInvalid":"skip","outputCol":"License 
> Type_IDX__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201068,"sparkVersion":"2.1.0","uid":"strIdx_25b6cbd02751","paramMap":{"inputCol":"Violation_CLEANED__","handleInvalid":"skip","outputCol":"Violation_IDX__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201282,"sparkVersion":"2.1.0","uid":"strIdx_aa12df0354d9","paramMap":{"handleInvalid":"skip","inputCol":"County_CLEANED__","outputCol":"County_IDX__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201552,"sparkVersion":"2.1.0","uid":"strIdx_babb120f3cc1","paramMap":{"handleInvalid":"skip","outputCol":"Issuing
>  Agency_IDX__","inputCol":"Issuing Agency_CLEANED__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201759,"sparkVersion":"2.1.0","uid":"strIdx_5f2de9d9542d","paramMap":{"handleInvalid":"skip","outputCol":"Violation
>  

[jira] [Commented] (SPARK-20252) java.lang.ClassNotFoundException: $line22.$read$$iwC$$iwC$movie_row

2017-04-07 Thread Peter Mead (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960762#comment-15960762
 ] 

Peter Mead commented on SPARK-20252:


I'm Not sure how this explains how it work the first (and every) time if the 
spark context is not changed? There must be a discrepancy in the way that DSE 
creates the spark context the first time through and the way I create it after 
sc.stop?

> java.lang.ClassNotFoundException: $line22.$read$$iwC$$iwC$movie_row
> ---
>
> Key: SPARK-20252
> URL: https://issues.apache.org/jira/browse/SPARK-20252
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.6.3
> Environment: Datastax DSE dual node SPARK cluster
> [cqlsh 5.0.1 | Cassandra 3.0.12.1586 | DSE 5.0.7 | CQL spec 3.4.0 | Native 
> protocol v4]
>Reporter: Peter Mead
>
> After starting a spark shell using DSE -u  -p x spark
> scala> case class movie_row (actor: String, character_name: String, video_id: 
> java.util.UUID, video_year: Int, title: String)
> defined class movie_row
> scala> val vids=sc.cassandraTable("hcl","videos_by_actor")
> vids: 
> com.datastax.spark.connector.rdd.CassandraTableScanRDD[com.datastax.spark.connector.CassandraRow]
>  = CassandraTableScanRDD[0] at RDD at CassandraRDD.scala:15
> scala> val vids=sc.cassandraTable[movie_row]("hcl","videos_by_actor")
> vids: com.datastax.spark.connector.rdd.CassandraTableScanRDD[movie_row] = 
> CassandraTableScanRDD[1] at RDD at CassandraRDD.scala:15
> scala> vids.count
> res0: Long = 114961
>  Works OK!!
> BUT if the spark context is stopped and recreated THEN:
> scala> sc.stop()
> scala> import org.apache.spark.SparkContext, org.apache.spark.SparkContext._, 
> org.apache.spark.SparkConf
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkContext._
> import org.apache.spark.SparkConf
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> val conf = new SparkConf(true)
> .set("spark.cassandra.connection.host", "redacted")
> .set("spark.cassandra.auth.username", "redacted")
> .set("spark.cassandra.auth.password", "redacted")
> // Exiting paste mode, now interpreting.
> conf: org.apache.spark.SparkConf = org.apache.spark.SparkConf@e207342
> scala> val sc = new SparkContext("spark://192.168.1.84:7077", "pjm", conf)
> sc: org.apache.spark.SparkContext = org.apache.spark.SparkContext@12b8b7c8
> scala> case class movie_row (actor: String, character_name: String, video_id: 
> java.util.UUID, video_year: Int, title: String)
> defined class movie_row
> scala> val vids=sc.cassandraTable[movie_row]("hcl","videos_by_actor")
> vids: com.datastax.spark.connector.rdd.CassandraTableScanRDD[movie_row] = 
> CassandraTableScanRDD[0] at RDD at CassandraRDD.scala:15
> scala> vids.count
> [Stage 0:>  (0 + 2) / 
> 2]WARN  2017-04-07 12:52:03,277 org.apache.spark.scheduler.TaskSetManager: 
> Lost task 0.0 in stage 0.0 (TID 0, cassandra183): 
> java.lang.ClassNotFoundException: $line22.$read$$iwC$$iwC$movie_row
> FAILS!!
> I have been unable to get this to work from a remote SPARK shell!



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20252) java.lang.ClassNotFoundException: $line22.$read$$iwC$$iwC$movie_row

2017-04-07 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20252.
---
Resolution: Duplicate

This is basically a limitation of how the shell and classloaders work. Simpler 
variants on this work as do compiled apps. (Also it's not necessarily going to 
work to stop and restart the context.)

> java.lang.ClassNotFoundException: $line22.$read$$iwC$$iwC$movie_row
> ---
>
> Key: SPARK-20252
> URL: https://issues.apache.org/jira/browse/SPARK-20252
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.6.3
> Environment: Datastax DSE dual node SPARK cluster
> [cqlsh 5.0.1 | Cassandra 3.0.12.1586 | DSE 5.0.7 | CQL spec 3.4.0 | Native 
> protocol v4]
>Reporter: Peter Mead
>
> After starting a spark shell using DSE -u  -p x spark
> scala> case class movie_row (actor: String, character_name: String, video_id: 
> java.util.UUID, video_year: Int, title: String)
> defined class movie_row
> scala> val vids=sc.cassandraTable("hcl","videos_by_actor")
> vids: 
> com.datastax.spark.connector.rdd.CassandraTableScanRDD[com.datastax.spark.connector.CassandraRow]
>  = CassandraTableScanRDD[0] at RDD at CassandraRDD.scala:15
> scala> val vids=sc.cassandraTable[movie_row]("hcl","videos_by_actor")
> vids: com.datastax.spark.connector.rdd.CassandraTableScanRDD[movie_row] = 
> CassandraTableScanRDD[1] at RDD at CassandraRDD.scala:15
> scala> vids.count
> res0: Long = 114961
>  Works OK!!
> BUT if the spark context is stopped and recreated THEN:
> scala> sc.stop()
> scala> import org.apache.spark.SparkContext, org.apache.spark.SparkContext._, 
> org.apache.spark.SparkConf
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkContext._
> import org.apache.spark.SparkConf
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> val conf = new SparkConf(true)
> .set("spark.cassandra.connection.host", "redacted")
> .set("spark.cassandra.auth.username", "redacted")
> .set("spark.cassandra.auth.password", "redacted")
> // Exiting paste mode, now interpreting.
> conf: org.apache.spark.SparkConf = org.apache.spark.SparkConf@e207342
> scala> val sc = new SparkContext("spark://192.168.1.84:7077", "pjm", conf)
> sc: org.apache.spark.SparkContext = org.apache.spark.SparkContext@12b8b7c8
> scala> case class movie_row (actor: String, character_name: String, video_id: 
> java.util.UUID, video_year: Int, title: String)
> defined class movie_row
> scala> val vids=sc.cassandraTable[movie_row]("hcl","videos_by_actor")
> vids: com.datastax.spark.connector.rdd.CassandraTableScanRDD[movie_row] = 
> CassandraTableScanRDD[0] at RDD at CassandraRDD.scala:15
> scala> vids.count
> [Stage 0:>  (0 + 2) / 
> 2]WARN  2017-04-07 12:52:03,277 org.apache.spark.scheduler.TaskSetManager: 
> Lost task 0.0 in stage 0.0 (TID 0, cassandra183): 
> java.lang.ClassNotFoundException: $line22.$read$$iwC$$iwC$movie_row
> FAILS!!
> I have been unable to get this to work from a remote SPARK shell!



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20251) Spark streaming skips batches in a case of failure

2017-04-07 Thread Roman Studenikin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960703#comment-15960703
 ] 

Roman Studenikin commented on SPARK-20251:
--

we've spent quite a lot of time investigating this already, I didn't find 
anything like that in google.

this is an app without checkpoints, so the problem happens just in the middle 
of run. It just fails in producing data to kafka and skips the batch.

Could you please refer to any specific cases when this behaviour is expected or 
point to any docs I could read about it?

> Spark streaming skips batches in a case of failure
> --
>
> Key: SPARK-20251
> URL: https://issues.apache.org/jira/browse/SPARK-20251
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Roman Studenikin
>
> We are experiencing strange behaviour of spark streaming application. 
> Sometimes it just skips batch in a case of job failure and starts working on 
> the next one.
> We expect it to attempt to reprocess batch, but not to skip it. Is it a bug 
> or we are missing any important configuration params?
> Screenshots from spark UI:
> http://pasteboard.co/1oRW0GDUX.png
> http://pasteboard.co/1oSjdFpbc.png



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19935) SparkSQL unsupports to create a hive table which is mapped for HBase table

2017-04-07 Thread sydt (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960682#comment-15960682
 ] 

sydt commented on SPARK-19935:
--

Have you resolve this problem about create table in sparksql for hbase table 

> SparkSQL unsupports to create a hive table which is mapped for HBase table
> --
>
> Key: SPARK-19935
> URL: https://issues.apache.org/jira/browse/SPARK-19935
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: spark2.0.2
>Reporter: Xiaochen Ouyang
>
> SparkSQL unsupports the command as following:
>  CREATE TABLE spark_test(key int, value string)   
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'   
> WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")   
> TBLPROPERTIES ("hbase.table.name" = "xyz");



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20252) java.lang.ClassNotFoundException: $line22.$read$$iwC$$iwC$movie_row

2017-04-07 Thread Peter Mead (JIRA)
Peter Mead created SPARK-20252:
--

 Summary: java.lang.ClassNotFoundException: 
$line22.$read$$iwC$$iwC$movie_row
 Key: SPARK-20252
 URL: https://issues.apache.org/jira/browse/SPARK-20252
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.6.3
 Environment: Datastax DSE dual node SPARK cluster
[cqlsh 5.0.1 | Cassandra 3.0.12.1586 | DSE 5.0.7 | CQL spec 3.4.0 | Native 
protocol v4]
Reporter: Peter Mead


After starting a spark shell using DSE -u  -p x spark

scala> case class movie_row (actor: String, character_name: String, video_id: 
java.util.UUID, video_year: Int, title: String)
defined class movie_row

scala> val vids=sc.cassandraTable("hcl","videos_by_actor")
vids: 
com.datastax.spark.connector.rdd.CassandraTableScanRDD[com.datastax.spark.connector.CassandraRow]
 = CassandraTableScanRDD[0] at RDD at CassandraRDD.scala:15

scala> val vids=sc.cassandraTable[movie_row]("hcl","videos_by_actor")
vids: com.datastax.spark.connector.rdd.CassandraTableScanRDD[movie_row] = 
CassandraTableScanRDD[1] at RDD at CassandraRDD.scala:15

scala> vids.count
res0: Long = 114961
 Works OK!!

BUT if the spark context is stopped and recreated THEN:
scala> sc.stop()

scala> import org.apache.spark.SparkContext, org.apache.spark.SparkContext._, 
org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

scala> :paste
// Entering paste mode (ctrl-D to finish)

val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "redacted")
.set("spark.cassandra.auth.username", "redacted")
.set("spark.cassandra.auth.password", "redacted")

// Exiting paste mode, now interpreting.

conf: org.apache.spark.SparkConf = org.apache.spark.SparkConf@e207342

scala> val sc = new SparkContext("spark://192.168.1.84:7077", "pjm", conf)
sc: org.apache.spark.SparkContext = org.apache.spark.SparkContext@12b8b7c8

scala> case class movie_row (actor: String, character_name: String, video_id: 
java.util.UUID, video_year: Int, title: String)
defined class movie_row

scala> val vids=sc.cassandraTable[movie_row]("hcl","videos_by_actor")
vids: com.datastax.spark.connector.rdd.CassandraTableScanRDD[movie_row] = 
CassandraTableScanRDD[0] at RDD at CassandraRDD.scala:15

scala> vids.count
[Stage 0:>  (0 + 2) / 
2]WARN  2017-04-07 12:52:03,277 org.apache.spark.scheduler.TaskSetManager: 
Lost task 0.0 in stage 0.0 (TID 0, cassandra183): 
java.lang.ClassNotFoundException: $line22.$read$$iwC$$iwC$movie_row
FAILS!!

I have been unable to get this to work from a remote SPARK shell!





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19900) [Standalone] Master registers application again when driver relaunched

2017-04-07 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19900.
---
Resolution: Cannot Reproduce

> [Standalone] Master registers application again when driver relaunched
> --
>
> Key: SPARK-19900
> URL: https://issues.apache.org/jira/browse/SPARK-19900
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 1.6.2
> Environment: Centos 6.5, spark standalone
>Reporter: Sergey
>Priority: Critical
>  Labels: Spark, network, standalone, supervise
>
> I've found some problems when node, where driver is running, has unstable 
> network. A situation is possible when two identical applications are running 
> on a cluster.
> *Steps to Reproduce:*
> # prepare 3 node. One for the spark master and two for the spark workers.
> # submit an application with parameter spark.driver.supervise = true
> # go to the node where driver is running (for example spark-worker-1) and 
> close 7077 port
> {code}
> # iptables -A OUTPUT -p tcp --dport 7077 -j DROP
> {code}
> # wait more 60 seconds
> # look at the spark master UI
> There are two spark applications and one driver. The new application has 
> WAITING state and the second application has RUNNING state. Driver has 
> RUNNING or RELAUNCHING state (It depends on the resources available, as I 
> understand it) and it launched on other node (for example spark-worker-2)
> # open the port
> {code}
> # iptables -D OUTPUT -p tcp --dport 7077 -j DROP
> {code}
> # look an the spark UI again
> There are no changes
> In addition, if you look at the processes on the node spark-worker-1
> {code}
> # ps ax | grep spark
> {code}
>  you will see that the old driver is still working!
> *Spark master logs:*
> {code}
> 17/03/10 05:26:27 WARN Master: Removing 
> worker-20170310052240-spark-worker-1-35039 because we got no heartbeat in 60 
> seconds
> 17/03/10 05:26:27 INFO Master: Removing worker 
> worker-20170310052240-spark-worker-1-35039 on spark-worker-1:35039
> 17/03/10 05:26:27 INFO Master: Telling app of lost executor: 1
> 17/03/10 05:26:27 INFO Master: Telling app of lost executor: 0
> 17/03/10 05:26:27 INFO Master: Re-launching driver-20170310052347-
> 17/03/10 05:26:27 INFO Master: Launching driver driver-20170310052347- on 
> worker worker-20170310052411-spark-worker-2-40473
> 17/03/10 05:26:35 INFO Master: Registering app TestApplication
> 17/03/10 05:26:35 INFO Master: Registered app TestApplication with ID 
> app-20170310052635-0001
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got status update for unknown executor 
> app-20170310052354-/1
> 17/03/10 05:31:07 WARN Master: Got status update for unknown executor 
> app-20170310052354-/0
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 

[jira] [Commented] (SPARK-20251) Spark streaming skips batches in a case of failure

2017-04-07 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960671#comment-15960671
 ] 

Sean Owen commented on SPARK-20251:
---

This depends on too many things, like how you've set up your app and what 
recovery semantics you've implemented. I'd close this as something that should 
start with more research and a question on the user@ list.

> Spark streaming skips batches in a case of failure
> --
>
> Key: SPARK-20251
> URL: https://issues.apache.org/jira/browse/SPARK-20251
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Roman Studenikin
>
> We are experiencing strange behaviour of spark streaming application. 
> Sometimes it just skips batch in a case of job failure and starts working on 
> the next one.
> We expect it to attempt to reprocess batch, but not to skip it. Is it a bug 
> or we are missing any important configuration params?
> Screenshots from spark UI:
> http://pasteboard.co/1oRW0GDUX.png
> http://pasteboard.co/1oSjdFpbc.png



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20218) '/applications/[app-id]/stages' in REST API,add description.

2017-04-07 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20218.
---
   Resolution: Fixed
Fix Version/s: 2.1.2
   2.2.0

Issue resolved by pull request 17534
[https://github.com/apache/spark/pull/17534]

> '/applications/[app-id]/stages' in REST API,add description.
> 
>
> Key: SPARK-20218
> URL: https://issues.apache.org/jira/browse/SPARK-20218
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: guoxiaolongzte
>Assignee: guoxiaolongzte
>Priority: Trivial
> Fix For: 2.2.0, 2.1.2
>
>
> 1. '/applications/[app-id]/stages' in rest api.status should add description 
> '?status=[active|complete|pending|failed] list only stages in the state.'
> Now the lack of this description, resulting in the use of this api do not 
> know the use of the status through the brush stage list.
> 2.'/applications/[app-id]/stages/[stage-id]' in REST API,remove redundant 
> description ‘?status=[active|complete|pending|failed] list only stages in the 
> state.’.
> Because only one stage is determined based on stage-id.
> code:
>   @GET
>   def stageList(@QueryParam("status") statuses: JList[StageStatus]): 
> Seq[StageData] = {
> val listener = ui.jobProgressListener
> val stageAndStatus = AllStagesResource.stagesAndStatus(ui)
> val adjStatuses = {
>   if (statuses.isEmpty()) {
> Arrays.asList(StageStatus.values(): _*)
>   } else {
> statuses
>   }
> };



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20218) '/applications/[app-id]/stages' in REST API,add description.

2017-04-07 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-20218:
-

Assignee: guoxiaolongzte
Priority: Trivial  (was: Minor)

> '/applications/[app-id]/stages' in REST API,add description.
> 
>
> Key: SPARK-20218
> URL: https://issues.apache.org/jira/browse/SPARK-20218
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: guoxiaolongzte
>Assignee: guoxiaolongzte
>Priority: Trivial
>
> 1. '/applications/[app-id]/stages' in rest api.status should add description 
> '?status=[active|complete|pending|failed] list only stages in the state.'
> Now the lack of this description, resulting in the use of this api do not 
> know the use of the status through the brush stage list.
> 2.'/applications/[app-id]/stages/[stage-id]' in REST API,remove redundant 
> description ‘?status=[active|complete|pending|failed] list only stages in the 
> state.’.
> Because only one stage is determined based on stage-id.
> code:
>   @GET
>   def stageList(@QueryParam("status") statuses: JList[StageStatus]): 
> Seq[StageData] = {
> val listener = ui.jobProgressListener
> val stageAndStatus = AllStagesResource.stagesAndStatus(ui)
> val adjStatuses = {
>   if (statuses.isEmpty()) {
> Arrays.asList(StageStatus.values(): _*)
>   } else {
> statuses
>   }
> };



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20251) Spark streaming skips batches in a case of failure

2017-04-07 Thread Roman Studenikin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roman Studenikin updated SPARK-20251:
-
Description: 
We are experiencing strange behaviour of spark streaming application. Sometimes 
it just skips batch in a case of job failure and starts working on the next one.
We expect it to attempt to reprocess batch, but not to skip it. Is it a bug or 
we are missing any important configuration params?

Screenshots from spark UI:
http://pasteboard.co/1oRW0GDUX.png
http://pasteboard.co/1oSjdFpbc.png

  was:
We are experiencing strange behaviour of spark streaming application. Sometimes 
it just skips batch in a case of job failure and starts working on the next one.
We expect it to attempt to reprocess batch, but not to skip it. Is it a bug or 
we are missing any important configuration params?


> Spark streaming skips batches in a case of failure
> --
>
> Key: SPARK-20251
> URL: https://issues.apache.org/jira/browse/SPARK-20251
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Roman Studenikin
>
> We are experiencing strange behaviour of spark streaming application. 
> Sometimes it just skips batch in a case of job failure and starts working on 
> the next one.
> We expect it to attempt to reprocess batch, but not to skip it. Is it a bug 
> or we are missing any important configuration params?
> Screenshots from spark UI:
> http://pasteboard.co/1oRW0GDUX.png
> http://pasteboard.co/1oSjdFpbc.png



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20251) Spark streaming skips batches in a case of failure

2017-04-07 Thread Roman Studenikin (JIRA)
Roman Studenikin created SPARK-20251:


 Summary: Spark streaming skips batches in a case of failure
 Key: SPARK-20251
 URL: https://issues.apache.org/jira/browse/SPARK-20251
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Roman Studenikin


We are experiencing strange behaviour of spark streaming application. Sometimes 
it just skips batch in a case of job failure and starts working on the next one.
We expect it to attempt to reprocess batch, but not to skip it. Is it a bug or 
we are missing any important configuration params?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20219) Schedule tasks based on size of input from ScheduledRDD

2017-04-07 Thread jin xing (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15960608#comment-15960608
 ] 

jin xing commented on SPARK-20219:
--

[~kayousterhout] [~irashid]
Thanks a lot for taking look at this :)  And sorry for late reply. 
The use cases are like below:
1. It is a Spark SQL job in my cluster. The sql is quite long and I'm hesitant 
to post it here(I can post later if there is people want to see it :)). There 
is 3 stages in the job: Stage-1, Stage-2, Stage-3. Stage-3 shuffle read from 
Stage-1 and Stage-2. There are 2000 partitions in Stage-3(we set 
spark.sql.shuffle.partitions=2000). The distribution of the size of the 
shuffle-read is in the screenshot.
Running with the change in the pr, total time cost of Stage-3 is 3654 seconds. 
Without the change, it will cost 4934 seconds. I supplied 50 executors(this is 
common in data warehouse when the job failed to acquire enough containers from 
yarn) to Stage-3. I think the improvement here is a good one.

2. I also did a small test in my local environment. Code is like below:
{code}
val rdd = sc.textFile("/tmp/data", 9)
rdd.map {
  case num =>
(num, 1)
}.groupByKey.map {
  case (key, iter) =>
iter.sum
(key, iter.size)
}.collect.foreach(println)
{code}
There are 200m lines in the RDD, the content is some people's names. In the 
ResultStage, the first 8 partitions are almost of the same size and the 9th 
partition is 10 times of the first 8 partitions.
Running with the change, the result is:
17/04/07 11:50:52 INFO DAGScheduler: ResultStage 1 (collect at 
SparkArchetype.scala:26) finished in 23.027 s.
Running without the change, the result is:
17/04/07 11:54:27 INFO DAGScheduler: ResultStage 1 (collect at 
SparkArchetype.scala:26) finished in 34.546 s.

In my warehouse, there are lots of cases like the first one I described above. 
So I really hope this idea could be taken into consideration. I feel sorry to 
bring in the complexity and I'm very thankful if you can give some advice for 
better implementation.

> Schedule tasks based on size of input from ScheduledRDD
> ---
>
> Key: SPARK-20219
> URL: https://issues.apache.org/jira/browse/SPARK-20219
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: jin xing
> Attachments: screenshot-1.png
>
>
> When data is highly skewed on ShuffledRDD, it make sense to launch those 
> tasks which process much more input as soon as possible. The current 
> scheduling mechanism in *TaskSetManager* is quite simple:
> {code}
>   for (i <- (0 until numTasks).reverse) {
> addPendingTask(i)
>   }
> {code}
> In scenario that "large tasks" locate at bottom half of tasks array, if tasks 
> with much more input are launched early, we can significantly reduce the time 
> cost and save resource when *"dynamic allocation"* is disabled.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19282) RandomForestRegressionModel summary should expose getMaxDepth

2017-04-07 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-19282:
--
Fix Version/s: (was: 2.2.0)

> RandomForestRegressionModel summary should expose getMaxDepth
> -
>
> Key: SPARK-19282
> URL: https://issues.apache.org/jira/browse/SPARK-19282
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark, SparkR
>Affects Versions: 2.1.0
>Reporter: Nick Lothian
>Assignee: Xin Ren
>Priority: Minor
>
> Currently it isn't clear hot to get the max depth of a 
> RandomForestRegressionModel (eg, after doing a grid search)
> It is possible to call
> {{regressor._java_obj.getMaxDepth()}} 
> but most other decision trees allow
> {{regressor.getMaxDepth()}} 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19522) --executor-memory flag doesn't work in local-cluster mode

2017-04-07 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-19522:
--
Target Version/s: 2.0.3, 2.1.2, 2.2.0  (was: 2.0.3, 2.1.1)

> --executor-memory flag doesn't work in local-cluster mode
> -
>
> Key: SPARK-19522
> URL: https://issues.apache.org/jira/browse/SPARK-19522
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> {code}
> bin/spark-shell --master local-cluster[2,1,2048]
> {code}
> doesn't do what you think it does. You'll end up getting executors with 1GB 
> each because that's the default. This is because the executor memory flag 
> isn't actually read in local mode.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19035) rand() function in case when cause failed

2017-04-07 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-19035:
--
Target Version/s: 2.0.3, 2.1.2, 2.2.0  (was: 2.0.3, 2.1.1, 2.2.0)

> rand() function in case when cause failed
> -
>
> Key: SPARK-19035
> URL: https://issues.apache.org/jira/browse/SPARK-19035
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
>Reporter: Feng Yuan
>
> *In this case:*
>select 
>case when a=1 then 1 else concat(a,cast(rand() as 
> string)) end b,count(1) 
>from 
>yuanfeng1_a 
>group by 
>case when a=1 then 1 else concat(a,cast(rand() as 
> string)) end;
> *Throw error:*
> Error in query: expression 'yuanfeng1_a.`a`' is neither present in the group 
> by, nor is it an aggregate function. Add to group by or wrap in first() (or 
> first_value) if you don't care which value you get.;;
> Aggregate [CASE WHEN (a#2075 = 1) THEN cast(1 as string) ELSE 
> concat(cast(a#2075 as string), cast(rand(519367429988179997) as string)) 
> END], [CASE WHEN (a#2075 = 1) THEN cast(1 as string) ELSE concat(cast(a#2075 
> as string), cast(rand(8090243936131101651) as string)) END AS b#2074]
> +- MetastoreRelation default, yuanfeng1_a
> select case when a=1 then 1 else rand() end b,count(1) from yuanfeng1_a group 
> by case when a=1 then rand() end also output this
> *Notice*:
> If replace rand() as 1,it work.
> A simpler way to reproduce this bug: `SELECT a + rand() FROM t GROUP BY a + 
> rand()`.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20219) Schedule tasks based on size of input from ScheduledRDD

2017-04-07 Thread jin xing (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jin xing updated SPARK-20219:
-
Attachment: screenshot-1.png

> Schedule tasks based on size of input from ScheduledRDD
> ---
>
> Key: SPARK-20219
> URL: https://issues.apache.org/jira/browse/SPARK-20219
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: jin xing
> Attachments: screenshot-1.png
>
>
> When data is highly skewed on ShuffledRDD, it make sense to launch those 
> tasks which process much more input as soon as possible. The current 
> scheduling mechanism in *TaskSetManager* is quite simple:
> {code}
>   for (i <- (0 until numTasks).reverse) {
> addPendingTask(i)
>   }
> {code}
> In scenario that "large tasks" locate at bottom half of tasks array, if tasks 
> with much more input are launched early, we can significantly reduce the time 
> cost and save resource when *"dynamic allocation"* is disabled.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20250) Improper OOM error when a task been killed while spilling data

2017-04-07 Thread Feng Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Zhu updated SPARK-20250:
-
Description: 
When a task is calling spill() but it receives a killing request from 
driver (e.g., speculative task), the TaskMemoryManager will throw an OOM 
exception. 
Then the executor takes it as UncaughtException, which will be handled by 
SparkUncaughtExceptionHandler and the executor will consequently be shutdown. 
However, this error may lead to the whole application failure due to the 
"max number of executor failures (30) reached". 
In our production environment, we have encountered a lot of such cases. 
\\
{noformat}
17/04/05 06:41:27 INFO sort.UnsafeExternalSorter: Thread 115 spilling sort data 
of 928.0 MB to disk (1 time so far)
17/04/05 06:41:27 INFO sort.UnsafeSorterSpillWriter: Spill 
file:/data/usercache/application_1482394966158_87487271/blockmgr-85c25fa8-06b4/32/temp_local_b731
17/04/05 06:41:27 INFO sort.UnsafeSorterSpillWriter: Write numRecords:2097152
17/04/05 06:41:30 INFO executor.Executor: Executor is trying to kill task 16.0 
in stage 3.0 (TID 857)
17/04/05 06:41:30 ERROR memory.TaskMemoryManager: error while calling spill() 
on org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@43a122ed
java.nio.channels.ClosedByInterruptException
at 
java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:269)
at 
org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten(DiskBlockObjectWriter.scala:228)
at 
org.apache.spark.storage.DiskBlockObjectWriter.recordWritten(DiskBlockObjectWriter.scala:207)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.write(UnsafeSorterSpillWriter.java:139)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:196)
at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:170)
at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:244)
at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:83)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:302)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:346)

17/04/05 06:41:30 INFO sort.UnsafeExternalSorter: Thread 115 spilling sort data 
of 928.0 MB to disk (2  times so far)
17/04/05 06:41:30 INFO sort.UnsafeSorterSpillWriter: Spill 
file:/data/usercache/appcache/application_1482394966158_87487271/blockmgr-573312a3-bd46-4c5c-9293-1021cc34c77
17/04/05 06:41:30 INFO sort.UnsafeSorterSpillWriter: Write numRecords:2097152
17/04/05 06:41:31 ERROR memory.TaskMemoryManager: error while calling spill() 
on org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@43a122ed
java.nio.channels.ClosedByInterruptException
at 
java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:269)
at 
org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten(DiskBlockObjectWriter.scala:228)
at 
org.apache.spark.storage.DiskBlockObjectWriter.recordWritten(DiskBlockObjectWriter.scala:207)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.write(UnsafeSorterSpillWriter.java:139)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:196)
at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:170)
at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:244)
at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:83)

.
17/04/05 06:41:31 WARN memory.TaskMemoryManager: leak 32.0 KB memory from 
org.apache.spark.shuffle.sort.ShuffleExternalSorter@513661a6
17/04/05 06:41:31 ERROR executor.Executor: Managed memory leak detected; size = 
26010016 bytes, TID = 857
17/04/05 06:41:31 ERROR executor.Executor: Exception in task 16.0 in stage 3.0 
(TID 857)
java.lang.OutOfMemoryError: error while calling spill() on 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@43a122ed : 
null
at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:178)
at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:244)
at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:83)
  

[jira] [Updated] (SPARK-20250) Improper OOM error when a task been killed while spilling data

2017-04-07 Thread Feng Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Zhu updated SPARK-20250:
-
Description: 
While a task is calling spill() when it receives a killing request from 
driver (e.g., speculative task), the TaskMemoryManager will throw an OOM 
exception. 
Then the executor takes it as UncaughtException, which will be handled by 
SparkUncaughtExceptionHandler and the executor will consequently be shutdown. 
However, this error may lead to the whole application failure due to the 
"max number of executor failures (30) reached". 
In our production environment, we have encountered a lot of such cases. 
\\
{noformat}
17/04/05 06:41:27 INFO sort.UnsafeExternalSorter: Thread 115 spilling sort data 
of 928.0 MB to disk (1 time so far)
17/04/05 06:41:27 INFO sort.UnsafeSorterSpillWriter: Spill 
file:/data/usercache/application_1482394966158_87487271/blockmgr-85c25fa8-06b4/32/temp_local_b731
17/04/05 06:41:27 INFO sort.UnsafeSorterSpillWriter: Write numRecords:2097152
17/04/05 06:41:30 INFO executor.Executor: Executor is trying to kill task 16.0 
in stage 3.0 (TID 857)
17/04/05 06:41:30 ERROR memory.TaskMemoryManager: error while calling spill() 
on org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@43a122ed
java.nio.channels.ClosedByInterruptException
at 
java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:269)
at 
org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten(DiskBlockObjectWriter.scala:228)
at 
org.apache.spark.storage.DiskBlockObjectWriter.recordWritten(DiskBlockObjectWriter.scala:207)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.write(UnsafeSorterSpillWriter.java:139)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:196)
at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:170)
at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:244)
at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:83)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:302)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:346)

17/04/05 06:41:30 INFO sort.UnsafeExternalSorter: Thread 115 spilling sort data 
of 928.0 MB to disk (2  times so far)
17/04/05 06:41:30 INFO sort.UnsafeSorterSpillWriter: Spill 
file:/data/usercache/appcache/application_1482394966158_87487271/blockmgr-573312a3-bd46-4c5c-9293-1021cc34c77
17/04/05 06:41:30 INFO sort.UnsafeSorterSpillWriter: Write numRecords:2097152
17/04/05 06:41:31 ERROR memory.TaskMemoryManager: error while calling spill() 
on org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@43a122ed
java.nio.channels.ClosedByInterruptException
at 
java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:269)
at 
org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten(DiskBlockObjectWriter.scala:228)
at 
org.apache.spark.storage.DiskBlockObjectWriter.recordWritten(DiskBlockObjectWriter.scala:207)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.write(UnsafeSorterSpillWriter.java:139)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:196)
at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:170)
at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:244)
at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:83)

.
17/04/05 06:41:31 WARN memory.TaskMemoryManager: leak 32.0 KB memory from 
org.apache.spark.shuffle.sort.ShuffleExternalSorter@513661a6
17/04/05 06:41:31 ERROR executor.Executor: Managed memory leak detected; size = 
26010016 bytes, TID = 857
17/04/05 06:41:31 ERROR executor.Executor: Exception in task 16.0 in stage 3.0 
(TID 857)
java.lang.OutOfMemoryError: error while calling spill() on 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@43a122ed : 
null
at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:178)
at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:244)
at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:83)

  1   2   >