[jira] [Comment Edited] (SPARK-21725) spark thriftserver insert overwrite table partition select

2017-10-30 Thread xinzhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16226337#comment-16226337
 ] 

xinzhang edited comment on SPARK-21725 at 10/31/17 6:55 AM:


[~mgaido]
[~srowen]
Now I try with the master branch.
The problem is still here.(Important: hive.default.fileformat  Text file is the 
parameter's default value. If I tried set hive.default.fileformat=Parquet; The 
problem has gone!!)
Steps:
1.download . install . exec hivesql  (hive-1.2.1 . Here prove my hive is OK)
!https://user-images.githubusercontent.com/8244097/32210043-7554300e-be46-11e7-8ce0-f61bc0bfa998.png!

2.download . install . exec spark-sql  (spark-master I build it with master the 
lastest commit 44c4003155c1d243ffe0f73d5537b4c8b3f3b564)
First time . Spark-sql  result: GOOD
!https://user-images.githubusercontent.com/8244097/32210200-5b02de20-be47-11e7-8eac-e0228a7cf7f5.png!

Second time . Spark-sql  result: GOOD
!https://user-images.githubusercontent.com/8244097/32210320-f518aa12-be47-11e7-9a86-a16819583748.png!

3.use spark-sql thriftserver
First time . Spark-sql  result: GOOD
Second time .Spark-sql result: BAD
!https://user-images.githubusercontent.com/8244097/32210560-47d431da-be49-11e7-8279-7dd88dda42a6.png!




was (Author: zhangxin0112zx):
[~mgaido]
[~srowen]
Now I try with the master branch.
The problem is still here.
Steps:
1.download . install . exec hivesql  (hive-1.2.1 . Here prove my hive is OK)
!https://user-images.githubusercontent.com/8244097/32210043-7554300e-be46-11e7-8ce0-f61bc0bfa998.png!

2.download . install . exec spark-sql  (spark-master I build it with master the 
lastest commit 44c4003155c1d243ffe0f73d5537b4c8b3f3b564)
First time . Spark-sql  result: GOOD
!https://user-images.githubusercontent.com/8244097/32210200-5b02de20-be47-11e7-8eac-e0228a7cf7f5.png!

Second time . Spark-sql  result: GOOD
!https://user-images.githubusercontent.com/8244097/32210320-f518aa12-be47-11e7-9a86-a16819583748.png!

3.use spark-sql thriftserver
First time . Spark-sql  result: GOOD
Second time .Spark-sql result: BAD
!https://user-images.githubusercontent.com/8244097/32210560-47d431da-be49-11e7-8279-7dd88dda42a6.png!



> spark thriftserver insert overwrite table partition select 
> ---
>
> Key: SPARK-21725
> URL: https://issues.apache.org/jira/browse/SPARK-21725
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: centos 6.7 spark 2.1  jdk8
>Reporter: xinzhang
>  Labels: spark-sql
>
> use thriftserver create table with partitions.
> session 1:
>  SET hive.default.fileformat=Parquet;create table tmp_10(count bigint) 
> partitioned by (pt string) stored as parquet;
> --ok
>  !exit
> session 2:
>  SET hive.default.fileformat=Parquet;create table tmp_11(count bigint) 
> partitioned by (pt string) stored as parquet; 
> --ok
>  !exit
> session 3:
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --ok
>  !exit
> session 4(do it again):
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --error
>  !exit
> -
> 17/08/14 18:13:42 ERROR SparkExecuteStatementOperation: Error executing 
> query, currentState RUNNING, 
> java.lang.reflect.InvocationTargetException
> ..
> ..
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move 
> source 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/.hive-staging_hive_2017-08-14_18-13-39_035_6303339779053
> 512282-2/-ext-1/part-0 to destination 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/pt=1/part-0
> at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2644)
> at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(Hive.java:2711)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1403)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1324)
> ... 45 more
> Caused by: java.io.IOException: Filesystem closed
> 
> -
> the doc about the parquet table desc here 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files
> Hive metastore Parquet table conversion
> When reading from and writing to Hive metastore Parquet tables, Spark SQL 
> will try to use its own Parquet support instead of Hive SerDe for better 
> performance. This behavior is controlled by the 
> spark.sql.hive.convertMeta

[jira] [Comment Edited] (SPARK-21725) spark thriftserver insert overwrite table partition select

2017-10-30 Thread xinzhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16226337#comment-16226337
 ] 

xinzhang edited comment on SPARK-21725 at 10/31/17 6:43 AM:


[~mgaido]
[~srowen]
Now I try with the master branch.
The problem is still here.
Steps:
1.download . install . exec hivesql  (hive-1.2.1 . Here prove my hive is OK)
!https://user-images.githubusercontent.com/8244097/32210043-7554300e-be46-11e7-8ce0-f61bc0bfa998.png!

2.download . install . exec spark-sql  (spark-master I build it with master the 
lastest commit 44c4003155c1d243ffe0f73d5537b4c8b3f3b564)
First time . Spark-sql  result: GOOD
!https://user-images.githubusercontent.com/8244097/32210200-5b02de20-be47-11e7-8eac-e0228a7cf7f5.png!

Second time . Spark-sql  result: GOOD
!https://user-images.githubusercontent.com/8244097/32210320-f518aa12-be47-11e7-9a86-a16819583748.png!

3.use spark-sql thriftserver
First time . Spark-sql  result: GOOD
Second time .Spark-sql result: BAD
!https://user-images.githubusercontent.com/8244097/32210560-47d431da-be49-11e7-8279-7dd88dda42a6.png!




was (Author: zhangxin0112zx):
Now I try with the master branch.
The problem is still here.
Steps:
1.download . install . exec hivesql  (hive-1.2.1 . Here prove my hive is OK)
!https://user-images.githubusercontent.com/8244097/32210043-7554300e-be46-11e7-8ce0-f61bc0bfa998.png!

2.download . install . exec spark-sql  (spark-master I build it with master the 
lastest commit 44c4003155c1d243ffe0f73d5537b4c8b3f3b564)
First time . Spark-sql  result: GOOD
!https://user-images.githubusercontent.com/8244097/32210200-5b02de20-be47-11e7-8eac-e0228a7cf7f5.png!

Second time . Spark-sql  result: GOOD
!https://user-images.githubusercontent.com/8244097/32210320-f518aa12-be47-11e7-9a86-a16819583748.png!

3.use spark-sql thriftserver
First time . Spark-sql  result: GOOD
Second time .Spark-sql result: BAD
!https://user-images.githubusercontent.com/8244097/32210560-47d431da-be49-11e7-8279-7dd88dda42a6.png!



> spark thriftserver insert overwrite table partition select 
> ---
>
> Key: SPARK-21725
> URL: https://issues.apache.org/jira/browse/SPARK-21725
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: centos 6.7 spark 2.1  jdk8
>Reporter: xinzhang
>  Labels: spark-sql
>
> use thriftserver create table with partitions.
> session 1:
>  SET hive.default.fileformat=Parquet;create table tmp_10(count bigint) 
> partitioned by (pt string) stored as parquet;
> --ok
>  !exit
> session 2:
>  SET hive.default.fileformat=Parquet;create table tmp_11(count bigint) 
> partitioned by (pt string) stored as parquet; 
> --ok
>  !exit
> session 3:
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --ok
>  !exit
> session 4(do it again):
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --error
>  !exit
> -
> 17/08/14 18:13:42 ERROR SparkExecuteStatementOperation: Error executing 
> query, currentState RUNNING, 
> java.lang.reflect.InvocationTargetException
> ..
> ..
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move 
> source 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/.hive-staging_hive_2017-08-14_18-13-39_035_6303339779053
> 512282-2/-ext-1/part-0 to destination 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/pt=1/part-0
> at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2644)
> at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(Hive.java:2711)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1403)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1324)
> ... 45 more
> Caused by: java.io.IOException: Filesystem closed
> 
> -
> the doc about the parquet table desc here 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files
> Hive metastore Parquet table conversion
> When reading from and writing to Hive metastore Parquet tables, Spark SQL 
> will try to use its own Parquet support instead of Hive SerDe for better 
> performance. This behavior is controlled by the 
> spark.sql.hive.convertMetastoreParquet configuration, and is turned on by 
> default.
> I am confused the problem appear in the table(partitions)  but it is ok with 
> table(with out partitions) . It 

[jira] [Commented] (SPARK-21725) spark thriftserver insert overwrite table partition select

2017-10-30 Thread xinzhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16226337#comment-16226337
 ] 

xinzhang commented on SPARK-21725:
--

Now I try with the master branch.
The problem is still here.
Steps:
1.download . install . exec hivesql  (hive-1.2.1 . Here prove my hive is OK)
!https://user-images.githubusercontent.com/8244097/32210043-7554300e-be46-11e7-8ce0-f61bc0bfa998.png!

2.download . install . exec spark-sql  (spark-master I build it with master the 
lastest commit 44c4003155c1d243ffe0f73d5537b4c8b3f3b564)
First time . Spark-sql  result: GOOD
!https://user-images.githubusercontent.com/8244097/32210200-5b02de20-be47-11e7-8eac-e0228a7cf7f5.png!

Second time . Spark-sql  result: GOOD
!https://user-images.githubusercontent.com/8244097/32210320-f518aa12-be47-11e7-9a86-a16819583748.png!

3.use spark-sql thriftserver
First time . Spark-sql  result: GOOD
Second time .Spark-sql result: BAD
!https://user-images.githubusercontent.com/8244097/32210560-47d431da-be49-11e7-8279-7dd88dda42a6.png!



> spark thriftserver insert overwrite table partition select 
> ---
>
> Key: SPARK-21725
> URL: https://issues.apache.org/jira/browse/SPARK-21725
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: centos 6.7 spark 2.1  jdk8
>Reporter: xinzhang
>  Labels: spark-sql
>
> use thriftserver create table with partitions.
> session 1:
>  SET hive.default.fileformat=Parquet;create table tmp_10(count bigint) 
> partitioned by (pt string) stored as parquet;
> --ok
>  !exit
> session 2:
>  SET hive.default.fileformat=Parquet;create table tmp_11(count bigint) 
> partitioned by (pt string) stored as parquet; 
> --ok
>  !exit
> session 3:
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --ok
>  !exit
> session 4(do it again):
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --error
>  !exit
> -
> 17/08/14 18:13:42 ERROR SparkExecuteStatementOperation: Error executing 
> query, currentState RUNNING, 
> java.lang.reflect.InvocationTargetException
> ..
> ..
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move 
> source 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/.hive-staging_hive_2017-08-14_18-13-39_035_6303339779053
> 512282-2/-ext-1/part-0 to destination 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/pt=1/part-0
> at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2644)
> at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(Hive.java:2711)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1403)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1324)
> ... 45 more
> Caused by: java.io.IOException: Filesystem closed
> 
> -
> the doc about the parquet table desc here 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files
> Hive metastore Parquet table conversion
> When reading from and writing to Hive metastore Parquet tables, Spark SQL 
> will try to use its own Parquet support instead of Hive SerDe for better 
> performance. This behavior is controlled by the 
> spark.sql.hive.convertMetastoreParquet configuration, and is turned on by 
> default.
> I am confused the problem appear in the table(partitions)  but it is ok with 
> table(with out partitions) . It means spark do not use its own parquet ?
> Maybe someone give any suggest how could I avoid the issue?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13030) Change OneHotEncoder to Estimator

2017-10-30 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16226307#comment-16226307
 ] 

yuhao yang commented on SPARK-13030:


Sorry to jumping in so late. I can see there's been a lot of efforts.

As far as I understand, making the OneHotEncoder an Estimator is essentially to 
fulfill the requirement that we need consistent dimension and mapping for 
OneHotEncoder during training and prediction. 

To achieve the same target, can we just set an optional numCategory: IntParam 
(or call it size) as an parameter for OneHotEncoder ? If set, then all the 
output vector will have the size as numCategory. Any index that's out of the 
bound of numCategory can be resolved by handleInvalid. Comparably, IMO this is 
a much simpler and robust solution. (totally backwards compatible). 


> Change OneHotEncoder to Estimator
> -
>
> Key: SPARK-13030
> URL: https://issues.apache.org/jira/browse/SPARK-13030
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.6.0
>Reporter: Wojciech Jurczyk
>
> OneHotEncoder should be an Estimator, just like in scikit-learn 
> (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).
> In its current form, it is impossible to use when number of categories is 
> different between training dataset and test dataset.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22327) R CRAN check fails on non-latest branches

2017-10-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16226252#comment-16226252
 ] 

Apache Spark commented on SPARK-22327:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/19620

> R CRAN check fails on non-latest branches
> -
>
> Key: SPARK-22327
> URL: https://issues.apache.org/jira/browse/SPARK-22327
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.4, 2.0.3, 2.1.3, 2.2.1, 2.3.0
>Reporter: Felix Cheung
>
> with warning
> * checking CRAN incoming feasibility ... WARNING
> Maintainer: 'Shivaram Venkataraman '
> Insufficient package version (submitted: 2.0.3, existing: 2.1.2)
> Unknown, possibly mis-spelled, fields in DESCRIPTION:
>   'RoxygenNote'
> WARNING: There was 1 warning.
> NOTE: There were 2 notes.
> We have seen this in branch-1.6, branch-2.0, and this would be a problem for 
> branch-2.1 after we ship 2.2.1.
> The root cause of the issue is in the package version check in CRAN check. 
> After the SparkR package version 2.1.2 (is first) published, any older 
> version is failing the version check. As far as we know, there is no way to 
> skip this version check.
> Also, there is previously a NOTE on new maintainer. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22327) R CRAN check fails on non-latest branches

2017-10-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16226249#comment-16226249
 ] 

Apache Spark commented on SPARK-22327:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/19619

> R CRAN check fails on non-latest branches
> -
>
> Key: SPARK-22327
> URL: https://issues.apache.org/jira/browse/SPARK-22327
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.4, 2.0.3, 2.1.3, 2.2.1, 2.3.0
>Reporter: Felix Cheung
>
> with warning
> * checking CRAN incoming feasibility ... WARNING
> Maintainer: 'Shivaram Venkataraman '
> Insufficient package version (submitted: 2.0.3, existing: 2.1.2)
> Unknown, possibly mis-spelled, fields in DESCRIPTION:
>   'RoxygenNote'
> WARNING: There was 1 warning.
> NOTE: There were 2 notes.
> We have seen this in branch-1.6, branch-2.0, and this would be a problem for 
> branch-2.1 after we ship 2.2.1.
> The root cause of the issue is in the package version check in CRAN check. 
> After the SparkR package version 2.1.2 (is first) published, any older 
> version is failing the version check. As far as we know, there is no way to 
> skip this version check.
> Also, there is previously a NOTE on new maintainer. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5484) Pregel should checkpoint periodically to avoid StackOverflowError

2017-10-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16226248#comment-16226248
 ] 

Apache Spark commented on SPARK-5484:
-

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/19618

> Pregel should checkpoint periodically to avoid StackOverflowError
> -
>
> Key: SPARK-5484
> URL: https://issues.apache.org/jira/browse/SPARK-5484
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Reporter: Ankur Dave
>Assignee: dingding
> Fix For: 2.2.0, 2.3.0
>
>
> Pregel-based iterative algorithms with more than ~50 iterations begin to slow 
> down and eventually fail with a StackOverflowError due to Spark's lack of 
> support for long lineage chains. Instead, Pregel should checkpoint the graph 
> periodically.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22347) UDF is evaluated when 'F.when' condition is false

2017-10-30 Thread Liang-Chi Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-22347:

Issue Type: Documentation  (was: Bug)

> UDF is evaluated when 'F.when' condition is false
> -
>
> Key: SPARK-22347
> URL: https://issues.apache.org/jira/browse/SPARK-22347
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Nicolas Porter
>Priority: Minor
>
> Here's a simple example on how to reproduce this:
> {code}
> from pyspark.sql import functions as F, Row, types
> def Divide10():
> def fn(value): return 10 / int(value)
> return F.udf(fn, types.IntegerType())
> df = sc.parallelize([Row(x=5), Row(x=0)]).toDF()
> x = F.col('x')
> df2 = df.select(F.when((x > 0), Divide10()(x)))
> df2.show(200)
> {code}
> This raises a division by zero error, even if `F.when` is trying to filter 
> out all cases where `x <= 0`. I believe the correct behavior should be not to 
> evaluate the UDF when the `F.when` condition is false.
> Interestingly enough, when the `F.when` condition is set to `F.lit(False)`, 
> then the error is not raised and all rows resolve to `null`, which is the 
> expected result.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22347) UDF is evaluated when 'F.when' condition is false

2017-10-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16226236#comment-16226236
 ] 

Apache Spark commented on SPARK-22347:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/19617

> UDF is evaluated when 'F.when' condition is false
> -
>
> Key: SPARK-22347
> URL: https://issues.apache.org/jira/browse/SPARK-22347
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Nicolas Porter
>Priority: Minor
>
> Here's a simple example on how to reproduce this:
> {code}
> from pyspark.sql import functions as F, Row, types
> def Divide10():
> def fn(value): return 10 / int(value)
> return F.udf(fn, types.IntegerType())
> df = sc.parallelize([Row(x=5), Row(x=0)]).toDF()
> x = F.col('x')
> df2 = df.select(F.when((x > 0), Divide10()(x)))
> df2.show(200)
> {code}
> This raises a division by zero error, even if `F.when` is trying to filter 
> out all cases where `x <= 0`. I believe the correct behavior should be not to 
> evaluate the UDF when the `F.when` condition is false.
> Interestingly enough, when the `F.when` condition is set to `F.lit(False)`, 
> then the error is not raised and all rows resolve to `null`, which is the 
> expected result.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22344) Prevent R CMD check from using /tmp

2017-10-30 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16226197#comment-16226197
 ] 

Felix Cheung commented on SPARK-22344:
--

Yes I think we should do just that. Might need to delete the parent .cache 
directory also though.




> Prevent R CMD check from using /tmp
> ---
>
> Key: SPARK-22344
> URL: https://issues.apache.org/jira/browse/SPARK-22344
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.3, 2.1.2, 2.2.0, 2.3.0
>Reporter: Shivaram Venkataraman
>
> When R CMD check is run on the SparkR package it leaves behind files in /tmp 
> which is a violation of CRAN policy. We should instead write to Rtmpdir. 
> Notes from CRAN are below
> {code}
> Checking this leaves behind dirs
>hive/$USER
>$USER
> and files named like
>b4f6459b-0624-4100-8358-7aa7afbda757_resources
> in /tmp, in violation of the CRAN Policy.
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22401) Missing 2.1.2 tag in git

2017-10-30 Thread Xin Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16226178#comment-16226178
 ] 

Xin Lu commented on SPARK-22401:


[~holdenk] is this just a new process? 

> Missing 2.1.2 tag in git
> 
>
> Key: SPARK-22401
> URL: https://issues.apache.org/jira/browse/SPARK-22401
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Deploy
>Affects Versions: 2.1.2
>Reporter: Brian Barker
>Priority: Minor
>
> We only saw a 2.1.2-rc4 tag in git, no official release. The releases web 
> page shows 2.1.2 was released in October 9.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11502) Word2VecSuite needs appropriate checks

2017-10-30 Thread Joseph Peng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16226149#comment-16226149
 ] 

Joseph Peng commented on SPARK-11502:
-

I have looked into gensim's unit tests on Word2Vec. They basically did two 
things:
1. Check everything is storing/loading correctly.
2. Check vocabulary is built correctly. It is not easy to this for us because 
related methods are all private.
3. They do not check if expected = model. Instead, they check expected.shape = 
model.shape. Is there any better test available? I am not very optimistic. 

> Word2VecSuite needs appropriate checks
> --
>
> Key: SPARK-11502
> URL: https://issues.apache.org/jira/browse/SPARK-11502
> Project: Spark
>  Issue Type: Test
>  Components: ML
>Affects Versions: 1.6.0
>Reporter: Imran Rashid
>Priority: Minor
>
> When the random number generator was changed slightly in SPARK-10116, 
> {{ml.feature.Word2VecSuite}} started failing.  The tests were updated to have 
> "magic" values, to at least provide minimal characterization tests preventing 
> more behavior changes.  However, the tests really need to be improved to have 
> more sensible checks.
> Note that brute force search over seeds from 0 to 1000 failed to find one 
> that worked -- the input data seems to be careful chosen for the previous 
> random number generator / seed combo.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20077) Documentation for ml.stats.Correlation

2017-10-30 Thread Joseph Peng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16226077#comment-16226077
 ] 

Joseph Peng commented on SPARK-20077:
-

I have read the documentation, and I believe this issue is resolved. 

> Documentation for ml.stats.Correlation
> --
>
> Key: SPARK-20077
> URL: https://issues.apache.org/jira/browse/SPARK-20077
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Timothy Hunter
>
> Now that (Pearson) correlations are available in spark.ml, we need to write 
> some documentation to go along with this feature. It can simply be looking at 
> the unit tests for example right now.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22404) Provide an option to use unmanaged AM in yarn-client mode

2017-10-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22404:


Assignee: Apache Spark

> Provide an option to use unmanaged AM in yarn-client mode
> -
>
> Key: SPARK-22404
> URL: https://issues.apache.org/jira/browse/SPARK-22404
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Devaraj K
>Assignee: Apache Spark
>
> There was an issue SPARK-1200 to provide an option but was closed without 
> fixing.
> Using an unmanaged AM in yarn-client mode would allow apps to start up 
> faster, but not requiring the container launcher AM to be launched on the 
> cluster.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22404) Provide an option to use unmanaged AM in yarn-client mode

2017-10-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16226025#comment-16226025
 ] 

Apache Spark commented on SPARK-22404:
--

User 'devaraj-kavali' has created a pull request for this issue:
https://github.com/apache/spark/pull/19616

> Provide an option to use unmanaged AM in yarn-client mode
> -
>
> Key: SPARK-22404
> URL: https://issues.apache.org/jira/browse/SPARK-22404
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Devaraj K
>
> There was an issue SPARK-1200 to provide an option but was closed without 
> fixing.
> Using an unmanaged AM in yarn-client mode would allow apps to start up 
> faster, but not requiring the container launcher AM to be launched on the 
> cluster.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22404) Provide an option to use unmanaged AM in yarn-client mode

2017-10-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22404:


Assignee: (was: Apache Spark)

> Provide an option to use unmanaged AM in yarn-client mode
> -
>
> Key: SPARK-22404
> URL: https://issues.apache.org/jira/browse/SPARK-22404
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Devaraj K
>
> There was an issue SPARK-1200 to provide an option but was closed without 
> fixing.
> Using an unmanaged AM in yarn-client mode would allow apps to start up 
> faster, but not requiring the container launcher AM to be launched on the 
> cluster.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22404) Provide an option to use unmanaged AM in yarn-client mode

2017-10-30 Thread Devaraj K (JIRA)
Devaraj K created SPARK-22404:
-

 Summary: Provide an option to use unmanaged AM in yarn-client mode
 Key: SPARK-22404
 URL: https://issues.apache.org/jira/browse/SPARK-22404
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 2.2.0
Reporter: Devaraj K


There was an issue SPARK-1200 to provide an option but was closed without 
fixing.

Using an unmanaged AM in yarn-client mode would allow apps to start up faster, 
but not requiring the container launcher AM to be launched on the cluster.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22404) Provide an option to use unmanaged AM in yarn-client mode

2017-10-30 Thread Devaraj K (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16226015#comment-16226015
 ] 

Devaraj K commented on SPARK-22404:
---

I am working on this, will update this jira with the proposal PR.

> Provide an option to use unmanaged AM in yarn-client mode
> -
>
> Key: SPARK-22404
> URL: https://issues.apache.org/jira/browse/SPARK-22404
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Devaraj K
>
> There was an issue SPARK-1200 to provide an option but was closed without 
> fixing.
> Using an unmanaged AM in yarn-client mode would allow apps to start up 
> faster, but not requiring the container launcher AM to be launched on the 
> cluster.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19611) Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files

2017-10-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16225953#comment-16225953
 ] 

Apache Spark commented on SPARK-19611:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/19615

> Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files
> ---
>
> Key: SPARK-19611
> URL: https://issues.apache.org/jira/browse/SPARK-19611
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Adam Budde
>Assignee: Adam Budde
> Fix For: 2.1.1, 2.2.0
>
>
> This issue replaces 
> [SPARK-19455|https://issues.apache.org/jira/browse/SPARK-19455] and [PR 
> #16797|https://github.com/apache/spark/pull/16797]
> [SPARK-16980|https://issues.apache.org/jira/browse/SPARK-16980] removed the 
> schema inferrence from the HiveMetastoreCatalog class when converting a 
> MetastoreRelation to a LoigcalRelation (HadoopFsRelation, in this case) in 
> favor of simply using the schema returend by the metastore. This results in 
> an optimization as the underlying file status no longer need to be resolved 
> until after the partition pruning step, reducing the number of files to be 
> touched significantly in some cases. The downside is that the data schema 
> used may no longer match the underlying file schema for case-sensitive 
> formats such as Parquet.
> [SPARK-17183|https://issues.apache.org/jira/browse/SPARK-17183] added support 
> for saving a case-sensitive copy of the schema in the metastore table 
> properties, which HiveExternalCatalog will read in as the table's schema if 
> it is present. If it is not present, it will fall back to the 
> case-insensitive metastore schema.
> Unfortunately, this silently breaks queries over tables where the underlying 
> data fields are case-sensitive but a case-sensitive schema wasn't written to 
> the table properties by Spark. This situation will occur for any Hive table 
> that wasn't created by Spark or that was created prior to Spark 2.1.0. If a 
> user attempts to run a query over such a table containing a case-sensitive 
> field name in the query projection or in the query filter, the query will 
> return 0 results in every case.
> The change we are proposing is to bring back the schema inference that was 
> used prior to Spark 2.1.0 if a case-sensitive schema can't be read from the 
> table properties.
> - INFER_AND_SAVE: Infer a schema from the data files if no case-sensitive 
> schema can be read from the table properties. Attempt to save the inferred 
> schema in the table properties to avoid future inference.
> - INFER_ONLY: Infer the schema if no case-sensitive schema can be read but 
> don't attempt to save it.
> - NEVER_INFER: Fall back to using the case-insensitive schema returned by the 
> Hive Metatore. Useful if the user knows that none of the underlying data is 
> case-sensitive.
> See the discussion on [PR #16797|https://github.com/apache/spark/pull/16797] 
> for more discussion around this issue and the proposed solution.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22344) Prevent R CMD check from using /tmp

2017-10-30 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16225924#comment-16225924
 ] 

Shivaram Venkataraman commented on SPARK-22344:
---

To consider this narrow case of CRAN checks -- we explicitly call install.spark 
from run-all.R and that returns the directory where it put Spark. So we could 
just delete that returned value when we exit ?

> Prevent R CMD check from using /tmp
> ---
>
> Key: SPARK-22344
> URL: https://issues.apache.org/jira/browse/SPARK-22344
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.3, 2.1.2, 2.2.0, 2.3.0
>Reporter: Shivaram Venkataraman
>
> When R CMD check is run on the SparkR package it leaves behind files in /tmp 
> which is a violation of CRAN policy. We should instead write to Rtmpdir. 
> Notes from CRAN are below
> {code}
> Checking this leaves behind dirs
>hive/$USER
>$USER
> and files named like
>b4f6459b-0624-4100-8358-7aa7afbda757_resources
> in /tmp, in violation of the CRAN Policy.
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22403) StructuredKafkaWordCount example fails in YARN cluster mode

2017-10-30 Thread Wing Yew Poon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16225913#comment-16225913
 ] 

Wing Yew Poon commented on SPARK-22403:
---

The simplest change that will solve the problem in this particular scenario, is 
to change the Utils.createTempDir(namePrefix = s"temporary") call in 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala#L208
 to Utils.createTempDir(root = "/tmp", namePrefix = "temporary").
In my view, using "/tmp" is not worse, in fact is better, than using 
System.getProperty("java.io.tmpdir").
However, others may know of better solutions.

> StructuredKafkaWordCount example fails in YARN cluster mode
> ---
>
> Key: SPARK-22403
> URL: https://issues.apache.org/jira/browse/SPARK-22403
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Wing Yew Poon
>
> When I run the StructuredKafkaWordCount example in YARN client mode, it runs 
> fine. However, when I run it in YARN cluster mode, the application errors 
> during initialization, and dies after the default number of YARN application 
> attempts. In the AM log, I see
> {noformat}
> 17/10/30 11:34:52 INFO execution.SparkSqlParser: Parsing command: CAST(value 
> AS STRING)
> 17/10/30 11:34:53 ERROR streaming.StreamMetadata: Error writing stream 
> metadata StreamMetadata(b71ca714-a7a1-467f-96aa-023375964429) to 
> /data/yarn/nm/usercache/systest/appcache/application_1508800814252_0047/container_1508800814252_0047_01_01/tmp/temporary-b5ced4ae-32e0-4725-b905-aad679aec9b5/metadata
> org.apache.hadoop.security.AccessControlException: Permission denied: 
> user=systest, access=WRITE, inode="/":hdfs:supergroup:drwxr-xr-x
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:397)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:256)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:194)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1842)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1826)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkAncestorAccess(FSDirectory.java:1785)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.resolvePathForStartFile(FSDirWriteFileOp.java:315)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2313)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2257)
> ...
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:280)
>   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1235)
>   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1214)
>   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1152)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:458)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:455)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:469)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:396)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1103)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1083)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:972)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:960)
>   at 
> org.apache.spark.sql.execution.streaming.StreamMetadata$.write(StreamMetadata.scala:76)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$6.apply(StreamExecution.scala:116)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$6.apply(StreamExecution.scala:114)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.(StreamExecution.scala:114)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:240)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:278)
>   at 
> org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:282)
>   at 
> org.apache.spark.examples.sql.streaming.StructuredKafkaWordCount$.main(StructuredKafkaWordCount.scal

[jira] [Created] (SPARK-22403) StructuredKafkaWordCount example fails in YARN cluster mode

2017-10-30 Thread Wing Yew Poon (JIRA)
Wing Yew Poon created SPARK-22403:
-

 Summary: StructuredKafkaWordCount example fails in YARN cluster 
mode
 Key: SPARK-22403
 URL: https://issues.apache.org/jira/browse/SPARK-22403
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.2.0
Reporter: Wing Yew Poon


When I run the StructuredKafkaWordCount example in YARN client mode, it runs 
fine. However, when I run it in YARN cluster mode, the application errors 
during initialization, and dies after the default number of YARN application 
attempts. In the AM log, I see
{noformat}
17/10/30 11:34:52 INFO execution.SparkSqlParser: Parsing command: CAST(value AS 
STRING)
17/10/30 11:34:53 ERROR streaming.StreamMetadata: Error writing stream metadata 
StreamMetadata(b71ca714-a7a1-467f-96aa-023375964429) to 
/data/yarn/nm/usercache/systest/appcache/application_1508800814252_0047/container_1508800814252_0047_01_01/tmp/temporary-b5ced4ae-32e0-4725-b905-aad679aec9b5/metadata
org.apache.hadoop.security.AccessControlException: Permission denied: 
user=systest, access=WRITE, inode="/":hdfs:supergroup:drwxr-xr-x
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:397)
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:256)
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:194)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1842)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1826)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkAncestorAccess(FSDirectory.java:1785)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.resolvePathForStartFile(FSDirWriteFileOp.java:315)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2313)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2257)
...
at 
org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:280)
at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1235)
at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1214)
at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1152)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:458)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:455)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:469)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:396)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1103)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1083)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:972)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:960)
at 
org.apache.spark.sql.execution.streaming.StreamMetadata$.write(StreamMetadata.scala:76)
at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$6.apply(StreamExecution.scala:116)
at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$6.apply(StreamExecution.scala:114)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.execution.streaming.StreamExecution.(StreamExecution.scala:114)
at 
org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:240)
at 
org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:278)
at 
org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:282)
at 
org.apache.spark.examples.sql.streaming.StructuredKafkaWordCount$.main(StructuredKafkaWordCount.scala:79)
at 
org.apache.spark.examples.sql.streaming.StructuredKafkaWordCount.main(StructuredKafkaWordCount.scala)
{noformat}
Looking at StreamingQueryManager#createQuery, we have
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala#L198
{code}
val checkpointLocation = userSpecifiedCheckpointLocation.map { ...
  ...
}.orElse {
  ...
}.getOrElse {
  if (useTempCheckpointLocation) {
// Delete the temp checkpoint when a query is being stopped without 
errors.
deleteCheckpointOnStop = true
Utils.createTempDir(namePrefix = s"temporary").getCanonicalPath
  } else {
...
  }
}
{code}
And Utils.createTempDir has
{code}
  def createTempDir(
  root: String = Syst

[jira] [Assigned] (SPARK-22399) reference in mllib-clustering.html is out of date

2017-10-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22399:


Assignee: (was: Apache Spark)

> reference in mllib-clustering.html is out of date
> -
>
> Key: SPARK-22399
> URL: https://issues.apache.org/jira/browse/SPARK-22399
> Project: Spark
>  Issue Type: Documentation
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: Nathan Kronenfeld
>Priority: Minor
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> Under "Power Iteration Clustering", the reference to the paper describing the 
> method redirects to the web site for the conference at which it was 
> originally published.
> The correct reference, as taken from Lin's web page at CMU, should be:
> Ihttp://www.cs.cmu.edu/~frank/papers/icml2010-pic-final.pdf
> (at least, I think that's the one - someone who knows the algorithm should 
> probably check it's the right one - I'm trying to follow it because I don't 
> know it, of course :-)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22399) reference in mllib-clustering.html is out of date

2017-10-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16225868#comment-16225868
 ] 

Apache Spark commented on SPARK-22399:
--

User 'bomeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/19614

> reference in mllib-clustering.html is out of date
> -
>
> Key: SPARK-22399
> URL: https://issues.apache.org/jira/browse/SPARK-22399
> Project: Spark
>  Issue Type: Documentation
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: Nathan Kronenfeld
>Priority: Minor
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> Under "Power Iteration Clustering", the reference to the paper describing the 
> method redirects to the web site for the conference at which it was 
> originally published.
> The correct reference, as taken from Lin's web page at CMU, should be:
> Ihttp://www.cs.cmu.edu/~frank/papers/icml2010-pic-final.pdf
> (at least, I think that's the one - someone who knows the algorithm should 
> probably check it's the right one - I'm trying to follow it because I don't 
> know it, of course :-)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22399) reference in mllib-clustering.html is out of date

2017-10-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22399:


Assignee: Apache Spark

> reference in mllib-clustering.html is out of date
> -
>
> Key: SPARK-22399
> URL: https://issues.apache.org/jira/browse/SPARK-22399
> Project: Spark
>  Issue Type: Documentation
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: Nathan Kronenfeld
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> Under "Power Iteration Clustering", the reference to the paper describing the 
> method redirects to the web site for the conference at which it was 
> originally published.
> The correct reference, as taken from Lin's web page at CMU, should be:
> Ihttp://www.cs.cmu.edu/~frank/papers/icml2010-pic-final.pdf
> (at least, I think that's the one - someone who knows the algorithm should 
> probably check it's the right one - I'm trying to follow it because I don't 
> know it, of course :-)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22211) LimitPushDown optimization for FullOuterJoin generates wrong results

2017-10-30 Thread Henry Robinson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16225816#comment-16225816
 ] 

Henry Robinson edited comment on SPARK-22211 at 10/30/17 9:59 PM:
--

Thinking about it a more, I think the optimization that's currently implemented 
works as long as a) the limit is pushed to the streaming side of the join and 
b) the physical join implementation guarantees that it will emit rows that have 
non-null RHSs from the streaming side before any that have a null RHS.

That is: say we've got a build-side of one row, (A,C), and a streaming-side of 
(A,B). If we do a full outer-join of these two inputs, the result should be 
some ordering of (A, A), (null, B), (C, null). If we do a FOJ with LIMIT 1 
pushed to the streaming side, imagine it returns (B). It's an error if the join 
operator sees that A on the build-side has no match, and emits that (A, null) 
before it sees that B has no match and emits (null, B). But if it emits (null, 
B) first, the limit above it should kick in and no further rows will be 
emitted. 

It seems a bit fragile to rely on this behaviour from all join implementations, 
and it has some implications for other transformations (e.g. it would not be 
safe to flip the join order for a FOJ with a pushed-down limit - but would be 
ok for a non-pushed-down one). However, it's also a bit concerning to remove an 
optimization that's probably a big win for some queries, even if it's 
incorrect. There are rewrites that would work, e.g.:

{code}x.join(y, x('bar') === y('bar'), "outer").limit(10) ==> 
x.sort.limit(10).join(y.sort.limit(10), x('bar') === y('bar'), 
"outer").sort.limit(10) {code}

seems like it would be correct. But for now, how about we disable the push-down 
optimization in {{LimitPushDown}} and see if there's a need to investigate more 
complicated optimizations after that?


was (Author: henryr):
Thinking about it a more, I think the optimization that's currently implemented 
works as long as a) the limit is pushed to the streaming side of the join and 
b) the physical join implementation guarantees that it will emit rows that have 
non-null RHSs from the streaming side before any that have a null RHS.

That is: say we've got a build-side of one row, (A,C), and a streaming-side of 
(A,B). If we do a full outer-join of these two inputs, the result should be 
some ordering of (A, A), (null, B), (C, null). If we do a FOJ with LIMIT 1 
pushed to the streaming side, imagine it returns (B). It's an error if the join 
operator sees that A on the build-side has no match, and emits that (A, null) 
before it sees that B has no match and emits (null, B). But if it emits (null, 
B) first, the limit above it should kick in and no further rows will be 
emitted. 

It seems a bit fragile to rely on this behaviour from all join implementations, 
and it has some implications for other transformations (e.g. it would not be 
safe to flip the join order for a FOJ with a pushed-down limit - but would be 
ok for a non-pushed-down one). However, it's also a bit concerning to remove an 
optimization that's probably a big win for some queries, even if it's 
incorrect. There are rewrites that would work, e.g.:

{{ x.join(y, x('bar') === y('bar'), "outer").limit(10) ==> 
x.sort.limit(10).join(y.sort.limit(10), x('bar') === y('bar'), 
"outer").sort.limit(10) }}

seems like it would be correct. But for now, how about we disable the push-down 
optimization in {{LimitPushDown}} and see if there's a need to investigate more 
complicated optimizations after that?

> LimitPushDown optimization for FullOuterJoin generates wrong results
> 
>
> Key: SPARK-22211
> URL: https://issues.apache.org/jira/browse/SPARK-22211
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: on community.cloude.databrick.com 
> Runtime Version 3.2 (includes Apache Spark 2.2.0, Scala 2.11)
>Reporter: Benyi Wang
>
> LimitPushDown pushes LocalLimit to one side for FullOuterJoin, but this may 
> generate a wrong result:
> Assume we use limit(1) and LocalLimit will be pushed to left side, and id=999 
> is selected, but at right side we have 100K rows including 999, the result 
> will be
> - one row is (999, 999)
> - the rest rows are (null, xxx)
> Once you call show(), the row (999,999) has only 1/10th chance to be 
> selected by CollectLimit.
> The actual optimization might be, 
> - push down limit
> - but convert the join to Broadcast LeftOuterJoin or RightOuterJoin.
> Here is my notebook:
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/349451637617406/2750346983121008/656075277290/latest.html
> {code:java}
> import scala.util.Random._
> val dl =

[jira] [Commented] (SPARK-22211) LimitPushDown optimization for FullOuterJoin generates wrong results

2017-10-30 Thread Henry Robinson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16225816#comment-16225816
 ] 

Henry Robinson commented on SPARK-22211:


Thinking about it a more, I think the optimization that's currently implemented 
works as long as a) the limit is pushed to the streaming side of the join and 
b) the physical join implementation guarantees that it will emit rows that have 
non-null RHSs from the streaming side before any that have a null RHS.

That is: say we've got a build-side of one row, (A,C), and a streaming-side of 
(A,B). If we do a full outer-join of these two inputs, the result should be 
some ordering of (A, A), (null, B), (C, null). If we do a FOJ with LIMIT 1 
pushed to the streaming side, imagine it returns (B). It's an error if the join 
operator sees that A on the build-side has no match, and emits that (A, null) 
before it sees that B has no match and emits (null, B). But if it emits (null, 
B) first, the limit above it should kick in and no further rows will be 
emitted. 

It seems a bit fragile to rely on this behaviour from all join implementations, 
and it has some implications for other transformations (e.g. it would not be 
safe to flip the join order for a FOJ with a pushed-down limit - but would be 
ok for a non-pushed-down one). However, it's also a bit concerning to remove an 
optimization that's probably a big win for some queries, even if it's 
incorrect. There are rewrites that would work, e.g.:

{{ x.join(y, x('bar') === y('bar'), "outer").limit(10) ==> 
x.sort.limit(10).join(y.sort.limit(10), x('bar') === y('bar'), 
"outer").sort.limit(10) }}

seems like it would be correct. But for now, how about we disable the push-down 
optimization in {{LimitPushDown}} and see if there's a need to investigate more 
complicated optimizations after that?

> LimitPushDown optimization for FullOuterJoin generates wrong results
> 
>
> Key: SPARK-22211
> URL: https://issues.apache.org/jira/browse/SPARK-22211
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: on community.cloude.databrick.com 
> Runtime Version 3.2 (includes Apache Spark 2.2.0, Scala 2.11)
>Reporter: Benyi Wang
>
> LimitPushDown pushes LocalLimit to one side for FullOuterJoin, but this may 
> generate a wrong result:
> Assume we use limit(1) and LocalLimit will be pushed to left side, and id=999 
> is selected, but at right side we have 100K rows including 999, the result 
> will be
> - one row is (999, 999)
> - the rest rows are (null, xxx)
> Once you call show(), the row (999,999) has only 1/10th chance to be 
> selected by CollectLimit.
> The actual optimization might be, 
> - push down limit
> - but convert the join to Broadcast LeftOuterJoin or RightOuterJoin.
> Here is my notebook:
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/349451637617406/2750346983121008/656075277290/latest.html
> {code:java}
> import scala.util.Random._
> val dl = shuffle(1 to 10).toDF("id")
> val dr = shuffle(1 to 10).toDF("id")
> println("data frame dl:")
> dl.explain
> println("data frame dr:")
> dr.explain
> val j = dl.join(dr, dl("id") === dr("id"), "outer").limit(1)
> j.explain
> j.show(false)
> {code}
> {code}
> data frame dl:
> == Physical Plan ==
> LocalTableScan [id#10]
> data frame dr:
> == Physical Plan ==
> LocalTableScan [id#16]
> == Physical Plan ==
> CollectLimit 1
> +- SortMergeJoin [id#10], [id#16], FullOuter
>:- *Sort [id#10 ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(id#10, 200)
>: +- *LocalLimit 1
>:+- LocalTableScan [id#10]
>+- *Sort [id#16 ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(id#16, 200)
>  +- LocalTableScan [id#16]
> import scala.util.Random._
> dl: org.apache.spark.sql.DataFrame = [id: int]
> dr: org.apache.spark.sql.DataFrame = [id: int]
> j: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, id: int]
> ++---+
> |id  |id |
> ++---+
> |null|148|
> ++---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22344) Prevent R CMD check from using /tmp

2017-10-30 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16225770#comment-16225770
 ] 

Felix Cheung commented on SPARK-22344:
--

Kinda. We have some alternate code paths and the packageName might not be the 
package version (e.g. When we have the URL override)



> Prevent R CMD check from using /tmp
> ---
>
> Key: SPARK-22344
> URL: https://issues.apache.org/jira/browse/SPARK-22344
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.3, 2.1.2, 2.2.0, 2.3.0
>Reporter: Shivaram Venkataraman
>
> When R CMD check is run on the SparkR package it leaves behind files in /tmp 
> which is a violation of CRAN policy. We should instead write to Rtmpdir. 
> Notes from CRAN are below
> {code}
> Checking this leaves behind dirs
>hive/$USER
>$USER
> and files named like
>b4f6459b-0624-4100-8358-7aa7afbda757_resources
> in /tmp, in violation of the CRAN Policy.
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22402) Allow fetcher URIs to be downloaded to specific locations relative to Mesos Sandbox

2017-10-30 Thread Arthur Rand (JIRA)
Arthur Rand created SPARK-22402:
---

 Summary: Allow fetcher URIs to be downloaded to specific locations 
relative to Mesos Sandbox
 Key: SPARK-22402
 URL: https://issues.apache.org/jira/browse/SPARK-22402
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Affects Versions: 2.2.1
Reporter: Arthur Rand
Priority: Minor


Currently {{spark.mesos.uris}} will only place files in the sandbox, but some 
configuration files and applications may need to be in specific locations. The 
Mesos proto allows for this with the optional {{output_file}} field 
(https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto#L671). 
We can expose this through the command line with {{--conf 
spark.mesos.uris=:}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14650) Compile Spark REPL for Scala 2.12

2017-10-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16225628#comment-16225628
 ] 

Apache Spark commented on SPARK-14650:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/19612

> Compile Spark REPL for Scala 2.12
> -
>
> Key: SPARK-14650
> URL: https://issues.apache.org/jira/browse/SPARK-14650
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Project Infra
>Reporter: Josh Rosen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22401) Missing 2.1.2 tag in git

2017-10-30 Thread Brian Barker (JIRA)
Brian Barker created SPARK-22401:


 Summary: Missing 2.1.2 tag in git
 Key: SPARK-22401
 URL: https://issues.apache.org/jira/browse/SPARK-22401
 Project: Spark
  Issue Type: Bug
  Components: Build, Deploy
Affects Versions: 2.1.2
Reporter: Brian Barker
Priority: Minor


We only saw a 2.1.2-rc4 tag in git, no official release. The releases web page 
shows 2.1.2 was released in October 9.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18886) Delay scheduling should not delay some executors indefinitely if one task is scheduled before delay timeout

2017-10-30 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16225412#comment-16225412
 ] 

Imran Rashid commented on SPARK-18886:
--

Hi [~willshen],

I haven't looked closely, but I would expect this to also exist in 1.6.0.  Btw, 
due to SPARK-18967 (which I assume is also in 1.6.0, but again havne't looked 
closely), I would advise setting the wait to {{1ms}}, not {{0}}.


> Delay scheduling should not delay some executors indefinitely if one task is 
> scheduled before delay timeout
> ---
>
> Key: SPARK-18886
> URL: https://issues.apache.org/jira/browse/SPARK-18886
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Imran Rashid
>
> Delay scheduling can introduce an unbounded delay and underutilization of 
> cluster resources under the following circumstances:
> 1. Tasks have locality preferences for a subset of available resources
> 2. Tasks finish in less time than the delay scheduling.
> Instead of having *one* delay to wait for resources with better locality, 
> spark waits indefinitely.
> As an example, consider a cluster with 100 executors, and a taskset with 500 
> tasks.  Say all tasks have a preference for one executor, which is by itself 
> on one host.  Given the default locality wait of 3s per level, we end up with 
> a 6s delay till we schedule on other hosts (process wait + host wait).
> If each task takes 5 seconds (under the 6 second delay), then _all 500_ tasks 
> get scheduled on _only one_ executor.  This means you're only using a 1% of 
> your cluster, and you get a ~100x slowdown.  You'd actually be better off if 
> tasks took 7 seconds.
> *WORKAROUNDS*: 
> (1) You can change the locality wait times so that it is shorter than the 
> task execution time.  You need to take into account the sum of all wait times 
> to use all the resources on your cluster.  For example, if you have resources 
> on different racks, this will include the sum of 
> "spark.locality.wait.process" + "spark.locality.wait.node" + 
> "spark.locality.wait.rack".  Those each default to "3s".  The simplest way to 
> be to set "spark.locality.wait.process" to your desired wait interval, and 
> set both "spark.locality.wait.node" and "spark.locality.wait.rack" to "0".  
> For example, if your tasks take ~3 seconds on average, you might set 
> "spark.locality.wait.process" to "1s".  *NOTE*: due to SPARK-18967, avoid 
> setting the {{spark.locality.wait=0}} -- instead, use 
> {{spark.locality.wait=1ms}}.
> Note that this workaround isn't perfect --with less delay scheduling, you may 
> not get as good resource locality.  After this issue is fixed, you'd most 
> likely want to undo these configuration changes.
> (2) The worst case here will only happen if your tasks have extreme skew in 
> their locality preferences.  Users may be able to modify their job to 
> controlling the distribution of the original input data.
> (2a) A shuffle may end up with very skewed locality preferences, especially 
> if you do a repartition starting from a small number of partitions.  (Shuffle 
> locality preference is assigned if any node has more than 20% of the shuffle 
> input data -- by chance, you may have one node just above that threshold, and 
> all other nodes just below it.)  In this case, you can turn off locality 
> preference for shuffle data by setting 
> {{spark.shuffle.reduceLocality.enabled=false}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22148) TaskSetManager.abortIfCompletelyBlacklisted should not abort when all current executors are blacklisted but dynamic allocation is enabled

2017-10-30 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-22148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16225375#comment-16225375
 ] 

Juan Rodríguez Hortalá commented on SPARK-22148:


Hi [~irashid]. This looks like a different problem, because this issue is about 
a crash due to job aborted because there is no place to schedule a task, and 
SPARK-15815 is about a hang. But I have seen hangs similar to the one described 
in SPARK-15815 in the past, also related to dynamic allocation, so it looks 
like the root cause could be related. \

My proposal is similar to some of the ideas you outline in SPARK-15815. The 
main difference is that I don't suggest killing an executor, but requesting 
more executors to the resource manager. The result is similar, but your 
approach would work even if no more capacity is available. On the other hand my 
approach won't kill an executor that is progressing in other tasks. However my 
approach won't work if 1) there are no more executors available in the cluster, 
and 2) the executor timeout if very long, or executors are caching RDDs and the 
default timeout of infinite, as I was expecting to cover the case of no more 
capacity available by assuming an executor will eventually become idle. Killing 
an executor has no terrible consequences because with dynamic allocation we 
probably have external shuffle, so I think the approach you propose in 
SPARK-15815 is a better alternative. 

> TaskSetManager.abortIfCompletelyBlacklisted should not abort when all current 
> executors are blacklisted but dynamic allocation is enabled
> -
>
> Key: SPARK-22148
> URL: https://issues.apache.org/jira/browse/SPARK-22148
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.2.0
>Reporter: Juan Rodríguez Hortalá
> Attachments: SPARK-22148_WIP.diff
>
>
> Currently TaskSetManager.abortIfCompletelyBlacklisted aborts the TaskSet and 
> the whole Spark job with `task X (partition Y) cannot run anywhere due to 
> node and executor blacklist. Blacklisting behavior can be configured via 
> spark.blacklist.*.` when all the available executors are blacklisted for a 
> pending Task or TaskSet. This makes sense for static allocation, where the 
> set of executors is fixed for the duration of the application, but this might 
> lead to unnecessary job failures when dynamic allocation is enabled. For 
> example, in a Spark application with a single job at a time, when a node 
> fails at the end of a stage attempt, all other executors will complete their 
> tasks, but the tasks running in the executors of the failing node will be 
> pending. Spark will keep waiting for those tasks for 2 minutes by default 
> (spark.network.timeout) until the heartbeat timeout is triggered, and then it 
> will blacklist those executors for that stage. At that point in time, other 
> executors would had been released after being idle for 1 minute by default 
> (spark.dynamicAllocation.executorIdleTimeout), because the next stage hasn't 
> started yet and so there are no more tasks available (assuming the default of 
> spark.speculation = false). So Spark will fail because the only executors 
> available are blacklisted for that stage. 
> An alternative is requesting more executors to the cluster manager in this 
> situation. This could be retried a configurable number of times after a 
> configurable wait time between request attempts, so if the cluster manager 
> fails to provide a suitable executor then the job is aborted like in the 
> previous case. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22400) rename some APIs and classes to make their meaning clearer

2017-10-30 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-22400.
-
   Resolution: Fixed
 Assignee: Zhenhua Wang
Fix Version/s: 2.3.0

> rename some APIs and classes to make their meaning clearer
> --
>
> Key: SPARK-22400
> URL: https://issues.apache.org/jira/browse/SPARK-22400
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
> Fix For: 2.3.0
>
>
> Both `ReadSupport` and `ReadTask` have a method called `createReader`, but 
> they create different things. This could cause some confusion for data source 
> developers. The same issue exists between `WriteSupport` and 
> `DataWriterFactory`, both of which have a method called `createWriter`.
> Besides, the name of `RowToInternalRowDataWriterFactory` is not correct, 
> because it actually converts `InternalRow`s to `Row`s.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22396) Unresolved operator InsertIntoDir for Hive format when Hive Support is not enabled

2017-10-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22396:


Assignee: Apache Spark  (was: Xiao Li)

> Unresolved operator InsertIntoDir for Hive format when Hive Support is not 
> enabled
> --
>
> Key: SPARK-22396
> URL: https://issues.apache.org/jira/browse/SPARK-22396
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Apache Spark
> Fix For: 2.3.0
>
>
> When Hive support is not on, users can hit unresolved plan node when trying 
> to call `INSERT OVERWRITE DIRECTORY` using Hive format. 
> ```
> "unresolved operator 'InsertIntoDir true, Storage(Location: 
> /private/var/folders/vx/j0ydl5rn0gd9mgrh1pljnw90gn/T/spark-b4227606-9311-46a8-8c02-56355bf0e2bc,
>  Serde Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde, InputFormat: 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat, OutputFormat: 
> org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat), hive, true;;
> ```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22396) Unresolved operator InsertIntoDir for Hive format when Hive Support is not enabled

2017-10-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22396:


Assignee: Xiao Li  (was: Apache Spark)

> Unresolved operator InsertIntoDir for Hive format when Hive Support is not 
> enabled
> --
>
> Key: SPARK-22396
> URL: https://issues.apache.org/jira/browse/SPARK-22396
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.3.0
>
>
> When Hive support is not on, users can hit unresolved plan node when trying 
> to call `INSERT OVERWRITE DIRECTORY` using Hive format. 
> ```
> "unresolved operator 'InsertIntoDir true, Storage(Location: 
> /private/var/folders/vx/j0ydl5rn0gd9mgrh1pljnw90gn/T/spark-b4227606-9311-46a8-8c02-56355bf0e2bc,
>  Serde Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde, InputFormat: 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat, OutputFormat: 
> org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat), hive, true;;
> ```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22396) Unresolved operator InsertIntoDir for Hive format when Hive Support is not enabled

2017-10-30 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-22396.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> Unresolved operator InsertIntoDir for Hive format when Hive Support is not 
> enabled
> --
>
> Key: SPARK-22396
> URL: https://issues.apache.org/jira/browse/SPARK-22396
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.3.0
>
>
> When Hive support is not on, users can hit unresolved plan node when trying 
> to call `INSERT OVERWRITE DIRECTORY` using Hive format. 
> ```
> "unresolved operator 'InsertIntoDir true, Storage(Location: 
> /private/var/folders/vx/j0ydl5rn0gd9mgrh1pljnw90gn/T/spark-b4227606-9311-46a8-8c02-56355bf0e2bc,
>  Serde Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde, InputFormat: 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat, OutputFormat: 
> org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat), hive, true;;
> ```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18886) Delay scheduling should not delay some executors indefinitely if one task is scheduled before delay timeout

2017-10-30 Thread William Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16225335#comment-16225335
 ] 

William Shen commented on SPARK-18886:
--

[~imranr],

I came across this issue because it is marked as duplicated by SPARK-11460. 
SPARK-11460 has affected versions of 1.0.0, 1.0.1, 1.0.2, 1.1.0, 1.1.1, 1.2.0, 
1.2.1, 1.2.2, 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.0, 1.5.1, and this issue has 
affected version of 2.1.0. Do we know if this is also an issue for 1.6.0? I am 
observing similar behavior for our application in 1.6.0. We are also able to 
achieve better utilization of the executors and performance through setting the 
wait to 0.

Thank you

> Delay scheduling should not delay some executors indefinitely if one task is 
> scheduled before delay timeout
> ---
>
> Key: SPARK-18886
> URL: https://issues.apache.org/jira/browse/SPARK-18886
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Imran Rashid
>
> Delay scheduling can introduce an unbounded delay and underutilization of 
> cluster resources under the following circumstances:
> 1. Tasks have locality preferences for a subset of available resources
> 2. Tasks finish in less time than the delay scheduling.
> Instead of having *one* delay to wait for resources with better locality, 
> spark waits indefinitely.
> As an example, consider a cluster with 100 executors, and a taskset with 500 
> tasks.  Say all tasks have a preference for one executor, which is by itself 
> on one host.  Given the default locality wait of 3s per level, we end up with 
> a 6s delay till we schedule on other hosts (process wait + host wait).
> If each task takes 5 seconds (under the 6 second delay), then _all 500_ tasks 
> get scheduled on _only one_ executor.  This means you're only using a 1% of 
> your cluster, and you get a ~100x slowdown.  You'd actually be better off if 
> tasks took 7 seconds.
> *WORKAROUNDS*: 
> (1) You can change the locality wait times so that it is shorter than the 
> task execution time.  You need to take into account the sum of all wait times 
> to use all the resources on your cluster.  For example, if you have resources 
> on different racks, this will include the sum of 
> "spark.locality.wait.process" + "spark.locality.wait.node" + 
> "spark.locality.wait.rack".  Those each default to "3s".  The simplest way to 
> be to set "spark.locality.wait.process" to your desired wait interval, and 
> set both "spark.locality.wait.node" and "spark.locality.wait.rack" to "0".  
> For example, if your tasks take ~3 seconds on average, you might set 
> "spark.locality.wait.process" to "1s".  *NOTE*: due to SPARK-18967, avoid 
> setting the {{spark.locality.wait=0}} -- instead, use 
> {{spark.locality.wait=1ms}}.
> Note that this workaround isn't perfect --with less delay scheduling, you may 
> not get as good resource locality.  After this issue is fixed, you'd most 
> likely want to undo these configuration changes.
> (2) The worst case here will only happen if your tasks have extreme skew in 
> their locality preferences.  Users may be able to modify their job to 
> controlling the distribution of the original input data.
> (2a) A shuffle may end up with very skewed locality preferences, especially 
> if you do a repartition starting from a small number of partitions.  (Shuffle 
> locality preference is assigned if any node has more than 20% of the shuffle 
> input data -- by chance, you may have one node just above that threshold, and 
> all other nodes just below it.)  In this case, you can turn off locality 
> preference for shuffle data by setting 
> {{spark.shuffle.reduceLocality.enabled=false}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21033) fix the potential OOM in UnsafeExternalSorter

2017-10-30 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16225328#comment-16225328
 ] 

Wenchen Fan commented on SPARK-21033:
-

[~clehene] I think you hit a different issue. By looking at the stacktrace, 
seems your task failed because of running out of memory. There can be 2 
possible reasons: 1) you have too many tasks running in one executor and the 
executor doesn't have enough memory. 2) a memory leak bug in Spark. Feel free 
to create a new ticket for it.

> fix the potential OOM in UnsafeExternalSorter
> -
>
> Key: SPARK-21033
> URL: https://issues.apache.org/jira/browse/SPARK-21033
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.3.0
>
>
> In `UnsafeInMemorySorter`, one record may take 32 bytes: 1 `long` for 
> pointer, 1 `long` for key-prefix, and another 2 `long`s as the temporary 
> buffer for radix sort.
> In `UnsafeExternalSorter`, we set the 
> `DEFAULT_NUM_ELEMENTS_FOR_SPILL_THRESHOLD` to be `1024 * 1024 * 1024 / 2`, 
> and hoping the max size of point array to be 8 GB. However this is wrong, 
> `1024 * 1024 * 1024 / 2 * 32` is actually 16 GB, and if we grow the point 
> array before reach this limitation, we may hit the max-page-size error.
> Users may see exception like this on large dataset:
> {code}
> Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with 
> more than 17179869176 bytes
> at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:241)
> at 
> org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:121)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:374)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:396)
> at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:94)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21033) fix the potential OOM in UnsafeExternalSorter

2017-10-30 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21033.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> fix the potential OOM in UnsafeExternalSorter
> -
>
> Key: SPARK-21033
> URL: https://issues.apache.org/jira/browse/SPARK-21033
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.3.0
>
>
> In `UnsafeInMemorySorter`, one record may take 32 bytes: 1 `long` for 
> pointer, 1 `long` for key-prefix, and another 2 `long`s as the temporary 
> buffer for radix sort.
> In `UnsafeExternalSorter`, we set the 
> `DEFAULT_NUM_ELEMENTS_FOR_SPILL_THRESHOLD` to be `1024 * 1024 * 1024 / 2`, 
> and hoping the max size of point array to be 8 GB. However this is wrong, 
> `1024 * 1024 * 1024 / 2 * 32` is actually 16 GB, and if we grow the point 
> array before reach this limitation, we may hit the max-page-size error.
> Users may see exception like this on large dataset:
> {code}
> Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with 
> more than 17179869176 bytes
> at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:241)
> at 
> org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:121)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:374)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:396)
> at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:94)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17788) RangePartitioner results in few very large tasks and many small to empty tasks

2017-10-30 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16225316#comment-16225316
 ] 

Wenchen Fan commented on SPARK-17788:
-

Unfortunately I don't have a reproducible code snippet to prove it has been 
fixed, but I'm pretty confident my fix should work for it. cc 
[~babak.alip...@gmail.com] please reopen this ticket if you still hit this 
issue, thanks!

> RangePartitioner results in few very large tasks and many small to empty 
> tasks 
> ---
>
> Key: SPARK-17788
> URL: https://issues.apache.org/jira/browse/SPARK-17788
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0
> Environment: Ubuntu 14.04 64bit
> Java 1.8.0_101
>Reporter: Babak Alipour
>Assignee: Wenchen Fan
> Fix For: 2.3.0
>
>
> Greetings everyone,
> I was trying to read a single field of a Hive table stored as Parquet in 
> Spark (~140GB for the entire table, this single field is a Double, ~1.4B 
> records) and look at the sorted output using the following:
> sql("SELECT " + field + " FROM MY_TABLE ORDER BY " + field + " DESC") 
> ​But this simple line of code gives:
> Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with 
> more than 17179869176 bytes
> Same error for:
> sql("SELECT " + field + " FROM MY_TABLE).sort(field)
> and:
> sql("SELECT " + field + " FROM MY_TABLE).orderBy(field)
> After doing some searching, the issue seems to lie in the RangePartitioner 
> trying to create equal ranges. [1]
> [1] 
> https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/RangePartitioner.html
>  
>  The Double values I'm trying to sort are mostly in the range [0,1] (~70% of 
> the data which roughly equates 1 billion records), other numbers in the 
> dataset are as high as 2000. With the RangePartitioner trying to create equal 
> ranges, some tasks are becoming almost empty while others are extremely 
> large, due to the heavily skewed distribution. 
> This is either a bug in Apache Spark or a major limitation of the framework. 
> I hope one of the devs can help solve this issue.
> P.S. Email thread on Spark user mailing list:
> http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCA%2B_of14hTVYTUHXC%3DmS9Kqd6qegVvkoF-ry3Yj2%2BRT%2BWSBNzhg%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17788) RangePartitioner results in few very large tasks and many small to empty tasks

2017-10-30 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-17788.
-
   Resolution: Fixed
 Assignee: Wenchen Fan
Fix Version/s: 2.3.0

> RangePartitioner results in few very large tasks and many small to empty 
> tasks 
> ---
>
> Key: SPARK-17788
> URL: https://issues.apache.org/jira/browse/SPARK-17788
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0
> Environment: Ubuntu 14.04 64bit
> Java 1.8.0_101
>Reporter: Babak Alipour
>Assignee: Wenchen Fan
> Fix For: 2.3.0
>
>
> Greetings everyone,
> I was trying to read a single field of a Hive table stored as Parquet in 
> Spark (~140GB for the entire table, this single field is a Double, ~1.4B 
> records) and look at the sorted output using the following:
> sql("SELECT " + field + " FROM MY_TABLE ORDER BY " + field + " DESC") 
> ​But this simple line of code gives:
> Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with 
> more than 17179869176 bytes
> Same error for:
> sql("SELECT " + field + " FROM MY_TABLE).sort(field)
> and:
> sql("SELECT " + field + " FROM MY_TABLE).orderBy(field)
> After doing some searching, the issue seems to lie in the RangePartitioner 
> trying to create equal ranges. [1]
> [1] 
> https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/RangePartitioner.html
>  
>  The Double values I'm trying to sort are mostly in the range [0,1] (~70% of 
> the data which roughly equates 1 billion records), other numbers in the 
> dataset are as high as 2000. With the RangePartitioner trying to create equal 
> ranges, some tasks are becoming almost empty while others are extremely 
> large, due to the heavily skewed distribution. 
> This is either a bug in Apache Spark or a major limitation of the framework. 
> I hope one of the devs can help solve this issue.
> P.S. Email thread on Spark user mailing list:
> http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCA%2B_of14hTVYTUHXC%3DmS9Kqd6qegVvkoF-ry3Yj2%2BRT%2BWSBNzhg%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22305) HDFSBackedStateStoreProvider fails with StackOverflowException when attempting to recover state

2017-10-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16225273#comment-16225273
 ] 

Apache Spark commented on SPARK-22305:
--

User 'joseph-torres' has created a pull request for this issue:
https://github.com/apache/spark/pull/19611

> HDFSBackedStateStoreProvider fails with StackOverflowException when 
> attempting to recover state
> ---
>
> Key: SPARK-22305
> URL: https://issues.apache.org/jira/browse/SPARK-22305
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Yuval Itzchakov
>
> Environment:
> Spark: 2.2.0
> Java version: 1.8.0_112
> spark.sql.streaming.minBatchesToRetain: 100
> After an application failure due to OOM exceptions, restarting the 
> application with the existing state produces the following OOM:
> {code:java}
> java.io.IOException: com.google.protobuf.ServiceException: 
> java.lang.StackOverflowError
>   at 
> org.apache.hadoop.ipc.ProtobufHelper.getRemoteException(ProtobufHelper.java:47)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:260)
>   at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy18.getBlockLocations(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1240)
>   at 
> org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1227)
>   at 
> org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1215)
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:303)
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:269)
>   at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:261)
>   at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1540)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:304)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:299)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:312)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$readSnapshotFile(HDFSBackedStateStoreProvider.scala:405)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:296)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:295)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:295)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$4.apply(HDFSBackedStateStoreProvider.scala:297)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$4.apply(HDFSBackedStateStoreProvider.scala:296)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:296)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:295)
>   at scala.Option.getOrEl

[jira] [Assigned] (SPARK-22305) HDFSBackedStateStoreProvider fails with StackOverflowException when attempting to recover state

2017-10-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22305:


Assignee: (was: Apache Spark)

> HDFSBackedStateStoreProvider fails with StackOverflowException when 
> attempting to recover state
> ---
>
> Key: SPARK-22305
> URL: https://issues.apache.org/jira/browse/SPARK-22305
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Yuval Itzchakov
>
> Environment:
> Spark: 2.2.0
> Java version: 1.8.0_112
> spark.sql.streaming.minBatchesToRetain: 100
> After an application failure due to OOM exceptions, restarting the 
> application with the existing state produces the following OOM:
> {code:java}
> java.io.IOException: com.google.protobuf.ServiceException: 
> java.lang.StackOverflowError
>   at 
> org.apache.hadoop.ipc.ProtobufHelper.getRemoteException(ProtobufHelper.java:47)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:260)
>   at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy18.getBlockLocations(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1240)
>   at 
> org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1227)
>   at 
> org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1215)
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:303)
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:269)
>   at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:261)
>   at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1540)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:304)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:299)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:312)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$readSnapshotFile(HDFSBackedStateStoreProvider.scala:405)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:296)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:295)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:295)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$4.apply(HDFSBackedStateStoreProvider.scala:297)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$4.apply(HDFSBackedStateStoreProvider.scala:296)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:296)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:295)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apach

[jira] [Assigned] (SPARK-22305) HDFSBackedStateStoreProvider fails with StackOverflowException when attempting to recover state

2017-10-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22305:


Assignee: Apache Spark

> HDFSBackedStateStoreProvider fails with StackOverflowException when 
> attempting to recover state
> ---
>
> Key: SPARK-22305
> URL: https://issues.apache.org/jira/browse/SPARK-22305
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Yuval Itzchakov
>Assignee: Apache Spark
>
> Environment:
> Spark: 2.2.0
> Java version: 1.8.0_112
> spark.sql.streaming.minBatchesToRetain: 100
> After an application failure due to OOM exceptions, restarting the 
> application with the existing state produces the following OOM:
> {code:java}
> java.io.IOException: com.google.protobuf.ServiceException: 
> java.lang.StackOverflowError
>   at 
> org.apache.hadoop.ipc.ProtobufHelper.getRemoteException(ProtobufHelper.java:47)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:260)
>   at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy18.getBlockLocations(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1240)
>   at 
> org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1227)
>   at 
> org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1215)
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:303)
>   at 
> org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:269)
>   at org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:261)
>   at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1540)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:304)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:299)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:312)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$readSnapshotFile(HDFSBackedStateStoreProvider.scala:405)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:296)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:295)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:295)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$4.apply(HDFSBackedStateStoreProvider.scala:297)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$4.apply(HDFSBackedStateStoreProvider.scala:296)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:296)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:295)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedSta

[jira] [Updated] (SPARK-22400) rename some APIs and classes to make their meaning clearer

2017-10-30 Thread Zhenhua Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-22400:
-
Description: 
Both `ReadSupport` and `ReadTask` have a method called `createReader`, but they 
create different things. This could cause some confusion for data source 
developers. The same issue exists between `WriteSupport` and 
`DataWriterFactory`, both of which have a method called `createWriter`.

Besides, the name of `RowToInternalRowDataWriterFactory` is not correct, 
because it actually converts `InternalRow`s to `Row`s.

  was:
Both `ReadSupport` and `ReadTask` have a method called `createReader`, but they 
create different things. This could cause some confusion for data source 
developers. The same problem exists between `WriteSupport` and 
`DataWriterFactory`, both of which have a method called `createWriter`.

Besides, the name of `RowToInternalRowDataWriterFactory` is not correct, 
because it actually converts `InternalRow`s to `Row`s.


> rename some APIs and classes to make their meaning clearer
> --
>
> Key: SPARK-22400
> URL: https://issues.apache.org/jira/browse/SPARK-22400
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>
> Both `ReadSupport` and `ReadTask` have a method called `createReader`, but 
> they create different things. This could cause some confusion for data source 
> developers. The same issue exists between `WriteSupport` and 
> `DataWriterFactory`, both of which have a method called `createWriter`.
> Besides, the name of `RowToInternalRowDataWriterFactory` is not correct, 
> because it actually converts `InternalRow`s to `Row`s.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22400) rename some APIs and classes to make their meaning clearer

2017-10-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22400:


Assignee: (was: Apache Spark)

> rename some APIs and classes to make their meaning clearer
> --
>
> Key: SPARK-22400
> URL: https://issues.apache.org/jira/browse/SPARK-22400
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>
> Both `ReadSupport` and `ReadTask` have a method called `createReader`, but 
> they create different things. This could cause some confusion for data source 
> developers. The same problem exists between `WriteSupport` and 
> `DataWriterFactory`, both of which have a method called `createWriter`.
> Besides, the name of `RowToInternalRowDataWriterFactory` is not correct, 
> because it actually converts `InternalRow`s to `Row`s.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22400) rename some APIs and classes to make their meaning clearer

2017-10-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16225247#comment-16225247
 ] 

Apache Spark commented on SPARK-22400:
--

User 'wzhfy' has created a pull request for this issue:
https://github.com/apache/spark/pull/19610

> rename some APIs and classes to make their meaning clearer
> --
>
> Key: SPARK-22400
> URL: https://issues.apache.org/jira/browse/SPARK-22400
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>
> Both `ReadSupport` and `ReadTask` have a method called `createReader`, but 
> they create different things. This could cause some confusion for data source 
> developers. The same problem exists between `WriteSupport` and 
> `DataWriterFactory`, both of which have a method called `createWriter`.
> Besides, the name of `RowToInternalRowDataWriterFactory` is not correct, 
> because it actually converts `InternalRow`s to `Row`s.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22400) rename some APIs and classes to make their meaning clearer

2017-10-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22400:


Assignee: Apache Spark

> rename some APIs and classes to make their meaning clearer
> --
>
> Key: SPARK-22400
> URL: https://issues.apache.org/jira/browse/SPARK-22400
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>Assignee: Apache Spark
>
> Both `ReadSupport` and `ReadTask` have a method called `createReader`, but 
> they create different things. This could cause some confusion for data source 
> developers. The same problem exists between `WriteSupport` and 
> `DataWriterFactory`, both of which have a method called `createWriter`.
> Besides, the name of `RowToInternalRowDataWriterFactory` is not correct, 
> because it actually converts `InternalRow`s to `Row`s.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22400) rename some APIs and classes to make their meaning clearer

2017-10-30 Thread Zhenhua Wang (JIRA)
Zhenhua Wang created SPARK-22400:


 Summary: rename some APIs and classes to make their meaning clearer
 Key: SPARK-22400
 URL: https://issues.apache.org/jira/browse/SPARK-22400
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Zhenhua Wang


Both `ReadSupport` and `ReadTask` have a method called `createReader`, but they 
create different things. This could cause some confusion for data source 
developers. The same problem exists between `WriteSupport` and 
`DataWriterFactory`, both of which have a method called `createWriter`.

Besides, the name of `RowToInternalRowDataWriterFactory` is not correct, 
because it actually converts `InternalRow`s to `Row`s.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22399) reference in mllib-clustering.html is out of date

2017-10-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-22399:
--

Just open a PR for minor stuff like this. No need for a JIRA.

> reference in mllib-clustering.html is out of date
> -
>
> Key: SPARK-22399
> URL: https://issues.apache.org/jira/browse/SPARK-22399
> Project: Spark
>  Issue Type: Documentation
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: Nathan Kronenfeld
>Priority: Minor
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> Under "Power Iteration Clustering", the reference to the paper describing the 
> method redirects to the web site for the conference at which it was 
> originally published.
> The correct reference, as taken from Lin's web page at CMU, should be:
> Ihttp://www.cs.cmu.edu/~frank/papers/icml2010-pic-final.pdf
> (at least, I think that's the one - someone who knows the algorithm should 
> probably check it's the right one - I'm trying to follow it because I don't 
> know it, of course :-)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22399) reference in mllib-clustering.html is out of date

2017-10-30 Thread Nathan Kronenfeld (JIRA)
Nathan Kronenfeld created SPARK-22399:
-

 Summary: reference in mllib-clustering.html is out of date
 Key: SPARK-22399
 URL: https://issues.apache.org/jira/browse/SPARK-22399
 Project: Spark
  Issue Type: Documentation
  Components: MLlib
Affects Versions: 2.2.0
Reporter: Nathan Kronenfeld
Priority: Minor


Under "Power Iteration Clustering", the reference to the paper describing the 
method redirects to the web site for the conference at which it was originally 
published.

The correct reference, as taken from Lin's web page at CMU, should be:

Ihttp://www.cs.cmu.edu/~frank/papers/icml2010-pic-final.pdf

(at least, I think that's the one - someone who knows the algorithm should 
probably check it's the right one - I'm trying to follow it because I don't 
know it, of course :-)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19628) Duplicate Spark jobs in 2.1.0

2017-10-30 Thread Guilherme Lucas da Silva (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16225113#comment-16225113
 ] 

Guilherme Lucas da Silva commented on SPARK-19628:
--

Can you show us more of this code, in order to analyse it better?

Thanks,

> Duplicate Spark jobs in 2.1.0
> -
>
> Key: SPARK-19628
> URL: https://issues.apache.org/jira/browse/SPARK-19628
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Jork Zijlstra
> Attachments: spark2.0.1.png, spark2.1.0-examplecode.png, 
> spark2.1.0.png
>
>
> After upgrading to Spark 2.1.0 we noticed that they are duplicate jobs 
> executed. Going back to Spark 2.0.1 they are gone again
> {code}
> import org.apache.spark.sql._
> object DoubleJobs {
>   def main(args: Array[String]) {
> System.setProperty("hadoop.home.dir", "/tmp");
> val sparkSession: SparkSession = SparkSession.builder
>   .master("local[4]")
>   .appName("spark session example")
>   .config("spark.driver.maxResultSize", "6G")
>   .config("spark.sql.orc.filterPushdown", true)
>   .config("spark.sql.hive.metastorePartitionPruning", true)
>   .getOrCreate()
> sparkSession.sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
> val paths = Seq(
>   ""//some orc source
> )
> def dataFrame(path: String): DataFrame = {
>   sparkSession.read.orc(path)
> }
> paths.foreach(path => {
>   dataFrame(path).show(20)
> })
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11215) Add multiple columns support to StringIndexer

2017-10-30 Thread Weichen Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16225093#comment-16225093
 ] 

Weichen Xu commented on SPARK-11215:


[~viirya] Sorry for late! I already has partly written code and will try to 
submit tomorrow, thank you for attention!

> Add multiple columns support to StringIndexer
> -
>
> Key: SPARK-11215
> URL: https://issues.apache.org/jira/browse/SPARK-11215
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> Add multiple columns support to StringIndexer, then users can transform 
> multiple input columns to multiple output columns simultaneously. See 
> discussion SPARK-8418.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18136) Make PySpark pip install works on windows

2017-10-30 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16225081#comment-16225081
 ] 

Hyukjin Kwon commented on SPARK-18136:
--

Oh, no. Yes, there is something to be done to detect Spark home for pip 
installation on Windows. This is partly resolved at the first PR and second 
followup is in review. Will take an action for this soon e.g., taking over the 
second PR if it goes inactive.

> Make PySpark pip install works on windows
> -
>
> Key: SPARK-18136
> URL: https://issues.apache.org/jira/browse/SPARK-18136
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: holdenk
> Fix For: 2.1.3, 2.2.1, 2.3.0
>
>
> Make sure that pip installer for PySpark works on windows



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22308) Support unit tests of spark code using ScalaTest using suites other than FunSuite

2017-10-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22308:


Assignee: Apache Spark  (was: Nathan Kronenfeld)

> Support unit tests of spark code using ScalaTest using suites other than 
> FunSuite
> -
>
> Key: SPARK-22308
> URL: https://issues.apache.org/jira/browse/SPARK-22308
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Core, SQL, Tests
>Affects Versions: 2.2.0
>Reporter: Nathan Kronenfeld
>Assignee: Apache Spark
>Priority: Minor
>  Labels: scalatest, test-suite, test_issue
>
> External codebases that have spark code can test it using SharedSparkContext, 
> no matter how they write their scalatests - basing on FunSuite, FunSpec, 
> FlatSpec, or WordSpec.
> SharedSQLContext only supports FunSuite.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22308) Support unit tests of spark code using ScalaTest using suites other than FunSuite

2017-10-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22308:


Assignee: Nathan Kronenfeld  (was: Apache Spark)

> Support unit tests of spark code using ScalaTest using suites other than 
> FunSuite
> -
>
> Key: SPARK-22308
> URL: https://issues.apache.org/jira/browse/SPARK-22308
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Core, SQL, Tests
>Affects Versions: 2.2.0
>Reporter: Nathan Kronenfeld
>Assignee: Nathan Kronenfeld
>Priority: Minor
>  Labels: scalatest, test-suite, test_issue
>
> External codebases that have spark code can test it using SharedSparkContext, 
> no matter how they write their scalatests - basing on FunSuite, FunSpec, 
> FlatSpec, or WordSpec.
> SharedSQLContext only supports FunSuite.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22398) Partition directories with leading 0s cause wrong results

2017-10-30 Thread Bogdan Raducanu (JIRA)
Bogdan Raducanu created SPARK-22398:
---

 Summary: Partition directories with leading 0s cause wrong results
 Key: SPARK-22398
 URL: https://issues.apache.org/jira/browse/SPARK-22398
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Bogdan Raducanu


Repro case:
{code}
spark.range(8).selectExpr("'0' || cast(id as string) as id", "id as 
b").write.mode("overwrite").partitionBy("id").parquet("/tmp/bug1")
spark.read.parquet("/tmp/bug1").where("id in ('01')").show
+---+---+
|  b| id|
+---+---+
+---+---+
spark.read.parquet("/tmp/bug1").where("id = '01'").show
+---+---+
|  b| id|
+---+---+
|  1|  1|
+---+---+
{code}

I think somewhere there is some special handling of this case for equals but 
not the same for IN.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark

2017-10-30 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16224979#comment-16224979
 ] 

Hyukjin Kwon commented on SPARK-21866:
--

I came here to say data source thing and just found this discussion. To me, +1 
for it.

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>  Labels: SPIP
> Attachments: SPIP - Image support for Apache Spark V1.1.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
> * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
> * the total size of an image should be restricted to less than 2GB (roughly)
> * the meaning of color channels is application-specific and is not mandated 
> by the standard (in line with the OpenCV standard)
> * specialized formats used in meteorology, the medical field, etc. are not 
> supported
> * this format is specialized to images and does not attempt to solve the more 
> general problem of representing n-dimensional tensors in Spark
> h2. Proposed API changes
> We propose to add a new package in the package structure, under the MLlib 
> project:
> {{org.apache.spark.image}}
> h3. Data format
> We propose to add the following structure:
> imageSchema = StructType([
> * StructField("mode", StringType(), False),
> ** The exact representation of the data.
> ** The values are described in the following OpenCV convention. Basically, 
> the type has both "depth" and "number of channels" info: in particular, type 
> "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 
> (value 32 in the table) with the channel order specified by convention.
> ** The exact channel ordering and meaning of each channel is dictated by 
> convention. By default, the order is RGB (3 channels) and BGRA (4 channels).
> If the image failed to load, the value is the empty string "".

[jira] [Commented] (SPARK-11215) Add multiple columns support to StringIndexer

2017-10-30 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16224975#comment-16224975
 ] 

Liang-Chi Hsieh commented on SPARK-11215:
-

Hi [~WeichenXu123], I'd like to know if you are busy these days, if so, I could 
work on it. Please let me know. Thanks.

> Add multiple columns support to StringIndexer
> -
>
> Key: SPARK-11215
> URL: https://issues.apache.org/jira/browse/SPARK-11215
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> Add multiple columns support to StringIndexer, then users can transform 
> multiple input columns to multiple output columns simultaneously. See 
> discussion SPARK-8418.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15689) Data source API v2

2017-10-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-15689:
--
Fix Version/s: (was: 2.3.0)

> Data source API v2
> --
>
> Key: SPARK-15689
> URL: https://issues.apache.org/jira/browse/SPARK-15689
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Reynold Xin
>Assignee: Wenchen Fan
>  Labels: SPIP, releasenotes
> Attachments: SPIP Data Source API V2.pdf
>
>
> This ticket tracks progress in creating the v2 of data source API. This new 
> API should focus on:
> 1. Have a small surface so it is easy to freeze and maintain compatibility 
> for a long time. Ideally, this API should survive architectural rewrites and 
> user-facing API revamps of Spark.
> 2. Have a well-defined column batch interface for high performance. 
> Convenience methods should exist to convert row-oriented formats into column 
> batches for data source developers.
> 3. Still support filter push down, similar to the existing API.
> 4. Nice-to-have: support additional common operators, including limit and 
> sampling.
> Note that both 1 and 2 are problems that the current data source API (v1) 
> suffers. The current data source API has a wide surface with dependency on 
> DataFrame/SQLContext, making the data source API compatibility depending on 
> the upper level API. The current data source API is also only row oriented 
> and has to go through an expensive external data type conversion to internal 
> data type.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18136) Make PySpark pip install works on windows

2017-10-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16224845#comment-16224845
 ] 

Sean Owen commented on SPARK-18136:
---

Is this open because it's meant to be backported further (I imagine not)? or 
just that something else needs to be done?
If it's substantially working, let's call this one resolved.

> Make PySpark pip install works on windows
> -
>
> Key: SPARK-18136
> URL: https://issues.apache.org/jira/browse/SPARK-18136
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: holdenk
> Fix For: 2.1.3, 2.2.1, 2.3.0
>
>
> Make sure that pip installer for PySpark works on windows



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22308) Support unit tests of spark code using ScalaTest using suites other than FunSuite

2017-10-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-22308:
--
Fix Version/s: (was: 2.3.0)

> Support unit tests of spark code using ScalaTest using suites other than 
> FunSuite
> -
>
> Key: SPARK-22308
> URL: https://issues.apache.org/jira/browse/SPARK-22308
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Core, SQL, Tests
>Affects Versions: 2.2.0
>Reporter: Nathan Kronenfeld
>Assignee: Nathan Kronenfeld
>Priority: Minor
>  Labels: scalatest, test-suite, test_issue
>
> External codebases that have spark code can test it using SharedSparkContext, 
> no matter how they write their scalatests - basing on FunSuite, FunSpec, 
> FlatSpec, or WordSpec.
> SharedSQLContext only supports FunSuite.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22351) Support user-created custom Encoders for Datasets

2017-10-30 Thread Adamos Loizou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16224715#comment-16224715
 ] 

Adamos Loizou commented on SPARK-22351:
---

Hi [~hyukjin.kwon], in Spark 1.6 I managed to add support for custom types by 
defining subclasses of {{org.apache.spark.sql.types.UserDefinedType}}.

e.g.

{code:java}
class JodaLocalDateType extends UserDefinedType[org.joda.time.LocalDate] {
  override def sqlType: DataType = TimestampType
  override def serialize(p: org.joda.time.LocalDate) = ???
  override def deserialize(datum: Any): org.joda.time.LocalDate = ??? 
  ...
}
{code}


This abstract class has been made private in Spark 2.x.
Unfortunately, there doesn't seem to be an easy alternative.

> Support user-created custom Encoders for Datasets
> -
>
> Key: SPARK-22351
> URL: https://issues.apache.org/jira/browse/SPARK-22351
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Adamos Loizou
>Priority: Minor
>
> It would be very helpful if we could easily support creating custom encoders 
> for classes in Spark SQL.
> This is to allow a user to properly define a business model using types of 
> their choice. They can then map them to Spark SQL types without being forced 
> to pollute their model with the built-in mappable types (e.g. 
> {{java.sql.Timestamp}}).
> Specifically in our case, we tend to use either the Java 8 time API or the 
> joda time API for dates instead of {{java.sql.Timestamp}} whose API is quite 
> limited compared to the others.
> Ideally we would like to be able to have a dataset of such a class:
> {code:java}
> case class Person(name: String, dateOfBirth: org.joda.time.LocalDate)
> implicit def localDateTimeEncoder: Encoder[LocalDate] = ??? // we define 
> something that maps to Spark SQL TimestampType
> ...
> // read csv and map it to model
> val people:Dataset[Person] = spark.read.csv("/my/path/file.csv").as[Person]
> {code}
> While this was possible in Spark 1.6 it's not longer the case in Spark 2.x.
> It's also not straight forward as to how to support that using an 
> {{ExpressionEncoder}} (any tips would be much appreciated)
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22385) MapObjects should not access list element by index

2017-10-30 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22385.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19603
[https://github.com/apache/spark/pull/19603]

> MapObjects should not access list element by index
> --
>
> Key: SPARK-22385
> URL: https://issues.apache.org/jira/browse/SPARK-22385
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21708) use sbt 1.0.0

2017-10-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21708:


Assignee: Apache Spark

> use sbt 1.0.0
> -
>
> Key: SPARK-21708
> URL: https://issues.apache.org/jira/browse/SPARK-21708
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: PJ Fanning
>Assignee: Apache Spark
>Priority: Minor
>
> Should improve sbt build times.
> http://www.scala-sbt.org/1.0/docs/sbt-1.0-Release-Notes.html
> According to https://github.com/sbt/sbt/issues/3424, we will need to change 
> the HTTP location where we get the sbt-launch jar.
> Other related issues:
> SPARK-14401
> https://github.com/typesafehub/sbteclipse/issues/343
> https://github.com/jrudolph/sbt-dependency-graph/issues/134
> https://github.com/AlpineNow/junit_xml_listener/issues/6
> https://github.com/spray/sbt-revolver/issues/62
> https://github.com/ihji/sbt-antlr4/issues/14



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21708) use sbt 1.0.0

2017-10-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16224523#comment-16224523
 ] 

Apache Spark commented on SPARK-21708:
--

User 'pjfanning' has created a pull request for this issue:
https://github.com/apache/spark/pull/19609

> use sbt 1.0.0
> -
>
> Key: SPARK-21708
> URL: https://issues.apache.org/jira/browse/SPARK-21708
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: PJ Fanning
>Priority: Minor
>
> Should improve sbt build times.
> http://www.scala-sbt.org/1.0/docs/sbt-1.0-Release-Notes.html
> According to https://github.com/sbt/sbt/issues/3424, we will need to change 
> the HTTP location where we get the sbt-launch jar.
> Other related issues:
> SPARK-14401
> https://github.com/typesafehub/sbteclipse/issues/343
> https://github.com/jrudolph/sbt-dependency-graph/issues/134
> https://github.com/AlpineNow/junit_xml_listener/issues/6
> https://github.com/spray/sbt-revolver/issues/62
> https://github.com/ihji/sbt-antlr4/issues/14



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21708) use sbt 1.0.0

2017-10-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21708:


Assignee: (was: Apache Spark)

> use sbt 1.0.0
> -
>
> Key: SPARK-21708
> URL: https://issues.apache.org/jira/browse/SPARK-21708
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: PJ Fanning
>Priority: Minor
>
> Should improve sbt build times.
> http://www.scala-sbt.org/1.0/docs/sbt-1.0-Release-Notes.html
> According to https://github.com/sbt/sbt/issues/3424, we will need to change 
> the HTTP location where we get the sbt-launch jar.
> Other related issues:
> SPARK-14401
> https://github.com/typesafehub/sbteclipse/issues/343
> https://github.com/jrudolph/sbt-dependency-graph/issues/134
> https://github.com/AlpineNow/junit_xml_listener/issues/6
> https://github.com/spray/sbt-revolver/issues/62
> https://github.com/ihji/sbt-antlr4/issues/14



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22291) Postgresql UUID[] to Cassandra: Conversion Error

2017-10-30 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-22291:

Fix Version/s: 2.2.1

> Postgresql UUID[] to Cassandra: Conversion Error
> 
>
> Key: SPARK-22291
> URL: https://issues.apache.org/jira/browse/SPARK-22291
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.0
> Environment: Debian Linux, Scala 2.11, Spark 2.2.0, PostgreSQL 9.6, 
> Cassandra 3
>Reporter: Fabio J. Walter
>Assignee: Jen-Ming Chung
>  Labels: patch, postgresql, sql
> Fix For: 2.2.1, 2.3.0
>
> Attachments: 
> org_apache_spark_sql_execution_datasources_jdbc_JdbcUtil.png
>
>
> My job reads data from a PostgreSQL table that contains columns of user_ids 
> uuid[] type, so that I'm getting the error above when I'm trying to save data 
> on Cassandra.
> However, the creation of this same table on Cassandra works fine!  user_ids 
> list.
> I can't change the type on the source table, because I'm reading data from a 
> legacy system.
> I've been looking at point printed on log, on class 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.scala
> Stacktrace on Spark:
> {noformat}
> Caused by: java.lang.ClassCastException: [Ljava.util.UUID; cannot be cast to 
> [Ljava.lang.String;
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:443)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$14.apply(JdbcUtils.scala:442)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13$$anonfun$18.apply(JdbcUtils.scala:472)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13$$anonfun$18.apply(JdbcUtils.scala:472)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$nullSafeConvert(JdbcUtils.scala:482)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:470)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$13.apply(JdbcUtils.scala:469)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:330)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:312)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
> at 
> org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:133)
> at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1038)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1029)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969)
> at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029)
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760)
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
> at 
> java.util.

[jira] [Updated] (SPARK-22397) Add multiple column support to QuantileDiscretizer

2017-10-30 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-22397:
---
Description: Once SPARK-20542 adds multi column support to {{Bucketizer}}, 
we  can add multi column support to the {{QuantileDiscretizer}} too.

> Add multiple column support to QuantileDiscretizer
> --
>
> Key: SPARK-22397
> URL: https://issues.apache.org/jira/browse/SPARK-22397
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Nick Pentreath
>
> Once SPARK-20542 adds multi column support to {{Bucketizer}}, we  can add 
> multi column support to the {{QuantileDiscretizer}} too.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21983) Fix ANTLR 4.7 deprecations

2017-10-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21983.
---
   Resolution: Fixed
 Assignee: Henry Robinson
Fix Version/s: 2.3.0

Resolved by https://github.com/apache/spark/pull/19578

> Fix ANTLR 4.7 deprecations
> --
>
> Key: SPARK-21983
> URL: https://issues.apache.org/jira/browse/SPARK-21983
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Assignee: Henry Robinson
>Priority: Minor
> Fix For: 2.3.0
>
>
> We are currently using a couple of deprecated ANTLR API's:
> # BufferedTokenStream.reset
> # ANTLRNoCaseStringStream
> # ParserRuleContext.addChild
> We should fix these, and replace them by non deprecated alternatives.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22397) Add multiple column support to QuantileDiscretizer

2017-10-30 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16224459#comment-16224459
 ] 

Nick Pentreath commented on SPARK-22397:


[~huaxing] is working on this and will submit a PR shortly.

> Add multiple column support to QuantileDiscretizer
> --
>
> Key: SPARK-22397
> URL: https://issues.apache.org/jira/browse/SPARK-22397
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Nick Pentreath
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22397) Add multiple column support to QuantileDiscretizer

2017-10-30 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-22397:
--

 Summary: Add multiple column support to QuantileDiscretizer
 Key: SPARK-22397
 URL: https://issues.apache.org/jira/browse/SPARK-22397
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.2.0
Reporter: Nick Pentreath






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22365) Spark UI executors empty list with 500 error

2017-10-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-22365:
--
Description: 
No data loaded on "execturos" tab in sparkUI with stack trace below. Apart from 
exception I have nothing more. But if I can test something to make this easier 
to resolve I am happy to help.

{code}
java.lang.NullPointerException
at 
org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388)
at 
org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341)
at 
org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228)
at 
org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:845)
at 
org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1689)
at 
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:164)
at 
org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1676)
at 
org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581)
at 
org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
at 
org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)
at 
org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
at 
org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at 
org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:461)
at 
org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
at 
org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at org.spark_project.jetty.server.Server.handle(Server.java:524)
at 
org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:319)
at 
org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:253)
at 
org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
at 
org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:95)
at 
org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
at 
org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
at 
org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
at 
org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
at 
org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
at 
org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
at java.lang.Thread.run(Thread.java:748)
{code}

  was:
No data loaded on "execturos" tab in sparkUI with stack trace below. Apart from 
exception I have nothing more. But if I can test something to make this easier 
to resolve I am happy to help.

{{java.lang.NullPointerException
at 
org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388)
at 
org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341)
at 
org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228)
at 
org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:845)
at 
org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1689)
at 
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:164)
at 
org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1676)
at 
org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581)
at 
org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
at 
org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)
at 
org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
at 
org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at 
org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:461)
at 
org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
at 
org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at org.spark_project.jetty.server.Server.handle(Server.java:524)
at 
org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:319)
at 
org.spark_project.jetty.server.Http

[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers

2017-10-30 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16224454#comment-16224454
 ] 

Nick Pentreath commented on SPARK-8418:
---

Adding SPARK-13030, since the new version of {{OneHotEncoder}} will also 
support transforming multiple columns.

> Add single- and multi-value support to ML Transformers
> --
>
> Key: SPARK-8418
> URL: https://issues.apache.org/jira/browse/SPARK-8418
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> It would be convenient if all feature transformers supported transforming 
> columns of single values and multiple values, specifically:
> * one column with one value (e.g., type {{Double}})
> * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}})
> We could go as far as supporting multiple columns, but that may not be 
> necessary since VectorAssembler could be used to handle that.
> Estimators under {{ml.feature}} should also support this.
> This will likely require a short design doc to describe:
> * how input and output columns will be specified
> * schema validation
> * code sharing to reduce duplication



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22365) Spark UI executors empty list with 500 error

2017-10-30 Thread Jakub Dubovsky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16224452#comment-16224452
 ] 

Jakub Dubovsky commented on SPARK-22365:


[~srowen]  I will look into log for "caused by" entries. Also I think this 
happens all the time in my setup so I am able to reproduce it. If it's one time 
error I wouldn't bother creating a ticket.
[~guoxiaolongzte] I am willing to do that but I have no idea what do you mean 
by snapshot.

I will be slow on debugging this as it is not my top priority now but want to 
get it resolved since it prevents me from checking logs on executors while app 
is running.

> Spark UI executors empty list with 500 error
> 
>
> Key: SPARK-22365
> URL: https://issues.apache.org/jira/browse/SPARK-22365
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.2.0
>Reporter: Jakub Dubovsky
>
> No data loaded on "execturos" tab in sparkUI with stack trace below. Apart 
> from exception I have nothing more. But if I can test something to make this 
> easier to resolve I am happy to help.
> {{java.lang.NullPointerException
>   at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388)
>   at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341)
>   at 
> org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228)
>   at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:845)
>   at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1689)
>   at 
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:164)
>   at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1676)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>   at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>   at 
> org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:461)
>   at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
>   at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>   at org.spark_project.jetty.server.Server.handle(Server.java:524)
>   at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:319)
>   at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:253)
>   at 
> org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
>   at 
> org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:95)
>   at 
> org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
>   at java.lang.Thread.run(Thread.java:748)}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20000) Spark Hive tests aborted due to lz4-java on ppc64le

2017-10-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2.
---
Resolution: Duplicate

> Spark Hive tests aborted due to lz4-java on ppc64le
> ---
>
> Key: SPARK-2
> URL: https://issues.apache.org/jira/browse/SPARK-2
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.2.0
> Environment: Ubuntu 14.04 ppc64le 
> $ java -version
> openjdk version "1.8.0_111"
> OpenJDK Runtime Environment (build 1.8.0_111-8u111-b14-3~14.04.1-b14)
> OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode)
>Reporter: Sonia Garudi
>Priority: Minor
>  Labels: ppc64le
> Attachments: hs_err_pid.log
>
>
> The tests are getting aborted in Spark Hive project with the following error :
> {code:borderStyle=solid}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x3fff94dbf114, pid=6160, tid=0x3fff6efef1a0
> #
> # JRE version: OpenJDK Runtime Environment (8.0_111-b14) (build 
> 1.8.0_111-8u111-b14-3~14.04.1-b14)
> # Java VM: OpenJDK 64-Bit Server VM (25.111-b14 mixed mode linux-ppc64 
> compressed oops)
> # Problematic frame:
> # V  [libjvm.so+0x56f114]
> {code}
> In the thread log file, I found the following traces :
> Event: 3669.042 Thread 0x3fff89976800 Exception  'java/lang/NoClassDefFoundError': Could not initialize class 
> net.jpountz.lz4.LZ4JNI> (0x00079fcda3b8) thrown at 
> [/build/openjdk-8-fVIxxI/openjdk-8-8u111-b14/src/hotspot/src/share/vm/oops/instanceKlass.cpp,
>  line 890]
> This error is due to the lz4-java (version 1.3.0), which doesn’t have support 
> for ppc64le.PFA the thread log file.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7019) Build docs on doc changes

2017-10-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-7019.
--
Resolution: Duplicate

> Build docs on doc changes
> -
>
> Key: SPARK-7019
> URL: https://issues.apache.org/jira/browse/SPARK-7019
> Project: Spark
>  Issue Type: New Feature
>  Components: Build
>Reporter: Brennon York
>
> Currently when a pull request changes the {{docs/}} directory, the docs 
> aren't actually built. When a PR is submitted the {{git}} history should be 
> checked to see if any doc changes were made and, if so, properly build the 
> docs and report any issues.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org