[jira] [Assigned] (SPARK-25979) Window function: allow parentheses around window reference

2018-11-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25979:


Assignee: Apache Spark

> Window function: allow parentheses around window reference
> --
>
> Key: SPARK-25979
> URL: https://issues.apache.org/jira/browse/SPARK-25979
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Minor
>
> Very minor parser bug, but possibly problematic for code-generated queries:
> Consider the following two queries:
> {code}
> SELECT avg(k) OVER (w) FROM kv WINDOW w AS (PARTITION BY v ORDER BY w) ORDER 
> BY 1
> {code}
> and
> {code}
> SELECT avg(k) OVER w FROM kv WINDOW w AS (PARTITION BY v ORDER BY w) ORDER BY 
> 1
> {code}
> The former, with parens around the OVER condition, fails to parse while the 
> latter, without parens, succeeds:
> {code}
> Error in SQL statement: ParseException: 
> mismatched input '(' expecting {, ',', 'FROM', 'WHERE', 'GROUP', 
> 'ORDER', 'HAVING', 'LIMIT', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', 
> 'INTERSECT', 'SORT', 'CLUSTER', 'DISTRIBUTE'}(line 1, pos 19)
> == SQL ==
> SELECT avg(k) OVER (w) FROM kv WINDOW w AS (PARTITION BY v ORDER BY w) ORDER 
> BY 1
> ---^^^
> {code}
> This was found when running the cockroach DB tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24437) Memory leak in UnsafeHashedRelation

2018-11-08 Thread David Vogelbacher (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16679768#comment-16679768
 ] 

David Vogelbacher commented on SPARK-24437:
---

Thanks for the explanations! I will look into the best workaround for this 
use-case then.

> Memory leak in UnsafeHashedRelation
> ---
>
> Key: SPARK-24437
> URL: https://issues.apache.org/jira/browse/SPARK-24437
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: gagan taneja
>Priority: Critical
> Attachments: Screen Shot 2018-05-30 at 2.05.40 PM.png, Screen Shot 
> 2018-05-30 at 2.07.22 PM.png, Screen Shot 2018-11-01 at 10.38.30 AM.png
>
>
> There seems to memory leak with 
> org.apache.spark.sql.execution.joins.UnsafeHashedRelation
> We have a long running instance of STS.
> With each query execution requiring Broadcast Join, UnsafeHashedRelation is 
> getting added for cleanup in ContextCleaner. This reference of 
> UnsafeHashedRelation is being held at some other Collection and not becoming 
> eligible for GC and because of this ContextCleaner is not able to clean it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25510) Create a new trait SqlBasedBenchmark

2018-11-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16679895#comment-16679895
 ] 

Apache Spark commented on SPARK-25510:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/22985

>  Create a new trait SqlBasedBenchmark
> -
>
> Key: SPARK-25510
> URL: https://issues.apache.org/jira/browse/SPARK-25510
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25981) Arrow optimization for conversion from R DataFrame to Spark DataFrame

2018-11-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16680084#comment-16680084
 ] 

Apache Spark commented on SPARK-25981:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/22954

> Arrow optimization for conversion from R DataFrame to Spark DataFrame
> -
>
> Key: SPARK-25981
> URL: https://issues.apache.org/jira/browse/SPARK-25981
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> PySpark introduced an optimization for toPandas and createDataFrame with 
> Pandas DataFrame.
> This was leveraged by PyArrow API.
> R Arrow API is under developement 
> (https://github.com/apache/arrow/tree/master/r) and about to be released via 
> CRAN (https://issues.apache.org/jira/browse/ARROW-3204).
> Once it's released, we can reuse PySpark's Arrow optimization code path and 
> leverage it with minimised codes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25979) Window function: allow parentheses around window reference

2018-11-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16679978#comment-16679978
 ] 

Apache Spark commented on SPARK-25979:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/22987

> Window function: allow parentheses around window reference
> --
>
> Key: SPARK-25979
> URL: https://issues.apache.org/jira/browse/SPARK-25979
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Minor
>
> Very minor parser bug, but possibly problematic for code-generated queries:
> Consider the following two queries:
> {code}
> SELECT avg(k) OVER (w) FROM kv WINDOW w AS (PARTITION BY v ORDER BY w) ORDER 
> BY 1
> {code}
> and
> {code}
> SELECT avg(k) OVER w FROM kv WINDOW w AS (PARTITION BY v ORDER BY w) ORDER BY 
> 1
> {code}
> The former, with parens around the OVER condition, fails to parse while the 
> latter, without parens, succeeds:
> {code}
> Error in SQL statement: ParseException: 
> mismatched input '(' expecting {, ',', 'FROM', 'WHERE', 'GROUP', 
> 'ORDER', 'HAVING', 'LIMIT', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', 
> 'INTERSECT', 'SORT', 'CLUSTER', 'DISTRIBUTE'}(line 1, pos 19)
> == SQL ==
> SELECT avg(k) OVER (w) FROM kv WINDOW w AS (PARTITION BY v ORDER BY w) ORDER 
> BY 1
> ---^^^
> {code}
> This was found when running the cockroach DB tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25981) Arrow optimization for conversion from R DataFrame to Spark DataFrame

2018-11-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25981:


Assignee: (was: Apache Spark)

> Arrow optimization for conversion from R DataFrame to Spark DataFrame
> -
>
> Key: SPARK-25981
> URL: https://issues.apache.org/jira/browse/SPARK-25981
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> PySpark introduced an optimization for toPandas and createDataFrame with 
> Pandas DataFrame.
> This was leveraged by PyArrow API.
> R Arrow API is under developement 
> (https://github.com/apache/arrow/tree/master/r) and about to be released via 
> CRAN (https://issues.apache.org/jira/browse/ARROW-3204).
> Once it's released, we can reuse PySpark's Arrow optimization code path and 
> leverage it with minimised codes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25981) Arrow optimization for conversion from R DataFrame to Spark DataFrame

2018-11-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25981:


Assignee: Apache Spark

> Arrow optimization for conversion from R DataFrame to Spark DataFrame
> -
>
> Key: SPARK-25981
> URL: https://issues.apache.org/jira/browse/SPARK-25981
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> PySpark introduced an optimization for toPandas and createDataFrame with 
> Pandas DataFrame.
> This was leveraged by PyArrow API.
> R Arrow API is under developement 
> (https://github.com/apache/arrow/tree/master/r) and about to be released via 
> CRAN (https://issues.apache.org/jira/browse/ARROW-3204).
> Once it's released, we can reuse PySpark's Arrow optimization code path and 
> leverage it with minimised codes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25981) Arrow optimization for conversion from R DataFrame to Spark DataFrame

2018-11-08 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-25981:


 Summary: Arrow optimization for conversion from R DataFrame to 
Spark DataFrame
 Key: SPARK-25981
 URL: https://issues.apache.org/jira/browse/SPARK-25981
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon


PySpark introduced an optimization for toPandas and createDataFrame with Pandas 
DataFrame.
This was leveraged by PyArrow API.

R Arrow API is under developement 
(https://github.com/apache/arrow/tree/master/r) and about to be released via 
CRAN (https://issues.apache.org/jira/browse/ARROW-3204).

Once it's released, we can reuse PySpark's Arrow optimization code path and 
leverage it with minimised codes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25981) Arrow optimization for conversion from R DataFrame to Spark DataFrame

2018-11-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16680085#comment-16680085
 ] 

Apache Spark commented on SPARK-25981:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/22954

> Arrow optimization for conversion from R DataFrame to Spark DataFrame
> -
>
> Key: SPARK-25981
> URL: https://issues.apache.org/jira/browse/SPARK-25981
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> PySpark introduced an optimization for toPandas and createDataFrame with 
> Pandas DataFrame.
> This was leveraged by PyArrow API.
> R Arrow API is under developement 
> (https://github.com/apache/arrow/tree/master/r) and about to be released via 
> CRAN (https://issues.apache.org/jira/browse/ARROW-3204).
> Once it's released, we can reuse PySpark's Arrow optimization code path and 
> leverage it with minimised codes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25510) Create a new trait SqlBasedBenchmark

2018-11-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16679896#comment-16679896
 ] 

Apache Spark commented on SPARK-25510:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/22985

>  Create a new trait SqlBasedBenchmark
> -
>
> Key: SPARK-25510
> URL: https://issues.apache.org/jira/browse/SPARK-25510
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24421) sun.misc.Unsafe in JDK9+

2018-11-08 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16679897#comment-16679897
 ] 

Sean Owen commented on SPARK-24421:
---

Using reflection to set the Cleaner works. We also have to use reflection to 
call Cleaner.clean() in StorageUtils. However this causes an 
IllegalAccessException, not because it's private, but because the methods are 
of course not exported from the java.base module. This can be disabled with 
command line flags but that's not a great solution. I'll look at workarounds 
but ideas welcome!

Good news is that this appears to be the only change needed to get compilation 
to work, and tests are running pretty well other than this so far.

> sun.misc.Unsafe in JDK9+
> 
>
> Key: SPARK-24421
> URL: https://issues.apache.org/jira/browse/SPARK-24421
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Priority: Major
>
> Many internal APIs such as unsafe are encapsulated in JDK9+, see 
> http://openjdk.java.net/jeps/260 for detail.
> To use Unsafe, we need to add *jdk.unsupported* to our code’s module 
> declaration:
> {code:java}
> module java9unsafe {
> requires jdk.unsupported;
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25973) Spark History Main page performance improvement

2018-11-08 Thread William Montaz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

William Montaz updated SPARK-25973:
---
Description: 
HistoryPage.scala counts applications (with a predicate depending on if it is 
displaying incomplete or complete applications) to check if it must display the 
dataTable.

Since it only checks if allAppsSize > 0, we could use exists method on the 
iterator. This way we stop iterating at the first occurence found.

 

 

  was:
HistoryPage.scala counts applications (with a predicate depending on if it is 
displaying incomplete or complete applications) to check if it must display the 
dataTable.

Since it only checks if allAppsSize > 0, we could use exists method on the 
iterator> This way we stop iterating at the first occurence found.

 

 


> Spark History Main page performance improvement
> ---
>
> Key: SPARK-25973
> URL: https://issues.apache.org/jira/browse/SPARK-25973
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.2
>Reporter: William Montaz
>Priority: Minor
> Attachments: fix.patch
>
>
> HistoryPage.scala counts applications (with a predicate depending on if it is 
> displaying incomplete or complete applications) to check if it must display 
> the dataTable.
> Since it only checks if allAppsSize > 0, we could use exists method on the 
> iterator. This way we stop iterating at the first occurence found.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25971) Ignore partition byte-size statistics in SQLQueryTestSuite

2018-11-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16679397#comment-16679397
 ] 

Apache Spark commented on SPARK-25971:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/22972

> Ignore partition byte-size statistics in SQLQueryTestSuite
> --
>
> Key: SPARK-25971
> URL: https://issues.apache.org/jira/browse/SPARK-25971
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Currently, `SQLQueryTestSuite` is sensitive in terms of the bytes of parquet 
> files in table partitions. If we change the default file format (from Parquet 
> to ORC) or update the metadata of them, the test case should be changed 
> accordingly. This issue aims to make `SQLQueryTestSuite` more robust by 
> ignoring the partition byte statistics.
> {code}
> -Partition Statistics   1144 bytes, 2 rows
> +Partition Statistics   [not included in comparison] bytes, 2 rows
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24529) Add spotbugs into maven build process

2018-11-08 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-24529.
---
Resolution: Won't Fix

Looks like this slows down the build too much

> Add spotbugs into maven build process
> -
>
> Key: SPARK-24529
> URL: https://issues.apache.org/jira/browse/SPARK-24529
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Minor
>
> We will enable a Java bytecode check tool 
> [spotbugs|https://spotbugs.github.io/] to avoid possible integer overflow at 
> multiplication. Due to the tool limitation, some other checks will be enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25332) Instead of broadcast hash join ,Sort merge join has selected when restart spark-shell/spark-JDBC for hive provider

2018-11-08 Thread Babulal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Babulal updated SPARK-25332:

Priority: Critical  (was: Major)

> Instead of broadcast hash join  ,Sort merge join has selected when restart 
> spark-shell/spark-JDBC for hive provider
> ---
>
> Key: SPARK-25332
> URL: https://issues.apache.org/jira/browse/SPARK-25332
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Babulal
>Priority: Critical
>
> spark.sql("create table x1(name string,age int) stored as parquet ")
>  spark.sql("insert into x1 select 'a',29")
>  spark.sql("create table x2 (name string,age int) stored as parquet '")
>  spark.sql("insert into x2_ex select 'a',29")
>  scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain
> == Physical Plan ==
> *{color:#14892c}(2) BroadcastHashJoin{color} [name#101], [name#103], Inner, 
> BuildRight
> :- *(2) Project [name#101, age#102]
> : +- *(2) Filter isnotnull(name#101)
> : +- *(2) FileScan parquet default.x1_ex[name#101,age#102] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x1, 
> PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: 
> struct
> +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
>  +- *(1) Project [name#103, age#104]
>  +- *(1) Filter isnotnull(name#103)
>  +- *(1) FileScan parquet default.x2_ex[name#103,age#104] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x2, 
> PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: 
> struct
>  
>  
> Now Restart Spark-Shell or do spark-submit orrestart JDBCServer  again and 
> run same select query again
>  
> scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain
> scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain
> == Physical Plan ==
> *{color:#FF}(5) SortMergeJoin [{color}name#43], [name#45], Inner
> :- *(2) Sort [name#43 ASC NULLS FIRST], false, 0
> : +- Exchange hashpartitioning(name#43, 200)
> : +- *(1) Project [name#43, age#44]
> : +- *(1) Filter isnotnull(name#43)
> : +- *(1) FileScan parquet default.x1[name#43,age#44] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x1], 
> PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: 
> struct
> +- *(4) Sort [name#45 ASC NULLS FIRST], false, 0
>  +- Exchange hashpartitioning(name#45, 200)
>  +- *(3) Project [name#45, age#46]
>  +- *(3) Filter isnotnull(name#45)
>  +- *(3) FileScan parquet default.x2[name#45,age#46] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x2], 
> PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: 
> struct
>  
>  
> scala> spark.sql("desc formatted x1").show(200,false)
> ++--+---+
> |col_name |data_type |comment|
> ++--+---+
> |name |string |null |
> |age |int |null |
> | | | |
> |# Detailed Table Information| | |
> |Database |default | |
> |Table |x1 | |
> |Owner |Administrator | |
> |Created Time |Sun Aug 19 12:36:58 IST 2018 | |
> |Last Access |Thu Jan 01 05:30:00 IST 1970 | |
> |Created By |Spark 2.3.0 | |
> |Type |MANAGED | |
> |Provider |hive | |
> |Table Properties |[transient_lastDdlTime=1534662418] | |
> |Location |file:/D:/spark_release/spark/bin/spark-warehouse/x1 | |
> |Serde Library |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | 
> |
> |InputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | 
> |
> |OutputFormat 
> |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| |
> |Storage Properties |[serialization.format=1] | |
> |Partition Provider |Catalog | |
> ++--+---+
>  
> With datasource table ,working fine ( create table using parquet instead of 
> stored by )



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25958) error: [Errno 97] Address family not supported by protocol in dataframe.take()

2018-11-08 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16679263#comment-16679263
 ] 

Ruslan Dautkhanov commented on SPARK-25958:
---

I just removed ipv6 reference ::1 in /etc/hosts and your sample code stopped 
reporting "OSError: [Errno 97] Address family not supported by protocol".

Will try to rerun the job now. 

Thank you.

 

> error: [Errno 97] Address family not supported by protocol in dataframe.take()
> --
>
> Key: SPARK-25958
> URL: https://issues.apache.org/jira/browse/SPARK-25958
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Following error happens on a heavy Spark job after 4 hours of runtime..
> {code}
> 2018-11-06 14:35:56,604 - data_vault.py - ERROR - Exited with exception: 
> [Errno 97] Address family not supported by protocol
> Traceback (most recent call last):
>   File "/home/mwincek/svn/data_vault/data_vault.py", line 64, in data_vault
> item.create_persistent_data()
>   File "/home/mwincek/svn/data_vault/src/table_recipe/amf_table_recipe.py", 
> line 53, in create_persistent_data
> single_obj.create_persistent_data()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", line 
> 21, in create_persistent_data
> main_df = self.generate_dataframe_main()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", line 
> 98, in generate_dataframe_main
> raw_disc_dv_df = self.get_raw_data_with_metadata_and_aggregation()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 16, in get_raw_data_with_metadata_and_aggregation
> main_df = 
> self.get_dataframe_using_binary_date_aggregation_on_dataframe(input_df=raw_disc_dv_df)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 60, in get_dataframe_using_binary_date_aggregation_on_dataframe
> return_df = self.get_dataframe_from_binary_value_iteration(input_df)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 136, in get_dataframe_from_binary_value_iteration
> combine_df = self.get_dataframe_from_binary_value(input_df=input_df, 
> binary_value=count)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 154, in get_dataframe_from_binary_value
> if len(results_of_filter_df.take(1)) == 0:
>   File 
> "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", 
> line 504, in take
> return self.limit(num).collect()
>   File 
> "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", 
> line 467, in collect
> return list(_load_from_socket(sock_info, 
> BatchedSerializer(PickleSerializer(
>   File "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/rdd.py", line 
> 148, in _load_from_socket
> sock = socket.socket(af, socktype, proto)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/socket.py", line 191, in 
> __init__
> _sock = _realsocket(family, type, proto)
> error: [Errno 97] Address family not supported by protocol
> {code}
> Looking at the failing line in lib/spark2/python/pyspark/rdd.py, line 148:
> {code}
> def _load_from_socket(sock_info, serializer):
> port, auth_secret = sock_info
> sock = None
> # Support for both IPv4 and IPv6.
> # On most of IPv6-ready systems, IPv6 will take precedence.
> for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, 
> socket.SOCK_STREAM):
> af, socktype, proto, canonname, sa = res
> sock = socket.socket(af, socktype, proto)
> try:
> sock.settimeout(15)
> sock.connect(sa)
> except socket.error:
> sock.close()
> sock = None
> continue
> break
> if not sock:
> raise Exception("could not open socket")
> # The RDD materialization time is unpredicable, if we set a timeout for 
> socket reading
> # operation, it will very possibly fail. See SPARK-18281.
> sock.settimeout(None)
> sockfile = sock.makefile("rwb", 65536)
> do_server_auth(sockfile, auth_secret)
> # The socket will be automatically closed when garbage-collected.
> return serializer.load_stream(sockfile)
> {code}
> the culprint is in lib/spark2/python/pyspark/rdd.py in this line 
> {code}
> socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, socket.SOCK_STREAM)
> {code}
> so the error "error: [Errno 97] *Address family* not supported by protocol"
> seems to be caused by socket.AF_UNSPEC t

[jira] [Assigned] (SPARK-25971) Ignore partition byte-size statistics in SQLQueryTestSuite

2018-11-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25971:


Assignee: (was: Apache Spark)

> Ignore partition byte-size statistics in SQLQueryTestSuite
> --
>
> Key: SPARK-25971
> URL: https://issues.apache.org/jira/browse/SPARK-25971
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Currently, `SQLQueryTestSuite` is sensitive in terms of the bytes of parquet 
> files in table partitions. If we change the default file format (from Parquet 
> to ORC) or update the metadata of them, the test case should be changed 
> accordingly. This issue aims to make `SQLQueryTestSuite` more robust by 
> ignoring the partition byte statistics.
> {code}
> -Partition Statistics   1144 bytes, 2 rows
> +Partition Statistics   [not included in comparison] bytes, 2 rows
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25980) dev list mail server is down

2018-11-08 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-25980:
---

 Summary: dev list mail server is down
 Key: SPARK-25980
 URL: https://issues.apache.org/jira/browse/SPARK-25980
 Project: Spark
  Issue Type: IT Help
  Components: Project Infra
Affects Versions: 2.4.0
Reporter: Wenchen Fan


After sending the 2.4.0 release announcement to the dev list, it doesn't show 
up for over one hour.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25974) Optimizes Generates bytecode for ordering based on the given order

2018-11-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25974:


Assignee: (was: Apache Spark)

> Optimizes Generates bytecode for ordering based on the given order
> --
>
> Key: SPARK-25974
> URL: https://issues.apache.org/jira/browse/SPARK-25974
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.1
>Reporter: caoxuewen
>Priority: Major
>
> Currently, when generates the code for ordering based on the given order, too 
> many variables and assignment statements will be generated, which is not 
> necessary. This PR will eliminate redundant variables. Optimizes Generates 
> bytecode for ordering based on the given order.
> The generated code looks like:
> spark.range(1).selectExpr(
>  "id as key",
>  "(id & 1023) as value1",
> "cast(id & 1023 as double) as value2",
> "cast(id & 1023 as int) as value3"
> ).select("value1", "value2", "value3").orderBy("value1", "value2").collect()
> before PR(codegen size: 178)
> Generated Ordering by input[0, bigint, false] ASC NULLS FIRST,input[1, 
> double, false] ASC NULLS FIRST:
> /* 001 */ public SpecificOrdering generate(Object[] references) {
> /* 002 */   return new SpecificOrdering(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificOrdering extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering {
> /* 006 */
> /* 007 */   private Object[] references;
> /* 008 */
> /* 009 */
> /* 010 */   public SpecificOrdering(Object[] references) {
> /* 011 */ this.references = references;
> /* 012 */
> /* 013 */   }
> /* 014 */
> /* 015 */   public int compare(InternalRow a, InternalRow b) {
> /* 016 */
> /* 017 */ InternalRow i = null;
> /* 018 */
> /* 019 */ i = a;
> /* 020 */ boolean isNullA_0;
> /* 021 */ long primitiveA_0;
> /* 022 */ {
> /* 023 */   long value_0 = i.getLong(0);
> /* 024 */   isNullA_0 = false;
> /* 025 */   primitiveA_0 = value_0;
> /* 026 */ }
> /* 027 */ i = b;
> /* 028 */ boolean isNullB_0;
> /* 029 */ long primitiveB_0;
> /* 030 */ {
> /* 031 */   long value_0 = i.getLong(0);
> /* 032 */   isNullB_0 = false;
> /* 033 */   primitiveB_0 = value_0;
> /* 034 */ }
> /* 035 */ if (isNullA_0 && isNullB_0) {
> /* 036 */   // Nothing
> /* 037 */ } else if (isNullA_0) {
> /* 038 */   return -1;
> /* 039 */ } else if (isNullB_0) {
> /* 040 */   return 1;
> /* 041 */ } else {
> /* 042 */   int comp = (primitiveA_0 > primitiveB_0 ? 1 : primitiveA_0 < 
> primitiveB_0 ? -1 : 0);
> /* 043 */   if (comp != 0) {
> /* 044 */ return comp;
> /* 045 */   }
> /* 046 */ }
> /* 047 */
> /* 048 */ i = a;
> /* 049 */ boolean isNullA_1;
> /* 050 */ double primitiveA_1;
> /* 051 */ {
> /* 052 */   double value_1 = i.getDouble(1);
> /* 053 */   isNullA_1 = false;
> /* 054 */   primitiveA_1 = value_1;
> /* 055 */ }
> /* 056 */ i = b;
> /* 057 */ boolean isNullB_1;
> /* 058 */ double primitiveB_1;
> /* 059 */ {
> /* 060 */   double value_1 = i.getDouble(1);
> /* 061 */   isNullB_1 = false;
> /* 062 */   primitiveB_1 = value_1;
> /* 063 */ }
> /* 064 */ if (isNullA_1 && isNullB_1) {
> /* 065 */   // Nothing
> /* 066 */ } else if (isNullA_1) {
> /* 067 */   return -1;
> /* 068 */ } else if (isNullB_1) {
> /* 069 */   return 1;
> /* 070 */ } else {
> /* 071 */   int comp = 
> org.apache.spark.util.Utils.nanSafeCompareDoubles(primitiveA_1, primitiveB_1);
> /* 072 */   if (comp != 0) {
> /* 073 */ return comp;
> /* 074 */   }
> /* 075 */ }
> /* 076 */
> /* 077 */
> /* 078 */ return 0;
> /* 079 */   }
> /* 080 */
> /* 081 */
> /* 082 */ }
> After PR(codegen size: 89)
> Generated Ordering by input[0, bigint, false] ASC NULLS FIRST,input[1, 
> double, false] ASC NULLS FIRST:
> /* 001 */ public SpecificOrdering generate(Object[] references) {
> /* 002 */   return new SpecificOrdering(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificOrdering extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering {
> /* 006 */
> /* 007 */   private Object[] references;
> /* 008 */
> /* 009 */
> /* 010 */   public SpecificOrdering(Object[] references) {
> /* 011 */ this.references = references;
> /* 012 */
> /* 013 */   }
> /* 014 */
> /* 015 */   public int compare(InternalRow a, InternalRow b) {
> /* 016 */
> /* 017 */
> /* 018 */ long value_0 = a.getLong(0);
> /* 019 */ long value_2 = b.getLong(0);
> /* 020 */ if (false && false) {
> /* 021 */   // Nothing
> /* 022 */ } else if (false) {
> /* 023 */   return -1;
> /* 024 */ } else if (fal

[jira] [Updated] (SPARK-25973) Spark History Main page performance improvement

2018-11-08 Thread William Montaz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

William Montaz updated SPARK-25973:
---
Summary: Spark History Main page performance improvement  (was: Spark 
History Main page performance improvment)

> Spark History Main page performance improvement
> ---
>
> Key: SPARK-25973
> URL: https://issues.apache.org/jira/browse/SPARK-25973
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.2
>Reporter: William Montaz
>Priority: Major
> Attachments: fix.patch
>
>
> HistoryPage.scala counts applications (with a predicate depending on if it is 
> displaying incomplete or complete applications) to check if it must display 
> the dataTable.
> Since it only checks if allAppsSize > 0, we could use exists method on the 
> iterator> This way we stop iterating at the first occurence found.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25962) Specify minimum versions for both pydocstyle and flake8 in 'lint-python' script

2018-11-08 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-25962:


Assignee: Hyukjin Kwon

> Specify minimum versions for both pydocstyle and flake8 in 'lint-python' 
> script
> ---
>
> Key: SPARK-25962
> URL: https://issues.apache.org/jira/browse/SPARK-25962
> Project: Spark
>  Issue Type: Test
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> Currently, 'lint-python' script does not specify minimum versions for both 
> pydocstyle and flake8. It should better set them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23831) Add org.apache.derby to IsolatedClientLoader

2018-11-08 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-23831:
-
Fix Version/s: (was: 2.4.0)

> Add org.apache.derby to IsolatedClientLoader
> 
>
> Key: SPARK-23831
> URL: https://issues.apache.org/jira/browse/SPARK-23831
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> Add org.apache.derby to IsolatedClientLoader,otherwise it may throw an 
> exception:
> {noformat}
> [info] Cause: java.sql.SQLException: Failed to start database 'metastore_db' 
> with class loader 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@2439ab23, see 
> the next exception for details.
> [info] at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
> [info] at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
> [info] at org.apache.derby.impl.jdbc.Util.seeNextException(Unknown Source)
> [info] at org.apache.derby.impl.jdbc.EmbedConnection.bootDatabase(Unknown 
> Source)
> [info] at org.apache.derby.impl.jdbc.EmbedConnection.(Unknown Source)
> [info] at org.apache.derby.jdbc.InternalDriver$1.run(Unknown Source)
> {noformat}
> How to reproduce:
> {noformat}
> sed 's/HiveExternalCatalogSuite/HiveExternalCatalog2Suite/g' 
> sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogSuite.scala
>  > 
> sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalog2Suite.scala
> build/sbt -Phive "hive/test-only *.HiveExternalCatalogSuite 
> *.HiveExternalCatalog2Suite"
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25967) sql.functions.trim() should remove trailing and leading tabs

2018-11-08 Thread Victor Sahin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16679083#comment-16679083
 ] 

Victor Sahin commented on SPARK-25967:
--

I see. In that case I can close the issue.

> sql.functions.trim() should remove trailing and leading tabs
> 
>
> Key: SPARK-25967
> URL: https://issues.apache.org/jira/browse/SPARK-25967
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0, 2.3.2
>Reporter: Victor Sahin
>Priority: Minor
>
> sql.functions.trim removes only trailing and leading whitespaces. Removing 
> tabs as well helps use the function for the same use case e.g. artifact 
> cleaning.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25974) Optimizes Generates bytecode for ordering based on the given order

2018-11-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16679580#comment-16679580
 ] 

Apache Spark commented on SPARK-25974:
--

User 'heary-cao' has created a pull request for this issue:
https://github.com/apache/spark/pull/22976

> Optimizes Generates bytecode for ordering based on the given order
> --
>
> Key: SPARK-25974
> URL: https://issues.apache.org/jira/browse/SPARK-25974
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.1
>Reporter: caoxuewen
>Priority: Major
>
> Currently, when generates the code for ordering based on the given order, too 
> many variables and assignment statements will be generated, which is not 
> necessary. This PR will eliminate redundant variables. Optimizes Generates 
> bytecode for ordering based on the given order.
> The generated code looks like:
> spark.range(1).selectExpr(
>  "id as key",
>  "(id & 1023) as value1",
> "cast(id & 1023 as double) as value2",
> "cast(id & 1023 as int) as value3"
> ).select("value1", "value2", "value3").orderBy("value1", "value2").collect()
> before PR(codegen size: 178)
> Generated Ordering by input[0, bigint, false] ASC NULLS FIRST,input[1, 
> double, false] ASC NULLS FIRST:
> /* 001 */ public SpecificOrdering generate(Object[] references) {
> /* 002 */   return new SpecificOrdering(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificOrdering extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering {
> /* 006 */
> /* 007 */   private Object[] references;
> /* 008 */
> /* 009 */
> /* 010 */   public SpecificOrdering(Object[] references) {
> /* 011 */ this.references = references;
> /* 012 */
> /* 013 */   }
> /* 014 */
> /* 015 */   public int compare(InternalRow a, InternalRow b) {
> /* 016 */
> /* 017 */ InternalRow i = null;
> /* 018 */
> /* 019 */ i = a;
> /* 020 */ boolean isNullA_0;
> /* 021 */ long primitiveA_0;
> /* 022 */ {
> /* 023 */   long value_0 = i.getLong(0);
> /* 024 */   isNullA_0 = false;
> /* 025 */   primitiveA_0 = value_0;
> /* 026 */ }
> /* 027 */ i = b;
> /* 028 */ boolean isNullB_0;
> /* 029 */ long primitiveB_0;
> /* 030 */ {
> /* 031 */   long value_0 = i.getLong(0);
> /* 032 */   isNullB_0 = false;
> /* 033 */   primitiveB_0 = value_0;
> /* 034 */ }
> /* 035 */ if (isNullA_0 && isNullB_0) {
> /* 036 */   // Nothing
> /* 037 */ } else if (isNullA_0) {
> /* 038 */   return -1;
> /* 039 */ } else if (isNullB_0) {
> /* 040 */   return 1;
> /* 041 */ } else {
> /* 042 */   int comp = (primitiveA_0 > primitiveB_0 ? 1 : primitiveA_0 < 
> primitiveB_0 ? -1 : 0);
> /* 043 */   if (comp != 0) {
> /* 044 */ return comp;
> /* 045 */   }
> /* 046 */ }
> /* 047 */
> /* 048 */ i = a;
> /* 049 */ boolean isNullA_1;
> /* 050 */ double primitiveA_1;
> /* 051 */ {
> /* 052 */   double value_1 = i.getDouble(1);
> /* 053 */   isNullA_1 = false;
> /* 054 */   primitiveA_1 = value_1;
> /* 055 */ }
> /* 056 */ i = b;
> /* 057 */ boolean isNullB_1;
> /* 058 */ double primitiveB_1;
> /* 059 */ {
> /* 060 */   double value_1 = i.getDouble(1);
> /* 061 */   isNullB_1 = false;
> /* 062 */   primitiveB_1 = value_1;
> /* 063 */ }
> /* 064 */ if (isNullA_1 && isNullB_1) {
> /* 065 */   // Nothing
> /* 066 */ } else if (isNullA_1) {
> /* 067 */   return -1;
> /* 068 */ } else if (isNullB_1) {
> /* 069 */   return 1;
> /* 070 */ } else {
> /* 071 */   int comp = 
> org.apache.spark.util.Utils.nanSafeCompareDoubles(primitiveA_1, primitiveB_1);
> /* 072 */   if (comp != 0) {
> /* 073 */ return comp;
> /* 074 */   }
> /* 075 */ }
> /* 076 */
> /* 077 */
> /* 078 */ return 0;
> /* 079 */   }
> /* 080 */
> /* 081 */
> /* 082 */ }
> After PR(codegen size: 89)
> Generated Ordering by input[0, bigint, false] ASC NULLS FIRST,input[1, 
> double, false] ASC NULLS FIRST:
> /* 001 */ public SpecificOrdering generate(Object[] references) {
> /* 002 */   return new SpecificOrdering(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificOrdering extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering {
> /* 006 */
> /* 007 */   private Object[] references;
> /* 008 */
> /* 009 */
> /* 010 */   public SpecificOrdering(Object[] references) {
> /* 011 */ this.references = references;
> /* 012 */
> /* 013 */   }
> /* 014 */
> /* 015 */   public int compare(InternalRow a, InternalRow b) {
> /* 016 */
> /* 017 */
> /* 018 */ long value_0 = a.getLong(0);
> /* 019 */ long value_2 = b.getLong(0);
> /* 020 */ if (false && false) {
> /* 021

[jira] [Resolved] (SPARK-25908) Remove old deprecated items in Spark 3

2018-11-08 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25908.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22921
[https://github.com/apache/spark/pull/22921]

> Remove old deprecated items in Spark 3
> --
>
> Key: SPARK-25908
> URL: https://issues.apache.org/jira/browse/SPARK-25908
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Major
> Fix For: 3.0.0
>
>
> There are many deprecated methods and classes in Spark. They _can_ be removed 
> in Spark 3, and for those that have been deprecated a long time (i.e. since 
> Spark <= 2.0), we should probably do so. This addresses most of these cases, 
> the easiest ones, those that are easy to remove and are old:
>  - Remove some AccumulableInfo .apply() methods
>  - Remove non-label-specific multiclass precision/recall/fScore in favor of 
> accuracy
>  - Remove toDegrees/toRadians in favor of degrees/radians (SparkR: only 
> deprecated)
>  - Remove approxCountDistinct in favor of approx_count_distinct (SparkR: only 
> deprecated)
>  - Remove unused Python StorageLevel constants
>  - Remove Dataset unionAll in favor of union
>  - Remove unused multiclass option in libsvm parsing
>  - Remove references to deprecated spark configs like spark.yarn.am.port
>  - Remove TaskContext.isRunningLocally
>  - Remove ShuffleMetrics.shuffle* methods
>  - Remove BaseReadWrite.context in favor of session
>  - Remove Column.!== in favor of =!=
>  - Remove Dataset.explode
>  - Remove Dataset.registerTempTable
>  - Remove SQLContext.getOrCreate, setActive, clearActive, constructors
> Not touched yet:
>  - everything else in MLLib
>  - HiveContext
>  - Anything deprecated more recently than 2.0.0, generally



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25976) Allow rdd.reduce on empty rdd by returning an Option[T]

2018-11-08 Thread Yuval Yaari (JIRA)
Yuval Yaari created SPARK-25976:
---

 Summary: Allow rdd.reduce on empty rdd by returning an Option[T]
 Key: SPARK-25976
 URL: https://issues.apache.org/jira/browse/SPARK-25976
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.2
Reporter: Yuval Yaari


it is sometimes useful to let the user decide what value to return when 
reducing on an empty rdd.

currently, if there is no data to reduce an UnsupportedOperationException is 
thrown. 

although user can catch that exception, it seems like a "shaky" solution as 
UnsupportedOperationException might be thrown from a different location.

Instead, we can overload the reduce method by adding add a new method:

reduce(f: (T, T) => T, defaultIfEmpty: T): T

the reduce API will not be effected as it will simply call the second reduce 
method throwing an UnsupportedException as the default value

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25959) Difference in featureImportances results on computed vs saved models

2018-11-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25959:


Assignee: Apache Spark

> Difference in featureImportances results on computed vs saved models
> 
>
> Key: SPARK-25959
> URL: https://issues.apache.org/jira/browse/SPARK-25959
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Suraj Nayak
>Assignee: Apache Spark
>Priority: Major
>
> I tried to implement GBT and found that the feature Importance computed while 
> the model was fit is different when the same model was saved into a storage 
> and loaded back. 
>  
> I also found that once the persistent model is loaded and saved back again 
> and loaded, the feature importance remains the same. 
>  
> Not sure if its bug while storing and reading the model first time or am 
> missing some parameter that need to be set before saving the model (thus 
> model is picking some defaults - causing feature importance to change)
>  
> *Below is the test code:*
> val testDF = Seq(
> (1, 3, 2, 1, 1),
> (3, 2, 1, 2, 0),
> (2, 2, 1, 1, 0),
> (3, 4, 2, 2, 0),
> (2, 2, 1, 3, 1)
> ).toDF("a", "b", "c", "d", "e")
> val featureColumns = testDF.columns.filter(_ != "e")
> // Assemble the features into a vector
> val assembler = new 
> VectorAssembler().setInputCols(featureColumns).setOutputCol("features")
> // Transform the data to get the feature data set
> val featureDF = assembler.transform(testDF)
> // Train a GBT model.
> val gbt = new GBTClassifier()
> .setLabelCol("e")
> .setFeaturesCol("features")
> .setMaxDepth(2)
> .setMaxBins(5)
> .setMaxIter(10)
> .setSeed(10)
> .fit(featureDF)
> gbt.transform(featureDF).show(false)
> // Write out the model
> featureColumns.zip(gbt.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println)
> /* Prints
> (d,0.5931875075767403)
> (a,0.3747184548362353)
> (b,0.03209403758702444)
> (c,0.0)
> */
> gbt.write.overwrite().save("file:///tmp/test123")
> println("Reading model again")
> val gbtload = GBTClassificationModel.load("file:///tmp/test123")
> featureColumns.zip(gbtload.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println)
> /*
> Prints
> (d,0.6455841215290767)
> (a,0.3316126797964181)
> (b,0.022803198674505094)
> (c,0.0)
> */
> gbtload.write.overwrite().save("file:///tmp/test123_rewrite")
> val gbtload2 = GBTClassificationModel.load("file:///tmp/test123_rewrite")
> featureColumns.zip(gbtload2.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println)
> /* prints
> (d,0.6455841215290767)
> (a,0.3316126797964181)
> (b,0.022803198674505094)
> (c,0.0)
> */



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25967) sql.functions.trim() should remove trailing and leading tabs

2018-11-08 Thread Victor Sahin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Victor Sahin resolved SPARK-25967.
--
Resolution: Feedback Received

> sql.functions.trim() should remove trailing and leading tabs
> 
>
> Key: SPARK-25967
> URL: https://issues.apache.org/jira/browse/SPARK-25967
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0, 2.3.2
>Reporter: Victor Sahin
>Priority: Minor
>
> sql.functions.trim removes only trailing and leading whitespaces. Removing 
> tabs as well helps use the function for the same use case e.g. artifact 
> cleaning.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22450) Safely register class for mllib

2018-11-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16679468#comment-16679468
 ] 

Apache Spark commented on SPARK-22450:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/22974

> Safely register class for mllib
> ---
>
> Key: SPARK-22450
> URL: https://issues.apache.org/jira/browse/SPARK-22450
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Xianyang Liu
>Assignee: Xianyang Liu
>Priority: Major
> Fix For: 2.3.0
>
>
> There are still some algorithms based on mllib, such as KMeans.  For now, 
> many mllib common class (such as: Vector, DenseVector, SparseVector, Matrix, 
> DenseMatrix, SparseMatrix) are not registered in Kryo. So there are some 
> performance issues for those object serialization or deserialization.
> Previously dicussed: https://github.com/apache/spark/pull/19586



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25974) Optimizes Generates bytecode for ordering based on the given order

2018-11-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25974:


Assignee: Apache Spark

> Optimizes Generates bytecode for ordering based on the given order
> --
>
> Key: SPARK-25974
> URL: https://issues.apache.org/jira/browse/SPARK-25974
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.1
>Reporter: caoxuewen
>Assignee: Apache Spark
>Priority: Major
>
> Currently, when generates the code for ordering based on the given order, too 
> many variables and assignment statements will be generated, which is not 
> necessary. This PR will eliminate redundant variables. Optimizes Generates 
> bytecode for ordering based on the given order.
> The generated code looks like:
> spark.range(1).selectExpr(
>  "id as key",
>  "(id & 1023) as value1",
> "cast(id & 1023 as double) as value2",
> "cast(id & 1023 as int) as value3"
> ).select("value1", "value2", "value3").orderBy("value1", "value2").collect()
> before PR(codegen size: 178)
> Generated Ordering by input[0, bigint, false] ASC NULLS FIRST,input[1, 
> double, false] ASC NULLS FIRST:
> /* 001 */ public SpecificOrdering generate(Object[] references) {
> /* 002 */   return new SpecificOrdering(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificOrdering extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering {
> /* 006 */
> /* 007 */   private Object[] references;
> /* 008 */
> /* 009 */
> /* 010 */   public SpecificOrdering(Object[] references) {
> /* 011 */ this.references = references;
> /* 012 */
> /* 013 */   }
> /* 014 */
> /* 015 */   public int compare(InternalRow a, InternalRow b) {
> /* 016 */
> /* 017 */ InternalRow i = null;
> /* 018 */
> /* 019 */ i = a;
> /* 020 */ boolean isNullA_0;
> /* 021 */ long primitiveA_0;
> /* 022 */ {
> /* 023 */   long value_0 = i.getLong(0);
> /* 024 */   isNullA_0 = false;
> /* 025 */   primitiveA_0 = value_0;
> /* 026 */ }
> /* 027 */ i = b;
> /* 028 */ boolean isNullB_0;
> /* 029 */ long primitiveB_0;
> /* 030 */ {
> /* 031 */   long value_0 = i.getLong(0);
> /* 032 */   isNullB_0 = false;
> /* 033 */   primitiveB_0 = value_0;
> /* 034 */ }
> /* 035 */ if (isNullA_0 && isNullB_0) {
> /* 036 */   // Nothing
> /* 037 */ } else if (isNullA_0) {
> /* 038 */   return -1;
> /* 039 */ } else if (isNullB_0) {
> /* 040 */   return 1;
> /* 041 */ } else {
> /* 042 */   int comp = (primitiveA_0 > primitiveB_0 ? 1 : primitiveA_0 < 
> primitiveB_0 ? -1 : 0);
> /* 043 */   if (comp != 0) {
> /* 044 */ return comp;
> /* 045 */   }
> /* 046 */ }
> /* 047 */
> /* 048 */ i = a;
> /* 049 */ boolean isNullA_1;
> /* 050 */ double primitiveA_1;
> /* 051 */ {
> /* 052 */   double value_1 = i.getDouble(1);
> /* 053 */   isNullA_1 = false;
> /* 054 */   primitiveA_1 = value_1;
> /* 055 */ }
> /* 056 */ i = b;
> /* 057 */ boolean isNullB_1;
> /* 058 */ double primitiveB_1;
> /* 059 */ {
> /* 060 */   double value_1 = i.getDouble(1);
> /* 061 */   isNullB_1 = false;
> /* 062 */   primitiveB_1 = value_1;
> /* 063 */ }
> /* 064 */ if (isNullA_1 && isNullB_1) {
> /* 065 */   // Nothing
> /* 066 */ } else if (isNullA_1) {
> /* 067 */   return -1;
> /* 068 */ } else if (isNullB_1) {
> /* 069 */   return 1;
> /* 070 */ } else {
> /* 071 */   int comp = 
> org.apache.spark.util.Utils.nanSafeCompareDoubles(primitiveA_1, primitiveB_1);
> /* 072 */   if (comp != 0) {
> /* 073 */ return comp;
> /* 074 */   }
> /* 075 */ }
> /* 076 */
> /* 077 */
> /* 078 */ return 0;
> /* 079 */   }
> /* 080 */
> /* 081 */
> /* 082 */ }
> After PR(codegen size: 89)
> Generated Ordering by input[0, bigint, false] ASC NULLS FIRST,input[1, 
> double, false] ASC NULLS FIRST:
> /* 001 */ public SpecificOrdering generate(Object[] references) {
> /* 002 */   return new SpecificOrdering(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificOrdering extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering {
> /* 006 */
> /* 007 */   private Object[] references;
> /* 008 */
> /* 009 */
> /* 010 */   public SpecificOrdering(Object[] references) {
> /* 011 */ this.references = references;
> /* 012 */
> /* 013 */   }
> /* 014 */
> /* 015 */   public int compare(InternalRow a, InternalRow b) {
> /* 016 */
> /* 017 */
> /* 018 */ long value_0 = a.getLong(0);
> /* 019 */ long value_2 = b.getLong(0);
> /* 020 */ if (false && false) {
> /* 021 */   // Nothing
> /* 022 */ } else if (false) {
> /* 023 */   return -1;
> /* 

[jira] [Created] (SPARK-25969) pyspark deal with large data memory issues

2018-11-08 Thread zhao yufei (JIRA)
zhao yufei created SPARK-25969:
--

 Summary: pyspark deal with large data memory issues
 Key: SPARK-25969
 URL: https://issues.apache.org/jira/browse/SPARK-25969
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.3.2
Reporter: zhao yufei


now i use pyspark to load a large csv file with line number about 1.4 million, 
each line contains two filed: imageId, kws (image keywords seperate by ',')

 

when i run the following code, it appears outOfMemory:
{code}
df_imageIdsKws = 
spark.read.format('com.databricks.spark.csv').options(delimiter="\t", 
header='true').schema(schema=schema).load(imagesKwsFilePath)
numClass=1868

def mapRow(row):
imageId=row.imageId

hotVector = np.zeros((numClass,), dtype=float)

for kw in row.kws.split(','):
 kwIndex=kwsIndexMap_broadcast.value.get(kw)
 hotVector[int(kwIndex)]=1.0
 return (imageId,hotVector.tolist())
   
df_imageIdsKws=df_imageIdsKws.rdd.persist(storageLevel=StorageLevel.DISK_ONLY)
imageIdsKws_rdd_=df_imageIdsKws.map(lambda 
row:mapRow(row)).persist(storageLevel=StorageLevel.DISK_ONLY)
{code}

even i use DISK_ONLY for all rdds, still outOfMemory,  
but when i change the numClass=1 for test , all work well.
following error messages from executor log:

{code:java}
java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57)
at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator$$anonfun$4.apply(ShuffleBlockFetcherIterator.scala:431)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator$$anonfun$4.apply(ShuffleBlockFetcherIterator.scala:431)
at 
org.apache.spark.util.io.ChunkedByteBufferOutputStream.allocateNewChunkIfNeeded(ChunkedByteBufferOutputStream.scala:87)
at 
org.apache.spark.util.io.ChunkedByteBufferOutputStream.write(ChunkedByteBufferOutputStream.scala:75)
at 
org.apache.spark.util.Utils$$anonfun$copyStream$1.apply$mcJ$sp(Utils.scala:351)
at 
org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:336)
at 
org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:336)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1381)
at org.apache.spark.util.Utils$.copyStream(Utils.scala:357)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:436)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:62)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at 
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:223)
at 
org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:439)
at 
org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:247)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1992)
at 
org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:170)
2018-11-08 10:53:06 ERROR SparkUncaughtExceptionHandler:91 - Uncaught exception 
in thread Thread[stdout writer for /data/anaconda3/bin/python3.5,5,main]
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25973) Spark History Main page performance improvment

2018-11-08 Thread William Montaz (JIRA)
William Montaz created SPARK-25973:
--

 Summary: Spark History Main page performance improvment
 Key: SPARK-25973
 URL: https://issues.apache.org/jira/browse/SPARK-25973
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 2.3.2
Reporter: William Montaz


HistoryPage.scala counts applications (with a predicate depending on if it is 
displaying incomplete or complete applications) to check if it must display the 
dataTable.

Since it only checks if allAppsSize > 0, we could use exists method on the 
iterator> This way we stop iterating at the first occurence found.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25676) Refactor BenchmarkWideTable to use main method

2018-11-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16679646#comment-16679646
 ] 

Apache Spark commented on SPARK-25676:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/22978

> Refactor BenchmarkWideTable to use main method
> --
>
> Key: SPARK-25676
> URL: https://issues.apache.org/jira/browse/SPARK-25676
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: yucai
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25958) error: [Errno 97] Address family not supported by protocol in dataframe.take()

2018-11-08 Thread Yuanjian Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16679169#comment-16679169
 ] 

Yuanjian Li commented on SPARK-25958:
-

We also meet this problem in internal folk and fixed by re-config `/etc/hosts`. 
If something wrong with server config, 
the `[Error 97]` will simply reproduce by code
{code}
import socket 
socket.create_connection(('localhost', 8000)) 
{code}
What's the `/etc/hosts` in your problem host currently?

> error: [Errno 97] Address family not supported by protocol in dataframe.take()
> --
>
> Key: SPARK-25958
> URL: https://issues.apache.org/jira/browse/SPARK-25958
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Following error happens on a heavy Spark job after 4 hours of runtime..
> {code}
> 2018-11-06 14:35:56,604 - data_vault.py - ERROR - Exited with exception: 
> [Errno 97] Address family not supported by protocol
> Traceback (most recent call last):
>   File "/home/mwincek/svn/data_vault/data_vault.py", line 64, in data_vault
> item.create_persistent_data()
>   File "/home/mwincek/svn/data_vault/src/table_recipe/amf_table_recipe.py", 
> line 53, in create_persistent_data
> single_obj.create_persistent_data()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", line 
> 21, in create_persistent_data
> main_df = self.generate_dataframe_main()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", line 
> 98, in generate_dataframe_main
> raw_disc_dv_df = self.get_raw_data_with_metadata_and_aggregation()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 16, in get_raw_data_with_metadata_and_aggregation
> main_df = 
> self.get_dataframe_using_binary_date_aggregation_on_dataframe(input_df=raw_disc_dv_df)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 60, in get_dataframe_using_binary_date_aggregation_on_dataframe
> return_df = self.get_dataframe_from_binary_value_iteration(input_df)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 136, in get_dataframe_from_binary_value_iteration
> combine_df = self.get_dataframe_from_binary_value(input_df=input_df, 
> binary_value=count)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 154, in get_dataframe_from_binary_value
> if len(results_of_filter_df.take(1)) == 0:
>   File 
> "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", 
> line 504, in take
> return self.limit(num).collect()
>   File 
> "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", 
> line 467, in collect
> return list(_load_from_socket(sock_info, 
> BatchedSerializer(PickleSerializer(
>   File "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/rdd.py", line 
> 148, in _load_from_socket
> sock = socket.socket(af, socktype, proto)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/socket.py", line 191, in 
> __init__
> _sock = _realsocket(family, type, proto)
> error: [Errno 97] Address family not supported by protocol
> {code}
> Looking at the failing line in lib/spark2/python/pyspark/rdd.py, line 148:
> {code}
> def _load_from_socket(sock_info, serializer):
> port, auth_secret = sock_info
> sock = None
> # Support for both IPv4 and IPv6.
> # On most of IPv6-ready systems, IPv6 will take precedence.
> for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, 
> socket.SOCK_STREAM):
> af, socktype, proto, canonname, sa = res
> sock = socket.socket(af, socktype, proto)
> try:
> sock.settimeout(15)
> sock.connect(sa)
> except socket.error:
> sock.close()
> sock = None
> continue
> break
> if not sock:
> raise Exception("could not open socket")
> # The RDD materialization time is unpredicable, if we set a timeout for 
> socket reading
> # operation, it will very possibly fail. See SPARK-18281.
> sock.settimeout(None)
> sockfile = sock.makefile("rwb", 65536)
> do_server_auth(sockfile, auth_secret)
> # The socket will be automatically closed when garbage-collected.
> return serializer.load_stream(sockfile)
> {code}
> the culprint is in lib/spark2/python/pyspark/rdd.py in this line 
> {code}
> socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, socket.SOCK_STREAM)
> {code}
> so the error "error: [Errno 97

[jira] [Commented] (SPARK-25959) Difference in featureImportances results on computed vs saved models

2018-11-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16679911#comment-16679911
 ] 

Apache Spark commented on SPARK-25959:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/22986

> Difference in featureImportances results on computed vs saved models
> 
>
> Key: SPARK-25959
> URL: https://issues.apache.org/jira/browse/SPARK-25959
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Suraj Nayak
>Priority: Major
>
> I tried to implement GBT and found that the feature Importance computed while 
> the model was fit is different when the same model was saved into a storage 
> and loaded back. 
>  
> I also found that once the persistent model is loaded and saved back again 
> and loaded, the feature importance remains the same. 
>  
> Not sure if its bug while storing and reading the model first time or am 
> missing some parameter that need to be set before saving the model (thus 
> model is picking some defaults - causing feature importance to change)
>  
> *Below is the test code:*
> val testDF = Seq(
> (1, 3, 2, 1, 1),
> (3, 2, 1, 2, 0),
> (2, 2, 1, 1, 0),
> (3, 4, 2, 2, 0),
> (2, 2, 1, 3, 1)
> ).toDF("a", "b", "c", "d", "e")
> val featureColumns = testDF.columns.filter(_ != "e")
> // Assemble the features into a vector
> val assembler = new 
> VectorAssembler().setInputCols(featureColumns).setOutputCol("features")
> // Transform the data to get the feature data set
> val featureDF = assembler.transform(testDF)
> // Train a GBT model.
> val gbt = new GBTClassifier()
> .setLabelCol("e")
> .setFeaturesCol("features")
> .setMaxDepth(2)
> .setMaxBins(5)
> .setMaxIter(10)
> .setSeed(10)
> .fit(featureDF)
> gbt.transform(featureDF).show(false)
> // Write out the model
> featureColumns.zip(gbt.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println)
> /* Prints
> (d,0.5931875075767403)
> (a,0.3747184548362353)
> (b,0.03209403758702444)
> (c,0.0)
> */
> gbt.write.overwrite().save("file:///tmp/test123")
> println("Reading model again")
> val gbtload = GBTClassificationModel.load("file:///tmp/test123")
> featureColumns.zip(gbtload.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println)
> /*
> Prints
> (d,0.6455841215290767)
> (a,0.3316126797964181)
> (b,0.022803198674505094)
> (c,0.0)
> */
> gbtload.write.overwrite().save("file:///tmp/test123_rewrite")
> val gbtload2 = GBTClassificationModel.load("file:///tmp/test123_rewrite")
> featureColumns.zip(gbtload2.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println)
> /* prints
> (d,0.6455841215290767)
> (a,0.3316126797964181)
> (b,0.022803198674505094)
> (c,0.0)
> */



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25969) pyspark deal with large data memory issues

2018-11-08 Thread zhao yufei (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhao yufei resolved SPARK-25969.

Resolution: Resolved

> pyspark deal with large data memory issues
> --
>
> Key: SPARK-25969
> URL: https://issues.apache.org/jira/browse/SPARK-25969
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2
>Reporter: zhao yufei
>Priority: Major
>
> now i use pyspark to load a large csv file with line number about 1.4 
> million, each line contains two filed: imageId, kws (image keywords seperate 
> by ',')
>  
> when i run the following code, it appears outOfMemory:
> {code}
> df_imageIdsKws = 
> spark.read.format('com.databricks.spark.csv').options(delimiter="\t", 
> header='true').schema(schema=schema).load(imagesKwsFilePath)
> numClass=1868
> def mapRow(row):
> imageId=row.imageId
> hotVector = np.zeros((numClass,), dtype=float)
> 
> for kw in row.kws.split(','):
>  kwIndex=kwsIndexMap_broadcast.value.get(kw)
>  hotVector[int(kwIndex)]=1.0
>  return (imageId,hotVector.tolist())
>
> df_imageIdsKws=df_imageIdsKws.rdd.persist(storageLevel=StorageLevel.DISK_ONLY)
> imageIdsKws_rdd_=df_imageIdsKws.map(lambda 
> row:mapRow(row)).persist(storageLevel=StorageLevel.DISK_ONLY)
> {code}
> even i use DISK_ONLY for all rdds, still outOfMemory,  
> but when i change the numClass=1 for test , all work well.
> following error messages from executor log:
> {code:java}
> java.lang.OutOfMemoryError: Java heap space
> at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57)
> at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator$$anonfun$4.apply(ShuffleBlockFetcherIterator.scala:431)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator$$anonfun$4.apply(ShuffleBlockFetcherIterator.scala:431)
> at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStream.allocateNewChunkIfNeeded(ChunkedByteBufferOutputStream.scala:87)
> at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStream.write(ChunkedByteBufferOutputStream.scala:75)
> at 
> org.apache.spark.util.Utils$$anonfun$copyStream$1.apply$mcJ$sp(Utils.scala:351)
> at 
> org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:336)
> at 
> org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:336)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1381)
> at org.apache.spark.util.Utils$.copyStream(Utils.scala:357)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:436)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:62)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at 
> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:223)
> at 
> org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:439)
> at 
> org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:247)
> at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1992)
> at 
> org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:170)
> 2018-11-08 10:53:06 ERROR SparkUncaughtExceptionHandler:91 - Uncaught 
> exception in thread Thread[stdout writer for 
> /data/anaconda3/bin/python3.5,5,main]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25973) Spark History Main page performance improvment

2018-11-08 Thread William Montaz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

William Montaz updated SPARK-25973:
---
Attachment: fix.patch

> Spark History Main page performance improvment
> --
>
> Key: SPARK-25973
> URL: https://issues.apache.org/jira/browse/SPARK-25973
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.2
>Reporter: William Montaz
>Priority: Major
> Attachments: fix.patch
>
>
> HistoryPage.scala counts applications (with a predicate depending on if it is 
> displaying incomplete or complete applications) to check if it must display 
> the dataTable.
> Since it only checks if allAppsSize > 0, we could use exists method on the 
> iterator> This way we stop iterating at the first occurence found.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24437) Memory leak in UnsafeHashedRelation

2018-11-08 Thread Eyal Farago (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16679560#comment-16679560
 ] 

Eyal Farago commented on SPARK-24437:
-

[~dvogelbacher], what about the _checkpoint_ approach?

another possibility: if the queries are actually rather small can you force 
them into memory and then convert them into _DataSet_s and cache these? this 
way you're getting completely rid of the broadcasts and lineage, effectively 
storing what you need. this still has a minor drawback as your DataSets are now 
built on top of a parallelized collection RDD which still has a memory 
footprint in the driver's heap.

re. your question about why is the broadcast being kept as part of the lineage, 
it'd require a long trip down the rabbit hole to understand the way the plan is 
being transformed and represented once being cached... as you wrote yourself 
this is a rather unusual use-case so it might require unusual handling on your 
side...

> Memory leak in UnsafeHashedRelation
> ---
>
> Key: SPARK-24437
> URL: https://issues.apache.org/jira/browse/SPARK-24437
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: gagan taneja
>Priority: Critical
> Attachments: Screen Shot 2018-05-30 at 2.05.40 PM.png, Screen Shot 
> 2018-05-30 at 2.07.22 PM.png, Screen Shot 2018-11-01 at 10.38.30 AM.png
>
>
> There seems to memory leak with 
> org.apache.spark.sql.execution.joins.UnsafeHashedRelation
> We have a long running instance of STS.
> With each query execution requiring Broadcast Join, UnsafeHashedRelation is 
> getting added for cleanup in ContextCleaner. This reference of 
> UnsafeHashedRelation is being held at some other Collection and not becoming 
> eligible for GC and because of this ContextCleaner is not able to clean it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-23831) Add org.apache.derby to IsolatedClientLoader

2018-11-08 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-23831:
--
  Assignee: (was: Yuming Wang)

> Add org.apache.derby to IsolatedClientLoader
> 
>
> Key: SPARK-23831
> URL: https://issues.apache.org/jira/browse/SPARK-23831
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> Add org.apache.derby to IsolatedClientLoader,otherwise it may throw an 
> exception:
> {noformat}
> [info] Cause: java.sql.SQLException: Failed to start database 'metastore_db' 
> with class loader 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@2439ab23, see 
> the next exception for details.
> [info] at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
> [info] at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
> [info] at org.apache.derby.impl.jdbc.Util.seeNextException(Unknown Source)
> [info] at org.apache.derby.impl.jdbc.EmbedConnection.bootDatabase(Unknown 
> Source)
> [info] at org.apache.derby.impl.jdbc.EmbedConnection.(Unknown Source)
> [info] at org.apache.derby.jdbc.InternalDriver$1.run(Unknown Source)
> {noformat}
> How to reproduce:
> {noformat}
> sed 's/HiveExternalCatalogSuite/HiveExternalCatalog2Suite/g' 
> sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogSuite.scala
>  > 
> sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalog2Suite.scala
> build/sbt -Phive "hive/test-only *.HiveExternalCatalogSuite 
> *.HiveExternalCatalog2Suite"
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25952) from_json returns wrong result if corrupt record column is in the middle of schema

2018-11-08 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25952.
--
   Resolution: Fixed
 Assignee: Maxim Gekk
Fix Version/s: 3.0.0

Fixed in https://github.com/apache/spark/pull/22958

> from_json returns wrong result if corrupt record column is in the middle of 
> schema
> --
>
> Key: SPARK-25952
> URL: https://issues.apache.org/jira/browse/SPARK-25952
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> If an user specifies a corrupt record column via 
> spark.sql.columnNameOfCorruptRecord or JSON options 
> columnNameOfCorruptRecord, schema with the column is propagated to Jackson 
> parser. This breaks an assumption inside of FailureSafeParser that a row 
> returned from Jackson Parser contains only actual data. As a consequence of 
> that FailureSafeParser writes a bad record in wrong position.
> For example:
> {code:scala}
> val schema = new StructType()
>   .add("a", IntegerType)
>   .add("_unparsed", StringType)
>   .add("b", IntegerType)
> val badRec = """{"a" 1, "b": 11}"""
> val df = Seq(badRec, """{"a": 2, "b": 12}""").toDS()
> {code}
> the collect() action below
> {code:scala}
> df.select(from_json($"value", schema, Map("columnNameOfCorruptRecord" -> 
> "_unparsed"))).collect()
> {code}
> loses 12:
> {code}
> Array(Row(Row(null, "{"a" 1, "b": 11}", null)), Row(Row(2, null, null)))
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22450) Safely register class for mllib

2018-11-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16679467#comment-16679467
 ] 

Apache Spark commented on SPARK-22450:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/22974

> Safely register class for mllib
> ---
>
> Key: SPARK-22450
> URL: https://issues.apache.org/jira/browse/SPARK-22450
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Xianyang Liu
>Assignee: Xianyang Liu
>Priority: Major
> Fix For: 2.3.0
>
>
> There are still some algorithms based on mllib, such as KMeans.  For now, 
> many mllib common class (such as: Vector, DenseVector, SparseVector, Matrix, 
> DenseMatrix, SparseMatrix) are not registered in Kryo. So there are some 
> performance issues for those object serialization or deserialization.
> Previously dicussed: https://github.com/apache/spark/pull/19586



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25976) Allow rdd.reduce on empty rdd by returning an Option[T]

2018-11-08 Thread Yuval Yaari (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuval Yaari updated SPARK-25976:

Description: 
it is sometimes useful to let the user decide what value to return when 
reducing on an empty rdd.

currently, if there is no data to reduce an UnsupportedOperationException is 
thrown. 

although user can catch that exception, it seems like a "shaky" solution as 
UnsupportedOperationException might be thrown from a different location.

Instead, we can overload the reduce method by adding add a new method:

reduce(f: (T, T) => T, defaultIfEmpty: () => T): T

the reduce API will not be effected as it will simply call the second reduce 
method throwing an UnsupportedException as the default value

 

  was:
it is sometimes useful to let the user decide what value to return when 
reducing on an empty rdd.

currently, if there is no data to reduce an UnsupportedOperationException is 
thrown. 

although user can catch that exception, it seems like a "shaky" solution as 
UnsupportedOperationException might be thrown from a different location.

Instead, we can overload the reduce method by adding add a new method:

reduce(f: (T, T) => T, defaultIfEmpty: T): T

the reduce API will not be effected as it will simply call the second reduce 
method throwing an UnsupportedException as the default value

 


> Allow rdd.reduce on empty rdd by returning an Option[T]
> ---
>
> Key: SPARK-25976
> URL: https://issues.apache.org/jira/browse/SPARK-25976
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: Yuval Yaari
>Priority: Minor
>
> it is sometimes useful to let the user decide what value to return when 
> reducing on an empty rdd.
> currently, if there is no data to reduce an UnsupportedOperationException is 
> thrown. 
> although user can catch that exception, it seems like a "shaky" solution as 
> UnsupportedOperationException might be thrown from a different location.
> Instead, we can overload the reduce method by adding add a new method:
> reduce(f: (T, T) => T, defaultIfEmpty: () => T): T
> the reduce API will not be effected as it will simply call the second reduce 
> method throwing an UnsupportedException as the default value
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25972) Missed JSON options in streaming.py

2018-11-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25972:


Assignee: Apache Spark

> Missed JSON options in streaming.py 
> 
>
> Key: SPARK-25972
> URL: https://issues.apache.org/jira/browse/SPARK-25972
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Trivial
>
> streaming.py misses JSON options comparing to readwrite.py:
> - dropFieldIfAllNull
> - encoding



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25974) Optimizes Generates bytecode for ordering based on the given order

2018-11-08 Thread caoxuewen (JIRA)
caoxuewen created SPARK-25974:
-

 Summary: Optimizes Generates bytecode for ordering based on the 
given order
 Key: SPARK-25974
 URL: https://issues.apache.org/jira/browse/SPARK-25974
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.1
Reporter: caoxuewen


Currently, when generates the code for ordering based on the given order, too 
many variables and assignment statements will be generated, which is not 
necessary. This PR will eliminate redundant variables. Optimizes Generates 
bytecode for ordering based on the given order.
The generated code looks like:

spark.range(1).selectExpr(
 "id as key",
 "(id & 1023) as value1",
"cast(id & 1023 as double) as value2",
"cast(id & 1023 as int) as value3"
).select("value1", "value2", "value3").orderBy("value1", "value2").collect()

before PR(codegen size: 178)

Generated Ordering by input[0, bigint, false] ASC NULLS FIRST,input[1, double, 
false] ASC NULLS FIRST:
/* 001 */ public SpecificOrdering generate(Object[] references) {
/* 002 */   return new SpecificOrdering(references);
/* 003 */ }
/* 004 */
/* 005 */ class SpecificOrdering extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering {
/* 006 */
/* 007 */   private Object[] references;
/* 008 */
/* 009 */
/* 010 */   public SpecificOrdering(Object[] references) {
/* 011 */ this.references = references;
/* 012 */
/* 013 */   }
/* 014 */
/* 015 */   public int compare(InternalRow a, InternalRow b) {
/* 016 */
/* 017 */ InternalRow i = null;
/* 018 */
/* 019 */ i = a;
/* 020 */ boolean isNullA_0;
/* 021 */ long primitiveA_0;
/* 022 */ {
/* 023 */   long value_0 = i.getLong(0);
/* 024 */   isNullA_0 = false;
/* 025 */   primitiveA_0 = value_0;
/* 026 */ }
/* 027 */ i = b;
/* 028 */ boolean isNullB_0;
/* 029 */ long primitiveB_0;
/* 030 */ {
/* 031 */   long value_0 = i.getLong(0);
/* 032 */   isNullB_0 = false;
/* 033 */   primitiveB_0 = value_0;
/* 034 */ }
/* 035 */ if (isNullA_0 && isNullB_0) {
/* 036 */   // Nothing
/* 037 */ } else if (isNullA_0) {
/* 038 */   return -1;
/* 039 */ } else if (isNullB_0) {
/* 040 */   return 1;
/* 041 */ } else {
/* 042 */   int comp = (primitiveA_0 > primitiveB_0 ? 1 : primitiveA_0 < 
primitiveB_0 ? -1 : 0);
/* 043 */   if (comp != 0) {
/* 044 */ return comp;
/* 045 */   }
/* 046 */ }
/* 047 */
/* 048 */ i = a;
/* 049 */ boolean isNullA_1;
/* 050 */ double primitiveA_1;
/* 051 */ {
/* 052 */   double value_1 = i.getDouble(1);
/* 053 */   isNullA_1 = false;
/* 054 */   primitiveA_1 = value_1;
/* 055 */ }
/* 056 */ i = b;
/* 057 */ boolean isNullB_1;
/* 058 */ double primitiveB_1;
/* 059 */ {
/* 060 */   double value_1 = i.getDouble(1);
/* 061 */   isNullB_1 = false;
/* 062 */   primitiveB_1 = value_1;
/* 063 */ }
/* 064 */ if (isNullA_1 && isNullB_1) {
/* 065 */   // Nothing
/* 066 */ } else if (isNullA_1) {
/* 067 */   return -1;
/* 068 */ } else if (isNullB_1) {
/* 069 */   return 1;
/* 070 */ } else {
/* 071 */   int comp = 
org.apache.spark.util.Utils.nanSafeCompareDoubles(primitiveA_1, primitiveB_1);
/* 072 */   if (comp != 0) {
/* 073 */ return comp;
/* 074 */   }
/* 075 */ }
/* 076 */
/* 077 */
/* 078 */ return 0;
/* 079 */   }
/* 080 */
/* 081 */
/* 082 */ }

After PR(codegen size: 89)
Generated Ordering by input[0, bigint, false] ASC NULLS FIRST,input[1, double, 
false] ASC NULLS FIRST:
/* 001 */ public SpecificOrdering generate(Object[] references) {
/* 002 */   return new SpecificOrdering(references);
/* 003 */ }
/* 004 */
/* 005 */ class SpecificOrdering extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering {
/* 006 */
/* 007 */   private Object[] references;
/* 008 */
/* 009 */
/* 010 */   public SpecificOrdering(Object[] references) {
/* 011 */ this.references = references;
/* 012 */
/* 013 */   }
/* 014 */
/* 015 */   public int compare(InternalRow a, InternalRow b) {
/* 016 */
/* 017 */
/* 018 */ long value_0 = a.getLong(0);
/* 019 */ long value_2 = b.getLong(0);
/* 020 */ if (false && false) {
/* 021 */   // Nothing
/* 022 */ } else if (false) {
/* 023 */   return -1;
/* 024 */ } else if (false) {
/* 025 */   return 1;
/* 026 */ } else {
/* 027 */   int comp = (value_0 > value_2 ? 1 : value_0 < value_2 ? -1 : 0);
/* 028 */   if (comp != 0) {
/* 029 */ return comp;
/* 030 */   }
/* 031 */ }
/* 032 */
/* 033 */ double value_1 = a.getDouble(1);
/* 034 */ double value_3 = b.getDouble(1);
/* 035 */ if (false && false) {
/* 036 */   // Nothing
/* 037 */ } else if (false) {
/* 038 */   return -1;
/* 039 */ } else if (false) {
/* 040 */   re

[jira] [Created] (SPARK-25975) Spark History does not display necessarily the incomplete applications when requested

2018-11-08 Thread William Montaz (JIRA)
William Montaz created SPARK-25975:
--

 Summary: Spark History does not display necessarily the incomplete 
applications when requested
 Key: SPARK-25975
 URL: https://issues.apache.org/jira/browse/SPARK-25975
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 2.3.2
Reporter: William Montaz
 Attachments: fix.patch

Filtering of incomplete applications is made in javascript against the response 
returned by the API. The problem is that if the returned result is not big 
enough (because of spark.history.ui.maxApplications), it might not contain 
incomplete applications. 

We can call the API with status RUNNING or COMPLETED depending on the view we 
want to fix this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25955) Porting JSON test for CSV functions

2018-11-08 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-25955:


Assignee: Maxim Gekk

> Porting JSON test for CSV functions
> ---
>
> Key: SPARK-25955
> URL: https://issues.apache.org/jira/browse/SPARK-25955
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>
> JsonFunctionsSuite contains test that are applicable and useful for CSV 
> functions - from_csv, to_csv and schema_of_csv:
> * uses DDL strings for defining a schema - java
> * roundtrip to_csv -> from_csv
> * roundtrip from_csv -> to_csv
> * infers schemas of a CSV string and pass to to from_csv
> * Support to_csv in SQL
> * Support from_csv in SQL



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25971) Ignore partition byte-size statistics in SQLQueryTestSuite

2018-11-08 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-25971:
-

 Summary: Ignore partition byte-size statistics in SQLQueryTestSuite
 Key: SPARK-25971
 URL: https://issues.apache.org/jira/browse/SPARK-25971
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun


Currently, `SQLQueryTestSuite` is sensitive in terms of the bytes of parquet 
files in table partitions. If we change the default file format (from Parquet 
to ORC) or update the metadata of them, the test case should be changed 
accordingly. This issue aims to make `SQLQueryTestSuite` more robust by 
ignoring the partition byte statistics.

{code}
-Partition Statistics   1144 bytes, 2 rows
+Partition Statistics   [not included in comparison] bytes, 2 rows
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25959) Difference in featureImportances results on computed vs saved models

2018-11-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16679913#comment-16679913
 ] 

Apache Spark commented on SPARK-25959:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/22986

> Difference in featureImportances results on computed vs saved models
> 
>
> Key: SPARK-25959
> URL: https://issues.apache.org/jira/browse/SPARK-25959
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Suraj Nayak
>Priority: Major
>
> I tried to implement GBT and found that the feature Importance computed while 
> the model was fit is different when the same model was saved into a storage 
> and loaded back. 
>  
> I also found that once the persistent model is loaded and saved back again 
> and loaded, the feature importance remains the same. 
>  
> Not sure if its bug while storing and reading the model first time or am 
> missing some parameter that need to be set before saving the model (thus 
> model is picking some defaults - causing feature importance to change)
>  
> *Below is the test code:*
> val testDF = Seq(
> (1, 3, 2, 1, 1),
> (3, 2, 1, 2, 0),
> (2, 2, 1, 1, 0),
> (3, 4, 2, 2, 0),
> (2, 2, 1, 3, 1)
> ).toDF("a", "b", "c", "d", "e")
> val featureColumns = testDF.columns.filter(_ != "e")
> // Assemble the features into a vector
> val assembler = new 
> VectorAssembler().setInputCols(featureColumns).setOutputCol("features")
> // Transform the data to get the feature data set
> val featureDF = assembler.transform(testDF)
> // Train a GBT model.
> val gbt = new GBTClassifier()
> .setLabelCol("e")
> .setFeaturesCol("features")
> .setMaxDepth(2)
> .setMaxBins(5)
> .setMaxIter(10)
> .setSeed(10)
> .fit(featureDF)
> gbt.transform(featureDF).show(false)
> // Write out the model
> featureColumns.zip(gbt.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println)
> /* Prints
> (d,0.5931875075767403)
> (a,0.3747184548362353)
> (b,0.03209403758702444)
> (c,0.0)
> */
> gbt.write.overwrite().save("file:///tmp/test123")
> println("Reading model again")
> val gbtload = GBTClassificationModel.load("file:///tmp/test123")
> featureColumns.zip(gbtload.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println)
> /*
> Prints
> (d,0.6455841215290767)
> (a,0.3316126797964181)
> (b,0.022803198674505094)
> (c,0.0)
> */
> gbtload.write.overwrite().save("file:///tmp/test123_rewrite")
> val gbtload2 = GBTClassificationModel.load("file:///tmp/test123_rewrite")
> featureColumns.zip(gbtload2.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println)
> /* prints
> (d,0.6455841215290767)
> (a,0.3316126797964181)
> (b,0.022803198674505094)
> (c,0.0)
> */



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25961) Random numbers are not supported when handling data skew

2018-11-08 Thread zengxl (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zengxl updated SPARK-25961:
---
Summary: Random numbers are not supported when handling data skew  (was: 
处理数据倾斜时使用随机数不支持)

> Random numbers are not supported when handling data skew
> 
>
> Key: SPARK-25961
> URL: https://issues.apache.org/jira/browse/SPARK-25961
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
> Environment: spark on yarn 2.3.1
>Reporter: zengxl
>Priority: Major
>
> my query sql use two table join,one table join key has null value,i use rand 
> value instead of null value,but has error,the error info as follows:
> Error in query: nondeterministic expressions are only allowed in
> Project, Filter, Aggregate or Window, found
>  
>  
> scan spark source code is 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis check sql, because the 
> number of random variables is uncertain, it is prohibited
> case o if o.expressions.exists(!_.deterministic) &&
>  !o.isInstanceOf[Project] && !o.isInstanceOf[Filter] &&
>  !o.isInstanceOf[Aggregate] && !o.isInstanceOf[Window] =>
>  // The rule above is used to check Aggregate operator.
>  failAnalysis(
>  s"""nondeterministic expressions are only allowed in
> |Project, Filter, Aggregate or Window, found:|
> |${o.expressions.map(_.sql).mkString(",")}|
> |in operator ${operator.simpleString}
>  """.stripMargin)|
>  
> Is it possible to add Join to this code? It's not yet tested.And whether 
> there will be other effects
> case o if o.expressions.exists(!_.deterministic) &&
>  !o.isInstanceOf[Project] && !o.isInstanceOf[Filter] &&
>  !o.isInstanceOf[Aggregate] && !o.isInstanceOf[Window] +{color:#d04437}&& 
> !o.isInstanceOf[Join]{color}+ =>
>  // The rule above is used to check Aggregate operator.
>  failAnalysis(
>  s"""nondeterministic expressions are only allowed in
> |Project, Filter, Aggregate or Window or Join, found:|
> |${o.expressions.map(_.sql).mkString(",")}|
> |in operator ${operator.simpleString}
>  """.stripMargin)|
>  
> this is my sparksql:
> SELECT
>  T1.CUST_NO AS CUST_NO ,
>  T3.CON_LAST_NAME AS CUST_NAME ,
>  T3.CON_SEX_MF AS SEX_CODE ,
>  T3.X_POSITION AS POST_LV_CODE 
>  FROM tmp.ICT_CUST_RANGE_INFO T1
>  LEFT join tmp.F_CUST_BASE_INFO_ALL T3 ON CASE WHEN coalesce(T1.CUST_NO,'') 
> ='' THEN concat('cust_no',RAND()) ELSE T1.CUST_NO END = T3.BECIF and 
> T3.DATE='20181105'
>  WHERE T1.DATE='20181105'



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25968) Non-codegen Floor and Ceil fail for FloatType

2018-11-08 Thread Juliusz Sompolski (JIRA)
Juliusz Sompolski created SPARK-25968:
-

 Summary: Non-codegen Floor and Ceil fail for FloatType
 Key: SPARK-25968
 URL: https://issues.apache.org/jira/browse/SPARK-25968
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.2, 2.2.2, 2.4.0
Reporter: Juliusz Sompolski


nullSafeEval of Floor and Ceil does not handle FloatType argument.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25977) Parsing decimals from CSV using locale

2018-11-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16679710#comment-16679710
 ] 

Apache Spark commented on SPARK-25977:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/22979

> Parsing decimals from CSV using locale
> --
>
> Key: SPARK-25977
> URL: https://issues.apache.org/jira/browse/SPARK-25977
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Support the locale option to parse decimals from CSV input. Currently CSV 
> parser can handle decimals that contain only dots - '.' which is incorrect 
> format in locales like ru-RU, for example.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25332) Instead of broadcast hash join ,Sort merge join has selected when restart spark-shell/spark-JDBC for hive provider

2018-11-08 Thread Babulal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16679574#comment-16679574
 ] 

Babulal commented on SPARK-25332:
-

Since issue impacting performance degradation so marking as 'Critical' .

> Instead of broadcast hash join  ,Sort merge join has selected when restart 
> spark-shell/spark-JDBC for hive provider
> ---
>
> Key: SPARK-25332
> URL: https://issues.apache.org/jira/browse/SPARK-25332
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Babulal
>Priority: Critical
>
> spark.sql("create table x1(name string,age int) stored as parquet ")
>  spark.sql("insert into x1 select 'a',29")
>  spark.sql("create table x2 (name string,age int) stored as parquet '")
>  spark.sql("insert into x2_ex select 'a',29")
>  scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain
> == Physical Plan ==
> *{color:#14892c}(2) BroadcastHashJoin{color} [name#101], [name#103], Inner, 
> BuildRight
> :- *(2) Project [name#101, age#102]
> : +- *(2) Filter isnotnull(name#101)
> : +- *(2) FileScan parquet default.x1_ex[name#101,age#102] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x1, 
> PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: 
> struct
> +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
>  +- *(1) Project [name#103, age#104]
>  +- *(1) Filter isnotnull(name#103)
>  +- *(1) FileScan parquet default.x2_ex[name#103,age#104] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x2, 
> PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: 
> struct
>  
>  
> Now Restart Spark-Shell or do spark-submit orrestart JDBCServer  again and 
> run same select query again
>  
> scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain
> scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain
> == Physical Plan ==
> *{color:#FF}(5) SortMergeJoin [{color}name#43], [name#45], Inner
> :- *(2) Sort [name#43 ASC NULLS FIRST], false, 0
> : +- Exchange hashpartitioning(name#43, 200)
> : +- *(1) Project [name#43, age#44]
> : +- *(1) Filter isnotnull(name#43)
> : +- *(1) FileScan parquet default.x1[name#43,age#44] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x1], 
> PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: 
> struct
> +- *(4) Sort [name#45 ASC NULLS FIRST], false, 0
>  +- Exchange hashpartitioning(name#45, 200)
>  +- *(3) Project [name#45, age#46]
>  +- *(3) Filter isnotnull(name#45)
>  +- *(3) FileScan parquet default.x2[name#45,age#46] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x2], 
> PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: 
> struct
>  
>  
> scala> spark.sql("desc formatted x1").show(200,false)
> ++--+---+
> |col_name |data_type |comment|
> ++--+---+
> |name |string |null |
> |age |int |null |
> | | | |
> |# Detailed Table Information| | |
> |Database |default | |
> |Table |x1 | |
> |Owner |Administrator | |
> |Created Time |Sun Aug 19 12:36:58 IST 2018 | |
> |Last Access |Thu Jan 01 05:30:00 IST 1970 | |
> |Created By |Spark 2.3.0 | |
> |Type |MANAGED | |
> |Provider |hive | |
> |Table Properties |[transient_lastDdlTime=1534662418] | |
> |Location |file:/D:/spark_release/spark/bin/spark-warehouse/x1 | |
> |Serde Library |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | 
> |
> |InputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | 
> |
> |OutputFormat 
> |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| |
> |Storage Properties |[serialization.format=1] | |
> |Partition Provider |Catalog | |
> ++--+---+
>  
> With datasource table ,working fine ( create table using parquet instead of 
> stored by )



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25965) Add read benchmark for Avro

2018-11-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25965:


Assignee: Apache Spark

> Add read benchmark for Avro
> ---
>
> Key: SPARK-25965
> URL: https://issues.apache.org/jira/browse/SPARK-25965
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Minor
>
> Add read benchmark for Avro, which is missing for a period.
> The benchmark is similar to DataSourceReadBenchmark and OrcReadBenchmark



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24421) sun.misc.Unsafe in JDK11

2018-11-08 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-24421:

Summary: sun.misc.Unsafe in JDK11  (was: sun.misc.Unsafe in JDK9+)

> sun.misc.Unsafe in JDK11
> 
>
> Key: SPARK-24421
> URL: https://issues.apache.org/jira/browse/SPARK-24421
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Priority: Major
>
> Many internal APIs such as unsafe are encapsulated in JDK9+, see 
> http://openjdk.java.net/jeps/260 for detail.
> To use Unsafe, we need to add *jdk.unsupported* to our code’s module 
> declaration:
> {code:java}
> module java9unsafe {
> requires jdk.unsupported;
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9686) Spark Thrift server doesn't return correct JDBC metadata

2018-11-08 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16679796#comment-16679796
 ] 

Yuming Wang commented on SPARK-9686:


This is my fix:
Implement Spark's own GetSchemasOperation: 
[https://github.com/apache/spark/pull/22903]
Implement Spark's own GetTablesOperation: 
[https://github.com/apache/spark/pull/22794]
Implement Spark's own GetColumnsOperation: 
[https://github.com/wangyum/spark/blob/SPARK-24570-DBVisualizer/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkGetColumnsOperation.scala]
 

> Spark Thrift server doesn't return correct JDBC metadata 
> -
>
> Key: SPARK-9686
> URL: https://issues.apache.org/jira/browse/SPARK-9686
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2
>Reporter: pin_zhang
>Priority: Critical
> Attachments: SPARK-9686.1.patch.txt
>
>
> 1. Start  start-thriftserver.sh
> 2. connect with beeline
> 3. create table
> 4.show tables, the new created table returned
> 5.
>   Class.forName("org.apache.hive.jdbc.HiveDriver");
>   String URL = "jdbc:hive2://localhost:1/default";
>Properties info = new Properties();
> Connection conn = DriverManager.getConnection(URL, info);
>   ResultSet tables = conn.getMetaData().getTables(conn.getCatalog(),
>null, null, null);
> Problem:
>No tables with returned this API, that work in spark1.3



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25982) Dataframe write is non blocking in fair scheduling mode

2018-11-08 Thread Ramandeep Singh (JIRA)
Ramandeep Singh created SPARK-25982:
---

 Summary: Dataframe write is non blocking in fair scheduling mode
 Key: SPARK-25982
 URL: https://issues.apache.org/jira/browse/SPARK-25982
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.1
Reporter: Ramandeep Singh


Hi,

I have noticed that expected behavior of dataframe write operation to block is 
not working in fair scheduling mode.

Ideally when a dataframe write is occurring and a future is blocking on 
AwaitResult, no other job should be started, but this is not the case. I have 
noticed that other jobs are started when the partitions are being written.  

 

Regards,

Ramandeep Singh

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24421) sun.misc.Unsafe in JDK11

2018-11-08 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16680169#comment-16680169
 ] 

Sean Owen commented on SPARK-24421:
---

I've found that, actually, we can't even access clean() with reflection. See 
[https://stackoverflow.com/questions/41265266/how-to-solve-inaccessibleobjectexception-unable-to-make-member-accessible-m]
 for example. It works but only if the JVM is run with a flag like 
\{{--add-opens java.base/java.lang=ALL-UNNAMED}} . 

We do indeed have to write code that works on Java 8 and 11. We will have to 
continue to compile with Java 8; JVM won't run any code compiled for a later 
version (the old UnsupportedClassVerionError), so, no we can't compile with 
Java 11 and run on Java 8.

But compiling with Java 8 should be fine as Java 11 can read it; we just can't 
access Java 9+ classes without reflection. It's easy enough to resolve the 
_compile_ problems here, and yes, it will still all work on Java 8 like today. 
The problem is running on Java 11 right now.

I'm going to go ahead and open a pull request that fixes the compile issues for 
Java 11 and gets this to the point where it should run on Java 11 _if_ you set 
the flag above. That's progress at least.

The single issue here is this code in StorageUtils:
{code:java}
/**
 * Attempt to clean up a ByteBuffer if it is direct or memory-mapped. This uses 
an *unsafe* Sun
 * API that will cause errors if one attempts to read from the disposed buffer. 
However, neither
 * the bytes allocated to direct buffers nor file descriptors opened for 
memory-mapped buffers put
 * pressure on the garbage collector. Waiting for garbage collection may lead 
to the depletion of
 * off-heap memory or huge numbers of open files. There's unfortunately no 
standard API to
 * manually dispose of these kinds of buffers.
 */
def dispose(buffer: ByteBuffer): Unit = {
  if (buffer != null && buffer.isInstanceOf[MappedByteBuffer]) {
logTrace(s"Disposing of $buffer")
cleanDirectBuffer(buffer.asInstanceOf[DirectBuffer])
  }
}

private def cleanDirectBuffer(buffer: DirectBuffer): Unit = {
  val cleaner: AnyRef = buffer.cleaner()
  if (cleaner != null) {
CLEAN_METHOD.invoke(cleaner)
  }
{code}

I wonder how bad it is if this simply isn't accessed? Sounds bad. Not strictly 
fatal but bad. This means it all still _runs_ in Java 11, or should, even if 
this method can't be invoked.

But is there any reason to think this kind of low-level intervention in the 
ByteBuffer wouldn't be needed in Java 11 anyway? doubt it, but I wonder.

> sun.misc.Unsafe in JDK11
> 
>
> Key: SPARK-24421
> URL: https://issues.apache.org/jira/browse/SPARK-24421
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Priority: Major
>
> Many internal APIs such as unsafe are encapsulated in JDK9+, see 
> http://openjdk.java.net/jeps/260 for detail.
> To use Unsafe, we need to add *jdk.unsupported* to our code’s module 
> declaration:
> {code:java}
> module java9unsafe {
> requires jdk.unsupported;
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24834) Utils#nanSafeCompare{Double,Float} functions do not differ from normal java double/float comparison

2018-11-08 Thread Matt Cheah (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16680170#comment-16680170
 ] 

Matt Cheah commented on SPARK-24834:


[~srowen] - I know this is an old ticket but I wanted to propose re-opening 
this and addressing it for Spark 3.0. My understanding is that this behavior is 
also not consistent with other SQL systems like MySQL and PostGres. In a sense, 
even though this would be a behavioral change, one could argue that this is a 
correctness issue given what one should be expecting given behavior from other 
systems. Would it be reasonable to make the behavior change for Spark 3.0 and 
call it out in the release notes?

> Utils#nanSafeCompare{Double,Float} functions do not differ from normal java 
> double/float comparison
> ---
>
> Key: SPARK-24834
> URL: https://issues.apache.org/jira/browse/SPARK-24834
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: Benjamin Duffield
>Priority: Minor
>
> Utils.scala contains two functions `nanSafeCompareDoubles` and 
> `nanSafeCompareFloats` which purport to have special handling of NaN values 
> in comparisons.
> The handling in these functions do not appear to differ from 
> java.lang.Double.compare and java.lang.Float.compare - they seem to produce 
> identical output to the built-in java comparison functions.
> I think it's clearer to not have these special Utils functions, and instead 
> just use the standard java comparison functions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24421) sun.misc.Unsafe in JDK11

2018-11-08 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16680169#comment-16680169
 ] 

Sean Owen edited comment on SPARK-24421 at 11/8/18 6:44 PM:


I've found that, actually, we can't even access clean() with reflection. See 
[https://stackoverflow.com/questions/41265266/how-to-solve-inaccessibleobjectexception-unable-to-make-member-accessible-m]
 for example. It works but only if the JVM is run with a flag like 
\{{--add-opens java.base/java.lang=ALL-UNNAMED}} . 

We do indeed have to write code that works on Java 8 and 11. We will have to 
continue to compile with Java 8; JVM won't run any code compiled for a later 
version (the old UnsupportedClassVerionError), so, no we can't compile with 
Java 11 and run on Java 8.

But compiling with Java 8 should be fine as Java 11 can read it; we just can't 
access Java 9+ classes without reflection. It's easy enough to resolve the 
_compile_ problems here, and yes, it will still all work on Java 8 like today. 
The problem is running on Java 11 right now.

I'm going to go ahead and open a pull request that fixes the compile issues for 
Java 11 and gets this to the point where it should run on Java 11 _if_ you set 
the flag above. That's progress at least.

The single issue here is this code in StorageUtils:
{code:java}
/**
 * Attempt to clean up a ByteBuffer if it is direct or memory-mapped. This uses 
an *unsafe* Sun
 * API that will cause errors if one attempts to read from the disposed buffer. 
However, neither
 * the bytes allocated to direct buffers nor file descriptors opened for 
memory-mapped buffers put
 * pressure on the garbage collector. Waiting for garbage collection may lead 
to the depletion of
 * off-heap memory or huge numbers of open files. There's unfortunately no 
standard API to
 * manually dispose of these kinds of buffers.
 */
def dispose(buffer: ByteBuffer): Unit = {
  if (buffer != null && buffer.isInstanceOf[MappedByteBuffer]) {
logTrace(s"Disposing of $buffer")
cleanDirectBuffer(buffer.asInstanceOf[DirectBuffer])
  }
}

private def cleanDirectBuffer(buffer: DirectBuffer): Unit = {
  val cleane= buffer.cleaner()
  if (cleaner != null) {
cleaner.clean()
  }
}
{code}

I wonder how bad it is if this simply isn't accessed? Sounds bad. Not strictly 
fatal but bad. This means it all still _runs_ in Java 11, or should, even if 
this method can't be invoked.

But is there any reason to think this kind of low-level intervention in the 
ByteBuffer wouldn't be needed in Java 11 anyway? doubt it, but I wonder.


was (Author: srowen):
I've found that, actually, we can't even access clean() with reflection. See 
[https://stackoverflow.com/questions/41265266/how-to-solve-inaccessibleobjectexception-unable-to-make-member-accessible-m]
 for example. It works but only if the JVM is run with a flag like 
\{{--add-opens java.base/java.lang=ALL-UNNAMED}} . 

We do indeed have to write code that works on Java 8 and 11. We will have to 
continue to compile with Java 8; JVM won't run any code compiled for a later 
version (the old UnsupportedClassVerionError), so, no we can't compile with 
Java 11 and run on Java 8.

But compiling with Java 8 should be fine as Java 11 can read it; we just can't 
access Java 9+ classes without reflection. It's easy enough to resolve the 
_compile_ problems here, and yes, it will still all work on Java 8 like today. 
The problem is running on Java 11 right now.

I'm going to go ahead and open a pull request that fixes the compile issues for 
Java 11 and gets this to the point where it should run on Java 11 _if_ you set 
the flag above. That's progress at least.

The single issue here is this code in StorageUtils:
{code:java}
/**
 * Attempt to clean up a ByteBuffer if it is direct or memory-mapped. This uses 
an *unsafe* Sun
 * API that will cause errors if one attempts to read from the disposed buffer. 
However, neither
 * the bytes allocated to direct buffers nor file descriptors opened for 
memory-mapped buffers put
 * pressure on the garbage collector. Waiting for garbage collection may lead 
to the depletion of
 * off-heap memory or huge numbers of open files. There's unfortunately no 
standard API to
 * manually dispose of these kinds of buffers.
 */
def dispose(buffer: ByteBuffer): Unit = {
  if (buffer != null && buffer.isInstanceOf[MappedByteBuffer]) {
logTrace(s"Disposing of $buffer")
cleanDirectBuffer(buffer.asInstanceOf[DirectBuffer])
  }
}

private def cleanDirectBuffer(buffer: DirectBuffer): Unit = {
  val cleaner: AnyRef = buffer.cleaner()
  if (cleaner != null) {
CLEAN_METHOD.invoke(cleaner)
  }
{code}

I wonder how bad it is if this simply isn't accessed? Sounds bad. Not strictly 
fatal but bad. This means it all still _runs_ in Java 11, or should, even if 
this method can't be invoked.

But is there any reason to think this kind of low-level i

[jira] [Commented] (SPARK-24421) sun.misc.Unsafe in JDK11

2018-11-08 Thread DB Tsai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16680117#comment-16680117
 ] 

DB Tsai commented on SPARK-24421:
-

[~srowen] Great news! Thanks for looking at this. 

Out of my curiosity, can the tests run in JVM11 if we build Spark with JDK8? 
Similarly, can we run the tests in JVM8 if the Spark is built with JDK11 with 
Java 8 target?

For release, which version of the JDK should we use?

> sun.misc.Unsafe in JDK11
> 
>
> Key: SPARK-24421
> URL: https://issues.apache.org/jira/browse/SPARK-24421
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Priority: Major
>
> Many internal APIs such as unsafe are encapsulated in JDK9+, see 
> http://openjdk.java.net/jeps/260 for detail.
> To use Unsafe, we need to add *jdk.unsupported* to our code’s module 
> declaration:
> {code:java}
> module java9unsafe {
> requires jdk.unsupported;
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25964) Revise OrcReadBenchmark/DataSourceReadBenchmark case names and execution instructions

2018-11-08 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-25964.
---
   Resolution: Fixed
 Assignee: Gengliang Wang
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/22965

> Revise OrcReadBenchmark/DataSourceReadBenchmark case names and execution 
> instructions
> -
>
> Key: SPARK-25964
> URL: https://issues.apache.org/jira/browse/SPARK-25964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Trivial
> Fix For: 3.0.0
>
>
> 1. OrcReadBenchmark is under hive module, so the way to run it should be 
> ```
> build/sbt "hive/test:runMain "
> ```
> 2. The benchmark "String with Nulls Scan" should be with case "String with 
> Nulls Scan(5%/50%/95%)", not "(0.05%/0.5%/0.95%)"
> 3. Add the null value percentages in the test case names of 
> DataSourceReadBenchmark, for the  benchmark "String with Nulls Scan" .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25961) Random numbers are not supported when handling data skew

2018-11-08 Thread Kris Mok (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16680152#comment-16680152
 ] 

Kris Mok commented on SPARK-25961:
--

It looks like the current restriction makes sense, because the expressions in 
join condition may eventually be evaluated multiple times depending on which 
physical join operator is chosen. It doesn't make a lot of sense to allow 
non-deterministic expression directly in the Join operator.

Instead, if we have to support having non-deterministic expression in the join 
condition and retain an "evaluated-once" semantic, it'd be better to have a 
rule in the Analyzer to extract non-deterministic expressions from the join 
condition and put it into a child Project operator on the appropriate side.

[~zengxl] does that make sense to you?

> Random numbers are not supported when handling data skew
> 
>
> Key: SPARK-25961
> URL: https://issues.apache.org/jira/browse/SPARK-25961
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
> Environment: spark on yarn 2.3.1
>Reporter: zengxl
>Priority: Major
>
> my query sql use two table join,one table join key has null value,i use rand 
> value instead of null value,but has error,the error info as follows:
> Error in query: nondeterministic expressions are only allowed in
> Project, Filter, Aggregate or Window, found
>  
>  
> scan spark source code is 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis check sql, because the 
> number of random variables is uncertain, it is prohibited
> case o if o.expressions.exists(!_.deterministic) &&
>  !o.isInstanceOf[Project] && !o.isInstanceOf[Filter] &&
>  !o.isInstanceOf[Aggregate] && !o.isInstanceOf[Window] =>
>  // The rule above is used to check Aggregate operator.
>  failAnalysis(
>  s"""nondeterministic expressions are only allowed in
> |Project, Filter, Aggregate or Window, found:|
> |${o.expressions.map(_.sql).mkString(",")}|
> |in operator ${operator.simpleString}
>  """.stripMargin)|
>  
> Is it possible to add Join to this code? It's not yet tested.And whether 
> there will be other effects
> case o if o.expressions.exists(!_.deterministic) &&
>  !o.isInstanceOf[Project] && !o.isInstanceOf[Filter] &&
>  !o.isInstanceOf[Aggregate] && !o.isInstanceOf[Window] +{color:#d04437}&& 
> !o.isInstanceOf[Join]{color}+ =>
>  // The rule above is used to check Aggregate operator.
>  failAnalysis(
>  s"""nondeterministic expressions are only allowed in
> |Project, Filter, Aggregate or Window or Join, found:|
> |${o.expressions.map(_.sql).mkString(",")}|
> |in operator ${operator.simpleString}
>  """.stripMargin)|
>  
> this is my sparksql:
> SELECT
>  T1.CUST_NO AS CUST_NO ,
>  T3.CON_LAST_NAME AS CUST_NAME ,
>  T3.CON_SEX_MF AS SEX_CODE ,
>  T3.X_POSITION AS POST_LV_CODE 
>  FROM tmp.ICT_CUST_RANGE_INFO T1
>  LEFT join tmp.F_CUST_BASE_INFO_ALL T3 ON CASE WHEN coalesce(T1.CUST_NO,'') 
> ='' THEN concat('cust_no',RAND()) ELSE T1.CUST_NO END = T3.BECIF and 
> T3.DATE='20181105'
>  WHERE T1.DATE='20181105'



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25965) Add read benchmark for Avro

2018-11-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16679333#comment-16679333
 ] 

Apache Spark commented on SPARK-25965:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/22966

> Add read benchmark for Avro
> ---
>
> Key: SPARK-25965
> URL: https://issues.apache.org/jira/browse/SPARK-25965
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Minor
>
> Add read benchmark for Avro, which is missing for a period.
> The benchmark is similar to DataSourceReadBenchmark and OrcReadBenchmark



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25977) Parsing decimals from CSV using locale

2018-11-08 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-25977:
--

 Summary: Parsing decimals from CSV using locale
 Key: SPARK-25977
 URL: https://issues.apache.org/jira/browse/SPARK-25977
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Maxim Gekk


Support the locale option to parse decimals from CSV input. Currently CSV 
parser can handle decimals that contain only dots - '.' which is incorrect 
format in locales like ru-RU, for example.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25968) Non-codegen Floor and Ceil fail for FloatType

2018-11-08 Thread Juliusz Sompolski (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juliusz Sompolski resolved SPARK-25968.
---
Resolution: Won't Fix

Ok, I see it's not supposed to handle it, but type gets promoted in the 
analyzer.

> Non-codegen Floor and Ceil fail for FloatType
> -
>
> Key: SPARK-25968
> URL: https://issues.apache.org/jira/browse/SPARK-25968
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
>Reporter: Juliusz Sompolski
>Priority: Major
>
> nullSafeEval of Floor and Ceil does not handle FloatType argument.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25980) dev list mail server is down

2018-11-08 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16679997#comment-16679997
 ] 

Wenchen Fan commented on SPARK-25980:
-

sorry I opened the ticket at a wrong place. Will open a new one in the INFRA 
project.

> dev list mail server is down
> 
>
> Key: SPARK-25980
> URL: https://issues.apache.org/jira/browse/SPARK-25980
> Project: Spark
>  Issue Type: IT Help
>  Components: Project Infra
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> After sending the 2.4.0 release announcement to the dev list, it doesn't show 
> up for over one hour.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25971) Ignore partition byte-size statistics in SQLQueryTestSuite

2018-11-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25971:


Assignee: Apache Spark

> Ignore partition byte-size statistics in SQLQueryTestSuite
> --
>
> Key: SPARK-25971
> URL: https://issues.apache.org/jira/browse/SPARK-25971
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, `SQLQueryTestSuite` is sensitive in terms of the bytes of parquet 
> files in table partitions. If we change the default file format (from Parquet 
> to ORC) or update the metadata of them, the test case should be changed 
> accordingly. This issue aims to make `SQLQueryTestSuite` more robust by 
> ignoring the partition byte statistics.
> {code}
> -Partition Statistics   1144 bytes, 2 rows
> +Partition Statistics   [not included in comparison] bytes, 2 rows
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25966) "EOF Reached the end of stream with bytes left to read" while reading/writing to Parquets

2018-11-08 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16679611#comment-16679611
 ] 

Steve Loughran commented on SPARK-25966:


bq.  It looks to me like a problem in closing the file or with an executor 
dying before finishing a file. If that happened and the data wasn't cleaned up, 
then it could lead to this problem.

though if S3 was the destination of the write, that'd be an atomic PUT wouldn't 
it? Corruption would have to happen in either the spark/parquet writer, or in 
the s3 client uploading the data (or RAM corruption, which I'm ignoring)

> "EOF Reached the end of stream with bytes left to read" while reading/writing 
> to Parquets
> -
>
> Key: SPARK-25966
> URL: https://issues.apache.org/jira/browse/SPARK-25966
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0 (built from RC5 tag) running Hadoop 3.1.1 on 
> top of a Mesos cluster. Both input and output Parquet files are on S3.
>Reporter: Alessandro Andrioni
>Priority: Major
>
> I was persistently getting the following exception while trying to run one 
> Spark job we have using Spark 2.4.0. It went away after I regenerated from 
> scratch all the input Parquet files (generated by another Spark job also 
> using Spark 2.4.0).
> Is there a chance that Spark is writing (quite rarely) corrupted Parquet 
> files?
> {code:java}
> org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>   at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:557)
>   (...)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 312 in stage 682.0 failed 4 times, most recent failure: Lost task 312.3 
> in stage 682.0 (TID 235229, 10.130.29.78, executor 77): java.io.EOFException: 
> Reached the end of stream with 996 bytes left to read
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104)
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127)
>   at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1174)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.V

[jira] [Commented] (SPARK-22827) Avoid throwing OutOfMemoryError in case of exception in spill

2018-11-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16679307#comment-16679307
 ] 

Apache Spark commented on SPARK-22827:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/22969

> Avoid throwing OutOfMemoryError in case of exception in spill
> -
>
> Key: SPARK-22827
> URL: https://issues.apache.org/jira/browse/SPARK-22827
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Sital Kedia
>Assignee: Sital Kedia
>Priority: Major
> Fix For: 2.3.0
>
>
> Currently, the task memory manager throws an OutofMemory error when there is 
> an IO exception happens in spill() - 
> https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L194.
>  Similarly there any many other places in code when if a task is not able to 
> acquire memory due to an exception we throw an OutofMemory error which kills 
> the entire executor and hence failing all the tasks that are running on that 
> executor instead of just failing one single task. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25960) Support subpath mounting with Kubernetes

2018-11-08 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16679242#comment-16679242
 ] 

Dongjoon Hyun commented on SPARK-25960:
---

Hi, [~tnachen]. I updated the field because the next version is 3.0.0.

> Support subpath mounting with Kubernetes
> 
>
> Key: SPARK-25960
> URL: https://issues.apache.org/jira/browse/SPARK-25960
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Timothy Chen
>Priority: Major
>
> Currently we support mounting volumes into executor and driver, but there is 
> no option to provide a subpath to be mounted from the volume. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22827) Avoid throwing OutOfMemoryError in case of exception in spill

2018-11-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16679306#comment-16679306
 ] 

Apache Spark commented on SPARK-22827:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/22969

> Avoid throwing OutOfMemoryError in case of exception in spill
> -
>
> Key: SPARK-22827
> URL: https://issues.apache.org/jira/browse/SPARK-22827
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Sital Kedia
>Assignee: Sital Kedia
>Priority: Major
> Fix For: 2.3.0
>
>
> Currently, the task memory manager throws an OutofMemory error when there is 
> an IO exception happens in spill() - 
> https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L194.
>  Similarly there any many other places in code when if a task is not able to 
> acquire memory due to an exception we throw an OutofMemory error which kills 
> the entire executor and hence failing all the tasks that are running on that 
> executor instead of just failing one single task. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25965) Add read benchmark for Avro

2018-11-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25965:


Assignee: (was: Apache Spark)

> Add read benchmark for Avro
> ---
>
> Key: SPARK-25965
> URL: https://issues.apache.org/jira/browse/SPARK-25965
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Minor
>
> Add read benchmark for Avro, which is missing for a period.
> The benchmark is similar to DataSourceReadBenchmark and OrcReadBenchmark



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24437) Memory leak in UnsafeHashedRelation

2018-11-08 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16679745#comment-16679745
 ] 

Marco Gaido commented on SPARK-24437:
-

[~dvogelbacher] the point is: a broadcast is never destroyed/recomputed. For 
many reasons: in case you just re-execute a plan without caching it, for 
instance, the broadcast doesn't need to be recomputed, etc.etc. This could be 
definitely changed doing something like what I did in the PR in the related 
JIRA (which is not enough anyway, since it misses the recompute logic). Yes, I 
think your use-case is rather unusual and not well handled by Spark currently, 
but fixing it is not trivial either since it is kind of a trade-off between 
recomputation cost vs resource allocation.

> Memory leak in UnsafeHashedRelation
> ---
>
> Key: SPARK-24437
> URL: https://issues.apache.org/jira/browse/SPARK-24437
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: gagan taneja
>Priority: Critical
> Attachments: Screen Shot 2018-05-30 at 2.05.40 PM.png, Screen Shot 
> 2018-05-30 at 2.07.22 PM.png, Screen Shot 2018-11-01 at 10.38.30 AM.png
>
>
> There seems to memory leak with 
> org.apache.spark.sql.execution.joins.UnsafeHashedRelation
> We have a long running instance of STS.
> With each query execution requiring Broadcast Join, UnsafeHashedRelation is 
> getting added for cleanup in ContextCleaner. This reference of 
> UnsafeHashedRelation is being held at some other Collection and not becoming 
> eligible for GC and because of this ContextCleaner is not able to clean it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25972) Missed JSON options in streaming.py

2018-11-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16679434#comment-16679434
 ] 

Apache Spark commented on SPARK-25972:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/22973

> Missed JSON options in streaming.py 
> 
>
> Key: SPARK-25972
> URL: https://issues.apache.org/jira/browse/SPARK-25972
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Trivial
>
> streaming.py misses JSON options comparing to readwrite.py:
> - dropFieldIfAllNull
> - encoding



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25980) dev list mail server is down

2018-11-08 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-25980.
-
Resolution: Invalid

> dev list mail server is down
> 
>
> Key: SPARK-25980
> URL: https://issues.apache.org/jira/browse/SPARK-25980
> Project: Spark
>  Issue Type: IT Help
>  Components: Project Infra
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> After sending the 2.4.0 release announcement to the dev list, it doesn't show 
> up for over one hour.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25973) Spark History Main page performance improvement

2018-11-08 Thread William Montaz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

William Montaz updated SPARK-25973:
---
Priority: Minor  (was: Major)

> Spark History Main page performance improvement
> ---
>
> Key: SPARK-25973
> URL: https://issues.apache.org/jira/browse/SPARK-25973
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.2
>Reporter: William Montaz
>Priority: Minor
> Attachments: fix.patch
>
>
> HistoryPage.scala counts applications (with a predicate depending on if it is 
> displaying incomplete or complete applications) to check if it must display 
> the dataTable.
> Since it only checks if allAppsSize > 0, we could use exists method on the 
> iterator> This way we stop iterating at the first occurence found.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25977) Parsing decimals from CSV using locale

2018-11-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25977:


Assignee: (was: Apache Spark)

> Parsing decimals from CSV using locale
> --
>
> Key: SPARK-25977
> URL: https://issues.apache.org/jira/browse/SPARK-25977
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Support the locale option to parse decimals from CSV input. Currently CSV 
> parser can handle decimals that contain only dots - '.' which is incorrect 
> format in locales like ru-RU, for example.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25978) Pyspark can only be used in spark-submit in spark-py docker image for kubernetes

2018-11-08 Thread Maxime Nannan (JIRA)
Maxime Nannan created SPARK-25978:
-

 Summary: Pyspark can only be used in spark-submit in spark-py 
docker image for kubernetes
 Key: SPARK-25978
 URL: https://issues.apache.org/jira/browse/SPARK-25978
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 2.4.0
Reporter: Maxime Nannan


Currently in spark-py docker image for kubernetes defined by the Dockerfile in 
resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/python/Dockerfile,
 the PYTHONPATH is defined as follows: 
{code:java}
ENV PYTHONPATH 
${SPARK_HOME}/python/lib/pyspark.zip:${SPARK_HOME}/python/lib/py4j-*.zip{code}
I think that the problem is that PYTHONPATH does not support wildcards so py4j 
cannot be imported with the default PYTHONPATH and pyspark cannot be imported 
too as it needs py4j.
This does not impact spark-submit of python files because py4j is dynamically 
added to PYTHONPATH when running python process in 
core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala.

 

It's not really an issue as the main purpose of that docker image is to be run 
as driver or executors on k8s but it's worth mentionning this.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25983) spark-sql-kafka-0-10 no longer works with Kafka 0.10.0

2018-11-08 Thread Alexander Bessonov (JIRA)
Alexander Bessonov created SPARK-25983:
--

 Summary: spark-sql-kafka-0-10 no longer works with Kafka 0.10.0
 Key: SPARK-25983
 URL: https://issues.apache.org/jira/browse/SPARK-25983
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Alexander Bessonov


Package {{org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0}} is no longer 
compatible with {{org.apache.kafka:kafka_2.11:0.10.0.1}}.

When both packages are used in the same project, the following exception occurs:
{code:java}
java.lang.NoClassDefFoundError: 
org/apache/kafka/common/protocol/SecurityProtocol
 at kafka.server.Defaults$.(KafkaConfig.scala:125)
 at kafka.server.Defaults$.(KafkaConfig.scala)
 at kafka.log.Defaults$.(LogConfig.scala:33)
 at kafka.log.Defaults$.(LogConfig.scala)
 at kafka.log.LogConfig$.(LogConfig.scala:152)
 at kafka.log.LogConfig$.(LogConfig.scala)
 at kafka.server.KafkaConfig$.(KafkaConfig.scala:265)
 at kafka.server.KafkaConfig$.(KafkaConfig.scala)
 at kafka.server.KafkaConfig.(KafkaConfig.scala:759)
 at kafka.server.KafkaConfig.(KafkaConfig.scala:761)
{code}
 

This exception is caused by incompatible dependency pulled by Spark: 
{{org.apache.kafka:kafka-clients_2.11:2.0.0}}.
 

Following workaround could be used to resolve the problem in my project:
{code:java}
dependencyOverrides += "org.apache.kafka" % "kafka-clients" % "0.10.0.1"
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24421) sun.misc.Unsafe in JDK11

2018-11-08 Thread Alan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16680355#comment-16680355
 ] 

Alan commented on SPARK-24421:
--

The comment that sun.misc.Unsafe is private and not accessible in JDK 9 or 
newer releases is not correct. When compiling or running code on the class path 
then sun.misc.Unsafe works as it did in JDK 8 and older releases. Yes, you need 
to use core reflection to get the Unsafe instance but this is no different to 
JDK 8 and older. If developing a module then the module should `requires 
jdk.unsupported` as per the description. The jdk.unsupported module opens the 
sun.misc package so you can use reflection to get at the Unsafe instances in 
the same way as code on the class path.

As regards freeing the memory underlying of a reachable direct buffer then this 
is always a very dangerous as further access to the buffer will lead to a crash 
and security issues. So anything doing this needs to be really careful and 
immediately discard all references to the Buffer object. There is no need to 
hack private fields to get at Cleaner objects with JDK 9 or newer, instead look 
at the Unsafe invokeCleaner method which will do what you want. The comment 
(from Kris Mo?) suggests that sun.misc.Cleaner still exists in JDK 9 - that 
isn't so, it was removed in JDK 9 as part of clearing out sun.misc. So I 
suspect Kris may be looking at a JDK 8 or older build instead. The Unsafe 
invokeCleaner API works the same in the OpenJDK builds from jdk.net as it does 
with Oracle JDK builds, there are no differences.

 

 

> sun.misc.Unsafe in JDK11
> 
>
> Key: SPARK-24421
> URL: https://issues.apache.org/jira/browse/SPARK-24421
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Priority: Major
>
> Many internal APIs such as unsafe are encapsulated in JDK9+, see 
> http://openjdk.java.net/jeps/260 for detail.
> To use Unsafe, we need to add *jdk.unsupported* to our code’s module 
> declaration:
> {code:java}
> module java9unsafe {
> requires jdk.unsupported;
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24421) sun.misc.Unsafe in JDK11

2018-11-08 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16680169#comment-16680169
 ] 

Sean Owen edited comment on SPARK-24421 at 11/8/18 7:56 PM:


I've found that, actually, we can't even access clean() with reflection. See 
[https://stackoverflow.com/questions/41265266/how-to-solve-inaccessibleobjectexception-unable-to-make-member-accessible-m]
 for example. It works but only if the JVM is run with a flag like 
\{{--add-opens java.base/java.lang=ALL-UNNAMED}} . 

We do indeed have to write code that works on Java 8 and 11. We will have to 
continue to compile with Java 8; JVM won't run any code compiled for a later 
version (the old UnsupportedClassVerionError), so, no we can't compile with 
Java 11 and run on Java 8.

But compiling with Java 8 should be fine as Java 11 can read it; we just can't 
access Java 9+ classes without reflection. It's easy enough to resolve the 
_compile_ problems here, and yes, it will still all work on Java 8 like today. 
The problem is running on Java 11 right now.

I'm going to go ahead and open a pull request that fixes the compile issues for 
Java 11 and gets this to the point where it should run on Java 11 _if_ you set 
the flag above. That's progress at least.

The single issue here is this code in StorageUtils:
{code:java}
/**
 * Attempt to clean up a ByteBuffer if it is direct or memory-mapped. This uses 
an *unsafe* Sun
 * API that will cause errors if one attempts to read from the disposed buffer. 
However, neither
 * the bytes allocated to direct buffers nor file descriptors opened for 
memory-mapped buffers put
 * pressure on the garbage collector. Waiting for garbage collection may lead 
to the depletion of
 * off-heap memory or huge numbers of open files. There's unfortunately no 
standard API to
 * manually dispose of these kinds of buffers.
 */
def dispose(buffer: ByteBuffer): Unit = {
  if (buffer != null && buffer.isInstanceOf[MappedByteBuffer]) {
logTrace(s"Disposing of $buffer")
cleanDirectBuffer(buffer.asInstanceOf[DirectBuffer])
  }
}

private def cleanDirectBuffer(buffer: DirectBuffer): Unit = {
  val cleaner = buffer.cleaner()
  if (cleaner != null) {
cleaner.clean()
  }
}
{code}

I wonder how bad it is if this simply isn't accessed? Sounds bad. Not strictly 
fatal but bad. This means it all still _runs_ in Java 11, or should, even if 
this method can't be invoked.

But is there any reason to think this kind of low-level intervention in the 
ByteBuffer wouldn't be needed in Java 11 anyway? doubt it, but I wonder.

EDIT: Kris Mo here notes that, at least, OpenJDK 9+ still has sun.misc.Cleaner 
and has a nice invokeCleane() method which we could use. But this doesn't seem 
to work for Oracle JDKs.


was (Author: srowen):
I've found that, actually, we can't even access clean() with reflection. See 
[https://stackoverflow.com/questions/41265266/how-to-solve-inaccessibleobjectexception-unable-to-make-member-accessible-m]
 for example. It works but only if the JVM is run with a flag like 
\{{--add-opens java.base/java.lang=ALL-UNNAMED}} . 

We do indeed have to write code that works on Java 8 and 11. We will have to 
continue to compile with Java 8; JVM won't run any code compiled for a later 
version (the old UnsupportedClassVerionError), so, no we can't compile with 
Java 11 and run on Java 8.

But compiling with Java 8 should be fine as Java 11 can read it; we just can't 
access Java 9+ classes without reflection. It's easy enough to resolve the 
_compile_ problems here, and yes, it will still all work on Java 8 like today. 
The problem is running on Java 11 right now.

I'm going to go ahead and open a pull request that fixes the compile issues for 
Java 11 and gets this to the point where it should run on Java 11 _if_ you set 
the flag above. That's progress at least.

The single issue here is this code in StorageUtils:
{code:java}
/**
 * Attempt to clean up a ByteBuffer if it is direct or memory-mapped. This uses 
an *unsafe* Sun
 * API that will cause errors if one attempts to read from the disposed buffer. 
However, neither
 * the bytes allocated to direct buffers nor file descriptors opened for 
memory-mapped buffers put
 * pressure on the garbage collector. Waiting for garbage collection may lead 
to the depletion of
 * off-heap memory or huge numbers of open files. There's unfortunately no 
standard API to
 * manually dispose of these kinds of buffers.
 */
def dispose(buffer: ByteBuffer): Unit = {
  if (buffer != null && buffer.isInstanceOf[MappedByteBuffer]) {
logTrace(s"Disposing of $buffer")
cleanDirectBuffer(buffer.asInstanceOf[DirectBuffer])
  }
}

private def cleanDirectBuffer(buffer: DirectBuffer): Unit = {
  val cleane= buffer.cleaner()
  if (cleaner != null) {
cleaner.clean()
  }
}
{code}

I wonder how bad it is if this simply isn't accessed? Sounds bad. Not strictly 

[jira] [Commented] (SPARK-25975) Spark History does not display necessarily the incomplete applications when requested

2018-11-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16679782#comment-16679782
 ] 

Apache Spark commented on SPARK-25975:
--

User 'Willymontaz' has created a pull request for this issue:
https://github.com/apache/spark/pull/22981

> Spark History does not display necessarily the incomplete applications when 
> requested
> -
>
> Key: SPARK-25975
> URL: https://issues.apache.org/jira/browse/SPARK-25975
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.2
>Reporter: William Montaz
>Priority: Minor
> Attachments: fix.patch
>
>
> Filtering of incomplete applications is made in javascript against the 
> response returned by the API. The problem is that if the returned result is 
> not big enough (because of spark.history.ui.maxApplications), it might not 
> contain incomplete applications. 
> We can call the API with status RUNNING or COMPLETED depending on the view we 
> want to fix this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25897) Cannot run k8s integration tests in sbt

2018-11-08 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-25897:
--

Assignee: Marcelo Vanzin

> Cannot run k8s integration tests in sbt
> ---
>
> Key: SPARK-25897
> URL: https://issues.apache.org/jira/browse/SPARK-25897
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently the k8s integration tests use maven, which makes it a little 
> awkward to run them if you use sbt for your day-to-day development. We should 
> hook them up to the sbt build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25975) Spark History does not display necessarily the incomplete applications when requested

2018-11-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25975:


Assignee: Apache Spark

> Spark History does not display necessarily the incomplete applications when 
> requested
> -
>
> Key: SPARK-25975
> URL: https://issues.apache.org/jira/browse/SPARK-25975
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.2
>Reporter: William Montaz
>Assignee: Apache Spark
>Priority: Minor
> Attachments: fix.patch
>
>
> Filtering of incomplete applications is made in javascript against the 
> response returned by the API. The problem is that if the returned result is 
> not big enough (because of spark.history.ui.maxApplications), it might not 
> contain incomplete applications. 
> We can call the API with status RUNNING or COMPLETED depending on the view we 
> want to fix this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25961) 处理数据倾斜时使用随机数不支持

2018-11-08 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16679237#comment-16679237
 ] 

Dongjoon Hyun commented on SPARK-25961:
---

[~zengxl]. Please use English in Apache Spark JIRA.

> 处理数据倾斜时使用随机数不支持
> ---
>
> Key: SPARK-25961
> URL: https://issues.apache.org/jira/browse/SPARK-25961
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
> Environment: spark on yarn 2.3.1
>Reporter: zengxl
>Priority: Major
>
> 两个表连接,有一个表存在空值,给join键加上随机数,提示不可以
> Error in query: nondeterministic expressions are only allowed in
> Project, Filter, Aggregate or Window, found
> 查看源码发现是在org.apache.spark.sql.catalyst.analysis.CheckAnalysis进行sql校验,由于随机数是不确定值被禁止了
> case o if o.expressions.exists(!_.deterministic) &&
>  !o.isInstanceOf[Project] && !o.isInstanceOf[Filter] &&
>  !o.isInstanceOf[Aggregate] && !o.isInstanceOf[Window] =>
>  // The rule above is used to check Aggregate operator.
>  failAnalysis(
>  s"""nondeterministic expressions are only allowed in
>  |Project, Filter, Aggregate or Window, found:
>  | ${o.expressions.map(_.sql).mkString(",")}
>  |in operator ${operator.simpleString}
>  """.stripMargin)
> 是否在这段代码加上Join情况就可以?现在还没测试
> case o if o.expressions.exists(!_.deterministic) &&
>  !o.isInstanceOf[Project] && !o.isInstanceOf[Filter] &&
>  !o.isInstanceOf[Aggregate] && !o.isInstanceOf[Window] +{color:#d04437}&& 
> !o.isInstanceOf[Join]{color}+ =>
>  // The rule above is used to check Aggregate operator.
>  failAnalysis(
>  s"""nondeterministic expressions are only allowed in
>  |Project, Filter, Aggregate or Window or Join, found:
>  | ${o.expressions.map(_.sql).mkString(",")}
>  |in operator ${operator.simpleString}
>  """.stripMargin)
>  
> 我的sql:
> SELECT
> T1.CUST_NO AS CUST_NO ,
> T3.CON_LAST_NAME AS CUST_NAME ,
> T3.CON_SEX_MF AS SEX_CODE ,
> T3.X_POSITION AS POST_LV_CODE 
> FROM tmp.ICT_CUST_RANGE_INFO T1
> LEFT join tmp.F_CUST_BASE_INFO_ALL T3 ON CASE WHEN coalesce(T1.CUST_NO,'') 
> ='' THEN concat('cust_no',RAND()) ELSE T1.CUST_NO END = T3.BECIF and 
> T3.DATE='20181105'
> WHERE T1.DATE='20181105'



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25958) error: [Errno 97] Address family not supported by protocol in dataframe.take()

2018-11-08 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16679261#comment-16679261
 ] 

Ruslan Dautkhanov commented on SPARK-25958:
---

[~XuanYuan] interesting.. here's our /etc/hosts:
{quote}127.0.0.1   localhost localhost.localdomain localhost4 
localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
{quote}

Notice we have ipv6 stuff there, but ipv6 is disabled for us.

I will comment out `::1` and try again. 

Was it the fix for you too? 

> error: [Errno 97] Address family not supported by protocol in dataframe.take()
> --
>
> Key: SPARK-25958
> URL: https://issues.apache.org/jira/browse/SPARK-25958
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.3.1, 2.3.2
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Following error happens on a heavy Spark job after 4 hours of runtime..
> {code}
> 2018-11-06 14:35:56,604 - data_vault.py - ERROR - Exited with exception: 
> [Errno 97] Address family not supported by protocol
> Traceback (most recent call last):
>   File "/home/mwincek/svn/data_vault/data_vault.py", line 64, in data_vault
> item.create_persistent_data()
>   File "/home/mwincek/svn/data_vault/src/table_recipe/amf_table_recipe.py", 
> line 53, in create_persistent_data
> single_obj.create_persistent_data()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", line 
> 21, in create_persistent_data
> main_df = self.generate_dataframe_main()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/table_processing.py", line 
> 98, in generate_dataframe_main
> raw_disc_dv_df = self.get_raw_data_with_metadata_and_aggregation()
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 16, in get_raw_data_with_metadata_and_aggregation
> main_df = 
> self.get_dataframe_using_binary_date_aggregation_on_dataframe(input_df=raw_disc_dv_df)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 60, in get_dataframe_using_binary_date_aggregation_on_dataframe
> return_df = self.get_dataframe_from_binary_value_iteration(input_df)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 136, in get_dataframe_from_binary_value_iteration
> combine_df = self.get_dataframe_from_binary_value(input_df=input_df, 
> binary_value=count)
>   File 
> "/home/mwincek/svn/data_vault/src/table_processing/satellite_binary_dates_table_processing.py",
>  line 154, in get_dataframe_from_binary_value
> if len(results_of_filter_df.take(1)) == 0:
>   File 
> "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", 
> line 504, in take
> return self.limit(num).collect()
>   File 
> "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/sql/dataframe.py", 
> line 467, in collect
> return list(_load_from_socket(sock_info, 
> BatchedSerializer(PickleSerializer(
>   File "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/rdd.py", line 
> 148, in _load_from_socket
> sock = socket.socket(af, socktype, proto)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/socket.py", line 191, in 
> __init__
> _sock = _realsocket(family, type, proto)
> error: [Errno 97] Address family not supported by protocol
> {code}
> Looking at the failing line in lib/spark2/python/pyspark/rdd.py, line 148:
> {code}
> def _load_from_socket(sock_info, serializer):
> port, auth_secret = sock_info
> sock = None
> # Support for both IPv4 and IPv6.
> # On most of IPv6-ready systems, IPv6 will take precedence.
> for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, 
> socket.SOCK_STREAM):
> af, socktype, proto, canonname, sa = res
> sock = socket.socket(af, socktype, proto)
> try:
> sock.settimeout(15)
> sock.connect(sa)
> except socket.error:
> sock.close()
> sock = None
> continue
> break
> if not sock:
> raise Exception("could not open socket")
> # The RDD materialization time is unpredicable, if we set a timeout for 
> socket reading
> # operation, it will very possibly fail. See SPARK-18281.
> sock.settimeout(None)
> sockfile = sock.makefile("rwb", 65536)
> do_server_auth(sockfile, auth_secret)
> # The socket will be automatically closed when garbage-collected.
> return serializer.load_stream(sockfile)
> {code}
> the culprint is in lib/spark2/python/pyspark/rdd.py in this line 
> {code}
> socket.getaddrinfo("localhost", port, socket.AF

[jira] [Commented] (SPARK-25973) Spark History Main page performance improvement

2018-11-08 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16679639#comment-16679639
 ] 

Yuming Wang commented on SPARK-25973:
-

Please create a pull request: https://github.com/apache/spark/pulls

> Spark History Main page performance improvement
> ---
>
> Key: SPARK-25973
> URL: https://issues.apache.org/jira/browse/SPARK-25973
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.2
>Reporter: William Montaz
>Priority: Minor
> Attachments: fix.patch
>
>
> HistoryPage.scala counts applications (with a predicate depending on if it is 
> displaying incomplete or complete applications) to check if it must display 
> the dataTable.
> Since it only checks if allAppsSize > 0, we could use exists method on the 
> iterator> This way we stop iterating at the first occurence found.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-25971) Ignore partition byte-size statistics in SQLQueryTestSuite

2018-11-08 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25971:
--
Comment: was deleted

(was: User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/22972)

> Ignore partition byte-size statistics in SQLQueryTestSuite
> --
>
> Key: SPARK-25971
> URL: https://issues.apache.org/jira/browse/SPARK-25971
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Currently, `SQLQueryTestSuite` is sensitive in terms of the bytes of parquet 
> files in table partitions. If we change the default file format (from Parquet 
> to ORC) or update the metadata of them, the test case should be changed 
> accordingly. This issue aims to make `SQLQueryTestSuite` more robust by 
> ignoring the partition byte statistics.
> {code}
> -Partition Statistics   1144 bytes, 2 rows
> +Partition Statistics   [not included in comparison] bytes, 2 rows
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25973) Spark History Main page performance improvement

2018-11-08 Thread William Montaz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16679853#comment-16679853
 ] 

William Montaz commented on SPARK-25973:


New pull request on master branch https://github.com/apache/spark/pull/22982

> Spark History Main page performance improvement
> ---
>
> Key: SPARK-25973
> URL: https://issues.apache.org/jira/browse/SPARK-25973
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.2
>Reporter: William Montaz
>Priority: Minor
> Attachments: fix.patch
>
>
> HistoryPage.scala counts applications (with a predicate depending on if it is 
> displaying incomplete or complete applications) to check if it must display 
> the dataTable.
> Since it only checks if allAppsSize > 0, we could use exists method on the 
> iterator. This way we stop iterating at the first occurence found.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25970) Add Instrumentation to PrefixSpan

2018-11-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25970:


Assignee: (was: Apache Spark)

> Add Instrumentation to PrefixSpan
> -
>
> Key: SPARK-25970
> URL: https://issues.apache.org/jira/browse/SPARK-25970
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Trivial
>
> Add Instrumentation to PrefixSpan



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23831) Add org.apache.derby to IsolatedClientLoader

2018-11-08 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16679444#comment-16679444
 ] 

Hyukjin Kwon commented on SPARK-23831:
--

This is reverted at 2.4.1 and 3.0.0

> Add org.apache.derby to IsolatedClientLoader
> 
>
> Key: SPARK-23831
> URL: https://issues.apache.org/jira/browse/SPARK-23831
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> Add org.apache.derby to IsolatedClientLoader,otherwise it may throw an 
> exception:
> {noformat}
> [info] Cause: java.sql.SQLException: Failed to start database 'metastore_db' 
> with class loader 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@2439ab23, see 
> the next exception for details.
> [info] at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
> [info] at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
> [info] at org.apache.derby.impl.jdbc.Util.seeNextException(Unknown Source)
> [info] at org.apache.derby.impl.jdbc.EmbedConnection.bootDatabase(Unknown 
> Source)
> [info] at org.apache.derby.impl.jdbc.EmbedConnection.(Unknown Source)
> [info] at org.apache.derby.jdbc.InternalDriver$1.run(Unknown Source)
> {noformat}
> How to reproduce:
> {noformat}
> sed 's/HiveExternalCatalogSuite/HiveExternalCatalog2Suite/g' 
> sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogSuite.scala
>  > 
> sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalog2Suite.scala
> build/sbt -Phive "hive/test-only *.HiveExternalCatalogSuite 
> *.HiveExternalCatalog2Suite"
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25970) Add Instrumentation to PrefixSpan

2018-11-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16679367#comment-16679367
 ] 

Apache Spark commented on SPARK-25970:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/22971

> Add Instrumentation to PrefixSpan
> -
>
> Key: SPARK-25970
> URL: https://issues.apache.org/jira/browse/SPARK-25970
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Trivial
>
> Add Instrumentation to PrefixSpan



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25904) Avoid allocating arrays too large for JVMs

2018-11-08 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-25904:
-
Fix Version/s: 2.4.1

> Avoid allocating arrays too large for JVMs
> --
>
> Key: SPARK-25904
> URL: https://issues.apache.org/jira/browse/SPARK-25904
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
>
> In a few places spark can try to allocate arrays as big as {{Int.MaxValue}}, 
> but thats actually too big for the JVM.  We should consistently use 
> {{ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH}} instead.
> In some cases this is changing defaults for configs, in some cases its bounds 
> on a config, and others its just improving error msgs for things that still 
> won't work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25972) Missed JSON options in streaming.py

2018-11-08 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-25972:
--

 Summary: Missed JSON options in streaming.py 
 Key: SPARK-25972
 URL: https://issues.apache.org/jira/browse/SPARK-25972
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Maxim Gekk


streaming.py misses JSON options comparing to readwrite.py:
- dropFieldIfAllNull
- encoding



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24834) Utils#nanSafeCompare{Double,Float} functions do not differ from normal java double/float comparison

2018-11-08 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16680290#comment-16680290
 ] 

Sean Owen commented on SPARK-24834:
---

The goal is to match Hive semantics, if anything. And of course to have 
internally-consistent behavior. Matching previous Spark SQL behavior is of 
course important too, even across major releases, but can be broken where 
appropriate. 
Matching another DB's semantics is fine if Hive is 'silent' on behavior. Here I 
am not clear what Hive does, but if it is different, yes, it should probably be 
fixed in 3.0 unless it would really cause big pain for Spark SQL workloads 
today. If it isn't what Hive does, I think we have to leave it.

> Utils#nanSafeCompare{Double,Float} functions do not differ from normal java 
> double/float comparison
> ---
>
> Key: SPARK-24834
> URL: https://issues.apache.org/jira/browse/SPARK-24834
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: Benjamin Duffield
>Priority: Minor
>
> Utils.scala contains two functions `nanSafeCompareDoubles` and 
> `nanSafeCompareFloats` which purport to have special handling of NaN values 
> in comparisons.
> The handling in these functions do not appear to differ from 
> java.lang.Double.compare and java.lang.Float.compare - they seem to produce 
> identical output to the built-in java comparison functions.
> I think it's clearer to not have these special Utils functions, and instead 
> just use the standard java comparison functions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25977) Parsing decimals from CSV using locale

2018-11-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25977:


Assignee: Apache Spark

> Parsing decimals from CSV using locale
> --
>
> Key: SPARK-25977
> URL: https://issues.apache.org/jira/browse/SPARK-25977
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Minor
>
> Support the locale option to parse decimals from CSV input. Currently CSV 
> parser can handle decimals that contain only dots - '.' which is incorrect 
> format in locales like ru-RU, for example.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25965) Add read benchmark for Avro

2018-11-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16679331#comment-16679331
 ] 

Apache Spark commented on SPARK-25965:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/22966

> Add read benchmark for Avro
> ---
>
> Key: SPARK-25965
> URL: https://issues.apache.org/jira/browse/SPARK-25965
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Minor
>
> Add read benchmark for Avro, which is missing for a period.
> The benchmark is similar to DataSourceReadBenchmark and OrcReadBenchmark



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25974) Optimizes Generates bytecode for ordering based on the given order

2018-11-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16679581#comment-16679581
 ] 

Apache Spark commented on SPARK-25974:
--

User 'heary-cao' has created a pull request for this issue:
https://github.com/apache/spark/pull/22976

> Optimizes Generates bytecode for ordering based on the given order
> --
>
> Key: SPARK-25974
> URL: https://issues.apache.org/jira/browse/SPARK-25974
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.1
>Reporter: caoxuewen
>Priority: Major
>
> Currently, when generates the code for ordering based on the given order, too 
> many variables and assignment statements will be generated, which is not 
> necessary. This PR will eliminate redundant variables. Optimizes Generates 
> bytecode for ordering based on the given order.
> The generated code looks like:
> spark.range(1).selectExpr(
>  "id as key",
>  "(id & 1023) as value1",
> "cast(id & 1023 as double) as value2",
> "cast(id & 1023 as int) as value3"
> ).select("value1", "value2", "value3").orderBy("value1", "value2").collect()
> before PR(codegen size: 178)
> Generated Ordering by input[0, bigint, false] ASC NULLS FIRST,input[1, 
> double, false] ASC NULLS FIRST:
> /* 001 */ public SpecificOrdering generate(Object[] references) {
> /* 002 */   return new SpecificOrdering(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificOrdering extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering {
> /* 006 */
> /* 007 */   private Object[] references;
> /* 008 */
> /* 009 */
> /* 010 */   public SpecificOrdering(Object[] references) {
> /* 011 */ this.references = references;
> /* 012 */
> /* 013 */   }
> /* 014 */
> /* 015 */   public int compare(InternalRow a, InternalRow b) {
> /* 016 */
> /* 017 */ InternalRow i = null;
> /* 018 */
> /* 019 */ i = a;
> /* 020 */ boolean isNullA_0;
> /* 021 */ long primitiveA_0;
> /* 022 */ {
> /* 023 */   long value_0 = i.getLong(0);
> /* 024 */   isNullA_0 = false;
> /* 025 */   primitiveA_0 = value_0;
> /* 026 */ }
> /* 027 */ i = b;
> /* 028 */ boolean isNullB_0;
> /* 029 */ long primitiveB_0;
> /* 030 */ {
> /* 031 */   long value_0 = i.getLong(0);
> /* 032 */   isNullB_0 = false;
> /* 033 */   primitiveB_0 = value_0;
> /* 034 */ }
> /* 035 */ if (isNullA_0 && isNullB_0) {
> /* 036 */   // Nothing
> /* 037 */ } else if (isNullA_0) {
> /* 038 */   return -1;
> /* 039 */ } else if (isNullB_0) {
> /* 040 */   return 1;
> /* 041 */ } else {
> /* 042 */   int comp = (primitiveA_0 > primitiveB_0 ? 1 : primitiveA_0 < 
> primitiveB_0 ? -1 : 0);
> /* 043 */   if (comp != 0) {
> /* 044 */ return comp;
> /* 045 */   }
> /* 046 */ }
> /* 047 */
> /* 048 */ i = a;
> /* 049 */ boolean isNullA_1;
> /* 050 */ double primitiveA_1;
> /* 051 */ {
> /* 052 */   double value_1 = i.getDouble(1);
> /* 053 */   isNullA_1 = false;
> /* 054 */   primitiveA_1 = value_1;
> /* 055 */ }
> /* 056 */ i = b;
> /* 057 */ boolean isNullB_1;
> /* 058 */ double primitiveB_1;
> /* 059 */ {
> /* 060 */   double value_1 = i.getDouble(1);
> /* 061 */   isNullB_1 = false;
> /* 062 */   primitiveB_1 = value_1;
> /* 063 */ }
> /* 064 */ if (isNullA_1 && isNullB_1) {
> /* 065 */   // Nothing
> /* 066 */ } else if (isNullA_1) {
> /* 067 */   return -1;
> /* 068 */ } else if (isNullB_1) {
> /* 069 */   return 1;
> /* 070 */ } else {
> /* 071 */   int comp = 
> org.apache.spark.util.Utils.nanSafeCompareDoubles(primitiveA_1, primitiveB_1);
> /* 072 */   if (comp != 0) {
> /* 073 */ return comp;
> /* 074 */   }
> /* 075 */ }
> /* 076 */
> /* 077 */
> /* 078 */ return 0;
> /* 079 */   }
> /* 080 */
> /* 081 */
> /* 082 */ }
> After PR(codegen size: 89)
> Generated Ordering by input[0, bigint, false] ASC NULLS FIRST,input[1, 
> double, false] ASC NULLS FIRST:
> /* 001 */ public SpecificOrdering generate(Object[] references) {
> /* 002 */   return new SpecificOrdering(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificOrdering extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering {
> /* 006 */
> /* 007 */   private Object[] references;
> /* 008 */
> /* 009 */
> /* 010 */   public SpecificOrdering(Object[] references) {
> /* 011 */ this.references = references;
> /* 012 */
> /* 013 */   }
> /* 014 */
> /* 015 */   public int compare(InternalRow a, InternalRow b) {
> /* 016 */
> /* 017 */
> /* 018 */ long value_0 = a.getLong(0);
> /* 019 */ long value_2 = b.getLong(0);
> /* 020 */ if (false && false) {
> /* 021

[jira] [Commented] (SPARK-20156) Java String toLowerCase "Turkish locale bug" causes Spark problems

2018-11-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16679542#comment-16679542
 ] 

Apache Spark commented on SPARK-20156:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/22975

> Java String toLowerCase "Turkish locale bug" causes Spark problems
> --
>
> Key: SPARK-20156
> URL: https://issues.apache.org/jira/browse/SPARK-20156
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.1.0
> Environment: Ubunutu 16.04
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_121)
>Reporter: Serkan Taş
>Assignee: Sean Owen
>Priority: Major
> Fix For: 2.2.0
>
> Attachments: sprk_shell.txt
>
>
> If the regional setting of the operation system is Turkish, the famous java 
> locale problem occurs (https://jira.atlassian.com/browse/CONF-5931 or 
> https://issues.apache.org/jira/browse/AVRO-1493). 
> e.g : 
> "SERDEINFO" lowers to "serdeınfo"
> "uniquetable" uppers to "UNİQUETABLE"
> work around : 
> add -Duser.country=US -Duser.language=en to the end of the line 
> SPARK_SUBMIT_OPTS="$SPARK_SUBMIT_OPTS -Dscala.usejavacp=true"
> in spark-shell.sh



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24540) Support for multiple delimiter in Spark CSV read

2018-11-08 Thread Maxim Gekk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16679870#comment-16679870
 ] 

Maxim Gekk commented on SPARK-24540:


The restriction has been fixed already at least in uniVocity 2.8.0, see 
https://github.com/uniVocity/univocity-parsers/issues/275 and 
https://github.com/uniVocity/univocity-parsers/issues/209 . [~hyukjin.kwon] 
This is one of examples where multiple values for an option would be useful.

> Support for multiple delimiter in Spark CSV read
> 
>
> Key: SPARK-24540
> URL: https://issues.apache.org/jira/browse/SPARK-24540
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Ashwin K
>Priority: Major
>
> Currently, the delimiter option Spark 2.0 to read and split CSV files/data 
> only support a single character delimiter. If we try to provide multiple 
> delimiters, we observer the following error message.
> eg: Dataset df = spark.read().option("inferSchema", "true")
>                                                           .option("header", 
> "false")
>                                                          .option("delimiter", 
> ", ")
>                                                           .csv("C:\test.txt");
> Exception in thread "main" java.lang.IllegalArgumentException: Delimiter 
> cannot be more than one character: , 
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:111)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:83)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:39)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:55)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
>  at scala.Option.orElse(Option.scala:289)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
>  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
>  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473)
>  
> Generally, the data to be processed contains multiple delimiters and 
> presently we need to do a manual data clean up on the source/input file, 
> which doesn't work well in large applications which consumes numerous files.
> There seems to be work-around like reading data as text and using the split 
> option, but this in my opinion defeats the purpose, advantage and efficiency 
> of a direct read from CSV file.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >