[jira] [Commented] (SPARK-24498) Add JDK compiler for runtime codegen

2018-06-15 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514696#comment-16514696
 ] 

Takeshi Yamamuro commented on SPARK-24498:
--

I'm also interested in this, so I'll look into this. If there are something I 
could do, let me know.

> Add JDK compiler for runtime codegen
> 
>
> Key: SPARK-24498
> URL: https://issues.apache.org/jira/browse/SPARK-24498
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> In some cases, JDK compiler can generate smaller bytecode and take less time 
> in compilation compared to Janino. However, in some cases, Janino is better. 
> We should support both for our runtime codegen. Janino will be still our 
> default runtime codegen compiler. 
> See the related JIRAs in DRILL: 
> - https://issues.apache.org/jira/browse/DRILL-1155
> - https://issues.apache.org/jira/browse/DRILL-4778
> - https://issues.apache.org/jira/browse/DRILL-5696



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23901) Data Masking Functions

2018-06-15 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514687#comment-16514687
 ] 

Marco Gaido commented on SPARK-23901:
-

These functions can be used as any other function in Hive, they are not just 
there for the Hive authorizer. I think the use case for them is to anonymize 
data for privacy reasons (eg. expose/export to other parties data without 
providing sensible data, but still being able to use them in joins).

> Data Masking Functions
> --
>
> Key: SPARK-23901
> URL: https://issues.apache.org/jira/browse/SPARK-23901
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.4.0
>
>
> - mask()
>  - mask_first_n()
>  - mask_last_n()
>  - mask_hash()
>  - mask_show_first_n()
>  - mask_show_last_n()
> Reference:
> [1] 
> [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DataMaskingFunctions]
> [2] https://issues.apache.org/jira/browse/HIVE-13568
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23901) Data Masking Functions

2018-06-15 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514685#comment-16514685
 ] 

Wenchen Fan commented on SPARK-23901:
-

According to the Hive JIRA, it's used in the Hive authorization, which is not a 
general function and seems can't be applied to Spark. 
[~smilegator] [~mgaido] do you know any use case for these functions?

> Data Masking Functions
> --
>
> Key: SPARK-23901
> URL: https://issues.apache.org/jira/browse/SPARK-23901
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.4.0
>
>
> - mask()
>  - mask_first_n()
>  - mask_last_n()
>  - mask_hash()
>  - mask_show_first_n()
>  - mask_show_last_n()
> Reference:
> [1] 
> [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DataMaskingFunctions]
> [2] https://issues.apache.org/jira/browse/HIVE-13568
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24540) Support for multiple delimiter in Spark CSV read

2018-06-15 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514684#comment-16514684
 ] 

Takeshi Yamamuro edited comment on SPARK-24540 at 6/16/18 5:28 AM:
---

Probably, this is a restriction of univocity parser.


was (Author: maropu):
Probably, this is a restriction of univocity parser. cc; [~hyukjin.kwon]

btw, why do you set 'is this blocked by SPARK-17967'?

> Support for multiple delimiter in Spark CSV read
> 
>
> Key: SPARK-24540
> URL: https://issues.apache.org/jira/browse/SPARK-24540
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Ashwin K
>Priority: Major
>
> Currently, the delimiter option Spark 2.0 to read and split CSV files/data 
> only support a single character delimiter. If we try to provide multiple 
> delimiters, we observer the following error message.
> eg: Dataset df = spark.read().option("inferSchema", "true")
>                                                           .option("header", 
> "false")
>                                                          .option("delimiter", 
> ", ")
>                                                           .csv("C:\test.txt");
> Exception in thread "main" java.lang.IllegalArgumentException: Delimiter 
> cannot be more than one character: , 
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:111)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:83)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:39)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:55)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
>  at scala.Option.orElse(Option.scala:289)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
>  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
>  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473)
>  
> Generally, the data to be processed contains multiple delimiters and 
> presently we need to do a manual data clean up on the source/input file, 
> which doesn't work well in large applications which consumes numerous files.
> There seems to be work-around like reading data as text and using the split 
> option, but this in my opinion defeats the purpose, advantage and efficiency 
> of a direct read from CSV file.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24540) Support for multiple delimiter in Spark CSV read

2018-06-15 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514684#comment-16514684
 ] 

Takeshi Yamamuro edited comment on SPARK-24540 at 6/16/18 5:28 AM:
---

Probably, I think this is a restriction of univocity parser.


was (Author: maropu):
Probably, this is a restriction of univocity parser.

> Support for multiple delimiter in Spark CSV read
> 
>
> Key: SPARK-24540
> URL: https://issues.apache.org/jira/browse/SPARK-24540
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Ashwin K
>Priority: Major
>
> Currently, the delimiter option Spark 2.0 to read and split CSV files/data 
> only support a single character delimiter. If we try to provide multiple 
> delimiters, we observer the following error message.
> eg: Dataset df = spark.read().option("inferSchema", "true")
>                                                           .option("header", 
> "false")
>                                                          .option("delimiter", 
> ", ")
>                                                           .csv("C:\test.txt");
> Exception in thread "main" java.lang.IllegalArgumentException: Delimiter 
> cannot be more than one character: , 
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:111)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:83)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:39)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:55)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
>  at scala.Option.orElse(Option.scala:289)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
>  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
>  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473)
>  
> Generally, the data to be processed contains multiple delimiters and 
> presently we need to do a manual data clean up on the source/input file, 
> which doesn't work well in large applications which consumes numerous files.
> There seems to be work-around like reading data as text and using the split 
> option, but this in my opinion defeats the purpose, advantage and efficiency 
> of a direct read from CSV file.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24540) Support for multiple delimiter in Spark CSV read

2018-06-15 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514684#comment-16514684
 ] 

Takeshi Yamamuro commented on SPARK-24540:
--

Probably, this is a restriction of univocity parser. cc; [~hyukjin.kwon]

btw, why do you set 'is this blocked by SPARK-17967'?

> Support for multiple delimiter in Spark CSV read
> 
>
> Key: SPARK-24540
> URL: https://issues.apache.org/jira/browse/SPARK-24540
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Ashwin K
>Priority: Major
>
> Currently, the delimiter option Spark 2.0 to read and split CSV files/data 
> only support a single character delimiter. If we try to provide multiple 
> delimiters, we observer the following error message.
> eg: Dataset df = spark.read().option("inferSchema", "true")
>                                                           .option("header", 
> "false")
>                                                          .option("delimiter", 
> ", ")
>                                                           .csv("C:\test.txt");
> Exception in thread "main" java.lang.IllegalArgumentException: Delimiter 
> cannot be more than one character: , 
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:111)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:83)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:39)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:55)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
>  at scala.Option.orElse(Option.scala:289)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
>  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
>  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473)
>  
> Generally, the data to be processed contains multiple delimiters and 
> presently we need to do a manual data clean up on the source/input file, 
> which doesn't work well in large applications which consumes numerous files.
> There seems to be work-around like reading data as text and using the split 
> option, but this in my opinion defeats the purpose, advantage and efficiency 
> of a direct read from CSV file.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24571) Support literals with values of the Char type

2018-06-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24571:


Assignee: Apache Spark

> Support literals with values of the Char type
> -
>
> Key: SPARK-24571
> URL: https://issues.apache.org/jira/browse/SPARK-24571
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, Spark doesn't support literals with the Char (java.lang.Character) 
> type. For example, the following code throws an exception:
> {code}
> val df = Seq("Amsterdam", "San Francisco", "London").toDF("city")
> df.where($"city".contains('o')).show(false)
> {code}
> It fails with the exception:
> {code:java}
> Unsupported literal type class java.lang.Character o
> java.lang.RuntimeException: Unsupported literal type class 
> java.lang.Character o
> at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78)
> {code}
> One of the possible solutions can be automatic conversion of Char literal to 
> String literal of length 1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24571) Support literals with values of the Char type

2018-06-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514669#comment-16514669
 ] 

Apache Spark commented on SPARK-24571:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/21578

> Support literals with values of the Char type
> -
>
> Key: SPARK-24571
> URL: https://issues.apache.org/jira/browse/SPARK-24571
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
>
> Currently, Spark doesn't support literals with the Char (java.lang.Character) 
> type. For example, the following code throws an exception:
> {code}
> val df = Seq("Amsterdam", "San Francisco", "London").toDF("city")
> df.where($"city".contains('o')).show(false)
> {code}
> It fails with the exception:
> {code:java}
> Unsupported literal type class java.lang.Character o
> java.lang.RuntimeException: Unsupported literal type class 
> java.lang.Character o
> at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78)
> {code}
> One of the possible solutions can be automatic conversion of Char literal to 
> String literal of length 1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24571) Support literals with values of the Char type

2018-06-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24571:


Assignee: (was: Apache Spark)

> Support literals with values of the Char type
> -
>
> Key: SPARK-24571
> URL: https://issues.apache.org/jira/browse/SPARK-24571
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
>
> Currently, Spark doesn't support literals with the Char (java.lang.Character) 
> type. For example, the following code throws an exception:
> {code}
> val df = Seq("Amsterdam", "San Francisco", "London").toDF("city")
> df.where($"city".contains('o')).show(false)
> {code}
> It fails with the exception:
> {code:java}
> Unsupported literal type class java.lang.Character o
> java.lang.RuntimeException: Unsupported literal type class 
> java.lang.Character o
> at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78)
> {code}
> One of the possible solutions can be automatic conversion of Char literal to 
> String literal of length 1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23901) Data Masking Functions

2018-06-15 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514667#comment-16514667
 ] 

Reynold Xin commented on SPARK-23901:
-

Why are we adding 1200 lines of code for some functions that don't even apply 
to Spark?!
 

> Data Masking Functions
> --
>
> Key: SPARK-23901
> URL: https://issues.apache.org/jira/browse/SPARK-23901
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.4.0
>
>
> - mask()
>  - mask_first_n()
>  - mask_last_n()
>  - mask_hash()
>  - mask_show_first_n()
>  - mask_show_last_n()
> Reference:
> [1] 
> [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DataMaskingFunctions]
> [2] https://issues.apache.org/jira/browse/HIVE-13568
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24571) Support literals with values of the Char type

2018-06-15 Thread Maxim Gekk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-24571:
---
Description: 
Currently, Spark doesn't support literals with the Char (java.lang.Character) 
type. For example, the following code throws an exception:
{code}
val df = Seq("Amsterdam", "San Francisco", "London").toDF("city")
df.where($"city".contains('o')).show(false)
{code}
It fails with the exception:
{code:java}
Unsupported literal type class java.lang.Character o
java.lang.RuntimeException: Unsupported literal type class java.lang.Character o
at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78)
{code}
One of the possible solutions can be automatic conversion of Char literal to 
String literal of length 1.

  was:
Currently, Spark doesn't support literals with the Char (java.lang.Character) 
type. For example, the following code throws an exception:
{code:scala}
val df = Seq("Amsterdam", "San Francisco", "London").toDF("city")
df.where($"city".contains('o')).show(false)
{code}
It fails with the exception:
{code}
Unsupported literal type class java.lang.Character o
java.lang.RuntimeException: Unsupported literal type class java.lang.Character p
at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78)
{code}
One of the possible solutions can be automatic conversion of Char literal to 
String literal of length 1.


> Support literals with values of the Char type
> -
>
> Key: SPARK-24571
> URL: https://issues.apache.org/jira/browse/SPARK-24571
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
>
> Currently, Spark doesn't support literals with the Char (java.lang.Character) 
> type. For example, the following code throws an exception:
> {code}
> val df = Seq("Amsterdam", "San Francisco", "London").toDF("city")
> df.where($"city".contains('o')).show(false)
> {code}
> It fails with the exception:
> {code:java}
> Unsupported literal type class java.lang.Character o
> java.lang.RuntimeException: Unsupported literal type class 
> java.lang.Character o
> at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78)
> {code}
> One of the possible solutions can be automatic conversion of Char literal to 
> String literal of length 1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24571) Support literals with values of the Char type

2018-06-15 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-24571:
--

 Summary: Support literals with values of the Char type
 Key: SPARK-24571
 URL: https://issues.apache.org/jira/browse/SPARK-24571
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.1
Reporter: Maxim Gekk


Currently, Spark doesn't support literals with the Char (java.lang.Character) 
type. For example, the following code throws an exception:
{code:scala}
val df = Seq("Amsterdam", "San Francisco", "London").toDF("city")
df.where($"city".contains('o')).show(false)
{code}
It fails with the exception:
{code}
Unsupported literal type class java.lang.Character o
java.lang.RuntimeException: Unsupported literal type class java.lang.Character p
at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78)
{code}
One of the possible solutions can be automatic conversion of Char literal to 
String literal of length 1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24571) Support literals with values of the Char type

2018-06-15 Thread Maxim Gekk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514665#comment-16514665
 ] 

Maxim Gekk commented on SPARK-24571:


I am working on the improvement.

> Support literals with values of the Char type
> -
>
> Key: SPARK-24571
> URL: https://issues.apache.org/jira/browse/SPARK-24571
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
>
> Currently, Spark doesn't support literals with the Char (java.lang.Character) 
> type. For example, the following code throws an exception:
> {code:scala}
> val df = Seq("Amsterdam", "San Francisco", "London").toDF("city")
> df.where($"city".contains('o')).show(false)
> {code}
> It fails with the exception:
> {code}
> Unsupported literal type class java.lang.Character o
> java.lang.RuntimeException: Unsupported literal type class 
> java.lang.Character p
> at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78)
> {code}
> One of the possible solutions can be automatic conversion of Char literal to 
> String literal of length 1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24216) Spark TypedAggregateExpression uses getSimpleName that is not safe in scala

2018-06-15 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-24216:

Fix Version/s: 2.3.2

> Spark TypedAggregateExpression uses getSimpleName that is not safe in scala
> ---
>
> Key: SPARK-24216
> URL: https://issues.apache.org/jira/browse/SPARK-24216
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Fangshi Li
>Assignee: Fangshi Li
>Priority: Major
> Fix For: 2.3.2, 2.4.0
>
>
> When user create a aggregator object in scala and pass the aggregator to 
> Spark Dataset's agg() method, Spark's will initialize 
> TypedAggregateExpression with the nodeName field as 
> aggregator.getClass.getSimpleName. However, getSimpleName is not safe in 
> scala environment, depending on how user creates the aggregator object. For 
> example, if the aggregator class full qualified name is 
> "com.my.company.MyUtils$myAgg$2$", the getSimpleName will throw 
> java.lang.InternalError "Malformed class name". This has been reported in 
> scalatest 
> [scalatest/scalatest#1044|https://github.com/scalatest/scalatest/pull/1044] 
> and discussed in many scala upstream jiras such as SI-8110, SI-5425.
> To fix this issue, we follow the solution in 
> [scalatest/scalatest#1044|https://github.com/scalatest/scalatest/pull/1044] 
> to add safer version of getSimpleName as a util method, and 
> TypedAggregateExpression will invoke this util method rather than 
> getClass.getSimpleName.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24570) SparkSQL - show schemas/tables in dropdowns of SQL client tools (ie Squirrel SQL, DBVisualizer.etc)

2018-06-15 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514629#comment-16514629
 ] 

Takeshi Yamamuro commented on SPARK-24570:
--

Is this a Spark itself issue?

> SparkSQL - show schemas/tables in dropdowns of SQL client tools (ie Squirrel 
> SQL, DBVisualizer.etc)
> ---
>
> Key: SPARK-24570
> URL: https://issues.apache.org/jira/browse/SPARK-24570
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: t oo
>Priority: Major
> Attachments: connect-to-sql-db-ssms-locate-table.png
>
>
> An end-user SQL client tool (ie in the screenshot) can list tables from 
> hiveserver2 and major DBs (Mysql, postgres,oracle, MSSQL..etc). But with 
> SparkSQL it does not display any tables. This would be very convenient for 
> users.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24570) SparkSQL - show schemas/tables in dropdowns of SQL client tools (ie Squirrel SQL, DBVisualizer.etc)

2018-06-15 Thread t oo (JIRA)
t oo created SPARK-24570:


 Summary: SparkSQL - show schemas/tables in dropdowns of SQL client 
tools (ie Squirrel SQL, DBVisualizer.etc)
 Key: SPARK-24570
 URL: https://issues.apache.org/jira/browse/SPARK-24570
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.3.1
Reporter: t oo
 Attachments: connect-to-sql-db-ssms-locate-table.png

An end-user SQL client tool (ie in the screenshot) can list tables from 
hiveserver2 and major DBs (Mysql, postgres,oracle, MSSQL..etc). But with 
SparkSQL it does not display any tables. This would be very convenient for 
users.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24570) SparkSQL - show schemas/tables in dropdowns of SQL client tools (ie Squirrel SQL, DBVisualizer.etc)

2018-06-15 Thread t oo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

t oo updated SPARK-24570:
-
Attachment: connect-to-sql-db-ssms-locate-table.png

> SparkSQL - show schemas/tables in dropdowns of SQL client tools (ie Squirrel 
> SQL, DBVisualizer.etc)
> ---
>
> Key: SPARK-24570
> URL: https://issues.apache.org/jira/browse/SPARK-24570
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: t oo
>Priority: Major
> Attachments: connect-to-sql-db-ssms-locate-table.png
>
>
> An end-user SQL client tool (ie in the screenshot) can list tables from 
> hiveserver2 and major DBs (Mysql, postgres,oracle, MSSQL..etc). But with 
> SparkSQL it does not display any tables. This would be very convenient for 
> users.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21569) Internal Spark class needs to be kryo-registered

2018-06-15 Thread YongGang Cao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514529#comment-16514529
 ] 

YongGang Cao edited comment on SPARK-21569 at 6/15/18 11:47 PM:


seems this is not workaround-able, at least from java side. unless we turn off 
the registration required which will harm the performance as documented.

tried to register both of following in SparkConf, no luck, still get the not 
registered error message. 
{code:java}
org.apache.spark.internal.io.FileCommitProtocol.TaskCommitMessage.class,
org.apache.spark.internal.io.FileCommitProtocol.TaskCommitMessage[].class{code}


was (Author: ygcao):
seems this is not workaround-able, at least from java side. 

tried to register both of following in SparkConf, no luck, still get the not 
registered error message. 


{code:java}
org.apache.spark.internal.io.FileCommitProtocol.TaskCommitMessage.class,
org.apache.spark.internal.io.FileCommitProtocol.TaskCommitMessage[].class{code}

> Internal Spark class needs to be kryo-registered
> 
>
> Key: SPARK-21569
> URL: https://issues.apache.org/jira/browse/SPARK-21569
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Ryan Williams
>Priority: Major
>
> [Full repro here|https://github.com/ryan-williams/spark-bugs/tree/hf]
> As of 2.2.0, {{saveAsNewAPIHadoopFile}} jobs fail (when 
> {{spark.kryo.registrationRequired=true}}) with:
> {code}
> java.lang.IllegalArgumentException: Class is not registered: 
> org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage
> Note: To register this class use: 
> kryo.register(org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage.class);
>   at com.esotericsoftware.kryo.Kryo.getRegistration(Kryo.java:458)
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:79)
>   at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:488)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:593)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:315)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:383)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> This internal Spark class should be kryo-registered by Spark by default.
> This was not a problem in 2.1.1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21569) Internal Spark class needs to be kryo-registered

2018-06-15 Thread YongGang Cao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514529#comment-16514529
 ] 

YongGang Cao edited comment on SPARK-21569 at 6/15/18 11:45 PM:


seems this is not workaround-able, at least from java side. 

tried to register both of following in SparkConf, no luck, still get the not 
registered error message. 


{code:java}
org.apache.spark.internal.io.FileCommitProtocol.TaskCommitMessage.class,
org.apache.spark.internal.io.FileCommitProtocol.TaskCommitMessage[].class{code}


was (Author: ygcao):
seems this is not workaround-able, at least from java side. 

tried to register both of following in SparkConf, no luck, still get the not 
registered error message. 
{{}}
{code:java}
org.apache.spark.internal.io.FileCommitProtocol.TaskCommitMessage.class,
org.apache.spark.internal.io.FileCommitProtocol.TaskCommitMessage[].class{code}

> Internal Spark class needs to be kryo-registered
> 
>
> Key: SPARK-21569
> URL: https://issues.apache.org/jira/browse/SPARK-21569
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Ryan Williams
>Priority: Major
>
> [Full repro here|https://github.com/ryan-williams/spark-bugs/tree/hf]
> As of 2.2.0, {{saveAsNewAPIHadoopFile}} jobs fail (when 
> {{spark.kryo.registrationRequired=true}}) with:
> {code}
> java.lang.IllegalArgumentException: Class is not registered: 
> org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage
> Note: To register this class use: 
> kryo.register(org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage.class);
>   at com.esotericsoftware.kryo.Kryo.getRegistration(Kryo.java:458)
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:79)
>   at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:488)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:593)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:315)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:383)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> This internal Spark class should be kryo-registered by Spark by default.
> This was not a problem in 2.1.1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21569) Internal Spark class needs to be kryo-registered

2018-06-15 Thread YongGang Cao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514529#comment-16514529
 ] 

YongGang Cao commented on SPARK-21569:
--

seems this is not workaround-able, at least from java side. 

tried to register both of following in SparkConf, no luck, still get the not 
registered error message. 
{{}}
{code:java}
org.apache.spark.internal.io.FileCommitProtocol.TaskCommitMessage.class,
org.apache.spark.internal.io.FileCommitProtocol.TaskCommitMessage[].class{code}

> Internal Spark class needs to be kryo-registered
> 
>
> Key: SPARK-21569
> URL: https://issues.apache.org/jira/browse/SPARK-21569
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Ryan Williams
>Priority: Major
>
> [Full repro here|https://github.com/ryan-williams/spark-bugs/tree/hf]
> As of 2.2.0, {{saveAsNewAPIHadoopFile}} jobs fail (when 
> {{spark.kryo.registrationRequired=true}}) with:
> {code}
> java.lang.IllegalArgumentException: Class is not registered: 
> org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage
> Note: To register this class use: 
> kryo.register(org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage.class);
>   at com.esotericsoftware.kryo.Kryo.getRegistration(Kryo.java:458)
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:79)
>   at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:488)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:593)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:315)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:383)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> This internal Spark class should be kryo-registered by Spark by default.
> This was not a problem in 2.1.1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24552) Task attempt numbers are reused when stages are retried

2018-06-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514526#comment-16514526
 ] 

Apache Spark commented on SPARK-24552:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/21577

> Task attempt numbers are reused when stages are retried
> ---
>
> Key: SPARK-24552
> URL: https://issues.apache.org/jira/browse/SPARK-24552
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1, 2.2.0, 2.2.1, 2.3.0, 2.3.1
>Reporter: Ryan Blue
>Priority: Blocker
>
> When stages are retried due to shuffle failures, task attempt numbers are 
> reused. This causes a correctness bug in the v2 data sources write path.
> Data sources (both the original and v2) pass the task attempt to writers so 
> that writers can use the attempt number to track and clean up data from 
> failed or speculative attempts. In the v2 docs for DataWriterFactory, the 
> attempt number's javadoc states that "Implementations can use this attempt 
> number to distinguish writers of different task attempts."
> When two attempts of a stage use the same (partition, attempt) pair, two 
> tasks can create the same data and attempt to commit. The commit coordinator 
> prevents both from committing and will abort the attempt that finishes last. 
> When using the (partition, attempt) pair to track data, the aborted task may 
> delete data associated with the (partition, attempt) pair. If that happens, 
> the data for the task that committed is also deleted as well, which is a 
> correctness bug.
> For a concrete example, I have a data source that creates files in place 
> named with {{part---.}}. Because these 
> files are written in place, both tasks create the same file and the one that 
> is aborted deletes the file, leading to data corruption when the file is 
> added to the table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24569) Spark Aggregator with output type Option[Boolean] creates column of type Row

2018-06-15 Thread John Conwell (JIRA)
John Conwell created SPARK-24569:


 Summary: Spark Aggregator with output type Option[Boolean] creates 
column of type Row
 Key: SPARK-24569
 URL: https://issues.apache.org/jira/browse/SPARK-24569
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1
 Environment: OSX
Reporter: John Conwell


Spark SQL Aggregator that returns an output column of Option[Boolean] creates a 
column of type 
StructField(,StructType(StructField(value,BooleanType,true)),true) 
instead of StructField(,BooleanType,true).  

In other words it puts a Row instance into the new column

 

Reproduction

 
{code:java}
class OptionBooleanAggregatorTest extends BaseFreeSpec {

  val ss: SparkSession = getSparkSession

  "test option" in {
import ss.implicits._

val df = List(
  Thing("bob", Some(true)),
  Thing("bob", Some(false)),
  Thing("bob", None))
  .toDF()

val group = df
  .groupBy("name")
  .agg(OptionBooleanAggregator("isGood").toColumn.alias("isGood"))
  .cache()

assert(group.schema("name").dataType == StringType)

//this will fail
assert(group.schema("isGood").dataType == BooleanType)
  }
}

case class Thing(name: String, isGood: Option[Boolean])

case class OptionBooleanAggregator(colName: String) extends Aggregator[Row, 
Option[Boolean], Option[Boolean]] {

  override def zero: Option[Boolean] = Option.empty[Boolean]

  override def reduce(buffer: Option[Boolean], row: Row): Option[Boolean] = {
val index = row.fieldIndex(colName)
val value = if (row.isNullAt(index))
  Option.empty[Boolean]
else
  Some(row.getBoolean(index))
merge(buffer, value)
  }

  override def merge(b1: Option[Boolean], b2: Option[Boolean]): Option[Boolean] 
= {
if ((b1.isDefined && b1.get) || (b2.isDefined && b2.get)) {
  Some(true)
}
else if (b1.isDefined) {
  b1
}
else
  b2
  }

  override def finish(reduction: Option[Boolean]): Option[Boolean] = reduction
  override def bufferEncoder: Encoder[Option[Boolean]] = OptionalBoolEncoder
  override def outputEncoder: Encoder[Option[Boolean]] = OptionalBoolEncoder

  def OptionalBoolEncoder: org.apache.spark.sql.Encoder[Option[Boolean]] = 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder()
}
{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24452) long = int*int or long = int+int may cause overflow.

2018-06-15 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-24452.
-
   Resolution: Fixed
Fix Version/s: 2.3.2
   2.4.0

Issue resolved by pull request 21481
[https://github.com/apache/spark/pull/21481]

> long = int*int or long = int+int may cause overflow.
> 
>
> Key: SPARK-24452
> URL: https://issues.apache.org/jira/browse/SPARK-24452
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Minor
> Fix For: 2.4.0, 2.3.2
>
>
> The following assignment cause overflow in right hand side. As a result, the 
> result may be negative.
> {code:java}
> long = int*int
> long = int+int{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24452) long = int*int or long = int+int may cause overflow.

2018-06-15 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-24452:
---

Assignee: Kazuaki Ishizaki

> long = int*int or long = int+int may cause overflow.
> 
>
> Key: SPARK-24452
> URL: https://issues.apache.org/jira/browse/SPARK-24452
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Minor
> Fix For: 2.3.2, 2.4.0
>
>
> The following assignment cause overflow in right hand side. As a result, the 
> result may be negative.
> {code:java}
> long = int*int
> long = int+int{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24552) Task attempt numbers are reused when stages are retried

2018-06-15 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-24552:
--
Affects Version/s: 2.2.0
   2.2.1
   2.3.0
   2.3.1

> Task attempt numbers are reused when stages are retried
> ---
>
> Key: SPARK-24552
> URL: https://issues.apache.org/jira/browse/SPARK-24552
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1, 2.2.0, 2.2.1, 2.3.0, 2.3.1
>Reporter: Ryan Blue
>Priority: Blocker
>
> When stages are retried due to shuffle failures, task attempt numbers are 
> reused. This causes a correctness bug in the v2 data sources write path.
> Data sources (both the original and v2) pass the task attempt to writers so 
> that writers can use the attempt number to track and clean up data from 
> failed or speculative attempts. In the v2 docs for DataWriterFactory, the 
> attempt number's javadoc states that "Implementations can use this attempt 
> number to distinguish writers of different task attempts."
> When two attempts of a stage use the same (partition, attempt) pair, two 
> tasks can create the same data and attempt to commit. The commit coordinator 
> prevents both from committing and will abort the attempt that finishes last. 
> When using the (partition, attempt) pair to track data, the aborted task may 
> delete data associated with the (partition, attempt) pair. If that happens, 
> the data for the task that committed is also deleted as well, which is a 
> correctness bug.
> For a concrete example, I have a data source that creates files in place 
> named with {{part---.}}. Because these 
> files are written in place, both tasks create the same file and the one that 
> is aborted deletes the file, leading to data corruption when the file is 
> added to the table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24552) Task attempt numbers are reused when stages are retried

2018-06-15 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-24552:
--
Priority: Blocker  (was: Critical)

> Task attempt numbers are reused when stages are retried
> ---
>
> Key: SPARK-24552
> URL: https://issues.apache.org/jira/browse/SPARK-24552
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1, 2.2.0, 2.2.1, 2.3.0, 2.3.1
>Reporter: Ryan Blue
>Priority: Blocker
>
> When stages are retried due to shuffle failures, task attempt numbers are 
> reused. This causes a correctness bug in the v2 data sources write path.
> Data sources (both the original and v2) pass the task attempt to writers so 
> that writers can use the attempt number to track and clean up data from 
> failed or speculative attempts. In the v2 docs for DataWriterFactory, the 
> attempt number's javadoc states that "Implementations can use this attempt 
> number to distinguish writers of different task attempts."
> When two attempts of a stage use the same (partition, attempt) pair, two 
> tasks can create the same data and attempt to commit. The commit coordinator 
> prevents both from committing and will abort the attempt that finishes last. 
> When using the (partition, attempt) pair to track data, the aborted task may 
> delete data associated with the (partition, attempt) pair. If that happens, 
> the data for the task that committed is also deleted as well, which is a 
> correctness bug.
> For a concrete example, I have a data source that creates files in place 
> named with {{part---.}}. Because these 
> files are written in place, both tasks create the same file and the one that 
> is aborted deletes the file, leading to data corruption when the file is 
> added to the table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24525) Provide an option to limit MemorySink memory usage

2018-06-15 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-24525.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Resolved by [https://github.com/apache/spark/pull/21559]

> Provide an option to limit MemorySink memory usage
> --
>
> Key: SPARK-24525
> URL: https://issues.apache.org/jira/browse/SPARK-24525
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Mukul Murthy
>Assignee: Mukul Murthy
>Priority: Major
> Fix For: 2.4.0
>
>
> MemorySink stores stream results in memory and is mostly used for testing and 
> displaying streams, but for large streams, this can OOM the driver. We should 
> add an option to limit the number of rows and the total size of a memory sink 
> and not add any new data once either limit is hit. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24525) Provide an option to limit MemorySink memory usage

2018-06-15 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz reassigned SPARK-24525:
---

Assignee: Mukul Murthy

> Provide an option to limit MemorySink memory usage
> --
>
> Key: SPARK-24525
> URL: https://issues.apache.org/jira/browse/SPARK-24525
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Mukul Murthy
>Assignee: Mukul Murthy
>Priority: Major
>
> MemorySink stores stream results in memory and is mostly used for testing and 
> displaying streams, but for large streams, this can OOM the driver. We should 
> add an option to limit the number of rows and the total size of a memory sink 
> and not add any new data once either limit is hit. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24396) Add Structured Streaming ForeachWriter for python

2018-06-15 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-24396.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 21477
[https://github.com/apache/spark/pull/21477]

> Add Structured Streaming ForeachWriter for python
> -
>
> Key: SPARK-24396
> URL: https://issues.apache.org/jira/browse/SPARK-24396
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
> Fix For: 3.0.0
>
>
> Users should be able to write ForeachWriter code in python, that is, they 
> should be able to use the partitionid and the version/batchId/epochId to 
> conditionally process rows.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24568) Code refactoring for DataType equalsXXX methods

2018-06-15 Thread Maryann Xue (JIRA)
Maryann Xue created SPARK-24568:
---

 Summary: Code refactoring for DataType equalsXXX methods
 Key: SPARK-24568
 URL: https://issues.apache.org/jira/browse/SPARK-24568
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Maryann Xue
 Fix For: 2.4.0


Right now there is a lot of code duplication between all DataType equalsXXX 
methods: {{equalsIgnoreNullability}}, {{equalsIgnoreCaseAndNullability}}, 
{{equalsIgnoreCaseAndNullability}}, {{equalsStructurally}}. We can replace the 
dup code with a helper function.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23435) R tests should support latest testthat

2018-06-15 Thread Weiqiang Zhuang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514111#comment-16514111
 ] 

Weiqiang Zhuang edited comment on SPARK-23435 at 6/15/18 5:14 PM:
--

yes, quite similar

 

```

sp <- getNamespace("SparkR")
 attach(sp)
 test_dir(file.path(sparkRDir, "pkg", "tests", "fulltests"))

)

```

 


was (Author: adrian555):
yes, quite similar

 

```

sp <- getNamespace("SparkR")
attach(sp)
test_dir(file.path(sparkRDir, "pkg", "tests", "fulltests")

)

```

 

> R tests should support latest testthat
> --
>
> Key: SPARK-23435
> URL: https://issues.apache.org/jira/browse/SPARK-23435
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Major
>
> To follow up on SPARK-22817, the latest version of testthat, 2.0.0 was 
> released in Dec 2017, and its method has been changed.
> In order for our tests to keep working, we need to detect that and call a 
> different method.
> Jenkins is running 1.0.1 though, we need to check if it is going to work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23435) R tests should support latest testthat

2018-06-15 Thread Weiqiang Zhuang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514111#comment-16514111
 ] 

Weiqiang Zhuang commented on SPARK-23435:
-

yes, quite similar

 

```

sp <- getNamespace("SparkR")
attach(sp)
test_dir(file.path(sparkRDir, "pkg", "tests", "fulltests")

)

```

 

> R tests should support latest testthat
> --
>
> Key: SPARK-23435
> URL: https://issues.apache.org/jira/browse/SPARK-23435
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Major
>
> To follow up on SPARK-22817, the latest version of testthat, 2.0.0 was 
> released in Dec 2017, and its method has been changed.
> In order for our tests to keep working, we need to detect that and call a 
> different method.
> Jenkins is running 1.0.1 though, we need to check if it is going to work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24490) Use WebUI.addStaticHandler in web UIs

2018-06-15 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-24490.

   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21510
[https://github.com/apache/spark/pull/21510]

> Use WebUI.addStaticHandler in web UIs
> -
>
> Key: SPARK-24490
> URL: https://issues.apache.org/jira/browse/SPARK-24490
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Assignee: Jacek Laskowski
>Priority: Trivial
> Fix For: 2.4.0
>
>
> {{WebUI}} defines {{addStaticHandler}} that web UIs don't use (and simply 
> introduce duplication). Let's clean them up and remove duplications.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24490) Use WebUI.addStaticHandler in web UIs

2018-06-15 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-24490:
--

Assignee: Jacek Laskowski

> Use WebUI.addStaticHandler in web UIs
> -
>
> Key: SPARK-24490
> URL: https://issues.apache.org/jira/browse/SPARK-24490
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Assignee: Jacek Laskowski
>Priority: Trivial
> Fix For: 2.4.0
>
>
> {{WebUI}} defines {{addStaticHandler}} that web UIs don't use (and simply 
> introduce duplication). Let's clean them up and remove duplications.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24531) HiveExternalCatalogVersionsSuite failing due to missing 2.2.0 version

2018-06-15 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-24531:
---
Fix Version/s: 2.3.2

> HiveExternalCatalogVersionsSuite failing due to missing 2.2.0 version
> -
>
> Key: SPARK-24531
> URL: https://issues.apache.org/jira/browse/SPARK-24531
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Blocker
> Fix For: 2.2.2, 2.3.2, 2.4.0
>
>
> We have many build failures caused by HiveExternalCatalogVersionsSuite 
> failing because Spark 2.2.0 is not present anymore in the mirrors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24531) HiveExternalCatalogVersionsSuite failing due to missing 2.2.0 version

2018-06-15 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-24531:
---
Fix Version/s: 2.2.2

> HiveExternalCatalogVersionsSuite failing due to missing 2.2.0 version
> -
>
> Key: SPARK-24531
> URL: https://issues.apache.org/jira/browse/SPARK-24531
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Blocker
> Fix For: 2.2.2, 2.3.2, 2.4.0
>
>
> We have many build failures caused by HiveExternalCatalogVersionsSuite 
> failing because Spark 2.2.0 is not present anymore in the mirrors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24535) Fix java version parsing in SparkR

2018-06-15 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514031#comment-16514031
 ] 

Hyukjin Kwon commented on SPARK-24535:
--

When I tried it before, I remember I faced some issues .. mind if I ask to 
share the steps you did just roughly? It doesn't have to be perfect if it's 
messy but I want to try what you did roughly.

> Fix java version parsing in SparkR
> --
>
> Key: SPARK-24535
> URL: https://issues.apache.org/jira/browse/SPARK-24535
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Shivaram Venkataraman
>Priority: Major
>
> We see errors on CRAN of the form 
> {code:java}
>   java version "1.8.0_144"
>   Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
>   Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
>   Picked up _JAVA_OPTIONS: -XX:-UsePerfData 
>   -- 1. Error: create DataFrame from list or data.frame (@test_basic.R#21)  
> --
>   subscript out of bounds
>   1: sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE, 
> sparkConfig = sparkRTestConfig) at 
> D:/temp/RtmpIJ8Cc3/RLIBS_3242c713c3181/SparkR/tests/testthat/test_basic.R:21
>   2: sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, 
> sparkExecutorEnvMap, 
>  sparkJars, sparkPackages)
>   3: checkJavaVersion()
>   4: strsplit(javaVersionFilter[[1]], "[\"]")
> {code}
> The complete log file is at 
> http://home.apache.org/~shivaram/SparkR_2.3.1_check_results/Windows/00check.log



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24535) Fix java version parsing in SparkR

2018-06-15 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514020#comment-16514020
 ] 

Shivaram Venkataraman commented on SPARK-24535:
---

I was going to do it manually. If we do it in the PR builder it will be great !

> Fix java version parsing in SparkR
> --
>
> Key: SPARK-24535
> URL: https://issues.apache.org/jira/browse/SPARK-24535
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Shivaram Venkataraman
>Priority: Major
>
> We see errors on CRAN of the form 
> {code:java}
>   java version "1.8.0_144"
>   Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
>   Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
>   Picked up _JAVA_OPTIONS: -XX:-UsePerfData 
>   -- 1. Error: create DataFrame from list or data.frame (@test_basic.R#21)  
> --
>   subscript out of bounds
>   1: sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE, 
> sparkConfig = sparkRTestConfig) at 
> D:/temp/RtmpIJ8Cc3/RLIBS_3242c713c3181/SparkR/tests/testthat/test_basic.R:21
>   2: sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, 
> sparkExecutorEnvMap, 
>  sparkJars, sparkPackages)
>   3: checkJavaVersion()
>   4: strsplit(javaVersionFilter[[1]], "[\"]")
> {code}
> The complete log file is at 
> http://home.apache.org/~shivaram/SparkR_2.3.1_check_results/Windows/00check.log



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24535) Fix java version parsing in SparkR

2018-06-15 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514016#comment-16514016
 ] 

Hyukjin Kwon commented on SPARK-24535:
--

[~shivaram], how did you run this on Windows? I think we should better run this 
in PR builder.

> Fix java version parsing in SparkR
> --
>
> Key: SPARK-24535
> URL: https://issues.apache.org/jira/browse/SPARK-24535
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Shivaram Venkataraman
>Priority: Major
>
> We see errors on CRAN of the form 
> {code:java}
>   java version "1.8.0_144"
>   Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
>   Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
>   Picked up _JAVA_OPTIONS: -XX:-UsePerfData 
>   -- 1. Error: create DataFrame from list or data.frame (@test_basic.R#21)  
> --
>   subscript out of bounds
>   1: sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE, 
> sparkConfig = sparkRTestConfig) at 
> D:/temp/RtmpIJ8Cc3/RLIBS_3242c713c3181/SparkR/tests/testthat/test_basic.R:21
>   2: sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, 
> sparkExecutorEnvMap, 
>  sparkJars, sparkPackages)
>   3: checkJavaVersion()
>   4: strsplit(javaVersionFilter[[1]], "[\"]")
> {code}
> The complete log file is at 
> http://home.apache.org/~shivaram/SparkR_2.3.1_check_results/Windows/00check.log



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24535) Fix java version parsing in SparkR

2018-06-15 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514012#comment-16514012
 ] 

Shivaram Venkataraman commented on SPARK-24535:
---

I'm not sure if its only failing on Windows -- the Debian test on CRAN did not 
have the same Java version. I can try to do a test on Windows later today to 
see what I find. 

> Fix java version parsing in SparkR
> --
>
> Key: SPARK-24535
> URL: https://issues.apache.org/jira/browse/SPARK-24535
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Shivaram Venkataraman
>Priority: Major
>
> We see errors on CRAN of the form 
> {code:java}
>   java version "1.8.0_144"
>   Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
>   Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
>   Picked up _JAVA_OPTIONS: -XX:-UsePerfData 
>   -- 1. Error: create DataFrame from list or data.frame (@test_basic.R#21)  
> --
>   subscript out of bounds
>   1: sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE, 
> sparkConfig = sparkRTestConfig) at 
> D:/temp/RtmpIJ8Cc3/RLIBS_3242c713c3181/SparkR/tests/testthat/test_basic.R:21
>   2: sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, 
> sparkExecutorEnvMap, 
>  sparkJars, sparkPackages)
>   3: checkJavaVersion()
>   4: strsplit(javaVersionFilter[[1]], "[\"]")
> {code}
> The complete log file is at 
> http://home.apache.org/~shivaram/SparkR_2.3.1_check_results/Windows/00check.log



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24476) java.net.SocketTimeoutException: Read timed out under jets3t while running the Spark Structured Streaming

2018-06-15 Thread bharath kumar avusherla (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

bharath kumar avusherla resolved SPARK-24476.
-
Resolution: Fixed

> java.net.SocketTimeoutException: Read timed out under jets3t while running 
> the Spark Structured Streaming
> -
>
> Key: SPARK-24476
> URL: https://issues.apache.org/jira/browse/SPARK-24476
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: bharath kumar avusherla
>Priority: Minor
> Attachments: socket-timeout-exception
>
>
> We are working on spark streaming application using spark structured 
> streaming with checkpointing in s3. When we start the application, the 
> application runs just fine for sometime  then it crashes with the error 
> mentioned below. The amount of time it will run successfully varies from time 
> to time, sometimes it will run for 2 days without any issues then crashes, 
> sometimes it will crash after 4hrs/ 24hrs. 
> Our streaming application joins(left and inner) multiple sources from kafka 
> and also s3 and aurora database.
> Can you please let us know how to solve this problem?
> Is it possible to somehow tweak the SocketTimeout-Time? 
> Here, I'm pasting the few line of complete exception log below. Also attached 
> the complete exception to the issue.
> *_Exception:_*
> *_Caused by: java.net.SocketTimeoutException: Read timed out_*
>         _at java.net.SocketInputStream.socketRead0(Native Method)_
>         _at java.net.SocketInputStream.read(SocketInputStream.java:150)_
>         _at java.net.SocketInputStream.read(SocketInputStream.java:121)_
>         _at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)_
>         _at sun.security.ssl.InputRecord.read(InputRecord.java:503)_
>         _at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:954)_
>         _at 
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1343)_
>         _at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1371)_
>         _at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1355)_
>         _at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:553)_
>         _at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:412)_
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24534) Add a way to bypass entrypoint.sh script if no spark cmd is passed

2018-06-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24534:


Assignee: Apache Spark

> Add a way to bypass entrypoint.sh script if no spark cmd is passed
> --
>
> Key: SPARK-24534
> URL: https://issues.apache.org/jira/browse/SPARK-24534
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Ricardo Martinelli de Oliveira
>Assignee: Apache Spark
>Priority: Minor
>
> As an improvement in the entrypoint.sh script, I'd like to propose spark 
> entrypoint do a passthrough if driver/executor/init is not the command 
> passed. Currently it raises an error.
> To me more specific, I'm talking about these lines:
> [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L113-L114]
> This allows the openshift-spark image to continue to function as a Spark 
> Standalone component, with custom configuration support etc. without 
> compromising the previous method to configure the cluster inside a kubernetes 
> environment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24534) Add a way to bypass entrypoint.sh script if no spark cmd is passed

2018-06-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24534:


Assignee: (was: Apache Spark)

> Add a way to bypass entrypoint.sh script if no spark cmd is passed
> --
>
> Key: SPARK-24534
> URL: https://issues.apache.org/jira/browse/SPARK-24534
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Ricardo Martinelli de Oliveira
>Priority: Minor
>
> As an improvement in the entrypoint.sh script, I'd like to propose spark 
> entrypoint do a passthrough if driver/executor/init is not the command 
> passed. Currently it raises an error.
> To me more specific, I'm talking about these lines:
> [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L113-L114]
> This allows the openshift-spark image to continue to function as a Spark 
> Standalone component, with custom configuration support etc. without 
> compromising the previous method to configure the cluster inside a kubernetes 
> environment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24534) Add a way to bypass entrypoint.sh script if no spark cmd is passed

2018-06-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513945#comment-16513945
 ] 

Apache Spark commented on SPARK-24534:
--

User 'rimolive' has created a pull request for this issue:
https://github.com/apache/spark/pull/21572

> Add a way to bypass entrypoint.sh script if no spark cmd is passed
> --
>
> Key: SPARK-24534
> URL: https://issues.apache.org/jira/browse/SPARK-24534
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Ricardo Martinelli de Oliveira
>Priority: Minor
>
> As an improvement in the entrypoint.sh script, I'd like to propose spark 
> entrypoint do a passthrough if driver/executor/init is not the command 
> passed. Currently it raises an error.
> To me more specific, I'm talking about these lines:
> [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L113-L114]
> This allows the openshift-spark image to continue to function as a Spark 
> Standalone component, with custom configuration support etc. without 
> compromising the previous method to configure the cluster inside a kubernetes 
> environment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24476) java.net.SocketTimeoutException: Read timed out under jets3t while running the Spark Structured Streaming

2018-06-15 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513933#comment-16513933
 ] 

Steve Loughran commented on SPARK-24476:


* Use S3A, as S3n is unsupported and deleted from the recent versions of 
Hadoop. Nobody tests it either.
* Don't know about speculation. S3 in general isn't good here as the 
speculation code in (hadoop, spark, hive, ...) assumes that renames are fast 
and, for directories, atomic. You can get into serious trouble there. I haven't 
looked at what Spark Streaming's commit protocol is in any detail; still on my 
TODO list. 

My recommendation, stay with S3A, close this as cannot reproduce for now


> java.net.SocketTimeoutException: Read timed out under jets3t while running 
> the Spark Structured Streaming
> -
>
> Key: SPARK-24476
> URL: https://issues.apache.org/jira/browse/SPARK-24476
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: bharath kumar avusherla
>Priority: Minor
> Attachments: socket-timeout-exception
>
>
> We are working on spark streaming application using spark structured 
> streaming with checkpointing in s3. When we start the application, the 
> application runs just fine for sometime  then it crashes with the error 
> mentioned below. The amount of time it will run successfully varies from time 
> to time, sometimes it will run for 2 days without any issues then crashes, 
> sometimes it will crash after 4hrs/ 24hrs. 
> Our streaming application joins(left and inner) multiple sources from kafka 
> and also s3 and aurora database.
> Can you please let us know how to solve this problem?
> Is it possible to somehow tweak the SocketTimeout-Time? 
> Here, I'm pasting the few line of complete exception log below. Also attached 
> the complete exception to the issue.
> *_Exception:_*
> *_Caused by: java.net.SocketTimeoutException: Read timed out_*
>         _at java.net.SocketInputStream.socketRead0(Native Method)_
>         _at java.net.SocketInputStream.read(SocketInputStream.java:150)_
>         _at java.net.SocketInputStream.read(SocketInputStream.java:121)_
>         _at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)_
>         _at sun.security.ssl.InputRecord.read(InputRecord.java:503)_
>         _at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:954)_
>         _at 
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1343)_
>         _at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1371)_
>         _at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1355)_
>         _at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:553)_
>         _at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:412)_
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22918) sbt test (spark - local) fail after upgrading to 2.2.1 with: java.security.AccessControlException: access denied org.apache.derby.security.SystemPermission( "engine",

2018-06-15 Thread Mihaly Toth (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513897#comment-16513897
 ] 

Mihaly Toth commented on SPARK-22918:
-

I managed to reproduce the problem in a unit test. When using a security 
manager (with derby) one needs to apply a security policy using 
{{Policy.setPolicy()}}. In its {{.getPermissions.implies}} one is tempted to 
use {{new SystemPermission("engine", "usederbyinternals")}}. This works fine 
but when you run a spark session it is seemingly ignored. This is caused by 
IsolatedClassLoader. {{SystemPermission}} does not work across class loaders 
meaning it requires that the permission that is checked needs to be in the same 
class loader as the one defined in the Policy. Otherwise their class will not 
be equal and thus the the call gets rejected.

One solution is to use another permission in the policy file that only checks 
for the names and class names, and give the original {{SystemPermission}} like:

{code:scala}
new Permission(delegate.getName) {
override def getActions: String = delegate.getActions

override def implies(permission: Permission): Boolean =
delegate.getClass.getCanonicalName == 
permission.getClass.getCanonicalName &&
delegate.getName == permission.getName

override def hashCode(): Int = reflectionHashCode(this)

override def equals(obj: scala.Any): Boolean = reflectionEquals(this, obj)
}
{code}

At least this one worked for me. It also works with {{new AllPermission()}} in 
case one is not really into using fine grained access control.

> sbt test (spark - local) fail after upgrading to 2.2.1 with: 
> java.security.AccessControlException: access denied 
> org.apache.derby.security.SystemPermission( "engine", "usederbyinternals" )
> 
>
> Key: SPARK-22918
> URL: https://issues.apache.org/jira/browse/SPARK-22918
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Damian Momot
>Priority: Major
>
> After upgrading 2.2.0 -> 2.2.1 sbt test command in one of my projects started 
> to fail with following exception:
> {noformat}
> java.security.AccessControlException: access denied 
> org.apache.derby.security.SystemPermission( "engine", "usederbyinternals" )
>   at 
> java.security.AccessControlContext.checkPermission(AccessControlContext.java:472)
>   at 
> java.security.AccessController.checkPermission(AccessController.java:884)
>   at 
> org.apache.derby.iapi.security.SecurityUtil.checkDerbyInternalsPrivilege(Unknown
>  Source)
>   at org.apache.derby.iapi.services.monitor.Monitor.startMonitor(Unknown 
> Source)
>   at org.apache.derby.iapi.jdbc.JDBCBoot$1.run(Unknown Source)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at org.apache.derby.iapi.jdbc.JDBCBoot.boot(Unknown Source)
>   at org.apache.derby.iapi.jdbc.JDBCBoot.boot(Unknown Source)
>   at org.apache.derby.jdbc.EmbeddedDriver.boot(Unknown Source)
>   at org.apache.derby.jdbc.EmbeddedDriver.(Unknown Source)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at java.lang.Class.newInstance(Class.java:442)
>   at 
> org.datanucleus.store.rdbms.connectionpool.AbstractConnectionPoolFactory.loadDriver(AbstractConnectionPoolFactory.java:47)
>   at 
> org.datanucleus.store.rdbms.connectionpool.BoneCPConnectionPoolFactory.createConnectionPool(BoneCPConnectionPoolFactory.java:54)
>   at 
> org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:238)
>   at 
> org.datanucleus.store.rdbms.ConnectionFactoryImpl.initialiseDataSources(ConnectionFactoryImpl.java:131)
>   at 
> org.datanucleus.store.rdbms.ConnectionFactoryImpl.(ConnectionFactoryImpl.java:85)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631)
>   at 
> 

[jira] [Assigned] (SPARK-21743) top-most limit should not cause memory leak

2018-06-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21743:


Assignee: Apache Spark  (was: Wenchen Fan)

> top-most limit should not cause memory leak
> ---
>
> Key: SPARK-21743
> URL: https://issues.apache.org/jira/browse/SPARK-21743
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21743) top-most limit should not cause memory leak

2018-06-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21743:


Assignee: Wenchen Fan  (was: Apache Spark)

> top-most limit should not cause memory leak
> ---
>
> Key: SPARK-21743
> URL: https://issues.apache.org/jira/browse/SPARK-21743
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-21743) top-most limit should not cause memory leak

2018-06-15 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell reopened SPARK-21743:
---

Reopening issue, this is causing a regression in the CSV reader.

> top-most limit should not cause memory leak
> ---
>
> Key: SPARK-21743
> URL: https://issues.apache.org/jira/browse/SPARK-21743
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24567) nodeBlacklist does not get updated if a spark executor fails to launch on a mesos node

2018-06-15 Thread Igor Berman (JIRA)
Igor Berman created SPARK-24567:
---

 Summary: nodeBlacklist does not get updated if a spark executor 
fails to launch on a mesos node 
 Key: SPARK-24567
 URL: https://issues.apache.org/jira/browse/SPARK-24567
 Project: Spark
  Issue Type: Bug
  Components: Mesos, Scheduler
Affects Versions: 2.4.0
Reporter: Igor Berman


As fix of SPARK-19755 we removed custom blacklisting mechanism in spark-mesos 
integration which has hardcoded constant of 2 failures max before node is 
marked as blacklisted.

>From now on the usual blacklisting mechanism is in use(when enabled), however 
>it has downside of not counting failures of launching mesos-tasks(spark 
>executors), i.e. only failures in spark-tasks will be counted.

[~squito] [~felixcheung]  [~susanxhuynh] [~skonto] please add details as you 
see it



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24465) LSHModel should support Structured Streaming for transform

2018-06-15 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513649#comment-16513649
 ] 

Liang-Chi Hsieh edited comment on SPARK-24465 at 6/15/18 10:32 AM:
---

I'm not sure if SPARK-12878  is a real issue. Seems to me that it just needs to 
write nested UDT codes in the working approach. Please see my comment on 
SPARK-12878.


was (Author: viirya):
I'm not sure if SPARK-12878  is a real issue. Seems to me that it just needs to 
write nested UDT codes in the working approach.

> LSHModel should support Structured Streaming for transform
> --
>
> Key: SPARK-24465
> URL: https://issues.apache.org/jira/browse/SPARK-24465
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> Locality Sensitive Hashing (LSH) Models (BucketedRandomProjectionLSHModel, 
> MinHashLSHModel) are not compatible with Structured Streaming (and I believe 
> are the final Transformers which are not compatible).  These do not work 
> because Spark SQL does not support nested types containing UDTs; see 
> [SPARK-12878].
> This task is to add unit tests for streaming (as in [SPARK-22644]) for 
> LSHModels after [SPARK-12878] has been fixed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24465) LSHModel should support Structured Streaming for transform

2018-06-15 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513649#comment-16513649
 ] 

Liang-Chi Hsieh commented on SPARK-24465:
-

I'm not sure if SPARK-12878  is a real issue. Seems to me that it just needs to 
write nested UDT codes in the working approach.

> LSHModel should support Structured Streaming for transform
> --
>
> Key: SPARK-24465
> URL: https://issues.apache.org/jira/browse/SPARK-24465
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> Locality Sensitive Hashing (LSH) Models (BucketedRandomProjectionLSHModel, 
> MinHashLSHModel) are not compatible with Structured Streaming (and I believe 
> are the final Transformers which are not compatible).  These do not work 
> because Spark SQL does not support nested types containing UDTs; see 
> [SPARK-12878].
> This task is to add unit tests for streaming (as in [SPARK-22644]) for 
> LSHModels after [SPARK-12878] has been fixed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12878) Dataframe fails with nested User Defined Types

2018-06-15 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513647#comment-16513647
 ] 

Liang-Chi Hsieh commented on SPARK-12878:
-


Is this a real issue? Seems to me that you can't write nested UDT like the 
example code in the description.

The nested UDT example should be look like the following, you need to serialize 
nested UDT objects when you serialize the wrapper object:

{code:scala}
@SQLUserDefinedType(udt = classOf[WrapperUDT])
case class Wrapper(list: Seq[Element])

class WrapperUDT extends UserDefinedType[Wrapper] {
  override def sqlType: DataType = StructType(Seq(StructField("list",
ArrayType(new ElementUDT(), containsNull = false), nullable = true)))
  override def userClass: Class[Wrapper] = classOf[Wrapper]
  override def serialize(obj: Wrapper): Any = obj match {
case Wrapper(list) =>
  val row = new GenericInternalRow(1)
  val elementUDT = new ElementUDT()
  val serializedElements = list.map((e: Element) => elementUDT.serialize(e))
  row.update(0, new GenericArrayData(serializedElements.toArray))
  row
  }

  override def deserialize(datum: Any): Wrapper = datum match {
case row: InternalRow =>
  val elementUDF = new ElementUDT()
  Wrapper(row.getArray(0).toArray(elementUDF).map((e: Any) => 
elementUDF.deserialize(e)))
  }
}

@SQLUserDefinedType(udt = classOf[ElementUDT])
case class Element(num: Int)

class ElementUDT extends UserDefinedType[Element] {
  override def sqlType: DataType =
StructType(Seq(StructField("num", IntegerType, nullable = false)))
  override def userClass: Class[Element] = classOf[Element]
  override def serialize(obj: Element): Any = obj match {
case Element(num) =>
  val row = new GenericInternalRow(1)
  row.setInt(0, num)
  row
  }

  override def deserialize(datum: Any): Element = datum match {
case row: InternalRow => Element(row.getInt(0))
  }
}

val data = Seq(Wrapper(Seq(Element(1), Element(2))), Wrapper(Seq(Element(3), 
Element(4
val df = sparkContext.parallelize((1 to 2).zip(data)).toDF("id", "b")
df.collect().map(println(_))
{code}

{code}
[1,Wrapper(ArraySeq(Element(1), Element(2)))]
[2,Wrapper(ArraySeq(Element(3), Element(4)))]
{code}

> Dataframe fails with nested User Defined Types
> --
>
> Key: SPARK-12878
> URL: https://issues.apache.org/jira/browse/SPARK-12878
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Joao Duarte
>Priority: Major
>
> Spark 1.6.0 crashes when using nested User Defined Types in a Dataframe. 
> In version 1.5.2 the code below worked just fine:
> import org.apache.spark.{SparkConf, SparkContext}
> import org.apache.spark.sql.catalyst.InternalRow
> import org.apache.spark.sql.catalyst.expressions.GenericMutableRow
> import org.apache.spark.sql.types._
> @SQLUserDefinedType(udt = classOf[AUDT])
> case class A(list:Seq[B])
> class AUDT extends UserDefinedType[A] {
>   override def sqlType: DataType = StructType(Seq(StructField("list", 
> ArrayType(BUDT, containsNull = false), nullable = true)))
>   override def userClass: Class[A] = classOf[A]
>   override def serialize(obj: Any): Any = obj match {
> case A(list) =>
>   val row = new GenericMutableRow(1)
>   row.update(0, new 
> GenericArrayData(list.map(_.asInstanceOf[Any]).toArray))
>   row
>   }
>   override def deserialize(datum: Any): A = {
> datum match {
>   case row: InternalRow => new A(row.getArray(0).toArray(BUDT).toSeq)
> }
>   }
> }
> object AUDT extends AUDT
> @SQLUserDefinedType(udt = classOf[BUDT])
> case class B(text:Int)
> class BUDT extends UserDefinedType[B] {
>   override def sqlType: DataType = StructType(Seq(StructField("num", 
> IntegerType, nullable = false)))
>   override def userClass: Class[B] = classOf[B]
>   override def serialize(obj: Any): Any = obj match {
> case B(text) =>
>   val row = new GenericMutableRow(1)
>   row.setInt(0, text)
>   row
>   }
>   override def deserialize(datum: Any): B = {
> datum match {  case row: InternalRow => new B(row.getInt(0))  }
>   }
> }
> object BUDT extends BUDT
> object Test {
>   def main(args:Array[String]) = {
> val col = Seq(new A(Seq(new B(1), new B(2))),
>   new A(Seq(new B(3), new B(4
> val sc = new SparkContext(new 
> SparkConf().setMaster("local[1]").setAppName("TestSpark"))
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> import sqlContext.implicits._
> val df = sc.parallelize(1 to 2 zip col).toDF("id","b")
> df.select("b").show()
> df.collect().foreach(println)
>   }
> }
> In the new version (1.6.0) I needed to include the following import:
> import org.apache.spark.sql.catalyst.expressions.GenericMutableRow
> However, Spark crashes in runtime:
> 16/01/18 

[jira] [Commented] (SPARK-19498) Discussion: Making MLlib APIs extensible for 3rd party libraries

2018-06-15 Thread Lucas Partridge (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513642#comment-16513642
 ] 

Lucas Partridge commented on SPARK-19498:
-

How would you prefer people provide their inputs on this? Via comments on this 
Jira issue, or where...?

> Discussion: Making MLlib APIs extensible for 3rd party libraries
> 
>
> Key: SPARK-19498
> URL: https://issues.apache.org/jira/browse/SPARK-19498
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Per the recent discussion on the dev list, this JIRA is for discussing how we 
> can make MLlib DataFrame-based APIs more extensible, especially for the 
> purpose of writing 3rd-party libraries with APIs extended from the MLlib APIs 
> (for custom Transformers, Estimators, etc.).
> * For people who have written such libraries, what issues have you run into?
> * What APIs are not public or extensible enough?  Do they require changes 
> before being made more public?
> * Are APIs for non-Scala languages such as Java and Python friendly or 
> extensive enough?
> The easy answer is to make everything public, but that would be terrible of 
> course in the long-term.  Let's discuss what is needed and how we can present 
> stable, sufficient, and easy-to-use APIs for 3rd-party developers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24548) JavaPairRDD to Dataset in SPARK generates ambiguous results

2018-06-15 Thread Liang-Chi Hsieh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-24548:

Component/s: (was: Spark Core)

> JavaPairRDD to Dataset in SPARK generates ambiguous results
> 
>
> Key: SPARK-24548
> URL: https://issues.apache.org/jira/browse/SPARK-24548
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL
>Affects Versions: 2.3.0
> Environment: Using Windows 10, on 64bit machine with 16G of ram.
>Reporter: Jackson
>Priority: Major
>
> I have data in below JavaPairRDD :
> {quote}JavaPairRDD> MY_RDD;
> {quote}
> I tried using below code:
> {quote}Encoder>> encoder2 =
> Encoders.tuple(Encoders.STRING(), 
> Encoders.tuple(Encoders.STRING(),Encoders.STRING()));
> Dataset newDataSet = 
> spark.createDataset(JavaPairRDD.toRDD(MY_RDD),encoder2).toDF("value1","value2");
> newDataSet.printSchema();
> {quote}
> {{root}}
> {{ |-- value1: string (nullable = true)}}
> {{ |-- value2: struct (nullable = true)}}
> {{ | |-- value: string (nullable = true)}}
> {{ | |-- value: string (nullable = true)}}
> But after creating a StackOverflow question 
> ("https://stackoverflow.com/questions/50834145/javapairrdd-to-datasetrow-in-spark;),
>  i got to know that values in tuple should have distinguish field names, 
> where in this case its generating same name. Cause of this I cannot select 
> specific column under value2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24548) JavaPairRDD to Dataset in SPARK generates ambiguous results

2018-06-15 Thread Liang-Chi Hsieh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-24548:

Component/s: SQL

> JavaPairRDD to Dataset in SPARK generates ambiguous results
> 
>
> Key: SPARK-24548
> URL: https://issues.apache.org/jira/browse/SPARK-24548
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL
>Affects Versions: 2.3.0
> Environment: Using Windows 10, on 64bit machine with 16G of ram.
>Reporter: Jackson
>Priority: Major
>
> I have data in below JavaPairRDD :
> {quote}JavaPairRDD> MY_RDD;
> {quote}
> I tried using below code:
> {quote}Encoder>> encoder2 =
> Encoders.tuple(Encoders.STRING(), 
> Encoders.tuple(Encoders.STRING(),Encoders.STRING()));
> Dataset newDataSet = 
> spark.createDataset(JavaPairRDD.toRDD(MY_RDD),encoder2).toDF("value1","value2");
> newDataSet.printSchema();
> {quote}
> {{root}}
> {{ |-- value1: string (nullable = true)}}
> {{ |-- value2: struct (nullable = true)}}
> {{ | |-- value: string (nullable = true)}}
> {{ | |-- value: string (nullable = true)}}
> But after creating a StackOverflow question 
> ("https://stackoverflow.com/questions/50834145/javapairrdd-to-datasetrow-in-spark;),
>  i got to know that values in tuple should have distinguish field names, 
> where in this case its generating same name. Cause of this I cannot select 
> specific column under value2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24548) JavaPairRDD to Dataset in SPARK generates ambiguous results

2018-06-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24548:


Assignee: Apache Spark

> JavaPairRDD to Dataset in SPARK generates ambiguous results
> 
>
> Key: SPARK-24548
> URL: https://issues.apache.org/jira/browse/SPARK-24548
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Spark Core
>Affects Versions: 2.3.0
> Environment: Using Windows 10, on 64bit machine with 16G of ram.
>Reporter: Jackson
>Assignee: Apache Spark
>Priority: Major
>
> I have data in below JavaPairRDD :
> {quote}JavaPairRDD> MY_RDD;
> {quote}
> I tried using below code:
> {quote}Encoder>> encoder2 =
> Encoders.tuple(Encoders.STRING(), 
> Encoders.tuple(Encoders.STRING(),Encoders.STRING()));
> Dataset newDataSet = 
> spark.createDataset(JavaPairRDD.toRDD(MY_RDD),encoder2).toDF("value1","value2");
> newDataSet.printSchema();
> {quote}
> {{root}}
> {{ |-- value1: string (nullable = true)}}
> {{ |-- value2: struct (nullable = true)}}
> {{ | |-- value: string (nullable = true)}}
> {{ | |-- value: string (nullable = true)}}
> But after creating a StackOverflow question 
> ("https://stackoverflow.com/questions/50834145/javapairrdd-to-datasetrow-in-spark;),
>  i got to know that values in tuple should have distinguish field names, 
> where in this case its generating same name. Cause of this I cannot select 
> specific column under value2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24548) JavaPairRDD to Dataset in SPARK generates ambiguous results

2018-06-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513571#comment-16513571
 ] 

Apache Spark commented on SPARK-24548:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/21576

> JavaPairRDD to Dataset in SPARK generates ambiguous results
> 
>
> Key: SPARK-24548
> URL: https://issues.apache.org/jira/browse/SPARK-24548
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Spark Core
>Affects Versions: 2.3.0
> Environment: Using Windows 10, on 64bit machine with 16G of ram.
>Reporter: Jackson
>Priority: Major
>
> I have data in below JavaPairRDD :
> {quote}JavaPairRDD> MY_RDD;
> {quote}
> I tried using below code:
> {quote}Encoder>> encoder2 =
> Encoders.tuple(Encoders.STRING(), 
> Encoders.tuple(Encoders.STRING(),Encoders.STRING()));
> Dataset newDataSet = 
> spark.createDataset(JavaPairRDD.toRDD(MY_RDD),encoder2).toDF("value1","value2");
> newDataSet.printSchema();
> {quote}
> {{root}}
> {{ |-- value1: string (nullable = true)}}
> {{ |-- value2: struct (nullable = true)}}
> {{ | |-- value: string (nullable = true)}}
> {{ | |-- value: string (nullable = true)}}
> But after creating a StackOverflow question 
> ("https://stackoverflow.com/questions/50834145/javapairrdd-to-datasetrow-in-spark;),
>  i got to know that values in tuple should have distinguish field names, 
> where in this case its generating same name. Cause of this I cannot select 
> specific column under value2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24548) JavaPairRDD to Dataset in SPARK generates ambiguous results

2018-06-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24548:


Assignee: (was: Apache Spark)

> JavaPairRDD to Dataset in SPARK generates ambiguous results
> 
>
> Key: SPARK-24548
> URL: https://issues.apache.org/jira/browse/SPARK-24548
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Spark Core
>Affects Versions: 2.3.0
> Environment: Using Windows 10, on 64bit machine with 16G of ram.
>Reporter: Jackson
>Priority: Major
>
> I have data in below JavaPairRDD :
> {quote}JavaPairRDD> MY_RDD;
> {quote}
> I tried using below code:
> {quote}Encoder>> encoder2 =
> Encoders.tuple(Encoders.STRING(), 
> Encoders.tuple(Encoders.STRING(),Encoders.STRING()));
> Dataset newDataSet = 
> spark.createDataset(JavaPairRDD.toRDD(MY_RDD),encoder2).toDF("value1","value2");
> newDataSet.printSchema();
> {quote}
> {{root}}
> {{ |-- value1: string (nullable = true)}}
> {{ |-- value2: struct (nullable = true)}}
> {{ | |-- value: string (nullable = true)}}
> {{ | |-- value: string (nullable = true)}}
> But after creating a StackOverflow question 
> ("https://stackoverflow.com/questions/50834145/javapairrdd-to-datasetrow-in-spark;),
>  i got to know that values in tuple should have distinguish field names, 
> where in this case its generating same name. Cause of this I cannot select 
> specific column under value2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23435) R tests should support latest testthat

2018-06-15 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513416#comment-16513416
 ] 

Felix Cheung commented on SPARK-23435:
--

sorry, I did try but couldn't get it to work, but something like this?

 
{code:java}
# for testthat after 1.0.2 call test_dir as run_tests is removed.
if (packageVersion("testthat") >= "2.0.0") {
  test_pkg_env <- list2env(as.list(getNamespace("SparkR"), all.names = TRUE),
  parent = parent.env(getNamespace("SparkR")))
  withr::local_options(list(topLevelEnvironment = test_pkg_env))
  test_dir(file.path(sparkRDir, "pkg", "tests", "fulltests"),
env = test_pkg_env,
stop_on_failure = TRUE,
stop_on_warning = FALSE)
} else {
  testthat:::run_tests("SparkR",
file.path(sparkRDir, "pkg", "tests", "fulltests"),
NULL,
"summary")
}

{code}

> R tests should support latest testthat
> --
>
> Key: SPARK-23435
> URL: https://issues.apache.org/jira/browse/SPARK-23435
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Major
>
> To follow up on SPARK-22817, the latest version of testthat, 2.0.0 was 
> released in Dec 2017, and its method has been changed.
> In order for our tests to keep working, we need to detect that and call a 
> different method.
> Jenkins is running 1.0.1 though, we need to check if it is going to work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org