[jira] [Commented] (SPARK-24498) Add JDK compiler for runtime codegen
[ https://issues.apache.org/jira/browse/SPARK-24498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514696#comment-16514696 ] Takeshi Yamamuro commented on SPARK-24498: -- I'm also interested in this, so I'll look into this. If there are something I could do, let me know. > Add JDK compiler for runtime codegen > > > Key: SPARK-24498 > URL: https://issues.apache.org/jira/browse/SPARK-24498 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > In some cases, JDK compiler can generate smaller bytecode and take less time > in compilation compared to Janino. However, in some cases, Janino is better. > We should support both for our runtime codegen. Janino will be still our > default runtime codegen compiler. > See the related JIRAs in DRILL: > - https://issues.apache.org/jira/browse/DRILL-1155 > - https://issues.apache.org/jira/browse/DRILL-4778 > - https://issues.apache.org/jira/browse/DRILL-5696 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23901) Data Masking Functions
[ https://issues.apache.org/jira/browse/SPARK-23901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514687#comment-16514687 ] Marco Gaido commented on SPARK-23901: - These functions can be used as any other function in Hive, they are not just there for the Hive authorizer. I think the use case for them is to anonymize data for privacy reasons (eg. expose/export to other parties data without providing sensible data, but still being able to use them in joins). > Data Masking Functions > -- > > Key: SPARK-23901 > URL: https://issues.apache.org/jira/browse/SPARK-23901 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Marco Gaido >Priority: Major > Fix For: 2.4.0 > > > - mask() > - mask_first_n() > - mask_last_n() > - mask_hash() > - mask_show_first_n() > - mask_show_last_n() > Reference: > [1] > [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DataMaskingFunctions] > [2] https://issues.apache.org/jira/browse/HIVE-13568 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23901) Data Masking Functions
[ https://issues.apache.org/jira/browse/SPARK-23901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514685#comment-16514685 ] Wenchen Fan commented on SPARK-23901: - According to the Hive JIRA, it's used in the Hive authorization, which is not a general function and seems can't be applied to Spark. [~smilegator] [~mgaido] do you know any use case for these functions? > Data Masking Functions > -- > > Key: SPARK-23901 > URL: https://issues.apache.org/jira/browse/SPARK-23901 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Marco Gaido >Priority: Major > Fix For: 2.4.0 > > > - mask() > - mask_first_n() > - mask_last_n() > - mask_hash() > - mask_show_first_n() > - mask_show_last_n() > Reference: > [1] > [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DataMaskingFunctions] > [2] https://issues.apache.org/jira/browse/HIVE-13568 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24540) Support for multiple delimiter in Spark CSV read
[ https://issues.apache.org/jira/browse/SPARK-24540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514684#comment-16514684 ] Takeshi Yamamuro edited comment on SPARK-24540 at 6/16/18 5:28 AM: --- Probably, this is a restriction of univocity parser. was (Author: maropu): Probably, this is a restriction of univocity parser. cc; [~hyukjin.kwon] btw, why do you set 'is this blocked by SPARK-17967'? > Support for multiple delimiter in Spark CSV read > > > Key: SPARK-24540 > URL: https://issues.apache.org/jira/browse/SPARK-24540 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Ashwin K >Priority: Major > > Currently, the delimiter option Spark 2.0 to read and split CSV files/data > only support a single character delimiter. If we try to provide multiple > delimiters, we observer the following error message. > eg: Dataset df = spark.read().option("inferSchema", "true") > .option("header", > "false") > .option("delimiter", > ", ") > .csv("C:\test.txt"); > Exception in thread "main" java.lang.IllegalArgumentException: Delimiter > cannot be more than one character: , > at > org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:111) > at > org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:83) > at > org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:39) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:55) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473) > > Generally, the data to be processed contains multiple delimiters and > presently we need to do a manual data clean up on the source/input file, > which doesn't work well in large applications which consumes numerous files. > There seems to be work-around like reading data as text and using the split > option, but this in my opinion defeats the purpose, advantage and efficiency > of a direct read from CSV file. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24540) Support for multiple delimiter in Spark CSV read
[ https://issues.apache.org/jira/browse/SPARK-24540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514684#comment-16514684 ] Takeshi Yamamuro edited comment on SPARK-24540 at 6/16/18 5:28 AM: --- Probably, I think this is a restriction of univocity parser. was (Author: maropu): Probably, this is a restriction of univocity parser. > Support for multiple delimiter in Spark CSV read > > > Key: SPARK-24540 > URL: https://issues.apache.org/jira/browse/SPARK-24540 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Ashwin K >Priority: Major > > Currently, the delimiter option Spark 2.0 to read and split CSV files/data > only support a single character delimiter. If we try to provide multiple > delimiters, we observer the following error message. > eg: Dataset df = spark.read().option("inferSchema", "true") > .option("header", > "false") > .option("delimiter", > ", ") > .csv("C:\test.txt"); > Exception in thread "main" java.lang.IllegalArgumentException: Delimiter > cannot be more than one character: , > at > org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:111) > at > org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:83) > at > org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:39) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:55) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473) > > Generally, the data to be processed contains multiple delimiters and > presently we need to do a manual data clean up on the source/input file, > which doesn't work well in large applications which consumes numerous files. > There seems to be work-around like reading data as text and using the split > option, but this in my opinion defeats the purpose, advantage and efficiency > of a direct read from CSV file. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24540) Support for multiple delimiter in Spark CSV read
[ https://issues.apache.org/jira/browse/SPARK-24540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514684#comment-16514684 ] Takeshi Yamamuro commented on SPARK-24540: -- Probably, this is a restriction of univocity parser. cc; [~hyukjin.kwon] btw, why do you set 'is this blocked by SPARK-17967'? > Support for multiple delimiter in Spark CSV read > > > Key: SPARK-24540 > URL: https://issues.apache.org/jira/browse/SPARK-24540 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Ashwin K >Priority: Major > > Currently, the delimiter option Spark 2.0 to read and split CSV files/data > only support a single character delimiter. If we try to provide multiple > delimiters, we observer the following error message. > eg: Dataset df = spark.read().option("inferSchema", "true") > .option("header", > "false") > .option("delimiter", > ", ") > .csv("C:\test.txt"); > Exception in thread "main" java.lang.IllegalArgumentException: Delimiter > cannot be more than one character: , > at > org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:111) > at > org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:83) > at > org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:39) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:55) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473) > > Generally, the data to be processed contains multiple delimiters and > presently we need to do a manual data clean up on the source/input file, > which doesn't work well in large applications which consumes numerous files. > There seems to be work-around like reading data as text and using the split > option, but this in my opinion defeats the purpose, advantage and efficiency > of a direct read from CSV file. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24571) Support literals with values of the Char type
[ https://issues.apache.org/jira/browse/SPARK-24571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24571: Assignee: Apache Spark > Support literals with values of the Char type > - > > Key: SPARK-24571 > URL: https://issues.apache.org/jira/browse/SPARK-24571 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Minor > > Currently, Spark doesn't support literals with the Char (java.lang.Character) > type. For example, the following code throws an exception: > {code} > val df = Seq("Amsterdam", "San Francisco", "London").toDF("city") > df.where($"city".contains('o')).show(false) > {code} > It fails with the exception: > {code:java} > Unsupported literal type class java.lang.Character o > java.lang.RuntimeException: Unsupported literal type class > java.lang.Character o > at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78) > {code} > One of the possible solutions can be automatic conversion of Char literal to > String literal of length 1. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24571) Support literals with values of the Char type
[ https://issues.apache.org/jira/browse/SPARK-24571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514669#comment-16514669 ] Apache Spark commented on SPARK-24571: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/21578 > Support literals with values of the Char type > - > > Key: SPARK-24571 > URL: https://issues.apache.org/jira/browse/SPARK-24571 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Minor > > Currently, Spark doesn't support literals with the Char (java.lang.Character) > type. For example, the following code throws an exception: > {code} > val df = Seq("Amsterdam", "San Francisco", "London").toDF("city") > df.where($"city".contains('o')).show(false) > {code} > It fails with the exception: > {code:java} > Unsupported literal type class java.lang.Character o > java.lang.RuntimeException: Unsupported literal type class > java.lang.Character o > at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78) > {code} > One of the possible solutions can be automatic conversion of Char literal to > String literal of length 1. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24571) Support literals with values of the Char type
[ https://issues.apache.org/jira/browse/SPARK-24571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24571: Assignee: (was: Apache Spark) > Support literals with values of the Char type > - > > Key: SPARK-24571 > URL: https://issues.apache.org/jira/browse/SPARK-24571 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Minor > > Currently, Spark doesn't support literals with the Char (java.lang.Character) > type. For example, the following code throws an exception: > {code} > val df = Seq("Amsterdam", "San Francisco", "London").toDF("city") > df.where($"city".contains('o')).show(false) > {code} > It fails with the exception: > {code:java} > Unsupported literal type class java.lang.Character o > java.lang.RuntimeException: Unsupported literal type class > java.lang.Character o > at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78) > {code} > One of the possible solutions can be automatic conversion of Char literal to > String literal of length 1. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23901) Data Masking Functions
[ https://issues.apache.org/jira/browse/SPARK-23901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514667#comment-16514667 ] Reynold Xin commented on SPARK-23901: - Why are we adding 1200 lines of code for some functions that don't even apply to Spark?! > Data Masking Functions > -- > > Key: SPARK-23901 > URL: https://issues.apache.org/jira/browse/SPARK-23901 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Marco Gaido >Priority: Major > Fix For: 2.4.0 > > > - mask() > - mask_first_n() > - mask_last_n() > - mask_hash() > - mask_show_first_n() > - mask_show_last_n() > Reference: > [1] > [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DataMaskingFunctions] > [2] https://issues.apache.org/jira/browse/HIVE-13568 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24571) Support literals with values of the Char type
[ https://issues.apache.org/jira/browse/SPARK-24571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-24571: --- Description: Currently, Spark doesn't support literals with the Char (java.lang.Character) type. For example, the following code throws an exception: {code} val df = Seq("Amsterdam", "San Francisco", "London").toDF("city") df.where($"city".contains('o')).show(false) {code} It fails with the exception: {code:java} Unsupported literal type class java.lang.Character o java.lang.RuntimeException: Unsupported literal type class java.lang.Character o at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78) {code} One of the possible solutions can be automatic conversion of Char literal to String literal of length 1. was: Currently, Spark doesn't support literals with the Char (java.lang.Character) type. For example, the following code throws an exception: {code:scala} val df = Seq("Amsterdam", "San Francisco", "London").toDF("city") df.where($"city".contains('o')).show(false) {code} It fails with the exception: {code} Unsupported literal type class java.lang.Character o java.lang.RuntimeException: Unsupported literal type class java.lang.Character p at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78) {code} One of the possible solutions can be automatic conversion of Char literal to String literal of length 1. > Support literals with values of the Char type > - > > Key: SPARK-24571 > URL: https://issues.apache.org/jira/browse/SPARK-24571 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Minor > > Currently, Spark doesn't support literals with the Char (java.lang.Character) > type. For example, the following code throws an exception: > {code} > val df = Seq("Amsterdam", "San Francisco", "London").toDF("city") > df.where($"city".contains('o')).show(false) > {code} > It fails with the exception: > {code:java} > Unsupported literal type class java.lang.Character o > java.lang.RuntimeException: Unsupported literal type class > java.lang.Character o > at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78) > {code} > One of the possible solutions can be automatic conversion of Char literal to > String literal of length 1. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24571) Support literals with values of the Char type
Maxim Gekk created SPARK-24571: -- Summary: Support literals with values of the Char type Key: SPARK-24571 URL: https://issues.apache.org/jira/browse/SPARK-24571 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.1 Reporter: Maxim Gekk Currently, Spark doesn't support literals with the Char (java.lang.Character) type. For example, the following code throws an exception: {code:scala} val df = Seq("Amsterdam", "San Francisco", "London").toDF("city") df.where($"city".contains('o')).show(false) {code} It fails with the exception: {code} Unsupported literal type class java.lang.Character o java.lang.RuntimeException: Unsupported literal type class java.lang.Character p at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78) {code} One of the possible solutions can be automatic conversion of Char literal to String literal of length 1. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24571) Support literals with values of the Char type
[ https://issues.apache.org/jira/browse/SPARK-24571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514665#comment-16514665 ] Maxim Gekk commented on SPARK-24571: I am working on the improvement. > Support literals with values of the Char type > - > > Key: SPARK-24571 > URL: https://issues.apache.org/jira/browse/SPARK-24571 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Minor > > Currently, Spark doesn't support literals with the Char (java.lang.Character) > type. For example, the following code throws an exception: > {code:scala} > val df = Seq("Amsterdam", "San Francisco", "London").toDF("city") > df.where($"city".contains('o')).show(false) > {code} > It fails with the exception: > {code} > Unsupported literal type class java.lang.Character o > java.lang.RuntimeException: Unsupported literal type class > java.lang.Character p > at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78) > {code} > One of the possible solutions can be automatic conversion of Char literal to > String literal of length 1. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24216) Spark TypedAggregateExpression uses getSimpleName that is not safe in scala
[ https://issues.apache.org/jira/browse/SPARK-24216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-24216: Fix Version/s: 2.3.2 > Spark TypedAggregateExpression uses getSimpleName that is not safe in scala > --- > > Key: SPARK-24216 > URL: https://issues.apache.org/jira/browse/SPARK-24216 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: Fangshi Li >Assignee: Fangshi Li >Priority: Major > Fix For: 2.3.2, 2.4.0 > > > When user create a aggregator object in scala and pass the aggregator to > Spark Dataset's agg() method, Spark's will initialize > TypedAggregateExpression with the nodeName field as > aggregator.getClass.getSimpleName. However, getSimpleName is not safe in > scala environment, depending on how user creates the aggregator object. For > example, if the aggregator class full qualified name is > "com.my.company.MyUtils$myAgg$2$", the getSimpleName will throw > java.lang.InternalError "Malformed class name". This has been reported in > scalatest > [scalatest/scalatest#1044|https://github.com/scalatest/scalatest/pull/1044] > and discussed in many scala upstream jiras such as SI-8110, SI-5425. > To fix this issue, we follow the solution in > [scalatest/scalatest#1044|https://github.com/scalatest/scalatest/pull/1044] > to add safer version of getSimpleName as a util method, and > TypedAggregateExpression will invoke this util method rather than > getClass.getSimpleName. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24570) SparkSQL - show schemas/tables in dropdowns of SQL client tools (ie Squirrel SQL, DBVisualizer.etc)
[ https://issues.apache.org/jira/browse/SPARK-24570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514629#comment-16514629 ] Takeshi Yamamuro commented on SPARK-24570: -- Is this a Spark itself issue? > SparkSQL - show schemas/tables in dropdowns of SQL client tools (ie Squirrel > SQL, DBVisualizer.etc) > --- > > Key: SPARK-24570 > URL: https://issues.apache.org/jira/browse/SPARK-24570 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.1 >Reporter: t oo >Priority: Major > Attachments: connect-to-sql-db-ssms-locate-table.png > > > An end-user SQL client tool (ie in the screenshot) can list tables from > hiveserver2 and major DBs (Mysql, postgres,oracle, MSSQL..etc). But with > SparkSQL it does not display any tables. This would be very convenient for > users. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24570) SparkSQL - show schemas/tables in dropdowns of SQL client tools (ie Squirrel SQL, DBVisualizer.etc)
t oo created SPARK-24570: Summary: SparkSQL - show schemas/tables in dropdowns of SQL client tools (ie Squirrel SQL, DBVisualizer.etc) Key: SPARK-24570 URL: https://issues.apache.org/jira/browse/SPARK-24570 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.3.1 Reporter: t oo Attachments: connect-to-sql-db-ssms-locate-table.png An end-user SQL client tool (ie in the screenshot) can list tables from hiveserver2 and major DBs (Mysql, postgres,oracle, MSSQL..etc). But with SparkSQL it does not display any tables. This would be very convenient for users. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24570) SparkSQL - show schemas/tables in dropdowns of SQL client tools (ie Squirrel SQL, DBVisualizer.etc)
[ https://issues.apache.org/jira/browse/SPARK-24570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] t oo updated SPARK-24570: - Attachment: connect-to-sql-db-ssms-locate-table.png > SparkSQL - show schemas/tables in dropdowns of SQL client tools (ie Squirrel > SQL, DBVisualizer.etc) > --- > > Key: SPARK-24570 > URL: https://issues.apache.org/jira/browse/SPARK-24570 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.1 >Reporter: t oo >Priority: Major > Attachments: connect-to-sql-db-ssms-locate-table.png > > > An end-user SQL client tool (ie in the screenshot) can list tables from > hiveserver2 and major DBs (Mysql, postgres,oracle, MSSQL..etc). But with > SparkSQL it does not display any tables. This would be very convenient for > users. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21569) Internal Spark class needs to be kryo-registered
[ https://issues.apache.org/jira/browse/SPARK-21569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514529#comment-16514529 ] YongGang Cao edited comment on SPARK-21569 at 6/15/18 11:47 PM: seems this is not workaround-able, at least from java side. unless we turn off the registration required which will harm the performance as documented. tried to register both of following in SparkConf, no luck, still get the not registered error message. {code:java} org.apache.spark.internal.io.FileCommitProtocol.TaskCommitMessage.class, org.apache.spark.internal.io.FileCommitProtocol.TaskCommitMessage[].class{code} was (Author: ygcao): seems this is not workaround-able, at least from java side. tried to register both of following in SparkConf, no luck, still get the not registered error message. {code:java} org.apache.spark.internal.io.FileCommitProtocol.TaskCommitMessage.class, org.apache.spark.internal.io.FileCommitProtocol.TaskCommitMessage[].class{code} > Internal Spark class needs to be kryo-registered > > > Key: SPARK-21569 > URL: https://issues.apache.org/jira/browse/SPARK-21569 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Ryan Williams >Priority: Major > > [Full repro here|https://github.com/ryan-williams/spark-bugs/tree/hf] > As of 2.2.0, {{saveAsNewAPIHadoopFile}} jobs fail (when > {{spark.kryo.registrationRequired=true}}) with: > {code} > java.lang.IllegalArgumentException: Class is not registered: > org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage > Note: To register this class use: > kryo.register(org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage.class); > at com.esotericsoftware.kryo.Kryo.getRegistration(Kryo.java:458) > at > com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:79) > at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:488) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:593) > at > org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:315) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:383) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > This internal Spark class should be kryo-registered by Spark by default. > This was not a problem in 2.1.1. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21569) Internal Spark class needs to be kryo-registered
[ https://issues.apache.org/jira/browse/SPARK-21569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514529#comment-16514529 ] YongGang Cao edited comment on SPARK-21569 at 6/15/18 11:45 PM: seems this is not workaround-able, at least from java side. tried to register both of following in SparkConf, no luck, still get the not registered error message. {code:java} org.apache.spark.internal.io.FileCommitProtocol.TaskCommitMessage.class, org.apache.spark.internal.io.FileCommitProtocol.TaskCommitMessage[].class{code} was (Author: ygcao): seems this is not workaround-able, at least from java side. tried to register both of following in SparkConf, no luck, still get the not registered error message. {{}} {code:java} org.apache.spark.internal.io.FileCommitProtocol.TaskCommitMessage.class, org.apache.spark.internal.io.FileCommitProtocol.TaskCommitMessage[].class{code} > Internal Spark class needs to be kryo-registered > > > Key: SPARK-21569 > URL: https://issues.apache.org/jira/browse/SPARK-21569 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Ryan Williams >Priority: Major > > [Full repro here|https://github.com/ryan-williams/spark-bugs/tree/hf] > As of 2.2.0, {{saveAsNewAPIHadoopFile}} jobs fail (when > {{spark.kryo.registrationRequired=true}}) with: > {code} > java.lang.IllegalArgumentException: Class is not registered: > org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage > Note: To register this class use: > kryo.register(org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage.class); > at com.esotericsoftware.kryo.Kryo.getRegistration(Kryo.java:458) > at > com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:79) > at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:488) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:593) > at > org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:315) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:383) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > This internal Spark class should be kryo-registered by Spark by default. > This was not a problem in 2.1.1. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21569) Internal Spark class needs to be kryo-registered
[ https://issues.apache.org/jira/browse/SPARK-21569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514529#comment-16514529 ] YongGang Cao commented on SPARK-21569: -- seems this is not workaround-able, at least from java side. tried to register both of following in SparkConf, no luck, still get the not registered error message. {{}} {code:java} org.apache.spark.internal.io.FileCommitProtocol.TaskCommitMessage.class, org.apache.spark.internal.io.FileCommitProtocol.TaskCommitMessage[].class{code} > Internal Spark class needs to be kryo-registered > > > Key: SPARK-21569 > URL: https://issues.apache.org/jira/browse/SPARK-21569 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Ryan Williams >Priority: Major > > [Full repro here|https://github.com/ryan-williams/spark-bugs/tree/hf] > As of 2.2.0, {{saveAsNewAPIHadoopFile}} jobs fail (when > {{spark.kryo.registrationRequired=true}}) with: > {code} > java.lang.IllegalArgumentException: Class is not registered: > org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage > Note: To register this class use: > kryo.register(org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage.class); > at com.esotericsoftware.kryo.Kryo.getRegistration(Kryo.java:458) > at > com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:79) > at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:488) > at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:593) > at > org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:315) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:383) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > This internal Spark class should be kryo-registered by Spark by default. > This was not a problem in 2.1.1. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24552) Task attempt numbers are reused when stages are retried
[ https://issues.apache.org/jira/browse/SPARK-24552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514526#comment-16514526 ] Apache Spark commented on SPARK-24552: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/21577 > Task attempt numbers are reused when stages are retried > --- > > Key: SPARK-24552 > URL: https://issues.apache.org/jira/browse/SPARK-24552 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1, 2.2.0, 2.2.1, 2.3.0, 2.3.1 >Reporter: Ryan Blue >Priority: Blocker > > When stages are retried due to shuffle failures, task attempt numbers are > reused. This causes a correctness bug in the v2 data sources write path. > Data sources (both the original and v2) pass the task attempt to writers so > that writers can use the attempt number to track and clean up data from > failed or speculative attempts. In the v2 docs for DataWriterFactory, the > attempt number's javadoc states that "Implementations can use this attempt > number to distinguish writers of different task attempts." > When two attempts of a stage use the same (partition, attempt) pair, two > tasks can create the same data and attempt to commit. The commit coordinator > prevents both from committing and will abort the attempt that finishes last. > When using the (partition, attempt) pair to track data, the aborted task may > delete data associated with the (partition, attempt) pair. If that happens, > the data for the task that committed is also deleted as well, which is a > correctness bug. > For a concrete example, I have a data source that creates files in place > named with {{part---.}}. Because these > files are written in place, both tasks create the same file and the one that > is aborted deletes the file, leading to data corruption when the file is > added to the table. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24569) Spark Aggregator with output type Option[Boolean] creates column of type Row
John Conwell created SPARK-24569: Summary: Spark Aggregator with output type Option[Boolean] creates column of type Row Key: SPARK-24569 URL: https://issues.apache.org/jira/browse/SPARK-24569 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.1 Environment: OSX Reporter: John Conwell Spark SQL Aggregator that returns an output column of Option[Boolean] creates a column of type StructField(,StructType(StructField(value,BooleanType,true)),true) instead of StructField(,BooleanType,true). In other words it puts a Row instance into the new column Reproduction {code:java} class OptionBooleanAggregatorTest extends BaseFreeSpec { val ss: SparkSession = getSparkSession "test option" in { import ss.implicits._ val df = List( Thing("bob", Some(true)), Thing("bob", Some(false)), Thing("bob", None)) .toDF() val group = df .groupBy("name") .agg(OptionBooleanAggregator("isGood").toColumn.alias("isGood")) .cache() assert(group.schema("name").dataType == StringType) //this will fail assert(group.schema("isGood").dataType == BooleanType) } } case class Thing(name: String, isGood: Option[Boolean]) case class OptionBooleanAggregator(colName: String) extends Aggregator[Row, Option[Boolean], Option[Boolean]] { override def zero: Option[Boolean] = Option.empty[Boolean] override def reduce(buffer: Option[Boolean], row: Row): Option[Boolean] = { val index = row.fieldIndex(colName) val value = if (row.isNullAt(index)) Option.empty[Boolean] else Some(row.getBoolean(index)) merge(buffer, value) } override def merge(b1: Option[Boolean], b2: Option[Boolean]): Option[Boolean] = { if ((b1.isDefined && b1.get) || (b2.isDefined && b2.get)) { Some(true) } else if (b1.isDefined) { b1 } else b2 } override def finish(reduction: Option[Boolean]): Option[Boolean] = reduction override def bufferEncoder: Encoder[Option[Boolean]] = OptionalBoolEncoder override def outputEncoder: Encoder[Option[Boolean]] = OptionalBoolEncoder def OptionalBoolEncoder: org.apache.spark.sql.Encoder[Option[Boolean]] = org.apache.spark.sql.catalyst.encoders.ExpressionEncoder() } {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24452) long = int*int or long = int+int may cause overflow.
[ https://issues.apache.org/jira/browse/SPARK-24452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-24452. - Resolution: Fixed Fix Version/s: 2.3.2 2.4.0 Issue resolved by pull request 21481 [https://github.com/apache/spark/pull/21481] > long = int*int or long = int+int may cause overflow. > > > Key: SPARK-24452 > URL: https://issues.apache.org/jira/browse/SPARK-24452 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.4.0 >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki >Priority: Minor > Fix For: 2.4.0, 2.3.2 > > > The following assignment cause overflow in right hand side. As a result, the > result may be negative. > {code:java} > long = int*int > long = int+int{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24452) long = int*int or long = int+int may cause overflow.
[ https://issues.apache.org/jira/browse/SPARK-24452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-24452: --- Assignee: Kazuaki Ishizaki > long = int*int or long = int+int may cause overflow. > > > Key: SPARK-24452 > URL: https://issues.apache.org/jira/browse/SPARK-24452 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.4.0 >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki >Priority: Minor > Fix For: 2.3.2, 2.4.0 > > > The following assignment cause overflow in right hand side. As a result, the > result may be negative. > {code:java} > long = int*int > long = int+int{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24552) Task attempt numbers are reused when stages are retried
[ https://issues.apache.org/jira/browse/SPARK-24552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-24552: -- Affects Version/s: 2.2.0 2.2.1 2.3.0 2.3.1 > Task attempt numbers are reused when stages are retried > --- > > Key: SPARK-24552 > URL: https://issues.apache.org/jira/browse/SPARK-24552 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1, 2.2.0, 2.2.1, 2.3.0, 2.3.1 >Reporter: Ryan Blue >Priority: Blocker > > When stages are retried due to shuffle failures, task attempt numbers are > reused. This causes a correctness bug in the v2 data sources write path. > Data sources (both the original and v2) pass the task attempt to writers so > that writers can use the attempt number to track and clean up data from > failed or speculative attempts. In the v2 docs for DataWriterFactory, the > attempt number's javadoc states that "Implementations can use this attempt > number to distinguish writers of different task attempts." > When two attempts of a stage use the same (partition, attempt) pair, two > tasks can create the same data and attempt to commit. The commit coordinator > prevents both from committing and will abort the attempt that finishes last. > When using the (partition, attempt) pair to track data, the aborted task may > delete data associated with the (partition, attempt) pair. If that happens, > the data for the task that committed is also deleted as well, which is a > correctness bug. > For a concrete example, I have a data source that creates files in place > named with {{part---.}}. Because these > files are written in place, both tasks create the same file and the one that > is aborted deletes the file, leading to data corruption when the file is > added to the table. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24552) Task attempt numbers are reused when stages are retried
[ https://issues.apache.org/jira/browse/SPARK-24552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-24552: -- Priority: Blocker (was: Critical) > Task attempt numbers are reused when stages are retried > --- > > Key: SPARK-24552 > URL: https://issues.apache.org/jira/browse/SPARK-24552 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1, 2.2.0, 2.2.1, 2.3.0, 2.3.1 >Reporter: Ryan Blue >Priority: Blocker > > When stages are retried due to shuffle failures, task attempt numbers are > reused. This causes a correctness bug in the v2 data sources write path. > Data sources (both the original and v2) pass the task attempt to writers so > that writers can use the attempt number to track and clean up data from > failed or speculative attempts. In the v2 docs for DataWriterFactory, the > attempt number's javadoc states that "Implementations can use this attempt > number to distinguish writers of different task attempts." > When two attempts of a stage use the same (partition, attempt) pair, two > tasks can create the same data and attempt to commit. The commit coordinator > prevents both from committing and will abort the attempt that finishes last. > When using the (partition, attempt) pair to track data, the aborted task may > delete data associated with the (partition, attempt) pair. If that happens, > the data for the task that committed is also deleted as well, which is a > correctness bug. > For a concrete example, I have a data source that creates files in place > named with {{part---.}}. Because these > files are written in place, both tasks create the same file and the one that > is aborted deletes the file, leading to data corruption when the file is > added to the table. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24525) Provide an option to limit MemorySink memory usage
[ https://issues.apache.org/jira/browse/SPARK-24525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz resolved SPARK-24525. - Resolution: Fixed Fix Version/s: 2.4.0 Resolved by [https://github.com/apache/spark/pull/21559] > Provide an option to limit MemorySink memory usage > -- > > Key: SPARK-24525 > URL: https://issues.apache.org/jira/browse/SPARK-24525 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: Mukul Murthy >Assignee: Mukul Murthy >Priority: Major > Fix For: 2.4.0 > > > MemorySink stores stream results in memory and is mostly used for testing and > displaying streams, but for large streams, this can OOM the driver. We should > add an option to limit the number of rows and the total size of a memory sink > and not add any new data once either limit is hit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24525) Provide an option to limit MemorySink memory usage
[ https://issues.apache.org/jira/browse/SPARK-24525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz reassigned SPARK-24525: --- Assignee: Mukul Murthy > Provide an option to limit MemorySink memory usage > -- > > Key: SPARK-24525 > URL: https://issues.apache.org/jira/browse/SPARK-24525 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: Mukul Murthy >Assignee: Mukul Murthy >Priority: Major > > MemorySink stores stream results in memory and is mostly used for testing and > displaying streams, but for large streams, this can OOM the driver. We should > add an option to limit the number of rows and the total size of a memory sink > and not add any new data once either limit is hit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24396) Add Structured Streaming ForeachWriter for python
[ https://issues.apache.org/jira/browse/SPARK-24396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-24396. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 21477 [https://github.com/apache/spark/pull/21477] > Add Structured Streaming ForeachWriter for python > - > > Key: SPARK-24396 > URL: https://issues.apache.org/jira/browse/SPARK-24396 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Major > Fix For: 3.0.0 > > > Users should be able to write ForeachWriter code in python, that is, they > should be able to use the partitionid and the version/batchId/epochId to > conditionally process rows. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24568) Code refactoring for DataType equalsXXX methods
Maryann Xue created SPARK-24568: --- Summary: Code refactoring for DataType equalsXXX methods Key: SPARK-24568 URL: https://issues.apache.org/jira/browse/SPARK-24568 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Maryann Xue Fix For: 2.4.0 Right now there is a lot of code duplication between all DataType equalsXXX methods: {{equalsIgnoreNullability}}, {{equalsIgnoreCaseAndNullability}}, {{equalsIgnoreCaseAndNullability}}, {{equalsStructurally}}. We can replace the dup code with a helper function. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23435) R tests should support latest testthat
[ https://issues.apache.org/jira/browse/SPARK-23435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514111#comment-16514111 ] Weiqiang Zhuang edited comment on SPARK-23435 at 6/15/18 5:14 PM: -- yes, quite similar ``` sp <- getNamespace("SparkR") attach(sp) test_dir(file.path(sparkRDir, "pkg", "tests", "fulltests")) ) ``` was (Author: adrian555): yes, quite similar ``` sp <- getNamespace("SparkR") attach(sp) test_dir(file.path(sparkRDir, "pkg", "tests", "fulltests") ) ``` > R tests should support latest testthat > -- > > Key: SPARK-23435 > URL: https://issues.apache.org/jira/browse/SPARK-23435 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.3.1, 2.4.0 >Reporter: Felix Cheung >Assignee: Felix Cheung >Priority: Major > > To follow up on SPARK-22817, the latest version of testthat, 2.0.0 was > released in Dec 2017, and its method has been changed. > In order for our tests to keep working, we need to detect that and call a > different method. > Jenkins is running 1.0.1 though, we need to check if it is going to work. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23435) R tests should support latest testthat
[ https://issues.apache.org/jira/browse/SPARK-23435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514111#comment-16514111 ] Weiqiang Zhuang commented on SPARK-23435: - yes, quite similar ``` sp <- getNamespace("SparkR") attach(sp) test_dir(file.path(sparkRDir, "pkg", "tests", "fulltests") ) ``` > R tests should support latest testthat > -- > > Key: SPARK-23435 > URL: https://issues.apache.org/jira/browse/SPARK-23435 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.3.1, 2.4.0 >Reporter: Felix Cheung >Assignee: Felix Cheung >Priority: Major > > To follow up on SPARK-22817, the latest version of testthat, 2.0.0 was > released in Dec 2017, and its method has been changed. > In order for our tests to keep working, we need to detect that and call a > different method. > Jenkins is running 1.0.1 though, we need to check if it is going to work. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24490) Use WebUI.addStaticHandler in web UIs
[ https://issues.apache.org/jira/browse/SPARK-24490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-24490. Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21510 [https://github.com/apache/spark/pull/21510] > Use WebUI.addStaticHandler in web UIs > - > > Key: SPARK-24490 > URL: https://issues.apache.org/jira/browse/SPARK-24490 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Jacek Laskowski >Assignee: Jacek Laskowski >Priority: Trivial > Fix For: 2.4.0 > > > {{WebUI}} defines {{addStaticHandler}} that web UIs don't use (and simply > introduce duplication). Let's clean them up and remove duplications. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24490) Use WebUI.addStaticHandler in web UIs
[ https://issues.apache.org/jira/browse/SPARK-24490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-24490: -- Assignee: Jacek Laskowski > Use WebUI.addStaticHandler in web UIs > - > > Key: SPARK-24490 > URL: https://issues.apache.org/jira/browse/SPARK-24490 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Jacek Laskowski >Assignee: Jacek Laskowski >Priority: Trivial > Fix For: 2.4.0 > > > {{WebUI}} defines {{addStaticHandler}} that web UIs don't use (and simply > introduce duplication). Let's clean them up and remove duplications. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24531) HiveExternalCatalogVersionsSuite failing due to missing 2.2.0 version
[ https://issues.apache.org/jira/browse/SPARK-24531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-24531: --- Fix Version/s: 2.3.2 > HiveExternalCatalogVersionsSuite failing due to missing 2.2.0 version > - > > Key: SPARK-24531 > URL: https://issues.apache.org/jira/browse/SPARK-24531 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.4.0 >Reporter: Marco Gaido >Assignee: Marco Gaido >Priority: Blocker > Fix For: 2.2.2, 2.3.2, 2.4.0 > > > We have many build failures caused by HiveExternalCatalogVersionsSuite > failing because Spark 2.2.0 is not present anymore in the mirrors. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24531) HiveExternalCatalogVersionsSuite failing due to missing 2.2.0 version
[ https://issues.apache.org/jira/browse/SPARK-24531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-24531: --- Fix Version/s: 2.2.2 > HiveExternalCatalogVersionsSuite failing due to missing 2.2.0 version > - > > Key: SPARK-24531 > URL: https://issues.apache.org/jira/browse/SPARK-24531 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.4.0 >Reporter: Marco Gaido >Assignee: Marco Gaido >Priority: Blocker > Fix For: 2.2.2, 2.3.2, 2.4.0 > > > We have many build failures caused by HiveExternalCatalogVersionsSuite > failing because Spark 2.2.0 is not present anymore in the mirrors. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24535) Fix java version parsing in SparkR
[ https://issues.apache.org/jira/browse/SPARK-24535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514031#comment-16514031 ] Hyukjin Kwon commented on SPARK-24535: -- When I tried it before, I remember I faced some issues .. mind if I ask to share the steps you did just roughly? It doesn't have to be perfect if it's messy but I want to try what you did roughly. > Fix java version parsing in SparkR > -- > > Key: SPARK-24535 > URL: https://issues.apache.org/jira/browse/SPARK-24535 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.3.1, 2.4.0 >Reporter: Shivaram Venkataraman >Priority: Major > > We see errors on CRAN of the form > {code:java} > java version "1.8.0_144" > Java(TM) SE Runtime Environment (build 1.8.0_144-b01) > Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode) > Picked up _JAVA_OPTIONS: -XX:-UsePerfData > -- 1. Error: create DataFrame from list or data.frame (@test_basic.R#21) > -- > subscript out of bounds > 1: sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE, > sparkConfig = sparkRTestConfig) at > D:/temp/RtmpIJ8Cc3/RLIBS_3242c713c3181/SparkR/tests/testthat/test_basic.R:21 > 2: sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, > sparkExecutorEnvMap, > sparkJars, sparkPackages) > 3: checkJavaVersion() > 4: strsplit(javaVersionFilter[[1]], "[\"]") > {code} > The complete log file is at > http://home.apache.org/~shivaram/SparkR_2.3.1_check_results/Windows/00check.log -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24535) Fix java version parsing in SparkR
[ https://issues.apache.org/jira/browse/SPARK-24535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514020#comment-16514020 ] Shivaram Venkataraman commented on SPARK-24535: --- I was going to do it manually. If we do it in the PR builder it will be great ! > Fix java version parsing in SparkR > -- > > Key: SPARK-24535 > URL: https://issues.apache.org/jira/browse/SPARK-24535 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.3.1, 2.4.0 >Reporter: Shivaram Venkataraman >Priority: Major > > We see errors on CRAN of the form > {code:java} > java version "1.8.0_144" > Java(TM) SE Runtime Environment (build 1.8.0_144-b01) > Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode) > Picked up _JAVA_OPTIONS: -XX:-UsePerfData > -- 1. Error: create DataFrame from list or data.frame (@test_basic.R#21) > -- > subscript out of bounds > 1: sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE, > sparkConfig = sparkRTestConfig) at > D:/temp/RtmpIJ8Cc3/RLIBS_3242c713c3181/SparkR/tests/testthat/test_basic.R:21 > 2: sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, > sparkExecutorEnvMap, > sparkJars, sparkPackages) > 3: checkJavaVersion() > 4: strsplit(javaVersionFilter[[1]], "[\"]") > {code} > The complete log file is at > http://home.apache.org/~shivaram/SparkR_2.3.1_check_results/Windows/00check.log -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24535) Fix java version parsing in SparkR
[ https://issues.apache.org/jira/browse/SPARK-24535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514016#comment-16514016 ] Hyukjin Kwon commented on SPARK-24535: -- [~shivaram], how did you run this on Windows? I think we should better run this in PR builder. > Fix java version parsing in SparkR > -- > > Key: SPARK-24535 > URL: https://issues.apache.org/jira/browse/SPARK-24535 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.3.1, 2.4.0 >Reporter: Shivaram Venkataraman >Priority: Major > > We see errors on CRAN of the form > {code:java} > java version "1.8.0_144" > Java(TM) SE Runtime Environment (build 1.8.0_144-b01) > Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode) > Picked up _JAVA_OPTIONS: -XX:-UsePerfData > -- 1. Error: create DataFrame from list or data.frame (@test_basic.R#21) > -- > subscript out of bounds > 1: sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE, > sparkConfig = sparkRTestConfig) at > D:/temp/RtmpIJ8Cc3/RLIBS_3242c713c3181/SparkR/tests/testthat/test_basic.R:21 > 2: sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, > sparkExecutorEnvMap, > sparkJars, sparkPackages) > 3: checkJavaVersion() > 4: strsplit(javaVersionFilter[[1]], "[\"]") > {code} > The complete log file is at > http://home.apache.org/~shivaram/SparkR_2.3.1_check_results/Windows/00check.log -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24535) Fix java version parsing in SparkR
[ https://issues.apache.org/jira/browse/SPARK-24535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514012#comment-16514012 ] Shivaram Venkataraman commented on SPARK-24535: --- I'm not sure if its only failing on Windows -- the Debian test on CRAN did not have the same Java version. I can try to do a test on Windows later today to see what I find. > Fix java version parsing in SparkR > -- > > Key: SPARK-24535 > URL: https://issues.apache.org/jira/browse/SPARK-24535 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.3.1, 2.4.0 >Reporter: Shivaram Venkataraman >Priority: Major > > We see errors on CRAN of the form > {code:java} > java version "1.8.0_144" > Java(TM) SE Runtime Environment (build 1.8.0_144-b01) > Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode) > Picked up _JAVA_OPTIONS: -XX:-UsePerfData > -- 1. Error: create DataFrame from list or data.frame (@test_basic.R#21) > -- > subscript out of bounds > 1: sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE, > sparkConfig = sparkRTestConfig) at > D:/temp/RtmpIJ8Cc3/RLIBS_3242c713c3181/SparkR/tests/testthat/test_basic.R:21 > 2: sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, > sparkExecutorEnvMap, > sparkJars, sparkPackages) > 3: checkJavaVersion() > 4: strsplit(javaVersionFilter[[1]], "[\"]") > {code} > The complete log file is at > http://home.apache.org/~shivaram/SparkR_2.3.1_check_results/Windows/00check.log -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24476) java.net.SocketTimeoutException: Read timed out under jets3t while running the Spark Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-24476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] bharath kumar avusherla resolved SPARK-24476. - Resolution: Fixed > java.net.SocketTimeoutException: Read timed out under jets3t while running > the Spark Structured Streaming > - > > Key: SPARK-24476 > URL: https://issues.apache.org/jira/browse/SPARK-24476 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: bharath kumar avusherla >Priority: Minor > Attachments: socket-timeout-exception > > > We are working on spark streaming application using spark structured > streaming with checkpointing in s3. When we start the application, the > application runs just fine for sometime then it crashes with the error > mentioned below. The amount of time it will run successfully varies from time > to time, sometimes it will run for 2 days without any issues then crashes, > sometimes it will crash after 4hrs/ 24hrs. > Our streaming application joins(left and inner) multiple sources from kafka > and also s3 and aurora database. > Can you please let us know how to solve this problem? > Is it possible to somehow tweak the SocketTimeout-Time? > Here, I'm pasting the few line of complete exception log below. Also attached > the complete exception to the issue. > *_Exception:_* > *_Caused by: java.net.SocketTimeoutException: Read timed out_* > _at java.net.SocketInputStream.socketRead0(Native Method)_ > _at java.net.SocketInputStream.read(SocketInputStream.java:150)_ > _at java.net.SocketInputStream.read(SocketInputStream.java:121)_ > _at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)_ > _at sun.security.ssl.InputRecord.read(InputRecord.java:503)_ > _at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:954)_ > _at > sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1343)_ > _at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1371)_ > _at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1355)_ > _at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:553)_ > _at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:412)_ > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24534) Add a way to bypass entrypoint.sh script if no spark cmd is passed
[ https://issues.apache.org/jira/browse/SPARK-24534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24534: Assignee: Apache Spark > Add a way to bypass entrypoint.sh script if no spark cmd is passed > -- > > Key: SPARK-24534 > URL: https://issues.apache.org/jira/browse/SPARK-24534 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Ricardo Martinelli de Oliveira >Assignee: Apache Spark >Priority: Minor > > As an improvement in the entrypoint.sh script, I'd like to propose spark > entrypoint do a passthrough if driver/executor/init is not the command > passed. Currently it raises an error. > To me more specific, I'm talking about these lines: > [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L113-L114] > This allows the openshift-spark image to continue to function as a Spark > Standalone component, with custom configuration support etc. without > compromising the previous method to configure the cluster inside a kubernetes > environment. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24534) Add a way to bypass entrypoint.sh script if no spark cmd is passed
[ https://issues.apache.org/jira/browse/SPARK-24534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24534: Assignee: (was: Apache Spark) > Add a way to bypass entrypoint.sh script if no spark cmd is passed > -- > > Key: SPARK-24534 > URL: https://issues.apache.org/jira/browse/SPARK-24534 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Ricardo Martinelli de Oliveira >Priority: Minor > > As an improvement in the entrypoint.sh script, I'd like to propose spark > entrypoint do a passthrough if driver/executor/init is not the command > passed. Currently it raises an error. > To me more specific, I'm talking about these lines: > [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L113-L114] > This allows the openshift-spark image to continue to function as a Spark > Standalone component, with custom configuration support etc. without > compromising the previous method to configure the cluster inside a kubernetes > environment. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24534) Add a way to bypass entrypoint.sh script if no spark cmd is passed
[ https://issues.apache.org/jira/browse/SPARK-24534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513945#comment-16513945 ] Apache Spark commented on SPARK-24534: -- User 'rimolive' has created a pull request for this issue: https://github.com/apache/spark/pull/21572 > Add a way to bypass entrypoint.sh script if no spark cmd is passed > -- > > Key: SPARK-24534 > URL: https://issues.apache.org/jira/browse/SPARK-24534 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Ricardo Martinelli de Oliveira >Priority: Minor > > As an improvement in the entrypoint.sh script, I'd like to propose spark > entrypoint do a passthrough if driver/executor/init is not the command > passed. Currently it raises an error. > To me more specific, I'm talking about these lines: > [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L113-L114] > This allows the openshift-spark image to continue to function as a Spark > Standalone component, with custom configuration support etc. without > compromising the previous method to configure the cluster inside a kubernetes > environment. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24476) java.net.SocketTimeoutException: Read timed out under jets3t while running the Spark Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-24476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513933#comment-16513933 ] Steve Loughran commented on SPARK-24476: * Use S3A, as S3n is unsupported and deleted from the recent versions of Hadoop. Nobody tests it either. * Don't know about speculation. S3 in general isn't good here as the speculation code in (hadoop, spark, hive, ...) assumes that renames are fast and, for directories, atomic. You can get into serious trouble there. I haven't looked at what Spark Streaming's commit protocol is in any detail; still on my TODO list. My recommendation, stay with S3A, close this as cannot reproduce for now > java.net.SocketTimeoutException: Read timed out under jets3t while running > the Spark Structured Streaming > - > > Key: SPARK-24476 > URL: https://issues.apache.org/jira/browse/SPARK-24476 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: bharath kumar avusherla >Priority: Minor > Attachments: socket-timeout-exception > > > We are working on spark streaming application using spark structured > streaming with checkpointing in s3. When we start the application, the > application runs just fine for sometime then it crashes with the error > mentioned below. The amount of time it will run successfully varies from time > to time, sometimes it will run for 2 days without any issues then crashes, > sometimes it will crash after 4hrs/ 24hrs. > Our streaming application joins(left and inner) multiple sources from kafka > and also s3 and aurora database. > Can you please let us know how to solve this problem? > Is it possible to somehow tweak the SocketTimeout-Time? > Here, I'm pasting the few line of complete exception log below. Also attached > the complete exception to the issue. > *_Exception:_* > *_Caused by: java.net.SocketTimeoutException: Read timed out_* > _at java.net.SocketInputStream.socketRead0(Native Method)_ > _at java.net.SocketInputStream.read(SocketInputStream.java:150)_ > _at java.net.SocketInputStream.read(SocketInputStream.java:121)_ > _at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)_ > _at sun.security.ssl.InputRecord.read(InputRecord.java:503)_ > _at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:954)_ > _at > sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1343)_ > _at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1371)_ > _at > sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1355)_ > _at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:553)_ > _at > org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:412)_ > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22918) sbt test (spark - local) fail after upgrading to 2.2.1 with: java.security.AccessControlException: access denied org.apache.derby.security.SystemPermission( "engine",
[ https://issues.apache.org/jira/browse/SPARK-22918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513897#comment-16513897 ] Mihaly Toth commented on SPARK-22918: - I managed to reproduce the problem in a unit test. When using a security manager (with derby) one needs to apply a security policy using {{Policy.setPolicy()}}. In its {{.getPermissions.implies}} one is tempted to use {{new SystemPermission("engine", "usederbyinternals")}}. This works fine but when you run a spark session it is seemingly ignored. This is caused by IsolatedClassLoader. {{SystemPermission}} does not work across class loaders meaning it requires that the permission that is checked needs to be in the same class loader as the one defined in the Policy. Otherwise their class will not be equal and thus the the call gets rejected. One solution is to use another permission in the policy file that only checks for the names and class names, and give the original {{SystemPermission}} like: {code:scala} new Permission(delegate.getName) { override def getActions: String = delegate.getActions override def implies(permission: Permission): Boolean = delegate.getClass.getCanonicalName == permission.getClass.getCanonicalName && delegate.getName == permission.getName override def hashCode(): Int = reflectionHashCode(this) override def equals(obj: scala.Any): Boolean = reflectionEquals(this, obj) } {code} At least this one worked for me. It also works with {{new AllPermission()}} in case one is not really into using fine grained access control. > sbt test (spark - local) fail after upgrading to 2.2.1 with: > java.security.AccessControlException: access denied > org.apache.derby.security.SystemPermission( "engine", "usederbyinternals" ) > > > Key: SPARK-22918 > URL: https://issues.apache.org/jira/browse/SPARK-22918 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Damian Momot >Priority: Major > > After upgrading 2.2.0 -> 2.2.1 sbt test command in one of my projects started > to fail with following exception: > {noformat} > java.security.AccessControlException: access denied > org.apache.derby.security.SystemPermission( "engine", "usederbyinternals" ) > at > java.security.AccessControlContext.checkPermission(AccessControlContext.java:472) > at > java.security.AccessController.checkPermission(AccessController.java:884) > at > org.apache.derby.iapi.security.SecurityUtil.checkDerbyInternalsPrivilege(Unknown > Source) > at org.apache.derby.iapi.services.monitor.Monitor.startMonitor(Unknown > Source) > at org.apache.derby.iapi.jdbc.JDBCBoot$1.run(Unknown Source) > at java.security.AccessController.doPrivileged(Native Method) > at org.apache.derby.iapi.jdbc.JDBCBoot.boot(Unknown Source) > at org.apache.derby.iapi.jdbc.JDBCBoot.boot(Unknown Source) > at org.apache.derby.jdbc.EmbeddedDriver.boot(Unknown Source) > at org.apache.derby.jdbc.EmbeddedDriver.(Unknown Source) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at java.lang.Class.newInstance(Class.java:442) > at > org.datanucleus.store.rdbms.connectionpool.AbstractConnectionPoolFactory.loadDriver(AbstractConnectionPoolFactory.java:47) > at > org.datanucleus.store.rdbms.connectionpool.BoneCPConnectionPoolFactory.createConnectionPool(BoneCPConnectionPoolFactory.java:54) > at > org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:238) > at > org.datanucleus.store.rdbms.ConnectionFactoryImpl.initialiseDataSources(ConnectionFactoryImpl.java:131) > at > org.datanucleus.store.rdbms.ConnectionFactoryImpl.(ConnectionFactoryImpl.java:85) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631) > at >
[jira] [Assigned] (SPARK-21743) top-most limit should not cause memory leak
[ https://issues.apache.org/jira/browse/SPARK-21743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21743: Assignee: Apache Spark (was: Wenchen Fan) > top-most limit should not cause memory leak > --- > > Key: SPARK-21743 > URL: https://issues.apache.org/jira/browse/SPARK-21743 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21743) top-most limit should not cause memory leak
[ https://issues.apache.org/jira/browse/SPARK-21743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21743: Assignee: Wenchen Fan (was: Apache Spark) > top-most limit should not cause memory leak > --- > > Key: SPARK-21743 > URL: https://issues.apache.org/jira/browse/SPARK-21743 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-21743) top-most limit should not cause memory leak
[ https://issues.apache.org/jira/browse/SPARK-21743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell reopened SPARK-21743: --- Reopening issue, this is causing a regression in the CSV reader. > top-most limit should not cause memory leak > --- > > Key: SPARK-21743 > URL: https://issues.apache.org/jira/browse/SPARK-21743 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24567) nodeBlacklist does not get updated if a spark executor fails to launch on a mesos node
Igor Berman created SPARK-24567: --- Summary: nodeBlacklist does not get updated if a spark executor fails to launch on a mesos node Key: SPARK-24567 URL: https://issues.apache.org/jira/browse/SPARK-24567 Project: Spark Issue Type: Bug Components: Mesos, Scheduler Affects Versions: 2.4.0 Reporter: Igor Berman As fix of SPARK-19755 we removed custom blacklisting mechanism in spark-mesos integration which has hardcoded constant of 2 failures max before node is marked as blacklisted. >From now on the usual blacklisting mechanism is in use(when enabled), however >it has downside of not counting failures of launching mesos-tasks(spark >executors), i.e. only failures in spark-tasks will be counted. [~squito] [~felixcheung] [~susanxhuynh] [~skonto] please add details as you see it -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24465) LSHModel should support Structured Streaming for transform
[ https://issues.apache.org/jira/browse/SPARK-24465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513649#comment-16513649 ] Liang-Chi Hsieh edited comment on SPARK-24465 at 6/15/18 10:32 AM: --- I'm not sure if SPARK-12878 is a real issue. Seems to me that it just needs to write nested UDT codes in the working approach. Please see my comment on SPARK-12878. was (Author: viirya): I'm not sure if SPARK-12878 is a real issue. Seems to me that it just needs to write nested UDT codes in the working approach. > LSHModel should support Structured Streaming for transform > -- > > Key: SPARK-24465 > URL: https://issues.apache.org/jira/browse/SPARK-24465 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.4.0 >Reporter: Joseph K. Bradley >Priority: Major > > Locality Sensitive Hashing (LSH) Models (BucketedRandomProjectionLSHModel, > MinHashLSHModel) are not compatible with Structured Streaming (and I believe > are the final Transformers which are not compatible). These do not work > because Spark SQL does not support nested types containing UDTs; see > [SPARK-12878]. > This task is to add unit tests for streaming (as in [SPARK-22644]) for > LSHModels after [SPARK-12878] has been fixed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24465) LSHModel should support Structured Streaming for transform
[ https://issues.apache.org/jira/browse/SPARK-24465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513649#comment-16513649 ] Liang-Chi Hsieh commented on SPARK-24465: - I'm not sure if SPARK-12878 is a real issue. Seems to me that it just needs to write nested UDT codes in the working approach. > LSHModel should support Structured Streaming for transform > -- > > Key: SPARK-24465 > URL: https://issues.apache.org/jira/browse/SPARK-24465 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.4.0 >Reporter: Joseph K. Bradley >Priority: Major > > Locality Sensitive Hashing (LSH) Models (BucketedRandomProjectionLSHModel, > MinHashLSHModel) are not compatible with Structured Streaming (and I believe > are the final Transformers which are not compatible). These do not work > because Spark SQL does not support nested types containing UDTs; see > [SPARK-12878]. > This task is to add unit tests for streaming (as in [SPARK-22644]) for > LSHModels after [SPARK-12878] has been fixed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12878) Dataframe fails with nested User Defined Types
[ https://issues.apache.org/jira/browse/SPARK-12878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513647#comment-16513647 ] Liang-Chi Hsieh commented on SPARK-12878: - Is this a real issue? Seems to me that you can't write nested UDT like the example code in the description. The nested UDT example should be look like the following, you need to serialize nested UDT objects when you serialize the wrapper object: {code:scala} @SQLUserDefinedType(udt = classOf[WrapperUDT]) case class Wrapper(list: Seq[Element]) class WrapperUDT extends UserDefinedType[Wrapper] { override def sqlType: DataType = StructType(Seq(StructField("list", ArrayType(new ElementUDT(), containsNull = false), nullable = true))) override def userClass: Class[Wrapper] = classOf[Wrapper] override def serialize(obj: Wrapper): Any = obj match { case Wrapper(list) => val row = new GenericInternalRow(1) val elementUDT = new ElementUDT() val serializedElements = list.map((e: Element) => elementUDT.serialize(e)) row.update(0, new GenericArrayData(serializedElements.toArray)) row } override def deserialize(datum: Any): Wrapper = datum match { case row: InternalRow => val elementUDF = new ElementUDT() Wrapper(row.getArray(0).toArray(elementUDF).map((e: Any) => elementUDF.deserialize(e))) } } @SQLUserDefinedType(udt = classOf[ElementUDT]) case class Element(num: Int) class ElementUDT extends UserDefinedType[Element] { override def sqlType: DataType = StructType(Seq(StructField("num", IntegerType, nullable = false))) override def userClass: Class[Element] = classOf[Element] override def serialize(obj: Element): Any = obj match { case Element(num) => val row = new GenericInternalRow(1) row.setInt(0, num) row } override def deserialize(datum: Any): Element = datum match { case row: InternalRow => Element(row.getInt(0)) } } val data = Seq(Wrapper(Seq(Element(1), Element(2))), Wrapper(Seq(Element(3), Element(4 val df = sparkContext.parallelize((1 to 2).zip(data)).toDF("id", "b") df.collect().map(println(_)) {code} {code} [1,Wrapper(ArraySeq(Element(1), Element(2)))] [2,Wrapper(ArraySeq(Element(3), Element(4)))] {code} > Dataframe fails with nested User Defined Types > -- > > Key: SPARK-12878 > URL: https://issues.apache.org/jira/browse/SPARK-12878 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Joao Duarte >Priority: Major > > Spark 1.6.0 crashes when using nested User Defined Types in a Dataframe. > In version 1.5.2 the code below worked just fine: > import org.apache.spark.{SparkConf, SparkContext} > import org.apache.spark.sql.catalyst.InternalRow > import org.apache.spark.sql.catalyst.expressions.GenericMutableRow > import org.apache.spark.sql.types._ > @SQLUserDefinedType(udt = classOf[AUDT]) > case class A(list:Seq[B]) > class AUDT extends UserDefinedType[A] { > override def sqlType: DataType = StructType(Seq(StructField("list", > ArrayType(BUDT, containsNull = false), nullable = true))) > override def userClass: Class[A] = classOf[A] > override def serialize(obj: Any): Any = obj match { > case A(list) => > val row = new GenericMutableRow(1) > row.update(0, new > GenericArrayData(list.map(_.asInstanceOf[Any]).toArray)) > row > } > override def deserialize(datum: Any): A = { > datum match { > case row: InternalRow => new A(row.getArray(0).toArray(BUDT).toSeq) > } > } > } > object AUDT extends AUDT > @SQLUserDefinedType(udt = classOf[BUDT]) > case class B(text:Int) > class BUDT extends UserDefinedType[B] { > override def sqlType: DataType = StructType(Seq(StructField("num", > IntegerType, nullable = false))) > override def userClass: Class[B] = classOf[B] > override def serialize(obj: Any): Any = obj match { > case B(text) => > val row = new GenericMutableRow(1) > row.setInt(0, text) > row > } > override def deserialize(datum: Any): B = { > datum match { case row: InternalRow => new B(row.getInt(0)) } > } > } > object BUDT extends BUDT > object Test { > def main(args:Array[String]) = { > val col = Seq(new A(Seq(new B(1), new B(2))), > new A(Seq(new B(3), new B(4 > val sc = new SparkContext(new > SparkConf().setMaster("local[1]").setAppName("TestSpark")) > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > import sqlContext.implicits._ > val df = sc.parallelize(1 to 2 zip col).toDF("id","b") > df.select("b").show() > df.collect().foreach(println) > } > } > In the new version (1.6.0) I needed to include the following import: > import org.apache.spark.sql.catalyst.expressions.GenericMutableRow > However, Spark crashes in runtime: > 16/01/18
[jira] [Commented] (SPARK-19498) Discussion: Making MLlib APIs extensible for 3rd party libraries
[ https://issues.apache.org/jira/browse/SPARK-19498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513642#comment-16513642 ] Lucas Partridge commented on SPARK-19498: - How would you prefer people provide their inputs on this? Via comments on this Jira issue, or where...? > Discussion: Making MLlib APIs extensible for 3rd party libraries > > > Key: SPARK-19498 > URL: https://issues.apache.org/jira/browse/SPARK-19498 > Project: Spark > Issue Type: Brainstorming > Components: ML >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley >Priority: Critical > > Per the recent discussion on the dev list, this JIRA is for discussing how we > can make MLlib DataFrame-based APIs more extensible, especially for the > purpose of writing 3rd-party libraries with APIs extended from the MLlib APIs > (for custom Transformers, Estimators, etc.). > * For people who have written such libraries, what issues have you run into? > * What APIs are not public or extensible enough? Do they require changes > before being made more public? > * Are APIs for non-Scala languages such as Java and Python friendly or > extensive enough? > The easy answer is to make everything public, but that would be terrible of > course in the long-term. Let's discuss what is needed and how we can present > stable, sufficient, and easy-to-use APIs for 3rd-party developers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24548) JavaPairRDD to Dataset in SPARK generates ambiguous results
[ https://issues.apache.org/jira/browse/SPARK-24548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-24548: Component/s: (was: Spark Core) > JavaPairRDD to Dataset in SPARK generates ambiguous results > > > Key: SPARK-24548 > URL: https://issues.apache.org/jira/browse/SPARK-24548 > Project: Spark > Issue Type: Bug > Components: Java API, SQL >Affects Versions: 2.3.0 > Environment: Using Windows 10, on 64bit machine with 16G of ram. >Reporter: Jackson >Priority: Major > > I have data in below JavaPairRDD : > {quote}JavaPairRDD> MY_RDD; > {quote} > I tried using below code: > {quote}Encoder>> encoder2 = > Encoders.tuple(Encoders.STRING(), > Encoders.tuple(Encoders.STRING(),Encoders.STRING())); > Dataset newDataSet = > spark.createDataset(JavaPairRDD.toRDD(MY_RDD),encoder2).toDF("value1","value2"); > newDataSet.printSchema(); > {quote} > {{root}} > {{ |-- value1: string (nullable = true)}} > {{ |-- value2: struct (nullable = true)}} > {{ | |-- value: string (nullable = true)}} > {{ | |-- value: string (nullable = true)}} > But after creating a StackOverflow question > ("https://stackoverflow.com/questions/50834145/javapairrdd-to-datasetrow-in-spark;), > i got to know that values in tuple should have distinguish field names, > where in this case its generating same name. Cause of this I cannot select > specific column under value2. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24548) JavaPairRDD to Dataset in SPARK generates ambiguous results
[ https://issues.apache.org/jira/browse/SPARK-24548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-24548: Component/s: SQL > JavaPairRDD to Dataset in SPARK generates ambiguous results > > > Key: SPARK-24548 > URL: https://issues.apache.org/jira/browse/SPARK-24548 > Project: Spark > Issue Type: Bug > Components: Java API, SQL >Affects Versions: 2.3.0 > Environment: Using Windows 10, on 64bit machine with 16G of ram. >Reporter: Jackson >Priority: Major > > I have data in below JavaPairRDD : > {quote}JavaPairRDD> MY_RDD; > {quote} > I tried using below code: > {quote}Encoder>> encoder2 = > Encoders.tuple(Encoders.STRING(), > Encoders.tuple(Encoders.STRING(),Encoders.STRING())); > Dataset newDataSet = > spark.createDataset(JavaPairRDD.toRDD(MY_RDD),encoder2).toDF("value1","value2"); > newDataSet.printSchema(); > {quote} > {{root}} > {{ |-- value1: string (nullable = true)}} > {{ |-- value2: struct (nullable = true)}} > {{ | |-- value: string (nullable = true)}} > {{ | |-- value: string (nullable = true)}} > But after creating a StackOverflow question > ("https://stackoverflow.com/questions/50834145/javapairrdd-to-datasetrow-in-spark;), > i got to know that values in tuple should have distinguish field names, > where in this case its generating same name. Cause of this I cannot select > specific column under value2. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24548) JavaPairRDD to Dataset in SPARK generates ambiguous results
[ https://issues.apache.org/jira/browse/SPARK-24548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24548: Assignee: Apache Spark > JavaPairRDD to Dataset in SPARK generates ambiguous results > > > Key: SPARK-24548 > URL: https://issues.apache.org/jira/browse/SPARK-24548 > Project: Spark > Issue Type: Bug > Components: Java API, Spark Core >Affects Versions: 2.3.0 > Environment: Using Windows 10, on 64bit machine with 16G of ram. >Reporter: Jackson >Assignee: Apache Spark >Priority: Major > > I have data in below JavaPairRDD : > {quote}JavaPairRDD> MY_RDD; > {quote} > I tried using below code: > {quote}Encoder>> encoder2 = > Encoders.tuple(Encoders.STRING(), > Encoders.tuple(Encoders.STRING(),Encoders.STRING())); > Dataset newDataSet = > spark.createDataset(JavaPairRDD.toRDD(MY_RDD),encoder2).toDF("value1","value2"); > newDataSet.printSchema(); > {quote} > {{root}} > {{ |-- value1: string (nullable = true)}} > {{ |-- value2: struct (nullable = true)}} > {{ | |-- value: string (nullable = true)}} > {{ | |-- value: string (nullable = true)}} > But after creating a StackOverflow question > ("https://stackoverflow.com/questions/50834145/javapairrdd-to-datasetrow-in-spark;), > i got to know that values in tuple should have distinguish field names, > where in this case its generating same name. Cause of this I cannot select > specific column under value2. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24548) JavaPairRDD to Dataset in SPARK generates ambiguous results
[ https://issues.apache.org/jira/browse/SPARK-24548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513571#comment-16513571 ] Apache Spark commented on SPARK-24548: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/21576 > JavaPairRDD to Dataset in SPARK generates ambiguous results > > > Key: SPARK-24548 > URL: https://issues.apache.org/jira/browse/SPARK-24548 > Project: Spark > Issue Type: Bug > Components: Java API, Spark Core >Affects Versions: 2.3.0 > Environment: Using Windows 10, on 64bit machine with 16G of ram. >Reporter: Jackson >Priority: Major > > I have data in below JavaPairRDD : > {quote}JavaPairRDD> MY_RDD; > {quote} > I tried using below code: > {quote}Encoder>> encoder2 = > Encoders.tuple(Encoders.STRING(), > Encoders.tuple(Encoders.STRING(),Encoders.STRING())); > Dataset newDataSet = > spark.createDataset(JavaPairRDD.toRDD(MY_RDD),encoder2).toDF("value1","value2"); > newDataSet.printSchema(); > {quote} > {{root}} > {{ |-- value1: string (nullable = true)}} > {{ |-- value2: struct (nullable = true)}} > {{ | |-- value: string (nullable = true)}} > {{ | |-- value: string (nullable = true)}} > But after creating a StackOverflow question > ("https://stackoverflow.com/questions/50834145/javapairrdd-to-datasetrow-in-spark;), > i got to know that values in tuple should have distinguish field names, > where in this case its generating same name. Cause of this I cannot select > specific column under value2. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24548) JavaPairRDD to Dataset in SPARK generates ambiguous results
[ https://issues.apache.org/jira/browse/SPARK-24548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24548: Assignee: (was: Apache Spark) > JavaPairRDD to Dataset in SPARK generates ambiguous results > > > Key: SPARK-24548 > URL: https://issues.apache.org/jira/browse/SPARK-24548 > Project: Spark > Issue Type: Bug > Components: Java API, Spark Core >Affects Versions: 2.3.0 > Environment: Using Windows 10, on 64bit machine with 16G of ram. >Reporter: Jackson >Priority: Major > > I have data in below JavaPairRDD : > {quote}JavaPairRDD> MY_RDD; > {quote} > I tried using below code: > {quote}Encoder>> encoder2 = > Encoders.tuple(Encoders.STRING(), > Encoders.tuple(Encoders.STRING(),Encoders.STRING())); > Dataset newDataSet = > spark.createDataset(JavaPairRDD.toRDD(MY_RDD),encoder2).toDF("value1","value2"); > newDataSet.printSchema(); > {quote} > {{root}} > {{ |-- value1: string (nullable = true)}} > {{ |-- value2: struct (nullable = true)}} > {{ | |-- value: string (nullable = true)}} > {{ | |-- value: string (nullable = true)}} > But after creating a StackOverflow question > ("https://stackoverflow.com/questions/50834145/javapairrdd-to-datasetrow-in-spark;), > i got to know that values in tuple should have distinguish field names, > where in this case its generating same name. Cause of this I cannot select > specific column under value2. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23435) R tests should support latest testthat
[ https://issues.apache.org/jira/browse/SPARK-23435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513416#comment-16513416 ] Felix Cheung commented on SPARK-23435: -- sorry, I did try but couldn't get it to work, but something like this? {code:java} # for testthat after 1.0.2 call test_dir as run_tests is removed. if (packageVersion("testthat") >= "2.0.0") { test_pkg_env <- list2env(as.list(getNamespace("SparkR"), all.names = TRUE), parent = parent.env(getNamespace("SparkR"))) withr::local_options(list(topLevelEnvironment = test_pkg_env)) test_dir(file.path(sparkRDir, "pkg", "tests", "fulltests"), env = test_pkg_env, stop_on_failure = TRUE, stop_on_warning = FALSE) } else { testthat:::run_tests("SparkR", file.path(sparkRDir, "pkg", "tests", "fulltests"), NULL, "summary") } {code} > R tests should support latest testthat > -- > > Key: SPARK-23435 > URL: https://issues.apache.org/jira/browse/SPARK-23435 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.3.1, 2.4.0 >Reporter: Felix Cheung >Assignee: Felix Cheung >Priority: Major > > To follow up on SPARK-22817, the latest version of testthat, 2.0.0 was > released in Dec 2017, and its method has been changed. > In order for our tests to keep working, we need to detect that and call a > different method. > Jenkins is running 1.0.1 though, we need to check if it is going to work. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org