[jira] [Commented] (SPARK-34276) Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12
[ https://issues.apache.org/jira/browse/SPARK-34276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425959#comment-17425959 ] Micah Kornfield commented on SPARK-34276: - Sorry for the late reply. PARQUET-2089 has been a long standing bug in the C++ implementation where we were setting file_offset to the beginning of column_chunk metatadata and not the actual data page. It's not clear to me if this was a problem before parquet-mr 1.12 in practice. [~gershinsky] Would the fix in PARQUET-2078 make parquet-mr resilient to this bug? > Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12 > -- > > Key: SPARK-34276 > URL: https://issues.apache.org/jira/browse/SPARK-34276 > Project: Spark > Issue Type: Task > Components: Build, SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Assignee: Chao Sun >Priority: Blocker > > Before the release, we need to double check the unreleased/unresolved > JIRAs/PRs of Parquet 1.11/1.12 and then decide whether we should > upgrade/revert Parquet. At the same time, we should encourage the whole > community to do the compatibility and performance tests for their production > workloads, including both read and write code paths. > More details: > [https://github.com/apache/spark/pull/26804#issuecomment-768790620] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35531) Can not insert into hive bucket table if create table with upper case schema
[ https://issues.apache.org/jira/browse/SPARK-35531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-35531: --- Affects Version/s: 3.0.0 3.1.1 > Can not insert into hive bucket table if create table with upper case schema > > > Key: SPARK-35531 > URL: https://issues.apache.org/jira/browse/SPARK-35531 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.1, 3.2.0 >Reporter: Hongyi Zhang >Priority: Major > > > > create table TEST1( > V1 BIGINT, > S1 INT) > partitioned by (PK BIGINT) > clustered by (V1) > sorted by (S1) > into 200 buckets > STORED AS PARQUET; > > insert into test1 > select > * from values(1,1,1); > > > org.apache.hadoop.hive.ql.metadata.HiveException: Bucket columns V1 is not > part of the table columns ([FieldSchema(name:v1, type:bigint, comment:null), > FieldSchema(name:s1, type:int, comment:null)] > org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Bucket columns V1 is not > part of the table columns ([FieldSchema(name:v1, type:bigint, comment:null), > FieldSchema(name:s1, type:int, comment:null)] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35531) Can not insert into hive bucket table if create table with upper case schema
[ https://issues.apache.org/jira/browse/SPARK-35531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425954#comment-17425954 ] Gengliang Wang commented on SPARK-35531: I can reproduce the issue on 3.0.0 and 3.1.1. It's a long-standing bug. > Can not insert into hive bucket table if create table with upper case schema > > > Key: SPARK-35531 > URL: https://issues.apache.org/jira/browse/SPARK-35531 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Hongyi Zhang >Priority: Major > > > > create table TEST1( > V1 BIGINT, > S1 INT) > partitioned by (PK BIGINT) > clustered by (V1) > sorted by (S1) > into 200 buckets > STORED AS PARQUET; > > insert into test1 > select > * from values(1,1,1); > > > org.apache.hadoop.hive.ql.metadata.HiveException: Bucket columns V1 is not > part of the table columns ([FieldSchema(name:v1, type:bigint, comment:null), > FieldSchema(name:s1, type:int, comment:null)] > org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Bucket columns V1 is not > part of the table columns ([FieldSchema(name:v1, type:bigint, comment:null), > FieldSchema(name:s1, type:int, comment:null)] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36952) Inline type hints for python/pyspark/resource/information.py and python/pyspark/resource/profile.py
[ https://issues.apache.org/jira/browse/SPARK-36952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425949#comment-17425949 ] dch nguyen commented on SPARK-36952: working on this > Inline type hints for python/pyspark/resource/information.py and > python/pyspark/resource/profile.py > --- > > Key: SPARK-36952 > URL: https://issues.apache.org/jira/browse/SPARK-36952 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dgd_contributor >Priority: Major > > Inline type hints for python/pyspark/resource/information.py and > python/pyspark/resource/profile.py -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36952) Inline type hints for python/pyspark/resource/information.py and python/pyspark/resource/profile.py
[ https://issues.apache.org/jira/browse/SPARK-36952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425947#comment-17425947 ] dgd_contributor commented on SPARK-36952: - working on this > Inline type hints for python/pyspark/resource/information.py and > python/pyspark/resource/profile.py > --- > > Key: SPARK-36952 > URL: https://issues.apache.org/jira/browse/SPARK-36952 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dgd_contributor >Priority: Major > > Inline type hints for python/pyspark/resource/information.py and > python/pyspark/resource/profile.py -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-36952) Inline type hints for python/pyspark/resource/information.py and python/pyspark/resource/profile.py
[ https://issues.apache.org/jira/browse/SPARK-36952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dgd_contributor updated SPARK-36952: Comment: was deleted (was: working on this) > Inline type hints for python/pyspark/resource/information.py and > python/pyspark/resource/profile.py > --- > > Key: SPARK-36952 > URL: https://issues.apache.org/jira/browse/SPARK-36952 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dgd_contributor >Priority: Major > > Inline type hints for python/pyspark/resource/information.py and > python/pyspark/resource/profile.py -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36953) Expose SQL state and error class in PySpark exceptions
[ https://issues.apache.org/jira/browse/SPARK-36953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425945#comment-17425945 ] Apache Spark commented on SPARK-36953: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/34219 > Expose SQL state and error class in PySpark exceptions > -- > > Key: SPARK-36953 > URL: https://issues.apache.org/jira/browse/SPARK-36953 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > SPARK-34920 introduced error classs and states but they are not accessible in > PySpark. We should make both available in PySpark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36953) Expose SQL state and error class in PySpark exceptions
[ https://issues.apache.org/jira/browse/SPARK-36953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36953: Assignee: (was: Apache Spark) > Expose SQL state and error class in PySpark exceptions > -- > > Key: SPARK-36953 > URL: https://issues.apache.org/jira/browse/SPARK-36953 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > SPARK-34920 introduced error classs and states but they are not accessible in > PySpark. We should make both available in PySpark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36953) Expose SQL state and error class in PySpark exceptions
[ https://issues.apache.org/jira/browse/SPARK-36953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36953: Assignee: Apache Spark > Expose SQL state and error class in PySpark exceptions > -- > > Key: SPARK-36953 > URL: https://issues.apache.org/jira/browse/SPARK-36953 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > SPARK-34920 introduced error classs and states but they are not accessible in > PySpark. We should make both available in PySpark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36953) Expose SQL state and error class in PySpark exceptions
[ https://issues.apache.org/jira/browse/SPARK-36953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425944#comment-17425944 ] Apache Spark commented on SPARK-36953: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/34219 > Expose SQL state and error class in PySpark exceptions > -- > > Key: SPARK-36953 > URL: https://issues.apache.org/jira/browse/SPARK-36953 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > SPARK-34920 introduced error classs and states but they are not accessible in > PySpark. We should make both available in PySpark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36953) Expose SQL state and error class in PySpark exceptions
Hyukjin Kwon created SPARK-36953: Summary: Expose SQL state and error class in PySpark exceptions Key: SPARK-36953 URL: https://issues.apache.org/jira/browse/SPARK-36953 Project: Spark Issue Type: Improvement Components: PySpark, SQL Affects Versions: 3.3.0 Reporter: Hyukjin Kwon SPARK-34920 introduced error classs and states but they are not accessible in PySpark. We should make both available in PySpark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35531) Can not insert into hive bucket table if create table with upper case schema
[ https://issues.apache.org/jira/browse/SPARK-35531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425932#comment-17425932 ] Apache Spark commented on SPARK-35531: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/34218 > Can not insert into hive bucket table if create table with upper case schema > > > Key: SPARK-35531 > URL: https://issues.apache.org/jira/browse/SPARK-35531 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Hongyi Zhang >Priority: Major > > > > create table TEST1( > V1 BIGINT, > S1 INT) > partitioned by (PK BIGINT) > clustered by (V1) > sorted by (S1) > into 200 buckets > STORED AS PARQUET; > > insert into test1 > select > * from values(1,1,1); > > > org.apache.hadoop.hive.ql.metadata.HiveException: Bucket columns V1 is not > part of the table columns ([FieldSchema(name:v1, type:bigint, comment:null), > FieldSchema(name:s1, type:int, comment:null)] > org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Bucket columns V1 is not > part of the table columns ([FieldSchema(name:v1, type:bigint, comment:null), > FieldSchema(name:s1, type:int, comment:null)] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36952) Inline type hints for python/pyspark/resource/information.py and python/pyspark/resource/profile.py
dgd_contributor created SPARK-36952: --- Summary: Inline type hints for python/pyspark/resource/information.py and python/pyspark/resource/profile.py Key: SPARK-36952 URL: https://issues.apache.org/jira/browse/SPARK-36952 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.3.0 Reporter: dgd_contributor Inline type hints for python/pyspark/resource/information.py and python/pyspark/resource/profile.py -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36903) oom exception occurred during code generation due to a large number of case when branches
[ https://issues.apache.org/jira/browse/SPARK-36903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JacobZheng updated SPARK-36903: --- Description: I have a spark task that contains many case when branches. When I run it, the driver throws an oom exception in the codegen phase. What I expect is if it is possible to detect or limit it in the codegen phase to avoid this. I see that spark 2.2 has a configuration item spark.sql.codegen.maxCaseBranches. would it help my situation if I try to add this limit back? This is the stack information I see via jstack {code:java} "SparkJobEngine-akka.actor.default-dispatcher-9" #23010 prio=5 os_prio=0 cpu=197487.25ms elapsed=7213.71s tid=0x7fb08c019800 nid=0x5fb9 runnable [0x7fb072af2000] java.lang.Thread.State: RUNNABLE at scala.collection.immutable.StringLike$$Lambda$1790/0x000840ee4840.apply(Unknown Source) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.immutable.StringLike.stripMargin(StringLike.scala:187) at scala.collection.immutable.StringLike.stripMargin$(StringLike.scala:185) at scala.collection.immutable.StringOps.stripMargin(StringOps.scala:33) at org.apache.spark.sql.catalyst.expressions.codegen.Block.toString(javaCode.scala:142) at org.apache.spark.sql.catalyst.expressions.codegen.Block.toString$(javaCode.scala:141) at org.apache.spark.sql.catalyst.expressions.codegen.CodeBlock.toString(javaCode.scala:286) at org.apache.spark.sql.catalyst.expressions.codegen.Block.length(javaCode.scala:149) at org.apache.spark.sql.catalyst.expressions.codegen.Block.length$(javaCode.scala:149) at org.apache.spark.sql.catalyst.expressions.codegen.CodeBlock.length(javaCode.scala:286) at org.apache.spark.sql.catalyst.expressions.Expression.reduceCodeSize(Expression.scala:160) at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:147) at org.apache.spark.sql.catalyst.expressions.Expression$$Lambda$2784/0x00084131b840.apply(Unknown Source) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:141) at org.apache.spark.sql.catalyst.expressions.And.doGenCode(predicates.scala:567) at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:146) at org.apache.spark.sql.catalyst.expressions.Expression$$Lambda$2784/0x00084131b840.apply(Unknown Source) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:141) at org.apache.spark.sql.catalyst.expressions.CaseWhen.$anonfun$multiBranchesCodegen$1(conditionalExpressions.scala:209) at org.apache.spark.sql.catalyst.expressions.CaseWhen$$Lambda$4626/0x0008415b8840.apply(Unknown Source) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.TraversableLike$$Lambda$83/0x0008401bc040.apply(Unknown Source) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at scala.collection.AbstractTraversable.map(Traversable.scala:108) at org.apache.spark.sql.catalyst.expressions.CaseWhen.multiBranchesCodegen(conditionalExpressions.scala:208) at org.apache.spark.sql.catalyst.expressions.CaseWhen.doGenCode(conditionalExpressions.scala:291) at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:146) at org.apache.spark.sql.catalyst.expressions.Expression$$Lambda$2784/0x00084131b840.apply(Unknown Source) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:141) at org.apache.spark.sql.catalyst.expressions.Concat.$anonfun$doGenCode$22(collectionOperations.scala:2120) at org.apache.spark.sql.catalyst.expressions.Concat$$Lambda$5022/0x000841a60840.apply(Unknown Source) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.TraversableLike$$Lambda$83/0x0008401bc040.apply(Unknown Source) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at scala.collection.AbstractTraversable.map(Traversable.scala:108) at org.apache.spark.sql.cata
[jira] [Updated] (SPARK-36903) oom exception occurred during code generation due to a large number of case when branches
[ https://issues.apache.org/jira/browse/SPARK-36903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JacobZheng updated SPARK-36903: --- Description: I have a spark task that contains many case when branches. When I run it, the driver throws an oom exception in the codegen phase. What I expect is if it is possible to detect or limit it in the codegen phase to avoid this. I see that spark 2.2 has a configuration item spark.sql.codegen.maxCaseBranches. would it help my situation if I try to add this limit back? This is the stack information I see via jstack {code:java} "SparkJobEngine-akka.actor.default-dispatcher-9" #23010 prio=5 os_prio=0 cpu=197487.25ms elapsed=7213.71s tid=0x7fb08c019800 nid=0x5fb9 runnable [0x7fb072af2000] java.lang.Thread.State: RUNNABLE at scala.collection.immutable.StringLike$$Lambda$1790/0x000840ee4840.apply(Unknown Source) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.immutable.StringLike.stripMargin(StringLike.scala:187) at scala.collection.immutable.StringLike.stripMargin$(StringLike.scala:185) at scala.collection.immutable.StringOps.stripMargin(StringOps.scala:33) at org.apache.spark.sql.catalyst.expressions.codegen.Block.toString(javaCode.scala:142) at org.apache.spark.sql.catalyst.expressions.codegen.Block.toString$(javaCode.scala:141) at org.apache.spark.sql.catalyst.expressions.codegen.CodeBlock.toString(javaCode.scala:286) at org.apache.spark.sql.catalyst.expressions.codegen.Block.length(javaCode.scala:149) at org.apache.spark.sql.catalyst.expressions.codegen.Block.length$(javaCode.scala:149) at org.apache.spark.sql.catalyst.expressions.codegen.CodeBlock.length(javaCode.scala:286) at org.apache.spark.sql.catalyst.expressions.Expression.reduceCodeSize(Expression.scala:160) at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:147) at org.apache.spark.sql.catalyst.expressions.Expression$$Lambda$2784/0x00084131b840.apply(Unknown Source) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:141) at org.apache.spark.sql.catalyst.expressions.And.doGenCode(predicates.scala:567) at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:146) at org.apache.spark.sql.catalyst.expressions.Expression$$Lambda$2784/0x00084131b840.apply(Unknown Source) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:141) at org.apache.spark.sql.catalyst.expressions.CaseWhen.$anonfun$multiBranchesCodegen$1(conditionalExpressions.scala:209) at org.apache.spark.sql.catalyst.expressions.CaseWhen$$Lambda$4626/0x0008415b8840.apply(Unknown Source) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.TraversableLike$$Lambda$83/0x0008401bc040.apply(Unknown Source) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at scala.collection.AbstractTraversable.map(Traversable.scala:108) at org.apache.spark.sql.catalyst.expressions.CaseWhen.multiBranchesCodegen(conditionalExpressions.scala:208) at org.apache.spark.sql.catalyst.expressions.CaseWhen.doGenCode(conditionalExpressions.scala:291) at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:146) at org.apache.spark.sql.catalyst.expressions.Expression$$Lambda$2784/0x00084131b840.apply(Unknown Source) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:141) at org.apache.spark.sql.catalyst.expressions.Concat.$anonfun$doGenCode$22(collectionOperations.scala:2120) at org.apache.spark.sql.catalyst.expressions.Concat$$Lambda$5022/0x000841a60840.apply(Unknown Source) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.TraversableLike$$Lambda$83/0x0008401bc040.apply(Unknown Source) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at scala.collection.AbstractTraversable.map(Tr
[jira] [Commented] (SPARK-36839) Add daily build with Hadoop 2 profile in GitHub Actions build
[ https://issues.apache.org/jira/browse/SPARK-36839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425879#comment-17425879 ] Apache Spark commented on SPARK-36839: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/34217 > Add daily build with Hadoop 2 profile in GitHub Actions build > - > > Key: SPARK-36839 > URL: https://issues.apache.org/jira/browse/SPARK-36839 > Project: Spark > Issue Type: Test > Components: Project Infra >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.3.0 > > > We have faced problems such as SPARK-36820 due to missing build with Hadoop 2 > profile. We should at least add a daily build in GitHub Actions for that. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36839) Add daily build with Hadoop 2 profile in GitHub Actions build
[ https://issues.apache.org/jira/browse/SPARK-36839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425878#comment-17425878 ] Apache Spark commented on SPARK-36839: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/34217 > Add daily build with Hadoop 2 profile in GitHub Actions build > - > > Key: SPARK-36839 > URL: https://issues.apache.org/jira/browse/SPARK-36839 > Project: Spark > Issue Type: Test > Components: Project Infra >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.3.0 > > > We have faced problems such as SPARK-36820 due to missing build with Hadoop 2 > profile. We should at least add a daily build in GitHub Actions for that. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36950) Normalize semi-structured data into a flat table.
[ https://issues.apache.org/jira/browse/SPARK-36950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425873#comment-17425873 ] Hyukjin Kwon commented on SPARK-36950: -- Thanks [~bjornjorgensen] > Normalize semi-structured data into a flat table. > - > > Key: SPARK-36950 > URL: https://issues.apache.org/jira/browse/SPARK-36950 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Bjørn Jørgensen >Priority: Major > > Hi, in pandas there is this json_normalize that flat out nested data. > https://github.com/pandas-dev/pandas/blob/v1.3.3/pandas/io/json/_normalize.py#L112-L353 > > I have opened a request for this function at koalas. Now there are more > people that will have some function over to pyspark. > https://github.com/databricks/koalas/issues/2162 > This is also a function that geopandas are using. In the meantime I have > found a gist that has code that flattens out the whole dataframe. > https://gist.github.com/nmukerje/e65cde41be85470e4b8dfd9a2d6aed50 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36947) Exception when trying to access Row field using getAs method
[ https://issues.apache.org/jira/browse/SPARK-36947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-36947: - Priority: Major (was: Blocker) > Exception when trying to access Row field using getAs method > > > Key: SPARK-36947 > URL: https://issues.apache.org/jira/browse/SPARK-36947 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 > Environment: Spark 3.1.2 (but this also may affect other versions as > well) >Reporter: Alexandros Mavrommatis >Priority: Major > Labels: catalyst, row, sql > > I have an input dataframe *df* with the following schema: > {code:java} > |-- origin: string (nullable = true) > |-- product: struct (nullable = true) > ||-- id: integer (nullable = true){code} > > when I try to select the first 20 rows of the id column I execute: > {code:java} > df.select("product.id").show(20, false) > {code} > > and I manage to get the result. But when I execute the following: > {code:java} > df.map(_.getAs[Int]("product.id")).show(20, false){code} > > I get the following error: > {code:java} > java.lang.IllegalArgumentException: Field "product.id" does not exist.{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36950) Normalize semi-structured data into a flat table.
[ https://issues.apache.org/jira/browse/SPARK-36950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-36950: - Issue Type: Improvement (was: Wish) > Normalize semi-structured data into a flat table. > - > > Key: SPARK-36950 > URL: https://issues.apache.org/jira/browse/SPARK-36950 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Bjørn Jørgensen >Priority: Major > > Hi, in pandas there is this json_normalize that flat out nested data. > https://github.com/pandas-dev/pandas/blob/v1.3.3/pandas/io/json/_normalize.py#L112-L353 > > I have opened a request for this function at koalas. Now there are more > people that will have some function over to pyspark. > https://github.com/databricks/koalas/issues/2162 > This is also a function that geopandas are using. In the meantime I have > found a gist that has code that flattens out the whole dataframe. > https://gist.github.com/nmukerje/e65cde41be85470e4b8dfd9a2d6aed50 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29871) Flaky test: ImageFileFormatTest.test_read_images
[ https://issues.apache.org/jira/browse/SPARK-29871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-29871. -- Fix Version/s: 3.3.0 Assignee: Hyukjin Kwon Resolution: Fixed > Flaky test: ImageFileFormatTest.test_read_images > > > Key: SPARK-29871 > URL: https://issues.apache.org/jira/browse/SPARK-29871 > Project: Spark > Issue Type: Test > Components: ML >Affects Versions: 3.0.0, 3.1.2, 3.2.0 >Reporter: wuyi >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.3.0 > > > Running tests... > -- > test_read_images (pyspark.ml.tests.test_image.ImageFileFormatTest) ... ERROR > (12.050s) > == > ERROR [12.050s]: test_read_images > (pyspark.ml.tests.test_image.ImageFileFormatTest) > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/tests/test_image.py", > line 35, in test_read_images > self.assertEqual(df.count(), 4) > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/dataframe.py", > line 507, in count > return int(self._jdf.count()) > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", > line 1286, in __call__ > answer, self.gateway_client, self.target_id, self.name) > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/utils.py", > line 98, in deco > return f(*a, **kw) > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py", > line 328, in get_return_value > format(target_id, ".", name), value) > py4j.protocol.Py4JJavaError: An error occurred while calling o32.count. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 > in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 0.0 > (TID 1, amp-jenkins-worker-05.amp, executor driver): > javax.imageio.IIOException: Unsupported Image Type > at > com.sun.imageio.plugins.jpeg.JPEGImageReader.readInternal(JPEGImageReader.java:1079) > at > com.sun.imageio.plugins.jpeg.JPEGImageReader.read(JPEGImageReader.java:1050) > at javax.imageio.ImageIO.read(ImageIO.java:1448) > at javax.imageio.ImageIO.read(ImageIO.java:1352) > at org.apache.spark.ml.image.ImageSchema$.decode(ImageSchema.scala:134) > at > org.apache.spark.ml.source.image.ImageFileFormat.$anonfun$buildReader$2(ImageFileFormat.scala:84) > at > org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:147) > at > org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:132) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:116) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:169) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(generated.java:33) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(generated.java:63) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:726) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132) > at > org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) > at org.apache.spark.scheduler.Task.run(Task.scala:127) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:462) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:465) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Driver stacktrace: > at > org.apache.spark.scheduler.DAG
[jira] [Commented] (SPARK-29871) Flaky test: ImageFileFormatTest.test_read_images
[ https://issues.apache.org/jira/browse/SPARK-29871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425869#comment-17425869 ] Hyukjin Kwon commented on SPARK-29871: -- Fixed in https://github.com/apache/spark/pull/34187 > Flaky test: ImageFileFormatTest.test_read_images > > > Key: SPARK-29871 > URL: https://issues.apache.org/jira/browse/SPARK-29871 > Project: Spark > Issue Type: Test > Components: ML >Affects Versions: 3.0.0, 3.1.2, 3.2.0 >Reporter: wuyi >Priority: Major > > Running tests... > -- > test_read_images (pyspark.ml.tests.test_image.ImageFileFormatTest) ... ERROR > (12.050s) > == > ERROR [12.050s]: test_read_images > (pyspark.ml.tests.test_image.ImageFileFormatTest) > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/tests/test_image.py", > line 35, in test_read_images > self.assertEqual(df.count(), 4) > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/dataframe.py", > line 507, in count > return int(self._jdf.count()) > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", > line 1286, in __call__ > answer, self.gateway_client, self.target_id, self.name) > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/utils.py", > line 98, in deco > return f(*a, **kw) > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py", > line 328, in get_return_value > format(target_id, ".", name), value) > py4j.protocol.Py4JJavaError: An error occurred while calling o32.count. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 > in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 0.0 > (TID 1, amp-jenkins-worker-05.amp, executor driver): > javax.imageio.IIOException: Unsupported Image Type > at > com.sun.imageio.plugins.jpeg.JPEGImageReader.readInternal(JPEGImageReader.java:1079) > at > com.sun.imageio.plugins.jpeg.JPEGImageReader.read(JPEGImageReader.java:1050) > at javax.imageio.ImageIO.read(ImageIO.java:1448) > at javax.imageio.ImageIO.read(ImageIO.java:1352) > at org.apache.spark.ml.image.ImageSchema$.decode(ImageSchema.scala:134) > at > org.apache.spark.ml.source.image.ImageFileFormat.$anonfun$buildReader$2(ImageFileFormat.scala:84) > at > org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:147) > at > org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:132) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:116) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:169) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(generated.java:33) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(generated.java:63) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:726) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132) > at > org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) > at org.apache.spark.scheduler.Task.run(Task.scala:127) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:462) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:465) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAG
[jira] [Created] (SPARK-36951) Inline type hints for python/pyspark/sql/column.py
Xinrong Meng created SPARK-36951: Summary: Inline type hints for python/pyspark/sql/column.py Key: SPARK-36951 URL: https://issues.apache.org/jira/browse/SPARK-36951 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.3.0 Reporter: Xinrong Meng Inline type hints for python/pyspark/sql/column.py for type check of function bodies. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36951) Inline type hints for python/pyspark/sql/column.py
[ https://issues.apache.org/jira/browse/SPARK-36951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425794#comment-17425794 ] Xinrong Meng commented on SPARK-36951: -- I am working on this. > Inline type hints for python/pyspark/sql/column.py > -- > > Key: SPARK-36951 > URL: https://issues.apache.org/jira/browse/SPARK-36951 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Priority: Major > > Inline type hints for python/pyspark/sql/column.py for type check of function > bodies. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36950) Normalize semi-structured data into a flat table.
Bjørn Jørgensen created SPARK-36950: --- Summary: Normalize semi-structured data into a flat table. Key: SPARK-36950 URL: https://issues.apache.org/jira/browse/SPARK-36950 Project: Spark Issue Type: Wish Components: PySpark Affects Versions: 3.3.0 Reporter: Bjørn Jørgensen Hi, in pandas there is this json_normalize that flat out nested data. https://github.com/pandas-dev/pandas/blob/v1.3.3/pandas/io/json/_normalize.py#L112-L353 I have opened a request for this function at koalas. Now there are more people that will have some function over to pyspark. https://github.com/databricks/koalas/issues/2162 This is also a function that geopandas are using. In the meantime I have found a gist that has code that flattens out the whole dataframe. https://gist.github.com/nmukerje/e65cde41be85470e4b8dfd9a2d6aed50 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36936) spark-hadoop-cloud broken on release and only published via 3rd party repositories
[ https://issues.apache.org/jira/browse/SPARK-36936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425764#comment-17425764 ] Colin Williams commented on SPARK-36936: [~csun] when I see SPARK-35844 I see 3.2.0 version for the jar. That does not look to be published. 2021.10.07 12:39:03 INFO [warn] Note: Unresolved dependencies path: 2021.10.07 12:39:03 INFO [error] sbt.librarymanagement.ResolveException: Error downloading org.apache.spark:spark-hadoop-cloud_2.12:3.2.0 2021.10.07 12:39:03 INFO [error] Not found 2021.10.07 12:39:03 INFO [error] Not found 2021.10.07 12:39:03 INFO [error] not found: /home/colin/.ivy2/local/org.apache.spark/spark-hadoop-cloud_2.12/3.2.0/ivys/ivy.xml 2021.10.07 12:39:03 INFO [error] not found: https://repo1.maven.org/maven2/org/apache/spark/spark-hadoop-cloud_2.12/3.2.0/spark-hadoop-cloud_2.12-3.2.0.pom 2021.10.07 12:39:03 INFO [error] not found: https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/spark/spark-hadoop-cloud_2.12/3.2.0/spark-hadoop-cloud_2.12-3.2.0.pom > spark-hadoop-cloud broken on release and only published via 3rd party > repositories > -- > > Key: SPARK-36936 > URL: https://issues.apache.org/jira/browse/SPARK-36936 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.1.1, 3.1.2 > Environment: name:=spark-demo > version := "0.0.1" > scalaVersion := "2.12.12" > lazy val app = (project in file("app")).settings( > assemblyPackageScala / assembleArtifact := false, > assembly / assemblyJarName := "uber.jar", > assembly / mainClass := Some("com.example.Main"), > // more settings here ... > ) > resolvers += "Cloudera" at > "https://repository.cloudera.com/artifactory/cloudera-repos/"; > libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.1.2" % > "provided" > libraryDependencies += "org.apache.spark" %% "spark-hadoop-cloud" % > "3.1.1.3.1.7270.0-253" > libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % > "3.1.1.7.2.7.0-184" > libraryDependencies += "com.amazonaws" % "aws-java-sdk-bundle" % "1.11.901" > libraryDependencies += "org.scalatest" %% "scalatest" % "3.0.1" % "test" > // test suite settings > fork in Test := true > javaOptions ++= Seq("-Xms512M", "-Xmx2048M", "-XX:MaxPermSize=2048M", > "-XX:+CMSClassUnloadingEnabled") > // Show runtime of tests > testOptions in Test += Tests.Argument(TestFrameworks.ScalaTest, "-oD") > ___ > > import org.apache.spark.sql.SparkSession > object SparkApp { > def main(args: Array[String]){ > val spark = SparkSession.builder().master("local") > //.config("spark.jars.repositories", > "https://repository.cloudera.com/artifactory/cloudera-repos/";) > //.config("spark.jars.packages", > "org.apache.spark:spark-hadoop-cloud_2.12:3.1.1.3.1.7270.0-253") > .appName("spark session").getOrCreate > val jsonDF = spark.read.json("s3a://path-to-bucket/compact.json") > val csvDF = spark.read.format("csv").load("s3a://path-to-bucket/some.csv") > jsonDF.show() > csvDF.show() > } > } >Reporter: Colin Williams >Priority: Major > > The spark docmentation suggests using `spark-hadoop-cloud` to read / write > from S3 in [https://spark.apache.org/docs/latest/cloud-integration.html] . > However artifacts are currently published via only 3rd party resolvers in > [https://mvnrepository.com/artifact/org.apache.spark/spark-hadoop-cloud] > including Cloudera and Palantir. > > Then apache spark documentation is providing a 3rd party solution for object > stores including S3. Furthermore, if you follow the instructions and include > one of the 3rd party jars IE the Cloudera jar with the spark 3.1.2 release > and try to access object store, the following exception is returned. > > ``` > Exception in thread "main" java.lang.NoSuchMethodError: 'void > com.google.common.base.Preconditions.checkArgument(boolean, java.lang.String, > java.lang.Object, java.lang.Object)' > at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:894) > at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:870) > at > org.apache.hadoop.fs.s3a.S3AUtils.getEncryptionAlgorithm(S3AUtils.java:1605) > at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:363) > at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124) > at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479) > at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361) > at > org.apache
[jira] [Created] (SPARK-36949) Fix CREATE TABLE AS SELECT of ANSI intervals
Max Gekk created SPARK-36949: Summary: Fix CREATE TABLE AS SELECT of ANSI intervals Key: SPARK-36949 URL: https://issues.apache.org/jira/browse/SPARK-36949 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: Max Gekk The given SQL should work: {code:sql} spark-sql> CREATE TABLE tbl1 STORED AS PARQUET AS SELECT INTERVAL '1-1' YEAR TO MONTH AS YM; 21/10/07 21:35:59 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.IllegalArgumentException: Error: type expected at the position 0 of 'interval year to month' but 'interval year to month' is found {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36940) Inline type hints for python/pyspark/sql/avro/functions.py
[ https://issues.apache.org/jira/browse/SPARK-36940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-36940. --- Fix Version/s: 3.3.0 Assignee: Xinrong Meng Resolution: Fixed Issue resolved by pull request 34200 https://github.com/apache/spark/pull/34200 > Inline type hints for python/pyspark/sql/avro/functions.py > -- > > Key: SPARK-36940 > URL: https://issues.apache.org/jira/browse/SPARK-36940 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.3.0 > > > Inline type hints for python/pyspark/sql/avro/functions.py. > > Currently, we use stub files for type annotations, which don't support type > checks within function bodies. So we inline type hints to support that. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36942) Inline type hints for python/pyspark/sql/readwriter.py
[ https://issues.apache.org/jira/browse/SPARK-36942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425733#comment-17425733 ] Apache Spark commented on SPARK-36942: -- User 'xinrong-databricks' has created a pull request for this issue: https://github.com/apache/spark/pull/34216 > Inline type hints for python/pyspark/sql/readwriter.py > -- > > Key: SPARK-36942 > URL: https://issues.apache.org/jira/browse/SPARK-36942 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Priority: Major > > Inline type hints for python/pyspark/sql/readwriter.py. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36942) Inline type hints for python/pyspark/sql/readwriter.py
[ https://issues.apache.org/jira/browse/SPARK-36942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425732#comment-17425732 ] Apache Spark commented on SPARK-36942: -- User 'xinrong-databricks' has created a pull request for this issue: https://github.com/apache/spark/pull/34216 > Inline type hints for python/pyspark/sql/readwriter.py > -- > > Key: SPARK-36942 > URL: https://issues.apache.org/jira/browse/SPARK-36942 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Priority: Major > > Inline type hints for python/pyspark/sql/readwriter.py. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36942) Inline type hints for python/pyspark/sql/readwriter.py
[ https://issues.apache.org/jira/browse/SPARK-36942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36942: Assignee: Apache Spark > Inline type hints for python/pyspark/sql/readwriter.py > -- > > Key: SPARK-36942 > URL: https://issues.apache.org/jira/browse/SPARK-36942 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Assignee: Apache Spark >Priority: Major > > Inline type hints for python/pyspark/sql/readwriter.py. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36942) Inline type hints for python/pyspark/sql/readwriter.py
[ https://issues.apache.org/jira/browse/SPARK-36942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36942: Assignee: (was: Apache Spark) > Inline type hints for python/pyspark/sql/readwriter.py > -- > > Key: SPARK-36942 > URL: https://issues.apache.org/jira/browse/SPARK-36942 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Priority: Major > > Inline type hints for python/pyspark/sql/readwriter.py. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36948) Check CREATE TABLE with ANSI intervals using Hive external catalog and Parquet
[ https://issues.apache.org/jira/browse/SPARK-36948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36948: Assignee: Max Gekk (was: Apache Spark) > Check CREATE TABLE with ANSI intervals using Hive external catalog and Parquet > -- > > Key: SPARK-36948 > URL: https://issues.apache.org/jira/browse/SPARK-36948 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > Add a test which checks: > 1. Creating a table with ANSI interval columns > 2. INSERT INTO the table > 3. Read inserted values back -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36948) Check CREATE TABLE with ANSI intervals using Hive external catalog and Parquet
[ https://issues.apache.org/jira/browse/SPARK-36948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425727#comment-17425727 ] Apache Spark commented on SPARK-36948: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/34215 > Check CREATE TABLE with ANSI intervals using Hive external catalog and Parquet > -- > > Key: SPARK-36948 > URL: https://issues.apache.org/jira/browse/SPARK-36948 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > Add a test which checks: > 1. Creating a table with ANSI interval columns > 2. INSERT INTO the table > 3. Read inserted values back -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36948) Check CREATE TABLE with ANSI intervals using Hive external catalog and Parquet
[ https://issues.apache.org/jira/browse/SPARK-36948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36948: Assignee: Apache Spark (was: Max Gekk) > Check CREATE TABLE with ANSI intervals using Hive external catalog and Parquet > -- > > Key: SPARK-36948 > URL: https://issues.apache.org/jira/browse/SPARK-36948 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > Add a test which checks: > 1. Creating a table with ANSI interval columns > 2. INSERT INTO the table > 3. Read inserted values back -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36948) Check CREATE TABLE with ANSI intervals using Hive external catalog and Parquet
[ https://issues.apache.org/jira/browse/SPARK-36948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425726#comment-17425726 ] Apache Spark commented on SPARK-36948: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/34215 > Check CREATE TABLE with ANSI intervals using Hive external catalog and Parquet > -- > > Key: SPARK-36948 > URL: https://issues.apache.org/jira/browse/SPARK-36948 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > Add a test which checks: > 1. Creating a table with ANSI interval columns > 2. INSERT INTO the table > 3. Read inserted values back -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36948) Check CREATE TABLE with ANSI intervals using Hive external catalog and Parquet
Max Gekk created SPARK-36948: Summary: Check CREATE TABLE with ANSI intervals using Hive external catalog and Parquet Key: SPARK-36948 URL: https://issues.apache.org/jira/browse/SPARK-36948 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: Max Gekk Assignee: Max Gekk Add a test which checks: 1. Creating a table with ANSI interval columns 2. INSERT INTO the table 3. Read inserted values back -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36947) Exception when trying to access Row field using getAs method
[ https://issues.apache.org/jira/browse/SPARK-36947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexandros Mavrommatis updated SPARK-36947: --- Description: I have an input dataframe *df* with the following schema: {code:java} |-- origin: string (nullable = true) |-- product: struct (nullable = true) ||-- id: integer (nullable = true){code} when I try to select the first 20 rows of the id column I execute: {code:java} df.select("product.id").show(20, false) {code} and I manage to get the result. But when I execute the following: {code:java} df.map(_.getAs[Int]("product.id")).show(20, false){code} I get the following error: {code:java} java.lang.IllegalArgumentException: Field "product.id" does not exist.{code} was: I have an input dataframe *df* with the following schema: {code:java} |-- origin: string (nullable = true) |-- product: struct (nullable = true) ||-- id: integer (nullable = true){code} when I try to select the first 20 rows of the id column I execute: {code:java} df.select("product.id").show(20, false) {code} **and I manage to get the result. But when I execute the following: {code:java} df.map(_.getAs[Int]("product.id")).show(20, false){code} I get the following error: {code:java} java.lang.IllegalArgumentException: Field "product.id" does not exist.{code} > Exception when trying to access Row field using getAs method > > > Key: SPARK-36947 > URL: https://issues.apache.org/jira/browse/SPARK-36947 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 > Environment: Spark 3.1.2 (but this also may affect other versions as > well) >Reporter: Alexandros Mavrommatis >Priority: Blocker > Labels: catalyst, row, sql > > I have an input dataframe *df* with the following schema: > {code:java} > |-- origin: string (nullable = true) > |-- product: struct (nullable = true) > ||-- id: integer (nullable = true){code} > > when I try to select the first 20 rows of the id column I execute: > {code:java} > df.select("product.id").show(20, false) > {code} > > and I manage to get the result. But when I execute the following: > {code:java} > df.map(_.getAs[Int]("product.id")).show(20, false){code} > > I get the following error: > {code:java} > java.lang.IllegalArgumentException: Field "product.id" does not exist.{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36947) Exception when trying to access Row field using getAs method
Alexandros Mavrommatis created SPARK-36947: -- Summary: Exception when trying to access Row field using getAs method Key: SPARK-36947 URL: https://issues.apache.org/jira/browse/SPARK-36947 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.2 Environment: Spark 3.1.2 (but this also may affect other versions as well) Reporter: Alexandros Mavrommatis I have an input dataframe *df* with the following schema: {code:java} |-- origin: string (nullable = true) |-- product: struct (nullable = true) ||-- id: integer (nullable = true){code} when I try to select the first 20 rows of the id column I execute: {code:java} df.select("product.id").show(20, false) {code} **and I manage to get the result. But when I execute the following: {code:java} df.map(_.getAs[Int]("product.id")).show(20, false){code} I get the following error: {code:java} java.lang.IllegalArgumentException: Field "product.id" does not exist.{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36900) "SPARK-36464: size returns correct positive number even with over 2GB data" will oom with JDK17
[ https://issues.apache.org/jira/browse/SPARK-36900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425570#comment-17425570 ] Apache Spark commented on SPARK-36900: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/34214 > "SPARK-36464: size returns correct positive number even with over 2GB data" > will oom with JDK17 > > > Key: SPARK-36900 > URL: https://issues.apache.org/jira/browse/SPARK-36900 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Yang Jie >Priority: Minor > > Execute > > {code:java} > build/mvn clean install -pl core -am -Dtest=none > -DwildcardSuites=org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite > {code} > with JDK 17, > {code:java} > ChunkedByteBufferOutputStreamSuite: > - empty output > - write a single byte > - write a single near boundary > - write a single at boundary > - single chunk output > - single chunk output at boundary size > - multiple chunk output > - multiple chunk output at boundary size > *** RUN ABORTED *** > java.lang.OutOfMemoryError: Java heap space > at java.base/java.lang.Integer.valueOf(Integer.java:1081) > at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:67) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStream.allocateNewChunkIfNeeded(ChunkedByteBufferOutputStream.scala:87) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStream.write(ChunkedByteBufferOutputStream.scala:75) > at java.base/java.io.OutputStream.write(OutputStream.java:127) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite.$anonfun$new$22(ChunkedByteBufferOutputStreamSuite.scala:127) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite$$Lambda$179/0x0008011a75d8.apply(Unknown > Source) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36900) "SPARK-36464: size returns correct positive number even with over 2GB data" will oom with JDK17
[ https://issues.apache.org/jira/browse/SPARK-36900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425568#comment-17425568 ] Apache Spark commented on SPARK-36900: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/34214 > "SPARK-36464: size returns correct positive number even with over 2GB data" > will oom with JDK17 > > > Key: SPARK-36900 > URL: https://issues.apache.org/jira/browse/SPARK-36900 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Yang Jie >Priority: Minor > > Execute > > {code:java} > build/mvn clean install -pl core -am -Dtest=none > -DwildcardSuites=org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite > {code} > with JDK 17, > {code:java} > ChunkedByteBufferOutputStreamSuite: > - empty output > - write a single byte > - write a single near boundary > - write a single at boundary > - single chunk output > - single chunk output at boundary size > - multiple chunk output > - multiple chunk output at boundary size > *** RUN ABORTED *** > java.lang.OutOfMemoryError: Java heap space > at java.base/java.lang.Integer.valueOf(Integer.java:1081) > at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:67) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStream.allocateNewChunkIfNeeded(ChunkedByteBufferOutputStream.scala:87) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStream.write(ChunkedByteBufferOutputStream.scala:75) > at java.base/java.io.OutputStream.write(OutputStream.java:127) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite.$anonfun$new$22(ChunkedByteBufferOutputStreamSuite.scala:127) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite$$Lambda$179/0x0008011a75d8.apply(Unknown > Source) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36900) "SPARK-36464: size returns correct positive number even with over 2GB data" will oom with JDK17
[ https://issues.apache.org/jira/browse/SPARK-36900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36900: Assignee: Apache Spark > "SPARK-36464: size returns correct positive number even with over 2GB data" > will oom with JDK17 > > > Key: SPARK-36900 > URL: https://issues.apache.org/jira/browse/SPARK-36900 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Minor > > Execute > > {code:java} > build/mvn clean install -pl core -am -Dtest=none > -DwildcardSuites=org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite > {code} > with JDK 17, > {code:java} > ChunkedByteBufferOutputStreamSuite: > - empty output > - write a single byte > - write a single near boundary > - write a single at boundary > - single chunk output > - single chunk output at boundary size > - multiple chunk output > - multiple chunk output at boundary size > *** RUN ABORTED *** > java.lang.OutOfMemoryError: Java heap space > at java.base/java.lang.Integer.valueOf(Integer.java:1081) > at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:67) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStream.allocateNewChunkIfNeeded(ChunkedByteBufferOutputStream.scala:87) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStream.write(ChunkedByteBufferOutputStream.scala:75) > at java.base/java.io.OutputStream.write(OutputStream.java:127) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite.$anonfun$new$22(ChunkedByteBufferOutputStreamSuite.scala:127) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite$$Lambda$179/0x0008011a75d8.apply(Unknown > Source) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36900) "SPARK-36464: size returns correct positive number even with over 2GB data" will oom with JDK17
[ https://issues.apache.org/jira/browse/SPARK-36900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36900: Assignee: (was: Apache Spark) > "SPARK-36464: size returns correct positive number even with over 2GB data" > will oom with JDK17 > > > Key: SPARK-36900 > URL: https://issues.apache.org/jira/browse/SPARK-36900 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Yang Jie >Priority: Minor > > Execute > > {code:java} > build/mvn clean install -pl core -am -Dtest=none > -DwildcardSuites=org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite > {code} > with JDK 17, > {code:java} > ChunkedByteBufferOutputStreamSuite: > - empty output > - write a single byte > - write a single near boundary > - write a single at boundary > - single chunk output > - single chunk output at boundary size > - multiple chunk output > - multiple chunk output at boundary size > *** RUN ABORTED *** > java.lang.OutOfMemoryError: Java heap space > at java.base/java.lang.Integer.valueOf(Integer.java:1081) > at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:67) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStream.allocateNewChunkIfNeeded(ChunkedByteBufferOutputStream.scala:87) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStream.write(ChunkedByteBufferOutputStream.scala:75) > at java.base/java.io.OutputStream.write(OutputStream.java:127) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite.$anonfun$new$22(ChunkedByteBufferOutputStreamSuite.scala:127) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite$$Lambda$179/0x0008011a75d8.apply(Unknown > Source) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36900) "SPARK-36464: size returns correct positive number even with over 2GB data" will oom with JDK17
[ https://issues.apache.org/jira/browse/SPARK-36900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-36900: - Priority: Minor (was: Major) > "SPARK-36464: size returns correct positive number even with over 2GB data" > will oom with JDK17 > > > Key: SPARK-36900 > URL: https://issues.apache.org/jira/browse/SPARK-36900 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Yang Jie >Priority: Minor > > Execute > > {code:java} > build/mvn clean install -pl core -am -Dtest=none > -DwildcardSuites=org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite > {code} > with JDK 17, > {code:java} > ChunkedByteBufferOutputStreamSuite: > - empty output > - write a single byte > - write a single near boundary > - write a single at boundary > - single chunk output > - single chunk output at boundary size > - multiple chunk output > - multiple chunk output at boundary size > *** RUN ABORTED *** > java.lang.OutOfMemoryError: Java heap space > at java.base/java.lang.Integer.valueOf(Integer.java:1081) > at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:67) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStream.allocateNewChunkIfNeeded(ChunkedByteBufferOutputStream.scala:87) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStream.write(ChunkedByteBufferOutputStream.scala:75) > at java.base/java.io.OutputStream.write(OutputStream.java:127) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite.$anonfun$new$22(ChunkedByteBufferOutputStreamSuite.scala:127) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite$$Lambda$179/0x0008011a75d8.apply(Unknown > Source) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36798) When SparkContext is stopped, metrics system should be flushed after listeners have finished processing
[ https://issues.apache.org/jira/browse/SPARK-36798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan reassigned SPARK-36798: --- Assignee: Harsh Panchal > When SparkContext is stopped, metrics system should be flushed after > listeners have finished processing > --- > > Key: SPARK-36798 > URL: https://issues.apache.org/jira/browse/SPARK-36798 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.2 >Reporter: Harsh Panchal >Assignee: Harsh Panchal >Priority: Minor > > In current implementation, when {{SparkContext.stop()}} is called, > {{metricsSystem.report()}} is called before {{listenerBus.stop()}}. In this > case, if some listener is producing some metrics, they would never reach sink. > Background: > We have some ingestion jobs in Spark Structured Streaming. To monitor them, > collect some metrics like number of input rows, trigger time etc. from > {{QueryProgressEvent}} received via {{StreamingQueryListener}}. These metrics > are then pushed to DB by custom sinks registered in {{MetricsSystem}}. We > noticed that these metrics are lost occasionally for last batch. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36798) When SparkContext is stopped, metrics system should be flushed after listeners have finished processing
[ https://issues.apache.org/jira/browse/SPARK-36798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan resolved SPARK-36798. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34039 [https://github.com/apache/spark/pull/34039] > When SparkContext is stopped, metrics system should be flushed after > listeners have finished processing > --- > > Key: SPARK-36798 > URL: https://issues.apache.org/jira/browse/SPARK-36798 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.2 >Reporter: Harsh Panchal >Assignee: Harsh Panchal >Priority: Minor > Fix For: 3.3.0 > > > In current implementation, when {{SparkContext.stop()}} is called, > {{metricsSystem.report()}} is called before {{listenerBus.stop()}}. In this > case, if some listener is producing some metrics, they would never reach sink. > Background: > We have some ingestion jobs in Spark Structured Streaming. To monitor them, > collect some metrics like number of input rows, trigger time etc. from > {{QueryProgressEvent}} received via {{StreamingQueryListener}}. These metrics > are then pushed to DB by custom sinks registered in {{MetricsSystem}}. We > noticed that these metrics are lost occasionally for last batch. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36396) Implement DataFrame.cov
[ https://issues.apache.org/jira/browse/SPARK-36396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425510#comment-17425510 ] Apache Spark commented on SPARK-36396: -- User 'dchvn' has created a pull request for this issue: https://github.com/apache/spark/pull/34213 > Implement DataFrame.cov > --- > > Key: SPARK-36396 > URL: https://issues.apache.org/jira/browse/SPARK-36396 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36402) Implement Series.combine
[ https://issues.apache.org/jira/browse/SPARK-36402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425493#comment-17425493 ] Apache Spark commented on SPARK-36402: -- User 'dchvn' has created a pull request for this issue: https://github.com/apache/spark/pull/34212 > Implement Series.combine > > > Key: SPARK-36402 > URL: https://issues.apache.org/jira/browse/SPARK-36402 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36946) Support time for ps.to_datetime
[ https://issues.apache.org/jira/browse/SPARK-36946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425473#comment-17425473 ] Apache Spark commented on SPARK-36946: -- User 'dchvn' has created a pull request for this issue: https://github.com/apache/spark/pull/34211 > Support time for ps.to_datetime > --- > > Key: SPARK-36946 > URL: https://issues.apache.org/jira/browse/SPARK-36946 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dgd_contributor >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36946) Support time for ps.to_datetime
[ https://issues.apache.org/jira/browse/SPARK-36946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36946: Assignee: Apache Spark > Support time for ps.to_datetime > --- > > Key: SPARK-36946 > URL: https://issues.apache.org/jira/browse/SPARK-36946 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dgd_contributor >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36946) Support time for ps.to_datetime
[ https://issues.apache.org/jira/browse/SPARK-36946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36946: Assignee: (was: Apache Spark) > Support time for ps.to_datetime > --- > > Key: SPARK-36946 > URL: https://issues.apache.org/jira/browse/SPARK-36946 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dgd_contributor >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36946) Support time for ps.to_datetime
dgd_contributor created SPARK-36946: --- Summary: Support time for ps.to_datetime Key: SPARK-36946 URL: https://issues.apache.org/jira/browse/SPARK-36946 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.3.0 Reporter: dgd_contributor -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36707) Support to specify index type and name in pandas API on Spark
[ https://issues.apache.org/jira/browse/SPARK-36707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-36707. -- Fix Version/s: 3.3.0 Assignee: Hyukjin Kwon Resolution: Done > Support to specify index type and name in pandas API on Spark > - > > Key: SPARK-36707 > URL: https://issues.apache.org/jira/browse/SPARK-36707 > Project: Spark > Issue Type: Umbrella > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.3.0 > > > See https://koalas.readthedocs.io/en/latest/user_guide/typehints.html. > pandas API on Spark currently there's no way to specify the index type and > name in the output when you apply an arbitrary function, which forces to > create the default index: > {code} > >>> def transform(pdf) -> pd.DataFrame["id": int, "A": int]: > ... pdf['A'] = pdf.id + 1 > ... return pdf > ... > >>> ps.range(5).koalas.apply_batch(transform) > {code} > {code} >id A > 0 0 1 > 1 1 2 > 2 2 3 > 3 3 4 > 4 4 5 > {code} > We should have a way to specify the index. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36713) Document new syntax for specifying index type
[ https://issues.apache.org/jira/browse/SPARK-36713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-36713. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34210 [https://github.com/apache/spark/pull/34210] > Document new syntax for specifying index type > - > > Key: SPARK-36713 > URL: https://issues.apache.org/jira/browse/SPARK-36713 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36713) Document new syntax for specifying index type
[ https://issues.apache.org/jira/browse/SPARK-36713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-36713: Assignee: Hyukjin Kwon > Document new syntax for specifying index type > - > > Key: SPARK-36713 > URL: https://issues.apache.org/jira/browse/SPARK-36713 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org