[jira] [Assigned] (SPARK-24886) Increase Jenkins build time
[ https://issues.apache.org/jira/browse/SPARK-24886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24886: Assignee: Apache Spark > Increase Jenkins build time > --- > > Key: SPARK-24886 > URL: https://issues.apache.org/jira/browse/SPARK-24886 > Project: Spark > Issue Type: Test > Components: Project Infra >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > Currently, looks we hit the time limit time to time. Looks better increasing > the time a bit. > For instance, please see https://github.com/apache/spark/pull/21822 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24886) Increase Jenkins build time
[ https://issues.apache.org/jira/browse/SPARK-24886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16552347#comment-16552347 ] Apache Spark commented on SPARK-24886: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/21845 > Increase Jenkins build time > --- > > Key: SPARK-24886 > URL: https://issues.apache.org/jira/browse/SPARK-24886 > Project: Spark > Issue Type: Test > Components: Project Infra >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > Currently, looks we hit the time limit time to time. Looks better increasing > the time a bit. > For instance, please see https://github.com/apache/spark/pull/21822 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24886) Increase Jenkins build time
[ https://issues.apache.org/jira/browse/SPARK-24886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24886: Assignee: (was: Apache Spark) > Increase Jenkins build time > --- > > Key: SPARK-24886 > URL: https://issues.apache.org/jira/browse/SPARK-24886 > Project: Spark > Issue Type: Test > Components: Project Infra >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > Currently, looks we hit the time limit time to time. Looks better increasing > the time a bit. > For instance, please see https://github.com/apache/spark/pull/21822 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24886) Increase Jenkins build time
Hyukjin Kwon created SPARK-24886: Summary: Increase Jenkins build time Key: SPARK-24886 URL: https://issues.apache.org/jira/browse/SPARK-24886 Project: Spark Issue Type: Test Components: Project Infra Affects Versions: 2.4.0 Reporter: Hyukjin Kwon Currently, looks we hit the time limit time to time. Looks better increasing the time a bit. For instance, please see https://github.com/apache/spark/pull/21822 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18492) GeneratedIterator grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-18492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16552257#comment-16552257 ] Takeshi Yamamuro commented on SPARK-18492: -- [~MDS Tang] You'd be better to describe more about you env, e.g., spark version, query to reproduce, ... > GeneratedIterator grows beyond 64 KB > > > Key: SPARK-18492 > URL: https://issues.apache.org/jira/browse/SPARK-18492 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: CentOS release 6.7 (Final) >Reporter: Norris Merritt >Priority: Major > Attachments: Screenshot from 2018-03-02 12-57-51.png > > > spark-submit fails with ERROR CodeGenerator: failed to compile: > org.codehaus.janino.JaninoRuntimeException: Code of method > "(I[Lscala/collection/Iterator;)V" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" > grows beyond 64 KB > Error message is followed by a huge dump of generated source code. > The generated code declares 1,454 field sequences like the following: > /* 036 */ private org.apache.spark.sql.catalyst.expressions.ScalaUDF > project_scalaUDF1; > /* 037 */ private scala.Function1 project_catalystConverter1; > /* 038 */ private scala.Function1 project_converter1; > /* 039 */ private scala.Function1 project_converter2; > /* 040 */ private scala.Function2 project_udf1; > (many omitted lines) ... > /* 6089 */ private org.apache.spark.sql.catalyst.expressions.ScalaUDF > project_scalaUDF1454; > /* 6090 */ private scala.Function1 project_catalystConverter1454; > /* 6091 */ private scala.Function1 project_converter1695; > /* 6092 */ private scala.Function1 project_udf1454; > It then proceeds to emit code for several methods (init, processNext) each of > which has totally repetitive sequences of statements pertaining to each of > the sequences of variables declared in the class. For example: > /* 6101 */ public void init(int index, scala.collection.Iterator inputs[]) { > The reason that the 64KB JVM limit for code for a method is exceeded is > because the code generator is using an incredibly naive strategy. It emits a > sequence like the one shown below for each of the 1,454 groups of variables > shown above, in > /* 6132 */ this.project_udf = > (scala.Function1)project_scalaUDF.userDefinedFunc(); > /* 6133 */ this.project_scalaUDF1 = > (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[10]; > /* 6134 */ this.project_catalystConverter1 = > (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF1.dataType()); > /* 6135 */ this.project_converter1 = > (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(0))).dataType()); > /* 6136 */ this.project_converter2 = > (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(1))).dataType()); > It blows up after emitting 230 such sequences, while trying to emit the 231st: > /* 7282 */ this.project_udf230 = > (scala.Function2)project_scalaUDF230.userDefinedFunc(); > /* 7283 */ this.project_scalaUDF231 = > (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[240]; > /* 7284 */ this.project_catalystConverter231 = > (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF231.dataType()); > many omitted lines ... > Example of repetitive code sequences emitted for processNext method: > /* 12253 */ boolean project_isNull247 = project_result244 == null; > /* 12254 */ MapData project_value247 = null; > /* 12255 */ if (!project_isNull247) { > /* 12256 */ project_value247 = project_result244; > /* 12257 */ } > /* 12258 */ Object project_arg = sort_isNull5 ? null : > project_converter489.apply(sort_value5); > /* 12259 */ > /* 12260 */ ArrayData project_result249 = null; > /* 12261 */ try { > /* 12262 */ project_result249 = > (ArrayData)project_catalystConverter248.apply(project_udf248.apply(project_arg)); > /* 12263 */ } catch (Exception e) { > /* 12264 */ throw new > org.apache.spark.SparkException(project_scalaUDF248.udfErrorMessage(), e); > /* 12265 */ } > /* 12266 */ > /* 12267 */ boolean project_isNull252 = project_result249 == null; > /* 12268 */ ArrayData project_value252 = null;
[jira] [Commented] (SPARK-18492) GeneratedIterator grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-18492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16552242#comment-16552242 ] Tang Yu Jie commented on SPARK-18492: - Here I also encountered this problem, have you resolved it? > GeneratedIterator grows beyond 64 KB > > > Key: SPARK-18492 > URL: https://issues.apache.org/jira/browse/SPARK-18492 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: CentOS release 6.7 (Final) >Reporter: Norris Merritt >Priority: Major > Attachments: Screenshot from 2018-03-02 12-57-51.png > > > spark-submit fails with ERROR CodeGenerator: failed to compile: > org.codehaus.janino.JaninoRuntimeException: Code of method > "(I[Lscala/collection/Iterator;)V" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" > grows beyond 64 KB > Error message is followed by a huge dump of generated source code. > The generated code declares 1,454 field sequences like the following: > /* 036 */ private org.apache.spark.sql.catalyst.expressions.ScalaUDF > project_scalaUDF1; > /* 037 */ private scala.Function1 project_catalystConverter1; > /* 038 */ private scala.Function1 project_converter1; > /* 039 */ private scala.Function1 project_converter2; > /* 040 */ private scala.Function2 project_udf1; > (many omitted lines) ... > /* 6089 */ private org.apache.spark.sql.catalyst.expressions.ScalaUDF > project_scalaUDF1454; > /* 6090 */ private scala.Function1 project_catalystConverter1454; > /* 6091 */ private scala.Function1 project_converter1695; > /* 6092 */ private scala.Function1 project_udf1454; > It then proceeds to emit code for several methods (init, processNext) each of > which has totally repetitive sequences of statements pertaining to each of > the sequences of variables declared in the class. For example: > /* 6101 */ public void init(int index, scala.collection.Iterator inputs[]) { > The reason that the 64KB JVM limit for code for a method is exceeded is > because the code generator is using an incredibly naive strategy. It emits a > sequence like the one shown below for each of the 1,454 groups of variables > shown above, in > /* 6132 */ this.project_udf = > (scala.Function1)project_scalaUDF.userDefinedFunc(); > /* 6133 */ this.project_scalaUDF1 = > (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[10]; > /* 6134 */ this.project_catalystConverter1 = > (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF1.dataType()); > /* 6135 */ this.project_converter1 = > (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(0))).dataType()); > /* 6136 */ this.project_converter2 = > (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(1))).dataType()); > It blows up after emitting 230 such sequences, while trying to emit the 231st: > /* 7282 */ this.project_udf230 = > (scala.Function2)project_scalaUDF230.userDefinedFunc(); > /* 7283 */ this.project_scalaUDF231 = > (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[240]; > /* 7284 */ this.project_catalystConverter231 = > (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF231.dataType()); > many omitted lines ... > Example of repetitive code sequences emitted for processNext method: > /* 12253 */ boolean project_isNull247 = project_result244 == null; > /* 12254 */ MapData project_value247 = null; > /* 12255 */ if (!project_isNull247) { > /* 12256 */ project_value247 = project_result244; > /* 12257 */ } > /* 12258 */ Object project_arg = sort_isNull5 ? null : > project_converter489.apply(sort_value5); > /* 12259 */ > /* 12260 */ ArrayData project_result249 = null; > /* 12261 */ try { > /* 12262 */ project_result249 = > (ArrayData)project_catalystConverter248.apply(project_udf248.apply(project_arg)); > /* 12263 */ } catch (Exception e) { > /* 12264 */ throw new > org.apache.spark.SparkException(project_scalaUDF248.udfErrorMessage(), e); > /* 12265 */ } > /* 12266 */ > /* 12267 */ boolean project_isNull252 = project_result249 == null; > /* 12268 */ ArrayData project_value252 = null; > /* 12269 */ if (!project_isNull252) { > /* 122
[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark
[ https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16552233#comment-16552233 ] Saisai Shao commented on SPARK-24615: - Hi [~tgraves] what you mentioned above is also what we think about and try to figure out a way to solve it. (this problem also existed in barrier execution). >From user point, specifying resource through RDD is the only feasible way >currently what I can think, though resource is bound to stage/task not >particular RDD. This means user could specify resources for different RDDs in >a single stage, Spark can only use one resource within this stage. This will >bring out several problems as you mentioned: *Specify resources to which RDD* For example {{rddA.withResource.mapPartition \{ xxx \}.collect()}} is not different from {{rddA.mapPartition \{ xxx \}.withResource.collect}}. Since all the rdds are executed in the same stage. So in the current design, not matter the resource is specified with {{rddA}} or mapped RDD, the result is the same. *one to one dependency RDDs with different resources* For example {{rddA.withResource.mapPartition \{ xxx \}.withResource.collec()}}, here assuming the resource request for {{rddA}} and mapped RDD is different, since they're running in a single stage, so we should fix such conflict. *multiple dependencies RDDs with different resources* For example: {code} val rddA = rdd.withResources.mapPartitions() val rddB = rdd.withResources.mapPartitions() val rddC = rddA.join(rddB) {code} If the resources in {{rddA}} is different from {{rddB}}, then we should also fix such conflicts. Previously I proposed to use largest resource requirement to satisfy all the needs. But it may also cause the resource wasting, [~mengxr] mentioned to set/merge resources per partition to avoid waste. In the meanwhile, it there's a API exposed to set resources in the stage level, then this problem will not be existed, but Spark doesn't expose such APIs to user, the only thing user can specify is from RDD level, I'm still thinking of a good way to fix it. > Accelerator-aware task scheduling for Spark > --- > > Key: SPARK-24615 > URL: https://issues.apache.org/jira/browse/SPARK-24615 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Major > Labels: Hydrogen, SPIP > > In the machine learning area, accelerator card (GPU, FPGA, TPU) is > predominant compared to CPUs. To make the current Spark architecture to work > with accelerator cards, Spark itself should understand the existence of > accelerators and know how to schedule task onto the executors where > accelerators are equipped. > Current Spark’s scheduler schedules tasks based on the locality of the data > plus the available of CPUs. This will introduce some problems when scheduling > tasks with accelerators required. > # CPU cores are usually more than accelerators on one node, using CPU cores > to schedule accelerator required tasks will introduce the mismatch. > # In one cluster, we always assume that CPU is equipped in each node, but > this is not true of accelerator cards. > # The existence of heterogeneous tasks (accelerator required or not) > requires scheduler to schedule tasks with a smart way. > So here propose to improve the current scheduler to support heterogeneous > tasks (accelerator requires or not). This can be part of the work of Project > hydrogen. > Details is attached in google doc. It doesn't cover all the implementation > details, just highlight the parts should be changed. > > CC [~yanboliang] [~merlintang] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24841) Memory leak in converting spark dataframe to pandas dataframe
[ https://issues.apache.org/jira/browse/SPARK-24841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16552231#comment-16552231 ] Kazuaki Ishizaki commented on SPARK-24841: -- Thank you for reporting an issue with heap profiling. Would it be possible to post a standalone program that can reproduce this problem? > Memory leak in converting spark dataframe to pandas dataframe > - > > Key: SPARK-24841 > URL: https://issues.apache.org/jira/browse/SPARK-24841 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 > Environment: Running PySpark in standalone mode >Reporter: Piyush Seth >Priority: Minor > > I am running a continuous running application using PySpark. In one of the > operations I have to convert PySpark data frame to Pandas data frame using > toPandas API on pyspark driver. After running for a while I am getting > "java.lang.OutOfMemoryError: GC overhead limit exceeded" error. > I tried running this in a loop and could see that the heap memory is > increasing continuously. When I ran jmap for the first time I had the > following top rows: > num #instances #bytes class name > -- > 1: 1757 411477568 [J > {color:#FF} *2: 124188 266323152 [C*{color} > 3: 167219 46821320 org.apache.spark.status.TaskDataWrapper > 4: 69683 27159536 [B > 5: 359278 8622672 java.lang.Long > 6: 221808 7097856 > java.util.concurrent.ConcurrentHashMap$Node > 7: 283771 6810504 scala.collection.immutable.$colon$colon > After running several iterations I had the following > num #instances #bytes class name > -- > {color:#FF} *1: 110760 3439887928 [C*{color} > 2: 698 411429088 [J > 3: 238096 6880 org.apache.spark.status.TaskDataWrapper > 4: 68819 24050520 [B > 5: 498308 11959392 java.lang.Long > 6: 292741 9367712 > java.util.concurrent.ConcurrentHashMap$Node > 7: 282878 6789072 scala.collection.immutable.$colon$colon -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24859) Predicates pushdown on outer joins
[ https://issues.apache.org/jira/browse/SPARK-24859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24859. -- Resolution: Cannot Reproduce Seems fixed in the master branch. Please let me know if you have a JIRA fixing this. Let me leave this resolved for now. > Predicates pushdown on outer joins > -- > > Key: SPARK-24859 > URL: https://issues.apache.org/jira/browse/SPARK-24859 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.0 > Environment: Cloudera CDH 5.13.1 >Reporter: Johannes Mayer >Priority: Major > > I have two AVRO tables in Hive called FAct and DIm. Both are partitioned by a > common column called part_col. Now I want to join both tables on their id but > only for some of partitions. > If I use an inner join, everything works well: > > {code:java} > select * > from FA f > join DI d > on(f.id = d.id and f.part_col = d.part_col) > where f.part_col = 'xyz' > {code} > > In the sql explain plan I can see, that the predicate part_col = 'xyz' is > also used in the DIm HiveTableScan. > > When I execute the same query using a left join the full dim table is > scanned. There are some workarounds for this issue, but I wanted to report > this as a bug, since it works on an inner join, and i think the behaviour > should be the same for an outer join. > Here is a self contained example (created in Zeppelin): > > {code:java} > val fact = Seq((1, 100), (2, 200), (3,100), (4,200)).toDF("id", "part_col") > val dim = Seq((1, 100), (2, 200)).toDF("id", "part_col") > fact.repartition($"part_col").write.mode("overwrite").partitionBy("part_col").format("com.databricks.spark.avro").save("/tmp/jira/fact") > dim.repartition($"part_col").write.mode("overwrite").partitionBy("part_col").format("com.databricks.spark.avro").save("/tmp/jira/dim") > > spark.sqlContext.sql("create table if not exists fact(id int) partitioned by > (part_col int) stored as avro location '/tmp/jira/fact'") > spark.sqlContext.sql("msck repair table fact") spark.sqlContext.sql("create > table if not exists dim(id int) partitioned by (part_col int) stored as avro > location '/tmp/jira/dim'") > spark.sqlContext.sql("msck repair table dim"){code} > > > > *Inner join example:* > {code:java} > select * from fact f > join dim d > on (f.id = d.id > and f.part_col = d.part_col) > where f.part_col = 100{code} > Excerpt from Spark-SQL physical explain plan: > {code:java} > HiveTableScan [id#411, part_col#412], CatalogRelation `default`.`fact`, > org.apache.hadoop.hive.serde2.avro.AvroSerDe, [id#411], [part_col#412], > [isnotnull(part_col#412), (part_col#412 = 100)] > HiveTableScan [id#413, part_col#414], CatalogRelation `default`.`dim`, > org.apache.hadoop.hive.serde2.avro.AvroSerDe, [id#413], [part_col#414], > [isnotnull(part_col#414), (part_col#414 = 100)]{code} > > *Outer join example:* > {code:java} > select * from fact f > left join dim d > on (f.id = d.id > and f.part_col = d.part_col) > where f.part_col = 100{code} > > Excerpt from Spark-SQL physical explain plan: > > {code:java} > HiveTableScan [id#426, part_col#427], CatalogRelation `default`.`fact`, > org.apache.hadoop.hive.serde2.avro.AvroSerDe, [id#426], [part_col#427], > [isnotnull(part_col#427), (part_col#427 = 100)] > HiveTableScan [id#428, part_col#429], CatalogRelation `default`.`dim`, > org.apache.hadoop.hive.serde2.avro.AvroSerDe, [id#428], [part_col#429] {code} > > > As you can see the predicate is not pushed down to the HiveTableScan of the > dim table on the outer join. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24847) ScalaReflection#schemaFor occasionally fails to detect schema for Seq of type alias
[ https://issues.apache.org/jira/browse/SPARK-24847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24847. -- Resolution: Cannot Reproduce > ScalaReflection#schemaFor occasionally fails to detect schema for Seq of type > alias > --- > > Key: SPARK-24847 > URL: https://issues.apache.org/jira/browse/SPARK-24847 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ahmed Mahran >Priority: Major > > org.apache.spark.sql.catalyst.ScalaReflection#schemaFor occasionally fails to > detect schema for Seq of type alias (and it occasionally succeeds). > > {code:java} > object Types { > type Alias1 = Long > type Alias2 = Int > type Alias3 = Int > } > case class B(b1: Alias1, b2: Seq[Alias2], b3: Option[Alias3]) > case class A(a1: B, a2: Int) > {code} > > {code} > import sparkSession.implicits._ > val seq = Seq( > A(B(2L, Seq(3), Some(1)), 1), > A(B(3L, Seq(2), Some(2)), 2) > ) > val ds = sparkSession.createDataset(seq) > {code} > > {code:java} > java.lang.UnsupportedOperationException: Schema for type Seq[Types.Alias2] is > not supported at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:780) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:715) > at > org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824) > at > org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39) > at > org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:714) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1$$anonfun$7.apply(ScalaReflection.scala:381) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1$$anonfun$7.apply(ScalaReflection.scala:380) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:380) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:150) > at > org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824) > at > org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39) > at > org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor(ScalaReflection.scala:150) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1$$anonfun$7.apply(ScalaReflection.scala:391) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1$$anonfun$7.apply(ScalaReflection.scala:380) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:380) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:150) > at > org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824) > at > org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39) > at > org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor(ScalaReflection.scala:150) > at > org.apache.spark.sql.catalyst.ScalaReflection$.deserializerFor(ScalaReflection.scala:138) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:72) > at org.apache.spark.sql.Encoders$.product(Encoders.scala:275) at > org.apache.spark.sql.LowPrioritySQLImplicits$class.newProductEncoder(SQLImplicits.scala:248) > at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:34) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24853) Support Column type for withColumn and withColumnRenamed apis
[ https://issues.apache.org/jira/browse/SPARK-24853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24853. -- Resolution: Won't Fix > Support Column type for withColumn and withColumnRenamed apis > - > > Key: SPARK-24853 > URL: https://issues.apache.org/jira/browse/SPARK-24853 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.2 >Reporter: nirav patel >Priority: Major > > Can we add overloaded version of withColumn or withColumnRenamed that accept > Column type instead of String? That way I can specify FQN in case when there > is duplicate column names. e.g. if I have 2 columns with same name as a > result of join and I want to rename one of the field I can do it with this > new API. > > This would be similar to Drop api which supports both String and Column type. > > def > withColumn(colName: Column, col: Column): DataFrame > Returns a new Dataset by adding a column or replacing the existing column > that has the same name. > > def > withColumnRenamed(existingName: Column, newName: Column): DataFrame > Returns a new Dataset with a column renamed. > > > > I think there should also be this one: > > def > withColumnRenamed(existingName: *Column*, newName: *Column*): DataFrame > Returns a new Dataset with a column renamed. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24884) Implement regexp_extract_all
[ https://issues.apache.org/jira/browse/SPARK-24884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-24884: - Component/s: (was: Spark Core) SQL > Implement regexp_extract_all > > > Key: SPARK-24884 > URL: https://issues.apache.org/jira/browse/SPARK-24884 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Nick Nicolini >Priority: Major > > I've recently hit many cases of regexp parsing where we need to match on > something that is always arbitrary in length; for example, a text block that > looks something like: > {code:java} > AAA:WORDS| > BBB:TEXT| > MSG:ASDF| > MSG:QWER| > ... > MSG:ZXCV|{code} > Where I need to pull out all values between "MSG:" and "|", which can occur > in each instance between 1 and n times. I cannot reliably use the existing > {{regexp_extract}} method since the number of occurrences is always > arbitrary, and while I can write a UDF to handle this it'd be great if this > was supported natively in Spark. > Perhaps we can implement something like {{regexp_extract_all}} as > [Presto|https://prestodb.io/docs/current/functions/regexp.html] and > [Pig|https://pig.apache.org/docs/latest/api/org/apache/pig/builtin/REGEX_EXTRACT_ALL.html] > have? > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24885) Initialize random seeds for Rand and Randn expression during analysis
[ https://issues.apache.org/jira/browse/SPARK-24885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh resolved SPARK-24885. - Resolution: Won't Fix > Initialize random seeds for Rand and Randn expression during analysis > - > > Key: SPARK-24885 > URL: https://issues.apache.org/jira/browse/SPARK-24885 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Liang-Chi Hsieh >Priority: Major > > Random expressions such as Rand and Randn should have the same behavior as > Uuid that their random seeds should be initialized at analysis. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24885) Initialize random seeds for Rand and Randn expression during analysis
[ https://issues.apache.org/jira/browse/SPARK-24885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-24885: Description: Random expressions such as Rand and Randn should have the same behavior as Uuid that their random seeds should be initialized at analysis. was: Random expressions such as Rand and Randn should have the same behavior as Uuid that > Initialize random seeds for Rand and Randn expression during analysis > - > > Key: SPARK-24885 > URL: https://issues.apache.org/jira/browse/SPARK-24885 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Liang-Chi Hsieh >Priority: Major > > Random expressions such as Rand and Randn should have the same behavior as > Uuid that their random seeds should be initialized at analysis. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24885) Initialize random seeds for Rand and Randn expression during analysis
[ https://issues.apache.org/jira/browse/SPARK-24885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-24885: Summary: Initialize random seeds for Rand and Randn expression during analysis (was: Rand and Randn expression should produce same result at DataFrame on retries) > Initialize random seeds for Rand and Randn expression during analysis > - > > Key: SPARK-24885 > URL: https://issues.apache.org/jira/browse/SPARK-24885 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Liang-Chi Hsieh >Priority: Major > > Random expressions such as Rand and Randn should have the same behavior as > Uuid that -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24885) Rand and Randn expression should produce same result at DataFrame on retries
[ https://issues.apache.org/jira/browse/SPARK-24885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-24885: Description: Random expressions such as Rand and Randn should have the same behavior as Uuid that was: Random expressions such as Rand and Randn should have the same behavior as Uuid that produces the same results at the same DataFrame on re-treis. > Rand and Randn expression should produce same result at DataFrame on retries > > > Key: SPARK-24885 > URL: https://issues.apache.org/jira/browse/SPARK-24885 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Liang-Chi Hsieh >Priority: Major > > Random expressions such as Rand and Randn should have the same behavior as > Uuid that -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24885) Rand and Randn expression should produce same result at DataFrame on retries
Liang-Chi Hsieh created SPARK-24885: --- Summary: Rand and Randn expression should produce same result at DataFrame on retries Key: SPARK-24885 URL: https://issues.apache.org/jira/browse/SPARK-24885 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Liang-Chi Hsieh Random expressions such as Rand and Randn should have the same behavior as Uuid that produces the same results at the same DataFrame on re-treis. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22228) Add support for Array so from_json can parse
[ https://issues.apache.org/jira/browse/SPARK-8?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-8. -- Resolution: Duplicate > Add support for Array so from_json can parse > > > Key: SPARK-8 > URL: https://issues.apache.org/jira/browse/SPARK-8 > Project: Spark > Issue Type: Improvement > Components: Java API >Affects Versions: 2.2.0 >Reporter: kant kodali >Priority: Major > > {code:java} > val inputDS = Seq("""["foo", "bar"]""").toDF > {code} > {code:java} > inputDS.printSchema() > root > |-- value: string (nullable = true) > {code} > Input Dataset inputDS > {code:java} > inputDS.show(false) > value > - > ["foo", "bar"] > {code} > Expected output dataset outputDS > {code:java} > value > --- > "foo" | > "bar" | > {code} > Tried explode function like below but it doesn't quite work > {code:java} > inputDS.select(explode(from_json(col("value"), ArrayType(StringType > {code} > and got the following error > {code:java} > org.apache.spark.sql.AnalysisException: cannot resolve > 'jsontostructs(`value`)' due to data type mismatch: Input schema string must > be a struct or an array of structs > {code} > Also tried the following > {code:java} > inputDS.select(explode(col("value"))) > {code} > And got the following error > {code:java} > org.apache.spark.sql.AnalysisException: cannot resolve 'explode(`value`)' due > to data type mismatch: input to function explode should be array or map type, > not StringType > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16203) regexp_extract to return an ArrayType(StringType())
[ https://issues.apache.org/jira/browse/SPARK-16203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16552155#comment-16552155 ] Nick Nicolini commented on SPARK-16203: --- Cool, added ticket here:https://issues.apache.org/jira/browse/SPARK-24884 I think the above is the same feature that [~mmoroz] was asking for, so IMO we close this ticket in favor of the newer one. > regexp_extract to return an ArrayType(StringType()) > --- > > Key: SPARK-16203 > URL: https://issues.apache.org/jira/browse/SPARK-16203 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0 >Reporter: Max Moroz >Priority: Minor > > regexp_extract only returns a single matched group. If (as if often the case > - e.g., web log parsing) we need to parse the entire line and get all the > groups, we'll need to call it as many times as there are groups. > It's only a minor annoyance syntactically. > But unless I misunderstand something, it would be very inefficient. (How > would Spark know not to do multiple pattern matching operations, when only > one is needed? Or does the optimizer actually check whether the patterns are > identical, and if they are, avoid the repeated regex matching operations??) > Would it be possible to have it return an array when the index is not > specified (defaulting to None)? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24884) Implement regexp_extract_all
[ https://issues.apache.org/jira/browse/SPARK-24884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Nicolini updated SPARK-24884: -- Description: I've recently hit many cases of regexp parsing where we need to match on something that is always arbitrary in length; for example, a text block that looks something like: {code:java} AAA:WORDS| BBB:TEXT| MSG:ASDF| MSG:QWER| ... MSG:ZXCV|{code} Where I need to pull out all values between "MSG:" and "|", which can occur in each instance between 1 and n times. I cannot reliably use the existing {{regexp_extract}} method since the number of occurrences is always arbitrary, and while I can write a UDF to handle this it'd be great if this was supported natively in Spark. Perhaps we can implement something like {{regexp_extract_all}} as [Presto|https://prestodb.io/docs/current/functions/regexp.html] and [Pig|https://pig.apache.org/docs/latest/api/org/apache/pig/builtin/REGEX_EXTRACT_ALL.html] have? was: I've recently hit many cases of regexp parsing where we need to match on something that is always arbitrary in length; for example, a text block that looks something like: {code:java} AAA:WORDS| BBB:TEXT| MSG:ASDF| MSG:QWER| ... MSG:ZXCV|{code} Where I need to pull out all values between "MSG:" and "|", which can occur in each instance between 1 and n times. I cannot reliably use the existing _regexp_extract_ method since the number of occurrences is always arbitrary, and while I can write a UDF to handle this it'd be great if this was supported natively in Spark. Perhaps we can implement something like _regexp_extract_all_ as [Presto|https://prestodb.io/docs/current/functions/regexp.html] and [Pig|https://pig.apache.org/docs/latest/api/org/apache/pig/builtin/REGEX_EXTRACT_ALL.html] have? {noformat} *no* further _formatting_ is done here{noformat} > Implement regexp_extract_all > > > Key: SPARK-24884 > URL: https://issues.apache.org/jira/browse/SPARK-24884 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Nick Nicolini >Priority: Major > > I've recently hit many cases of regexp parsing where we need to match on > something that is always arbitrary in length; for example, a text block that > looks something like: > {code:java} > AAA:WORDS| > BBB:TEXT| > MSG:ASDF| > MSG:QWER| > ... > MSG:ZXCV|{code} > Where I need to pull out all values between "MSG:" and "|", which can occur > in each instance between 1 and n times. I cannot reliably use the existing > {{regexp_extract}} method since the number of occurrences is always > arbitrary, and while I can write a UDF to handle this it'd be great if this > was supported natively in Spark. > Perhaps we can implement something like {{regexp_extract_all}} as > [Presto|https://prestodb.io/docs/current/functions/regexp.html] and > [Pig|https://pig.apache.org/docs/latest/api/org/apache/pig/builtin/REGEX_EXTRACT_ALL.html] > have? > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24884) Implement regexp_extract_all
Nick Nicolini created SPARK-24884: - Summary: Implement regexp_extract_all Key: SPARK-24884 URL: https://issues.apache.org/jira/browse/SPARK-24884 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.3.1 Reporter: Nick Nicolini I've recently hit many cases of regexp parsing where we need to match on something that is always arbitrary in length; for example, a text block that looks something like: {code:java} AAA:WORDS| BBB:TEXT| MSG:ASDF| MSG:QWER| ... MSG:ZXCV|{code} Where I need to pull out all values between "MSG:" and "|", which can occur in each instance between 1 and n times. I cannot reliably use the existing _regexp_extract_ method since the number of occurrences is always arbitrary, and while I can write a UDF to handle this it'd be great if this was supported natively in Spark. Perhaps we can implement something like _regexp_extract_all_ as [Presto|https://prestodb.io/docs/current/functions/regexp.html] and [Pig|https://pig.apache.org/docs/latest/api/org/apache/pig/builtin/REGEX_EXTRACT_ALL.html] have? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24884) Implement regexp_extract_all
[ https://issues.apache.org/jira/browse/SPARK-24884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Nicolini updated SPARK-24884: -- Description: I've recently hit many cases of regexp parsing where we need to match on something that is always arbitrary in length; for example, a text block that looks something like: {code:java} AAA:WORDS| BBB:TEXT| MSG:ASDF| MSG:QWER| ... MSG:ZXCV|{code} Where I need to pull out all values between "MSG:" and "|", which can occur in each instance between 1 and n times. I cannot reliably use the existing _regexp_extract_ method since the number of occurrences is always arbitrary, and while I can write a UDF to handle this it'd be great if this was supported natively in Spark. Perhaps we can implement something like _regexp_extract_all_ as [Presto|https://prestodb.io/docs/current/functions/regexp.html] and [Pig|https://pig.apache.org/docs/latest/api/org/apache/pig/builtin/REGEX_EXTRACT_ALL.html] have? {noformat} *no* further _formatting_ is done here{noformat} was: I've recently hit many cases of regexp parsing where we need to match on something that is always arbitrary in length; for example, a text block that looks something like: {code:java} AAA:WORDS| BBB:TEXT| MSG:ASDF| MSG:QWER| ... MSG:ZXCV|{code} Where I need to pull out all values between "MSG:" and "|", which can occur in each instance between 1 and n times. I cannot reliably use the existing _regexp_extract_ method since the number of occurrences is always arbitrary, and while I can write a UDF to handle this it'd be great if this was supported natively in Spark. Perhaps we can implement something like _regexp_extract_all_ as [Presto|https://prestodb.io/docs/current/functions/regexp.html] and [Pig|https://pig.apache.org/docs/latest/api/org/apache/pig/builtin/REGEX_EXTRACT_ALL.html] have? > Implement regexp_extract_all > > > Key: SPARK-24884 > URL: https://issues.apache.org/jira/browse/SPARK-24884 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Nick Nicolini >Priority: Major > > I've recently hit many cases of regexp parsing where we need to match on > something that is always arbitrary in length; for example, a text block that > looks something like: > {code:java} > AAA:WORDS| > BBB:TEXT| > MSG:ASDF| > MSG:QWER| > ... > MSG:ZXCV|{code} > Where I need to pull out all values between "MSG:" and "|", which can occur > in each instance between 1 and n times. I cannot reliably use the existing > _regexp_extract_ method since the number of occurrences is always arbitrary, > and while I can write a UDF to handle this it'd be great if this was > supported natively in Spark. > Perhaps we can implement something like _regexp_extract_all_ as > [Presto|https://prestodb.io/docs/current/functions/regexp.html] and > [Pig|https://pig.apache.org/docs/latest/api/org/apache/pig/builtin/REGEX_EXTRACT_ALL.html] > have? > > {noformat} > *no* further _formatting_ is done here{noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16483) Unifying struct fields and columns
[ https://issues.apache.org/jira/browse/SPARK-16483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simeon Simeonov updated SPARK-16483: Description: This issue comes as a result of an exchange with Michael Armbrust outside of the usual JIRA/dev list channels. DataFrame provides a full set of manipulation operations for top-level columns. They have be added, removed, modified and renamed. The same is not true about fields inside structs yet, from a logical standpoint, Spark users may very well want to perform the same operations on struct fields, especially since automatic schema discovery from JSON input tends to create deeply nested structs. Common use-cases include: - Remove and/or rename struct field(s) to adjust the schema - Fix a data quality issue with a struct field (update/rewrite) To do this with the existing API by hand requires manually calling {{named_struct}} and listing all fields, including ones we don't want to manipulate. This leads to complex, fragile code that cannot survive schema evolution. It would be far better if the various APIs that can now manipulate top-level columns were extended to handle struct fields at arbitrary locations or, alternatively, if we introduced new APIs for modifying any field in a dataframe, whether it is a top-level one or one nested inside a struct. Purely for discussion purposes (overloaded methods are not shown): {code:java} class Column(val expr: Expression) extends Logging { // ... // matches Dataset.schema semantics def schema: StructType // matches Dataset.select() semantics // '* support allows multiple new fields to be added easily, saving cumbersome repeated withColumn() calls def select(cols: Column*): Column // matches Dataset.withColumn() semantics of add or replace def withColumn(colName: String, col: Column): Column // matches Dataset.drop() semantics def drop(colName: String): Column } class Dataset[T] ... { // ... // Equivalent to sparkSession.createDataset(toDF.rdd, newSchema) def cast(newShema: StructType): DataFrame } {code} The benefit of the above API is that it unifies manipulating top-level & nested columns. The addition of {{schema}} and {{select()}} to {{Column}} allows for nested field reordering, casting, etc., which is important in data exchange scenarios where field position matters. That's also the reason to add {{cast}} to {{Dataset}}: it improves consistency and readability (with method chaining). Another way to think of {{Dataset.cast}} is as the Spark schema equivalent of {{Dataset.as}}. {{as}} is to {{cast}} as a Scala encodable type is to a {{StructType}} instance. was: This issue comes as a result of an exchange with Michael Armbrust outside of the usual JIRA/dev list channels. DataFrame provides a full set of manipulation operations for top-level columns. They have be added, removed, modified and renamed. The same is not true about fields inside structs yet, from a logical standpoint, Spark users may very well want to perform the same operations on struct fields, especially since automatic schema discovery from JSON input tends to create deeply nested structs. Common use-cases include: - Remove and/or rename struct field(s) to adjust the schema - Fix a data quality issue with a struct field (update/rewrite) To do this with the existing API by hand requires manually calling {{named_struct}} and listing all fields, including ones we don't want to manipulate. This leads to complex, fragile code that cannot survive schema evolution. It would be far better if the various APIs that can now manipulate top-level columns were extended to handle struct fields at arbitrary locations or, alternatively, if we introduced new APIs for modifying any field in a dataframe, whether it is a top-level one or one nested inside a struct. Purely for discussion purposes (overloaded methods are not shown): {code:java} class Column(val expr: Expression) extends Logging { // ... // matches Dataset.schema semantics def schema: StructType // matches Dataset.select() semantics // '* support allows multiple new fields to be added easily, saving cumbersome repeated withColumn() calls def select(cols: Column*): Column // matches Dataset.withColumn() semantics of add or replace def withColumn(colName: String, col: Column): Column // matches Dataset.drop() semantics def drop(colName: String): Column } class Dataset[T] ... { // ... // Equivalent to sparkSession.createDataset(toDF.rdd, newSchema) def cast(newShema: StructType): DataFrame } {code} The benefit of the above API is that it unifies manipulating top-level & nested columns. The addition of {{schema}} and {{select()}} to {{Column}} allows for nested field reordering, casting, etc., which is important in data exchange scenarios where field position matters. That's also the reason to add {{cast}} to {{Data
[jira] [Commented] (SPARK-24768) Have a built-in AVRO data source implementation
[ https://issues.apache.org/jira/browse/SPARK-24768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16552125#comment-16552125 ] Apache Spark commented on SPARK-24768: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/21841 > Have a built-in AVRO data source implementation > --- > > Key: SPARK-24768 > URL: https://issues.apache.org/jira/browse/SPARK-24768 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Priority: Major > Attachments: Built-in AVRO Data Source In Spark 2.4.pdf > > > Apache Avro (https://avro.apache.org) is a popular data serialization format. > It is widely used in the Spark and Hadoop ecosystem, especially for > Kafka-based data pipelines. Using the external package > [https://github.com/databricks/spark-avro], Spark SQL can read and write the > avro data. Making spark-Avro built-in can provide a better experience for > first-time users of Spark SQL and structured streaming. We expect the > built-in Avro data source can further improve the adoption of structured > streaming. The proposal is to inline code from spark-avro package > ([https://github.com/databricks/spark-avro]). The target release is Spark > 2.4. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24883) Remove implicit class AvroDataFrameWriter/AvroDataFrameReader
Gengliang Wang created SPARK-24883: -- Summary: Remove implicit class AvroDataFrameWriter/AvroDataFrameReader Key: SPARK-24883 URL: https://issues.apache.org/jira/browse/SPARK-24883 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.4.0 Reporter: Gengliang Wang As per Reynold's comment: [https://github.com/apache/spark/pull/21742#discussion_r203496489] It makes sense to remove the implicit class AvroDataFrameWriter/AvroDataFrameReader, since the Avro package is external module. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24869) SaveIntoDataSourceCommand's input Dataset does not use Cached Data
[ https://issues.apache.org/jira/browse/SPARK-24869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16552103#comment-16552103 ] Xiao Li commented on SPARK-24869: - [~maropu] Just updated the test case in the JIRA. Forgot to cache it when I copy and paste the example. I think the fix in your branch is not in a right direction. Do you know why the cached data is not used? > SaveIntoDataSourceCommand's input Dataset does not use Cached Data > -- > > Key: SPARK-24869 > URL: https://issues.apache.org/jira/browse/SPARK-24869 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Xiao Li >Priority: Major > > {code} > withTable("t") { > withTempPath { path => > var numTotalCachedHit = 0 > val listener = new QueryExecutionListener { > override def onFailure(f: String, qe: QueryExecution, e: > Exception):Unit = {} > override def onSuccess(funcName: String, qe: QueryExecution, > duration: Long): Unit = { > qe.withCachedData match { > case c: SaveIntoDataSourceCommand > if c.query.isInstanceOf[InMemoryRelation] => > numTotalCachedHit += 1 > case _ => > println(qe.withCachedData) > } > } > } > spark.listenerManager.register(listener) > val udf1 = udf({ (x: Int, y: Int) => x + y }) > val df = spark.range(0, 3).toDF("a") > .withColumn("b", udf1(col("a"), lit(10))) > df.cache() > df.write.mode(SaveMode.Overwrite).jdbc(url1, "TEST.DROPTEST", > properties) > assert(numTotalCachedHit == 1, "expected to be cached in jdbc") > } > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24869) SaveIntoDataSourceCommand's input Dataset does not use Cached Data
[ https://issues.apache.org/jira/browse/SPARK-24869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-24869: Description: {code} withTable("t") { withTempPath { path => var numTotalCachedHit = 0 val listener = new QueryExecutionListener { override def onFailure(f: String, qe: QueryExecution, e: Exception):Unit = {} override def onSuccess(funcName: String, qe: QueryExecution, duration: Long): Unit = { qe.withCachedData match { case c: SaveIntoDataSourceCommand if c.query.isInstanceOf[InMemoryRelation] => numTotalCachedHit += 1 case _ => println(qe.withCachedData) } } } spark.listenerManager.register(listener) val udf1 = udf({ (x: Int, y: Int) => x + y }) val df = spark.range(0, 3).toDF("a") .withColumn("b", udf1(col("a"), lit(10))) df.cache() df.write.mode(SaveMode.Overwrite).jdbc(url1, "TEST.DROPTEST", properties) assert(numTotalCachedHit == 1, "expected to be cached in jdbc") } } {code} was: {code} withTable("t") { withTempPath { path => var numTotalCachedHit = 0 val listener = new QueryExecutionListener { override def onFailure(f: String, qe: QueryExecution, e: Exception):Unit = {} override def onSuccess(funcName: String, qe: QueryExecution, duration: Long): Unit = { qe.withCachedData match { case c: SaveIntoDataSourceCommand if c.query.isInstanceOf[InMemoryRelation] => numTotalCachedHit += 1 case _ => println(qe.withCachedData) } } } spark.listenerManager.register(listener) val udf1 = udf({ (x: Int, y: Int) => x + y }) val df = spark.range(0, 3).toDF("a") .withColumn("b", udf1(col("a"), lit(10))) df.write.mode(SaveMode.Overwrite).jdbc(url1, "TEST.DROPTEST", properties) assert(numTotalCachedHit == 1, "expected to be cached in jdbc") } } {code} > SaveIntoDataSourceCommand's input Dataset does not use Cached Data > -- > > Key: SPARK-24869 > URL: https://issues.apache.org/jira/browse/SPARK-24869 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Xiao Li >Priority: Major > > {code} > withTable("t") { > withTempPath { path => > var numTotalCachedHit = 0 > val listener = new QueryExecutionListener { > override def onFailure(f: String, qe: QueryExecution, e: > Exception):Unit = {} > override def onSuccess(funcName: String, qe: QueryExecution, > duration: Long): Unit = { > qe.withCachedData match { > case c: SaveIntoDataSourceCommand > if c.query.isInstanceOf[InMemoryRelation] => > numTotalCachedHit += 1 > case _ => > println(qe.withCachedData) > } > } > } > spark.listenerManager.register(listener) > val udf1 = udf({ (x: Int, y: Int) => x + y }) > val df = spark.range(0, 3).toDF("a") > .withColumn("b", udf1(col("a"), lit(10))) > df.cache() > df.write.mode(SaveMode.Overwrite).jdbc(url1, "TEST.DROPTEST", > properties) > assert(numTotalCachedHit == 1, "expected to be cached in jdbc") > } > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics
[ https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16552087#comment-16552087 ] James commented on SPARK-23206: --- Hi [~elu] If I want to know the CPU metrics of executor level, what kind of API could I use? Recently I am doing a project which needs the CPU metrics. Thx > Additional Memory Tuning Metrics > > > Key: SPARK-23206 > URL: https://issues.apache.org/jira/browse/SPARK-23206 > Project: Spark > Issue Type: Umbrella > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Edwina Lu >Priority: Major > Attachments: ExecutorsTab.png, ExecutorsTab2.png, > MemoryTuningMetricsDesignDoc.pdf, SPARK-23206 Design Doc.pdf, StageTab.png > > > At LinkedIn, we have multiple clusters, running thousands of Spark > applications, and these numbers are growing rapidly. We need to ensure that > these Spark applications are well tuned – cluster resources, including > memory, should be used efficiently so that the cluster can support running > more applications concurrently, and applications should run quickly and > reliably. > Currently there is limited visibility into how much memory executors are > using, and users are guessing numbers for executor and driver memory sizing. > These estimates are often much larger than needed, leading to memory wastage. > Examining the metrics for one cluster for a month, the average percentage of > used executor memory (max JVM used memory across executors / > spark.executor.memory) is 35%, leading to an average of 591GB unused memory > per application (number of executors * (spark.executor.memory - max JVM used > memory)). Spark has multiple memory regions (user memory, execution memory, > storage memory, and overhead memory), and to understand how memory is being > used and fine-tune allocation between regions, it would be useful to have > information about how much memory is being used for the different regions. > To improve visibility into memory usage for the driver and executors and > different memory regions, the following additional memory metrics can be be > tracked for each executor and driver: > * JVM used memory: the JVM heap size for the executor/driver. > * Execution memory: memory used for computation in shuffles, joins, sorts > and aggregations. > * Storage memory: memory used caching and propagating internal data across > the cluster. > * Unified memory: sum of execution and storage memory. > The peak values for each memory metric can be tracked for each executor, and > also per stage. This information can be shown in the Spark UI and the REST > APIs. Information for peak JVM used memory can help with determining > appropriate values for spark.executor.memory and spark.driver.memory, and > information about the unified memory region can help with determining > appropriate values for spark.memory.fraction and > spark.memory.storageFraction. Stage memory information can help identify > which stages are most memory intensive, and users can look into the relevant > code to determine if it can be optimized. > The memory metrics can be gathered by adding the current JVM used memory, > execution memory and storage memory to the heartbeat. SparkListeners are > modified to collect the new metrics for the executors, stages and Spark > history log. Only interesting values (peak values per stage per executor) are > recorded in the Spark history log, to minimize the amount of additional > logging. > We have attached our design documentation with this ticket and would like to > receive feedback from the community for this proposal. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16483) Unifying struct fields and columns
[ https://issues.apache.org/jira/browse/SPARK-16483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simeon Simeonov updated SPARK-16483: Affects Version/s: 2.3.1 Description: This issue comes as a result of an exchange with Michael Armbrust outside of the usual JIRA/dev list channels. DataFrame provides a full set of manipulation operations for top-level columns. They have be added, removed, modified and renamed. The same is not true about fields inside structs yet, from a logical standpoint, Spark users may very well want to perform the same operations on struct fields, especially since automatic schema discovery from JSON input tends to create deeply nested structs. Common use-cases include: - Remove and/or rename struct field(s) to adjust the schema - Fix a data quality issue with a struct field (update/rewrite) To do this with the existing API by hand requires manually calling {{named_struct}} and listing all fields, including ones we don't want to manipulate. This leads to complex, fragile code that cannot survive schema evolution. It would be far better if the various APIs that can now manipulate top-level columns were extended to handle struct fields at arbitrary locations or, alternatively, if we introduced new APIs for modifying any field in a dataframe, whether it is a top-level one or one nested inside a struct. Purely for discussion purposes (overloaded methods are not shown): {code:java} class Column(val expr: Expression) extends Logging { // ... // matches Dataset.schema semantics def schema: StructType // matches Dataset.select() semantics // '* support allows multiple new fields to be added easily, saving cumbersome repeated withColumn() calls def select(cols: Column*): Column // matches Dataset.withColumn() semantics of add or replace def withColumn(colName: String, col: Column): Column // matches Dataset.drop() semantics def drop(colName: String): Column } class Dataset[T] ... { // ... // Equivalent to sparkSession.createDataset(toDF.rdd, newSchema) def cast(newShema: StructType): DataFrame } {code} The benefit of the above API is that it unifies manipulating top-level & nested columns. The addition of {{schema}} and {{select()}} to {{Column}} allows for nested field reordering, casting, etc., which is important in data exchange scenarios where field position matters. That's also the reason to add {{cast}} to {{Dataset}}: it improves consistency and readability (with method chaining). was: This issue comes as a result of an exchange with Michael Armbrust outside of the usual JIRA/dev list channels. DataFrame provides a full set of manipulation operations for top-level columns. They have be added, removed, modified and renamed. The same is not true about fields inside structs yet, from a logical standpoint, Spark users may very well want to perform the same operations on struct fields, especially since automatic schema discovery from JSON input tends to create deeply nested structs. Common use-cases include: - Remove and/or rename struct field(s) to adjust the schema - Fix a data quality issue with a struct field (update/rewrite) To do this with the existing API by hand requires manually calling {{named_struct}} and listing all fields, including ones we don't want to manipulate. This leads to complex, fragile code that cannot survive schema evolution. It would be far better if the various APIs that can now manipulate top-level columns were extended to handle struct fields at arbitrary locations or, alternatively, if we introduced new APIs for modifying any field in a dataframe, whether it is a top-level one or one nested inside a struct. Purely for discussion purposes, here is the skeleton implementation of an update() implicit that we've use to modify any existing field in a dataframe. (Note that it depends on various other utilities and implicits that are not included). https://gist.github.com/ssimeonov/f98dcfa03cd067157fa08aaa688b0f66 > Unifying struct fields and columns > -- > > Key: SPARK-16483 > URL: https://issues.apache.org/jira/browse/SPARK-16483 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.1 >Reporter: Simeon Simeonov >Priority: Major > Labels: sql > > This issue comes as a result of an exchange with Michael Armbrust outside of > the usual JIRA/dev list channels. > DataFrame provides a full set of manipulation operations for top-level > columns. They have be added, removed, modified and renamed. The same is not > true about fields inside structs yet, from a logical standpoint, Spark users > may very well want to perform the same operations on struct fields, > especially since automatic schema discovery from JSON input tends to create > d
[jira] [Resolved] (SPARK-22562) CachedKafkaConsumer unsafe eviction from cache
[ https://issues.apache.org/jira/browse/SPARK-22562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-22562. --- Resolution: Not A Problem > CachedKafkaConsumer unsafe eviction from cache > -- > > Key: SPARK-22562 > URL: https://issues.apache.org/jira/browse/SPARK-22562 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.2.0 >Reporter: Dariusz Szablinski >Priority: Major > > From time to time a job fails because of > "java.util.ConcurrentModificationException: KafkaConsumer is not safe for > multi-threaded access" (full stacktrace below). I think it happens when one > thread wants to add a new consumer into fully packed cache and another one > still uses an instance of cached consumer which is marked for eviction. > java.util.ConcurrentModificationException: KafkaConsumer is not safe for > multi-threaded access > at > org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431) > at > org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:1361) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer$$anon$1.removeEldestEntry(CachedKafkaConsumer.scala:128) > at java.util.LinkedHashMap.afterNodeInsertion(LinkedHashMap.java:299) > at java.util.HashMap.putVal(HashMap.java:663) > at java.util.HashMap.put(HashMap.java:611) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer$.get(CachedKafkaConsumer.scala:158) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.(KafkaRDD.scala:206) > at org.apache.spark.streaming.kafka010.KafkaRDD.compute(KafkaRDD.scala:181) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24882) separate responsibilities of the data source v2 read API
[ https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16552041#comment-16552041 ] Wenchen Fan commented on SPARK-24882: - cc people we may be interested: [~rxin] [~rdblue] [~marmbrus] [~LI,Xiao] [~Gengliang.Wang] > separate responsibilities of the data source v2 read API > > > Key: SPARK-24882 > URL: https://issues.apache.org/jira/browse/SPARK-24882 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > > Data source V2 is out for a while, see the SPIP > [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing]. > We have already migrated most of the built-in streaming data sources to the > V2 API, and the file source migration is in progress. During the migration, > we found several problems and want to address them before we stabilize the V2 > API. > To solve these problems, we need to separate responsibilities in the data > source v2 read API. Details please see the attached google doc: > https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24882) separate responsibilities of the data source v2 read API
[ https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16552041#comment-16552041 ] Wenchen Fan edited comment on SPARK-24882 at 7/22/18 2:49 PM: -- cc people who may be interested: [~rxin] [~rdblue] [~marmbrus] [~LI,Xiao] [~Gengliang.Wang] was (Author: cloud_fan): cc people we may be interested: [~rxin] [~rdblue] [~marmbrus] [~LI,Xiao] [~Gengliang.Wang] > separate responsibilities of the data source v2 read API > > > Key: SPARK-24882 > URL: https://issues.apache.org/jira/browse/SPARK-24882 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > > Data source V2 is out for a while, see the SPIP > [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing]. > We have already migrated most of the built-in streaming data sources to the > V2 API, and the file source migration is in progress. During the migration, > we found several problems and want to address them before we stabilize the V2 > API. > To solve these problems, we need to separate responsibilities in the data > source v2 read API. Details please see the attached google doc: > https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24339) spark sql can not prune column in transform/map/reduce query
[ https://issues.apache.org/jira/browse/SPARK-24339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16552040#comment-16552040 ] Apache Spark commented on SPARK-24339: -- User 'xuanyuanking' has created a pull request for this issue: https://github.com/apache/spark/pull/21839 > spark sql can not prune column in transform/map/reduce query > > > Key: SPARK-24339 > URL: https://issues.apache.org/jira/browse/SPARK-24339 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: xdcjie >Priority: Minor > Labels: map, reduce, sql, transform > > I was using {{TRANSFORM USING}} with branch-2.1/2.2, and noticed that it will > scan all column of data, query like: > {code:java} > SELECT TRANSFORM(usid, cch) USING 'python test.py' AS (u1, c1, u2, c2) FROM > test_table;{code} > it's physical plan like: > {code:java} > == Physical Plan == > ScriptTransformation [usid#17, cch#9], python test.py, [u1#784, c1#785, > u2#786, c2#787], > HiveScriptIOSchema(List(),List(),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),List((field.delim, > )),List((field.delim, > )),Some(org.apache.hadoop.hive.ql.exec.TextRecordReader),Some(org.apache.hadoop.hive.ql.exec.TextRecordWriter),false) > +- FileScan parquet > [sh#0L,clk#1L,chg#2L,qey#3,ship#4,chgname#5,sid#6,bid#7,dis#8L,cch#9,wch#10,wid#11L,arank#12L,rtag#13,iid#14,uid#15,pid#16,usid#17,wdid#18,bid#19,oqleft#20,oqright#21,poqvalue#22,tm#23,... > 367 more fields] Batched: false, Format: Parquet, Location: > InMemoryFileIndex[file:/Users/Downloads/part-r-00093-0ef5d59f-2e08-4085-9b46-458a1652932a.g..., > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct {code} > In our scenario, parquet has 400 column, this query will take more time. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24882) separate responsibilities of the data source v2 read API
[ https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-24882: External issue URL: (was: https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing) > separate responsibilities of the data source v2 read API > > > Key: SPARK-24882 > URL: https://issues.apache.org/jira/browse/SPARK-24882 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > > Data source V2 is out for a while, see the SPIP > [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing]. > We have already migrated most of the built-in streaming data sources to the > V2 API, and the file source migration is in progress. During the migration, > we found several problems and want to address them before we stabilize the V2 > API. > To solve these problems, we need to separate responsibilities in the data > source v2 read API. Details please see the attached google doc. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24882) separate responsibilities of the data source v2 read API
[ https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-24882: Description: Data source V2 is out for a while, see the SPIP [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing]. We have already migrated most of the built-in streaming data sources to the V2 API, and the file source migration is in progress. During the migration, we found several problems and want to address them before we stabilize the V2 API. To solve these problems, we need to separate responsibilities in the data source v2 read API. Details please see the attached google doc: https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing was: Data source V2 is out for a while, see the SPIP [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing]. We have already migrated most of the built-in streaming data sources to the V2 API, and the file source migration is in progress. During the migration, we found several problems and want to address them before we stabilize the V2 API. To solve these problems, we need to separate responsibilities in the data source v2 read API. Details please see the attached google doc. > separate responsibilities of the data source v2 read API > > > Key: SPARK-24882 > URL: https://issues.apache.org/jira/browse/SPARK-24882 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > > Data source V2 is out for a while, see the SPIP > [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing]. > We have already migrated most of the built-in streaming data sources to the > V2 API, and the file source migration is in progress. During the migration, > we found several problems and want to address them before we stabilize the V2 > API. > To solve these problems, we need to separate responsibilities in the data > source v2 read API. Details please see the attached google doc: > https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24882) separate responsibilities of the data source v2 read API
[ https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-24882: External issue URL: https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing > separate responsibilities of the data source v2 read API > > > Key: SPARK-24882 > URL: https://issues.apache.org/jira/browse/SPARK-24882 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > > Data source V2 is out for a while, see the SPIP > [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing]. > We have already migrated most of the built-in streaming data sources to the > V2 API, and the file source migration is in progress. During the migration, > we found several problems and want to address them before we stabilize the V2 > API. > To solve these problems, we need to separate responsibilities in the data > source v2 read API. Details please see the attached google doc. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24882) separate responsibilities of the data source v2 read API
Wenchen Fan created SPARK-24882: --- Summary: separate responsibilities of the data source v2 read API Key: SPARK-24882 URL: https://issues.apache.org/jira/browse/SPARK-24882 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Wenchen Fan Assignee: Wenchen Fan Data source V2 is out for a while, see the SPIP [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing]. We have already migrated most of the built-in streaming data sources to the V2 API, and the file source migration is in progress. During the migration, we found several problems and want to address them before we stabilize the V2 API. To solve these problems, we need to separate responsibilities in the data source v2 read API. Details please see the attached google doc. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24811) Add function `from_avro` and `to_avro`
[ https://issues.apache.org/jira/browse/SPARK-24811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551997#comment-16551997 ] Apache Spark commented on SPARK-24811: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/21838 > Add function `from_avro` and `to_avro` > -- > > Key: SPARK-24811 > URL: https://issues.apache.org/jira/browse/SPARK-24811 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 2.4.0 > > > Add a new function from_avro for parsing a binary column of avro format and > converting it into its corresponding catalyst value. > Add a new function to_avro for converting a column into binary of avro format > with the specified schema. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL
[ https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551968#comment-16551968 ] Dilip Biswal commented on SPARK-21274: -- [~maropu] Thanks a lot for the info. > Implement EXCEPT ALL and INTERSECT ALL > -- > > Key: SPARK-21274 > URL: https://issues.apache.org/jira/browse/SPARK-21274 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > 1) *EXCEPT ALL* / MINUS ALL : > {code} > SELECT a,b,c FROM tab1 > EXCEPT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following outer join: > {code} > SELECT a,b,c > FROMtab1 t1 > LEFT OUTER JOIN > tab2 t2 > ON ( > (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c) > ) > WHERE > COALESCE(t2.a, t2.b, t2.c) IS NULL > {code} > (register as a temp.view this second query under "*t1_except_t2_df*" name > that can be also used to find INTERSECT ALL below): > 2) *INTERSECT ALL*: > {code} > SELECT a,b,c FROM tab1 > INTERSECT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following anti-join using t1_except_t2_df we defined > above: > {code} > SELECT a,b,c > FROMtab1 t1 > WHERE >NOT EXISTS >(SELECT 1 > FROMt1_except_t2_df e > WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c) >) > {code} > So the suggestion is just to use above query rewrites to implement both > EXCEPT ALL and INTERSECT ALL sql set operations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16203) regexp_extract to return an ArrayType(StringType())
[ https://issues.apache.org/jira/browse/SPARK-16203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551966#comment-16551966 ] Herman van Hovell commented on SPARK-16203: --- [~nnicolini] adding {{regexp_extract_all}} makes sense. Can you file a new ticket for this? BTW there might already one. > regexp_extract to return an ArrayType(StringType()) > --- > > Key: SPARK-16203 > URL: https://issues.apache.org/jira/browse/SPARK-16203 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0 >Reporter: Max Moroz >Priority: Minor > > regexp_extract only returns a single matched group. If (as if often the case > - e.g., web log parsing) we need to parse the entire line and get all the > groups, we'll need to call it as many times as there are groups. > It's only a minor annoyance syntactically. > But unless I misunderstand something, it would be very inefficient. (How > would Spark know not to do multiple pattern matching operations, when only > one is needed? Or does the optimizer actually check whether the patterns are > identical, and if they are, avoid the repeated regex matching operations??) > Would it be possible to have it return an array when the index is not > specified (defaulting to None)? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL
[ https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551965#comment-16551965 ] Takeshi Yamamuro edited comment on SPARK-21274 at 7/22/18 9:35 AM: --- ok, thanks! Since the feature freeze will come soon, I think this feature will appear in v2.5 or v3.0. So, we don't need to be in a hurry for this. was (Author: maropu): ok, thanks! > Implement EXCEPT ALL and INTERSECT ALL > -- > > Key: SPARK-21274 > URL: https://issues.apache.org/jira/browse/SPARK-21274 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > 1) *EXCEPT ALL* / MINUS ALL : > {code} > SELECT a,b,c FROM tab1 > EXCEPT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following outer join: > {code} > SELECT a,b,c > FROMtab1 t1 > LEFT OUTER JOIN > tab2 t2 > ON ( > (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c) > ) > WHERE > COALESCE(t2.a, t2.b, t2.c) IS NULL > {code} > (register as a temp.view this second query under "*t1_except_t2_df*" name > that can be also used to find INTERSECT ALL below): > 2) *INTERSECT ALL*: > {code} > SELECT a,b,c FROM tab1 > INTERSECT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following anti-join using t1_except_t2_df we defined > above: > {code} > SELECT a,b,c > FROMtab1 t1 > WHERE >NOT EXISTS >(SELECT 1 > FROMt1_except_t2_df e > WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c) >) > {code} > So the suggestion is just to use above query rewrites to implement both > EXCEPT ALL and INTERSECT ALL sql set operations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL
[ https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551965#comment-16551965 ] Takeshi Yamamuro commented on SPARK-21274: -- ok, thanks! > Implement EXCEPT ALL and INTERSECT ALL > -- > > Key: SPARK-21274 > URL: https://issues.apache.org/jira/browse/SPARK-21274 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > 1) *EXCEPT ALL* / MINUS ALL : > {code} > SELECT a,b,c FROM tab1 > EXCEPT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following outer join: > {code} > SELECT a,b,c > FROMtab1 t1 > LEFT OUTER JOIN > tab2 t2 > ON ( > (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c) > ) > WHERE > COALESCE(t2.a, t2.b, t2.c) IS NULL > {code} > (register as a temp.view this second query under "*t1_except_t2_df*" name > that can be also used to find INTERSECT ALL below): > 2) *INTERSECT ALL*: > {code} > SELECT a,b,c FROM tab1 > INTERSECT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following anti-join using t1_except_t2_df we defined > above: > {code} > SELECT a,b,c > FROMtab1 t1 > WHERE >NOT EXISTS >(SELECT 1 > FROMt1_except_t2_df e > WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c) >) > {code} > So the suggestion is just to use above query rewrites to implement both > EXCEPT ALL and INTERSECT ALL sql set operations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL
[ https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551954#comment-16551954 ] Dilip Biswal commented on SPARK-21274: -- [~maropu] Hi Takeshi, yeah.. So the code that does the rewrite is already in place. I am looking into a alternate way to ReplicateRows and will update in a few days. > Implement EXCEPT ALL and INTERSECT ALL > -- > > Key: SPARK-21274 > URL: https://issues.apache.org/jira/browse/SPARK-21274 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > 1) *EXCEPT ALL* / MINUS ALL : > {code} > SELECT a,b,c FROM tab1 > EXCEPT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following outer join: > {code} > SELECT a,b,c > FROMtab1 t1 > LEFT OUTER JOIN > tab2 t2 > ON ( > (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c) > ) > WHERE > COALESCE(t2.a, t2.b, t2.c) IS NULL > {code} > (register as a temp.view this second query under "*t1_except_t2_df*" name > that can be also used to find INTERSECT ALL below): > 2) *INTERSECT ALL*: > {code} > SELECT a,b,c FROM tab1 > INTERSECT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following anti-join using t1_except_t2_df we defined > above: > {code} > SELECT a,b,c > FROMtab1 t1 > WHERE >NOT EXISTS >(SELECT 1 > FROMt1_except_t2_df e > WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c) >) > {code} > So the suggestion is just to use above query rewrites to implement both > EXCEPT ALL and INTERSECT ALL sql set operations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24881) New options - compression and compressionLevel
[ https://issues.apache.org/jira/browse/SPARK-24881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24881: Assignee: Apache Spark > New options - compression and compressionLevel > -- > > Key: SPARK-24881 > URL: https://issues.apache.org/jira/browse/SPARK-24881 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Minor > > Currently Avro datasource takes the compression codec name from SQL config > (config key is hard coded in AvroFileFormat): > https://github.com/apache/spark/blob/106880edcd67bc20e8610a16f8ce6aa250268eeb/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala#L121-L125 > . The obvious cons of it is modification of the global config can impact of > multiple writes. > A purpose of the ticket is to add new Avro option - "compression" the same as > we already have for other datasource like JSON, CSV and etc. If new option is > not set by an user, we take settings from SQL config > spark.sql.avro.compression.codec. If the former one is not set too, default > compression codec will be snappy (this is current behavior in the master). > Besides of the compression option, need to add another option - > compressionLevel which should reflect another SQL config in Avro: > https://github.com/apache/spark/blob/106880edcd67bc20e8610a16f8ce6aa250268eeb/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala#L122 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24881) New options - compression and compressionLevel
[ https://issues.apache.org/jira/browse/SPARK-24881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551952#comment-16551952 ] Apache Spark commented on SPARK-24881: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/21837 > New options - compression and compressionLevel > -- > > Key: SPARK-24881 > URL: https://issues.apache.org/jira/browse/SPARK-24881 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Minor > > Currently Avro datasource takes the compression codec name from SQL config > (config key is hard coded in AvroFileFormat): > https://github.com/apache/spark/blob/106880edcd67bc20e8610a16f8ce6aa250268eeb/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala#L121-L125 > . The obvious cons of it is modification of the global config can impact of > multiple writes. > A purpose of the ticket is to add new Avro option - "compression" the same as > we already have for other datasource like JSON, CSV and etc. If new option is > not set by an user, we take settings from SQL config > spark.sql.avro.compression.codec. If the former one is not set too, default > compression codec will be snappy (this is current behavior in the master). > Besides of the compression option, need to add another option - > compressionLevel which should reflect another SQL config in Avro: > https://github.com/apache/spark/blob/106880edcd67bc20e8610a16f8ce6aa250268eeb/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala#L122 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24881) New options - compression and compressionLevel
[ https://issues.apache.org/jira/browse/SPARK-24881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24881: Assignee: (was: Apache Spark) > New options - compression and compressionLevel > -- > > Key: SPARK-24881 > URL: https://issues.apache.org/jira/browse/SPARK-24881 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Minor > > Currently Avro datasource takes the compression codec name from SQL config > (config key is hard coded in AvroFileFormat): > https://github.com/apache/spark/blob/106880edcd67bc20e8610a16f8ce6aa250268eeb/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala#L121-L125 > . The obvious cons of it is modification of the global config can impact of > multiple writes. > A purpose of the ticket is to add new Avro option - "compression" the same as > we already have for other datasource like JSON, CSV and etc. If new option is > not set by an user, we take settings from SQL config > spark.sql.avro.compression.codec. If the former one is not set too, default > compression codec will be snappy (this is current behavior in the master). > Besides of the compression option, need to add another option - > compressionLevel which should reflect another SQL config in Avro: > https://github.com/apache/spark/blob/106880edcd67bc20e8610a16f8ce6aa250268eeb/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala#L122 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24881) New options - compression and compressionLevel
Maxim Gekk created SPARK-24881: -- Summary: New options - compression and compressionLevel Key: SPARK-24881 URL: https://issues.apache.org/jira/browse/SPARK-24881 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.3.1 Reporter: Maxim Gekk Currently Avro datasource takes the compression codec name from SQL config (config key is hard coded in AvroFileFormat): https://github.com/apache/spark/blob/106880edcd67bc20e8610a16f8ce6aa250268eeb/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala#L121-L125 . The obvious cons of it is modification of the global config can impact of multiple writes. A purpose of the ticket is to add new Avro option - "compression" the same as we already have for other datasource like JSON, CSV and etc. If new option is not set by an user, we take settings from SQL config spark.sql.avro.compression.codec. If the former one is not set too, default compression codec will be snappy (this is current behavior in the master). Besides of the compression option, need to add another option - compressionLevel which should reflect another SQL config in Avro: https://github.com/apache/spark/blob/106880edcd67bc20e8610a16f8ce6aa250268eeb/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala#L122 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL
[ https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551921#comment-16551921 ] Takeshi Yamamuro commented on SPARK-21274: -- Anybody still working on this? > Implement EXCEPT ALL and INTERSECT ALL > -- > > Key: SPARK-21274 > URL: https://issues.apache.org/jira/browse/SPARK-21274 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > 1) *EXCEPT ALL* / MINUS ALL : > {code} > SELECT a,b,c FROM tab1 > EXCEPT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following outer join: > {code} > SELECT a,b,c > FROMtab1 t1 > LEFT OUTER JOIN > tab2 t2 > ON ( > (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c) > ) > WHERE > COALESCE(t2.a, t2.b, t2.c) IS NULL > {code} > (register as a temp.view this second query under "*t1_except_t2_df*" name > that can be also used to find INTERSECT ALL below): > 2) *INTERSECT ALL*: > {code} > SELECT a,b,c FROM tab1 > INTERSECT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following anti-join using t1_except_t2_df we defined > above: > {code} > SELECT a,b,c > FROMtab1 t1 > WHERE >NOT EXISTS >(SELECT 1 > FROMt1_except_t2_df e > WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c) >) > {code} > So the suggestion is just to use above query rewrites to implement both > EXCEPT ALL and INTERSECT ALL sql set operations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org