[jira] [Resolved] (SPARK-28332) SQLMetric wrong initValue
[ https://issues.apache.org/jira/browse/SPARK-28332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-28332. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26899 [https://github.com/apache/spark/pull/26899] > SQLMetric wrong initValue > -- > > Key: SPARK-28332 > URL: https://issues.apache.org/jira/browse/SPARK-28332 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Song Jun >Priority: Minor > Fix For: 3.0.0 > > > Currently SQLMetrics.createSizeMetric create a SQLMetric with initValue set > to -1. > If there is a ShuffleMapStage with lots of Tasks which read 0 bytes data, > these tasks will send the metric(the metric value still be the initValue with > -1) to Driver, then Driver do metric merge for this Stage in > DAGScheduler.updateAccumulators, this will cause the merged metric value of > this Stage set to be a negative value. > This is incorrect, we should set the initValue to 0 . > Another same case in SQLMetrics.createTimingMetric. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30329) add iterator/foreach methods for Vectors
zhengruifeng created SPARK-30329: Summary: add iterator/foreach methods for Vectors Key: SPARK-30329 URL: https://issues.apache.org/jira/browse/SPARK-30329 Project: Spark Issue Type: Wish Components: ML Affects Versions: 3.0.0 Reporter: zhengruifeng 1, foreach: there are a lot of places that we need to traversal all the elements, current we impl like this: {code:java} var i = 0; while (i < vec.size) { val v = vec(i); ...; i += 1 } {code} This method is for both convenience and performace: For a SparseVector, the total complexity is O(size * log(nnz)), since an apply call has log(nnz) complexity due to usage of binary search; However, this can be optimized by operations of cursors. 2, foreachNonZero: the usage of foreachActive is mostly binded with filter value!=0, like {code:java} vec.foreachActive { case (i, v) => if (v != 0.0) { ... } } {code} Here we can add this method for convenience. 3, iterator/activeIterator/nonZeroIterator: add those three iterators, so that we can futuremore add/change some impls based on those iterators for both ml and mllib sides, to avoid vector conversions. For example, I want to optimize PCA by using ml.stat.Summarizer instead of Statistics.colStats/mllib.MultivariateStatisticalSummary, to avoid computation of unused variables. After having these iterators, I can do it without vector conversions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30329) add iterator/foreach methods for Vectors
[ https://issues.apache.org/jira/browse/SPARK-30329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng reassigned SPARK-30329: Assignee: zhengruifeng > add iterator/foreach methods for Vectors > > > Key: SPARK-30329 > URL: https://issues.apache.org/jira/browse/SPARK-30329 > Project: Spark > Issue Type: Wish > Components: ML >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Major > > 1, foreach: there are a lot of places that we need to traversal all the > elements, current we impl like this: > {code:java} > var i = 0; while (i < vec.size) { val v = vec(i); ...; i += 1 } {code} > This method is for both convenience and performace: > For a SparseVector, the total complexity is O(size * log(nnz)), since an > apply call has log(nnz) complexity due to usage of binary search; > However, this can be optimized by operations of cursors. > > 2, foreachNonZero: the usage of foreachActive is mostly binded with filter > value!=0, like > {code:java} > vec.foreachActive { case (i, v) => > if (v != 0.0) { > ... > } > } > {code} > Here we can add this method for convenience. > > 3, iterator/activeIterator/nonZeroIterator: add those three iterators, so > that we can futuremore add/change some impls based on those iterators for > both ml and mllib sides, to avoid vector conversions. > For example, I want to optimize PCA by using ml.stat.Summarizer instead of > Statistics.colStats/mllib.MultivariateStatisticalSummary, to avoid > computation of unused variables. > After having these iterators, I can do it without vector conversions. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30328) Fail to write local files with RDD.saveTextFile when setting the incorrect Hadoop configuration files
chendihao created SPARK-30328: - Summary: Fail to write local files with RDD.saveTextFile when setting the incorrect Hadoop configuration files Key: SPARK-30328 URL: https://issues.apache.org/jira/browse/SPARK-30328 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.0 Reporter: chendihao -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26618) Make typed Timestamp/Date literals consistent to casting
[ https://issues.apache.org/jira/browse/SPARK-26618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-26618: Labels: (was: correctness) > Make typed Timestamp/Date literals consistent to casting > > > Key: SPARK-26618 > URL: https://issues.apache.org/jira/browse/SPARK-26618 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.0.0 > > > Currently values of typed literals TIMESTAMP and DATE are parsed to desired > values by Timestamp.valueOf and Date.valueOf. This restricts date and > timestamp pattern, and makes inconsistent to casting to > TimestampType/DateType. Also using Timestamp.valueOf and Date.valueOf assumes > hybrid calendar while parsing textual representation of timestamps/dates. > This should be fixed by re-using cast functionality. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26618) Make typed Timestamp/Date literals consistent to casting
[ https://issues.apache.org/jira/browse/SPARK-26618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-26618: Labels: correctness (was: ) > Make typed Timestamp/Date literals consistent to casting > > > Key: SPARK-26618 > URL: https://issues.apache.org/jira/browse/SPARK-26618 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Labels: correctness > Fix For: 3.0.0 > > > Currently values of typed literals TIMESTAMP and DATE are parsed to desired > values by Timestamp.valueOf and Date.valueOf. This restricts date and > timestamp pattern, and makes inconsistent to casting to > TimestampType/DateType. Also using Timestamp.valueOf and Date.valueOf assumes > hybrid calendar while parsing textual representation of timestamps/dates. > This should be fixed by re-using cast functionality. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29699) Different answers in nested aggregates with window functions
[ https://issues.apache.org/jira/browse/SPARK-29699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002122#comment-17002122 ] Takeshi Yamamuro commented on SPARK-29699: -- A root cause is the NULL order difference in order-by clauses; in Spark, NULLS FIRST is the default for an ascending order, but in PostgreSQL/Oracle, NULLS LAST is the default for that. So, if we explicitly set the order in the example query above, we can get the same answer with PostgreSQL; {code:java} sql("""select a, b, sum(c), sum(sum(c)) over (order by a asc nulls last, b asc nulls last) as rsum from gstest2 group by rollup (a,b) order by rsum asc nulls last, a asc nulls last, b asc nulls last""").show() +++--++ | a| b|sum(c)|rsum| +++--++ | 1| 1| 8| 8| | 1| 2| 2| 10| | 1|null|10| 20| | 2| 2| 2| 22| | 2|null| 2| 24| |null|null|12| 36| +++--++ {code} Currently, it seems we follow the MySQL/SQL Server behaviour and they have NULLS FIRST by default. Any historical reason for our default NULL order? cc: [~smilegator] [~cloud_fan] [~viirya] Changing the default behaivour in Spark has some impacts on test output in SQLQueryTestSuite: [https://github.com/apache/spark/compare/master...maropu:NullLastByDefault] References: Cited from the PostgreSQL doc: [https://www.postgresql.org/docs/current/queries-order.html] {code:java} By default, null values sort as if larger than any non-null value; that is, NULLS FIRST is the default for DESC order, and NULLS LAST otherwise. {code} Cited from the OracleDB doc: [https://docs.oracle.com/database/121/SQLRF/statements_10002.htm#SQLRF01702] {code:java} NULLS LAST is the default for ascending order, and NULLS FIRST is the default for descending order. {code} Cited from the SQL server: [https://docs.microsoft.com/en-us/sql/t-sql/queries/select-order-by-clause-transact-sql?redirectedfrom=MSDN&view=sql-server-ver15] {code:java} ASC is the default sort order. Null values are treated as the lowest possible values. {code} Cited from the MySQL: [https://dev.mysql.com/doc/refman/5.7/en/working-with-null.html] {code:java} When doing an ORDER BY, NULL values are presented first if you do ORDER BY ... ASC and last if you do ORDER BY ... DESC. {code} > Different answers in nested aggregates with window functions > > > Key: SPARK-29699 > URL: https://issues.apache.org/jira/browse/SPARK-29699 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Priority: Major > Labels: correctness > > A nested aggregate below with a window function seems to have different > answers in the `rsum` column between PgSQL and Spark; > {code:java} > postgres=# create table gstest2 (a integer, b integer, c integer, d integer, > e integer, f integer, g integer, h integer); > postgres=# insert into gstest2 values > postgres-# (1, 1, 1, 1, 1, 1, 1, 1), > postgres-# (1, 1, 1, 1, 1, 1, 1, 2), > postgres-# (1, 1, 1, 1, 1, 1, 2, 2), > postgres-# (1, 1, 1, 1, 1, 2, 2, 2), > postgres-# (1, 1, 1, 1, 2, 2, 2, 2), > postgres-# (1, 1, 1, 2, 2, 2, 2, 2), > postgres-# (1, 1, 2, 2, 2, 2, 2, 2), > postgres-# (1, 2, 2, 2, 2, 2, 2, 2), > postgres-# (2, 2, 2, 2, 2, 2, 2, 2); > INSERT 0 9 > postgres=# > postgres=# select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum > postgres-# from gstest2 group by rollup (a,b) order by rsum, a, b; > a | b | sum | rsum > ---+---+-+-- > 1 | 1 | 8 |8 > 1 | 2 | 2 | 10 > 1 | | 10 | 20 > 2 | 2 | 2 | 22 > 2 | | 2 | 24 >| | 12 | 36 > (6 rows) > {code} > {code:java} > scala> sql(""" > | select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum > | from gstest2 group by rollup (a,b) order by rsum, a, b > | """).show() > +++--++ > > | a| b|sum(c)|rsum| > +++--++ > |null|null|12| 12| > | 1|null|10| 22| > | 1| 1| 8| 30| > | 1| 2| 2| 32| > | 2|null| 2| 34| > | 2| 2| 2| 36| > +++--++ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29699) Different answers in nested aggregates with window functions
[ https://issues.apache.org/jira/browse/SPARK-29699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-29699: - Description: A nested aggregate below with a window function seems to have different answers in the `rsum` column between PgSQL and Spark; {code:java} postgres=# create table gstest2 (a integer, b integer, c integer, d integer, e integer, f integer, g integer, h integer); postgres=# insert into gstest2 values postgres-# (1, 1, 1, 1, 1, 1, 1, 1), postgres-# (1, 1, 1, 1, 1, 1, 1, 2), postgres-# (1, 1, 1, 1, 1, 1, 2, 2), postgres-# (1, 1, 1, 1, 1, 2, 2, 2), postgres-# (1, 1, 1, 1, 2, 2, 2, 2), postgres-# (1, 1, 1, 2, 2, 2, 2, 2), postgres-# (1, 1, 2, 2, 2, 2, 2, 2), postgres-# (1, 2, 2, 2, 2, 2, 2, 2), postgres-# (2, 2, 2, 2, 2, 2, 2, 2); INSERT 0 9 postgres=# postgres=# select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum postgres-# from gstest2 group by rollup (a,b) order by rsum, a, b; a | b | sum | rsum ---+---+-+-- 1 | 1 | 8 |8 1 | 2 | 2 | 10 1 | | 10 | 20 2 | 2 | 2 | 22 2 | | 2 | 24 | | 12 | 36 (6 rows) {code} {code:java} scala> sql(""" | select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum | from gstest2 group by rollup (a,b) order by rsum, a, b | """).show() +++--++ | a| b|sum(c)|rsum| +++--++ |null|null|12| 12| | 1|null|10| 22| | 1| 1| 8| 30| | 1| 2| 2| 32| | 2|null| 2| 34| | 2| 2| 2| 36| +++--++ {code} was: A nested aggregate below with a window function seems to have different answers in the `rsum` column between PgSQL and Spark; {code:java} postgres=# create table gstest2 (a integer, b integer, c integer, d integer, e integer, f integer, g integer, h integer); postgres=# insert into gstest2 values postgres-# (1, 1, 1, 1, 1, 1, 1, 1), postgres-# (1, 1, 1, 1, 1, 1, 1, 2), postgres-# (1, 1, 1, 1, 1, 1, 2, 2), postgres-# (1, 1, 1, 1, 1, 2, 2, 2), postgres-# (1, 1, 1, 1, 2, 2, 2, 2), postgres-# (1, 1, 1, 2, 2, 2, 2, 2), postgres-# (1, 1, 2, 2, 2, 2, 2, 2), postgres-# (1, 2, 2, 2, 2, 2, 2, 2), postgres-# (2, 2, 2, 2, 2, 2, 2, 2); INSERT 0 9 postgres=# postgres=# select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum postgres-# from gstest2 group by rollup (a,b) order by rsum, a, b; a | b | sum | rsum ---+---+-+-- 1 | 1 | 16 | 16 1 | 2 | 4 | 20 1 | | 20 | 40 2 | 2 | 4 | 44 2 | | 4 | 48 | | 24 | 72 (6 rows) {code} {code:java} scala> sql(""" | select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum | from gstest2 group by rollup (a,b) order by rsum, a, b | """).show() +++--++ | a| b|sum(c)|rsum| +++--++ |null|null|12| 12| | 1|null|10| 22| | 1| 1| 8| 30| | 1| 2| 2| 32| | 2|null| 2| 34| | 2| 2| 2| 36| +++--++ {code} > Different answers in nested aggregates with window functions > > > Key: SPARK-29699 > URL: https://issues.apache.org/jira/browse/SPARK-29699 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Priority: Major > Labels: correctness > > A nested aggregate below with a window function seems to have different > answers in the `rsum` column between PgSQL and Spark; > {code:java} > postgres=# create table gstest2 (a integer, b integer, c integer, d integer, > e integer, f integer, g integer, h integer); > postgres=# insert into gstest2 values > postgres-# (1, 1, 1, 1, 1, 1, 1, 1), > postgres-# (1, 1, 1, 1, 1, 1, 1, 2), > postgres-# (1, 1, 1, 1, 1, 1, 2, 2), > postgres-# (1, 1, 1, 1, 1, 2, 2, 2), > postgres-# (1, 1, 1, 1, 2, 2, 2, 2), > postgres-# (1, 1, 1, 2, 2, 2, 2, 2), > postgres-# (1, 1, 2, 2, 2, 2, 2, 2), > postgres-# (1, 2, 2, 2, 2, 2, 2, 2), > postgres-# (2, 2, 2, 2, 2, 2, 2, 2); > INSERT 0 9 > postgres=# > postgres=# select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum > postgres-# from gstest2 group by rollup (a,b) order by rsum, a, b; > a | b | sum | rsum > ---+---+-+-- > 1 | 1 | 8 |8 > 1 | 2 | 2 | 10 > 1 | | 10 | 20 > 2 | 2 | 2 | 22 > 2 | | 2 | 24 >| | 12 | 36 > (6 rows) > {code} > {code:java} > scala> sql(""" > | select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum > | from gstest2 group by rollup (a,b) order by rsum, a, b > | """).show() > +++--++ > > | a| b|sum(c)|rsum| >
[jira] [Created] (SPARK-30327) Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
lujun created SPARK-30327: - Summary: Caused by: java.lang.ArrayIndexOutOfBoundsException: -1 Key: SPARK-30327 URL: https://issues.apache.org/jira/browse/SPARK-30327 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 2.3.2 Reporter: lujun val edgeRdd: RDD[Edge[Int]] = rdd.map(rec => { Edge(rec._2._1.getOldcid, rec._2._1.getNewcid, 0) }) val vertexRdd: RDD[(Long, String)] = rdd.map(rec =>{ (rec._2._1.getOldcid, rec._2._1.getCustomer_id)} ) val returnRdd = Graph(vertexRdd, edgeRdd).connectedComponents().vertices. join(vertexRdd) .map \{ case (cid, (groupid, cus)) => (cus, groupid)} For the same batch of data, sometimes it succeeds, and the following errors are reported! Exception in thread "main" java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:65) at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala) Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 24374.0 failed 4 times, most recent failure: Lost task 2.3 in stage 24374.0 (TID 133352, lx-es-04, executor 0): java.lang.ArrayIndexOutOfBoundsException: -1 at org.apache.spark.graphx.util.collection.GraphXPrimitiveKeyOpenHashMap$mcJI$sp.apply$mcJI$sp(GraphXPrimitiveKeyOpenHashMap.scala:64) at org.apache.spark.graphx.impl.EdgePartition.updateVertices(EdgePartition.scala:91) at org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:73) at org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:71) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1651) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1639) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1638) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1638) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1872) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1821) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1810) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2131) at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1035) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at org.apache.spark.rdd.RDD.reduce(RDD.scala:1017) at org.apache.spark.graphx.impl.VertexRDDImpl.count(VertexRDDImpl.scala:90) at org.apache.spark.graphx.Pregel$.apply(Pregel.scala:140) at org.apache.spark.graphx.lib.ConnectedComponents$.run(ConnectedComponents.scala:54) at org.apache.spark.graphx.lib.ConnectedComponents$.
[jira] [Comment Edited] (SPARK-6221) SparkSQL should support auto merging output files
[ https://issues.apache.org/jira/browse/SPARK-6221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002085#comment-17002085 ] Tian Tian edited comment on SPARK-6221 at 12/23/19 3:24 AM: I encounterd this problem and find [issue-24940](https://issues.apache.org/jira/browse/SPARK-24940) Use {quote}/*+ COALESCE(numPartitions) */{quote} or {quote}/*+ REPARTITION(numPartitions) */{quote} in spark sql query will control output file numbers. In my parctice I recommend second parm for users, because it will generate a new stage to do this job, while first parm won't which may lead the job dead because of fewer tasks in the last stage. was (Author: tian tian): I encounterd this problem and find [issue-24940](https://issues.apache.org/jira/browse/SPARK-24940) Use `/*+ COALESCE(numPartitions) */` or `/*+ REPARTITION(numPartitions) */` in spark sql query will control output file numbers. In my parctice I recommend second parm for users, because it will generate a new stage to do this job, while first parm won't which may lead the job dead because of fewer tasks in the last stage. > SparkSQL should support auto merging output files > - > > Key: SPARK-6221 > URL: https://issues.apache.org/jira/browse/SPARK-6221 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Tianyi Wang >Priority: Major > > Hive has a feature that could automatically merge small files in HQL's output > path. > This feature is quite useful for some cases that people use {{insert into}} > to handle minute data from the input path to a daily table. > In that case, if the SQL includes {{group by}} or {{join}} operation, we > always set the {{reduce number}} at least 200 to avoid the possible OOM in > reduce side. > That will cause this SQL output at least 200 files at the end of the > execution. So the daily table will finally contains more than 5 files. > If we could provide the same feature in SparkSQL, it will extremely reduce > hdfs operations and spark tasks when we run other sql on this table. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6221) SparkSQL should support auto merging output files
[ https://issues.apache.org/jira/browse/SPARK-6221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002085#comment-17002085 ] Tian Tian edited comment on SPARK-6221 at 12/23/19 3:21 AM: I encounterd this problem and find [issue-24940](https://issues.apache.org/jira/browse/SPARK-24940) Use `/*+ COALESCE(numPartitions) */` or `/*+ REPARTITION(numPartitions) */` in spark sql query will control output file numbers. In my parctice I recommend second parm for users, because it will generate a new stage to do this job, while first parm won't which may lead the job dead because of fewer tasks in the last stage. was (Author: tian tian): I encounterd this problem and find [issue-24940](https://issues.apache.org/jira/browse/SPARK-24940) Use /*+ COALESCE(numPartitions) */ or /*+ REPARTITION(numPartitions) */ in spark sql query will control output file numbers. In my parctice I recommend second parm for users, because it will generate a new stage to do this job, while first parm won't which may lead the job dead because of fewer tasks in the last stage. > SparkSQL should support auto merging output files > - > > Key: SPARK-6221 > URL: https://issues.apache.org/jira/browse/SPARK-6221 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Tianyi Wang >Priority: Major > > Hive has a feature that could automatically merge small files in HQL's output > path. > This feature is quite useful for some cases that people use {{insert into}} > to handle minute data from the input path to a daily table. > In that case, if the SQL includes {{group by}} or {{join}} operation, we > always set the {{reduce number}} at least 200 to avoid the possible OOM in > reduce side. > That will cause this SQL output at least 200 files at the end of the > execution. So the daily table will finally contains more than 5 files. > If we could provide the same feature in SparkSQL, it will extremely reduce > hdfs operations and spark tasks when we run other sql on this table. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6221) SparkSQL should support auto merging output files
[ https://issues.apache.org/jira/browse/SPARK-6221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002085#comment-17002085 ] Tian Tian commented on SPARK-6221: -- I encounterd this problem and find [issue-24940](https://issues.apache.org/jira/browse/SPARK-24940) Use /*+ COALESCE(numPartitions) */ or /*+ REPARTITION(numPartitions) */ in spark sql query will control output file numbers. In my parctice I recommend second parm for users, because it will generate a new stage to do this job, while first parm won't which may lead the job dead because of fewer tasks in the last stage. > SparkSQL should support auto merging output files > - > > Key: SPARK-6221 > URL: https://issues.apache.org/jira/browse/SPARK-6221 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Tianyi Wang >Priority: Major > > Hive has a feature that could automatically merge small files in HQL's output > path. > This feature is quite useful for some cases that people use {{insert into}} > to handle minute data from the input path to a daily table. > In that case, if the SQL includes {{group by}} or {{join}} operation, we > always set the {{reduce number}} at least 200 to avoid the possible OOM in > reduce side. > That will cause this SQL output at least 200 files at the end of the > execution. So the daily table will finally contains more than 5 files. > If we could provide the same feature in SparkSQL, it will extremely reduce > hdfs operations and spark tasks when we run other sql on this table. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30316) data size boom after shuffle writing dataframe save as parquet
[ https://issues.apache.org/jira/browse/SPARK-30316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002049#comment-17002049 ] Cesc commented on SPARK-30316: --- However, the rows of two results are the same. > data size boom after shuffle writing dataframe save as parquet > -- > > Key: SPARK-30316 > URL: https://issues.apache.org/jira/browse/SPARK-30316 > Project: Spark > Issue Type: Improvement > Components: Shuffle, SQL >Affects Versions: 2.4.4 >Reporter: Cesc >Priority: Blocker > > When I read a same parquet file and then save it in two ways, with shuffle > and without shuffle, I found the size of output parquet files are quite > different. For example, an origin parquet file with 800 MB size, if save > without shuffle, the size is still 800MB, whereas if I use method repartition > and then save it as in parquet format, the data size increase to 2.5GB. Row > numbers, column numbers and content of two output files are all the same. > I wonder: > firstly, why data size will increase after repartition/shuffle? > secondly, if I need shuffle the input dataframe, how to save it as parquet > file efficiently to avoid data size boom? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30326) Raise exception if analyzer exceed max iterations
[ https://issues.apache.org/jira/browse/SPARK-30326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xin Wu updated SPARK-30326: --- Description: Currently, both analyzer and optimizer just log warning message if rule execution exceed max iterations. They should have different behavior. Analyzer should raise exception to indicates the plan is not fixed after max iterations, while optimizer just log warning to keep the current plan. This is more feasible after SPARK-30138 was introduced. (was: Currently, both analyzer and optimizer just log warning message if rule execution exceed max iterations. They should have different behavior. Analyzer should raise exception to indicates logical plan resolve failed, while optimizer just log warning to keep the current plan. This is more feasible after SPARK-30138 was introduced.) > Raise exception if analyzer exceed max iterations > - > > Key: SPARK-30326 > URL: https://issues.apache.org/jira/browse/SPARK-30326 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xin Wu >Priority: Major > > Currently, both analyzer and optimizer just log warning message if rule > execution exceed max iterations. They should have different behavior. > Analyzer should raise exception to indicates the plan is not fixed after max > iterations, while optimizer just log warning to keep the current plan. This > is more feasible after SPARK-30138 was introduced. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30269) Should use old partition stats to decide whether to update stats when analyzing partition
[ https://issues.apache.org/jira/browse/SPARK-30269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30269. -- Fix Version/s: (was: 3.0.0) Resolution: Fixed Issue resolved by pull request 26963 [https://github.com/apache/spark/pull/26963] > Should use old partition stats to decide whether to update stats when > analyzing partition > - > > Key: SPARK-30269 > URL: https://issues.apache.org/jira/browse/SPARK-30269 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.4, 2.4.4, 3.0.0 >Reporter: Zhenhua Wang >Assignee: Zhenhua Wang >Priority: Major > Fix For: 2.4.5 > > > It's an obvious bug: currently when analyzing partition stats, we use old > table stats to compare with newly computed stats to decide whether it should > update stats or not. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30269) Should use old partition stats to decide whether to update stats when analyzing partition
[ https://issues.apache.org/jira/browse/SPARK-30269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-30269: Assignee: Zhenhua Wang > Should use old partition stats to decide whether to update stats when > analyzing partition > - > > Key: SPARK-30269 > URL: https://issues.apache.org/jira/browse/SPARK-30269 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.4, 2.4.4, 3.0.0 >Reporter: Zhenhua Wang >Assignee: Zhenhua Wang >Priority: Major > Fix For: 2.4.5, 3.0.0 > > > It's an obvious bug: currently when analyzing partition stats, we use old > table stats to compare with newly computed stats to decide whether it should > update stats or not. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30128) Promote remaining "hidden" PySpark DataFrameReader options to load APIs
[ https://issues.apache.org/jira/browse/SPARK-30128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-30128: Assignee: Hyukjin Kwon > Promote remaining "hidden" PySpark DataFrameReader options to load APIs > --- > > Key: SPARK-30128 > URL: https://issues.apache.org/jira/browse/SPARK-30128 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.4.4, 3.0.0 >Reporter: Nicholas Chammas >Assignee: Hyukjin Kwon >Priority: Minor > > Following on to SPARK-29903 and similar issues (linked), there are options > available to the DataFrameReader for certain source formats, but which are > not exposed properly in the relevant APIs. > These options include `timeZone` and `pathGlobFilter`. Instead of being noted > under [the option() > method|https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.option], > they should be implemented directly into load APIs that support them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30128) Promote remaining "hidden" PySpark DataFrameReader options to load APIs
[ https://issues.apache.org/jira/browse/SPARK-30128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30128. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26958 [https://github.com/apache/spark/pull/26958] > Promote remaining "hidden" PySpark DataFrameReader options to load APIs > --- > > Key: SPARK-30128 > URL: https://issues.apache.org/jira/browse/SPARK-30128 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.4.4, 3.0.0 >Reporter: Nicholas Chammas >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 3.0.0 > > > Following on to SPARK-29903 and similar issues (linked), there are options > available to the DataFrameReader for certain source formats, but which are > not exposed properly in the relevant APIs. > These options include `timeZone` and `pathGlobFilter`. Instead of being noted > under [the option() > method|https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.option], > they should be implemented directly into load APIs that support them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27762) Support user provided avro schema for writing fields with different ordering
[ https://issues.apache.org/jira/browse/SPARK-27762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-27762: Issue Type: Bug (was: New Feature) > Support user provided avro schema for writing fields with different ordering > > > Key: SPARK-27762 > URL: https://issues.apache.org/jira/browse/SPARK-27762 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: DB Tsai >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org