[jira] [Resolved] (SPARK-28332) SQLMetric wrong initValue

2019-12-22 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-28332.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26899
[https://github.com/apache/spark/pull/26899]

> SQLMetric wrong initValue 
> --
>
> Key: SPARK-28332
> URL: https://issues.apache.org/jira/browse/SPARK-28332
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Song Jun
>Priority: Minor
> Fix For: 3.0.0
>
>
> Currently SQLMetrics.createSizeMetric create a SQLMetric with initValue set 
> to -1.
> If there is a ShuffleMapStage with lots of Tasks which read 0 bytes data, 
> these tasks will send the metric(the metric value still be the initValue with 
> -1) to Driver,  then Driver do metric merge for this Stage in 
> DAGScheduler.updateAccumulators, this will cause the merged metric value of 
> this Stage set to be a negative value. 
> This is incorrect, we should set the initValue to 0 .
> Another same case in SQLMetrics.createTimingMetric.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30329) add iterator/foreach methods for Vectors

2019-12-22 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-30329:


 Summary: add iterator/foreach methods for Vectors
 Key: SPARK-30329
 URL: https://issues.apache.org/jira/browse/SPARK-30329
 Project: Spark
  Issue Type: Wish
  Components: ML
Affects Versions: 3.0.0
Reporter: zhengruifeng


1, foreach: there are a lot of places that we need to traversal all the 
elements, current we impl like this:
{code:java}

var i = 0; while (i < vec.size) { val v = vec(i); ...; i += 1 } {code}
This method is for both convenience and performace:

For a SparseVector, the total complexity is O(size * log(nnz)), since an apply 
call has log(nnz) complexity due to usage of binary search;

However, this can be optimized by operations of cursors.

 

2, foreachNonZero: the usage of foreachActive is mostly binded with filter 
value!=0, like
{code:java}
vec.foreachActive { case (i, v) =>
  if (v != 0.0) {
...
  }
}
 {code}
Here we can add this method for convenience.

 

3, iterator/activeIterator/nonZeroIterator: add those three iterators, so that 
we can futuremore add/change some impls based on those iterators for both ml 
and mllib sides, to avoid vector conversions.

For example, I want to optimize PCA by using ml.stat.Summarizer instead of

Statistics.colStats/mllib.MultivariateStatisticalSummary, to avoid computation 
of unused variables.

After having these iterators, I can do it without vector conversions.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30329) add iterator/foreach methods for Vectors

2019-12-22 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng reassigned SPARK-30329:


Assignee: zhengruifeng

> add iterator/foreach methods for Vectors
> 
>
> Key: SPARK-30329
> URL: https://issues.apache.org/jira/browse/SPARK-30329
> Project: Spark
>  Issue Type: Wish
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
>
> 1, foreach: there are a lot of places that we need to traversal all the 
> elements, current we impl like this:
> {code:java}
> var i = 0; while (i < vec.size) { val v = vec(i); ...; i += 1 } {code}
> This method is for both convenience and performace:
> For a SparseVector, the total complexity is O(size * log(nnz)), since an 
> apply call has log(nnz) complexity due to usage of binary search;
> However, this can be optimized by operations of cursors.
>  
> 2, foreachNonZero: the usage of foreachActive is mostly binded with filter 
> value!=0, like
> {code:java}
> vec.foreachActive { case (i, v) =>
>   if (v != 0.0) {
> ...
>   }
> }
>  {code}
> Here we can add this method for convenience.
>  
> 3, iterator/activeIterator/nonZeroIterator: add those three iterators, so 
> that we can futuremore add/change some impls based on those iterators for 
> both ml and mllib sides, to avoid vector conversions.
> For example, I want to optimize PCA by using ml.stat.Summarizer instead of
> Statistics.colStats/mllib.MultivariateStatisticalSummary, to avoid 
> computation of unused variables.
> After having these iterators, I can do it without vector conversions.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30328) Fail to write local files with RDD.saveTextFile when setting the incorrect Hadoop configuration files

2019-12-22 Thread chendihao (Jira)
chendihao created SPARK-30328:
-

 Summary: Fail to write local files with RDD.saveTextFile when 
setting the incorrect Hadoop configuration files
 Key: SPARK-30328
 URL: https://issues.apache.org/jira/browse/SPARK-30328
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: chendihao






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26618) Make typed Timestamp/Date literals consistent to casting

2019-12-22 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-26618:

Labels:   (was: correctness)

> Make typed Timestamp/Date literals consistent to casting
> 
>
> Key: SPARK-26618
> URL: https://issues.apache.org/jira/browse/SPARK-26618
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> Currently values of typed literals TIMESTAMP and DATE are parsed to desired 
> values by Timestamp.valueOf and Date.valueOf. This restricts date and 
> timestamp pattern, and makes inconsistent to casting to 
> TimestampType/DateType. Also using Timestamp.valueOf and Date.valueOf assumes 
> hybrid calendar while parsing textual representation of timestamps/dates. 
> This should be fixed by re-using cast functionality.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26618) Make typed Timestamp/Date literals consistent to casting

2019-12-22 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-26618:

Labels: correctness  (was: )

> Make typed Timestamp/Date literals consistent to casting
> 
>
> Key: SPARK-26618
> URL: https://issues.apache.org/jira/browse/SPARK-26618
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>  Labels: correctness
> Fix For: 3.0.0
>
>
> Currently values of typed literals TIMESTAMP and DATE are parsed to desired 
> values by Timestamp.valueOf and Date.valueOf. This restricts date and 
> timestamp pattern, and makes inconsistent to casting to 
> TimestampType/DateType. Also using Timestamp.valueOf and Date.valueOf assumes 
> hybrid calendar while parsing textual representation of timestamps/dates. 
> This should be fixed by re-using cast functionality.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29699) Different answers in nested aggregates with window functions

2019-12-22 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002122#comment-17002122
 ] 

Takeshi Yamamuro commented on SPARK-29699:
--

A root cause is the NULL order difference in order-by clauses; in Spark, NULLS 
FIRST is the default for an ascending order, but in PostgreSQL/Oracle, NULLS 
LAST is the default for that. So, if we explicitly set the order in the example 
query above, we can get the same answer with PostgreSQL;
{code:java}
sql("""select a, b, sum(c), sum(sum(c)) over (order by a asc nulls last, b asc 
nulls last) as rsum
  from gstest2 group by rollup (a,b) order by rsum asc nulls last, a asc nulls 
last, b asc nulls last""").show()
+++--++ 
|   a|   b|sum(c)|rsum|
+++--++
|   1|   1| 8|   8|
|   1|   2| 2|  10|
|   1|null|10|  20|
|   2|   2| 2|  22|
|   2|null| 2|  24|
|null|null|12|  36|
+++--++
{code}
Currently, it seems we follow the MySQL/SQL Server behaviour and they have 
NULLS FIRST by default.
 Any historical reason for our default NULL order? cc: [~smilegator] 
[~cloud_fan] [~viirya]

Changing the default behaivour in Spark has some impacts on test output in 
SQLQueryTestSuite:
 [https://github.com/apache/spark/compare/master...maropu:NullLastByDefault]

 

References:
 Cited from the PostgreSQL doc: 
[https://www.postgresql.org/docs/current/queries-order.html]
{code:java}
By default, null values sort as if larger than any non-null value; that is, 
NULLS FIRST is the default for DESC order, and NULLS LAST otherwise.
{code}
Cited from the OracleDB doc: 
[https://docs.oracle.com/database/121/SQLRF/statements_10002.htm#SQLRF01702]
{code:java}
NULLS LAST is the default for ascending order, and NULLS FIRST is the default 
for descending order.
{code}
Cited from the SQL server: 
[https://docs.microsoft.com/en-us/sql/t-sql/queries/select-order-by-clause-transact-sql?redirectedfrom=MSDN=sql-server-ver15]
{code:java}
 ASC is the default sort order. Null values are treated as the lowest possible 
values.
{code}
Cited from the MySQL: 
[https://dev.mysql.com/doc/refman/5.7/en/working-with-null.html]
{code:java}
When doing an ORDER BY, NULL values are presented first if you do ORDER BY ... 
ASC and last if you do ORDER BY ... DESC.
{code}

> Different answers in nested aggregates with window functions
> 
>
> Key: SPARK-29699
> URL: https://issues.apache.org/jira/browse/SPARK-29699
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>  Labels: correctness
>
> A nested aggregate below with a window function seems to have different 
> answers in the `rsum` column  between PgSQL and Spark;
> {code:java}
> postgres=# create table gstest2 (a integer, b integer, c integer, d integer, 
> e integer, f integer, g integer, h integer);
> postgres=# insert into gstest2 values
> postgres-#   (1, 1, 1, 1, 1, 1, 1, 1),
> postgres-#   (1, 1, 1, 1, 1, 1, 1, 2),
> postgres-#   (1, 1, 1, 1, 1, 1, 2, 2),
> postgres-#   (1, 1, 1, 1, 1, 2, 2, 2),
> postgres-#   (1, 1, 1, 1, 2, 2, 2, 2),
> postgres-#   (1, 1, 1, 2, 2, 2, 2, 2),
> postgres-#   (1, 1, 2, 2, 2, 2, 2, 2),
> postgres-#   (1, 2, 2, 2, 2, 2, 2, 2),
> postgres-#   (2, 2, 2, 2, 2, 2, 2, 2);
> INSERT 0 9
> postgres=# 
> postgres=# select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum
> postgres-#   from gstest2 group by rollup (a,b) order by rsum, a, b;
>  a | b | sum | rsum 
> ---+---+-+--
>  1 | 1 |   8 |8
>  1 | 2 |   2 |   10
>  1 |   |  10 |   20
>  2 | 2 |   2 |   22
>  2 |   |   2 |   24
>|   |  12 |   36
> (6 rows)
> {code}
> {code:java}
> scala> sql("""
>  | select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum
>  |   from gstest2 group by rollup (a,b) order by rsum, a, b
>  | """).show()
> +++--++   
>   
> |   a|   b|sum(c)|rsum|
> +++--++
> |null|null|12|  12|
> |   1|null|10|  22|
> |   1|   1| 8|  30|
> |   1|   2| 2|  32|
> |   2|null| 2|  34|
> |   2|   2| 2|  36|
> +++--++
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29699) Different answers in nested aggregates with window functions

2019-12-22 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-29699:
-
Description: 
A nested aggregate below with a window function seems to have different answers 
in the `rsum` column  between PgSQL and Spark;
{code:java}
postgres=# create table gstest2 (a integer, b integer, c integer, d integer, e 
integer, f integer, g integer, h integer);
postgres=# insert into gstest2 values
postgres-#   (1, 1, 1, 1, 1, 1, 1, 1),
postgres-#   (1, 1, 1, 1, 1, 1, 1, 2),
postgres-#   (1, 1, 1, 1, 1, 1, 2, 2),
postgres-#   (1, 1, 1, 1, 1, 2, 2, 2),
postgres-#   (1, 1, 1, 1, 2, 2, 2, 2),
postgres-#   (1, 1, 1, 2, 2, 2, 2, 2),
postgres-#   (1, 1, 2, 2, 2, 2, 2, 2),
postgres-#   (1, 2, 2, 2, 2, 2, 2, 2),
postgres-#   (2, 2, 2, 2, 2, 2, 2, 2);
INSERT 0 9
postgres=# 
postgres=# select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum
postgres-#   from gstest2 group by rollup (a,b) order by rsum, a, b;
 a | b | sum | rsum 
---+---+-+--
 1 | 1 |   8 |8
 1 | 2 |   2 |   10
 1 |   |  10 |   20
 2 | 2 |   2 |   22
 2 |   |   2 |   24
   |   |  12 |   36
(6 rows)
{code}
{code:java}
scala> sql("""
 | select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum
 |   from gstest2 group by rollup (a,b) order by rsum, a, b
 | """).show()
+++--++ 
|   a|   b|sum(c)|rsum|
+++--++
|null|null|12|  12|
|   1|null|10|  22|
|   1|   1| 8|  30|
|   1|   2| 2|  32|
|   2|null| 2|  34|
|   2|   2| 2|  36|
+++--++
{code}

  was:
A nested aggregate below with a window function seems to have different answers 
in the `rsum` column  between PgSQL and Spark;
{code:java}
postgres=# create table gstest2 (a integer, b integer, c integer, d integer, e 
integer, f integer, g integer, h integer);
postgres=# insert into gstest2 values
postgres-#   (1, 1, 1, 1, 1, 1, 1, 1),
postgres-#   (1, 1, 1, 1, 1, 1, 1, 2),
postgres-#   (1, 1, 1, 1, 1, 1, 2, 2),
postgres-#   (1, 1, 1, 1, 1, 2, 2, 2),
postgres-#   (1, 1, 1, 1, 2, 2, 2, 2),
postgres-#   (1, 1, 1, 2, 2, 2, 2, 2),
postgres-#   (1, 1, 2, 2, 2, 2, 2, 2),
postgres-#   (1, 2, 2, 2, 2, 2, 2, 2),
postgres-#   (2, 2, 2, 2, 2, 2, 2, 2);
INSERT 0 9
postgres=# 
postgres=# select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum
postgres-#   from gstest2 group by rollup (a,b) order by rsum, a, b;
 a | b | sum | rsum 
---+---+-+--
 1 | 1 |  16 |   16
 1 | 2 |   4 |   20
 1 |   |  20 |   40
 2 | 2 |   4 |   44
 2 |   |   4 |   48
   |   |  24 |   72
(6 rows)
{code}
{code:java}
scala> sql("""
 | select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum
 |   from gstest2 group by rollup (a,b) order by rsum, a, b
 | """).show()
+++--++ 
|   a|   b|sum(c)|rsum|
+++--++
|null|null|12|  12|
|   1|null|10|  22|
|   1|   1| 8|  30|
|   1|   2| 2|  32|
|   2|null| 2|  34|
|   2|   2| 2|  36|
+++--++
{code}


> Different answers in nested aggregates with window functions
> 
>
> Key: SPARK-29699
> URL: https://issues.apache.org/jira/browse/SPARK-29699
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>  Labels: correctness
>
> A nested aggregate below with a window function seems to have different 
> answers in the `rsum` column  between PgSQL and Spark;
> {code:java}
> postgres=# create table gstest2 (a integer, b integer, c integer, d integer, 
> e integer, f integer, g integer, h integer);
> postgres=# insert into gstest2 values
> postgres-#   (1, 1, 1, 1, 1, 1, 1, 1),
> postgres-#   (1, 1, 1, 1, 1, 1, 1, 2),
> postgres-#   (1, 1, 1, 1, 1, 1, 2, 2),
> postgres-#   (1, 1, 1, 1, 1, 2, 2, 2),
> postgres-#   (1, 1, 1, 1, 2, 2, 2, 2),
> postgres-#   (1, 1, 1, 2, 2, 2, 2, 2),
> postgres-#   (1, 1, 2, 2, 2, 2, 2, 2),
> postgres-#   (1, 2, 2, 2, 2, 2, 2, 2),
> postgres-#   (2, 2, 2, 2, 2, 2, 2, 2);
> INSERT 0 9
> postgres=# 
> postgres=# select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum
> postgres-#   from gstest2 group by rollup (a,b) order by rsum, a, b;
>  a | b | sum | rsum 
> ---+---+-+--
>  1 | 1 |   8 |8
>  1 | 2 |   2 |   10
>  1 |   |  10 |   20
>  2 | 2 |   2 |   22
>  2 |   |   2 |   24
>|   |  12 |   36
> (6 rows)
> {code}
> {code:java}
> scala> sql("""
>  | select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum
>  |   from gstest2 group by rollup (a,b) order by rsum, a, b
>  | """).show()
> +++--++   
>   
> |   a|   b|sum(c)|rsum|
> 

[jira] [Created] (SPARK-30327) Caused by: java.lang.ArrayIndexOutOfBoundsException: -1

2019-12-22 Thread lujun (Jira)
lujun created SPARK-30327:
-

 Summary: Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
 Key: SPARK-30327
 URL: https://issues.apache.org/jira/browse/SPARK-30327
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 2.3.2
Reporter: lujun


val edgeRdd: RDD[Edge[Int]] = rdd.map(rec => {
 Edge(rec._2._1.getOldcid, rec._2._1.getNewcid, 0)
 })
 val vertexRdd: RDD[(Long, String)] = rdd.map(rec =>{
 (rec._2._1.getOldcid, rec._2._1.getCustomer_id)} )
 val returnRdd = Graph(vertexRdd, edgeRdd).connectedComponents().vertices.
 join(vertexRdd)
 .map \{ case (cid, (groupid, cus)) => (cus, groupid)}

 

For the same batch of data, sometimes it succeeds, and the following errors are 
reported!

 

Exception in thread "main" java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:65)
 at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 2 in stage 24374.0 failed 4 times, most recent failure: Lost task 2.3 in 
stage 24374.0 (TID 133352, lx-es-04, executor 0): 
java.lang.ArrayIndexOutOfBoundsException: -1
 at 
org.apache.spark.graphx.util.collection.GraphXPrimitiveKeyOpenHashMap$mcJI$sp.apply$mcJI$sp(GraphXPrimitiveKeyOpenHashMap.scala:64)
 at 
org.apache.spark.graphx.impl.EdgePartition.updateVertices(EdgePartition.scala:91)
 at 
org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:73)
 at 
org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:71)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
 at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
 at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
 at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
 at org.apache.spark.scheduler.Task.run(Task.scala:109)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
 at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1651)
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1639)
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1638)
 at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
 at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1638)
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
 at scala.Option.foreach(Option.scala:257)
 at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1872)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1821)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1810)
 at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
 at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:2131)
 at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1035)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
 at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
 at org.apache.spark.rdd.RDD.reduce(RDD.scala:1017)
 at org.apache.spark.graphx.impl.VertexRDDImpl.count(VertexRDDImpl.scala:90)
 at org.apache.spark.graphx.Pregel$.apply(Pregel.scala:140)
 at 
org.apache.spark.graphx.lib.ConnectedComponents$.run(ConnectedComponents.scala:54)
 at 

[jira] [Comment Edited] (SPARK-6221) SparkSQL should support auto merging output files

2019-12-22 Thread Tian Tian (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-6221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002085#comment-17002085
 ] 

Tian Tian edited comment on SPARK-6221 at 12/23/19 3:24 AM:


I encounterd this problem and find 
[issue-24940](https://issues.apache.org/jira/browse/SPARK-24940)

Use {quote}/*+ COALESCE(numPartitions) */{quote} or {quote}/*+ 
REPARTITION(numPartitions) */{quote} in spark sql query will control output 
file numbers.

In my parctice I recommend second parm for users, because it will generate a 
new stage to do this job, while first parm won't which may lead the job dead 
because of fewer tasks in the last stage.


was (Author: tian tian):
I encounterd this problem and find 
[issue-24940](https://issues.apache.org/jira/browse/SPARK-24940)

Use `/*+ COALESCE(numPartitions) */` or `/*+ REPARTITION(numPartitions) */` in 
spark sql query will control output file numbers.

In my parctice I recommend second parm for users, because it will generate a 
new stage to do this job, while first parm won't which may lead the job dead 
because of fewer tasks in the last stage.

> SparkSQL should support auto merging output files
> -
>
> Key: SPARK-6221
> URL: https://issues.apache.org/jira/browse/SPARK-6221
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Tianyi Wang
>Priority: Major
>
> Hive has a feature that could automatically merge small files in HQL's output 
> path. 
> This feature is quite useful for some cases that people use {{insert into}} 
> to  handle minute data from the input path to a daily table.
> In that case, if the SQL includes {{group by}} or {{join}} operation, we 
> always set the {{reduce number}} at least 200 to avoid the possible OOM in 
> reduce side.
> That will cause this SQL output at least 200 files at the end of the 
> execution. So the daily table will finally contains more than 5 files. 
> If we could provide the same feature in SparkSQL, it will extremely reduce 
> hdfs operations and spark tasks when we run other sql on this table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6221) SparkSQL should support auto merging output files

2019-12-22 Thread Tian Tian (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-6221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002085#comment-17002085
 ] 

Tian Tian edited comment on SPARK-6221 at 12/23/19 3:21 AM:


I encounterd this problem and find 
[issue-24940](https://issues.apache.org/jira/browse/SPARK-24940)

Use `/*+ COALESCE(numPartitions) */` or `/*+ REPARTITION(numPartitions) */` in 
spark sql query will control output file numbers.

In my parctice I recommend second parm for users, because it will generate a 
new stage to do this job, while first parm won't which may lead the job dead 
because of fewer tasks in the last stage.


was (Author: tian tian):
I encounterd this problem and find 
[issue-24940](https://issues.apache.org/jira/browse/SPARK-24940)

Use /*+ COALESCE(numPartitions) */ or /*+ REPARTITION(numPartitions) */ in 
spark sql query will control output file numbers.

In my parctice I recommend second parm for users, because it will generate a 
new stage to do this job, while first parm won't which may lead the job dead 
because of fewer tasks in the last stage.

> SparkSQL should support auto merging output files
> -
>
> Key: SPARK-6221
> URL: https://issues.apache.org/jira/browse/SPARK-6221
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Tianyi Wang
>Priority: Major
>
> Hive has a feature that could automatically merge small files in HQL's output 
> path. 
> This feature is quite useful for some cases that people use {{insert into}} 
> to  handle minute data from the input path to a daily table.
> In that case, if the SQL includes {{group by}} or {{join}} operation, we 
> always set the {{reduce number}} at least 200 to avoid the possible OOM in 
> reduce side.
> That will cause this SQL output at least 200 files at the end of the 
> execution. So the daily table will finally contains more than 5 files. 
> If we could provide the same feature in SparkSQL, it will extremely reduce 
> hdfs operations and spark tasks when we run other sql on this table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6221) SparkSQL should support auto merging output files

2019-12-22 Thread Tian Tian (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-6221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002085#comment-17002085
 ] 

Tian Tian commented on SPARK-6221:
--

I encounterd this problem and find 
[issue-24940](https://issues.apache.org/jira/browse/SPARK-24940)

Use /*+ COALESCE(numPartitions) */ or /*+ REPARTITION(numPartitions) */ in 
spark sql query will control output file numbers.

In my parctice I recommend second parm for users, because it will generate a 
new stage to do this job, while first parm won't which may lead the job dead 
because of fewer tasks in the last stage.

> SparkSQL should support auto merging output files
> -
>
> Key: SPARK-6221
> URL: https://issues.apache.org/jira/browse/SPARK-6221
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Tianyi Wang
>Priority: Major
>
> Hive has a feature that could automatically merge small files in HQL's output 
> path. 
> This feature is quite useful for some cases that people use {{insert into}} 
> to  handle minute data from the input path to a daily table.
> In that case, if the SQL includes {{group by}} or {{join}} operation, we 
> always set the {{reduce number}} at least 200 to avoid the possible OOM in 
> reduce side.
> That will cause this SQL output at least 200 files at the end of the 
> execution. So the daily table will finally contains more than 5 files. 
> If we could provide the same feature in SparkSQL, it will extremely reduce 
> hdfs operations and spark tasks when we run other sql on this table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30316) data size boom after shuffle writing dataframe save as parquet

2019-12-22 Thread Cesc (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002049#comment-17002049
 ] 

Cesc  commented on SPARK-30316:
---

However, the rows of two results are the same.

> data size boom after shuffle writing dataframe save as parquet
> --
>
> Key: SPARK-30316
> URL: https://issues.apache.org/jira/browse/SPARK-30316
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, SQL
>Affects Versions: 2.4.4
>Reporter: Cesc 
>Priority: Blocker
>
> When I read a same parquet file and then save it in two ways, with shuffle 
> and without shuffle, I found the size of output parquet files are quite 
> different. For example,  an origin parquet file with 800 MB size, if save 
> without shuffle, the size is still 800MB, whereas if I use method repartition 
> and then save it as in parquet format, the data size increase to 2.5GB. Row 
> numbers, column numbers and content of two output files are all the same.
> I wonder:
> firstly, why data size will increase after repartition/shuffle?
> secondly, if I need shuffle the input dataframe, how to save it as parquet 
> file efficiently to avoid data size boom?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30326) Raise exception if analyzer exceed max iterations

2019-12-22 Thread Xin Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Wu updated SPARK-30326:
---
Description: Currently, both analyzer and optimizer just log warning 
message if rule execution exceed max iterations. They should have different 
behavior. Analyzer should raise exception to indicates the plan is not fixed 
after max iterations, while optimizer just log warning to keep the current 
plan. This is more feasible after SPARK-30138 was introduced.  (was: Currently, 
both analyzer and optimizer just log warning message if rule execution exceed 
max iterations. They should have different behavior. Analyzer should raise 
exception to indicates logical plan resolve failed, while optimizer just log 
warning to keep the current plan. This is more feasible after SPARK-30138 was 
introduced.)

> Raise exception if analyzer exceed max iterations
> -
>
> Key: SPARK-30326
> URL: https://issues.apache.org/jira/browse/SPARK-30326
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xin Wu
>Priority: Major
>
> Currently, both analyzer and optimizer just log warning message if rule 
> execution exceed max iterations. They should have different behavior. 
> Analyzer should raise exception to indicates the plan is not fixed after max 
> iterations, while optimizer just log warning to keep the current plan. This 
> is more feasible after SPARK-30138 was introduced.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30269) Should use old partition stats to decide whether to update stats when analyzing partition

2019-12-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30269.
--
Fix Version/s: (was: 3.0.0)
   Resolution: Fixed

Issue resolved by pull request 26963
[https://github.com/apache/spark/pull/26963]

> Should use old partition stats to decide whether to update stats when 
> analyzing partition
> -
>
> Key: SPARK-30269
> URL: https://issues.apache.org/jira/browse/SPARK-30269
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.4, 3.0.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
>Priority: Major
> Fix For: 2.4.5
>
>
> It's an obvious bug: currently when analyzing partition stats, we use old 
> table stats to compare with newly computed stats to decide whether it should 
> update stats or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30269) Should use old partition stats to decide whether to update stats when analyzing partition

2019-12-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-30269:


Assignee: Zhenhua Wang

> Should use old partition stats to decide whether to update stats when 
> analyzing partition
> -
>
> Key: SPARK-30269
> URL: https://issues.apache.org/jira/browse/SPARK-30269
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.4, 3.0.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> It's an obvious bug: currently when analyzing partition stats, we use old 
> table stats to compare with newly computed stats to decide whether it should 
> update stats or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30128) Promote remaining "hidden" PySpark DataFrameReader options to load APIs

2019-12-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-30128:


Assignee: Hyukjin Kwon

> Promote remaining "hidden" PySpark DataFrameReader options to load APIs
> ---
>
> Key: SPARK-30128
> URL: https://issues.apache.org/jira/browse/SPARK-30128
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Nicholas Chammas
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> Following on to SPARK-29903 and similar issues (linked), there are options 
> available to the DataFrameReader for certain source formats, but which are 
> not exposed properly in the relevant APIs.
> These options include `timeZone` and `pathGlobFilter`. Instead of being noted 
> under [the option() 
> method|https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.option],
>  they should be implemented directly into load APIs that support them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30128) Promote remaining "hidden" PySpark DataFrameReader options to load APIs

2019-12-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30128.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26958
[https://github.com/apache/spark/pull/26958]

> Promote remaining "hidden" PySpark DataFrameReader options to load APIs
> ---
>
> Key: SPARK-30128
> URL: https://issues.apache.org/jira/browse/SPARK-30128
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Nicholas Chammas
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.0.0
>
>
> Following on to SPARK-29903 and similar issues (linked), there are options 
> available to the DataFrameReader for certain source formats, but which are 
> not exposed properly in the relevant APIs.
> These options include `timeZone` and `pathGlobFilter`. Instead of being noted 
> under [the option() 
> method|https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.option],
>  they should be implemented directly into load APIs that support them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27762) Support user provided avro schema for writing fields with different ordering

2019-12-22 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-27762:

Issue Type: Bug  (was: New Feature)

> Support user provided avro schema for writing fields with different ordering
> 
>
> Key: SPARK-27762
> URL: https://issues.apache.org/jira/browse/SPARK-27762
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: DB Tsai
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org