[jira] [Created] (SPARK-31391) Add AdaptiveTestUtils to ease the test of AQE

2020-04-08 Thread wuyi (Jira)
wuyi created SPARK-31391:


 Summary: Add AdaptiveTestUtils to ease the test of AQE
 Key: SPARK-31391
 URL: https://issues.apache.org/jira/browse/SPARK-31391
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 3.0.0
Reporter: wuyi


Tests related to AQE now have much duplicate codes, we can use some utility 
functions to make the test simpler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31301) flatten the result dataframe of tests in stat

2020-04-08 Thread zhengruifeng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078958#comment-17078958
 ] 

zhengruifeng commented on SPARK-31301:
--

[~srowen] There are two methods now:
{code:java}
@Since("2.2.0")
def test(dataset: DataFrame, featuresCol: String, labelCol: String): DataFrame 
= {
  val spark = dataset.sparkSession
  import spark.implicits._

  SchemaUtils.checkColumnType(dataset.schema, featuresCol, new VectorUDT)
  SchemaUtils.checkNumericType(dataset.schema, labelCol)
  val rdd = dataset.select(col(labelCol).cast("double"), 
col(featuresCol)).as[(Double, Vector)]
.rdd.map { case (label, features) => OldLabeledPoint(label, 
OldVectors.fromML(features)) }
  val testResults = OldStatistics.chiSqTest(rdd)
  val pValues = Vectors.dense(testResults.map(_.pValue))
  val degreesOfFreedom = testResults.map(_.degreesOfFreedom)
  val statistics = Vectors.dense(testResults.map(_.statistic))
  spark.createDataFrame(Seq(ChiSquareResult(pValues, degreesOfFreedom, 
statistics)))
}

@Since("3.1.0")
def testChiSquare(
dataset: Dataset[_],
featuresCol: String,
labelCol: String): Array[SelectionTestResult] = {

  SchemaUtils.checkColumnType(dataset.schema, featuresCol, new VectorUDT)
  SchemaUtils.checkNumericType(dataset.schema, labelCol)
  val input = dataset.select(col(labelCol).cast(DoubleType), 
col(featuresCol)).rdd
.map { case Row(label: Double, features: Vector) =>
  OldLabeledPoint(label, OldVectors.fromML(features))
}
  val chiTestResult = OldStatistics.chiSqTest(input)
  chiTestResult.map(r => new ChiSqTestResult(r.pValue, r.degreesOfFreedom, 
r.statistic))
} {code}
 

The newly added one is targeted to 3.1.0, so we can modify its return type 
without breaking api

 

 

> flatten the result dataframe of tests in stat
> -
>
> Key: SPARK-31301
> URL: https://issues.apache.org/jira/browse/SPARK-31301
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Priority: Major
>
> {code:java}
>  scala> import org.apache.spark.ml.linalg.{Vector, Vectors}
> import org.apache.spark.ml.linalg.{Vector, Vectors}scala> import 
> org.apache.spark.ml.stat.ChiSquareTest
> import org.apache.spark.ml.stat.ChiSquareTestscala> val data = Seq(
>  |   (0.0, Vectors.dense(0.5, 10.0)),
>  |   (0.0, Vectors.dense(1.5, 20.0)),
>  |   (1.0, Vectors.dense(1.5, 30.0)),
>  |   (0.0, Vectors.dense(3.5, 30.0)),
>  |   (0.0, Vectors.dense(3.5, 40.0)),
>  |   (1.0, Vectors.dense(3.5, 40.0))
>  | )
> data: Seq[(Double, org.apache.spark.ml.linalg.Vector)] = 
> List((0.0,[0.5,10.0]), (0.0,[1.5,20.0]), (1.0,[1.5,30.0]), (0.0,[3.5,30.0]), 
> (0.0,[3.5,40.0]), (1.0,[3.5,40.0]))scala> scala> scala> val df = 
> data.toDF("label", "features")
> df: org.apache.spark.sql.DataFrame = [label: double, features: vector]scala>  
>val chi = ChiSquareTest.test(df, "features", "label")
> chi: org.apache.spark.sql.DataFrame = [pValues: vector, degreesOfFreedom: 
> array ... 1 more field]scala> chi.show
> +++--+
> | pValues|degreesOfFreedom|statistics|
> +++--+
> |[0.68728927879097...|  [2, 3]|[0.75,1.5]|
> +++--+{code}
>  
> Current impls of {{ChiSquareTest}}, {{ANOVATest}}, {{FValueTest}}, 
> {{Correlation}} all return a df only containing one row.
> I think this is quite hard to use, suppose we have a dataset with dim=1000, 
> the only operation we can deal with the test result is to collect it by 
> {{head()}} or {{first(), and then use it in the driver.}}
> {{While what I really want to do is filtering the df like pValue>0.1}} or 
> {{corr<0.5}}, *So I suggest to flatten the output df in those tests.*
>  
> {{note: {{ANOVATest}}{{ and\{{FValueTest}} are newly added in 3.1.0, but 
> ChiSquareTest and Correlation were here for a long time.
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31301) flatten the result dataframe of tests in stat

2020-04-08 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078902#comment-17078902
 ] 

Sean R. Owen commented on SPARK-31301:
--

I guess so, but doesn't it become inconsistent with existing similar methods?

> flatten the result dataframe of tests in stat
> -
>
> Key: SPARK-31301
> URL: https://issues.apache.org/jira/browse/SPARK-31301
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Priority: Major
>
> {code:java}
>  scala> import org.apache.spark.ml.linalg.{Vector, Vectors}
> import org.apache.spark.ml.linalg.{Vector, Vectors}scala> import 
> org.apache.spark.ml.stat.ChiSquareTest
> import org.apache.spark.ml.stat.ChiSquareTestscala> val data = Seq(
>  |   (0.0, Vectors.dense(0.5, 10.0)),
>  |   (0.0, Vectors.dense(1.5, 20.0)),
>  |   (1.0, Vectors.dense(1.5, 30.0)),
>  |   (0.0, Vectors.dense(3.5, 30.0)),
>  |   (0.0, Vectors.dense(3.5, 40.0)),
>  |   (1.0, Vectors.dense(3.5, 40.0))
>  | )
> data: Seq[(Double, org.apache.spark.ml.linalg.Vector)] = 
> List((0.0,[0.5,10.0]), (0.0,[1.5,20.0]), (1.0,[1.5,30.0]), (0.0,[3.5,30.0]), 
> (0.0,[3.5,40.0]), (1.0,[3.5,40.0]))scala> scala> scala> val df = 
> data.toDF("label", "features")
> df: org.apache.spark.sql.DataFrame = [label: double, features: vector]scala>  
>val chi = ChiSquareTest.test(df, "features", "label")
> chi: org.apache.spark.sql.DataFrame = [pValues: vector, degreesOfFreedom: 
> array ... 1 more field]scala> chi.show
> +++--+
> | pValues|degreesOfFreedom|statistics|
> +++--+
> |[0.68728927879097...|  [2, 3]|[0.75,1.5]|
> +++--+{code}
>  
> Current impls of {{ChiSquareTest}}, {{ANOVATest}}, {{FValueTest}}, 
> {{Correlation}} all return a df only containing one row.
> I think this is quite hard to use, suppose we have a dataset with dim=1000, 
> the only operation we can deal with the test result is to collect it by 
> {{head()}} or {{first(), and then use it in the driver.}}
> {{While what I really want to do is filtering the df like pValue>0.1}} or 
> {{corr<0.5}}, *So I suggest to flatten the output df in those tests.*
>  
> {{note: {{ANOVATest}}{{ and\{{FValueTest}} are newly added in 3.1.0, but 
> ChiSquareTest and Correlation were here for a long time.
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31301) flatten the result dataframe of tests in stat

2020-04-08 Thread zhengruifeng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078898#comment-17078898
 ] 

zhengruifeng commented on SPARK-31301:
--

[~srowen] How do you think about changing the return type of newly added method 
"testChiSquare" to flatten rows?

This method currently returns "Array[SelectionTestResult]", which is similar to 
the old one "test" which returns single row.

> flatten the result dataframe of tests in stat
> -
>
> Key: SPARK-31301
> URL: https://issues.apache.org/jira/browse/SPARK-31301
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Priority: Major
>
> {code:java}
>  scala> import org.apache.spark.ml.linalg.{Vector, Vectors}
> import org.apache.spark.ml.linalg.{Vector, Vectors}scala> import 
> org.apache.spark.ml.stat.ChiSquareTest
> import org.apache.spark.ml.stat.ChiSquareTestscala> val data = Seq(
>  |   (0.0, Vectors.dense(0.5, 10.0)),
>  |   (0.0, Vectors.dense(1.5, 20.0)),
>  |   (1.0, Vectors.dense(1.5, 30.0)),
>  |   (0.0, Vectors.dense(3.5, 30.0)),
>  |   (0.0, Vectors.dense(3.5, 40.0)),
>  |   (1.0, Vectors.dense(3.5, 40.0))
>  | )
> data: Seq[(Double, org.apache.spark.ml.linalg.Vector)] = 
> List((0.0,[0.5,10.0]), (0.0,[1.5,20.0]), (1.0,[1.5,30.0]), (0.0,[3.5,30.0]), 
> (0.0,[3.5,40.0]), (1.0,[3.5,40.0]))scala> scala> scala> val df = 
> data.toDF("label", "features")
> df: org.apache.spark.sql.DataFrame = [label: double, features: vector]scala>  
>val chi = ChiSquareTest.test(df, "features", "label")
> chi: org.apache.spark.sql.DataFrame = [pValues: vector, degreesOfFreedom: 
> array ... 1 more field]scala> chi.show
> +++--+
> | pValues|degreesOfFreedom|statistics|
> +++--+
> |[0.68728927879097...|  [2, 3]|[0.75,1.5]|
> +++--+{code}
>  
> Current impls of {{ChiSquareTest}}, {{ANOVATest}}, {{FValueTest}}, 
> {{Correlation}} all return a df only containing one row.
> I think this is quite hard to use, suppose we have a dataset with dim=1000, 
> the only operation we can deal with the test result is to collect it by 
> {{head()}} or {{first(), and then use it in the driver.}}
> {{While what I really want to do is filtering the df like pValue>0.1}} or 
> {{corr<0.5}}, *So I suggest to flatten the output df in those tests.*
>  
> {{note: {{ANOVATest}}{{ and\{{FValueTest}} are newly added in 3.1.0, but 
> ChiSquareTest and Correlation were here for a long time.
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31368) The query with the where condition failed,when the partition field is null

2020-04-08 Thread tanweihua (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tanweihua updated SPARK-31368:
--
Component/s: (was: Spark Shell)
 (was: Spark Core)
 (was: PySpark)

> The query with the where condition failed,when the partition field is null
> --
>
> Key: SPARK-31368
> URL: https://issues.apache.org/jira/browse/SPARK-31368
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 2.4.5
> Environment: 1、Linux environment:  CentOS Linux release 7.3.1611 or 
> CentOS Linux release 7.5.1804
> 2、Spark Client  environment: Spark-2.4.4-bin-hadoop2.6 or 
> Spark-2.4.5-bin-hadoop2.6
> 3、Hadoop environment: hadoop-2.6.0-cdh5.8.4
> 4、Hive environment: hive-1.1.0-cdh5.8.4
> 5、Java environment: jdk1.8.0_181
> 6、Python environment: python 2.7.5
>Reporter: tanweihua
>Priority: Major
>
> h3. The problem recurs as follows:
>  # create table test_1(id int,name string) partitioned by(profile string)
>  # insert into test_1 values(1,null)
>  # select * from test_1 where profile is null
> Go through the above steps,the result is nothing.But if add the condition 
> profile='__HIVE_DEFAULT_PARTITION__',the result is OK.
> h3. The temporary solution:
> select * from test_1 where profile is null or 
> profile='__HIVE_DEFAULT_PARTITION__'
> The result is OK
> h3. Special instructions:
> 1、The above phenomenon,Only the partition filed type is string can happen
> 2、The above operation in hive is no problem
> h3. Problem orientation:
> As far as I'm consider the problem is in 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils and 
> org.apache.spark.sql.catalyst.catalog.CatalogTablePartition.Especially the 
> toRow function in CatalogTablePartition.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30818) Add LinearRegression wrapper to SparkR

2020-04-08 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-30818.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 27593
[https://github.com/apache/spark/pull/27593]

> Add LinearRegression wrapper to SparkR
> --
>
> Key: SPARK-30818
> URL: https://issues.apache.org/jira/browse/SPARK-30818
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.1.0
>
>
> Spark should provide a wrapper for {{o.a.s.ml.regression. LinearRegression}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30818) Add LinearRegression wrapper to SparkR

2020-04-08 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-30818:


Assignee: Maciej Szymkiewicz

> Add LinearRegression wrapper to SparkR
> --
>
> Key: SPARK-30818
> URL: https://issues.apache.org/jira/browse/SPARK-30818
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
>
> Spark should provide a wrapper for {{o.a.s.ml.regression. LinearRegression}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31309) Migrate the ChiSquareTest from MLlib to ML

2020-04-08 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-31309.
--
Resolution: Not A Problem

> Migrate the ChiSquareTest from MLlib to ML
> --
>
> Key: SPARK-31309
> URL: https://issues.apache.org/jira/browse/SPARK-31309
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Priority: Minor
>
> Move the impl of ChiSq from .mllib to the .ml side, and make .mllib.ChiSq a 
> wrapper of the .ml side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31382) Show a better error message for different python and pip installation mistake

2020-04-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-31382:


Assignee: Hyukjin Kwon

> Show a better error message for different python and pip installation mistake
> -
>
> Key: SPARK-31382
> URL: https://issues.apache.org/jira/browse/SPARK-31382
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> See 
> https://stackoverflow.com/questions/46286436/running-pyspark-after-pip-install-pyspark/49587560



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31382) Show a better error message for different python and pip installation mistake

2020-04-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31382.
--
Fix Version/s: 3.0.0
   2.4.6
   Resolution: Fixed

Issue resolved by pull request 28152
[https://github.com/apache/spark/pull/28152]

> Show a better error message for different python and pip installation mistake
> -
>
> Key: SPARK-31382
> URL: https://issues.apache.org/jira/browse/SPARK-31382
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.4.6, 3.0.0
>
>
> See 
> https://stackoverflow.com/questions/46286436/running-pyspark-after-pip-install-pyspark/49587560



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31390) Document Window Function

2020-04-08 Thread Huaxin Gao (Jira)
Huaxin Gao created SPARK-31390:
--

 Summary: Document Window Function
 Key: SPARK-31390
 URL: https://issues.apache.org/jira/browse/SPARK-31390
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, SQL
Affects Versions: 3.0.0
Reporter: Huaxin Gao


Document Window Function



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29314) ProgressReporter.extractStateOperatorMetrics should not overwrite updated as 0 when it actually runs a batch even with no data

2020-04-08 Thread Burak Yavuz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-29314.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Resolved by [https://github.com/apache/spark/pull/25987]

> ProgressReporter.extractStateOperatorMetrics should not overwrite updated as 
> 0 when it actually runs a batch even with no data
> --
>
> Key: SPARK-29314
> URL: https://issues.apache.org/jira/browse/SPARK-29314
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.0.0
>
>
> SPARK-24156 brought the ability to run a batch without actual data to enable 
> fast state cleanup as well as emit evicted outputs without waiting actual 
> data to come.
> This breaks some assumption on 
> `ProgressReporter.extractStateOperatorMetrics`. See comment in source code:
> {code:java}
> // lastExecution could belong to one of the previous triggers if 
> `!hasNewData`.
> // Walking the plan again should be inexpensive.
> {code}
> and newNumRowsUpdated is replaced to 0 if hasNewData is false. It makes sense 
> if we copy progress from previous execution (which means no batch is run for 
> this time), but after SPARK-24156 the precondition is broken. 
> Spark should still replace the value of newNumRowsUpdated with 0 if there's 
> no batch being run and it needs to copy the old value from previous 
> execution, but it shouldn't touch the value if it runs a batch for no data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29314) ProgressReporter.extractStateOperatorMetrics should not overwrite updated as 0 when it actually runs a batch even with no data

2020-04-08 Thread Burak Yavuz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz reassigned SPARK-29314:
---

Assignee: Jungtaek Lim

> ProgressReporter.extractStateOperatorMetrics should not overwrite updated as 
> 0 when it actually runs a batch even with no data
> --
>
> Key: SPARK-29314
> URL: https://issues.apache.org/jira/browse/SPARK-29314
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
>
> SPARK-24156 brought the ability to run a batch without actual data to enable 
> fast state cleanup as well as emit evicted outputs without waiting actual 
> data to come.
> This breaks some assumption on 
> `ProgressReporter.extractStateOperatorMetrics`. See comment in source code:
> {code:java}
> // lastExecution could belong to one of the previous triggers if 
> `!hasNewData`.
> // Walking the plan again should be inexpensive.
> {code}
> and newNumRowsUpdated is replaced to 0 if hasNewData is false. It makes sense 
> if we copy progress from previous execution (which means no batch is run for 
> this time), but after SPARK-24156 the precondition is broken. 
> Spark should still replace the value of newNumRowsUpdated with 0 if there's 
> no batch being run and it needs to copy the old value from previous 
> execution, but it shouldn't touch the value if it runs a batch for no data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31389) Ensure all tests in SQLMetricsSuite run with both codegen on and off

2020-04-08 Thread Srinivas Rishindra Pothireddi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078803#comment-17078803
 ] 

Srinivas Rishindra Pothireddi commented on SPARK-31389:
---

I am working on this.

> Ensure all tests in SQLMetricsSuite run with both codegen on and off
> 
>
> Key: SPARK-31389
> URL: https://issues.apache.org/jira/browse/SPARK-31389
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Srinivas Rishindra Pothireddi
>Priority: Minor
>
> Many tests in SQLMetricsSuite run only with codegen turned off. Some complex 
> code paths (for example, generated code in "SortMergeJoin metrics") aren't 
> exercised at all. The generated code should be tested as well.
> *List of tests that run with codegen off*
> Filter metrics, SortMergeJoin metrics, SortMergeJoin(outer) metrics, 
> BroadcastHashJoin metrics,  ShuffledHashJoin metrics, 
> BroadcastHashJoin(outer) metrics, BroadcastNestedLoopJoin metrics, 
> BroadcastLeftSemiJoinHash metrics
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31389) Ensure all tests in SQLMetricsSuite run with both codegen on and off

2020-04-08 Thread Srinivas Rishindra Pothireddi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srinivas Rishindra Pothireddi updated SPARK-31389:
--
Description: 
Many tests in SQLMetricsSuite run only with codegen turned off. Some complex 
code paths (for example, generated code in "SortMergeJoin metrics") aren't 
exercised at all. The generated code should be tested as well.

*List of tests that run with codegen off*

Filter metrics, SortMergeJoin metrics, SortMergeJoin(outer) metrics, 
BroadcastHashJoin metrics,  ShuffledHashJoin metrics, BroadcastHashJoin(outer) 
metrics, BroadcastNestedLoopJoin metrics, BroadcastLeftSemiJoinHash metrics

 

  was:
Many tests in SQLMetricsSuite run only with codegen turned off. Some complex 
code paths (for example, generated code in "SortMergeJoin metrics") aren't 
exercised at all.

*List of tests that run with codegen off*

Filter metrics, SortMergeJoin metrics, SortMergeJoin(outer) metrics, 
BroadcastHashJoin metrics,  ShuffledHashJoin metrics, BroadcastHashJoin(outer) 
metrics, BroadcastNestedLoopJoin metrics, BroadcastLeftSemiJoinHash metrics

The generated code should be tested as well.


> Ensure all tests in SQLMetricsSuite run with both codegen on and off
> 
>
> Key: SPARK-31389
> URL: https://issues.apache.org/jira/browse/SPARK-31389
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Srinivas Rishindra Pothireddi
>Priority: Minor
>
> Many tests in SQLMetricsSuite run only with codegen turned off. Some complex 
> code paths (for example, generated code in "SortMergeJoin metrics") aren't 
> exercised at all. The generated code should be tested as well.
> *List of tests that run with codegen off*
> Filter metrics, SortMergeJoin metrics, SortMergeJoin(outer) metrics, 
> BroadcastHashJoin metrics,  ShuffledHashJoin metrics, 
> BroadcastHashJoin(outer) metrics, BroadcastNestedLoopJoin metrics, 
> BroadcastLeftSemiJoinHash metrics
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31389) Ensure all tests in SQLMetricsSuite run with both codegen on and off

2020-04-08 Thread Srinivas Rishindra Pothireddi (Jira)
Srinivas Rishindra Pothireddi created SPARK-31389:
-

 Summary: Ensure all tests in SQLMetricsSuite run with both codegen 
on and off
 Key: SPARK-31389
 URL: https://issues.apache.org/jira/browse/SPARK-31389
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 3.1.0
Reporter: Srinivas Rishindra Pothireddi


Many tests in SQLMetricsSuite run only with codegen turned off. Some complex 
code paths (for example, generated code in "SortMergeJoin metrics") aren't 
exercised at all.

*List of tests that run with codegen off*

Filter metrics, SortMergeJoin metrics, SortMergeJoin(outer) metrics, 
BroadcastHashJoin metrics,  ShuffledHashJoin metrics, BroadcastHashJoin(outer) 
metrics, BroadcastNestedLoopJoin metrics, BroadcastLeftSemiJoinHash metrics

The generated code should be tested as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27249) Developers API for Transformers beyond UnaryTransformer

2020-04-08 Thread Nick Afshartous (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078751#comment-17078751
 ] 

Nick Afshartous commented on SPARK-27249:
-

[~enrush] Hi Everett, checking back on my question in the last comment.  

> Developers API for Transformers beyond UnaryTransformer
> ---
>
> Key: SPARK-27249
> URL: https://issues.apache.org/jira/browse/SPARK-27249
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: Everett Rush
>Priority: Minor
>  Labels: starter
> Attachments: Screen Shot 2020-01-17 at 4.20.57 PM.png
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> It would be nice to have a developers' API for dataset transformations that 
> need more than one column from a row (ie UnaryTransformer inputs one column 
> and outputs one column) or that contain objects too expensive to initialize 
> repeatedly in a UDF such as a database connection. 
>  
> Design:
> Abstract class PartitionTransformer extends Transformer and defines the 
> partition transformation function as Iterator[Row] => Iterator[Row]
> NB: This parallels the UnaryTransformer createTransformFunc method
>  
> When developers subclass this transformer, they can provide their own schema 
> for the output Row in which case the PartitionTransformer creates a row 
> encoder and executes the transformation. Alternatively the developer can set 
> output Datatype and output col name. Then the PartitionTransformer class will 
> create a new schema, a row encoder, and execute the transformation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31386) Reading broadcast in UDF raises MemoryError when spark.executor.pyspark.memory is set

2020-04-08 Thread Viacheslav Krot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viacheslav Krot updated SPARK-31386:

Description: 
Following code with udf causes MemoryError when `spark.executor.pyspark.memory` 
is set

```

from pyspark.sql.types import BooleanType
 from pyspark.sql.functions import udf

df = spark.createDataFrame([
   ('Alice', 10),
   ('Bob', 12)
 ], ['name', 'cnt'])

broadcast = spark.sparkContext.broadcast([1,2,3])

@udf(BooleanType())
 def f(cnt):
   return cnt < len(broadcast.value)

df.filter(f(df.cnt)).count()

```

Same code work well when spark.executor.pyspark.memory is not set. 

The code by itself does not make any sense, just simplest code to reproduce the 
bug.

 

Error:

```

20/04/08 13:16:50 WARN TaskSetManager: Lost task 3.0 in stage 2.0 (TID 6, 
ip-172-31-32-201.us-east-2.compute.internal, executor 2): 
org.apache.spark.api.python.PythonException: Traceback (most recent call 
last):20/04/08 13:16:50 WARN TaskSetManager: Lost task 3.0 in stage 2.0 (TID 6, 
ip-172-31-32-201.us-east-2.compute.internal, executor 2): 
org.apache.spark.api.python.PythonException: Traceback (most recent call last): 
 File 
"/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/worker.py",
 line 377, in main    process()  File 
"/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/worker.py",
 line 372, in process    serializer.dump_stream(func(split_index, iterator), 
outfile)  File 
"/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/serializers.py",
 line 345, in dump_stream    
self.serializer.dump_stream(self._batched(iterator), stream)  File 
"/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/serializers.py",
 line 141, in dump_stream    for obj in iterator:  File 
"/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/serializers.py",
 line 334, in _batched    for item in iterator:  File "", line 1, in 
  File 
"/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/worker.py",
 line 85, in     return lambda *a: f(*a)  File 
"/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/util.py",
 line 113, in wrapper    return f(*args, **kwargs)  File "", line 3, in 
f  File 
"/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/broadcast.py",
 line 148, in value    self._value = self.load_from_path(self._path)  File 
"/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/broadcast.py",
 line 124, in load_from_path    with open(path, 'rb', 1 << 20) as f:MemoryError
 at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456)
 at 
org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:81)
 at 
org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:64)
 at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
 at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) 
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown
 Source) at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source) at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) 
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) 
at org.apache.spark.scheduler.Task.run(Task.scala:123) at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at 
java.util.concurrent.ThreadPoolExecutor.

[jira] [Resolved] (SPARK-31009) Support json_object_keys function

2020-04-08 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31009.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 27836
[https://github.com/apache/spark/pull/27836]

> Support json_object_keys function
> -
>
> Key: SPARK-31009
> URL: https://issues.apache.org/jira/browse/SPARK-31009
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Rakesh Raushan
>Assignee: Rakesh Raushan
>Priority: Major
> Fix For: 3.1.0
>
>
> This function will return all the keys from outer json object.
>  
> PostgreSQL  -> [https://www.postgresql.org/docs/9.3/functions-json.html]
> Mysql -> 
> [https://dev.mysql.com/doc/refman/8.0/en/json-function-reference.html]
> MariaDB -> [https://mariadb.com/kb/en/json-functions/]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31009) Support json_object_keys function

2020-04-08 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-31009:
-

Assignee: Rakesh Raushan

> Support json_object_keys function
> -
>
> Key: SPARK-31009
> URL: https://issues.apache.org/jira/browse/SPARK-31009
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Rakesh Raushan
>Assignee: Rakesh Raushan
>Priority: Major
>
> This function will return all the keys from outer json object.
>  
> PostgreSQL  -> [https://www.postgresql.org/docs/9.3/functions-json.html]
> Mysql -> 
> [https://dev.mysql.com/doc/refman/8.0/en/json-function-reference.html]
> MariaDB -> [https://mariadb.com/kb/en/json-functions/]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31388) org.apache.spark.sql.hive.thriftserver.CliSuite result matching is flaky

2020-04-08 Thread Juliusz Sompolski (Jira)
Juliusz Sompolski created SPARK-31388:
-

 Summary: org.apache.spark.sql.hive.thriftserver.CliSuite result 
matching is flaky
 Key: SPARK-31388
 URL: https://issues.apache.org/jira/browse/SPARK-31388
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Juliusz Sompolski


CliSuite.runCliWithin result matching has issues. Will describe in PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31377) Add unit tests for "number of output rows" metric for joins in SQLMetricsSuite

2020-04-08 Thread Srinivas Rishindra Pothireddi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078524#comment-17078524
 ] 

Srinivas Rishindra Pothireddi commented on SPARK-31377:
---

I am working on this

> Add unit tests for "number of output rows" metric for joins in SQLMetricsSuite
> --
>
> Key: SPARK-31377
> URL: https://issues.apache.org/jira/browse/SPARK-31377
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Srinivas Rishindra Pothireddi
>Priority: Minor
>
> For some combinations of join algorithm and join types there are no unit 
> tests for the "number of output rows" metric.
> A list of missing unit tests include the following.
>  * SortMergeJoin: ExistenceJoin
>  * ShuffledHashJoin: OuterJoin, ReftOuter, RightOuter, LeftAnti, LeftSemi, 
> ExistenseJoin
>  * BroadcastNestedLoopJoin: RightOuter, ExistenceJoin, InnerJoin
>  * BroadcastHashJoin: LeftAnti, ExistenceJoin



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22148) TaskSetManager.abortIfCompletelyBlacklisted should not abort when all current executors are blacklisted but dynamic allocation is enabled

2020-04-08 Thread Venkata krishnan Sowrirajan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078520#comment-17078520
 ] 

Venkata krishnan Sowrirajan commented on SPARK-22148:
-

Thanks for your comments [~tgraves] Makes sense, I will think about it more, 
create a new JIRA and share a new proposal based on how we think about it 
internally.

> TaskSetManager.abortIfCompletelyBlacklisted should not abort when all current 
> executors are blacklisted but dynamic allocation is enabled
> -
>
> Key: SPARK-22148
> URL: https://issues.apache.org/jira/browse/SPARK-22148
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.2.0
>Reporter: Juan Rodríguez Hortalá
>Assignee: Dhruve Ashar
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
> Attachments: SPARK-22148_WIP.diff
>
>
> Currently TaskSetManager.abortIfCompletelyBlacklisted aborts the TaskSet and 
> the whole Spark job with `task X (partition Y) cannot run anywhere due to 
> node and executor blacklist. Blacklisting behavior can be configured via 
> spark.blacklist.*.` when all the available executors are blacklisted for a 
> pending Task or TaskSet. This makes sense for static allocation, where the 
> set of executors is fixed for the duration of the application, but this might 
> lead to unnecessary job failures when dynamic allocation is enabled. For 
> example, in a Spark application with a single job at a time, when a node 
> fails at the end of a stage attempt, all other executors will complete their 
> tasks, but the tasks running in the executors of the failing node will be 
> pending. Spark will keep waiting for those tasks for 2 minutes by default 
> (spark.network.timeout) until the heartbeat timeout is triggered, and then it 
> will blacklist those executors for that stage. At that point in time, other 
> executors would had been released after being idle for 1 minute by default 
> (spark.dynamicAllocation.executorIdleTimeout), because the next stage hasn't 
> started yet and so there are no more tasks available (assuming the default of 
> spark.speculation = false). So Spark will fail because the only executors 
> available are blacklisted for that stage. 
> An alternative is requesting more executors to the cluster manager in this 
> situation. This could be retried a configurable number of times after a 
> configurable wait time between request attempts, so if the cluster manager 
> fails to provide a suitable executor then the job is aborted like in the 
> previous case. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31387) HiveThriftServer2Listener update methods fail with unknown operation/session id

2020-04-08 Thread Ali Smesseim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ali Smesseim updated SPARK-31387:
-
Description: HiveThriftServer2Listener update methods, such as 
onSessionClosed and onOperationError throw a NullPointerException (in Spark 3) 
or a NoSuchElementException (in Spark 2) when the input session/operation id is 
unknown. In Spark 2, this can cause control flow issues with the caller of the 
listener. In Spark 3, the listener is called by a ListenerBus which catches the 
exception, but it would still be nicer if an invalid update is logged and does 
not throw an exception.  (was: HiveThriftServer2Listener update methods, such 
as onSessionClosed and onOperationError throw a NullPointerException (in Spark 
3) or a NoSuchElementException (in Spark 2) when the input session/operation id 
is unknown. In Spark 2, this can cause control flow issues with the caller of 
the listener. In Spark 3, the listener is called by a ListenerBus which catches 
the exception, but it would still be nicer if an invalid update is logged and 
not throw an exception.)

> HiveThriftServer2Listener update methods fail with unknown operation/session 
> id
> ---
>
> Key: SPARK-31387
> URL: https://issues.apache.org/jira/browse/SPARK-31387
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.5, 3
>Reporter: Ali Smesseim
>Priority: Major
>
> HiveThriftServer2Listener update methods, such as onSessionClosed and 
> onOperationError throw a NullPointerException (in Spark 3) or a 
> NoSuchElementException (in Spark 2) when the input session/operation id is 
> unknown. In Spark 2, this can cause control flow issues with the caller of 
> the listener. In Spark 3, the listener is called by a ListenerBus which 
> catches the exception, but it would still be nicer if an invalid update is 
> logged and does not throw an exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31387) HiveThriftServer2Listener update methods fail with unknown operation/session id

2020-04-08 Thread Ali Smesseim (Jira)
Ali Smesseim created SPARK-31387:


 Summary: HiveThriftServer2Listener update methods fail with 
unknown operation/session id
 Key: SPARK-31387
 URL: https://issues.apache.org/jira/browse/SPARK-31387
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.5, 2.3.4, 3
Reporter: Ali Smesseim


HiveThriftServer2Listener update methods, such as onSessionClosed and 
onOperationError throw a NullPointerException (in Spark 3) or a 
NoSuchElementException (in Spark 2) when the input session/operation id is 
unknown. In Spark 2, this can cause control flow issues with the caller of the 
listener. In Spark 3, the listener is called by a ListenerBus which catches the 
exception, but it would still be nicer if an invalid update is logged and not 
throw an exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31362) Document Set Operators in SQL Reference

2020-04-08 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-31362.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28139
[https://github.com/apache/spark/pull/28139]

> Document Set Operators in SQL Reference
> ---
>
> Key: SPARK-31362
> URL: https://issues.apache.org/jira/browse/SPARK-31362
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.0.0
>
>
> Document Set Operators



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31362) Document Set Operators in SQL Reference

2020-04-08 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-31362:


Assignee: Huaxin Gao

> Document Set Operators in SQL Reference
> ---
>
> Key: SPARK-31362
> URL: https://issues.apache.org/jira/browse/SPARK-31362
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
>
> Document Set Operators



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31327) write spark version to avro file metadata

2020-04-08 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078384#comment-17078384
 ] 

Dongjoon Hyun commented on SPARK-31327:
---

This is backported to `branch-2.4` via 
[https://github.com/apache/spark/pull/28150] to help Apache Spark 3.0.0 handle 
old files more elegantly.

> write spark version to avro file metadata
> -
>
> Key: SPARK-31327
> URL: https://issues.apache.org/jira/browse/SPARK-31327
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0, 2.4.6
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31327) write spark version to avro file metadata

2020-04-08 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31327:
--
Fix Version/s: 2.4.6

> write spark version to avro file metadata
> -
>
> Key: SPARK-31327
> URL: https://issues.apache.org/jira/browse/SPARK-31327
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0, 2.4.6
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23128) A new approach to do adaptive execution in Spark SQL

2020-04-08 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078283#comment-17078283
 ] 

Wenchen Fan commented on SPARK-23128:
-

Yes, they are.
https://issues.apache.org/jira/browse/SPARK-28177
https://issues.apache.org/jira/browse/SPARK-29544

> A new approach to do adaptive execution in Spark SQL
> 
>
> Key: SPARK-23128
> URL: https://issues.apache.org/jira/browse/SPARK-23128
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Carson Wang
>Assignee: Carson Wang
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: AdaptiveExecutioninBaidu.pdf
>
>
> SPARK-9850 proposed the basic idea of adaptive execution in Spark. In 
> DAGScheduler, a new API is added to support submitting a single map stage.  
> The current implementation of adaptive execution in Spark SQL supports 
> changing the reducer number at runtime. An Exchange coordinator is used to 
> determine the number of post-shuffle partitions for a stage that needs to 
> fetch shuffle data from one or multiple stages. The current implementation 
> adds ExchangeCoordinator while we are adding Exchanges. However there are 
> some limitations. First, it may cause additional shuffles that may decrease 
> the performance. We can see this from EnsureRequirements rule when it adds 
> ExchangeCoordinator.  Secondly, it is not a good idea to add 
> ExchangeCoordinators while we are adding Exchanges because we don’t have a 
> global picture of all shuffle dependencies of a post-shuffle stage. I.e. for 
> 3 tables’ join in a single stage, the same ExchangeCoordinator should be used 
> in three Exchanges but currently two separated ExchangeCoordinator will be 
> added. Thirdly, with the current framework it is not easy to implement other 
> features in adaptive execution flexibly like changing the execution plan and 
> handling skewed join at runtime.
> We'd like to introduce a new way to do adaptive execution in Spark SQL and 
> address the limitations. The idea is described at 
> [https://docs.google.com/document/d/1mpVjvQZRAkD-Ggy6-hcjXtBPiQoVbZGe3dLnAKgtJ4k/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22148) TaskSetManager.abortIfCompletelyBlacklisted should not abort when all current executors are blacklisted but dynamic allocation is enabled

2020-04-08 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078278#comment-17078278
 ] 

Thomas Graves commented on SPARK-22148:
---

so off the top of my head, I think the main issue with just requesting more is 
that the dynamic allocation manager isn't tied very tightly to the scheduler or 
the blacklist tracker, so getting the information required to properly track 
why we have more executors then needed took quite a bit more work and code 
refactoring.  If you are still seeing issues regularly though we could revisit 
to see if we could either request more or perhaps kill executors that are 
blacklisted that aren't completely idle.  But I would have to re-read through 
these and think about it more.  If you have ideas feel free to propose, though 
we should do it under a new Jira and link them 

> TaskSetManager.abortIfCompletelyBlacklisted should not abort when all current 
> executors are blacklisted but dynamic allocation is enabled
> -
>
> Key: SPARK-22148
> URL: https://issues.apache.org/jira/browse/SPARK-22148
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.2.0
>Reporter: Juan Rodríguez Hortalá
>Assignee: Dhruve Ashar
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
> Attachments: SPARK-22148_WIP.diff
>
>
> Currently TaskSetManager.abortIfCompletelyBlacklisted aborts the TaskSet and 
> the whole Spark job with `task X (partition Y) cannot run anywhere due to 
> node and executor blacklist. Blacklisting behavior can be configured via 
> spark.blacklist.*.` when all the available executors are blacklisted for a 
> pending Task or TaskSet. This makes sense for static allocation, where the 
> set of executors is fixed for the duration of the application, but this might 
> lead to unnecessary job failures when dynamic allocation is enabled. For 
> example, in a Spark application with a single job at a time, when a node 
> fails at the end of a stage attempt, all other executors will complete their 
> tasks, but the tasks running in the executors of the failing node will be 
> pending. Spark will keep waiting for those tasks for 2 minutes by default 
> (spark.network.timeout) until the heartbeat timeout is triggered, and then it 
> will blacklist those executors for that stage. At that point in time, other 
> executors would had been released after being idle for 1 minute by default 
> (spark.dynamicAllocation.executorIdleTimeout), because the next stage hasn't 
> started yet and so there are no more tasks available (assuming the default of 
> spark.speculation = false). So Spark will fail because the only executors 
> available are blacklisted for that stage. 
> An alternative is requesting more executors to the cluster manager in this 
> situation. This could be retried a configurable number of times after a 
> configurable wait time between request attempts, so if the cluster manager 
> fails to provide a suitable executor then the job is aborted like in the 
> previous case. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31386) Reading broadcast in UDF raises MemoryError when spark.executor.pyspark.memory is set

2020-04-08 Thread Viacheslav Krot (Jira)
Viacheslav Krot created SPARK-31386:
---

 Summary: Reading broadcast in UDF raises MemoryError when 
spark.executor.pyspark.memory is set
 Key: SPARK-31386
 URL: https://issues.apache.org/jira/browse/SPARK-31386
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.4
 Environment: Spark 2.4.4 or AWS EMR

`pyspark --conf spark.executor.pyspark.memory=500m`
Reporter: Viacheslav Krot


Following code with udf causes MemoryError when `spark.executor.pyspark.memory` 
is set

```

from pyspark.sql.types import BooleanType
from pyspark.sql.functions import udf

df = spark.createDataFrame([
  ('Alice', 10),
  ('Bob', 12)
], ['name', 'cnt'])

broadcast = spark.sparkContext.broadcast([1,2,3])

@udf(BooleanType())
def f(cnt):
  return cnt < len(broadcast.value)

df.filter(f(df.cnt)).count()

```

Same code work well when spark.executor.pyspark.memory is not set. 

The code by itself does not make any sense, just simplest code to reproduce the 
bug.

 

Error:

```

20/04/08 13:16:50 WARN TaskSetManager: Lost task 3.0 in stage 2.0 (TID 6, 
ip-172-31-32-201.us-east-2.compute.internal, executor 2): 
org.apache.spark.api.python.PythonException: Traceback (most recent call 
last):20/04/08 13:16:50 WARN TaskSetManager: Lost task 3.0 in stage 2.0 (TID 6, 
ip-172-31-32-201.us-east-2.compute.internal, executor 2): 
org.apache.spark.api.python.PythonException: Traceback (most recent call last): 
 File 
"/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/worker.py",
 line 377, in main    process()  File 
"/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/worker.py",
 line 372, in process    serializer.dump_stream(func(split_index, iterator), 
outfile)  File 
"/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/serializers.py",
 line 345, in dump_stream    
self.serializer.dump_stream(self._batched(iterator), stream)  File 
"/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/serializers.py",
 line 141, in dump_stream    for obj in iterator:  File 
"/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/serializers.py",
 line 334, in _batched    for item in iterator:  File "", line 1, in 
  File 
"/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/worker.py",
 line 85, in     return lambda *a: f(*a)  File 
"/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/util.py",
 line 113, in wrapper    return f(*args, **kwargs)  File "", line 3, in 
f  File 
"/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/broadcast.py",
 line 148, in value    self._value = self.load_from_path(self._path)  File 
"/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/broadcast.py",
 line 124, in load_from_path    with open(path, 'rb', 1 << 20) as f:MemoryError
 at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456)
 at 
org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:81)
 at 
org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:64)
 at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
 at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) 
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown
 Source) at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source) at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) 
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) 
at org.apache.sp

[jira] [Comment Edited] (SPARK-31376) Non-global sort support for structured streaming

2020-04-08 Thread Adam Binford (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078219#comment-17078219
 ] 

Adam Binford edited comment on SPARK-31376 at 4/8/20, 12:40 PM:


I tried multiple times to add myself to the dev@ mailing list but was 
unsuccessful, which is why I ended up just posting a Jira ticket. It looks like 
it finally worked using the subscribe link on the spark community page 
(subscribing from the mailing list page doesn't seem to work).

Taking the discussion there now that it finally actually worked.


was (Author: kimahriman):
I tried multiple times to add myself to the dev@ mailing list but was 
unsuccessful, which is why I ended up just posting a Jira ticket. It looks like 
it finally worked using the subscribe link on the spark community page 
(subscribing from the mailing list page doesn't seem to work).

> Non-global sort support for structured streaming
> 
>
> Key: SPARK-31376
> URL: https://issues.apache.org/jira/browse/SPARK-31376
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Adam Binford
>Priority: Minor
>
> Currently, all sorting is disallowed with structured streaming queries. Not 
> allowing global sorting makes sense, but could non-global sorting (i.e. 
> sortWithinPartitions) be allowed? I'm running into this with an external 
> source I'm using, but not sure if this would be useful to file sources as 
> well. I have to foreachBatch so that I can do a sortWithinPartitions.
> Two main questions:
>  * Does a local sort cause issues with any exactly-once guarantees streaming 
> queries provides? I can't say I know or understand how these semantics work. 
> Or are there other issues I can't think of this would cause?
>  * Is the change as simple as changing the unsupported operations check to 
> only look for global sorts instead of all sorts?
> I have built a version that simply changes the unsupported check to only 
> disallow global sorts and it seems to be working. Anything I'm missing or is 
> it this simple?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31376) Non-global sort support for structured streaming

2020-04-08 Thread Adam Binford (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078219#comment-17078219
 ] 

Adam Binford commented on SPARK-31376:
--

I tried multiple times to add myself to the dev@ mailing list but was 
unsuccessful, which is why I ended up just posting a Jira ticket. It looks like 
it finally worked using the subscribe link on the spark community page 
(subscribing from the mailing list page doesn't seem to work).

> Non-global sort support for structured streaming
> 
>
> Key: SPARK-31376
> URL: https://issues.apache.org/jira/browse/SPARK-31376
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Adam Binford
>Priority: Minor
>
> Currently, all sorting is disallowed with structured streaming queries. Not 
> allowing global sorting makes sense, but could non-global sorting (i.e. 
> sortWithinPartitions) be allowed? I'm running into this with an external 
> source I'm using, but not sure if this would be useful to file sources as 
> well. I have to foreachBatch so that I can do a sortWithinPartitions.
> Two main questions:
>  * Does a local sort cause issues with any exactly-once guarantees streaming 
> queries provides? I can't say I know or understand how these semantics work. 
> Or are there other issues I can't think of this would cause?
>  * Is the change as simple as changing the unsupported operations check to 
> only look for global sorts instead of all sorts?
> I have built a version that simply changes the unsupported check to only 
> disallow global sorts and it seems to be working. Anything I'm missing or is 
> it this simple?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31385) Results of Julian-Gregorian rebasing don't match to Gregorian-Julian rebasing

2020-04-08 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-31385:
--

 Summary: Results of Julian-Gregorian rebasing don't match to 
Gregorian-Julian rebasing
 Key: SPARK-31385
 URL: https://issues.apache.org/jira/browse/SPARK-31385
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Microseconds rebasing from the hybrid calendar (Julian + Gregorian) to 
Proleptic Gregorian calendar is not symmetric to opposite conversion for the 
following time zones:
 #  Asia/Tehran
 # Iran
 # Africa/Casablanca
 # Africa/El_Aaiun

Here is the results from the https://github.com/apache/spark/pull/28119:
Julian -> Gregorian:
{code:json}
, {
  "tz" : "Asia/Tehran",
  "switches" : [ -62135782200, -59006460600, -55850700600, -52694940600, 
-46383420600, -43227660600, -40071900600, -33760380600, -30604620600, 
-27448860600, -21137340600, -17981580600, -14825820600, -12219305400, 
-2208988800, 2547315000, 2547401400 ],
  "diffs" : [ 173056, 86656, 256, -86144, -172544, -258944, -345344, -431744, 
-518144, -604544, -690944, -777344, -863744, 256, 0, -3600, 0 ]
}, {
  "tz" : "Iran",
  "switches" : [ -62135782200, -59006460600, -55850700600, -52694940600, 
-46383420600, -43227660600, -40071900600, -33760380600, -30604620600, 
-27448860600, -21137340600, -17981580600, -14825820600, -12219305400, 
-2208988800, 2547315000, 2547401400 ],
  "diffs" : [ 173056, 86656, 256, -86144, -172544, -258944, -345344, -431744, 
-518144, -604544, -690944, -777344, -863744, 256, 0, -3600, 0 ]
}, {
  "tz" : "Africa/Casablanca",
  "switches" : [ -62135769600, -59006448000, -55850688000, -52694928000, 
-46383408000, -43227648000, -40071888000, -33760368000, -30604608000, 
-27448848000, -21137328000, -17981568000, -14825808000, -12219292800, 
-2208988800, 2141866800, 2169079200, 2172106800, 2199924000, 2202951600, 
2230164000, 2233796400, 2261008800, 2264036400, 2291248800, 2294881200, 
2322093600, 2325121200, 2352938400, 2355966000, 2383178400, 2386810800, 
2414023200, 2417050800, 2444868000, 2447895600, 2475108000, 2478740400, 
2505952800, 2508980400, 2536192800, 2539825200, 2567037600, 2570065200, 
2597882400, 260091, 2628122400, 2631754800, 2658967200, 2661994800, 
2689812000, 2692839600, 2720052000, 2723684400, 2750896800, 2753924400, 
2781136800, 2784769200, 2811981600, 2815009200, 2842826400, 2845854000, 
2873066400, 2876698800, 2903911200, 2906938800, 2934756000, 2937783600, 
2964996000, 2968023600, 2995840800, 2998868400, 3026080800, 3029713200, 
3056925600, 3059953200, 3087770400, 3090798000, 3118010400, 3121642800, 
3148855200, 3151882800, 317970, 3182727600, 320994, 3212967600, 
3240784800, 3243812400, 3271024800, 3274657200, 3301869600, 3304897200, 
3332714400, 3335742000, 3362954400, 3366586800, 3393799200, 3396826800, 
3424644000, 3427671600, 3454884000, 3457911600, 3485728800, 3488756400, 
3515968800, 3519601200, 3546813600, 3549841200, 3577658400, 3580686000, 
3607898400, 3611530800, 3638743200, 3641770800, 3669588000, 3672615600, 
3699828000, 3702855600 ],
  "diffs" : [ 174620, 88220, 1820, -84580, -170980, -257380, -343780, -430180, 
-516580, -602980, -689380, -775780, -862180, 1820, 0, -3600, 0, -3600, 0, 
-3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, 
-3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, 
-3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, 
-3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, 
-3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, 
-3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, 
-3600, 0, -3600 ]
}, {
  "tz" : "Africa/El_Aaiun",
  "switches" : [ -62135769600, -59006448000, -55850688000, -52694928000, 
-46383408000, -43227648000, -40071888000, -33760368000, -30604608000, 
-27448848000, -21137328000, -17981568000, -14825808000, -12219292800, 
-2208988800, 2141866800, 2169079200, 2172106800, 2199924000, 2202951600, 
2230164000, 2233796400, 2261008800, 2264036400, 2291248800, 2294881200, 
2322093600, 2325121200, 2352938400, 2355966000, 2383178400, 2386810800, 
2414023200, 2417050800, 2444868000, 2447895600, 2475108000, 2478740400, 
2505952800, 2508980400, 2536192800, 2539825200, 2567037600, 2570065200, 
2597882400, 260091, 2628122400, 2631754800, 2658967200, 2661994800, 
2689812000, 2692839600, 2720052000, 2723684400, 2750896800, 2753924400, 
2781136800, 2784769200, 2811981600, 2815009200, 2842826400, 2845854000, 
2873066400, 2876698800, 2903911200, 2906938800, 2934756000, 2937783600, 
2964996000, 2968023600, 2995840800, 2998868400, 3026080800, 3029713200, 
3056925600, 3059953200, 3087770400, 3090798000, 3118010400, 3121642800, 
3148855200, 3151882800, 317970, 3182727600, 320994, 3212967600, 
3240784800, 3243812400, 3271024800, 3274657200, 3301869600, 3304897200, 
3332714400, 333

[jira] [Updated] (SPARK-31384) NPE in OptimizeSkewedJoin when there's a inputRDD of plan has 0 partition

2020-04-08 Thread wuyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-31384:
-
Summary: NPE in OptimizeSkewedJoin when there's a inputRDD of plan has 0 
partition  (was: Fix NPE in OptimizeSkewedJoin)

> NPE in OptimizeSkewedJoin when there's a inputRDD of plan has 0 partition
> -
>
> Key: SPARK-31384
> URL: https://issues.apache.org/jira/browse/SPARK-31384
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: wuyi
>Priority: Major
>
> When there's a inputRDD of a plan with 0 partitions, rule OptimizeSkewedJoin 
> can hit NPE.
> The issue can be reproduced by below test:
> {code:java}
> withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true",
>   SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
>   withTempView("t2") {
> // create DataFrame with 0 partition
> spark.createDataFrame(sparkContext.emptyRDD[Row], new 
> StructType().add("b", IntegerType))
>   .createOrReplaceTempView("t2")
> // should run successfully without NPE
> runAdaptiveAndVerifyResult("SELECT * FROM testData2 t1 left semi join t2 
> ON t1.a=t2.b")
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31384) Fix NPE in OptimizeSkewedJoin

2020-04-08 Thread wuyi (Jira)
wuyi created SPARK-31384:


 Summary: Fix NPE in OptimizeSkewedJoin
 Key: SPARK-31384
 URL: https://issues.apache.org/jira/browse/SPARK-31384
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: wuyi


When there's a inputRDD of a plan with 0 partitions, rule OptimizeSkewedJoin 
can hit NPE.

The issue can be reproduced by below test:
{code:java}
withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true",
  SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
  withTempView("t2") {
// create DataFrame with 0 partition
spark.createDataFrame(sparkContext.emptyRDD[Row], new StructType().add("b", 
IntegerType))
  .createOrReplaceTempView("t2")
// should run successfully without NPE
runAdaptiveAndVerifyResult("SELECT * FROM testData2 t1 left semi join t2 ON 
t1.a=t2.b")
  }
}
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31379) Fix flaky test: o.a.s.scheduler.CoarseGrainedSchedulerBackendSuite.extra resources from executor

2020-04-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31379.
--
Fix Version/s: 3.0.0
 Assignee: wuyi
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/28145

> Fix flaky test: o.a.s.scheduler.CoarseGrainedSchedulerBackendSuite.extra 
> resources from executor
> 
>
> Key: SPARK-31379
> URL: https://issues.apache.org/jira/browse/SPARK-31379
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.0.0
>
>
> see 
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120786/testReport/org.apache.spark.scheduler/CoarseGrainedSchedulerBackendSuite/extra_resources_from_executor/]
>  for details.
> {code:java}
> sbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 325 times over 5.01070979 
> seconds. Last failure message: ArrayBuffer("1", "3") did not equal Array("0", 
> "1", "3").
>   at 
> org.scalatest.concurrent.Eventually.tryTryAgain$1(Eventually.scala:432)
>   at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:439)
>   at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:391)
>   at 
> org.apache.spark.scheduler.CoarseGrainedSchedulerBackendSuite.eventually(CoarseGrainedSchedulerBackendSuite.scala:45)
>   at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:337)
>   at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:336)
>   at 
> org.apache.spark.scheduler.CoarseGrainedSchedulerBackendSuite.eventually(CoarseGrainedSchedulerBackendSuite.scala:45)
>   at 
> org.apache.spark.scheduler.CoarseGrainedSchedulerBackendSuite.$anonfun$new$12(CoarseGrainedSchedulerBackendSuite.scala:264)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23128) A new approach to do adaptive execution in Spark SQL

2020-04-08 Thread Sandeep Katta (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17077939#comment-17077939
 ] 

Sandeep Katta commented on SPARK-23128:
---

[~cloud_fan] [~carsonwang] any updates on dynamic parallelism and skew 
Handling. Is it fixed in 3.0.0 

> A new approach to do adaptive execution in Spark SQL
> 
>
> Key: SPARK-23128
> URL: https://issues.apache.org/jira/browse/SPARK-23128
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Carson Wang
>Assignee: Carson Wang
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: AdaptiveExecutioninBaidu.pdf
>
>
> SPARK-9850 proposed the basic idea of adaptive execution in Spark. In 
> DAGScheduler, a new API is added to support submitting a single map stage.  
> The current implementation of adaptive execution in Spark SQL supports 
> changing the reducer number at runtime. An Exchange coordinator is used to 
> determine the number of post-shuffle partitions for a stage that needs to 
> fetch shuffle data from one or multiple stages. The current implementation 
> adds ExchangeCoordinator while we are adding Exchanges. However there are 
> some limitations. First, it may cause additional shuffles that may decrease 
> the performance. We can see this from EnsureRequirements rule when it adds 
> ExchangeCoordinator.  Secondly, it is not a good idea to add 
> ExchangeCoordinators while we are adding Exchanges because we don’t have a 
> global picture of all shuffle dependencies of a post-shuffle stage. I.e. for 
> 3 tables’ join in a single stage, the same ExchangeCoordinator should be used 
> in three Exchanges but currently two separated ExchangeCoordinator will be 
> added. Thirdly, with the current framework it is not easy to implement other 
> features in adaptive execution flexibly like changing the execution plan and 
> handling skewed join at runtime.
> We'd like to introduce a new way to do adaptive execution in Spark SQL and 
> address the limitations. The idea is described at 
> [https://docs.google.com/document/d/1mpVjvQZRAkD-Ggy6-hcjXtBPiQoVbZGe3dLnAKgtJ4k/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31383) Clean up the SQL documents in docs/sql-ref*

2020-04-08 Thread Takeshi Yamamuro (Jira)
Takeshi Yamamuro created SPARK-31383:


 Summary: Clean up the SQL documents in docs/sql-ref*
 Key: SPARK-31383
 URL: https://issues.apache.org/jira/browse/SPARK-31383
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, SQL
Affects Versions: 3.0.0
Reporter: Takeshi Yamamuro


This ticket intends to clean up the SQL documents in `doc/sql-ref*`.
Main changes are as follows;

- Fixes wrong syntaxes and capitalize sub-titles
 - Adds some DDL queries in `Examples` so that users can run examples there
 - Makes query output in `Examples` follows the `Dataset.showString` 
(right-aligned) format 
 - Adds/Removes spaces, Indents, or blank lines to follow the format below;

{code}
---
license...
---

### Description

Writes what's the syntax is.

### Syntax

{% highlight sql %}
SELECT...
 WHERE... // 4 indents after the second line
 ...
{% endhighlight %}

### Parameters


 
 Param Name
 
 Param Description
 
 ...


### Examples

{% highlight sql %}
-- It is better that users are able to execute example queries here.
-- So, we prepare test data in the first section if possible.
CREATE TABLE t (key STRING, value DOUBLE);
INSERT INTO t VALUES
 ('a', 1.0), ('a', 2.0), ('b', 3.0), ('c', 4.0);

-- query output has 2 indents and it follows the `Dataset.showString`
-- format (right-aligned).
SELECT * FROM t;
 +---+-+
 |key|value|
 +---+-+
 | a| 1.0|
 | a| 2.0|
 | b| 3.0|
 | c| 4.0|
 +---+-+

-- Query statements after the second line have 4 indents.
SELECT key, SUM(value)
 FROM t
 GROUP BY key;
 +---+--+ 
 |key|sum(value)|
 +---+--+
 | c| 4.0|
 | b| 3.0|
 | a| 3.0|
 +---+--+
...
{% endhighlight %}

### Related Statements

* [XXX](xxx.html)
 * ...
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31382) Show a better error message for different python and pip installation mistake

2020-04-08 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-31382:


 Summary: Show a better error message for different python and pip 
installation mistake
 Key: SPARK-31382
 URL: https://issues.apache.org/jira/browse/SPARK-31382
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.5, 3.0.0
Reporter: Hyukjin Kwon


See 
https://stackoverflow.com/questions/46286436/running-pyspark-after-pip-install-pyspark/49587560



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org