[jira] [Created] (SPARK-31391) Add AdaptiveTestUtils to ease the test of AQE
wuyi created SPARK-31391: Summary: Add AdaptiveTestUtils to ease the test of AQE Key: SPARK-31391 URL: https://issues.apache.org/jira/browse/SPARK-31391 Project: Spark Issue Type: Test Components: SQL Affects Versions: 3.0.0 Reporter: wuyi Tests related to AQE now have much duplicate codes, we can use some utility functions to make the test simpler. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31301) flatten the result dataframe of tests in stat
[ https://issues.apache.org/jira/browse/SPARK-31301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078958#comment-17078958 ] zhengruifeng commented on SPARK-31301: -- [~srowen] There are two methods now: {code:java} @Since("2.2.0") def test(dataset: DataFrame, featuresCol: String, labelCol: String): DataFrame = { val spark = dataset.sparkSession import spark.implicits._ SchemaUtils.checkColumnType(dataset.schema, featuresCol, new VectorUDT) SchemaUtils.checkNumericType(dataset.schema, labelCol) val rdd = dataset.select(col(labelCol).cast("double"), col(featuresCol)).as[(Double, Vector)] .rdd.map { case (label, features) => OldLabeledPoint(label, OldVectors.fromML(features)) } val testResults = OldStatistics.chiSqTest(rdd) val pValues = Vectors.dense(testResults.map(_.pValue)) val degreesOfFreedom = testResults.map(_.degreesOfFreedom) val statistics = Vectors.dense(testResults.map(_.statistic)) spark.createDataFrame(Seq(ChiSquareResult(pValues, degreesOfFreedom, statistics))) } @Since("3.1.0") def testChiSquare( dataset: Dataset[_], featuresCol: String, labelCol: String): Array[SelectionTestResult] = { SchemaUtils.checkColumnType(dataset.schema, featuresCol, new VectorUDT) SchemaUtils.checkNumericType(dataset.schema, labelCol) val input = dataset.select(col(labelCol).cast(DoubleType), col(featuresCol)).rdd .map { case Row(label: Double, features: Vector) => OldLabeledPoint(label, OldVectors.fromML(features)) } val chiTestResult = OldStatistics.chiSqTest(input) chiTestResult.map(r => new ChiSqTestResult(r.pValue, r.degreesOfFreedom, r.statistic)) } {code} The newly added one is targeted to 3.1.0, so we can modify its return type without breaking api > flatten the result dataframe of tests in stat > - > > Key: SPARK-31301 > URL: https://issues.apache.org/jira/browse/SPARK-31301 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Priority: Major > > {code:java} > scala> import org.apache.spark.ml.linalg.{Vector, Vectors} > import org.apache.spark.ml.linalg.{Vector, Vectors}scala> import > org.apache.spark.ml.stat.ChiSquareTest > import org.apache.spark.ml.stat.ChiSquareTestscala> val data = Seq( > | (0.0, Vectors.dense(0.5, 10.0)), > | (0.0, Vectors.dense(1.5, 20.0)), > | (1.0, Vectors.dense(1.5, 30.0)), > | (0.0, Vectors.dense(3.5, 30.0)), > | (0.0, Vectors.dense(3.5, 40.0)), > | (1.0, Vectors.dense(3.5, 40.0)) > | ) > data: Seq[(Double, org.apache.spark.ml.linalg.Vector)] = > List((0.0,[0.5,10.0]), (0.0,[1.5,20.0]), (1.0,[1.5,30.0]), (0.0,[3.5,30.0]), > (0.0,[3.5,40.0]), (1.0,[3.5,40.0]))scala> scala> scala> val df = > data.toDF("label", "features") > df: org.apache.spark.sql.DataFrame = [label: double, features: vector]scala> >val chi = ChiSquareTest.test(df, "features", "label") > chi: org.apache.spark.sql.DataFrame = [pValues: vector, degreesOfFreedom: > array ... 1 more field]scala> chi.show > +++--+ > | pValues|degreesOfFreedom|statistics| > +++--+ > |[0.68728927879097...| [2, 3]|[0.75,1.5]| > +++--+{code} > > Current impls of {{ChiSquareTest}}, {{ANOVATest}}, {{FValueTest}}, > {{Correlation}} all return a df only containing one row. > I think this is quite hard to use, suppose we have a dataset with dim=1000, > the only operation we can deal with the test result is to collect it by > {{head()}} or {{first(), and then use it in the driver.}} > {{While what I really want to do is filtering the df like pValue>0.1}} or > {{corr<0.5}}, *So I suggest to flatten the output df in those tests.* > > {{note: {{ANOVATest}}{{ and\{{FValueTest}} are newly added in 3.1.0, but > ChiSquareTest and Correlation were here for a long time. > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31301) flatten the result dataframe of tests in stat
[ https://issues.apache.org/jira/browse/SPARK-31301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078902#comment-17078902 ] Sean R. Owen commented on SPARK-31301: -- I guess so, but doesn't it become inconsistent with existing similar methods? > flatten the result dataframe of tests in stat > - > > Key: SPARK-31301 > URL: https://issues.apache.org/jira/browse/SPARK-31301 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Priority: Major > > {code:java} > scala> import org.apache.spark.ml.linalg.{Vector, Vectors} > import org.apache.spark.ml.linalg.{Vector, Vectors}scala> import > org.apache.spark.ml.stat.ChiSquareTest > import org.apache.spark.ml.stat.ChiSquareTestscala> val data = Seq( > | (0.0, Vectors.dense(0.5, 10.0)), > | (0.0, Vectors.dense(1.5, 20.0)), > | (1.0, Vectors.dense(1.5, 30.0)), > | (0.0, Vectors.dense(3.5, 30.0)), > | (0.0, Vectors.dense(3.5, 40.0)), > | (1.0, Vectors.dense(3.5, 40.0)) > | ) > data: Seq[(Double, org.apache.spark.ml.linalg.Vector)] = > List((0.0,[0.5,10.0]), (0.0,[1.5,20.0]), (1.0,[1.5,30.0]), (0.0,[3.5,30.0]), > (0.0,[3.5,40.0]), (1.0,[3.5,40.0]))scala> scala> scala> val df = > data.toDF("label", "features") > df: org.apache.spark.sql.DataFrame = [label: double, features: vector]scala> >val chi = ChiSquareTest.test(df, "features", "label") > chi: org.apache.spark.sql.DataFrame = [pValues: vector, degreesOfFreedom: > array ... 1 more field]scala> chi.show > +++--+ > | pValues|degreesOfFreedom|statistics| > +++--+ > |[0.68728927879097...| [2, 3]|[0.75,1.5]| > +++--+{code} > > Current impls of {{ChiSquareTest}}, {{ANOVATest}}, {{FValueTest}}, > {{Correlation}} all return a df only containing one row. > I think this is quite hard to use, suppose we have a dataset with dim=1000, > the only operation we can deal with the test result is to collect it by > {{head()}} or {{first(), and then use it in the driver.}} > {{While what I really want to do is filtering the df like pValue>0.1}} or > {{corr<0.5}}, *So I suggest to flatten the output df in those tests.* > > {{note: {{ANOVATest}}{{ and\{{FValueTest}} are newly added in 3.1.0, but > ChiSquareTest and Correlation were here for a long time. > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31301) flatten the result dataframe of tests in stat
[ https://issues.apache.org/jira/browse/SPARK-31301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078898#comment-17078898 ] zhengruifeng commented on SPARK-31301: -- [~srowen] How do you think about changing the return type of newly added method "testChiSquare" to flatten rows? This method currently returns "Array[SelectionTestResult]", which is similar to the old one "test" which returns single row. > flatten the result dataframe of tests in stat > - > > Key: SPARK-31301 > URL: https://issues.apache.org/jira/browse/SPARK-31301 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Priority: Major > > {code:java} > scala> import org.apache.spark.ml.linalg.{Vector, Vectors} > import org.apache.spark.ml.linalg.{Vector, Vectors}scala> import > org.apache.spark.ml.stat.ChiSquareTest > import org.apache.spark.ml.stat.ChiSquareTestscala> val data = Seq( > | (0.0, Vectors.dense(0.5, 10.0)), > | (0.0, Vectors.dense(1.5, 20.0)), > | (1.0, Vectors.dense(1.5, 30.0)), > | (0.0, Vectors.dense(3.5, 30.0)), > | (0.0, Vectors.dense(3.5, 40.0)), > | (1.0, Vectors.dense(3.5, 40.0)) > | ) > data: Seq[(Double, org.apache.spark.ml.linalg.Vector)] = > List((0.0,[0.5,10.0]), (0.0,[1.5,20.0]), (1.0,[1.5,30.0]), (0.0,[3.5,30.0]), > (0.0,[3.5,40.0]), (1.0,[3.5,40.0]))scala> scala> scala> val df = > data.toDF("label", "features") > df: org.apache.spark.sql.DataFrame = [label: double, features: vector]scala> >val chi = ChiSquareTest.test(df, "features", "label") > chi: org.apache.spark.sql.DataFrame = [pValues: vector, degreesOfFreedom: > array ... 1 more field]scala> chi.show > +++--+ > | pValues|degreesOfFreedom|statistics| > +++--+ > |[0.68728927879097...| [2, 3]|[0.75,1.5]| > +++--+{code} > > Current impls of {{ChiSquareTest}}, {{ANOVATest}}, {{FValueTest}}, > {{Correlation}} all return a df only containing one row. > I think this is quite hard to use, suppose we have a dataset with dim=1000, > the only operation we can deal with the test result is to collect it by > {{head()}} or {{first(), and then use it in the driver.}} > {{While what I really want to do is filtering the df like pValue>0.1}} or > {{corr<0.5}}, *So I suggest to flatten the output df in those tests.* > > {{note: {{ANOVATest}}{{ and\{{FValueTest}} are newly added in 3.1.0, but > ChiSquareTest and Correlation were here for a long time. > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31368) The query with the where condition failed,when the partition field is null
[ https://issues.apache.org/jira/browse/SPARK-31368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tanweihua updated SPARK-31368: -- Component/s: (was: Spark Shell) (was: Spark Core) (was: PySpark) > The query with the where condition failed,when the partition field is null > -- > > Key: SPARK-31368 > URL: https://issues.apache.org/jira/browse/SPARK-31368 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 2.4.5 > Environment: 1、Linux environment: CentOS Linux release 7.3.1611 or > CentOS Linux release 7.5.1804 > 2、Spark Client environment: Spark-2.4.4-bin-hadoop2.6 or > Spark-2.4.5-bin-hadoop2.6 > 3、Hadoop environment: hadoop-2.6.0-cdh5.8.4 > 4、Hive environment: hive-1.1.0-cdh5.8.4 > 5、Java environment: jdk1.8.0_181 > 6、Python environment: python 2.7.5 >Reporter: tanweihua >Priority: Major > > h3. The problem recurs as follows: > # create table test_1(id int,name string) partitioned by(profile string) > # insert into test_1 values(1,null) > # select * from test_1 where profile is null > Go through the above steps,the result is nothing.But if add the condition > profile='__HIVE_DEFAULT_PARTITION__',the result is OK. > h3. The temporary solution: > select * from test_1 where profile is null or > profile='__HIVE_DEFAULT_PARTITION__' > The result is OK > h3. Special instructions: > 1、The above phenomenon,Only the partition filed type is string can happen > 2、The above operation in hive is no problem > h3. Problem orientation: > As far as I'm consider the problem is in > org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils and > org.apache.spark.sql.catalyst.catalog.CatalogTablePartition.Especially the > toRow function in CatalogTablePartition. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30818) Add LinearRegression wrapper to SparkR
[ https://issues.apache.org/jira/browse/SPARK-30818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-30818. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 27593 [https://github.com/apache/spark/pull/27593] > Add LinearRegression wrapper to SparkR > -- > > Key: SPARK-30818 > URL: https://issues.apache.org/jira/browse/SPARK-30818 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Affects Versions: 3.1.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > Fix For: 3.1.0 > > > Spark should provide a wrapper for {{o.a.s.ml.regression. LinearRegression}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30818) Add LinearRegression wrapper to SparkR
[ https://issues.apache.org/jira/browse/SPARK-30818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-30818: Assignee: Maciej Szymkiewicz > Add LinearRegression wrapper to SparkR > -- > > Key: SPARK-30818 > URL: https://issues.apache.org/jira/browse/SPARK-30818 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Affects Versions: 3.1.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > > Spark should provide a wrapper for {{o.a.s.ml.regression. LinearRegression}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31309) Migrate the ChiSquareTest from MLlib to ML
[ https://issues.apache.org/jira/browse/SPARK-31309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-31309. -- Resolution: Not A Problem > Migrate the ChiSquareTest from MLlib to ML > -- > > Key: SPARK-31309 > URL: https://issues.apache.org/jira/browse/SPARK-31309 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Priority: Minor > > Move the impl of ChiSq from .mllib to the .ml side, and make .mllib.ChiSq a > wrapper of the .ml side. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31382) Show a better error message for different python and pip installation mistake
[ https://issues.apache.org/jira/browse/SPARK-31382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-31382: Assignee: Hyukjin Kwon > Show a better error message for different python and pip installation mistake > - > > Key: SPARK-31382 > URL: https://issues.apache.org/jira/browse/SPARK-31382 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.5, 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > See > https://stackoverflow.com/questions/46286436/running-pyspark-after-pip-install-pyspark/49587560 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31382) Show a better error message for different python and pip installation mistake
[ https://issues.apache.org/jira/browse/SPARK-31382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31382. -- Fix Version/s: 3.0.0 2.4.6 Resolution: Fixed Issue resolved by pull request 28152 [https://github.com/apache/spark/pull/28152] > Show a better error message for different python and pip installation mistake > - > > Key: SPARK-31382 > URL: https://issues.apache.org/jira/browse/SPARK-31382 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.5, 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 2.4.6, 3.0.0 > > > See > https://stackoverflow.com/questions/46286436/running-pyspark-after-pip-install-pyspark/49587560 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31390) Document Window Function
Huaxin Gao created SPARK-31390: -- Summary: Document Window Function Key: SPARK-31390 URL: https://issues.apache.org/jira/browse/SPARK-31390 Project: Spark Issue Type: Sub-task Components: Documentation, SQL Affects Versions: 3.0.0 Reporter: Huaxin Gao Document Window Function -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29314) ProgressReporter.extractStateOperatorMetrics should not overwrite updated as 0 when it actually runs a batch even with no data
[ https://issues.apache.org/jira/browse/SPARK-29314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz resolved SPARK-29314. - Fix Version/s: 3.0.0 Resolution: Fixed Resolved by [https://github.com/apache/spark/pull/25987] > ProgressReporter.extractStateOperatorMetrics should not overwrite updated as > 0 when it actually runs a batch even with no data > -- > > Key: SPARK-29314 > URL: https://issues.apache.org/jira/browse/SPARK-29314 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.4, 3.0.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.0.0 > > > SPARK-24156 brought the ability to run a batch without actual data to enable > fast state cleanup as well as emit evicted outputs without waiting actual > data to come. > This breaks some assumption on > `ProgressReporter.extractStateOperatorMetrics`. See comment in source code: > {code:java} > // lastExecution could belong to one of the previous triggers if > `!hasNewData`. > // Walking the plan again should be inexpensive. > {code} > and newNumRowsUpdated is replaced to 0 if hasNewData is false. It makes sense > if we copy progress from previous execution (which means no batch is run for > this time), but after SPARK-24156 the precondition is broken. > Spark should still replace the value of newNumRowsUpdated with 0 if there's > no batch being run and it needs to copy the old value from previous > execution, but it shouldn't touch the value if it runs a batch for no data. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29314) ProgressReporter.extractStateOperatorMetrics should not overwrite updated as 0 when it actually runs a batch even with no data
[ https://issues.apache.org/jira/browse/SPARK-29314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz reassigned SPARK-29314: --- Assignee: Jungtaek Lim > ProgressReporter.extractStateOperatorMetrics should not overwrite updated as > 0 when it actually runs a batch even with no data > -- > > Key: SPARK-29314 > URL: https://issues.apache.org/jira/browse/SPARK-29314 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.4, 3.0.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > > SPARK-24156 brought the ability to run a batch without actual data to enable > fast state cleanup as well as emit evicted outputs without waiting actual > data to come. > This breaks some assumption on > `ProgressReporter.extractStateOperatorMetrics`. See comment in source code: > {code:java} > // lastExecution could belong to one of the previous triggers if > `!hasNewData`. > // Walking the plan again should be inexpensive. > {code} > and newNumRowsUpdated is replaced to 0 if hasNewData is false. It makes sense > if we copy progress from previous execution (which means no batch is run for > this time), but after SPARK-24156 the precondition is broken. > Spark should still replace the value of newNumRowsUpdated with 0 if there's > no batch being run and it needs to copy the old value from previous > execution, but it shouldn't touch the value if it runs a batch for no data. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31389) Ensure all tests in SQLMetricsSuite run with both codegen on and off
[ https://issues.apache.org/jira/browse/SPARK-31389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078803#comment-17078803 ] Srinivas Rishindra Pothireddi commented on SPARK-31389: --- I am working on this. > Ensure all tests in SQLMetricsSuite run with both codegen on and off > > > Key: SPARK-31389 > URL: https://issues.apache.org/jira/browse/SPARK-31389 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 3.1.0 >Reporter: Srinivas Rishindra Pothireddi >Priority: Minor > > Many tests in SQLMetricsSuite run only with codegen turned off. Some complex > code paths (for example, generated code in "SortMergeJoin metrics") aren't > exercised at all. The generated code should be tested as well. > *List of tests that run with codegen off* > Filter metrics, SortMergeJoin metrics, SortMergeJoin(outer) metrics, > BroadcastHashJoin metrics, ShuffledHashJoin metrics, > BroadcastHashJoin(outer) metrics, BroadcastNestedLoopJoin metrics, > BroadcastLeftSemiJoinHash metrics > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31389) Ensure all tests in SQLMetricsSuite run with both codegen on and off
[ https://issues.apache.org/jira/browse/SPARK-31389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Srinivas Rishindra Pothireddi updated SPARK-31389: -- Description: Many tests in SQLMetricsSuite run only with codegen turned off. Some complex code paths (for example, generated code in "SortMergeJoin metrics") aren't exercised at all. The generated code should be tested as well. *List of tests that run with codegen off* Filter metrics, SortMergeJoin metrics, SortMergeJoin(outer) metrics, BroadcastHashJoin metrics, ShuffledHashJoin metrics, BroadcastHashJoin(outer) metrics, BroadcastNestedLoopJoin metrics, BroadcastLeftSemiJoinHash metrics was: Many tests in SQLMetricsSuite run only with codegen turned off. Some complex code paths (for example, generated code in "SortMergeJoin metrics") aren't exercised at all. *List of tests that run with codegen off* Filter metrics, SortMergeJoin metrics, SortMergeJoin(outer) metrics, BroadcastHashJoin metrics, ShuffledHashJoin metrics, BroadcastHashJoin(outer) metrics, BroadcastNestedLoopJoin metrics, BroadcastLeftSemiJoinHash metrics The generated code should be tested as well. > Ensure all tests in SQLMetricsSuite run with both codegen on and off > > > Key: SPARK-31389 > URL: https://issues.apache.org/jira/browse/SPARK-31389 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 3.1.0 >Reporter: Srinivas Rishindra Pothireddi >Priority: Minor > > Many tests in SQLMetricsSuite run only with codegen turned off. Some complex > code paths (for example, generated code in "SortMergeJoin metrics") aren't > exercised at all. The generated code should be tested as well. > *List of tests that run with codegen off* > Filter metrics, SortMergeJoin metrics, SortMergeJoin(outer) metrics, > BroadcastHashJoin metrics, ShuffledHashJoin metrics, > BroadcastHashJoin(outer) metrics, BroadcastNestedLoopJoin metrics, > BroadcastLeftSemiJoinHash metrics > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31389) Ensure all tests in SQLMetricsSuite run with both codegen on and off
Srinivas Rishindra Pothireddi created SPARK-31389: - Summary: Ensure all tests in SQLMetricsSuite run with both codegen on and off Key: SPARK-31389 URL: https://issues.apache.org/jira/browse/SPARK-31389 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 3.1.0 Reporter: Srinivas Rishindra Pothireddi Many tests in SQLMetricsSuite run only with codegen turned off. Some complex code paths (for example, generated code in "SortMergeJoin metrics") aren't exercised at all. *List of tests that run with codegen off* Filter metrics, SortMergeJoin metrics, SortMergeJoin(outer) metrics, BroadcastHashJoin metrics, ShuffledHashJoin metrics, BroadcastHashJoin(outer) metrics, BroadcastNestedLoopJoin metrics, BroadcastLeftSemiJoinHash metrics The generated code should be tested as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27249) Developers API for Transformers beyond UnaryTransformer
[ https://issues.apache.org/jira/browse/SPARK-27249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078751#comment-17078751 ] Nick Afshartous commented on SPARK-27249: - [~enrush] Hi Everett, checking back on my question in the last comment. > Developers API for Transformers beyond UnaryTransformer > --- > > Key: SPARK-27249 > URL: https://issues.apache.org/jira/browse/SPARK-27249 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 3.1.0 >Reporter: Everett Rush >Priority: Minor > Labels: starter > Attachments: Screen Shot 2020-01-17 at 4.20.57 PM.png > > Original Estimate: 96h > Remaining Estimate: 96h > > It would be nice to have a developers' API for dataset transformations that > need more than one column from a row (ie UnaryTransformer inputs one column > and outputs one column) or that contain objects too expensive to initialize > repeatedly in a UDF such as a database connection. > > Design: > Abstract class PartitionTransformer extends Transformer and defines the > partition transformation function as Iterator[Row] => Iterator[Row] > NB: This parallels the UnaryTransformer createTransformFunc method > > When developers subclass this transformer, they can provide their own schema > for the output Row in which case the PartitionTransformer creates a row > encoder and executes the transformation. Alternatively the developer can set > output Datatype and output col name. Then the PartitionTransformer class will > create a new schema, a row encoder, and execute the transformation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31386) Reading broadcast in UDF raises MemoryError when spark.executor.pyspark.memory is set
[ https://issues.apache.org/jira/browse/SPARK-31386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viacheslav Krot updated SPARK-31386: Description: Following code with udf causes MemoryError when `spark.executor.pyspark.memory` is set ``` from pyspark.sql.types import BooleanType from pyspark.sql.functions import udf df = spark.createDataFrame([ ('Alice', 10), ('Bob', 12) ], ['name', 'cnt']) broadcast = spark.sparkContext.broadcast([1,2,3]) @udf(BooleanType()) def f(cnt): return cnt < len(broadcast.value) df.filter(f(df.cnt)).count() ``` Same code work well when spark.executor.pyspark.memory is not set. The code by itself does not make any sense, just simplest code to reproduce the bug. Error: ``` 20/04/08 13:16:50 WARN TaskSetManager: Lost task 3.0 in stage 2.0 (TID 6, ip-172-31-32-201.us-east-2.compute.internal, executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last):20/04/08 13:16:50 WARN TaskSetManager: Lost task 3.0 in stage 2.0 (TID 6, ip-172-31-32-201.us-east-2.compute.internal, executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/worker.py", line 377, in main process() File "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/worker.py", line 372, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/serializers.py", line 345, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream for obj in iterator: File "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/serializers.py", line 334, in _batched for item in iterator: File "", line 1, in File "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/worker.py", line 85, in return lambda *a: f(*a) File "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/util.py", line 113, in wrapper return f(*args, **kwargs) File "", line 3, in f File "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/broadcast.py", line 148, in value self._value = self.load_from_path(self._path) File "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/broadcast.py", line 124, in load_from_path with open(path, 'rb', 1 << 20) as f:MemoryError at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456) at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:81) at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:64) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) at org.apache.spark.scheduler.Task.run(Task.scala:123) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.
[jira] [Resolved] (SPARK-31009) Support json_object_keys function
[ https://issues.apache.org/jira/browse/SPARK-31009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-31009. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 27836 [https://github.com/apache/spark/pull/27836] > Support json_object_keys function > - > > Key: SPARK-31009 > URL: https://issues.apache.org/jira/browse/SPARK-31009 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Rakesh Raushan >Assignee: Rakesh Raushan >Priority: Major > Fix For: 3.1.0 > > > This function will return all the keys from outer json object. > > PostgreSQL -> [https://www.postgresql.org/docs/9.3/functions-json.html] > Mysql -> > [https://dev.mysql.com/doc/refman/8.0/en/json-function-reference.html] > MariaDB -> [https://mariadb.com/kb/en/json-functions/] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31009) Support json_object_keys function
[ https://issues.apache.org/jira/browse/SPARK-31009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-31009: - Assignee: Rakesh Raushan > Support json_object_keys function > - > > Key: SPARK-31009 > URL: https://issues.apache.org/jira/browse/SPARK-31009 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Rakesh Raushan >Assignee: Rakesh Raushan >Priority: Major > > This function will return all the keys from outer json object. > > PostgreSQL -> [https://www.postgresql.org/docs/9.3/functions-json.html] > Mysql -> > [https://dev.mysql.com/doc/refman/8.0/en/json-function-reference.html] > MariaDB -> [https://mariadb.com/kb/en/json-functions/] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31388) org.apache.spark.sql.hive.thriftserver.CliSuite result matching is flaky
Juliusz Sompolski created SPARK-31388: - Summary: org.apache.spark.sql.hive.thriftserver.CliSuite result matching is flaky Key: SPARK-31388 URL: https://issues.apache.org/jira/browse/SPARK-31388 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Juliusz Sompolski CliSuite.runCliWithin result matching has issues. Will describe in PR. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31377) Add unit tests for "number of output rows" metric for joins in SQLMetricsSuite
[ https://issues.apache.org/jira/browse/SPARK-31377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078524#comment-17078524 ] Srinivas Rishindra Pothireddi commented on SPARK-31377: --- I am working on this > Add unit tests for "number of output rows" metric for joins in SQLMetricsSuite > -- > > Key: SPARK-31377 > URL: https://issues.apache.org/jira/browse/SPARK-31377 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.1.0 >Reporter: Srinivas Rishindra Pothireddi >Priority: Minor > > For some combinations of join algorithm and join types there are no unit > tests for the "number of output rows" metric. > A list of missing unit tests include the following. > * SortMergeJoin: ExistenceJoin > * ShuffledHashJoin: OuterJoin, ReftOuter, RightOuter, LeftAnti, LeftSemi, > ExistenseJoin > * BroadcastNestedLoopJoin: RightOuter, ExistenceJoin, InnerJoin > * BroadcastHashJoin: LeftAnti, ExistenceJoin -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22148) TaskSetManager.abortIfCompletelyBlacklisted should not abort when all current executors are blacklisted but dynamic allocation is enabled
[ https://issues.apache.org/jira/browse/SPARK-22148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078520#comment-17078520 ] Venkata krishnan Sowrirajan commented on SPARK-22148: - Thanks for your comments [~tgraves] Makes sense, I will think about it more, create a new JIRA and share a new proposal based on how we think about it internally. > TaskSetManager.abortIfCompletelyBlacklisted should not abort when all current > executors are blacklisted but dynamic allocation is enabled > - > > Key: SPARK-22148 > URL: https://issues.apache.org/jira/browse/SPARK-22148 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.2.0 >Reporter: Juan Rodríguez Hortalá >Assignee: Dhruve Ashar >Priority: Major > Fix For: 2.4.1, 3.0.0 > > Attachments: SPARK-22148_WIP.diff > > > Currently TaskSetManager.abortIfCompletelyBlacklisted aborts the TaskSet and > the whole Spark job with `task X (partition Y) cannot run anywhere due to > node and executor blacklist. Blacklisting behavior can be configured via > spark.blacklist.*.` when all the available executors are blacklisted for a > pending Task or TaskSet. This makes sense for static allocation, where the > set of executors is fixed for the duration of the application, but this might > lead to unnecessary job failures when dynamic allocation is enabled. For > example, in a Spark application with a single job at a time, when a node > fails at the end of a stage attempt, all other executors will complete their > tasks, but the tasks running in the executors of the failing node will be > pending. Spark will keep waiting for those tasks for 2 minutes by default > (spark.network.timeout) until the heartbeat timeout is triggered, and then it > will blacklist those executors for that stage. At that point in time, other > executors would had been released after being idle for 1 minute by default > (spark.dynamicAllocation.executorIdleTimeout), because the next stage hasn't > started yet and so there are no more tasks available (assuming the default of > spark.speculation = false). So Spark will fail because the only executors > available are blacklisted for that stage. > An alternative is requesting more executors to the cluster manager in this > situation. This could be retried a configurable number of times after a > configurable wait time between request attempts, so if the cluster manager > fails to provide a suitable executor then the job is aborted like in the > previous case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31387) HiveThriftServer2Listener update methods fail with unknown operation/session id
[ https://issues.apache.org/jira/browse/SPARK-31387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ali Smesseim updated SPARK-31387: - Description: HiveThriftServer2Listener update methods, such as onSessionClosed and onOperationError throw a NullPointerException (in Spark 3) or a NoSuchElementException (in Spark 2) when the input session/operation id is unknown. In Spark 2, this can cause control flow issues with the caller of the listener. In Spark 3, the listener is called by a ListenerBus which catches the exception, but it would still be nicer if an invalid update is logged and does not throw an exception. (was: HiveThriftServer2Listener update methods, such as onSessionClosed and onOperationError throw a NullPointerException (in Spark 3) or a NoSuchElementException (in Spark 2) when the input session/operation id is unknown. In Spark 2, this can cause control flow issues with the caller of the listener. In Spark 3, the listener is called by a ListenerBus which catches the exception, but it would still be nicer if an invalid update is logged and not throw an exception.) > HiveThriftServer2Listener update methods fail with unknown operation/session > id > --- > > Key: SPARK-31387 > URL: https://issues.apache.org/jira/browse/SPARK-31387 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.5, 3 >Reporter: Ali Smesseim >Priority: Major > > HiveThriftServer2Listener update methods, such as onSessionClosed and > onOperationError throw a NullPointerException (in Spark 3) or a > NoSuchElementException (in Spark 2) when the input session/operation id is > unknown. In Spark 2, this can cause control flow issues with the caller of > the listener. In Spark 3, the listener is called by a ListenerBus which > catches the exception, but it would still be nicer if an invalid update is > logged and does not throw an exception. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31387) HiveThriftServer2Listener update methods fail with unknown operation/session id
Ali Smesseim created SPARK-31387: Summary: HiveThriftServer2Listener update methods fail with unknown operation/session id Key: SPARK-31387 URL: https://issues.apache.org/jira/browse/SPARK-31387 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.5, 2.3.4, 3 Reporter: Ali Smesseim HiveThriftServer2Listener update methods, such as onSessionClosed and onOperationError throw a NullPointerException (in Spark 3) or a NoSuchElementException (in Spark 2) when the input session/operation id is unknown. In Spark 2, this can cause control flow issues with the caller of the listener. In Spark 3, the listener is called by a ListenerBus which catches the exception, but it would still be nicer if an invalid update is logged and not throw an exception. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31362) Document Set Operators in SQL Reference
[ https://issues.apache.org/jira/browse/SPARK-31362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-31362. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28139 [https://github.com/apache/spark/pull/28139] > Document Set Operators in SQL Reference > --- > > Key: SPARK-31362 > URL: https://issues.apache.org/jira/browse/SPARK-31362 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.0.0 > > > Document Set Operators -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31362) Document Set Operators in SQL Reference
[ https://issues.apache.org/jira/browse/SPARK-31362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-31362: Assignee: Huaxin Gao > Document Set Operators in SQL Reference > --- > > Key: SPARK-31362 > URL: https://issues.apache.org/jira/browse/SPARK-31362 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > > Document Set Operators -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31327) write spark version to avro file metadata
[ https://issues.apache.org/jira/browse/SPARK-31327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078384#comment-17078384 ] Dongjoon Hyun commented on SPARK-31327: --- This is backported to `branch-2.4` via [https://github.com/apache/spark/pull/28150] to help Apache Spark 3.0.0 handle old files more elegantly. > write spark version to avro file metadata > - > > Key: SPARK-31327 > URL: https://issues.apache.org/jira/browse/SPARK-31327 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0, 2.4.6 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31327) write spark version to avro file metadata
[ https://issues.apache.org/jira/browse/SPARK-31327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31327: -- Fix Version/s: 2.4.6 > write spark version to avro file metadata > - > > Key: SPARK-31327 > URL: https://issues.apache.org/jira/browse/SPARK-31327 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0, 2.4.6 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23128) A new approach to do adaptive execution in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-23128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078283#comment-17078283 ] Wenchen Fan commented on SPARK-23128: - Yes, they are. https://issues.apache.org/jira/browse/SPARK-28177 https://issues.apache.org/jira/browse/SPARK-29544 > A new approach to do adaptive execution in Spark SQL > > > Key: SPARK-23128 > URL: https://issues.apache.org/jira/browse/SPARK-23128 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Carson Wang >Assignee: Carson Wang >Priority: Major > Fix For: 3.0.0 > > Attachments: AdaptiveExecutioninBaidu.pdf > > > SPARK-9850 proposed the basic idea of adaptive execution in Spark. In > DAGScheduler, a new API is added to support submitting a single map stage. > The current implementation of adaptive execution in Spark SQL supports > changing the reducer number at runtime. An Exchange coordinator is used to > determine the number of post-shuffle partitions for a stage that needs to > fetch shuffle data from one or multiple stages. The current implementation > adds ExchangeCoordinator while we are adding Exchanges. However there are > some limitations. First, it may cause additional shuffles that may decrease > the performance. We can see this from EnsureRequirements rule when it adds > ExchangeCoordinator. Secondly, it is not a good idea to add > ExchangeCoordinators while we are adding Exchanges because we don’t have a > global picture of all shuffle dependencies of a post-shuffle stage. I.e. for > 3 tables’ join in a single stage, the same ExchangeCoordinator should be used > in three Exchanges but currently two separated ExchangeCoordinator will be > added. Thirdly, with the current framework it is not easy to implement other > features in adaptive execution flexibly like changing the execution plan and > handling skewed join at runtime. > We'd like to introduce a new way to do adaptive execution in Spark SQL and > address the limitations. The idea is described at > [https://docs.google.com/document/d/1mpVjvQZRAkD-Ggy6-hcjXtBPiQoVbZGe3dLnAKgtJ4k/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22148) TaskSetManager.abortIfCompletelyBlacklisted should not abort when all current executors are blacklisted but dynamic allocation is enabled
[ https://issues.apache.org/jira/browse/SPARK-22148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078278#comment-17078278 ] Thomas Graves commented on SPARK-22148: --- so off the top of my head, I think the main issue with just requesting more is that the dynamic allocation manager isn't tied very tightly to the scheduler or the blacklist tracker, so getting the information required to properly track why we have more executors then needed took quite a bit more work and code refactoring. If you are still seeing issues regularly though we could revisit to see if we could either request more or perhaps kill executors that are blacklisted that aren't completely idle. But I would have to re-read through these and think about it more. If you have ideas feel free to propose, though we should do it under a new Jira and link them > TaskSetManager.abortIfCompletelyBlacklisted should not abort when all current > executors are blacklisted but dynamic allocation is enabled > - > > Key: SPARK-22148 > URL: https://issues.apache.org/jira/browse/SPARK-22148 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.2.0 >Reporter: Juan Rodríguez Hortalá >Assignee: Dhruve Ashar >Priority: Major > Fix For: 2.4.1, 3.0.0 > > Attachments: SPARK-22148_WIP.diff > > > Currently TaskSetManager.abortIfCompletelyBlacklisted aborts the TaskSet and > the whole Spark job with `task X (partition Y) cannot run anywhere due to > node and executor blacklist. Blacklisting behavior can be configured via > spark.blacklist.*.` when all the available executors are blacklisted for a > pending Task or TaskSet. This makes sense for static allocation, where the > set of executors is fixed for the duration of the application, but this might > lead to unnecessary job failures when dynamic allocation is enabled. For > example, in a Spark application with a single job at a time, when a node > fails at the end of a stage attempt, all other executors will complete their > tasks, but the tasks running in the executors of the failing node will be > pending. Spark will keep waiting for those tasks for 2 minutes by default > (spark.network.timeout) until the heartbeat timeout is triggered, and then it > will blacklist those executors for that stage. At that point in time, other > executors would had been released after being idle for 1 minute by default > (spark.dynamicAllocation.executorIdleTimeout), because the next stage hasn't > started yet and so there are no more tasks available (assuming the default of > spark.speculation = false). So Spark will fail because the only executors > available are blacklisted for that stage. > An alternative is requesting more executors to the cluster manager in this > situation. This could be retried a configurable number of times after a > configurable wait time between request attempts, so if the cluster manager > fails to provide a suitable executor then the job is aborted like in the > previous case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31386) Reading broadcast in UDF raises MemoryError when spark.executor.pyspark.memory is set
Viacheslav Krot created SPARK-31386: --- Summary: Reading broadcast in UDF raises MemoryError when spark.executor.pyspark.memory is set Key: SPARK-31386 URL: https://issues.apache.org/jira/browse/SPARK-31386 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.4.4 Environment: Spark 2.4.4 or AWS EMR `pyspark --conf spark.executor.pyspark.memory=500m` Reporter: Viacheslav Krot Following code with udf causes MemoryError when `spark.executor.pyspark.memory` is set ``` from pyspark.sql.types import BooleanType from pyspark.sql.functions import udf df = spark.createDataFrame([ ('Alice', 10), ('Bob', 12) ], ['name', 'cnt']) broadcast = spark.sparkContext.broadcast([1,2,3]) @udf(BooleanType()) def f(cnt): return cnt < len(broadcast.value) df.filter(f(df.cnt)).count() ``` Same code work well when spark.executor.pyspark.memory is not set. The code by itself does not make any sense, just simplest code to reproduce the bug. Error: ``` 20/04/08 13:16:50 WARN TaskSetManager: Lost task 3.0 in stage 2.0 (TID 6, ip-172-31-32-201.us-east-2.compute.internal, executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last):20/04/08 13:16:50 WARN TaskSetManager: Lost task 3.0 in stage 2.0 (TID 6, ip-172-31-32-201.us-east-2.compute.internal, executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/worker.py", line 377, in main process() File "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/worker.py", line 372, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/serializers.py", line 345, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream for obj in iterator: File "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/serializers.py", line 334, in _batched for item in iterator: File "", line 1, in File "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/worker.py", line 85, in return lambda *a: f(*a) File "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/util.py", line 113, in wrapper return f(*args, **kwargs) File "", line 3, in f File "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/broadcast.py", line 148, in value self._value = self.load_from_path(self._path) File "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/broadcast.py", line 124, in load_from_path with open(path, 'rb', 1 << 20) as f:MemoryError at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456) at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:81) at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:64) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) at org.apache.sp
[jira] [Comment Edited] (SPARK-31376) Non-global sort support for structured streaming
[ https://issues.apache.org/jira/browse/SPARK-31376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078219#comment-17078219 ] Adam Binford edited comment on SPARK-31376 at 4/8/20, 12:40 PM: I tried multiple times to add myself to the dev@ mailing list but was unsuccessful, which is why I ended up just posting a Jira ticket. It looks like it finally worked using the subscribe link on the spark community page (subscribing from the mailing list page doesn't seem to work). Taking the discussion there now that it finally actually worked. was (Author: kimahriman): I tried multiple times to add myself to the dev@ mailing list but was unsuccessful, which is why I ended up just posting a Jira ticket. It looks like it finally worked using the subscribe link on the spark community page (subscribing from the mailing list page doesn't seem to work). > Non-global sort support for structured streaming > > > Key: SPARK-31376 > URL: https://issues.apache.org/jira/browse/SPARK-31376 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.1.0 >Reporter: Adam Binford >Priority: Minor > > Currently, all sorting is disallowed with structured streaming queries. Not > allowing global sorting makes sense, but could non-global sorting (i.e. > sortWithinPartitions) be allowed? I'm running into this with an external > source I'm using, but not sure if this would be useful to file sources as > well. I have to foreachBatch so that I can do a sortWithinPartitions. > Two main questions: > * Does a local sort cause issues with any exactly-once guarantees streaming > queries provides? I can't say I know or understand how these semantics work. > Or are there other issues I can't think of this would cause? > * Is the change as simple as changing the unsupported operations check to > only look for global sorts instead of all sorts? > I have built a version that simply changes the unsupported check to only > disallow global sorts and it seems to be working. Anything I'm missing or is > it this simple? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31376) Non-global sort support for structured streaming
[ https://issues.apache.org/jira/browse/SPARK-31376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078219#comment-17078219 ] Adam Binford commented on SPARK-31376: -- I tried multiple times to add myself to the dev@ mailing list but was unsuccessful, which is why I ended up just posting a Jira ticket. It looks like it finally worked using the subscribe link on the spark community page (subscribing from the mailing list page doesn't seem to work). > Non-global sort support for structured streaming > > > Key: SPARK-31376 > URL: https://issues.apache.org/jira/browse/SPARK-31376 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.1.0 >Reporter: Adam Binford >Priority: Minor > > Currently, all sorting is disallowed with structured streaming queries. Not > allowing global sorting makes sense, but could non-global sorting (i.e. > sortWithinPartitions) be allowed? I'm running into this with an external > source I'm using, but not sure if this would be useful to file sources as > well. I have to foreachBatch so that I can do a sortWithinPartitions. > Two main questions: > * Does a local sort cause issues with any exactly-once guarantees streaming > queries provides? I can't say I know or understand how these semantics work. > Or are there other issues I can't think of this would cause? > * Is the change as simple as changing the unsupported operations check to > only look for global sorts instead of all sorts? > I have built a version that simply changes the unsupported check to only > disallow global sorts and it seems to be working. Anything I'm missing or is > it this simple? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31385) Results of Julian-Gregorian rebasing don't match to Gregorian-Julian rebasing
Maxim Gekk created SPARK-31385: -- Summary: Results of Julian-Gregorian rebasing don't match to Gregorian-Julian rebasing Key: SPARK-31385 URL: https://issues.apache.org/jira/browse/SPARK-31385 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk Microseconds rebasing from the hybrid calendar (Julian + Gregorian) to Proleptic Gregorian calendar is not symmetric to opposite conversion for the following time zones: # Asia/Tehran # Iran # Africa/Casablanca # Africa/El_Aaiun Here is the results from the https://github.com/apache/spark/pull/28119: Julian -> Gregorian: {code:json} , { "tz" : "Asia/Tehran", "switches" : [ -62135782200, -59006460600, -55850700600, -52694940600, -46383420600, -43227660600, -40071900600, -33760380600, -30604620600, -27448860600, -21137340600, -17981580600, -14825820600, -12219305400, -2208988800, 2547315000, 2547401400 ], "diffs" : [ 173056, 86656, 256, -86144, -172544, -258944, -345344, -431744, -518144, -604544, -690944, -777344, -863744, 256, 0, -3600, 0 ] }, { "tz" : "Iran", "switches" : [ -62135782200, -59006460600, -55850700600, -52694940600, -46383420600, -43227660600, -40071900600, -33760380600, -30604620600, -27448860600, -21137340600, -17981580600, -14825820600, -12219305400, -2208988800, 2547315000, 2547401400 ], "diffs" : [ 173056, 86656, 256, -86144, -172544, -258944, -345344, -431744, -518144, -604544, -690944, -777344, -863744, 256, 0, -3600, 0 ] }, { "tz" : "Africa/Casablanca", "switches" : [ -62135769600, -59006448000, -55850688000, -52694928000, -46383408000, -43227648000, -40071888000, -33760368000, -30604608000, -27448848000, -21137328000, -17981568000, -14825808000, -12219292800, -2208988800, 2141866800, 2169079200, 2172106800, 2199924000, 2202951600, 2230164000, 2233796400, 2261008800, 2264036400, 2291248800, 2294881200, 2322093600, 2325121200, 2352938400, 2355966000, 2383178400, 2386810800, 2414023200, 2417050800, 2444868000, 2447895600, 2475108000, 2478740400, 2505952800, 2508980400, 2536192800, 2539825200, 2567037600, 2570065200, 2597882400, 260091, 2628122400, 2631754800, 2658967200, 2661994800, 2689812000, 2692839600, 2720052000, 2723684400, 2750896800, 2753924400, 2781136800, 2784769200, 2811981600, 2815009200, 2842826400, 2845854000, 2873066400, 2876698800, 2903911200, 2906938800, 2934756000, 2937783600, 2964996000, 2968023600, 2995840800, 2998868400, 3026080800, 3029713200, 3056925600, 3059953200, 3087770400, 3090798000, 3118010400, 3121642800, 3148855200, 3151882800, 317970, 3182727600, 320994, 3212967600, 3240784800, 3243812400, 3271024800, 3274657200, 3301869600, 3304897200, 3332714400, 3335742000, 3362954400, 3366586800, 3393799200, 3396826800, 3424644000, 3427671600, 3454884000, 3457911600, 3485728800, 3488756400, 3515968800, 3519601200, 3546813600, 3549841200, 3577658400, 3580686000, 3607898400, 3611530800, 3638743200, 3641770800, 3669588000, 3672615600, 3699828000, 3702855600 ], "diffs" : [ 174620, 88220, 1820, -84580, -170980, -257380, -343780, -430180, -516580, -602980, -689380, -775780, -862180, 1820, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600 ] }, { "tz" : "Africa/El_Aaiun", "switches" : [ -62135769600, -59006448000, -55850688000, -52694928000, -46383408000, -43227648000, -40071888000, -33760368000, -30604608000, -27448848000, -21137328000, -17981568000, -14825808000, -12219292800, -2208988800, 2141866800, 2169079200, 2172106800, 2199924000, 2202951600, 2230164000, 2233796400, 2261008800, 2264036400, 2291248800, 2294881200, 2322093600, 2325121200, 2352938400, 2355966000, 2383178400, 2386810800, 2414023200, 2417050800, 2444868000, 2447895600, 2475108000, 2478740400, 2505952800, 2508980400, 2536192800, 2539825200, 2567037600, 2570065200, 2597882400, 260091, 2628122400, 2631754800, 2658967200, 2661994800, 2689812000, 2692839600, 2720052000, 2723684400, 2750896800, 2753924400, 2781136800, 2784769200, 2811981600, 2815009200, 2842826400, 2845854000, 2873066400, 2876698800, 2903911200, 2906938800, 2934756000, 2937783600, 2964996000, 2968023600, 2995840800, 2998868400, 3026080800, 3029713200, 3056925600, 3059953200, 3087770400, 3090798000, 3118010400, 3121642800, 3148855200, 3151882800, 317970, 3182727600, 320994, 3212967600, 3240784800, 3243812400, 3271024800, 3274657200, 3301869600, 3304897200, 3332714400, 333
[jira] [Updated] (SPARK-31384) NPE in OptimizeSkewedJoin when there's a inputRDD of plan has 0 partition
[ https://issues.apache.org/jira/browse/SPARK-31384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi updated SPARK-31384: - Summary: NPE in OptimizeSkewedJoin when there's a inputRDD of plan has 0 partition (was: Fix NPE in OptimizeSkewedJoin) > NPE in OptimizeSkewedJoin when there's a inputRDD of plan has 0 partition > - > > Key: SPARK-31384 > URL: https://issues.apache.org/jira/browse/SPARK-31384 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: wuyi >Priority: Major > > When there's a inputRDD of a plan with 0 partitions, rule OptimizeSkewedJoin > can hit NPE. > The issue can be reproduced by below test: > {code:java} > withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true", > SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") { > withTempView("t2") { > // create DataFrame with 0 partition > spark.createDataFrame(sparkContext.emptyRDD[Row], new > StructType().add("b", IntegerType)) > .createOrReplaceTempView("t2") > // should run successfully without NPE > runAdaptiveAndVerifyResult("SELECT * FROM testData2 t1 left semi join t2 > ON t1.a=t2.b") > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31384) Fix NPE in OptimizeSkewedJoin
wuyi created SPARK-31384: Summary: Fix NPE in OptimizeSkewedJoin Key: SPARK-31384 URL: https://issues.apache.org/jira/browse/SPARK-31384 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: wuyi When there's a inputRDD of a plan with 0 partitions, rule OptimizeSkewedJoin can hit NPE. The issue can be reproduced by below test: {code:java} withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true", SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") { withTempView("t2") { // create DataFrame with 0 partition spark.createDataFrame(sparkContext.emptyRDD[Row], new StructType().add("b", IntegerType)) .createOrReplaceTempView("t2") // should run successfully without NPE runAdaptiveAndVerifyResult("SELECT * FROM testData2 t1 left semi join t2 ON t1.a=t2.b") } } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31379) Fix flaky test: o.a.s.scheduler.CoarseGrainedSchedulerBackendSuite.extra resources from executor
[ https://issues.apache.org/jira/browse/SPARK-31379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31379. -- Fix Version/s: 3.0.0 Assignee: wuyi Resolution: Fixed Fixed in https://github.com/apache/spark/pull/28145 > Fix flaky test: o.a.s.scheduler.CoarseGrainedSchedulerBackendSuite.extra > resources from executor > > > Key: SPARK-31379 > URL: https://issues.apache.org/jira/browse/SPARK-31379 > Project: Spark > Issue Type: Test > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Fix For: 3.0.0 > > > see > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120786/testReport/org.apache.spark.scheduler/CoarseGrainedSchedulerBackendSuite/extra_resources_from_executor/] > for details. > {code:java} > sbt.ForkMain$ForkError: > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 325 times over 5.01070979 > seconds. Last failure message: ArrayBuffer("1", "3") did not equal Array("0", > "1", "3"). > at > org.scalatest.concurrent.Eventually.tryTryAgain$1(Eventually.scala:432) > at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:439) > at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:391) > at > org.apache.spark.scheduler.CoarseGrainedSchedulerBackendSuite.eventually(CoarseGrainedSchedulerBackendSuite.scala:45) > at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:337) > at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:336) > at > org.apache.spark.scheduler.CoarseGrainedSchedulerBackendSuite.eventually(CoarseGrainedSchedulerBackendSuite.scala:45) > at > org.apache.spark.scheduler.CoarseGrainedSchedulerBackendSuite.$anonfun$new$12(CoarseGrainedSchedulerBackendSuite.scala:264) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23128) A new approach to do adaptive execution in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-23128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17077939#comment-17077939 ] Sandeep Katta commented on SPARK-23128: --- [~cloud_fan] [~carsonwang] any updates on dynamic parallelism and skew Handling. Is it fixed in 3.0.0 > A new approach to do adaptive execution in Spark SQL > > > Key: SPARK-23128 > URL: https://issues.apache.org/jira/browse/SPARK-23128 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Carson Wang >Assignee: Carson Wang >Priority: Major > Fix For: 3.0.0 > > Attachments: AdaptiveExecutioninBaidu.pdf > > > SPARK-9850 proposed the basic idea of adaptive execution in Spark. In > DAGScheduler, a new API is added to support submitting a single map stage. > The current implementation of adaptive execution in Spark SQL supports > changing the reducer number at runtime. An Exchange coordinator is used to > determine the number of post-shuffle partitions for a stage that needs to > fetch shuffle data from one or multiple stages. The current implementation > adds ExchangeCoordinator while we are adding Exchanges. However there are > some limitations. First, it may cause additional shuffles that may decrease > the performance. We can see this from EnsureRequirements rule when it adds > ExchangeCoordinator. Secondly, it is not a good idea to add > ExchangeCoordinators while we are adding Exchanges because we don’t have a > global picture of all shuffle dependencies of a post-shuffle stage. I.e. for > 3 tables’ join in a single stage, the same ExchangeCoordinator should be used > in three Exchanges but currently two separated ExchangeCoordinator will be > added. Thirdly, with the current framework it is not easy to implement other > features in adaptive execution flexibly like changing the execution plan and > handling skewed join at runtime. > We'd like to introduce a new way to do adaptive execution in Spark SQL and > address the limitations. The idea is described at > [https://docs.google.com/document/d/1mpVjvQZRAkD-Ggy6-hcjXtBPiQoVbZGe3dLnAKgtJ4k/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31383) Clean up the SQL documents in docs/sql-ref*
Takeshi Yamamuro created SPARK-31383: Summary: Clean up the SQL documents in docs/sql-ref* Key: SPARK-31383 URL: https://issues.apache.org/jira/browse/SPARK-31383 Project: Spark Issue Type: Sub-task Components: Documentation, SQL Affects Versions: 3.0.0 Reporter: Takeshi Yamamuro This ticket intends to clean up the SQL documents in `doc/sql-ref*`. Main changes are as follows; - Fixes wrong syntaxes and capitalize sub-titles - Adds some DDL queries in `Examples` so that users can run examples there - Makes query output in `Examples` follows the `Dataset.showString` (right-aligned) format - Adds/Removes spaces, Indents, or blank lines to follow the format below; {code} --- license... --- ### Description Writes what's the syntax is. ### Syntax {% highlight sql %} SELECT... WHERE... // 4 indents after the second line ... {% endhighlight %} ### Parameters Param Name Param Description ... ### Examples {% highlight sql %} -- It is better that users are able to execute example queries here. -- So, we prepare test data in the first section if possible. CREATE TABLE t (key STRING, value DOUBLE); INSERT INTO t VALUES ('a', 1.0), ('a', 2.0), ('b', 3.0), ('c', 4.0); -- query output has 2 indents and it follows the `Dataset.showString` -- format (right-aligned). SELECT * FROM t; +---+-+ |key|value| +---+-+ | a| 1.0| | a| 2.0| | b| 3.0| | c| 4.0| +---+-+ -- Query statements after the second line have 4 indents. SELECT key, SUM(value) FROM t GROUP BY key; +---+--+ |key|sum(value)| +---+--+ | c| 4.0| | b| 3.0| | a| 3.0| +---+--+ ... {% endhighlight %} ### Related Statements * [XXX](xxx.html) * ... {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31382) Show a better error message for different python and pip installation mistake
Hyukjin Kwon created SPARK-31382: Summary: Show a better error message for different python and pip installation mistake Key: SPARK-31382 URL: https://issues.apache.org/jira/browse/SPARK-31382 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.4.5, 3.0.0 Reporter: Hyukjin Kwon See https://stackoverflow.com/questions/46286436/running-pyspark-after-pip-install-pyspark/49587560 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org