[jira] [Created] (SPARK-32814) Metaclasses are broken for a few classes in Python 3
Maciej Szymkiewicz created SPARK-32814: -- Summary: Metaclasses are broken for a few classes in Python 3 Key: SPARK-32814 URL: https://issues.apache.org/jira/browse/SPARK-32814 Project: Spark Issue Type: Bug Components: ML, PySpark, SQL Affects Versions: 3.0.0, 2.4.0, 3.1.0 Reporter: Maciej Szymkiewicz As of Python 3 {{__metaclass__}} is no longer supported https://www.python.org/dev/peps/pep-3115/. However, we have multiple classes which where never migrated to Python 3 compatible syntax: - A number of ML {{Params}}} with {{__metaclass__ = ABCMeta}} - Some of the SQL {{types}} with {{__metaclass__ = DataTypeSingleton}} As a result some functionalities are broken in Python 3. For example {code:python} >>> from pyspark.sql.types import BooleanType >>> >>> >>> BooleanType() is BooleanType() >>> >>> False {code} or {code:python} >>> import inspect >>> >>> >>> from pyspark.ml import Estimator >>> >>> >>> inspect.isabstract(Estimator) >>> >>> False {code} where in both cases we expect to see {{True}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31511) Make BytesToBytesMap iterator() thread-safe
[ https://issues.apache.org/jira/browse/SPARK-31511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191954#comment-17191954 ] Apache Spark commented on SPARK-31511: -- User 'cxzl25' has created a pull request for this issue: https://github.com/apache/spark/pull/29669 > Make BytesToBytesMap iterator() thread-safe > --- > > Key: SPARK-31511 > URL: https://issues.apache.org/jira/browse/SPARK-31511 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1, 1.5.2, 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5, > 3.0.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > Labels: correctness > Fix For: 2.4.7, 3.1.0, 3.0.2 > > > BytesToBytesMap currently has a thread safe and unsafe iterator method. This > is somewhat confusing. We should make iterator() thread safe and remove the > safeIterator() function. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31511) Make BytesToBytesMap iterator() thread-safe
[ https://issues.apache.org/jira/browse/SPARK-31511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191953#comment-17191953 ] Apache Spark commented on SPARK-31511: -- User 'cxzl25' has created a pull request for this issue: https://github.com/apache/spark/pull/29669 > Make BytesToBytesMap iterator() thread-safe > --- > > Key: SPARK-31511 > URL: https://issues.apache.org/jira/browse/SPARK-31511 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1, 1.5.2, 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5, > 3.0.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > Labels: correctness > Fix For: 2.4.7, 3.1.0, 3.0.2 > > > BytesToBytesMap currently has a thread safe and unsafe iterator method. This > is somewhat confusing. We should make iterator() thread safe and remove the > safeIterator() function. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32736) Avoid caching the removed decommissioned executors in TaskSchedulerImpl
[ https://issues.apache.org/jira/browse/SPARK-32736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-32736. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29579 [https://github.com/apache/spark/pull/29579] > Avoid caching the removed decommissioned executors in TaskSchedulerImpl > --- > > Key: SPARK-32736 > URL: https://issues.apache.org/jira/browse/SPARK-32736 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: wuyi >Assignee: Apache Spark >Priority: Major > Fix For: 3.1.0 > > > We can save the host directly in the ExecutorDecommissionState. Therefore, > when the executor lost, we could unregister the shuffle map status on the > host. Thus, we don't need to hold the cache to wait for FetchFailureException > to do the unregister. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32764) compare of -0.0 < 0.0 return true
[ https://issues.apache.org/jira/browse/SPARK-32764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32764: -- Fix Version/s: (was: 3.0.1) 3.0.2 > compare of -0.0 < 0.0 return true > - > > Key: SPARK-32764 > URL: https://issues.apache.org/jira/browse/SPARK-32764 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1 >Reporter: Izek Greenfield >Assignee: Wenchen Fan >Priority: Major > Labels: correctness > Fix For: 3.1.0, 3.0.2 > > Attachments: 2.4_codegen.txt, 3.0_Codegen.txt > > > {code:scala} > val spark: SparkSession = SparkSession > .builder() > .master("local") > .appName("SparkByExamples.com") > .getOrCreate() > spark.sparkContext.setLogLevel("ERROR") > import spark.sqlContext.implicits._ > val df = Seq((-0.0, 0.0)).toDF("neg", "pos") > .withColumn("comp", col("neg") < col("pos")) > df.show(false) > == > ++---++ > |neg |pos|comp| > ++---++ > |-0.0|0.0|true| > ++---++{code} > I think that result should be false. > **Apache Spark 2.4.6 RESULT** > {code} > scala> spark.version > res0: String = 2.4.6 > scala> Seq((-0.0, 0.0)).toDF("neg", "pos").withColumn("comp", col("neg") < > col("pos")).show > ++---+-+ > | neg|pos| comp| > ++---+-+ > |-0.0|0.0|false| > ++---+-+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32764) compare of -0.0 < 0.0 return true
[ https://issues.apache.org/jira/browse/SPARK-32764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-32764. --- Fix Version/s: 3.1.0 3.0.1 Resolution: Fixed Issue resolved by pull request 29647 [https://github.com/apache/spark/pull/29647] > compare of -0.0 < 0.0 return true > - > > Key: SPARK-32764 > URL: https://issues.apache.org/jira/browse/SPARK-32764 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1 >Reporter: Izek Greenfield >Assignee: Wenchen Fan >Priority: Major > Labels: correctness > Fix For: 3.0.1, 3.1.0 > > Attachments: 2.4_codegen.txt, 3.0_Codegen.txt > > > {code:scala} > val spark: SparkSession = SparkSession > .builder() > .master("local") > .appName("SparkByExamples.com") > .getOrCreate() > spark.sparkContext.setLogLevel("ERROR") > import spark.sqlContext.implicits._ > val df = Seq((-0.0, 0.0)).toDF("neg", "pos") > .withColumn("comp", col("neg") < col("pos")) > df.show(false) > == > ++---++ > |neg |pos|comp| > ++---++ > |-0.0|0.0|true| > ++---++{code} > I think that result should be false. > **Apache Spark 2.4.6 RESULT** > {code} > scala> spark.version > res0: String = 2.4.6 > scala> Seq((-0.0, 0.0)).toDF("neg", "pos").withColumn("comp", col("neg") < > col("pos")).show > ++---+-+ > | neg|pos| comp| > ++---+-+ > |-0.0|0.0|false| > ++---+-+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32764) compare of -0.0 < 0.0 return true
[ https://issues.apache.org/jira/browse/SPARK-32764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-32764: - Assignee: Wenchen Fan > compare of -0.0 < 0.0 return true > - > > Key: SPARK-32764 > URL: https://issues.apache.org/jira/browse/SPARK-32764 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1 >Reporter: Izek Greenfield >Assignee: Wenchen Fan >Priority: Major > Labels: correctness > Attachments: 2.4_codegen.txt, 3.0_Codegen.txt > > > {code:scala} > val spark: SparkSession = SparkSession > .builder() > .master("local") > .appName("SparkByExamples.com") > .getOrCreate() > spark.sparkContext.setLogLevel("ERROR") > import spark.sqlContext.implicits._ > val df = Seq((-0.0, 0.0)).toDF("neg", "pos") > .withColumn("comp", col("neg") < col("pos")) > df.show(false) > == > ++---++ > |neg |pos|comp| > ++---++ > |-0.0|0.0|true| > ++---++{code} > I think that result should be false. > **Apache Spark 2.4.6 RESULT** > {code} > scala> spark.version > res0: String = 2.4.6 > scala> Seq((-0.0, 0.0)).toDF("neg", "pos").withColumn("comp", col("neg") < > col("pos")).show > ++---+-+ > | neg|pos| comp| > ++---+-+ > |-0.0|0.0|false| > ++---+-+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32796) Make withField API support nested struct in array
[ https://issues.apache.org/jira/browse/SPARK-32796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh resolved SPARK-32796. - Resolution: Won't Fix > Make withField API support nested struct in array > - > > Key: SPARK-32796 > URL: https://issues.apache.org/jira/browse/SPARK-32796 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > Currently {{Column.withField}} only supports {{StructType}}. For nested > struct in {{ArrayType}}, it doesn't support. We can support {{ArrayType}} to > make the API more general and useful. > It could significantly simplify how we add/replace deeply nested fields. > For example, > {code} > private lazy val arrayType = ArrayType( > StructType(Seq( > StructField("a", IntegerType, nullable = false), > StructField("b", IntegerType, nullable = true), > StructField("c", IntegerType, nullable = false))), > containsNull = true) > private lazy val arrayStructArrayLevel1: DataFrame = spark.createDataFrame( > sparkContext.parallelize(Row(Array(Row(Array(Row(1, null, 3)), null, 3))) > :: Nil), > StructType( > Seq(StructField("a", ArrayType( > StructType(Seq( > StructField("a", arrayType, nullable = false), > StructField("b", IntegerType, nullable = true), > StructField("c", IntegerType, nullable = false))), > containsNull = false) > {code} > The data looks like: > {code} > +---+ > |a | > +---+ > |[{[{1, null, 3}], null, 3}]| > +---+ > {code} > In order to replace deeply nested b column, like: > {code} > ++ > |a | > ++ > |[{[{1, 2, 3}], null, 3}]| > ++ > {code} > Currently by using transform + withField, we probably need: > {code} > arrayStructArrayLevel1.withColumn("a", > transform($"a", _.withField("a", > flatten(transform($"a.a", transform(_, _.withField("b", lit(2 > {code} > Using modified withField, we can do it like: > {code} > arrayStructArrayLevel1.withColumn("a", $"a".withField("a.b", lit(2))) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32354) Fix & re-enable failing R K8s tests
[ https://issues.apache.org/jira/browse/SPARK-32354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191929#comment-17191929 ] Dongjoon Hyun edited comment on SPARK-32354 at 9/8/20, 3:29 AM: As I reported during 3.0.1 RC, `dev/make-distribution.sh` doesn't work for me on my environment. It failed when I enabled R. So, I couldn't even run R IT test, [~hyukjin.kwon]. was (Author: dongjoon): As I reported during 3.0.1 RC, `dev/make-distribution.sh` doesn't work for me on my environment. It failed when I enabled R. So, I couldn't even run R IT test. > Fix & re-enable failing R K8s tests > --- > > Key: SPARK-32354 > URL: https://issues.apache.org/jira/browse/SPARK-32354 > Project: Spark > Issue Type: Bug > Components: Kubernetes, R >Affects Versions: 3.1.0 >Reporter: Holden Karau >Priority: Blocker > > The R K8s integration tests have been failing for extended period of time. > Some of it was waiting for the R version to upgrade, but now that it has been > upgraded it is still failing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32354) Fix & re-enable failing R K8s tests
[ https://issues.apache.org/jira/browse/SPARK-32354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191929#comment-17191929 ] Dongjoon Hyun commented on SPARK-32354: --- As I reported during 3.0.1 RC, `dev/make-distribution.sh` doesn't work for me on my environment. It failed when I enabled R. So, I couldn't even run R IT test. > Fix & re-enable failing R K8s tests > --- > > Key: SPARK-32354 > URL: https://issues.apache.org/jira/browse/SPARK-32354 > Project: Spark > Issue Type: Bug > Components: Kubernetes, R >Affects Versions: 3.1.0 >Reporter: Holden Karau >Priority: Blocker > > The R K8s integration tests have been failing for extended period of time. > Some of it was waiting for the R version to upgrade, but now that it has been > upgraded it is still failing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31511) Make BytesToBytesMap iterator() thread-safe
[ https://issues.apache.org/jira/browse/SPARK-31511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-31511: Fix Version/s: 2.4.7 > Make BytesToBytesMap iterator() thread-safe > --- > > Key: SPARK-31511 > URL: https://issues.apache.org/jira/browse/SPARK-31511 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1, 1.5.2, 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5, > 3.0.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > Labels: correctness > Fix For: 2.4.7, 3.1.0, 3.0.2 > > > BytesToBytesMap currently has a thread safe and unsafe iterator method. This > is somewhat confusing. We should make iterator() thread safe and remove the > safeIterator() function. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32812) Run tests script for Python fails in certain environments
[ https://issues.apache.org/jira/browse/SPARK-32812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-32812: Assignee: Haejoon Lee > Run tests script for Python fails in certain environments > - > > Key: SPARK-32812 > URL: https://issues.apache.org/jira/browse/SPARK-32812 > Project: Spark > Issue Type: Bug > Components: PySpark, Tests >Affects Versions: 3.1.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > > When running PySpark test in the local environment with "python/run-tests" > command, the following error could occur. > {code} > Traceback (most recent call last): > File "", line 1, in > ... > raise RuntimeError(''' > RuntimeError: > An attempt has been made to start a new process before the > current process has finished its bootstrapping phase. > This probably means that you are not using fork to start your > child processes and you have forgotten to use the proper idiom > in the main module: > if __name__ == '__main__': > freeze_support() > ... > The "freeze_support()" line can be omitted if the program > is not going to be frozen to produce an executable. > Traceback (most recent call last): > ... > raise EOFError > EOFError > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32812) Run tests script for Python fails in certain environments
[ https://issues.apache.org/jira/browse/SPARK-32812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32812. -- Fix Version/s: 2.4.8 3.0.2 3.1.0 Resolution: Fixed Issue resolved by pull request 29666 [https://github.com/apache/spark/pull/29666] > Run tests script for Python fails in certain environments > - > > Key: SPARK-32812 > URL: https://issues.apache.org/jira/browse/SPARK-32812 > Project: Spark > Issue Type: Bug > Components: PySpark, Tests >Affects Versions: 3.1.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.1.0, 3.0.2, 2.4.8 > > > When running PySpark test in the local environment with "python/run-tests" > command, the following error could occur. > {code} > Traceback (most recent call last): > File "", line 1, in > ... > raise RuntimeError(''' > RuntimeError: > An attempt has been made to start a new process before the > current process has finished its bootstrapping phase. > This probably means that you are not using fork to start your > child processes and you have forgotten to use the proper idiom > in the main module: > if __name__ == '__main__': > freeze_support() > ... > The "freeze_support()" line can be omitted if the program > is not going to be frozen to produce an executable. > Traceback (most recent call last): > ... > raise EOFError > EOFError > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32813) Reading parquet rdd in non columnar mode fails in multithreaded environment
[ https://issues.apache.org/jira/browse/SPARK-32813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh updated SPARK-32813: Affects Version/s: 3.1.0 > Reading parquet rdd in non columnar mode fails in multithreaded environment > --- > > Key: SPARK-32813 > URL: https://issues.apache.org/jira/browse/SPARK-32813 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 > Environment: Spark 3.0.0, Scala 2.12.12 >Reporter: Vladimir Klyushnikov >Priority: Major > > Reading parquet rdd in non columnar mode (i.e. with list fields) if Spark > session was created in one thread and rdd is being read in another - so > InheritableThreadLocal with active session is not propagated. Code below was > working perfectly in Spark 2.X, but fails in Spark 3 > {code:scala} > import java.util.concurrent.Executors > import org.apache.spark.sql.SparkSession > import scala.concurrent.{Await, ExecutionContext, Future} > import scala.concurrent.duration._ > object Main { > final case class Data(list: List[Int]) > def main(args: Array[String]): Unit = { > val executor1 = Executors.newSingleThreadExecutor() > val executor2 = Executors.newSingleThreadExecutor() > try { > val ds = Await.result(Future { > val session = > SparkSession.builder().appName("test").master("local[*]").getOrCreate() > import session.implicits._ > val path = "test.parquet" > session.createDataset(Data(1 :: Nil) :: Nil).write.parquet(path) > session.read.parquet(path).as[Data] > }(ExecutionContext.fromExecutorService(executor1)), 1.minute) > Await.result(Future { > ds.rdd.collect().foreach(println(_)) > }(ExecutionContext.fromExecutorService(executor2)), 1.minute) > } finally { > executor1.shutdown() > executor2.shutdown() > } > } > } > {code} > This code fails with following exception: > {code} > Exception in thread "main" java.util.NoSuchElementException: > None.getException in thread "main" java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at > org.apache.spark.sql.execution.FileSourceScanExec.needsUnsafeRowConversion$lzycompute(DataSourceScanExec.scala:178) > at > org.apache.spark.sql.execution.FileSourceScanExec.needsUnsafeRowConversion(DataSourceScanExec.scala:176) > at > org.apache.spark.sql.execution.FileSourceScanExec.doExecute(DataSourceScanExec.scala:462) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) at > org.apache.spark.sql.execution.DeserializeToObjectExec.doExecute(objects.scala:96) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:122) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:121) > at org.apache.spark.sql.Dataset.rdd$lzycompute(Dataset.scala:3198) at > org.apache.spark.sql.Dataset.rdd(Dataset.scala:3196) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32813) Reading parquet rdd in non columnar mode fails in multithreaded environment
[ https://issues.apache.org/jira/browse/SPARK-32813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32813: Assignee: Apache Spark > Reading parquet rdd in non columnar mode fails in multithreaded environment > --- > > Key: SPARK-32813 > URL: https://issues.apache.org/jira/browse/SPARK-32813 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: Spark 3.0.0, Scala 2.12.12 >Reporter: Vladimir Klyushnikov >Assignee: Apache Spark >Priority: Major > > Reading parquet rdd in non columnar mode (i.e. with list fields) if Spark > session was created in one thread and rdd is being read in another - so > InheritableThreadLocal with active session is not propagated. Code below was > working perfectly in Spark 2.X, but fails in Spark 3 > {code:scala} > import java.util.concurrent.Executors > import org.apache.spark.sql.SparkSession > import scala.concurrent.{Await, ExecutionContext, Future} > import scala.concurrent.duration._ > object Main { > final case class Data(list: List[Int]) > def main(args: Array[String]): Unit = { > val executor1 = Executors.newSingleThreadExecutor() > val executor2 = Executors.newSingleThreadExecutor() > try { > val ds = Await.result(Future { > val session = > SparkSession.builder().appName("test").master("local[*]").getOrCreate() > import session.implicits._ > val path = "test.parquet" > session.createDataset(Data(1 :: Nil) :: Nil).write.parquet(path) > session.read.parquet(path).as[Data] > }(ExecutionContext.fromExecutorService(executor1)), 1.minute) > Await.result(Future { > ds.rdd.collect().foreach(println(_)) > }(ExecutionContext.fromExecutorService(executor2)), 1.minute) > } finally { > executor1.shutdown() > executor2.shutdown() > } > } > } > {code} > This code fails with following exception: > {code} > Exception in thread "main" java.util.NoSuchElementException: > None.getException in thread "main" java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at > org.apache.spark.sql.execution.FileSourceScanExec.needsUnsafeRowConversion$lzycompute(DataSourceScanExec.scala:178) > at > org.apache.spark.sql.execution.FileSourceScanExec.needsUnsafeRowConversion(DataSourceScanExec.scala:176) > at > org.apache.spark.sql.execution.FileSourceScanExec.doExecute(DataSourceScanExec.scala:462) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) at > org.apache.spark.sql.execution.DeserializeToObjectExec.doExecute(objects.scala:96) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:122) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:121) > at org.apache.spark.sql.Dataset.rdd$lzycompute(Dataset.scala:3198) at > org.apache.spark.sql.Dataset.rdd(Dataset.scala:3196) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32813) Reading parquet rdd in non columnar mode fails in multithreaded environment
[ https://issues.apache.org/jira/browse/SPARK-32813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191904#comment-17191904 ] Apache Spark commented on SPARK-32813: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/29667 > Reading parquet rdd in non columnar mode fails in multithreaded environment > --- > > Key: SPARK-32813 > URL: https://issues.apache.org/jira/browse/SPARK-32813 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: Spark 3.0.0, Scala 2.12.12 >Reporter: Vladimir Klyushnikov >Priority: Major > > Reading parquet rdd in non columnar mode (i.e. with list fields) if Spark > session was created in one thread and rdd is being read in another - so > InheritableThreadLocal with active session is not propagated. Code below was > working perfectly in Spark 2.X, but fails in Spark 3 > {code:scala} > import java.util.concurrent.Executors > import org.apache.spark.sql.SparkSession > import scala.concurrent.{Await, ExecutionContext, Future} > import scala.concurrent.duration._ > object Main { > final case class Data(list: List[Int]) > def main(args: Array[String]): Unit = { > val executor1 = Executors.newSingleThreadExecutor() > val executor2 = Executors.newSingleThreadExecutor() > try { > val ds = Await.result(Future { > val session = > SparkSession.builder().appName("test").master("local[*]").getOrCreate() > import session.implicits._ > val path = "test.parquet" > session.createDataset(Data(1 :: Nil) :: Nil).write.parquet(path) > session.read.parquet(path).as[Data] > }(ExecutionContext.fromExecutorService(executor1)), 1.minute) > Await.result(Future { > ds.rdd.collect().foreach(println(_)) > }(ExecutionContext.fromExecutorService(executor2)), 1.minute) > } finally { > executor1.shutdown() > executor2.shutdown() > } > } > } > {code} > This code fails with following exception: > {code} > Exception in thread "main" java.util.NoSuchElementException: > None.getException in thread "main" java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at > org.apache.spark.sql.execution.FileSourceScanExec.needsUnsafeRowConversion$lzycompute(DataSourceScanExec.scala:178) > at > org.apache.spark.sql.execution.FileSourceScanExec.needsUnsafeRowConversion(DataSourceScanExec.scala:176) > at > org.apache.spark.sql.execution.FileSourceScanExec.doExecute(DataSourceScanExec.scala:462) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) at > org.apache.spark.sql.execution.DeserializeToObjectExec.doExecute(objects.scala:96) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:122) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:121) > at org.apache.spark.sql.Dataset.rdd$lzycompute(Dataset.scala:3198) at > org.apache.spark.sql.Dataset.rdd(Dataset.scala:3196) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32813) Reading parquet rdd in non columnar mode fails in multithreaded environment
[ https://issues.apache.org/jira/browse/SPARK-32813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32813: Assignee: (was: Apache Spark) > Reading parquet rdd in non columnar mode fails in multithreaded environment > --- > > Key: SPARK-32813 > URL: https://issues.apache.org/jira/browse/SPARK-32813 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: Spark 3.0.0, Scala 2.12.12 >Reporter: Vladimir Klyushnikov >Priority: Major > > Reading parquet rdd in non columnar mode (i.e. with list fields) if Spark > session was created in one thread and rdd is being read in another - so > InheritableThreadLocal with active session is not propagated. Code below was > working perfectly in Spark 2.X, but fails in Spark 3 > {code:scala} > import java.util.concurrent.Executors > import org.apache.spark.sql.SparkSession > import scala.concurrent.{Await, ExecutionContext, Future} > import scala.concurrent.duration._ > object Main { > final case class Data(list: List[Int]) > def main(args: Array[String]): Unit = { > val executor1 = Executors.newSingleThreadExecutor() > val executor2 = Executors.newSingleThreadExecutor() > try { > val ds = Await.result(Future { > val session = > SparkSession.builder().appName("test").master("local[*]").getOrCreate() > import session.implicits._ > val path = "test.parquet" > session.createDataset(Data(1 :: Nil) :: Nil).write.parquet(path) > session.read.parquet(path).as[Data] > }(ExecutionContext.fromExecutorService(executor1)), 1.minute) > Await.result(Future { > ds.rdd.collect().foreach(println(_)) > }(ExecutionContext.fromExecutorService(executor2)), 1.minute) > } finally { > executor1.shutdown() > executor2.shutdown() > } > } > } > {code} > This code fails with following exception: > {code} > Exception in thread "main" java.util.NoSuchElementException: > None.getException in thread "main" java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at > org.apache.spark.sql.execution.FileSourceScanExec.needsUnsafeRowConversion$lzycompute(DataSourceScanExec.scala:178) > at > org.apache.spark.sql.execution.FileSourceScanExec.needsUnsafeRowConversion(DataSourceScanExec.scala:176) > at > org.apache.spark.sql.execution.FileSourceScanExec.doExecute(DataSourceScanExec.scala:462) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) at > org.apache.spark.sql.execution.DeserializeToObjectExec.doExecute(objects.scala:96) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:122) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:121) > at org.apache.spark.sql.Dataset.rdd$lzycompute(Dataset.scala:3198) at > org.apache.spark.sql.Dataset.rdd(Dataset.scala:3196) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32186) Development - Debugging
[ https://issues.apache.org/jira/browse/SPARK-32186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32186. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29639 [https://github.com/apache/spark/pull/29639] > Development - Debugging > --- > > Key: SPARK-32186 > URL: https://issues.apache.org/jira/browse/SPARK-32186 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.1.0 > > > 1. Python Profiler: > https://spark.apache.org/docs/2.3.0/api/python/_modules/pyspark/profiler.html, > > {code} > >>> sc._conf.set("spark.python.profile", "true") > >>> rdd = sc.parallelize(range(100)).map(str) > >>> rdd.count() > 100 > >>> sc.show_profiles() > {code} > 2. Python Debugger, e.g.) > https://www.jetbrains.com/help/pycharm/remote-debugging-with-product.html#remote-interpreter > \(?\) > 3. Monitoring Python Workers: {{top}}, `{{ls}}`, etc. \(?\) > 4. PyCharm setup \(?\) (at https://issues.apache.org/jira/browse/SPARK-32189) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32812) Run tests script for Python fails in certain environments
[ https://issues.apache.org/jira/browse/SPARK-32812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-32812: Summary: Run tests script for Python fails in certain environments (was: Run tests script for Python fails on local environment) > Run tests script for Python fails in certain environments > - > > Key: SPARK-32812 > URL: https://issues.apache.org/jira/browse/SPARK-32812 > Project: Spark > Issue Type: Bug > Components: PySpark, Tests >Affects Versions: 3.1.0 >Reporter: Haejoon Lee >Priority: Major > > When running PySpark test in the local environment with "python/run-tests" > command, the following error could occur. > {code} > Traceback (most recent call last): > File "", line 1, in > ... > raise RuntimeError(''' > RuntimeError: > An attempt has been made to start a new process before the > current process has finished its bootstrapping phase. > This probably means that you are not using fork to start your > child processes and you have forgotten to use the proper idiom > in the main module: > if __name__ == '__main__': > freeze_support() > ... > The "freeze_support()" line can be omitted if the program > is not going to be frozen to produce an executable. > Traceback (most recent call last): > ... > raise EOFError > EOFError > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32812) Run tests script for Python fails on local environment
[ https://issues.apache.org/jira/browse/SPARK-32812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191899#comment-17191899 ] Apache Spark commented on SPARK-32812: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/29666 > Run tests script for Python fails on local environment > -- > > Key: SPARK-32812 > URL: https://issues.apache.org/jira/browse/SPARK-32812 > Project: Spark > Issue Type: Bug > Components: PySpark, Tests >Affects Versions: 3.1.0 >Reporter: Haejoon Lee >Priority: Major > > When running PySpark test in the local environment with "python/run-tests" > command, the following error could occur. > {code} > Traceback (most recent call last): > File "", line 1, in > ... > raise RuntimeError(''' > RuntimeError: > An attempt has been made to start a new process before the > current process has finished its bootstrapping phase. > This probably means that you are not using fork to start your > child processes and you have forgotten to use the proper idiom > in the main module: > if __name__ == '__main__': > freeze_support() > ... > The "freeze_support()" line can be omitted if the program > is not going to be frozen to produce an executable. > Traceback (most recent call last): > ... > raise EOFError > EOFError > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32812) Run tests script for Python fails on local environment
[ https://issues.apache.org/jira/browse/SPARK-32812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32812: Assignee: (was: Apache Spark) > Run tests script for Python fails on local environment > -- > > Key: SPARK-32812 > URL: https://issues.apache.org/jira/browse/SPARK-32812 > Project: Spark > Issue Type: Bug > Components: PySpark, Tests >Affects Versions: 3.1.0 >Reporter: Haejoon Lee >Priority: Major > > When running PySpark test in the local environment with "python/run-tests" > command, the following error could occur. > {code} > Traceback (most recent call last): > File "", line 1, in > ... > raise RuntimeError(''' > RuntimeError: > An attempt has been made to start a new process before the > current process has finished its bootstrapping phase. > This probably means that you are not using fork to start your > child processes and you have forgotten to use the proper idiom > in the main module: > if __name__ == '__main__': > freeze_support() > ... > The "freeze_support()" line can be omitted if the program > is not going to be frozen to produce an executable. > Traceback (most recent call last): > ... > raise EOFError > EOFError > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32812) Run tests script for Python fails on local environment
[ https://issues.apache.org/jira/browse/SPARK-32812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191898#comment-17191898 ] Apache Spark commented on SPARK-32812: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/29666 > Run tests script for Python fails on local environment > -- > > Key: SPARK-32812 > URL: https://issues.apache.org/jira/browse/SPARK-32812 > Project: Spark > Issue Type: Bug > Components: PySpark, Tests >Affects Versions: 3.1.0 >Reporter: Haejoon Lee >Priority: Major > > When running PySpark test in the local environment with "python/run-tests" > command, the following error could occur. > {code} > Traceback (most recent call last): > File "", line 1, in > ... > raise RuntimeError(''' > RuntimeError: > An attempt has been made to start a new process before the > current process has finished its bootstrapping phase. > This probably means that you are not using fork to start your > child processes and you have forgotten to use the proper idiom > in the main module: > if __name__ == '__main__': > freeze_support() > ... > The "freeze_support()" line can be omitted if the program > is not going to be frozen to produce an executable. > Traceback (most recent call last): > ... > raise EOFError > EOFError > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32812) Run tests script for Python fails on local environment
[ https://issues.apache.org/jira/browse/SPARK-32812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32812: Assignee: Apache Spark > Run tests script for Python fails on local environment > -- > > Key: SPARK-32812 > URL: https://issues.apache.org/jira/browse/SPARK-32812 > Project: Spark > Issue Type: Bug > Components: PySpark, Tests >Affects Versions: 3.1.0 >Reporter: Haejoon Lee >Assignee: Apache Spark >Priority: Major > > When running PySpark test in the local environment with "python/run-tests" > command, the following error could occur. > {code} > Traceback (most recent call last): > File "", line 1, in > ... > raise RuntimeError(''' > RuntimeError: > An attempt has been made to start a new process before the > current process has finished its bootstrapping phase. > This probably means that you are not using fork to start your > child processes and you have forgotten to use the proper idiom > in the main module: > if __name__ == '__main__': > freeze_support() > ... > The "freeze_support()" line can be omitted if the program > is not going to be frozen to produce an executable. > Traceback (most recent call last): > ... > raise EOFError > EOFError > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32810) CSV/JSON data sources should avoid globbing paths when inferring schema
[ https://issues.apache.org/jira/browse/SPARK-32810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32810. -- Fix Version/s: 2.4.8 3.0.2 3.1.0 Resolution: Fixed Issue resolved by pull request 29659 [https://github.com/apache/spark/pull/29659] > CSV/JSON data sources should avoid globbing paths when inferring schema > --- > > Key: SPARK-32810 > URL: https://issues.apache.org/jira/browse/SPARK-32810 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > Fix For: 3.1.0, 3.0.2, 2.4.8 > > > The problem is that when the user doesn't specify the schema when reading a > CSV table, The CSV file format and data source needs to infer schema, and it > does so by creating a base DataSource relation, and there's a mismatch: > *FileFormat.inferSchema* expects actual file paths without glob patterns, but > *DataSource.paths* expects file paths in glob patterns. > An example is demonstrated below: > {code:java} > ^ > | DataSource.resolveRelationtries to glob again (incorrectly) on > glob pattern """[abc].csv""" > | DataSource.apply ^ > | CSVDataSource.inferSchema | > | CSVFileFormat.inferSchema | > | ... | > | DataSource.resolveRelation globbed into """[abc].csv""", should > be treated as verbatim path, not as glob pattern > | DataSource.apply^ > | DataFrameReader.load | > | input """\[abc\].csv""" > {code} > The same problem exists in the JSON data source as well. Ditto for MLlib's > LibSVM data source. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32798) Make unionByName optionally fill missing columns with nulls in PySpark
[ https://issues.apache.org/jira/browse/SPARK-32798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32798. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29657 [https://github.com/apache/spark/pull/29657] > Make unionByName optionally fill missing columns with nulls in PySpark > -- > > Key: SPARK-32798 > URL: https://issues.apache.org/jira/browse/SPARK-32798 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Haejoon Lee >Priority: Major > Labels: starter > Fix For: 3.1.0 > > > It would be better to expose {{unionByName}} parameter in Python APIs as > well. Currently this is only exposed in Scala/Java APIs (at SPARK-29358) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32798) Make unionByName optionally fill missing columns with nulls in PySpark
[ https://issues.apache.org/jira/browse/SPARK-32798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-32798: Assignee: Haejoon Lee > Make unionByName optionally fill missing columns with nulls in PySpark > -- > > Key: SPARK-32798 > URL: https://issues.apache.org/jira/browse/SPARK-32798 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Haejoon Lee >Priority: Major > Labels: starter > > It would be better to expose {{unionByName}} parameter in Python APIs as > well. Currently this is only exposed in Scala/Java APIs (at SPARK-29358) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32813) Reading parquet rdd in non columnar mode fails in multithreaded environment
[ https://issues.apache.org/jira/browse/SPARK-32813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh updated SPARK-32813: Priority: Major (was: Blocker) > Reading parquet rdd in non columnar mode fails in multithreaded environment > --- > > Key: SPARK-32813 > URL: https://issues.apache.org/jira/browse/SPARK-32813 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: Spark 3.0.0, Scala 2.12.12 >Reporter: Vladimir Klyushnikov >Priority: Major > > Reading parquet rdd in non columnar mode (i.e. with list fields) if Spark > session was created in one thread and rdd is being read in another - so > InheritableThreadLocal with active session is not propagated. Code below was > working perfectly in Spark 2.X, but fails in Spark 3 > {code:scala} > import java.util.concurrent.Executors > import org.apache.spark.sql.SparkSession > import scala.concurrent.{Await, ExecutionContext, Future} > import scala.concurrent.duration._ > object Main { > final case class Data(list: List[Int]) > def main(args: Array[String]): Unit = { > val executor1 = Executors.newSingleThreadExecutor() > val executor2 = Executors.newSingleThreadExecutor() > try { > val ds = Await.result(Future { > val session = > SparkSession.builder().appName("test").master("local[*]").getOrCreate() > import session.implicits._ > val path = "test.parquet" > session.createDataset(Data(1 :: Nil) :: Nil).write.parquet(path) > session.read.parquet(path).as[Data] > }(ExecutionContext.fromExecutorService(executor1)), 1.minute) > Await.result(Future { > ds.rdd.collect().foreach(println(_)) > }(ExecutionContext.fromExecutorService(executor2)), 1.minute) > } finally { > executor1.shutdown() > executor2.shutdown() > } > } > } > {code} > This code fails with following exception: > {code} > Exception in thread "main" java.util.NoSuchElementException: > None.getException in thread "main" java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at > org.apache.spark.sql.execution.FileSourceScanExec.needsUnsafeRowConversion$lzycompute(DataSourceScanExec.scala:178) > at > org.apache.spark.sql.execution.FileSourceScanExec.needsUnsafeRowConversion(DataSourceScanExec.scala:176) > at > org.apache.spark.sql.execution.FileSourceScanExec.doExecute(DataSourceScanExec.scala:462) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) at > org.apache.spark.sql.execution.DeserializeToObjectExec.doExecute(objects.scala:96) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:122) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:121) > at org.apache.spark.sql.Dataset.rdd$lzycompute(Dataset.scala:3198) at > org.apache.spark.sql.Dataset.rdd(Dataset.scala:3196) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32753) Deduplicating and repartitioning the same column create duplicate rows with AQE
[ https://issues.apache.org/jira/browse/SPARK-32753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191882#comment-17191882 ] Apache Spark commented on SPARK-32753: -- User 'manuzhang' has created a pull request for this issue: https://github.com/apache/spark/pull/29665 > Deduplicating and repartitioning the same column create duplicate rows with > AQE > --- > > Key: SPARK-32753 > URL: https://issues.apache.org/jira/browse/SPARK-32753 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Manu Zhang >Assignee: Manu Zhang >Priority: Major > Fix For: 3.1.0 > > > To reproduce: > {code:java} > spark.range(10).union(spark.range(10)).createOrReplaceTempView("v1") > val df = spark.sql("select id from v1 group by id distribute by id") > println(df.collect().toArray.mkString(",")) > println(df.queryExecution.executedPlan) > // With AQE > [4],[0],[3],[2],[1],[7],[6],[8],[5],[9],[4],[0],[3],[2],[1],[7],[6],[8],[5],[9] > AdaptiveSparkPlan(isFinalPlan=true) > +- CustomShuffleReader local >+- ShuffleQueryStage 0 > +- Exchange hashpartitioning(id#183L, 10), true > +- *(3) HashAggregate(keys=[id#183L], functions=[], output=[id#183L]) > +- Union >:- *(1) Range (0, 10, step=1, splits=2) >+- *(2) Range (0, 10, step=1, splits=2) > // Without AQE > [4],[7],[0],[6],[8],[3],[2],[5],[1],[9] > *(4) HashAggregate(keys=[id#206L], functions=[], output=[id#206L]) > +- Exchange hashpartitioning(id#206L, 10), true >+- *(3) HashAggregate(keys=[id#206L], functions=[], output=[id#206L]) > +- Union > :- *(1) Range (0, 10, step=1, splits=2) > +- *(2) Range (0, 10, step=1, splits=2){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32138) Drop Python 2, 3.4 and 3.5 in codes and documentation
[ https://issues.apache.org/jira/browse/SPARK-32138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191876#comment-17191876 ] Apache Spark commented on SPARK-32138: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/29664 > Drop Python 2, 3.4 and 3.5 in codes and documentation > - > > Key: SPARK-32138 > URL: https://issues.apache.org/jira/browse/SPARK-32138 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32813) Reading parquet rdd in non columnar mode fails in multithreaded environment
[ https://issues.apache.org/jira/browse/SPARK-32813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Klyushnikov updated SPARK-32813: - Description: Reading parquet rdd in non columnar mode (i.e. with list fields) if Spark session was created in one thread and rdd is being read in another - so InheritableThreadLocal with active session is not propagated. Code below was working perfectly in Spark 2.X, but fails in Spark 3 {code:scala} import java.util.concurrent.Executors import org.apache.spark.sql.SparkSession import scala.concurrent.{Await, ExecutionContext, Future} import scala.concurrent.duration._ object Main { final case class Data(list: List[Int]) def main(args: Array[String]): Unit = { val executor1 = Executors.newSingleThreadExecutor() val executor2 = Executors.newSingleThreadExecutor() try { val ds = Await.result(Future { val session = SparkSession.builder().appName("test").master("local[*]").getOrCreate() import session.implicits._ val path = "test.parquet" session.createDataset(Data(1 :: Nil) :: Nil).write.parquet(path) session.read.parquet(path).as[Data] }(ExecutionContext.fromExecutorService(executor1)), 1.minute) Await.result(Future { ds.rdd.collect().foreach(println(_)) }(ExecutionContext.fromExecutorService(executor2)), 1.minute) } finally { executor1.shutdown() executor2.shutdown() } } } {code} This code fails with following exception: {code} Exception in thread "main" java.util.NoSuchElementException: None.getException in thread "main" java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at org.apache.spark.sql.execution.FileSourceScanExec.needsUnsafeRowConversion$lzycompute(DataSourceScanExec.scala:178) at org.apache.spark.sql.execution.FileSourceScanExec.needsUnsafeRowConversion(DataSourceScanExec.scala:176) at org.apache.spark.sql.execution.FileSourceScanExec.doExecute(DataSourceScanExec.scala:462) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) at org.apache.spark.sql.execution.DeserializeToObjectExec.doExecute(objects.scala:96) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:122) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:121) at org.apache.spark.sql.Dataset.rdd$lzycompute(Dataset.scala:3198) at org.apache.spark.sql.Dataset.rdd(Dataset.scala:3196) {code} was: Reading parquet rdd in non columnar mode (i.e. with list fields) if Spark session was created in one thread and rdd is being read in another - so InheritableThreadLocal with active session is not propagated. Code below was working perfectly in Spark 2.X, but fails in Spark 3 {code:scala} import java.util.concurrent.Executors import org.apache.spark.sql.SparkSession import scala.concurrent.{Await, ExecutionContext, Future} import scala.concurrent.duration._ object Main { final case class Data(list: List[Int]) def main(args: Array[String]): Unit = { val executor1 = Executors.newSingleThreadExecutor() val executor2 = Executors.newSingleThreadExecutor() try { val ds = Await.result(Future { val session = SparkSession.builder().appName("test").master("local[*]").getOrCreate() import session.implicits._ session.createDataset(Data(1 :: Nil) :: Nil).write.parquet("test.parquet") session.read.parquet("test.parquet").as[Data] }(ExecutionContext.fromExecutorService(executor1)), 1.minute) Await.result(Future { ds.rdd.collect().foreach(println(_)) }(ExecutionContext.fromExecutorService(executor2)), 1.minute) } finally { executor1.shutdown() executor2.shutdown() } } } {code} This code fails with following exception: {code} Exception in thread "main" java.util.NoSuchElementException: None.getException in thread "main" java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at
[jira] [Commented] (SPARK-32810) CSV/JSON data sources should avoid globbing paths when inferring schema
[ https://issues.apache.org/jira/browse/SPARK-32810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191822#comment-17191822 ] Apache Spark commented on SPARK-32810: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/29663 > CSV/JSON data sources should avoid globbing paths when inferring schema > --- > > Key: SPARK-32810 > URL: https://issues.apache.org/jira/browse/SPARK-32810 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > > The problem is that when the user doesn't specify the schema when reading a > CSV table, The CSV file format and data source needs to infer schema, and it > does so by creating a base DataSource relation, and there's a mismatch: > *FileFormat.inferSchema* expects actual file paths without glob patterns, but > *DataSource.paths* expects file paths in glob patterns. > An example is demonstrated below: > {code:java} > ^ > | DataSource.resolveRelationtries to glob again (incorrectly) on > glob pattern """[abc].csv""" > | DataSource.apply ^ > | CSVDataSource.inferSchema | > | CSVFileFormat.inferSchema | > | ... | > | DataSource.resolveRelation globbed into """[abc].csv""", should > be treated as verbatim path, not as glob pattern > | DataSource.apply^ > | DataFrameReader.load | > | input """\[abc\].csv""" > {code} > The same problem exists in the JSON data source as well. Ditto for MLlib's > LibSVM data source. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
subscribe
[jira] [Commented] (SPARK-32603) CREATE/REPLACE TABLE AS SELECT not support multi-part identifiers
[ https://issues.apache.org/jira/browse/SPARK-32603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191813#comment-17191813 ] Huaxin Gao commented on SPARK-32603: I thought CREATE TABLE syntax will be unified for hive and datasource? Anyway, I opened this ticket to track the problems I hit when testing V2 JDBC. If in the future we decide not to support this syntax, I will close this jira. > CREATE/REPLACE TABLE AS SELECT not support multi-part identifiers > - > > Key: SPARK-32603 > URL: https://issues.apache.org/jira/browse/SPARK-32603 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Priority: Major > > {code:java} > == SQL == > CREATE TABLE h2.test.abc AS SELECT * FROM h2.test.people > ^^^ > org.apache.spark.sql.catalyst.parser.ParseException: > Operation not allowed: CREATE TABLE ... STORED AS ... does not support > multi-part identifiers > {code} > {code:java} > == SQL == > CREATE OR REPLACE TABLE h2.test.abc AS SELECT 1 as col > ^^^ > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input 'AS' expecting {'(', 'USING'} > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32813) Reading parquet rdd in non columnar mode fails in multithreaded environment
[ https://issues.apache.org/jira/browse/SPARK-32813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Klyushnikov updated SPARK-32813: - Environment: Spark 3.0.0, Scala 2.12.12 > Reading parquet rdd in non columnar mode fails in multithreaded environment > --- > > Key: SPARK-32813 > URL: https://issues.apache.org/jira/browse/SPARK-32813 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: Spark 3.0.0, Scala 2.12.12 >Reporter: Vladimir Klyushnikov >Priority: Blocker > > Reading parquet rdd in non columnar mode (i.e. with list fields) if Spark > session was created in one thread and rdd is being read in another - so > InheritableThreadLocal with active session is not propagated. Code below was > working perfectly in Spark 2.X, but fails in Spark 3 > {code:scala} > import java.util.concurrent.Executors > import org.apache.spark.sql.SparkSession > import scala.concurrent.{Await, ExecutionContext, Future} > import scala.concurrent.duration._ > object Main { > final case class Data(list: List[Int]) > def main(args: Array[String]): Unit = { > val executor1 = Executors.newSingleThreadExecutor() > val executor2 = Executors.newSingleThreadExecutor() > try { > val ds = Await.result(Future { > val session = > SparkSession.builder().appName("test").master("local[*]").getOrCreate() > import session.implicits._ > session.createDataset(Data(1 :: Nil) :: > Nil).write.parquet("test.parquet") > session.read.parquet("test.parquet").as[Data] > }(ExecutionContext.fromExecutorService(executor1)), 1.minute) > Await.result(Future { > ds.rdd.collect().foreach(println(_)) > }(ExecutionContext.fromExecutorService(executor2)), 1.minute) > } finally { > executor1.shutdown() > executor2.shutdown() > } > } > } > {code} > This code fails with following exception: > {code} > Exception in thread "main" java.util.NoSuchElementException: > None.getException in thread "main" java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at > org.apache.spark.sql.execution.FileSourceScanExec.needsUnsafeRowConversion$lzycompute(DataSourceScanExec.scala:178) > at > org.apache.spark.sql.execution.FileSourceScanExec.needsUnsafeRowConversion(DataSourceScanExec.scala:176) > at > org.apache.spark.sql.execution.FileSourceScanExec.doExecute(DataSourceScanExec.scala:462) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) at > org.apache.spark.sql.execution.DeserializeToObjectExec.doExecute(objects.scala:96) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:122) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:121) > at org.apache.spark.sql.Dataset.rdd$lzycompute(Dataset.scala:3198) at > org.apache.spark.sql.Dataset.rdd(Dataset.scala:3196) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32813) Reading parquet rdd in non columnar mode fails in multithreaded environment
[ https://issues.apache.org/jira/browse/SPARK-32813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Klyushnikov updated SPARK-32813: - Description: Reading parquet rdd in non columnar mode (i.e. with list fields) if Spark session was created in one thread and rdd is being read in another - so InheritableThreadLocal with active session is not propagated. Code below was working perfectly in Spark 2.X, but fails in Spark 3 {code:scala} import java.util.concurrent.Executors import org.apache.spark.sql.SparkSession import scala.concurrent.{Await, ExecutionContext, Future} import scala.concurrent.duration._ object Main { final case class Data(list: List[Int]) def main(args: Array[String]): Unit = { val executor1 = Executors.newSingleThreadExecutor() val executor2 = Executors.newSingleThreadExecutor() try { val ds = Await.result(Future { val session = SparkSession.builder().appName("test").master("local[*]").getOrCreate() import session.implicits._ session.createDataset(Data(1 :: Nil) :: Nil).write.parquet("test.parquet") session.read.parquet("test.parquet").as[Data] }(ExecutionContext.fromExecutorService(executor1)), 1.minute) Await.result(Future { ds.rdd.collect().foreach(println(_)) }(ExecutionContext.fromExecutorService(executor2)), 1.minute) } finally { executor1.shutdown() executor2.shutdown() } } } {code} This code fails with following exception: {noformat} Exception in thread "main" java.util.NoSuchElementException: None.getException in thread "main" java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at org.apache.spark.sql.execution.FileSourceScanExec.needsUnsafeRowConversion$lzycompute(DataSourceScanExec.scala:178) at org.apache.spark.sql.execution.FileSourceScanExec.needsUnsafeRowConversion(DataSourceScanExec.scala:176) at org.apache.spark.sql.execution.FileSourceScanExec.doExecute(DataSourceScanExec.scala:462) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) at org.apache.spark.sql.execution.DeserializeToObjectExec.doExecute(objects.scala:96) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:122) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:121) at org.apache.spark.sql.Dataset.rdd$lzycompute(Dataset.scala:3198) at org.apache.spark.sql.Dataset.rdd(Dataset.scala:3196) {noformat} was: Reading parquet rdd in non columnar mode (i.e. with list fields) if Spark session was created in one thread and rdd is being read in another - so InheritableThreadLocal with active session is not propagated. Code below was working perfectly in Spark 2.X, but fails in Spark 3 {code:scala} import java.util.concurrent.Executors import org.apache.spark.sql.SparkSession import scala.concurrent.{Await, ExecutionContext, Future} import scala.concurrent.duration._ object Main { final case class Data(list: List[Int]) def main(args: Array[String]): Unit = { val executor1 = Executors.newSingleThreadExecutor() val executor2 = Executors.newSingleThreadExecutor() try { val ds = Await.result(Future { val session = SparkSession.builder().appName("test").master("local[*]").getOrCreate() import session.implicits._ session.createDataset(Data(1 :: Nil) :: Nil).write.parquet("test.parquet") session.read.parquet("test.parquet").as[Data] }(ExecutionContext.fromExecutorService(executor1)), 1.minute) Await.result(Future { ds.rdd.collect().foreach(println(_)) }(ExecutionContext.fromExecutorService(executor2)), 1.minute) } finally { executor1.shutdown() executor2.shutdown() } } } {code} This code fails with following exception: {noformat} Exception in thread "main" java.util.NoSuchElementException: None.getException in thread "main" java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at
[jira] [Updated] (SPARK-32813) Reading parquet rdd in non columnar mode fails in multithreaded environment
[ https://issues.apache.org/jira/browse/SPARK-32813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Klyushnikov updated SPARK-32813: - Description: Reading parquet rdd in non columnar mode (i.e. with list fields) if Spark session was created in one thread and rdd is being read in another - so InheritableThreadLocal with active session is not propagated. Code below was working perfectly in Spark 2.X, but fails in Spark 3 {code:scala} import java.util.concurrent.Executors import org.apache.spark.sql.SparkSession import scala.concurrent.{Await, ExecutionContext, Future} import scala.concurrent.duration._ object Main { final case class Data(list: List[Int]) def main(args: Array[String]): Unit = { val executor1 = Executors.newSingleThreadExecutor() val executor2 = Executors.newSingleThreadExecutor() try { val ds = Await.result(Future { val session = SparkSession.builder().appName("test").master("local[*]").getOrCreate() import session.implicits._ session.createDataset(Data(1 :: Nil) :: Nil).write.parquet("test.parquet") session.read.parquet("test.parquet").as[Data] }(ExecutionContext.fromExecutorService(executor1)), 1.minute) Await.result(Future { ds.rdd.collect().foreach(println(_)) }(ExecutionContext.fromExecutorService(executor2)), 1.minute) } finally { executor1.shutdown() executor2.shutdown() } } } {code} This code fails with following exception: {code} Exception in thread "main" java.util.NoSuchElementException: None.getException in thread "main" java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at org.apache.spark.sql.execution.FileSourceScanExec.needsUnsafeRowConversion$lzycompute(DataSourceScanExec.scala:178) at org.apache.spark.sql.execution.FileSourceScanExec.needsUnsafeRowConversion(DataSourceScanExec.scala:176) at org.apache.spark.sql.execution.FileSourceScanExec.doExecute(DataSourceScanExec.scala:462) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) at org.apache.spark.sql.execution.DeserializeToObjectExec.doExecute(objects.scala:96) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:122) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:121) at org.apache.spark.sql.Dataset.rdd$lzycompute(Dataset.scala:3198) at org.apache.spark.sql.Dataset.rdd(Dataset.scala:3196) {code} was: Reading parquet rdd in non columnar mode (i.e. with list fields) if Spark session was created in one thread and rdd is being read in another - so InheritableThreadLocal with active session is not propagated. Code below was working perfectly in Spark 2.X, but fails in Spark 3 {code:scala} import java.util.concurrent.Executors import org.apache.spark.sql.SparkSession import scala.concurrent.{Await, ExecutionContext, Future} import scala.concurrent.duration._ object Main { final case class Data(list: List[Int]) def main(args: Array[String]): Unit = { val executor1 = Executors.newSingleThreadExecutor() val executor2 = Executors.newSingleThreadExecutor() try { val ds = Await.result(Future { val session = SparkSession.builder().appName("test").master("local[*]").getOrCreate() import session.implicits._ session.createDataset(Data(1 :: Nil) :: Nil).write.parquet("test.parquet") session.read.parquet("test.parquet").as[Data] }(ExecutionContext.fromExecutorService(executor1)), 1.minute) Await.result(Future { ds.rdd.collect().foreach(println(_)) }(ExecutionContext.fromExecutorService(executor2)), 1.minute) } finally { executor1.shutdown() executor2.shutdown() } } } {code} This code fails with following exception: {noformat} Exception in thread "main" java.util.NoSuchElementException: None.getException in thread "main" java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at
[jira] [Created] (SPARK-32813) Reading parquet rdd in non columnar mode fails in multithreaded environment
Vladimir Klyushnikov created SPARK-32813: Summary: Reading parquet rdd in non columnar mode fails in multithreaded environment Key: SPARK-32813 URL: https://issues.apache.org/jira/browse/SPARK-32813 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Vladimir Klyushnikov Reading parquet rdd in non columnar mode (i.e. with list fields) if Spark session was created in one thread and rdd is being read in another - so InheritableThreadLocal with active session is not propagated. Code below was working perfectly in Spark 2.X, but fails in Spark 3 {code:scala} import java.util.concurrent.Executors import org.apache.spark.sql.SparkSession import scala.concurrent.{Await, ExecutionContext, Future} import scala.concurrent.duration._ object Main { final case class Data(list: List[Int]) def main(args: Array[String]): Unit = { val executor1 = Executors.newSingleThreadExecutor() val executor2 = Executors.newSingleThreadExecutor() try { val ds = Await.result(Future { val session = SparkSession.builder().appName("test").master("local[*]").getOrCreate() import session.implicits._ session.createDataset(Data(1 :: Nil) :: Nil).write.parquet("test.parquet") session.read.parquet("test.parquet").as[Data] }(ExecutionContext.fromExecutorService(executor1)), 1.minute) Await.result(Future { ds.rdd.collect().foreach(println(_)) }(ExecutionContext.fromExecutorService(executor2)), 1.minute) } finally { executor1.shutdown() executor2.shutdown() } } } {code} This code fails with following exception: {noformat} Exception in thread "main" java.util.NoSuchElementException: None.getException in thread "main" java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at org.apache.spark.sql.execution.FileSourceScanExec.needsUnsafeRowConversion$lzycompute(DataSourceScanExec.scala:178) at org.apache.spark.sql.execution.FileSourceScanExec.needsUnsafeRowConversion(DataSourceScanExec.scala:176) at org.apache.spark.sql.execution.FileSourceScanExec.doExecute(DataSourceScanExec.scala:462) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) at org.apache.spark.sql.execution.DeserializeToObjectExec.doExecute(objects.scala:96) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:122) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:121) at org.apache.spark.sql.Dataset.rdd$lzycompute(Dataset.scala:3198) at org.apache.spark.sql.Dataset.rdd(Dataset.scala:3196){noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32753) Deduplicating and repartitioning the same column create duplicate rows with AQE
[ https://issues.apache.org/jira/browse/SPARK-32753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-32753: --- Assignee: Manu Zhang > Deduplicating and repartitioning the same column create duplicate rows with > AQE > --- > > Key: SPARK-32753 > URL: https://issues.apache.org/jira/browse/SPARK-32753 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Manu Zhang >Assignee: Manu Zhang >Priority: Major > > To reproduce: > {code:java} > spark.range(10).union(spark.range(10)).createOrReplaceTempView("v1") > val df = spark.sql("select id from v1 group by id distribute by id") > println(df.collect().toArray.mkString(",")) > println(df.queryExecution.executedPlan) > // With AQE > [4],[0],[3],[2],[1],[7],[6],[8],[5],[9],[4],[0],[3],[2],[1],[7],[6],[8],[5],[9] > AdaptiveSparkPlan(isFinalPlan=true) > +- CustomShuffleReader local >+- ShuffleQueryStage 0 > +- Exchange hashpartitioning(id#183L, 10), true > +- *(3) HashAggregate(keys=[id#183L], functions=[], output=[id#183L]) > +- Union >:- *(1) Range (0, 10, step=1, splits=2) >+- *(2) Range (0, 10, step=1, splits=2) > // Without AQE > [4],[7],[0],[6],[8],[3],[2],[5],[1],[9] > *(4) HashAggregate(keys=[id#206L], functions=[], output=[id#206L]) > +- Exchange hashpartitioning(id#206L, 10), true >+- *(3) HashAggregate(keys=[id#206L], functions=[], output=[id#206L]) > +- Union > :- *(1) Range (0, 10, step=1, splits=2) > +- *(2) Range (0, 10, step=1, splits=2){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32753) Deduplicating and repartitioning the same column create duplicate rows with AQE
[ https://issues.apache.org/jira/browse/SPARK-32753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-32753. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29593 [https://github.com/apache/spark/pull/29593] > Deduplicating and repartitioning the same column create duplicate rows with > AQE > --- > > Key: SPARK-32753 > URL: https://issues.apache.org/jira/browse/SPARK-32753 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Manu Zhang >Assignee: Manu Zhang >Priority: Major > Fix For: 3.1.0 > > > To reproduce: > {code:java} > spark.range(10).union(spark.range(10)).createOrReplaceTempView("v1") > val df = spark.sql("select id from v1 group by id distribute by id") > println(df.collect().toArray.mkString(",")) > println(df.queryExecution.executedPlan) > // With AQE > [4],[0],[3],[2],[1],[7],[6],[8],[5],[9],[4],[0],[3],[2],[1],[7],[6],[8],[5],[9] > AdaptiveSparkPlan(isFinalPlan=true) > +- CustomShuffleReader local >+- ShuffleQueryStage 0 > +- Exchange hashpartitioning(id#183L, 10), true > +- *(3) HashAggregate(keys=[id#183L], functions=[], output=[id#183L]) > +- Union >:- *(1) Range (0, 10, step=1, splits=2) >+- *(2) Range (0, 10, step=1, splits=2) > // Without AQE > [4],[7],[0],[6],[8],[3],[2],[5],[1],[9] > *(4) HashAggregate(keys=[id#206L], functions=[], output=[id#206L]) > +- Exchange hashpartitioning(id#206L, 10), true >+- *(3) HashAggregate(keys=[id#206L], functions=[], output=[id#206L]) > +- Union > :- *(1) Range (0, 10, step=1, splits=2) > +- *(2) Range (0, 10, step=1, splits=2){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32810) CSV/JSON data sources should avoid globbing paths when inferring schema
[ https://issues.apache.org/jira/browse/SPARK-32810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191742#comment-17191742 ] Apache Spark commented on SPARK-32810: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/29662 > CSV/JSON data sources should avoid globbing paths when inferring schema > --- > > Key: SPARK-32810 > URL: https://issues.apache.org/jira/browse/SPARK-32810 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > > The problem is that when the user doesn't specify the schema when reading a > CSV table, The CSV file format and data source needs to infer schema, and it > does so by creating a base DataSource relation, and there's a mismatch: > *FileFormat.inferSchema* expects actual file paths without glob patterns, but > *DataSource.paths* expects file paths in glob patterns. > An example is demonstrated below: > {code:java} > ^ > | DataSource.resolveRelationtries to glob again (incorrectly) on > glob pattern """[abc].csv""" > | DataSource.apply ^ > | CSVDataSource.inferSchema | > | CSVFileFormat.inferSchema | > | ... | > | DataSource.resolveRelation globbed into """[abc].csv""", should > be treated as verbatim path, not as glob pattern > | DataSource.apply^ > | DataFrameReader.load | > | input """\[abc\].csv""" > {code} > The same problem exists in the JSON data source as well. Ditto for MLlib's > LibSVM data source. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32810) CSV/JSON data sources should avoid globbing paths when inferring schema
[ https://issues.apache.org/jira/browse/SPARK-32810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191741#comment-17191741 ] Apache Spark commented on SPARK-32810: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/29662 > CSV/JSON data sources should avoid globbing paths when inferring schema > --- > > Key: SPARK-32810 > URL: https://issues.apache.org/jira/browse/SPARK-32810 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > > The problem is that when the user doesn't specify the schema when reading a > CSV table, The CSV file format and data source needs to infer schema, and it > does so by creating a base DataSource relation, and there's a mismatch: > *FileFormat.inferSchema* expects actual file paths without glob patterns, but > *DataSource.paths* expects file paths in glob patterns. > An example is demonstrated below: > {code:java} > ^ > | DataSource.resolveRelationtries to glob again (incorrectly) on > glob pattern """[abc].csv""" > | DataSource.apply ^ > | CSVDataSource.inferSchema | > | CSVFileFormat.inferSchema | > | ... | > | DataSource.resolveRelation globbed into """[abc].csv""", should > be treated as verbatim path, not as glob pattern > | DataSource.apply^ > | DataFrameReader.load | > | input """\[abc\].csv""" > {code} > The same problem exists in the JSON data source as well. Ditto for MLlib's > LibSVM data source. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32812) Run tests script for Python fails on local environment
[ https://issues.apache.org/jira/browse/SPARK-32812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-32812: - Component/s: PySpark > Run tests script for Python fails on local environment > -- > > Key: SPARK-32812 > URL: https://issues.apache.org/jira/browse/SPARK-32812 > Project: Spark > Issue Type: Bug > Components: PySpark, Tests >Affects Versions: 3.1.0 >Reporter: Haejoon Lee >Priority: Major > > When running PySpark test in the local environment with "python/run-tests" > command, the following error could occur. > {code} > Traceback (most recent call last): > File "", line 1, in > ... > raise RuntimeError(''' > RuntimeError: > An attempt has been made to start a new process before the > current process has finished its bootstrapping phase. > This probably means that you are not using fork to start your > child processes and you have forgotten to use the proper idiom > in the main module: > if __name__ == '__main__': > freeze_support() > ... > The "freeze_support()" line can be omitted if the program > is not going to be frozen to produce an executable. > Traceback (most recent call last): > ... > raise EOFError > EOFError > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32812) Run tests script for Python fails on local environment
[ https://issues.apache.org/jira/browse/SPARK-32812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-32812: - Component/s: (was: PySpark) Tests > Run tests script for Python fails on local environment > -- > > Key: SPARK-32812 > URL: https://issues.apache.org/jira/browse/SPARK-32812 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 3.1.0 >Reporter: Haejoon Lee >Priority: Major > > When running PySpark test in the local environment with "python/run-tests" > command, the following error could occur. > {code} > Traceback (most recent call last): > File "", line 1, in > ... > raise RuntimeError(''' > RuntimeError: > An attempt has been made to start a new process before the > current process has finished its bootstrapping phase. > This probably means that you are not using fork to start your > child processes and you have forgotten to use the proper idiom > in the main module: > if __name__ == '__main__': > freeze_support() > ... > The "freeze_support()" line can be omitted if the program > is not going to be frozen to produce an executable. > Traceback (most recent call last): > ... > raise EOFError > EOFError > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32812) Run tests script for Python fails on local environment
Haejoon Lee created SPARK-32812: --- Summary: Run tests script for Python fails on local environment Key: SPARK-32812 URL: https://issues.apache.org/jira/browse/SPARK-32812 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.1.0 Reporter: Haejoon Lee When running PySpark test in the local environment with "python/run-tests" command, the following error could occur. {code} Traceback (most recent call last): File "", line 1, in ... raise RuntimeError(''' RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase. This probably means that you are not using fork to start your child processes and you have forgotten to use the proper idiom in the main module: if __name__ == '__main__': freeze_support() ... The "freeze_support()" line can be omitted if the program is not going to be frozen to produce an executable. Traceback (most recent call last): ... raise EOFError EOFError {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32811) Replace IN predicate of continuous range with boundary checks
[ https://issues.apache.org/jira/browse/SPARK-32811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191645#comment-17191645 ] Apache Spark commented on SPARK-32811: -- User 'hotienvu' has created a pull request for this issue: https://github.com/apache/spark/pull/29661 > Replace IN predicate of continuous range with boundary checks > - > > Key: SPARK-32811 > URL: https://issues.apache.org/jira/browse/SPARK-32811 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Vu Ho >Priority: Major > > This expression > {code:java} > select a from t where a in (1, 2, 3, 3, 4){code} > should be translated to > {code:java} > select a from t where a >= 1 and a <= 4 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32811) Replace IN predicate of continuous range with boundary checks
[ https://issues.apache.org/jira/browse/SPARK-32811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32811: Assignee: (was: Apache Spark) > Replace IN predicate of continuous range with boundary checks > - > > Key: SPARK-32811 > URL: https://issues.apache.org/jira/browse/SPARK-32811 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Vu Ho >Priority: Major > > This expression > {code:java} > select a from t where a in (1, 2, 3, 3, 4){code} > should be translated to > {code:java} > select a from t where a >= 1 and a <= 4 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32811) Replace IN predicate of continuous range with boundary checks
[ https://issues.apache.org/jira/browse/SPARK-32811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32811: Assignee: Apache Spark > Replace IN predicate of continuous range with boundary checks > - > > Key: SPARK-32811 > URL: https://issues.apache.org/jira/browse/SPARK-32811 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Vu Ho >Assignee: Apache Spark >Priority: Major > > This expression > {code:java} > select a from t where a in (1, 2, 3, 3, 4){code} > should be translated to > {code:java} > select a from t where a >= 1 and a <= 4 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32811) Replace IN predicate of continuous range with boundary checks
[ https://issues.apache.org/jira/browse/SPARK-32811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191642#comment-17191642 ] Apache Spark commented on SPARK-32811: -- User 'hotienvu' has created a pull request for this issue: https://github.com/apache/spark/pull/29661 > Replace IN predicate of continuous range with boundary checks > - > > Key: SPARK-32811 > URL: https://issues.apache.org/jira/browse/SPARK-32811 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Vu Ho >Priority: Major > > This expression > {code:java} > select a from t where a in (1, 2, 3, 3, 4){code} > should be translated to > {code:java} > select a from t where a >= 1 and a <= 4 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32808) Pass all `sql/core` module UTs in Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-32808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191611#comment-17191611 ] Apache Spark commented on SPARK-32808: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/29660 > Pass all `sql/core` module UTs in Scala 2.13 > > > Key: SPARK-32808 > URL: https://issues.apache.org/jira/browse/SPARK-32808 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yang Jie >Priority: Major > > Now there are 319 TESTS FAILED based on commit > `f5360e761ef161f7e04526b59a4baf53f1cf8cd5` > {code:java} > Run completed in 1 hour, 20 minutes, 25 seconds. > Total number of tests run: 8485 > Suites: completed 357, aborted 0 > Tests: succeeded 8166, failed 319, canceled 1, ignored 52, pending 0 > *** 319 TESTS FAILED *** > {code} > > There are 293 failures associated with TPCDS_XXX_PlanStabilitySuite and > TPCDS_XXX_PlanStabilityWithStatsSuite: > * TPCDSV2_7_PlanStabilitySuite(33 FAILED) > * TPCDSV1_4_PlanStabilityWithStatsSuite(94 FAILED) > * TPCDSModifiedPlanStabilityWithStatsSuite(21 FAILED) > * TPCDSV1_4_PlanStabilitySuite(92 FAILED) > * TPCDSModifiedPlanStabilitySuite(21 FAILED) > * TPCDSV2_7_PlanStabilityWithStatsSuite(32 FAILED) > > Other 26 FAILED cases as follow: > * StreamingAggregationSuite > ** count distinct - state format version 1 > ** count distinct - state format version 2 > * GeneratorFunctionSuite > ** explode and other columns > ** explode_outer and other columns > * UDFSuite > ** SPARK-26308: udf with complex types of decimal > ** SPARK-32459: UDF should not fail on WrappedArray > * SQLQueryTestSuite > ** decimalArithmeticOperations.sql > ** postgreSQL/aggregates_part2.sql > ** ansi/decimalArithmeticOperations.sql > ** udf/postgreSQL/udf-aggregates_part2.sql - Scala UDF > ** udf/postgreSQL/udf-aggregates_part2.sql - Regular Python UDF > * WholeStageCodegenSuite > ** SPARK-26680: Stream in groupBy does not cause StackOverflowError > * DataFrameSuite: > ** explode > ** SPARK-28067: Aggregate sum should not return wrong results for decimal > overflow > ** Star Expansion - ds.explode should fail with a meaningful message if it > takes a star > * DataStreamReaderWriterSuite > ** SPARK-18510: use user specified types for partition columns in file > sources > * OrcV1QuerySuite\OrcV2QuerySuite > ** Simple selection form ORC table * 2 > * ExpressionsSchemaSuite > ** Check schemas for expression examples > * DataFrameStatSuite > ** SPARK-28818: Respect original column nullability in `freqItems` > * JsonV1Suite\JsonV2Suite\JsonLegacyTimeParserSuite > ** SPARK-4228 DataFrame to JSON * 3 > ** backward compatibility * 3 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32808) Pass all `sql/core` module UTs in Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-32808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32808: Assignee: Apache Spark > Pass all `sql/core` module UTs in Scala 2.13 > > > Key: SPARK-32808 > URL: https://issues.apache.org/jira/browse/SPARK-32808 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Major > > Now there are 319 TESTS FAILED based on commit > `f5360e761ef161f7e04526b59a4baf53f1cf8cd5` > {code:java} > Run completed in 1 hour, 20 minutes, 25 seconds. > Total number of tests run: 8485 > Suites: completed 357, aborted 0 > Tests: succeeded 8166, failed 319, canceled 1, ignored 52, pending 0 > *** 319 TESTS FAILED *** > {code} > > There are 293 failures associated with TPCDS_XXX_PlanStabilitySuite and > TPCDS_XXX_PlanStabilityWithStatsSuite: > * TPCDSV2_7_PlanStabilitySuite(33 FAILED) > * TPCDSV1_4_PlanStabilityWithStatsSuite(94 FAILED) > * TPCDSModifiedPlanStabilityWithStatsSuite(21 FAILED) > * TPCDSV1_4_PlanStabilitySuite(92 FAILED) > * TPCDSModifiedPlanStabilitySuite(21 FAILED) > * TPCDSV2_7_PlanStabilityWithStatsSuite(32 FAILED) > > Other 26 FAILED cases as follow: > * StreamingAggregationSuite > ** count distinct - state format version 1 > ** count distinct - state format version 2 > * GeneratorFunctionSuite > ** explode and other columns > ** explode_outer and other columns > * UDFSuite > ** SPARK-26308: udf with complex types of decimal > ** SPARK-32459: UDF should not fail on WrappedArray > * SQLQueryTestSuite > ** decimalArithmeticOperations.sql > ** postgreSQL/aggregates_part2.sql > ** ansi/decimalArithmeticOperations.sql > ** udf/postgreSQL/udf-aggregates_part2.sql - Scala UDF > ** udf/postgreSQL/udf-aggregates_part2.sql - Regular Python UDF > * WholeStageCodegenSuite > ** SPARK-26680: Stream in groupBy does not cause StackOverflowError > * DataFrameSuite: > ** explode > ** SPARK-28067: Aggregate sum should not return wrong results for decimal > overflow > ** Star Expansion - ds.explode should fail with a meaningful message if it > takes a star > * DataStreamReaderWriterSuite > ** SPARK-18510: use user specified types for partition columns in file > sources > * OrcV1QuerySuite\OrcV2QuerySuite > ** Simple selection form ORC table * 2 > * ExpressionsSchemaSuite > ** Check schemas for expression examples > * DataFrameStatSuite > ** SPARK-28818: Respect original column nullability in `freqItems` > * JsonV1Suite\JsonV2Suite\JsonLegacyTimeParserSuite > ** SPARK-4228 DataFrame to JSON * 3 > ** backward compatibility * 3 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32808) Pass all `sql/core` module UTs in Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-32808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32808: Assignee: (was: Apache Spark) > Pass all `sql/core` module UTs in Scala 2.13 > > > Key: SPARK-32808 > URL: https://issues.apache.org/jira/browse/SPARK-32808 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yang Jie >Priority: Major > > Now there are 319 TESTS FAILED based on commit > `f5360e761ef161f7e04526b59a4baf53f1cf8cd5` > {code:java} > Run completed in 1 hour, 20 minutes, 25 seconds. > Total number of tests run: 8485 > Suites: completed 357, aborted 0 > Tests: succeeded 8166, failed 319, canceled 1, ignored 52, pending 0 > *** 319 TESTS FAILED *** > {code} > > There are 293 failures associated with TPCDS_XXX_PlanStabilitySuite and > TPCDS_XXX_PlanStabilityWithStatsSuite: > * TPCDSV2_7_PlanStabilitySuite(33 FAILED) > * TPCDSV1_4_PlanStabilityWithStatsSuite(94 FAILED) > * TPCDSModifiedPlanStabilityWithStatsSuite(21 FAILED) > * TPCDSV1_4_PlanStabilitySuite(92 FAILED) > * TPCDSModifiedPlanStabilitySuite(21 FAILED) > * TPCDSV2_7_PlanStabilityWithStatsSuite(32 FAILED) > > Other 26 FAILED cases as follow: > * StreamingAggregationSuite > ** count distinct - state format version 1 > ** count distinct - state format version 2 > * GeneratorFunctionSuite > ** explode and other columns > ** explode_outer and other columns > * UDFSuite > ** SPARK-26308: udf with complex types of decimal > ** SPARK-32459: UDF should not fail on WrappedArray > * SQLQueryTestSuite > ** decimalArithmeticOperations.sql > ** postgreSQL/aggregates_part2.sql > ** ansi/decimalArithmeticOperations.sql > ** udf/postgreSQL/udf-aggregates_part2.sql - Scala UDF > ** udf/postgreSQL/udf-aggregates_part2.sql - Regular Python UDF > * WholeStageCodegenSuite > ** SPARK-26680: Stream in groupBy does not cause StackOverflowError > * DataFrameSuite: > ** explode > ** SPARK-28067: Aggregate sum should not return wrong results for decimal > overflow > ** Star Expansion - ds.explode should fail with a meaningful message if it > takes a star > * DataStreamReaderWriterSuite > ** SPARK-18510: use user specified types for partition columns in file > sources > * OrcV1QuerySuite\OrcV2QuerySuite > ** Simple selection form ORC table * 2 > * ExpressionsSchemaSuite > ** Check schemas for expression examples > * DataFrameStatSuite > ** SPARK-28818: Respect original column nullability in `freqItems` > * JsonV1Suite\JsonV2Suite\JsonLegacyTimeParserSuite > ** SPARK-4228 DataFrame to JSON * 3 > ** backward compatibility * 3 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32811) Replace IN predicate of continuous range with boundary checks
Vu Ho created SPARK-32811: - Summary: Replace IN predicate of continuous range with boundary checks Key: SPARK-32811 URL: https://issues.apache.org/jira/browse/SPARK-32811 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Vu Ho This expression {code:java} select a from t where a in (1, 2, 3, 3, 4){code} should be translated to {code:java} select a from t where a >= 1 and a <= 4 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32810) CSV/JSON data sources should avoid globbing paths when inferring schema
[ https://issues.apache.org/jira/browse/SPARK-32810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32810: Assignee: (was: Apache Spark) > CSV/JSON data sources should avoid globbing paths when inferring schema > --- > > Key: SPARK-32810 > URL: https://issues.apache.org/jira/browse/SPARK-32810 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > The problem is that when the user doesn't specify the schema when reading a > CSV table, The CSV file format and data source needs to infer schema, and it > does so by creating a base DataSource relation, and there's a mismatch: > *FileFormat.inferSchema* expects actual file paths without glob patterns, but > *DataSource.paths* expects file paths in glob patterns. > An example is demonstrated below: > {code:java} > ^ > | DataSource.resolveRelationtries to glob again (incorrectly) on > glob pattern """[abc].csv""" > | DataSource.apply ^ > | CSVDataSource.inferSchema | > | CSVFileFormat.inferSchema | > | ... | > | DataSource.resolveRelation globbed into """[abc].csv""", should > be treated as verbatim path, not as glob pattern > | DataSource.apply^ > | DataFrameReader.load | > | input """\[abc\].csv""" > {code} > The same problem exists in the JSON data source as well. Ditto for MLlib's > LibSVM data source. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32810) CSV/JSON data sources should avoid globbing paths when inferring schema
[ https://issues.apache.org/jira/browse/SPARK-32810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191578#comment-17191578 ] Apache Spark commented on SPARK-32810: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/29659 > CSV/JSON data sources should avoid globbing paths when inferring schema > --- > > Key: SPARK-32810 > URL: https://issues.apache.org/jira/browse/SPARK-32810 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > The problem is that when the user doesn't specify the schema when reading a > CSV table, The CSV file format and data source needs to infer schema, and it > does so by creating a base DataSource relation, and there's a mismatch: > *FileFormat.inferSchema* expects actual file paths without glob patterns, but > *DataSource.paths* expects file paths in glob patterns. > An example is demonstrated below: > {code:java} > ^ > | DataSource.resolveRelationtries to glob again (incorrectly) on > glob pattern """[abc].csv""" > | DataSource.apply ^ > | CSVDataSource.inferSchema | > | CSVFileFormat.inferSchema | > | ... | > | DataSource.resolveRelation globbed into """[abc].csv""", should > be treated as verbatim path, not as glob pattern > | DataSource.apply^ > | DataFrameReader.load | > | input """\[abc\].csv""" > {code} > The same problem exists in the JSON data source as well. Ditto for MLlib's > LibSVM data source. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32810) CSV/JSON data sources should avoid globbing paths when inferring schema
[ https://issues.apache.org/jira/browse/SPARK-32810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32810: Assignee: Apache Spark > CSV/JSON data sources should avoid globbing paths when inferring schema > --- > > Key: SPARK-32810 > URL: https://issues.apache.org/jira/browse/SPARK-32810 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > > The problem is that when the user doesn't specify the schema when reading a > CSV table, The CSV file format and data source needs to infer schema, and it > does so by creating a base DataSource relation, and there's a mismatch: > *FileFormat.inferSchema* expects actual file paths without glob patterns, but > *DataSource.paths* expects file paths in glob patterns. > An example is demonstrated below: > {code:java} > ^ > | DataSource.resolveRelationtries to glob again (incorrectly) on > glob pattern """[abc].csv""" > | DataSource.apply ^ > | CSVDataSource.inferSchema | > | CSVFileFormat.inferSchema | > | ... | > | DataSource.resolveRelation globbed into """[abc].csv""", should > be treated as verbatim path, not as glob pattern > | DataSource.apply^ > | DataFrameReader.load | > | input """\[abc\].csv""" > {code} > The same problem exists in the JSON data source as well. Ditto for MLlib's > LibSVM data source. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32810) CSV/JSON data sources should avoid globbing paths when inferring schema
[ https://issues.apache.org/jira/browse/SPARK-32810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32810: Assignee: Apache Spark > CSV/JSON data sources should avoid globbing paths when inferring schema > --- > > Key: SPARK-32810 > URL: https://issues.apache.org/jira/browse/SPARK-32810 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > > The problem is that when the user doesn't specify the schema when reading a > CSV table, The CSV file format and data source needs to infer schema, and it > does so by creating a base DataSource relation, and there's a mismatch: > *FileFormat.inferSchema* expects actual file paths without glob patterns, but > *DataSource.paths* expects file paths in glob patterns. > An example is demonstrated below: > {code:java} > ^ > | DataSource.resolveRelationtries to glob again (incorrectly) on > glob pattern """[abc].csv""" > | DataSource.apply ^ > | CSVDataSource.inferSchema | > | CSVFileFormat.inferSchema | > | ... | > | DataSource.resolveRelation globbed into """[abc].csv""", should > be treated as verbatim path, not as glob pattern > | DataSource.apply^ > | DataFrameReader.load | > | input """\[abc\].csv""" > {code} > The same problem exists in the JSON data source as well. Ditto for MLlib's > LibSVM data source. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32785) interval with dangling part should not results null
[ https://issues.apache.org/jira/browse/SPARK-32785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191513#comment-17191513 ] Apache Spark commented on SPARK-32785: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/29658 > interval with dangling part should not results null > --- > > Key: SPARK-32785 > URL: https://issues.apache.org/jira/browse/SPARK-32785 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.1.0 > > > bin/spark-sql -S -e "select interval '1', interval '+', interval '1 day -'" > NULL NULLNULL > we should fail these cases correctly -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32785) interval with dangling part should not results null
[ https://issues.apache.org/jira/browse/SPARK-32785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191511#comment-17191511 ] Apache Spark commented on SPARK-32785: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/29658 > interval with dangling part should not results null > --- > > Key: SPARK-32785 > URL: https://issues.apache.org/jira/browse/SPARK-32785 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.1.0 > > > bin/spark-sql -S -e "select interval '1', interval '+', interval '1 day -'" > NULL NULLNULL > we should fail these cases correctly -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32810) CSV/JSON data sources should avoid globbing paths when inferring schema
[ https://issues.apache.org/jira/browse/SPARK-32810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191510#comment-17191510 ] Maxim Gekk commented on SPARK-32810: I am working on this. > CSV/JSON data sources should avoid globbing paths when inferring schema > --- > > Key: SPARK-32810 > URL: https://issues.apache.org/jira/browse/SPARK-32810 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > The problem is that when the user doesn't specify the schema when reading a > CSV table, The CSV file format and data source needs to infer schema, and it > does so by creating a base DataSource relation, and there's a mismatch: > *FileFormat.inferSchema* expects actual file paths without glob patterns, but > *DataSource.paths* expects file paths in glob patterns. > An example is demonstrated below: > {code:java} > ^ > | DataSource.resolveRelationtries to glob again (incorrectly) on > glob pattern """[abc].csv""" > | DataSource.apply ^ > | CSVDataSource.inferSchema | > | CSVFileFormat.inferSchema | > | ... | > | DataSource.resolveRelation globbed into """[abc].csv""", should > be treated as verbatim path, not as glob pattern > | DataSource.apply^ > | DataFrameReader.load | > | input """\[abc\].csv""" > {code} > The same problem exists in the JSON data source as well. Ditto for MLlib's > LibSVM data source. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32810) CSV/JSON data sources should avoid globbing paths when inferring schema
Maxim Gekk created SPARK-32810: -- Summary: CSV/JSON data sources should avoid globbing paths when inferring schema Key: SPARK-32810 URL: https://issues.apache.org/jira/browse/SPARK-32810 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk The problem is that when the user doesn't specify the schema when reading a CSV table, The CSV file format and data source needs to infer schema, and it does so by creating a base DataSource relation, and there's a mismatch: *FileFormat.inferSchema* expects actual file paths without glob patterns, but *DataSource.paths* expects file paths in glob patterns. An example is demonstrated below: {code:java} ^ | DataSource.resolveRelationtries to glob again (incorrectly) on glob pattern """[abc].csv""" | DataSource.apply ^ | CSVDataSource.inferSchema | | CSVFileFormat.inferSchema | | ... | | DataSource.resolveRelation globbed into """[abc].csv""", should be treated as verbatim path, not as glob pattern | DataSource.apply^ | DataFrameReader.load | | input """\[abc\].csv""" {code} The same problem exists in the JSON data source as well. Ditto for MLlib's LibSVM data source. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32798) Make unionByName optionally fill missing columns with nulls in PySpark
[ https://issues.apache.org/jira/browse/SPARK-32798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32798: Assignee: Apache Spark > Make unionByName optionally fill missing columns with nulls in PySpark > -- > > Key: SPARK-32798 > URL: https://issues.apache.org/jira/browse/SPARK-32798 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > Labels: starter > > It would be better to expose {{unionByName}} parameter in Python APIs as > well. Currently this is only exposed in Scala/Java APIs (at SPARK-29358) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32798) Make unionByName optionally fill missing columns with nulls in PySpark
[ https://issues.apache.org/jira/browse/SPARK-32798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191507#comment-17191507 ] Apache Spark commented on SPARK-32798: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/29657 > Make unionByName optionally fill missing columns with nulls in PySpark > -- > > Key: SPARK-32798 > URL: https://issues.apache.org/jira/browse/SPARK-32798 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > Labels: starter > > It would be better to expose {{unionByName}} parameter in Python APIs as > well. Currently this is only exposed in Scala/Java APIs (at SPARK-29358) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32798) Make unionByName optionally fill missing columns with nulls in PySpark
[ https://issues.apache.org/jira/browse/SPARK-32798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32798: Assignee: (was: Apache Spark) > Make unionByName optionally fill missing columns with nulls in PySpark > -- > > Key: SPARK-32798 > URL: https://issues.apache.org/jira/browse/SPARK-32798 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > Labels: starter > > It would be better to expose {{unionByName}} parameter in Python APIs as > well. Currently this is only exposed in Scala/Java APIs (at SPARK-29358) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-32809) RDD different partitions cause didderent results
[ https://issues.apache.org/jira/browse/SPARK-32809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-32809: -- > RDD different partitions cause didderent results > -- > > Key: SPARK-32809 > URL: https://issues.apache.org/jira/browse/SPARK-32809 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 > Environment: spark2.2.0 ,scala 2.11.8 , hadoop-client2.6.0 >Reporter: zhangchenglong >Priority: Major > Original Estimate: 12h > Remaining Estimate: 12h > > {code} > class Exec3 { > private val exec: SparkConf = new > SparkConf().setMaster("local[1]").setAppName("exec3") > private val context = new SparkContext(exec) > context.setCheckpointDir("checkPoint") > > /** >* get total number by key >* in this project desired results are ("apple",25) ("huwei",20) >* but in fact i get ("apple",150) ("huawei",20) >* when i change it to local[3] the result is correct >* i want to know which cause it and how to slove it >*/ > @Test > def testError(): Unit ={ > val rdd = context.parallelize(Seq(("apple", 10), ("apple", 15), > ("huawei", 20))) > rdd.aggregateByKey(1.0)( > seqOp = (zero, price) => price * zero, > combOp = (curr, agg) => curr + agg).collect().foreach(println(_)) > context.stop() > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32809) RDD different partitions cause didderent results
[ https://issues.apache.org/jira/browse/SPARK-32809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32809. -- Resolution: Invalid > RDD different partitions cause didderent results > -- > > Key: SPARK-32809 > URL: https://issues.apache.org/jira/browse/SPARK-32809 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 > Environment: spark2.2.0 ,scala 2.11.8 , hadoop-client2.6.0 >Reporter: zhangchenglong >Priority: Major > Original Estimate: 12h > Remaining Estimate: 12h > > {code} > class Exec3 { > private val exec: SparkConf = new > SparkConf().setMaster("local[1]").setAppName("exec3") > private val context = new SparkContext(exec) > context.setCheckpointDir("checkPoint") > > /** >* get total number by key >* in this project desired results are ("apple",25) ("huwei",20) >* but in fact i get ("apple",150) ("huawei",20) >* when i change it to local[3] the result is correct >* i want to know which cause it and how to slove it >*/ > @Test > def testError(): Unit ={ > val rdd = context.parallelize(Seq(("apple", 10), ("apple", 15), > ("huawei", 20))) > rdd.aggregateByKey(1.0)( > seqOp = (zero, price) => price * zero, > combOp = (curr, agg) => curr + agg).collect().foreach(println(_)) > context.stop() > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32809) RDD different partitions cause didderent results
[ https://issues.apache.org/jira/browse/SPARK-32809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191485#comment-17191485 ] Hyukjin Kwon commented on SPARK-32809: -- The results is correct. For {{local[1]}}, there's a single partition. So {{seqOp}} will be used for {{("apple", 10), ("apple", 15)}}. When {{local[3]}} the partitions are three. {{combOp}} will be used to combine each partition. > RDD different partitions cause didderent results > -- > > Key: SPARK-32809 > URL: https://issues.apache.org/jira/browse/SPARK-32809 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 > Environment: spark2.2.0 ,scala 2.11.8 , hadoop-client2.6.0 >Reporter: zhangchenglong >Priority: Major > Original Estimate: 12h > Remaining Estimate: 12h > > {code} > class Exec3 { > private val exec: SparkConf = new > SparkConf().setMaster("local[1]").setAppName("exec3") > private val context = new SparkContext(exec) > context.setCheckpointDir("checkPoint") > > /** >* get total number by key >* in this project desired results are ("apple",25) ("huwei",20) >* but in fact i get ("apple",150) ("huawei",20) >* when i change it to local[3] the result is correct >* i want to know which cause it and how to slove it >*/ > @Test > def testError(): Unit ={ > val rdd = context.parallelize(Seq(("apple", 10), ("apple", 15), > ("huawei", 20))) > rdd.aggregateByKey(1.0)( > seqOp = (zero, price) => price * zero, > combOp = (curr, agg) => curr + agg).collect().foreach(println(_)) > context.stop() > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32809) RDD different partitions cause didderent results
[ https://issues.apache.org/jira/browse/SPARK-32809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-32809: - Description: {code} class Exec3 { private val exec: SparkConf = new SparkConf().setMaster("local[1]").setAppName("exec3") private val context = new SparkContext(exec) context.setCheckpointDir("checkPoint") /** * get total number by key * in this project desired results are ("apple",25) ("huwei",20) * but in fact i get ("apple",150) ("huawei",20) * when i change it to local[3] the result is correct * i want to know which cause it and how to slove it */ @Test def testError(): Unit ={ val rdd = context.parallelize(Seq(("apple", 10), ("apple", 15), ("huawei", 20))) rdd.aggregateByKey(1.0)( seqOp = (zero, price) => price * zero, combOp = (curr, agg) => curr + agg).collect().foreach(println(_)) context.stop() } } {code} was: class Exec3 { private val exec: SparkConf = new SparkConf().setMaster("local[1]").setAppName("exec3") private val context = new SparkContext(exec) context.setCheckpointDir("checkPoint") /** * get total number by key * in this project desired results are ("apple",25) ("huwei",20) * but in fact i get ("apple",150) ("huawei",20) * when i change it to local[3] the result is correct * i want to know which cause it and how to slove it */ @Test def testError(): Unit ={ val rdd = context.parallelize(Seq(("apple", 10), ("apple", 15), ("huawei", 20))) rdd.aggregateByKey(1.0)( seqOp = (zero, price) => price * zero, combOp = (curr, agg) => curr + agg ).collect().foreach(println(_)) context.stop() } } > RDD different partitions cause didderent results > -- > > Key: SPARK-32809 > URL: https://issues.apache.org/jira/browse/SPARK-32809 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 > Environment: spark2.2.0 ,scala 2.11.8 , hadoop-client2.6.0 >Reporter: zhangchenglong >Priority: Major > Original Estimate: 12h > Remaining Estimate: 12h > > {code} > class Exec3 { > private val exec: SparkConf = new > SparkConf().setMaster("local[1]").setAppName("exec3") > private val context = new SparkContext(exec) > context.setCheckpointDir("checkPoint") > > /** >* get total number by key >* in this project desired results are ("apple",25) ("huwei",20) >* but in fact i get ("apple",150) ("huawei",20) >* when i change it to local[3] the result is correct >* i want to know which cause it and how to slove it >*/ > @Test > def testError(): Unit ={ > val rdd = context.parallelize(Seq(("apple", 10), ("apple", 15), > ("huawei", 20))) > rdd.aggregateByKey(1.0)( > seqOp = (zero, price) => price * zero, > combOp = (curr, agg) => curr + agg).collect().foreach(println(_)) > context.stop() > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32808) Pass all `sql/core` module UTs in Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-32808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-32808: - Description: Now there are 319 TESTS FAILED based on commit `f5360e761ef161f7e04526b59a4baf53f1cf8cd5` {code:java} Run completed in 1 hour, 20 minutes, 25 seconds. Total number of tests run: 8485 Suites: completed 357, aborted 0 Tests: succeeded 8166, failed 319, canceled 1, ignored 52, pending 0 *** 319 TESTS FAILED *** {code} There are 293 failures associated with TPCDS_XXX_PlanStabilitySuite and TPCDS_XXX_PlanStabilityWithStatsSuite: * TPCDSV2_7_PlanStabilitySuite(33 FAILED) * TPCDSV1_4_PlanStabilityWithStatsSuite(94 FAILED) * TPCDSModifiedPlanStabilityWithStatsSuite(21 FAILED) * TPCDSV1_4_PlanStabilitySuite(92 FAILED) * TPCDSModifiedPlanStabilitySuite(21 FAILED) * TPCDSV2_7_PlanStabilityWithStatsSuite(32 FAILED) Other 26 FAILED cases as follow: * StreamingAggregationSuite ** count distinct - state format version 1 ** count distinct - state format version 2 * GeneratorFunctionSuite ** explode and other columns ** explode_outer and other columns * UDFSuite ** SPARK-26308: udf with complex types of decimal ** SPARK-32459: UDF should not fail on WrappedArray * SQLQueryTestSuite ** decimalArithmeticOperations.sql ** postgreSQL/aggregates_part2.sql ** ansi/decimalArithmeticOperations.sql ** udf/postgreSQL/udf-aggregates_part2.sql - Scala UDF ** udf/postgreSQL/udf-aggregates_part2.sql - Regular Python UDF * WholeStageCodegenSuite ** SPARK-26680: Stream in groupBy does not cause StackOverflowError * DataFrameSuite: ** explode ** SPARK-28067: Aggregate sum should not return wrong results for decimal overflow ** Star Expansion - ds.explode should fail with a meaningful message if it takes a star * DataStreamReaderWriterSuite ** SPARK-18510: use user specified types for partition columns in file sources * OrcV1QuerySuite\OrcV2QuerySuite ** Simple selection form ORC table * 2 * ExpressionsSchemaSuite ** Check schemas for expression examples * DataFrameStatSuite ** SPARK-28818: Respect original column nullability in `freqItems` * JsonV1Suite\JsonV2Suite\JsonLegacyTimeParserSuite ** SPARK-4228 DataFrame to JSON * 3 ** backward compatibility * 3 was: Now there are 319 TESTS FAILED based on commit `f5360e761ef161f7e04526b59a4baf53f1cf8cd5` {code:java} Run completed in 1 hour, 20 minutes, 25 seconds. Total number of tests run: 8485 Suites: completed 357, aborted 0 Tests: succeeded 8166, failed 319, canceled 1, ignored 52, pending 0 *** 319 TESTS FAILED *** {code} > Pass all `sql/core` module UTs in Scala 2.13 > > > Key: SPARK-32808 > URL: https://issues.apache.org/jira/browse/SPARK-32808 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yang Jie >Priority: Major > > Now there are 319 TESTS FAILED based on commit > `f5360e761ef161f7e04526b59a4baf53f1cf8cd5` > {code:java} > Run completed in 1 hour, 20 minutes, 25 seconds. > Total number of tests run: 8485 > Suites: completed 357, aborted 0 > Tests: succeeded 8166, failed 319, canceled 1, ignored 52, pending 0 > *** 319 TESTS FAILED *** > {code} > > There are 293 failures associated with TPCDS_XXX_PlanStabilitySuite and > TPCDS_XXX_PlanStabilityWithStatsSuite: > * TPCDSV2_7_PlanStabilitySuite(33 FAILED) > * TPCDSV1_4_PlanStabilityWithStatsSuite(94 FAILED) > * TPCDSModifiedPlanStabilityWithStatsSuite(21 FAILED) > * TPCDSV1_4_PlanStabilitySuite(92 FAILED) > * TPCDSModifiedPlanStabilitySuite(21 FAILED) > * TPCDSV2_7_PlanStabilityWithStatsSuite(32 FAILED) > > Other 26 FAILED cases as follow: > * StreamingAggregationSuite > ** count distinct - state format version 1 > ** count distinct - state format version 2 > * GeneratorFunctionSuite > ** explode and other columns > ** explode_outer and other columns > * UDFSuite > ** SPARK-26308: udf with complex types of decimal > ** SPARK-32459: UDF should not fail on WrappedArray > * SQLQueryTestSuite > ** decimalArithmeticOperations.sql > ** postgreSQL/aggregates_part2.sql > ** ansi/decimalArithmeticOperations.sql > ** udf/postgreSQL/udf-aggregates_part2.sql - Scala UDF > ** udf/postgreSQL/udf-aggregates_part2.sql - Regular Python UDF > * WholeStageCodegenSuite > ** SPARK-26680: Stream in groupBy does not cause StackOverflowError > * DataFrameSuite: > ** explode > ** SPARK-28067: Aggregate sum should not return wrong results for decimal > overflow > ** Star Expansion - ds.explode should fail with a meaningful message if it > takes a star > * DataStreamReaderWriterSuite > ** SPARK-18510: use user specified types for partition columns in file > sources > * OrcV1QuerySuite\OrcV2QuerySuite > ** Simple selection form ORC table * 2 > * ExpressionsSchemaSuite > ** Check
[jira] [Updated] (SPARK-32809) RDD different partitions cause didderent results
[ https://issues.apache.org/jira/browse/SPARK-32809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-32809: - Priority: Major (was: Blocker) > RDD different partitions cause didderent results > -- > > Key: SPARK-32809 > URL: https://issues.apache.org/jira/browse/SPARK-32809 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.2.0 > Environment: spark2.2.0 ,scala 2.11.8 , hadoop-client2.6.0 >Reporter: zhangchenglong >Priority: Major > Original Estimate: 12h > Remaining Estimate: 12h > > class Exec3 { > private val exec: SparkConf = new > SparkConf().setMaster("local[1]").setAppName("exec3") > private val context = new SparkContext(exec) > context.setCheckpointDir("checkPoint") > > /** > * get total number by key > * in this project desired results are ("apple",25) ("huwei",20) > * but in fact i get ("apple",150) ("huawei",20) > * when i change it to local[3] the result is correct > * i want to know which cause it and how to slove it > */ > @Test > def testError(): Unit ={ > val rdd = context.parallelize(Seq(("apple", 10), ("apple", 15), ("huawei", > 20))) > rdd.aggregateByKey(1.0)( > seqOp = (zero, price) => price * zero, > combOp = (curr, agg) => curr + agg > ).collect().foreach(println(_)) > context.stop() > } > } -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32809) RDD different partitions cause didderent results
[ https://issues.apache.org/jira/browse/SPARK-32809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-32809: - Issue Type: Bug (was: Wish) > RDD different partitions cause didderent results > -- > > Key: SPARK-32809 > URL: https://issues.apache.org/jira/browse/SPARK-32809 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 > Environment: spark2.2.0 ,scala 2.11.8 , hadoop-client2.6.0 >Reporter: zhangchenglong >Priority: Major > Original Estimate: 12h > Remaining Estimate: 12h > > class Exec3 { > private val exec: SparkConf = new > SparkConf().setMaster("local[1]").setAppName("exec3") > private val context = new SparkContext(exec) > context.setCheckpointDir("checkPoint") > > /** > * get total number by key > * in this project desired results are ("apple",25) ("huwei",20) > * but in fact i get ("apple",150) ("huawei",20) > * when i change it to local[3] the result is correct > * i want to know which cause it and how to slove it > */ > @Test > def testError(): Unit ={ > val rdd = context.parallelize(Seq(("apple", 10), ("apple", 15), ("huawei", > 20))) > rdd.aggregateByKey(1.0)( > seqOp = (zero, price) => price * zero, > combOp = (curr, agg) => curr + agg > ).collect().foreach(println(_)) > context.stop() > } > } -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32809) RDD different partitions cause didderent results
[ https://issues.apache.org/jira/browse/SPARK-32809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhangchenglong updated SPARK-32809: --- Flags: Important Environment: spark2.2.0 ,scala 2.11.8 , hadoop-client2.6.0 Issue Type: Wish (was: Bug) Priority: Blocker (was: Major) Summary: RDD different partitions cause didderent results (was: RDD分区数对于计算结果的影响) Remaining Estimate: 12h Original Estimate: 12h the desired result is ("apple",25) ("huawei",20) but i get ("apple",150) ("huawei",20) > RDD different partitions cause didderent results > -- > > Key: SPARK-32809 > URL: https://issues.apache.org/jira/browse/SPARK-32809 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.2.0 > Environment: spark2.2.0 ,scala 2.11.8 , hadoop-client2.6.0 >Reporter: zhangchenglong >Priority: Blocker > Original Estimate: 12h > Remaining Estimate: 12h > > class Exec3 { > private val exec: SparkConf = new > SparkConf().setMaster("local[1]").setAppName("exec3") > private val context = new SparkContext(exec) > context.setCheckpointDir("checkPoint") > > /** > * get total number by key > * in this project desired results are ("apple",25) ("huwei",20) > * but in fact i get ("apple",150) ("huawei",20) > * when i change it to local[3] the result is correct > * i want to know which cause it and how to slove it > */ > @Test > def testError(): Unit ={ > val rdd = context.parallelize(Seq(("apple", 10), ("apple", 15), ("huawei", > 20))) > rdd.aggregateByKey(1.0)( > seqOp = (zero, price) => price * zero, > combOp = (curr, agg) => curr + agg > ).collect().foreach(println(_)) > context.stop() > } > } -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32809) RDD分区数对于计算结果的影响
[ https://issues.apache.org/jira/browse/SPARK-32809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191475#comment-17191475 ] Hyukjin Kwon commented on SPARK-32809: -- Please also fix the title. What's the expected results and actual results? Also, Spark 2.2 is EOL which doesn't have maintenence releases anymore. It would be great if it can be still reproduced in Spark 2.4+. > RDD分区数对于计算结果的影响 > --- > > Key: SPARK-32809 > URL: https://issues.apache.org/jira/browse/SPARK-32809 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: zhangchenglong >Priority: Major > > class Exec3 { > private val exec: SparkConf = new > SparkConf().setMaster("local[1]").setAppName("exec3") > private val context = new SparkContext(exec) > context.setCheckpointDir("checkPoint") > > /** > * get total number by key > * in this project desired results are ("apple",25) ("huwei",20) > * but in fact i get ("apple",150) ("huawei",20) > * when i change it to local[3] the result is correct > * i want to know which cause it and how to slove it > */ > @Test > def testError(): Unit ={ > val rdd = context.parallelize(Seq(("apple", 10), ("apple", 15), ("huawei", > 20))) > rdd.aggregateByKey(1.0)( > seqOp = (zero, price) => price * zero, > combOp = (curr, agg) => curr + agg > ).collect().foreach(println(_)) > context.stop() > } > } -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32809) RDD分区数对于计算结果的影响
[ https://issues.apache.org/jira/browse/SPARK-32809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhangchenglong updated SPARK-32809: --- Description: class Exec3 { private val exec: SparkConf = new SparkConf().setMaster("local[1]").setAppName("exec3") private val context = new SparkContext(exec) context.setCheckpointDir("checkPoint") /** * get total number by key * in this project desired results are ("apple",25) ("huwei",20) * but in fact i get ("apple",150) ("huawei",20) * when i change it to local[3] the result is correct * i want to know which cause it and how to slove it */ @Test def testError(): Unit ={ val rdd = context.parallelize(Seq(("apple", 10), ("apple", 15), ("huawei", 20))) rdd.aggregateByKey(1.0)( seqOp = (zero, price) => price * zero, combOp = (curr, agg) => curr + agg ).collect().foreach(println(_)) context.stop() } } was: class Exec3 { private val exec: SparkConf = new SparkConf().setMaster("local[1]").setAppName("exec3") private val context = new SparkContext(exec) context.setCheckpointDir("checkPoint") /** * get total number by key * in this project desired results are ("苹果",25) ("华为",20) * but in fact i get ("苹果",150) ("华为",20) * when i change it to local[3] the result is correct * i want to know which cause it and how to slove it */ @Test def testError(): Unit ={ val rdd = context.parallelize(Seq(("苹果", 10), ("苹果", 15), ("华为", 20))) rdd.aggregateByKey(1.0)( seqOp = (zero, price) => price * zero, combOp = (curr, agg) => curr + agg ).collect().foreach(println(_)) context.stop() } } > RDD分区数对于计算结果的影响 > --- > > Key: SPARK-32809 > URL: https://issues.apache.org/jira/browse/SPARK-32809 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: zhangchenglong >Priority: Major > > class Exec3 { > private val exec: SparkConf = new > SparkConf().setMaster("local[1]").setAppName("exec3") > private val context = new SparkContext(exec) > context.setCheckpointDir("checkPoint") > > /** > * get total number by key > * in this project desired results are ("apple",25) ("huwei",20) > * but in fact i get ("apple",150) ("huawei",20) > * when i change it to local[3] the result is correct > * i want to know which cause it and how to slove it > */ > @Test > def testError(): Unit ={ > val rdd = context.parallelize(Seq(("apple", 10), ("apple", 15), ("huawei", > 20))) > rdd.aggregateByKey(1.0)( > seqOp = (zero, price) => price * zero, > combOp = (curr, agg) => curr + agg > ).collect().foreach(println(_)) > context.stop() > } > } -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32748) Support local property propagation in SubqueryBroadcastExec
[ https://issues.apache.org/jira/browse/SPARK-32748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-32748: --- Assignee: Zhenhua Wang > Support local property propagation in SubqueryBroadcastExec > --- > > Key: SPARK-32748 > URL: https://issues.apache.org/jira/browse/SPARK-32748 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zhenhua Wang >Assignee: Zhenhua Wang >Priority: Major > > Since SPARK-22590, local property propagation is supported through > `SQLExecution.withThreadLocalCaptured` in both `BroadcastExchangeExec` and > `SubqueryExec` when computing `relationFuture`. > The propagation is missed in `SubqueryBroadcastExec`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32748) Support local property propagation in SubqueryBroadcastExec
[ https://issues.apache.org/jira/browse/SPARK-32748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-32748. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29589 [https://github.com/apache/spark/pull/29589] > Support local property propagation in SubqueryBroadcastExec > --- > > Key: SPARK-32748 > URL: https://issues.apache.org/jira/browse/SPARK-32748 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zhenhua Wang >Assignee: Zhenhua Wang >Priority: Major > Fix For: 3.1.0 > > > Since SPARK-22590, local property propagation is supported through > `SQLExecution.withThreadLocalCaptured` in both `BroadcastExchangeExec` and > `SubqueryExec` when computing `relationFuture`. > The propagation is missed in `SubqueryBroadcastExec`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32778) Accidental Data Deletion on calling saveAsTable
[ https://issues.apache.org/jira/browse/SPARK-32778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191472#comment-17191472 ] Aman Rastogi commented on SPARK-32778: -- [~hyukjin.kwon] Sure, will reproduce with 2.4 > Accidental Data Deletion on calling saveAsTable > --- > > Key: SPARK-32778 > URL: https://issues.apache.org/jira/browse/SPARK-32778 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Aman Rastogi >Priority: Major > > {code:java} > df.write.option("path", > "/already/existing/path").mode(SaveMode.Append).format("json").saveAsTable(db.table) > {code} > Above code deleted the data present in path "/already/existing/path". This > happened because table was already not there in hive metastore however, path > given had data. And if table is not present in Hive Metastore, SaveMode gets > modified internally to SaveMode.Overwrite irrespective of what user has > provided, which leads to data deletion. This change was introduced as part of > https://issues.apache.org/jira/browse/SPARK-19583. > Now, suppose if user is not using external hive metastore (hive metastore is > associated with a cluster) and if cluster goes down or due to some reason > user has to migrate to a new cluster. Once user tries to save data using > above code in new cluster, it will first delete the data. It could be a > production data and user is completely unaware of it as they have provided > SaveMode.Append or ErrorIfExists. This will be an accidental data deletion. > > Repro Steps: > > # Save data through a hive table as mentioned in above code > # create another cluster and save data in new table in new cluster by giving > same path > > Proposed Fix: > Instead of modifying SaveMode to Overwrite, we should modify it to > ErrorIfExists in class CreateDataSourceTableAsSelectCommand. > Change (line 154) > > {code:java} > val result = saveDataIntoTable( > sparkSession, table, tableLocation, child, SaveMode.Overwrite, tableExists = > false) > > {code} > to > > {code:java} > val result = saveDataIntoTable( > sparkSession, table, tableLocation, child, SaveMode.ErrorIfExists, > tableExists = false){code} > This should not break CTAS. Even in case of CTAS, user may not want to delete > data if already exists as it could be accidental. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32809) RDD分区数对于计算结果的影响
[ https://issues.apache.org/jira/browse/SPARK-32809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32809. -- Resolution: Incomplete Please write in English which the community use to communicate. > RDD分区数对于计算结果的影响 > --- > > Key: SPARK-32809 > URL: https://issues.apache.org/jira/browse/SPARK-32809 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: zhangchenglong >Priority: Major > > class Exec3 { > private val exec: SparkConf = new > SparkConf().setMaster("local[1]").setAppName("exec3") > private val context = new SparkContext(exec) > context.setCheckpointDir("checkPoint") > > /** > * get total number by key > * in this project desired results are ("苹果",25) ("华为",20) > * but in fact i get ("苹果",150) ("华为",20) > * when i change it to local[3] the result is correct > * i want to know which cause it and how to slove it > */ > @Test > def testError(): Unit ={ > val rdd = context.parallelize(Seq(("苹果", 10), ("苹果", 15), ("华为", 20))) > rdd.aggregateByKey(1.0)( > seqOp = (zero, price) => price * zero, > combOp = (curr, agg) => curr + agg > ).collect().foreach(println(_)) > context.stop() > } > } -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32778) Accidental Data Deletion on calling saveAsTable
[ https://issues.apache.org/jira/browse/SPARK-32778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191468#comment-17191468 ] Hyukjin Kwon commented on SPARK-32778: -- Spark 2.3 is EOL too .. can you reproduce in Spark 2.4 or 3.0? > Accidental Data Deletion on calling saveAsTable > --- > > Key: SPARK-32778 > URL: https://issues.apache.org/jira/browse/SPARK-32778 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Aman Rastogi >Priority: Major > > {code:java} > df.write.option("path", > "/already/existing/path").mode(SaveMode.Append).format("json").saveAsTable(db.table) > {code} > Above code deleted the data present in path "/already/existing/path". This > happened because table was already not there in hive metastore however, path > given had data. And if table is not present in Hive Metastore, SaveMode gets > modified internally to SaveMode.Overwrite irrespective of what user has > provided, which leads to data deletion. This change was introduced as part of > https://issues.apache.org/jira/browse/SPARK-19583. > Now, suppose if user is not using external hive metastore (hive metastore is > associated with a cluster) and if cluster goes down or due to some reason > user has to migrate to a new cluster. Once user tries to save data using > above code in new cluster, it will first delete the data. It could be a > production data and user is completely unaware of it as they have provided > SaveMode.Append or ErrorIfExists. This will be an accidental data deletion. > > Repro Steps: > > # Save data through a hive table as mentioned in above code > # create another cluster and save data in new table in new cluster by giving > same path > > Proposed Fix: > Instead of modifying SaveMode to Overwrite, we should modify it to > ErrorIfExists in class CreateDataSourceTableAsSelectCommand. > Change (line 154) > > {code:java} > val result = saveDataIntoTable( > sparkSession, table, tableLocation, child, SaveMode.Overwrite, tableExists = > false) > > {code} > to > > {code:java} > val result = saveDataIntoTable( > sparkSession, table, tableLocation, child, SaveMode.ErrorIfExists, > tableExists = false){code} > This should not break CTAS. Even in case of CTAS, user may not want to delete > data if already exists as it could be accidental. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32807) Spark ThriftServer multisession mode set DB use direct API
[ https://issues.apache.org/jira/browse/SPARK-32807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-32807: - Issue Type: Improvement (was: Bug) > Spark ThriftServer multisession mode set DB use direct API > -- > > Key: SPARK-32807 > URL: https://issues.apache.org/jira/browse/SPARK-32807 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: angerszhu >Priority: Major > > When spark thrift server multi session mode, if we define init default db, it > will call `use db` to set default db, when there is high concurrent, it's slow -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32807) Spark ThriftServer multisession mode set DB use direct API
[ https://issues.apache.org/jira/browse/SPARK-32807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-32807: - Priority: Trivial (was: Major) > Spark ThriftServer multisession mode set DB use direct API > -- > > Key: SPARK-32807 > URL: https://issues.apache.org/jira/browse/SPARK-32807 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: angerszhu >Priority: Trivial > > When spark thrift server multi session mode, if we define init default db, it > will call `use db` to set default db, when there is high concurrent, it's slow -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32779) Spark/Hive3 interaction potentially causes deadlock
[ https://issues.apache.org/jira/browse/SPARK-32779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32779. -- Fix Version/s: 3.1.0 3.0.1 Resolution: Fixed Issue resolved by pull request 29649 [https://github.com/apache/spark/pull/29649] > Spark/Hive3 interaction potentially causes deadlock > --- > > Key: SPARK-32779 > URL: https://issues.apache.org/jira/browse/SPARK-32779 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Bruce Robbins >Assignee: Sandeep Katta >Priority: Major > Fix For: 3.0.1, 3.1.0 > > > This is an issue for applications that share a Spark Session across multiple > threads. > sessionCatalog.loadPartition (after checking that the table exists) grabs > locks in this order: > - HiveExternalCatalog > - HiveSessionCatalog (in Shim_v3_0) > Other operations (e.g., sessionCatalog.tableExists), grab locks in this order: > - HiveSessionCatalog > - HiveExternalCatalog > [This|https://github.com/apache/spark/blob/ad6b887541bf90cc3ea830a1a3322b71ccdd80ee/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala#L1332] > appears to be the culprit. Maybe db name should be defaulted _before_ the > call to HiveClient so that Shim_v3_0 doesn't have to call back into > SessionCatalog. Or possibly this is not needed at all, since loadPartition in > Shim_v2_1 doesn't worry about the default db name, but that might be because > of differences between Hive client libraries. > Reproduction case: > - You need to have a running Hive 3.x HMS instance and the appropriate > hive-site.xml for your Spark instance > - Adjust your spark.sql.hive.metastore.version accordingly > - It might take more than one try to hit the deadlock > Launch Spark: > {noformat} > bin/spark-shell --conf "spark.sql.hive.metastore.jars=${HIVE_HOME}/lib/*" > --conf spark.sql.hive.metastore.version=3.1 > {noformat} > Then use the following code: > {noformat} > import scala.collection.mutable.ArrayBuffer > import scala.util.Random > val tableCount = 4 > for (i <- 0 until tableCount) { > val tableName = s"partitioned${i+1}" > sql(s"drop table if exists $tableName") > sql(s"create table $tableName (a bigint) partitioned by (b bigint) stored > as orc") > } > val threads = new ArrayBuffer[Thread] > for (i <- 0 until tableCount) { > threads.append(new Thread( new Runnable { > override def run: Unit = { > val tableName = s"partitioned${i + 1}" > val rand = Random > val df = spark.range(0, 2).toDF("a") > val location = s"/tmp/${rand.nextLong.abs}" > df.write.mode("overwrite").orc(location) > sql( > s""" > LOAD DATA LOCAL INPATH '$location' INTO TABLE $tableName partition > (b=$i)""") > } > }, s"worker$i")) > threads(i).start() > } > for (i <- 0 until tableCount) { > println(s"Joining with thread $i") > threads(i).join() > } > println("All done") > {noformat} > The job often gets stuck after one or two "Joining..." lines. > {{kill -3}} shows something like this: > {noformat} > Found one Java-level deadlock: > = > "worker3": > waiting to lock monitor 0x7fdc3cde6798 (object 0x000784d98ac8, a > org.apache.spark.sql.hive.HiveSessionCatalog), > which is held by "worker0" > "worker0": > waiting to lock monitor 0x7fdc441d1b88 (object 0x0007861d1208, a > org.apache.spark.sql.hive.HiveExternalCatalog), > which is held by "worker3" > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32779) Spark/Hive3 interaction potentially causes deadlock
[ https://issues.apache.org/jira/browse/SPARK-32779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-32779: Assignee: Sandeep Katta > Spark/Hive3 interaction potentially causes deadlock > --- > > Key: SPARK-32779 > URL: https://issues.apache.org/jira/browse/SPARK-32779 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Bruce Robbins >Assignee: Sandeep Katta >Priority: Major > > This is an issue for applications that share a Spark Session across multiple > threads. > sessionCatalog.loadPartition (after checking that the table exists) grabs > locks in this order: > - HiveExternalCatalog > - HiveSessionCatalog (in Shim_v3_0) > Other operations (e.g., sessionCatalog.tableExists), grab locks in this order: > - HiveSessionCatalog > - HiveExternalCatalog > [This|https://github.com/apache/spark/blob/ad6b887541bf90cc3ea830a1a3322b71ccdd80ee/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala#L1332] > appears to be the culprit. Maybe db name should be defaulted _before_ the > call to HiveClient so that Shim_v3_0 doesn't have to call back into > SessionCatalog. Or possibly this is not needed at all, since loadPartition in > Shim_v2_1 doesn't worry about the default db name, but that might be because > of differences between Hive client libraries. > Reproduction case: > - You need to have a running Hive 3.x HMS instance and the appropriate > hive-site.xml for your Spark instance > - Adjust your spark.sql.hive.metastore.version accordingly > - It might take more than one try to hit the deadlock > Launch Spark: > {noformat} > bin/spark-shell --conf "spark.sql.hive.metastore.jars=${HIVE_HOME}/lib/*" > --conf spark.sql.hive.metastore.version=3.1 > {noformat} > Then use the following code: > {noformat} > import scala.collection.mutable.ArrayBuffer > import scala.util.Random > val tableCount = 4 > for (i <- 0 until tableCount) { > val tableName = s"partitioned${i+1}" > sql(s"drop table if exists $tableName") > sql(s"create table $tableName (a bigint) partitioned by (b bigint) stored > as orc") > } > val threads = new ArrayBuffer[Thread] > for (i <- 0 until tableCount) { > threads.append(new Thread( new Runnable { > override def run: Unit = { > val tableName = s"partitioned${i + 1}" > val rand = Random > val df = spark.range(0, 2).toDF("a") > val location = s"/tmp/${rand.nextLong.abs}" > df.write.mode("overwrite").orc(location) > sql( > s""" > LOAD DATA LOCAL INPATH '$location' INTO TABLE $tableName partition > (b=$i)""") > } > }, s"worker$i")) > threads(i).start() > } > for (i <- 0 until tableCount) { > println(s"Joining with thread $i") > threads(i).join() > } > println("All done") > {noformat} > The job often gets stuck after one or two "Joining..." lines. > {{kill -3}} shows something like this: > {noformat} > Found one Java-level deadlock: > = > "worker3": > waiting to lock monitor 0x7fdc3cde6798 (object 0x000784d98ac8, a > org.apache.spark.sql.hive.HiveSessionCatalog), > which is held by "worker0" > "worker0": > waiting to lock monitor 0x7fdc441d1b88 (object 0x0007861d1208, a > org.apache.spark.sql.hive.HiveExternalCatalog), > which is held by "worker3" > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32779) Spark/Hive3 interaction potentially causes deadlock
[ https://issues.apache.org/jira/browse/SPARK-32779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-32779: - Fix Version/s: (was: 3.0.1) 3.0.2 > Spark/Hive3 interaction potentially causes deadlock > --- > > Key: SPARK-32779 > URL: https://issues.apache.org/jira/browse/SPARK-32779 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Bruce Robbins >Assignee: Sandeep Katta >Priority: Major > Fix For: 3.1.0, 3.0.2 > > > This is an issue for applications that share a Spark Session across multiple > threads. > sessionCatalog.loadPartition (after checking that the table exists) grabs > locks in this order: > - HiveExternalCatalog > - HiveSessionCatalog (in Shim_v3_0) > Other operations (e.g., sessionCatalog.tableExists), grab locks in this order: > - HiveSessionCatalog > - HiveExternalCatalog > [This|https://github.com/apache/spark/blob/ad6b887541bf90cc3ea830a1a3322b71ccdd80ee/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala#L1332] > appears to be the culprit. Maybe db name should be defaulted _before_ the > call to HiveClient so that Shim_v3_0 doesn't have to call back into > SessionCatalog. Or possibly this is not needed at all, since loadPartition in > Shim_v2_1 doesn't worry about the default db name, but that might be because > of differences between Hive client libraries. > Reproduction case: > - You need to have a running Hive 3.x HMS instance and the appropriate > hive-site.xml for your Spark instance > - Adjust your spark.sql.hive.metastore.version accordingly > - It might take more than one try to hit the deadlock > Launch Spark: > {noformat} > bin/spark-shell --conf "spark.sql.hive.metastore.jars=${HIVE_HOME}/lib/*" > --conf spark.sql.hive.metastore.version=3.1 > {noformat} > Then use the following code: > {noformat} > import scala.collection.mutable.ArrayBuffer > import scala.util.Random > val tableCount = 4 > for (i <- 0 until tableCount) { > val tableName = s"partitioned${i+1}" > sql(s"drop table if exists $tableName") > sql(s"create table $tableName (a bigint) partitioned by (b bigint) stored > as orc") > } > val threads = new ArrayBuffer[Thread] > for (i <- 0 until tableCount) { > threads.append(new Thread( new Runnable { > override def run: Unit = { > val tableName = s"partitioned${i + 1}" > val rand = Random > val df = spark.range(0, 2).toDF("a") > val location = s"/tmp/${rand.nextLong.abs}" > df.write.mode("overwrite").orc(location) > sql( > s""" > LOAD DATA LOCAL INPATH '$location' INTO TABLE $tableName partition > (b=$i)""") > } > }, s"worker$i")) > threads(i).start() > } > for (i <- 0 until tableCount) { > println(s"Joining with thread $i") > threads(i).join() > } > println("All done") > {noformat} > The job often gets stuck after one or two "Joining..." lines. > {{kill -3}} shows something like this: > {noformat} > Found one Java-level deadlock: > = > "worker3": > waiting to lock monitor 0x7fdc3cde6798 (object 0x000784d98ac8, a > org.apache.spark.sql.hive.HiveSessionCatalog), > which is held by "worker0" > "worker0": > waiting to lock monitor 0x7fdc441d1b88 (object 0x0007861d1208, a > org.apache.spark.sql.hive.HiveExternalCatalog), > which is held by "worker3" > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32809) RDD分区数对于计算结果的影响
zhangchenglong created SPARK-32809: -- Summary: RDD分区数对于计算结果的影响 Key: SPARK-32809 URL: https://issues.apache.org/jira/browse/SPARK-32809 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.0 Reporter: zhangchenglong class Exec3 { private val exec: SparkConf = new SparkConf().setMaster("local[1]").setAppName("exec3") private val context = new SparkContext(exec) context.setCheckpointDir("checkPoint") /** * get total number by key * in this project desired results are ("苹果",25) ("华为",20) * but in fact i get ("苹果",150) ("华为",20) * when i change it to local[3] the result is correct * i want to know which cause it and how to slove it */ @Test def testError(): Unit ={ val rdd = context.parallelize(Seq(("苹果", 10), ("苹果", 15), ("华为", 20))) rdd.aggregateByKey(1.0)( seqOp = (zero, price) => price * zero, combOp = (curr, agg) => curr + agg ).collect().foreach(println(_)) context.stop() } } -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32793) Expose assert_true in Python/Scala APIs and add error message parameter
[ https://issues.apache.org/jira/browse/SPARK-32793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191457#comment-17191457 ] Takeshi Yamamuro commented on SPARK-32793: -- You don't add it into SQL? I think BigQuery has a similar feature: [https://cloud.google.com/bigquery/docs/reference/standard-sql/debugging-statements] > Expose assert_true in Python/Scala APIs and add error message parameter > --- > > Key: SPARK-32793 > URL: https://issues.apache.org/jira/browse/SPARK-32793 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Karen Feng >Priority: Minor > > # Add RAISEERROR() (or RAISE_ERROR()) to the API > # Add Scala/Python/R version of API for ASSERT_TRUE() > # Add an extra parameter to ASSERT_TRUE() as (cond, message), and for which > the `message` parameter is only lazily evaluated when the condition is not > true > # Change the implementation of ASSERT_TRUE() to be rewritten during > optimization to IF() instead. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32677) Load function resource before create
[ https://issues.apache.org/jira/browse/SPARK-32677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-32677: --- Assignee: ulysses you > Load function resource before create > > > Key: SPARK-32677 > URL: https://issues.apache.org/jira/browse/SPARK-32677 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: ulysses you >Assignee: ulysses you >Priority: Minor > > Change `CreateFunctionCommand` code that add class check before create > function and add to function registry. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32677) Load function resource before create
[ https://issues.apache.org/jira/browse/SPARK-32677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-32677. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29502 [https://github.com/apache/spark/pull/29502] > Load function resource before create > > > Key: SPARK-32677 > URL: https://issues.apache.org/jira/browse/SPARK-32677 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: ulysses you >Assignee: ulysses you >Priority: Minor > Fix For: 3.1.0 > > > Change `CreateFunctionCommand` code that add class check before create > function and add to function registry. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org