[jira] [Updated] (SPARK-28050) DataFrameWriter support insertInto a specific table partition
[ https://issues.apache.org/jira/browse/SPARK-28050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leanken.Lin updated SPARK-28050: Description: {code:java} // Some comments here val ptTableName = "mc_test_pt_table" sql(s"CREATE TABLE ${ptTableName} (name STRING, num BIGINT) PARTITIONED BY (pt1 STRING, pt2 STRING)") val df = spark.sparkContext.parallelize(0 to 99, 2) .map(f => { (s"name-$f", f) }) .toDF("name", "num") // if i want to insert df into a specific partition // say pt1='2018',pt2='0601' current api does not supported // only with following work around df.createOrReplaceTempView(s"${ptTableName}_tmp_view") sql(s"insert into table ${ptTableName} partition (pt1='2018', pt2='0601') select * from ${ptTableName}_tmp_view") {code} Propose to have another API in DataframeWriter that can do somethink like: {code:java} df.write.insertInto(ptTableName, "pt1='2018',pt2='0601'") {code} we have a lot of this kind of scenario in our production env. providing a api like this will make us less painful. was: ``` val ptTableName = "mc_test_pt_table" sql(s"CREATE TABLE ${ptTableName} (name STRING, num BIGINT) PARTITIONED BY (pt1 STRING, pt2 STRING)") val df = spark.sparkContext.parallelize(0 to 99, 2) .map(f => { (s"name-$f", f) }) .toDF("name", "num") // if i want to insert df into a specific partition // say pt1='2018',pt2='0601' current api does not supported // only with following work around df.createOrReplaceTempView(s"${ptTableName}_tmp_view") sql(s"insert into table ${ptTableName} partition (pt1='2018', pt2='0601') select * from ${ptTableName}_tmp_view") ``` Propose to have another API in DataframeWriter that can do somethink like: ``` df.write.insertInto(ptTableName, "pt1='2018',pt2='0601'") ``` we have a lot of this kind of scenario in our production env. providing a api like this will make us less painful. > DataFrameWriter support insertInto a specific table partition > - > > Key: SPARK-28050 > URL: https://issues.apache.org/jira/browse/SPARK-28050 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.3, 2.4.3 >Reporter: Leanken.Lin >Priority: Minor > Fix For: 2.3.3, 2.4.3 > > > {code:java} > // Some comments here > val ptTableName = "mc_test_pt_table" > sql(s"CREATE TABLE ${ptTableName} (name STRING, num BIGINT) PARTITIONED BY > (pt1 STRING, pt2 STRING)") > val df = spark.sparkContext.parallelize(0 to 99, 2) > .map(f => > { > (s"name-$f", f) > }) > .toDF("name", "num") > // if i want to insert df into a specific partition > // say pt1='2018',pt2='0601' current api does not supported > // only with following work around > df.createOrReplaceTempView(s"${ptTableName}_tmp_view") > sql(s"insert into table ${ptTableName} partition (pt1='2018', pt2='0601') > select * from ${ptTableName}_tmp_view") > {code} > Propose to have another API in DataframeWriter that can do somethink like: > {code:java} > df.write.insertInto(ptTableName, "pt1='2018',pt2='0601'") > {code} > we have a lot of this kind of scenario in our production env. providing a api > like this will make us less painful. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28051) Exposing JIRA issue component types at GitHub PRs
Dongjoon Hyun created SPARK-28051: - Summary: Exposing JIRA issue component types at GitHub PRs Key: SPARK-28051 URL: https://issues.apache.org/jira/browse/SPARK-28051 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 3.0.0 Reporter: Dongjoon Hyun This issue aims to expose JIRA issue component types at GitHub PRs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28050) DataFrameWriter support insertInto a specific table partition
[ https://issues.apache.org/jira/browse/SPARK-28050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leanken.Lin updated SPARK-28050: Description: ``` val ptTableName = "mc_test_pt_table" sql(s"CREATE TABLE ${ptTableName} (name STRING, num BIGINT) PARTITIONED BY (pt1 STRING, pt2 STRING)") val df = spark.sparkContext.parallelize(0 to 99, 2) .map(f => { (s"name-$f", f) }) .toDF("name", "num") // if i want to insert df into a specific partition // say pt1='2018',pt2='0601' current api does not supported // only with following work around df.createOrReplaceTempView(s"${ptTableName}_tmp_view") sql(s"insert into table ${ptTableName} partition (pt1='2018', pt2='0601') select * from ${ptTableName}_tmp_view") ``` Propose to have another API in DataframeWriter that can do somethink like: ``` df.write.insertInto(ptTableName, "pt1='2018',pt2='0601'") ``` we have a lot of this kind of scenario in our production env. providing a api like this will make us less painful. was: val ptTableName = "mc_test_pt_table" sql(s"CREATE TABLE ${ptTableName} (name STRING, num BIGINT) PARTITIONED BY (pt1 STRING, pt2 STRING)") val df = spark.sparkContext.parallelize(0 to 99, 2) .map(f => { (s"name-$f", f) }) .toDF("name", "num") // if i want to insert df into a specific partition // say pt1='2018',pt2='0601' current api does not supported // only with following work around df.createOrReplaceTempView(s"${ptTableName}_tmp_view") sql(s"insert into table ${ptTableName} partition (pt1='2018', pt2='0601') select * from ${ptTableName}_tmp_view") Propose to have another API in DataframeWriter that can do somethink like: df.write.insertInto(ptTableName, "pt1='2018',pt2='0601'") we have a lot of this kind of scenario in our production env. providing a api like this will make us less painful. > DataFrameWriter support insertInto a specific table partition > - > > Key: SPARK-28050 > URL: https://issues.apache.org/jira/browse/SPARK-28050 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.3, 2.4.3 >Reporter: Leanken.Lin >Priority: Minor > Fix For: 2.3.3, 2.4.3 > > > ``` > val ptTableName = "mc_test_pt_table" > sql(s"CREATE TABLE ${ptTableName} (name STRING, num BIGINT) PARTITIONED BY > (pt1 STRING, pt2 STRING)") > val df = spark.sparkContext.parallelize(0 to 99, 2) > .map(f => > { > (s"name-$f", f) > }) > .toDF("name", "num") > // if i want to insert df into a specific partition > // say pt1='2018',pt2='0601' current api does not supported > // only with following work around > df.createOrReplaceTempView(s"${ptTableName}_tmp_view") > sql(s"insert into table ${ptTableName} partition (pt1='2018', pt2='0601') > select * from ${ptTableName}_tmp_view") > ``` > Propose to have another API in DataframeWriter that can do somethink like: > ``` > df.write.insertInto(ptTableName, "pt1='2018',pt2='0601'") > ``` > we have a lot of this kind of scenario in our production env. providing a api > like this will make us less painful. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28050) DataFrameWriter support insertInto a specific table partition
Leanken.Lin created SPARK-28050: --- Summary: DataFrameWriter support insertInto a specific table partition Key: SPARK-28050 URL: https://issues.apache.org/jira/browse/SPARK-28050 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.4.3, 2.3.3 Reporter: Leanken.Lin Fix For: 2.4.3, 2.3.3 val ptTableName = "mc_test_pt_table" sql(s"CREATE TABLE ${ptTableName} (name STRING, num BIGINT) PARTITIONED BY (pt1 STRING, pt2 STRING)") val df = spark.sparkContext.parallelize(0 to 99, 2) .map(f => { (s"name-$f", f) }) .toDF("name", "num") // if i want to insert df into a specific partition // say pt1='2018',pt2='0601' current api does not supported // only with following work around df.createOrReplaceTempView(s"${ptTableName}_tmp_view") sql(s"insert into table ${ptTableName} partition (pt1='2018', pt2='0601') select * from ${ptTableName}_tmp_view") Propose to have another API in DataframeWriter that can do somethink like: df.write.insertInto(ptTableName, "pt1='2018',pt2='0601'") we have a lot of this kind of scenario in our production env. providing a api like this will make us less painful. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27100) dag-scheduler-event-loop" java.lang.StackOverflowError
[ https://issues.apache.org/jira/browse/SPARK-27100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27100: -- Component/s: (was: MLlib) SQL > dag-scheduler-event-loop" java.lang.StackOverflowError > -- > > Key: SPARK-27100 > URL: https://issues.apache.org/jira/browse/SPARK-27100 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.3, 2.3.3 >Reporter: KaiXu >Priority: Major > Attachments: SPARK-27100-Overflow.txt, stderr > > > ALS in Spark MLlib causes StackOverflow: > /opt/sparkml/spark213/bin/spark-submit --properties-file > /opt/HiBench/report/als/spark/conf/sparkbench/spark.conf --class > com.intel.hibench.sparkbench.ml.ALSExample --master yarn-client > --num-executors 3 --executor-memory 322g > /opt/HiBench/sparkbench/assembly/target/sparkbench-assembly-7.1-SNAPSHOT-dist.jar > --numUsers 4 --numProducts 6 --rank 100 --numRecommends 20 > --numIterations 100 --kryo false --implicitPrefs true --numProductBlocks -1 > --numUserBlocks -1 --lambda 1.0 hdfs://bdw-slave20:8020/HiBench/ALS/Input > > Exception in thread "dag-scheduler-event-loop" java.lang.StackOverflowError > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1534) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) > at > scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468) > at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) > at > scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468) > at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28049) i want to a first ticket in zira
[ https://issues.apache.org/jira/browse/SPARK-28049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sanjeet resolved SPARK-28049. - Resolution: Fixed code change has been delivered > i want to a first ticket in zira > > > Key: SPARK-28049 > URL: https://issues.apache.org/jira/browse/SPARK-28049 > Project: Spark > Issue Type: Test > Components: Build >Affects Versions: 2.2.2, 2.4.3 > Environment: I just want to test in zira >Reporter: sanjeet >Priority: Minor > Labels: test > Fix For: 2.4.4, 2.4.3 > > > I just want to test in zira -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28049) i want to a first ticket in zira
[ https://issues.apache.org/jira/browse/SPARK-28049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16863742#comment-16863742 ] sanjeet commented on SPARK-28049: - 2 nd comment > i want to a first ticket in zira > > > Key: SPARK-28049 > URL: https://issues.apache.org/jira/browse/SPARK-28049 > Project: Spark > Issue Type: Test > Components: Build >Affects Versions: 2.2.2, 2.4.3 > Environment: I just want to test in zira >Reporter: sanjeet >Priority: Minor > Labels: test > Fix For: 2.4.4, 2.4.3 > > > I just want to test in zira -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28049) i want to a first ticket in zira
sanjeet created SPARK-28049: --- Summary: i want to a first ticket in zira Key: SPARK-28049 URL: https://issues.apache.org/jira/browse/SPARK-28049 Project: Spark Issue Type: Test Components: Build Affects Versions: 2.4.3, 2.2.2 Environment: I just want to test in zira Reporter: sanjeet Fix For: 2.4.4, 2.4.3 I just want to test in zira -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28047) [UI] Little bug in executorspage.js
[ https://issues.apache.org/jira/browse/SPARK-28047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang resolved SPARK-28047. - Resolution: Not A Problem > [UI] Little bug in executorspage.js > > > Key: SPARK-28047 > URL: https://issues.apache.org/jira/browse/SPARK-28047 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.4.3 >Reporter: feiwang >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28048) pyspark.sql.functions.explode will abondon the row which has a empty list column when applied to the column
[ https://issues.apache.org/jira/browse/SPARK-28048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ma Xinmin updated SPARK-28048: -- Shepherd: Dongjoon Hyun > pyspark.sql.functions.explode will abondon the row which has a empty list > column when applied to the column > --- > > Key: SPARK-28048 > URL: https://issues.apache.org/jira/browse/SPARK-28048 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.1 >Reporter: Ma Xinmin >Priority: Major > > {code} > from pyspark.sql import Row > from pyspark.sql.functions import explode > eDF = spark.createDataFrame([Row(a=1, intlist=[1,2], mapfield={"a": "b"}), > Row(a=2, intlist=[], mapfield={"a": "b"})]) > eDF = eDF.withColumn('another', explode(eDF.intlist)).collect() > eDF > {code} > The `a=2` row is missing in the output -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28048) pyspark.sql.functions.explode will abondon the row which has a empty list column when applied to the column
Ma Xinmin created SPARK-28048: - Summary: pyspark.sql.functions.explode will abondon the row which has a empty list column when applied to the column Key: SPARK-28048 URL: https://issues.apache.org/jira/browse/SPARK-28048 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.1.1 Reporter: Ma Xinmin {code} from pyspark.sql import Row from pyspark.sql.functions import explode eDF = spark.createDataFrame([Row(a=1, intlist=[1,2], mapfield={"a": "b"}), Row(a=2, intlist=[], mapfield={"a": "b"})]) eDF = eDF.withColumn('another', explode(eDF.intlist)).collect() eDF {code} The `a=2` row is missing in the output -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28043) Reading json with duplicate columns drops the first column value
[ https://issues.apache.org/jira/browse/SPARK-28043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16863680#comment-16863680 ] Liang-Chi Hsieh commented on SPARK-28043: - I tried to look around that, like https://stackoverflow.com/questions/21832701/does-json-syntax-allow-duplicate-keys-in-an-object. So JSON doesn't disallow duplicate keys. Spark SQL doesn't disallow duplicate field names, although it can be impose some difficulties when using a DataFrame with duplicate field names. To clarify it, just because Spark SQL allows duplicate field names that doesn't mean that we should use such feature. But I think that, to some extent, the current behavior isn't consistent. {code} scala> val jsonRDD = spark.sparkContext.parallelize(Seq("[{ \"a\": \"blah\", \"a\": \"blah2\"} ]")) jsonRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[1] at parallelize at :23 scala> val df = spark.read.json(jsonRDD) df: org.apache.spark.sql.DataFrame = [a: string, a: string] scala> df.show ++-+ | a|a| ++-+ |null|blah2| ++-+ {code} > Reading json with duplicate columns drops the first column value > > > Key: SPARK-28043 > URL: https://issues.apache.org/jira/browse/SPARK-28043 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Mukul Murthy >Priority: Major > > When reading a JSON blob with duplicate fields, Spark appears to ignore the > value of the first one. JSON recommends unique names but does not require it; > since JSON and Spark SQL both allow duplicate field names, we should fix the > bug where the first column value is getting dropped. > > I'm guessing somewhere when parsing JSON, we're turning it into a Map which > is causing the first value to be overridden. > > Repro (Python, 2.4): > >>> jsonRDD = spark.sparkContext.parallelize(["\\{ \"a\": \"blah\", \"a\": > >>> \"blah2\"}"]) > >>> df = spark.read.json(jsonRDD) > >>> df.show() > +-++ > |a|a| > +-++ > |null|blah2| > +-++ > > The expected response would be: > +-++ > |a|a| > +-++ > |blah|blah2| > +-++ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28047) [UI] Little bug in executorspage.js
feiwang created SPARK-28047: --- Summary: [UI] Little bug in executorspage.js Key: SPARK-28047 URL: https://issues.apache.org/jira/browse/SPARK-28047 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.4.3 Reporter: feiwang -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27018) Checkpointed RDD deleted prematurely when using GBTClassifier
[ https://issues.apache.org/jira/browse/SPARK-27018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27018: Assignee: (was: Apache Spark) > Checkpointed RDD deleted prematurely when using GBTClassifier > - > > Key: SPARK-27018 > URL: https://issues.apache.org/jira/browse/SPARK-27018 > Project: Spark > Issue Type: Bug > Components: ML, Spark Core >Affects Versions: 2.2.2, 2.2.3, 2.3.3, 2.4.0 > Environment: OS: Ubuntu Linux 18.10 > Java: java version "1.8.0_201" > Java(TM) SE Runtime Environment (build 1.8.0_201-b09) > Java HotSpot(TM) 64-Bit Server VM (build 25.201-b09, mixed mode) > Reproducible with a single-node Spark in standalone mode. > Reproducible with Zepellin or Spark shell. > >Reporter: Piotr Kołaczkowski >Priority: Major > Attachments: > Fix_check_if_the_next_checkpoint_exists_before_deleting_the_old_one.patch > > > Steps to reproduce: > {noformat} > import org.apache.spark.ml.linalg.Vectors > import org.apache.spark.ml.classification.GBTClassifier > case class Row(features: org.apache.spark.ml.linalg.Vector, label: Int) > sc.setCheckpointDir("/checkpoints") > val trainingData = sc.parallelize(1 to 2426874, 256).map(x => > Row(Vectors.dense(x, x + 1, x * 2 % 10), if (x % 5 == 0) 1 else 0)).toDF > val classifier = new GBTClassifier() > .setLabelCol("label") > .setFeaturesCol("features") > .setProbabilityCol("probability") > .setMaxIter(100) > .setMaxDepth(10) > .setCheckpointInterval(2) > classifier.fit(trainingData){noformat} > > The last line fails with: > {noformat} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 56.0 failed 10 times, most recent failure: Lost task 0.9 in stage 56.0 > (TID 12058, 127.0.0.1, executor 0): java.io.FileNotFoundException: > /checkpoints/191c9209-0955-440f-8c11-f042bdf7f804/rdd-51 > at > com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$1.applyOrElse(DseFileSystem.scala:63) > at > com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$1.applyOrElse(DseFileSystem.scala:61) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) > at > com.datastax.bdp.fs.hadoop.DseFileSystem.com$datastax$bdp$fs$hadoop$DseFileSystem$$translateToHadoopExceptions(DseFileSystem.scala:70) > at > com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$6.apply(DseFileSystem.scala:264) > at > com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$6.apply(DseFileSystem.scala:264) > at > com.datastax.bdp.fs.hadoop.DseFsInputStream.input(DseFsInputStream.scala:31) > at > com.datastax.bdp.fs.hadoop.DseFsInputStream.openUnderlyingDataSource(DseFsInputStream.scala:39) > at com.datastax.bdp.fs.hadoop.DseFileSystem.open(DseFileSystem.scala:269) > at > org.apache.spark.rdd.ReliableCheckpointRDD$.readCheckpointFile(ReliableCheckpointRDD.scala:292) > at > org.apache.spark.rdd.ReliableCheckpointRDD.compute(ReliableCheckpointRDD.scala:100) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:322) > at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337) > at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:286) > at > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337) > at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882) > at org.apache.spark.rdd.RDD.getOrCompute(R
[jira] [Assigned] (SPARK-27018) Checkpointed RDD deleted prematurely when using GBTClassifier
[ https://issues.apache.org/jira/browse/SPARK-27018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27018: Assignee: Apache Spark > Checkpointed RDD deleted prematurely when using GBTClassifier > - > > Key: SPARK-27018 > URL: https://issues.apache.org/jira/browse/SPARK-27018 > Project: Spark > Issue Type: Bug > Components: ML, Spark Core >Affects Versions: 2.2.2, 2.2.3, 2.3.3, 2.4.0 > Environment: OS: Ubuntu Linux 18.10 > Java: java version "1.8.0_201" > Java(TM) SE Runtime Environment (build 1.8.0_201-b09) > Java HotSpot(TM) 64-Bit Server VM (build 25.201-b09, mixed mode) > Reproducible with a single-node Spark in standalone mode. > Reproducible with Zepellin or Spark shell. > >Reporter: Piotr Kołaczkowski >Assignee: Apache Spark >Priority: Major > Attachments: > Fix_check_if_the_next_checkpoint_exists_before_deleting_the_old_one.patch > > > Steps to reproduce: > {noformat} > import org.apache.spark.ml.linalg.Vectors > import org.apache.spark.ml.classification.GBTClassifier > case class Row(features: org.apache.spark.ml.linalg.Vector, label: Int) > sc.setCheckpointDir("/checkpoints") > val trainingData = sc.parallelize(1 to 2426874, 256).map(x => > Row(Vectors.dense(x, x + 1, x * 2 % 10), if (x % 5 == 0) 1 else 0)).toDF > val classifier = new GBTClassifier() > .setLabelCol("label") > .setFeaturesCol("features") > .setProbabilityCol("probability") > .setMaxIter(100) > .setMaxDepth(10) > .setCheckpointInterval(2) > classifier.fit(trainingData){noformat} > > The last line fails with: > {noformat} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 56.0 failed 10 times, most recent failure: Lost task 0.9 in stage 56.0 > (TID 12058, 127.0.0.1, executor 0): java.io.FileNotFoundException: > /checkpoints/191c9209-0955-440f-8c11-f042bdf7f804/rdd-51 > at > com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$1.applyOrElse(DseFileSystem.scala:63) > at > com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$1.applyOrElse(DseFileSystem.scala:61) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) > at > com.datastax.bdp.fs.hadoop.DseFileSystem.com$datastax$bdp$fs$hadoop$DseFileSystem$$translateToHadoopExceptions(DseFileSystem.scala:70) > at > com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$6.apply(DseFileSystem.scala:264) > at > com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$6.apply(DseFileSystem.scala:264) > at > com.datastax.bdp.fs.hadoop.DseFsInputStream.input(DseFsInputStream.scala:31) > at > com.datastax.bdp.fs.hadoop.DseFsInputStream.openUnderlyingDataSource(DseFsInputStream.scala:39) > at com.datastax.bdp.fs.hadoop.DseFileSystem.open(DseFileSystem.scala:269) > at > org.apache.spark.rdd.ReliableCheckpointRDD$.readCheckpointFile(ReliableCheckpointRDD.scala:292) > at > org.apache.spark.rdd.ReliableCheckpointRDD.compute(ReliableCheckpointRDD.scala:100) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:322) > at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337) > at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:286) > at > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337) > at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882) > at org.apache.spa
[jira] [Updated] (SPARK-27018) Checkpointed RDD deleted prematurely when using GBTClassifier
[ https://issues.apache.org/jira/browse/SPARK-27018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-27018: - Component/s: Spark Core > Checkpointed RDD deleted prematurely when using GBTClassifier > - > > Key: SPARK-27018 > URL: https://issues.apache.org/jira/browse/SPARK-27018 > Project: Spark > Issue Type: Bug > Components: ML, Spark Core >Affects Versions: 2.2.2, 2.2.3, 2.3.3, 2.4.0 > Environment: OS: Ubuntu Linux 18.10 > Java: java version "1.8.0_201" > Java(TM) SE Runtime Environment (build 1.8.0_201-b09) > Java HotSpot(TM) 64-Bit Server VM (build 25.201-b09, mixed mode) > Reproducible with a single-node Spark in standalone mode. > Reproducible with Zepellin or Spark shell. > >Reporter: Piotr Kołaczkowski >Priority: Major > Attachments: > Fix_check_if_the_next_checkpoint_exists_before_deleting_the_old_one.patch > > > Steps to reproduce: > {noformat} > import org.apache.spark.ml.linalg.Vectors > import org.apache.spark.ml.classification.GBTClassifier > case class Row(features: org.apache.spark.ml.linalg.Vector, label: Int) > sc.setCheckpointDir("/checkpoints") > val trainingData = sc.parallelize(1 to 2426874, 256).map(x => > Row(Vectors.dense(x, x + 1, x * 2 % 10), if (x % 5 == 0) 1 else 0)).toDF > val classifier = new GBTClassifier() > .setLabelCol("label") > .setFeaturesCol("features") > .setProbabilityCol("probability") > .setMaxIter(100) > .setMaxDepth(10) > .setCheckpointInterval(2) > classifier.fit(trainingData){noformat} > > The last line fails with: > {noformat} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 56.0 failed 10 times, most recent failure: Lost task 0.9 in stage 56.0 > (TID 12058, 127.0.0.1, executor 0): java.io.FileNotFoundException: > /checkpoints/191c9209-0955-440f-8c11-f042bdf7f804/rdd-51 > at > com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$1.applyOrElse(DseFileSystem.scala:63) > at > com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$1.applyOrElse(DseFileSystem.scala:61) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) > at > com.datastax.bdp.fs.hadoop.DseFileSystem.com$datastax$bdp$fs$hadoop$DseFileSystem$$translateToHadoopExceptions(DseFileSystem.scala:70) > at > com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$6.apply(DseFileSystem.scala:264) > at > com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$6.apply(DseFileSystem.scala:264) > at > com.datastax.bdp.fs.hadoop.DseFsInputStream.input(DseFsInputStream.scala:31) > at > com.datastax.bdp.fs.hadoop.DseFsInputStream.openUnderlyingDataSource(DseFsInputStream.scala:39) > at com.datastax.bdp.fs.hadoop.DseFileSystem.open(DseFileSystem.scala:269) > at > org.apache.spark.rdd.ReliableCheckpointRDD$.readCheckpointFile(ReliableCheckpointRDD.scala:292) > at > org.apache.spark.rdd.ReliableCheckpointRDD.compute(ReliableCheckpointRDD.scala:100) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:322) > at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337) > at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:286) > at > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337) > at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335) > a
[jira] [Updated] (SPARK-28022) k8s pod affinity to achieve cloud native friendly autoscaling
[ https://issues.apache.org/jira/browse/SPARK-28022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Henry Yu updated SPARK-28022: - Summary: k8s pod affinity to achieve cloud native friendly autoscaling (was: k8s pod affinity achieve cloud native friendly autoscaling ) > k8s pod affinity to achieve cloud native friendly autoscaling > -- > > Key: SPARK-28022 > URL: https://issues.apache.org/jira/browse/SPARK-28022 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.4.3 >Reporter: Henry Yu >Priority: Major > > Hi, in order to achieve cloud native friendly autoscaling , I propose to add > a pod affinity feature. > Traditionally, when we use spark in fix size yarn cluster, it make sense to > spread containers to every node. > Coming to cloud native resource manage, we want to release node when we don't > need it any more. > Pod affinity feature counts to place all pods of certain application to some > nodes instead of all nodes. > By the way, using pod template is not a good choice, adding application id > to pod affinity term when submit is more robust. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-28021) A unappropriate exception in StaticMemoryManager.getMaxExecutionMemory
[ https://issues.apache.org/jira/browse/SPARK-28021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] child2d closed SPARK-28021. --- > A unappropriate exception in StaticMemoryManager.getMaxExecutionMemory > -- > > Key: SPARK-28021 > URL: https://issues.apache.org/jira/browse/SPARK-28021 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 2.4.3 >Reporter: child2d >Priority: Minor > > When i review StaticMemoryManager.scala, there comes a question to me. > {code:java} > private def getMaxExecutionMemory(conf: SparkConf): Long = { > val systemMaxMemory = conf.getLong("spark.testing.memory", > Runtime.getRuntime.maxMemory) > if (systemMaxMemory < MIN_MEMORY_BYTES) { > throw new IllegalArgumentException(s"System memory $systemMaxMemory must > " + > s"be at least $MIN_MEMORY_BYTES. Please increase heap size using the > --driver-memory " + > s"option or spark.driver.memory in Spark configuration.") > } > if (conf.contains("spark.executor.memory")) { > val executorMemory = conf.getSizeAsBytes("spark.executor.memory") > if (executorMemory < MIN_MEMORY_BYTES) { > throw new IllegalArgumentException(s"Executor memory $executorMemory > must be at least " + > s"$MIN_MEMORY_BYTES. Please increase executor memory using the " + > s"--executor-memory option or spark.executor.memory in Spark > configuration.") > } > } > val memoryFraction = conf.getDouble("spark.shuffle.memoryFraction", 0.2) > val safetyFraction = conf.getDouble("spark.shuffle.safetyFraction", 0.8) > (systemMaxMemory * memoryFraction * safetyFraction).toLong > } > {code} > When a executor tries to getMaxExecutionMemory, it should set systemMaxMemory > by using Runtime.getRuntime.maxMemory first, then compares the value between > systemMaxMemory and MIN_MEMORY_BYTES. > If the compared value is true, program thows an exception to remind user to > increase heap size by using --driver-memory. > I wonder if it is wrong because the heap size of executors are setted by > --executor-memory? > Although there is another exception about adjusting executor's memory below, > i just think that the first exception may be not appropriate. > Thanks for answering my question! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28021) A unappropriate exception in StaticMemoryManager.getMaxExecutionMemory
[ https://issues.apache.org/jira/browse/SPARK-28021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16863648#comment-16863648 ] child2d commented on SPARK-28021: - Thanks for reminding. I will close the issue. > A unappropriate exception in StaticMemoryManager.getMaxExecutionMemory > -- > > Key: SPARK-28021 > URL: https://issues.apache.org/jira/browse/SPARK-28021 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 2.4.3 >Reporter: child2d >Priority: Minor > > When i review StaticMemoryManager.scala, there comes a question to me. > {code:java} > private def getMaxExecutionMemory(conf: SparkConf): Long = { > val systemMaxMemory = conf.getLong("spark.testing.memory", > Runtime.getRuntime.maxMemory) > if (systemMaxMemory < MIN_MEMORY_BYTES) { > throw new IllegalArgumentException(s"System memory $systemMaxMemory must > " + > s"be at least $MIN_MEMORY_BYTES. Please increase heap size using the > --driver-memory " + > s"option or spark.driver.memory in Spark configuration.") > } > if (conf.contains("spark.executor.memory")) { > val executorMemory = conf.getSizeAsBytes("spark.executor.memory") > if (executorMemory < MIN_MEMORY_BYTES) { > throw new IllegalArgumentException(s"Executor memory $executorMemory > must be at least " + > s"$MIN_MEMORY_BYTES. Please increase executor memory using the " + > s"--executor-memory option or spark.executor.memory in Spark > configuration.") > } > } > val memoryFraction = conf.getDouble("spark.shuffle.memoryFraction", 0.2) > val safetyFraction = conf.getDouble("spark.shuffle.safetyFraction", 0.8) > (systemMaxMemory * memoryFraction * safetyFraction).toLong > } > {code} > When a executor tries to getMaxExecutionMemory, it should set systemMaxMemory > by using Runtime.getRuntime.maxMemory first, then compares the value between > systemMaxMemory and MIN_MEMORY_BYTES. > If the compared value is true, program thows an exception to remind user to > increase heap size by using --driver-memory. > I wonder if it is wrong because the heap size of executors are setted by > --executor-memory? > Although there is another exception about adjusting executor's memory below, > i just think that the first exception may be not appropriate. > Thanks for answering my question! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28023) Trim the string when cast string type to other types
[ https://issues.apache.org/jira/browse/SPARK-28023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16863643#comment-16863643 ] Yuming Wang commented on SPARK-28023: - I'm working on. > Trim the string when cast string type to other types > > > Key: SPARK-28023 > URL: https://issues.apache.org/jira/browse/SPARK-28023 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > For example: > {code:sql} > SELECT bool ' f '; > select int2 ' 21234 '; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27925) Better control numBins of curves in BinaryClassificationMetrics
[ https://issues.apache.org/jira/browse/SPARK-27925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-27925. -- Resolution: Not A Problem > Better control numBins of curves in BinaryClassificationMetrics > --- > > Key: SPARK-27925 > URL: https://issues.apache.org/jira/browse/SPARK-27925 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Major > > In case of large datasets with tens of thousands of partitions, current curve > down-sampling method tend to generate much more bins than the value set by > #numBins. > Since in current impl, grouping is done within partitions, that is to say, > each partition contains at least one bin. > A more reasonable way is to bring the grouping op forward into the sort op, > then we can directly set the #bins as the #partitions, and regard one > partition as one bin. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28045) add missing RankingEvaluator
[ https://issues.apache.org/jira/browse/SPARK-28045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28045: Assignee: (was: Apache Spark) > add missing RankingEvaluator > > > Key: SPARK-28045 > URL: https://issues.apache.org/jira/browse/SPARK-28045 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Major > > expose RankingEvaluator -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28045) add missing RankingEvaluator
[ https://issues.apache.org/jira/browse/SPARK-28045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28045: Assignee: Apache Spark > add missing RankingEvaluator > > > Key: SPARK-28045 > URL: https://issues.apache.org/jira/browse/SPARK-28045 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: Apache Spark >Priority: Major > > expose RankingEvaluator -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28046) OOM caused by building hash table when the compressed ratio of small table is normal
[ https://issues.apache.org/jira/browse/SPARK-28046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ke Jia updated SPARK-28046: --- Attachment: image-2019-06-14-10-34-53-379.png > OOM caused by building hash table when the compressed ratio of small table is > normal > > > Key: SPARK-28046 > URL: https://issues.apache.org/jira/browse/SPARK-28046 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.1 >Reporter: Ke Jia >Priority: Major > Attachments: image-2019-06-14-10-34-53-379.png > > > Currently, spark will convert the sort merge join to broadcast hash join when > the small table compressed size <= the broadcast threshold. Same with > Spark, AE also convert the smj to bhj based on the compressed size in > runtime. In our test, when enable ae with 32M broadcast threshold, one smj > with 16M compressed size is converted to bhj. However, when building the hash > table, the 16M small table is decompressed with 2GB size and has 134485048 > row count, which has a mount of continuous and repeated values. Therefore, > the following OOM exception occurs when building hash table: > !image-2019-06-14-10-29-00-499.png! > And based on this founding , it may be not reasonable to decide whether smj > be converted to bhj only by the compressed size. In AE, we add the condition > with the estimation decompressed size based on the row counts. And in spark, > we may also need the decompressed size or row counts condition judgment not > only the compressed size when converting the smj to bhj. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28046) OOM caused by building hash table when the compressed ratio of small table is normal
[ https://issues.apache.org/jira/browse/SPARK-28046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ke Jia updated SPARK-28046: --- Description: Currently, spark will convert the sort merge join to broadcast hash join when the small table compressed size <= the broadcast threshold. Same with Spark, AE also convert the smj to bhj based on the compressed size in runtime. In our test, when enable ae with 32M broadcast threshold, one smj with 16M compressed size is converted to bhj. However, when building the hash table, the 16M small table is decompressed with 2GB size and has 134485048 row count, which has a mount of continuous and repeated values. Therefore, the following OOM exception occurs when building hash table: !image-2019-06-14-10-34-53-379.png! And based on this founding , it may be not reasonable to decide whether smj be converted to bhj only by the compressed size. In AE, we add the condition with the estimation decompressed size based on the row counts. And in spark, we may also need the decompressed size or row counts condition judgment not only the compressed size when converting the smj to bhj. was: Currently, spark will convert the sort merge join to broadcast hash join when the small table compressed size <= the broadcast threshold. Same with Spark, AE also convert the smj to bhj based on the compressed size in runtime. In our test, when enable ae with 32M broadcast threshold, one smj with 16M compressed size is converted to bhj. However, when building the hash table, the 16M small table is decompressed with 2GB size and has 134485048 row count, which has a mount of continuous and repeated values. Therefore, the following OOM exception occurs when building hash table: !image-2019-06-14-10-29-00-499.png! And based on this founding , it may be not reasonable to decide whether smj be converted to bhj only by the compressed size. In AE, we add the condition with the estimation decompressed size based on the row counts. And in spark, we may also need the decompressed size or row counts condition judgment not only the compressed size when converting the smj to bhj. > OOM caused by building hash table when the compressed ratio of small table is > normal > > > Key: SPARK-28046 > URL: https://issues.apache.org/jira/browse/SPARK-28046 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.1 >Reporter: Ke Jia >Priority: Major > Attachments: image-2019-06-14-10-34-53-379.png > > > Currently, spark will convert the sort merge join to broadcast hash join when > the small table compressed size <= the broadcast threshold. Same with > Spark, AE also convert the smj to bhj based on the compressed size in > runtime. In our test, when enable ae with 32M broadcast threshold, one smj > with 16M compressed size is converted to bhj. However, when building the hash > table, the 16M small table is decompressed with 2GB size and has 134485048 > row count, which has a mount of continuous and repeated values. Therefore, > the following OOM exception occurs when building hash table: > !image-2019-06-14-10-34-53-379.png! > And based on this founding , it may be not reasonable to decide whether smj > be converted to bhj only by the compressed size. In AE, we add the condition > with the estimation decompressed size based on the row counts. And in spark, > we may also need the decompressed size or row counts condition judgment not > only the compressed size when converting the smj to bhj. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28046) OOM caused by building hash table when the compressed ratio of small table is normal
Ke Jia created SPARK-28046: -- Summary: OOM caused by building hash table when the compressed ratio of small table is normal Key: SPARK-28046 URL: https://issues.apache.org/jira/browse/SPARK-28046 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.1 Reporter: Ke Jia Currently, spark will convert the sort merge join to broadcast hash join when the small table compressed size <= the broadcast threshold. Same with Spark, AE also convert the smj to bhj based on the compressed size in runtime. In our test, when enable ae with 32M broadcast threshold, one smj with 16M compressed size is converted to bhj. However, when building the hash table, the 16M small table is decompressed with 2GB size and has 134485048 row count, which has a mount of continuous and repeated values. Therefore, the following OOM exception occurs when building hash table: !image-2019-06-14-10-29-00-499.png! And based on this founding , it may be not reasonable to decide whether smj be converted to bhj only by the compressed size. In AE, we add the condition with the estimation decompressed size based on the row counts. And in spark, we may also need the decompressed size or row counts condition judgment not only the compressed size when converting the smj to bhj. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28045) add missing RankingEvaluator
zhengruifeng created SPARK-28045: Summary: add missing RankingEvaluator Key: SPARK-28045 URL: https://issues.apache.org/jira/browse/SPARK-28045 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 3.0.0 Reporter: zhengruifeng expose RankingEvaluator -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28044) MulticlassClassificationEvaluator support more metrics
[ https://issues.apache.org/jira/browse/SPARK-28044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28044: Assignee: Apache Spark > MulticlassClassificationEvaluator support more metrics > -- > > Key: SPARK-28044 > URL: https://issues.apache.org/jira/browse/SPARK-28044 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: Apache Spark >Priority: Minor > > expose more metrics in evaluator: > weightedTruePositiveRate > weightedFalsePositiveRate > weightedFMeasure > truePositiveRateByLabel > falsePositiveRateByLabel > precisionByLabel > recallByLabel > fMeasureByLabel -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28044) MulticlassClassificationEvaluator support more metrics
[ https://issues.apache.org/jira/browse/SPARK-28044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28044: Assignee: (was: Apache Spark) > MulticlassClassificationEvaluator support more metrics > -- > > Key: SPARK-28044 > URL: https://issues.apache.org/jira/browse/SPARK-28044 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Minor > > expose more metrics in evaluator: > weightedTruePositiveRate > weightedFalsePositiveRate > weightedFMeasure > truePositiveRateByLabel > falsePositiveRateByLabel > precisionByLabel > recallByLabel > fMeasureByLabel -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28044) MulticlassClassificationEvaluator support more metrics
zhengruifeng created SPARK-28044: Summary: MulticlassClassificationEvaluator support more metrics Key: SPARK-28044 URL: https://issues.apache.org/jira/browse/SPARK-28044 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 3.0.0 Reporter: zhengruifeng expose more metrics in evaluator: weightedTruePositiveRate weightedFalsePositiveRate weightedFMeasure truePositiveRateByLabel falsePositiveRateByLabel precisionByLabel recallByLabel fMeasureByLabel -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore
[ https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16863598#comment-16863598 ] HonglunChen commented on SPARK-18112: - [~dongjoon] Thank you, I get it. > Spark2.x does not support read data from Hive 2.x metastore > --- > > Key: SPARK-18112 > URL: https://issues.apache.org/jira/browse/SPARK-18112 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: KaiXu >Assignee: Xiao Li >Priority: Critical > Fix For: 2.2.0 > > > Hive2.0 has been released in February 2016, after that Hive2.0.1 and > Hive2.1.0 have also been released for a long time, but till now spark only > support to read hive metastore data from Hive1.2.1 and older version, since > Hive2.x has many bugs fixed and performance improvement it's better and > urgent to upgrade to support Hive2.x > failed to load data from hive2.x metastore: > Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT > at > org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38) > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4 > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45) > at > org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50) > at > org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48) > at > org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31) > at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568) > at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28041) Increase the minimum pandas version to 0.23.2
[ https://issues.apache.org/jira/browse/SPARK-28041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16863570#comment-16863570 ] Bryan Cutler commented on SPARK-28041: -- Yes, definitely. I made a quick PR, but we should run it through the mailing list. I'll send it now. > Increase the minimum pandas version to 0.23.2 > - > > Key: SPARK-28041 > URL: https://issues.apache.org/jira/browse/SPARK-28041 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Major > > Currently, the minimum supported Pandas version is 0.19.2. We bumped up > testing in the Jenkins env to 0.23.2 and since 0.19.2 was released nearly 3 > years ago, it is not always compatible with other Python libraries. > Increasing the version to 0.23.2 will also allow some workarounds to be > removed and make maintenance easier. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27463) Support Dataframe Cogroup via Pandas UDFs
[ https://issues.apache.org/jira/browse/SPARK-27463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16863567#comment-16863567 ] Hyukjin Kwon commented on SPARK-27463: -- Yea I think it'd be easier to discuss about this with a Pr > Support Dataframe Cogroup via Pandas UDFs > -- > > Key: SPARK-27463 > URL: https://issues.apache.org/jira/browse/SPARK-27463 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Chris Martin >Priority: Major > > Recent work on Pandas UDFs in Spark, has allowed for improved > interoperability between Pandas and Spark. This proposal aims to extend this > by introducing a new Pandas UDF type which would allow for a cogroup > operation to be applied to two PySpark DataFrames. > Full details are in the google document linked below. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28041) Increase the minimum pandas version to 0.23.2
[ https://issues.apache.org/jira/browse/SPARK-28041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28041: Assignee: Apache Spark > Increase the minimum pandas version to 0.23.2 > - > > Key: SPARK-28041 > URL: https://issues.apache.org/jira/browse/SPARK-28041 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Assignee: Apache Spark >Priority: Major > > Currently, the minimum supported Pandas version is 0.19.2. We bumped up > testing in the Jenkins env to 0.23.2 and since 0.19.2 was released nearly 3 > years ago, it is not always compatible with other Python libraries. > Increasing the version to 0.23.2 will also allow some workarounds to be > removed and make maintenance easier. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28041) Increase the minimum pandas version to 0.23.2
[ https://issues.apache.org/jira/browse/SPARK-28041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28041: Assignee: (was: Apache Spark) > Increase the minimum pandas version to 0.23.2 > - > > Key: SPARK-28041 > URL: https://issues.apache.org/jira/browse/SPARK-28041 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Major > > Currently, the minimum supported Pandas version is 0.19.2. We bumped up > testing in the Jenkins env to 0.23.2 and since 0.19.2 was released nearly 3 > years ago, it is not always compatible with other Python libraries. > Increasing the version to 0.23.2 will also allow some workarounds to be > removed and make maintenance easier. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28041) Increase the minimum pandas version to 0.23.2
[ https://issues.apache.org/jira/browse/SPARK-28041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16863565#comment-16863565 ] Hyukjin Kwon commented on SPARK-28041: -- [~bryanc], BTW, can we quickly discuss this one in dev mailing list? > Increase the minimum pandas version to 0.23.2 > - > > Key: SPARK-28041 > URL: https://issues.apache.org/jira/browse/SPARK-28041 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Major > > Currently, the minimum supported Pandas version is 0.19.2. We bumped up > testing in the Jenkins env to 0.23.2 and since 0.19.2 was released nearly 3 > years ago, it is not always compatible with other Python libraries. > Increasing the version to 0.23.2 will also allow some workarounds to be > removed and make maintenance easier. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28043) Reading json with duplicate columns drops the first column value
[ https://issues.apache.org/jira/browse/SPARK-28043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mukul Murthy updated SPARK-28043: - Description: When reading a JSON blob with duplicate fields, Spark appears to ignore the value of the first one. JSON recommends unique names but does not require it; since JSON and Spark SQL both allow duplicate field names, we should fix the bug where the first column value is getting dropped. I'm guessing somewhere when parsing JSON, we're turning it into a Map which is causing the first value to be overridden. Repro (Python, 2.4): >>> jsonRDD = spark.sparkContext.parallelize(["\\{ \"a\": \"blah\", \"a\": >>> \"blah2\"}"]) >>> df = spark.read.json(jsonRDD) >>> df.show() +-++ |a|a| +-++ |null|blah2| +-++ The expected response would be: +-++ |a|a| +-++ |blah|blah2| +-++ was: When reading a JSON blob with duplicate fields, Spark appears to ignore the value of the first one. JSON recommends unique names but does not require it; since JSON and Spark SQL both allow duplicate field names, we should fix the bug where the first column value is getting dropped. Repro (Python, 2.4): >>> jsonRDD = spark.sparkContext.parallelize(["\{ \"a\": \"blah\", \"a\": >>> \"blah2\"}"]) >>> df = spark.read.json(jsonRDD) >>> df.show() ++-+ | a| a| ++-+ |null|blah2| ++-+ The expected response would be: ++-+ | a| a| ++-+ |blah|blah2| ++-+ > Reading json with duplicate columns drops the first column value > > > Key: SPARK-28043 > URL: https://issues.apache.org/jira/browse/SPARK-28043 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Mukul Murthy >Priority: Major > > When reading a JSON blob with duplicate fields, Spark appears to ignore the > value of the first one. JSON recommends unique names but does not require it; > since JSON and Spark SQL both allow duplicate field names, we should fix the > bug where the first column value is getting dropped. > > I'm guessing somewhere when parsing JSON, we're turning it into a Map which > is causing the first value to be overridden. > > Repro (Python, 2.4): > >>> jsonRDD = spark.sparkContext.parallelize(["\\{ \"a\": \"blah\", \"a\": > >>> \"blah2\"}"]) > >>> df = spark.read.json(jsonRDD) > >>> df.show() > +-++ > |a|a| > +-++ > |null|blah2| > +-++ > > The expected response would be: > +-++ > |a|a| > +-++ > |blah|blah2| > +-++ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28043) Reading json with duplicate columns drops the first column value
Mukul Murthy created SPARK-28043: Summary: Reading json with duplicate columns drops the first column value Key: SPARK-28043 URL: https://issues.apache.org/jira/browse/SPARK-28043 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.0 Reporter: Mukul Murthy When reading a JSON blob with duplicate fields, Spark appears to ignore the value of the first one. JSON recommends unique names but does not require it; since JSON and Spark SQL both allow duplicate field names, we should fix the bug where the first column value is getting dropped. Repro (Python, 2.4): >>> jsonRDD = spark.sparkContext.parallelize(["\{ \"a\": \"blah\", \"a\": >>> \"blah2\"}"]) >>> df = spark.read.json(jsonRDD) >>> df.show() ++-+ | a| a| ++-+ |null|blah2| ++-+ The expected response would be: ++-+ | a| a| ++-+ |blah|blah2| ++-+ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28042) Support mapping spark.local.dir to hostPath volume
Dongjoon Hyun created SPARK-28042: - Summary: Support mapping spark.local.dir to hostPath volume Key: SPARK-28042 URL: https://issues.apache.org/jira/browse/SPARK-28042 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 3.0.0 Reporter: Junjie Chen Currently, the k8s executor builder mount spark.local.dir as emptyDir or memory, it should satisfy some small workload, while in some heavily workload like TPCDS, both of them can have some problem, such as pods are evicted due to disk pressure when using emptyDir, and OOM when using tmpfs. In particular on cloud environment, users may allocate cluster with minimum configuration and add cloud storage when running workload. In this case, we can specify multiple elastic storage as spark.local.dir to accelerate the spilling. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27100) dag-scheduler-event-loop" java.lang.StackOverflowError
[ https://issues.apache.org/jira/browse/SPARK-27100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16863381#comment-16863381 ] Parth Chandra commented on SPARK-27100: --- Opened a PR with a fix and a test to reproduce the issue. https://github.com/apache/spark/pull/24865. Thanks to [~dbtsai] [~dongjoon] for offline help with this one. > dag-scheduler-event-loop" java.lang.StackOverflowError > -- > > Key: SPARK-27100 > URL: https://issues.apache.org/jira/browse/SPARK-27100 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.1.3, 2.3.3 >Reporter: KaiXu >Priority: Major > Attachments: SPARK-27100-Overflow.txt, stderr > > > ALS in Spark MLlib causes StackOverflow: > /opt/sparkml/spark213/bin/spark-submit --properties-file > /opt/HiBench/report/als/spark/conf/sparkbench/spark.conf --class > com.intel.hibench.sparkbench.ml.ALSExample --master yarn-client > --num-executors 3 --executor-memory 322g > /opt/HiBench/sparkbench/assembly/target/sparkbench-assembly-7.1-SNAPSHOT-dist.jar > --numUsers 4 --numProducts 6 --rank 100 --numRecommends 20 > --numIterations 100 --kryo false --implicitPrefs true --numProductBlocks -1 > --numUserBlocks -1 --lambda 1.0 hdfs://bdw-slave20:8020/HiBench/ALS/Input > > Exception in thread "dag-scheduler-event-loop" java.lang.StackOverflowError > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1534) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) > at > scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468) > at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) > at > scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468) > at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spar
[jira] [Assigned] (SPARK-27100) dag-scheduler-event-loop" java.lang.StackOverflowError
[ https://issues.apache.org/jira/browse/SPARK-27100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27100: Assignee: Apache Spark > dag-scheduler-event-loop" java.lang.StackOverflowError > -- > > Key: SPARK-27100 > URL: https://issues.apache.org/jira/browse/SPARK-27100 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.1.3, 2.3.3 >Reporter: KaiXu >Assignee: Apache Spark >Priority: Major > Attachments: SPARK-27100-Overflow.txt, stderr > > > ALS in Spark MLlib causes StackOverflow: > /opt/sparkml/spark213/bin/spark-submit --properties-file > /opt/HiBench/report/als/spark/conf/sparkbench/spark.conf --class > com.intel.hibench.sparkbench.ml.ALSExample --master yarn-client > --num-executors 3 --executor-memory 322g > /opt/HiBench/sparkbench/assembly/target/sparkbench-assembly-7.1-SNAPSHOT-dist.jar > --numUsers 4 --numProducts 6 --rank 100 --numRecommends 20 > --numIterations 100 --kryo false --implicitPrefs true --numProductBlocks -1 > --numUserBlocks -1 --lambda 1.0 hdfs://bdw-slave20:8020/HiBench/ALS/Input > > Exception in thread "dag-scheduler-event-loop" java.lang.StackOverflowError > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1534) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) > at > scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468) > at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) > at > scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468) > at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27100) dag-scheduler-event-loop" java.lang.StackOverflowError
[ https://issues.apache.org/jira/browse/SPARK-27100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27100: Assignee: (was: Apache Spark) > dag-scheduler-event-loop" java.lang.StackOverflowError > -- > > Key: SPARK-27100 > URL: https://issues.apache.org/jira/browse/SPARK-27100 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.1.3, 2.3.3 >Reporter: KaiXu >Priority: Major > Attachments: SPARK-27100-Overflow.txt, stderr > > > ALS in Spark MLlib causes StackOverflow: > /opt/sparkml/spark213/bin/spark-submit --properties-file > /opt/HiBench/report/als/spark/conf/sparkbench/spark.conf --class > com.intel.hibench.sparkbench.ml.ALSExample --master yarn-client > --num-executors 3 --executor-memory 322g > /opt/HiBench/sparkbench/assembly/target/sparkbench-assembly-7.1-SNAPSHOT-dist.jar > --numUsers 4 --numProducts 6 --rank 100 --numRecommends 20 > --numIterations 100 --kryo false --implicitPrefs true --numProductBlocks -1 > --numUserBlocks -1 --lambda 1.0 hdfs://bdw-slave20:8020/HiBench/ALS/Input > > Exception in thread "dag-scheduler-event-loop" java.lang.StackOverflowError > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1534) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) > at > scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468) > at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) > at > scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468) > at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27100) dag-scheduler-event-loop" java.lang.StackOverflowError
[ https://issues.apache.org/jira/browse/SPARK-27100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Parth Chandra updated SPARK-27100: -- Attachment: SPARK-27100-Overflow.txt > dag-scheduler-event-loop" java.lang.StackOverflowError > -- > > Key: SPARK-27100 > URL: https://issues.apache.org/jira/browse/SPARK-27100 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.1.3, 2.3.3 >Reporter: KaiXu >Priority: Major > Attachments: SPARK-27100-Overflow.txt, stderr > > > ALS in Spark MLlib causes StackOverflow: > /opt/sparkml/spark213/bin/spark-submit --properties-file > /opt/HiBench/report/als/spark/conf/sparkbench/spark.conf --class > com.intel.hibench.sparkbench.ml.ALSExample --master yarn-client > --num-executors 3 --executor-memory 322g > /opt/HiBench/sparkbench/assembly/target/sparkbench-assembly-7.1-SNAPSHOT-dist.jar > --numUsers 4 --numProducts 6 --rank 100 --numRecommends 20 > --numIterations 100 --kryo false --implicitPrefs true --numProductBlocks -1 > --numUserBlocks -1 --lambda 1.0 hdfs://bdw-slave20:8020/HiBench/ALS/Input > > Exception in thread "dag-scheduler-event-loop" java.lang.StackOverflowError > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1534) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) > at > scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468) > at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) > at > scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468) > at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27100) dag-scheduler-event-loop" java.lang.StackOverflowError
[ https://issues.apache.org/jira/browse/SPARK-27100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16863375#comment-16863375 ] Parth Chandra commented on SPARK-27100: --- The stack overflow is due to serialization of a ShuffleMapTask (see attached file with complete stack) [^SPARK-27100-Overflow.txt] . ShuffleMapTask.partition is a FilePartition and FilePartition.files is a Stream which is essentially a linked list. It is therefore serialized recursively. If the number of files in each partition is, say, 1 files, recursing into a linked list of length 1 causes a stack overflow. This is a general problem with serialization of Scala streams (and other collections that are lazily initialized) that is fixed in 2.13 (https://github.com/scala/scala/pull/6676). The problem is only in Bucketed partitions. The corresponding implementation for non Bucketed partitions uses a StreamBuffer. Partial expansion of ShuffleMapTask just before the stack overflow - {code:java} obj = \{ShuffleMapTask@16639} Method threw 'java.lang.StackOverflowError' exception. Cannot evaluate org.apache.spark.scheduler.ShuffleMapTask.toString() taskBinary = \{TorrentBroadcast@17216} Method threw 'java.lang.StackOverflowError' exception. Cannot evaluate org.apache.spark.broadcast.TorrentBroadcast.toString() partition = \{FilePartition@17217} Method threw 'java.lang.StackOverflowError' exception. Cannot evaluate org.apache.spark.sql.execution.datasources.FilePartition.toString() index = 0 files = \{Stream$Cons@17244} Method threw 'java.lang.StackOverflowError' exception. Cannot evaluate scala.collection.immutable.Stream$Cons.toString() hd = \{PartitionedFile@17246} Method threw 'java.lang.StackOverflowError' exception. Cannot evaluate org.apache.spark.sql.execution.datasources.PartitionedFile.toString() partitionValues = \{GenericInternalRow@17259} Method threw 'java.lang.StackOverflowError' exception. Cannot evaluate org.apache.spark.sql.catalyst.expressions.GenericInternalRow.toString() filePath = "hdfs://path/a.db/master/version=1/rangeid=000/part-00039-295e3ac1-760c-482e-8640-5e5d1539c2c9_0.c000.gz.parquet" start = 0 length = 225781388 locations = \{String[3]@17261} tlVal = \{Stream$Cons@16687} Method threw 'java.lang.StackOverflowError' exception. Cannot evaluate scala.collection.immutable.Stream$Cons.toString() hd = \{PartitionedFile@17249} Method threw 'java.lang.StackOverflowError' exception. Cannot evaluate org.apache.spark.sql.execution.datasources.PartitionedFile.toString() partitionValues = \{GenericInternalRow@17255} Method threw 'java.lang.StackOverflowError' exception. Cannot evaluate org.apache.spark.sql.catalyst.expressions.GenericInternalRow.toString() filePath = "hdfs://path/a.db/master/version=1/rangeid=001/part-00061-0346437e-7f8f-44ac-8739-94d1ee285c0b_0.c000.gz.parquet" start = 0 length = 431239612 locations = \{String[3]@17257} tlVal = \{Stream$Cons@16812} Method threw 'java.lang.StackOverflowError' exception. Cannot evaluate scala.collection.immutable.Stream$Cons.toString() hd = \{PartitionedFile@17264} Method threw 'java.lang.StackOverflowError' exception. Cannot evaluate org.apache.spark.sql.execution.datasources.PartitionedFile.toString() partitionValues = \{GenericInternalRow@17268} Method threw 'java.lang.StackOverflowError' exception. Cannot evaluate org.apache.spark.sql.catalyst.expressions.GenericInternalRow.toString() filePath = "hdfs://path/a.db/master/version=1/rangeid=002/part-00058-b3a99b18-140e-43ed-838e-276eaa45a5f3_0.c000.gz.parquet" start = 0 length = 219930113 locations = \{String[3]@17270} tlVal = \{Stream$Cons@17265} Method threw 'java.lang.StackOverflowError' exception. Cannot evaluate scala.collection.immutable.Stream$Cons.toString() hd = \{PartitionedFile@17273} Method threw 'java.lang.StackOverflowError' exception. Cannot evaluate org.apache.spark.sql.execution.datasources.PartitionedFile.toString() partitionValues = \{GenericInternalRow@17277} Method threw 'java.lang.StackOverflowError' exception. Cannot evaluate org.apache.spark.sql.catalyst.expressions.GenericInternalRow.toString() filePath = "hdfs://path/a.db/master/version=1/rangeid=003/part-00051-58be3faa-0611-49de-8546-1656b6086934_0.c000.gz.parquet" start = 0 length = 219503110 locations = \{String[3]@17279} tlVal = \{Stream$Cons@17274} Method threw 'java.lang.StackOverflowError' exception. Cannot evaluate scala.collection.immutable.Stream$Cons.toString() tlGen = null tlGen = null tlGen = null tlGen = null {code} > dag-scheduler-event-loop" java.lang.StackOverflowError > -- > > Key: SPARK-27100 > URL: https://issues.apache.org/jira/browse/SPARK-27100 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.1.3,
[jira] [Resolved] (SPARK-27578) Support INTERVAL ... HOUR TO SECOND syntax
[ https://issues.apache.org/jira/browse/SPARK-27578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-27578. --- Resolution: Fixed Assignee: Zhu, Lipeng Fix Version/s: 3.0.0 This is resolved via https://github.com/apache/spark/pull/24472 > Support INTERVAL ... HOUR TO SECOND syntax > -- > > Key: SPARK-27578 > URL: https://issues.apache.org/jira/browse/SPARK-27578 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zhu, Lipeng >Assignee: Zhu, Lipeng >Priority: Major > Fix For: 3.0.0 > > > Currently, SparkSQL can support interval format like this. > > {code:java} > select interval '5 23:59:59.155' day to second.{code} > > Can SparkSQL support grammar like below, as Presto/Teradata can support it > well now. > {code:java} > select interval '23:59:59.155' hour to second > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28041) Increase the minimum pandas version to 0.23.2
Bryan Cutler created SPARK-28041: Summary: Increase the minimum pandas version to 0.23.2 Key: SPARK-28041 URL: https://issues.apache.org/jira/browse/SPARK-28041 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.0.0 Reporter: Bryan Cutler Currently, the minimum supported Pandas version is 0.19.2. We bumped up testing in the Jenkins env to 0.23.2 and since 0.19.2 was released nearly 3 years ago, it is not always compatible with other Python libraries. Increasing the version to 0.23.2 will also allow some workarounds to be removed and make maintenance easier. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14864) [MLLIB] Implement Doc2Vec
[ https://issues.apache.org/jira/browse/SPARK-14864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16863246#comment-16863246 ] Ayush Singh commented on SPARK-14864: - [~michelle] any reason why this issue has been marked resolved? I can work on it. > [MLLIB] Implement Doc2Vec > - > > Key: SPARK-14864 > URL: https://issues.apache.org/jira/browse/SPARK-14864 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Peter Mountanos >Priority: Minor > Labels: bulk-closed > > It would be useful to implement Doc2Vec, as described in the paper > [Distributed Representations of Sentences and > Documents|https://cs.stanford.edu/~quocle/paragraph_vector.pdf]. Gensim has > an implementation [Deep learning with > paragraph2vec|https://radimrehurek.com/gensim/models/doc2vec.html]. > Le & Mikolov show that when aggregating Word2Vec vector representations for a > paragraph/document, it does not perform well for prediction tasks. Instead, > they propose the Paragraph Vector implementation, which provides > state-of-the-art results on several text classification and sentiment > analysis tasks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28040) sql() fails to process output of glue::glue_data()
Michael Chirico created SPARK-28040: --- Summary: sql() fails to process output of glue::glue_data() Key: SPARK-28040 URL: https://issues.apache.org/jira/browse/SPARK-28040 Project: Spark Issue Type: Bug Components: R Affects Versions: 2.4.3 Reporter: Michael Chirico {{glue}} package is quite natural for sending parameterized queries to Spark from R. Very similar to Python's {{format}} for strings. Error is as simple as {code:java} library(glue) library(sparkR) sparkR.session() query = glue_data(list(val = 4), 'select {val}') sql(query){code} Error in writeType(con, serdeType) : Unsupported type for serialization glue {{sql(as.character(query))}} works as expected but this is a bit awkward / post-hoc -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28006) User-defined grouped transform pandas_udf for window operations
[ https://issues.apache.org/jira/browse/SPARK-28006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16863186#comment-16863186 ] Li Jin edited comment on SPARK-28006 at 6/13/19 3:36 PM: - Hi [~viirya] good questions! >> Can we use pandas agg udfs as window function? pandas agg udfs as window function is supported. With both unbounded and bounded window. >> Because the proposed GROUPED_XFORM udf calculates output values for all rows >> in the group, looks like the proposed GROUPED_XFORM udf can only use window >> frame (UnboundedPreceding, UnboundedFollowing) This is correct. It is really using unbounded window as groups here (because there is no groupby transform API in Spark sql). was (Author: icexelloss): Hi [~viirya] good questions: >> Can we use pandas agg udfs as window function? pandas agg udfs as window function is supported. With both unbounded and bounded window. >> Because the proposed GROUPED_XFORM udf calculates output values for all rows >> in the group, looks like the proposed GROUPED_XFORM udf can only use window >> frame (UnboundedPreceding, UnboundedFollowing) This is correct. It is really using unbounded window as groups here (because there is no groupby transform API in Spark sql). > User-defined grouped transform pandas_udf for window operations > --- > > Key: SPARK-28006 > URL: https://issues.apache.org/jira/browse/SPARK-28006 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 2.4.3 >Reporter: Li Jin >Priority: Major > > Currently, pandas_udf supports "grouped aggregate" type that can be used with > unbounded and unbounded windows. There is another set of use cases that can > benefit from a "grouped transform" type pandas_udf. > Grouped transform is defined as a N -> N mapping over a group. For example, > "compute zscore for values in the group using the grouped mean and grouped > stdev", or "rank the values in the group". > Currently, in order to do this, user needs to use "grouped apply", for > example: > {code:java} > @pandas_udf(schema, GROUPED_MAP) > def subtract_mean(pdf) > v = pdf['v'] > pdf['v'] = v - v.mean() > return pdf > df.groupby('id').apply(subtract_mean) > # +---++ > # | id| v| > # +---++ > # | 1|-0.5| > # | 1| 0.5| > # | 2|-3.0| > # | 2|-1.0| > # | 2| 4.0| > # +---++{code} > This approach has a few downside: > * Specifying the full return schema is complicated for the user although the > function only changes one column. > * The column name 'v' inside as part of the udf, makes the udf less reusable. > * The entire dataframe is serialized to pass to Python although only one > column is needed. > Here we propose a new type of pandas_udf to work with these types of use > cases: > {code:java} > df = spark.createDataFrame( > [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], > ("id", "v")) > @pandas_udf('double', GROUPED_XFORM) > def subtract_mean(v): > return v - v.mean() > w = Window.partitionBy('id') > df = df.withColumn('v', subtract_mean(df['v']).over(w)) > # +---++ > # | id| v| > # +---++ > # | 1|-0.5| > # | 1| 0.5| > # | 2|-3.0| > # | 2|-1.0| > # | 2| 4.0| > # +---++{code} > Which addresses the above downsides. > * The user only needs to specify the output type of a single column. > * The column being zscored is decoupled from the udf implementation > * We only need to send one column to Python worker and concat the result > with the original dataframe (this is what grouped aggregate is doing already) > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28006) User-defined grouped transform pandas_udf for window operations
[ https://issues.apache.org/jira/browse/SPARK-28006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16863186#comment-16863186 ] Li Jin commented on SPARK-28006: Hi [~viirya] good questions: >> Can we use pandas agg udfs as window function? pandas agg udfs as window function is supported. With both unbounded and bounded window. >> Because the proposed GROUPED_XFORM udf calculates output values for all rows >> in the group, looks like the proposed GROUPED_XFORM udf can only use window >> frame (UnboundedPreceding, UnboundedFollowing) This is correct. It is really using unbounded window as groups here (because there is no groupby transform API in Spark sql). > User-defined grouped transform pandas_udf for window operations > --- > > Key: SPARK-28006 > URL: https://issues.apache.org/jira/browse/SPARK-28006 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 2.4.3 >Reporter: Li Jin >Priority: Major > > Currently, pandas_udf supports "grouped aggregate" type that can be used with > unbounded and unbounded windows. There is another set of use cases that can > benefit from a "grouped transform" type pandas_udf. > Grouped transform is defined as a N -> N mapping over a group. For example, > "compute zscore for values in the group using the grouped mean and grouped > stdev", or "rank the values in the group". > Currently, in order to do this, user needs to use "grouped apply", for > example: > {code:java} > @pandas_udf(schema, GROUPED_MAP) > def subtract_mean(pdf) > v = pdf['v'] > pdf['v'] = v - v.mean() > return pdf > df.groupby('id').apply(subtract_mean) > # +---++ > # | id| v| > # +---++ > # | 1|-0.5| > # | 1| 0.5| > # | 2|-3.0| > # | 2|-1.0| > # | 2| 4.0| > # +---++{code} > This approach has a few downside: > * Specifying the full return schema is complicated for the user although the > function only changes one column. > * The column name 'v' inside as part of the udf, makes the udf less reusable. > * The entire dataframe is serialized to pass to Python although only one > column is needed. > Here we propose a new type of pandas_udf to work with these types of use > cases: > {code:java} > df = spark.createDataFrame( > [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], > ("id", "v")) > @pandas_udf('double', GROUPED_XFORM) > def subtract_mean(v): > return v - v.mean() > w = Window.partitionBy('id') > df = df.withColumn('v', subtract_mean(df['v']).over(w)) > # +---++ > # | id| v| > # +---++ > # | 1|-0.5| > # | 1| 0.5| > # | 2|-3.0| > # | 2|-1.0| > # | 2| 4.0| > # +---++{code} > Which addresses the above downsides. > * The user only needs to specify the output type of a single column. > * The column being zscored is decoupled from the udf implementation > * We only need to send one column to Python worker and concat the result > with the original dataframe (this is what grouped aggregate is doing already) > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28006) User-defined grouped transform pandas_udf for window operations
[ https://issues.apache.org/jira/browse/SPARK-28006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16863180#comment-16863180 ] Liang-Chi Hsieh commented on SPARK-28006: - I'm curious about two questions: Can we use pandas agg udfs as window function? Because the proposed GROUPED_XFORM udf calculates output values for all rows in the group, looks like the proposed GROUPED_XFORM udf can only use window frame (UnboundedPreceding, UnboundedFollowing)? > User-defined grouped transform pandas_udf for window operations > --- > > Key: SPARK-28006 > URL: https://issues.apache.org/jira/browse/SPARK-28006 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 2.4.3 >Reporter: Li Jin >Priority: Major > > Currently, pandas_udf supports "grouped aggregate" type that can be used with > unbounded and unbounded windows. There is another set of use cases that can > benefit from a "grouped transform" type pandas_udf. > Grouped transform is defined as a N -> N mapping over a group. For example, > "compute zscore for values in the group using the grouped mean and grouped > stdev", or "rank the values in the group". > Currently, in order to do this, user needs to use "grouped apply", for > example: > {code:java} > @pandas_udf(schema, GROUPED_MAP) > def subtract_mean(pdf) > v = pdf['v'] > pdf['v'] = v - v.mean() > return pdf > df.groupby('id').apply(subtract_mean) > # +---++ > # | id| v| > # +---++ > # | 1|-0.5| > # | 1| 0.5| > # | 2|-3.0| > # | 2|-1.0| > # | 2| 4.0| > # +---++{code} > This approach has a few downside: > * Specifying the full return schema is complicated for the user although the > function only changes one column. > * The column name 'v' inside as part of the udf, makes the udf less reusable. > * The entire dataframe is serialized to pass to Python although only one > column is needed. > Here we propose a new type of pandas_udf to work with these types of use > cases: > {code:java} > df = spark.createDataFrame( > [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], > ("id", "v")) > @pandas_udf('double', GROUPED_XFORM) > def subtract_mean(v): > return v - v.mean() > w = Window.partitionBy('id') > df = df.withColumn('v', subtract_mean(df['v']).over(w)) > # +---++ > # | id| v| > # +---++ > # | 1|-0.5| > # | 1| 0.5| > # | 2|-3.0| > # | 2|-1.0| > # | 2| 4.0| > # +---++{code} > Which addresses the above downsides. > * The user only needs to specify the output type of a single column. > * The column being zscored is decoupled from the udf implementation > * We only need to send one column to Python worker and concat the result > with the original dataframe (this is what grouped aggregate is doing already) > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27463) Support Dataframe Cogroup via Pandas UDFs
[ https://issues.apache.org/jira/browse/SPARK-27463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16863177#comment-16863177 ] Li Jin commented on SPARK-27463: Yeah I think the exact spelling of the API can go either way. I think the current options are pretty close that we don't need to commit to either one at this moment and they shouldn't affect the implementation too much. [~d80tb7] [~hyukjin.kwon] How about we start working towards a PR that implements one of the proposed APIs and go from there? > Support Dataframe Cogroup via Pandas UDFs > -- > > Key: SPARK-27463 > URL: https://issues.apache.org/jira/browse/SPARK-27463 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Chris Martin >Priority: Major > > Recent work on Pandas UDFs in Spark, has allowed for improved > interoperability between Pandas and Spark. This proposal aims to extend this > by introducing a new Pandas UDF type which would allow for a cogroup > operation to be applied to two PySpark DataFrames. > Full details are in the google document linked below. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28039) Add float4.sql
Yuming Wang created SPARK-28039: --- Summary: Add float4.sql Key: SPARK-28039 URL: https://issues.apache.org/jira/browse/SPARK-28039 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang In this ticket, we plan to add the regression test cases of https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/float4.sql. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16692) multilabel classification to DataFrame, ML
[ https://issues.apache.org/jira/browse/SPARK-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-16692: - Assignee: zhengruifeng > multilabel classification to DataFrame, ML > --- > > Key: SPARK-16692 > URL: https://issues.apache.org/jira/browse/SPARK-16692 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Weizhi Li >Assignee: zhengruifeng >Priority: Minor > Original Estimate: 1h > Remaining Estimate: 1h > > For the multi labels evaluations. There is a method to in MLlib named > MultilabelMetrics: A multilabel classification problem involves mapping each > sample in a dataset to a set of class labels. In this type of classification > problem, the labels are not mutually exclusive. For example, when classifying > a set of news articles into topics, a single article might be both science > and politics. > Added this method to support DataFrame in ML. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16692) multilabel classification to DataFrame, ML
[ https://issues.apache.org/jira/browse/SPARK-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16692. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24777 [https://github.com/apache/spark/pull/24777] > multilabel classification to DataFrame, ML > --- > > Key: SPARK-16692 > URL: https://issues.apache.org/jira/browse/SPARK-16692 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Weizhi Li >Assignee: zhengruifeng >Priority: Minor > Fix For: 3.0.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > For the multi labels evaluations. There is a method to in MLlib named > MultilabelMetrics: A multilabel classification problem involves mapping each > sample in a dataset to a set of class labels. In this type of classification > problem, the labels are not mutually exclusive. For example, when classifying > a set of news articles into topics, a single article might be both science > and politics. > Added this method to support DataFrame in ML. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28033) String concatenation should low priority than other operators
[ https://issues.apache.org/jira/browse/SPARK-28033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-28033: Summary: String concatenation should low priority than other operators (was: String concatenation low priority than other operators) > String concatenation should low priority than other operators > - > > Key: SPARK-28033 > URL: https://issues.apache.org/jira/browse/SPARK-28033 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.3 >Reporter: Yuming Wang >Priority: Major > > Spark SQL: > {code:sql} > spark-sql> explain select 'four: ' || 2 + 2; > == Physical Plan == > *(1) Project [null AS (CAST(concat(four: , CAST(2 AS STRING)) AS DOUBLE) + > CAST(2 AS DOUBLE))#2] > +- Scan OneRowRelation[] > spark-sql> select 'four: ' || 2 + 2; > NULL > {code} > Hive: > {code:sql} > hive> select 'four: ' || 2 + 2; > OK > four: 4 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28038) Add text.sql
[ https://issues.apache.org/jira/browse/SPARK-28038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28038: Assignee: (was: Apache Spark) > Add text.sql > > > Key: SPARK-28038 > URL: https://issues.apache.org/jira/browse/SPARK-28038 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > In this ticket, we plan to add the regression test cases of > [https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/text.sql]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28038) Add text.sql
[ https://issues.apache.org/jira/browse/SPARK-28038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28038: Assignee: Apache Spark > Add text.sql > > > Key: SPARK-28038 > URL: https://issues.apache.org/jira/browse/SPARK-28038 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > > In this ticket, we plan to add the regression test cases of > [https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/text.sql]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27966) input_file_name empty when listing files in parallel
[ https://issues.apache.org/jira/browse/SPARK-27966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16863030#comment-16863030 ] Liang-Chi Hsieh commented on SPARK-27966: - I can't see where input_file_name is, from the truncated output. If it is just in the Project, I can't tell why it doesn't work. If there is no good reproducer, I agree with [~hyukjin.kwon] that we may resolve this JIRA. > input_file_name empty when listing files in parallel > > > Key: SPARK-27966 > URL: https://issues.apache.org/jira/browse/SPARK-27966 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.4.0 > Environment: Databricks 5.3 (includes Apache Spark 2.4.0, Scala 2.11) > > Worker Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2 > Workers: 3 > Driver Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2 >Reporter: Christian Homberg >Priority: Minor > Attachments: input_file_name_bug > > > I ran into an issue similar and probably related to SPARK-26128. The > _org.apache.spark.sql.functions.input_file_name_ is sometimes empty. > > {code:java} > df.select(input_file_name()).show(5,false) > {code} > > {code:java} > +-+ > |input_file_name()| > +-+ > | | > | | > | | > | | > | | > +-+ > {code} > My environment is databricks and debugging the Log4j output showed me that > the issue occurred when the files are being listed in parallel, e.g. when > {code:java} > 19/06/06 11:50:47 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 127; threshold: 32 > 19/06/06 11:50:47 INFO InMemoryFileIndex: Listing leaf files and directories > in parallel under:{code} > > Everything's fine as long as > {code:java} > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 6; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > {code} > > Setting spark.sql.sources.parallelPartitionDiscovery.threshold to > resolves the issue for me. > > *edit: the problem is not exclusively linked to listing files in parallel. > I've setup a larger cluster for which after parallel file listing the > input_file_name did return the correct filename. After inspecting the log4j > again, I assume that it's linked to some kind of MetaStore being full. I've > attached a section of the log4j output that I think should indicate why it's > failing. If you need more, please let me know.* > ** > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28038) Add text.sql
Yuming Wang created SPARK-28038: --- Summary: Add text.sql Key: SPARK-28038 URL: https://issues.apache.org/jira/browse/SPARK-28038 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang In this ticket, we plan to add the regression test cases of [https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/text.sql]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28037) Add built-in String Functions: quote_literal
Yuming Wang created SPARK-28037: --- Summary: Add built-in String Functions: quote_literal Key: SPARK-28037 URL: https://issues.apache.org/jira/browse/SPARK-28037 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang ||Function||Return Type||Description||Example||Result|| |{{quote_literal(_{{string}}_ }}{{text}}{{)}}|{{text}}|Return the given string suitably quoted to be used as a string literal in an SQL statement string. Embedded single-quotes and backslashes are properly doubled. Note that {{quote_literal}} returns null on null input; if the argument might be null, {{quote_nullable}} is often more suitable. See also [Example 43.1|https://www.postgresql.org/docs/11/plpgsql-statements.html#PLPGSQL-QUOTE-LITERAL-EXAMPLE].|{{quote_literal(E'O\'Reilly')}}|{{'O''Reilly'}}| |{{quote_literal(_{{value}}_ }}{{anyelement}}{{)}}|{{text}}|Coerce the given value to text and then quote it as a literal. Embedded single-quotes and backslashes are properly doubled.|{{quote_literal(42.5)}}|{{'42.5'}}| https://www.postgresql.org/docs/11/functions-string.html https://docs.aws.amazon.com/redshift/latest/dg/r_QUOTE_LITERAL.html https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/String/QUOTE_LITERAL.htm?tocpath=SQL%20Reference%20Manual%7CSQL%20Functions%7CString%20Functions%7C_38 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-28033) String concatenation low priority than other operators
[ https://issues.apache.org/jira/browse/SPARK-28033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-28033: Comment: was deleted (was: I'm working on.) > String concatenation low priority than other operators > -- > > Key: SPARK-28033 > URL: https://issues.apache.org/jira/browse/SPARK-28033 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.3 >Reporter: Yuming Wang >Priority: Major > > Spark SQL: > {code:sql} > spark-sql> explain select 'four: ' || 2 + 2; > == Physical Plan == > *(1) Project [null AS (CAST(concat(four: , CAST(2 AS STRING)) AS DOUBLE) + > CAST(2 AS DOUBLE))#2] > +- Scan OneRowRelation[] > spark-sql> select 'four: ' || 2 + 2; > NULL > {code} > Hive: > {code:sql} > hive> select 'four: ' || 2 + 2; > OK > four: 4 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28033) String concatenation low priority than other operators
[ https://issues.apache.org/jira/browse/SPARK-28033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-28033: Summary: String concatenation low priority than other operators (was: String concatenation low priority than other arithmeticBinary) > String concatenation low priority than other operators > -- > > Key: SPARK-28033 > URL: https://issues.apache.org/jira/browse/SPARK-28033 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.3 >Reporter: Yuming Wang >Priority: Major > > Spark SQL: > {code:sql} > spark-sql> explain select 'four: ' || 2 + 2; > == Physical Plan == > *(1) Project [null AS (CAST(concat(four: , CAST(2 AS STRING)) AS DOUBLE) + > CAST(2 AS DOUBLE))#2] > +- Scan OneRowRelation[] > spark-sql> select 'four: ' || 2 + 2; > NULL > {code} > Hive: > {code:sql} > hive> select 'four: ' || 2 + 2; > OK > four: 4 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28025) HDFSBackedStateStoreProvider should not leak .crc files
[ https://issues.apache.org/jira/browse/SPARK-28025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862908#comment-16862908 ] Stavros Kontopoulos edited comment on SPARK-28025 at 6/13/19 10:53 AM: --- I just found out that the following config options would suffice to avoid creating crcs for the given case: --conf spark.hadoop.spark.sql.streaming.checkpointFileManagerClass=org.apache.spark.sql.execution.streaming.FileSystemBasedCheckpointFileManager --conf spark.hadoop.fs.file.impl=org.apache.hadoop.fs.RawLocalFileSystem So no need to make a PR to disable crc emission if user wants to for this case. I could make one to cover all cases to be able to enable/disable crcs if needed. However, the filesystem hierarchy in hadoop is a bit inconsistent. So the LocalFileSystem will use a a FilterFileSystem that has the flag setWriteChecksum but the DistributedFileSystem does not have it and it is controlled by property [https://github.com/apache/hadoop/blob/533138718cc05b78e0afe583d7a9bd30e8a48fdc/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/client/impl/TestBlockReaderLocal.java#L620] `dfs.checksum.type` Btw there is a nice summary on the topic in chapter 5 of the book: hadoop the definitive guide. Looking at the old PR, I am not sure it would work with the LocalFs as it seems that the crc file will be renamed by default as the underlying system supports checksums by default. was (Author: skonto): I just found out that the following config options would suffice to avoid creating crcs: --conf spark.hadoop.spark.sql.streaming.checkpointFileManagerClass=org.apache.spark.sql.execution.streaming.FileSystemBasedCheckpointFileManager --conf spark.hadoop.fs.file.impl=org.apache.hadoop.fs.RawLocalFileSystem So need to make a PR to disable crc emission if user wants to. Unfortunately the filesystem hierarchy in hadoop is a bit inconsistent. So the LocalFileSystem will use a a FilterFileSystem that has the flag setWriteChecksum but the DistributedFileSystem does not have it and it is controlled by property [https://github.com/apache/hadoop/blob/533138718cc05b78e0afe583d7a9bd30e8a48fdc/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/client/impl/TestBlockReaderLocal.java#L620] `dfs.checksum.type`. Btw there is a nice summary on the topic in chapter 5 of the book: hadoop the definitive guide. So should we clean up things like in the old PR? I am not sure if the old PR would work with the LocalFs as it seems that the crc file will be renamed by default as the underlying system supports checksums by default. > HDFSBackedStateStoreProvider should not leak .crc files > > > Key: SPARK-28025 > URL: https://issues.apache.org/jira/browse/SPARK-28025 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.3 > Environment: Spark 2.4.3 > Kubernetes 1.11(?) (OpenShift) > StateStore storage on a mounted PVC. Viewed as a local filesystem by the > `FileContextBasedCheckpointFileManager` : > {noformat} > scala> glusterfm.isLocal > res17: Boolean = true{noformat} >Reporter: Gerard Maas >Priority: Major > > The HDFSBackedStateStoreProvider when using the default CheckpointFileManager > is leaving '.crc' files behind. There's a .crc file created for each > `atomicFile` operation of the CheckpointFileManager. > Over time, the number of files becomes very large. It makes the state store > file system constantly increase in size and, in our case, deteriorates the > file system performance. > Here's a sample of one of our spark storage volumes after 2 days of execution > (4 stateful streaming jobs, each on a different sub-dir): > # > {noformat} > Total files in PVC (used for checkpoints and state store) > $find . | wc -l > 431796 > # .crc files > $find . -name "*.crc" | wc -l > 418053{noformat} > With each .crc file taking one storage block, the used storage runs into the > GBs of data. > These jobs are running on Kubernetes. Our shared storage provider, GlusterFS, > shows serious performance deterioration with this large number of files: > {noformat} > DEBUG HDFSBackedStateStoreProvider: fetchFiles() took 29164ms{noformat} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28025) HDFSBackedStateStoreProvider should not leak .crc files
[ https://issues.apache.org/jira/browse/SPARK-28025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862908#comment-16862908 ] Stavros Kontopoulos edited comment on SPARK-28025 at 6/13/19 10:51 AM: --- I just found out that the following config options would suffice to avoid creating crcs: --conf spark.hadoop.spark.sql.streaming.checkpointFileManagerClass=org.apache.spark.sql.execution.streaming.FileSystemBasedCheckpointFileManager --conf spark.hadoop.fs.file.impl=org.apache.hadoop.fs.RawLocalFileSystem So need to make a PR to disable crc emission if user wants to. Unfortunately the filesystem hierarchy in hadoop is a bit inconsistent. So the LocalFileSystem will use a a FilterFileSystem that has the flag setWriteChecksum but the DistributedFileSystem does not have it and it is controlled by property [https://github.com/apache/hadoop/blob/533138718cc05b78e0afe583d7a9bd30e8a48fdc/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/client/impl/TestBlockReaderLocal.java#L620] `dfs.checksum.type`. Btw there is a nice summary on the topic in chapter 5 of the book: hadoop the definitive guide. So should we clean up things like in the old PR? I am not sure if the old PR would work with the LocalFs as it seems that the crc file will be renamed by default as the underlying system supports checksums by default. was (Author: skonto): I just found out that the following config options would suffice to avoid creating crcs: --conf spark.hadoop.spark.sql.streaming.checkpointFileManagerClass=org.apache.spark.sql.execution.streaming.FileSystemBasedCheckpointFileManager --conf spark.hadoop.fs.file.impl=org.apache.hadoop.fs.RawLocalFileSystem So need to make a PR to disable crc emission if user wants to in this case but could make one for the generic case if the filesystem supports the flag mentioned above. Will do that. Btw there is a nice summary on the topic in chapter 5 of the book: hadoop the definitive guide. So should we clean up things like in the old PR? I am not sure if the old PR would work with the LocalFs as it seems that the crc file will be renamed by default as the underlying system supports checksums by default. > HDFSBackedStateStoreProvider should not leak .crc files > > > Key: SPARK-28025 > URL: https://issues.apache.org/jira/browse/SPARK-28025 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.3 > Environment: Spark 2.4.3 > Kubernetes 1.11(?) (OpenShift) > StateStore storage on a mounted PVC. Viewed as a local filesystem by the > `FileContextBasedCheckpointFileManager` : > {noformat} > scala> glusterfm.isLocal > res17: Boolean = true{noformat} >Reporter: Gerard Maas >Priority: Major > > The HDFSBackedStateStoreProvider when using the default CheckpointFileManager > is leaving '.crc' files behind. There's a .crc file created for each > `atomicFile` operation of the CheckpointFileManager. > Over time, the number of files becomes very large. It makes the state store > file system constantly increase in size and, in our case, deteriorates the > file system performance. > Here's a sample of one of our spark storage volumes after 2 days of execution > (4 stateful streaming jobs, each on a different sub-dir): > # > {noformat} > Total files in PVC (used for checkpoints and state store) > $find . | wc -l > 431796 > # .crc files > $find . -name "*.crc" | wc -l > 418053{noformat} > With each .crc file taking one storage block, the used storage runs into the > GBs of data. > These jobs are running on Kubernetes. Our shared storage provider, GlusterFS, > shows serious performance deterioration with this large number of files: > {noformat} > DEBUG HDFSBackedStateStoreProvider: fetchFiles() took 29164ms{noformat} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28036) Built-in udf left/right has inconsistent behavior
Yuming Wang created SPARK-28036: --- Summary: Built-in udf left/right has inconsistent behavior Key: SPARK-28036 URL: https://issues.apache.org/jira/browse/SPARK-28036 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang PostgreSQL: {code:sql} postgres=# select left('ahoj', -2), right('ahoj', -2); left | right --+--- ah | oj (1 row) {code} Spark SQL: {code:sql} spark-sql> select left('ahoj', -2), right('ahoj', -2); spark-sql> {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28035) Test JoinSuite."equi-join is hash-join" is incompatible with its title.
[ https://issues.apache.org/jira/browse/SPARK-28035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862928#comment-16862928 ] Jiatao Tao commented on SPARK-28035: this test "multiple-key equi-join is hash-join" also the same situation. > Test JoinSuite."equi-join is hash-join" is incompatible with its title. > --- > > Key: SPARK-28035 > URL: https://issues.apache.org/jira/browse/SPARK-28035 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.4.3 >Reporter: Jiatao Tao >Priority: Trivial > Attachments: image-2019-06-13-10-32-06-759.png > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28025) HDFSBackedStateStoreProvider should not leak .crc files
[ https://issues.apache.org/jira/browse/SPARK-28025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862908#comment-16862908 ] Stavros Kontopoulos edited comment on SPARK-28025 at 6/13/19 10:38 AM: --- I just found out that the following config options would suffice to avoid creating crcs: --conf spark.hadoop.spark.sql.streaming.checkpointFileManagerClass=org.apache.spark.sql.execution.streaming.FileSystemBasedCheckpointFileManager --conf spark.hadoop.fs.file.impl=org.apache.hadoop.fs.RawLocalFileSystem So need to make a PR to disable crc emission if user wants to in this case but could make one for the generic case if the filesystem supports the flag mentioned above. Will do that. Btw there is a nice summary on the topic in chapter 5 of the book: hadoop the definitive guide. So should we clean up things like in the old PR? I am not sure if the old PR would work with the LocalFs as it seems that the crc file will be renamed by default as the underlying system supports checksums by default. was (Author: skonto): I just found out that the following config options would suffice to avoid creating crcs: --conf spark.hadoop.spark.sql.streaming.checkpointFileManagerClass=org.apache.spark.sql.execution.streaming.FileSystemBasedCheckpointFileManager --conf spark.hadoop.fs.file.impl=org.apache.hadoop.fs.RawLocalFileSystem So need to make a PR to disable crc emission if user wants to in this case but could make one for the generic case if the filesystem supports the flag mentioned above . Btw there is a nice summary on the topic in chapter 5 of the book: hadoop the definitive guide. So should we clean up things like in the old PR? I am not sure if the old PR would work with the LocalFs as it seems that the crc file will be renamed by default. > HDFSBackedStateStoreProvider should not leak .crc files > > > Key: SPARK-28025 > URL: https://issues.apache.org/jira/browse/SPARK-28025 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.3 > Environment: Spark 2.4.3 > Kubernetes 1.11(?) (OpenShift) > StateStore storage on a mounted PVC. Viewed as a local filesystem by the > `FileContextBasedCheckpointFileManager` : > {noformat} > scala> glusterfm.isLocal > res17: Boolean = true{noformat} >Reporter: Gerard Maas >Priority: Major > > The HDFSBackedStateStoreProvider when using the default CheckpointFileManager > is leaving '.crc' files behind. There's a .crc file created for each > `atomicFile` operation of the CheckpointFileManager. > Over time, the number of files becomes very large. It makes the state store > file system constantly increase in size and, in our case, deteriorates the > file system performance. > Here's a sample of one of our spark storage volumes after 2 days of execution > (4 stateful streaming jobs, each on a different sub-dir): > # > {noformat} > Total files in PVC (used for checkpoints and state store) > $find . | wc -l > 431796 > # .crc files > $find . -name "*.crc" | wc -l > 418053{noformat} > With each .crc file taking one storage block, the used storage runs into the > GBs of data. > These jobs are running on Kubernetes. Our shared storage provider, GlusterFS, > shows serious performance deterioration with this large number of files: > {noformat} > DEBUG HDFSBackedStateStoreProvider: fetchFiles() took 29164ms{noformat} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28035) Test JoinSuite."equi-join is hash-join" is incompatible with its title.
[ https://issues.apache.org/jira/browse/SPARK-28035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862927#comment-16862927 ] Jiatao Tao commented on SPARK-28035: The title means hash-join, but when I debug, found it is a merge join. And cannot get the idea that this test do, why only asserting the size? If this is a problem, I would like to modify this test, hope to hear your opinion, thanks. !image-2019-06-13-10-32-06-759.png! > Test JoinSuite."equi-join is hash-join" is incompatible with its title. > --- > > Key: SPARK-28035 > URL: https://issues.apache.org/jira/browse/SPARK-28035 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.4.3 >Reporter: Jiatao Tao >Priority: Trivial > Attachments: image-2019-06-13-10-32-06-759.png > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28025) HDFSBackedStateStoreProvider should not leak .crc files
[ https://issues.apache.org/jira/browse/SPARK-28025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862908#comment-16862908 ] Stavros Kontopoulos edited comment on SPARK-28025 at 6/13/19 10:36 AM: --- I just found out that the following config options would suffice to avoid creating crcs: --conf spark.hadoop.spark.sql.streaming.checkpointFileManagerClass=org.apache.spark.sql.execution.streaming.FileSystemBasedCheckpointFileManager --conf spark.hadoop.fs.file.impl=org.apache.hadoop.fs.RawLocalFileSystem So need to make a PR to disable crc emission if user wants to in this case but could make one for the generic case if the filesystem supports the flag mentioned above . Btw there is a nice summary on the topic in chapter 5 of the book: hadoop the definitive guide. So should we clean up things like in the old PR? I am not sure if the old PR would work with the LocalFs as it seems that the crc file will be renamed by default. was (Author: skonto): I just found out that the following config options would suffice to avoid creating crcs: --conf spark.hadoop.spark.sql.streaming.checkpointFileManagerClass=org.apache.spark.sql.execution.streaming.FileSystemBasedCheckpointFileManager --conf spark.hadoop.fs.file.impl=org.apache.hadoop.fs.RawLocalFileSystem So need to make a PR to disable crc emission if user wants to in this case but could make one for the generic case. Btw there is a nice summary on the topic in chapter 5 of the book: hadoop the definitive guide. So should we clean up things like in the old PR? I am not sure if the old PR would work with the LocalFs as it seems that the crc file will be renamed by default. > HDFSBackedStateStoreProvider should not leak .crc files > > > Key: SPARK-28025 > URL: https://issues.apache.org/jira/browse/SPARK-28025 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.3 > Environment: Spark 2.4.3 > Kubernetes 1.11(?) (OpenShift) > StateStore storage on a mounted PVC. Viewed as a local filesystem by the > `FileContextBasedCheckpointFileManager` : > {noformat} > scala> glusterfm.isLocal > res17: Boolean = true{noformat} >Reporter: Gerard Maas >Priority: Major > > The HDFSBackedStateStoreProvider when using the default CheckpointFileManager > is leaving '.crc' files behind. There's a .crc file created for each > `atomicFile` operation of the CheckpointFileManager. > Over time, the number of files becomes very large. It makes the state store > file system constantly increase in size and, in our case, deteriorates the > file system performance. > Here's a sample of one of our spark storage volumes after 2 days of execution > (4 stateful streaming jobs, each on a different sub-dir): > # > {noformat} > Total files in PVC (used for checkpoints and state store) > $find . | wc -l > 431796 > # .crc files > $find . -name "*.crc" | wc -l > 418053{noformat} > With each .crc file taking one storage block, the used storage runs into the > GBs of data. > These jobs are running on Kubernetes. Our shared storage provider, GlusterFS, > shows serious performance deterioration with this large number of files: > {noformat} > DEBUG HDFSBackedStateStoreProvider: fetchFiles() took 29164ms{noformat} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28025) HDFSBackedStateStoreProvider should not leak .crc files
[ https://issues.apache.org/jira/browse/SPARK-28025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862908#comment-16862908 ] Stavros Kontopoulos edited comment on SPARK-28025 at 6/13/19 10:35 AM: --- I just found out that the following config options would suffice to avoid creating crcs: --conf spark.hadoop.spark.sql.streaming.checkpointFileManagerClass=org.apache.spark.sql.execution.streaming.FileSystemBasedCheckpointFileManager --conf spark.hadoop.fs.file.impl=org.apache.hadoop.fs.RawLocalFileSystem So need to make a PR to disable crc emission if user wants to in this case but could make one for the generic case. Btw there is a nice summary on the topic in chapter 5 of the book: hadoop the definitive guide. So should we clean up things like in the old PR? I am not sure if the old PR would work with the LocalFs as it seems that the crc file will be renamed by default. was (Author: skonto): I just found out that the following config options would suffice to avoid creating crcs: --conf spark.hadoop.spark.sql.streaming.checkpointFileManagerClass=org.apache.spark.sql.execution.streaming.FileSystemBasedCheckpointFileManager --conf spark.hadoop.fs.file.impl=org.apache.hadoop.fs.RawLocalFileSystem So should we clean up things like in the old PR? I am not sure if the old PR would work with the LocalFs as it seems that the crc file will be renamed by default. > HDFSBackedStateStoreProvider should not leak .crc files > > > Key: SPARK-28025 > URL: https://issues.apache.org/jira/browse/SPARK-28025 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.3 > Environment: Spark 2.4.3 > Kubernetes 1.11(?) (OpenShift) > StateStore storage on a mounted PVC. Viewed as a local filesystem by the > `FileContextBasedCheckpointFileManager` : > {noformat} > scala> glusterfm.isLocal > res17: Boolean = true{noformat} >Reporter: Gerard Maas >Priority: Major > > The HDFSBackedStateStoreProvider when using the default CheckpointFileManager > is leaving '.crc' files behind. There's a .crc file created for each > `atomicFile` operation of the CheckpointFileManager. > Over time, the number of files becomes very large. It makes the state store > file system constantly increase in size and, in our case, deteriorates the > file system performance. > Here's a sample of one of our spark storage volumes after 2 days of execution > (4 stateful streaming jobs, each on a different sub-dir): > # > {noformat} > Total files in PVC (used for checkpoints and state store) > $find . | wc -l > 431796 > # .crc files > $find . -name "*.crc" | wc -l > 418053{noformat} > With each .crc file taking one storage block, the used storage runs into the > GBs of data. > These jobs are running on Kubernetes. Our shared storage provider, GlusterFS, > shows serious performance deterioration with this large number of files: > {noformat} > DEBUG HDFSBackedStateStoreProvider: fetchFiles() took 29164ms{noformat} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28035) Test JoinSuite."equi-join is hash-join" is incompatible with its title.
[ https://issues.apache.org/jira/browse/SPARK-28035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiatao Tao updated SPARK-28035: --- Attachment: image-2019-06-13-10-32-06-759.png > Test JoinSuite."equi-join is hash-join" is incompatible with its title. > --- > > Key: SPARK-28035 > URL: https://issues.apache.org/jira/browse/SPARK-28035 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.4.3 >Reporter: Jiatao Tao >Priority: Trivial > Attachments: image-2019-06-13-10-32-06-759.png > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28035) Test JoinSuite."equi-join is hash-join" is incompatible with its title.
Jiatao Tao created SPARK-28035: -- Summary: Test JoinSuite."equi-join is hash-join" is incompatible with its title. Key: SPARK-28035 URL: https://issues.apache.org/jira/browse/SPARK-28035 Project: Spark Issue Type: Improvement Components: Tests Affects Versions: 2.4.3 Reporter: Jiatao Tao -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-27546) Should repalce DateTimeUtils#defaultTimeZoneuse with sessionLocalTimeZone
[ https://issues.apache.org/jira/browse/SPARK-27546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiatao Tao reopened SPARK-27546: Hi, I update the comment, could someone take a look? > Should repalce DateTimeUtils#defaultTimeZoneuse with sessionLocalTimeZone > - > > Key: SPARK-27546 > URL: https://issues.apache.org/jira/browse/SPARK-27546 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Jiatao Tao >Priority: Minor > Attachments: image-2019-04-23-08-10-00-475.png, > image-2019-04-23-08-10-50-247.png > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28025) HDFSBackedStateStoreProvider should not leak .crc files
[ https://issues.apache.org/jira/browse/SPARK-28025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862908#comment-16862908 ] Stavros Kontopoulos commented on SPARK-28025: - I just found out that the following config options would suffice to avoid creating crcs: --conf spark.hadoop.spark.sql.streaming.checkpointFileManagerClass=org.apache.spark.sql.execution.streaming.FileSystemBasedCheckpointFileManager --conf spark.hadoop.fs.file.impl=org.apache.hadoop.fs.RawLocalFileSystem So should we clean up things like in the old PR? I am not sure if the old PR would work with the LocalFs as it seems that the crc file will be renamed by default. > HDFSBackedStateStoreProvider should not leak .crc files > > > Key: SPARK-28025 > URL: https://issues.apache.org/jira/browse/SPARK-28025 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.3 > Environment: Spark 2.4.3 > Kubernetes 1.11(?) (OpenShift) > StateStore storage on a mounted PVC. Viewed as a local filesystem by the > `FileContextBasedCheckpointFileManager` : > {noformat} > scala> glusterfm.isLocal > res17: Boolean = true{noformat} >Reporter: Gerard Maas >Priority: Major > > The HDFSBackedStateStoreProvider when using the default CheckpointFileManager > is leaving '.crc' files behind. There's a .crc file created for each > `atomicFile` operation of the CheckpointFileManager. > Over time, the number of files becomes very large. It makes the state store > file system constantly increase in size and, in our case, deteriorates the > file system performance. > Here's a sample of one of our spark storage volumes after 2 days of execution > (4 stateful streaming jobs, each on a different sub-dir): > # > {noformat} > Total files in PVC (used for checkpoints and state store) > $find . | wc -l > 431796 > # .crc files > $find . -name "*.crc" | wc -l > 418053{noformat} > With each .crc file taking one storage block, the used storage runs into the > GBs of data. > These jobs are running on Kubernetes. Our shared storage provider, GlusterFS, > shows serious performance deterioration with this large number of files: > {noformat} > DEBUG HDFSBackedStateStoreProvider: fetchFiles() took 29164ms{noformat} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27930) List all built-in UDFs have different names
[ https://issues.apache.org/jira/browse/SPARK-27930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-27930: Description: This ticket list all built-in UDFs have different names: |PostgreSQL|Spark SQL|Note| |random|rand| | |format|format_string|Spark's {{format_string}} is based on the implementation of {{java.util.Formatter}}. Which makes some formats of PostgreSQL can not supported, such as: {{format_string('>>%-s<<', 'Hello')}}| was: This ticket list all built-in UDFs have different names: |PostgreSQL|Spark SQL|Note| |random|rand| | |format|format_string|Spark's {{format_string}} is based on the implementation of {{java.util.Formatter}}, which makes some formats of PostgreSQL not supported, such as: {{format_string('>>%-s<<', 'Hello')}}| > List all built-in UDFs have different names > --- > > Key: SPARK-27930 > URL: https://issues.apache.org/jira/browse/SPARK-27930 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > This ticket list all built-in UDFs have different names: > |PostgreSQL|Spark SQL|Note| > |random|rand| | > |format|format_string|Spark's {{format_string}} is based on the > implementation of {{java.util.Formatter}}. > Which makes some formats of PostgreSQL can not supported, such as: > {{format_string('>>%-s<<', 'Hello')}}| -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27930) List all built-in UDFs have different names
[ https://issues.apache.org/jira/browse/SPARK-27930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-27930: Description: This ticket list all built-in UDFs have different names: |PostgreSQL|Spark SQL|Note| |random|rand| | |format|format_string|Spark's {{format_string}} is based on the implementation of {{java.util.Formatter}}, which makes some formats of PostgreSQL not supported, such as: {{format_string('>>%-s<<', 'Hello')}}| was: This ticket list all built-in UDFs have different names: |PostgreSQL|Spark SQL| |random|rand| |format|format_string| > List all built-in UDFs have different names > --- > > Key: SPARK-27930 > URL: https://issues.apache.org/jira/browse/SPARK-27930 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > This ticket list all built-in UDFs have different names: > |PostgreSQL|Spark SQL|Note| > |random|rand| | > |format|format_string|Spark's {{format_string}} is based on the > implementation of {{java.util.Formatter}}, which makes some formats of > PostgreSQL not supported, such as: {{format_string('>>%-s<<', 'Hello')}}| -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28016) Spark hangs when an execution plan has many projections on nested structs
[ https://issues.apache.org/jira/browse/SPARK-28016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862898#comment-16862898 ] Ruslan Yushchenko commented on SPARK-28016: --- Attached a self-contained example ([^SparkApp1IssueSelfContained.scala]). Slightly simplified the example, only numeric fields are used now. Also, the number of transformations is now easily changeable by changing the upper bound of the `Range()`. This example also reproduces the issue in the current master. On master it does not produce a timeout error, just freezes. > Spark hangs when an execution plan has many projections on nested structs > - > > Key: SPARK-28016 > URL: https://issues.apache.org/jira/browse/SPARK-28016 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.4.3 > Environment: Tried in > * Spark 2.2.1, Spark 2.4.3 in local mode on Linux, MasOS and Windows > * Spark 2.4.3 / Yarn on a Linux cluster >Reporter: Ruslan Yushchenko >Priority: Major > Attachments: NestedOps.scala, SparkApp1Issue.scala, > SparkApp1IssueSelfContained.scala, SparkApp2Workaround.scala, > spark-app-nested.tgz > > > Spark applications freeze on execution plan optimization stage (Catalyst) > when a logical execution plan contains a lot of projections that operate on > nested struct fields. > 2 Spark Applications are attached. One demonstrates the issue, the other > demonstrates a workaround. Also, an archive is attached where these jobs are > packages as a Maven Project. > To reproduce the attached Spark App does the following: > * A small dataframe is created from a JSON example. > * A nested withColumn map transformation is used to apply a transformation > on a struct field and create a new struct field. The code for this > transformation is also attached. > * Once more than 11 such transformations are applied the Catalyst optimizer > freezes on optimizing the execution plan > {code:scala} > package za.co.absa.spark.app > import org.apache.spark.sql._ > import org.apache.spark.sql.functions._ > object SparkApp1Issue { > // A sample data for a dataframe with nested structs > val sample = > """ > |{ > | "strings": { > |"simple": "Culpa repellat nesciunt accusantium", > |"all_random": "DESebo8d%fL9sX@AzVin", > |"whitespaces": "qbbl" > | }, > | "numerics": { > |"small_positive": 722, > |"small_negative": -660, > |"big_positive": 669223368251997, > |"big_negative": -161176863305841, > |"zero": 0 > | } > |} > """.stripMargin :: > """{ > | "strings": { > |"simple": "Accusamus quia vel deleniti", > |"all_random": "rY&n9UnVcD*KS]jPBpa[", > |"whitespaces": " t e t rp z p" > | }, > | "numerics": { > |"small_positive": 268, > |"small_negative": -134, > |"big_positive": 768990048149640, > |"big_negative": -684718954884696, > |"zero": 0 > | } > |} > |""".stripMargin :: > """{ > | "strings": { > |"simple": "Quia numquam deserunt delectus rem est", > |"all_random": "GmRdQlE4Avn1hSlVPAH", > |"whitespaces": " c sayv drf " > | }, > | "numerics": { > |"small_positive": 909, > |"small_negative": -363, > |"big_positive": 592517494751902, > |"big_negative": -703224505589638, > |"zero": 0 > | } > |} > |""".stripMargin :: Nil > /** > * This Spark Job demonstrates an issue of execution plan freezing when > there are a lot of projections > * involving nested structs in an execution plan. > * > * The example works as follows: > * - A small dataframe is created from a JSON example above > * - A nested withColumn map transformation is used to apply a > transformation on a struct field and create > * a new struct field. > * - Once more than 11 such transformations are applied the Catalyst > optimizer freezes on optimizing > * the execution plan > */ > def main(args: Array[String]): Unit = { > val sparkBuilder = SparkSession.builder().appName("Nested Projections > Issue") > val spark = sparkBuilder > .master("local[4]") > .getOrCreate() > import spark.implicits._ > import za.co.absa.spark.utils.NestedOps.DataSetWrapper > val df = spark.read.json(sample.toDS) > // Apply several uppercase and negation transformations > val dfOutput = df > .nestedWithColumnMap("strings.simple", "strings.uppercase1", c => > upp
[jira] [Assigned] (SPARK-28034) Add with.sql
[ https://issues.apache.org/jira/browse/SPARK-28034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28034: Assignee: (was: Apache Spark) > Add with.sql > > > Key: SPARK-28034 > URL: https://issues.apache.org/jira/browse/SPARK-28034 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Peter Toth >Priority: Major > > In this ticket, we plan to add the regression test cases of > [https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/with.sql] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28034) Add with.sql
[ https://issues.apache.org/jira/browse/SPARK-28034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28034: Assignee: Apache Spark > Add with.sql > > > Key: SPARK-28034 > URL: https://issues.apache.org/jira/browse/SPARK-28034 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Peter Toth >Assignee: Apache Spark >Priority: Major > > In this ticket, we plan to add the regression test cases of > [https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/with.sql] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28034) Add with.sql
Peter Toth created SPARK-28034: -- Summary: Add with.sql Key: SPARK-28034 URL: https://issues.apache.org/jira/browse/SPARK-28034 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Peter Toth In this ticket, we plan to add the regression test cases of [https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/with.sql] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28033) String concatenation low priority than other arithmeticBinary
[ https://issues.apache.org/jira/browse/SPARK-28033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28033: Assignee: (was: Apache Spark) > String concatenation low priority than other arithmeticBinary > - > > Key: SPARK-28033 > URL: https://issues.apache.org/jira/browse/SPARK-28033 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.3 >Reporter: Yuming Wang >Priority: Major > > Spark SQL: > {code:sql} > spark-sql> explain select 'four: ' || 2 + 2; > == Physical Plan == > *(1) Project [null AS (CAST(concat(four: , CAST(2 AS STRING)) AS DOUBLE) + > CAST(2 AS DOUBLE))#2] > +- Scan OneRowRelation[] > spark-sql> select 'four: ' || 2 + 2; > NULL > {code} > Hive: > {code:sql} > hive> select 'four: ' || 2 + 2; > OK > four: 4 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28033) String concatenation low priority than other arithmeticBinary
[ https://issues.apache.org/jira/browse/SPARK-28033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28033: Assignee: Apache Spark > String concatenation low priority than other arithmeticBinary > - > > Key: SPARK-28033 > URL: https://issues.apache.org/jira/browse/SPARK-28033 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.3 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > > Spark SQL: > {code:sql} > spark-sql> explain select 'four: ' || 2 + 2; > == Physical Plan == > *(1) Project [null AS (CAST(concat(four: , CAST(2 AS STRING)) AS DOUBLE) + > CAST(2 AS DOUBLE))#2] > +- Scan OneRowRelation[] > spark-sql> select 'four: ' || 2 + 2; > NULL > {code} > Hive: > {code:sql} > hive> select 'four: ' || 2 + 2; > OK > four: 4 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28033) String concatenation low priority than other arithmeticBinary
[ https://issues.apache.org/jira/browse/SPARK-28033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-28033: Affects Version/s: (was: 3.0.0) 2.3.3 > String concatenation low priority than other arithmeticBinary > - > > Key: SPARK-28033 > URL: https://issues.apache.org/jira/browse/SPARK-28033 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.3 >Reporter: Yuming Wang >Priority: Major > > Spark SQL: > {code:sql} > spark-sql> explain select 'four: ' || 2 + 2; > == Physical Plan == > *(1) Project [null AS (CAST(concat(four: , CAST(2 AS STRING)) AS DOUBLE) + > CAST(2 AS DOUBLE))#2] > +- Scan OneRowRelation[] > spark-sql> select 'four: ' || 2 + 2; > NULL > {code} > Hive: > {code:sql} > hive> select 'four: ' || 2 + 2; > OK > four: 4 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28033) String concatenation low priority than other arithmeticBinary
[ https://issues.apache.org/jira/browse/SPARK-28033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862844#comment-16862844 ] Yuming Wang commented on SPARK-28033: - I'm working on. > String concatenation low priority than other arithmeticBinary > - > > Key: SPARK-28033 > URL: https://issues.apache.org/jira/browse/SPARK-28033 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Spark SQL: > {code:sql} > spark-sql> explain select 'four: ' || 2 + 2; > == Physical Plan == > *(1) Project [null AS (CAST(concat(four: , CAST(2 AS STRING)) AS DOUBLE) + > CAST(2 AS DOUBLE))#2] > +- Scan OneRowRelation[] > spark-sql> select 'four: ' || 2 + 2; > NULL > {code} > Hive: > {code:sql} > hive> select 'four: ' || 2 + 2; > OK > four: 4 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28033) String concatenation low priority than other arithmeticBinary
Yuming Wang created SPARK-28033: --- Summary: String concatenation low priority than other arithmeticBinary Key: SPARK-28033 URL: https://issues.apache.org/jira/browse/SPARK-28033 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang Spark SQL: {code:sql} spark-sql> explain select 'four: ' || 2 + 2; == Physical Plan == *(1) Project [null AS (CAST(concat(four: , CAST(2 AS STRING)) AS DOUBLE) + CAST(2 AS DOUBLE))#2] +- Scan OneRowRelation[] spark-sql> select 'four: ' || 2 + 2; NULL {code} Hive: {code:sql} hive> select 'four: ' || 2 + 2; OK four: 4 {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27463) Support Dataframe Cogroup via Pandas UDFs
[ https://issues.apache.org/jira/browse/SPARK-27463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862828#comment-16862828 ] Chris Martin commented on SPARK-27463: -- Hi [~hyukjin.kwon] Ah I see your concern now. It think it’s fair to say that the cogrouping functionality proposed has no analogous API in Pandas. In my opinion that’s understandable as Pandas is fundamentally a library for manipulating local data so the problems of colocating multiple DatafFrames don’t apply as they do in Spark. That said, the inspiration behind the proposed API is clearly that of the Pandas groupby().apply() so I’d argue it is not without precedent. I think the more direct comparison here is with the existing Dataset cogroup, where high level functionality is almost exactly the same (partition two distinct DatafFrames such that partitions are cogroup and apply a flatmap operation over them) with the differences being in the cogroup key definition (typed for datasets, untyped for pandas-udf), Input (Iterables for Datasets, Pandas DataFrames for pandas-udf) and Output (Iterable for Datasets, pandas Dataframe for pandas-udf). Now at this point one might observe that we have two different language-specific implementations of the same high level functionality. This is true, however it’s been the case since the introduction of Pandas Udfs (see groupBy().apply() vs groupByKey().flatmapgroups()) and is imho a good thing; it allows us to provide functionality that plays to the strength of each individual language given that what is simple and idiomatic in Python is not in Scala and vice versa. If, considering this, we agree that this cogroup functionality both useful and suitable as exposing via a Pandas UDF (and I hope we do, but please say if you disagree), the question now comes as to what we would like the api to be. At this point let’s consider the API as currently proposed in the design doc. {code:java} result = df1.cogroup(df2, on='id').apply(my_pandas_udf) {code} This API is concise and consistent with existing groupby.apply(). The disadvantage is that it isn’t consistent with Dataset’s cogroup and, as this API doesn’t exist in Pandas, it can’t be consistent with that (although I would argue that if Pandas did introduce such an API it would look a lot like this). The alternative would be to implement something on RelationalGroupedData as described by Li in the post above (I think we can discount something based on KeyValueGroupedDataset as if my reading of the code is correct this would only apply for typed APIs which this isn’t). The big advantage here is that this is much more consistent with the existing Dataset cogroup. On the flip side it comes at the cost of a little more verbosity and IMHO is a little less pythonic/in the style of Pandas. That being the case, I’m slightly in favour of the the API as currently proposed in the design doc, but am happy to be swayed to something else if the majority have a different opinion. > Support Dataframe Cogroup via Pandas UDFs > -- > > Key: SPARK-27463 > URL: https://issues.apache.org/jira/browse/SPARK-27463 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Chris Martin >Priority: Major > > Recent work on Pandas UDFs in Spark, has allowed for improved > interoperability between Pandas and Spark. This proposal aims to extend this > by introducing a new Pandas UDF type which would allow for a cogroup > operation to be applied to two PySpark DataFrames. > Full details are in the google document linked below. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27966) input_file_name empty when listing files in parallel
[ https://issues.apache.org/jira/browse/SPARK-27966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862822#comment-16862822 ] Christian Homberg commented on SPARK-27966: --- This is the truncated output I get from df.explain. {code:java} == Physical Plan == CollectLimit 21 +- *(1) Project [{COLUMNDEFINITIONS} ... 21 more fields] +- *(1) FileScan csv [{COLUMNDEFINITIONS}... 21 more fields] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex[dbfs:/mnt{LOCATIONTRUNCATED}, dbfs:/m..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct{COLUMNDEFINITIONS}... {code} Is there anything else I can provide? > input_file_name empty when listing files in parallel > > > Key: SPARK-27966 > URL: https://issues.apache.org/jira/browse/SPARK-27966 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.4.0 > Environment: Databricks 5.3 (includes Apache Spark 2.4.0, Scala 2.11) > > Worker Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2 > Workers: 3 > Driver Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2 >Reporter: Christian Homberg >Priority: Minor > Attachments: input_file_name_bug > > > I ran into an issue similar and probably related to SPARK-26128. The > _org.apache.spark.sql.functions.input_file_name_ is sometimes empty. > > {code:java} > df.select(input_file_name()).show(5,false) > {code} > > {code:java} > +-+ > |input_file_name()| > +-+ > | | > | | > | | > | | > | | > +-+ > {code} > My environment is databricks and debugging the Log4j output showed me that > the issue occurred when the files are being listed in parallel, e.g. when > {code:java} > 19/06/06 11:50:47 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 127; threshold: 32 > 19/06/06 11:50:47 INFO InMemoryFileIndex: Listing leaf files and directories > in parallel under:{code} > > Everything's fine as long as > {code:java} > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 6; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > {code} > > Setting spark.sql.sources.parallelPartitionDiscovery.threshold to > resolves the issue for me. > > *edit: the problem is not exclusively linked to listing files in parallel. > I've setup a larger cluster for which after parallel file listing the > input_file_name did return the correct filename. After inspecting the log4j > again, I assume that it's linked to some kind of MetaStore being full. I've > attached a section of the log4j output that I think should indicate why it's > failing. If you need more, please let me know.* > ** > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13882) Remove org.apache.spark.sql.execution.local
[ https://issues.apache.org/jira/browse/SPARK-13882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862641#comment-16862641 ] Lai Zhou edited comment on SPARK-13882 at 6/13/19 8:07 AM: --- hi,[~rxin], is this iterator-based local mode will be re-introduced in the future ? I think a direct iterator-based local mode will be high-efficiency , that can help people to do real-time queries. was (Author: hhlai1990): hi,[~rxin], is this iterator-based local mode will be re-introduced in the future ? I think a direct iterator-based local mode will be high-efficiency , than can help people to do real-time queries. > Remove org.apache.spark.sql.execution.local > --- > > Key: SPARK-13882 > URL: https://issues.apache.org/jira/browse/SPARK-13882 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Major > Fix For: 2.0.0 > > > We introduced some local operators in org.apache.spark.sql.execution.local > package but never fully wired the engine to actually use these. We still plan > to implement a full local mode, but it's probably going to be fairly > different from what the current iterator-based local mode would look like. > Let's just remove them for now, and we can always re-introduced them in the > future by looking at branch-1.6. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28024) Incorrect numeric values when out of range
[ https://issues.apache.org/jira/browse/SPARK-28024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-28024: Priority: Critical (was: Major) > Incorrect numeric values when out of range > -- > > Key: SPARK-28024 > URL: https://issues.apache.org/jira/browse/SPARK-28024 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Critical > Labels: correctness > Attachments: SPARK-28024.png > > > For example: > {code:sql} > select tinyint(128) * tinyint(2); -- 0 > select smallint(2147483647) * smallint(2); -- -2 > select int(2147483647) * int(2); -- -2 > SELECT smallint((-32768)) * smallint(-1); -- -32768 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28024) Incorrect numeric values when out of range
[ https://issues.apache.org/jira/browse/SPARK-28024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-28024: Labels: correctness (was: ) > Incorrect numeric values when out of range > -- > > Key: SPARK-28024 > URL: https://issues.apache.org/jira/browse/SPARK-28024 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > Labels: correctness > Attachments: SPARK-28024.png > > > For example: > {code:sql} > select tinyint(128) * tinyint(2); -- 0 > select smallint(2147483647) * smallint(2); -- -2 > select int(2147483647) * int(2); -- -2 > SELECT smallint((-32768)) * smallint(-1); -- -32768 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28024) Incorrect numeric values when out of range
[ https://issues.apache.org/jira/browse/SPARK-28024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-28024: Target Version/s: 3.0.0 > Incorrect numeric values when out of range > -- > > Key: SPARK-28024 > URL: https://issues.apache.org/jira/browse/SPARK-28024 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Critical > Labels: correctness > Attachments: SPARK-28024.png > > > For example: > {code:sql} > select tinyint(128) * tinyint(2); -- 0 > select smallint(2147483647) * smallint(2); -- -2 > select int(2147483647) * int(2); -- -2 > SELECT smallint((-32768)) * smallint(-1); -- -32768 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28016) Spark hangs when an execution plan has many projections on nested structs
[ https://issues.apache.org/jira/browse/SPARK-28016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruslan Yushchenko updated SPARK-28016: -- Attachment: SparkApp1IssueSelfContained.scala > Spark hangs when an execution plan has many projections on nested structs > - > > Key: SPARK-28016 > URL: https://issues.apache.org/jira/browse/SPARK-28016 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.4.3 > Environment: Tried in > * Spark 2.2.1, Spark 2.4.3 in local mode on Linux, MasOS and Windows > * Spark 2.4.3 / Yarn on a Linux cluster >Reporter: Ruslan Yushchenko >Priority: Major > Attachments: NestedOps.scala, SparkApp1Issue.scala, > SparkApp1IssueSelfContained.scala, SparkApp2Workaround.scala, > spark-app-nested.tgz > > > Spark applications freeze on execution plan optimization stage (Catalyst) > when a logical execution plan contains a lot of projections that operate on > nested struct fields. > 2 Spark Applications are attached. One demonstrates the issue, the other > demonstrates a workaround. Also, an archive is attached where these jobs are > packages as a Maven Project. > To reproduce the attached Spark App does the following: > * A small dataframe is created from a JSON example. > * A nested withColumn map transformation is used to apply a transformation > on a struct field and create a new struct field. The code for this > transformation is also attached. > * Once more than 11 such transformations are applied the Catalyst optimizer > freezes on optimizing the execution plan > {code:scala} > package za.co.absa.spark.app > import org.apache.spark.sql._ > import org.apache.spark.sql.functions._ > object SparkApp1Issue { > // A sample data for a dataframe with nested structs > val sample = > """ > |{ > | "strings": { > |"simple": "Culpa repellat nesciunt accusantium", > |"all_random": "DESebo8d%fL9sX@AzVin", > |"whitespaces": "qbbl" > | }, > | "numerics": { > |"small_positive": 722, > |"small_negative": -660, > |"big_positive": 669223368251997, > |"big_negative": -161176863305841, > |"zero": 0 > | } > |} > """.stripMargin :: > """{ > | "strings": { > |"simple": "Accusamus quia vel deleniti", > |"all_random": "rY&n9UnVcD*KS]jPBpa[", > |"whitespaces": " t e t rp z p" > | }, > | "numerics": { > |"small_positive": 268, > |"small_negative": -134, > |"big_positive": 768990048149640, > |"big_negative": -684718954884696, > |"zero": 0 > | } > |} > |""".stripMargin :: > """{ > | "strings": { > |"simple": "Quia numquam deserunt delectus rem est", > |"all_random": "GmRdQlE4Avn1hSlVPAH", > |"whitespaces": " c sayv drf " > | }, > | "numerics": { > |"small_positive": 909, > |"small_negative": -363, > |"big_positive": 592517494751902, > |"big_negative": -703224505589638, > |"zero": 0 > | } > |} > |""".stripMargin :: Nil > /** > * This Spark Job demonstrates an issue of execution plan freezing when > there are a lot of projections > * involving nested structs in an execution plan. > * > * The example works as follows: > * - A small dataframe is created from a JSON example above > * - A nested withColumn map transformation is used to apply a > transformation on a struct field and create > * a new struct field. > * - Once more than 11 such transformations are applied the Catalyst > optimizer freezes on optimizing > * the execution plan > */ > def main(args: Array[String]): Unit = { > val sparkBuilder = SparkSession.builder().appName("Nested Projections > Issue") > val spark = sparkBuilder > .master("local[4]") > .getOrCreate() > import spark.implicits._ > import za.co.absa.spark.utils.NestedOps.DataSetWrapper > val df = spark.read.json(sample.toDS) > // Apply several uppercase and negation transformations > val dfOutput = df > .nestedWithColumnMap("strings.simple", "strings.uppercase1", c => > upper(c)) > .nestedWithColumnMap("strings.all_random", "strings.uppercase2", c => > upper(c)) > .nestedWithColumnMap("strings.whitespaces", "strings.uppercase3", c => > upper(c)) > .nestedWithColumnMap("numerics.small_positive", "numerics.num1", c => > -c) > .nestedWithColumnMap("numerics.small_negative", "numerics.num2", c => > -c) > .nestedWithColum
[jira] [Resolved] (SPARK-27322) DataSourceV2 table relation
[ https://issues.apache.org/jira/browse/SPARK-27322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-27322. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24741 [https://github.com/apache/spark/pull/24741] > DataSourceV2 table relation > --- > > Key: SPARK-27322 > URL: https://issues.apache.org/jira/browse/SPARK-27322 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Assignee: John Zhuge >Priority: Major > Fix For: 3.0.0 > > > Support multi-catalog in the following SELECT code paths: > * SELECT * FROM catalog.db.tbl > * TABLE catalog.db.tbl > * JOIN or UNION tables from different catalogs > * SparkSession.table("catalog.db.tbl") -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org