[jira] [Resolved] (SPARK-27100) dag-scheduler-event-loop" java.lang.StackOverflowError
[ https://issues.apache.org/jira/browse/SPARK-27100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai resolved SPARK-27100. - Resolution: Fixed Fix Version/s: 2.4.4 Issue resolved by pull request 24957 [https://github.com/apache/spark/pull/24957] > dag-scheduler-event-loop" java.lang.StackOverflowError > -- > > Key: SPARK-27100 > URL: https://issues.apache.org/jira/browse/SPARK-27100 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.3, 2.3.3 >Reporter: KaiXu >Priority: Major > Fix For: 2.4.4 > > Attachments: SPARK-27100-Overflow.txt, stderr > > > ALS in Spark MLlib causes StackOverflow: > /opt/sparkml/spark213/bin/spark-submit --properties-file > /opt/HiBench/report/als/spark/conf/sparkbench/spark.conf --class > com.intel.hibench.sparkbench.ml.ALSExample --master yarn-client > --num-executors 3 --executor-memory 322g > /opt/HiBench/sparkbench/assembly/target/sparkbench-assembly-7.1-SNAPSHOT-dist.jar > --numUsers 4 --numProducts 6 --rank 100 --numRecommends 20 > --numIterations 100 --kryo false --implicitPrefs true --numProductBlocks -1 > --numUserBlocks -1 --lambda 1.0 hdfs://bdw-slave20:8020/HiBench/ALS/Input > > Exception in thread "dag-scheduler-event-loop" java.lang.StackOverflowError > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1534) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) > at > scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468) > at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) > at > scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468) > at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28169) Spark can’t push down predicate for OR expression
angerszhu created SPARK-28169: - Summary: Spark can’t push down predicate for OR expression Key: SPARK-28169 URL: https://issues.apache.org/jira/browse/SPARK-28169 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0, 2.3.0 Reporter: angerszhu -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28091) Extend Spark metrics system with executor plugin metrics
[ https://issues.apache.org/jira/browse/SPARK-28091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873041#comment-16873041 ] Luca Canali commented on SPARK-28091: - Indeed toString() is quite useful, thanks. + good proposal, I think it could be useful. I take the occasion to address a related topic: I wanted to inquire on if there are any plans to add to S3A instrumentation read and write time/latency metrics. This would be useful for (Spark) performance troubleshooting, for example by comparing CPU time and I/O time. I think the question of I/O time instrumentation has come up a few times already over the years. I add that in the context discussed there, that is Spark monitoring instrumentation with Dropwizard libraries, we would "only" need aggregated latency metrics over the executor JVM, rather than task/query level statistics (BTW, the latter I see has been discussed in the case of HDFS in HADOOP-11873). > Extend Spark metrics system with executor plugin metrics > > > Key: SPARK-28091 > URL: https://issues.apache.org/jira/browse/SPARK-28091 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Luca Canali >Priority: Minor > > This proposes to improve Spark instrumentation by adding a hook for Spark > executor plugin metrics to the Spark metrics systems implemented with the > Dropwizard/Codahale library. > Context: The Spark metrics system provides a large variety of metrics, see > also SPARK-26890, useful to monitor and troubleshoot Spark workloads. A > typical workflow is to sink the metrics to a storage system and build > dashboards on top of that. > Improvement: The original goal of this work was to add instrumentation for S3 > filesystem access metrics by Spark job. Currently, [[ExecutorSource]] > instruments HDFS and local filesystem metrics. Rather than extending the code > there, we proposes to add a metrics plugin system which is of more flexible > and general use. > Advantages: > * The metric plugin system makes it easy to implement instrumentation for S3 > access by Spark jobs. > * The metrics plugin system allows for easy extensions of how Spark collects > HDFS-related workload metrics. This is currently done using the Hadoop > Filesystem GetAllStatistics method, which is deprecated in recent versions of > Hadoop. Recent versions of Hadoop Filesystem recommend using method > GetGlobalStorageStatistics, which also provides several additional metrics. > GetGlobalStorageStatistics is not available in Hadoop 2.7 (had been > introduced in Hadoop 2.8). Using a metric plugin for Spark would allow an > easy way to “opt in” using such new API calls for those deploying suitable > Hadoop versions. > * We also have the use case of adding Hadoop filesystem monitoring for a > custom Hadoop compliant filesystem in use in our organization (EOS using the > XRootD protocol). The metrics plugin infrastructure makes this easy to do. > Others may have similar use cases. > * More generally, this method makes it straightforward to plug in Filesystem > and other metrics to the Spark monitoring system. Future work on plugin > implementation can address extending monitoring to measure usage of external > resources (OS, filesystem, network, accelerator cards, etc), that maybe would > not normally be considered general enough for inclusion in Apache Spark code, > but that can be nevertheless useful for specialized use cases, tests or > troubleshooting. > Implementation: > The proposed implementation is currently a WIP open for comments and > improvements. It is based on the work on Executor Plugin of SPARK-24918 and > builds on recent work on extending Spark executor metrics, such as SPARK-25228 > Tests and examples: > This has been so far manually tested running Spark on YARN and K8S clusters, > in particular for monitoring S3 and for extending HDFS instrumentation with > the Hadoop Filesystem “GetGlobalStorageStatistics” metrics. Executor metric > plugin example and code used for testing are available. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28169) Spark can’t push down predicate for OR expression
[ https://issues.apache.org/jira/browse/SPARK-28169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-28169: -- Description: Spark can't push down filter condition of Or: Such as if I have a table {color:#d04437}default.test{color}, his partition col is "{color:#d04437}dt{color}", if I use query : {code:java} select * from default.test where dt=20190625 or (dt = 20190626 and id in (1,2,3) ) {code} In this case, Spark will resolve or condition as one expression, and since this {color:#33}expr {color}has reference of "{color:#FF}id{color}", then it can't been push down. > Spark can’t push down predicate for OR expression > - > > Key: SPARK-28169 > URL: https://issues.apache.org/jira/browse/SPARK-28169 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.4.0 >Reporter: angerszhu >Priority: Major > Labels: SQL > > Spark can't push down filter condition of Or: > Such as if I have a table {color:#d04437}default.test{color}, his partition > col is "{color:#d04437}dt{color}", > if I use query : > {code:java} > select * from default.test where dt=20190625 or (dt = 20190626 and id in > (1,2,3) ) > {code} > In this case, Spark will resolve or condition as one expression, and since > this {color:#33}expr {color}has reference of "{color:#FF}id{color}", > then it can't been push down. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28169) Spark can’t push down predicate for OR expression
[ https://issues.apache.org/jira/browse/SPARK-28169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28169: Assignee: Apache Spark > Spark can’t push down predicate for OR expression > - > > Key: SPARK-28169 > URL: https://issues.apache.org/jira/browse/SPARK-28169 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.4.0 >Reporter: angerszhu >Assignee: Apache Spark >Priority: Major > Labels: SQL > > Spark can't push down filter condition of Or: > Such as if I have a table {color:#d04437}default.test{color}, his partition > col is "{color:#d04437}dt{color}", > if I use query : > {code:java} > select * from default.test where dt=20190625 or (dt = 20190626 and id in > (1,2,3) ) > {code} > In this case, Spark will resolve or condition as one expression, and since > this {color:#33}expr {color}has reference of "{color:#FF}id{color}", > then it can't been push down. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28169) Spark can’t push down predicate for OR expression
[ https://issues.apache.org/jira/browse/SPARK-28169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28169: Assignee: (was: Apache Spark) > Spark can’t push down predicate for OR expression > - > > Key: SPARK-28169 > URL: https://issues.apache.org/jira/browse/SPARK-28169 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.4.0 >Reporter: angerszhu >Priority: Major > Labels: SQL > > Spark can't push down filter condition of Or: > Such as if I have a table {color:#d04437}default.test{color}, his partition > col is "{color:#d04437}dt{color}", > if I use query : > {code:java} > select * from default.test where dt=20190625 or (dt = 20190626 and id in > (1,2,3) ) > {code} > In this case, Spark will resolve or condition as one expression, and since > this {color:#33}expr {color}has reference of "{color:#FF}id{color}", > then it can't been push down. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28091) Extend Spark metrics system with executor plugin metrics
[ https://issues.apache.org/jira/browse/SPARK-28091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873161#comment-16873161 ] Steve Loughran commented on SPARK-28091: IO metrics. hmmm. Maybe, but there's a lot going on underneath, including retries happening in the AWS code. And things like the duration of a rename is O(bytes), whereas for a getFileStatus its that of a pair of HEAD calls. We've discussed hooking up to the AWS SDK metrics, if that interests you: HADOOP-13551. Getting the throttle and retry count which is often handled transparently in the AWS code would be the one to go for. Sometimes we see it in our code, where again, the throttle count would be good. It's better to say "your job was slow because you overloaded an S3 shard" than "your job was slow". There are some quantiles to measure DynamoDB duration in S3Guard, which led to HADOOP-16278. Now that DDB supports pay-on-demand IO throttling is less common and so these counters less important. > Extend Spark metrics system with executor plugin metrics > > > Key: SPARK-28091 > URL: https://issues.apache.org/jira/browse/SPARK-28091 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Luca Canali >Priority: Minor > > This proposes to improve Spark instrumentation by adding a hook for Spark > executor plugin metrics to the Spark metrics systems implemented with the > Dropwizard/Codahale library. > Context: The Spark metrics system provides a large variety of metrics, see > also SPARK-26890, useful to monitor and troubleshoot Spark workloads. A > typical workflow is to sink the metrics to a storage system and build > dashboards on top of that. > Improvement: The original goal of this work was to add instrumentation for S3 > filesystem access metrics by Spark job. Currently, [[ExecutorSource]] > instruments HDFS and local filesystem metrics. Rather than extending the code > there, we proposes to add a metrics plugin system which is of more flexible > and general use. > Advantages: > * The metric plugin system makes it easy to implement instrumentation for S3 > access by Spark jobs. > * The metrics plugin system allows for easy extensions of how Spark collects > HDFS-related workload metrics. This is currently done using the Hadoop > Filesystem GetAllStatistics method, which is deprecated in recent versions of > Hadoop. Recent versions of Hadoop Filesystem recommend using method > GetGlobalStorageStatistics, which also provides several additional metrics. > GetGlobalStorageStatistics is not available in Hadoop 2.7 (had been > introduced in Hadoop 2.8). Using a metric plugin for Spark would allow an > easy way to “opt in” using such new API calls for those deploying suitable > Hadoop versions. > * We also have the use case of adding Hadoop filesystem monitoring for a > custom Hadoop compliant filesystem in use in our organization (EOS using the > XRootD protocol). The metrics plugin infrastructure makes this easy to do. > Others may have similar use cases. > * More generally, this method makes it straightforward to plug in Filesystem > and other metrics to the Spark monitoring system. Future work on plugin > implementation can address extending monitoring to measure usage of external > resources (OS, filesystem, network, accelerator cards, etc), that maybe would > not normally be considered general enough for inclusion in Apache Spark code, > but that can be nevertheless useful for specialized use cases, tests or > troubleshooting. > Implementation: > The proposed implementation is currently a WIP open for comments and > improvements. It is based on the work on Executor Plugin of SPARK-24918 and > builds on recent work on extending Spark executor metrics, such as SPARK-25228 > Tests and examples: > This has been so far manually tested running Spark on YARN and K8S clusters, > in particular for monitoring S3 and for extending HDFS instrumentation with > the Hadoop Filesystem “GetGlobalStorageStatistics” metrics. Executor metric > plugin example and code used for testing are available. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26683) Incorrect value of "internal.metrics.input.recordsRead" when reading from temp hive table backed by HDFS file
[ https://issues.apache.org/jira/browse/SPARK-26683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amar Agrawal updated SPARK-26683: - Fix Version/s: 3.0.0 > Incorrect value of "internal.metrics.input.recordsRead" when reading from > temp hive table backed by HDFS file > - > > Key: SPARK-26683 > URL: https://issues.apache.org/jira/browse/SPARK-26683 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Amar Agrawal >Priority: Major > Fix For: 3.0.0 > > Attachments: asyncfactory.scala, input1.txt, input2.txt > > > *Issue description* > The summary of the issue is - when persisted DataFrame is used in two > different concurrent threads, we are getting wrong value of > *internal.metrics.input.recordsRead* in SparkListenerStageCompleted event. > > *Issue Details* > The spark code I have written has 2 source temp hive tables. When the first > temp table is read, it's dataframe is persisted. Whereas, for the other temp > table, its source dataframe is not persisted. After that, we have 2 pipelines > which we run in async fashion. In the 1st pipeline, the persisted dataframe > is written to some hive target table. Whereas, in the 2nd pipeline, we are > performing a UNION of persisted dataframe with non-persisted dataframe, which > is then written to a separate hive table. > Our expectation is, since the first dataframe is persisted, its metric for > recordsRead should be computed exactly once. But in our case, we are seeing > an increased value of the metric. > Example - if my persisted dataframe has 2 rows, the above mentioned metric is > consistently reporting it as 3 rows. > > *Steps to reproduce Issue:* > # Create directory /tmp/INFA_UNION1 and copy input1.txt to this directory. > # Create directory /tmp/INFA_UNION2 and copy input2.txt to this directory. > # Run the following in spark-shell: > scala> :load asyncfactory.scala > scala> : paste -raw > > {code:java} > package org.apache.spark > import org.apache.spark.scheduler._ > import org.apache.spark.util.JsonProtocol > import org.json4s.jackson.JsonMethods._ > class InfaListener(mode:String="ACCUMULATOR") extends > org.apache.spark.scheduler.SparkListener { > def onEvent(event: SparkListenerEvent): Unit = { > val jv = JsonProtocol.sparkEventToJson(event) > println(compact(jv)) > } > override def onStageCompleted(stageCompleted: SparkListenerStageCompleted): > Unit = { onEvent(stageCompleted)} > override def onStageSubmitted(stageSubmitted: SparkListenerStageSubmitted): > Unit = { onEvent(stageSubmitted)} > } > {code} > > scala> :paste > {code:java} > import org.apache.spark.InfaListener > implicit def df2idf(d:DataFrame):InfaDataFrame = new InfaDataFrame(d); > val sqlc = spark.sqlContext > val sc = spark.sparkContext > val lis = new InfaListener("TAG") > sc.addSparkListener(lis) > sqlc.sql("DROP TABLE IF EXISTS `default`.`read1`") > sqlc.sql("CREATE TABLE `default`.`read1` (`col0` STRING) LOCATION > '/tmp/INFA_UNION1'") > sqlc.sql("DROP TABLE IF EXISTS `default`.`read2`") > sqlc.sql("CREATE TABLE `default`.`read2` (`col0` STRING) LOCATION > '/tmp/INFA_UNION2'") > sqlc.sql("DROP TABLE IF EXISTS `default`.`write1`") > sqlc.sql("CREATE TABLE `default`.`write1` (`col0` STRING)") > sqlc.sql("DROP TABLE IF EXISTS `default`.`write2`") > sqlc.sql("CREATE TABLE `default`.`write2` (`col0` STRING)") > val v0 = sqlc.sql("SELECT `read1`.`col0` as a0 FROM > `default`.`read1`").itoDF.persist(MEMORY_AND_DISK).where(lit(true)); > async(asBlock(sqlc.sql("INSERT OVERWRITE TABLE `default`.`write1` SELECT > tbl0.c0 as a0 FROM tbl0"), v0.unionAll(sqlc.sql("SELECT `read2`.`col0` as a0 > FROM > `default`.`read2`").itoDF).itoDF("TGT_").itoDF("c").createOrReplaceTempView("tbl0"))); > async(asBlock(sqlc.sql("INSERT OVERWRITE TABLE `default`.`write2` SELECT > tbl1.c0 as a0 FROM tbl1"), > v0.itoDF("TGT_").itoDF("c").createOrReplaceTempView("tbl1"))); > stop; > {code} > *NOTE* - The above code refers to 2 file directories /tmp/INFA_UNION1 and > /tmp/INFA_UNION2. We have attached the files which need to be copied to the > above locations after these directories are created. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28170) DenseVector .toArray() and .values documentation do not specify they are aliases
Sivam Pasupathipillai created SPARK-28170: - Summary: DenseVector .toArray() and .values documentation do not specify they are aliases Key: SPARK-28170 URL: https://issues.apache.org/jira/browse/SPARK-28170 Project: Spark Issue Type: Improvement Components: ML, MLlib Affects Versions: 2.4.3 Reporter: Sivam Pasupathipillai The documentation of the *toArray()* method and the *values* property in pyspark.ml.linalg.DenseVector is confusing. *toArray():* Returns an numpy.ndarray *values**:* Returns a list of values However, they are actually aliases and they both return a numpy.ndarray. FIX: either change the documentation or change the *values* property to return a Python list. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28164) usage description does not match with shell scripts
[ https://issues.apache.org/jira/browse/SPARK-28164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28164: Assignee: (was: Apache Spark) > usage description does not match with shell scripts > --- > > Key: SPARK-28164 > URL: https://issues.apache.org/jira/browse/SPARK-28164 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 2.4.3 >Reporter: Hanna Kan >Priority: Major > > I found that "spark/sbin/start-slave.sh" may have some error. > line 43 gives--- echo "Usage: ./sbin/start-slave.sh [options] " > but later this script, I found line 59 MASTER=$1 > Is this a conflict? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28164) usage description does not match with shell scripts
[ https://issues.apache.org/jira/browse/SPARK-28164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873272#comment-16873272 ] Apache Spark commented on SPARK-28164: -- User 'shivusondur' has created a pull request for this issue: https://github.com/apache/spark/pull/24974 > usage description does not match with shell scripts > --- > > Key: SPARK-28164 > URL: https://issues.apache.org/jira/browse/SPARK-28164 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 2.4.3 >Reporter: Hanna Kan >Priority: Major > > I found that "spark/sbin/start-slave.sh" may have some error. > line 43 gives--- echo "Usage: ./sbin/start-slave.sh [options] " > but later this script, I found line 59 MASTER=$1 > Is this a conflict? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28164) usage description does not match with shell scripts
[ https://issues.apache.org/jira/browse/SPARK-28164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28164: Assignee: Apache Spark > usage description does not match with shell scripts > --- > > Key: SPARK-28164 > URL: https://issues.apache.org/jira/browse/SPARK-28164 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 2.4.3 >Reporter: Hanna Kan >Assignee: Apache Spark >Priority: Major > > I found that "spark/sbin/start-slave.sh" may have some error. > line 43 gives--- echo "Usage: ./sbin/start-slave.sh [options] " > but later this script, I found line 59 MASTER=$1 > Is this a conflict? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28164) usage description does not match with shell scripts
[ https://issues.apache.org/jira/browse/SPARK-28164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873275#comment-16873275 ] Shivu Sondur commented on SPARK-28164: -- [~hannankan] i updated the usage message and raised the pull request > usage description does not match with shell scripts > --- > > Key: SPARK-28164 > URL: https://issues.apache.org/jira/browse/SPARK-28164 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 2.4.3 >Reporter: Hanna Kan >Priority: Major > > I found that "spark/sbin/start-slave.sh" may have some error. > line 43 gives--- echo "Usage: ./sbin/start-slave.sh [options] " > but later this script, I found line 59 MASTER=$1 > Is this a conflict? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28171) Correct interval type conversion behavior
Zhu, Lipeng created SPARK-28171: --- Summary: Correct interval type conversion behavior Key: SPARK-28171 URL: https://issues.apache.org/jira/browse/SPARK-28171 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Zhu, Lipeng When calculate between date and interval. {code:sql} select timestamp '2019-01-01 00:00:00' + interval '1 2:03:04' day to second, timestamp '2019-01-01 00:00:00' + interval '-1 2:03:04' day to second {code} * PostgreSQL return *2019-01-02 02:03:04* *2018-12-31 02:03:04* * SparkSQL return *2019-01-02 02:03:04* *2018-12-30 21:56:56* {code:sql} select timestamp '2019-01-01 00:00:00' + interval '1 -2:03:04' day to second, timestamp '2019-01-01 00:00:00' + interval '-1 -2:03:04' day to second {code} * PostgreSQL return *2019-01-01 21:56:56* *2018-12-30 21:56:56* * SparkSQL return _*Interval string does not match day-time format of 'd h:m:s.n': '1 -2:03:04'(line 1, pos 50)*_ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28171) Correct interval type conversion behavior
[ https://issues.apache.org/jira/browse/SPARK-28171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873318#comment-16873318 ] Yuming Wang commented on SPARK-28171: - {code:sql} postgres=# select substr(version(), 0, 16), timestamp '2019-01-01 00:00:00' + interval '1 2:03:04' day to second, timestamp '2019-01-01 00:00:00' + interval '-1 2:03:04' day to second; substr | ?column? | ?column? -+-+- PostgreSQL 11.3 | 2019-01-02 02:03:04 | 2018-12-31 02:03:04 (1 row) {code} {code:sql} dbadmin=> select version(), timestamp '2019-01-01 00:00:00' + interval '1 2:03:04' day to second, timestamp '2019-01-01 00:00:00' + interval '-1 2:03:04' day to second; version | ?column? | ?column? +-+- Vertica Analytic Database v9.1.1-0 | 2019-01-02 02:03:04 | 2018-12-30 21:56:56 (1 row) {code} {code:sql} presto> select timestamp '2019-01-01 00:00:00' + interval '1 2:03:04' day to second, timestamp '2019-01-01 00:00:00' + interval '-1 2:03:04' day to second; _col0 | _col1 -+- 2019-01-02 02:03:04.000 | 2018-12-30 21:56:56.000 (1 row) {code} {code:sql} SQL> -- Oracle SQL> select timestamp '2019-01-01 00:00:00' + interval '1 2:03:04' day to second, timestamp '2019-01-01 00:00:00' + interval '-1 2:03:04' day to second from dual; TIMESTAMP'2019-01-0100:00:00'+INTERVAL'12:03:04'DAYTOSECOND --- TIMESTAMP'2019-01-0100:00:00'+INTERVAL'-12:03:04'DAYTOSECOND --- 02-JAN-19 02.03.04.0 AM 30-DEC-18 09.56.56.0 PM {code} > Correct interval type conversion behavior > -- > > Key: SPARK-28171 > URL: https://issues.apache.org/jira/browse/SPARK-28171 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zhu, Lipeng >Priority: Major > > When calculate between date and interval. > {code:sql} > select timestamp '2019-01-01 00:00:00' + interval '1 2:03:04' day to second, > timestamp '2019-01-01 00:00:00' + interval '-1 2:03:04' day to second > {code} > * PostgreSQL return *2019-01-02 02:03:04* *2018-12-31 02:03:04* > * SparkSQL return *2019-01-02 02:03:04* *2018-12-30 21:56:56* > {code:sql} > select timestamp '2019-01-01 00:00:00' + interval '1 -2:03:04' day to second, > timestamp '2019-01-01 00:00:00' + interval '-1 -2:03:04' day to second > {code} > * PostgreSQL return *2019-01-01 21:56:56* *2018-12-30 21:56:56* > * SparkSQL return _*Interval string does not match day-time format of 'd > h:m:s.n': '1 -2:03:04'(line 1, pos 50)*_ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28172) pyspark DataFrame equality operator
Hugo created SPARK-28172: Summary: pyspark DataFrame equality operator Key: SPARK-28172 URL: https://issues.apache.org/jira/browse/SPARK-28172 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 2.4.3 Reporter: Hugo *Motivation:* Facilitating testing equality between DataFrames. Many approaches for applying TDD practices in data science require checking function output against expected output (in similarity to API snapshots). Having an equality operator or something alike would make this easier. A basic example: {code} from pyspark.ml.feature import Imputer def create_mock_missing_df(): return spark.createDataFrame([("a", 1.0),("b", 2.0),("c", float('nan'))], ['COL1', 'COl2']) def test_imputation(): """Test mean value imputation""" df1 = create_mock_missing_df() #load snapshot pickled_snapshot = sc.pickleFile('imputed_df.pkl').collect() df2 = spark.createDataFrame(pickled_snapshot) """ >>> df2.show() +++ |COL1|COL2_imputed| +++ | a | 1.0| | b | 2.0| | c | 1.5| +++ """ imputer = Imputer( inputCols=['COL2'], outputCols=['COL2_imputed'] ) df1 = imputer.fit(df1).transform(df1) df1 = df1.drop('COL2') assert df1 == df2 {code} Suggested change: {code} class DataFrame(object): ... def __eq__(self, other): """Returns ``True`` if DataFrame content is equal to other. >>> df1 = spark.createDataFrame([("a", 1), ("b", 2), ("c", 3)]) >>> df2 = spark.createDataFrame([("b", 2), ("a", 1), ("c", 3)]) >>> df1 == df2 True """ return self.unionAll(other) \ .subtract(self.intersect(other)) \ .count() == 0 {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28005) SparkRackResolver should not log for resolving empty list
[ https://issues.apache.org/jira/browse/SPARK-28005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-28005. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24935 [https://github.com/apache/spark/pull/24935] > SparkRackResolver should not log for resolving empty list > - > > Key: SPARK-28005 > URL: https://issues.apache.org/jira/browse/SPARK-28005 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Priority: Major > Fix For: 3.0.0 > > > After SPARK-13704, {{SparkRackResolver}} generates an INFO message everytime > is called with 0 arguments: > https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/SparkRackResolver.scala#L73-L76 > That actually happens every 1s when there are no active executors, because of > the repeated offers that happen as part of delay scheduling: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L134-L139 > while this is relatively benign, its a pretty annoying thing to be logging at > INFO level every 1 second. > This is easy to reproduce -- in spark-shell, with dynamic allocation, set log > level to info, see the logs appear every 1 second. Then run something, see > the msgs stop. After the executors timeout, see the msgs reappear. > {noformat} > scala> :paste > // Entering paste mode (ctrl-D to finish) > sc.setLogLevel("info") > Thread.sleep(5000) > sc.parallelize(1 to 10).count() > // Exiting paste mode, now interpreting. > 19/06/11 12:43:40 INFO yarn.SparkRackResolver: Got an error when resolving > hostNames. Falling back to /default-rack for all > 19/06/11 12:43:41 INFO yarn.SparkRackResolver: Got an error when resolving > hostNames. Falling back to /default-rack for all > 19/06/11 12:43:42 INFO yarn.SparkRackResolver: Got an error when resolving > hostNames. Falling back to /default-rack for all > 19/06/11 12:43:43 INFO yarn.SparkRackResolver: Got an error when resolving > hostNames. Falling back to /default-rack for all > 19/06/11 12:43:44 INFO yarn.SparkRackResolver: Got an error when resolving > hostNames. Falling back to /default-rack for all > 19/06/11 12:43:45 INFO spark.SparkContext: Starting job: count at :28 > 19/06/11 12:43:45 INFO scheduler.DAGScheduler: Got job 0 (count at > :28) with 2 output partitions > 19/06/11 12:43:45 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 > (count at :28) > ... > 19/06/11 12:43:54 INFO cluster.YarnScheduler: Removed TaskSet 0.0, whose > tasks have all completed, from pool > 19/06/11 12:43:54 INFO scheduler.DAGScheduler: ResultStage 0 (count at > :28) finished in 9.548 s > 19/06/11 12:43:54 INFO scheduler.DAGScheduler: Job 0 finished: count at > :28, took 9.613049 s > res2: Long = 10 > > scala> > ... > 19/06/11 12:44:56 INFO yarn.SparkRackResolver: Got an error when resolving > hostNames. Falling back to /default-rack for all > 19/06/11 12:44:57 INFO yarn.SparkRackResolver: Got an error when resolving > hostNames. Falling back to /default-rack for all > 19/06/11 12:44:58 INFO yarn.SparkRackResolver: Got an error when resolving > hostNames. Falling back to /default-rack for all > 19/06/11 12:44:59 INFO yarn.SparkRackResolver: Got an error when resolving > hostNames. Falling back to /default-rack for all > ... > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28005) SparkRackResolver should not log for resolving empty list
[ https://issues.apache.org/jira/browse/SPARK-28005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid reassigned SPARK-28005: Assignee: Gabor Somogyi > SparkRackResolver should not log for resolving empty list > - > > Key: SPARK-28005 > URL: https://issues.apache.org/jira/browse/SPARK-28005 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Assignee: Gabor Somogyi >Priority: Major > Fix For: 3.0.0 > > > After SPARK-13704, {{SparkRackResolver}} generates an INFO message everytime > is called with 0 arguments: > https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/SparkRackResolver.scala#L73-L76 > That actually happens every 1s when there are no active executors, because of > the repeated offers that happen as part of delay scheduling: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L134-L139 > while this is relatively benign, its a pretty annoying thing to be logging at > INFO level every 1 second. > This is easy to reproduce -- in spark-shell, with dynamic allocation, set log > level to info, see the logs appear every 1 second. Then run something, see > the msgs stop. After the executors timeout, see the msgs reappear. > {noformat} > scala> :paste > // Entering paste mode (ctrl-D to finish) > sc.setLogLevel("info") > Thread.sleep(5000) > sc.parallelize(1 to 10).count() > // Exiting paste mode, now interpreting. > 19/06/11 12:43:40 INFO yarn.SparkRackResolver: Got an error when resolving > hostNames. Falling back to /default-rack for all > 19/06/11 12:43:41 INFO yarn.SparkRackResolver: Got an error when resolving > hostNames. Falling back to /default-rack for all > 19/06/11 12:43:42 INFO yarn.SparkRackResolver: Got an error when resolving > hostNames. Falling back to /default-rack for all > 19/06/11 12:43:43 INFO yarn.SparkRackResolver: Got an error when resolving > hostNames. Falling back to /default-rack for all > 19/06/11 12:43:44 INFO yarn.SparkRackResolver: Got an error when resolving > hostNames. Falling back to /default-rack for all > 19/06/11 12:43:45 INFO spark.SparkContext: Starting job: count at :28 > 19/06/11 12:43:45 INFO scheduler.DAGScheduler: Got job 0 (count at > :28) with 2 output partitions > 19/06/11 12:43:45 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 > (count at :28) > ... > 19/06/11 12:43:54 INFO cluster.YarnScheduler: Removed TaskSet 0.0, whose > tasks have all completed, from pool > 19/06/11 12:43:54 INFO scheduler.DAGScheduler: ResultStage 0 (count at > :28) finished in 9.548 s > 19/06/11 12:43:54 INFO scheduler.DAGScheduler: Job 0 finished: count at > :28, took 9.613049 s > res2: Long = 10 > > scala> > ... > 19/06/11 12:44:56 INFO yarn.SparkRackResolver: Got an error when resolving > hostNames. Falling back to /default-rack for all > 19/06/11 12:44:57 INFO yarn.SparkRackResolver: Got an error when resolving > hostNames. Falling back to /default-rack for all > 19/06/11 12:44:58 INFO yarn.SparkRackResolver: Got an error when resolving > hostNames. Falling back to /default-rack for all > 19/06/11 12:44:59 INFO yarn.SparkRackResolver: Got an error when resolving > hostNames. Falling back to /default-rack for all > ... > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28173) Add Kafka delegation token proxy user support
Gabor Somogyi created SPARK-28173: - Summary: Add Kafka delegation token proxy user support Key: SPARK-28173 URL: https://issues.apache.org/jira/browse/SPARK-28173 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 3.0.0 Reporter: Gabor Somogyi In SPARK-26592 I've turned off proxy user usage because https://issues.apache.org/jira/browse/KAFKA-6945 is not yet implemented. Since the KIP will be under discussion and hopefully implemented here is this jira to track the Spark side effort. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28173) Add Kafka delegation token proxy user support
[ https://issues.apache.org/jira/browse/SPARK-28173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873420#comment-16873420 ] Gabor Somogyi commented on SPARK-28173: --- cc [~viktorsomogyi] > Add Kafka delegation token proxy user support > - > > Key: SPARK-28173 > URL: https://issues.apache.org/jira/browse/SPARK-28173 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Major > > In SPARK-26592 I've turned off proxy user usage because > https://issues.apache.org/jira/browse/KAFKA-6945 is not yet implemented. > Since the KIP will be under discussion and hopefully implemented here is this > jira to track the Spark side effort. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28173) Add Kafka delegation token proxy user support
[ https://issues.apache.org/jira/browse/SPARK-28173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873422#comment-16873422 ] Gabor Somogyi commented on SPARK-28173: --- Will file the appropriate PR when Kafka part is ready. I'm closely working on this with [~viktorsomogyi]. > Add Kafka delegation token proxy user support > - > > Key: SPARK-28173 > URL: https://issues.apache.org/jira/browse/SPARK-28173 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Major > > In SPARK-26592 I've turned off proxy user usage because > https://issues.apache.org/jira/browse/KAFKA-6945 is not yet implemented. > Since the KIP will be under discussion and hopefully implemented here is this > jira to track the Spark side effort. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28164) usage description does not match with shell scripts
[ https://issues.apache.org/jira/browse/SPARK-28164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-28164. --- Resolution: Fixed Fix Version/s: 2.4.4 2.3.4 3.0.0 Issue resolved by pull request 24974 [https://github.com/apache/spark/pull/24974] > usage description does not match with shell scripts > --- > > Key: SPARK-28164 > URL: https://issues.apache.org/jira/browse/SPARK-28164 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 2.4.3 >Reporter: Hanna Kan >Assignee: Shivu Sondur >Priority: Major > Fix For: 3.0.0, 2.3.4, 2.4.4 > > > I found that "spark/sbin/start-slave.sh" may have some error. > line 43 gives--- echo "Usage: ./sbin/start-slave.sh [options] " > but later this script, I found line 59 MASTER=$1 > Is this a conflict? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28164) usage description does not match with shell scripts
[ https://issues.apache.org/jira/browse/SPARK-28164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-28164: - Assignee: Shivu Sondur > usage description does not match with shell scripts > --- > > Key: SPARK-28164 > URL: https://issues.apache.org/jira/browse/SPARK-28164 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 2.4.3 >Reporter: Hanna Kan >Assignee: Shivu Sondur >Priority: Major > > I found that "spark/sbin/start-slave.sh" may have some error. > line 43 gives--- echo "Usage: ./sbin/start-slave.sh [options] " > but later this script, I found line 59 MASTER=$1 > Is this a conflict? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28164) usage description does not match with shell scripts
[ https://issues.apache.org/jira/browse/SPARK-28164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-28164: -- Priority: Minor (was: Major) > usage description does not match with shell scripts > --- > > Key: SPARK-28164 > URL: https://issues.apache.org/jira/browse/SPARK-28164 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 2.4.3 >Reporter: Hanna Kan >Assignee: Shivu Sondur >Priority: Minor > Fix For: 2.3.4, 2.4.4, 3.0.0 > > > I found that "spark/sbin/start-slave.sh" may have some error. > line 43 gives--- echo "Usage: ./sbin/start-slave.sh [options] " > but later this script, I found line 59 MASTER=$1 > Is this a conflict? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28157) Make SHS clear KVStore LogInfo for the blacklisted entries
[ https://issues.apache.org/jira/browse/SPARK-28157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai resolved SPARK-28157. - Resolution: Resolved Assignee: Dongjoon Hyun Fix Version/s: 3.0.0 > Make SHS clear KVStore LogInfo for the blacklisted entries > -- > > Key: SPARK-28157 > URL: https://issues.apache.org/jira/browse/SPARK-28157 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 3.0.0, 2.4.3 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.0.0 > > > At Spark 2.4.0/2.3.2/2.2.3, SPARK-24948 delegated access permission checks to > the file system, and maintains a blacklist for all event log files failed > once at reading. The blacklisted log files are released back after > CLEAN_INTERVAL_S . > However, the files whose size don't changes are ignored forever because > shouldReloadLog return false always when the size is the same with the value > in KVStore. This is recovered only via SHS restart. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28157) Make SHS clear KVStore LogInfo for the blacklisted entries
[ https://issues.apache.org/jira/browse/SPARK-28157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28157: -- Affects Version/s: 2.3.2 2.3.3 > Make SHS clear KVStore LogInfo for the blacklisted entries > -- > > Key: SPARK-28157 > URL: https://issues.apache.org/jira/browse/SPARK-28157 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2, 2.3.3, 2.4.0, 2.4.1, 2.4.2, 3.0.0, 2.4.3 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.0.0 > > > At Spark 2.4.0/2.3.2/2.2.3, SPARK-24948 delegated access permission checks to > the file system, and maintains a blacklist for all event log files failed > once at reading. The blacklisted log files are released back after > CLEAN_INTERVAL_S . > However, the files whose size don't changes are ignored forever because > shouldReloadLog return false always when the size is the same with the value > in KVStore. This is recovered only via SHS restart. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28174) Upgrade to Kafka 2.3.0
Dongjoon Hyun created SPARK-28174: - Summary: Upgrade to Kafka 2.3.0 Key: SPARK-28174 URL: https://issues.apache.org/jira/browse/SPARK-28174 Project: Spark Issue Type: Improvement Components: Build, Structured Streaming Affects Versions: 3.0.0 Reporter: Dongjoon Hyun This issue updates Kafka dependency to 2.3.0 to bring the following 10 client-side patches at least. https://issues.apache.org/jira/browse/KAFKA-8379?jql=project%20%3D%20KAFKA%20AND%20fixVersion%20%3D%202.3.0%20AND%20component%20%3D%20clients -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28165) SHS does not delete old inprogress files until cleaner.maxAge after SHS start time
[ https://issues.apache.org/jira/browse/SPARK-28165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873614#comment-16873614 ] Imran Rashid commented on SPARK-28165: -- btw if anybody wants to investigate this more, here's a simple test case, (though as discussed above, we can't just use the modtime as its not totally trustworthy): {code} test("log cleaner for inprogress files before SHS startup") { val firstFileModifiedTime = TimeUnit.SECONDS.toMillis(10) val secondFileModifiedTime = TimeUnit.SECONDS.toMillis(100) val maxAge = TimeUnit.SECONDS.toMillis(40) val clock = new ManualClock(0) val log1 = newLogFile("inProgressApp1", None, inProgress = true) writeFile(log1, true, None, SparkListenerApplicationStart( "inProgressApp1", Some("inProgressApp1"), 3L, "test", Some("attempt1")) ) log1.setLastModified(firstFileModifiedTime) val log2 = newLogFile("inProgressApp2", None, inProgress = true) writeFile(log2, true, None, SparkListenerApplicationStart( "inProgressApp2", Some("inProgressApp2"), 23L, "test2", Some("attempt2")) ) log2.setLastModified(secondFileModifiedTime) // advance the clock so the first log is expired, but second log is still recent clock.setTime(secondFileModifiedTime) assert(clock.getTimeMillis() > firstFileModifiedTime + maxAge) // start up the SHS val provider = new FsHistoryProvider( createTestConf().set("spark.history.fs.cleaner.maxAge", s"${maxAge}ms"), clock) provider.checkForLogs() // We should cleanup one log immediately updateAndCheck(provider) { list => assert(list.size === 1) } assert(!log1.exists()) assert(log2.exists()) } {code} > SHS does not delete old inprogress files until cleaner.maxAge after SHS start > time > -- > > Key: SPARK-28165 > URL: https://issues.apache.org/jira/browse/SPARK-28165 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.3, 2.4.3 >Reporter: Imran Rashid >Priority: Major > > The SHS will not delete inprogress files until > {{spark.history.fs.cleaner.maxAge}} time after it has started (7 days by > default), regardless of when the last modification to the file was. This is > particularly problematic if the SHS gets restarted regularly, as then you'll > end up never deleting old files. > There might not be much we can do about this -- we can't really trust the > modification time of the file, as that isn't always updated reliably. > We could take the last time of any event from the file, but then we'd have to > turn off the optimization of SPARK-6951, to avoid reading the entire file > just for the listing. > *WORKAROUND*: have the SHS save state across restarts to local disk by > specifying a path in {{spark.history.store.path}}. It'll still take 7 days > from when you add that config for the cleaning to happen, but then going for > the cleaning should happen reliably. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28174) Upgrade to Kafka 2.3.0
[ https://issues.apache.org/jira/browse/SPARK-28174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28174: Assignee: (was: Apache Spark) > Upgrade to Kafka 2.3.0 > -- > > Key: SPARK-28174 > URL: https://issues.apache.org/jira/browse/SPARK-28174 > Project: Spark > Issue Type: Improvement > Components: Build, Structured Streaming >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > This issue updates Kafka dependency to 2.3.0 to bring the following 10 > client-side patches at least. > https://issues.apache.org/jira/browse/KAFKA-8379?jql=project%20%3D%20KAFKA%20AND%20fixVersion%20%3D%202.3.0%20AND%20component%20%3D%20clients -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28174) Upgrade to Kafka 2.3.0
[ https://issues.apache.org/jira/browse/SPARK-28174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28174: Assignee: Apache Spark > Upgrade to Kafka 2.3.0 > -- > > Key: SPARK-28174 > URL: https://issues.apache.org/jira/browse/SPARK-28174 > Project: Spark > Issue Type: Improvement > Components: Build, Structured Streaming >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > > This issue updates Kafka dependency to 2.3.0 to bring the following 10 > client-side patches at least. > https://issues.apache.org/jira/browse/KAFKA-8379?jql=project%20%3D%20KAFKA%20AND%20fixVersion%20%3D%202.3.0%20AND%20component%20%3D%20clients -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28174) Upgrade to Kafka 2.3.0
[ https://issues.apache.org/jira/browse/SPARK-28174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28174: -- Description: This issue updates Kafka dependency to 2.3.0 to bring the following 10 client-side patches at least. - https://issues.apache.org/jira/issues/?jql=project%20%3D%20KAFKA%20AND%20fixVersion%20%3D%202.3.0%20AND%20component%20%3D%20clients The following is a full release note. - https://www.apache.org/dist/kafka/2.3.0/RELEASE_NOTES.html was: This issue updates Kafka dependency to 2.3.0 to bring the following 10 client-side patches at least. https://issues.apache.org/jira/browse/KAFKA-8379?jql=project%20%3D%20KAFKA%20AND%20fixVersion%20%3D%202.3.0%20AND%20component%20%3D%20clients > Upgrade to Kafka 2.3.0 > -- > > Key: SPARK-28174 > URL: https://issues.apache.org/jira/browse/SPARK-28174 > Project: Spark > Issue Type: Improvement > Components: Build, Structured Streaming >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > This issue updates Kafka dependency to 2.3.0 to bring the following 10 > client-side patches at least. > - > https://issues.apache.org/jira/issues/?jql=project%20%3D%20KAFKA%20AND%20fixVersion%20%3D%202.3.0%20AND%20component%20%3D%20clients > The following is a full release note. > - https://www.apache.org/dist/kafka/2.3.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28174) Upgrade to Kafka 2.3.0
[ https://issues.apache.org/jira/browse/SPARK-28174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28174: -- Description: This issue updates Kafka dependency to 2.3.0 to bring the following 9 client-side patches at least. - https://issues.apache.org/jira/issues/?jql=project%20%3D%20KAFKA%20AND%20fixVersion%20%3D%202.3.0%20AND%20component%20%3D%20clients The following is a full release note. - https://www.apache.org/dist/kafka/2.3.0/RELEASE_NOTES.html was: This issue updates Kafka dependency to 2.3.0 to bring the following 10 client-side patches at least. - https://issues.apache.org/jira/issues/?jql=project%20%3D%20KAFKA%20AND%20fixVersion%20%3D%202.3.0%20AND%20component%20%3D%20clients The following is a full release note. - https://www.apache.org/dist/kafka/2.3.0/RELEASE_NOTES.html > Upgrade to Kafka 2.3.0 > -- > > Key: SPARK-28174 > URL: https://issues.apache.org/jira/browse/SPARK-28174 > Project: Spark > Issue Type: Improvement > Components: Build, Structured Streaming >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > This issue updates Kafka dependency to 2.3.0 to bring the following 9 > client-side patches at least. > - > https://issues.apache.org/jira/issues/?jql=project%20%3D%20KAFKA%20AND%20fixVersion%20%3D%202.3.0%20AND%20component%20%3D%20clients > The following is a full release note. > - https://www.apache.org/dist/kafka/2.3.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28174) Upgrade to Kafka 2.3.0
[ https://issues.apache.org/jira/browse/SPARK-28174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28174: -- Description: This issue updates Kafka dependency to 2.3.0 to bring the following 9 client-side patches at least. - https://issues.apache.org/jira/issues/?jql=project%20%3D%20KAFKA%20AND%20fixVersion%20%3D%202.3.0%20AND%20fixVersion%20NOT%20IN%20(2.2.0%2C%202.2.1)%20AND%20component%20%3D%20clients The following is a full release note. - https://www.apache.org/dist/kafka/2.3.0/RELEASE_NOTES.html was: This issue updates Kafka dependency to 2.3.0 to bring the following 9 client-side patches at least. - https://issues.apache.org/jira/issues/?jql=project%20%3D%20KAFKA%20AND%20fixVersion%20%3D%202.3.0%20AND%20component%20%3D%20clients The following is a full release note. - https://www.apache.org/dist/kafka/2.3.0/RELEASE_NOTES.html > Upgrade to Kafka 2.3.0 > -- > > Key: SPARK-28174 > URL: https://issues.apache.org/jira/browse/SPARK-28174 > Project: Spark > Issue Type: Improvement > Components: Build, Structured Streaming >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > This issue updates Kafka dependency to 2.3.0 to bring the following 9 > client-side patches at least. > - > https://issues.apache.org/jira/issues/?jql=project%20%3D%20KAFKA%20AND%20fixVersion%20%3D%202.3.0%20AND%20fixVersion%20NOT%20IN%20(2.2.0%2C%202.2.1)%20AND%20component%20%3D%20clients > The following is a full release note. > - https://www.apache.org/dist/kafka/2.3.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27992) PySpark socket server should sync with JVM connection thread future
[ https://issues.apache.org/jira/browse/SPARK-27992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler reassigned SPARK-27992: Assignee: Bryan Cutler > PySpark socket server should sync with JVM connection thread future > --- > > Key: SPARK-27992 > URL: https://issues.apache.org/jira/browse/SPARK-27992 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > > Both SPARK-27805 and SPARK-27548 identified an issue that errors in a Spark > job are not propagated to Python. This is because toLocalIterator() and > toPandas() with Arrow enabled run Spark jobs asynchronously in a background > thread, after creating the socket connection info. The fix for these was to > catch a SparkException if the job errored and then send the exception through > the pyspark serializer. > A better fix would be to allow Python to await on the serving thread future > and join the thread. That way if the serving thread throws an exception, it > will be propagated on the call to awaitResult. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27992) PySpark socket server should sync with JVM connection thread future
[ https://issues.apache.org/jira/browse/SPARK-27992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved SPARK-27992. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24834 [https://github.com/apache/spark/pull/24834] > PySpark socket server should sync with JVM connection thread future > --- > > Key: SPARK-27992 > URL: https://issues.apache.org/jira/browse/SPARK-27992 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > Fix For: 3.0.0 > > > Both SPARK-27805 and SPARK-27548 identified an issue that errors in a Spark > job are not propagated to Python. This is because toLocalIterator() and > toPandas() with Arrow enabled run Spark jobs asynchronously in a background > thread, after creating the socket connection info. The fix for these was to > catch a SparkException if the job errored and then send the exception through > the pyspark serializer. > A better fix would be to allow Python to await on the serving thread future > and join the thread. That way if the serving thread throws an exception, it > will be propagated on the call to awaitResult. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28175) Exception when using userClassPathFirst
Shan Zhao created SPARK-28175: - Summary: Exception when using userClassPathFirst Key: SPARK-28175 URL: https://issues.apache.org/jira/browse/SPARK-28175 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 2.3.0, 2.0.0 Reporter: Shan Zhao We want to override the executor/driver spark version with our own spark version in a fat jar. Conf: spark.driver.userClassPathFirst=true spark.executor.userClassPathFirst=true, Build: spark 2.3.0 Deploy: cluster mode with spark 2.0.0 installed We encounter the following error: {code:java} Exception in thread "main" org.apache.spark.SparkException: Unable to load YARN support at org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:392) at org.apache.spark.deploy.SparkHadoopUtil$.yarn$lzycompute(SparkHadoopUtil.scala:387) at org.apache.spark.deploy.SparkHadoopUtil$.yarn(SparkHadoopUtil.scala:387) at org.apache.spark.deploy.SparkHadoopUtil$.get(SparkHadoopUtil.scala:412) at org.apache.spark.deploy.yarn.Client.(Client.scala:69) at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1290) at org.apache.spark.deploy.yarn.Client.main(Client.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:750) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.RuntimeException: java.lang.RuntimeException: class org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback not org.apache.hadoop.security.GroupMappingServiceProvider at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2273) at org.apache.hadoop.security.Groups.(Groups.java:99) at org.apache.hadoop.security.Groups.(Groups.java:95) at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:420) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:324) at org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:352) at org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:51) at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil.(YarnSparkHadoopUtil.scala:49) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at java.lang.Class.newInstance(Class.java:374) at org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:389) ... 15 more Caused by: java.lang.RuntimeException: class org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback not org.apache.hadoop.security.GroupMappingServiceProvider at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2267) ... 28 more{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28176) Add Dataset.collect(PartialFunction) method for parity with RDD API
Josh Rosen created SPARK-28176: -- Summary: Add Dataset.collect(PartialFunction) method for parity with RDD API Key: SPARK-28176 URL: https://issues.apache.org/jira/browse/SPARK-28176 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Josh Rosen RDDs have a {{collect(PartialFunction)}} method which behaves like the equivalent Scala collections method (see [https://github.com/apache/spark/commit/f4e6b9361ffeec1018d5834f09db9fd86f2ba7bd]). It would be nice if Datasets had a similar method (to ease migration of RDD code to Datasets). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27292) Spark Job Fails with Unknown Error writing to S3 from AWS EMR
[ https://issues.apache.org/jira/browse/SPARK-27292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873670#comment-16873670 ] Padmakar commented on SPARK-27292: -- Hi, am facing a similar issue on my 10 node cluster: {color:#205081}org.apache.hadoop.hdfs.DataStreamer: Exception for BP-181186916-10.138.0.22-1561568502844:blk_1073741840_1016{color} {color:#205081}java.io.EOFException: Unexpected EOF while trying to read response from server{color} {color:#205081} at org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:453){color} {color:#205081} at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:213){color} {color:#205081} at org.apache.hadoop.hdfs.DataStreamer$ResponseProcessor.run(DataStreamer.java:1080){color} {color:#205081}19/06/26 21:18:24 WARN org.apache.hadoop.hdfs.DataStreamer: Error Recovery for BP-181186916-10.138.0.22-1561568502844:blk_1073741840_1016 in pipeline [DatanodeInfoWithStorage[10.138.0.23:9866,DS-197cd21a-3e5f-46cc-9bbb-fda94b2ba6c0,DISK], DatanodeInfoWithStorage[10.138.15.250:9866,DS-c2bc521d-37b0-43ea-a80d-191037fcb8a7,DISK]]: datanode 0(DatanodeInfoWithStorage[10.138.0.23:9866,DS-197cd21a-3e5f-46cc-9bbb-fda94b2ba6c0,DISK]) is bad.{color} {color:#205081}was there a fix or any solution or any insights on this issue? Please let me know{color} > Spark Job Fails with Unknown Error writing to S3 from AWS EMR > - > > Key: SPARK-27292 > URL: https://issues.apache.org/jira/browse/SPARK-27292 > Project: Spark > Issue Type: Question > Components: Input/Output >Affects Versions: 2.3.2 >Reporter: Olalekan Elesin >Priority: Major > > I am currently experiencing issues writing data to S3 from my Spark Job > running on AWS EMR. > The job writings to some staging path in S3 e.g > \{{.spark-random-alphanumeric}}. After which it fails with this error: > {code:java} > 9/03/26 10:54:07 WARN AsyncEventQueue: Dropped 196300 events from appStatus > since Tue Mar 26 10:52:05 UTC 2019. > 19/03/26 10:55:07 WARN AsyncEventQueue: Dropped 211186 events from appStatus > since Tue Mar 26 10:54:07 UTC 2019. > 19/03/26 11:37:09 WARN DataStreamer: Exception for > BP-312054361-10.41.97.71-1553586781241:blk_1073742995_2172 > java.io.EOFException: Unexpected EOF while trying to read response from server > at > org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:402) > at > org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:213) > at > org.apache.hadoop.hdfs.DataStreamer$ResponseProcessor.run(DataStreamer.java:1073) > 19/03/26 11:37:09 WARN DataStreamer: Error Recovery for > BP-312054361-10.41.97.71-1553586781241:blk_1073742995_2172 in pipeline > [DatanodeInfoWithStorage[10.41.121.135:50010,DS-cba2a850-fa30-4933-af2a-05b40b58fdb5,DISK], > > DatanodeInfoWithStorage[10.41.71.181:50010,DS-c90a1d87-b40a-4928-a709-1aef027db65a,DISK]]: > datanode > 0(DatanodeInfoWithStorage[10.41.121.135:50010,DS-cba2a850-fa30-4933-af2a-05b40b58fdb5,DISK]) > is bad. > 19/03/26 11:50:34 WARN AsyncEventQueue: Dropped 157572 events from appStatus > since Tue Mar 26 10:55:07 UTC 2019. > 19/03/26 11:51:34 WARN AsyncEventQueue: Dropped 785 events from appStatus > since Tue Mar 26 11:50:34 UTC 2019. > 19/03/26 11:52:34 WARN AsyncEventQueue: Dropped 656 events from appStatus > since Tue Mar 26 11:51:34 UTC 2019. > 19/03/26 11:53:35 WARN AsyncEventQueue: Dropped 1335 events from appStatus > since Tue Mar 26 11:52:34 UTC 2019. > 19/03/26 11:54:35 WARN AsyncEventQueue: Dropped 1087 events from appStatus > since Tue Mar 26 11:53:35 UTC 2019. > ... > 19/03/26 13:39:39 WARN TaskSetManager: Lost task 33302.0 in stage 1444.0 (TID > 1324427, ip-10-41-122-224.eu-west-1.compute.internal, executor 18): > org.apache.spark.SparkException: Task failed while writing rows. > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:254) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:168) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:121) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolEx
[jira] [Created] (SPARK-28177) Adjust post shuffle partition number in adaptive execution
Carson Wang created SPARK-28177: --- Summary: Adjust post shuffle partition number in adaptive execution Key: SPARK-28177 URL: https://issues.apache.org/jira/browse/SPARK-28177 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.3 Reporter: Carson Wang Implement a rule in the new adaptive execution framework introduced in SPARK-23128. This rule is used to adjust the post shuffle partitions based on the map output statistics. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28177) Adjust post shuffle partition number in adaptive execution
[ https://issues.apache.org/jira/browse/SPARK-28177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28177: Assignee: Apache Spark > Adjust post shuffle partition number in adaptive execution > -- > > Key: SPARK-28177 > URL: https://issues.apache.org/jira/browse/SPARK-28177 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.3 >Reporter: Carson Wang >Assignee: Apache Spark >Priority: Major > > Implement a rule in the new adaptive execution framework introduced in > SPARK-23128. This rule is used to adjust the post shuffle partitions based on > the map output statistics. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28177) Adjust post shuffle partition number in adaptive execution
[ https://issues.apache.org/jira/browse/SPARK-28177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28177: Assignee: (was: Apache Spark) > Adjust post shuffle partition number in adaptive execution > -- > > Key: SPARK-28177 > URL: https://issues.apache.org/jira/browse/SPARK-28177 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.3 >Reporter: Carson Wang >Priority: Major > > Implement a rule in the new adaptive execution framework introduced in > SPARK-23128. This rule is used to adjust the post shuffle partitions based on > the map output statistics. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28036) Built-in udf left/right has inconsistent behavior
[ https://issues.apache.org/jira/browse/SPARK-28036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873696#comment-16873696 ] Shivu Sondur commented on SPARK-28036: -- [~yumwang] You can close this jira right?. As it is working in spark. Only difference is we should use offset without '-' symbol. > Built-in udf left/right has inconsistent behavior > - > > Key: SPARK-28036 > URL: https://issues.apache.org/jira/browse/SPARK-28036 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > PostgreSQL: > {code:sql} > postgres=# select left('ahoj', -2), right('ahoj', -2); > left | right > --+--- > ah | oj > (1 row) > {code} > Spark SQL: > {code:sql} > spark-sql> select left('ahoj', -2), right('ahoj', -2); > spark-sql> > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27845) DataSourceV2: InsertTable
[ https://issues.apache.org/jira/browse/SPARK-27845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-27845: --- Description: Support multiple catalogs in the following InsertInto use cases: * INSERT INTO [TABLE] catalog.db.tbl * INSERT OVERWRITE TABLE catalog.db.tbl was: Support multiple catalogs in the following InsertInto use cases: * INSERT INTO [TABLE] catalog.db.tbl * INSERT OVERWRITE TABLE catalog.db.tbl * DataFrameWriter.insertInto("catalog.db.tbl") > DataSourceV2: InsertTable > - > > Key: SPARK-27845 > URL: https://issues.apache.org/jira/browse/SPARK-27845 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Priority: Major > > Support multiple catalogs in the following InsertInto use cases: > * INSERT INTO [TABLE] catalog.db.tbl > * INSERT OVERWRITE TABLE catalog.db.tbl -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27845) DataSourceV2: InsertTable
[ https://issues.apache.org/jira/browse/SPARK-27845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-27845: --- Description: Support multiple catalogs in the following use cases: * INSERT INTO [TABLE] catalog.db.tbl * INSERT OVERWRITE TABLE catalog.db.tbl was: Support multiple catalogs in the following InsertInto use cases: * INSERT INTO [TABLE] catalog.db.tbl * INSERT OVERWRITE TABLE catalog.db.tbl > DataSourceV2: InsertTable > - > > Key: SPARK-27845 > URL: https://issues.apache.org/jira/browse/SPARK-27845 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Priority: Major > > Support multiple catalogs in the following use cases: > * INSERT INTO [TABLE] catalog.db.tbl > * INSERT OVERWRITE TABLE catalog.db.tbl -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28178) DataSourceV2: DataFrameWriter.insertInfo
John Zhuge created SPARK-28178: -- Summary: DataSourceV2: DataFrameWriter.insertInfo Key: SPARK-28178 URL: https://issues.apache.org/jira/browse/SPARK-28178 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: John Zhuge Support multiple catalogs in the following InsertInto use cases: * INSERT INTO [TABLE] catalog.db.tbl * INSERT OVERWRITE TABLE catalog.db.tbl * DataFrameWriter.insertInto("catalog.db.tbl") -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28178) DataSourceV2: DataFrameWriter.insertInfo
[ https://issues.apache.org/jira/browse/SPARK-28178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-28178: --- Description: Support multiple catalogs in the following InsertInto use cases: * DataFrameWriter.insertInto("catalog.db.tbl") was: Support multiple catalogs in the following InsertInto use cases: * INSERT INTO [TABLE] catalog.db.tbl * INSERT OVERWRITE TABLE catalog.db.tbl * DataFrameWriter.insertInto("catalog.db.tbl") > DataSourceV2: DataFrameWriter.insertInfo > > > Key: SPARK-28178 > URL: https://issues.apache.org/jira/browse/SPARK-28178 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Priority: Major > > Support multiple catalogs in the following InsertInto use cases: > * DataFrameWriter.insertInto("catalog.db.tbl") -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28175) Exception when using userClassPathFirst
[ https://issues.apache.org/jira/browse/SPARK-28175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873762#comment-16873762 ] Hyukjin Kwon commented on SPARK-28175: -- Are you able to test this against higher version of Spark? > Exception when using userClassPathFirst > --- > > Key: SPARK-28175 > URL: https://issues.apache.org/jira/browse/SPARK-28175 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.0.0, 2.3.0 >Reporter: Shan Zhao >Priority: Major > > We want to override the executor/driver spark version with our own spark > version in a fat jar. > Conf: > spark.driver.userClassPathFirst=true > spark.executor.userClassPathFirst=true, > Build: spark 2.3.0 > > Deploy: cluster mode with spark 2.0.0 installed > We encounter the following error: > {code:java} > Exception in thread "main" org.apache.spark.SparkException: Unable to load > YARN support > at > org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:392) > at > org.apache.spark.deploy.SparkHadoopUtil$.yarn$lzycompute(SparkHadoopUtil.scala:387) > at org.apache.spark.deploy.SparkHadoopUtil$.yarn(SparkHadoopUtil.scala:387) > at org.apache.spark.deploy.SparkHadoopUtil$.get(SparkHadoopUtil.scala:412) > at org.apache.spark.deploy.yarn.Client.(Client.scala:69) > at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1290) > at org.apache.spark.deploy.yarn.Client.main(Client.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:750) > at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: java.lang.RuntimeException: java.lang.RuntimeException: class > org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback not > org.apache.hadoop.security.GroupMappingServiceProvider > at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2273) > at org.apache.hadoop.security.Groups.(Groups.java:99) > at org.apache.hadoop.security.Groups.(Groups.java:95) > at > org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:420) > at > org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:324) > at > org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:352) > at org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:51) > at > org.apache.spark.deploy.yarn.YarnSparkHadoopUtil.(YarnSparkHadoopUtil.scala:49) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at java.lang.Class.newInstance(Class.java:374) > at > org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:389) > ... 15 more > Caused by: java.lang.RuntimeException: class > org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback not > org.apache.hadoop.security.GroupMappingServiceProvider > at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2267) > ... 28 more{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27369) Standalone worker can load resource conf and discover resources
[ https://issues.apache.org/jira/browse/SPARK-27369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xingbo Jiang resolved SPARK-27369. -- Resolution: Done > Standalone worker can load resource conf and discover resources > --- > > Key: SPARK-27369 > URL: https://issues.apache.org/jira/browse/SPARK-27369 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: wuyi >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28175) Exception when using userClassPathFirst
[ https://issues.apache.org/jira/browse/SPARK-28175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-28175: - Labels: (was: version) > Exception when using userClassPathFirst > --- > > Key: SPARK-28175 > URL: https://issues.apache.org/jira/browse/SPARK-28175 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.0.0, 2.3.0 >Reporter: Shan Zhao >Priority: Major > > We want to override the executor/driver spark version with our own spark > version in a fat jar. > Conf: > spark.driver.userClassPathFirst=true > spark.executor.userClassPathFirst=true, > Build: spark 2.3.0 > > Deploy: cluster mode with spark 2.0.0 installed > We encounter the following error: > {code:java} > Exception in thread "main" org.apache.spark.SparkException: Unable to load > YARN support > at > org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:392) > at > org.apache.spark.deploy.SparkHadoopUtil$.yarn$lzycompute(SparkHadoopUtil.scala:387) > at org.apache.spark.deploy.SparkHadoopUtil$.yarn(SparkHadoopUtil.scala:387) > at org.apache.spark.deploy.SparkHadoopUtil$.get(SparkHadoopUtil.scala:412) > at org.apache.spark.deploy.yarn.Client.(Client.scala:69) > at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1290) > at org.apache.spark.deploy.yarn.Client.main(Client.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:750) > at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: java.lang.RuntimeException: java.lang.RuntimeException: class > org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback not > org.apache.hadoop.security.GroupMappingServiceProvider > at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2273) > at org.apache.hadoop.security.Groups.(Groups.java:99) > at org.apache.hadoop.security.Groups.(Groups.java:95) > at > org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:420) > at > org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:324) > at > org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:352) > at org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:51) > at > org.apache.spark.deploy.yarn.YarnSparkHadoopUtil.(YarnSparkHadoopUtil.scala:49) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at java.lang.Class.newInstance(Class.java:374) > at > org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:389) > ... 15 more > Caused by: java.lang.RuntimeException: class > org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback not > org.apache.hadoop.security.GroupMappingServiceProvider > at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2267) > ... 28 more{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28172) pyspark DataFrame equality operator
[ https://issues.apache.org/jira/browse/SPARK-28172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873764#comment-16873764 ] Hyukjin Kwon commented on SPARK-28172: -- {code} >>> df1 == df2 True {code} Wouldn't this require to execute both DataFrames and collect the data into driver side? When the datasets are large, it's very easy for users to shoot them in the foot. I won't do that without an explicit plan and design doc for all other operators. > pyspark DataFrame equality operator > --- > > Key: SPARK-28172 > URL: https://issues.apache.org/jira/browse/SPARK-28172 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.3 >Reporter: Hugo >Priority: Minor > > *Motivation:* > Facilitating testing equality between DataFrames. Many approaches for > applying TDD practices in data science require checking function output > against expected output (in similarity to API snapshots). Having an equality > operator or something alike would make this easier. > A basic example: > {code} > from pyspark.ml.feature import Imputer > def create_mock_missing_df(): >return spark.createDataFrame([("a", 1.0),("b", 2.0),("c", float('nan'))], > ['COL1', 'COl2']) > def test_imputation(): >"""Test mean value imputation""" >df1 = create_mock_missing_df() >#load snapshot >pickled_snapshot = sc.pickleFile('imputed_df.pkl').collect() >df2 = spark.createDataFrame(pickled_snapshot) >""" >>>> df2.show() >+++ >|COL1|COL2_imputed| >+++ >| a | 1.0| >| b | 2.0| >| c | 1.5| >+++ >""" >imputer = Imputer( > inputCols=['COL2'], > outputCols=['COL2_imputed'] >) >df1 = imputer.fit(df1).transform(df1) >df1 = df1.drop('COL2') >assert df1 == df2 > {code} > > Suggested change: > {code} > class DataFrame(object): >... >def __eq__(self, other): > """Returns ``True`` if DataFrame content is equal to other. > >>> df1 = spark.createDataFrame([("a", 1), ("b", 2), ("c", 3)]) > >>> df2 = spark.createDataFrame([("b", 2), ("a", 1), ("c", 3)]) > >>> df1 == df2 > True > """ > return self.unionAll(other) \ > .subtract(self.intersect(other)) \ > .count() == 0 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28172) pyspark DataFrame equality operator
[ https://issues.apache.org/jira/browse/SPARK-28172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-28172. -- Resolution: Duplicate > pyspark DataFrame equality operator > --- > > Key: SPARK-28172 > URL: https://issues.apache.org/jira/browse/SPARK-28172 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.3 >Reporter: Hugo >Priority: Minor > > *Motivation:* > Facilitating testing equality between DataFrames. Many approaches for > applying TDD practices in data science require checking function output > against expected output (in similarity to API snapshots). Having an equality > operator or something alike would make this easier. > A basic example: > {code} > from pyspark.ml.feature import Imputer > def create_mock_missing_df(): >return spark.createDataFrame([("a", 1.0),("b", 2.0),("c", float('nan'))], > ['COL1', 'COl2']) > def test_imputation(): >"""Test mean value imputation""" >df1 = create_mock_missing_df() >#load snapshot >pickled_snapshot = sc.pickleFile('imputed_df.pkl').collect() >df2 = spark.createDataFrame(pickled_snapshot) >""" >>>> df2.show() >+++ >|COL1|COL2_imputed| >+++ >| a | 1.0| >| b | 2.0| >| c | 1.5| >+++ >""" >imputer = Imputer( > inputCols=['COL2'], > outputCols=['COL2_imputed'] >) >df1 = imputer.fit(df1).transform(df1) >df1 = df1.drop('COL2') >assert df1 == df2 > {code} > > Suggested change: > {code} > class DataFrame(object): >... >def __eq__(self, other): > """Returns ``True`` if DataFrame content is equal to other. > >>> df1 = spark.createDataFrame([("a", 1), ("b", 2), ("c", 3)]) > >>> df2 = spark.createDataFrame([("b", 2), ("a", 1), ("c", 3)]) > >>> df1 == df2 > True > """ > return self.unionAll(other) \ > .subtract(self.intersect(other)) \ > .count() == 0 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28171) Correct interval type conversion behavior
[ https://issues.apache.org/jira/browse/SPARK-28171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-28171: Issue Type: Sub-task (was: Bug) Parent: SPARK-27764 > Correct interval type conversion behavior > -- > > Key: SPARK-28171 > URL: https://issues.apache.org/jira/browse/SPARK-28171 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zhu, Lipeng >Priority: Major > > When calculate between date and interval. > {code:sql} > select timestamp '2019-01-01 00:00:00' + interval '1 2:03:04' day to second, > timestamp '2019-01-01 00:00:00' + interval '-1 2:03:04' day to second > {code} > * PostgreSQL return *2019-01-02 02:03:04* *2018-12-31 02:03:04* > * SparkSQL return *2019-01-02 02:03:04* *2018-12-30 21:56:56* > {code:sql} > select timestamp '2019-01-01 00:00:00' + interval '1 -2:03:04' day to second, > timestamp '2019-01-01 00:00:00' + interval '-1 -2:03:04' day to second > {code} > * PostgreSQL return *2019-01-01 21:56:56* *2018-12-30 21:56:56* > * SparkSQL return _*Interval string does not match day-time format of 'd > h:m:s.n': '1 -2:03:04'(line 1, pos 50)*_ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28171) Difference of interval type conversion between SparkSQL and PostgreSQL
[ https://issues.apache.org/jira/browse/SPARK-28171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-28171: Summary: Difference of interval type conversion between SparkSQL and PostgreSQL (was: Correct interval type conversion behavior ) > Difference of interval type conversion between SparkSQL and PostgreSQL > -- > > Key: SPARK-28171 > URL: https://issues.apache.org/jira/browse/SPARK-28171 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zhu, Lipeng >Priority: Major > > When calculate between date and interval. > {code:sql} > select timestamp '2019-01-01 00:00:00' + interval '1 2:03:04' day to second, > timestamp '2019-01-01 00:00:00' + interval '-1 2:03:04' day to second > {code} > * PostgreSQL return *2019-01-02 02:03:04* *2018-12-31 02:03:04* > * SparkSQL return *2019-01-02 02:03:04* *2018-12-30 21:56:56* > {code:sql} > select timestamp '2019-01-01 00:00:00' + interval '1 -2:03:04' day to second, > timestamp '2019-01-01 00:00:00' + interval '-1 -2:03:04' day to second > {code} > * PostgreSQL return *2019-01-01 21:56:56* *2018-12-30 21:56:56* > * SparkSQL return _*Interval string does not match day-time format of 'd > h:m:s.n': '1 -2:03:04'(line 1, pos 50)*_ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28168) release-build.sh for Hadoop-3.2
[ https://issues.apache.org/jira/browse/SPARK-28168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-28168: - Description: Hadoop-3.2 release process in o release-build.sh. (was: Add Hadoop-3.2 to release-build.sh?) > release-build.sh for Hadoop-3.2 > --- > > Key: SPARK-28168 > URL: https://issues.apache.org/jira/browse/SPARK-28168 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Hadoop-3.2 release process in o release-build.sh. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28168) release-build.sh for Hadoop-3.2
[ https://issues.apache.org/jira/browse/SPARK-28168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-28168: - Description: Hadoop 3.2 support was added. We might have to add Hadoop-3.2 in release-build.sh. (was: Hadoop-3.2 release process in o release-build.sh.) > release-build.sh for Hadoop-3.2 > --- > > Key: SPARK-28168 > URL: https://issues.apache.org/jira/browse/SPARK-28168 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Hadoop 3.2 support was added. We might have to add Hadoop-3.2 in > release-build.sh. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28179) Avoid hard-coded config: spark.sql.globalTempDatabase
Yuming Wang created SPARK-28179: --- Summary: Avoid hard-coded config: spark.sql.globalTempDatabase Key: SPARK-28179 URL: https://issues.apache.org/jira/browse/SPARK-28179 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28179) Avoid hard-coded config: spark.sql.globalTempDatabase
[ https://issues.apache.org/jira/browse/SPARK-28179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28179: Assignee: Apache Spark > Avoid hard-coded config: spark.sql.globalTempDatabase > - > > Key: SPARK-28179 > URL: https://issues.apache.org/jira/browse/SPARK-28179 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28179) Avoid hard-coded config: spark.sql.globalTempDatabase
[ https://issues.apache.org/jira/browse/SPARK-28179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28179: Assignee: (was: Apache Spark) > Avoid hard-coded config: spark.sql.globalTempDatabase > - > > Key: SPARK-28179 > URL: https://issues.apache.org/jira/browse/SPARK-28179 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28170) DenseVector .toArray() and .values documentation do not specify they are aliases
[ https://issues.apache.org/jira/browse/SPARK-28170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-28170: - Component/s: PySpark > DenseVector .toArray() and .values documentation do not specify they are > aliases > > > Key: SPARK-28170 > URL: https://issues.apache.org/jira/browse/SPARK-28170 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, PySpark >Affects Versions: 2.4.3 >Reporter: Sivam Pasupathipillai >Priority: Minor > > The documentation of the *toArray()* method and the *values* property in > pyspark.ml.linalg.DenseVector is confusing. > *toArray():* Returns an numpy.ndarray > *values**:* Returns a list of values > However, they are actually aliases and they both return a numpy.ndarray. > FIX: either change the documentation or change the *values* property to > return a Python list. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28152) ShortType and FloatTypes are not correctly mapped to right JDBC types when using JDBC connector
[ https://issues.apache.org/jira/browse/SPARK-28152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-28152. -- Resolution: Duplicate > ShortType and FloatTypes are not correctly mapped to right JDBC types when > using JDBC connector > --- > > Key: SPARK-28152 > URL: https://issues.apache.org/jira/browse/SPARK-28152 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 2.4.3 >Reporter: Shiv Prashant Sood >Priority: Minor > > ShortType and FloatTypes are not correctly mapped to right JDBC types when > using JDBC connector. This results in tables and spark data frame being > created with unintended types. > Some example issue > * Write from df with column type results in a SQL table of with column type > as INTEGER as opposed to SMALLINT. Thus a larger table that expected. > * read results in a dataframe with type INTEGER as opposed to ShortType > FloatTypes have a issue with read path. In the write path Spark data type > 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in > the read path when JDBC data types need to be converted to Catalyst data > types ( getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' > rather than 'FloatType'. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28175) Exception when using userClassPathFirst
[ https://issues.apache.org/jira/browse/SPARK-28175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873800#comment-16873800 ] Shan Zhao commented on SPARK-28175: --- No since only 2.0.0 is available in our cluster > Exception when using userClassPathFirst > --- > > Key: SPARK-28175 > URL: https://issues.apache.org/jira/browse/SPARK-28175 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.0.0, 2.3.0 >Reporter: Shan Zhao >Priority: Major > > We want to override the executor/driver spark version with our own spark > version in a fat jar. > Conf: > spark.driver.userClassPathFirst=true > spark.executor.userClassPathFirst=true, > Build: spark 2.3.0 > > Deploy: cluster mode with spark 2.0.0 installed > We encounter the following error: > {code:java} > Exception in thread "main" org.apache.spark.SparkException: Unable to load > YARN support > at > org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:392) > at > org.apache.spark.deploy.SparkHadoopUtil$.yarn$lzycompute(SparkHadoopUtil.scala:387) > at org.apache.spark.deploy.SparkHadoopUtil$.yarn(SparkHadoopUtil.scala:387) > at org.apache.spark.deploy.SparkHadoopUtil$.get(SparkHadoopUtil.scala:412) > at org.apache.spark.deploy.yarn.Client.(Client.scala:69) > at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1290) > at org.apache.spark.deploy.yarn.Client.main(Client.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:750) > at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: java.lang.RuntimeException: java.lang.RuntimeException: class > org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback not > org.apache.hadoop.security.GroupMappingServiceProvider > at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2273) > at org.apache.hadoop.security.Groups.(Groups.java:99) > at org.apache.hadoop.security.Groups.(Groups.java:95) > at > org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:420) > at > org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:324) > at > org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:352) > at org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:51) > at > org.apache.spark.deploy.yarn.YarnSparkHadoopUtil.(YarnSparkHadoopUtil.scala:49) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at java.lang.Class.newInstance(Class.java:374) > at > org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:389) > ... 15 more > Caused by: java.lang.RuntimeException: class > org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback not > org.apache.hadoop.security.GroupMappingServiceProvider > at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2267) > ... 28 more{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28175) Exception when using userClassPathFirst
[ https://issues.apache.org/jira/browse/SPARK-28175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873801#comment-16873801 ] Hyukjin Kwon commented on SPARK-28175: -- hm, then was it not tested in Spark 2.3? > Exception when using userClassPathFirst > --- > > Key: SPARK-28175 > URL: https://issues.apache.org/jira/browse/SPARK-28175 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.0.0, 2.3.0 >Reporter: Shan Zhao >Priority: Major > > We want to override the executor/driver spark version with our own spark > version in a fat jar. > Conf: > spark.driver.userClassPathFirst=true > spark.executor.userClassPathFirst=true, > Build: spark 2.3.0 > > Deploy: cluster mode with spark 2.0.0 installed > We encounter the following error: > {code:java} > Exception in thread "main" org.apache.spark.SparkException: Unable to load > YARN support > at > org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:392) > at > org.apache.spark.deploy.SparkHadoopUtil$.yarn$lzycompute(SparkHadoopUtil.scala:387) > at org.apache.spark.deploy.SparkHadoopUtil$.yarn(SparkHadoopUtil.scala:387) > at org.apache.spark.deploy.SparkHadoopUtil$.get(SparkHadoopUtil.scala:412) > at org.apache.spark.deploy.yarn.Client.(Client.scala:69) > at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1290) > at org.apache.spark.deploy.yarn.Client.main(Client.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:750) > at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: java.lang.RuntimeException: java.lang.RuntimeException: class > org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback not > org.apache.hadoop.security.GroupMappingServiceProvider > at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2273) > at org.apache.hadoop.security.Groups.(Groups.java:99) > at org.apache.hadoop.security.Groups.(Groups.java:95) > at > org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:420) > at > org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:324) > at > org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:352) > at org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:51) > at > org.apache.spark.deploy.yarn.YarnSparkHadoopUtil.(YarnSparkHadoopUtil.scala:49) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at java.lang.Class.newInstance(Class.java:374) > at > org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:389) > ... 15 more > Caused by: java.lang.RuntimeException: class > org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback not > org.apache.hadoop.security.GroupMappingServiceProvider > at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2267) > ... 28 more{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28175) Exception when using userClassPathFirst
[ https://issues.apache.org/jira/browse/SPARK-28175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873804#comment-16873804 ] Shan Zhao commented on SPARK-28175: --- We have another cluster with spark 2.3, and apparently it doesn't need the two "userClassPathFirst" setup; we haven't tested the code with the "userClassPathFirst" conf in spark 2.3. So what we're trying to do is the to compile the code in spark 2.3, and to be able to run it in both spark 2.0 and spark 2.3 cluster, if that makes sense. > Exception when using userClassPathFirst > --- > > Key: SPARK-28175 > URL: https://issues.apache.org/jira/browse/SPARK-28175 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.0.0, 2.3.0 >Reporter: Shan Zhao >Priority: Major > > We want to override the executor/driver spark version with our own spark > version in a fat jar. > Conf: > spark.driver.userClassPathFirst=true > spark.executor.userClassPathFirst=true, > Build: spark 2.3.0 > > Deploy: cluster mode with spark 2.0.0 installed > We encounter the following error: > {code:java} > Exception in thread "main" org.apache.spark.SparkException: Unable to load > YARN support > at > org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:392) > at > org.apache.spark.deploy.SparkHadoopUtil$.yarn$lzycompute(SparkHadoopUtil.scala:387) > at org.apache.spark.deploy.SparkHadoopUtil$.yarn(SparkHadoopUtil.scala:387) > at org.apache.spark.deploy.SparkHadoopUtil$.get(SparkHadoopUtil.scala:412) > at org.apache.spark.deploy.yarn.Client.(Client.scala:69) > at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1290) > at org.apache.spark.deploy.yarn.Client.main(Client.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:750) > at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: java.lang.RuntimeException: java.lang.RuntimeException: class > org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback not > org.apache.hadoop.security.GroupMappingServiceProvider > at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2273) > at org.apache.hadoop.security.Groups.(Groups.java:99) > at org.apache.hadoop.security.Groups.(Groups.java:95) > at > org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:420) > at > org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:324) > at > org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:352) > at org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:51) > at > org.apache.spark.deploy.yarn.YarnSparkHadoopUtil.(YarnSparkHadoopUtil.scala:49) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at java.lang.Class.newInstance(Class.java:374) > at > org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:389) > ... 15 more > Caused by: java.lang.RuntimeException: class > org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback not > org.apache.hadoop.security.GroupMappingServiceProvider > at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2267) > ... 28 more{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28161) Can't build Spark 2.4.3 based on Ubuntu and Oracle Java 8 SDK v212
[ https://issues.apache.org/jira/browse/SPARK-28161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873813#comment-16873813 ] Hyukjin Kwon commented on SPARK-28161: -- Can you show the full error messages? Also, let's ask questions to mailing list before filing it as an issue. You would get better answer there. I don't think it's an issue within Spark for now but rather sounds like an env issue. > Can't build Spark 2.4.3 based on Ubuntu and Oracle Java 8 SDK v212 > -- > > Key: SPARK-28161 > URL: https://issues.apache.org/jira/browse/SPARK-28161 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.3 > Environment: {code} > # Dockerfile > # Pull base image. > FROM ubuntu:16.04 > RUN apt update --fix-missing > RUN apt-get install -y software-properties-common > RUN mkdir /usr/java > ADD jdk-8u212-linux-x64.tar /usr/java > ENV JAVA_HOME=/usr/java/jdk1.8.0_212 > RUN update-alternatives --install /usr/bin/java java ${JAVA_HOME%*/}/bin/java > 2 > RUN update-alternatives --install /usr/bin/javac javac > ${JAVA_HOME%*/}/bin/javac 2 > ENV PATH="${PATH}:/usr/java/jdk1.8.0_212/bin" > ENV MAVEN_VERSION 3.6.1 > RUN apt-get install -y curl wget > RUN curl -fsSL > [http://archive.apache.org/dist/maven/maven-3/$]{MAVEN_VERSION}/binaries/apache-maven-${MAVEN_VERSION}-bin.tar.gz > | tar xzf - -C /usr/share \ > && mv /usr/share/apache-maven-${MAVEN_VERSION} /usr/share/maven \ > && ln -s /usr/share/maven/bin/mvn /usr/bin/mvn > ENV MAVEN_HOME /usr/share/maven > ENV MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m" > ENV SPARK_SRC="/usr/src/spark" > ENV BRANCH="v2.4.3" > RUN apt-get update && apt-get install -y --no-install-recommends \ > git python3 python3-setuptools r-base-dev r-cran-evaluate > RUN mkdir -p $SPARK_SRC > RUN git clone --branch $BRANCH [https://github.com/apache/spark] $SPARK_SRC > WORKDIR $SPARK_SRC > RUN ./build/mvn -DskipTests clean package > {code} >Reporter: Martin Nigsch >Priority: Minor > > Trying to build spark from source based on the Dockerfile attached locally > (launched on docker on OSX) fails. > Attempts to change/add the following things beyond what's recommended on the > build page do not bring improvement: > 1. adding ```*RUN ./dev/change-scala-version.sh 2.11```* --> doesn't help > 2. editing the pom.xml to exclude zinc as in one of the answers in > [https://stackoverflow.com/questions/28004552/problems-while-compiling-spark-with-maven/41223558] > --> doesn't help > 3. adding options -DrecompileMode=all --> doesn't help > I've downloaded the java from Oracle directly ( jdk-8u212-linux-x64.tar ) > which is manually put into /usr/java as the Oracle java seems to be > recommended. > Build fails at project streaming with: > {code} > [INFO] > > [INFO] Reactor Summary for Spark Project Parent POM 2.4.3: > [INFO] > [INFO] Spark Project Parent POM ... SUCCESS [ 59.875 > s] > [INFO] Spark Project Tags . SUCCESS [ 20.386 > s] > [INFO] Spark Project Sketch ... SUCCESS [ 3.026 s] > [INFO] Spark Project Local DB . SUCCESS [ 5.654 s] > [INFO] Spark Project Networking ... SUCCESS [ 7.401 s] > [INFO] Spark Project Shuffle Streaming Service SUCCESS [ 3.400 s] > [INFO] Spark Project Unsafe ... SUCCESS [ 6.306 s] > [INFO] Spark Project Launcher . SUCCESS [ 17.471 > s] > [INFO] Spark Project Core . SUCCESS [02:36 > min] > [INFO] Spark Project ML Local Library . SUCCESS [ 50.313 > s] > [INFO] Spark Project GraphX ... SUCCESS [ 21.097 > s] > [INFO] Spark Project Streaming SUCCESS [ 52.537 > s] > [INFO] Spark Project Catalyst . SUCCESS [02:44 > min] > [INFO] Spark Project SQL .. FAILURE [10:44 > min] > [INFO] Spark Project ML Library ... SKIPPED > [INFO] Spark Project Tools SKIPPED > [INFO] Spark Project Hive . SKIPPED > [INFO] Spark Project REPL . SKIPPED > [INFO] Spark Project Assembly . SKIPPED > [INFO] Spark Integration for Kafka 0.10 ... SKIPPED > [INFO] Kafka 0.10+ Source for Structured Streaming SKIPPED > [INFO] Spark Project Examples . SKIPPED > [INFO] Spark Integration for Kafka 0.10 Assembly .. SKIPPED > [INFO] Spark Avro ...
[jira] [Resolved] (SPARK-28161) Can't build Spark 2.4.3 based on Ubuntu and Oracle Java 8 SDK v212
[ https://issues.apache.org/jira/browse/SPARK-28161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-28161. -- Resolution: Invalid > Can't build Spark 2.4.3 based on Ubuntu and Oracle Java 8 SDK v212 > -- > > Key: SPARK-28161 > URL: https://issues.apache.org/jira/browse/SPARK-28161 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.3 > Environment: {code} > # Dockerfile > # Pull base image. > FROM ubuntu:16.04 > RUN apt update --fix-missing > RUN apt-get install -y software-properties-common > RUN mkdir /usr/java > ADD jdk-8u212-linux-x64.tar /usr/java > ENV JAVA_HOME=/usr/java/jdk1.8.0_212 > RUN update-alternatives --install /usr/bin/java java ${JAVA_HOME%*/}/bin/java > 2 > RUN update-alternatives --install /usr/bin/javac javac > ${JAVA_HOME%*/}/bin/javac 2 > ENV PATH="${PATH}:/usr/java/jdk1.8.0_212/bin" > ENV MAVEN_VERSION 3.6.1 > RUN apt-get install -y curl wget > RUN curl -fsSL > [http://archive.apache.org/dist/maven/maven-3/$]{MAVEN_VERSION}/binaries/apache-maven-${MAVEN_VERSION}-bin.tar.gz > | tar xzf - -C /usr/share \ > && mv /usr/share/apache-maven-${MAVEN_VERSION} /usr/share/maven \ > && ln -s /usr/share/maven/bin/mvn /usr/bin/mvn > ENV MAVEN_HOME /usr/share/maven > ENV MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m" > ENV SPARK_SRC="/usr/src/spark" > ENV BRANCH="v2.4.3" > RUN apt-get update && apt-get install -y --no-install-recommends \ > git python3 python3-setuptools r-base-dev r-cran-evaluate > RUN mkdir -p $SPARK_SRC > RUN git clone --branch $BRANCH [https://github.com/apache/spark] $SPARK_SRC > WORKDIR $SPARK_SRC > RUN ./build/mvn -DskipTests clean package > {code} >Reporter: Martin Nigsch >Priority: Minor > > Trying to build spark from source based on the Dockerfile attached locally > (launched on docker on OSX) fails. > Attempts to change/add the following things beyond what's recommended on the > build page do not bring improvement: > 1. adding ```*RUN ./dev/change-scala-version.sh 2.11```* --> doesn't help > 2. editing the pom.xml to exclude zinc as in one of the answers in > [https://stackoverflow.com/questions/28004552/problems-while-compiling-spark-with-maven/41223558] > --> doesn't help > 3. adding options -DrecompileMode=all --> doesn't help > I've downloaded the java from Oracle directly ( jdk-8u212-linux-x64.tar ) > which is manually put into /usr/java as the Oracle java seems to be > recommended. > Build fails at project streaming with: > {code} > [INFO] > > [INFO] Reactor Summary for Spark Project Parent POM 2.4.3: > [INFO] > [INFO] Spark Project Parent POM ... SUCCESS [ 59.875 > s] > [INFO] Spark Project Tags . SUCCESS [ 20.386 > s] > [INFO] Spark Project Sketch ... SUCCESS [ 3.026 s] > [INFO] Spark Project Local DB . SUCCESS [ 5.654 s] > [INFO] Spark Project Networking ... SUCCESS [ 7.401 s] > [INFO] Spark Project Shuffle Streaming Service SUCCESS [ 3.400 s] > [INFO] Spark Project Unsafe ... SUCCESS [ 6.306 s] > [INFO] Spark Project Launcher . SUCCESS [ 17.471 > s] > [INFO] Spark Project Core . SUCCESS [02:36 > min] > [INFO] Spark Project ML Local Library . SUCCESS [ 50.313 > s] > [INFO] Spark Project GraphX ... SUCCESS [ 21.097 > s] > [INFO] Spark Project Streaming SUCCESS [ 52.537 > s] > [INFO] Spark Project Catalyst . SUCCESS [02:44 > min] > [INFO] Spark Project SQL .. FAILURE [10:44 > min] > [INFO] Spark Project ML Library ... SKIPPED > [INFO] Spark Project Tools SKIPPED > [INFO] Spark Project Hive . SKIPPED > [INFO] Spark Project REPL . SKIPPED > [INFO] Spark Project Assembly . SKIPPED > [INFO] Spark Integration for Kafka 0.10 ... SKIPPED > [INFO] Kafka 0.10+ Source for Structured Streaming SKIPPED > [INFO] Spark Project Examples . SKIPPED > [INFO] Spark Integration for Kafka 0.10 Assembly .. SKIPPED > [INFO] Spark Avro . SKIPPED > [INFO] > > [INFO] BUILD FAILURE > [INFO] > > [INFO] Total time: 20:15 min > [INFO] Finished at:
[jira] [Created] (SPARK-28180) Encoding CSV to Pojo works with Encoders.bean on RDD but fail on asserts when attemtping it from a Dataset
M. Le Bihan created SPARK-28180: --- Summary: Encoding CSV to Pojo works with Encoders.bean on RDD but fail on asserts when attemtping it from a Dataset Key: SPARK-28180 URL: https://issues.apache.org/jira/browse/SPARK-28180 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.3 Environment: Debian 9, Java 8. Reporter: M. Le Bihan I am converting an _RDD_ spark program to a _Dataset_ one. Once, it was converting a CSV file mapped the help of a Jackson loader to a RDD of Enterprise objects with Encoders.bean(Entreprise.class), and now it is doing the conversion more simplier, by loading the CSV content into a Dataset and applying the _Encoders.bean(Entreprise.class)_ on it. {code:java} Dataset csv = this.session.read().format("csv") .option("header","true").option("quote", "\"").option("escape", "\"") .load(source.getAbsolutePath()) .selectExpr( "ActivitePrincipaleUniteLegale as ActivitePrincipale", "CAST(AnneeCategorieEntreprise as INTEGER) as AnneeCategorieEntreprise", "CAST(AnneeEffectifsUniteLegale as INTEGER) as AnneeValiditeEffectifSalarie", "CAST(CaractereEmployeurUniteLegale == 'O' as BOOLEAN) as CaractereEmployeur", "CategorieEntreprise", "CategorieJuridiqueUniteLegale as CategorieJuridique", "DateCreationUniteLegale as DateCreationEntreprise", "DateDebut as DateDebutHistorisation", "DateDernierTraitementUniteLegale as DateDernierTraitement", "DenominationUniteLegale as Denomination", "DenominationUsuelle1UniteLegale as DenominationUsuelle1", "DenominationUsuelle2UniteLegale as DenominationUsuelle2", "DenominationUsuelle3UniteLegale as DenominationUsuelle3", "CAST(EconomieSocialeSolidaireUniteLegale == 'O' as BOOLEAN) as EconomieSocialeSolidaire", "CAST(EtatAdministratifUniteLegale == 'A' as BOOLEAN) as Active", "IdentifiantAssociationUniteLegale as IdentifiantAssociation", "NicSiegeUniteLegale as NicSiege", "CAST(NombrePeriodesUniteLegale as INTEGER) as NombrePeriodes", "NomenclatureActivitePrincipaleUniteLegale as NomenclatureActivitePrincipale", "NomUniteLegale as NomNaissance", "NomUsageUniteLegale as NomUsage", "Prenom1UniteLegale as Prenom1", "Prenom2UniteLegale as Prenom2", "Prenom3UniteLegale as Prenom3", "Prenom4UniteLegale as Prenom4", "PrenomUsuelUniteLegale as PrenomUsuel", "PseudonymeUniteLegale as Pseudonyme", "SexeUniteLegale as Sexe", "SigleUniteLegale as Sigle", "Siren", "TrancheEffectifsUniteLegale as TrancheEffectifSalarie" ); {code} The _Dataset_ is succesfully created. But the following call of _Encoders.bean(Enterprise.class)_ fails : {code:java} java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:208) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:87) at org.apache.spark.sql.Encoders$.bean(Encoders.scala:142) at org.apache.spark.sql.Encoders.bean(Encoders.scala) at fr.ecoemploi.spark.entreprise.EntrepriseService.dsEntreprises(EntrepriseService.java:178) at test.fr.ecoemploi.spark.entreprise.EntreprisesIT.datasetEntreprises(EntreprisesIT.java:72) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:532) at org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:115) at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$6(TestMethodTestDescriptor.java:171) at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:72) at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.invokeTestMethod(TestMethodTestDescriptor.java:167) at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:114) at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:59) at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$4(NodeTestTask.java:108) at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:72) at org.junit.platform.engine.support.hierarchical.NodeTestTask.executeRecursively(NodeTestTask.java
[jira] [Updated] (SPARK-28180) Encoding CSV to Pojo works with Encoders.bean on RDD but fail on asserts when attemtping it from a Dataset
[ https://issues.apache.org/jira/browse/SPARK-28180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] M. Le Bihan updated SPARK-28180: Description: I am converting an _RDD_ spark program to a _Dataset_ one. Once, it was converting a CSV file mapped the help of a Jackson loader to a RDD of Enterprise objects with Encoders.bean(Entreprise.class), and now it is doing the conversion more simplier, by loading the CSV content into a Dataset and applying the _Encoders.bean(Entreprise.class)_ on it. {code:java} Dataset csv = this.session.read().format("csv") .option("header","true").option("quote", "\"").option("escape", "\"") .load(source.getAbsolutePath()) .selectExpr( "ActivitePrincipaleUniteLegale as ActivitePrincipale", "CAST(AnneeCategorieEntreprise as INTEGER) as AnneeCategorieEntreprise", "CAST(AnneeEffectifsUniteLegale as INTEGER) as AnneeValiditeEffectifSalarie", "CAST(CaractereEmployeurUniteLegale == 'O' as BOOLEAN) as CaractereEmployeur", "CategorieEntreprise", "CategorieJuridiqueUniteLegale as CategorieJuridique", "DateCreationUniteLegale as DateCreationEntreprise", "DateDebut as DateDebutHistorisation", "DateDernierTraitementUniteLegale as DateDernierTraitement", "DenominationUniteLegale as Denomination", "DenominationUsuelle1UniteLegale as DenominationUsuelle1", "DenominationUsuelle2UniteLegale as DenominationUsuelle2", "DenominationUsuelle3UniteLegale as DenominationUsuelle3", "CAST(EconomieSocialeSolidaireUniteLegale == 'O' as BOOLEAN) as EconomieSocialeSolidaire", "CAST(EtatAdministratifUniteLegale == 'A' as BOOLEAN) as Active", "IdentifiantAssociationUniteLegale as IdentifiantAssociation", "NicSiegeUniteLegale as NicSiege", "CAST(NombrePeriodesUniteLegale as INTEGER) as NombrePeriodes", "NomenclatureActivitePrincipaleUniteLegale as NomenclatureActivitePrincipale", "NomUniteLegale as NomNaissance", "NomUsageUniteLegale as NomUsage", "Prenom1UniteLegale as Prenom1", "Prenom2UniteLegale as Prenom2", "Prenom3UniteLegale as Prenom3", "Prenom4UniteLegale as Prenom4", "PrenomUsuelUniteLegale as PrenomUsuel", "PseudonymeUniteLegale as Pseudonyme", "SexeUniteLegale as Sexe", "SigleUniteLegale as Sigle", "Siren", "TrancheEffectifsUniteLegale as TrancheEffectifSalarie" ); {code} The _Dataset_ is succesfully created. But the following call of _Encoders.bean(Enterprise.class)_ fails : {code:java} java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:208) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:87) at org.apache.spark.sql.Encoders$.bean(Encoders.scala:142) at org.apache.spark.sql.Encoders.bean(Encoders.scala) at fr.ecoemploi.spark.entreprise.EntrepriseService.dsEntreprises(EntrepriseService.java:178) at test.fr.ecoemploi.spark.entreprise.EntreprisesIT.datasetEntreprises(EntreprisesIT.java:72) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:532) at org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:115) at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$6(TestMethodTestDescriptor.java:171) at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:72) at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.invokeTestMethod(TestMethodTestDescriptor.java:167) at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:114) at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:59) at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$4(NodeTestTask.java:108) at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:72) at org.junit.platform.engine.support.hierarchical.NodeTestTask.executeRecursively(NodeTestTask.java:98) at org.junit.platform.engine.support.hierarchical.NodeTestTask.execute(NodeTestTask.java:74) at java.util.ArrayList.forEach(ArrayList.java:1257) at org.junit.platform.engine.support.hierarchical.SameThreadHierarchicalTestExecutorService.invokeAl
[jira] [Updated] (SPARK-28180) Encoding CSV to Pojo works with Encoders.bean on RDD but fail on asserts when attemtping it from a Dataset
[ https://issues.apache.org/jira/browse/SPARK-28180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] M. Le Bihan updated SPARK-28180: Description: I am converting an _RDD_ spark program to a _Dataset_ one. Once, it was converting a CSV file mapped the help of a Jackson loader to a RDD of Enterprise objects with Encoders.bean(Entreprise.class), and now it is doing the conversion more simplier, by loading the CSV content into a Dataset and applying the _Encoders.bean(Entreprise.class)_ on it. {code:java} Dataset csv = this.session.read().format("csv") .option("header","true").option("quote", "\"").option("escape", "\"") .load(source.getAbsolutePath()) .selectExpr( "ActivitePrincipaleUniteLegale as ActivitePrincipale", "CAST(AnneeCategorieEntreprise as INTEGER) as AnneeCategorieEntreprise", "CAST(AnneeEffectifsUniteLegale as INTEGER) as AnneeValiditeEffectifSalarie", "CAST(CaractereEmployeurUniteLegale == 'O' as BOOLEAN) as CaractereEmployeur", "CategorieEntreprise", "CategorieJuridiqueUniteLegale as CategorieJuridique", "DateCreationUniteLegale as DateCreationEntreprise", "DateDebut as DateDebutHistorisation", "DateDernierTraitementUniteLegale as DateDernierTraitement", "DenominationUniteLegale as Denomination", "DenominationUsuelle1UniteLegale as DenominationUsuelle1", "DenominationUsuelle2UniteLegale as DenominationUsuelle2", "DenominationUsuelle3UniteLegale as DenominationUsuelle3", "CAST(EconomieSocialeSolidaireUniteLegale == 'O' as BOOLEAN) as EconomieSocialeSolidaire", "CAST(EtatAdministratifUniteLegale == 'A' as BOOLEAN) as Active", "IdentifiantAssociationUniteLegale as IdentifiantAssociation", "NicSiegeUniteLegale as NicSiege", "CAST(NombrePeriodesUniteLegale as INTEGER) as NombrePeriodes", "NomenclatureActivitePrincipaleUniteLegale as NomenclatureActivitePrincipale", "NomUniteLegale as NomNaissance", "NomUsageUniteLegale as NomUsage", "Prenom1UniteLegale as Prenom1", "Prenom2UniteLegale as Prenom2", "Prenom3UniteLegale as Prenom3", "Prenom4UniteLegale as Prenom4", "PrenomUsuelUniteLegale as PrenomUsuel", "PseudonymeUniteLegale as Pseudonyme", "SexeUniteLegale as Sexe", "SigleUniteLegale as Sigle", "Siren", "TrancheEffectifsUniteLegale as TrancheEffectifSalarie" ); {code} The _Dataset_ is succesfully created. But the following call of _Encoders.bean(Enterprise.class)_ fails : {code:java} java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:208) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:87) at org.apache.spark.sql.Encoders$.bean(Encoders.scala:142) at org.apache.spark.sql.Encoders.bean(Encoders.scala) at fr.ecoemploi.spark.entreprise.EntrepriseService.dsEntreprises(EntrepriseService.java:178) at test.fr.ecoemploi.spark.entreprise.EntreprisesIT.datasetEntreprises(EntreprisesIT.java:72) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:532) at org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:115) at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$6(TestMethodTestDescriptor.java:171) at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:72) at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.invokeTestMethod(TestMethodTestDescriptor.java:167) at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:114) at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:59) at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$4(NodeTestTask.java:108) at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:72) at org.junit.platform.engine.support.hierarchical.NodeTestTask.executeRecursively(NodeTestTask.java:98) at org.junit.platform.engine.support.hierarchical.NodeTestTask.execute(NodeTestTask.java:74) at java.util.ArrayList.forEach(ArrayList.java:1257) at org.junit.platform.engine.support.hierarchical.SameThreadHierarchicalTestExecutorService.invokeAl
[jira] [Updated] (SPARK-28180) Encoding CSV to Pojo works with Encoders.bean on RDD but fail on asserts when attemtping it from a Dataset
[ https://issues.apache.org/jira/browse/SPARK-28180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] M. Le Bihan updated SPARK-28180: Description: I am converting an _RDD_ spark program to a _Dataset_ one. Once, it was converting a CSV file mapped the help of a Jackson loader to a RDD of Enterprise objects with Encoders.bean(Entreprise.class), and now it is doing the conversion more simplier, by loading the CSV content into a Dataset and applying the _Encoders.bean(Entreprise.class)_ on it. {code:java} Dataset csv = this.session.read().format("csv") .option("header","true").option("quote", "\"").option("escape", "\"") .load(source.getAbsolutePath()) .selectExpr( "ActivitePrincipaleUniteLegale as ActivitePrincipale", "CAST(AnneeCategorieEntreprise as INTEGER) as AnneeCategorieEntreprise", "CAST(AnneeEffectifsUniteLegale as INTEGER) as AnneeValiditeEffectifSalarie", "CAST(CaractereEmployeurUniteLegale == 'O' as BOOLEAN) as CaractereEmployeur", "CategorieEntreprise", "CategorieJuridiqueUniteLegale as CategorieJuridique", "DateCreationUniteLegale as DateCreationEntreprise", "DateDebut as DateDebutHistorisation", "DateDernierTraitementUniteLegale as DateDernierTraitement", "DenominationUniteLegale as Denomination", "DenominationUsuelle1UniteLegale as DenominationUsuelle1", "DenominationUsuelle2UniteLegale as DenominationUsuelle2", "DenominationUsuelle3UniteLegale as DenominationUsuelle3", "CAST(EconomieSocialeSolidaireUniteLegale == 'O' as BOOLEAN) as EconomieSocialeSolidaire", "CAST(EtatAdministratifUniteLegale == 'A' as BOOLEAN) as Active", "IdentifiantAssociationUniteLegale as IdentifiantAssociation", "NicSiegeUniteLegale as NicSiege", "CAST(NombrePeriodesUniteLegale as INTEGER) as NombrePeriodes", "NomenclatureActivitePrincipaleUniteLegale as NomenclatureActivitePrincipale", "NomUniteLegale as NomNaissance", "NomUsageUniteLegale as NomUsage", "Prenom1UniteLegale as Prenom1", "Prenom2UniteLegale as Prenom2", "Prenom3UniteLegale as Prenom3", "Prenom4UniteLegale as Prenom4", "PrenomUsuelUniteLegale as PrenomUsuel", "PseudonymeUniteLegale as Pseudonyme", "SexeUniteLegale as Sexe", "SigleUniteLegale as Sigle", "Siren", "TrancheEffectifsUniteLegale as TrancheEffectifSalarie" ); {code} The _Dataset_ is succesfully created. But the following call of _Encoders.bean(Enterprise.class)_ fails : {code:java} java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:208) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:87) at org.apache.spark.sql.Encoders$.bean(Encoders.scala:142) at org.apache.spark.sql.Encoders.bean(Encoders.scala) at fr.ecoemploi.spark.entreprise.EntrepriseService.dsEntreprises(EntrepriseService.java:178) at test.fr.ecoemploi.spark.entreprise.EntreprisesIT.datasetEntreprises(EntreprisesIT.java:72) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:532) at org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:115) at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$6(TestMethodTestDescriptor.java:171) at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:72) at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.invokeTestMethod(TestMethodTestDescriptor.java:167) at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:114) at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:59) at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$4(NodeTestTask.java:108) at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:72) at org.junit.platform.engine.support.hierarchical.NodeTestTask.executeRecursively(NodeTestTask.java:98) at org.junit.platform.engine.support.hierarchical.NodeTestTask.execute(NodeTestTask.java:74) at java.util.ArrayList.forEach(ArrayList.java:1257) at org.junit.platform.engine.support.hierarchical.SameThreadHierarchicalTestExecutorService.invokeAl