[jira] [Resolved] (SPARK-22017) watermark evaluation with multi-input stream operators is unspecified
[ https://issues.apache.org/jira/browse/SPARK-22017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-22017. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 19239 [https://github.com/apache/spark/pull/19239] > watermark evaluation with multi-input stream operators is unspecified > - > > Key: SPARK-22017 > URL: https://issues.apache.org/jira/browse/SPARK-22017 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Jose Torres > Fix For: 3.0.0 > > > Watermarks are stored as a single value in StreamExecution. If a query has > multiple watermark nodes (which can generally only happen with multi input > operators like union), a headOption call will arbitrarily pick one to use as > the real one. This will happen independently in each batch, possibly leading > to strange and undefined behavior. > We should instead choose the minimum from all watermark exec nodes as the > query-wide watermark. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22034) CrossValidator's training and testing set with different set of labels, resulting in encoder transform error
AnChe Kuo created SPARK-22034: - Summary: CrossValidator's training and testing set with different set of labels, resulting in encoder transform error Key: SPARK-22034 URL: https://issues.apache.org/jira/browse/SPARK-22034 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 2.2.0 Environment: Ubuntu 16.04 Scala 2.11 Spark 2.2.0 Reporter: AnChe Kuo Let's say we have a VectorIndexer with maxCategories set to 13, and training set has a column containing month label. In CrossValidator, dataframe is split into training and testing set automatically. If could happen that training set happens to lack month 2 (could happen by chance, or happen quite frequently if we have unbalanced label). When training set is being trained within the cross validator, the pipeline is fitted with the training set only, resulting in a partial key map in VectorIndexer. When this pipeline is used to transform the predict set, VectorIndexer will throw a "key not found" error. Making CrossValidator also an estimator thus can be connected to a whole pipeline is a cool idea, but bug like this occurs, and is not expected. The solution, I am guessing, would be to check each stage in the pipeline, and when we see encoder type stage, we fit the stage model with the complete dataset. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22033) BufferHolder size checks should account for the specific VM array size limitations
[ https://issues.apache.org/jira/browse/SPARK-22033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16168650#comment-16168650 ] Sean Owen commented on SPARK-22033: --- Hm, good point. There may be other similar issues throughout the code. [~cloud_fan] does that sound right to you? > BufferHolder size checks should account for the specific VM array size > limitations > -- > > Key: SPARK-22033 > URL: https://issues.apache.org/jira/browse/SPARK-22033 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Vadim Semenov >Priority: Minor > > User may get the following OOM Error while running a job with heavy > aggregations > ``` > java.lang.OutOfMemoryError: Requested array size exceeds VM limit > at > org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:73) > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:235) > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:228) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$2.apply(AggregationIterator.scala:254) > at > org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$2.apply(AggregationIterator.scala:247) > at > org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.next(ObjectAggregationIterator.scala:88) > at > org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.next(ObjectAggregationIterator.scala:33) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:167) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > ``` > The [`BufferHolder.grow` tries to create a byte array of `Integer.MAX_VALUE` > here](https://github.com/apache/spark/blob/v2.2.0/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/BufferHolder.java#L72) > but the maximum size of an array depends on specifics of a VM. > The safest value seems to be `Integer.MAX_VALUE - 8` > http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/tip/src/share/classes/java/util/ArrayList.java#l229 > In my JVM: > ``` > java -version > openjdk version "1.8.0_141" > OpenJDK Runtime Environment (build 1.8.0_141-b16) > OpenJDK 64-Bit Server VM (build 25.141-b16, mixed mode) > ``` > the max is `new Array[Byte](Integer.MAX_VALUE - 2)` -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22033) BufferHolder size checks should account for the specific VM array size limitations
[ https://issues.apache.org/jira/browse/SPARK-22033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16168612#comment-16168612 ] Vadim Semenov commented on SPARK-22033: --- Leaving traces for others if they happen to hit the same issue. The issue isn't exclusive to the ObjectAggregator, it can happen in the SortBasedAggregator too ``` java.lang.OutOfMemoryError: Requested array size exceeds VM limit at org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:73) at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:235) at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:228) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$2.apply(AggregationIterator.scala:254) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$2.apply(AggregationIterator.scala:247) at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:159) at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:29) at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.initialize(SortBasedAggregationIterator.scala:103) at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.(SortBasedAggregationIterator.scala:113) at org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3.apply(SortAggregateExec.scala:86) at org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3.apply(SortAggregateExec.scala:77) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` > BufferHolder size checks should account for the specific VM array size > limitations > -- > > Key: SPARK-22033 > URL: https://issues.apache.org/jira/browse/SPARK-22033 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Vadim Semenov >Priority: Minor > > User may get the following OOM Error while running a job with heavy > aggregations > ``` > java.lang.OutOfMemoryError: Requested array size exceeds VM limit > at > org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:73) > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:235) > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:228) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$2.apply(AggregationIterator.scala:254) > at > org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$2.apply(AggregationIterator.scala:247) > at > org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.next(ObjectAggregationIterator.scala:88) > at > org.apache.spark.sql.execution.aggregate.ObjectAggregatio
[jira] [Created] (SPARK-22033) BufferHolder size checks should account for the specific VM array size limitations
Vadim Semenov created SPARK-22033: - Summary: BufferHolder size checks should account for the specific VM array size limitations Key: SPARK-22033 URL: https://issues.apache.org/jira/browse/SPARK-22033 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Vadim Semenov Priority: Minor User may get the following OOM Error while running a job with heavy aggregations ``` java.lang.OutOfMemoryError: Requested array size exceeds VM limit at org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:73) at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:235) at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:228) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$2.apply(AggregationIterator.scala:254) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$2.apply(AggregationIterator.scala:247) at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.next(ObjectAggregationIterator.scala:88) at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.next(ObjectAggregationIterator.scala:33) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:167) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` The [`BufferHolder.grow` tries to create a byte array of `Integer.MAX_VALUE` here](https://github.com/apache/spark/blob/v2.2.0/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/BufferHolder.java#L72) but the maximum size of an array depends on specifics of a VM. The safest value seems to be `Integer.MAX_VALUE - 8` http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/tip/src/share/classes/java/util/ArrayList.java#l229 In my JVM: ``` java -version openjdk version "1.8.0_141" OpenJDK Runtime Environment (build 1.8.0_141-b16) OpenJDK 64-Bit Server VM (build 25.141-b16, mixed mode) ``` the max is `new Array[Byte](Integer.MAX_VALUE - 2)` -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.
[ https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12297: Assignee: (was: Apache Spark) > Add work-around for Parquet/Hive int96 timestamp bug. > - > > Key: SPARK-12297 > URL: https://issues.apache.org/jira/browse/SPARK-12297 > Project: Spark > Issue Type: Task > Components: Spark Core >Reporter: Ryan Blue > > Spark copied Hive's behavior for parquet, but this was inconsistent with > other file formats, and inconsistent with Impala (which is the original > source of putting a timestamp as an int96 in parquet, I believe). This made > timestamps in parquet act more like timestamps with timezones, while in other > file formats, timestamps have no time zone, they are a "floating time". > The easiest way to see this issue is to write out a table with timestamps in > multiple different formats from one timezone, then try to read them back in > another timezone. Eg., here I write out a few timestamps to parquet and > textfile hive tables, and also just as a json file, all in the > "America/Los_Angeles" timezone: > {code} > import org.apache.spark.sql.Row > import org.apache.spark.sql.types._ > val tblPrefix = args(0) > val schema = new StructType().add("ts", TimestampType) > val rows = sc.parallelize(Seq( > "2015-12-31 23:50:59.123", > "2015-12-31 22:49:59.123", > "2016-01-01 00:39:59.123", > "2016-01-01 01:29:59.123" > ).map { x => Row(java.sql.Timestamp.valueOf(x)) }) > val rawData = spark.createDataFrame(rows, schema).toDF() > rawData.show() > Seq("parquet", "textfile").foreach { format => > val tblName = s"${tblPrefix}_$format" > spark.sql(s"DROP TABLE IF EXISTS $tblName") > spark.sql( > raw"""CREATE TABLE $tblName ( > | ts timestamp > | ) > | STORED AS $format > """.stripMargin) > rawData.write.insertInto(tblName) > } > rawData.write.json(s"${tblPrefix}_json") > {code} > Then I start a spark-shell in "America/New_York" timezone, and read the data > back from each table: > {code} > scala> spark.sql("select * from la_parquet").collect().foreach{println} > [2016-01-01 02:50:59.123] > [2016-01-01 01:49:59.123] > [2016-01-01 03:39:59.123] > [2016-01-01 04:29:59.123] > scala> spark.sql("select * from la_textfile").collect().foreach{println} > [2015-12-31 23:50:59.123] > [2015-12-31 22:49:59.123] > [2016-01-01 00:39:59.123] > [2016-01-01 01:29:59.123] > scala> spark.read.json("la_json").collect().foreach{println} > [2015-12-31 23:50:59.123] > [2015-12-31 22:49:59.123] > [2016-01-01 00:39:59.123] > [2016-01-01 01:29:59.123] > scala> spark.read.json("la_json").join(spark.sql("select * from > la_textfile"), "ts").show() > ++ > | ts| > ++ > |2015-12-31 23:50:...| > |2015-12-31 22:49:...| > |2016-01-01 00:39:...| > |2016-01-01 01:29:...| > ++ > scala> spark.read.json("la_json").join(spark.sql("select * from la_parquet"), > "ts").show() > +---+ > | ts| > +---+ > +---+ > {code} > The textfile and json based data shows the same times, and can be joined > against each other, while the times from the parquet data have changed (and > obviously joins fail). > This is a big problem for any organization that may try to read the same data > (say in S3) with clusters in multiple timezones. It can also be a nasty > surprise as an organization tries to migrate file formats. Finally, its a > source of incompatibility between Hive, Impala, and Spark. > HIVE-12767 aims to fix this by introducing a table property which indicates > the "storage timezone" for the table. Spark should add the same to ensure > consistency between file formats, and with Hive & Impala. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.
[ https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16168410#comment-16168410 ] Apache Spark commented on SPARK-12297: -- User 'squito' has created a pull request for this issue: https://github.com/apache/spark/pull/19250 > Add work-around for Parquet/Hive int96 timestamp bug. > - > > Key: SPARK-12297 > URL: https://issues.apache.org/jira/browse/SPARK-12297 > Project: Spark > Issue Type: Task > Components: Spark Core >Reporter: Ryan Blue > > Spark copied Hive's behavior for parquet, but this was inconsistent with > other file formats, and inconsistent with Impala (which is the original > source of putting a timestamp as an int96 in parquet, I believe). This made > timestamps in parquet act more like timestamps with timezones, while in other > file formats, timestamps have no time zone, they are a "floating time". > The easiest way to see this issue is to write out a table with timestamps in > multiple different formats from one timezone, then try to read them back in > another timezone. Eg., here I write out a few timestamps to parquet and > textfile hive tables, and also just as a json file, all in the > "America/Los_Angeles" timezone: > {code} > import org.apache.spark.sql.Row > import org.apache.spark.sql.types._ > val tblPrefix = args(0) > val schema = new StructType().add("ts", TimestampType) > val rows = sc.parallelize(Seq( > "2015-12-31 23:50:59.123", > "2015-12-31 22:49:59.123", > "2016-01-01 00:39:59.123", > "2016-01-01 01:29:59.123" > ).map { x => Row(java.sql.Timestamp.valueOf(x)) }) > val rawData = spark.createDataFrame(rows, schema).toDF() > rawData.show() > Seq("parquet", "textfile").foreach { format => > val tblName = s"${tblPrefix}_$format" > spark.sql(s"DROP TABLE IF EXISTS $tblName") > spark.sql( > raw"""CREATE TABLE $tblName ( > | ts timestamp > | ) > | STORED AS $format > """.stripMargin) > rawData.write.insertInto(tblName) > } > rawData.write.json(s"${tblPrefix}_json") > {code} > Then I start a spark-shell in "America/New_York" timezone, and read the data > back from each table: > {code} > scala> spark.sql("select * from la_parquet").collect().foreach{println} > [2016-01-01 02:50:59.123] > [2016-01-01 01:49:59.123] > [2016-01-01 03:39:59.123] > [2016-01-01 04:29:59.123] > scala> spark.sql("select * from la_textfile").collect().foreach{println} > [2015-12-31 23:50:59.123] > [2015-12-31 22:49:59.123] > [2016-01-01 00:39:59.123] > [2016-01-01 01:29:59.123] > scala> spark.read.json("la_json").collect().foreach{println} > [2015-12-31 23:50:59.123] > [2015-12-31 22:49:59.123] > [2016-01-01 00:39:59.123] > [2016-01-01 01:29:59.123] > scala> spark.read.json("la_json").join(spark.sql("select * from > la_textfile"), "ts").show() > ++ > | ts| > ++ > |2015-12-31 23:50:...| > |2015-12-31 22:49:...| > |2016-01-01 00:39:...| > |2016-01-01 01:29:...| > ++ > scala> spark.read.json("la_json").join(spark.sql("select * from la_parquet"), > "ts").show() > +---+ > | ts| > +---+ > +---+ > {code} > The textfile and json based data shows the same times, and can be joined > against each other, while the times from the parquet data have changed (and > obviously joins fail). > This is a big problem for any organization that may try to read the same data > (say in S3) with clusters in multiple timezones. It can also be a nasty > surprise as an organization tries to migrate file formats. Finally, its a > source of incompatibility between Hive, Impala, and Spark. > HIVE-12767 aims to fix this by introducing a table property which indicates > the "storage timezone" for the table. Spark should add the same to ensure > consistency between file formats, and with Hive & Impala. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.
[ https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12297: Assignee: Apache Spark > Add work-around for Parquet/Hive int96 timestamp bug. > - > > Key: SPARK-12297 > URL: https://issues.apache.org/jira/browse/SPARK-12297 > Project: Spark > Issue Type: Task > Components: Spark Core >Reporter: Ryan Blue >Assignee: Apache Spark > > Spark copied Hive's behavior for parquet, but this was inconsistent with > other file formats, and inconsistent with Impala (which is the original > source of putting a timestamp as an int96 in parquet, I believe). This made > timestamps in parquet act more like timestamps with timezones, while in other > file formats, timestamps have no time zone, they are a "floating time". > The easiest way to see this issue is to write out a table with timestamps in > multiple different formats from one timezone, then try to read them back in > another timezone. Eg., here I write out a few timestamps to parquet and > textfile hive tables, and also just as a json file, all in the > "America/Los_Angeles" timezone: > {code} > import org.apache.spark.sql.Row > import org.apache.spark.sql.types._ > val tblPrefix = args(0) > val schema = new StructType().add("ts", TimestampType) > val rows = sc.parallelize(Seq( > "2015-12-31 23:50:59.123", > "2015-12-31 22:49:59.123", > "2016-01-01 00:39:59.123", > "2016-01-01 01:29:59.123" > ).map { x => Row(java.sql.Timestamp.valueOf(x)) }) > val rawData = spark.createDataFrame(rows, schema).toDF() > rawData.show() > Seq("parquet", "textfile").foreach { format => > val tblName = s"${tblPrefix}_$format" > spark.sql(s"DROP TABLE IF EXISTS $tblName") > spark.sql( > raw"""CREATE TABLE $tblName ( > | ts timestamp > | ) > | STORED AS $format > """.stripMargin) > rawData.write.insertInto(tblName) > } > rawData.write.json(s"${tblPrefix}_json") > {code} > Then I start a spark-shell in "America/New_York" timezone, and read the data > back from each table: > {code} > scala> spark.sql("select * from la_parquet").collect().foreach{println} > [2016-01-01 02:50:59.123] > [2016-01-01 01:49:59.123] > [2016-01-01 03:39:59.123] > [2016-01-01 04:29:59.123] > scala> spark.sql("select * from la_textfile").collect().foreach{println} > [2015-12-31 23:50:59.123] > [2015-12-31 22:49:59.123] > [2016-01-01 00:39:59.123] > [2016-01-01 01:29:59.123] > scala> spark.read.json("la_json").collect().foreach{println} > [2015-12-31 23:50:59.123] > [2015-12-31 22:49:59.123] > [2016-01-01 00:39:59.123] > [2016-01-01 01:29:59.123] > scala> spark.read.json("la_json").join(spark.sql("select * from > la_textfile"), "ts").show() > ++ > | ts| > ++ > |2015-12-31 23:50:...| > |2015-12-31 22:49:...| > |2016-01-01 00:39:...| > |2016-01-01 01:29:...| > ++ > scala> spark.read.json("la_json").join(spark.sql("select * from la_parquet"), > "ts").show() > +---+ > | ts| > +---+ > +---+ > {code} > The textfile and json based data shows the same times, and can be joined > against each other, while the times from the parquet data have changed (and > obviously joins fail). > This is a big problem for any organization that may try to read the same data > (say in S3) with clusters in multiple timezones. It can also be a nasty > surprise as an organization tries to migrate file formats. Finally, its a > source of incompatibility between Hive, Impala, and Spark. > HIVE-12767 aims to fix this by introducing a table property which indicates > the "storage timezone" for the table. Spark should add the same to ensure > consistency between file formats, and with Hive & Impala. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21842) Support Kerberos ticket renewal and creation in Mesos
[ https://issues.apache.org/jira/browse/SPARK-21842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16168278#comment-16168278 ] Arthur Rand commented on SPARK-21842: - Hey [~kalvinnchau] I'm currently of the mind that using the RPC/broadcast approach is better (for Mesos) for a couple reasons. 1. We recently added Secret support to Spark on Mesos, this uses a temporary file system to put the keytab or TGT in the sandbox of the Spark driver. They are packed into the SparkAppConfig (CoarseGrainedSchedulerBackend.scala:L236) which is broadcast to the executors, so using RPC/broadcast is consistent with this. 2. Keeps all transfers of secure information within Spark. 3. Doesn't require HDFS. There is a little bit of a chicken-and-egg situation here w.r.t. YARN, but I'm obviously not familiar enough with how Spark-YARN-HDFS work together. However I understand that there is a potential risk with executors falsely registering with the Driver and getting tokens. I know in the case of DC/OS this is less of a concern (we have some protections around this). But this could still happen today due to the code mentioned above. We could prevent this by keeping track of the executor IDs and only allowing executors to register when they have an expected ID..? > Support Kerberos ticket renewal and creation in Mesos > -- > > Key: SPARK-21842 > URL: https://issues.apache.org/jira/browse/SPARK-21842 > Project: Spark > Issue Type: New Feature > Components: Mesos >Affects Versions: 2.3.0 >Reporter: Arthur Rand > > We at Mesosphere have written Kerberos support for Spark on Mesos. The code > to use Kerberos on a Mesos cluster has been added to Apache Spark > (SPARK-16742). This ticket is to complete the implementation and allow for > ticket renewal and creation. Specifically for long running and streaming jobs. > Mesosphere design doc (needs revision, wip): > https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22032) Speed up StructType.fromInternal
[ https://issues.apache.org/jira/browse/SPARK-22032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22032: Assignee: Apache Spark > Speed up StructType.fromInternal > > > Key: SPARK-22032 > URL: https://issues.apache.org/jira/browse/SPARK-22032 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Maciej Bryński >Assignee: Apache Spark > > StructType.fromInternal is calling f.fromInternal(v) for every field. > We can use needConversion method to limit the number of function calls. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22032) Speed up StructType.fromInternal
[ https://issues.apache.org/jira/browse/SPARK-22032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22032: Assignee: (was: Apache Spark) > Speed up StructType.fromInternal > > > Key: SPARK-22032 > URL: https://issues.apache.org/jira/browse/SPARK-22032 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Maciej Bryński > > StructType.fromInternal is calling f.fromInternal(v) for every field. > We can use needConversion method to limit the number of function calls. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22032) Speed up StructType.fromInternal
[ https://issues.apache.org/jira/browse/SPARK-22032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16168273#comment-16168273 ] Apache Spark commented on SPARK-22032: -- User 'maver1ck' has created a pull request for this issue: https://github.com/apache/spark/pull/19249 > Speed up StructType.fromInternal > > > Key: SPARK-22032 > URL: https://issues.apache.org/jira/browse/SPARK-22032 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Maciej Bryński > > StructType.fromInternal is calling f.fromInternal(v) for every field. > We can use needConversion method to limit the number of function calls. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22032) Speed up StructType.fromInternal
Maciej Bryński created SPARK-22032: -- Summary: Speed up StructType.fromInternal Key: SPARK-22032 URL: https://issues.apache.org/jira/browse/SPARK-22032 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 2.2.0 Reporter: Maciej Bryński StructType.fromInternal is calling f.fromInternal(v) for every field. We can use needConversion method to limit the number of function calls. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22031) KMeans - Compute cost for a single vector
[ https://issues.apache.org/jira/browse/SPARK-22031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-22031: -- Target Version/s: (was: 2.3.0) Labels: (was: newbie) Priority: Minor (was: Major) > KMeans - Compute cost for a single vector > - > > Key: SPARK-22031 > URL: https://issues.apache.org/jira/browse/SPARK-22031 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Laurent Valdes >Priority: Minor > > BisectingKMeans models from ML package have the ability to compute cost for a > single data point. > We would like to port that ability to mainstream KMeansModels, both to ML and > MLLIB. > A pull request is provided. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer
[ https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16168038#comment-16168038 ] Nicholas Chammas commented on SPARK-17025: -- I take that back. I won't be able to test this for the time being. If someone else wants to test this out and needs pointers, I'd be happy to help. > Cannot persist PySpark ML Pipeline model that includes custom Transformer > - > > Key: SPARK-17025 > URL: https://issues.apache.org/jira/browse/SPARK-17025 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Nicholas Chammas >Priority: Minor > > Following the example in [this Databricks blog > post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html] > under "Python tuning", I'm trying to save an ML Pipeline model. > This pipeline, however, includes a custom transformer. When I try to save the > model, the operation fails because the custom transformer doesn't have a > {{_to_java}} attribute. > {code} > Traceback (most recent call last): > File ".../file.py", line 56, in > model.bestModel.save('model') > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 222, in save > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 217, in write > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py", > line 93, in __init__ > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 254, in _to_java > AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java' > {code} > Looking at the source code for > [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py], > I see that not even the base Transformer class has such an attribute. > I'm assuming this is missing functionality that is intended to be patched up > (i.e. [like > this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]). > I'm not sure if there is an existing JIRA for this (my searches didn't turn > up clear results). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22030) GraphiteSink fails to re-connect to Graphite instances behind an ELB or any other auto-scaled LB
[ https://issues.apache.org/jira/browse/SPARK-22030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22030: Assignee: (was: Apache Spark) > GraphiteSink fails to re-connect to Graphite instances behind an ELB or any > other auto-scaled LB > > > Key: SPARK-22030 > URL: https://issues.apache.org/jira/browse/SPARK-22030 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Alex Mikhailau >Priority: Critical > > Upgrade codahale metrics library so that Graphite constructor can re-resolve > hosts behind a CNAME with re-tried DNS lookups. When Graphite is deployed > behind an ELB, ELB may change IP addresses based on auto-scaling needs. Using > current approach yields Graphite usage impossible, fixing for that use case > Upgrade to codahale 3.1.5 > Use new Graphite(host, port) constructor instead of new Graphite(new > InetSocketAddress(host, port)) constructor > This are proposed changes for codahale lib - > dropwizard/metrics@v3.1.2...v3.1.5#diff-6916c85d2dd08d89fe771c952e3b8512R120. > Specifically, > https://github.com/dropwizard/metrics/blob/b4d246d34e8a059b047567848b3522567cbe6108/metrics-graphite/src/main/java/com/codahale/metrics/graphite/Graphite.java#L120 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22030) GraphiteSink fails to re-connect to Graphite instances behind an ELB or any other auto-scaled LB
[ https://issues.apache.org/jira/browse/SPARK-22030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16168030#comment-16168030 ] Apache Spark commented on SPARK-22030: -- User 'alexmnyc' has created a pull request for this issue: https://github.com/apache/spark/pull/19210 > GraphiteSink fails to re-connect to Graphite instances behind an ELB or any > other auto-scaled LB > > > Key: SPARK-22030 > URL: https://issues.apache.org/jira/browse/SPARK-22030 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Alex Mikhailau >Priority: Critical > > Upgrade codahale metrics library so that Graphite constructor can re-resolve > hosts behind a CNAME with re-tried DNS lookups. When Graphite is deployed > behind an ELB, ELB may change IP addresses based on auto-scaling needs. Using > current approach yields Graphite usage impossible, fixing for that use case > Upgrade to codahale 3.1.5 > Use new Graphite(host, port) constructor instead of new Graphite(new > InetSocketAddress(host, port)) constructor > This are proposed changes for codahale lib - > dropwizard/metrics@v3.1.2...v3.1.5#diff-6916c85d2dd08d89fe771c952e3b8512R120. > Specifically, > https://github.com/dropwizard/metrics/blob/b4d246d34e8a059b047567848b3522567cbe6108/metrics-graphite/src/main/java/com/codahale/metrics/graphite/Graphite.java#L120 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22030) GraphiteSink fails to re-connect to Graphite instances behind an ELB or any other auto-scaled LB
[ https://issues.apache.org/jira/browse/SPARK-22030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22030: Assignee: Apache Spark > GraphiteSink fails to re-connect to Graphite instances behind an ELB or any > other auto-scaled LB > > > Key: SPARK-22030 > URL: https://issues.apache.org/jira/browse/SPARK-22030 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Alex Mikhailau >Assignee: Apache Spark >Priority: Critical > > Upgrade codahale metrics library so that Graphite constructor can re-resolve > hosts behind a CNAME with re-tried DNS lookups. When Graphite is deployed > behind an ELB, ELB may change IP addresses based on auto-scaling needs. Using > current approach yields Graphite usage impossible, fixing for that use case > Upgrade to codahale 3.1.5 > Use new Graphite(host, port) constructor instead of new Graphite(new > InetSocketAddress(host, port)) constructor > This are proposed changes for codahale lib - > dropwizard/metrics@v3.1.2...v3.1.5#diff-6916c85d2dd08d89fe771c952e3b8512R120. > Specifically, > https://github.com/dropwizard/metrics/blob/b4d246d34e8a059b047567848b3522567cbe6108/metrics-graphite/src/main/java/com/codahale/metrics/graphite/Graphite.java#L120 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22031) KMeans - Compute cost for a single vector
[ https://issues.apache.org/jira/browse/SPARK-22031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Laurent Valdes updated SPARK-22031: --- Summary: KMeans - Compute cost for a single vector (was: Compute cost for a single vector) > KMeans - Compute cost for a single vector > - > > Key: SPARK-22031 > URL: https://issues.apache.org/jira/browse/SPARK-22031 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Laurent Valdes > Labels: newbie > > BisectingKMeans models from ML package have the ability to compute cost for a > single data point. > We would like to port that ability to mainstream KMeansModels, both to ML and > MLLIB. > A pull request is provided. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22031) Compute cost for a single vector
Laurent Valdes created SPARK-22031: -- Summary: Compute cost for a single vector Key: SPARK-22031 URL: https://issues.apache.org/jira/browse/SPARK-22031 Project: Spark Issue Type: Improvement Components: ML, MLlib Affects Versions: 2.3.0 Reporter: Laurent Valdes BisectingKMeans models from ML package have the ability to compute cost for a single data point. We would like to port that ability to mainstream KMeansModels, both to ML and MLLIB. A pull request is provided. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22030) GraphiteSink fails to re-connect to Graphite instances behind an ELB or any other auto-scaled LB
Alex Mikhailau created SPARK-22030: -- Summary: GraphiteSink fails to re-connect to Graphite instances behind an ELB or any other auto-scaled LB Key: SPARK-22030 URL: https://issues.apache.org/jira/browse/SPARK-22030 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.0 Reporter: Alex Mikhailau Priority: Critical Upgrade codahale metrics library so that Graphite constructor can re-resolve hosts behind a CNAME with re-tried DNS lookups. When Graphite is deployed behind an ELB, ELB may change IP addresses based on auto-scaling needs. Using current approach yields Graphite usage impossible, fixing for that use case Upgrade to codahale 3.1.5 Use new Graphite(host, port) constructor instead of new Graphite(new InetSocketAddress(host, port)) constructor This are proposed changes for codahale lib - dropwizard/metrics@v3.1.2...v3.1.5#diff-6916c85d2dd08d89fe771c952e3b8512R120. Specifically, https://github.com/dropwizard/metrics/blob/b4d246d34e8a059b047567848b3522567cbe6108/metrics-graphite/src/main/java/com/codahale/metrics/graphite/Graphite.java#L120 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22029) Cache of _parse_datatype_json_string function
[ https://issues.apache.org/jira/browse/SPARK-22029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16168017#comment-16168017 ] Maciej Bryński commented on SPARK-22029: I did a proof of concept with functools.lru_cache. But this function is only for Python >= 3.2. Any ideas for Python 2.x ? > Cache of _parse_datatype_json_string function > - > > Key: SPARK-22029 > URL: https://issues.apache.org/jira/browse/SPARK-22029 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Maciej Bryński > > _parse_datatype_json_string is called many times with the same arguments. > Cacheing this calls can dramatically speed up pyspark code execution (up to > 20%) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22029) Cache of _parse_datatype_json_string function
Maciej Bryński created SPARK-22029: -- Summary: Cache of _parse_datatype_json_string function Key: SPARK-22029 URL: https://issues.apache.org/jira/browse/SPARK-22029 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 2.2.0 Reporter: Maciej Bryński _parse_datatype_json_string is called many times with the same arguments. Cacheing this calls can dramatically speed up pyspark code execution (up to 20%) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22024) [pySpark] Speeding up internal conversion for Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-22024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-22024: --- Summary: [pySpark] Speeding up internal conversion for Spark SQL (was: [pySpark] Speeding up fromInternal methods) > [pySpark] Speeding up internal conversion for Spark SQL > --- > > Key: SPARK-22024 > URL: https://issues.apache.org/jira/browse/SPARK-22024 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Maciej Bryński > > fromInternal methods of pySpark datatypes are bottleneck when using pySpark. > This is umbrella ticket to different optimization in these methods. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22028) spark-submit trips over environment variables
[ https://issues.apache.org/jira/browse/SPARK-22028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16168009#comment-16168009 ] Sean Owen commented on SPARK-22028: --- But that's a Java error or limit, even. What would Spark do? You should unset this. > spark-submit trips over environment variables > - > > Key: SPARK-22028 > URL: https://issues.apache.org/jira/browse/SPARK-22028 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.1.1 > Environment: Operating System: Windows 10 > Shell: CMD or bash.exe, both with the same result >Reporter: Franz Wimmer > Labels: windows > > I have a strange environment variable in my Windows operating system: > {code:none} > C:\Path>set "" > =::=::\ > {code} > According to [this issue at > stackexchange|https://unix.stackexchange.com/a/251215/251326], this is some > sort of old MS-DOS relict that interacts with cygwin shells. > Leaving that aside for a moment, Spark tries to read environment variables on > submit and trips over it: > {code:none} > ./spark-submit.cmd > Running Spark using the REST application submission protocol. > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > 17/09/15 15:57:51 INFO RestSubmissionClient: Submitting a request to launch > an application in spark://:31824. > 17/09/15 15:58:01 WARN RestSubmissionClient: Unable to connect to server > spark://***:31824. > Warning: Master endpoint spark://:31824 was not a REST server. > Falling back to legacy submission gateway instead. > 17/09/15 15:58:02 ERROR Shell: Failed to locate the winutils binary in the > hadoop binary path > [ ... ] > 17/09/15 15:58:02 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 17/09/15 15:58:08 ERROR ClientEndpoint: Exception from cluster was: > java.lang.IllegalArgumentException: Invalid environment variable name: "=::" > java.lang.IllegalArgumentException: Invalid environment variable name: "=::" > at > java.lang.ProcessEnvironment.validateVariable(ProcessEnvironment.java:114) > at java.lang.ProcessEnvironment.access$200(ProcessEnvironment.java:61) > at > java.lang.ProcessEnvironment$Variable.valueOf(ProcessEnvironment.java:170) > at > java.lang.ProcessEnvironment$StringEnvironment.put(ProcessEnvironment.java:242) > at > java.lang.ProcessEnvironment$StringEnvironment.put(ProcessEnvironment.java:221) > at > org.apache.spark.deploy.worker.CommandUtils$$anonfun$buildProcessBuilder$2.apply(CommandUtils.scala:55) > at > org.apache.spark.deploy.worker.CommandUtils$$anonfun$buildProcessBuilder$2.apply(CommandUtils.scala:54) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) > at > scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:221) > at > scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428) > at > scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428) > at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) > at > org.apache.spark.deploy.worker.CommandUtils$.buildProcessBuilder(CommandUtils.scala:54) > at > org.apache.spark.deploy.worker.DriverRunner.prepareAndRunDriver(DriverRunner.scala:181) > at > org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:91) > {code} > Please note that _spark-submit.cmd_ is in this case my own script calling the > _spark-submit.cmd_ from the spark distribution. > I think that shouldn't happen. Spark should handle such a malformed > environment variable gracefully. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-22028) spark-submit trips over environment variables
[ https://issues.apache.org/jira/browse/SPARK-22028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Franz Wimmer reopened SPARK-22028: -- Sorry - the Error regarding the Hadoop binaries is normal for this system - I'm asking because of the exception below. > spark-submit trips over environment variables > - > > Key: SPARK-22028 > URL: https://issues.apache.org/jira/browse/SPARK-22028 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.1.1 > Environment: Operating System: Windows 10 > Shell: CMD or bash.exe, both with the same result >Reporter: Franz Wimmer > Labels: windows > > I have a strange environment variable in my Windows operating system: > {code:none} > C:\Path>set "" > =::=::\ > {code} > According to [this issue at > stackexchange|https://unix.stackexchange.com/a/251215/251326], this is some > sort of old MS-DOS relict that interacts with cygwin shells. > Leaving that aside for a moment, Spark tries to read environment variables on > submit and trips over it: > {code:none} > ./spark-submit.cmd > Running Spark using the REST application submission protocol. > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > 17/09/15 15:57:51 INFO RestSubmissionClient: Submitting a request to launch > an application in spark://:31824. > 17/09/15 15:58:01 WARN RestSubmissionClient: Unable to connect to server > spark://***:31824. > Warning: Master endpoint spark://:31824 was not a REST server. > Falling back to legacy submission gateway instead. > 17/09/15 15:58:02 ERROR Shell: Failed to locate the winutils binary in the > hadoop binary path > java.io.IOException: Could not locate executable null\bin\winutils.exe in the > Hadoop binaries. > at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:379) > at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:394) > at org.apache.hadoop.util.Shell.(Shell.java:387) > at org.apache.hadoop.util.StringUtils.(StringUtils.java:80) > at > org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(SecurityUtil.java:611) > at > org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:273) > at > org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:261) > at > org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:791) > at > org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:761) > at > org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:634) > at > org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2391) > at > org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2391) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2391) > at org.apache.spark.SecurityManager.(SecurityManager.scala:221) > at org.apache.spark.deploy.Client$.main(Client.scala:230) > at org.apache.spark.deploy.Client.main(Client.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) > at java.lang.reflect.Method.invoke(Unknown Source) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:743) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > 17/09/15 15:58:02 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 17/09/15 15:58:08 ERROR ClientEndpoint: Exception from cluster was: > java.lang.IllegalArgumentException: Invalid environment variable name: "=::" > java.lang.IllegalArgumentException: Invalid environment variable name: "=::" > at > java.lang.ProcessEnvironment.validateVariable(ProcessEnvironment.java:114) > at java.lang.ProcessEnvironment.access$200(ProcessEnvironment.java:61) > at > java.lang.ProcessEnvironment$Variable.valueOf(ProcessEnvironment.java:170) > at > java.lang.ProcessEnvironment$StringEnvironment.put(ProcessEnvironment.java:242) > at > java.lang.ProcessEnvironment$StringEnvironment.put(ProcessEnvironment.java:221) > at > or
[jira] [Issue Comment Deleted] (SPARK-22028) spark-submit trips over environment variables
[ https://issues.apache.org/jira/browse/SPARK-22028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Franz Wimmer updated SPARK-22028: - Comment: was deleted (was: The Error with the Hadoop binaries is normal for this system - I'm asking because of the exception below.) > spark-submit trips over environment variables > - > > Key: SPARK-22028 > URL: https://issues.apache.org/jira/browse/SPARK-22028 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.1.1 > Environment: Operating System: Windows 10 > Shell: CMD or bash.exe, both with the same result >Reporter: Franz Wimmer > Labels: windows > > I have a strange environment variable in my Windows operating system: > {code:none} > C:\Path>set "" > =::=::\ > {code} > According to [this issue at > stackexchange|https://unix.stackexchange.com/a/251215/251326], this is some > sort of old MS-DOS relict that interacts with cygwin shells. > Leaving that aside for a moment, Spark tries to read environment variables on > submit and trips over it: > {code:none} > ./spark-submit.cmd > Running Spark using the REST application submission protocol. > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > 17/09/15 15:57:51 INFO RestSubmissionClient: Submitting a request to launch > an application in spark://:31824. > 17/09/15 15:58:01 WARN RestSubmissionClient: Unable to connect to server > spark://***:31824. > Warning: Master endpoint spark://:31824 was not a REST server. > Falling back to legacy submission gateway instead. > 17/09/15 15:58:02 ERROR Shell: Failed to locate the winutils binary in the > hadoop binary path > java.io.IOException: Could not locate executable null\bin\winutils.exe in the > Hadoop binaries. > at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:379) > at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:394) > at org.apache.hadoop.util.Shell.(Shell.java:387) > at org.apache.hadoop.util.StringUtils.(StringUtils.java:80) > at > org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(SecurityUtil.java:611) > at > org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:273) > at > org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:261) > at > org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:791) > at > org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:761) > at > org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:634) > at > org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2391) > at > org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2391) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2391) > at org.apache.spark.SecurityManager.(SecurityManager.scala:221) > at org.apache.spark.deploy.Client$.main(Client.scala:230) > at org.apache.spark.deploy.Client.main(Client.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) > at java.lang.reflect.Method.invoke(Unknown Source) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:743) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > 17/09/15 15:58:02 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 17/09/15 15:58:08 ERROR ClientEndpoint: Exception from cluster was: > java.lang.IllegalArgumentException: Invalid environment variable name: "=::" > java.lang.IllegalArgumentException: Invalid environment variable name: "=::" > at > java.lang.ProcessEnvironment.validateVariable(ProcessEnvironment.java:114) > at java.lang.ProcessEnvironment.access$200(ProcessEnvironment.java:61) > at > java.lang.ProcessEnvironment$Variable.valueOf(ProcessEnvironment.java:170) > at > java.lang.ProcessEnvironment$StringEnvironment.put(ProcessEnvironment.java:242) > at > java.lang.ProcessEnvironment$StringEnvironment.put(ProcessEnvironment.java:221) >
[jira] [Updated] (SPARK-22028) spark-submit trips over environment variables
[ https://issues.apache.org/jira/browse/SPARK-22028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Franz Wimmer updated SPARK-22028: - Description: I have a strange environment variable in my Windows operating system: {code:none} C:\Path>set "" =::=::\ {code} According to [this issue at stackexchange|https://unix.stackexchange.com/a/251215/251326], this is some sort of old MS-DOS relict that interacts with cygwin shells. Leaving that aside for a moment, Spark tries to read environment variables on submit and trips over it: {code:none} ./spark-submit.cmd Running Spark using the REST application submission protocol. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 17/09/15 15:57:51 INFO RestSubmissionClient: Submitting a request to launch an application in spark://:31824. 17/09/15 15:58:01 WARN RestSubmissionClient: Unable to connect to server spark://***:31824. Warning: Master endpoint spark://:31824 was not a REST server. Falling back to legacy submission gateway instead. 17/09/15 15:58:02 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path [ ... ] 17/09/15 15:58:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/09/15 15:58:08 ERROR ClientEndpoint: Exception from cluster was: java.lang.IllegalArgumentException: Invalid environment variable name: "=::" java.lang.IllegalArgumentException: Invalid environment variable name: "=::" at java.lang.ProcessEnvironment.validateVariable(ProcessEnvironment.java:114) at java.lang.ProcessEnvironment.access$200(ProcessEnvironment.java:61) at java.lang.ProcessEnvironment$Variable.valueOf(ProcessEnvironment.java:170) at java.lang.ProcessEnvironment$StringEnvironment.put(ProcessEnvironment.java:242) at java.lang.ProcessEnvironment$StringEnvironment.put(ProcessEnvironment.java:221) at org.apache.spark.deploy.worker.CommandUtils$$anonfun$buildProcessBuilder$2.apply(CommandUtils.scala:55) at org.apache.spark.deploy.worker.CommandUtils$$anonfun$buildProcessBuilder$2.apply(CommandUtils.scala:54) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:221) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) at org.apache.spark.deploy.worker.CommandUtils$.buildProcessBuilder(CommandUtils.scala:54) at org.apache.spark.deploy.worker.DriverRunner.prepareAndRunDriver(DriverRunner.scala:181) at org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:91) {code} Please note that _spark-submit.cmd_ is in this case my own script calling the _spark-submit.cmd_ from the spark distribution. I think that shouldn't happen. Spark should handle such a malformed environment variable gracefully. was: I have a strange environment variable in my Windows operating system: {code:none} C:\Path>set "" =::=::\ {code} According to [this issue at stackexchange|https://unix.stackexchange.com/a/251215/251326], this is some sort of old MS-DOS relict that interacts with cygwin shells. Leaving that aside for a moment, Spark tries to read environment variables on submit and trips over it: {code:none} ./spark-submit.cmd Running Spark using the REST application submission protocol. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 17/09/15 15:57:51 INFO RestSubmissionClient: Submitting a request to launch an application in spark://:31824. 17/09/15 15:58:01 WARN RestSubmissionClient: Unable to connect to server spark://***:31824. Warning: Master endpoint spark://:31824 was not a REST server. Falling back to legacy submission gateway instead. 17/09/15 15:58:02 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:379) at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:394) at org.apache.hadoop.util.Shell.(Shell.java:387) at org.apache.hadoop.util.StringUtils.(StringUtils.java:80) at org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(SecurityUtil.java:611) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:273) at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:261) at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubje
[jira] [Commented] (SPARK-22028) spark-submit trips over environment variables
[ https://issues.apache.org/jira/browse/SPARK-22028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167970#comment-16167970 ] Franz Wimmer commented on SPARK-22028: -- The Error with the Hadoop binaries is normal for this system - I'm asking because of the exception below. > spark-submit trips over environment variables > - > > Key: SPARK-22028 > URL: https://issues.apache.org/jira/browse/SPARK-22028 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.1.1 > Environment: Operating System: Windows 10 > Shell: CMD or bash.exe, both with the same result >Reporter: Franz Wimmer > Labels: windows > > I have a strange environment variable in my Windows operating system: > {code:none} > C:\Path>set "" > =::=::\ > {code} > According to [this issue at > stackexchange|https://unix.stackexchange.com/a/251215/251326], this is some > sort of old MS-DOS relict that interacts with cygwin shells. > Leaving that aside for a moment, Spark tries to read environment variables on > submit and trips over it: > {code:none} > ./spark-submit.cmd > Running Spark using the REST application submission protocol. > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > 17/09/15 15:57:51 INFO RestSubmissionClient: Submitting a request to launch > an application in spark://:31824. > 17/09/15 15:58:01 WARN RestSubmissionClient: Unable to connect to server > spark://***:31824. > Warning: Master endpoint spark://:31824 was not a REST server. > Falling back to legacy submission gateway instead. > 17/09/15 15:58:02 ERROR Shell: Failed to locate the winutils binary in the > hadoop binary path > java.io.IOException: Could not locate executable null\bin\winutils.exe in the > Hadoop binaries. > at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:379) > at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:394) > at org.apache.hadoop.util.Shell.(Shell.java:387) > at org.apache.hadoop.util.StringUtils.(StringUtils.java:80) > at > org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(SecurityUtil.java:611) > at > org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:273) > at > org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:261) > at > org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:791) > at > org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:761) > at > org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:634) > at > org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2391) > at > org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2391) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2391) > at org.apache.spark.SecurityManager.(SecurityManager.scala:221) > at org.apache.spark.deploy.Client$.main(Client.scala:230) > at org.apache.spark.deploy.Client.main(Client.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) > at java.lang.reflect.Method.invoke(Unknown Source) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:743) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > 17/09/15 15:58:02 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 17/09/15 15:58:08 ERROR ClientEndpoint: Exception from cluster was: > java.lang.IllegalArgumentException: Invalid environment variable name: "=::" > java.lang.IllegalArgumentException: Invalid environment variable name: "=::" > at > java.lang.ProcessEnvironment.validateVariable(ProcessEnvironment.java:114) > at java.lang.ProcessEnvironment.access$200(ProcessEnvironment.java:61) > at > java.lang.ProcessEnvironment$Variable.valueOf(ProcessEnvironment.java:170) > at > java.lang.ProcessEnvironment$StringEnvironment.put(ProcessEnvironment.java:242) > at > java.lang.ProcessEnvironment$StringEnvironment.put(Proce
[jira] [Resolved] (SPARK-22028) spark-submit trips over environment variables
[ https://issues.apache.org/jira/browse/SPARK-22028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-22028. --- Resolution: Not A Problem No, it indicates you don't have the win Hadoop binaries available. See the error and search JIRA here. > spark-submit trips over environment variables > - > > Key: SPARK-22028 > URL: https://issues.apache.org/jira/browse/SPARK-22028 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.1.1 > Environment: Operating System: Windows 10 > Shell: CMD or bash.exe, both with the same result >Reporter: Franz Wimmer > Labels: windows > > I have a strange environment variable in my Windows operating system: > {code:none} > C:\Path>set "" > =::=::\ > {code} > According to [this issue at > stackexchange|https://unix.stackexchange.com/a/251215/251326], this is some > sort of old MS-DOS relict that interacts with cygwin shells. > Leaving that aside for a moment, Spark tries to read environment variables on > submit and trips over it: > {code:none} > ./spark-submit.cmd > Running Spark using the REST application submission protocol. > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > 17/09/15 15:57:51 INFO RestSubmissionClient: Submitting a request to launch > an application in spark://:31824. > 17/09/15 15:58:01 WARN RestSubmissionClient: Unable to connect to server > spark://***:31824. > Warning: Master endpoint spark://:31824 was not a REST server. > Falling back to legacy submission gateway instead. > 17/09/15 15:58:02 ERROR Shell: Failed to locate the winutils binary in the > hadoop binary path > java.io.IOException: Could not locate executable null\bin\winutils.exe in the > Hadoop binaries. > at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:379) > at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:394) > at org.apache.hadoop.util.Shell.(Shell.java:387) > at org.apache.hadoop.util.StringUtils.(StringUtils.java:80) > at > org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(SecurityUtil.java:611) > at > org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:273) > at > org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:261) > at > org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:791) > at > org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:761) > at > org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:634) > at > org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2391) > at > org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2391) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2391) > at org.apache.spark.SecurityManager.(SecurityManager.scala:221) > at org.apache.spark.deploy.Client$.main(Client.scala:230) > at org.apache.spark.deploy.Client.main(Client.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) > at java.lang.reflect.Method.invoke(Unknown Source) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:743) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > 17/09/15 15:58:02 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 17/09/15 15:58:08 ERROR ClientEndpoint: Exception from cluster was: > java.lang.IllegalArgumentException: Invalid environment variable name: "=::" > java.lang.IllegalArgumentException: Invalid environment variable name: "=::" > at > java.lang.ProcessEnvironment.validateVariable(ProcessEnvironment.java:114) > at java.lang.ProcessEnvironment.access$200(ProcessEnvironment.java:61) > at > java.lang.ProcessEnvironment$Variable.valueOf(ProcessEnvironment.java:170) > at > java.lang.ProcessEnvironment$StringEnvironment.put(ProcessEnvironment.java:242) > at > java.lang.ProcessEnvironment$StringEnvironment.put(ProcessEnvironment.java:221) >
[jira] [Created] (SPARK-22028) spark-submit trips over environment variables
Franz Wimmer created SPARK-22028: Summary: spark-submit trips over environment variables Key: SPARK-22028 URL: https://issues.apache.org/jira/browse/SPARK-22028 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 2.1.1 Environment: Operating System: Windows 10 Shell: CMD or bash.exe, both with the same result Reporter: Franz Wimmer I have a strange environment variable in my Windows operating system: {code:none} C:\Path>set "" =::=::\ {code} According to [this issue at stackexchange|https://unix.stackexchange.com/a/251215/251326], this is some sort of old MS-DOS relict that interacts with cygwin shells. Leaving that aside for a moment, Spark tries to read environment variables on submit and trips over it: {code:none} ./spark-submit.cmd Running Spark using the REST application submission protocol. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 17/09/15 15:57:51 INFO RestSubmissionClient: Submitting a request to launch an application in spark://:31824. 17/09/15 15:58:01 WARN RestSubmissionClient: Unable to connect to server spark://***:31824. Warning: Master endpoint spark://:31824 was not a REST server. Falling back to legacy submission gateway instead. 17/09/15 15:58:02 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:379) at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:394) at org.apache.hadoop.util.Shell.(Shell.java:387) at org.apache.hadoop.util.StringUtils.(StringUtils.java:80) at org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(SecurityUtil.java:611) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:273) at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:261) at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:791) at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:761) at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:634) at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2391) at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2391) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2391) at org.apache.spark.SecurityManager.(SecurityManager.scala:221) at org.apache.spark.deploy.Client$.main(Client.scala:230) at org.apache.spark.deploy.Client.main(Client.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:743) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 17/09/15 15:58:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/09/15 15:58:08 ERROR ClientEndpoint: Exception from cluster was: java.lang.IllegalArgumentException: Invalid environment variable name: "=::" java.lang.IllegalArgumentException: Invalid environment variable name: "=::" at java.lang.ProcessEnvironment.validateVariable(ProcessEnvironment.java:114) at java.lang.ProcessEnvironment.access$200(ProcessEnvironment.java:61) at java.lang.ProcessEnvironment$Variable.valueOf(ProcessEnvironment.java:170) at java.lang.ProcessEnvironment$StringEnvironment.put(ProcessEnvironment.java:242) at java.lang.ProcessEnvironment$StringEnvironment.put(ProcessEnvironment.java:221) at org.apache.spark.deploy.worker.CommandUtils$$anonfun$buildProcessBuilder$2.apply(CommandUtils.scala:55) at org.apache.spark.deploy.worker.CommandUtils$$anonfun$buildProcessBuilder$2.apply(CommandUtils.scala:54) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:221) at scala.collection.immutable.HashMap$HashTrieMap.for
[jira] [Updated] (SPARK-22027) Explanation of default value of GBTRegressor's maxIter is missing in API doc
[ https://issues.apache.org/jira/browse/SPARK-22027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazunori Sakamoto updated SPARK-22027: -- Labels: documentation (was: ) > Explanation of default value of GBTRegressor's maxIter is missing in API doc > > > Key: SPARK-22027 > URL: https://issues.apache.org/jira/browse/SPARK-22027 > Project: Spark > Issue Type: Documentation > Components: ML >Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, > 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.2.0 >Reporter: Kazunori Sakamoto >Priority: Trivial > Labels: documentation > > The current API document of GBTRegressor's maxIter doesn't include the > explanation of the default value (20). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22027) Explanation of default value of GBTRegressor's maxIter is missing in API doc
[ https://issues.apache.org/jira/browse/SPARK-22027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22027: Assignee: (was: Apache Spark) > Explanation of default value of GBTRegressor's maxIter is missing in API doc > > > Key: SPARK-22027 > URL: https://issues.apache.org/jira/browse/SPARK-22027 > Project: Spark > Issue Type: Documentation > Components: ML >Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, > 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.2.0 >Reporter: Kazunori Sakamoto >Priority: Trivial > Labels: documentation > > The current API document of GBTRegressor's maxIter doesn't include the > explanation of the default value (20). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22027) Explanation of default value of GBTRegressor's maxIter is missing in API doc
[ https://issues.apache.org/jira/browse/SPARK-22027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22027: Assignee: Apache Spark > Explanation of default value of GBTRegressor's maxIter is missing in API doc > > > Key: SPARK-22027 > URL: https://issues.apache.org/jira/browse/SPARK-22027 > Project: Spark > Issue Type: Documentation > Components: ML >Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, > 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.2.0 >Reporter: Kazunori Sakamoto >Assignee: Apache Spark >Priority: Trivial > Labels: documentation > > The current API document of GBTRegressor's maxIter doesn't include the > explanation of the default value (20). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22027) Explanation of default value of GBTRegressor's maxIter is missing in API doc
[ https://issues.apache.org/jira/browse/SPARK-22027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167944#comment-16167944 ] Apache Spark commented on SPARK-22027: -- User 'exKAZUu' has created a pull request for this issue: https://github.com/apache/spark/pull/19248 > Explanation of default value of GBTRegressor's maxIter is missing in API doc > > > Key: SPARK-22027 > URL: https://issues.apache.org/jira/browse/SPARK-22027 > Project: Spark > Issue Type: Documentation > Components: ML >Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, > 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.2.0 >Reporter: Kazunori Sakamoto >Priority: Trivial > Labels: documentation > > The current API document of GBTRegressor's maxIter doesn't include the > explanation of the default value (20). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22027) Explanation of default value of GBTRegressor's maxIter is missing in API doc
Kazunori Sakamoto created SPARK-22027: - Summary: Explanation of default value of GBTRegressor's maxIter is missing in API doc Key: SPARK-22027 URL: https://issues.apache.org/jira/browse/SPARK-22027 Project: Spark Issue Type: Documentation Components: ML Affects Versions: 2.2.0, 2.1.1, 2.1.0, 2.0.2, 2.0.1, 2.0.0, 1.6.3, 1.6.2, 1.6.1, 1.6.0, 1.5.2, 1.5.1, 1.5.0, 1.4.1, 1.4.0 Reporter: Kazunori Sakamoto Priority: Trivial The current API document of GBTRegressor's maxIter doesn't include the explanation of the default value (20). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21996) Streaming ignores files with spaces in the file names
[ https://issues.apache.org/jira/browse/SPARK-21996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167941#comment-16167941 ] Xiayun Sun edited comment on SPARK-21996 at 9/15/17 2:30 PM: - I can reproduce this issue for master branch, and found out it was from using {{.toString}} instead of {{.getPath}} to get the file path in string format. I pushed a PR to make it work for streaming. But there're other places in spark where we're using `.toString`. Not sure if it's a general concern. was (Author: xiayunsun): I can reproduce this issue for master branch, and found out it was from using `.toString` instead of `.getPath` to get the file path in string format. I pushed a PR to make it work for streaming. But there're other places in spark where we're using `.toString`. Not sure if it's a general concern. > Streaming ignores files with spaces in the file names > - > > Key: SPARK-21996 > URL: https://issues.apache.org/jira/browse/SPARK-21996 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: openjdk version "1.8.0_131" > OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-2ubuntu1.17.04.3-b11) > OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode) >Reporter: Ivan Sharamet > Attachments: spark-streaming.zip > > > I tried to stream text files from folder and noticed that files inside this > folder with spaces in their names ignored and there are some warnings in the > log: > {code} > 17/09/13 16:15:14 WARN InMemoryFileIndex: The directory > file:/in/two%20two.txt was not found. Was it deleted very recently? > {code} > I found that this happens due to duplicate file path URI encoding (I suppose) > and the actual URI inside path objects looks like this > {{file:/in/two%2520two.txt}}. > To reproduce this issue just place some text files with spaces in their names > and execute some simple streaming code: > {code:java} > /in > /one.txt > /two two.txt > /three.txt > {code} > {code} > sparkSession.readStream.textFile("/in") > .writeStream > .option("checkpointLocation", "/checkpoint") > .format("text") > .start("/out") > .awaitTermination() > {code} > The result will contain only content of files {{one.txt}} and {{three.txt}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21996) Streaming ignores files with spaces in the file names
[ https://issues.apache.org/jira/browse/SPARK-21996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167941#comment-16167941 ] Xiayun Sun commented on SPARK-21996: I can reproduce this issue for master branch, and found out it was from using `.toString` instead of `.getPath` to get the file path in string format. I pushed a PR to make it work for streaming. But there're other places in spark where we're using `.toString`. Not sure if it's a general concern. > Streaming ignores files with spaces in the file names > - > > Key: SPARK-21996 > URL: https://issues.apache.org/jira/browse/SPARK-21996 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: openjdk version "1.8.0_131" > OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-2ubuntu1.17.04.3-b11) > OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode) >Reporter: Ivan Sharamet > Attachments: spark-streaming.zip > > > I tried to stream text files from folder and noticed that files inside this > folder with spaces in their names ignored and there are some warnings in the > log: > {code} > 17/09/13 16:15:14 WARN InMemoryFileIndex: The directory > file:/in/two%20two.txt was not found. Was it deleted very recently? > {code} > I found that this happens due to duplicate file path URI encoding (I suppose) > and the actual URI inside path objects looks like this > {{file:/in/two%2520two.txt}}. > To reproduce this issue just place some text files with spaces in their names > and execute some simple streaming code: > {code:java} > /in > /one.txt > /two two.txt > /three.txt > {code} > {code} > sparkSession.readStream.textFile("/in") > .writeStream > .option("checkpointLocation", "/checkpoint") > .format("text") > .start("/out") > .awaitTermination() > {code} > The result will contain only content of files {{one.txt}} and {{three.txt}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21996) Streaming ignores files with spaces in the file names
[ https://issues.apache.org/jira/browse/SPARK-21996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167927#comment-16167927 ] Apache Spark commented on SPARK-21996: -- User 'xysun' has created a pull request for this issue: https://github.com/apache/spark/pull/19247 > Streaming ignores files with spaces in the file names > - > > Key: SPARK-21996 > URL: https://issues.apache.org/jira/browse/SPARK-21996 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: openjdk version "1.8.0_131" > OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-2ubuntu1.17.04.3-b11) > OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode) >Reporter: Ivan Sharamet > Attachments: spark-streaming.zip > > > I tried to stream text files from folder and noticed that files inside this > folder with spaces in their names ignored and there are some warnings in the > log: > {code} > 17/09/13 16:15:14 WARN InMemoryFileIndex: The directory > file:/in/two%20two.txt was not found. Was it deleted very recently? > {code} > I found that this happens due to duplicate file path URI encoding (I suppose) > and the actual URI inside path objects looks like this > {{file:/in/two%2520two.txt}}. > To reproduce this issue just place some text files with spaces in their names > and execute some simple streaming code: > {code:java} > /in > /one.txt > /two two.txt > /three.txt > {code} > {code} > sparkSession.readStream.textFile("/in") > .writeStream > .option("checkpointLocation", "/checkpoint") > .format("text") > .start("/out") > .awaitTermination() > {code} > The result will contain only content of files {{one.txt}} and {{three.txt}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21996) Streaming ignores files with spaces in the file names
[ https://issues.apache.org/jira/browse/SPARK-21996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21996: Assignee: Apache Spark > Streaming ignores files with spaces in the file names > - > > Key: SPARK-21996 > URL: https://issues.apache.org/jira/browse/SPARK-21996 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: openjdk version "1.8.0_131" > OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-2ubuntu1.17.04.3-b11) > OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode) >Reporter: Ivan Sharamet >Assignee: Apache Spark > Attachments: spark-streaming.zip > > > I tried to stream text files from folder and noticed that files inside this > folder with spaces in their names ignored and there are some warnings in the > log: > {code} > 17/09/13 16:15:14 WARN InMemoryFileIndex: The directory > file:/in/two%20two.txt was not found. Was it deleted very recently? > {code} > I found that this happens due to duplicate file path URI encoding (I suppose) > and the actual URI inside path objects looks like this > {{file:/in/two%2520two.txt}}. > To reproduce this issue just place some text files with spaces in their names > and execute some simple streaming code: > {code:java} > /in > /one.txt > /two two.txt > /three.txt > {code} > {code} > sparkSession.readStream.textFile("/in") > .writeStream > .option("checkpointLocation", "/checkpoint") > .format("text") > .start("/out") > .awaitTermination() > {code} > The result will contain only content of files {{one.txt}} and {{three.txt}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21996) Streaming ignores files with spaces in the file names
[ https://issues.apache.org/jira/browse/SPARK-21996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21996: Assignee: (was: Apache Spark) > Streaming ignores files with spaces in the file names > - > > Key: SPARK-21996 > URL: https://issues.apache.org/jira/browse/SPARK-21996 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: openjdk version "1.8.0_131" > OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-2ubuntu1.17.04.3-b11) > OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode) >Reporter: Ivan Sharamet > Attachments: spark-streaming.zip > > > I tried to stream text files from folder and noticed that files inside this > folder with spaces in their names ignored and there are some warnings in the > log: > {code} > 17/09/13 16:15:14 WARN InMemoryFileIndex: The directory > file:/in/two%20two.txt was not found. Was it deleted very recently? > {code} > I found that this happens due to duplicate file path URI encoding (I suppose) > and the actual URI inside path objects looks like this > {{file:/in/two%2520two.txt}}. > To reproduce this issue just place some text files with spaces in their names > and execute some simple streaming code: > {code:java} > /in > /one.txt > /two two.txt > /three.txt > {code} > {code} > sparkSession.readStream.textFile("/in") > .writeStream > .option("checkpointLocation", "/checkpoint") > .format("text") > .start("/out") > .awaitTermination() > {code} > The result will contain only content of files {{one.txt}} and {{three.txt}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22026) data source v2 write path
Wenchen Fan created SPARK-22026: --- Summary: data source v2 write path Key: SPARK-22026 URL: https://issues.apache.org/jira/browse/SPARK-22026 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.3.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22024) [pySpark] Speeding up fromInternal methods
[ https://issues.apache.org/jira/browse/SPARK-22024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-22024: --- Description: fromInternal methods of pySpark datatypes are bottleneck when using pySpark. This is umbrella ticket to different optimization in these methods. was: fromInternal methods of pySpark datatypes are bottleneck when using pySpark. This is umbrella ticket to different optimization in thos methods. > [pySpark] Speeding up fromInternal methods > -- > > Key: SPARK-22024 > URL: https://issues.apache.org/jira/browse/SPARK-22024 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Maciej Bryński > > fromInternal methods of pySpark datatypes are bottleneck when using pySpark. > This is umbrella ticket to different optimization in these methods. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21958) Attempting to save large Word2Vec model hangs driver in constant GC.
[ https://issues.apache.org/jira/browse/SPARK-21958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reassigned SPARK-21958: -- Assignee: Travis Hegner > Attempting to save large Word2Vec model hangs driver in constant GC. > > > Key: SPARK-21958 > URL: https://issues.apache.org/jira/browse/SPARK-21958 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 > Environment: Running spark on yarn, hadoop 2.7.2 provided by the > cluster >Reporter: Travis Hegner >Assignee: Travis Hegner > Labels: easyfix, patch, performance > Fix For: 2.3.0 > > > In the new version of Word2Vec, the model saving was modified to estimate an > appropriate number of partitions based on the kryo buffer size. This is a > great improvement, but there is a caveat for very large models. > The {{(word, vector)}} tuple goes through a transformation to a local case > class of {{Data(word, vector)}}... I can only assume this is for the kryo > serialization process. The new version of the code iterates over the entire > vocabulary to do this transformation (the old version wrapped the entire > datum) in the driver's heap. Only to have the result then distributed to the > cluster to be written into it's parquet files. > With extremely large vocabularies (~2 million docs, with uni-grams, bi-grams, > and tri-grams), that local driver transformation is causing the driver to > hang indefinitely in GC as I can only assume that it's generating millions of > short lived objects which can't be evicted fast enough. > Perhaps I'm overlooking something, but it seems to me that since the result > is distributed over the cluster to be saved _after_ the transformation > anyway, we may as well distribute it _first_, allowing the cluster resources > to do the transformation more efficiently, and then write the parquet file > from there. > I have a patch implemented, and am in the process of testing it at scale. I > will open a pull request when I feel that the patch is successfully resolving > the issue, and after making sure that it passes unit tests. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21958) Attempting to save large Word2Vec model hangs driver in constant GC.
[ https://issues.apache.org/jira/browse/SPARK-21958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-21958. Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19191 [https://github.com/apache/spark/pull/19191] > Attempting to save large Word2Vec model hangs driver in constant GC. > > > Key: SPARK-21958 > URL: https://issues.apache.org/jira/browse/SPARK-21958 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 > Environment: Running spark on yarn, hadoop 2.7.2 provided by the > cluster >Reporter: Travis Hegner > Labels: easyfix, patch, performance > Fix For: 2.3.0 > > > In the new version of Word2Vec, the model saving was modified to estimate an > appropriate number of partitions based on the kryo buffer size. This is a > great improvement, but there is a caveat for very large models. > The {{(word, vector)}} tuple goes through a transformation to a local case > class of {{Data(word, vector)}}... I can only assume this is for the kryo > serialization process. The new version of the code iterates over the entire > vocabulary to do this transformation (the old version wrapped the entire > datum) in the driver's heap. Only to have the result then distributed to the > cluster to be written into it's parquet files. > With extremely large vocabularies (~2 million docs, with uni-grams, bi-grams, > and tri-grams), that local driver transformation is causing the driver to > hang indefinitely in GC as I can only assume that it's generating millions of > short lived objects which can't be evicted fast enough. > Perhaps I'm overlooking something, but it seems to me that since the result > is distributed over the cluster to be saved _after_ the transformation > anyway, we may as well distribute it _first_, allowing the cluster resources > to do the transformation more efficiently, and then write the parquet file > from there. > I have a patch implemented, and am in the process of testing it at scale. I > will open a pull request when I feel that the patch is successfully resolving > the issue, and after making sure that it passes unit tests. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22025) Speeding up fromInternal for StructField
[ https://issues.apache.org/jira/browse/SPARK-22025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22025: Assignee: (was: Apache Spark) > Speeding up fromInternal for StructField > > > Key: SPARK-22025 > URL: https://issues.apache.org/jira/browse/SPARK-22025 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Maciej Bryński > > Current code for StructField is simple function call. > https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L434 > We can change it to function reference. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22025) Speeding up fromInternal for StructField
[ https://issues.apache.org/jira/browse/SPARK-22025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22025: Assignee: Apache Spark > Speeding up fromInternal for StructField > > > Key: SPARK-22025 > URL: https://issues.apache.org/jira/browse/SPARK-22025 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Maciej Bryński >Assignee: Apache Spark > > Current code for StructField is simple function call. > https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L434 > We can change it to function reference. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22025) Speeding up fromInternal for StructField
[ https://issues.apache.org/jira/browse/SPARK-22025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167846#comment-16167846 ] Apache Spark commented on SPARK-22025: -- User 'maver1ck' has created a pull request for this issue: https://github.com/apache/spark/pull/19246 > Speeding up fromInternal for StructField > > > Key: SPARK-22025 > URL: https://issues.apache.org/jira/browse/SPARK-22025 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Maciej Bryński > > Current code for StructField is simple function call. > https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L434 > We can change it to function reference. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22012) CLONE - Spark Streaming, Kafka receiver, "Failed to get records for ... after polling for 512"
[ https://issues.apache.org/jira/browse/SPARK-22012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karan Singh updated SPARK-22012: Description: My Spark Streaming duration is 5 seconds (5000) and kafka is all at its default properties , i am still facing this issue , Can anyone tell me how to resolve it what i am doing wrong ? JavaStreamingContext ssc = new JavaStreamingContext(sc, new Duration(5000)); Exception in Spark Streamings Job aborted due to stage failure: Task 6 in stage 289.0 failed 4 times, most recent failure: Lost task 6.3 in stage 289.0 (xx, executor 2): java.lang.AssertionError: assertion failed: Failed to get records for spark-executor-xxx pulse 1 163684030 after polling for 512 at scala.Predef$.assert(Predef.scala:170) at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74) at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227) at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193) at scala.collection.convert.Wrappers$IteratorWrapper.next(Wrappers.scala:31) at com.xxx.analytics.engine.EngineManager$1$1.call(EngineManager.java:165) at com.xxx.analytics.engine.EngineManager$1$1.call(EngineManager.java:132) at org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:219) at org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:219) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) was: My Spark Streaming duration is 5 seconds (5000) and kafka is all at its default properties , i am still facing this issue , Can anyone tell me how to resolve it what i am doing wrong ? JavaStreamingContext ssc = new JavaStreamingContext(sc, new Duration(5000)); Exception in Spark Streamings Job aborted due to stage failure: Task 6 in stage 289.0 failed 4 times, most recent failure: Lost task 6.3 in stage 289.0 (xx, executor 2): java.lang.AssertionError: assertion failed: Failed to get records for spark-executor-xxx pulse 1 163684030 after polling for 512 at scala.Predef$.assert(Predef.scala:170) at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74) at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227) at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193) at scala.collection.convert.Wrappers$IteratorWrapper.next(Wrappers.scala:31) at com.olacabs.analytics.engine.EngineManager$1$1.call(EngineManager.java:165) at com.olacabs.analytics.engine.EngineManager$1$1.call(EngineManager.java:132) at org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:219) at org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:219) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) > CLONE - Spark Streaming, Kafka receiver, "Failed to get records for ... after > polling for 512" > -- > > Key: SPARK-22012 >
[jira] [Comment Edited] (SPARK-19275) Spark Streaming, Kafka receiver, "Failed to get records for ... after polling for 512"
[ https://issues.apache.org/jira/browse/SPARK-19275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16166112#comment-16166112 ] Karan Singh edited comment on SPARK-19275 at 9/15/17 1:02 PM: -- Hi Team , My Spark Streaming duration is 5 seconds (5000) and kafka is all at its default properties , i am still facing this issue , Can anyone tell me how to resolve it what i am doing wrong ? JavaStreamingContext ssc = new JavaStreamingContext(sc, new Duration(5000)); Exception in Spark Streamings Job aborted due to stage failure: Task 6 in stage 289.0 failed 4 times, most recent failure: Lost task 6.3 in stage 289.0 (xx, executor 2): java.lang.AssertionError: assertion failed: Failed to get records for spark-executor-xxx pulse 1 163684030 after polling for 512 at scala.Predef$.assert(Predef.scala:170) at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74) at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227) at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193) at scala.collection.convert.Wrappers$IteratorWrapper.next(Wrappers.scala:31) at com.xxx.analytics.engine.EngineManager$1$1.call(EngineManager.java:165) at com.xxx.analytics.engine.EngineManager$1$1.call(EngineManager.java:132) at org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:219) at org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:219) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) was (Author: karansinghkjs346): Hi Team , My Spark Streaming duration is 5 seconds (5000) and kafka is all at its default properties , i am still facing this issue , Can anyone tell me how to resolve it what i am doing wrong ? JavaStreamingContext ssc = new JavaStreamingContext(sc, new Duration(5000)); Exception in Spark Streamings Job aborted due to stage failure: Task 6 in stage 289.0 failed 4 times, most recent failure: Lost task 6.3 in stage 289.0 (xx, executor 2): java.lang.AssertionError: assertion failed: Failed to get records for spark-executor-xxx pulse 1 163684030 after polling for 512 at scala.Predef$.assert(Predef.scala:170) at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74) at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227) at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193) at scala.collection.convert.Wrappers$IteratorWrapper.next(Wrappers.scala:31) at com.olacabs.analytics.engine.EngineManager$1$1.call(EngineManager.java:165) at com.olacabs.analytics.engine.EngineManager$1$1.call(EngineManager.java:132) at org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:219) at org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:219) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) > Spark Streaming, Kafka receiver, "Failed to get records for ... after polling > for 512" >
[jira] [Created] (SPARK-22025) Speeding up fromInternal for StructField
Maciej Bryński created SPARK-22025: -- Summary: Speeding up fromInternal for StructField Key: SPARK-22025 URL: https://issues.apache.org/jira/browse/SPARK-22025 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 2.2.0 Reporter: Maciej Bryński Current code for StructField is simple function call. https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L434 We can change it to function reference. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22024) [pySpark] Speeding up fromInternal methods
[ https://issues.apache.org/jira/browse/SPARK-22024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-22024: --- Summary: [pySpark] Speeding up fromInternal methods (was: Speeding up fromInternal methods) > [pySpark] Speeding up fromInternal methods > -- > > Key: SPARK-22024 > URL: https://issues.apache.org/jira/browse/SPARK-22024 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Maciej Bryński > > fromInternal methods of pySpark datatypes are bottleneck when using pySpark. > This is umbrella ticket to different optimization in thos methods. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22010) Slow fromInternal conversion for TimestampType
[ https://issues.apache.org/jira/browse/SPARK-22010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-22010: --- Issue Type: Sub-task (was: Improvement) Parent: SPARK-22024 > Slow fromInternal conversion for TimestampType > -- > > Key: SPARK-22010 > URL: https://issues.apache.org/jira/browse/SPARK-22010 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Maciej Bryński >Priority: Minor > Attachments: profile_fact_dok.png > > > To convert timestamp type to python we are using > {code}datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts % > 100){code} > code. > {code} > In [34]: %%timeit > ...: > datetime.datetime.fromtimestamp(1505383647).replace(microsecond=12344) > ...: > 4.58 µs ± 558 ns per loop (mean ± std. dev. of 7 runs, 10 loops each) > {code} > It's slow, because: > # we're trying to get TZ on every conversion > # we're using replace method > Proposed solution: custom datetime conversion and move calculation of TZ to > module -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22024) Speeding up fromInternal methods
Maciej Bryński created SPARK-22024: -- Summary: Speeding up fromInternal methods Key: SPARK-22024 URL: https://issues.apache.org/jira/browse/SPARK-22024 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 2.2.0 Reporter: Maciej Bryński fromInternal methods of pySpark datatypes are bottleneck when using pySpark. This is umbrella ticket to different optimization in thos methods. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7276) withColumn is very slow on dataframe with large number of columns
[ https://issues.apache.org/jira/browse/SPARK-7276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167814#comment-16167814 ] Barry Becker commented on SPARK-7276: - Isn't there still a problem with withColumn performance in 2.1.1? I noticed that using a select that adds a bunch of derived columns at once (using column expressions) is *much* faster than adding a bunch of withColumn's. Has anyone else noticed this? It might be good to have a "withColumns" method to avoid the overhead of doing multiple withColumn's. > withColumn is very slow on dataframe with large number of columns > - > > Key: SPARK-7276 > URL: https://issues.apache.org/jira/browse/SPARK-7276 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.3.1 >Reporter: Alexandre CLEMENT >Assignee: Wenchen Fan > Fix For: 1.4.0 > > > The code snippet demonstrates the problem. > {code} > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > val sparkConf = new SparkConf().setAppName("Spark > Test").setMaster(System.getProperty("spark.master", "local[4]")) > val sc = new SparkContext(sparkConf) > val sqlContext = new SQLContext(sc) > import sqlContext.implicits._ > val custs = Seq( > Row(1, "Bob", 21, 80.5), > Row(2, "Bobby", 21, 80.5), > Row(3, "Jean", 21, 80.5), > Row(4, "Fatime", 21, 80.5) > ) > var fields = List( > StructField("id", IntegerType, true), > StructField("a", IntegerType, true), > StructField("b", StringType, true), > StructField("target", DoubleType, false)) > val schema = StructType(fields) > var rdd = sc.parallelize(custs) > var df = sqlContext.createDataFrame(rdd, schema) > for (i <- 1 to 200) { > val now = System.currentTimeMillis > df = df.withColumn("a_new_col_" + i, df("a") + i) > println(s"$i -> " + (System.currentTimeMillis - now)) > } > df.show() > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22021) Add a feature transformation to accept a function and apply it on all rows of dataframe
[ https://issues.apache.org/jira/browse/SPARK-22021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167806#comment-16167806 ] Nick Pentreath commented on SPARK-22021: Why a JavaScript function? I think this is not a good fit to go into Spark ML core. You can easily have this as an external library or Spark package. We are looking at potentially a transformer for generic Scala functions in SPARK-20271 > Add a feature transformation to accept a function and apply it on all rows of > dataframe > --- > > Key: SPARK-22021 > URL: https://issues.apache.org/jira/browse/SPARK-22021 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.3.0 >Reporter: Hosur Narahari > > More often we generate derived features in ML pipeline by doing some > mathematical or other kind of operation on columns of dataframe like getting > a total of few columns as a new column or if there is text field message and > we want the length of message etc. We currently don't have an efficient way > to handle such scenario in ML pipeline. > By Providing a transformer which accepts a function and performs that on > mentioned columns to generate output column of numerical type, user has the > flexibility to derive features by applying any domain specific logic. > Example: > val function = "function(a,b) { return a+b;}" > val transformer = new GenFuncTransformer().setInputCols(Array("v1", > "v2")).setOutputCol("result").setFunction(function) > val df = Seq((1.0, 2.0), (3.0, 4.0)).toDF("v1", "v2") > val result = transformer.transform(df) > result.show > v1 v2 result > 1.0 2.0 3.0 > 3.0 4.0 7.0 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21994) Spark 2.2 can not read Parquet table created by itself
[ https://issues.apache.org/jira/browse/SPARK-21994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167788#comment-16167788 ] Jurgis Pods edited comment on SPARK-21994 at 9/15/17 12:13 PM: --- I have updated to CDH 5.12.1 and the problem persists. There is an existing topic on the Cloudera forums with exactly this problem: http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Saving-Spark-2-2-dataframs-in-Hive-table/m-p/58463/thread-id/2867 was (Author: pederpansen): I have updated to CDH 5.12.1 and the problem persists. There is an existing topic on the Cloudera forums with exactly this problem: http://community.cloudera.com/t5/forums/replypage/board-id/Spark/message-id/2867 > Spark 2.2 can not read Parquet table created by itself > -- > > Key: SPARK-21994 > URL: https://issues.apache.org/jira/browse/SPARK-21994 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: Spark 2.2 on Cloudera CDH 5.10.1, Hive 1.1 >Reporter: Jurgis Pods > > This seems to be a new bug introduced in Spark 2.2, since it did not occur > under Spark 2.1. > When writing a dataframe to a table in Parquet format, Spark SQL does not > write the 'path' of the table to the Hive metastore, unlike in previous > versions. > As a consequence, Spark 2.2 is not able to read the table it just created. It > just outputs the table header without any row content. > A parallel installation of Spark 1.6 at least produces an appropriate error > trace: > {code:java} > 17/09/13 10:22:12 WARN metastore.ObjectStore: Version information not found > in metastore. hive.metastore.schema.verification is not enabled so recording > the schema version 1.1.0 > 17/09/13 10:22:12 WARN metastore.ObjectStore: Failed to get database default, > returning NoSuchObjectException > org.spark-project.guava.util.concurrent.UncheckedExecutionException: > java.util.NoSuchElementException: key not found: path > [...] > {code} > h3. Steps to reproduce: > Run the following in spark2-shell: > {code:java} > scala> val df = spark.sql("show databases") > scala> df.show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > scala> df.write.format("parquet").saveAsTable("test.spark22_test") > scala> spark.sql("select * from test.spark22_test").show() > ++ > |databaseName| > ++ > ++{code} > When manually setting the path (causing the data to be saved as external > table), it works: > {code:java} > scala> df.write.option("path", > "/hadoop/eco/hive/warehouse/test.db/spark22_parquet_with_path").format("parquet").saveAsTable("test.spark22_parquet_with_path") > scala> spark.sql("select * from test.spark22_parquet_with_path").show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > {code} > A second workaround is to update the metadata of the managed table created by > Spark 2.2: > {code} > spark.sql("alter table test.spark22_test set SERDEPROPERTIES > ('path'='hdfs://my-cluster-name:8020/hadoop/eco/hive/warehouse/test.db/spark22_test')") > spark.catalog.refreshTable("test.spark22_test") > spark.sql("select * from test.spark22_test").show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > {code} > It is kind of a disaster that we are not able to read tables created by the > very same Spark version and have to manually specify the path as an explicit > option. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21994) Spark 2.2 can not read Parquet table created by itself
[ https://issues.apache.org/jira/browse/SPARK-21994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167788#comment-16167788 ] Jurgis Pods commented on SPARK-21994: - I have updated to CDH 5.12.1 and the problem persists. There is an existing topic on the Cloudera forums with exactly this problem: http://community.cloudera.com/t5/forums/replypage/board-id/Spark/message-id/2867 > Spark 2.2 can not read Parquet table created by itself > -- > > Key: SPARK-21994 > URL: https://issues.apache.org/jira/browse/SPARK-21994 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: Spark 2.2 on Cloudera CDH 5.10.1, Hive 1.1 >Reporter: Jurgis Pods > > This seems to be a new bug introduced in Spark 2.2, since it did not occur > under Spark 2.1. > When writing a dataframe to a table in Parquet format, Spark SQL does not > write the 'path' of the table to the Hive metastore, unlike in previous > versions. > As a consequence, Spark 2.2 is not able to read the table it just created. It > just outputs the table header without any row content. > A parallel installation of Spark 1.6 at least produces an appropriate error > trace: > {code:java} > 17/09/13 10:22:12 WARN metastore.ObjectStore: Version information not found > in metastore. hive.metastore.schema.verification is not enabled so recording > the schema version 1.1.0 > 17/09/13 10:22:12 WARN metastore.ObjectStore: Failed to get database default, > returning NoSuchObjectException > org.spark-project.guava.util.concurrent.UncheckedExecutionException: > java.util.NoSuchElementException: key not found: path > [...] > {code} > h3. Steps to reproduce: > Run the following in spark2-shell: > {code:java} > scala> val df = spark.sql("show databases") > scala> df.show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > scala> df.write.format("parquet").saveAsTable("test.spark22_test") > scala> spark.sql("select * from test.spark22_test").show() > ++ > |databaseName| > ++ > ++{code} > When manually setting the path (causing the data to be saved as external > table), it works: > {code:java} > scala> df.write.option("path", > "/hadoop/eco/hive/warehouse/test.db/spark22_parquet_with_path").format("parquet").saveAsTable("test.spark22_parquet_with_path") > scala> spark.sql("select * from test.spark22_parquet_with_path").show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > {code} > A second workaround is to update the metadata of the managed table created by > Spark 2.2: > {code} > spark.sql("alter table test.spark22_test set SERDEPROPERTIES > ('path'='hdfs://my-cluster-name:8020/hadoop/eco/hive/warehouse/test.db/spark22_test')") > spark.catalog.refreshTable("test.spark22_test") > spark.sql("select * from test.spark22_test").show() > ++ > |databaseName| > ++ > | mydb1| > | mydb2| > | default| > |test| > ++ > {code} > It is kind of a disaster that we are not able to read tables created by the > very same Spark version and have to manually specify the path as an explicit > option. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21994) Spark 2.2 can not read Parquet table created by itself
[ https://issues.apache.org/jira/browse/SPARK-21994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jurgis Pods updated SPARK-21994: Description: This seems to be a new bug introduced in Spark 2.2, since it did not occur under Spark 2.1. When writing a dataframe to a table in Parquet format, Spark SQL does not write the 'path' of the table to the Hive metastore, unlike in previous versions. As a consequence, Spark 2.2 is not able to read the table it just created. It just outputs the table header without any row content. A parallel installation of Spark 1.6 at least produces an appropriate error trace: {code:java} 17/09/13 10:22:12 WARN metastore.ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.1.0 17/09/13 10:22:12 WARN metastore.ObjectStore: Failed to get database default, returning NoSuchObjectException org.spark-project.guava.util.concurrent.UncheckedExecutionException: java.util.NoSuchElementException: key not found: path [...] {code} h3. Steps to reproduce: Run the following in spark2-shell: {code:java} scala> val df = spark.sql("show databases") scala> df.show() ++ |databaseName| ++ | mydb1| | mydb2| | default| |test| ++ scala> df.write.format("parquet").saveAsTable("test.spark22_test") scala> spark.sql("select * from test.spark22_test").show() ++ |databaseName| ++ ++{code} When manually setting the path (causing the data to be saved as external table), it works: {code:java} scala> df.write.option("path", "/hadoop/eco/hive/warehouse/test.db/spark22_parquet_with_path").format("parquet").saveAsTable("test.spark22_parquet_with_path") scala> spark.sql("select * from test.spark22_parquet_with_path").show() ++ |databaseName| ++ | mydb1| | mydb2| | default| |test| ++ {code} A second workaround is to update the metadata of the managed table created by Spark 2.2: {code} spark.sql("alter table test.spark22_test set SERDEPROPERTIES ('path'='hdfs://my-cluster-name:8020/hadoop/eco/hive/warehouse/test.db/spark22_test')") spark.catalog.refreshTable("test.spark22_test") spark.sql("select * from test.spark22_test").show() ++ |databaseName| ++ | mydb1| | mydb2| | default| |test| ++ {code} It is kind of a disaster that we are not able to read tables created by the very same Spark version and have to manually specify the path as an explicit option. was: This seems to be a new bug introduced in Spark 2.2, since it did not occur under Spark 2.1. When writing a dataframe to a table in Parquet format, Spark SQL does not write the 'path' of the table to the Hive metastore, unlike in previous versions. As a consequence, Spark 2.2 is not able to read the table it just created. It just outputs the table header without any row content. A parallel installation of Spark 1.6 at least produces an appropriate error trace: {code:java} 17/09/13 10:22:12 WARN metastore.ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.1.0 17/09/13 10:22:12 WARN metastore.ObjectStore: Failed to get database default, returning NoSuchObjectException org.spark-project.guava.util.concurrent.UncheckedExecutionException: java.util.NoSuchElementException: key not found: path [...] {code} h3. Steps to reproduce: Run the following in spark2-shell: {code:java} scala> val df = spark.sql("show databases") scala> df.show() ++ |databaseName| ++ | mydb1| | mydb2| | default| |test| ++ scala> df.write.format("parquet").saveAsTable("test.spark22_test") scala> spark.sql("select * from test.spark22_test").show() ++ |databaseName| ++ ++{code} When manually setting the path, it works: {code:java} scala> df.write.option("path", "/hadoop/eco/hive/warehouse/test.db/spark22_parquet_with_path").format("parquet").saveAsTable("test.spark22_parquet_with_path") scala> spark.sql("select * from test.spark22_parquet_with_path").show() ++ |databaseName| ++ | mydb1| | mydb2| | default| |test| ++ {code} It is kind of a disaster that we are not able to read tables created by the very same Spark version and have to manually specify the path as an explicit option. > Spark 2.2 can not read Parquet table created
[jira] [Updated] (SPARK-22023) Multi-column Spark SQL UDFs broken in Python 3
[ https://issues.apache.org/jira/browse/SPARK-22023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Oli Hall updated SPARK-22023: - Description: I've been testing some existing PySpark code after migrating to Python3, and there seems to be an issue with multi-column UDFs in Spark SQL. Essentially, any UDF that takes in more than one column as input fails with an error relating to expansion of an underlying lambda expression: {code} org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/python/lib/pyspark.zip/pyspark/worker.py", line 177, in main process() File "/python/lib/pyspark.zip/pyspark/worker.py", line 172, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/python/lib/pyspark.zip/pyspark/serializers.py", line 237, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/python/lib/pyspark.zip/pyspark/serializers.py", line 138, in dump_stream for obj in iterator: File "/python/lib/pyspark.zip/pyspark/serializers.py", line 226, in _batched for item in iterator: File "", line 1, in File "/python/lib/pyspark.zip/pyspark/worker.py", line 71, in return lambda *a: f(*a) TypeError: () takes 1 positional argument but 2 were given at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193) at org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234) at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152) at org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:144) at org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:87) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:341) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} This syntax should (and does) work in Python 2, but is not valid in Python 3, I believe. I have a minimal example that reproduces the error, running in the PySpark shell, with Python 3.6.2, Spark 2.2: {code} >>> from pyspark.sql.functions import udf >>> from pyspark.sql.types import LongType >>> >>> df = spark.createDataFrame(sc.parallelize([{'a': 1, 'b': 1}, {'a': 2, 'b': >>> 2}])) /Users/oli-hall/Documents/Code/spark2/python/pyspark/sql/session.py:351: UserWarning: Using RDD of dict to inferSchema is deprecated. Use pyspark.sql.Row instead warnings.warn("Using RDD of dict to inferSchema is deprecated. " >>> df.printSchema() root |-- a: long (nullable = true) |-- b: long (nullable = true) >>> sum_udf = udf(lambda x: x[0] + x[1], LongType()) >>> >>> with_sum = df.withColumn('sum', sum_udf('a', 'b')) >>> >>> with_sum.first() 17/09/15 11:43:56 ERROR executor.Executor: Exception in task 2.0 in stage 3.0 (TID 8) ... (error snipped) TypeError: () takes 1 positional argument but 2 were given {code} I've managed to work around it for now, by pushing the input columns into a struct, then modifying the UDF to read from the struct, but it'd be good if I could use multi-column input again. Workaround: {code} >>> from pyspark.sql.functions import udf, struct >>> from pyspark.sql.types import LongType >>> >>> df = spark.createDataFrame(sc.parallelize([{'a': 1, 'b': 1}, {'a': 2, 'b': >>> 2}])) >>> >>> sum_udf = udf(lambda x: x.a + x.b, LongType()) >>> >>> with_sum = df.withColumn('temp_struct', struct('a', 'b')).withColumn('sum', >>> sum_udf('temp_struct')) >>> with_sum.first() Row(a=1, b=1, temp_struct=Row(a=1, b=1), sum=2) {code} was: I've been testing some existing PySpark code after migrating to Python3, and there seems to be an issue with multi-column UDFs in Spark SQL. Es
[jira] [Created] (SPARK-22023) Multi-column Spark SQL UDFs broken in Python 3
Oli Hall created SPARK-22023: Summary: Multi-column Spark SQL UDFs broken in Python 3 Key: SPARK-22023 URL: https://issues.apache.org/jira/browse/SPARK-22023 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.2.0 Environment: Python3 Reporter: Oli Hall I've been testing some existing PySpark code after migrating to Python3, and there seems to be an issue with multi-column UDFs in Spark SQL. Essentially, any UDF that takes in more than one column as input fails with an error relating to expansion of an underlying lambda expression: {code:bash} org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/python/lib/pyspark.zip/pyspark/worker.py", line 177, in main process() File "/python/lib/pyspark.zip/pyspark/worker.py", line 172, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/python/lib/pyspark.zip/pyspark/serializers.py", line 237, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/python/lib/pyspark.zip/pyspark/serializers.py", line 138, in dump_stream for obj in iterator: File "/python/lib/pyspark.zip/pyspark/serializers.py", line 226, in _batched for item in iterator: File "", line 1, in File "/python/lib/pyspark.zip/pyspark/worker.py", line 71, in return lambda *a: f(*a) TypeError: () takes 1 positional argument but 2 were given at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193) at org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234) at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152) at org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:144) at org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:87) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:341) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} This syntax should (and does) work in Python 2, but is not valid in Python 3, I believe. I have a minimal example that reproduces the error, running in the PySpark shell, with Python 3.6.2, Spark 2.2: {code:python} >>> from pyspark.sql.functions import udf >>> from pyspark.sql.types import LongType >>> >>> df = spark.createDataFrame(sc.parallelize([{'a': 1, 'b': 1}, {'a': 2, 'b': >>> 2}])) /Users/oli-hall/Documents/Code/spark2/python/pyspark/sql/session.py:351: UserWarning: Using RDD of dict to inferSchema is deprecated. Use pyspark.sql.Row instead warnings.warn("Using RDD of dict to inferSchema is deprecated. " >>> df.printSchema() root |-- a: long (nullable = true) |-- b: long (nullable = true) >>> sum_udf = udf(lambda x: x[0] + x[1], LongType()) >>> >>> with_sum = df.withColumn('sum', sum_udf('a', 'b')) >>> >>> with_sum.first() 17/09/15 11:43:56 ERROR executor.Executor: Exception in task 2.0 in stage 3.0 (TID 8) ... TypeError: () takes 1 positional argument but 2 were given {code} I've managed to work around it for now, by pushing the input columns into a struct, then modifying the UDF to read from the struct, but it'd be good if I could use multi-column input again. Workaround: {code:python} >>> from pyspark.sql.functions import udf, struct >>> from pyspark.sql.types import LongType >>> >>> df = spark.createDataFrame(sc.parallelize([{'a': 1, 'b': 1}, {'a': 2, 'b': >>> 2}])) >>> >>> sum_udf = udf(lambda x: x.a + x.b, LongType()) >>> >>> with_sum = df.withColumn('temp_struct', struct('a', 'b')).withColumn('sum', >>> sum_udf('temp_struct')) >>> with_sum.first
[jira] [Assigned] (SPARK-22021) Add a feature transformation to accept a function and apply it on all rows of dataframe
[ https://issues.apache.org/jira/browse/SPARK-22021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22021: Assignee: (was: Apache Spark) > Add a feature transformation to accept a function and apply it on all rows of > dataframe > --- > > Key: SPARK-22021 > URL: https://issues.apache.org/jira/browse/SPARK-22021 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.3.0 >Reporter: Hosur Narahari > > More often we generate derived features in ML pipeline by doing some > mathematical or other kind of operation on columns of dataframe like getting > a total of few columns as a new column or if there is text field message and > we want the length of message etc. We currently don't have an efficient way > to handle such scenario in ML pipeline. > By Providing a transformer which accepts a function and performs that on > mentioned columns to generate output column of numerical type, user has the > flexibility to derive features by applying any domain specific logic. > Example: > val function = "function(a,b) { return a+b;}" > val transformer = new GenFuncTransformer().setInputCols(Array("v1", > "v2")).setOutputCol("result").setFunction(function) > val df = Seq((1.0, 2.0), (3.0, 4.0)).toDF("v1", "v2") > val result = transformer.transform(df) > result.show > v1 v2 result > 1.0 2.0 3.0 > 3.0 4.0 7.0 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22021) Add a feature transformation to accept a function and apply it on all rows of dataframe
[ https://issues.apache.org/jira/browse/SPARK-22021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167753#comment-16167753 ] Apache Spark commented on SPARK-22021: -- User 'narahari92' has created a pull request for this issue: https://github.com/apache/spark/pull/19244 > Add a feature transformation to accept a function and apply it on all rows of > dataframe > --- > > Key: SPARK-22021 > URL: https://issues.apache.org/jira/browse/SPARK-22021 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.3.0 >Reporter: Hosur Narahari > > More often we generate derived features in ML pipeline by doing some > mathematical or other kind of operation on columns of dataframe like getting > a total of few columns as a new column or if there is text field message and > we want the length of message etc. We currently don't have an efficient way > to handle such scenario in ML pipeline. > By Providing a transformer which accepts a function and performs that on > mentioned columns to generate output column of numerical type, user has the > flexibility to derive features by applying any domain specific logic. > Example: > val function = "function(a,b) { return a+b;}" > val transformer = new GenFuncTransformer().setInputCols(Array("v1", > "v2")).setOutputCol("result").setFunction(function) > val df = Seq((1.0, 2.0), (3.0, 4.0)).toDF("v1", "v2") > val result = transformer.transform(df) > result.show > v1 v2 result > 1.0 2.0 3.0 > 3.0 4.0 7.0 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22021) Add a feature transformation to accept a function and apply it on all rows of dataframe
[ https://issues.apache.org/jira/browse/SPARK-22021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22021: Assignee: Apache Spark > Add a feature transformation to accept a function and apply it on all rows of > dataframe > --- > > Key: SPARK-22021 > URL: https://issues.apache.org/jira/browse/SPARK-22021 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.3.0 >Reporter: Hosur Narahari >Assignee: Apache Spark > > More often we generate derived features in ML pipeline by doing some > mathematical or other kind of operation on columns of dataframe like getting > a total of few columns as a new column or if there is text field message and > we want the length of message etc. We currently don't have an efficient way > to handle such scenario in ML pipeline. > By Providing a transformer which accepts a function and performs that on > mentioned columns to generate output column of numerical type, user has the > flexibility to derive features by applying any domain specific logic. > Example: > val function = "function(a,b) { return a+b;}" > val transformer = new GenFuncTransformer().setInputCols(Array("v1", > "v2")).setOutputCol("result").setFunction(function) > val df = Seq((1.0, 2.0), (3.0, 4.0)).toDF("v1", "v2") > val result = transformer.transform(df) > result.show > v1 v2 result > 1.0 2.0 3.0 > 3.0 4.0 7.0 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22019) JavaBean int type property
[ https://issues.apache.org/jira/browse/SPARK-22019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167734#comment-16167734 ] Jen-Ming Chung edited comment on SPARK-22019 at 9/15/17 11:29 AM: -- The alternative is giving the explicit schema instead inferring that you don't need to change your pojo class in above test case. {code} StructType schema = new StructType() .add("id", IntegerType) .add("str", StringType); Dataset df = spark.read().schema(schema).json(stringdataset).as( org.apache.spark.sql.Encoders.bean(SampleData.class)); {code} was (Author: jmchung): The alternative is giving the explicit schema instead inferring, means you don't need to change your pojo class. {code} StructType schema = new StructType() .add("id", IntegerType) .add("str", StringType); Dataset df = spark.read().schema(schema).json(stringdataset).as( org.apache.spark.sql.Encoders.bean(SampleData.class)); {code} > JavaBean int type property > --- > > Key: SPARK-22019 > URL: https://issues.apache.org/jira/browse/SPARK-22019 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: taiho choi > > when the type of SampleData's id is int, following code generates errors. > when long, it's ok. > > {code:java} > @Test > public void testDataSet2() { > ArrayList arr= new ArrayList(); > arr.add("{\"str\": \"everyone\", \"id\": 1}"); > arr.add("{\"str\": \"Hello\", \"id\": 1}"); > //1.read array and change to string dataset. > JavaRDD data = sc.parallelize(arr); > Dataset stringdataset = sqc.createDataset(data.rdd(), > Encoders.STRING()); > stringdataset.show(); //PASS > //2. convert string dataset to sampledata dataset > Dataset df = > sqc.read().json(stringdataset).as(Encoders.bean(SampleData.class)); > df.show();//PASS > df.printSchema();//PASS > Dataset fad = df.flatMap(SampleDataFlat::flatMap, > Encoders.bean(SampleDataFlat.class)); > fad.show(); //ERROR > fad.printSchema(); > } > public static class SampleData implements Serializable { > public String getStr() { > return str; > } > public void setStr(String str) { > this.str = str; > } > public int getId() { > return id; > } > public void setId(int id) { > this.id = id; > } > String str; > int id; > } > public static class SampleDataFlat { > String str; > public String getStr() { > return str; > } > public void setStr(String str) { > this.str = str; > } > public SampleDataFlat(String str, long id) { > this.str = str; > } > public static Iterator flatMap(SampleData data) { > ArrayList arr = new ArrayList<>(); > arr.add(new SampleDataFlat(data.getStr(), data.getId())); > arr.add(new SampleDataFlat(data.getStr(), data.getId()+1)); > arr.add(new SampleDataFlat(data.getStr(), data.getId()+2)); > return arr.iterator(); > } > } > {code} > ==Error message== > Caused by: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 38, Column 16: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 38, Column 16: No applicable constructor/method found for actual parameters > "long"; candidates are: "public void SparkUnitTest$SampleData.setId(int)" > /* 024 */ public java.lang.Object apply(java.lang.Object _i) { > /* 025 */ InternalRow i = (InternalRow) _i; > /* 026 */ > /* 027 */ final SparkUnitTest$SampleData value1 = false ? null : new > SparkUnitTest$SampleData(); > /* 028 */ this.javaBean = value1; > /* 029 */ if (!false) { > /* 030 */ > /* 031 */ > /* 032 */ boolean isNull3 = i.isNullAt(0); > /* 033 */ long value3 = isNull3 ? -1L : (i.getLong(0)); > /* 034 */ > /* 035 */ if (isNull3) { > /* 036 */ throw new NullPointerException(((java.lang.String) > references[0])); > /* 037 */ } > /* 038 */ javaBean.setId(value3); -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22021) Add a feature transformation to accept a function and apply it on all rows of dataframe
[ https://issues.apache.org/jira/browse/SPARK-22021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167736#comment-16167736 ] Hosur Narahari commented on SPARK-22021: If I just apply this function, I can't use it in spark's ML pipeline and will have to break it into 2 sub-pipelines where I've to apply a function(logic). But by providing a transformer we can make one single pipeline till end. Also it will be a one stop generic solution to get any derived feature. > Add a feature transformation to accept a function and apply it on all rows of > dataframe > --- > > Key: SPARK-22021 > URL: https://issues.apache.org/jira/browse/SPARK-22021 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.3.0 >Reporter: Hosur Narahari > > More often we generate derived features in ML pipeline by doing some > mathematical or other kind of operation on columns of dataframe like getting > a total of few columns as a new column or if there is text field message and > we want the length of message etc. We currently don't have an efficient way > to handle such scenario in ML pipeline. > By Providing a transformer which accepts a function and performs that on > mentioned columns to generate output column of numerical type, user has the > flexibility to derive features by applying any domain specific logic. > Example: > val function = "function(a,b) { return a+b;}" > val transformer = new GenFuncTransformer().setInputCols(Array("v1", > "v2")).setOutputCol("result").setFunction(function) > val df = Seq((1.0, 2.0), (3.0, 4.0)).toDF("v1", "v2") > val result = transformer.transform(df) > result.show > v1 v2 result > 1.0 2.0 3.0 > 3.0 4.0 7.0 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22019) JavaBean int type property
[ https://issues.apache.org/jira/browse/SPARK-22019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167734#comment-16167734 ] Jen-Ming Chung edited comment on SPARK-22019 at 9/15/17 11:28 AM: -- The alternative is giving the explicit schema instead inferring, means you don't need to change your pojo class. {code} StructType schema = new StructType() .add("id", IntegerType) .add("str", StringType); Dataset df = spark.read().schema(schema).json(stringdataset).as( org.apache.spark.sql.Encoders.bean(SampleData.class)); {code} was (Author: jmchung): The alternative is giving the explicit schema instead inferring, means you don't need to change your pojo class. {code} StructType schema = new StructType().add("id", IntegerType).add("str", StringType); Dataset df = spark.read().schema(schema).json(stringdataset).as( org.apache.spark.sql.Encoders.bean(SampleData.class)); {code} > JavaBean int type property > --- > > Key: SPARK-22019 > URL: https://issues.apache.org/jira/browse/SPARK-22019 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: taiho choi > > when the type of SampleData's id is int, following code generates errors. > when long, it's ok. > > {code:java} > @Test > public void testDataSet2() { > ArrayList arr= new ArrayList(); > arr.add("{\"str\": \"everyone\", \"id\": 1}"); > arr.add("{\"str\": \"Hello\", \"id\": 1}"); > //1.read array and change to string dataset. > JavaRDD data = sc.parallelize(arr); > Dataset stringdataset = sqc.createDataset(data.rdd(), > Encoders.STRING()); > stringdataset.show(); //PASS > //2. convert string dataset to sampledata dataset > Dataset df = > sqc.read().json(stringdataset).as(Encoders.bean(SampleData.class)); > df.show();//PASS > df.printSchema();//PASS > Dataset fad = df.flatMap(SampleDataFlat::flatMap, > Encoders.bean(SampleDataFlat.class)); > fad.show(); //ERROR > fad.printSchema(); > } > public static class SampleData implements Serializable { > public String getStr() { > return str; > } > public void setStr(String str) { > this.str = str; > } > public int getId() { > return id; > } > public void setId(int id) { > this.id = id; > } > String str; > int id; > } > public static class SampleDataFlat { > String str; > public String getStr() { > return str; > } > public void setStr(String str) { > this.str = str; > } > public SampleDataFlat(String str, long id) { > this.str = str; > } > public static Iterator flatMap(SampleData data) { > ArrayList arr = new ArrayList<>(); > arr.add(new SampleDataFlat(data.getStr(), data.getId())); > arr.add(new SampleDataFlat(data.getStr(), data.getId()+1)); > arr.add(new SampleDataFlat(data.getStr(), data.getId()+2)); > return arr.iterator(); > } > } > {code} > ==Error message== > Caused by: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 38, Column 16: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 38, Column 16: No applicable constructor/method found for actual parameters > "long"; candidates are: "public void SparkUnitTest$SampleData.setId(int)" > /* 024 */ public java.lang.Object apply(java.lang.Object _i) { > /* 025 */ InternalRow i = (InternalRow) _i; > /* 026 */ > /* 027 */ final SparkUnitTest$SampleData value1 = false ? null : new > SparkUnitTest$SampleData(); > /* 028 */ this.javaBean = value1; > /* 029 */ if (!false) { > /* 030 */ > /* 031 */ > /* 032 */ boolean isNull3 = i.isNullAt(0); > /* 033 */ long value3 = isNull3 ? -1L : (i.getLong(0)); > /* 034 */ > /* 035 */ if (isNull3) { > /* 036 */ throw new NullPointerException(((java.lang.String) > references[0])); > /* 037 */ } > /* 038 */ javaBean.setId(value3); -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22019) JavaBean int type property
[ https://issues.apache.org/jira/browse/SPARK-22019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167734#comment-16167734 ] Jen-Ming Chung commented on SPARK-22019: The alternative is giving the explicit schema instead inferring, means you don't need to change your pojo class. {code} StructType schema = new StructType().add("id", IntegerType).add("str", StringType); Dataset df = spark.read().schema(schema).json(stringdataset).as( org.apache.spark.sql.Encoders.bean(SampleData.class)); {code} > JavaBean int type property > --- > > Key: SPARK-22019 > URL: https://issues.apache.org/jira/browse/SPARK-22019 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: taiho choi > > when the type of SampleData's id is int, following code generates errors. > when long, it's ok. > > {code:java} > @Test > public void testDataSet2() { > ArrayList arr= new ArrayList(); > arr.add("{\"str\": \"everyone\", \"id\": 1}"); > arr.add("{\"str\": \"Hello\", \"id\": 1}"); > //1.read array and change to string dataset. > JavaRDD data = sc.parallelize(arr); > Dataset stringdataset = sqc.createDataset(data.rdd(), > Encoders.STRING()); > stringdataset.show(); //PASS > //2. convert string dataset to sampledata dataset > Dataset df = > sqc.read().json(stringdataset).as(Encoders.bean(SampleData.class)); > df.show();//PASS > df.printSchema();//PASS > Dataset fad = df.flatMap(SampleDataFlat::flatMap, > Encoders.bean(SampleDataFlat.class)); > fad.show(); //ERROR > fad.printSchema(); > } > public static class SampleData implements Serializable { > public String getStr() { > return str; > } > public void setStr(String str) { > this.str = str; > } > public int getId() { > return id; > } > public void setId(int id) { > this.id = id; > } > String str; > int id; > } > public static class SampleDataFlat { > String str; > public String getStr() { > return str; > } > public void setStr(String str) { > this.str = str; > } > public SampleDataFlat(String str, long id) { > this.str = str; > } > public static Iterator flatMap(SampleData data) { > ArrayList arr = new ArrayList<>(); > arr.add(new SampleDataFlat(data.getStr(), data.getId())); > arr.add(new SampleDataFlat(data.getStr(), data.getId()+1)); > arr.add(new SampleDataFlat(data.getStr(), data.getId()+2)); > return arr.iterator(); > } > } > {code} > ==Error message== > Caused by: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 38, Column 16: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 38, Column 16: No applicable constructor/method found for actual parameters > "long"; candidates are: "public void SparkUnitTest$SampleData.setId(int)" > /* 024 */ public java.lang.Object apply(java.lang.Object _i) { > /* 025 */ InternalRow i = (InternalRow) _i; > /* 026 */ > /* 027 */ final SparkUnitTest$SampleData value1 = false ? null : new > SparkUnitTest$SampleData(); > /* 028 */ this.javaBean = value1; > /* 029 */ if (!false) { > /* 030 */ > /* 031 */ > /* 032 */ boolean isNull3 = i.isNullAt(0); > /* 033 */ long value3 = isNull3 ? -1L : (i.getLong(0)); > /* 034 */ > /* 035 */ if (isNull3) { > /* 036 */ throw new NullPointerException(((java.lang.String) > references[0])); > /* 037 */ } > /* 038 */ javaBean.setId(value3); -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22019) JavaBean int type property
[ https://issues.apache.org/jira/browse/SPARK-22019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167725#comment-16167725 ] Jen-Ming Chung edited comment on SPARK-22019 at 9/15/17 11:18 AM: -- Hi [~client.test], The schema inferred after {{sqc.read().json(stringdataset)}} as below, {code} root |-- id: long (nullable = true) |-- str: string (nullable = true) {code} However, the pojo class {{SampleData.class}} the member {{id}} is declared as {{int}} instead of {{long}}, this will cause the subsequent exception in your test case. So set the {{long}} type to {{id}} in {{SampleData.class}} then executing the test case again, you can expect the following results: {code} ++ | str| ++ |everyone| |everyone| |everyone| | Hello| | Hello| | Hello| ++ root |-- str: string (nullable = true) {code} As you can see, we missing the {{id}} in schema, we need to add the {{id}} and corresponding getter and setter, {code} class SampleDataFlat { ... long id; public long getId() { return id; } public void setId(long id) { this.id = id; } public SampleDataFlat(String str, long id) { this.str = str; this.id = id; } ... } {code} Then you will get the following results: {code} +---++ | id| str| +---++ | 1|everyone| | 2|everyone| | 3|everyone| | 1| Hello| | 2| Hello| | 3| Hello| +---++ root |-- id: long (nullable = true) |-- str: string (nullable = true) {code} was (Author: jmchung): Hi [~client.test], The schema inferred after {{sqc.read().json(stringdataset)}} as below, {code} root |-- id: long (nullable = true) |-- str: string (nullable = true) {code} However, the pojo class {{SampleData.class}} the member {{id}} is declared as {{int}} instead of {{long}}, this will cause the subsequent exception in your test case. So set the {{long}} type to {{id}} in {{SampleData.class}} then executing the test case again, you can expect the following results: {code} ++ | str| ++ |everyone| |everyone| |everyone| | Hello| | Hello| | Hello| ++ root |-- str: string (nullable = true) {code} As you can see, we missing the {{id}} in schema, we need to add the {{id}} and corresponding getter and setter, {code} class SampleDataFlat { long id; public long getId() { return id; } public void setId(long id) { this.id = id; } public SampleDataFlat(String str, long id) { this.str = str; this.id = id; } } {code} Then you will get the following results: {code} +---++ | id| str| +---++ | 1|everyone| | 2|everyone| | 3|everyone| | 1| Hello| | 2| Hello| | 3| Hello| +---++ root |-- id: long (nullable = true) |-- str: string (nullable = true) {code} > JavaBean int type property > --- > > Key: SPARK-22019 > URL: https://issues.apache.org/jira/browse/SPARK-22019 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: taiho choi > > when the type of SampleData's id is int, following code generates errors. > when long, it's ok. > > {code:java} > @Test > public void testDataSet2() { > ArrayList arr= new ArrayList(); > arr.add("{\"str\": \"everyone\", \"id\": 1}"); > arr.add("{\"str\": \"Hello\", \"id\": 1}"); > //1.read array and change to string dataset. > JavaRDD data = sc.parallelize(arr); > Dataset stringdataset = sqc.createDataset(data.rdd(), > Encoders.STRING()); > stringdataset.show(); //PASS > //2. convert string dataset to sampledata dataset > Dataset df = > sqc.read().json(stringdataset).as(Encoders.bean(SampleData.class)); > df.show();//PASS > df.printSchema();//PASS > Dataset fad = df.flatMap(SampleDataFlat::flatMap, > Encoders.bean(SampleDataFlat.class)); > fad.show(); //ERROR > fad.printSchema(); > } > public static class SampleData implements Serializable { > public String getStr() { > return str; > } > public void setStr(String str) { > this.str = str; > } > public int getId() { > return id; > } > public void setId(int id) { > this.id = id; > } > String str; > int id; > } > public static class SampleDataFlat { > String str; > public String getStr() { > return str; > } > public void setStr(String str) { > this.str = str; > } > public SampleDataFlat(String str, long id) { > this.str = str; > } > public static Iterator flatMap(SampleData da
[jira] [Commented] (SPARK-22019) JavaBean int type property
[ https://issues.apache.org/jira/browse/SPARK-22019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167725#comment-16167725 ] Jen-Ming Chung commented on SPARK-22019: Hi [~client.test], The schema inferred after {{sqc.read().json(stringdataset)}} as below, {code} root |-- id: long (nullable = true) |-- str: string (nullable = true) {code} However, the pojo class {{SampleData.class}} the member {{id}} is declared as {{int}} instead of {{long}}, this will cause the subsequent exception in your test case. So set the {{long}} type to {{id}} in {{SampleData.class}} then executing the test case again, you can expect the following results: {code} ++ | str| ++ |everyone| |everyone| |everyone| | Hello| | Hello| | Hello| ++ root |-- str: string (nullable = true) {code} As you can see, we missing the {{id}} in schema, we need to add the {{id}} and corresponding getter and setter, {code} class SampleDataFlat { long id; public long getId() { return id; } public void setId(long id) { this.id = id; } public SampleDataFlat(String str, long id) { this.str = str; this.id = id; } } {code} Then you will get the following results: {code} +---++ | id| str| +---++ | 1|everyone| | 2|everyone| | 3|everyone| | 1| Hello| | 2| Hello| | 3| Hello| +---++ root |-- id: long (nullable = true) |-- str: string (nullable = true) {code} > JavaBean int type property > --- > > Key: SPARK-22019 > URL: https://issues.apache.org/jira/browse/SPARK-22019 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: taiho choi > > when the type of SampleData's id is int, following code generates errors. > when long, it's ok. > > {code:java} > @Test > public void testDataSet2() { > ArrayList arr= new ArrayList(); > arr.add("{\"str\": \"everyone\", \"id\": 1}"); > arr.add("{\"str\": \"Hello\", \"id\": 1}"); > //1.read array and change to string dataset. > JavaRDD data = sc.parallelize(arr); > Dataset stringdataset = sqc.createDataset(data.rdd(), > Encoders.STRING()); > stringdataset.show(); //PASS > //2. convert string dataset to sampledata dataset > Dataset df = > sqc.read().json(stringdataset).as(Encoders.bean(SampleData.class)); > df.show();//PASS > df.printSchema();//PASS > Dataset fad = df.flatMap(SampleDataFlat::flatMap, > Encoders.bean(SampleDataFlat.class)); > fad.show(); //ERROR > fad.printSchema(); > } > public static class SampleData implements Serializable { > public String getStr() { > return str; > } > public void setStr(String str) { > this.str = str; > } > public int getId() { > return id; > } > public void setId(int id) { > this.id = id; > } > String str; > int id; > } > public static class SampleDataFlat { > String str; > public String getStr() { > return str; > } > public void setStr(String str) { > this.str = str; > } > public SampleDataFlat(String str, long id) { > this.str = str; > } > public static Iterator flatMap(SampleData data) { > ArrayList arr = new ArrayList<>(); > arr.add(new SampleDataFlat(data.getStr(), data.getId())); > arr.add(new SampleDataFlat(data.getStr(), data.getId()+1)); > arr.add(new SampleDataFlat(data.getStr(), data.getId()+2)); > return arr.iterator(); > } > } > {code} > ==Error message== > Caused by: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 38, Column 16: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 38, Column 16: No applicable constructor/method found for actual parameters > "long"; candidates are: "public void SparkUnitTest$SampleData.setId(int)" > /* 024 */ public java.lang.Object apply(java.lang.Object _i) { > /* 025 */ InternalRow i = (InternalRow) _i; > /* 026 */ > /* 027 */ final SparkUnitTest$SampleData value1 = false ? null : new > SparkUnitTest$SampleData(); > /* 028 */ this.javaBean = value1; > /* 029 */ if (!false) { > /* 030 */ > /* 031 */ > /* 032 */ boolean isNull3 = i.isNullAt(0); > /* 033 */ long value3 = isNull3 ? -1L : (i.getLong(0)); > /* 034 */ > /* 035 */ if (isNull3) { > /* 036 */ throw new NullPointerException(((java.lang.String) > references[0])); > /* 037 */ } > /* 038 */ javaBean.setId(value3); -- This message was sent by Atla
[jira] [Created] (SPARK-22022) Unable to use Python Profiler with SparkSession
Maciej Bryński created SPARK-22022: -- Summary: Unable to use Python Profiler with SparkSession Key: SPARK-22022 URL: https://issues.apache.org/jira/browse/SPARK-22022 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.2.0 Reporter: Maciej Bryński Priority: Minor SparkContext has profiler_cls option that gives possibility to use Python profiler within Spark app. There is no such an option in SparkSession. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22021) Add a feature transformation to accept a function and apply it on all rows of dataframe
[ https://issues.apache.org/jira/browse/SPARK-22021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167722#comment-16167722 ] Sean Owen commented on SPARK-22021: --- Why can't you just apply this function? Or implement Transformer. I'm not sure what the point of a transformer that just applies a function is. > Add a feature transformation to accept a function and apply it on all rows of > dataframe > --- > > Key: SPARK-22021 > URL: https://issues.apache.org/jira/browse/SPARK-22021 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.3.0 >Reporter: Hosur Narahari > > More often we generate derived features in ML pipeline by doing some > mathematical or other kind of operation on columns of dataframe like getting > a total of few columns as a new column or if there is text field message and > we want the length of message etc. We currently don't have an efficient way > to handle such scenario in ML pipeline. > By Providing a transformer which accepts a function and performs that on > mentioned columns to generate output column of numerical type, user has the > flexibility to derive features by applying any domain specific logic. > Example: > val function = "function(a,b) { return a+b;}" > val transformer = new GenFuncTransformer().setInputCols(Array("v1", > "v2")).setOutputCol("result").setFunction(function) > val df = Seq((1.0, 2.0), (3.0, 4.0)).toDF("v1", "v2") > val result = transformer.transform(df) > result.show > v1 v2 result > 1.0 2.0 3.0 > 3.0 4.0 7.0 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22021) Add a feature transformation to accept a function and apply it on all rows of dataframe
Hosur Narahari created SPARK-22021: -- Summary: Add a feature transformation to accept a function and apply it on all rows of dataframe Key: SPARK-22021 URL: https://issues.apache.org/jira/browse/SPARK-22021 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 2.3.0 Reporter: Hosur Narahari More often we generate derived features in ML pipeline by doing some mathematical or other kind of operation on columns of dataframe like getting a total of few columns as a new column or if there is text field message and we want the length of message etc. We currently don't have an efficient way to handle such scenario in ML pipeline. By Providing a transformer which accepts a function and performs that on mentioned columns to generate output column of numerical type, user has the flexibility to derive features by applying any domain specific logic. Example: val function = "function(a,b) { return a+b;}" val transformer = new GenFuncTransformer().setInputCols(Array("v1", "v2")).setOutputCol("result").setFunction(function) val df = Seq((1.0, 2.0), (3.0, 4.0)).toDF("v1", "v2") val result = transformer.transform(df) result.show v1 v2 result 1.0 2.0 3.0 3.0 4.0 7.0 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21780) Simpler Dataset.sample API in R
[ https://issues.apache.org/jira/browse/SPARK-21780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21780: Assignee: Apache Spark > Simpler Dataset.sample API in R > --- > > Key: SPARK-21780 > URL: https://issues.apache.org/jira/browse/SPARK-21780 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Reynold Xin >Assignee: Apache Spark > > See parent ticket. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21780) Simpler Dataset.sample API in R
[ https://issues.apache.org/jira/browse/SPARK-21780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167571#comment-16167571 ] Apache Spark commented on SPARK-21780: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/19243 > Simpler Dataset.sample API in R > --- > > Key: SPARK-21780 > URL: https://issues.apache.org/jira/browse/SPARK-21780 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Reynold Xin > > See parent ticket. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21780) Simpler Dataset.sample API in R
[ https://issues.apache.org/jira/browse/SPARK-21780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21780: Assignee: (was: Apache Spark) > Simpler Dataset.sample API in R > --- > > Key: SPARK-21780 > URL: https://issues.apache.org/jira/browse/SPARK-21780 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Reynold Xin > > See parent ticket. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22020) Support session local timezone
[ https://issues.apache.org/jira/browse/SPARK-22020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-22020. --- Resolution: Duplicate > Support session local timezone > -- > > Key: SPARK-22020 > URL: https://issues.apache.org/jira/browse/SPARK-22020 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Navya Krishnappa > > As of Spark 2.1, Spark SQL assumes the machine timezone for datetime > manipulation, which is bad if users are not in the same timezones as the > machines, or if different users have different timezones. > Input data: > Date,SparkDate,SparkDate1,SparkDate2 > 04/22/2017T03:30:02,2017-03-21T03:30:02,2017-03-21T03:30:02.02Z,2017-03-21T00:00:00Z > I have set the below value to set the timeZone to UTC. It is adding the > current timeZone value even though it is in the UTC format. > spark.conf.set("spark.sql.session.timeZone", "UTC") > Expected : Time should remain same as the input since it's already in UTC > format > var df1 = spark.read.option("delimiter", ",").option("qualifier", > "\"").option("inferSchema","true").option("header", "true").option("mode", > "PERMISSIVE").option("timestampFormat","MM/dd/'T'HH:mm:ss.SSS").option("dateFormat", > "MM/dd/'T'HH:mm:ss").csv("DateSpark.csv"); > df1: org.apache.spark.sql.DataFrame = [Name: string, Age: int ... 5 more > fields] > scala> df1.show(false); > -- > Name Age Add DateSparkDate SparkDate1 SparkDate2 > -- > abc 21 bvxc04/22/2017T03:30:02 2017-03-21 03:30:02 > 2017-03-21 09:00:02.02 2017-03-21 05:30:00 > -- -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20921) While reading from oracle database, it converts to wrong type.
[ https://issues.apache.org/jira/browse/SPARK-20921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20921. --- Resolution: Duplicate > While reading from oracle database, it converts to wrong type. > -- > > Key: SPARK-20921 > URL: https://issues.apache.org/jira/browse/SPARK-20921 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: vivek dixit > > While reading data from oracle using jdbc, it converts integer column to > boolean. if it has the column size as one which is wrong. > https://github.com/apache/spark/commit/39a2b2ea74d420caa37019e3684f65b3a6fcb388#diff-edf9e48ce837c5f2a504428d9aa67970R46 > this condition in oracledjdbcdialect is doing it. This has caused our data > load process to fail. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21713) Replace LogicalPlan.isStreaming with OutputMode
[ https://issues.apache.org/jira/browse/SPARK-21713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167506#comment-16167506 ] Apache Spark commented on SPARK-21713: -- User 'joseph-torres' has created a pull request for this issue: https://github.com/apache/spark/pull/18925 > Replace LogicalPlan.isStreaming with OutputMode > --- > > Key: SPARK-21713 > URL: https://issues.apache.org/jira/browse/SPARK-21713 > Project: Spark > Issue Type: Improvement > Components: SQL, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Jose Torres > > The isStreaming bit in LogicalPlan is based on an old model. Switching to > OutputMode will allow us to more easily integrate with things that require > specific OutputModes. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21987) Spark 2.3 cannot read 2.2 event logs
[ https://issues.apache.org/jira/browse/SPARK-21987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-21987. - Resolution: Fixed Assignee: Wenchen Fan Fix Version/s: 2.3.0 > Spark 2.3 cannot read 2.2 event logs > > > Key: SPARK-21987 > URL: https://issues.apache.org/jira/browse/SPARK-21987 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Wenchen Fan >Priority: Blocker > Fix For: 2.3.0 > > > Reported by [~jincheng] in a comment in SPARK-18085: > {noformat} > com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: > Unrecognized field "metadata" (class > org.apache.spark.sql.execution.SparkPlanInfo), not marked as ignorable (4 > known properties: "simpleString", "nodeName", "children", "metrics"]) > at [Source: > {"Event":"org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart","executionId":0,"description":"json > at > NativeMethodAccessorImpl.java:0","details":"org.apache.spark.sql.DataFrameWriter.json(DataFrameWriter.scala:487)\nsun.reflect.NativeMethodAccessorImpl.invoke0(Native > > Method)\nsun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\nsun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\njava.lang.reflect.Method.invoke(Method.java:498)\npy4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\npy4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\npy4j.Gateway.invoke(Gateway.java:280)\npy4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\npy4j.commands.CallCommand.execute(CallCommand.java:79)\npy4j.GatewayConnection.run(GatewayConnection.java:214)\njava.lang.Thread.run(Thread.java:748)","physicalPlanDescription":"== > Parsed Logical Plan ==\nRepartition 200, true\n+- LogicalRDD [uid#327L, > gids#328]\n\n== Analyzed Logical Plan ==\nuid: bigint, gids: > array\nRepartition 200, true\n+- LogicalRDD [uid#327L, > gids#328]\n\n== Optimized Logical Plan ==\nRepartition 200, true\n+- > LogicalRDD [uid#327L, gids#328]\n\n== Physical Plan ==\nExchange > RoundRobinPartitioning(200)\n+- Scan > ExistingRDD[uid#327L,gids#328]","sparkPlanInfo":{"nodeName":"Exchange","simpleString":"Exchange > > RoundRobinPartitioning(200)","children":[{"nodeName":"ExistingRDD","simpleString":"Scan > > ExistingRDD[uid#327L,gids#328]","children":[],"metadata":{},"metrics":[{"name":"number > of output > rows","accumulatorId":140,"metricType":"sum"}]}],"metadata":{},"metrics":[{"name":"data > size total (min, med, > max)","accumulatorId":139,"metricType":"size"}]},"time":1504837052948}; line: > 1, column: 1622] (through reference chain: > org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart["sparkPlanInfo"]->org.apache.spark.sql.execution.SparkPlanInfo["children"]->com.fasterxml.jackson.module.scala.deser.BuilderWrapper[0]->org.apache.spark.sql.execution.SparkPlanInfo["metadata"]) > at > com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException.from(UnrecognizedPropertyException.java:51) > {noformat} > This was caused by SPARK-17701 (which at this moment is still open even > though the patch has been committed). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org