[jira] [Assigned] (SPARK-20373) Batch queries with 'Dataset/DataFrame.withWatermark()` does not execute
[ https://issues.apache.org/jira/browse/SPARK-20373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20373: Assignee: Apache Spark > Batch queries with 'Dataset/DataFrame.withWatermark()` does not execute > --- > > Key: SPARK-20373 > URL: https://issues.apache.org/jira/browse/SPARK-20373 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0, 2.2.0 >Reporter: Tathagata Das >Assignee: Apache Spark >Priority: Minor > > Any Dataset/DataFrame batch query with the operation `withWatermark` does not > execute because the batch planner does not have any rule to explicitly handle > the EventTimeWatermark logical plan. The right solution is to simply remove > the plan node, as the watermark should not affect any batch query in any way. > {code} > from pyspark.sql.functions import * > eventsDF = spark.createDataFrame([("2016-03-11 09:00:07", "dev1", > 123)]).toDF("eventTime", "deviceId", > "signal").select(col("eventTime").cast("timestamp").alias("eventTime"), > "deviceId", "signal") > windowedCountsDF = \ > eventsDF \ > .withWatermark("eventTime", "10 minutes") \ > .groupBy( > "deviceId", > window("eventTime", "5 minutes")) \ > .count() > windowedCountsDF.collect() > {code} > This throws as an error > {code} > java.lang.AssertionError: assertion failed: No plan for EventTimeWatermark > eventTime#3762657: timestamp, interval 10 minutes > +- Project [cast(_1#3762643 as timestamp) AS eventTime#3762657, _2#3762644 AS > deviceId#3762651] >+- LogicalRDD [_1#3762643, _2#3762644, _3#3762645L] > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74) > at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157) > at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:74) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:66) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74) > at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157) > at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:74) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:66) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74) > at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at
[jira] [Assigned] (SPARK-20373) Batch queries with 'Dataset/DataFrame.withWatermark()` does not execute
[ https://issues.apache.org/jira/browse/SPARK-20373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20373: Assignee: (was: Apache Spark) > Batch queries with 'Dataset/DataFrame.withWatermark()` does not execute > --- > > Key: SPARK-20373 > URL: https://issues.apache.org/jira/browse/SPARK-20373 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0, 2.2.0 >Reporter: Tathagata Das >Priority: Minor > > Any Dataset/DataFrame batch query with the operation `withWatermark` does not > execute because the batch planner does not have any rule to explicitly handle > the EventTimeWatermark logical plan. The right solution is to simply remove > the plan node, as the watermark should not affect any batch query in any way. > {code} > from pyspark.sql.functions import * > eventsDF = spark.createDataFrame([("2016-03-11 09:00:07", "dev1", > 123)]).toDF("eventTime", "deviceId", > "signal").select(col("eventTime").cast("timestamp").alias("eventTime"), > "deviceId", "signal") > windowedCountsDF = \ > eventsDF \ > .withWatermark("eventTime", "10 minutes") \ > .groupBy( > "deviceId", > window("eventTime", "5 minutes")) \ > .count() > windowedCountsDF.collect() > {code} > This throws as an error > {code} > java.lang.AssertionError: assertion failed: No plan for EventTimeWatermark > eventTime#3762657: timestamp, interval 10 minutes > +- Project [cast(_1#3762643 as timestamp) AS eventTime#3762657, _2#3762644 AS > deviceId#3762651] >+- LogicalRDD [_1#3762643, _2#3762644, _3#3762645L] > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74) > at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157) > at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:74) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:66) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74) > at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157) > at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:74) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:66) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74) > at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at >
[jira] [Commented] (SPARK-20373) Batch queries with 'Dataset/DataFrame.withWatermark()` does not execute
[ https://issues.apache.org/jira/browse/SPARK-20373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16000294#comment-16000294 ] Apache Spark commented on SPARK-20373: -- User 'uncleGen' has created a pull request for this issue: https://github.com/apache/spark/pull/17896 > Batch queries with 'Dataset/DataFrame.withWatermark()` does not execute > --- > > Key: SPARK-20373 > URL: https://issues.apache.org/jira/browse/SPARK-20373 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0, 2.2.0 >Reporter: Tathagata Das >Priority: Minor > > Any Dataset/DataFrame batch query with the operation `withWatermark` does not > execute because the batch planner does not have any rule to explicitly handle > the EventTimeWatermark logical plan. The right solution is to simply remove > the plan node, as the watermark should not affect any batch query in any way. > {code} > from pyspark.sql.functions import * > eventsDF = spark.createDataFrame([("2016-03-11 09:00:07", "dev1", > 123)]).toDF("eventTime", "deviceId", > "signal").select(col("eventTime").cast("timestamp").alias("eventTime"), > "deviceId", "signal") > windowedCountsDF = \ > eventsDF \ > .withWatermark("eventTime", "10 minutes") \ > .groupBy( > "deviceId", > window("eventTime", "5 minutes")) \ > .count() > windowedCountsDF.collect() > {code} > This throws as an error > {code} > java.lang.AssertionError: assertion failed: No plan for EventTimeWatermark > eventTime#3762657: timestamp, interval 10 minutes > +- Project [cast(_1#3762643 as timestamp) AS eventTime#3762657, _2#3762644 AS > deviceId#3762651] >+- LogicalRDD [_1#3762643, _2#3762644, _3#3762645L] > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74) > at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157) > at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:74) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:66) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74) > at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157) > at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:74) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:66) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74) > at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) > at
[jira] [Commented] (SPARK-20588) from_utc_timestamp causes bottleneck
[ https://issues.apache.org/jira/browse/SPARK-20588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16000290#comment-16000290 ] Takuya Ueshin commented on SPARK-20588: --- I agree with caching per thread for now. > from_utc_timestamp causes bottleneck > > > Key: SPARK-20588 > URL: https://issues.apache.org/jira/browse/SPARK-20588 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.2 > Environment: AWS EMR AMI 5.2.1 >Reporter: Ameen Tayyebi > > We have a SQL query that makes use of the from_utc_timestamp function like > so: from_utc_timestamp(itemSigningTime,'America/Los_Angeles') > This causes a major bottleneck. Our exact call is: > date_add(from_utc_timestamp(itemSigningTime,'America/Los_Angeles'), 1) > Switching from the above to date_add(itemSigningTime, 1) reduces the job > running time from 40 minutes to 9. > When from_utc_timestamp function is used, several threads in the executors > are in the BLOCKED state, on this call stack: > "Executor task launch worker-63" #261 daemon prio=5 os_prio=0 > tid=0x7f848472e000 nid=0x4294 waiting for monitor entry > [0x7f501981c000] >java.lang.Thread.State: BLOCKED (on object monitor) > at java.util.TimeZone.getTimeZone(TimeZone.java:516) > - waiting to lock <0x7f5216c2aa58> (a java.lang.Class for > java.util.TimeZone) > at > org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTimestamp(DateTimeUtils.scala:356) > at > org.apache.spark.sql.catalyst.util.DateTimeUtils.stringToTimestamp(DateTimeUtils.scala) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Can we cache the locale's once per JVM so that we don't do this for every > record? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.
[ https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16000213#comment-16000213 ] Takuya Ueshin commented on SPARK-12297: --- Issue resolved by pull request 16781 https://github.com/apache/spark/pull/16781 > Add work-around for Parquet/Hive int96 timestamp bug. > - > > Key: SPARK-12297 > URL: https://issues.apache.org/jira/browse/SPARK-12297 > Project: Spark > Issue Type: Task > Components: Spark Core >Reporter: Ryan Blue > Fix For: 2.3.0 > > > Spark copied Hive's behavior for parquet, but this was inconsistent with > other file formats, and inconsistent with Impala (which is the original > source of putting a timestamp as an int96 in parquet, I believe). This made > timestamps in parquet act more like timestamps with timezones, while in other > file formats, timestamps have no time zone, they are a "floating time". > The easiest way to see this issue is to write out a table with timestamps in > multiple different formats from one timezone, then try to read them back in > another timezone. Eg., here I write out a few timestamps to parquet and > textfile hive tables, and also just as a json file, all in the > "America/Los_Angeles" timezone: > {code} > import org.apache.spark.sql.Row > import org.apache.spark.sql.types._ > val tblPrefix = args(0) > val schema = new StructType().add("ts", TimestampType) > val rows = sc.parallelize(Seq( > "2015-12-31 23:50:59.123", > "2015-12-31 22:49:59.123", > "2016-01-01 00:39:59.123", > "2016-01-01 01:29:59.123" > ).map { x => Row(java.sql.Timestamp.valueOf(x)) }) > val rawData = spark.createDataFrame(rows, schema).toDF() > rawData.show() > Seq("parquet", "textfile").foreach { format => > val tblName = s"${tblPrefix}_$format" > spark.sql(s"DROP TABLE IF EXISTS $tblName") > spark.sql( > raw"""CREATE TABLE $tblName ( > | ts timestamp > | ) > | STORED AS $format > """.stripMargin) > rawData.write.insertInto(tblName) > } > rawData.write.json(s"${tblPrefix}_json") > {code} > Then I start a spark-shell in "America/New_York" timezone, and read the data > back from each table: > {code} > scala> spark.sql("select * from la_parquet").collect().foreach{println} > [2016-01-01 02:50:59.123] > [2016-01-01 01:49:59.123] > [2016-01-01 03:39:59.123] > [2016-01-01 04:29:59.123] > scala> spark.sql("select * from la_textfile").collect().foreach{println} > [2015-12-31 23:50:59.123] > [2015-12-31 22:49:59.123] > [2016-01-01 00:39:59.123] > [2016-01-01 01:29:59.123] > scala> spark.read.json("la_json").collect().foreach{println} > [2015-12-31 23:50:59.123] > [2015-12-31 22:49:59.123] > [2016-01-01 00:39:59.123] > [2016-01-01 01:29:59.123] > scala> spark.read.json("la_json").join(spark.sql("select * from > la_textfile"), "ts").show() > ++ > | ts| > ++ > |2015-12-31 23:50:...| > |2015-12-31 22:49:...| > |2016-01-01 00:39:...| > |2016-01-01 01:29:...| > ++ > scala> spark.read.json("la_json").join(spark.sql("select * from la_parquet"), > "ts").show() > +---+ > | ts| > +---+ > +---+ > {code} > The textfile and json based data shows the same times, and can be joined > against each other, while the times from the parquet data have changed (and > obviously joins fail). > This is a big problem for any organization that may try to read the same data > (say in S3) with clusters in multiple timezones. It can also be a nasty > surprise as an organization tries to migrate file formats. Finally, its a > source of incompatibility between Hive, Impala, and Spark. > HIVE-12767 aims to fix this by introducing a table property which indicates > the "storage timezone" for the table. Spark should add the same to ensure > consistency between file formats, and with Hive & Impala. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.
[ https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-12297. --- Resolution: Fixed Fix Version/s: 2.3.0 > Add work-around for Parquet/Hive int96 timestamp bug. > - > > Key: SPARK-12297 > URL: https://issues.apache.org/jira/browse/SPARK-12297 > Project: Spark > Issue Type: Task > Components: Spark Core >Reporter: Ryan Blue > Fix For: 2.3.0 > > > Spark copied Hive's behavior for parquet, but this was inconsistent with > other file formats, and inconsistent with Impala (which is the original > source of putting a timestamp as an int96 in parquet, I believe). This made > timestamps in parquet act more like timestamps with timezones, while in other > file formats, timestamps have no time zone, they are a "floating time". > The easiest way to see this issue is to write out a table with timestamps in > multiple different formats from one timezone, then try to read them back in > another timezone. Eg., here I write out a few timestamps to parquet and > textfile hive tables, and also just as a json file, all in the > "America/Los_Angeles" timezone: > {code} > import org.apache.spark.sql.Row > import org.apache.spark.sql.types._ > val tblPrefix = args(0) > val schema = new StructType().add("ts", TimestampType) > val rows = sc.parallelize(Seq( > "2015-12-31 23:50:59.123", > "2015-12-31 22:49:59.123", > "2016-01-01 00:39:59.123", > "2016-01-01 01:29:59.123" > ).map { x => Row(java.sql.Timestamp.valueOf(x)) }) > val rawData = spark.createDataFrame(rows, schema).toDF() > rawData.show() > Seq("parquet", "textfile").foreach { format => > val tblName = s"${tblPrefix}_$format" > spark.sql(s"DROP TABLE IF EXISTS $tblName") > spark.sql( > raw"""CREATE TABLE $tblName ( > | ts timestamp > | ) > | STORED AS $format > """.stripMargin) > rawData.write.insertInto(tblName) > } > rawData.write.json(s"${tblPrefix}_json") > {code} > Then I start a spark-shell in "America/New_York" timezone, and read the data > back from each table: > {code} > scala> spark.sql("select * from la_parquet").collect().foreach{println} > [2016-01-01 02:50:59.123] > [2016-01-01 01:49:59.123] > [2016-01-01 03:39:59.123] > [2016-01-01 04:29:59.123] > scala> spark.sql("select * from la_textfile").collect().foreach{println} > [2015-12-31 23:50:59.123] > [2015-12-31 22:49:59.123] > [2016-01-01 00:39:59.123] > [2016-01-01 01:29:59.123] > scala> spark.read.json("la_json").collect().foreach{println} > [2015-12-31 23:50:59.123] > [2015-12-31 22:49:59.123] > [2016-01-01 00:39:59.123] > [2016-01-01 01:29:59.123] > scala> spark.read.json("la_json").join(spark.sql("select * from > la_textfile"), "ts").show() > ++ > | ts| > ++ > |2015-12-31 23:50:...| > |2015-12-31 22:49:...| > |2016-01-01 00:39:...| > |2016-01-01 01:29:...| > ++ > scala> spark.read.json("la_json").join(spark.sql("select * from la_parquet"), > "ts").show() > +---+ > | ts| > +---+ > +---+ > {code} > The textfile and json based data shows the same times, and can be joined > against each other, while the times from the parquet data have changed (and > obviously joins fail). > This is a big problem for any organization that may try to read the same data > (say in S3) with clusters in multiple timezones. It can also be a nasty > surprise as an organization tries to migrate file formats. Finally, its a > source of incompatibility between Hive, Impala, and Spark. > HIVE-12767 aims to fix this by introducing a table property which indicates > the "storage timezone" for the table. Spark should add the same to ensure > consistency between file formats, and with Hive & Impala. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20634) result of MLlib KMeans cluster is not stabilize
Simon.J created SPARK-20634: --- Summary: result of MLlib KMeans cluster is not stabilize Key: SPARK-20634 URL: https://issues.apache.org/jira/browse/SPARK-20634 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 2.0.2 Environment: Windows 10 spark 2.0.2 standalone spyder 3.1.4 Anaconda 4.3.0 python 3.5.2 Reporter: Simon.J Priority: Critical 1.Get a DataFrame through python with Cx_Oracle lib. 2.Start a local Spark Session. 3.Convert the dataset for Kmeansmodel train. 4.Train the KMeans model and predict the same data.just set K =3 5.Get the ClassifierFeature of the KMeans model'predict. 6.Get the count of every ClassifierFeature. 7.Loop 4-6 for 20 times. 8.Compare the result of every time. 9.Find the KMeans result dose not stabilize. 10.The same dataset and param for ML package'KMeans, its result is the same. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16931) PySpark access to data-frame bucketing api
[ https://issues.apache.org/jira/browse/SPARK-16931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-16931. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 17077 [https://github.com/apache/spark/pull/17077] > PySpark access to data-frame bucketing api > -- > > Key: SPARK-16931 > URL: https://issues.apache.org/jira/browse/SPARK-16931 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.0.0 >Reporter: Greg Bowyer >Assignee: Maciej Szymkiewicz > Fix For: 2.3.0 > > > Attached is a patch that enables bucketing for pyspark using the dataframe > API. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16931) PySpark access to data-frame bucketing api
[ https://issues.apache.org/jira/browse/SPARK-16931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-16931: --- Assignee: Maciej Szymkiewicz > PySpark access to data-frame bucketing api > -- > > Key: SPARK-16931 > URL: https://issues.apache.org/jira/browse/SPARK-16931 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.0.0 >Reporter: Greg Bowyer >Assignee: Maciej Szymkiewicz > Fix For: 2.3.0 > > > Attached is a patch that enables bucketing for pyspark using the dataframe > API. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17134) Use level 2 BLAS operations in LogisticAggregator
[ https://issues.apache.org/jira/browse/SPARK-17134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17134: Assignee: Seth Hendrickson (was: Apache Spark) > Use level 2 BLAS operations in LogisticAggregator > - > > Key: SPARK-17134 > URL: https://issues.apache.org/jira/browse/SPARK-17134 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson >Assignee: Seth Hendrickson > > Multinomial logistic regression uses LogisticAggregator class for gradient > updates. We should look into refactoring MLOR to use level 2 BLAS operations > for the updates. Performance testing should be done to show improvements. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17134) Use level 2 BLAS operations in LogisticAggregator
[ https://issues.apache.org/jira/browse/SPARK-17134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17134: Assignee: Apache Spark (was: Seth Hendrickson) > Use level 2 BLAS operations in LogisticAggregator > - > > Key: SPARK-17134 > URL: https://issues.apache.org/jira/browse/SPARK-17134 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson >Assignee: Apache Spark > > Multinomial logistic regression uses LogisticAggregator class for gradient > updates. We should look into refactoring MLOR to use level 2 BLAS operations > for the updates. Performance testing should be done to show improvements. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17134) Use level 2 BLAS operations in LogisticAggregator
[ https://issues.apache.org/jira/browse/SPARK-17134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16000188#comment-16000188 ] Apache Spark commented on SPARK-17134: -- User 'VinceShieh' has created a pull request for this issue: https://github.com/apache/spark/pull/17894 > Use level 2 BLAS operations in LogisticAggregator > - > > Key: SPARK-17134 > URL: https://issues.apache.org/jira/browse/SPARK-17134 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson >Assignee: Seth Hendrickson > > Multinomial logistic regression uses LogisticAggregator class for gradient > updates. We should look into refactoring MLOR to use level 2 BLAS operations > for the updates. Performance testing should be done to show improvements. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20633) FileFormatWriter wrap the FetchFailedException which breaks job's failover
[ https://issues.apache.org/jira/browse/SPARK-20633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20633: Assignee: (was: Apache Spark) > FileFormatWriter wrap the FetchFailedException which breaks job's failover > -- > > Key: SPARK-20633 > URL: https://issues.apache.org/jira/browse/SPARK-20633 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Liu Shaohui > > The task scheduler handles FetchFailedException separately for the task > failover. But the FileFormatWriter wraps the FetchFailedException with > SparkException. This causes the job cannot be recovered from the failure like > a external shuffle server is down. > See the stacktrace: > {code} > 2017-04-30,05:02:42,348 ERROR > org.apache.spark.sql.execution.datasources.DefaultWriterContainer: Task > attempt attempt_201704300443_0018_m_96_1 aborted. > 2017-04-30,05:02:42,392 ERROR org.apache.spark.executor.Executor: Exception > in task 96.1 in stage 18.0 (TID 26538) > org.apache.spark.SparkException: Task failed while writing rows > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.shuffle.FetchFailedException: > java.lang.RuntimeException: Executor is not registered > (appId=application_1491898760056_636981, execId=546) > at > org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:319) > at > org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.handleMessage(ExternalShuffleBlockHandler.java:87) > at > org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.receive(ExternalShuffleBlockHandler.java:74) > at > org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:152) > at > org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:102) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) > at > io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20633) FileFormatWriter wrap the FetchFailedException which breaks job's failover
[ https://issues.apache.org/jira/browse/SPARK-20633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20633: Assignee: Apache Spark > FileFormatWriter wrap the FetchFailedException which breaks job's failover > -- > > Key: SPARK-20633 > URL: https://issues.apache.org/jira/browse/SPARK-20633 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Liu Shaohui >Assignee: Apache Spark > > The task scheduler handles FetchFailedException separately for the task > failover. But the FileFormatWriter wraps the FetchFailedException with > SparkException. This causes the job cannot be recovered from the failure like > a external shuffle server is down. > See the stacktrace: > {code} > 2017-04-30,05:02:42,348 ERROR > org.apache.spark.sql.execution.datasources.DefaultWriterContainer: Task > attempt attempt_201704300443_0018_m_96_1 aborted. > 2017-04-30,05:02:42,392 ERROR org.apache.spark.executor.Executor: Exception > in task 96.1 in stage 18.0 (TID 26538) > org.apache.spark.SparkException: Task failed while writing rows > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.shuffle.FetchFailedException: > java.lang.RuntimeException: Executor is not registered > (appId=application_1491898760056_636981, execId=546) > at > org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:319) > at > org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.handleMessage(ExternalShuffleBlockHandler.java:87) > at > org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.receive(ExternalShuffleBlockHandler.java:74) > at > org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:152) > at > org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:102) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) > at > io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20633) FileFormatWriter wrap the FetchFailedException which breaks job's failover
[ https://issues.apache.org/jira/browse/SPARK-20633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16000184#comment-16000184 ] Apache Spark commented on SPARK-20633: -- User 'lshmouse' has created a pull request for this issue: https://github.com/apache/spark/pull/17893 > FileFormatWriter wrap the FetchFailedException which breaks job's failover > -- > > Key: SPARK-20633 > URL: https://issues.apache.org/jira/browse/SPARK-20633 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Liu Shaohui > > The task scheduler handles FetchFailedException separately for the task > failover. But the FileFormatWriter wraps the FetchFailedException with > SparkException. This causes the job cannot be recovered from the failure like > a external shuffle server is down. > See the stacktrace: > {code} > 2017-04-30,05:02:42,348 ERROR > org.apache.spark.sql.execution.datasources.DefaultWriterContainer: Task > attempt attempt_201704300443_0018_m_96_1 aborted. > 2017-04-30,05:02:42,392 ERROR org.apache.spark.executor.Executor: Exception > in task 96.1 in stage 18.0 (TID 26538) > org.apache.spark.SparkException: Task failed while writing rows > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.shuffle.FetchFailedException: > java.lang.RuntimeException: Executor is not registered > (appId=application_1491898760056_636981, execId=546) > at > org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:319) > at > org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.handleMessage(ExternalShuffleBlockHandler.java:87) > at > org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.receive(ExternalShuffleBlockHandler.java:74) > at > org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:152) > at > org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:102) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) > at > io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20633) FileFormatWriter wrap the FetchFailedException which breaks job's failover
Liu Shaohui created SPARK-20633: --- Summary: FileFormatWriter wrap the FetchFailedException which breaks job's failover Key: SPARK-20633 URL: https://issues.apache.org/jira/browse/SPARK-20633 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.1 Reporter: Liu Shaohui The task scheduler handles FetchFailedException separately for the task failover. But the FileFormatWriter wraps the FetchFailedException with SparkException. This causes the job cannot be recovered from the failure like a external shuffle server is down. See the stacktrace: {code} 2017-04-30,05:02:42,348 ERROR org.apache.spark.sql.execution.datasources.DefaultWriterContainer: Task attempt attempt_201704300443_0018_m_96_1 aborted. 2017-04-30,05:02:42,392 ERROR org.apache.spark.executor.Executor: Exception in task 96.1 in stage 18.0 (TID 26538) org.apache.spark.SparkException: Task failed while writing rows at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.spark.shuffle.FetchFailedException: java.lang.RuntimeException: Executor is not registered (appId=application_1491898760056_636981, execId=546) at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:319) at org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.handleMessage(ExternalShuffleBlockHandler.java:87) at org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.receive(ExternalShuffleBlockHandler.java:74) at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:152) at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:102) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17134) Use level 2 BLAS operations in LogisticAggregator
[ https://issues.apache.org/jira/browse/SPARK-17134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16000176#comment-16000176 ] Vincent commented on SPARK-17134: - I will submit a PR for this issue soon. > Use level 2 BLAS operations in LogisticAggregator > - > > Key: SPARK-17134 > URL: https://issues.apache.org/jira/browse/SPARK-17134 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson >Assignee: Seth Hendrickson > > Multinomial logistic regression uses LogisticAggregator class for gradient > updates. We should look into refactoring MLOR to use level 2 BLAS operations > for the updates. Performance testing should be done to show improvements. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18350) Support session local timezone
[ https://issues.apache.org/jira/browse/SPARK-18350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16000173#comment-16000173 ] Reynold Xin commented on SPARK-18350: - [~srowen] why was this reopened? > Support session local timezone > -- > > Key: SPARK-18350 > URL: https://issues.apache.org/jira/browse/SPARK-18350 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin >Assignee: Takuya Ueshin > Labels: releasenotes > > As of Spark 2.1, Spark SQL assumes the machine timezone for datetime > manipulation, which is bad if users are not in the same timezones as the > machines, or if different users have different timezones. > We should introduce a session local timezone setting that is used for > execution. > An explicit non-goal is locale handling. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19112) add codec for ZStandard
[ https://issues.apache.org/jira/browse/SPARK-19112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16000148#comment-16000148 ] Takeshi Yamamuro commented on SPARK-19112: -- I also put the result here: {code} scaleFactor: 4 AWS instance: c4.4xlarge -- zstd Running execution q4-v1.4 iteration: 1, StandardRun=true Execution time: 53.315878375s Running execution q4-v1.4 iteration: 2, StandardRun=true Execution time: 53.468174668s Running execution q4-v1.4 iteration: 3, StandardRun=true Execution time: 57.282403146s -- lz4 Running execution q4-v1.4 iteration: 1, StandardRun=true Execution time: 20.779643053s Running execution q4-v1.4 iteration: 2, StandardRun=true Execution time: 16.520911319s Running execution q4-v1.4 iteration: 3, StandardRun=true Execution time: 15.897124967s -- snappy Running execution q4-v1.4 iteration: 1, StandardRun=true Execution time: 21.13241203698s Running execution q4-v1.4 iteration: 2, StandardRun=true Execution time: 15.90886774398s Running execution q4-v1.4 iteration: 3, StandardRun=true Execution time: 15.789648712s -- lzf Running execution q4-v1.4 iteration: 1, StandardRun=true Execution time: 21.339518781s Running execution q4-v1.4 iteration: 2, StandardRun=true Execution time: 16.881225328s Running execution q4-v1.4 iteration: 3, StandardRun=true Execution time: 15.813455479s {code} ISTM it's okay to close the current pr for now. But we should close this ticket now? IMHO the performance depends on environments, configurations, code structure, and so on. So, we could keep this open for collecting other's performance results? > add codec for ZStandard > --- > > Key: SPARK-19112 > URL: https://issues.apache.org/jira/browse/SPARK-19112 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Thomas Graves >Priority: Minor > > ZStandard: https://github.com/facebook/zstd and > http://facebook.github.io/zstd/ has been in use for a while now. v1.0 was > recently released. Hadoop > (https://issues.apache.org/jira/browse/HADOOP-13578) and others > (https://issues.apache.org/jira/browse/KAFKA-4514) are adopting it. > Zstd seems to give great results => Gzip level Compression with Lz4 level CPU. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16000130#comment-16000130 ] Helena Edelson commented on SPARK-18057: Did that a while ago, my only point is not modifying artifacts ideally, by adding and excluding in builds. > Update structured streaming kafka from 10.0.1 to 10.2.0 > --- > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20626) Fix SparkR test warning on Windows with timestamp time zone
[ https://issues.apache.org/jira/browse/SPARK-20626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20626: Assignee: Apache Spark > Fix SparkR test warning on Windows with timestamp time zone > --- > > Key: SPARK-20626 > URL: https://issues.apache.org/jira/browse/SPARK-20626 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0, 2.2.0, 2.3.0 >Reporter: Felix Cheung >Assignee: Apache Spark > > Warnings > --- > 1. infer types and check types (@test_sparkSQL.R#123) - unable to identify > current timezone 'C': > please set environment variable 'TZ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20626) Fix SparkR test warning on Windows with timestamp time zone
[ https://issues.apache.org/jira/browse/SPARK-20626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20626: Assignee: (was: Apache Spark) > Fix SparkR test warning on Windows with timestamp time zone > --- > > Key: SPARK-20626 > URL: https://issues.apache.org/jira/browse/SPARK-20626 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0, 2.2.0, 2.3.0 >Reporter: Felix Cheung > > Warnings > --- > 1. infer types and check types (@test_sparkSQL.R#123) - unable to identify > current timezone 'C': > please set environment variable 'TZ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20626) Fix SparkR test warning on Windows with timestamp time zone
[ https://issues.apache.org/jira/browse/SPARK-20626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16000110#comment-16000110 ] Apache Spark commented on SPARK-20626: -- User 'felixcheung' has created a pull request for this issue: https://github.com/apache/spark/pull/17892 > Fix SparkR test warning on Windows with timestamp time zone > --- > > Key: SPARK-20626 > URL: https://issues.apache.org/jira/browse/SPARK-20626 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0, 2.2.0, 2.3.0 >Reporter: Felix Cheung > > Warnings > --- > 1. infer types and check types (@test_sparkSQL.R#123) - unable to identify > current timezone 'C': > please set environment variable 'TZ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18350) Support session local timezone
[ https://issues.apache.org/jira/browse/SPARK-18350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-18350: -- Fix Version/s: (was: 2.2.0) > Support session local timezone > -- > > Key: SPARK-18350 > URL: https://issues.apache.org/jira/browse/SPARK-18350 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin >Assignee: Takuya Ueshin > Labels: releasenotes > > As of Spark 2.1, Spark SQL assumes the machine timezone for datetime > manipulation, which is bad if users are not in the same timezones as the > machines, or if different users have different timezones. > We should introduce a session local timezone setting that is used for > execution. > An explicit non-goal is locale handling. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20372) Word2Vec Continuous Bag Of Words model
[ https://issues.apache.org/jira/browse/SPARK-20372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-20372: -- Fix Version/s: (was: 2.2.0) > Word2Vec Continuous Bag Of Words model > -- > > Key: SPARK-20372 > URL: https://issues.apache.org/jira/browse/SPARK-20372 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.0 >Reporter: Shubham Chopra > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20548) Flaky Test: ReplSuite.newProductSeqEncoder with REPL defined class
[ https://issues.apache.org/jira/browse/SPARK-20548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-20548: -- Fix Version/s: (was: 2.2.0) > Flaky Test: ReplSuite.newProductSeqEncoder with REPL defined class > --- > > Key: SPARK-20548 > URL: https://issues.apache.org/jira/browse/SPARK-20548 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Sameer Agarwal >Assignee: Sameer Agarwal > > {{newProductSeqEncoder with REPL defined class}} in {{ReplSuite}} has been > failing in-deterministically : https://spark-tests.appspot.com/failed-tests > over the last few days. > https://spark.test.databricks.com/job/spark-master-test-sbt-hadoop-2.7/176/testReport/junit/org.apache.spark.repl/ReplSuite/newProductSeqEncoder_with_REPL_defined_class/history/ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18891) Support for specific collection types
[ https://issues.apache.org/jira/browse/SPARK-18891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-18891: -- Fix Version/s: (was: 2.2.0) > Support for specific collection types > - > > Key: SPARK-18891 > URL: https://issues.apache.org/jira/browse/SPARK-18891 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.3, 2.1.0 >Reporter: Michael Armbrust >Priority: Critical > > Encoders treat all collections the same (i.e. {{Seq}} vs {{List}}) which > force users to only define classes with the most generic type. > An [example > error|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/2398463439880241/2840265927289860/latest.html]: > {code} > case class SpecificCollection(aList: List[Int]) > Seq(SpecificCollection(1 :: Nil)).toDS().collect() > {code} > {code} > java.lang.RuntimeException: Error while decoding: > java.util.concurrent.ExecutionException: java.lang.Exception: failed to > compile: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 98, Column 120: No applicable constructor/method found > for actual parameters "scala.collection.Seq"; candidates are: > "line29e7e4b1e36445baa3505b2e102aa86b29.$read$$iw$$iw$$iw$$iw$SpecificCollection(scala.collection.immutable.List)" > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20617) pyspark.sql filtering fails when using ~isin when there are nulls in column
[ https://issues.apache.org/jira/browse/SPARK-20617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-20617: -- Fix Version/s: (was: 2.2.0) > pyspark.sql filtering fails when using ~isin when there are nulls in column > --- > > Key: SPARK-20617 > URL: https://issues.apache.org/jira/browse/SPARK-20617 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.2.0 > Environment: Ubuntu Xenial 16.04, Python 3.5 >Reporter: Ed Lee > > Hello encountered a filtering bug using 'isin' in pyspark sql on version > 2.2.0, Ubuntu 16.04. > Enclosed below an example to replicate: > from pyspark.sql import SparkSession > from pyspark.sql import functions as sf > import pandas as pd > spark = SparkSession.builder.master("local").appName("Word > Count").getOrCreate() > test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"], > "col2": range(5) > }) > test_sdf = spark.createDataFrame(test_df) > test_sdf.show() > |col1|col2| > |null| 0| > |null| 1| > | a| 2| > | b| 3| > | c| 4| > # Below shows when filtering col1 NOT in list ['a'] the col1 rows with null > are missing: > test_sdf.filter(sf.col("col1").isin(["a"]) == False).show() > Or: > test_sdf.filter(~sf.col("col1").isin(["a"])).show() > *Expecting*: > |col1|col2| > |null| 0| > |null| 1| > | b| 3| > | c| 4| > *Got*: > |col1|col2| > | b| 3| > | c| 4| > My workarounds: > 1. null is considered 'in', so add OR isNull conditon: > test_sdf.filter((sf.col("col1").isin(["a"])== False) | ( > sf.col("col1").isNull())).show() > To get: > |col1|col2|isin| > |null| 0|null| > |null| 1|null| > | c| 4|null| > | b| 3|null| > 2. Use left join and filter > join_df = pd.DataFrame({"col1": ["a"], > "isin": 1 > }) > join_sdf = spark.createDataFrame(join_df) > test_sdf.join(join_sdf, on="col1", how="left") \ > .filter(sf.col("isin").isNull()) \ > .show() > To get: > |col1|col2|isin| > |null| 0|null| > |null| 1|null| > | c| 4|null| > | b| 3|null| > Thank you -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20626) Fix SparkR test warning on Windows with timestamp time zone
[ https://issues.apache.org/jira/browse/SPARK-20626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-20626: - Summary: Fix SparkR test warning on Windows with timestamp time zone (was: Fix SparkR test on Windows with timestamp time zone) > Fix SparkR test warning on Windows with timestamp time zone > --- > > Key: SPARK-20626 > URL: https://issues.apache.org/jira/browse/SPARK-20626 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0, 2.2.0, 2.3.0 >Reporter: Felix Cheung > > Warnings > --- > 1. infer types and check types (@test_sparkSQL.R#123) - unable to identify > current timezone 'C': > please set environment variable 'TZ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11834) Ignore thresholds in LogisticRegression and update documentation
[ https://issues.apache.org/jira/browse/SPARK-11834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1625#comment-1625 ] Maciej Szymkiewicz commented on SPARK-11834: Sorry, for that. Wrong ticket in PR. > Ignore thresholds in LogisticRegression and update documentation > > > Key: SPARK-11834 > URL: https://issues.apache.org/jira/browse/SPARK-11834 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML >Affects Versions: 1.6.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Minor > > ml.LogisticRegression does not support multiclass yet. So we should ignore > `thresholds` and update the documentation. In the next release, we can do > SPARK-11543. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20631) LogisticRegression._checkThresholdConsistency should use values not Params
[ https://issues.apache.org/jira/browse/SPARK-20631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20631: Assignee: Apache Spark > LogisticRegression._checkThresholdConsistency should use values not Params > -- > > Key: SPARK-20631 > URL: https://issues.apache.org/jira/browse/SPARK-20631 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Maciej Szymkiewicz >Assignee: Apache Spark >Priority: Minor > > {{_checkThresholdConsistency}} incorrectly uses {{getParam}} in attempt to > access {{threshold}} and {{thresholds}} values. Furthermore it calls it with > {{Param}} instead of {{str}}: > {code} > >>> from pyspark.ml.classification import LogisticRegression > >>> lr = LogisticRegression(threshold=0.25, thresholds=[0.75, 0.25]) > Traceback (most recent call last): > ... > TypeError: getattr(): attribute name must be string > {code} > Finally exception message uses {{join}} without converting values to {{str}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20631) LogisticRegression._checkThresholdConsistency should use values not Params
[ https://issues.apache.org/jira/browse/SPARK-20631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20631: Assignee: (was: Apache Spark) > LogisticRegression._checkThresholdConsistency should use values not Params > -- > > Key: SPARK-20631 > URL: https://issues.apache.org/jira/browse/SPARK-20631 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Maciej Szymkiewicz >Priority: Minor > > {{_checkThresholdConsistency}} incorrectly uses {{getParam}} in attempt to > access {{threshold}} and {{thresholds}} values. Furthermore it calls it with > {{Param}} instead of {{str}}: > {code} > >>> from pyspark.ml.classification import LogisticRegression > >>> lr = LogisticRegression(threshold=0.25, thresholds=[0.75, 0.25]) > Traceback (most recent call last): > ... > TypeError: getattr(): attribute name must be string > {code} > Finally exception message uses {{join}} without converting values to {{str}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20631) LogisticRegression._checkThresholdConsistency should use values not Params
[ https://issues.apache.org/jira/browse/SPARK-20631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1623#comment-1623 ] Apache Spark commented on SPARK-20631: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/17891 > LogisticRegression._checkThresholdConsistency should use values not Params > -- > > Key: SPARK-20631 > URL: https://issues.apache.org/jira/browse/SPARK-20631 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Maciej Szymkiewicz >Priority: Minor > > {{_checkThresholdConsistency}} incorrectly uses {{getParam}} in attempt to > access {{threshold}} and {{thresholds}} values. Furthermore it calls it with > {{Param}} instead of {{str}}: > {code} > >>> from pyspark.ml.classification import LogisticRegression > >>> lr = LogisticRegression(threshold=0.25, thresholds=[0.75, 0.25]) > Traceback (most recent call last): > ... > TypeError: getattr(): attribute name must be string > {code} > Finally exception message uses {{join}} without converting values to {{str}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20631) LogisticRegression._checkThresholdConsistency should use values not Params
[ https://issues.apache.org/jira/browse/SPARK-20631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz updated SPARK-20631: --- Description: {{_checkThresholdConsistency}} incorrectly uses {{getParam}} in attempt to access {{threshold}} and {{thresholds}} values. Furthermore it calls it with {{Param}} instead of {{str}}: {code} >>> from pyspark.ml.classification import LogisticRegression >>> lr = LogisticRegression(threshold=0.25, thresholds=[0.75, 0.25]) Traceback (most recent call last): ... TypeError: getattr(): attribute name must be string {code} Finally exception message uses {{join}} without converting values to {{str}}. was:{{_checkThresholdConsistency}} incorrectly uses {{getParam}} in attempt to access {{threshold}} and {{thresholds}} values. > LogisticRegression._checkThresholdConsistency should use values not Params > -- > > Key: SPARK-20631 > URL: https://issues.apache.org/jira/browse/SPARK-20631 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Maciej Szymkiewicz >Priority: Minor > > {{_checkThresholdConsistency}} incorrectly uses {{getParam}} in attempt to > access {{threshold}} and {{thresholds}} values. Furthermore it calls it with > {{Param}} instead of {{str}}: > {code} > >>> from pyspark.ml.classification import LogisticRegression > >>> lr = LogisticRegression(threshold=0.25, thresholds=[0.75, 0.25]) > Traceback (most recent call last): > ... > TypeError: getattr(): attribute name must be string > {code} > Finally exception message uses {{join}} without converting values to {{str}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11834) Ignore thresholds in LogisticRegression and update documentation
[ https://issues.apache.org/jira/browse/SPARK-11834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1612#comment-1612 ] Apache Spark commented on SPARK-11834: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/17891 > Ignore thresholds in LogisticRegression and update documentation > > > Key: SPARK-11834 > URL: https://issues.apache.org/jira/browse/SPARK-11834 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML >Affects Versions: 1.6.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Minor > > ml.LogisticRegression does not support multiclass yet. So we should ignore > `thresholds` and update the documentation. In the next release, we can do > SPARK-11543. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20632) Allow 'Column.getItem()' API to accept Vector columns
Kevin Ushey created SPARK-20632: --- Summary: Allow 'Column.getItem()' API to accept Vector columns Key: SPARK-20632 URL: https://issues.apache.org/jira/browse/SPARK-20632 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.1.1, 1.6.3 Reporter: Kevin Ushey Priority: Minor The 'getItem()' API is quite handy for extracting values from Dataset columns of type 'ArrayType'. It would be quite useful if this could also accept 'Vector' columns, e.g. those generated by the various MLLib routines (probability columns). If I understand correctly, users are forced to define custom UDFs to handle this case, and the UDF type required for vector columns is not always obvious. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20631) LogisticRegression._checkThresholdConsistency should use values not Params
Maciej Szymkiewicz created SPARK-20631: -- Summary: LogisticRegression._checkThresholdConsistency should use values not Params Key: SPARK-20631 URL: https://issues.apache.org/jira/browse/SPARK-20631 Project: Spark Issue Type: Bug Components: ML, PySpark Affects Versions: 2.2.0 Reporter: Maciej Szymkiewicz Priority: Minor {{_checkThresholdConsistency}} incorrectly uses {{getParam}} in attempt to access {{threshold}} and {{thresholds}} values. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20630) Thread Dump link available in Executors tab irrespective of spark.ui.threadDumpsEnabled
[ https://issues.apache.org/jira/browse/SPARK-20630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacek Laskowski updated SPARK-20630: Description: Irrespective of {{spark.ui.threadDumpsEnabled}} property web UI's Executors page displays *Thread Dump* column with an active link (that does nothing though). (was: Irrespective of {{spark.ui.threadDumpsEnabled}} property web UI's Executors page displays **Thread Dump** column with an active link (that does nothing though).) > Thread Dump link available in Executors tab irrespective of > spark.ui.threadDumpsEnabled > --- > > Key: SPARK-20630 > URL: https://issues.apache.org/jira/browse/SPARK-20630 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Jacek Laskowski >Priority: Minor > Attachments: spark-webui-executors-threadDump.png > > > Irrespective of {{spark.ui.threadDumpsEnabled}} property web UI's Executors > page displays *Thread Dump* column with an active link (that does nothing > though). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20630) Thread Dump link available in Executors tab irrespective of spark.ui.threadDumpsEnabled
[ https://issues.apache.org/jira/browse/SPARK-20630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacek Laskowski updated SPARK-20630: Attachment: spark-webui-executors-threadDump.png > Thread Dump link available in Executors tab irrespective of > spark.ui.threadDumpsEnabled > --- > > Key: SPARK-20630 > URL: https://issues.apache.org/jira/browse/SPARK-20630 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Jacek Laskowski >Priority: Minor > Attachments: spark-webui-executors-threadDump.png > > > Irrespective of {{spark.ui.threadDumpsEnabled}} property web UI's Executors > page displays **Thread Dump** column with an active link (that does nothing > though). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20630) Thread Dump link available in Executors tab irrespective of spark.ui.threadDumpsEnabled
Jacek Laskowski created SPARK-20630: --- Summary: Thread Dump link available in Executors tab irrespective of spark.ui.threadDumpsEnabled Key: SPARK-20630 URL: https://issues.apache.org/jira/browse/SPARK-20630 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.3.0 Reporter: Jacek Laskowski Priority: Minor Irrespective of {{spark.ui.threadDumpsEnabled}} property web UI's Executors page displays **Thread Dump** column with an active link (that does nothing though). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20617) pyspark.sql filtering fails when using ~isin when there are nulls in column
[ https://issues.apache.org/jira/browse/SPARK-20617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ed Lee updated SPARK-20617: --- Description: Hello encountered a filtering bug using 'isin' in pyspark sql on version 2.2.0, Ubuntu 16.04. Enclosed below an example to replicate: from pyspark.sql import SparkSession from pyspark.sql import functions as sf import pandas as pd spark = SparkSession.builder.master("local").appName("Word Count").getOrCreate() test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"], "col2": range(5) }) test_sdf = spark.createDataFrame(test_df) test_sdf.show() |col1|col2| |null| 0| |null| 1| | a| 2| | b| 3| | c| 4| # Below shows when filtering col1 NOT in list ['a'] the col1 rows with null are missing: test_sdf.filter(sf.col("col1").isin(["a"]) == False).show() Or: test_sdf.filter(~sf.col("col1").isin(["a"])).show() *Expecting*: |col1|col2| |null| 0| |null| 1| | b| 3| | c| 4| *Got*: |col1|col2| | b| 3| | c| 4| My workarounds: 1. null is considered 'in', so add OR isNull conditon: test_sdf.filter((sf.col("col1").isin(["a"])== False) | ( sf.col("col1").isNull())).show() To get: |col1|col2|isin| |null| 0|null| |null| 1|null| | c| 4|null| | b| 3|null| 2. Use left join and filter join_df = pd.DataFrame({"col1": ["a"], "isin": 1 }) join_sdf = spark.createDataFrame(join_df) test_sdf.join(join_sdf, on="col1", how="left") \ .filter(sf.col("isin").isNull()) \ .show() To get: |col1|col2|isin| |null| 0|null| |null| 1|null| | c| 4|null| | b| 3|null| Thank you was: Hello encountered a filtering bug using 'isin' in pyspark sql on version 2.2.0, Ubuntu 16.04. Enclosed below an example to replicate: from pyspark.sql import SparkSession from pyspark.sql import functions as sf import pandas as pd spark = SparkSession.builder.master("local").appName("Word Count").getOrCreate() test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"], "col2": range(5) }) test_sdf = spark.createDataFrame(test_df) test_sdf.show() |col1|col2| |null| 0| |null| 1| | a| 2| | b| 3| | c| 4| # Below shows null entries in col1 are considered 'isin' the list ["a"] (it is not in the list so it should show): test_sdf.filter(sf.col("col1").isin(["a"]) == False).show() Or: test_sdf.filter(~sf.col("col1").isin(["a"])).show() *Expecting*: |col1|col2| |null| 0| |null| 1| | b| 3| | c| 4| *Got*: |col1|col2| | b| 3| | c| 4| My workarounds: 1. null is considered 'in', so add OR isNull conditon: test_sdf.filter((sf.col("col1").isin(["a"])== False) | ( sf.col("col1").isNull())).show() To get: |col1|col2|isin| |null| 0|null| |null| 1|null| | c| 4|null| | b| 3|null| 2. Use left join and filter join_df = pd.DataFrame({"col1": ["a"], "isin": 1 }) join_sdf = spark.createDataFrame(join_df) test_sdf.join(join_sdf, on="col1", how="left") \ .filter(sf.col("isin").isNull()) \ .show() To get: |col1|col2|isin| |null| 0|null| |null| 1|null| | c| 4|null| | b| 3|null| Thank you > pyspark.sql filtering fails when using ~isin when there are nulls in column > --- > > Key: SPARK-20617 > URL: https://issues.apache.org/jira/browse/SPARK-20617 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.2.0 > Environment: Ubuntu Xenial 16.04, Python 3.5 >Reporter: Ed Lee > Fix For: 2.2.0 > > > Hello encountered a filtering bug using 'isin' in pyspark sql on version > 2.2.0, Ubuntu 16.04. > Enclosed below an example to replicate: > from pyspark.sql import SparkSession > from pyspark.sql import functions as sf > import pandas as pd > spark = SparkSession.builder.master("local").appName("Word > Count").getOrCreate() > test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"], > "col2": range(5) > }) > test_sdf = spark.createDataFrame(test_df) > test_sdf.show() > |col1|col2| > |null| 0| > |null| 1| > | a| 2| > | b| 3| > | c| 4| > # Below shows when filtering col1 NOT in list ['a'] the col1 rows with null > are missing: > test_sdf.filter(sf.col("col1").isin(["a"]) == False).show() > Or: > test_sdf.filter(~sf.col("col1").isin(["a"])).show() > *Expecting*: > |col1|col2| > |null| 0| > |null| 1| > | b| 3| > | c| 4| > *Got*: > |col1|col2| > | b| 3| > | c| 4| > My workarounds: > 1. null is considered 'in', so add OR isNull conditon: > test_sdf.filter((sf.col("col1").isin(["a"])== False) | (
[jira] [Updated] (SPARK-20617) pyspark.sql filtering fails when using ~isin when there are nulls in column
[ https://issues.apache.org/jira/browse/SPARK-20617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ed Lee updated SPARK-20617: --- Summary: pyspark.sql filtering fails when using ~isin when there are nulls in column (was: pyspark.sql, filtering with ~isin missing rows) > pyspark.sql filtering fails when using ~isin when there are nulls in column > --- > > Key: SPARK-20617 > URL: https://issues.apache.org/jira/browse/SPARK-20617 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.2.0 > Environment: Ubuntu Xenial 16.04, Python 3.5 >Reporter: Ed Lee > Fix For: 2.2.0 > > > Hello encountered a filtering bug using 'isin' in pyspark sql on version > 2.2.0, Ubuntu 16.04. > Enclosed below an example to replicate: > from pyspark.sql import SparkSession > from pyspark.sql import functions as sf > import pandas as pd > spark = SparkSession.builder.master("local").appName("Word > Count").getOrCreate() > test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"], > "col2": range(5) > }) > test_sdf = spark.createDataFrame(test_df) > test_sdf.show() > |col1|col2| > |null| 0| > |null| 1| > | a| 2| > | b| 3| > | c| 4| > # Below shows null entries in col1 are considered 'isin' the list ["a"] (it > is not in the list so it should show): > test_sdf.filter(sf.col("col1").isin(["a"]) == False).show() > Or: > test_sdf.filter(~sf.col("col1").isin(["a"])).show() > *Expecting*: > |col1|col2| > |null| 0| > |null| 1| > | b| 3| > | c| 4| > *Got*: > |col1|col2| > | b| 3| > | c| 4| > My workarounds: > 1. null is considered 'in', so add OR isNull conditon: > test_sdf.filter((sf.col("col1").isin(["a"])== False) | ( > sf.col("col1").isNull())).show() > To get: > |col1|col2|isin| > |null| 0|null| > |null| 1|null| > | c| 4|null| > | b| 3|null| > 2. Use left join and filter > join_df = pd.DataFrame({"col1": ["a"], > "isin": 1 > }) > join_sdf = spark.createDataFrame(join_df) > test_sdf.join(join_sdf, on="col1", how="left") \ > .filter(sf.col("isin").isNull()) \ > .show() > To get: > |col1|col2|isin| > |null| 0|null| > |null| 1|null| > | c| 4|null| > | b| 3|null| > Thank you -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20617) pyspark.sql, filtering with ~isin missing rows
[ https://issues.apache.org/jira/browse/SPARK-20617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ed Lee updated SPARK-20617: --- Environment: Ubuntu Xenial 16.04, Python 3.5 (was: Ubuntu Xenial 16.04) > pyspark.sql, filtering with ~isin missing rows > --- > > Key: SPARK-20617 > URL: https://issues.apache.org/jira/browse/SPARK-20617 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.2.0 > Environment: Ubuntu Xenial 16.04, Python 3.5 >Reporter: Ed Lee > Fix For: 2.2.0 > > > Hello encountered a filtering bug using 'isin' in pyspark sql on version > 2.2.0, Ubuntu 16.04. > Enclosed below an example to replicate: > from pyspark.sql import SparkSession > from pyspark.sql import functions as sf > import pandas as pd > spark = SparkSession.builder.master("local").appName("Word > Count").getOrCreate() > test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"], > "col2": range(5) > }) > test_sdf = spark.createDataFrame(test_df) > test_sdf.show() > |col1|col2| > |null| 0| > |null| 1| > | a| 2| > | b| 3| > | c| 4| > # Below shows null entries in col1 are considered 'isin' the list ["a"] (it > is not in the list so it should show): > test_sdf.filter(sf.col("col1").isin(["a"]) == False).show() > Or: > test_sdf.filter(~sf.col("col1").isin(["a"])).show() > *Expecting*: > |col1|col2| > |null| 0| > |null| 1| > | b| 3| > | c| 4| > *Got*: > |col1|col2| > | b| 3| > | c| 4| > My workarounds: > 1. null is considered 'in', so add OR isNull conditon: > test_sdf.filter((sf.col("col1").isin(["a"])== False) | ( > sf.col("col1").isNull())).show() > To get: > |col1|col2|isin| > |null| 0|null| > |null| 1|null| > | c| 4|null| > | b| 3|null| > 2. Use left join and filter > join_df = pd.DataFrame({"col1": ["a"], > "isin": 1 > }) > join_sdf = spark.createDataFrame(join_df) > test_sdf.join(join_sdf, on="col1", how="left") \ > .filter(sf.col("isin").isNull()) \ > .show() > To get: > |col1|col2|isin| > |null| 0|null| > |null| 1|null| > | c| 4|null| > | b| 3|null| > Thank you -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20617) pyspark.sql, filtering with ~isin missing rows
[ https://issues.apache.org/jira/browse/SPARK-20617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ed Lee updated SPARK-20617: --- Fix Version/s: 2.2.0 Description: Hello encountered a filtering bug using 'isin' in pyspark sql on version 2.2.0, Ubuntu 16.04. Enclosed below an example to replicate: from pyspark.sql import SparkSession from pyspark.sql import functions as sf import pandas as pd spark = SparkSession.builder.master("local").appName("Word Count").getOrCreate() test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"], "col2": range(5) }) test_sdf = spark.createDataFrame(test_df) test_sdf.show() |col1|col2| |null| 0| |null| 1| | a| 2| | b| 3| | c| 4| # Below shows null entries in col1 are considered 'isin' the list ["a"] (it is not in the list so it should show): test_sdf.filter(sf.col("col1").isin(["a"]) == False).show() Or: test_sdf.filter(~sf.col("col1").isin(["a"])).show() *Expecting*: |col1|col2| |null| 0| |null| 1| | b| 3| | c| 4| *Got*: |col1|col2| | b| 3| | c| 4| My workarounds: 1. null is considered 'in', so add OR isNull conditon: test_sdf.filter((sf.col("col1").isin(["a"])== False) | ( sf.col("col1").isNull())).show() To get: |col1|col2|isin| |null| 0|null| |null| 1|null| | c| 4|null| | b| 3|null| 2. Use left join and filter join_df = pd.DataFrame({"col1": ["a"], "isin": 1 }) join_sdf = spark.createDataFrame(join_df) test_sdf.join(join_sdf, on="col1", how="left") \ .filter(sf.col("isin").isNull()) \ .show() To get: |col1|col2|isin| |null| 0|null| |null| 1|null| | c| 4|null| | b| 3|null| Thank you was: Hello encountered a filtering bug using 'isin' in pyspark sql on version 2.2.0, Ubuntu 16.04. Enclosed below an example to replicate: from pyspark.sql import functions as sf import pandas as pd test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"], "col2": range(5) }) test_sdf = spark.createDataFrame(test_df) test_sdf.show() |col1|col2| |null| 0| |null| 1| | a| 2| | b| 3| | c| 4| # Below shows null entries in col1 are considered 'isin' the list ["a"] (it is not in the list so it should show): test_sdf.filter(sf.col("col1").isin(["a"]) == False).show() Or: test_sdf.filter(~sf.col("col1").isin(["a"])).show() *Expecting*: |col1|col2| |null| 0| |null| 1| | b| 3| | c| 4| *Got*: |col1|col2| | b| 3| | c| 4| My workarounds: 1. null is considered 'in', so add OR isNull conditon: test_sdf.filter((sf.col("col1").isin(["a"])== False) | ( sf.col("col1").isNull())).show() To get: |col1|col2|isin| |null| 0|null| |null| 1|null| | c| 4|null| | b| 3|null| 2. Use left join and filter join_df = pd.DataFrame({"col1": ["a"], "isin": 1 }) join_sdf = spark.createDataFrame(join_df) test_sdf.join(join_sdf, on="col1", how="left") \ .filter(sf.col("isin").isNull()) \ .show() To get: |col1|col2|isin| |null| 0|null| |null| 1|null| | c| 4|null| | b| 3|null| Thank you > pyspark.sql, filtering with ~isin missing rows > --- > > Key: SPARK-20617 > URL: https://issues.apache.org/jira/browse/SPARK-20617 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.2.0 > Environment: Ubuntu Xenial 16.04 >Reporter: Ed Lee > Fix For: 2.2.0 > > > Hello encountered a filtering bug using 'isin' in pyspark sql on version > 2.2.0, Ubuntu 16.04. > Enclosed below an example to replicate: > from pyspark.sql import SparkSession > from pyspark.sql import functions as sf > import pandas as pd > spark = SparkSession.builder.master("local").appName("Word > Count").getOrCreate() > test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"], > "col2": range(5) > }) > test_sdf = spark.createDataFrame(test_df) > test_sdf.show() > |col1|col2| > |null| 0| > |null| 1| > | a| 2| > | b| 3| > | c| 4| > # Below shows null entries in col1 are considered 'isin' the list ["a"] (it > is not in the list so it should show): > test_sdf.filter(sf.col("col1").isin(["a"]) == False).show() > Or: > test_sdf.filter(~sf.col("col1").isin(["a"])).show() > *Expecting*: > |col1|col2| > |null| 0| > |null| 1| > | b| 3| > | c| 4| > *Got*: > |col1|col2| > | b| 3| > | c| 4| > My workarounds: > 1. null is considered 'in', so add OR isNull conditon: > test_sdf.filter((sf.col("col1").isin(["a"])== False) | ( > sf.col("col1").isNull())).show() > To get: > |col1|col2|isin| > |null| 0|null| > |null| 1|null| >
[jira] [Commented] (SPARK-19900) [Standalone] Master registers application again when driver relaunched
[ https://issues.apache.org/jira/browse/SPARK-19900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15999751#comment-15999751 ] Apache Spark commented on SPARK-19900: -- User 'liyichao' has created a pull request for this issue: https://github.com/apache/spark/pull/17888 > [Standalone] Master registers application again when driver relaunched > -- > > Key: SPARK-19900 > URL: https://issues.apache.org/jira/browse/SPARK-19900 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core >Affects Versions: 1.6.2 > Environment: Centos 6.5, spark standalone >Reporter: Sergey >Priority: Critical > Labels: Spark, network, standalone, supervise > > I've found some problems when node, where driver is running, has unstable > network. A situation is possible when two identical applications are running > on a cluster. > *Steps to Reproduce:* > # prepare 3 node. One for the spark master and two for the spark workers. > # submit an application with parameter spark.driver.supervise = true > # go to the node where driver is running (for example spark-worker-1) and > close 7077 port > {code} > # iptables -A OUTPUT -p tcp --dport 7077 -j DROP > {code} > # wait more 60 seconds > # look at the spark master UI > There are two spark applications and one driver. The new application has > WAITING state and the second application has RUNNING state. Driver has > RUNNING or RELAUNCHING state (It depends on the resources available, as I > understand it) and it launched on other node (for example spark-worker-2) > # open the port > {code} > # iptables -D OUTPUT -p tcp --dport 7077 -j DROP > {code} > # look an the spark UI again > There are no changes > In addition, if you look at the processes on the node spark-worker-1 > {code} > # ps ax | grep spark > {code} > you will see that the old driver is still working! > *Spark master logs:* > {code} > 17/03/10 05:26:27 WARN Master: Removing > worker-20170310052240-spark-worker-1-35039 because we got no heartbeat in 60 > seconds > 17/03/10 05:26:27 INFO Master: Removing worker > worker-20170310052240-spark-worker-1-35039 on spark-worker-1:35039 > 17/03/10 05:26:27 INFO Master: Telling app of lost executor: 1 > 17/03/10 05:26:27 INFO Master: Telling app of lost executor: 0 > 17/03/10 05:26:27 INFO Master: Re-launching driver-20170310052347- > 17/03/10 05:26:27 INFO Master: Launching driver driver-20170310052347- on > worker worker-20170310052411-spark-worker-2-40473 > 17/03/10 05:26:35 INFO Master: Registering app TestApplication > 17/03/10 05:26:35 INFO Master: Registered app TestApplication with ID > app-20170310052635-0001 > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got status update for unknown executor > app-20170310052354-/1 > 17/03/10 05:31:07 WARN Master: Got status update for unknown executor > app-20170310052354-/0 > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from
[jira] [Comment Edited] (SPARK-19900) [Standalone] Master registers application again when driver relaunched
[ https://issues.apache.org/jira/browse/SPARK-19900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15999741#comment-15999741 ] lyc edited comment on SPARK-19900 at 5/7/17 9:19 AM: - I can successfully reproduce in spark master ( commit 63d90e7d, and spark version is 2.3.0), though the symptom is a little different. To make the application run slowly enough, I modify `examples/LogQuery.scala` to this: {code} dataSet.map(line => (extractKey(line), extractStats(line))) .reduceByKey((a, b) => { Thread.sleep(20*60*1000) a.merge(b) }) {code} run two worker each with 1G memory and 2 cores. submit the application with: {code} bin/spark-submit --class org.apache.spark.examples.LogQuery --master spark://master:6066 --deploy-mode cluster --supervise --total-executor-cores 1 --num-executors 1 --executor-memory 512M --driver-memory 512M /vagrant_data/spark/examples/target/scala-2.11/jars/spark-examples_2.11-2.3.0-SNAPSHOT.jar {code} and follow the steps as stated above. The outcome is that at step 5, there are not two drivers, but there are two applications, the old one is still running, the new one is waiting for resources. After step 6, the old application is finished, the new one running, and the only one driver is killed. This is problematic because the driver corresponding to the running application is killed instead of running. The problem is due to the fact that the relaunched driver uses the same driver id with the original one and when old application is killed, the driver is killed as a side effect. I have create a pull request to fix this. was (Author: lyc): I can successfully reproduce in spark master ( commit 63d90e7d, and spark version is 2.3.0), though the symptom is a little different. To make the application run slowly enough, I modify `examples/LogQuery.scala` to this: ``` dataSet.map(line => (extractKey(line), extractStats(line))) .reduceByKey((a, b) => { Thread.sleep(20*60*1000) a.merge(b) }) ``` run two worker each with 1G memory and 2 cores. submit the application with: ``` bin/spark-submit --class org.apache.spark.examples.LogQuery --master spark://master:6066 --deploy-mode cluster --supervise --total-executor-cores 1 --num-executors 1 --executor-memory 512M --driver-memory 512M /vagrant_data/spark/examples/target/scala-2.11/jars/spark-examples_2.11-2.3.0-SNAPSHOT.jar ``` and follow the steps as stated above. The outcome is that at step 5, there are not two drivers, but there are two applications, the old one is still running, the new one is waiting for resources. After step 6, the old application is finished, the new one running, and the only one driver is killed. This is problematic because the driver corresponding to the running application is killed instead of running. The problem is due to the fact that the relaunched driver uses the same driver id with the original one and when old application is killed, the driver is killed as a side effect. I have create a pull request to fix this. > [Standalone] Master registers application again when driver relaunched > -- > > Key: SPARK-19900 > URL: https://issues.apache.org/jira/browse/SPARK-19900 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core >Affects Versions: 1.6.2 > Environment: Centos 6.5, spark standalone >Reporter: Sergey >Priority: Critical > Labels: Spark, network, standalone, supervise > > I've found some problems when node, where driver is running, has unstable > network. A situation is possible when two identical applications are running > on a cluster. > *Steps to Reproduce:* > # prepare 3 node. One for the spark master and two for the spark workers. > # submit an application with parameter spark.driver.supervise = true > # go to the node where driver is running (for example spark-worker-1) and > close 7077 port > {code} > # iptables -A OUTPUT -p tcp --dport 7077 -j DROP > {code} > # wait more 60 seconds > # look at the spark master UI > There are two spark applications and one driver. The new application has > WAITING state and the second application has RUNNING state. Driver has > RUNNING or RELAUNCHING state (It depends on the resources available, as I > understand it) and it launched on other node (for example spark-worker-2) > # open the port > {code} > # iptables -D OUTPUT -p tcp --dport 7077 -j DROP > {code} > # look an the spark UI again > There are no changes > In addition, if you look at the processes on the node spark-worker-1 > {code} > # ps ax | grep spark > {code} > you will see that the old driver is still working! > *Spark master logs:* > {code} > 17/03/10 05:26:27 WARN Master: Removing > worker-20170310052240-spark-worker-1-35039
[jira] [Comment Edited] (SPARK-19900) [Standalone] Master registers application again when driver relaunched
[ https://issues.apache.org/jira/browse/SPARK-19900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15999741#comment-15999741 ] lyc edited comment on SPARK-19900 at 5/7/17 9:18 AM: - I can successfully reproduce in spark master ( commit 63d90e7d, and spark version is 2.3.0), though the symptom is a little different. To make the application run slowly enough, I modify `examples/LogQuery.scala` to this: ``` dataSet.map(line => (extractKey(line), extractStats(line))) .reduceByKey((a, b) => { Thread.sleep(20*60*1000) a.merge(b) }) ``` run two worker each with 1G memory and 2 cores. submit the application with: ``` bin/spark-submit --class org.apache.spark.examples.LogQuery --master spark://master:6066 --deploy-mode cluster --supervise --total-executor-cores 1 --num-executors 1 --executor-memory 512M --driver-memory 512M /vagrant_data/spark/examples/target/scala-2.11/jars/spark-examples_2.11-2.3.0-SNAPSHOT.jar ``` and follow the steps as stated above. The outcome is that at step 5, there are not two drivers, but there are two applications, the old one is still running, the new one is waiting for resources. After step 6, the old application is finished, the new one running, and the only one driver is killed. This is problematic because the driver corresponding to the running application is killed instead of running. The problem is due to the fact that the relaunched driver uses the same driver id with the original one and when old application is killed, the driver is killed as a side effect. I have create a pull request to fix this. was (Author: lyc): I can successfully reproduce in spark master ( commit 63d90e7d, and spark version is 2.3.0), though the symptom is a little different. To make the application run slowly enough, I modify `examples/LogQuery.scala` to this: ``` dataSet.map(line => (extractKey(line), extractStats(line))) .reduceByKey((a, b) => { Thread.sleep(20*60*1000) a.merge(b) }) ``` run two worker each with 1G memory and 2 cores. submit the application with: ``` bin/spark-submit --class org.apache.spark.examples.LogQuery --master spark://master:6066 --deploy-mode cluster --supervise --total-executor-cores 1 --num-executors 1 --executor-memory 512M --driver-memory 512M /vagrant_data/spark/examples/target/scala-2.11/jars/spark-examples_2.11-2.3.0-SNAPSHOT.jar ``` and follow the steps as stated above. The outcome is that at step 5, there are not two drivers, but there are two applications, the old one is still running, the new one is waiting for resources. After step 6, the old application is finished, the new one running, and the only one driver is killed. This is problematic because the driver corresponding to the running application is killed instead of running. The problem is due to the fact that the relaunched driver uses the same driver id with the original one and when old application is killed, the driver is killed as a side effect. I have create a pull request to fix this. > [Standalone] Master registers application again when driver relaunched > -- > > Key: SPARK-19900 > URL: https://issues.apache.org/jira/browse/SPARK-19900 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core >Affects Versions: 1.6.2 > Environment: Centos 6.5, spark standalone >Reporter: Sergey >Priority: Critical > Labels: Spark, network, standalone, supervise > > I've found some problems when node, where driver is running, has unstable > network. A situation is possible when two identical applications are running > on a cluster. > *Steps to Reproduce:* > # prepare 3 node. One for the spark master and two for the spark workers. > # submit an application with parameter spark.driver.supervise = true > # go to the node where driver is running (for example spark-worker-1) and > close 7077 port > {code} > # iptables -A OUTPUT -p tcp --dport 7077 -j DROP > {code} > # wait more 60 seconds > # look at the spark master UI > There are two spark applications and one driver. The new application has > WAITING state and the second application has RUNNING state. Driver has > RUNNING or RELAUNCHING state (It depends on the resources available, as I > understand it) and it launched on other node (for example spark-worker-2) > # open the port > {code} > # iptables -D OUTPUT -p tcp --dport 7077 -j DROP > {code} > # look an the spark UI again > There are no changes > In addition, if you look at the processes on the node spark-worker-1 > {code} > # ps ax | grep spark > {code} > you will see that the old driver is still working! > *Spark master logs:* > {code} > 17/03/10 05:26:27 WARN Master: Removing > worker-20170310052240-spark-worker-1-35039 because we got
[jira] [Assigned] (SPARK-7481) Add spark-hadoop-cloud module to pull in object store support
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-7481: Assignee: Steve Loughran > Add spark-hadoop-cloud module to pull in object store support > - > > Key: SPARK-7481 > URL: https://issues.apache.org/jira/browse/SPARK-7481 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.1.0 >Reporter: Steve Loughran >Assignee: Steve Loughran > Fix For: 2.3.0 > > > To keep the s3n classpath right, to add s3a, swift & azure, the dependencies > of spark in a 2.6+ profile need to add the relevant object store packages > (hadoop-aws, hadoop-openstack, hadoop-azure) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7481) Add spark-hadoop-cloud module to pull in object store support
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-7481. -- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 17834 [https://github.com/apache/spark/pull/17834] > Add spark-hadoop-cloud module to pull in object store support > - > > Key: SPARK-7481 > URL: https://issues.apache.org/jira/browse/SPARK-7481 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.1.0 >Reporter: Steve Loughran > Fix For: 2.3.0 > > > To keep the s3n classpath right, to add s3a, swift & azure, the dependencies > of spark in a 2.6+ profile need to add the relevant object store packages > (hadoop-aws, hadoop-openstack, hadoop-azure) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20484) Add documentation to ALS code
[ https://issues.apache.org/jira/browse/SPARK-20484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-20484: - Assignee: Daniel Li Priority: Minor (was: Trivial) > Add documentation to ALS code > - > > Key: SPARK-20484 > URL: https://issues.apache.org/jira/browse/SPARK-20484 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: Daniel Li >Assignee: Daniel Li >Priority: Minor > Fix For: 2.3.0 > > > The documentation (both Scaladocs and inline comments) for the ALS code (in > package {{org.apache.spark.ml.recommendation}}) can be clarified where needed > and expanded where incomplete. This is especially important for parts of the > code that are written imperatively for performance, as these parts don't > benefit from the intuitive self-documentation of Scala's higher-level > language abstractions. Specifically, I'd like to add documentation fully > explaining the key functionality of the in-block and out-block objects, their > purpose, how they relate to the overall ALS algorithm, and how they are > calculated in such a way that new maintainers can ramp up much more quickly. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20484) Add documentation to ALS code
[ https://issues.apache.org/jira/browse/SPARK-20484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20484. --- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 17793 [https://github.com/apache/spark/pull/17793] > Add documentation to ALS code > - > > Key: SPARK-20484 > URL: https://issues.apache.org/jira/browse/SPARK-20484 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: Daniel Li >Priority: Trivial > Fix For: 2.3.0 > > > The documentation (both Scaladocs and inline comments) for the ALS code (in > package {{org.apache.spark.ml.recommendation}}) can be clarified where needed > and expanded where incomplete. This is especially important for parts of the > code that are written imperatively for performance, as these parts don't > benefit from the intuitive self-documentation of Scala's higher-level > language abstractions. Specifically, I'd like to add documentation fully > explaining the key functionality of the in-block and out-block objects, their > purpose, how they relate to the overall ALS algorithm, and how they are > calculated in such a way that new maintainers can ramp up much more quickly. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20518) Supplement the new blockidsuite unit tests
[ https://issues.apache.org/jira/browse/SPARK-20518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-20518: - Assignee: caoxuewen Priority: Minor (was: Major) > Supplement the new blockidsuite unit tests > -- > > Key: SPARK-20518 > URL: https://issues.apache.org/jira/browse/SPARK-20518 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.1.0 >Reporter: caoxuewen >Assignee: caoxuewen >Priority: Minor > Fix For: 2.3.0 > > > adds the new unit tests to support ShuffleDataBlockId , ShuffleIndexBlockId , > TempShuffleBlockId , TempLocalBlockId -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20518) Supplement the new blockidsuite unit tests
[ https://issues.apache.org/jira/browse/SPARK-20518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20518. --- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 17794 [https://github.com/apache/spark/pull/17794] > Supplement the new blockidsuite unit tests > -- > > Key: SPARK-20518 > URL: https://issues.apache.org/jira/browse/SPARK-20518 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.1.0 >Reporter: caoxuewen > Fix For: 2.3.0 > > > adds the new unit tests to support ShuffleDataBlockId , ShuffleIndexBlockId , > TempShuffleBlockId , TempLocalBlockId -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19900) [Standalone] Master registers application again when driver relaunched
[ https://issues.apache.org/jira/browse/SPARK-19900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15999741#comment-15999741 ] lyc commented on SPARK-19900: - I can successfully reproduce in spark master ( commit 63d90e7d, and spark version is 2.3.0), though the symptom is a little different. To make the application run slowly enough, I modify `examples/LogQuery.scala` to this: ``` dataSet.map(line => (extractKey(line), extractStats(line))) .reduceByKey((a, b) => { Thread.sleep(20*60*1000) a.merge(b) }) ``` run two worker each with 1G memory and 2 cores. submit the application with: ``` bin/spark-submit --class org.apache.spark.examples.LogQuery --master spark://master:6066 --deploy-mode cluster --supervise --total-executor-cores 1 --num-executors 1 --executor-memory 512M --driver-memory 512M /vagrant_data/spark/examples/target/scala-2.11/jars/spark-examples_2.11-2.3.0-SNAPSHOT.jar ``` and follow the steps as stated above. The outcome is that at step 5, there are not two drivers, but there are two applications, the old one is still running, the new one is waiting for resources. After step 6, the old application is finished, the new one running, and the only one driver is killed. This is problematic because the driver corresponding to the running application is killed instead of running. The problem is due to the fact that the relaunched driver uses the same driver id with the original one and when old application is killed, the driver is killed as a side effect. I have create a pull request to fix this. > [Standalone] Master registers application again when driver relaunched > -- > > Key: SPARK-19900 > URL: https://issues.apache.org/jira/browse/SPARK-19900 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core >Affects Versions: 1.6.2 > Environment: Centos 6.5, spark standalone >Reporter: Sergey >Priority: Critical > Labels: Spark, network, standalone, supervise > > I've found some problems when node, where driver is running, has unstable > network. A situation is possible when two identical applications are running > on a cluster. > *Steps to Reproduce:* > # prepare 3 node. One for the spark master and two for the spark workers. > # submit an application with parameter spark.driver.supervise = true > # go to the node where driver is running (for example spark-worker-1) and > close 7077 port > {code} > # iptables -A OUTPUT -p tcp --dport 7077 -j DROP > {code} > # wait more 60 seconds > # look at the spark master UI > There are two spark applications and one driver. The new application has > WAITING state and the second application has RUNNING state. Driver has > RUNNING or RELAUNCHING state (It depends on the resources available, as I > understand it) and it launched on other node (for example spark-worker-2) > # open the port > {code} > # iptables -D OUTPUT -p tcp --dport 7077 -j DROP > {code} > # look an the spark UI again > There are no changes > In addition, if you look at the processes on the node spark-worker-1 > {code} > # ps ax | grep spark > {code} > you will see that the old driver is still working! > *Spark master logs:* > {code} > 17/03/10 05:26:27 WARN Master: Removing > worker-20170310052240-spark-worker-1-35039 because we got no heartbeat in 60 > seconds > 17/03/10 05:26:27 INFO Master: Removing worker > worker-20170310052240-spark-worker-1-35039 on spark-worker-1:35039 > 17/03/10 05:26:27 INFO Master: Telling app of lost executor: 1 > 17/03/10 05:26:27 INFO Master: Telling app of lost executor: 0 > 17/03/10 05:26:27 INFO Master: Re-launching driver-20170310052347- > 17/03/10 05:26:27 INFO Master: Launching driver driver-20170310052347- on > worker worker-20170310052411-spark-worker-2-40473 > 17/03/10 05:26:35 INFO Master: Registering app TestApplication > 17/03/10 05:26:35 INFO Master: Registered app TestApplication with ID > app-20170310052635-0001 > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got status update for unknown executor > app-20170310052354-/1 > 17/03/10 05:31:07 WARN Master: Got status update for unknown executor > app-20170310052354-/0 > 17/03/10 05:31:07 WARN Master: Got heartbeat from