[jira] [Assigned] (SPARK-20373) Batch queries with 'Dataset/DataFrame.withWatermark()` does not execute

2017-05-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20373:


Assignee: Apache Spark

> Batch queries with 'Dataset/DataFrame.withWatermark()` does not execute
> ---
>
> Key: SPARK-20373
> URL: https://issues.apache.org/jira/browse/SPARK-20373
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Tathagata Das
>Assignee: Apache Spark
>Priority: Minor
>
> Any Dataset/DataFrame batch query with the operation `withWatermark` does not 
> execute because the batch planner does not have any rule to explicitly handle 
> the EventTimeWatermark logical plan. The right solution is to simply remove 
> the plan node, as the watermark should not affect any batch query in any way.
> {code}
> from pyspark.sql.functions import *
> eventsDF = spark.createDataFrame([("2016-03-11 09:00:07", "dev1", 
> 123)]).toDF("eventTime", "deviceId", 
> "signal").select(col("eventTime").cast("timestamp").alias("eventTime"), 
> "deviceId", "signal")
> windowedCountsDF = \
>   eventsDF \
> .withWatermark("eventTime", "10 minutes") \
> .groupBy(
>   "deviceId",
>   window("eventTime", "5 minutes")) \
> .count()
> windowedCountsDF.collect()
> {code}
> This throws as an error 
> {code}
> java.lang.AssertionError: assertion failed: No plan for EventTimeWatermark 
> eventTime#3762657: timestamp, interval 10 minutes
> +- Project [cast(_1#3762643 as timestamp) AS eventTime#3762657, _2#3762644 AS 
> deviceId#3762651]
>+- LogicalRDD [_1#3762643, _2#3762644, _3#3762645L]
>   at scala.Predef$.assert(Predef.scala:170)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
>   at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:74)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:66)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
>   at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:74)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:66)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at 

[jira] [Assigned] (SPARK-20373) Batch queries with 'Dataset/DataFrame.withWatermark()` does not execute

2017-05-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20373:


Assignee: (was: Apache Spark)

> Batch queries with 'Dataset/DataFrame.withWatermark()` does not execute
> ---
>
> Key: SPARK-20373
> URL: https://issues.apache.org/jira/browse/SPARK-20373
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Tathagata Das
>Priority: Minor
>
> Any Dataset/DataFrame batch query with the operation `withWatermark` does not 
> execute because the batch planner does not have any rule to explicitly handle 
> the EventTimeWatermark logical plan. The right solution is to simply remove 
> the plan node, as the watermark should not affect any batch query in any way.
> {code}
> from pyspark.sql.functions import *
> eventsDF = spark.createDataFrame([("2016-03-11 09:00:07", "dev1", 
> 123)]).toDF("eventTime", "deviceId", 
> "signal").select(col("eventTime").cast("timestamp").alias("eventTime"), 
> "deviceId", "signal")
> windowedCountsDF = \
>   eventsDF \
> .withWatermark("eventTime", "10 minutes") \
> .groupBy(
>   "deviceId",
>   window("eventTime", "5 minutes")) \
> .count()
> windowedCountsDF.collect()
> {code}
> This throws as an error 
> {code}
> java.lang.AssertionError: assertion failed: No plan for EventTimeWatermark 
> eventTime#3762657: timestamp, interval 10 minutes
> +- Project [cast(_1#3762643 as timestamp) AS eventTime#3762657, _2#3762644 AS 
> deviceId#3762651]
>+- LogicalRDD [_1#3762643, _2#3762644, _3#3762645L]
>   at scala.Predef$.assert(Predef.scala:170)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
>   at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:74)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:66)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
>   at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:74)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:66)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> 

[jira] [Commented] (SPARK-20373) Batch queries with 'Dataset/DataFrame.withWatermark()` does not execute

2017-05-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16000294#comment-16000294
 ] 

Apache Spark commented on SPARK-20373:
--

User 'uncleGen' has created a pull request for this issue:
https://github.com/apache/spark/pull/17896

> Batch queries with 'Dataset/DataFrame.withWatermark()` does not execute
> ---
>
> Key: SPARK-20373
> URL: https://issues.apache.org/jira/browse/SPARK-20373
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Tathagata Das
>Priority: Minor
>
> Any Dataset/DataFrame batch query with the operation `withWatermark` does not 
> execute because the batch planner does not have any rule to explicitly handle 
> the EventTimeWatermark logical plan. The right solution is to simply remove 
> the plan node, as the watermark should not affect any batch query in any way.
> {code}
> from pyspark.sql.functions import *
> eventsDF = spark.createDataFrame([("2016-03-11 09:00:07", "dev1", 
> 123)]).toDF("eventTime", "deviceId", 
> "signal").select(col("eventTime").cast("timestamp").alias("eventTime"), 
> "deviceId", "signal")
> windowedCountsDF = \
>   eventsDF \
> .withWatermark("eventTime", "10 minutes") \
> .groupBy(
>   "deviceId",
>   window("eventTime", "5 minutes")) \
> .count()
> windowedCountsDF.collect()
> {code}
> This throws as an error 
> {code}
> java.lang.AssertionError: assertion failed: No plan for EventTimeWatermark 
> eventTime#3762657: timestamp, interval 10 minutes
> +- Project [cast(_1#3762643 as timestamp) AS eventTime#3762657, _2#3762644 AS 
> deviceId#3762651]
>+- LogicalRDD [_1#3762643, _2#3762644, _3#3762645L]
>   at scala.Predef$.assert(Predef.scala:170)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
>   at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:74)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:66)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
>   at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:74)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:66)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at 

[jira] [Commented] (SPARK-20588) from_utc_timestamp causes bottleneck

2017-05-07 Thread Takuya Ueshin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16000290#comment-16000290
 ] 

Takuya Ueshin commented on SPARK-20588:
---

I agree with caching per thread for now.

> from_utc_timestamp causes bottleneck
> 
>
> Key: SPARK-20588
> URL: https://issues.apache.org/jira/browse/SPARK-20588
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: AWS EMR AMI 5.2.1
>Reporter: Ameen Tayyebi
>
> We have a SQL query that makes use of the from_utc_timestamp function like 
> so: from_utc_timestamp(itemSigningTime,'America/Los_Angeles')
> This causes a major bottleneck. Our exact call is:
> date_add(from_utc_timestamp(itemSigningTime,'America/Los_Angeles'), 1)
> Switching from the above to date_add(itemSigningTime, 1) reduces the job 
> running time from 40 minutes to 9.
> When from_utc_timestamp function is used, several threads in the executors 
> are in the BLOCKED state, on this call stack:
> "Executor task launch worker-63" #261 daemon prio=5 os_prio=0 
> tid=0x7f848472e000 nid=0x4294 waiting for monitor entry 
> [0x7f501981c000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at java.util.TimeZone.getTimeZone(TimeZone.java:516)
> - waiting to lock <0x7f5216c2aa58> (a java.lang.Class for 
> java.util.TimeZone)
> at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTimestamp(DateTimeUtils.scala:356)
> at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils.stringToTimestamp(DateTimeUtils.scala)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Can we cache the locale's once per JVM so that we don't do this for every 
> record?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.

2017-05-07 Thread Takuya Ueshin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16000213#comment-16000213
 ] 

Takuya Ueshin commented on SPARK-12297:
---

Issue resolved by pull request 16781
https://github.com/apache/spark/pull/16781

> Add work-around for Parquet/Hive int96 timestamp bug.
> -
>
> Key: SPARK-12297
> URL: https://issues.apache.org/jira/browse/SPARK-12297
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Reporter: Ryan Blue
> Fix For: 2.3.0
>
>
> Spark copied Hive's behavior for parquet, but this was inconsistent with 
> other file formats, and inconsistent with Impala (which is the original 
> source of putting a timestamp as an int96 in parquet, I believe).  This made 
> timestamps in parquet act more like timestamps with timezones, while in other 
> file formats, timestamps have no time zone, they are a "floating time".
> The easiest way to see this issue is to write out a table with timestamps in 
> multiple different formats from one timezone, then try to read them back in 
> another timezone.  Eg., here I write out a few timestamps to parquet and 
> textfile hive tables, and also just as a json file, all in the 
> "America/Los_Angeles" timezone:
> {code}
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> val tblPrefix = args(0)
> val schema = new StructType().add("ts", TimestampType)
> val rows = sc.parallelize(Seq(
>   "2015-12-31 23:50:59.123",
>   "2015-12-31 22:49:59.123",
>   "2016-01-01 00:39:59.123",
>   "2016-01-01 01:29:59.123"
> ).map { x => Row(java.sql.Timestamp.valueOf(x)) })
> val rawData = spark.createDataFrame(rows, schema).toDF()
> rawData.show()
> Seq("parquet", "textfile").foreach { format =>
>   val tblName = s"${tblPrefix}_$format"
>   spark.sql(s"DROP TABLE IF EXISTS $tblName")
>   spark.sql(
> raw"""CREATE TABLE $tblName (
>   |  ts timestamp
>   | )
>   | STORED AS $format
>  """.stripMargin)
>   rawData.write.insertInto(tblName)
> }
> rawData.write.json(s"${tblPrefix}_json")
> {code}
> Then I start a spark-shell in "America/New_York" timezone, and read the data 
> back from each table:
> {code}
> scala> spark.sql("select * from la_parquet").collect().foreach{println}
> [2016-01-01 02:50:59.123]
> [2016-01-01 01:49:59.123]
> [2016-01-01 03:39:59.123]
> [2016-01-01 04:29:59.123]
> scala> spark.sql("select * from la_textfile").collect().foreach{println}
> [2015-12-31 23:50:59.123]
> [2015-12-31 22:49:59.123]
> [2016-01-01 00:39:59.123]
> [2016-01-01 01:29:59.123]
> scala> spark.read.json("la_json").collect().foreach{println}
> [2015-12-31 23:50:59.123]
> [2015-12-31 22:49:59.123]
> [2016-01-01 00:39:59.123]
> [2016-01-01 01:29:59.123]
> scala> spark.read.json("la_json").join(spark.sql("select * from 
> la_textfile"), "ts").show()
> ++
> |  ts|
> ++
> |2015-12-31 23:50:...|
> |2015-12-31 22:49:...|
> |2016-01-01 00:39:...|
> |2016-01-01 01:29:...|
> ++
> scala> spark.read.json("la_json").join(spark.sql("select * from la_parquet"), 
> "ts").show()
> +---+
> | ts|
> +---+
> +---+
> {code}
> The textfile and json based data shows the same times, and can be joined 
> against each other, while the times from the parquet data have changed (and 
> obviously joins fail).
> This is a big problem for any organization that may try to read the same data 
> (say in S3) with clusters in multiple timezones.  It can also be a nasty 
> surprise as an organization tries to migrate file formats.  Finally, its a 
> source of incompatibility between Hive, Impala, and Spark.
> HIVE-12767 aims to fix this by introducing a table property which indicates 
> the "storage timezone" for the table.  Spark should add the same to ensure 
> consistency between file formats, and with Hive & Impala.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.

2017-05-07 Thread Takuya Ueshin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-12297.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

> Add work-around for Parquet/Hive int96 timestamp bug.
> -
>
> Key: SPARK-12297
> URL: https://issues.apache.org/jira/browse/SPARK-12297
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Reporter: Ryan Blue
> Fix For: 2.3.0
>
>
> Spark copied Hive's behavior for parquet, but this was inconsistent with 
> other file formats, and inconsistent with Impala (which is the original 
> source of putting a timestamp as an int96 in parquet, I believe).  This made 
> timestamps in parquet act more like timestamps with timezones, while in other 
> file formats, timestamps have no time zone, they are a "floating time".
> The easiest way to see this issue is to write out a table with timestamps in 
> multiple different formats from one timezone, then try to read them back in 
> another timezone.  Eg., here I write out a few timestamps to parquet and 
> textfile hive tables, and also just as a json file, all in the 
> "America/Los_Angeles" timezone:
> {code}
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> val tblPrefix = args(0)
> val schema = new StructType().add("ts", TimestampType)
> val rows = sc.parallelize(Seq(
>   "2015-12-31 23:50:59.123",
>   "2015-12-31 22:49:59.123",
>   "2016-01-01 00:39:59.123",
>   "2016-01-01 01:29:59.123"
> ).map { x => Row(java.sql.Timestamp.valueOf(x)) })
> val rawData = spark.createDataFrame(rows, schema).toDF()
> rawData.show()
> Seq("parquet", "textfile").foreach { format =>
>   val tblName = s"${tblPrefix}_$format"
>   spark.sql(s"DROP TABLE IF EXISTS $tblName")
>   spark.sql(
> raw"""CREATE TABLE $tblName (
>   |  ts timestamp
>   | )
>   | STORED AS $format
>  """.stripMargin)
>   rawData.write.insertInto(tblName)
> }
> rawData.write.json(s"${tblPrefix}_json")
> {code}
> Then I start a spark-shell in "America/New_York" timezone, and read the data 
> back from each table:
> {code}
> scala> spark.sql("select * from la_parquet").collect().foreach{println}
> [2016-01-01 02:50:59.123]
> [2016-01-01 01:49:59.123]
> [2016-01-01 03:39:59.123]
> [2016-01-01 04:29:59.123]
> scala> spark.sql("select * from la_textfile").collect().foreach{println}
> [2015-12-31 23:50:59.123]
> [2015-12-31 22:49:59.123]
> [2016-01-01 00:39:59.123]
> [2016-01-01 01:29:59.123]
> scala> spark.read.json("la_json").collect().foreach{println}
> [2015-12-31 23:50:59.123]
> [2015-12-31 22:49:59.123]
> [2016-01-01 00:39:59.123]
> [2016-01-01 01:29:59.123]
> scala> spark.read.json("la_json").join(spark.sql("select * from 
> la_textfile"), "ts").show()
> ++
> |  ts|
> ++
> |2015-12-31 23:50:...|
> |2015-12-31 22:49:...|
> |2016-01-01 00:39:...|
> |2016-01-01 01:29:...|
> ++
> scala> spark.read.json("la_json").join(spark.sql("select * from la_parquet"), 
> "ts").show()
> +---+
> | ts|
> +---+
> +---+
> {code}
> The textfile and json based data shows the same times, and can be joined 
> against each other, while the times from the parquet data have changed (and 
> obviously joins fail).
> This is a big problem for any organization that may try to read the same data 
> (say in S3) with clusters in multiple timezones.  It can also be a nasty 
> surprise as an organization tries to migrate file formats.  Finally, its a 
> source of incompatibility between Hive, Impala, and Spark.
> HIVE-12767 aims to fix this by introducing a table property which indicates 
> the "storage timezone" for the table.  Spark should add the same to ensure 
> consistency between file formats, and with Hive & Impala.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20634) result of MLlib KMeans cluster is not stabilize

2017-05-07 Thread Simon.J (JIRA)
Simon.J created SPARK-20634:
---

 Summary: result of MLlib KMeans cluster is not stabilize
 Key: SPARK-20634
 URL: https://issues.apache.org/jira/browse/SPARK-20634
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 2.0.2
 Environment: Windows 10
spark 2.0.2 standalone
spyder 3.1.4
Anaconda 4.3.0
python 3.5.2
Reporter: Simon.J
Priority: Critical


1.Get a DataFrame through python with Cx_Oracle lib.
2.Start a local Spark Session.
3.Convert the dataset for Kmeansmodel train.
4.Train the KMeans model and predict the same data.just set K =3
5.Get the ClassifierFeature of the KMeans model'predict.
6.Get the count of every ClassifierFeature.
7.Loop 4-6 for 20 times.
8.Compare the result of every time.
9.Find the KMeans result dose not stabilize.
10.The same dataset and param for ML package'KMeans, its result is the same.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16931) PySpark access to data-frame bucketing api

2017-05-07 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-16931.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 17077
[https://github.com/apache/spark/pull/17077]

> PySpark access to data-frame bucketing api
> --
>
> Key: SPARK-16931
> URL: https://issues.apache.org/jira/browse/SPARK-16931
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Greg Bowyer
>Assignee: Maciej Szymkiewicz
> Fix For: 2.3.0
>
>
> Attached is a patch that enables bucketing for pyspark using the dataframe 
> API.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16931) PySpark access to data-frame bucketing api

2017-05-07 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-16931:
---

Assignee: Maciej Szymkiewicz

> PySpark access to data-frame bucketing api
> --
>
> Key: SPARK-16931
> URL: https://issues.apache.org/jira/browse/SPARK-16931
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Greg Bowyer
>Assignee: Maciej Szymkiewicz
> Fix For: 2.3.0
>
>
> Attached is a patch that enables bucketing for pyspark using the dataframe 
> API.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17134) Use level 2 BLAS operations in LogisticAggregator

2017-05-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17134:


Assignee: Seth Hendrickson  (was: Apache Spark)

> Use level 2 BLAS operations in LogisticAggregator
> -
>
> Key: SPARK-17134
> URL: https://issues.apache.org/jira/browse/SPARK-17134
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>
> Multinomial logistic regression uses LogisticAggregator class for gradient 
> updates. We should look into refactoring MLOR to use level 2 BLAS operations 
> for the updates. Performance testing should be done to show improvements.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17134) Use level 2 BLAS operations in LogisticAggregator

2017-05-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17134:


Assignee: Apache Spark  (was: Seth Hendrickson)

> Use level 2 BLAS operations in LogisticAggregator
> -
>
> Key: SPARK-17134
> URL: https://issues.apache.org/jira/browse/SPARK-17134
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>Assignee: Apache Spark
>
> Multinomial logistic regression uses LogisticAggregator class for gradient 
> updates. We should look into refactoring MLOR to use level 2 BLAS operations 
> for the updates. Performance testing should be done to show improvements.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17134) Use level 2 BLAS operations in LogisticAggregator

2017-05-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16000188#comment-16000188
 ] 

Apache Spark commented on SPARK-17134:
--

User 'VinceShieh' has created a pull request for this issue:
https://github.com/apache/spark/pull/17894

> Use level 2 BLAS operations in LogisticAggregator
> -
>
> Key: SPARK-17134
> URL: https://issues.apache.org/jira/browse/SPARK-17134
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>
> Multinomial logistic regression uses LogisticAggregator class for gradient 
> updates. We should look into refactoring MLOR to use level 2 BLAS operations 
> for the updates. Performance testing should be done to show improvements.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20633) FileFormatWriter wrap the FetchFailedException which breaks job's failover

2017-05-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20633:


Assignee: (was: Apache Spark)

> FileFormatWriter wrap the FetchFailedException which breaks job's failover
> --
>
> Key: SPARK-20633
> URL: https://issues.apache.org/jira/browse/SPARK-20633
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Liu Shaohui
>
> The task scheduler handles FetchFailedException separately for the task 
> failover. But the FileFormatWriter wraps the FetchFailedException with 
> SparkException. This causes the job cannot be recovered from the failure like 
> a external shuffle server is down.
> See the stacktrace:
> {code}
> 2017-04-30,05:02:42,348 ERROR 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer: Task 
> attempt attempt_201704300443_0018_m_96_1 aborted.
> 2017-04-30,05:02:42,392 ERROR org.apache.spark.executor.Executor: Exception 
> in task 96.1 in stage 18.0 (TID 26538)
> org.apache.spark.SparkException: Task failed while writing rows
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.shuffle.FetchFailedException: 
> java.lang.RuntimeException: Executor is not registered 
> (appId=application_1491898760056_636981, execId=546)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:319)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.handleMessage(ExternalShuffleBlockHandler.java:87)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.receive(ExternalShuffleBlockHandler.java:74)
>   at 
> org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:152)
>   at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:102)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
>   at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20633) FileFormatWriter wrap the FetchFailedException which breaks job's failover

2017-05-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20633:


Assignee: Apache Spark

> FileFormatWriter wrap the FetchFailedException which breaks job's failover
> --
>
> Key: SPARK-20633
> URL: https://issues.apache.org/jira/browse/SPARK-20633
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Liu Shaohui
>Assignee: Apache Spark
>
> The task scheduler handles FetchFailedException separately for the task 
> failover. But the FileFormatWriter wraps the FetchFailedException with 
> SparkException. This causes the job cannot be recovered from the failure like 
> a external shuffle server is down.
> See the stacktrace:
> {code}
> 2017-04-30,05:02:42,348 ERROR 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer: Task 
> attempt attempt_201704300443_0018_m_96_1 aborted.
> 2017-04-30,05:02:42,392 ERROR org.apache.spark.executor.Executor: Exception 
> in task 96.1 in stage 18.0 (TID 26538)
> org.apache.spark.SparkException: Task failed while writing rows
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.shuffle.FetchFailedException: 
> java.lang.RuntimeException: Executor is not registered 
> (appId=application_1491898760056_636981, execId=546)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:319)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.handleMessage(ExternalShuffleBlockHandler.java:87)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.receive(ExternalShuffleBlockHandler.java:74)
>   at 
> org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:152)
>   at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:102)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
>   at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20633) FileFormatWriter wrap the FetchFailedException which breaks job's failover

2017-05-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16000184#comment-16000184
 ] 

Apache Spark commented on SPARK-20633:
--

User 'lshmouse' has created a pull request for this issue:
https://github.com/apache/spark/pull/17893

> FileFormatWriter wrap the FetchFailedException which breaks job's failover
> --
>
> Key: SPARK-20633
> URL: https://issues.apache.org/jira/browse/SPARK-20633
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Liu Shaohui
>
> The task scheduler handles FetchFailedException separately for the task 
> failover. But the FileFormatWriter wraps the FetchFailedException with 
> SparkException. This causes the job cannot be recovered from the failure like 
> a external shuffle server is down.
> See the stacktrace:
> {code}
> 2017-04-30,05:02:42,348 ERROR 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer: Task 
> attempt attempt_201704300443_0018_m_96_1 aborted.
> 2017-04-30,05:02:42,392 ERROR org.apache.spark.executor.Executor: Exception 
> in task 96.1 in stage 18.0 (TID 26538)
> org.apache.spark.SparkException: Task failed while writing rows
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.shuffle.FetchFailedException: 
> java.lang.RuntimeException: Executor is not registered 
> (appId=application_1491898760056_636981, execId=546)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:319)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.handleMessage(ExternalShuffleBlockHandler.java:87)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.receive(ExternalShuffleBlockHandler.java:74)
>   at 
> org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:152)
>   at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:102)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104)
>   at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
>   at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20633) FileFormatWriter wrap the FetchFailedException which breaks job's failover

2017-05-07 Thread Liu Shaohui (JIRA)
Liu Shaohui created SPARK-20633:
---

 Summary: FileFormatWriter wrap the FetchFailedException which 
breaks job's failover
 Key: SPARK-20633
 URL: https://issues.apache.org/jira/browse/SPARK-20633
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.1
Reporter: Liu Shaohui


The task scheduler handles FetchFailedException separately for the task 
failover. But the FileFormatWriter wraps the FetchFailedException with 
SparkException. This causes the job cannot be recovered from the failure like a 
external shuffle server is down.

See the stacktrace:
{code}
2017-04-30,05:02:42,348 ERROR 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer: Task attempt 
attempt_201704300443_0018_m_96_1 aborted.
2017-04-30,05:02:42,392 ERROR org.apache.spark.executor.Executor: Exception in 
task 96.1 in stage 18.0 (TID 26538)
org.apache.spark.SparkException: Task failed while writing rows
  at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261)
  at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
  at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
  at org.apache.spark.scheduler.Task.run(Task.scala:86)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.shuffle.FetchFailedException: 
java.lang.RuntimeException: Executor is not registered 
(appId=application_1491898760056_636981, execId=546)
  at 
org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:319)
  at 
org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.handleMessage(ExternalShuffleBlockHandler.java:87)
  at 
org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.receive(ExternalShuffleBlockHandler.java:74)
  at 
org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:152)
  at 
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:102)
  at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104)
  at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
  at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
  at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
  at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
  at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
  at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
  at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17134) Use level 2 BLAS operations in LogisticAggregator

2017-05-07 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16000176#comment-16000176
 ] 

Vincent commented on SPARK-17134:
-

I will submit a PR for this issue soon.

> Use level 2 BLAS operations in LogisticAggregator
> -
>
> Key: SPARK-17134
> URL: https://issues.apache.org/jira/browse/SPARK-17134
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>
> Multinomial logistic regression uses LogisticAggregator class for gradient 
> updates. We should look into refactoring MLOR to use level 2 BLAS operations 
> for the updates. Performance testing should be done to show improvements.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18350) Support session local timezone

2017-05-07 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16000173#comment-16000173
 ] 

Reynold Xin commented on SPARK-18350:
-

[~srowen] why was this reopened?


> Support session local timezone
> --
>
> Key: SPARK-18350
> URL: https://issues.apache.org/jira/browse/SPARK-18350
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Takuya Ueshin
>  Labels: releasenotes
>
> As of Spark 2.1, Spark SQL assumes the machine timezone for datetime 
> manipulation, which is bad if users are not in the same timezones as the 
> machines, or if different users have different timezones.
> We should introduce a session local timezone setting that is used for 
> execution.
> An explicit non-goal is locale handling.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19112) add codec for ZStandard

2017-05-07 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16000148#comment-16000148
 ] 

Takeshi Yamamuro commented on SPARK-19112:
--

I also put the result here:
{code}
scaleFactor: 4
AWS instance: c4.4xlarge

-- zstd
Running execution q4-v1.4 iteration: 1, StandardRun=true
Execution time: 53.315878375s
Running execution q4-v1.4 iteration: 2, StandardRun=true
Execution time: 53.468174668s
Running execution q4-v1.4 iteration: 3, StandardRun=true
Execution time: 57.282403146s 

-- lz4
Running execution q4-v1.4 iteration: 1, StandardRun=true
Execution time: 20.779643053s
Running execution q4-v1.4 iteration: 2, StandardRun=true
Execution time: 16.520911319s
Running execution q4-v1.4 iteration: 3, StandardRun=true
Execution time: 15.897124967s

-- snappy
Running execution q4-v1.4 iteration: 1, StandardRun=true
Execution time: 21.13241203698s
Running execution q4-v1.4 iteration: 2, StandardRun=true
Execution time: 15.90886774398s 
Running execution q4-v1.4 iteration: 3, StandardRun=true
Execution time: 15.789648712s

-- lzf
Running execution q4-v1.4 iteration: 1, StandardRun=true
Execution time: 21.339518781s
Running execution q4-v1.4 iteration: 2, StandardRun=true
Execution time: 16.881225328s   
Running execution q4-v1.4 iteration: 3, StandardRun=true
Execution time: 15.813455479s
{code}

ISTM it's okay to close the current pr for now. But we should close this ticket 
now? IMHO the performance depends on environments, configurations, code 
structure, and so on. So, we could keep this open for collecting other's 
performance results?

> add codec for ZStandard
> ---
>
> Key: SPARK-19112
> URL: https://issues.apache.org/jira/browse/SPARK-19112
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Thomas Graves
>Priority: Minor
>
> ZStandard: https://github.com/facebook/zstd and 
> http://facebook.github.io/zstd/ has been in use for a while now. v1.0 was 
> recently released. Hadoop 
> (https://issues.apache.org/jira/browse/HADOOP-13578) and others 
> (https://issues.apache.org/jira/browse/KAFKA-4514) are adopting it.
> Zstd seems to give great results => Gzip level Compression with Lz4 level CPU.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0

2017-05-07 Thread Helena Edelson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16000130#comment-16000130
 ] 

Helena Edelson commented on SPARK-18057:


Did that a while ago, my only point is not modifying artifacts ideally, by 
adding and excluding in builds.

> Update structured streaming kafka from 10.0.1 to 10.2.0
> ---
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20626) Fix SparkR test warning on Windows with timestamp time zone

2017-05-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20626:


Assignee: Apache Spark

> Fix SparkR test warning on Windows with timestamp time zone
> ---
>
> Key: SPARK-20626
> URL: https://issues.apache.org/jira/browse/SPARK-20626
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0, 2.2.0, 2.3.0
>Reporter: Felix Cheung
>Assignee: Apache Spark
>
> Warnings 
> ---
> 1. infer types and check types (@test_sparkSQL.R#123) - unable to identify 
> current timezone 'C':
> please set environment variable 'TZ



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20626) Fix SparkR test warning on Windows with timestamp time zone

2017-05-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20626:


Assignee: (was: Apache Spark)

> Fix SparkR test warning on Windows with timestamp time zone
> ---
>
> Key: SPARK-20626
> URL: https://issues.apache.org/jira/browse/SPARK-20626
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0, 2.2.0, 2.3.0
>Reporter: Felix Cheung
>
> Warnings 
> ---
> 1. infer types and check types (@test_sparkSQL.R#123) - unable to identify 
> current timezone 'C':
> please set environment variable 'TZ



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20626) Fix SparkR test warning on Windows with timestamp time zone

2017-05-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16000110#comment-16000110
 ] 

Apache Spark commented on SPARK-20626:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/17892

> Fix SparkR test warning on Windows with timestamp time zone
> ---
>
> Key: SPARK-20626
> URL: https://issues.apache.org/jira/browse/SPARK-20626
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0, 2.2.0, 2.3.0
>Reporter: Felix Cheung
>
> Warnings 
> ---
> 1. infer types and check types (@test_sparkSQL.R#123) - unable to identify 
> current timezone 'C':
> please set environment variable 'TZ



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18350) Support session local timezone

2017-05-07 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18350:
--
Fix Version/s: (was: 2.2.0)

> Support session local timezone
> --
>
> Key: SPARK-18350
> URL: https://issues.apache.org/jira/browse/SPARK-18350
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Takuya Ueshin
>  Labels: releasenotes
>
> As of Spark 2.1, Spark SQL assumes the machine timezone for datetime 
> manipulation, which is bad if users are not in the same timezones as the 
> machines, or if different users have different timezones.
> We should introduce a session local timezone setting that is used for 
> execution.
> An explicit non-goal is locale handling.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20372) Word2Vec Continuous Bag Of Words model

2017-05-07 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-20372:
--
Fix Version/s: (was: 2.2.0)

> Word2Vec Continuous Bag Of Words model
> --
>
> Key: SPARK-20372
> URL: https://issues.apache.org/jira/browse/SPARK-20372
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Shubham Chopra
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20548) Flaky Test: ReplSuite.newProductSeqEncoder with REPL defined class

2017-05-07 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-20548:
--
Fix Version/s: (was: 2.2.0)

> Flaky Test:  ReplSuite.newProductSeqEncoder with REPL defined class
> ---
>
> Key: SPARK-20548
> URL: https://issues.apache.org/jira/browse/SPARK-20548
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Sameer Agarwal
>Assignee: Sameer Agarwal
>
> {{newProductSeqEncoder with REPL defined class}} in {{ReplSuite}} has been 
> failing in-deterministically : https://spark-tests.appspot.com/failed-tests 
> over the last few days.
> https://spark.test.databricks.com/job/spark-master-test-sbt-hadoop-2.7/176/testReport/junit/org.apache.spark.repl/ReplSuite/newProductSeqEncoder_with_REPL_defined_class/history/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18891) Support for specific collection types

2017-05-07 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18891:
--
Fix Version/s: (was: 2.2.0)

> Support for specific collection types
> -
>
> Key: SPARK-18891
> URL: https://issues.apache.org/jira/browse/SPARK-18891
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.3, 2.1.0
>Reporter: Michael Armbrust
>Priority: Critical
>
> Encoders treat all collections the same (i.e. {{Seq}} vs {{List}}) which 
> force users to only define classes with the most generic type.
> An [example 
> error|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/2398463439880241/2840265927289860/latest.html]:
> {code}
> case class SpecificCollection(aList: List[Int])
> Seq(SpecificCollection(1 :: Nil)).toDS().collect()
> {code}
> {code}
> java.lang.RuntimeException: Error while decoding: 
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 98, Column 120: No applicable constructor/method found 
> for actual parameters "scala.collection.Seq"; candidates are: 
> "line29e7e4b1e36445baa3505b2e102aa86b29.$read$$iw$$iw$$iw$$iw$SpecificCollection(scala.collection.immutable.List)"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20617) pyspark.sql filtering fails when using ~isin when there are nulls in column

2017-05-07 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-20617:
--
Fix Version/s: (was: 2.2.0)

> pyspark.sql filtering fails when using ~isin when there are nulls in column
> ---
>
> Key: SPARK-20617
> URL: https://issues.apache.org/jira/browse/SPARK-20617
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
> Environment: Ubuntu Xenial 16.04, Python 3.5
>Reporter: Ed Lee
>
> Hello encountered a filtering bug using 'isin' in pyspark sql on version 
> 2.2.0, Ubuntu 16.04.
> Enclosed below an example to replicate:
> from pyspark.sql import SparkSession
> from pyspark.sql import functions as sf
> import pandas as pd
> spark = SparkSession.builder.master("local").appName("Word 
> Count").getOrCreate()
> test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"],
> "col2": range(5)
> })
> test_sdf = spark.createDataFrame(test_df)
> test_sdf.show()
>  |col1|col2|
>  |null|   0|
>  |null|   1|
>  |   a|   2|
>  |   b|   3|
>  |   c|   4|
> # Below shows when filtering col1 NOT in list ['a'] the col1 rows with null 
> are missing:
> test_sdf.filter(sf.col("col1").isin(["a"]) == False).show()
> Or:
> test_sdf.filter(~sf.col("col1").isin(["a"])).show()
> *Expecting*:
>  |col1|col2|
>  |null|   0|
>  |null|   1|
>  |   b|   3|
>  |   c|   4|
> *Got*:
>  |col1|col2|
>  |   b|   3|
>  |   c|   4|
> My workarounds:
> 1.  null is considered 'in', so add OR isNull conditon:
> test_sdf.filter((sf.col("col1").isin(["a"])== False) | (
> sf.col("col1").isNull())).show()
> To get:
>  |col1|col2|isin|
>  |null|   0|null|
>  |null|   1|null|
>  |   c|   4|null|
>  |   b|   3|null|
> 2.  Use left join and filter
> join_df = pd.DataFrame({"col1": ["a"],
> "isin": 1
> })
> join_sdf = spark.createDataFrame(join_df)
> test_sdf.join(join_sdf, on="col1", how="left") \
> .filter(sf.col("isin").isNull()) \
> .show()
> To get:
>  |col1|col2|isin|
>  |null|   0|null|
>  |null|   1|null|
>  |   c|   4|null|
>  |   b|   3|null|
> Thank you



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20626) Fix SparkR test warning on Windows with timestamp time zone

2017-05-07 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-20626:
-
Summary: Fix SparkR test warning on Windows with timestamp time zone  (was: 
Fix SparkR test on Windows with timestamp time zone)

> Fix SparkR test warning on Windows with timestamp time zone
> ---
>
> Key: SPARK-20626
> URL: https://issues.apache.org/jira/browse/SPARK-20626
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0, 2.2.0, 2.3.0
>Reporter: Felix Cheung
>
> Warnings 
> ---
> 1. infer types and check types (@test_sparkSQL.R#123) - unable to identify 
> current timezone 'C':
> please set environment variable 'TZ



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11834) Ignore thresholds in LogisticRegression and update documentation

2017-05-07 Thread Maciej Szymkiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1625#comment-1625
 ] 

Maciej Szymkiewicz commented on SPARK-11834:


Sorry, for that. Wrong ticket in PR.

> Ignore thresholds in LogisticRegression and update documentation
> 
>
> Key: SPARK-11834
> URL: https://issues.apache.org/jira/browse/SPARK-11834
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Minor
>
> ml.LogisticRegression does not support multiclass yet. So we should ignore 
> `thresholds` and update the documentation. In the next release, we can do 
> SPARK-11543.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20631) LogisticRegression._checkThresholdConsistency should use values not Params

2017-05-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20631:


Assignee: Apache Spark

> LogisticRegression._checkThresholdConsistency should use values not Params
> --
>
> Key: SPARK-20631
> URL: https://issues.apache.org/jira/browse/SPARK-20631
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Maciej Szymkiewicz
>Assignee: Apache Spark
>Priority: Minor
>
> {{_checkThresholdConsistency}} incorrectly uses {{getParam}} in attempt to 
> access {{threshold}} and {{thresholds}} values. Furthermore it calls it with 
> {{Param}} instead of {{str}}:
> {code}
> >>> from pyspark.ml.classification import LogisticRegression
> >>> lr = LogisticRegression(threshold=0.25, thresholds=[0.75, 0.25])
> Traceback (most recent call last):
> ...
> TypeError: getattr(): attribute name must be string
> {code}
> Finally exception message uses {{join}} without converting values to {{str}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20631) LogisticRegression._checkThresholdConsistency should use values not Params

2017-05-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20631:


Assignee: (was: Apache Spark)

> LogisticRegression._checkThresholdConsistency should use values not Params
> --
>
> Key: SPARK-20631
> URL: https://issues.apache.org/jira/browse/SPARK-20631
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> {{_checkThresholdConsistency}} incorrectly uses {{getParam}} in attempt to 
> access {{threshold}} and {{thresholds}} values. Furthermore it calls it with 
> {{Param}} instead of {{str}}:
> {code}
> >>> from pyspark.ml.classification import LogisticRegression
> >>> lr = LogisticRegression(threshold=0.25, thresholds=[0.75, 0.25])
> Traceback (most recent call last):
> ...
> TypeError: getattr(): attribute name must be string
> {code}
> Finally exception message uses {{join}} without converting values to {{str}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20631) LogisticRegression._checkThresholdConsistency should use values not Params

2017-05-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1623#comment-1623
 ] 

Apache Spark commented on SPARK-20631:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/17891

> LogisticRegression._checkThresholdConsistency should use values not Params
> --
>
> Key: SPARK-20631
> URL: https://issues.apache.org/jira/browse/SPARK-20631
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> {{_checkThresholdConsistency}} incorrectly uses {{getParam}} in attempt to 
> access {{threshold}} and {{thresholds}} values. Furthermore it calls it with 
> {{Param}} instead of {{str}}:
> {code}
> >>> from pyspark.ml.classification import LogisticRegression
> >>> lr = LogisticRegression(threshold=0.25, thresholds=[0.75, 0.25])
> Traceback (most recent call last):
> ...
> TypeError: getattr(): attribute name must be string
> {code}
> Finally exception message uses {{join}} without converting values to {{str}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20631) LogisticRegression._checkThresholdConsistency should use values not Params

2017-05-07 Thread Maciej Szymkiewicz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz updated SPARK-20631:
---
Description: 
{{_checkThresholdConsistency}} incorrectly uses {{getParam}} in attempt to 
access {{threshold}} and {{thresholds}} values. Furthermore it calls it with 
{{Param}} instead of {{str}}:

{code}
>>> from pyspark.ml.classification import LogisticRegression
>>> lr = LogisticRegression(threshold=0.25, thresholds=[0.75, 0.25])
Traceback (most recent call last):
...
TypeError: getattr(): attribute name must be string
{code}

Finally exception message uses {{join}} without converting values to {{str}}.

  was:{{_checkThresholdConsistency}} incorrectly uses {{getParam}} in attempt 
to access {{threshold}} and {{thresholds}} values.


> LogisticRegression._checkThresholdConsistency should use values not Params
> --
>
> Key: SPARK-20631
> URL: https://issues.apache.org/jira/browse/SPARK-20631
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> {{_checkThresholdConsistency}} incorrectly uses {{getParam}} in attempt to 
> access {{threshold}} and {{thresholds}} values. Furthermore it calls it with 
> {{Param}} instead of {{str}}:
> {code}
> >>> from pyspark.ml.classification import LogisticRegression
> >>> lr = LogisticRegression(threshold=0.25, thresholds=[0.75, 0.25])
> Traceback (most recent call last):
> ...
> TypeError: getattr(): attribute name must be string
> {code}
> Finally exception message uses {{join}} without converting values to {{str}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11834) Ignore thresholds in LogisticRegression and update documentation

2017-05-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1612#comment-1612
 ] 

Apache Spark commented on SPARK-11834:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/17891

> Ignore thresholds in LogisticRegression and update documentation
> 
>
> Key: SPARK-11834
> URL: https://issues.apache.org/jira/browse/SPARK-11834
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Minor
>
> ml.LogisticRegression does not support multiclass yet. So we should ignore 
> `thresholds` and update the documentation. In the next release, we can do 
> SPARK-11543.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20632) Allow 'Column.getItem()' API to accept Vector columns

2017-05-07 Thread Kevin Ushey (JIRA)
Kevin Ushey created SPARK-20632:
---

 Summary: Allow 'Column.getItem()' API to accept Vector columns
 Key: SPARK-20632
 URL: https://issues.apache.org/jira/browse/SPARK-20632
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.1.1, 1.6.3
Reporter: Kevin Ushey
Priority: Minor


The 'getItem()' API is quite handy for extracting values from Dataset columns 
of type 'ArrayType'. It would be quite useful if this could also accept 
'Vector' columns, e.g. those generated by the various MLLib routines 
(probability columns).

If I understand correctly, users are forced to define custom UDFs to handle 
this case, and the UDF type required for vector columns is not always obvious.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20631) LogisticRegression._checkThresholdConsistency should use values not Params

2017-05-07 Thread Maciej Szymkiewicz (JIRA)
Maciej Szymkiewicz created SPARK-20631:
--

 Summary: LogisticRegression._checkThresholdConsistency should use 
values not Params
 Key: SPARK-20631
 URL: https://issues.apache.org/jira/browse/SPARK-20631
 Project: Spark
  Issue Type: Bug
  Components: ML, PySpark
Affects Versions: 2.2.0
Reporter: Maciej Szymkiewicz
Priority: Minor


{{_checkThresholdConsistency}} incorrectly uses {{getParam}} in attempt to 
access {{threshold}} and {{thresholds}} values.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20630) Thread Dump link available in Executors tab irrespective of spark.ui.threadDumpsEnabled

2017-05-07 Thread Jacek Laskowski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski updated SPARK-20630:

Description: Irrespective of {{spark.ui.threadDumpsEnabled}} property web 
UI's Executors page displays *Thread Dump* column with an active link (that 
does nothing though).  (was: Irrespective of {{spark.ui.threadDumpsEnabled}} 
property web UI's Executors page displays **Thread Dump** column with an active 
link (that does nothing though).)

> Thread Dump link available in Executors tab irrespective of 
> spark.ui.threadDumpsEnabled
> ---
>
> Key: SPARK-20630
> URL: https://issues.apache.org/jira/browse/SPARK-20630
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Priority: Minor
> Attachments: spark-webui-executors-threadDump.png
>
>
> Irrespective of {{spark.ui.threadDumpsEnabled}} property web UI's Executors 
> page displays *Thread Dump* column with an active link (that does nothing 
> though).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20630) Thread Dump link available in Executors tab irrespective of spark.ui.threadDumpsEnabled

2017-05-07 Thread Jacek Laskowski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski updated SPARK-20630:

Attachment: spark-webui-executors-threadDump.png

> Thread Dump link available in Executors tab irrespective of 
> spark.ui.threadDumpsEnabled
> ---
>
> Key: SPARK-20630
> URL: https://issues.apache.org/jira/browse/SPARK-20630
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Priority: Minor
> Attachments: spark-webui-executors-threadDump.png
>
>
> Irrespective of {{spark.ui.threadDumpsEnabled}} property web UI's Executors 
> page displays **Thread Dump** column with an active link (that does nothing 
> though).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20630) Thread Dump link available in Executors tab irrespective of spark.ui.threadDumpsEnabled

2017-05-07 Thread Jacek Laskowski (JIRA)
Jacek Laskowski created SPARK-20630:
---

 Summary: Thread Dump link available in Executors tab irrespective 
of spark.ui.threadDumpsEnabled
 Key: SPARK-20630
 URL: https://issues.apache.org/jira/browse/SPARK-20630
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.3.0
Reporter: Jacek Laskowski
Priority: Minor


Irrespective of {{spark.ui.threadDumpsEnabled}} property web UI's Executors 
page displays **Thread Dump** column with an active link (that does nothing 
though).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20617) pyspark.sql filtering fails when using ~isin when there are nulls in column

2017-05-07 Thread Ed Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ed Lee updated SPARK-20617:
---
Description: 
Hello encountered a filtering bug using 'isin' in pyspark sql on version 2.2.0, 
Ubuntu 16.04.

Enclosed below an example to replicate:

from pyspark.sql import SparkSession
from pyspark.sql import functions as sf
import pandas as pd
spark = SparkSession.builder.master("local").appName("Word Count").getOrCreate()

test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"],
"col2": range(5)
})

test_sdf = spark.createDataFrame(test_df)
test_sdf.show()

 |col1|col2|
 |null|   0|
 |null|   1|
 |   a|   2|
 |   b|   3|
 |   c|   4|

# Below shows when filtering col1 NOT in list ['a'] the col1 rows with null are 
missing:

test_sdf.filter(sf.col("col1").isin(["a"]) == False).show()
Or:
test_sdf.filter(~sf.col("col1").isin(["a"])).show()

*Expecting*:
 |col1|col2|
 |null|   0|
 |null|   1|
 |   b|   3|
 |   c|   4|

*Got*:
 |col1|col2|
 |   b|   3|
 |   c|   4|

My workarounds:

1.  null is considered 'in', so add OR isNull conditon:
test_sdf.filter((sf.col("col1").isin(["a"])== False) | (
sf.col("col1").isNull())).show()

To get:
 |col1|col2|isin|
 |null|   0|null|
 |null|   1|null|
 |   c|   4|null|
 |   b|   3|null|

2.  Use left join and filter
join_df = pd.DataFrame({"col1": ["a"],
"isin": 1
})

join_sdf = spark.createDataFrame(join_df)

test_sdf.join(join_sdf, on="col1", how="left") \
.filter(sf.col("isin").isNull()) \
.show()

To get:
 |col1|col2|isin|
 |null|   0|null|
 |null|   1|null|
 |   c|   4|null|
 |   b|   3|null|

Thank you


  was:
Hello encountered a filtering bug using 'isin' in pyspark sql on version 2.2.0, 
Ubuntu 16.04.

Enclosed below an example to replicate:

from pyspark.sql import SparkSession
from pyspark.sql import functions as sf
import pandas as pd
spark = SparkSession.builder.master("local").appName("Word Count").getOrCreate()

test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"],
"col2": range(5)
})

test_sdf = spark.createDataFrame(test_df)
test_sdf.show()

 |col1|col2|
 |null|   0|
 |null|   1|
 |   a|   2|
 |   b|   3|
 |   c|   4|

# Below shows null entries in col1 are considered 'isin' the list ["a"] (it is 
not in the list so it should show):

test_sdf.filter(sf.col("col1").isin(["a"]) == False).show()
Or:
test_sdf.filter(~sf.col("col1").isin(["a"])).show()

*Expecting*:
 |col1|col2|
 |null|   0|
 |null|   1|
 |   b|   3|
 |   c|   4|

*Got*:
 |col1|col2|
 |   b|   3|
 |   c|   4|

My workarounds:

1.  null is considered 'in', so add OR isNull conditon:
test_sdf.filter((sf.col("col1").isin(["a"])== False) | (
sf.col("col1").isNull())).show()

To get:
 |col1|col2|isin|
 |null|   0|null|
 |null|   1|null|
 |   c|   4|null|
 |   b|   3|null|

2.  Use left join and filter
join_df = pd.DataFrame({"col1": ["a"],
"isin": 1
})

join_sdf = spark.createDataFrame(join_df)

test_sdf.join(join_sdf, on="col1", how="left") \
.filter(sf.col("isin").isNull()) \
.show()

To get:
 |col1|col2|isin|
 |null|   0|null|
 |null|   1|null|
 |   c|   4|null|
 |   b|   3|null|

Thank you



> pyspark.sql filtering fails when using ~isin when there are nulls in column
> ---
>
> Key: SPARK-20617
> URL: https://issues.apache.org/jira/browse/SPARK-20617
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
> Environment: Ubuntu Xenial 16.04, Python 3.5
>Reporter: Ed Lee
> Fix For: 2.2.0
>
>
> Hello encountered a filtering bug using 'isin' in pyspark sql on version 
> 2.2.0, Ubuntu 16.04.
> Enclosed below an example to replicate:
> from pyspark.sql import SparkSession
> from pyspark.sql import functions as sf
> import pandas as pd
> spark = SparkSession.builder.master("local").appName("Word 
> Count").getOrCreate()
> test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"],
> "col2": range(5)
> })
> test_sdf = spark.createDataFrame(test_df)
> test_sdf.show()
>  |col1|col2|
>  |null|   0|
>  |null|   1|
>  |   a|   2|
>  |   b|   3|
>  |   c|   4|
> # Below shows when filtering col1 NOT in list ['a'] the col1 rows with null 
> are missing:
> test_sdf.filter(sf.col("col1").isin(["a"]) == False).show()
> Or:
> test_sdf.filter(~sf.col("col1").isin(["a"])).show()
> *Expecting*:
>  |col1|col2|
>  |null|   0|
>  |null|   1|
>  |   b|   3|
>  |   c|   4|
> *Got*:
>  |col1|col2|
>  |   b|   3|
>  |   c|   4|
> My workarounds:
> 1.  null is considered 'in', so add OR isNull conditon:
> test_sdf.filter((sf.col("col1").isin(["a"])== False) | (

[jira] [Updated] (SPARK-20617) pyspark.sql filtering fails when using ~isin when there are nulls in column

2017-05-07 Thread Ed Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ed Lee updated SPARK-20617:
---
Summary: pyspark.sql filtering fails when using ~isin when there are nulls 
in column  (was: pyspark.sql,  filtering with ~isin missing rows)

> pyspark.sql filtering fails when using ~isin when there are nulls in column
> ---
>
> Key: SPARK-20617
> URL: https://issues.apache.org/jira/browse/SPARK-20617
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
> Environment: Ubuntu Xenial 16.04, Python 3.5
>Reporter: Ed Lee
> Fix For: 2.2.0
>
>
> Hello encountered a filtering bug using 'isin' in pyspark sql on version 
> 2.2.0, Ubuntu 16.04.
> Enclosed below an example to replicate:
> from pyspark.sql import SparkSession
> from pyspark.sql import functions as sf
> import pandas as pd
> spark = SparkSession.builder.master("local").appName("Word 
> Count").getOrCreate()
> test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"],
> "col2": range(5)
> })
> test_sdf = spark.createDataFrame(test_df)
> test_sdf.show()
>  |col1|col2|
>  |null|   0|
>  |null|   1|
>  |   a|   2|
>  |   b|   3|
>  |   c|   4|
> # Below shows null entries in col1 are considered 'isin' the list ["a"] (it 
> is not in the list so it should show):
> test_sdf.filter(sf.col("col1").isin(["a"]) == False).show()
> Or:
> test_sdf.filter(~sf.col("col1").isin(["a"])).show()
> *Expecting*:
>  |col1|col2|
>  |null|   0|
>  |null|   1|
>  |   b|   3|
>  |   c|   4|
> *Got*:
>  |col1|col2|
>  |   b|   3|
>  |   c|   4|
> My workarounds:
> 1.  null is considered 'in', so add OR isNull conditon:
> test_sdf.filter((sf.col("col1").isin(["a"])== False) | (
> sf.col("col1").isNull())).show()
> To get:
>  |col1|col2|isin|
>  |null|   0|null|
>  |null|   1|null|
>  |   c|   4|null|
>  |   b|   3|null|
> 2.  Use left join and filter
> join_df = pd.DataFrame({"col1": ["a"],
> "isin": 1
> })
> join_sdf = spark.createDataFrame(join_df)
> test_sdf.join(join_sdf, on="col1", how="left") \
> .filter(sf.col("isin").isNull()) \
> .show()
> To get:
>  |col1|col2|isin|
>  |null|   0|null|
>  |null|   1|null|
>  |   c|   4|null|
>  |   b|   3|null|
> Thank you



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20617) pyspark.sql, filtering with ~isin missing rows

2017-05-07 Thread Ed Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ed Lee updated SPARK-20617:
---
Environment: Ubuntu Xenial 16.04, Python 3.5  (was: Ubuntu Xenial 16.04)

> pyspark.sql,  filtering with ~isin missing rows
> ---
>
> Key: SPARK-20617
> URL: https://issues.apache.org/jira/browse/SPARK-20617
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
> Environment: Ubuntu Xenial 16.04, Python 3.5
>Reporter: Ed Lee
> Fix For: 2.2.0
>
>
> Hello encountered a filtering bug using 'isin' in pyspark sql on version 
> 2.2.0, Ubuntu 16.04.
> Enclosed below an example to replicate:
> from pyspark.sql import SparkSession
> from pyspark.sql import functions as sf
> import pandas as pd
> spark = SparkSession.builder.master("local").appName("Word 
> Count").getOrCreate()
> test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"],
> "col2": range(5)
> })
> test_sdf = spark.createDataFrame(test_df)
> test_sdf.show()
>  |col1|col2|
>  |null|   0|
>  |null|   1|
>  |   a|   2|
>  |   b|   3|
>  |   c|   4|
> # Below shows null entries in col1 are considered 'isin' the list ["a"] (it 
> is not in the list so it should show):
> test_sdf.filter(sf.col("col1").isin(["a"]) == False).show()
> Or:
> test_sdf.filter(~sf.col("col1").isin(["a"])).show()
> *Expecting*:
>  |col1|col2|
>  |null|   0|
>  |null|   1|
>  |   b|   3|
>  |   c|   4|
> *Got*:
>  |col1|col2|
>  |   b|   3|
>  |   c|   4|
> My workarounds:
> 1.  null is considered 'in', so add OR isNull conditon:
> test_sdf.filter((sf.col("col1").isin(["a"])== False) | (
> sf.col("col1").isNull())).show()
> To get:
>  |col1|col2|isin|
>  |null|   0|null|
>  |null|   1|null|
>  |   c|   4|null|
>  |   b|   3|null|
> 2.  Use left join and filter
> join_df = pd.DataFrame({"col1": ["a"],
> "isin": 1
> })
> join_sdf = spark.createDataFrame(join_df)
> test_sdf.join(join_sdf, on="col1", how="left") \
> .filter(sf.col("isin").isNull()) \
> .show()
> To get:
>  |col1|col2|isin|
>  |null|   0|null|
>  |null|   1|null|
>  |   c|   4|null|
>  |   b|   3|null|
> Thank you



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20617) pyspark.sql, filtering with ~isin missing rows

2017-05-07 Thread Ed Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ed Lee updated SPARK-20617:
---
Fix Version/s: 2.2.0
  Description: 
Hello encountered a filtering bug using 'isin' in pyspark sql on version 2.2.0, 
Ubuntu 16.04.

Enclosed below an example to replicate:

from pyspark.sql import SparkSession
from pyspark.sql import functions as sf
import pandas as pd
spark = SparkSession.builder.master("local").appName("Word Count").getOrCreate()

test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"],
"col2": range(5)
})

test_sdf = spark.createDataFrame(test_df)
test_sdf.show()

 |col1|col2|
 |null|   0|
 |null|   1|
 |   a|   2|
 |   b|   3|
 |   c|   4|

# Below shows null entries in col1 are considered 'isin' the list ["a"] (it is 
not in the list so it should show):

test_sdf.filter(sf.col("col1").isin(["a"]) == False).show()
Or:
test_sdf.filter(~sf.col("col1").isin(["a"])).show()

*Expecting*:
 |col1|col2|
 |null|   0|
 |null|   1|
 |   b|   3|
 |   c|   4|

*Got*:
 |col1|col2|
 |   b|   3|
 |   c|   4|

My workarounds:

1.  null is considered 'in', so add OR isNull conditon:
test_sdf.filter((sf.col("col1").isin(["a"])== False) | (
sf.col("col1").isNull())).show()

To get:
 |col1|col2|isin|
 |null|   0|null|
 |null|   1|null|
 |   c|   4|null|
 |   b|   3|null|

2.  Use left join and filter
join_df = pd.DataFrame({"col1": ["a"],
"isin": 1
})

join_sdf = spark.createDataFrame(join_df)

test_sdf.join(join_sdf, on="col1", how="left") \
.filter(sf.col("isin").isNull()) \
.show()

To get:
 |col1|col2|isin|
 |null|   0|null|
 |null|   1|null|
 |   c|   4|null|
 |   b|   3|null|

Thank you


  was:
Hello encountered a filtering bug using 'isin' in pyspark sql on version 2.2.0, 
Ubuntu 16.04.

Enclosed below an example to replicate:

from pyspark.sql import functions as sf
import pandas as pd
test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"],
"col2": range(5)
})

test_sdf = spark.createDataFrame(test_df)
test_sdf.show()

 |col1|col2|
 |null|   0|
 |null|   1|
 |   a|   2|
 |   b|   3|
 |   c|   4|

# Below shows null entries in col1 are considered 'isin' the list ["a"] (it is 
not in the list so it should show):

test_sdf.filter(sf.col("col1").isin(["a"]) == False).show()
Or:
test_sdf.filter(~sf.col("col1").isin(["a"])).show()

*Expecting*:
 |col1|col2|
 |null|   0|
 |null|   1|
 |   b|   3|
 |   c|   4|

*Got*:
 |col1|col2|
 |   b|   3|
 |   c|   4|

My workarounds:

1.  null is considered 'in', so add OR isNull conditon:
test_sdf.filter((sf.col("col1").isin(["a"])== False) | (
sf.col("col1").isNull())).show()

To get:
 |col1|col2|isin|
 |null|   0|null|
 |null|   1|null|
 |   c|   4|null|
 |   b|   3|null|

2.  Use left join and filter
join_df = pd.DataFrame({"col1": ["a"],
"isin": 1
})

join_sdf = spark.createDataFrame(join_df)

test_sdf.join(join_sdf, on="col1", how="left") \
.filter(sf.col("isin").isNull()) \
.show()

To get:
 |col1|col2|isin|
 |null|   0|null|
 |null|   1|null|
 |   c|   4|null|
 |   b|   3|null|

Thank you



> pyspark.sql,  filtering with ~isin missing rows
> ---
>
> Key: SPARK-20617
> URL: https://issues.apache.org/jira/browse/SPARK-20617
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
> Environment: Ubuntu Xenial 16.04
>Reporter: Ed Lee
> Fix For: 2.2.0
>
>
> Hello encountered a filtering bug using 'isin' in pyspark sql on version 
> 2.2.0, Ubuntu 16.04.
> Enclosed below an example to replicate:
> from pyspark.sql import SparkSession
> from pyspark.sql import functions as sf
> import pandas as pd
> spark = SparkSession.builder.master("local").appName("Word 
> Count").getOrCreate()
> test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"],
> "col2": range(5)
> })
> test_sdf = spark.createDataFrame(test_df)
> test_sdf.show()
>  |col1|col2|
>  |null|   0|
>  |null|   1|
>  |   a|   2|
>  |   b|   3|
>  |   c|   4|
> # Below shows null entries in col1 are considered 'isin' the list ["a"] (it 
> is not in the list so it should show):
> test_sdf.filter(sf.col("col1").isin(["a"]) == False).show()
> Or:
> test_sdf.filter(~sf.col("col1").isin(["a"])).show()
> *Expecting*:
>  |col1|col2|
>  |null|   0|
>  |null|   1|
>  |   b|   3|
>  |   c|   4|
> *Got*:
>  |col1|col2|
>  |   b|   3|
>  |   c|   4|
> My workarounds:
> 1.  null is considered 'in', so add OR isNull conditon:
> test_sdf.filter((sf.col("col1").isin(["a"])== False) | (
> sf.col("col1").isNull())).show()
> To get:
>  |col1|col2|isin|
>  |null|   0|null|
>  |null|   1|null|
>  

[jira] [Commented] (SPARK-19900) [Standalone] Master registers application again when driver relaunched

2017-05-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15999751#comment-15999751
 ] 

Apache Spark commented on SPARK-19900:
--

User 'liyichao' has created a pull request for this issue:
https://github.com/apache/spark/pull/17888

> [Standalone] Master registers application again when driver relaunched
> --
>
> Key: SPARK-19900
> URL: https://issues.apache.org/jira/browse/SPARK-19900
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 1.6.2
> Environment: Centos 6.5, spark standalone
>Reporter: Sergey
>Priority: Critical
>  Labels: Spark, network, standalone, supervise
>
> I've found some problems when node, where driver is running, has unstable 
> network. A situation is possible when two identical applications are running 
> on a cluster.
> *Steps to Reproduce:*
> # prepare 3 node. One for the spark master and two for the spark workers.
> # submit an application with parameter spark.driver.supervise = true
> # go to the node where driver is running (for example spark-worker-1) and 
> close 7077 port
> {code}
> # iptables -A OUTPUT -p tcp --dport 7077 -j DROP
> {code}
> # wait more 60 seconds
> # look at the spark master UI
> There are two spark applications and one driver. The new application has 
> WAITING state and the second application has RUNNING state. Driver has 
> RUNNING or RELAUNCHING state (It depends on the resources available, as I 
> understand it) and it launched on other node (for example spark-worker-2)
> # open the port
> {code}
> # iptables -D OUTPUT -p tcp --dport 7077 -j DROP
> {code}
> # look an the spark UI again
> There are no changes
> In addition, if you look at the processes on the node spark-worker-1
> {code}
> # ps ax | grep spark
> {code}
>  you will see that the old driver is still working!
> *Spark master logs:*
> {code}
> 17/03/10 05:26:27 WARN Master: Removing 
> worker-20170310052240-spark-worker-1-35039 because we got no heartbeat in 60 
> seconds
> 17/03/10 05:26:27 INFO Master: Removing worker 
> worker-20170310052240-spark-worker-1-35039 on spark-worker-1:35039
> 17/03/10 05:26:27 INFO Master: Telling app of lost executor: 1
> 17/03/10 05:26:27 INFO Master: Telling app of lost executor: 0
> 17/03/10 05:26:27 INFO Master: Re-launching driver-20170310052347-
> 17/03/10 05:26:27 INFO Master: Launching driver driver-20170310052347- on 
> worker worker-20170310052411-spark-worker-2-40473
> 17/03/10 05:26:35 INFO Master: Registering app TestApplication
> 17/03/10 05:26:35 INFO Master: Registered app TestApplication with ID 
> app-20170310052635-0001
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got status update for unknown executor 
> app-20170310052354-/1
> 17/03/10 05:31:07 WARN Master: Got status update for unknown executor 
> app-20170310052354-/0
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from 

[jira] [Comment Edited] (SPARK-19900) [Standalone] Master registers application again when driver relaunched

2017-05-07 Thread lyc (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15999741#comment-15999741
 ] 

lyc edited comment on SPARK-19900 at 5/7/17 9:19 AM:
-

I can successfully reproduce in spark master ( commit  63d90e7d, and spark 
version is 2.3.0), though the symptom is a little different.

To make the application run slowly enough, I modify `examples/LogQuery.scala` 
to this:

{code}
dataSet.map(line => (extractKey(line), extractStats(line)))
  .reduceByKey((a, b) => {
Thread.sleep(20*60*1000)
a.merge(b)
  })
{code}

run two worker each with 1G memory and 2 cores. submit the application with:

{code}
bin/spark-submit --class org.apache.spark.examples.LogQuery  --master 
spark://master:6066 --deploy-mode cluster --supervise --total-executor-cores 1 
--num-executors 1 --executor-memory 512M --driver-memory 512M 
/vagrant_data/spark/examples/target/scala-2.11/jars/spark-examples_2.11-2.3.0-SNAPSHOT.jar
{code}

and follow the steps as stated above. 

The outcome is that at step 5, there are not two drivers, but there are two 
applications, the old one is still running, the new one is waiting for 
resources. After step 6, the old application is finished, the new one running, 
and the only one driver is killed. This is problematic because the driver 
corresponding to the running application is killed instead of running.

The problem is due to the fact that the relaunched driver uses the same driver 
id with the original one and when old application is killed, the driver is 
killed as a side effect. I have create a pull request to fix this. 


was (Author: lyc):
I can successfully reproduce in spark master ( commit  63d90e7d, and spark 
version is 2.3.0), though the symptom is a little different.

To make the application run slowly enough, I modify `examples/LogQuery.scala` 
to this:

``` 
dataSet.map(line => (extractKey(line), extractStats(line)))
  .reduceByKey((a, b) => {
Thread.sleep(20*60*1000)
a.merge(b)
  })
```

run two worker each with 1G memory and 2 cores. submit the application with:

```
bin/spark-submit --class org.apache.spark.examples.LogQuery  --master 
spark://master:6066 --deploy-mode cluster --supervise --total-executor-cores 1 
--num-executors 1 --executor-memory 512M --driver-memory 512M 
/vagrant_data/spark/examples/target/scala-2.11/jars/spark-examples_2.11-2.3.0-SNAPSHOT.jar
```

and follow the steps as stated above. 

The outcome is that at step 5, there are not two drivers, but there are two 
applications, the old one is still running, the new one is waiting for 
resources. After step 6, the old application is finished, the new one running, 
and the only one driver is killed. This is problematic because the driver 
corresponding to the running application is killed instead of running.

The problem is due to the fact that the relaunched driver uses the same driver 
id with the original one and when old application is killed, the driver is 
killed as a side effect. I have create a pull request to fix this. 

> [Standalone] Master registers application again when driver relaunched
> --
>
> Key: SPARK-19900
> URL: https://issues.apache.org/jira/browse/SPARK-19900
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 1.6.2
> Environment: Centos 6.5, spark standalone
>Reporter: Sergey
>Priority: Critical
>  Labels: Spark, network, standalone, supervise
>
> I've found some problems when node, where driver is running, has unstable 
> network. A situation is possible when two identical applications are running 
> on a cluster.
> *Steps to Reproduce:*
> # prepare 3 node. One for the spark master and two for the spark workers.
> # submit an application with parameter spark.driver.supervise = true
> # go to the node where driver is running (for example spark-worker-1) and 
> close 7077 port
> {code}
> # iptables -A OUTPUT -p tcp --dport 7077 -j DROP
> {code}
> # wait more 60 seconds
> # look at the spark master UI
> There are two spark applications and one driver. The new application has 
> WAITING state and the second application has RUNNING state. Driver has 
> RUNNING or RELAUNCHING state (It depends on the resources available, as I 
> understand it) and it launched on other node (for example spark-worker-2)
> # open the port
> {code}
> # iptables -D OUTPUT -p tcp --dport 7077 -j DROP
> {code}
> # look an the spark UI again
> There are no changes
> In addition, if you look at the processes on the node spark-worker-1
> {code}
> # ps ax | grep spark
> {code}
>  you will see that the old driver is still working!
> *Spark master logs:*
> {code}
> 17/03/10 05:26:27 WARN Master: Removing 
> worker-20170310052240-spark-worker-1-35039 

[jira] [Comment Edited] (SPARK-19900) [Standalone] Master registers application again when driver relaunched

2017-05-07 Thread lyc (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15999741#comment-15999741
 ] 

lyc edited comment on SPARK-19900 at 5/7/17 9:18 AM:
-

I can successfully reproduce in spark master ( commit  63d90e7d, and spark 
version is 2.3.0), though the symptom is a little different.

To make the application run slowly enough, I modify `examples/LogQuery.scala` 
to this:

``` 
dataSet.map(line => (extractKey(line), extractStats(line)))
  .reduceByKey((a, b) => {
Thread.sleep(20*60*1000)
a.merge(b)
  })
```

run two worker each with 1G memory and 2 cores. submit the application with:

```
bin/spark-submit --class org.apache.spark.examples.LogQuery  --master 
spark://master:6066 --deploy-mode cluster --supervise --total-executor-cores 1 
--num-executors 1 --executor-memory 512M --driver-memory 512M 
/vagrant_data/spark/examples/target/scala-2.11/jars/spark-examples_2.11-2.3.0-SNAPSHOT.jar
```

and follow the steps as stated above. 

The outcome is that at step 5, there are not two drivers, but there are two 
applications, the old one is still running, the new one is waiting for 
resources. After step 6, the old application is finished, the new one running, 
and the only one driver is killed. This is problematic because the driver 
corresponding to the running application is killed instead of running.

The problem is due to the fact that the relaunched driver uses the same driver 
id with the original one and when old application is killed, the driver is 
killed as a side effect. I have create a pull request to fix this. 


was (Author: lyc):
I can successfully reproduce in spark master ( commit  63d90e7d, and spark 
version is 2.3.0), though the symptom is a little different.

To make the application run slowly enough, I modify `examples/LogQuery.scala` 
to this:
``` 
dataSet.map(line => (extractKey(line), extractStats(line)))
  .reduceByKey((a, b) => {
Thread.sleep(20*60*1000)
a.merge(b)
  })
```
run two worker each with 1G memory and 2 cores. submit the application with:
```
bin/spark-submit --class org.apache.spark.examples.LogQuery  --master 
spark://master:6066 --deploy-mode cluster --supervise --total-executor-cores 1 
--num-executors 1 --executor-memory 512M --driver-memory 512M 
/vagrant_data/spark/examples/target/scala-2.11/jars/spark-examples_2.11-2.3.0-SNAPSHOT.jar
```
and follow the steps as stated above. 

The outcome is that at step 5, there are not two drivers, but there are two 
applications, the old one is still running, the new one is waiting for 
resources. After step 6, the old application is finished, the new one running, 
and the only one driver is killed. This is problematic because the driver 
corresponding to the running application is killed instead of running.

The problem is due to the fact that the relaunched driver uses the same driver 
id with the original one and when old application is killed, the driver is 
killed as a side effect. I have create a pull request to fix this. 

> [Standalone] Master registers application again when driver relaunched
> --
>
> Key: SPARK-19900
> URL: https://issues.apache.org/jira/browse/SPARK-19900
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 1.6.2
> Environment: Centos 6.5, spark standalone
>Reporter: Sergey
>Priority: Critical
>  Labels: Spark, network, standalone, supervise
>
> I've found some problems when node, where driver is running, has unstable 
> network. A situation is possible when two identical applications are running 
> on a cluster.
> *Steps to Reproduce:*
> # prepare 3 node. One for the spark master and two for the spark workers.
> # submit an application with parameter spark.driver.supervise = true
> # go to the node where driver is running (for example spark-worker-1) and 
> close 7077 port
> {code}
> # iptables -A OUTPUT -p tcp --dport 7077 -j DROP
> {code}
> # wait more 60 seconds
> # look at the spark master UI
> There are two spark applications and one driver. The new application has 
> WAITING state and the second application has RUNNING state. Driver has 
> RUNNING or RELAUNCHING state (It depends on the resources available, as I 
> understand it) and it launched on other node (for example spark-worker-2)
> # open the port
> {code}
> # iptables -D OUTPUT -p tcp --dport 7077 -j DROP
> {code}
> # look an the spark UI again
> There are no changes
> In addition, if you look at the processes on the node spark-worker-1
> {code}
> # ps ax | grep spark
> {code}
>  you will see that the old driver is still working!
> *Spark master logs:*
> {code}
> 17/03/10 05:26:27 WARN Master: Removing 
> worker-20170310052240-spark-worker-1-35039 because we got 

[jira] [Assigned] (SPARK-7481) Add spark-hadoop-cloud module to pull in object store support

2017-05-07 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-7481:


Assignee: Steve Loughran

> Add spark-hadoop-cloud module to pull in object store support
> -
>
> Key: SPARK-7481
> URL: https://issues.apache.org/jira/browse/SPARK-7481
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
> Fix For: 2.3.0
>
>
> To keep the s3n classpath right, to add s3a, swift & azure, the dependencies 
> of spark in a 2.6+ profile need to add the relevant object store packages 
> (hadoop-aws, hadoop-openstack, hadoop-azure)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7481) Add spark-hadoop-cloud module to pull in object store support

2017-05-07 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-7481.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 17834
[https://github.com/apache/spark/pull/17834]

> Add spark-hadoop-cloud module to pull in object store support
> -
>
> Key: SPARK-7481
> URL: https://issues.apache.org/jira/browse/SPARK-7481
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Steve Loughran
> Fix For: 2.3.0
>
>
> To keep the s3n classpath right, to add s3a, swift & azure, the dependencies 
> of spark in a 2.6+ profile need to add the relevant object store packages 
> (hadoop-aws, hadoop-openstack, hadoop-azure)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20484) Add documentation to ALS code

2017-05-07 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-20484:
-

Assignee: Daniel Li
Priority: Minor  (was: Trivial)

> Add documentation to ALS code
> -
>
> Key: SPARK-20484
> URL: https://issues.apache.org/jira/browse/SPARK-20484
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Daniel Li
>Assignee: Daniel Li
>Priority: Minor
> Fix For: 2.3.0
>
>
> The documentation (both Scaladocs and inline comments) for the ALS code (in 
> package {{org.apache.spark.ml.recommendation}}) can be clarified where needed 
> and expanded where incomplete. This is especially important for parts of the 
> code that are written imperatively for performance, as these parts don't 
> benefit from the intuitive self-documentation of Scala's higher-level 
> language abstractions. Specifically, I'd like to add documentation fully 
> explaining the key functionality of the in-block and out-block objects, their 
> purpose, how they relate to the overall ALS algorithm, and how they are 
> calculated in such a way that new maintainers can ramp up much more quickly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20484) Add documentation to ALS code

2017-05-07 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20484.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 17793
[https://github.com/apache/spark/pull/17793]

> Add documentation to ALS code
> -
>
> Key: SPARK-20484
> URL: https://issues.apache.org/jira/browse/SPARK-20484
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Daniel Li
>Priority: Trivial
> Fix For: 2.3.0
>
>
> The documentation (both Scaladocs and inline comments) for the ALS code (in 
> package {{org.apache.spark.ml.recommendation}}) can be clarified where needed 
> and expanded where incomplete. This is especially important for parts of the 
> code that are written imperatively for performance, as these parts don't 
> benefit from the intuitive self-documentation of Scala's higher-level 
> language abstractions. Specifically, I'd like to add documentation fully 
> explaining the key functionality of the in-block and out-block objects, their 
> purpose, how they relate to the overall ALS algorithm, and how they are 
> calculated in such a way that new maintainers can ramp up much more quickly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20518) Supplement the new blockidsuite unit tests

2017-05-07 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-20518:
-

Assignee: caoxuewen
Priority: Minor  (was: Major)

> Supplement the new blockidsuite unit tests
> --
>
> Key: SPARK-20518
> URL: https://issues.apache.org/jira/browse/SPARK-20518
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.1.0
>Reporter: caoxuewen
>Assignee: caoxuewen
>Priority: Minor
> Fix For: 2.3.0
>
>
> adds the new unit tests to support ShuffleDataBlockId , ShuffleIndexBlockId , 
> TempShuffleBlockId , TempLocalBlockId 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20518) Supplement the new blockidsuite unit tests

2017-05-07 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20518.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 17794
[https://github.com/apache/spark/pull/17794]

> Supplement the new blockidsuite unit tests
> --
>
> Key: SPARK-20518
> URL: https://issues.apache.org/jira/browse/SPARK-20518
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.1.0
>Reporter: caoxuewen
> Fix For: 2.3.0
>
>
> adds the new unit tests to support ShuffleDataBlockId , ShuffleIndexBlockId , 
> TempShuffleBlockId , TempLocalBlockId 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19900) [Standalone] Master registers application again when driver relaunched

2017-05-07 Thread lyc (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15999741#comment-15999741
 ] 

lyc commented on SPARK-19900:
-

I can successfully reproduce in spark master ( commit  63d90e7d, and spark 
version is 2.3.0), though the symptom is a little different.

To make the application run slowly enough, I modify `examples/LogQuery.scala` 
to this:
``` 
dataSet.map(line => (extractKey(line), extractStats(line)))
  .reduceByKey((a, b) => {
Thread.sleep(20*60*1000)
a.merge(b)
  })
```
run two worker each with 1G memory and 2 cores. submit the application with:
```
bin/spark-submit --class org.apache.spark.examples.LogQuery  --master 
spark://master:6066 --deploy-mode cluster --supervise --total-executor-cores 1 
--num-executors 1 --executor-memory 512M --driver-memory 512M 
/vagrant_data/spark/examples/target/scala-2.11/jars/spark-examples_2.11-2.3.0-SNAPSHOT.jar
```
and follow the steps as stated above. 

The outcome is that at step 5, there are not two drivers, but there are two 
applications, the old one is still running, the new one is waiting for 
resources. After step 6, the old application is finished, the new one running, 
and the only one driver is killed. This is problematic because the driver 
corresponding to the running application is killed instead of running.

The problem is due to the fact that the relaunched driver uses the same driver 
id with the original one and when old application is killed, the driver is 
killed as a side effect. I have create a pull request to fix this. 

> [Standalone] Master registers application again when driver relaunched
> --
>
> Key: SPARK-19900
> URL: https://issues.apache.org/jira/browse/SPARK-19900
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 1.6.2
> Environment: Centos 6.5, spark standalone
>Reporter: Sergey
>Priority: Critical
>  Labels: Spark, network, standalone, supervise
>
> I've found some problems when node, where driver is running, has unstable 
> network. A situation is possible when two identical applications are running 
> on a cluster.
> *Steps to Reproduce:*
> # prepare 3 node. One for the spark master and two for the spark workers.
> # submit an application with parameter spark.driver.supervise = true
> # go to the node where driver is running (for example spark-worker-1) and 
> close 7077 port
> {code}
> # iptables -A OUTPUT -p tcp --dport 7077 -j DROP
> {code}
> # wait more 60 seconds
> # look at the spark master UI
> There are two spark applications and one driver. The new application has 
> WAITING state and the second application has RUNNING state. Driver has 
> RUNNING or RELAUNCHING state (It depends on the resources available, as I 
> understand it) and it launched on other node (for example spark-worker-2)
> # open the port
> {code}
> # iptables -D OUTPUT -p tcp --dport 7077 -j DROP
> {code}
> # look an the spark UI again
> There are no changes
> In addition, if you look at the processes on the node spark-worker-1
> {code}
> # ps ax | grep spark
> {code}
>  you will see that the old driver is still working!
> *Spark master logs:*
> {code}
> 17/03/10 05:26:27 WARN Master: Removing 
> worker-20170310052240-spark-worker-1-35039 because we got no heartbeat in 60 
> seconds
> 17/03/10 05:26:27 INFO Master: Removing worker 
> worker-20170310052240-spark-worker-1-35039 on spark-worker-1:35039
> 17/03/10 05:26:27 INFO Master: Telling app of lost executor: 1
> 17/03/10 05:26:27 INFO Master: Telling app of lost executor: 0
> 17/03/10 05:26:27 INFO Master: Re-launching driver-20170310052347-
> 17/03/10 05:26:27 INFO Master: Launching driver driver-20170310052347- on 
> worker worker-20170310052411-spark-worker-2-40473
> 17/03/10 05:26:35 INFO Master: Registering app TestApplication
> 17/03/10 05:26:35 INFO Master: Registered app TestApplication with ID 
> app-20170310052635-0001
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker 
> worker-20170310052240-spark-worker-1-35039. Asking it to re-register.
> 17/03/10 05:31:07 WARN Master: Got status update for unknown executor 
> app-20170310052354-/1
> 17/03/10 05:31:07 WARN Master: Got status update for unknown executor 
> app-20170310052354-/0
> 17/03/10 05:31:07 WARN Master: Got heartbeat from