[jira] [Resolved] (SPARK-22017) watermark evaluation with multi-input stream operators is unspecified

2017-09-15 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-22017.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 19239
[https://github.com/apache/spark/pull/19239]

> watermark evaluation with multi-input stream operators is unspecified
> -
>
> Key: SPARK-22017
> URL: https://issues.apache.org/jira/browse/SPARK-22017
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jose Torres
> Fix For: 3.0.0
>
>
> Watermarks are stored as a single value in StreamExecution. If a query has 
> multiple watermark nodes (which can generally only happen with multi input 
> operators like union), a headOption call will arbitrarily pick one to use as 
> the real one. This will happen independently in each batch, possibly leading 
> to strange and undefined behavior.
> We should instead choose the minimum from all watermark exec nodes as the 
> query-wide watermark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22034) CrossValidator's training and testing set with different set of labels, resulting in encoder transform error

2017-09-15 Thread AnChe Kuo (JIRA)
AnChe Kuo created SPARK-22034:
-

 Summary: CrossValidator's training and testing set with different 
set of labels, resulting in encoder transform error
 Key: SPARK-22034
 URL: https://issues.apache.org/jira/browse/SPARK-22034
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 2.2.0
 Environment: Ubuntu 16.04
Scala 2.11
Spark 2.2.0
Reporter: AnChe Kuo


Let's say we have a VectorIndexer with maxCategories set to 13, and training 
set has a column containing month label.

In CrossValidator, dataframe is split into training and testing set 
automatically. If could happen that training set happens to lack month 2 (could 
happen by chance, or happen quite frequently if we have unbalanced label).

When training set is being trained within the cross validator, the pipeline is 
fitted with the training set only, resulting in a partial key map in 
VectorIndexer. When this pipeline is used to transform the predict set, 
VectorIndexer will throw  a "key not found" error.

Making CrossValidator also an estimator thus can be connected to a whole 
pipeline is a cool idea, but bug like this occurs, and is not expected.

The solution, I am guessing, would be to check each stage in the pipeline, and 
when we see encoder type stage, we fit the stage model with the complete 
dataset.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22033) BufferHolder size checks should account for the specific VM array size limitations

2017-09-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16168650#comment-16168650
 ] 

Sean Owen commented on SPARK-22033:
---

Hm, good point. There may be other similar issues throughout the code.
[~cloud_fan] does that sound right to you?

> BufferHolder size checks should account for the specific VM array size 
> limitations
> --
>
> Key: SPARK-22033
> URL: https://issues.apache.org/jira/browse/SPARK-22033
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Vadim Semenov
>Priority: Minor
>
> User may get the following OOM Error while running a job with heavy 
> aggregations
> ```
> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:73)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:235)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:228)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$2.apply(AggregationIterator.scala:254)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$2.apply(AggregationIterator.scala:247)
>   at 
> org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.next(ObjectAggregationIterator.scala:88)
>   at 
> org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.next(ObjectAggregationIterator.scala:33)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:167)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> ```
> The [`BufferHolder.grow` tries to create a byte array of `Integer.MAX_VALUE` 
> here](https://github.com/apache/spark/blob/v2.2.0/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/BufferHolder.java#L72)
>  but the maximum size of an array depends on specifics of a VM.
> The safest value seems to be `Integer.MAX_VALUE - 8` 
> http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/tip/src/share/classes/java/util/ArrayList.java#l229
> In my JVM:
> ```
> java -version
> openjdk version "1.8.0_141"
> OpenJDK Runtime Environment (build 1.8.0_141-b16)
> OpenJDK 64-Bit Server VM (build 25.141-b16, mixed mode)
> ```
> the max is `new Array[Byte](Integer.MAX_VALUE - 2)`



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22033) BufferHolder size checks should account for the specific VM array size limitations

2017-09-15 Thread Vadim Semenov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16168612#comment-16168612
 ] 

Vadim Semenov commented on SPARK-22033:
---

Leaving traces for others if they happen to hit the same issue.

The issue isn't exclusive to the ObjectAggregator, it can happen in the 
SortBasedAggregator too

```
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:73)
at 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:235)
at 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:228)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$2.apply(AggregationIterator.scala:254)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$2.apply(AggregationIterator.scala:247)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:159)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:29)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.initialize(SortBasedAggregationIterator.scala:103)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.(SortBasedAggregationIterator.scala:113)
at 
org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3.apply(SortAggregateExec.scala:86)
at 
org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3.apply(SortAggregateExec.scala:77)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
```

> BufferHolder size checks should account for the specific VM array size 
> limitations
> --
>
> Key: SPARK-22033
> URL: https://issues.apache.org/jira/browse/SPARK-22033
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Vadim Semenov
>Priority: Minor
>
> User may get the following OOM Error while running a job with heavy 
> aggregations
> ```
> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:73)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:235)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:228)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$2.apply(AggregationIterator.scala:254)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$2.apply(AggregationIterator.scala:247)
>   at 
> org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.next(ObjectAggregationIterator.scala:88)
>   at 
> org.apache.spark.sql.execution.aggregate.ObjectAggregatio

[jira] [Created] (SPARK-22033) BufferHolder size checks should account for the specific VM array size limitations

2017-09-15 Thread Vadim Semenov (JIRA)
Vadim Semenov created SPARK-22033:
-

 Summary: BufferHolder size checks should account for the specific 
VM array size limitations
 Key: SPARK-22033
 URL: https://issues.apache.org/jira/browse/SPARK-22033
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Vadim Semenov
Priority: Minor


User may get the following OOM Error while running a job with heavy aggregations

```
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:73)
at 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:235)
at 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:228)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$2.apply(AggregationIterator.scala:254)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$2.apply(AggregationIterator.scala:247)
at 
org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.next(ObjectAggregationIterator.scala:88)
at 
org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.next(ObjectAggregationIterator.scala:33)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:167)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
```

The [`BufferHolder.grow` tries to create a byte array of `Integer.MAX_VALUE` 
here](https://github.com/apache/spark/blob/v2.2.0/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/BufferHolder.java#L72)
 but the maximum size of an array depends on specifics of a VM.

The safest value seems to be `Integer.MAX_VALUE - 8` 
http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/tip/src/share/classes/java/util/ArrayList.java#l229

In my JVM:
```
java -version
openjdk version "1.8.0_141"
OpenJDK Runtime Environment (build 1.8.0_141-b16)
OpenJDK 64-Bit Server VM (build 25.141-b16, mixed mode)
```

the max is `new Array[Byte](Integer.MAX_VALUE - 2)`



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.

2017-09-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12297:


Assignee: (was: Apache Spark)

> Add work-around for Parquet/Hive int96 timestamp bug.
> -
>
> Key: SPARK-12297
> URL: https://issues.apache.org/jira/browse/SPARK-12297
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Reporter: Ryan Blue
>
> Spark copied Hive's behavior for parquet, but this was inconsistent with 
> other file formats, and inconsistent with Impala (which is the original 
> source of putting a timestamp as an int96 in parquet, I believe).  This made 
> timestamps in parquet act more like timestamps with timezones, while in other 
> file formats, timestamps have no time zone, they are a "floating time".
> The easiest way to see this issue is to write out a table with timestamps in 
> multiple different formats from one timezone, then try to read them back in 
> another timezone.  Eg., here I write out a few timestamps to parquet and 
> textfile hive tables, and also just as a json file, all in the 
> "America/Los_Angeles" timezone:
> {code}
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> val tblPrefix = args(0)
> val schema = new StructType().add("ts", TimestampType)
> val rows = sc.parallelize(Seq(
>   "2015-12-31 23:50:59.123",
>   "2015-12-31 22:49:59.123",
>   "2016-01-01 00:39:59.123",
>   "2016-01-01 01:29:59.123"
> ).map { x => Row(java.sql.Timestamp.valueOf(x)) })
> val rawData = spark.createDataFrame(rows, schema).toDF()
> rawData.show()
> Seq("parquet", "textfile").foreach { format =>
>   val tblName = s"${tblPrefix}_$format"
>   spark.sql(s"DROP TABLE IF EXISTS $tblName")
>   spark.sql(
> raw"""CREATE TABLE $tblName (
>   |  ts timestamp
>   | )
>   | STORED AS $format
>  """.stripMargin)
>   rawData.write.insertInto(tblName)
> }
> rawData.write.json(s"${tblPrefix}_json")
> {code}
> Then I start a spark-shell in "America/New_York" timezone, and read the data 
> back from each table:
> {code}
> scala> spark.sql("select * from la_parquet").collect().foreach{println}
> [2016-01-01 02:50:59.123]
> [2016-01-01 01:49:59.123]
> [2016-01-01 03:39:59.123]
> [2016-01-01 04:29:59.123]
> scala> spark.sql("select * from la_textfile").collect().foreach{println}
> [2015-12-31 23:50:59.123]
> [2015-12-31 22:49:59.123]
> [2016-01-01 00:39:59.123]
> [2016-01-01 01:29:59.123]
> scala> spark.read.json("la_json").collect().foreach{println}
> [2015-12-31 23:50:59.123]
> [2015-12-31 22:49:59.123]
> [2016-01-01 00:39:59.123]
> [2016-01-01 01:29:59.123]
> scala> spark.read.json("la_json").join(spark.sql("select * from 
> la_textfile"), "ts").show()
> ++
> |  ts|
> ++
> |2015-12-31 23:50:...|
> |2015-12-31 22:49:...|
> |2016-01-01 00:39:...|
> |2016-01-01 01:29:...|
> ++
> scala> spark.read.json("la_json").join(spark.sql("select * from la_parquet"), 
> "ts").show()
> +---+
> | ts|
> +---+
> +---+
> {code}
> The textfile and json based data shows the same times, and can be joined 
> against each other, while the times from the parquet data have changed (and 
> obviously joins fail).
> This is a big problem for any organization that may try to read the same data 
> (say in S3) with clusters in multiple timezones.  It can also be a nasty 
> surprise as an organization tries to migrate file formats.  Finally, its a 
> source of incompatibility between Hive, Impala, and Spark.
> HIVE-12767 aims to fix this by introducing a table property which indicates 
> the "storage timezone" for the table.  Spark should add the same to ensure 
> consistency between file formats, and with Hive & Impala.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.

2017-09-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16168410#comment-16168410
 ] 

Apache Spark commented on SPARK-12297:
--

User 'squito' has created a pull request for this issue:
https://github.com/apache/spark/pull/19250

> Add work-around for Parquet/Hive int96 timestamp bug.
> -
>
> Key: SPARK-12297
> URL: https://issues.apache.org/jira/browse/SPARK-12297
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Reporter: Ryan Blue
>
> Spark copied Hive's behavior for parquet, but this was inconsistent with 
> other file formats, and inconsistent with Impala (which is the original 
> source of putting a timestamp as an int96 in parquet, I believe).  This made 
> timestamps in parquet act more like timestamps with timezones, while in other 
> file formats, timestamps have no time zone, they are a "floating time".
> The easiest way to see this issue is to write out a table with timestamps in 
> multiple different formats from one timezone, then try to read them back in 
> another timezone.  Eg., here I write out a few timestamps to parquet and 
> textfile hive tables, and also just as a json file, all in the 
> "America/Los_Angeles" timezone:
> {code}
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> val tblPrefix = args(0)
> val schema = new StructType().add("ts", TimestampType)
> val rows = sc.parallelize(Seq(
>   "2015-12-31 23:50:59.123",
>   "2015-12-31 22:49:59.123",
>   "2016-01-01 00:39:59.123",
>   "2016-01-01 01:29:59.123"
> ).map { x => Row(java.sql.Timestamp.valueOf(x)) })
> val rawData = spark.createDataFrame(rows, schema).toDF()
> rawData.show()
> Seq("parquet", "textfile").foreach { format =>
>   val tblName = s"${tblPrefix}_$format"
>   spark.sql(s"DROP TABLE IF EXISTS $tblName")
>   spark.sql(
> raw"""CREATE TABLE $tblName (
>   |  ts timestamp
>   | )
>   | STORED AS $format
>  """.stripMargin)
>   rawData.write.insertInto(tblName)
> }
> rawData.write.json(s"${tblPrefix}_json")
> {code}
> Then I start a spark-shell in "America/New_York" timezone, and read the data 
> back from each table:
> {code}
> scala> spark.sql("select * from la_parquet").collect().foreach{println}
> [2016-01-01 02:50:59.123]
> [2016-01-01 01:49:59.123]
> [2016-01-01 03:39:59.123]
> [2016-01-01 04:29:59.123]
> scala> spark.sql("select * from la_textfile").collect().foreach{println}
> [2015-12-31 23:50:59.123]
> [2015-12-31 22:49:59.123]
> [2016-01-01 00:39:59.123]
> [2016-01-01 01:29:59.123]
> scala> spark.read.json("la_json").collect().foreach{println}
> [2015-12-31 23:50:59.123]
> [2015-12-31 22:49:59.123]
> [2016-01-01 00:39:59.123]
> [2016-01-01 01:29:59.123]
> scala> spark.read.json("la_json").join(spark.sql("select * from 
> la_textfile"), "ts").show()
> ++
> |  ts|
> ++
> |2015-12-31 23:50:...|
> |2015-12-31 22:49:...|
> |2016-01-01 00:39:...|
> |2016-01-01 01:29:...|
> ++
> scala> spark.read.json("la_json").join(spark.sql("select * from la_parquet"), 
> "ts").show()
> +---+
> | ts|
> +---+
> +---+
> {code}
> The textfile and json based data shows the same times, and can be joined 
> against each other, while the times from the parquet data have changed (and 
> obviously joins fail).
> This is a big problem for any organization that may try to read the same data 
> (say in S3) with clusters in multiple timezones.  It can also be a nasty 
> surprise as an organization tries to migrate file formats.  Finally, its a 
> source of incompatibility between Hive, Impala, and Spark.
> HIVE-12767 aims to fix this by introducing a table property which indicates 
> the "storage timezone" for the table.  Spark should add the same to ensure 
> consistency between file formats, and with Hive & Impala.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.

2017-09-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12297:


Assignee: Apache Spark

> Add work-around for Parquet/Hive int96 timestamp bug.
> -
>
> Key: SPARK-12297
> URL: https://issues.apache.org/jira/browse/SPARK-12297
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Reporter: Ryan Blue
>Assignee: Apache Spark
>
> Spark copied Hive's behavior for parquet, but this was inconsistent with 
> other file formats, and inconsistent with Impala (which is the original 
> source of putting a timestamp as an int96 in parquet, I believe).  This made 
> timestamps in parquet act more like timestamps with timezones, while in other 
> file formats, timestamps have no time zone, they are a "floating time".
> The easiest way to see this issue is to write out a table with timestamps in 
> multiple different formats from one timezone, then try to read them back in 
> another timezone.  Eg., here I write out a few timestamps to parquet and 
> textfile hive tables, and also just as a json file, all in the 
> "America/Los_Angeles" timezone:
> {code}
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> val tblPrefix = args(0)
> val schema = new StructType().add("ts", TimestampType)
> val rows = sc.parallelize(Seq(
>   "2015-12-31 23:50:59.123",
>   "2015-12-31 22:49:59.123",
>   "2016-01-01 00:39:59.123",
>   "2016-01-01 01:29:59.123"
> ).map { x => Row(java.sql.Timestamp.valueOf(x)) })
> val rawData = spark.createDataFrame(rows, schema).toDF()
> rawData.show()
> Seq("parquet", "textfile").foreach { format =>
>   val tblName = s"${tblPrefix}_$format"
>   spark.sql(s"DROP TABLE IF EXISTS $tblName")
>   spark.sql(
> raw"""CREATE TABLE $tblName (
>   |  ts timestamp
>   | )
>   | STORED AS $format
>  """.stripMargin)
>   rawData.write.insertInto(tblName)
> }
> rawData.write.json(s"${tblPrefix}_json")
> {code}
> Then I start a spark-shell in "America/New_York" timezone, and read the data 
> back from each table:
> {code}
> scala> spark.sql("select * from la_parquet").collect().foreach{println}
> [2016-01-01 02:50:59.123]
> [2016-01-01 01:49:59.123]
> [2016-01-01 03:39:59.123]
> [2016-01-01 04:29:59.123]
> scala> spark.sql("select * from la_textfile").collect().foreach{println}
> [2015-12-31 23:50:59.123]
> [2015-12-31 22:49:59.123]
> [2016-01-01 00:39:59.123]
> [2016-01-01 01:29:59.123]
> scala> spark.read.json("la_json").collect().foreach{println}
> [2015-12-31 23:50:59.123]
> [2015-12-31 22:49:59.123]
> [2016-01-01 00:39:59.123]
> [2016-01-01 01:29:59.123]
> scala> spark.read.json("la_json").join(spark.sql("select * from 
> la_textfile"), "ts").show()
> ++
> |  ts|
> ++
> |2015-12-31 23:50:...|
> |2015-12-31 22:49:...|
> |2016-01-01 00:39:...|
> |2016-01-01 01:29:...|
> ++
> scala> spark.read.json("la_json").join(spark.sql("select * from la_parquet"), 
> "ts").show()
> +---+
> | ts|
> +---+
> +---+
> {code}
> The textfile and json based data shows the same times, and can be joined 
> against each other, while the times from the parquet data have changed (and 
> obviously joins fail).
> This is a big problem for any organization that may try to read the same data 
> (say in S3) with clusters in multiple timezones.  It can also be a nasty 
> surprise as an organization tries to migrate file formats.  Finally, its a 
> source of incompatibility between Hive, Impala, and Spark.
> HIVE-12767 aims to fix this by introducing a table property which indicates 
> the "storage timezone" for the table.  Spark should add the same to ensure 
> consistency between file formats, and with Hive & Impala.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21842) Support Kerberos ticket renewal and creation in Mesos

2017-09-15 Thread Arthur Rand (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16168278#comment-16168278
 ] 

Arthur Rand commented on SPARK-21842:
-

Hey [~kalvinnchau]

I'm currently of the mind that using the RPC/broadcast approach is better (for 
Mesos) for a couple reasons.
1. We recently added Secret support to Spark on Mesos, this uses a temporary 
file system to put the keytab or TGT in the sandbox of the Spark driver. They 
are packed into the SparkAppConfig (CoarseGrainedSchedulerBackend.scala:L236) 
which is broadcast to the executors, so using RPC/broadcast is consistent with 
this.
2. Keeps all transfers of secure information within Spark.
3. Doesn't require HDFS. There is a little bit of a chicken-and-egg situation 
here w.r.t. YARN, but I'm obviously not familiar enough with how 
Spark-YARN-HDFS work together.  

However I understand that there is a potential risk with executors falsely 
registering with the Driver and getting tokens. I know in the case of DC/OS 
this is less of a concern (we have some protections around this). But this 
could still happen today due to the code mentioned above. We could prevent this 
by keeping track of the executor IDs and only allowing executors to register 
when they have an expected ID..? 

> Support Kerberos ticket renewal and creation in Mesos 
> --
>
> Key: SPARK-21842
> URL: https://issues.apache.org/jira/browse/SPARK-21842
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Affects Versions: 2.3.0
>Reporter: Arthur Rand
>
> We at Mesosphere have written Kerberos support for Spark on Mesos. The code 
> to use Kerberos on a Mesos cluster has been added to Apache Spark 
> (SPARK-16742). This ticket is to complete the implementation and allow for 
> ticket renewal and creation. Specifically for long running and streaming jobs.
> Mesosphere design doc (needs revision, wip): 
> https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22032) Speed up StructType.fromInternal

2017-09-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22032:


Assignee: Apache Spark

> Speed up StructType.fromInternal
> 
>
> Key: SPARK-22032
> URL: https://issues.apache.org/jira/browse/SPARK-22032
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Maciej Bryński
>Assignee: Apache Spark
>
> StructType.fromInternal is calling f.fromInternal(v) for every field.
> We can use needConversion method to limit the number of function calls.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22032) Speed up StructType.fromInternal

2017-09-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22032:


Assignee: (was: Apache Spark)

> Speed up StructType.fromInternal
> 
>
> Key: SPARK-22032
> URL: https://issues.apache.org/jira/browse/SPARK-22032
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Maciej Bryński
>
> StructType.fromInternal is calling f.fromInternal(v) for every field.
> We can use needConversion method to limit the number of function calls.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22032) Speed up StructType.fromInternal

2017-09-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16168273#comment-16168273
 ] 

Apache Spark commented on SPARK-22032:
--

User 'maver1ck' has created a pull request for this issue:
https://github.com/apache/spark/pull/19249

> Speed up StructType.fromInternal
> 
>
> Key: SPARK-22032
> URL: https://issues.apache.org/jira/browse/SPARK-22032
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Maciej Bryński
>
> StructType.fromInternal is calling f.fromInternal(v) for every field.
> We can use needConversion method to limit the number of function calls.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22032) Speed up StructType.fromInternal

2017-09-15 Thread JIRA
Maciej Bryński created SPARK-22032:
--

 Summary: Speed up StructType.fromInternal
 Key: SPARK-22032
 URL: https://issues.apache.org/jira/browse/SPARK-22032
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 2.2.0
Reporter: Maciej Bryński


StructType.fromInternal is calling f.fromInternal(v) for every field.
We can use needConversion method to limit the number of function calls.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22031) KMeans - Compute cost for a single vector

2017-09-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-22031:
--
Target Version/s:   (was: 2.3.0)
  Labels:   (was: newbie)
Priority: Minor  (was: Major)

> KMeans - Compute cost for a single vector
> -
>
> Key: SPARK-22031
> URL: https://issues.apache.org/jira/browse/SPARK-22031
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Laurent Valdes
>Priority: Minor
>
> BisectingKMeans models from ML package have the ability to compute cost for a 
> single data point.
> We would like to port that ability to mainstream KMeansModels, both to ML and 
> MLLIB.
> A pull request is provided.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer

2017-09-15 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16168038#comment-16168038
 ] 

Nicholas Chammas commented on SPARK-17025:
--

I take that back. I won't be able to test this for the time being. If someone 
else wants to test this out and needs pointers, I'd be happy to help.

> Cannot persist PySpark ML Pipeline model that includes custom Transformer
> -
>
> Key: SPARK-17025
> URL: https://issues.apache.org/jira/browse/SPARK-17025
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Following the example in [this Databricks blog 
> post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html]
>  under "Python tuning", I'm trying to save an ML Pipeline model.
> This pipeline, however, includes a custom transformer. When I try to save the 
> model, the operation fails because the custom transformer doesn't have a 
> {{_to_java}} attribute.
> {code}
> Traceback (most recent call last):
>   File ".../file.py", line 56, in 
> model.bestModel.save('model')
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 222, in save
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 217, in write
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py",
>  line 93, in __init__
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 254, in _to_java
> AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java'
> {code}
> Looking at the source code for 
> [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py],
>  I see that not even the base Transformer class has such an attribute.
> I'm assuming this is missing functionality that is intended to be patched up 
> (i.e. [like 
> this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]).
> I'm not sure if there is an existing JIRA for this (my searches didn't turn 
> up clear results).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22030) GraphiteSink fails to re-connect to Graphite instances behind an ELB or any other auto-scaled LB

2017-09-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22030:


Assignee: (was: Apache Spark)

> GraphiteSink fails to re-connect to Graphite instances behind an ELB or any 
> other auto-scaled LB
> 
>
> Key: SPARK-22030
> URL: https://issues.apache.org/jira/browse/SPARK-22030
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Alex Mikhailau
>Priority: Critical
>
> Upgrade codahale metrics library so that Graphite constructor can re-resolve 
> hosts behind a CNAME with re-tried DNS lookups. When Graphite is deployed 
> behind an ELB, ELB may change IP addresses based on auto-scaling needs. Using 
> current approach yields Graphite usage impossible, fixing for that use case
> Upgrade to codahale 3.1.5
> Use new Graphite(host, port) constructor instead of new Graphite(new 
> InetSocketAddress(host, port)) constructor
> This are proposed changes for codahale lib - 
> dropwizard/metrics@v3.1.2...v3.1.5#diff-6916c85d2dd08d89fe771c952e3b8512R120. 
> Specifically, 
> https://github.com/dropwizard/metrics/blob/b4d246d34e8a059b047567848b3522567cbe6108/metrics-graphite/src/main/java/com/codahale/metrics/graphite/Graphite.java#L120



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22030) GraphiteSink fails to re-connect to Graphite instances behind an ELB or any other auto-scaled LB

2017-09-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16168030#comment-16168030
 ] 

Apache Spark commented on SPARK-22030:
--

User 'alexmnyc' has created a pull request for this issue:
https://github.com/apache/spark/pull/19210

> GraphiteSink fails to re-connect to Graphite instances behind an ELB or any 
> other auto-scaled LB
> 
>
> Key: SPARK-22030
> URL: https://issues.apache.org/jira/browse/SPARK-22030
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Alex Mikhailau
>Priority: Critical
>
> Upgrade codahale metrics library so that Graphite constructor can re-resolve 
> hosts behind a CNAME with re-tried DNS lookups. When Graphite is deployed 
> behind an ELB, ELB may change IP addresses based on auto-scaling needs. Using 
> current approach yields Graphite usage impossible, fixing for that use case
> Upgrade to codahale 3.1.5
> Use new Graphite(host, port) constructor instead of new Graphite(new 
> InetSocketAddress(host, port)) constructor
> This are proposed changes for codahale lib - 
> dropwizard/metrics@v3.1.2...v3.1.5#diff-6916c85d2dd08d89fe771c952e3b8512R120. 
> Specifically, 
> https://github.com/dropwizard/metrics/blob/b4d246d34e8a059b047567848b3522567cbe6108/metrics-graphite/src/main/java/com/codahale/metrics/graphite/Graphite.java#L120



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22030) GraphiteSink fails to re-connect to Graphite instances behind an ELB or any other auto-scaled LB

2017-09-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22030:


Assignee: Apache Spark

> GraphiteSink fails to re-connect to Graphite instances behind an ELB or any 
> other auto-scaled LB
> 
>
> Key: SPARK-22030
> URL: https://issues.apache.org/jira/browse/SPARK-22030
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Alex Mikhailau
>Assignee: Apache Spark
>Priority: Critical
>
> Upgrade codahale metrics library so that Graphite constructor can re-resolve 
> hosts behind a CNAME with re-tried DNS lookups. When Graphite is deployed 
> behind an ELB, ELB may change IP addresses based on auto-scaling needs. Using 
> current approach yields Graphite usage impossible, fixing for that use case
> Upgrade to codahale 3.1.5
> Use new Graphite(host, port) constructor instead of new Graphite(new 
> InetSocketAddress(host, port)) constructor
> This are proposed changes for codahale lib - 
> dropwizard/metrics@v3.1.2...v3.1.5#diff-6916c85d2dd08d89fe771c952e3b8512R120. 
> Specifically, 
> https://github.com/dropwizard/metrics/blob/b4d246d34e8a059b047567848b3522567cbe6108/metrics-graphite/src/main/java/com/codahale/metrics/graphite/Graphite.java#L120



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22031) KMeans - Compute cost for a single vector

2017-09-15 Thread Laurent Valdes (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Laurent Valdes updated SPARK-22031:
---
Summary: KMeans - Compute cost for a single vector  (was: Compute cost for 
a single vector)

> KMeans - Compute cost for a single vector
> -
>
> Key: SPARK-22031
> URL: https://issues.apache.org/jira/browse/SPARK-22031
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Laurent Valdes
>  Labels: newbie
>
> BisectingKMeans models from ML package have the ability to compute cost for a 
> single data point.
> We would like to port that ability to mainstream KMeansModels, both to ML and 
> MLLIB.
> A pull request is provided.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22031) Compute cost for a single vector

2017-09-15 Thread Laurent Valdes (JIRA)
Laurent Valdes created SPARK-22031:
--

 Summary: Compute cost for a single vector
 Key: SPARK-22031
 URL: https://issues.apache.org/jira/browse/SPARK-22031
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 2.3.0
Reporter: Laurent Valdes


BisectingKMeans models from ML package have the ability to compute cost for a 
single data point.

We would like to port that ability to mainstream KMeansModels, both to ML and 
MLLIB.

A pull request is provided.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22030) GraphiteSink fails to re-connect to Graphite instances behind an ELB or any other auto-scaled LB

2017-09-15 Thread Alex Mikhailau (JIRA)
Alex Mikhailau created SPARK-22030:
--

 Summary: GraphiteSink fails to re-connect to Graphite instances 
behind an ELB or any other auto-scaled LB
 Key: SPARK-22030
 URL: https://issues.apache.org/jira/browse/SPARK-22030
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Alex Mikhailau
Priority: Critical


Upgrade codahale metrics library so that Graphite constructor can re-resolve 
hosts behind a CNAME with re-tried DNS lookups. When Graphite is deployed 
behind an ELB, ELB may change IP addresses based on auto-scaling needs. Using 
current approach yields Graphite usage impossible, fixing for that use case

Upgrade to codahale 3.1.5
Use new Graphite(host, port) constructor instead of new Graphite(new 
InetSocketAddress(host, port)) constructor

This are proposed changes for codahale lib - 
dropwizard/metrics@v3.1.2...v3.1.5#diff-6916c85d2dd08d89fe771c952e3b8512R120. 
Specifically, 
https://github.com/dropwizard/metrics/blob/b4d246d34e8a059b047567848b3522567cbe6108/metrics-graphite/src/main/java/com/codahale/metrics/graphite/Graphite.java#L120






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22029) Cache of _parse_datatype_json_string function

2017-09-15 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-22029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16168017#comment-16168017
 ] 

Maciej Bryński commented on SPARK-22029:


I did a proof of concept with functools.lru_cache.
But this function is only for Python >= 3.2. 
Any ideas for Python 2.x ?

> Cache of _parse_datatype_json_string function
> -
>
> Key: SPARK-22029
> URL: https://issues.apache.org/jira/browse/SPARK-22029
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Maciej Bryński
>
> _parse_datatype_json_string is called many times with the same arguments.
> Cacheing this calls can dramatically speed up pyspark code execution (up to 
> 20%)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22029) Cache of _parse_datatype_json_string function

2017-09-15 Thread JIRA
Maciej Bryński created SPARK-22029:
--

 Summary: Cache of _parse_datatype_json_string function
 Key: SPARK-22029
 URL: https://issues.apache.org/jira/browse/SPARK-22029
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 2.2.0
Reporter: Maciej Bryński


_parse_datatype_json_string is called many times with the same arguments.
Cacheing this calls can dramatically speed up pyspark code execution (up to 20%)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22024) [pySpark] Speeding up internal conversion for Spark SQL

2017-09-15 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-22024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-22024:
---
Summary: [pySpark] Speeding up internal conversion for Spark SQL  (was: 
[pySpark] Speeding up fromInternal methods)

> [pySpark] Speeding up internal conversion for Spark SQL
> ---
>
> Key: SPARK-22024
> URL: https://issues.apache.org/jira/browse/SPARK-22024
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Maciej Bryński
>
> fromInternal methods of pySpark datatypes are bottleneck when using pySpark.
> This is umbrella ticket to different optimization in these methods.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22028) spark-submit trips over environment variables

2017-09-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16168009#comment-16168009
 ] 

Sean Owen commented on SPARK-22028:
---

But that's a Java error or limit, even. What would Spark do? You should unset 
this. 

> spark-submit trips over environment variables
> -
>
> Key: SPARK-22028
> URL: https://issues.apache.org/jira/browse/SPARK-22028
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.1.1
> Environment: Operating System: Windows 10
> Shell: CMD or bash.exe, both with the same result
>Reporter: Franz Wimmer
>  Labels: windows
>
> I have a strange environment variable in my Windows operating system:
> {code:none}
> C:\Path>set ""
> =::=::\
> {code}
> According to [this issue at 
> stackexchange|https://unix.stackexchange.com/a/251215/251326], this is some 
> sort of old MS-DOS relict that interacts with cygwin shells.
> Leaving that aside for a moment, Spark tries to read environment variables on 
> submit and trips over it: 
> {code:none}
> ./spark-submit.cmd
> Running Spark using the REST application submission protocol.
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 17/09/15 15:57:51 INFO RestSubmissionClient: Submitting a request to launch 
> an application in spark://:31824.
> 17/09/15 15:58:01 WARN RestSubmissionClient: Unable to connect to server 
> spark://***:31824.
> Warning: Master endpoint spark://:31824 was not a REST server. 
> Falling back to legacy submission gateway instead.
> 17/09/15 15:58:02 ERROR Shell: Failed to locate the winutils binary in the 
> hadoop binary path
> [ ... ]
> 17/09/15 15:58:02 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 17/09/15 15:58:08 ERROR ClientEndpoint: Exception from cluster was: 
> java.lang.IllegalArgumentException: Invalid environment variable name: "=::"
> java.lang.IllegalArgumentException: Invalid environment variable name: "=::"
> at 
> java.lang.ProcessEnvironment.validateVariable(ProcessEnvironment.java:114)
> at java.lang.ProcessEnvironment.access$200(ProcessEnvironment.java:61)
> at 
> java.lang.ProcessEnvironment$Variable.valueOf(ProcessEnvironment.java:170)
> at 
> java.lang.ProcessEnvironment$StringEnvironment.put(ProcessEnvironment.java:242)
> at 
> java.lang.ProcessEnvironment$StringEnvironment.put(ProcessEnvironment.java:221)
> at 
> org.apache.spark.deploy.worker.CommandUtils$$anonfun$buildProcessBuilder$2.apply(CommandUtils.scala:55)
> at 
> org.apache.spark.deploy.worker.CommandUtils$$anonfun$buildProcessBuilder$2.apply(CommandUtils.scala:54)
> at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
> at 
> scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:221)
> at 
> scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
> at 
> scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
> at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
> at 
> org.apache.spark.deploy.worker.CommandUtils$.buildProcessBuilder(CommandUtils.scala:54)
> at 
> org.apache.spark.deploy.worker.DriverRunner.prepareAndRunDriver(DriverRunner.scala:181)
> at 
> org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:91)
> {code}
> Please note that _spark-submit.cmd_ is in this case my own script calling the 
> _spark-submit.cmd_ from the spark distribution.
> I think that shouldn't happen. Spark should handle such a malformed 
> environment variable gracefully.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-22028) spark-submit trips over environment variables

2017-09-15 Thread Franz Wimmer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franz Wimmer reopened SPARK-22028:
--

Sorry - the Error regarding the Hadoop binaries is normal for this system - I'm 
asking because of the exception below.

> spark-submit trips over environment variables
> -
>
> Key: SPARK-22028
> URL: https://issues.apache.org/jira/browse/SPARK-22028
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.1.1
> Environment: Operating System: Windows 10
> Shell: CMD or bash.exe, both with the same result
>Reporter: Franz Wimmer
>  Labels: windows
>
> I have a strange environment variable in my Windows operating system:
> {code:none}
> C:\Path>set ""
> =::=::\
> {code}
> According to [this issue at 
> stackexchange|https://unix.stackexchange.com/a/251215/251326], this is some 
> sort of old MS-DOS relict that interacts with cygwin shells.
> Leaving that aside for a moment, Spark tries to read environment variables on 
> submit and trips over it: 
> {code:none}
> ./spark-submit.cmd
> Running Spark using the REST application submission protocol.
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 17/09/15 15:57:51 INFO RestSubmissionClient: Submitting a request to launch 
> an application in spark://:31824.
> 17/09/15 15:58:01 WARN RestSubmissionClient: Unable to connect to server 
> spark://***:31824.
> Warning: Master endpoint spark://:31824 was not a REST server. 
> Falling back to legacy submission gateway instead.
> 17/09/15 15:58:02 ERROR Shell: Failed to locate the winutils binary in the 
> hadoop binary path
> java.io.IOException: Could not locate executable null\bin\winutils.exe in the 
> Hadoop binaries.
> at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:379)
> at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:394)
> at org.apache.hadoop.util.Shell.(Shell.java:387)
> at org.apache.hadoop.util.StringUtils.(StringUtils.java:80)
> at 
> org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(SecurityUtil.java:611)
> at 
> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:273)
> at 
> org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:261)
> at 
> org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:791)
> at 
> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:761)
> at 
> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:634)
> at 
> org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2391)
> at 
> org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2391)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2391)
> at org.apache.spark.SecurityManager.(SecurityManager.scala:221)
> at org.apache.spark.deploy.Client$.main(Client.scala:230)
> at org.apache.spark.deploy.Client.main(Client.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
> at java.lang.reflect.Method.invoke(Unknown Source)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:743)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> 17/09/15 15:58:02 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 17/09/15 15:58:08 ERROR ClientEndpoint: Exception from cluster was: 
> java.lang.IllegalArgumentException: Invalid environment variable name: "=::"
> java.lang.IllegalArgumentException: Invalid environment variable name: "=::"
> at 
> java.lang.ProcessEnvironment.validateVariable(ProcessEnvironment.java:114)
> at java.lang.ProcessEnvironment.access$200(ProcessEnvironment.java:61)
> at 
> java.lang.ProcessEnvironment$Variable.valueOf(ProcessEnvironment.java:170)
> at 
> java.lang.ProcessEnvironment$StringEnvironment.put(ProcessEnvironment.java:242)
> at 
> java.lang.ProcessEnvironment$StringEnvironment.put(ProcessEnvironment.java:221)
> at 
> or

[jira] [Issue Comment Deleted] (SPARK-22028) spark-submit trips over environment variables

2017-09-15 Thread Franz Wimmer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franz Wimmer updated SPARK-22028:
-
Comment: was deleted

(was: The Error with the Hadoop binaries is normal for this system - I'm asking 
because of the exception below.)

> spark-submit trips over environment variables
> -
>
> Key: SPARK-22028
> URL: https://issues.apache.org/jira/browse/SPARK-22028
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.1.1
> Environment: Operating System: Windows 10
> Shell: CMD or bash.exe, both with the same result
>Reporter: Franz Wimmer
>  Labels: windows
>
> I have a strange environment variable in my Windows operating system:
> {code:none}
> C:\Path>set ""
> =::=::\
> {code}
> According to [this issue at 
> stackexchange|https://unix.stackexchange.com/a/251215/251326], this is some 
> sort of old MS-DOS relict that interacts with cygwin shells.
> Leaving that aside for a moment, Spark tries to read environment variables on 
> submit and trips over it: 
> {code:none}
> ./spark-submit.cmd
> Running Spark using the REST application submission protocol.
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 17/09/15 15:57:51 INFO RestSubmissionClient: Submitting a request to launch 
> an application in spark://:31824.
> 17/09/15 15:58:01 WARN RestSubmissionClient: Unable to connect to server 
> spark://***:31824.
> Warning: Master endpoint spark://:31824 was not a REST server. 
> Falling back to legacy submission gateway instead.
> 17/09/15 15:58:02 ERROR Shell: Failed to locate the winutils binary in the 
> hadoop binary path
> java.io.IOException: Could not locate executable null\bin\winutils.exe in the 
> Hadoop binaries.
> at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:379)
> at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:394)
> at org.apache.hadoop.util.Shell.(Shell.java:387)
> at org.apache.hadoop.util.StringUtils.(StringUtils.java:80)
> at 
> org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(SecurityUtil.java:611)
> at 
> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:273)
> at 
> org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:261)
> at 
> org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:791)
> at 
> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:761)
> at 
> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:634)
> at 
> org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2391)
> at 
> org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2391)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2391)
> at org.apache.spark.SecurityManager.(SecurityManager.scala:221)
> at org.apache.spark.deploy.Client$.main(Client.scala:230)
> at org.apache.spark.deploy.Client.main(Client.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
> at java.lang.reflect.Method.invoke(Unknown Source)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:743)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> 17/09/15 15:58:02 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 17/09/15 15:58:08 ERROR ClientEndpoint: Exception from cluster was: 
> java.lang.IllegalArgumentException: Invalid environment variable name: "=::"
> java.lang.IllegalArgumentException: Invalid environment variable name: "=::"
> at 
> java.lang.ProcessEnvironment.validateVariable(ProcessEnvironment.java:114)
> at java.lang.ProcessEnvironment.access$200(ProcessEnvironment.java:61)
> at 
> java.lang.ProcessEnvironment$Variable.valueOf(ProcessEnvironment.java:170)
> at 
> java.lang.ProcessEnvironment$StringEnvironment.put(ProcessEnvironment.java:242)
> at 
> java.lang.ProcessEnvironment$StringEnvironment.put(ProcessEnvironment.java:221)
>

[jira] [Updated] (SPARK-22028) spark-submit trips over environment variables

2017-09-15 Thread Franz Wimmer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franz Wimmer updated SPARK-22028:
-
Description: 
I have a strange environment variable in my Windows operating system:

{code:none}
C:\Path>set ""
=::=::\
{code}

According to [this issue at 
stackexchange|https://unix.stackexchange.com/a/251215/251326], this is some 
sort of old MS-DOS relict that interacts with cygwin shells.

Leaving that aside for a moment, Spark tries to read environment variables on 
submit and trips over it: 

{code:none}
./spark-submit.cmd
Running Spark using the REST application submission protocol.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/09/15 15:57:51 INFO RestSubmissionClient: Submitting a request to launch an 
application in spark://:31824.
17/09/15 15:58:01 WARN RestSubmissionClient: Unable to connect to server 
spark://***:31824.
Warning: Master endpoint spark://:31824 was not a REST server. Falling 
back to legacy submission gateway instead.
17/09/15 15:58:02 ERROR Shell: Failed to locate the winutils binary in the 
hadoop binary path
[ ... ]
17/09/15 15:58:02 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
17/09/15 15:58:08 ERROR ClientEndpoint: Exception from cluster was: 
java.lang.IllegalArgumentException: Invalid environment variable name: "=::"
java.lang.IllegalArgumentException: Invalid environment variable name: "=::"
at 
java.lang.ProcessEnvironment.validateVariable(ProcessEnvironment.java:114)
at java.lang.ProcessEnvironment.access$200(ProcessEnvironment.java:61)
at 
java.lang.ProcessEnvironment$Variable.valueOf(ProcessEnvironment.java:170)
at 
java.lang.ProcessEnvironment$StringEnvironment.put(ProcessEnvironment.java:242)
at 
java.lang.ProcessEnvironment$StringEnvironment.put(ProcessEnvironment.java:221)
at 
org.apache.spark.deploy.worker.CommandUtils$$anonfun$buildProcessBuilder$2.apply(CommandUtils.scala:55)
at 
org.apache.spark.deploy.worker.CommandUtils$$anonfun$buildProcessBuilder$2.apply(CommandUtils.scala:54)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at 
scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:221)
at 
scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
at 
scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at 
org.apache.spark.deploy.worker.CommandUtils$.buildProcessBuilder(CommandUtils.scala:54)
at 
org.apache.spark.deploy.worker.DriverRunner.prepareAndRunDriver(DriverRunner.scala:181)
at 
org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:91)
{code}

Please note that _spark-submit.cmd_ is in this case my own script calling the 
_spark-submit.cmd_ from the spark distribution.

I think that shouldn't happen. Spark should handle such a malformed environment 
variable gracefully.

  was:
I have a strange environment variable in my Windows operating system:

{code:none}
C:\Path>set ""
=::=::\
{code}

According to [this issue at 
stackexchange|https://unix.stackexchange.com/a/251215/251326], this is some 
sort of old MS-DOS relict that interacts with cygwin shells.

Leaving that aside for a moment, Spark tries to read environment variables on 
submit and trips over it: 

{code:none}
./spark-submit.cmd
Running Spark using the REST application submission protocol.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/09/15 15:57:51 INFO RestSubmissionClient: Submitting a request to launch an 
application in spark://:31824.
17/09/15 15:58:01 WARN RestSubmissionClient: Unable to connect to server 
spark://***:31824.
Warning: Master endpoint spark://:31824 was not a REST server. Falling 
back to legacy submission gateway instead.
17/09/15 15:58:02 ERROR Shell: Failed to locate the winutils binary in the 
hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the 
Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:379)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:394)
at org.apache.hadoop.util.Shell.(Shell.java:387)
at org.apache.hadoop.util.StringUtils.(StringUtils.java:80)
at 
org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(SecurityUtil.java:611)
at 
org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:273)
at 
org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:261)
at 
org.apache.hadoop.security.UserGroupInformation.loginUserFromSubje

[jira] [Commented] (SPARK-22028) spark-submit trips over environment variables

2017-09-15 Thread Franz Wimmer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167970#comment-16167970
 ] 

Franz Wimmer commented on SPARK-22028:
--

The Error with the Hadoop binaries is normal for this system - I'm asking 
because of the exception below.

> spark-submit trips over environment variables
> -
>
> Key: SPARK-22028
> URL: https://issues.apache.org/jira/browse/SPARK-22028
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.1.1
> Environment: Operating System: Windows 10
> Shell: CMD or bash.exe, both with the same result
>Reporter: Franz Wimmer
>  Labels: windows
>
> I have a strange environment variable in my Windows operating system:
> {code:none}
> C:\Path>set ""
> =::=::\
> {code}
> According to [this issue at 
> stackexchange|https://unix.stackexchange.com/a/251215/251326], this is some 
> sort of old MS-DOS relict that interacts with cygwin shells.
> Leaving that aside for a moment, Spark tries to read environment variables on 
> submit and trips over it: 
> {code:none}
> ./spark-submit.cmd
> Running Spark using the REST application submission protocol.
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 17/09/15 15:57:51 INFO RestSubmissionClient: Submitting a request to launch 
> an application in spark://:31824.
> 17/09/15 15:58:01 WARN RestSubmissionClient: Unable to connect to server 
> spark://***:31824.
> Warning: Master endpoint spark://:31824 was not a REST server. 
> Falling back to legacy submission gateway instead.
> 17/09/15 15:58:02 ERROR Shell: Failed to locate the winutils binary in the 
> hadoop binary path
> java.io.IOException: Could not locate executable null\bin\winutils.exe in the 
> Hadoop binaries.
> at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:379)
> at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:394)
> at org.apache.hadoop.util.Shell.(Shell.java:387)
> at org.apache.hadoop.util.StringUtils.(StringUtils.java:80)
> at 
> org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(SecurityUtil.java:611)
> at 
> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:273)
> at 
> org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:261)
> at 
> org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:791)
> at 
> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:761)
> at 
> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:634)
> at 
> org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2391)
> at 
> org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2391)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2391)
> at org.apache.spark.SecurityManager.(SecurityManager.scala:221)
> at org.apache.spark.deploy.Client$.main(Client.scala:230)
> at org.apache.spark.deploy.Client.main(Client.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
> at java.lang.reflect.Method.invoke(Unknown Source)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:743)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> 17/09/15 15:58:02 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 17/09/15 15:58:08 ERROR ClientEndpoint: Exception from cluster was: 
> java.lang.IllegalArgumentException: Invalid environment variable name: "=::"
> java.lang.IllegalArgumentException: Invalid environment variable name: "=::"
> at 
> java.lang.ProcessEnvironment.validateVariable(ProcessEnvironment.java:114)
> at java.lang.ProcessEnvironment.access$200(ProcessEnvironment.java:61)
> at 
> java.lang.ProcessEnvironment$Variable.valueOf(ProcessEnvironment.java:170)
> at 
> java.lang.ProcessEnvironment$StringEnvironment.put(ProcessEnvironment.java:242)
> at 
> java.lang.ProcessEnvironment$StringEnvironment.put(Proce

[jira] [Resolved] (SPARK-22028) spark-submit trips over environment variables

2017-09-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22028.
---
Resolution: Not A Problem

No, it indicates you don't have the win Hadoop binaries available. See the 
error and search JIRA here. 

> spark-submit trips over environment variables
> -
>
> Key: SPARK-22028
> URL: https://issues.apache.org/jira/browse/SPARK-22028
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.1.1
> Environment: Operating System: Windows 10
> Shell: CMD or bash.exe, both with the same result
>Reporter: Franz Wimmer
>  Labels: windows
>
> I have a strange environment variable in my Windows operating system:
> {code:none}
> C:\Path>set ""
> =::=::\
> {code}
> According to [this issue at 
> stackexchange|https://unix.stackexchange.com/a/251215/251326], this is some 
> sort of old MS-DOS relict that interacts with cygwin shells.
> Leaving that aside for a moment, Spark tries to read environment variables on 
> submit and trips over it: 
> {code:none}
> ./spark-submit.cmd
> Running Spark using the REST application submission protocol.
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 17/09/15 15:57:51 INFO RestSubmissionClient: Submitting a request to launch 
> an application in spark://:31824.
> 17/09/15 15:58:01 WARN RestSubmissionClient: Unable to connect to server 
> spark://***:31824.
> Warning: Master endpoint spark://:31824 was not a REST server. 
> Falling back to legacy submission gateway instead.
> 17/09/15 15:58:02 ERROR Shell: Failed to locate the winutils binary in the 
> hadoop binary path
> java.io.IOException: Could not locate executable null\bin\winutils.exe in the 
> Hadoop binaries.
> at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:379)
> at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:394)
> at org.apache.hadoop.util.Shell.(Shell.java:387)
> at org.apache.hadoop.util.StringUtils.(StringUtils.java:80)
> at 
> org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(SecurityUtil.java:611)
> at 
> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:273)
> at 
> org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:261)
> at 
> org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:791)
> at 
> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:761)
> at 
> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:634)
> at 
> org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2391)
> at 
> org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2391)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2391)
> at org.apache.spark.SecurityManager.(SecurityManager.scala:221)
> at org.apache.spark.deploy.Client$.main(Client.scala:230)
> at org.apache.spark.deploy.Client.main(Client.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
> at java.lang.reflect.Method.invoke(Unknown Source)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:743)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> 17/09/15 15:58:02 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 17/09/15 15:58:08 ERROR ClientEndpoint: Exception from cluster was: 
> java.lang.IllegalArgumentException: Invalid environment variable name: "=::"
> java.lang.IllegalArgumentException: Invalid environment variable name: "=::"
> at 
> java.lang.ProcessEnvironment.validateVariable(ProcessEnvironment.java:114)
> at java.lang.ProcessEnvironment.access$200(ProcessEnvironment.java:61)
> at 
> java.lang.ProcessEnvironment$Variable.valueOf(ProcessEnvironment.java:170)
> at 
> java.lang.ProcessEnvironment$StringEnvironment.put(ProcessEnvironment.java:242)
> at 
> java.lang.ProcessEnvironment$StringEnvironment.put(ProcessEnvironment.java:221)
>

[jira] [Created] (SPARK-22028) spark-submit trips over environment variables

2017-09-15 Thread Franz Wimmer (JIRA)
Franz Wimmer created SPARK-22028:


 Summary: spark-submit trips over environment variables
 Key: SPARK-22028
 URL: https://issues.apache.org/jira/browse/SPARK-22028
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 2.1.1
 Environment: Operating System: Windows 10
Shell: CMD or bash.exe, both with the same result
Reporter: Franz Wimmer


I have a strange environment variable in my Windows operating system:

{code:none}
C:\Path>set ""
=::=::\
{code}

According to [this issue at 
stackexchange|https://unix.stackexchange.com/a/251215/251326], this is some 
sort of old MS-DOS relict that interacts with cygwin shells.

Leaving that aside for a moment, Spark tries to read environment variables on 
submit and trips over it: 

{code:none}
./spark-submit.cmd
Running Spark using the REST application submission protocol.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/09/15 15:57:51 INFO RestSubmissionClient: Submitting a request to launch an 
application in spark://:31824.
17/09/15 15:58:01 WARN RestSubmissionClient: Unable to connect to server 
spark://***:31824.
Warning: Master endpoint spark://:31824 was not a REST server. Falling 
back to legacy submission gateway instead.
17/09/15 15:58:02 ERROR Shell: Failed to locate the winutils binary in the 
hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the 
Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:379)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:394)
at org.apache.hadoop.util.Shell.(Shell.java:387)
at org.apache.hadoop.util.StringUtils.(StringUtils.java:80)
at 
org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(SecurityUtil.java:611)
at 
org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:273)
at 
org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:261)
at 
org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:791)
at 
org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:761)
at 
org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:634)
at 
org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2391)
at 
org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2391)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2391)
at org.apache.spark.SecurityManager.(SecurityManager.scala:221)
at org.apache.spark.deploy.Client$.main(Client.scala:230)
at org.apache.spark.deploy.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:743)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
17/09/15 15:58:02 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
17/09/15 15:58:08 ERROR ClientEndpoint: Exception from cluster was: 
java.lang.IllegalArgumentException: Invalid environment variable name: "=::"
java.lang.IllegalArgumentException: Invalid environment variable name: "=::"
at 
java.lang.ProcessEnvironment.validateVariable(ProcessEnvironment.java:114)
at java.lang.ProcessEnvironment.access$200(ProcessEnvironment.java:61)
at 
java.lang.ProcessEnvironment$Variable.valueOf(ProcessEnvironment.java:170)
at 
java.lang.ProcessEnvironment$StringEnvironment.put(ProcessEnvironment.java:242)
at 
java.lang.ProcessEnvironment$StringEnvironment.put(ProcessEnvironment.java:221)
at 
org.apache.spark.deploy.worker.CommandUtils$$anonfun$buildProcessBuilder$2.apply(CommandUtils.scala:55)
at 
org.apache.spark.deploy.worker.CommandUtils$$anonfun$buildProcessBuilder$2.apply(CommandUtils.scala:54)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at 
scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:221)
at 
scala.collection.immutable.HashMap$HashTrieMap.for

[jira] [Updated] (SPARK-22027) Explanation of default value of GBTRegressor's maxIter is missing in API doc

2017-09-15 Thread Kazunori Sakamoto (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazunori Sakamoto updated SPARK-22027:
--
Labels: documentation  (was: )

> Explanation of default value of GBTRegressor's maxIter is missing in API doc
> 
>
> Key: SPARK-22027
> URL: https://issues.apache.org/jira/browse/SPARK-22027
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 
> 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.2.0
>Reporter: Kazunori Sakamoto
>Priority: Trivial
>  Labels: documentation
>
> The current API document of GBTRegressor's maxIter doesn't include the 
> explanation of the default value (20).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22027) Explanation of default value of GBTRegressor's maxIter is missing in API doc

2017-09-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22027:


Assignee: (was: Apache Spark)

> Explanation of default value of GBTRegressor's maxIter is missing in API doc
> 
>
> Key: SPARK-22027
> URL: https://issues.apache.org/jira/browse/SPARK-22027
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 
> 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.2.0
>Reporter: Kazunori Sakamoto
>Priority: Trivial
>  Labels: documentation
>
> The current API document of GBTRegressor's maxIter doesn't include the 
> explanation of the default value (20).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22027) Explanation of default value of GBTRegressor's maxIter is missing in API doc

2017-09-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22027:


Assignee: Apache Spark

> Explanation of default value of GBTRegressor's maxIter is missing in API doc
> 
>
> Key: SPARK-22027
> URL: https://issues.apache.org/jira/browse/SPARK-22027
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 
> 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.2.0
>Reporter: Kazunori Sakamoto
>Assignee: Apache Spark
>Priority: Trivial
>  Labels: documentation
>
> The current API document of GBTRegressor's maxIter doesn't include the 
> explanation of the default value (20).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22027) Explanation of default value of GBTRegressor's maxIter is missing in API doc

2017-09-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167944#comment-16167944
 ] 

Apache Spark commented on SPARK-22027:
--

User 'exKAZUu' has created a pull request for this issue:
https://github.com/apache/spark/pull/19248

> Explanation of default value of GBTRegressor's maxIter is missing in API doc
> 
>
> Key: SPARK-22027
> URL: https://issues.apache.org/jira/browse/SPARK-22027
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 
> 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.2.0
>Reporter: Kazunori Sakamoto
>Priority: Trivial
>  Labels: documentation
>
> The current API document of GBTRegressor's maxIter doesn't include the 
> explanation of the default value (20).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22027) Explanation of default value of GBTRegressor's maxIter is missing in API doc

2017-09-15 Thread Kazunori Sakamoto (JIRA)
Kazunori Sakamoto created SPARK-22027:
-

 Summary: Explanation of default value of GBTRegressor's maxIter is 
missing in API doc
 Key: SPARK-22027
 URL: https://issues.apache.org/jira/browse/SPARK-22027
 Project: Spark
  Issue Type: Documentation
  Components: ML
Affects Versions: 2.2.0, 2.1.1, 2.1.0, 2.0.2, 2.0.1, 2.0.0, 1.6.3, 1.6.2, 
1.6.1, 1.6.0, 1.5.2, 1.5.1, 1.5.0, 1.4.1, 1.4.0
Reporter: Kazunori Sakamoto
Priority: Trivial


The current API document of GBTRegressor's maxIter doesn't include the 
explanation of the default value (20).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21996) Streaming ignores files with spaces in the file names

2017-09-15 Thread Xiayun Sun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167941#comment-16167941
 ] 

Xiayun Sun edited comment on SPARK-21996 at 9/15/17 2:30 PM:
-

I can reproduce this issue for master branch, and found out it was from using 
{{.toString}} instead of {{.getPath}} to get the file path in string format. 

I pushed a PR to make it work for streaming. But there're other places in spark 
where we're using `.toString`. Not sure if it's a general concern. 


was (Author: xiayunsun):
I can reproduce this issue for master branch, and found out it was from using 
`.toString` instead of `.getPath` to get the file path in string format. 

I pushed a PR to make it work for streaming. But there're other places in spark 
where we're using `.toString`. Not sure if it's a general concern. 

> Streaming ignores files with spaces in the file names
> -
>
> Key: SPARK-21996
> URL: https://issues.apache.org/jira/browse/SPARK-21996
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: openjdk version "1.8.0_131"
> OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-2ubuntu1.17.04.3-b11)
> OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)
>Reporter: Ivan Sharamet
> Attachments: spark-streaming.zip
>
>
> I tried to stream text files from folder and noticed that files inside this 
> folder with spaces in their names ignored and there are some warnings in the 
> log:
> {code}
> 17/09/13 16:15:14 WARN InMemoryFileIndex: The directory 
> file:/in/two%20two.txt was not found. Was it deleted very recently?
> {code}
> I found that this happens due to duplicate file path URI encoding (I suppose) 
> and the actual URI inside path objects looks like this 
> {{file:/in/two%2520two.txt}}.
> To reproduce this issue just place some text files with spaces in their names 
> and execute some simple streaming code:
> {code:java}
> /in
> /one.txt
> /two two.txt
> /three.txt
> {code}
> {code}
> sparkSession.readStream.textFile("/in")
>   .writeStream
>   .option("checkpointLocation", "/checkpoint")
>   .format("text")
>   .start("/out")
>   .awaitTermination()
> {code}
> The result will contain only content of files {{one.txt}} and {{three.txt}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21996) Streaming ignores files with spaces in the file names

2017-09-15 Thread Xiayun Sun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167941#comment-16167941
 ] 

Xiayun Sun commented on SPARK-21996:


I can reproduce this issue for master branch, and found out it was from using 
`.toString` instead of `.getPath` to get the file path in string format. 

I pushed a PR to make it work for streaming. But there're other places in spark 
where we're using `.toString`. Not sure if it's a general concern. 

> Streaming ignores files with spaces in the file names
> -
>
> Key: SPARK-21996
> URL: https://issues.apache.org/jira/browse/SPARK-21996
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: openjdk version "1.8.0_131"
> OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-2ubuntu1.17.04.3-b11)
> OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)
>Reporter: Ivan Sharamet
> Attachments: spark-streaming.zip
>
>
> I tried to stream text files from folder and noticed that files inside this 
> folder with spaces in their names ignored and there are some warnings in the 
> log:
> {code}
> 17/09/13 16:15:14 WARN InMemoryFileIndex: The directory 
> file:/in/two%20two.txt was not found. Was it deleted very recently?
> {code}
> I found that this happens due to duplicate file path URI encoding (I suppose) 
> and the actual URI inside path objects looks like this 
> {{file:/in/two%2520two.txt}}.
> To reproduce this issue just place some text files with spaces in their names 
> and execute some simple streaming code:
> {code:java}
> /in
> /one.txt
> /two two.txt
> /three.txt
> {code}
> {code}
> sparkSession.readStream.textFile("/in")
>   .writeStream
>   .option("checkpointLocation", "/checkpoint")
>   .format("text")
>   .start("/out")
>   .awaitTermination()
> {code}
> The result will contain only content of files {{one.txt}} and {{three.txt}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21996) Streaming ignores files with spaces in the file names

2017-09-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167927#comment-16167927
 ] 

Apache Spark commented on SPARK-21996:
--

User 'xysun' has created a pull request for this issue:
https://github.com/apache/spark/pull/19247

> Streaming ignores files with spaces in the file names
> -
>
> Key: SPARK-21996
> URL: https://issues.apache.org/jira/browse/SPARK-21996
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: openjdk version "1.8.0_131"
> OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-2ubuntu1.17.04.3-b11)
> OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)
>Reporter: Ivan Sharamet
> Attachments: spark-streaming.zip
>
>
> I tried to stream text files from folder and noticed that files inside this 
> folder with spaces in their names ignored and there are some warnings in the 
> log:
> {code}
> 17/09/13 16:15:14 WARN InMemoryFileIndex: The directory 
> file:/in/two%20two.txt was not found. Was it deleted very recently?
> {code}
> I found that this happens due to duplicate file path URI encoding (I suppose) 
> and the actual URI inside path objects looks like this 
> {{file:/in/two%2520two.txt}}.
> To reproduce this issue just place some text files with spaces in their names 
> and execute some simple streaming code:
> {code:java}
> /in
> /one.txt
> /two two.txt
> /three.txt
> {code}
> {code}
> sparkSession.readStream.textFile("/in")
>   .writeStream
>   .option("checkpointLocation", "/checkpoint")
>   .format("text")
>   .start("/out")
>   .awaitTermination()
> {code}
> The result will contain only content of files {{one.txt}} and {{three.txt}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21996) Streaming ignores files with spaces in the file names

2017-09-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21996:


Assignee: Apache Spark

> Streaming ignores files with spaces in the file names
> -
>
> Key: SPARK-21996
> URL: https://issues.apache.org/jira/browse/SPARK-21996
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: openjdk version "1.8.0_131"
> OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-2ubuntu1.17.04.3-b11)
> OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)
>Reporter: Ivan Sharamet
>Assignee: Apache Spark
> Attachments: spark-streaming.zip
>
>
> I tried to stream text files from folder and noticed that files inside this 
> folder with spaces in their names ignored and there are some warnings in the 
> log:
> {code}
> 17/09/13 16:15:14 WARN InMemoryFileIndex: The directory 
> file:/in/two%20two.txt was not found. Was it deleted very recently?
> {code}
> I found that this happens due to duplicate file path URI encoding (I suppose) 
> and the actual URI inside path objects looks like this 
> {{file:/in/two%2520two.txt}}.
> To reproduce this issue just place some text files with spaces in their names 
> and execute some simple streaming code:
> {code:java}
> /in
> /one.txt
> /two two.txt
> /three.txt
> {code}
> {code}
> sparkSession.readStream.textFile("/in")
>   .writeStream
>   .option("checkpointLocation", "/checkpoint")
>   .format("text")
>   .start("/out")
>   .awaitTermination()
> {code}
> The result will contain only content of files {{one.txt}} and {{three.txt}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21996) Streaming ignores files with spaces in the file names

2017-09-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21996:


Assignee: (was: Apache Spark)

> Streaming ignores files with spaces in the file names
> -
>
> Key: SPARK-21996
> URL: https://issues.apache.org/jira/browse/SPARK-21996
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: openjdk version "1.8.0_131"
> OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-2ubuntu1.17.04.3-b11)
> OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)
>Reporter: Ivan Sharamet
> Attachments: spark-streaming.zip
>
>
> I tried to stream text files from folder and noticed that files inside this 
> folder with spaces in their names ignored and there are some warnings in the 
> log:
> {code}
> 17/09/13 16:15:14 WARN InMemoryFileIndex: The directory 
> file:/in/two%20two.txt was not found. Was it deleted very recently?
> {code}
> I found that this happens due to duplicate file path URI encoding (I suppose) 
> and the actual URI inside path objects looks like this 
> {{file:/in/two%2520two.txt}}.
> To reproduce this issue just place some text files with spaces in their names 
> and execute some simple streaming code:
> {code:java}
> /in
> /one.txt
> /two two.txt
> /three.txt
> {code}
> {code}
> sparkSession.readStream.textFile("/in")
>   .writeStream
>   .option("checkpointLocation", "/checkpoint")
>   .format("text")
>   .start("/out")
>   .awaitTermination()
> {code}
> The result will contain only content of files {{one.txt}} and {{three.txt}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22026) data source v2 write path

2017-09-15 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-22026:
---

 Summary: data source v2 write path
 Key: SPARK-22026
 URL: https://issues.apache.org/jira/browse/SPARK-22026
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22024) [pySpark] Speeding up fromInternal methods

2017-09-15 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-22024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-22024:
---
Description: 
fromInternal methods of pySpark datatypes are bottleneck when using pySpark.
This is umbrella ticket to different optimization in these methods.

  was:
fromInternal methods of pySpark datatypes are bottleneck when using pySpark.
This is umbrella ticket to different optimization in thos methods.


> [pySpark] Speeding up fromInternal methods
> --
>
> Key: SPARK-22024
> URL: https://issues.apache.org/jira/browse/SPARK-22024
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Maciej Bryński
>
> fromInternal methods of pySpark datatypes are bottleneck when using pySpark.
> This is umbrella ticket to different optimization in these methods.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21958) Attempting to save large Word2Vec model hangs driver in constant GC.

2017-09-15 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-21958:
--

Assignee: Travis Hegner

> Attempting to save large Word2Vec model hangs driver in constant GC.
> 
>
> Key: SPARK-21958
> URL: https://issues.apache.org/jira/browse/SPARK-21958
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
> Environment: Running spark on yarn, hadoop 2.7.2 provided by the 
> cluster
>Reporter: Travis Hegner
>Assignee: Travis Hegner
>  Labels: easyfix, patch, performance
> Fix For: 2.3.0
>
>
> In the new version of Word2Vec, the model saving was modified to estimate an 
> appropriate number of partitions based on the kryo buffer size. This is a 
> great improvement, but there is a caveat for very large models.
> The {{(word, vector)}} tuple goes through a transformation to a local case 
> class of {{Data(word, vector)}}... I can only assume this is for the kryo 
> serialization process. The new version of the code iterates over the entire 
> vocabulary to do this transformation (the old version wrapped the entire 
> datum) in the driver's heap. Only to have the result then distributed to the 
> cluster to be written into it's parquet files.
> With extremely large vocabularies (~2 million docs, with uni-grams, bi-grams, 
> and tri-grams), that local driver transformation is causing the driver to 
> hang indefinitely in GC as I can only assume that it's generating millions of 
> short lived objects which can't be evicted fast enough.
> Perhaps I'm overlooking something, but it seems to me that since the result 
> is distributed over the cluster to be saved _after_ the transformation 
> anyway, we may as well distribute it _first_, allowing the cluster resources 
> to do the transformation more efficiently, and then write the parquet file 
> from there.
> I have a patch implemented, and am in the process of testing it at scale. I 
> will open a pull request when I feel that the patch is successfully resolving 
> the issue, and after making sure that it passes unit tests.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21958) Attempting to save large Word2Vec model hangs driver in constant GC.

2017-09-15 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-21958.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19191
[https://github.com/apache/spark/pull/19191]

> Attempting to save large Word2Vec model hangs driver in constant GC.
> 
>
> Key: SPARK-21958
> URL: https://issues.apache.org/jira/browse/SPARK-21958
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
> Environment: Running spark on yarn, hadoop 2.7.2 provided by the 
> cluster
>Reporter: Travis Hegner
>  Labels: easyfix, patch, performance
> Fix For: 2.3.0
>
>
> In the new version of Word2Vec, the model saving was modified to estimate an 
> appropriate number of partitions based on the kryo buffer size. This is a 
> great improvement, but there is a caveat for very large models.
> The {{(word, vector)}} tuple goes through a transformation to a local case 
> class of {{Data(word, vector)}}... I can only assume this is for the kryo 
> serialization process. The new version of the code iterates over the entire 
> vocabulary to do this transformation (the old version wrapped the entire 
> datum) in the driver's heap. Only to have the result then distributed to the 
> cluster to be written into it's parquet files.
> With extremely large vocabularies (~2 million docs, with uni-grams, bi-grams, 
> and tri-grams), that local driver transformation is causing the driver to 
> hang indefinitely in GC as I can only assume that it's generating millions of 
> short lived objects which can't be evicted fast enough.
> Perhaps I'm overlooking something, but it seems to me that since the result 
> is distributed over the cluster to be saved _after_ the transformation 
> anyway, we may as well distribute it _first_, allowing the cluster resources 
> to do the transformation more efficiently, and then write the parquet file 
> from there.
> I have a patch implemented, and am in the process of testing it at scale. I 
> will open a pull request when I feel that the patch is successfully resolving 
> the issue, and after making sure that it passes unit tests.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22025) Speeding up fromInternal for StructField

2017-09-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22025:


Assignee: (was: Apache Spark)

> Speeding up fromInternal for StructField
> 
>
> Key: SPARK-22025
> URL: https://issues.apache.org/jira/browse/SPARK-22025
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Maciej Bryński
>
> Current code for StructField is simple function call.
> https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L434
> We can change it to function reference.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22025) Speeding up fromInternal for StructField

2017-09-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22025:


Assignee: Apache Spark

> Speeding up fromInternal for StructField
> 
>
> Key: SPARK-22025
> URL: https://issues.apache.org/jira/browse/SPARK-22025
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Maciej Bryński
>Assignee: Apache Spark
>
> Current code for StructField is simple function call.
> https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L434
> We can change it to function reference.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22025) Speeding up fromInternal for StructField

2017-09-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167846#comment-16167846
 ] 

Apache Spark commented on SPARK-22025:
--

User 'maver1ck' has created a pull request for this issue:
https://github.com/apache/spark/pull/19246

> Speeding up fromInternal for StructField
> 
>
> Key: SPARK-22025
> URL: https://issues.apache.org/jira/browse/SPARK-22025
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Maciej Bryński
>
> Current code for StructField is simple function call.
> https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L434
> We can change it to function reference.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22012) CLONE - Spark Streaming, Kafka receiver, "Failed to get records for ... after polling for 512"

2017-09-15 Thread Karan Singh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karan Singh updated SPARK-22012:

Description: 
My Spark Streaming duration is 5 seconds (5000) and kafka is all at its default 
properties , i am still facing this issue , Can anyone tell me how to resolve 
it what i am doing wrong ?
JavaStreamingContext ssc = new JavaStreamingContext(sc, 
new Duration(5000));


Exception in Spark Streamings
Job aborted due to stage failure: Task 6 in stage 289.0 failed 4 times, most 
recent failure: Lost task 6.3 in stage 289.0 (xx, executor 2): 
java.lang.AssertionError: assertion failed: Failed to get records for 
spark-executor-xxx pulse 1 163684030 after polling for 512
at scala.Predef$.assert(Predef.scala:170)
at 
org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74)
at 
org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
at 
org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)
at 
scala.collection.convert.Wrappers$IteratorWrapper.next(Wrappers.scala:31)
at 
com.xxx.analytics.engine.EngineManager$1$1.call(EngineManager.java:165)
at 
com.xxx.analytics.engine.EngineManager$1$1.call(EngineManager.java:132)
at 
org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:219)
at 
org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:219)
at 
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
at 
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

  was:
My Spark Streaming duration is 5 seconds (5000) and kafka is all at its default 
properties , i am still facing this issue , Can anyone tell me how to resolve 
it what i am doing wrong ?
JavaStreamingContext ssc = new JavaStreamingContext(sc, 
new Duration(5000));


Exception in Spark Streamings
Job aborted due to stage failure: Task 6 in stage 289.0 failed 4 times, most 
recent failure: Lost task 6.3 in stage 289.0 (xx, executor 2): 
java.lang.AssertionError: assertion failed: Failed to get records for 
spark-executor-xxx pulse 1 163684030 after polling for 512
at scala.Predef$.assert(Predef.scala:170)
at 
org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74)
at 
org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
at 
org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)
at 
scala.collection.convert.Wrappers$IteratorWrapper.next(Wrappers.scala:31)
at 
com.olacabs.analytics.engine.EngineManager$1$1.call(EngineManager.java:165)
at 
com.olacabs.analytics.engine.EngineManager$1$1.call(EngineManager.java:132)
at 
org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:219)
at 
org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:219)
at 
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
at 
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


> CLONE - Spark Streaming, Kafka receiver, "Failed to get records for ... after 
> polling for 512"
> --
>
> Key: SPARK-22012
>   

[jira] [Comment Edited] (SPARK-19275) Spark Streaming, Kafka receiver, "Failed to get records for ... after polling for 512"

2017-09-15 Thread Karan Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16166112#comment-16166112
 ] 

Karan Singh edited comment on SPARK-19275 at 9/15/17 1:02 PM:
--

Hi Team , 
My Spark Streaming duration is 5 seconds (5000) and kafka is all at its default 
properties , i am still facing this issue , Can anyone tell me how to resolve 
it what i am doing wrong ?
JavaStreamingContext ssc = new JavaStreamingContext(sc, 
new Duration(5000));


Exception in Spark Streamings
Job aborted due to stage failure: Task 6 in stage 289.0 failed 4 times, most 
recent failure: Lost task 6.3 in stage 289.0 (xx, executor 2): 
java.lang.AssertionError: assertion failed: Failed to get records for 
spark-executor-xxx pulse 1 163684030 after polling for 512
at scala.Predef$.assert(Predef.scala:170)
at 
org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74)
at 
org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
at 
org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)
at 
scala.collection.convert.Wrappers$IteratorWrapper.next(Wrappers.scala:31)
at 
com.xxx.analytics.engine.EngineManager$1$1.call(EngineManager.java:165)
at 
com.xxx.analytics.engine.EngineManager$1$1.call(EngineManager.java:132)
at 
org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:219)
at 
org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:219)
at 
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
at 
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


was (Author: karansinghkjs346):
Hi Team , 
My Spark Streaming duration is 5 seconds (5000) and kafka is all at its default 
properties , i am still facing this issue , Can anyone tell me how to resolve 
it what i am doing wrong ?
JavaStreamingContext ssc = new JavaStreamingContext(sc, 
new Duration(5000));


Exception in Spark Streamings
Job aborted due to stage failure: Task 6 in stage 289.0 failed 4 times, most 
recent failure: Lost task 6.3 in stage 289.0 (xx, executor 2): 
java.lang.AssertionError: assertion failed: Failed to get records for 
spark-executor-xxx pulse 1 163684030 after polling for 512
at scala.Predef$.assert(Predef.scala:170)
at 
org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74)
at 
org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
at 
org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)
at 
scala.collection.convert.Wrappers$IteratorWrapper.next(Wrappers.scala:31)
at 
com.olacabs.analytics.engine.EngineManager$1$1.call(EngineManager.java:165)
at 
com.olacabs.analytics.engine.EngineManager$1$1.call(EngineManager.java:132)
at 
org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:219)
at 
org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:219)
at 
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
at 
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

> Spark Streaming, Kafka receiver, "Failed to get records for ... after polling 
> for 512"
> 

[jira] [Created] (SPARK-22025) Speeding up fromInternal for StructField

2017-09-15 Thread JIRA
Maciej Bryński created SPARK-22025:
--

 Summary: Speeding up fromInternal for StructField
 Key: SPARK-22025
 URL: https://issues.apache.org/jira/browse/SPARK-22025
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 2.2.0
Reporter: Maciej Bryński


Current code for StructField is simple function call.
https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L434
We can change it to function reference.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22024) [pySpark] Speeding up fromInternal methods

2017-09-15 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-22024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-22024:
---
Summary: [pySpark] Speeding up fromInternal methods  (was: Speeding up 
fromInternal methods)

> [pySpark] Speeding up fromInternal methods
> --
>
> Key: SPARK-22024
> URL: https://issues.apache.org/jira/browse/SPARK-22024
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Maciej Bryński
>
> fromInternal methods of pySpark datatypes are bottleneck when using pySpark.
> This is umbrella ticket to different optimization in thos methods.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22010) Slow fromInternal conversion for TimestampType

2017-09-15 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-22010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-22010:
---
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-22024

> Slow fromInternal conversion for TimestampType
> --
>
> Key: SPARK-22010
> URL: https://issues.apache.org/jira/browse/SPARK-22010
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Maciej Bryński
>Priority: Minor
> Attachments: profile_fact_dok.png
>
>
> To convert timestamp type to python we are using 
> {code}datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts % 
> 100){code}
> code.
> {code}
> In [34]: %%timeit
> ...: 
> datetime.datetime.fromtimestamp(1505383647).replace(microsecond=12344)
> ...:
> 4.58 µs ± 558 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)
> {code}
> It's slow, because:
> # we're trying to get TZ on every conversion
> # we're using replace method
> Proposed solution: custom datetime conversion and move calculation of TZ to 
> module



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22024) Speeding up fromInternal methods

2017-09-15 Thread JIRA
Maciej Bryński created SPARK-22024:
--

 Summary: Speeding up fromInternal methods
 Key: SPARK-22024
 URL: https://issues.apache.org/jira/browse/SPARK-22024
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.2.0
Reporter: Maciej Bryński


fromInternal methods of pySpark datatypes are bottleneck when using pySpark.
This is umbrella ticket to different optimization in thos methods.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7276) withColumn is very slow on dataframe with large number of columns

2017-09-15 Thread Barry Becker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167814#comment-16167814
 ] 

Barry Becker commented on SPARK-7276:
-

Isn't there still a problem with withColumn performance in 2.1.1? I noticed 
that using a select that adds a bunch of derived columns at once (using column 
expressions) is *much* faster than adding a bunch of withColumn's. Has anyone 
else noticed this? It might be good to have a "withColumns" method to avoid the 
overhead of doing multiple withColumn's.

> withColumn is very slow on dataframe with large number of columns
> -
>
> Key: SPARK-7276
> URL: https://issues.apache.org/jira/browse/SPARK-7276
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: Alexandre CLEMENT
>Assignee: Wenchen Fan
> Fix For: 1.4.0
>
>
> The code snippet demonstrates the problem.
> {code}
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> val sparkConf = new SparkConf().setAppName("Spark 
> Test").setMaster(System.getProperty("spark.master", "local[4]"))
> val sc = new SparkContext(sparkConf)
> val sqlContext = new SQLContext(sc)
> import sqlContext.implicits._
> val custs = Seq(
>   Row(1, "Bob", 21, 80.5),
>   Row(2, "Bobby", 21, 80.5),
>   Row(3, "Jean", 21, 80.5),
>   Row(4, "Fatime", 21, 80.5)
> )
> var fields = List(
>   StructField("id", IntegerType, true),
>   StructField("a", IntegerType, true),
>   StructField("b", StringType, true),
>   StructField("target", DoubleType, false))
> val schema = StructType(fields)
> var rdd = sc.parallelize(custs)
> var df = sqlContext.createDataFrame(rdd, schema)
> for (i <- 1 to 200) {
>   val now = System.currentTimeMillis
>   df = df.withColumn("a_new_col_" + i, df("a") + i)
>   println(s"$i -> " + (System.currentTimeMillis - now))
> }
> df.show()
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22021) Add a feature transformation to accept a function and apply it on all rows of dataframe

2017-09-15 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167806#comment-16167806
 ] 

Nick Pentreath commented on SPARK-22021:


Why a JavaScript function? I think this is not a good fit to go into Spark ML 
core. You can easily have this as an external library or Spark package.

We are looking at potentially a transformer for generic Scala functions in 
SPARK-20271

> Add a feature transformation to accept a function and apply it on all rows of 
> dataframe
> ---
>
> Key: SPARK-22021
> URL: https://issues.apache.org/jira/browse/SPARK-22021
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Hosur Narahari
>
> More often we generate derived features in ML pipeline by doing some 
> mathematical or other kind of operation on columns of dataframe like getting 
> a total of few columns as a new column or if there is text field message and 
> we want the length of message etc. We currently don't have an efficient way 
> to handle such scenario in ML pipeline.
> By Providing a transformer which accepts a function and performs that on 
> mentioned columns to generate output column of numerical type, user has the 
> flexibility to derive features by applying any domain specific logic.
> Example:
> val function = "function(a,b) { return a+b;}"
> val transformer = new GenFuncTransformer().setInputCols(Array("v1", 
> "v2")).setOutputCol("result").setFunction(function)
> val df = Seq((1.0, 2.0), (3.0, 4.0)).toDF("v1", "v2")
> val result = transformer.transform(df)
> result.show
> v1   v2  result
> 1.0 2.0 3.0
> 3.0 4.0 7.0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21994) Spark 2.2 can not read Parquet table created by itself

2017-09-15 Thread Jurgis Pods (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167788#comment-16167788
 ] 

Jurgis Pods edited comment on SPARK-21994 at 9/15/17 12:13 PM:
---

I have updated to CDH 5.12.1 and the problem persists. There is an existing 
topic on the Cloudera forums with exactly this problem: 
http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Saving-Spark-2-2-dataframs-in-Hive-table/m-p/58463/thread-id/2867


was (Author: pederpansen):
I have updated to CDH 5.12.1 and the problem persists. There is an existing 
topic on the Cloudera forums with exactly this problem: 
http://community.cloudera.com/t5/forums/replypage/board-id/Spark/message-id/2867

> Spark 2.2 can not read Parquet table created by itself
> --
>
> Key: SPARK-21994
> URL: https://issues.apache.org/jira/browse/SPARK-21994
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Spark 2.2 on Cloudera CDH 5.10.1, Hive 1.1
>Reporter: Jurgis Pods
>
> This seems to be a new bug introduced in Spark 2.2, since it did not occur 
> under Spark 2.1.
> When writing a dataframe to a table in Parquet format, Spark SQL does not 
> write the 'path' of the table to the Hive metastore, unlike in previous 
> versions.
> As a consequence, Spark 2.2 is not able to read the table it just created. It 
> just outputs the table header without any row content. 
> A parallel installation of Spark 1.6 at least produces an appropriate error 
> trace:
> {code:java}
> 17/09/13 10:22:12 WARN metastore.ObjectStore: Version information not found 
> in metastore. hive.metastore.schema.verification is not enabled so recording 
> the schema version 1.1.0
> 17/09/13 10:22:12 WARN metastore.ObjectStore: Failed to get database default, 
> returning NoSuchObjectException
> org.spark-project.guava.util.concurrent.UncheckedExecutionException: 
> java.util.NoSuchElementException: key not found: path
> [...]
> {code}
> h3. Steps to reproduce:
> Run the following in spark2-shell:
> {code:java}
> scala> val df = spark.sql("show databases")
> scala> df.show()
> ++
> |databaseName|
> ++
> |   mydb1|
> |   mydb2|
> | default|
> |test|
> ++
> scala> df.write.format("parquet").saveAsTable("test.spark22_test")
> scala> spark.sql("select * from test.spark22_test").show()
> ++
> |databaseName|
> ++
> ++{code}
> When manually setting the path (causing the data to be saved as external 
> table), it works:
> {code:java}
> scala> df.write.option("path", 
> "/hadoop/eco/hive/warehouse/test.db/spark22_parquet_with_path").format("parquet").saveAsTable("test.spark22_parquet_with_path")
> scala> spark.sql("select * from test.spark22_parquet_with_path").show()
> ++
> |databaseName|
> ++
> |   mydb1|
> |   mydb2|
> | default|
> |test|
> ++
> {code}
> A second workaround is to update the metadata of the managed table created by 
> Spark 2.2:
> {code}
> spark.sql("alter table test.spark22_test set SERDEPROPERTIES 
> ('path'='hdfs://my-cluster-name:8020/hadoop/eco/hive/warehouse/test.db/spark22_test')")
> spark.catalog.refreshTable("test.spark22_test")
> spark.sql("select * from test.spark22_test").show()
> ++
> |databaseName|
> ++
> |   mydb1|
> |   mydb2|
> | default|
> |test|
> ++
> {code}
> It is kind of a disaster that we are not able to read tables created by the 
> very same Spark version and have to manually specify the path as an explicit 
> option.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21994) Spark 2.2 can not read Parquet table created by itself

2017-09-15 Thread Jurgis Pods (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167788#comment-16167788
 ] 

Jurgis Pods commented on SPARK-21994:
-

I have updated to CDH 5.12.1 and the problem persists. There is an existing 
topic on the Cloudera forums with exactly this problem: 
http://community.cloudera.com/t5/forums/replypage/board-id/Spark/message-id/2867

> Spark 2.2 can not read Parquet table created by itself
> --
>
> Key: SPARK-21994
> URL: https://issues.apache.org/jira/browse/SPARK-21994
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Spark 2.2 on Cloudera CDH 5.10.1, Hive 1.1
>Reporter: Jurgis Pods
>
> This seems to be a new bug introduced in Spark 2.2, since it did not occur 
> under Spark 2.1.
> When writing a dataframe to a table in Parquet format, Spark SQL does not 
> write the 'path' of the table to the Hive metastore, unlike in previous 
> versions.
> As a consequence, Spark 2.2 is not able to read the table it just created. It 
> just outputs the table header without any row content. 
> A parallel installation of Spark 1.6 at least produces an appropriate error 
> trace:
> {code:java}
> 17/09/13 10:22:12 WARN metastore.ObjectStore: Version information not found 
> in metastore. hive.metastore.schema.verification is not enabled so recording 
> the schema version 1.1.0
> 17/09/13 10:22:12 WARN metastore.ObjectStore: Failed to get database default, 
> returning NoSuchObjectException
> org.spark-project.guava.util.concurrent.UncheckedExecutionException: 
> java.util.NoSuchElementException: key not found: path
> [...]
> {code}
> h3. Steps to reproduce:
> Run the following in spark2-shell:
> {code:java}
> scala> val df = spark.sql("show databases")
> scala> df.show()
> ++
> |databaseName|
> ++
> |   mydb1|
> |   mydb2|
> | default|
> |test|
> ++
> scala> df.write.format("parquet").saveAsTable("test.spark22_test")
> scala> spark.sql("select * from test.spark22_test").show()
> ++
> |databaseName|
> ++
> ++{code}
> When manually setting the path (causing the data to be saved as external 
> table), it works:
> {code:java}
> scala> df.write.option("path", 
> "/hadoop/eco/hive/warehouse/test.db/spark22_parquet_with_path").format("parquet").saveAsTable("test.spark22_parquet_with_path")
> scala> spark.sql("select * from test.spark22_parquet_with_path").show()
> ++
> |databaseName|
> ++
> |   mydb1|
> |   mydb2|
> | default|
> |test|
> ++
> {code}
> A second workaround is to update the metadata of the managed table created by 
> Spark 2.2:
> {code}
> spark.sql("alter table test.spark22_test set SERDEPROPERTIES 
> ('path'='hdfs://my-cluster-name:8020/hadoop/eco/hive/warehouse/test.db/spark22_test')")
> spark.catalog.refreshTable("test.spark22_test")
> spark.sql("select * from test.spark22_test").show()
> ++
> |databaseName|
> ++
> |   mydb1|
> |   mydb2|
> | default|
> |test|
> ++
> {code}
> It is kind of a disaster that we are not able to read tables created by the 
> very same Spark version and have to manually specify the path as an explicit 
> option.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21994) Spark 2.2 can not read Parquet table created by itself

2017-09-15 Thread Jurgis Pods (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jurgis Pods updated SPARK-21994:

Description: 
This seems to be a new bug introduced in Spark 2.2, since it did not occur 
under Spark 2.1.

When writing a dataframe to a table in Parquet format, Spark SQL does not write 
the 'path' of the table to the Hive metastore, unlike in previous versions.

As a consequence, Spark 2.2 is not able to read the table it just created. It 
just outputs the table header without any row content. 

A parallel installation of Spark 1.6 at least produces an appropriate error 
trace:

{code:java}
17/09/13 10:22:12 WARN metastore.ObjectStore: Version information not found in 
metastore. hive.metastore.schema.verification is not enabled so recording the 
schema version 1.1.0
17/09/13 10:22:12 WARN metastore.ObjectStore: Failed to get database default, 
returning NoSuchObjectException
org.spark-project.guava.util.concurrent.UncheckedExecutionException: 
java.util.NoSuchElementException: key not found: path
[...]
{code}


h3. Steps to reproduce:

Run the following in spark2-shell:

{code:java}
scala> val df = spark.sql("show databases")
scala> df.show()
++
|databaseName|
++
|   mydb1|
|   mydb2|
| default|
|test|
++
scala> df.write.format("parquet").saveAsTable("test.spark22_test")
scala> spark.sql("select * from test.spark22_test").show()
++
|databaseName|
++
++{code}

When manually setting the path (causing the data to be saved as external 
table), it works:


{code:java}
scala> df.write.option("path", 
"/hadoop/eco/hive/warehouse/test.db/spark22_parquet_with_path").format("parquet").saveAsTable("test.spark22_parquet_with_path")

scala> spark.sql("select * from test.spark22_parquet_with_path").show()
++
|databaseName|
++
|   mydb1|
|   mydb2|
| default|
|test|
++
{code}

A second workaround is to update the metadata of the managed table created by 
Spark 2.2:
{code}
spark.sql("alter table test.spark22_test set SERDEPROPERTIES 
('path'='hdfs://my-cluster-name:8020/hadoop/eco/hive/warehouse/test.db/spark22_test')")
spark.catalog.refreshTable("test.spark22_test")
spark.sql("select * from test.spark22_test").show()
++
|databaseName|
++
|   mydb1|
|   mydb2|
| default|
|test|
++
{code}

It is kind of a disaster that we are not able to read tables created by the 
very same Spark version and have to manually specify the path as an explicit 
option.

  was:
This seems to be a new bug introduced in Spark 2.2, since it did not occur 
under Spark 2.1.

When writing a dataframe to a table in Parquet format, Spark SQL does not write 
the 'path' of the table to the Hive metastore, unlike in previous versions.

As a consequence, Spark 2.2 is not able to read the table it just created. It 
just outputs the table header without any row content. 

A parallel installation of Spark 1.6 at least produces an appropriate error 
trace:

{code:java}
17/09/13 10:22:12 WARN metastore.ObjectStore: Version information not found in 
metastore. hive.metastore.schema.verification is not enabled so recording the 
schema version 1.1.0
17/09/13 10:22:12 WARN metastore.ObjectStore: Failed to get database default, 
returning NoSuchObjectException
org.spark-project.guava.util.concurrent.UncheckedExecutionException: 
java.util.NoSuchElementException: key not found: path
[...]
{code}


h3. Steps to reproduce:

Run the following in spark2-shell:

{code:java}
scala> val df = spark.sql("show databases")
scala> df.show()
++
|databaseName|
++
|   mydb1|
|   mydb2|
| default|
|test|
++
scala> df.write.format("parquet").saveAsTable("test.spark22_test")
scala> spark.sql("select * from test.spark22_test").show()
++
|databaseName|
++
++{code}

When manually setting the path, it works:


{code:java}
scala> df.write.option("path", 
"/hadoop/eco/hive/warehouse/test.db/spark22_parquet_with_path").format("parquet").saveAsTable("test.spark22_parquet_with_path")

scala> spark.sql("select * from test.spark22_parquet_with_path").show()
++
|databaseName|
++
|   mydb1|
|   mydb2|
| default|
|test|
++
{code}

It is kind of a disaster that we are not able to read tables created by the 
very same Spark version and have to manually specify the path as an explicit 
option.


> Spark 2.2 can not read Parquet table created 

[jira] [Updated] (SPARK-22023) Multi-column Spark SQL UDFs broken in Python 3

2017-09-15 Thread Oli Hall (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oli Hall updated SPARK-22023:
-
Description: 
I've been testing some existing PySpark code after migrating to Python3, and 
there seems to be an issue with multi-column UDFs in Spark SQL. Essentially, 
any UDF that takes in more than one column as input fails with an error 
relating to expansion of an underlying lambda expression:

{code}
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/python/lib/pyspark.zip/pyspark/worker.py", line 177, 
in main
process()
  File "/python/lib/pyspark.zip/pyspark/worker.py", line 172, 
in process
serializer.dump_stream(func(split_index, iterator), outfile)
  File "/python/lib/pyspark.zip/pyspark/serializers.py", line 
237, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File "/python/lib/pyspark.zip/pyspark/serializers.py", line 
138, in dump_stream
for obj in iterator:
  File "/python/lib/pyspark.zip/pyspark/serializers.py", line 
226, in _batched
for item in iterator:
  File "", line 1, in 
  File "/python/lib/pyspark.zip/pyspark/worker.py", line 71, in 

return lambda *a: f(*a)
TypeError: () takes 1 positional argument but 2 were given

at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at 
org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:144)
at 
org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:87)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:341)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

This syntax should (and does) work in Python 2, but is not valid in Python 3, I 
believe.

I have a minimal example that reproduces the error, running in the PySpark 
shell, with Python 3.6.2, Spark 2.2:

{code}
>>> from pyspark.sql.functions import udf
>>> from pyspark.sql.types import LongType
>>> 
>>> df = spark.createDataFrame(sc.parallelize([{'a': 1, 'b': 1}, {'a': 2, 'b': 
>>> 2}]))
/Users/oli-hall/Documents/Code/spark2/python/pyspark/sql/session.py:351: 
UserWarning: Using RDD of dict to inferSchema is deprecated. Use 
pyspark.sql.Row instead
  warnings.warn("Using RDD of dict to inferSchema is deprecated. "
>>> df.printSchema()
root
 |-- a: long (nullable = true)
 |-- b: long (nullable = true)

>>> sum_udf = udf(lambda x: x[0] + x[1], LongType())
>>> 
>>> with_sum = df.withColumn('sum', sum_udf('a', 'b'))
>>> 
>>> with_sum.first()
17/09/15 11:43:56 ERROR executor.Executor: Exception in task 2.0 in stage 3.0 
(TID 8)
... (error snipped)
TypeError: () takes 1 positional argument but 2 were given
{code}

I've managed to work around it for now, by pushing the input columns into a 
struct, then modifying the UDF to read from the struct, but it'd be good if I 
could use multi-column input again.

Workaround:
{code}
>>> from pyspark.sql.functions import udf, struct
>>> from pyspark.sql.types import LongType
>>> 
>>> df = spark.createDataFrame(sc.parallelize([{'a': 1, 'b': 1}, {'a': 2, 'b': 
>>> 2}]))
>>> 
>>> sum_udf = udf(lambda x: x.a + x.b, LongType())
>>> 
>>> with_sum = df.withColumn('temp_struct', struct('a', 'b')).withColumn('sum', 
>>> sum_udf('temp_struct'))
>>> with_sum.first()
Row(a=1, b=1, temp_struct=Row(a=1, b=1), sum=2)
{code}

  was:
I've been testing some existing PySpark code after migrating to Python3, and 
there seems to be an issue with multi-column UDFs in Spark SQL. Es

[jira] [Created] (SPARK-22023) Multi-column Spark SQL UDFs broken in Python 3

2017-09-15 Thread Oli Hall (JIRA)
Oli Hall created SPARK-22023:


 Summary: Multi-column Spark SQL UDFs broken in Python 3
 Key: SPARK-22023
 URL: https://issues.apache.org/jira/browse/SPARK-22023
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.2.0
 Environment: Python3
Reporter: Oli Hall


I've been testing some existing PySpark code after migrating to Python3, and 
there seems to be an issue with multi-column UDFs in Spark SQL. Essentially, 
any UDF that takes in more than one column as input fails with an error 
relating to expansion of an underlying lambda expression:

{code:bash}
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/python/lib/pyspark.zip/pyspark/worker.py", line 177, 
in main
process()
  File "/python/lib/pyspark.zip/pyspark/worker.py", line 172, 
in process
serializer.dump_stream(func(split_index, iterator), outfile)
  File "/python/lib/pyspark.zip/pyspark/serializers.py", line 
237, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File "/python/lib/pyspark.zip/pyspark/serializers.py", line 
138, in dump_stream
for obj in iterator:
  File "/python/lib/pyspark.zip/pyspark/serializers.py", line 
226, in _batched
for item in iterator:
  File "", line 1, in 
  File "/python/lib/pyspark.zip/pyspark/worker.py", line 71, in 

return lambda *a: f(*a)
TypeError: () takes 1 positional argument but 2 were given

at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at 
org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:144)
at 
org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:87)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:341)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

This syntax should (and does) work in Python 2, but is not valid in Python 3, I 
believe.

I have a minimal example that reproduces the error, running in the PySpark 
shell, with Python 3.6.2, Spark 2.2:

{code:python}
>>> from pyspark.sql.functions import udf
>>> from pyspark.sql.types import LongType
>>> 
>>> df = spark.createDataFrame(sc.parallelize([{'a': 1, 'b': 1}, {'a': 2, 'b': 
>>> 2}]))
/Users/oli-hall/Documents/Code/spark2/python/pyspark/sql/session.py:351: 
UserWarning: Using RDD of dict to inferSchema is deprecated. Use 
pyspark.sql.Row instead
  warnings.warn("Using RDD of dict to inferSchema is deprecated. "
>>> df.printSchema()
root
 |-- a: long (nullable = true)
 |-- b: long (nullable = true)

>>> sum_udf = udf(lambda x: x[0] + x[1], LongType())
>>> 
>>> with_sum = df.withColumn('sum', sum_udf('a', 'b'))
>>> 
>>> with_sum.first()
17/09/15 11:43:56 ERROR executor.Executor: Exception in task 2.0 in stage 3.0 
(TID 8)
...
TypeError: () takes 1 positional argument but 2 were given
{code}

I've managed to work around it for now, by pushing the input columns into a 
struct, then modifying the UDF to read from the struct, but it'd be good if I 
could use multi-column input again.

Workaround:
{code:python}
>>> from pyspark.sql.functions import udf, struct
>>> from pyspark.sql.types import LongType
>>> 
>>> df = spark.createDataFrame(sc.parallelize([{'a': 1, 'b': 1}, {'a': 2, 'b': 
>>> 2}]))
>>> 
>>> sum_udf = udf(lambda x: x.a + x.b, LongType())
>>> 
>>> with_sum = df.withColumn('temp_struct', struct('a', 'b')).withColumn('sum', 
>>> sum_udf('temp_struct'))
>>> with_sum.first

[jira] [Assigned] (SPARK-22021) Add a feature transformation to accept a function and apply it on all rows of dataframe

2017-09-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22021:


Assignee: (was: Apache Spark)

> Add a feature transformation to accept a function and apply it on all rows of 
> dataframe
> ---
>
> Key: SPARK-22021
> URL: https://issues.apache.org/jira/browse/SPARK-22021
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Hosur Narahari
>
> More often we generate derived features in ML pipeline by doing some 
> mathematical or other kind of operation on columns of dataframe like getting 
> a total of few columns as a new column or if there is text field message and 
> we want the length of message etc. We currently don't have an efficient way 
> to handle such scenario in ML pipeline.
> By Providing a transformer which accepts a function and performs that on 
> mentioned columns to generate output column of numerical type, user has the 
> flexibility to derive features by applying any domain specific logic.
> Example:
> val function = "function(a,b) { return a+b;}"
> val transformer = new GenFuncTransformer().setInputCols(Array("v1", 
> "v2")).setOutputCol("result").setFunction(function)
> val df = Seq((1.0, 2.0), (3.0, 4.0)).toDF("v1", "v2")
> val result = transformer.transform(df)
> result.show
> v1   v2  result
> 1.0 2.0 3.0
> 3.0 4.0 7.0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22021) Add a feature transformation to accept a function and apply it on all rows of dataframe

2017-09-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167753#comment-16167753
 ] 

Apache Spark commented on SPARK-22021:
--

User 'narahari92' has created a pull request for this issue:
https://github.com/apache/spark/pull/19244

> Add a feature transformation to accept a function and apply it on all rows of 
> dataframe
> ---
>
> Key: SPARK-22021
> URL: https://issues.apache.org/jira/browse/SPARK-22021
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Hosur Narahari
>
> More often we generate derived features in ML pipeline by doing some 
> mathematical or other kind of operation on columns of dataframe like getting 
> a total of few columns as a new column or if there is text field message and 
> we want the length of message etc. We currently don't have an efficient way 
> to handle such scenario in ML pipeline.
> By Providing a transformer which accepts a function and performs that on 
> mentioned columns to generate output column of numerical type, user has the 
> flexibility to derive features by applying any domain specific logic.
> Example:
> val function = "function(a,b) { return a+b;}"
> val transformer = new GenFuncTransformer().setInputCols(Array("v1", 
> "v2")).setOutputCol("result").setFunction(function)
> val df = Seq((1.0, 2.0), (3.0, 4.0)).toDF("v1", "v2")
> val result = transformer.transform(df)
> result.show
> v1   v2  result
> 1.0 2.0 3.0
> 3.0 4.0 7.0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22021) Add a feature transformation to accept a function and apply it on all rows of dataframe

2017-09-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22021:


Assignee: Apache Spark

> Add a feature transformation to accept a function and apply it on all rows of 
> dataframe
> ---
>
> Key: SPARK-22021
> URL: https://issues.apache.org/jira/browse/SPARK-22021
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Hosur Narahari
>Assignee: Apache Spark
>
> More often we generate derived features in ML pipeline by doing some 
> mathematical or other kind of operation on columns of dataframe like getting 
> a total of few columns as a new column or if there is text field message and 
> we want the length of message etc. We currently don't have an efficient way 
> to handle such scenario in ML pipeline.
> By Providing a transformer which accepts a function and performs that on 
> mentioned columns to generate output column of numerical type, user has the 
> flexibility to derive features by applying any domain specific logic.
> Example:
> val function = "function(a,b) { return a+b;}"
> val transformer = new GenFuncTransformer().setInputCols(Array("v1", 
> "v2")).setOutputCol("result").setFunction(function)
> val df = Seq((1.0, 2.0), (3.0, 4.0)).toDF("v1", "v2")
> val result = transformer.transform(df)
> result.show
> v1   v2  result
> 1.0 2.0 3.0
> 3.0 4.0 7.0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22019) JavaBean int type property

2017-09-15 Thread Jen-Ming Chung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167734#comment-16167734
 ] 

Jen-Ming Chung edited comment on SPARK-22019 at 9/15/17 11:29 AM:
--

The alternative is giving the explicit schema instead inferring that you don't 
need to change your pojo class in above test case.

{code}
StructType schema = new StructType()
.add("id", IntegerType)
.add("str", StringType);
Dataset df = spark.read().schema(schema).json(stringdataset).as(
org.apache.spark.sql.Encoders.bean(SampleData.class));
{code}



was (Author: jmchung):
The alternative is giving the explicit schema instead inferring, means you 
don't need to change your pojo class.

{code}
StructType schema = new StructType()
.add("id", IntegerType)
.add("str", StringType);
Dataset df = spark.read().schema(schema).json(stringdataset).as(
org.apache.spark.sql.Encoders.bean(SampleData.class));
{code}


> JavaBean int type property 
> ---
>
> Key: SPARK-22019
> URL: https://issues.apache.org/jira/browse/SPARK-22019
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: taiho choi
>
> when the type of SampleData's id is int, following code generates errors.
> when long, it's ok.
>  
> {code:java}
> @Test
> public void testDataSet2() {
> ArrayList arr= new ArrayList();
> arr.add("{\"str\": \"everyone\", \"id\": 1}");
> arr.add("{\"str\": \"Hello\", \"id\": 1}");
> //1.read array and change to string dataset.
> JavaRDD data = sc.parallelize(arr);
> Dataset stringdataset = sqc.createDataset(data.rdd(), 
> Encoders.STRING());
> stringdataset.show(); //PASS
> //2. convert string dataset to sampledata dataset
> Dataset df = 
> sqc.read().json(stringdataset).as(Encoders.bean(SampleData.class));
> df.show();//PASS
> df.printSchema();//PASS
> Dataset fad = df.flatMap(SampleDataFlat::flatMap, 
> Encoders.bean(SampleDataFlat.class));
> fad.show(); //ERROR
> fad.printSchema();
> }
> public static class SampleData implements Serializable {
> public String getStr() {
> return str;
> }
> public void setStr(String str) {
> this.str = str;
> }
> public int getId() {
> return id;
> }
> public void setId(int id) {
> this.id = id;
> }
> String str;
> int id;
> }
> public static class SampleDataFlat {
> String str;
> public String getStr() {
> return str;
> }
> public void setStr(String str) {
> this.str = str;
> }
> public SampleDataFlat(String str, long id) {
> this.str = str;
> }
> public static Iterator flatMap(SampleData data) {
> ArrayList arr = new ArrayList<>();
> arr.add(new SampleDataFlat(data.getStr(), data.getId()));
> arr.add(new SampleDataFlat(data.getStr(), data.getId()+1));
> arr.add(new SampleDataFlat(data.getStr(), data.getId()+2));
> return arr.iterator();
> }
> }
> {code}
> ==Error message==
> Caused by: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 38, Column 16: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 38, Column 16: No applicable constructor/method found for actual parameters 
> "long"; candidates are: "public void SparkUnitTest$SampleData.setId(int)"
> /* 024 */   public java.lang.Object apply(java.lang.Object _i) {
> /* 025 */ InternalRow i = (InternalRow) _i;
> /* 026 */
> /* 027 */ final SparkUnitTest$SampleData value1 = false ? null : new 
> SparkUnitTest$SampleData();
> /* 028 */ this.javaBean = value1;
> /* 029 */ if (!false) {
> /* 030 */
> /* 031 */
> /* 032 */   boolean isNull3 = i.isNullAt(0);
> /* 033 */   long value3 = isNull3 ? -1L : (i.getLong(0));
> /* 034 */
> /* 035 */   if (isNull3) {
> /* 036 */ throw new NullPointerException(((java.lang.String) 
> references[0]));
> /* 037 */   }
> /* 038 */   javaBean.setId(value3);



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22021) Add a feature transformation to accept a function and apply it on all rows of dataframe

2017-09-15 Thread Hosur Narahari (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167736#comment-16167736
 ] 

Hosur Narahari commented on SPARK-22021:


If I just apply this function, I can't use it in spark's ML pipeline and will 
have to break it into 2 sub-pipelines where I've to apply a function(logic). 
But by providing a transformer we can make one single pipeline till end. Also 
it will be a one stop generic solution to get any derived feature.

> Add a feature transformation to accept a function and apply it on all rows of 
> dataframe
> ---
>
> Key: SPARK-22021
> URL: https://issues.apache.org/jira/browse/SPARK-22021
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Hosur Narahari
>
> More often we generate derived features in ML pipeline by doing some 
> mathematical or other kind of operation on columns of dataframe like getting 
> a total of few columns as a new column or if there is text field message and 
> we want the length of message etc. We currently don't have an efficient way 
> to handle such scenario in ML pipeline.
> By Providing a transformer which accepts a function and performs that on 
> mentioned columns to generate output column of numerical type, user has the 
> flexibility to derive features by applying any domain specific logic.
> Example:
> val function = "function(a,b) { return a+b;}"
> val transformer = new GenFuncTransformer().setInputCols(Array("v1", 
> "v2")).setOutputCol("result").setFunction(function)
> val df = Seq((1.0, 2.0), (3.0, 4.0)).toDF("v1", "v2")
> val result = transformer.transform(df)
> result.show
> v1   v2  result
> 1.0 2.0 3.0
> 3.0 4.0 7.0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22019) JavaBean int type property

2017-09-15 Thread Jen-Ming Chung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167734#comment-16167734
 ] 

Jen-Ming Chung edited comment on SPARK-22019 at 9/15/17 11:28 AM:
--

The alternative is giving the explicit schema instead inferring, means you 
don't need to change your pojo class.

{code}
StructType schema = new StructType()
.add("id", IntegerType)
.add("str", StringType);
Dataset df = spark.read().schema(schema).json(stringdataset).as(
org.apache.spark.sql.Encoders.bean(SampleData.class));
{code}



was (Author: jmchung):
The alternative is giving the explicit schema instead inferring, means you 
don't need to change your pojo class.

{code}
StructType schema = new StructType().add("id", IntegerType).add("str", 
StringType);
Dataset df = spark.read().schema(schema).json(stringdataset).as(
org.apache.spark.sql.Encoders.bean(SampleData.class));
{code}


> JavaBean int type property 
> ---
>
> Key: SPARK-22019
> URL: https://issues.apache.org/jira/browse/SPARK-22019
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: taiho choi
>
> when the type of SampleData's id is int, following code generates errors.
> when long, it's ok.
>  
> {code:java}
> @Test
> public void testDataSet2() {
> ArrayList arr= new ArrayList();
> arr.add("{\"str\": \"everyone\", \"id\": 1}");
> arr.add("{\"str\": \"Hello\", \"id\": 1}");
> //1.read array and change to string dataset.
> JavaRDD data = sc.parallelize(arr);
> Dataset stringdataset = sqc.createDataset(data.rdd(), 
> Encoders.STRING());
> stringdataset.show(); //PASS
> //2. convert string dataset to sampledata dataset
> Dataset df = 
> sqc.read().json(stringdataset).as(Encoders.bean(SampleData.class));
> df.show();//PASS
> df.printSchema();//PASS
> Dataset fad = df.flatMap(SampleDataFlat::flatMap, 
> Encoders.bean(SampleDataFlat.class));
> fad.show(); //ERROR
> fad.printSchema();
> }
> public static class SampleData implements Serializable {
> public String getStr() {
> return str;
> }
> public void setStr(String str) {
> this.str = str;
> }
> public int getId() {
> return id;
> }
> public void setId(int id) {
> this.id = id;
> }
> String str;
> int id;
> }
> public static class SampleDataFlat {
> String str;
> public String getStr() {
> return str;
> }
> public void setStr(String str) {
> this.str = str;
> }
> public SampleDataFlat(String str, long id) {
> this.str = str;
> }
> public static Iterator flatMap(SampleData data) {
> ArrayList arr = new ArrayList<>();
> arr.add(new SampleDataFlat(data.getStr(), data.getId()));
> arr.add(new SampleDataFlat(data.getStr(), data.getId()+1));
> arr.add(new SampleDataFlat(data.getStr(), data.getId()+2));
> return arr.iterator();
> }
> }
> {code}
> ==Error message==
> Caused by: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 38, Column 16: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 38, Column 16: No applicable constructor/method found for actual parameters 
> "long"; candidates are: "public void SparkUnitTest$SampleData.setId(int)"
> /* 024 */   public java.lang.Object apply(java.lang.Object _i) {
> /* 025 */ InternalRow i = (InternalRow) _i;
> /* 026 */
> /* 027 */ final SparkUnitTest$SampleData value1 = false ? null : new 
> SparkUnitTest$SampleData();
> /* 028 */ this.javaBean = value1;
> /* 029 */ if (!false) {
> /* 030 */
> /* 031 */
> /* 032 */   boolean isNull3 = i.isNullAt(0);
> /* 033 */   long value3 = isNull3 ? -1L : (i.getLong(0));
> /* 034 */
> /* 035 */   if (isNull3) {
> /* 036 */ throw new NullPointerException(((java.lang.String) 
> references[0]));
> /* 037 */   }
> /* 038 */   javaBean.setId(value3);



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22019) JavaBean int type property

2017-09-15 Thread Jen-Ming Chung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167734#comment-16167734
 ] 

Jen-Ming Chung commented on SPARK-22019:


The alternative is giving the explicit schema instead inferring, means you 
don't need to change your pojo class.

{code}
StructType schema = new StructType().add("id", IntegerType).add("str", 
StringType);
Dataset df = spark.read().schema(schema).json(stringdataset).as(
org.apache.spark.sql.Encoders.bean(SampleData.class));
{code}


> JavaBean int type property 
> ---
>
> Key: SPARK-22019
> URL: https://issues.apache.org/jira/browse/SPARK-22019
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: taiho choi
>
> when the type of SampleData's id is int, following code generates errors.
> when long, it's ok.
>  
> {code:java}
> @Test
> public void testDataSet2() {
> ArrayList arr= new ArrayList();
> arr.add("{\"str\": \"everyone\", \"id\": 1}");
> arr.add("{\"str\": \"Hello\", \"id\": 1}");
> //1.read array and change to string dataset.
> JavaRDD data = sc.parallelize(arr);
> Dataset stringdataset = sqc.createDataset(data.rdd(), 
> Encoders.STRING());
> stringdataset.show(); //PASS
> //2. convert string dataset to sampledata dataset
> Dataset df = 
> sqc.read().json(stringdataset).as(Encoders.bean(SampleData.class));
> df.show();//PASS
> df.printSchema();//PASS
> Dataset fad = df.flatMap(SampleDataFlat::flatMap, 
> Encoders.bean(SampleDataFlat.class));
> fad.show(); //ERROR
> fad.printSchema();
> }
> public static class SampleData implements Serializable {
> public String getStr() {
> return str;
> }
> public void setStr(String str) {
> this.str = str;
> }
> public int getId() {
> return id;
> }
> public void setId(int id) {
> this.id = id;
> }
> String str;
> int id;
> }
> public static class SampleDataFlat {
> String str;
> public String getStr() {
> return str;
> }
> public void setStr(String str) {
> this.str = str;
> }
> public SampleDataFlat(String str, long id) {
> this.str = str;
> }
> public static Iterator flatMap(SampleData data) {
> ArrayList arr = new ArrayList<>();
> arr.add(new SampleDataFlat(data.getStr(), data.getId()));
> arr.add(new SampleDataFlat(data.getStr(), data.getId()+1));
> arr.add(new SampleDataFlat(data.getStr(), data.getId()+2));
> return arr.iterator();
> }
> }
> {code}
> ==Error message==
> Caused by: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 38, Column 16: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 38, Column 16: No applicable constructor/method found for actual parameters 
> "long"; candidates are: "public void SparkUnitTest$SampleData.setId(int)"
> /* 024 */   public java.lang.Object apply(java.lang.Object _i) {
> /* 025 */ InternalRow i = (InternalRow) _i;
> /* 026 */
> /* 027 */ final SparkUnitTest$SampleData value1 = false ? null : new 
> SparkUnitTest$SampleData();
> /* 028 */ this.javaBean = value1;
> /* 029 */ if (!false) {
> /* 030 */
> /* 031 */
> /* 032 */   boolean isNull3 = i.isNullAt(0);
> /* 033 */   long value3 = isNull3 ? -1L : (i.getLong(0));
> /* 034 */
> /* 035 */   if (isNull3) {
> /* 036 */ throw new NullPointerException(((java.lang.String) 
> references[0]));
> /* 037 */   }
> /* 038 */   javaBean.setId(value3);



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22019) JavaBean int type property

2017-09-15 Thread Jen-Ming Chung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167725#comment-16167725
 ] 

Jen-Ming Chung edited comment on SPARK-22019 at 9/15/17 11:18 AM:
--

Hi [~client.test],

The schema inferred after {{sqc.read().json(stringdataset)}} as below,
{code}
root
 |-- id: long (nullable = true)
 |-- str: string (nullable = true)
{code}

However, the pojo class {{SampleData.class}} the member {{id}} is declared as 
{{int}} instead of {{long}}, this will cause the subsequent exception in your 
test case.
So set the {{long}} type to {{id}} in {{SampleData.class}} then executing the 
test case again, you can expect the following results:
{code}
++
| str|
++
|everyone|
|everyone|
|everyone|
|   Hello|
|   Hello|
|   Hello|
++

root
 |-- str: string (nullable = true)
{code}


As you can see, we missing the {{id}} in schema, we need to add the {{id}} and 
corresponding getter and setter,
{code}
class SampleDataFlat {
...
long id;
public long getId() {
return id;
}

public void setId(long id) {
this.id = id;
}

public SampleDataFlat(String str, long id) {
this.str = str;
this.id = id;
}
...
}
{code}

Then you will get the following results:
{code}
+---++
| id| str|
+---++
|  1|everyone|
|  2|everyone|
|  3|everyone|
|  1|   Hello|
|  2|   Hello|
|  3|   Hello|
+---++

root
 |-- id: long (nullable = true)
 |-- str: string (nullable = true)
{code}


was (Author: jmchung):
Hi [~client.test],

The schema inferred after {{sqc.read().json(stringdataset)}} as below,
{code}
root
 |-- id: long (nullable = true)
 |-- str: string (nullable = true)
{code}

However, the pojo class {{SampleData.class}} the member {{id}} is declared as 
{{int}} instead of {{long}}, this will cause the subsequent exception in your 
test case.
So set the {{long}} type to {{id}} in {{SampleData.class}} then executing the 
test case again, you can expect the following results:
{code}
++
| str|
++
|everyone|
|everyone|
|everyone|
|   Hello|
|   Hello|
|   Hello|
++

root
 |-- str: string (nullable = true)
{code}


As you can see, we missing the {{id}} in schema, we need to add the {{id}} and 
corresponding getter and setter,
{code}
class SampleDataFlat {
long id;
public long getId() {
return id;
}

public void setId(long id) {
this.id = id;
}

public SampleDataFlat(String str, long id) {
this.str = str;
this.id = id;
}
}
{code}

Then you will get the following results:
{code}
+---++
| id| str|
+---++
|  1|everyone|
|  2|everyone|
|  3|everyone|
|  1|   Hello|
|  2|   Hello|
|  3|   Hello|
+---++

root
 |-- id: long (nullable = true)
 |-- str: string (nullable = true)
{code}

> JavaBean int type property 
> ---
>
> Key: SPARK-22019
> URL: https://issues.apache.org/jira/browse/SPARK-22019
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: taiho choi
>
> when the type of SampleData's id is int, following code generates errors.
> when long, it's ok.
>  
> {code:java}
> @Test
> public void testDataSet2() {
> ArrayList arr= new ArrayList();
> arr.add("{\"str\": \"everyone\", \"id\": 1}");
> arr.add("{\"str\": \"Hello\", \"id\": 1}");
> //1.read array and change to string dataset.
> JavaRDD data = sc.parallelize(arr);
> Dataset stringdataset = sqc.createDataset(data.rdd(), 
> Encoders.STRING());
> stringdataset.show(); //PASS
> //2. convert string dataset to sampledata dataset
> Dataset df = 
> sqc.read().json(stringdataset).as(Encoders.bean(SampleData.class));
> df.show();//PASS
> df.printSchema();//PASS
> Dataset fad = df.flatMap(SampleDataFlat::flatMap, 
> Encoders.bean(SampleDataFlat.class));
> fad.show(); //ERROR
> fad.printSchema();
> }
> public static class SampleData implements Serializable {
> public String getStr() {
> return str;
> }
> public void setStr(String str) {
> this.str = str;
> }
> public int getId() {
> return id;
> }
> public void setId(int id) {
> this.id = id;
> }
> String str;
> int id;
> }
> public static class SampleDataFlat {
> String str;
> public String getStr() {
> return str;
> }
> public void setStr(String str) {
> this.str = str;
> }
> public SampleDataFlat(String str, long id) {
> this.str = str;
> }
> public static Iterator flatMap(SampleData da

[jira] [Commented] (SPARK-22019) JavaBean int type property

2017-09-15 Thread Jen-Ming Chung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167725#comment-16167725
 ] 

Jen-Ming Chung commented on SPARK-22019:


Hi [~client.test],

The schema inferred after {{sqc.read().json(stringdataset)}} as below,
{code}
root
 |-- id: long (nullable = true)
 |-- str: string (nullable = true)
{code}

However, the pojo class {{SampleData.class}} the member {{id}} is declared as 
{{int}} instead of {{long}}, this will cause the subsequent exception in your 
test case.
So set the {{long}} type to {{id}} in {{SampleData.class}} then executing the 
test case again, you can expect the following results:
{code}
++
| str|
++
|everyone|
|everyone|
|everyone|
|   Hello|
|   Hello|
|   Hello|
++

root
 |-- str: string (nullable = true)
{code}


As you can see, we missing the {{id}} in schema, we need to add the {{id}} and 
corresponding getter and setter,
{code}
class SampleDataFlat {
long id;
public long getId() {
return id;
}

public void setId(long id) {
this.id = id;
}

public SampleDataFlat(String str, long id) {
this.str = str;
this.id = id;
}
}
{code}

Then you will get the following results:
{code}
+---++
| id| str|
+---++
|  1|everyone|
|  2|everyone|
|  3|everyone|
|  1|   Hello|
|  2|   Hello|
|  3|   Hello|
+---++

root
 |-- id: long (nullable = true)
 |-- str: string (nullable = true)
{code}

> JavaBean int type property 
> ---
>
> Key: SPARK-22019
> URL: https://issues.apache.org/jira/browse/SPARK-22019
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: taiho choi
>
> when the type of SampleData's id is int, following code generates errors.
> when long, it's ok.
>  
> {code:java}
> @Test
> public void testDataSet2() {
> ArrayList arr= new ArrayList();
> arr.add("{\"str\": \"everyone\", \"id\": 1}");
> arr.add("{\"str\": \"Hello\", \"id\": 1}");
> //1.read array and change to string dataset.
> JavaRDD data = sc.parallelize(arr);
> Dataset stringdataset = sqc.createDataset(data.rdd(), 
> Encoders.STRING());
> stringdataset.show(); //PASS
> //2. convert string dataset to sampledata dataset
> Dataset df = 
> sqc.read().json(stringdataset).as(Encoders.bean(SampleData.class));
> df.show();//PASS
> df.printSchema();//PASS
> Dataset fad = df.flatMap(SampleDataFlat::flatMap, 
> Encoders.bean(SampleDataFlat.class));
> fad.show(); //ERROR
> fad.printSchema();
> }
> public static class SampleData implements Serializable {
> public String getStr() {
> return str;
> }
> public void setStr(String str) {
> this.str = str;
> }
> public int getId() {
> return id;
> }
> public void setId(int id) {
> this.id = id;
> }
> String str;
> int id;
> }
> public static class SampleDataFlat {
> String str;
> public String getStr() {
> return str;
> }
> public void setStr(String str) {
> this.str = str;
> }
> public SampleDataFlat(String str, long id) {
> this.str = str;
> }
> public static Iterator flatMap(SampleData data) {
> ArrayList arr = new ArrayList<>();
> arr.add(new SampleDataFlat(data.getStr(), data.getId()));
> arr.add(new SampleDataFlat(data.getStr(), data.getId()+1));
> arr.add(new SampleDataFlat(data.getStr(), data.getId()+2));
> return arr.iterator();
> }
> }
> {code}
> ==Error message==
> Caused by: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 38, Column 16: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 38, Column 16: No applicable constructor/method found for actual parameters 
> "long"; candidates are: "public void SparkUnitTest$SampleData.setId(int)"
> /* 024 */   public java.lang.Object apply(java.lang.Object _i) {
> /* 025 */ InternalRow i = (InternalRow) _i;
> /* 026 */
> /* 027 */ final SparkUnitTest$SampleData value1 = false ? null : new 
> SparkUnitTest$SampleData();
> /* 028 */ this.javaBean = value1;
> /* 029 */ if (!false) {
> /* 030 */
> /* 031 */
> /* 032 */   boolean isNull3 = i.isNullAt(0);
> /* 033 */   long value3 = isNull3 ? -1L : (i.getLong(0));
> /* 034 */
> /* 035 */   if (isNull3) {
> /* 036 */ throw new NullPointerException(((java.lang.String) 
> references[0]));
> /* 037 */   }
> /* 038 */   javaBean.setId(value3);



--
This message was sent by Atla

[jira] [Created] (SPARK-22022) Unable to use Python Profiler with SparkSession

2017-09-15 Thread JIRA
Maciej Bryński created SPARK-22022:
--

 Summary: Unable to use Python Profiler with SparkSession
 Key: SPARK-22022
 URL: https://issues.apache.org/jira/browse/SPARK-22022
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.2.0
Reporter: Maciej Bryński
Priority: Minor


SparkContext has profiler_cls option that gives possibility to use Python 
profiler within Spark app.

There is no such an option in SparkSession.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22021) Add a feature transformation to accept a function and apply it on all rows of dataframe

2017-09-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167722#comment-16167722
 ] 

Sean Owen commented on SPARK-22021:
---

Why can't you just apply this function? Or implement Transformer. I'm not sure 
what the point of a transformer that just applies a function is. 

> Add a feature transformation to accept a function and apply it on all rows of 
> dataframe
> ---
>
> Key: SPARK-22021
> URL: https://issues.apache.org/jira/browse/SPARK-22021
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Hosur Narahari
>
> More often we generate derived features in ML pipeline by doing some 
> mathematical or other kind of operation on columns of dataframe like getting 
> a total of few columns as a new column or if there is text field message and 
> we want the length of message etc. We currently don't have an efficient way 
> to handle such scenario in ML pipeline.
> By Providing a transformer which accepts a function and performs that on 
> mentioned columns to generate output column of numerical type, user has the 
> flexibility to derive features by applying any domain specific logic.
> Example:
> val function = "function(a,b) { return a+b;}"
> val transformer = new GenFuncTransformer().setInputCols(Array("v1", 
> "v2")).setOutputCol("result").setFunction(function)
> val df = Seq((1.0, 2.0), (3.0, 4.0)).toDF("v1", "v2")
> val result = transformer.transform(df)
> result.show
> v1   v2  result
> 1.0 2.0 3.0
> 3.0 4.0 7.0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22021) Add a feature transformation to accept a function and apply it on all rows of dataframe

2017-09-15 Thread Hosur Narahari (JIRA)
Hosur Narahari created SPARK-22021:
--

 Summary: Add a feature transformation to accept a function and 
apply it on all rows of dataframe
 Key: SPARK-22021
 URL: https://issues.apache.org/jira/browse/SPARK-22021
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 2.3.0
Reporter: Hosur Narahari


More often we generate derived features in ML pipeline by doing some 
mathematical or other kind of operation on columns of dataframe like getting a 
total of few columns as a new column or if there is text field message and we 
want the length of message etc. We currently don't have an efficient way to 
handle such scenario in ML pipeline.

By Providing a transformer which accepts a function and performs that on 
mentioned columns to generate output column of numerical type, user has the 
flexibility to derive features by applying any domain specific logic.

Example:

val function = "function(a,b) { return a+b;}"
val transformer = new GenFuncTransformer().setInputCols(Array("v1", 
"v2")).setOutputCol("result").setFunction(function)
val df = Seq((1.0, 2.0), (3.0, 4.0)).toDF("v1", "v2")
val result = transformer.transform(df)
result.show

v1   v2  result
1.0 2.0 3.0
3.0 4.0 7.0




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21780) Simpler Dataset.sample API in R

2017-09-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21780:


Assignee: Apache Spark

> Simpler Dataset.sample API in R
> ---
>
> Key: SPARK-21780
> URL: https://issues.apache.org/jira/browse/SPARK-21780
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> See parent ticket.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21780) Simpler Dataset.sample API in R

2017-09-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167571#comment-16167571
 ] 

Apache Spark commented on SPARK-21780:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/19243

> Simpler Dataset.sample API in R
> ---
>
> Key: SPARK-21780
> URL: https://issues.apache.org/jira/browse/SPARK-21780
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>
> See parent ticket.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21780) Simpler Dataset.sample API in R

2017-09-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21780:


Assignee: (was: Apache Spark)

> Simpler Dataset.sample API in R
> ---
>
> Key: SPARK-21780
> URL: https://issues.apache.org/jira/browse/SPARK-21780
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>
> See parent ticket.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22020) Support session local timezone

2017-09-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22020.
---
Resolution: Duplicate

> Support session local timezone
> --
>
> Key: SPARK-22020
> URL: https://issues.apache.org/jira/browse/SPARK-22020
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Navya Krishnappa
>
> As of Spark 2.1, Spark SQL assumes the machine timezone for datetime 
> manipulation, which is bad if users are not in the same timezones as the 
> machines, or if different users have different timezones.
> Input data:
> Date,SparkDate,SparkDate1,SparkDate2
> 04/22/2017T03:30:02,2017-03-21T03:30:02,2017-03-21T03:30:02.02Z,2017-03-21T00:00:00Z
> I have set the below value to set the timeZone to UTC. It is adding the 
> current timeZone value even though it is in the UTC format.
> spark.conf.set("spark.sql.session.timeZone", "UTC")
> Expected : Time should remain same as the input since it's already in UTC 
> format
> var df1 = spark.read.option("delimiter", ",").option("qualifier", 
> "\"").option("inferSchema","true").option("header", "true").option("mode", 
> "PERMISSIVE").option("timestampFormat","MM/dd/'T'HH:mm:ss.SSS").option("dateFormat",
>  "MM/dd/'T'HH:mm:ss").csv("DateSpark.csv");
> df1: org.apache.spark.sql.DataFrame = [Name: string, Age: int ... 5 more 
> fields]
> scala> df1.show(false);
> --
> Name  Age Add DateSparkDate   SparkDate1  SparkDate2
> --
> abc   21  bvxc04/22/2017T03:30:02 2017-03-21 03:30:02 
> 2017-03-21 09:00:02.02  2017-03-21 05:30:00
> --



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20921) While reading from oracle database, it converts to wrong type.

2017-09-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20921.
---
Resolution: Duplicate

> While reading from oracle database, it converts to wrong type.
> --
>
> Key: SPARK-20921
> URL: https://issues.apache.org/jira/browse/SPARK-20921
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: vivek dixit
>
> While reading data from oracle using jdbc, it converts integer column to 
> boolean. if it has the column size as one which is wrong. 
> https://github.com/apache/spark/commit/39a2b2ea74d420caa37019e3684f65b3a6fcb388#diff-edf9e48ce837c5f2a504428d9aa67970R46
>  this condition in oracledjdbcdialect is doing it. This has caused our data 
> load process to fail. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21713) Replace LogicalPlan.isStreaming with OutputMode

2017-09-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167506#comment-16167506
 ] 

Apache Spark commented on SPARK-21713:
--

User 'joseph-torres' has created a pull request for this issue:
https://github.com/apache/spark/pull/18925

> Replace LogicalPlan.isStreaming with OutputMode
> ---
>
> Key: SPARK-21713
> URL: https://issues.apache.org/jira/browse/SPARK-21713
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jose Torres
>
> The isStreaming bit in LogicalPlan is based on an old model. Switching to 
> OutputMode will allow us to more easily integrate with things that require 
> specific OutputModes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21987) Spark 2.3 cannot read 2.2 event logs

2017-09-15 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-21987.
-
   Resolution: Fixed
 Assignee: Wenchen Fan
Fix Version/s: 2.3.0

> Spark 2.3 cannot read 2.2 event logs
> 
>
> Key: SPARK-21987
> URL: https://issues.apache.org/jira/browse/SPARK-21987
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Wenchen Fan
>Priority: Blocker
> Fix For: 2.3.0
>
>
> Reported by [~jincheng] in a comment in SPARK-18085:
> {noformat}
> com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: 
> Unrecognized field "metadata" (class 
> org.apache.spark.sql.execution.SparkPlanInfo), not marked as ignorable (4 
> known properties: "simpleString", "nodeName", "children", "metrics"])
>  at [Source: 
> {"Event":"org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart","executionId":0,"description":"json
>  at 
> NativeMethodAccessorImpl.java:0","details":"org.apache.spark.sql.DataFrameWriter.json(DataFrameWriter.scala:487)\nsun.reflect.NativeMethodAccessorImpl.invoke0(Native
>  
> Method)\nsun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\nsun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\njava.lang.reflect.Method.invoke(Method.java:498)\npy4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\npy4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\npy4j.Gateway.invoke(Gateway.java:280)\npy4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\npy4j.commands.CallCommand.execute(CallCommand.java:79)\npy4j.GatewayConnection.run(GatewayConnection.java:214)\njava.lang.Thread.run(Thread.java:748)","physicalPlanDescription":"==
>  Parsed Logical Plan ==\nRepartition 200, true\n+- LogicalRDD [uid#327L, 
> gids#328]\n\n== Analyzed Logical Plan ==\nuid: bigint, gids: 
> array\nRepartition 200, true\n+- LogicalRDD [uid#327L, 
> gids#328]\n\n== Optimized Logical Plan ==\nRepartition 200, true\n+- 
> LogicalRDD [uid#327L, gids#328]\n\n== Physical Plan ==\nExchange 
> RoundRobinPartitioning(200)\n+- Scan 
> ExistingRDD[uid#327L,gids#328]","sparkPlanInfo":{"nodeName":"Exchange","simpleString":"Exchange
>  
> RoundRobinPartitioning(200)","children":[{"nodeName":"ExistingRDD","simpleString":"Scan
>  
> ExistingRDD[uid#327L,gids#328]","children":[],"metadata":{},"metrics":[{"name":"number
>  of output 
> rows","accumulatorId":140,"metricType":"sum"}]}],"metadata":{},"metrics":[{"name":"data
>  size total (min, med, 
> max)","accumulatorId":139,"metricType":"size"}]},"time":1504837052948}; line: 
> 1, column: 1622] (through reference chain: 
> org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart["sparkPlanInfo"]->org.apache.spark.sql.execution.SparkPlanInfo["children"]->com.fasterxml.jackson.module.scala.deser.BuilderWrapper[0]->org.apache.spark.sql.execution.SparkPlanInfo["metadata"])
>   at 
> com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException.from(UnrecognizedPropertyException.java:51)
> {noformat}
> This was caused by SPARK-17701 (which at this moment is still open even 
> though the patch has been committed).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org