[jira] [Commented] (SPARK-23404) When the underlying buffers are already direct, we should copy them to the heap memory

2018-02-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16361933#comment-16361933
 ] 

Apache Spark commented on SPARK-23404:
--

User '10110346' has created a pull request for this issue:
https://github.com/apache/spark/pull/20596

> When the underlying buffers are already direct, we should copy them to the 
> heap memory
> --
>
> Key: SPARK-23404
> URL: https://issues.apache.org/jira/browse/SPARK-23404
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Priority: Minor
>
> If the memory mode is _ON_HEAP_,when the underlying buffers are direct, we 
> should copy them to the heap memory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23404) When the underlying buffers are already direct, we should copy them to the heap memory

2018-02-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23404:


Assignee: (was: Apache Spark)

> When the underlying buffers are already direct, we should copy them to the 
> heap memory
> --
>
> Key: SPARK-23404
> URL: https://issues.apache.org/jira/browse/SPARK-23404
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Priority: Minor
>
> If the memory mode is _ON_HEAP_,when the underlying buffers are direct, we 
> should copy them to the heap memory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23404) When the underlying buffers are already direct, we should copy them to the heap memory

2018-02-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23404:


Assignee: Apache Spark

> When the underlying buffers are already direct, we should copy them to the 
> heap memory
> --
>
> Key: SPARK-23404
> URL: https://issues.apache.org/jira/browse/SPARK-23404
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Assignee: Apache Spark
>Priority: Minor
>
> If the memory mode is _ON_HEAP_,when the underlying buffers are direct, we 
> should copy them to the heap memory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23404) When the underlying buffers are already direct, we should copy them to the heap memory

2018-02-12 Thread liuxian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-23404:

Description: If the memory mode is _ON_HEAP_,when the underlying buffers 
are direct, we should copy them to the heap memory.  (was: If the memory mode 
is _ON_HEAP_,when the underlying buffers are direct, we should copy it to the 
heap memory.)

> When the underlying buffers are already direct, we should copy them to the 
> heap memory
> --
>
> Key: SPARK-23404
> URL: https://issues.apache.org/jira/browse/SPARK-23404
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Priority: Minor
>
> If the memory mode is _ON_HEAP_,when the underlying buffers are direct, we 
> should copy them to the heap memory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23404) When the underlying buffers are already direct, we should copy them to the heap memory

2018-02-12 Thread liuxian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-23404:

Summary: When the underlying buffers are already direct, we should copy 
them to the heap memory  (was: When the underlying buffers are already direct, 
we should copy it to the heap memory)

> When the underlying buffers are already direct, we should copy them to the 
> heap memory
> --
>
> Key: SPARK-23404
> URL: https://issues.apache.org/jira/browse/SPARK-23404
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Priority: Minor
>
> If the memory mode is _ON_HEAP_,when the underlying buffers are direct, we 
> should copy it to the heap memory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23404) When the underlying buffers are already direct, we should copy it to the heap memory

2018-02-12 Thread liuxian (JIRA)
liuxian created SPARK-23404:
---

 Summary: When the underlying buffers are already direct, we should 
copy it to the heap memory
 Key: SPARK-23404
 URL: https://issues.apache.org/jira/browse/SPARK-23404
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: liuxian


If the memory mode is _ON_HEAP_,when the underlying buffers are direct, we 
should copy it to the heap memory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23403) java.lang.ArrayIndexOutOfBoundsException: 10

2018-02-12 Thread Naresh Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naresh Kumar updated SPARK-23403:
-
Docs Text: 
val 
washing_flat=sc.textFile("hdfs://ip-172-31-53-45:8020/user/narine91267897/washing_flat.csv")
washing_flat: org.apache.spark.rdd.RDD[String] = 
hdfs://ip-172-31-55-77:8020/user/narine91267897/washing_flat.csv 
MapPartitionsRDD[
24] at textFile at :33
scala> val schema=StructType(Array(
 |  StructField("id",StringType,true),
 |  StructField("rev",StringType,true),
 |  StructField("count",LongType,true),
 |  StructField("flowrate",LongType,true),
 |  StructField("fluidlevel",StringType,true),
 |  StructField("frequency",LongType,true),
 |  StructField("hardness",LongType,true),
 |  StructField("speed",LongType,true),
 |  StructField("temperature",LongType,true),
 |  StructField("ts",LongType,true),
 |  StructField("voltage",LongType,true)))

scala> val rowRDD=washing_flat.map(line => line.split(",")).map(row => 
Row(row(0)
 | ,row(1)
 | ,row(2),
 | row(3),
 | row(4),
 | row(5),
 | row(6),
 | row(7),
 | row(8),
 | row(9),
 | row(10)))
rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
MapPartitionsRDD[26] at map at :35
scala> val washing_df=spark.createDataFrame(rowRDD,schema)
washing_df: org.apache.spark.sql.DataFrame = [id: string, rev: string ... 9 
more fields]
scala> washing_df.printSchema
root
 |-- id: string (nullable = true)
 |-- rev: string (nullable = true)
 |-- count: long (nullable = true)
 |-- flowrate: long (nullable = true)
 |-- fluidlevel: string (nullable = true)
 |-- frequency: long (nullable = true)
 |-- hardness: long (nullable = true)
 |-- speed: long (nullable = true)
 |-- temperature: long (nullable = true)
 |-- ts: long (nullable = true)
 |-- voltage: long (nullable = true)
scala> washing_df.show(5)
18/02/13 05:54:51 ERROR executor.Executor: Exception in task 0.0 in stage 4.0 
(TID 5)
java.lang.ArrayIndexOutOfBoundsException: 10


  was:
val 
washing_flat=sc.textFile("hdfs://ip-172-31-53-48.ec2.internal:8020/user/narine91267897/washing_flat.csv")
washing_flat: org.apache.spark.rdd.RDD[String] = 
hdfs://ip-172-31-55-77:8020/user/narine91267897/washing_flat.csv 
MapPartitionsRDD[
24] at textFile at :33
scala> val schema=StructType(Array(
 |  StructField("id",StringType,true),
 |  StructField("rev",StringType,true),
 |  StructField("count",LongType,true),
 |  StructField("flowrate",LongType,true),
 |  StructField("fluidlevel",StringType,true),
 |  StructField("frequency",LongType,true),
 |  StructField("hardness",LongType,true),
 |  StructField("speed",LongType,true),
 |  StructField("temperature",LongType,true),
 |  StructField("ts",LongType,true),
 |  StructField("voltage",LongType,true)))

scala> val rowRDD=washing_flat.map(line => line.split(",")).map(row => 
Row(row(0)
 | ,row(1)
 | ,row(2),
 | row(3),
 | row(4),
 | row(5),
 | row(6),
 | row(7),
 | row(8),
 | row(9),
 | row(10)))
rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
MapPartitionsRDD[26] at map at :35
scala> val washing_df=spark.createDataFrame(rowRDD,schema)
washing_df: org.apache.spark.sql.DataFrame = [id: string, rev: string ... 9 
more fields]
scala> washing_df.printSchema
root
 |-- id: string (nullable = true)
 |-- rev: string (nullable = true)
 |-- count: long (nullable = true)
 |-- flowrate: long (nullable = true)
 |-- fluidlevel: string (nullable = true)
 |-- frequency: long (nullable = true)
 |-- hardness: long (nullable = true)
 |-- speed: long (nullable = true)
 |-- temperature: long (nullable = true)
 |-- ts: long (nullable = true)
 |-- voltage: long (nullable = true)
scala> washing_df.show(5)
18/02/13 05:54:51 ERROR executor.Executor: Exception in task 0.0 in stage 4.0 
(TID 5)
java.lang.ArrayIndexOutOfBoundsException: 10



> java.lang.ArrayIndexOutOfBoundsException: 10
> 
>
> Key: SPARK-23403
> URL: https://issues.apache.org/jira/browse/SPARK-23403
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.2.0
>Reporter: Naresh Kumar
>Priority: Major
>
> java.lang.ArrayIndexOutOfBoundsException: 10, while retriving records from 
> Dataframe in spark-shell



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23403) java.lang.ArrayIndexOutOfBoundsException: 10

2018-02-12 Thread Naresh Kumar (JIRA)
Naresh Kumar created SPARK-23403:


 Summary: java.lang.ArrayIndexOutOfBoundsException: 10
 Key: SPARK-23403
 URL: https://issues.apache.org/jira/browse/SPARK-23403
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 2.2.0
Reporter: Naresh Kumar


java.lang.ArrayIndexOutOfBoundsException: 10, while retriving records from 
Dataframe in spark-shell



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23340) Empty float/double array columns in ORC file should not raise EOFException

2018-02-12 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23340:
--
Description: 
This issue updates Apache ORC dependencies to 1.4.3 released on February 9th.
Apache ORC 1.4.2 release removes unnecessary dependencies and 1.4.3 has 5 more 
patches including bug fixes (https://s.apache.org/Fll8).
Especially, the following ORC-285 is fixed at 1.4.3.

{code}
scala> val df = Seq(Array.empty[Float]).toDF()

scala> df.write.format("orc").save("/tmp/floatarray")

scala> spark.read.orc("/tmp/floatarray")
res1: org.apache.spark.sql.DataFrame = [value: array]

scala> spark.read.orc("/tmp/floatarray").show()
18/02/12 22:09:10 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.io.IOException: Error reading file: 
file:/tmp/floatarray/part-0-9c0b461b-4df1-4c23-aac1-3e4f349ac7d6-c000.snappy.orc
at 
org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1191)
at 
org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:78)
...
Caused by: java.io.EOFException: Read past EOF for compressed stream Stream for 
column 2 kind DATA position: 0 length: 0 range: 0 offset: 0 limit: 0
{code}

  was:
This issue updates Apache ORC dependencies to 1.4.3 released on February 9th.
Apache ORC 1.4.2 release removes unnecessary dependencies and 1.4.3 has 5 more 
patches (https://s.apache.org/Fll8).
Especially, the following ORC-285 is fixed at 1.4.3.

{code}
scala> val df = Seq(Array.empty[Float]).toDF()

scala> df.write.format("orc").save("/tmp/floatarray")

scala> spark.read.orc("/tmp/floatarray")
res1: org.apache.spark.sql.DataFrame = [value: array]

scala> spark.read.orc("/tmp/floatarray").show()
18/02/12 22:09:10 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.io.IOException: Error reading file: 
file:/tmp/floatarray/part-0-9c0b461b-4df1-4c23-aac1-3e4f349ac7d6-c000.snappy.orc
at 
org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1191)
at 
org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:78)
...
Caused by: java.io.EOFException: Read past EOF for compressed stream Stream for 
column 2 kind DATA position: 0 length: 0 range: 0 offset: 0 limit: 0
{code}


> Empty float/double array columns in ORC file should not raise EOFException
> --
>
> Key: SPARK-23340
> URL: https://issues.apache.org/jira/browse/SPARK-23340
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Critical
>
> This issue updates Apache ORC dependencies to 1.4.3 released on February 9th.
> Apache ORC 1.4.2 release removes unnecessary dependencies and 1.4.3 has 5 
> more patches including bug fixes (https://s.apache.org/Fll8).
> Especially, the following ORC-285 is fixed at 1.4.3.
> {code}
> scala> val df = Seq(Array.empty[Float]).toDF()
> scala> df.write.format("orc").save("/tmp/floatarray")
> scala> spark.read.orc("/tmp/floatarray")
> res1: org.apache.spark.sql.DataFrame = [value: array]
> scala> spark.read.orc("/tmp/floatarray").show()
> 18/02/12 22:09:10 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.io.IOException: Error reading file: 
> file:/tmp/floatarray/part-0-9c0b461b-4df1-4c23-aac1-3e4f349ac7d6-c000.snappy.orc
>   at 
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1191)
>   at 
> org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:78)
> ...
> Caused by: java.io.EOFException: Read past EOF for compressed stream Stream 
> for column 2 kind DATA position: 0 length: 0 range: 0 offset: 0 limit: 0
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23340) Empty float/double array columns in ORC file should not raise EOFException

2018-02-12 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23340:
--
Description: 
This issue updates Apache ORC dependencies to 1.4.3 released on February 9th.
Apache ORC 1.4.2 release removes unnecessary dependencies and 1.4.3 has 5 more 
patches (https://s.apache.org/Fll8).
Especially, the following ORC-285 is fixed at 1.4.3.

{code}
scala> val df = Seq(Array.empty[Float]).toDF()

scala> df.write.format("orc").save("/tmp/floatarray")

scala> spark.read.orc("/tmp/floatarray")
res1: org.apache.spark.sql.DataFrame = [value: array]

scala> spark.read.orc("/tmp/floatarray").show()
18/02/12 22:09:10 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.io.IOException: Error reading file: 
file:/tmp/floatarray/part-0-9c0b461b-4df1-4c23-aac1-3e4f349ac7d6-c000.snappy.orc
at 
org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1191)
at 
org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:78)
...
Caused by: java.io.EOFException: Read past EOF for compressed stream Stream for 
column 2 kind DATA position: 0 length: 0 range: 0 offset: 0 limit: 0
{code}

  was:
This issue updates Apache ORC dependencies to 1.4.3 released on February 9th.
Apache ORC 1.4.2 release removes unnecessary dependencies and 1.4.3 has 5 more 
patches (https://s.apache.org/Fll8).
Especially, the following ORC-285 is fixed at 1.4.3.

{code}
scala> df.write.format("orc").save("/tmp/floatarray")

scala> spark.read.orc("/tmp/floatarray")
res1: org.apache.spark.sql.DataFrame = [value: array]

scala> spark.read.orc("/tmp/floatarray").show()
18/02/12 22:09:10 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.io.IOException: Error reading file: 
file:/tmp/floatarray/part-0-9c0b461b-4df1-4c23-aac1-3e4f349ac7d6-c000.snappy.orc
at 
org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1191)
at 
org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:78)
...
Caused by: java.io.EOFException: Read past EOF for compressed stream Stream for 
column 2 kind DATA position: 0 length: 0 range: 0 offset: 0 limit: 0
{code}


> Empty float/double array columns in ORC file should not raise EOFException
> --
>
> Key: SPARK-23340
> URL: https://issues.apache.org/jira/browse/SPARK-23340
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Critical
>
> This issue updates Apache ORC dependencies to 1.4.3 released on February 9th.
> Apache ORC 1.4.2 release removes unnecessary dependencies and 1.4.3 has 5 
> more patches (https://s.apache.org/Fll8).
> Especially, the following ORC-285 is fixed at 1.4.3.
> {code}
> scala> val df = Seq(Array.empty[Float]).toDF()
> scala> df.write.format("orc").save("/tmp/floatarray")
> scala> spark.read.orc("/tmp/floatarray")
> res1: org.apache.spark.sql.DataFrame = [value: array]
> scala> spark.read.orc("/tmp/floatarray").show()
> 18/02/12 22:09:10 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.io.IOException: Error reading file: 
> file:/tmp/floatarray/part-0-9c0b461b-4df1-4c23-aac1-3e4f349ac7d6-c000.snappy.orc
>   at 
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1191)
>   at 
> org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:78)
> ...
> Caused by: java.io.EOFException: Read past EOF for compressed stream Stream 
> for column 2 kind DATA position: 0 length: 0 range: 0 offset: 0 limit: 0
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23340) Empty float/double array columns in ORC file should not raise EOFException

2018-02-12 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23340:
--
Summary: Empty float/double array columns in ORC file should not raise 
EOFException  (was: Empty float/double array columns in ORC file raise 
EOFException)

> Empty float/double array columns in ORC file should not raise EOFException
> --
>
> Key: SPARK-23340
> URL: https://issues.apache.org/jira/browse/SPARK-23340
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Critical
>
> This issue updates Apache ORC dependencies to 1.4.3 released on February 9th.
> Apache ORC 1.4.2 release removes unnecessary dependencies and 1.4.3 has 5 
> more patches (https://s.apache.org/Fll8).
> Especially, the following ORC-285 is fixed at 1.4.3.
> {code}
> scala> df.write.format("orc").save("/tmp/floatarray")
> scala> spark.read.orc("/tmp/floatarray")
> res1: org.apache.spark.sql.DataFrame = [value: array]
> scala> spark.read.orc("/tmp/floatarray").show()
> 18/02/12 22:09:10 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.io.IOException: Error reading file: 
> file:/tmp/floatarray/part-0-9c0b461b-4df1-4c23-aac1-3e4f349ac7d6-c000.snappy.orc
>   at 
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1191)
>   at 
> org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:78)
> ...
> Caused by: java.io.EOFException: Read past EOF for compressed stream Stream 
> for column 2 kind DATA position: 0 length: 0 range: 0 offset: 0 limit: 0
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23340) Empty float/double array columns in ORC file raise EOFException

2018-02-12 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23340:
--
Summary: Empty float/double array columns in ORC file raise EOFException  
(was: Empty float/double array columns raise EOFException)

> Empty float/double array columns in ORC file raise EOFException
> ---
>
> Key: SPARK-23340
> URL: https://issues.apache.org/jira/browse/SPARK-23340
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Critical
>
> This issue updates Apache ORC dependencies to 1.4.3 released on February 9th.
> Apache ORC 1.4.2 release removes unnecessary dependencies and 1.4.3 has 5 
> more patches (https://s.apache.org/Fll8).
> Especially, the following ORC-285 is fixed at 1.4.3.
> {code}
> scala> df.write.format("orc").save("/tmp/floatarray")
> scala> spark.read.orc("/tmp/floatarray")
> res1: org.apache.spark.sql.DataFrame = [value: array]
> scala> spark.read.orc("/tmp/floatarray").show()
> 18/02/12 22:09:10 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.io.IOException: Error reading file: 
> file:/tmp/floatarray/part-0-9c0b461b-4df1-4c23-aac1-3e4f349ac7d6-c000.snappy.orc
>   at 
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1191)
>   at 
> org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:78)
> ...
> Caused by: java.io.EOFException: Read past EOF for compressed stream Stream 
> for column 2 kind DATA position: 0 length: 0 range: 0 offset: 0 limit: 0
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23340) Empty float/double array columns raise EOFException

2018-02-12 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23340:
--
Description: 
This issue updates Apache ORC dependencies to 1.4.3 released on February 9th.
Apache ORC 1.4.2 release removes unnecessary dependencies and 1.4.3 has 5 more 
patches (https://s.apache.org/Fll8).
Especially, the following ORC-285 is fixed at 1.4.3.

{code}
scala> df.write.format("orc").save("/tmp/floatarray")

scala> spark.read.orc("/tmp/floatarray")
res1: org.apache.spark.sql.DataFrame = [value: array]

scala> spark.read.orc("/tmp/floatarray").show()
18/02/12 22:09:10 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.io.IOException: Error reading file: 
file:/tmp/floatarray/part-0-9c0b461b-4df1-4c23-aac1-3e4f349ac7d6-c000.snappy.orc
at 
org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1191)
at 
org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:78)
...
Caused by: java.io.EOFException: Read past EOF for compressed stream Stream for 
column 2 kind DATA position: 0 length: 0 range: 0 offset: 0 limit: 0
{code}

  was:
This issue updates Apache ORC dependencies to 1.4.3 released on February 9th.
Apache ORC 1.4.2 release removes unnecessary dependencies and 1.4.3 has 5 more 
patches (https://s.apache.org/Fll8).

{code}
scala> df.write.format("orc").save("/tmp/floatarray")

scala> spark.read.orc("/tmp/floatarray")
res1: org.apache.spark.sql.DataFrame = [value: array]

scala> spark.read.orc("/tmp/floatarray").show()
18/02/12 22:09:10 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.io.IOException: Error reading file: 
file:/tmp/floatarray/part-0-9c0b461b-4df1-4c23-aac1-3e4f349ac7d6-c000.snappy.orc
at 
org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1191)
at 
org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:78)
...
Caused by: java.io.EOFException: Read past EOF for compressed stream Stream for 
column 2 kind DATA position: 0 length: 0 range: 0 offset: 0 limit: 0
{code}


> Empty float/double array columns raise EOFException
> ---
>
> Key: SPARK-23340
> URL: https://issues.apache.org/jira/browse/SPARK-23340
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Critical
>
> This issue updates Apache ORC dependencies to 1.4.3 released on February 9th.
> Apache ORC 1.4.2 release removes unnecessary dependencies and 1.4.3 has 5 
> more patches (https://s.apache.org/Fll8).
> Especially, the following ORC-285 is fixed at 1.4.3.
> {code}
> scala> df.write.format("orc").save("/tmp/floatarray")
> scala> spark.read.orc("/tmp/floatarray")
> res1: org.apache.spark.sql.DataFrame = [value: array]
> scala> spark.read.orc("/tmp/floatarray").show()
> 18/02/12 22:09:10 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.io.IOException: Error reading file: 
> file:/tmp/floatarray/part-0-9c0b461b-4df1-4c23-aac1-3e4f349ac7d6-c000.snappy.orc
>   at 
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1191)
>   at 
> org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:78)
> ...
> Caused by: java.io.EOFException: Read past EOF for compressed stream Stream 
> for column 2 kind DATA position: 0 length: 0 range: 0 offset: 0 limit: 0
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23340) Empty float/double array columns raise EOFException

2018-02-12 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23340:
--
Priority: Critical  (was: Major)

> Empty float/double array columns raise EOFException
> ---
>
> Key: SPARK-23340
> URL: https://issues.apache.org/jira/browse/SPARK-23340
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Critical
>
> This issue updates Apache ORC dependencies to 1.4.3 released on February 9th.
> Apache ORC 1.4.2 release removes unnecessary dependencies and 1.4.3 has 5 
> more patches (https://s.apache.org/Fll8).
> {code}
> scala> df.write.format("orc").save("/tmp/floatarray")
> scala> spark.read.orc("/tmp/floatarray")
> res1: org.apache.spark.sql.DataFrame = [value: array]
> scala> spark.read.orc("/tmp/floatarray").show()
> 18/02/12 22:09:10 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.io.IOException: Error reading file: 
> file:/tmp/floatarray/part-0-9c0b461b-4df1-4c23-aac1-3e4f349ac7d6-c000.snappy.orc
>   at 
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1191)
>   at 
> org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:78)
> ...
> Caused by: java.io.EOFException: Read past EOF for compressed stream Stream 
> for column 2 kind DATA position: 0 length: 0 range: 0 offset: 0 limit: 0
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23340) Empty float/double array columns raise EOFException

2018-02-12 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23340:
--
Summary: Empty float/double array columns raise EOFException  (was:  Update 
ORC to 1.4.3)

> Empty float/double array columns raise EOFException
> ---
>
> Key: SPARK-23340
> URL: https://issues.apache.org/jira/browse/SPARK-23340
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This issue updates Apache ORC dependencies to 1.4.3 released on February 9th.
> Apache ORC 1.4.2 release removes unnecessary dependencies and 1.4.3 has 5 
> more patches (https://s.apache.org/Fll8).
> {code}
> scala> df.write.format("orc").save("/tmp/floatarray")
> scala> spark.read.orc("/tmp/floatarray")
> res1: org.apache.spark.sql.DataFrame = [value: array]
> scala> spark.read.orc("/tmp/floatarray").show()
> 18/02/12 22:09:10 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.io.IOException: Error reading file: 
> file:/tmp/floatarray/part-0-9c0b461b-4df1-4c23-aac1-3e4f349ac7d6-c000.snappy.orc
>   at 
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1191)
>   at 
> org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:78)
> ...
> Caused by: java.io.EOFException: Read past EOF for compressed stream Stream 
> for column 2 kind DATA position: 0 length: 0 range: 0 offset: 0 limit: 0
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23340) Empty float/double array columns raise EOFException

2018-02-12 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23340:
--
Component/s: SQL

> Empty float/double array columns raise EOFException
> ---
>
> Key: SPARK-23340
> URL: https://issues.apache.org/jira/browse/SPARK-23340
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Critical
>
> This issue updates Apache ORC dependencies to 1.4.3 released on February 9th.
> Apache ORC 1.4.2 release removes unnecessary dependencies and 1.4.3 has 5 
> more patches (https://s.apache.org/Fll8).
> {code}
> scala> df.write.format("orc").save("/tmp/floatarray")
> scala> spark.read.orc("/tmp/floatarray")
> res1: org.apache.spark.sql.DataFrame = [value: array]
> scala> spark.read.orc("/tmp/floatarray").show()
> 18/02/12 22:09:10 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.io.IOException: Error reading file: 
> file:/tmp/floatarray/part-0-9c0b461b-4df1-4c23-aac1-3e4f349ac7d6-c000.snappy.orc
>   at 
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1191)
>   at 
> org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:78)
> ...
> Caused by: java.io.EOFException: Read past EOF for compressed stream Stream 
> for column 2 kind DATA position: 0 length: 0 range: 0 offset: 0 limit: 0
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23340) Update ORC to 1.4.3

2018-02-12 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23340:
--
Description: 
This issue updates Apache ORC dependencies to 1.4.3 released on February 9th.
Apache ORC 1.4.2 release removes unnecessary dependencies and 1.4.3 has 5 more 
patches (https://s.apache.org/Fll8).

{code}
scala> df.write.format("orc").save("/tmp/floatarray")

scala> spark.read.orc("/tmp/floatarray")
res1: org.apache.spark.sql.DataFrame = [value: array]

scala> spark.read.orc("/tmp/floatarray").show()
18/02/12 22:09:10 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.io.IOException: Error reading file: 
file:/tmp/floatarray/part-0-9c0b461b-4df1-4c23-aac1-3e4f349ac7d6-c000.snappy.orc
at 
org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1191)
at 
org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:78)
...
Caused by: java.io.EOFException: Read past EOF for compressed stream Stream for 
column 2 kind DATA position: 0 length: 0 range: 0 offset: 0 limit: 0
{code}

  was:
This issue updates Apache ORC dependencies to 1.4.3 released on February 9th.
Apache ORC 1.4.2 release removes unnecessary dependencies and 1.4.3 has 5 more 
patches (https://s.apache.org/Fll8).


>  Update ORC to 1.4.3
> 
>
> Key: SPARK-23340
> URL: https://issues.apache.org/jira/browse/SPARK-23340
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This issue updates Apache ORC dependencies to 1.4.3 released on February 9th.
> Apache ORC 1.4.2 release removes unnecessary dependencies and 1.4.3 has 5 
> more patches (https://s.apache.org/Fll8).
> {code}
> scala> df.write.format("orc").save("/tmp/floatarray")
> scala> spark.read.orc("/tmp/floatarray")
> res1: org.apache.spark.sql.DataFrame = [value: array]
> scala> spark.read.orc("/tmp/floatarray").show()
> 18/02/12 22:09:10 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.io.IOException: Error reading file: 
> file:/tmp/floatarray/part-0-9c0b461b-4df1-4c23-aac1-3e4f349ac7d6-c000.snappy.orc
>   at 
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1191)
>   at 
> org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:78)
> ...
> Caused by: java.io.EOFException: Read past EOF for compressed stream Stream 
> for column 2 kind DATA position: 0 length: 0 range: 0 offset: 0 limit: 0
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23402) Dataset write method not working as expected for postgresql database

2018-02-12 Thread Pallapothu Jyothi Swaroop (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pallapothu Jyothi Swaroop updated SPARK-23402:
--
Description: 
I am using spark dataset write to insert data on postgresql existing table. For 
this I am using  write method mode as append mode. While using i am getting 
exception like table already exists. But, I gave option as append mode.

It's strange. When i change options to sqlserver/oracle append mode is working 
as expected.

 

*Database Properties:*

{{destinationProps.put("driver", "org.postgresql.Driver"); 
destinationProps.put("url", "jdbc:postgresql://127.0.0.1:30001/dbmig"); 
destinationProps.put("user", "dbmig");}}

{{destinationProps.put("password", "dbmig");}}

 

*Dataset Write Code:*

{{valueAnalysisDataset.write().mode(SaveMode.Append).jdbc(destinationDbMap.get("url"),
 "dqvalue", destinationdbProperties);}} 

 

 

{{Exception in thread "main" org.postgresql.util.PSQLException: ERROR: relation 
"dqvalue" already exists at 
org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2412)
 at 
org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2125)
 at 
org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:297) at 
org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:428) at 
org.postgresql.jdbc.PgStatement.execute(PgStatement.java:354) at 
org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:301) at 
org.postgresql.jdbc.PgStatement.executeCachedSql(PgStatement.java:287) at 
org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:264) at 
org.postgresql.jdbc.PgStatement.executeUpdate(PgStatement.java:244) at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createTable(JdbcUtils.scala:806)
 at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:95)
 at 
org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:469)
 at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:50)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) 
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
 at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92) at 
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609) at 
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233) at 
org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:460) at 
com.ads.dqam.action.impl.PostgresValueAnalysis.persistValueAnalysis(PostgresValueAnalysis.java:25)
 at 
com.ads.dqam.action.AbstractValueAnalysis.persistAnalysis(AbstractValueAnalysis.java:81)
 at com.ads.dqam.Analysis.doAnalysis(Analysis.java:32) at 
com.ads.dqam.Client.main(Client.java:71)}}

 

 

 

  was:
I am using spark dataset write to insert data on postgresql existing table. For 
this I am using  write method mode as append mode. While using i am getting 
exception like table already exists. But, I gave option as append mode.

It's strange. When i change options to sqlserver/oracle append mode is working 
as expected.

 

*Database Properties:*

{{destinationProps.put("driver", "org.postgresql.Driver"); 
destinationProps.put("url", "jdbc:postgresql://127.0.0.1:30001/dbmig"); 
destinationProps.put("user", "dbmig"); destinationProps.put("password", 
"dbmig");}}

*Dataset Write Code:*

{{valueAnalysisDataset.write().mode(SaveMode.Append).jdbc(destinationDbMap.get("url"),
 "dqvalue", destinationdbProperties);}} 

 

 

{{Exception in thread "main" org.postgresql.util.PSQLException: ERROR: relation 
"dqvalue" already exists at 
org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2412)
 at 
org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2125)
 at 
org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:297) at 
org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:428) at 
org.postgresql.jdbc.PgStatement.execute(PgStatement.java:354) at 

[jira] [Updated] (SPARK-23402) Dataset write method not working as expected for postgresql database

2018-02-12 Thread Pallapothu Jyothi Swaroop (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pallapothu Jyothi Swaroop updated SPARK-23402:
--
Description: 
I am using spark dataset write to insert data on postgresql existing table. For 
this I am using  write method mode as append mode. While using i am getting 
exception like table already exists. But, I gave option as append mode.

It's strange. When i change options to sqlserver/oracle append mode is working 
as expected.

 

*Database Properties:*

{{destinationProps.put("driver", "org.postgresql.Driver"); 
destinationProps.put("url", "jdbc:postgresql://127.0.0.1:30001/dbmig"); 
destinationProps.put("user", "dbmig"); destinationProps.put("password", 
"dbmig");}}

*Dataset Write Code:*

{{valueAnalysisDataset.write().mode(SaveMode.Append).jdbc(destinationDbMap.get("url"),
 "dqvalue", destinationdbProperties);}} 

 

 

{{Exception in thread "main" org.postgresql.util.PSQLException: ERROR: relation 
"dqvalue" already exists at 
org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2412)
 at 
org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2125)
 at 
org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:297) at 
org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:428) at 
org.postgresql.jdbc.PgStatement.execute(PgStatement.java:354) at 
org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:301) at 
org.postgresql.jdbc.PgStatement.executeCachedSql(PgStatement.java:287) at 
org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:264) at 
org.postgresql.jdbc.PgStatement.executeUpdate(PgStatement.java:244) at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createTable(JdbcUtils.scala:806)
 at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:95)
 at 
org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:469)
 at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:50)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) 
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
 at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92) at 
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609) at 
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233) at 
org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:460) at 
com.ads.dqam.action.impl.PostgresValueAnalysis.persistValueAnalysis(PostgresValueAnalysis.java:25)
 at 
com.ads.dqam.action.AbstractValueAnalysis.persistAnalysis(AbstractValueAnalysis.java:81)
 at com.ads.dqam.Analysis.doAnalysis(Analysis.java:32) at 
com.ads.dqam.Client.main(Client.java:71)}}

 

 

 

  was:
I am using spark dataset write to insert data on postgresql existing table. For 
this I am using  write method mode as append mode. While using i am getting 
exception like table already exists. But, I gave option as append mode.

It's strange. When i change options to sqlserver/oracle append mode is working 
as expected.

 

{{Exception in thread "main" org.postgresql.util.PSQLException: ERROR: relation 
"dqvalue" already exists at 
org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2412)
 at 
org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2125)
 at 
org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:297) at 
org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:428) at 
org.postgresql.jdbc.PgStatement.execute(PgStatement.java:354) at 
org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:301) at 
org.postgresql.jdbc.PgStatement.executeCachedSql(PgStatement.java:287) at 
org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:264) at 
org.postgresql.jdbc.PgStatement.executeUpdate(PgStatement.java:244) at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createTable(JdbcUtils.scala:806)
 at 

[jira] [Updated] (SPARK-23402) Dataset write method not working as expected for postgresql database

2018-02-12 Thread Pallapothu Jyothi Swaroop (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pallapothu Jyothi Swaroop updated SPARK-23402:
--
Attachment: Emsku[1].jpg

> Dataset write method not working as expected for postgresql database
> 
>
> Key: SPARK-23402
> URL: https://issues.apache.org/jira/browse/SPARK-23402
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
> Environment: PostgreSQL: 9.5.8 (10 + Also same issue)
> OS: Cent OS 7 & Windows 7,8
> JDBC: 9.4-1201-jdbc41
>  
> Spark:  I executed in both 2.1.0 and 2.2.1
> Mode: Standalone
> OS: Windows 7
>Reporter: Pallapothu Jyothi Swaroop
>Priority: Major
> Attachments: Emsku[1].jpg
>
>
> I am using spark dataset write to insert data on postgresql existing table. 
> For this I am using  write method mode as append mode. While using i am 
> getting exception like table already exists. But, I gave option as append 
> mode.
> It's strange. When i change options to sqlserver/oracle append mode is 
> working as expected.
>  
> {{Exception in thread "main" org.postgresql.util.PSQLException: ERROR: 
> relation "dqvalue" already exists at 
> org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2412)
>  at 
> org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2125)
>  at 
> org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:297) 
> at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:428) at 
> org.postgresql.jdbc.PgStatement.execute(PgStatement.java:354) at 
> org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:301) at 
> org.postgresql.jdbc.PgStatement.executeCachedSql(PgStatement.java:287) at 
> org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:264) at 
> org.postgresql.jdbc.PgStatement.executeUpdate(PgStatement.java:244) at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createTable(JdbcUtils.scala:806)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:95)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:469)
>  at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:50)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92) 
> at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609) 
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233) at 
> org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:460) at 
> com.ads.dqam.action.impl.PostgresValueAnalysis.persistValueAnalysis(PostgresValueAnalysis.java:25)
>  at 
> com.ads.dqam.action.AbstractValueAnalysis.persistAnalysis(AbstractValueAnalysis.java:81)
>  at com.ads.dqam.Analysis.doAnalysis(Analysis.java:32) at 
> com.ads.dqam.Client.main(Client.java:71)}}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23402) Dataset write method not working as expected for postgresql database

2018-02-12 Thread Pallapothu Jyothi Swaroop (JIRA)
Pallapothu Jyothi Swaroop created SPARK-23402:
-

 Summary: Dataset write method not working as expected for 
postgresql database
 Key: SPARK-23402
 URL: https://issues.apache.org/jira/browse/SPARK-23402
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 2.2.1
 Environment: PostgreSQL: 9.5.8 (10 + Also same issue)

OS: Cent OS 7 & Windows 7,8

JDBC: 9.4-1201-jdbc41

 

Spark:  I executed in both 2.1.0 and 2.2.1

Mode: Standalone

OS: Windows 7
Reporter: Pallapothu Jyothi Swaroop


I am using spark dataset write to insert data on postgresql existing table. For 
this I am using  write method mode as append mode. While using i am getting 
exception like table already exists. But, I gave option as append mode.

It's strange. When i change options to sqlserver/oracle append mode is working 
as expected.

 

{{Exception in thread "main" org.postgresql.util.PSQLException: ERROR: relation 
"dqvalue" already exists at 
org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2412)
 at 
org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2125)
 at 
org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:297) at 
org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:428) at 
org.postgresql.jdbc.PgStatement.execute(PgStatement.java:354) at 
org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:301) at 
org.postgresql.jdbc.PgStatement.executeCachedSql(PgStatement.java:287) at 
org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:264) at 
org.postgresql.jdbc.PgStatement.executeUpdate(PgStatement.java:244) at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createTable(JdbcUtils.scala:806)
 at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:95)
 at 
org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:469)
 at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:50)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) 
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
 at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92) at 
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609) at 
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233) at 
org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:460) at 
com.ads.dqam.action.impl.PostgresValueAnalysis.persistValueAnalysis(PostgresValueAnalysis.java:25)
 at 
com.ads.dqam.action.AbstractValueAnalysis.persistAnalysis(AbstractValueAnalysis.java:81)
 at com.ads.dqam.Analysis.doAnalysis(Analysis.java:32) at 
com.ads.dqam.Client.main(Client.java:71)}}

 

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23397) Scheduling delay causes Spark Streaming to miss batches.

2018-02-12 Thread Vivek Patangiwar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16361851#comment-16361851
 ] 

Vivek Patangiwar commented on SPARK-23397:
--

Thanks for your response Sean.

An example to elaborate what Shahbaz said: if I have a code like this (Scala)

sourceDstream.transform(rdd=>{

  SqlContext.getOrCreate(rdd.sparkContext).createDataFrame(rdd, 
someStruct).select($"col1", $"col2").rdd

}).foreach(rdd=>\{rdd.foreach(println)})

The action results into a sparkPlan generation. Only problem is Spark-Streaming 
does it for every minibatch. The above example is really simple and it's not a 
big deal to make it into a spark plan. The operations in transform() can get 
really complex and in that case, it may take a considerable amount of time 
(comparable to batch interval) to generate spark-plan for it, in which case the 
batches get delayed significantly.

I would instead like to generate the spark plan once(logic remains the same, 
only the input RDD changes) and use it for every subsequent minibatch by 
placing the new RDD in the plan.

I'm sure structured streaming solves this problem but is there any way I could 
do that in spark-streaming (DStream API)?

> Scheduling delay causes Spark Streaming to miss batches.
> 
>
> Key: SPARK-23397
> URL: https://issues.apache.org/jira/browse/SPARK-23397
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.2.1
>Reporter: Shahbaz Hussain
>Priority: Major
>
> * For Complex Spark (Scala) based D-Stream based applications ,which requires 
> creating Ex: 40 Jobs for every batch ,its been observed that ,batches does 
> not get created on the specific time ,ex: if i started a Spark Streaming 
> based application with batch interval as 20 seconds and application is 
> creating 40 odd Jobs ,observe the next batch does not create 20 seconds later 
> than previous job creation time.
>  * This is due to the fact that Job Creation is Single Threaded, if Job 
> Creation delay is greater than Batch Interval time ,batch execution misses 
> its schedule.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20090) Add StructType.fieldNames to Python API

2018-02-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16361838#comment-16361838
 ] 

Apache Spark commented on SPARK-20090:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/20595

> Add StructType.fieldNames to Python API
> ---
>
> Key: SPARK-20090
> URL: https://issues.apache.org/jira/browse/SPARK-20090
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Joseph K. Bradley
>Assignee: Hyukjin Kwon
>Priority: Trivial
> Fix For: 2.3.0
>
>
> The Scala/Java API for {{StructType}} has a method {{fieldNames}}.  It would 
> be nice if the Python {{StructType}} did as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20090) Add StructType.fieldNames to Python API

2018-02-12 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20090:

Target Version/s: 2.3.0

> Add StructType.fieldNames to Python API
> ---
>
> Key: SPARK-20090
> URL: https://issues.apache.org/jira/browse/SPARK-20090
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Joseph K. Bradley
>Assignee: Hyukjin Kwon
>Priority: Trivial
> Fix For: 2.3.0
>
>
> The Scala/Java API for {{StructType}} has a method {{fieldNames}}.  It would 
> be nice if the Python {{StructType}} did as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23303) improve the explain result for data source v2 relations

2018-02-12 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-23303.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

> improve the explain result for data source v2 relations
> ---
>
> Key: SPARK-23303
> URL: https://issues.apache.org/jira/browse/SPARK-23303
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23377) Bucketizer with multiple columns persistence bug

2018-02-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16361830#comment-16361830
 ] 

Apache Spark commented on SPARK-23377:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/20594

> Bucketizer with multiple columns persistence bug
> 
>
> Key: SPARK-23377
> URL: https://issues.apache.org/jira/browse/SPARK-23377
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>Priority: Critical
>
> A Bucketizer with multiple input/output columns get "inputCol" set to the 
> default value on write -> read which causes it to throw an error on 
> transform. Here's an example.
> {code:java}
> import org.apache.spark.ml.feature._
> val splits = Array(Double.NegativeInfinity, 0, 10, 100, 
> Double.PositiveInfinity)
> val bucketizer = new Bucketizer()
>   .setSplitsArray(Array(splits, splits))
>   .setInputCols(Array("foo1", "foo2"))
>   .setOutputCols(Array("bar1", "bar2"))
> val data = Seq((1.0, 2.0), (10.0, 100.0), (101.0, -1.0)).toDF("foo1", "foo2")
> bucketizer.transform(data)
> val path = "/temp/bucketrizer-persist-test"
> bucketizer.write.overwrite.save(path)
> val bucketizerAfterRead = Bucketizer.read.load(path)
> println(bucketizerAfterRead.isDefined(bucketizerAfterRead.outputCol))
> // This line throws an error because "outputCol" is set
> bucketizerAfterRead.transform(data)
> {code}
> And the trace:
> {code:java}
> java.lang.IllegalArgumentException: Bucketizer bucketizer_6f0acc3341f7 has 
> the inputCols Param set for multi-column transform. The following Params are 
> not applicable and should not be set: outputCol.
>   at 
> org.apache.spark.ml.param.ParamValidators$.checkExclusiveParams$1(params.scala:300)
>   at 
> org.apache.spark.ml.param.ParamValidators$.checkSingleVsMultiColumnParams(params.scala:314)
>   at 
> org.apache.spark.ml.feature.Bucketizer.transformSchema(Bucketizer.scala:189)
>   at 
> org.apache.spark.ml.feature.Bucketizer.transform(Bucketizer.scala:141)
>   at 
> line251821108a8a433da484ee31f166c83725.$read$$iw$$iw$$iw$$iw$$iw$$iw.(command-6079631:17)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23316) AnalysisException after max iteration reached for IN query

2018-02-12 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23316:

Target Version/s: 2.3.0

> AnalysisException after max iteration reached for IN query
> --
>
> Key: SPARK-23316
> URL: https://issues.apache.org/jira/browse/SPARK-23316
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Bogdan Raducanu
>Priority: Major
>
> Query to reproduce:
> {code:scala}
> spark.range(10).where("(id,id) in (select id, null from range(3))").show
> {code}
> {code}
> 18/02/02 11:32:31 WARN BaseSessionStateBuilder$$anon$1: Max iterations (100) 
> reached for batch Resolution
> org.apache.spark.sql.AnalysisException: cannot resolve '(named_struct('id', 
> `id`, 'id', `id`) IN (listquery()))' due to data type mismatch:
> The data type of one or more elements in the left hand side of an IN subquery
> is not compatible with the data type of the output of the subquery
> Mismatched columns:
> []
> Left side:
> [bigint, bigint].
> Right side:
> [bigint, bigint].;;
> {code}
> The error message includes the last plan which contains ~100 useless Projects.
> Does not happen in branch-2.2.
> It has something to do with TypeCoercion, it is doing a futile attempt  to 
> change nullability.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23377) Bucketizer with multiple columns persistence bug

2018-02-12 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16361824#comment-16361824
 ] 

Joseph K. Bradley commented on SPARK-23377:
---

Thanks for reconsidering here [~viirya]!  I can help get the PR merged when 
ready.

> Bucketizer with multiple columns persistence bug
> 
>
> Key: SPARK-23377
> URL: https://issues.apache.org/jira/browse/SPARK-23377
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>Priority: Critical
>
> A Bucketizer with multiple input/output columns get "inputCol" set to the 
> default value on write -> read which causes it to throw an error on 
> transform. Here's an example.
> {code:java}
> import org.apache.spark.ml.feature._
> val splits = Array(Double.NegativeInfinity, 0, 10, 100, 
> Double.PositiveInfinity)
> val bucketizer = new Bucketizer()
>   .setSplitsArray(Array(splits, splits))
>   .setInputCols(Array("foo1", "foo2"))
>   .setOutputCols(Array("bar1", "bar2"))
> val data = Seq((1.0, 2.0), (10.0, 100.0), (101.0, -1.0)).toDF("foo1", "foo2")
> bucketizer.transform(data)
> val path = "/temp/bucketrizer-persist-test"
> bucketizer.write.overwrite.save(path)
> val bucketizerAfterRead = Bucketizer.read.load(path)
> println(bucketizerAfterRead.isDefined(bucketizerAfterRead.outputCol))
> // This line throws an error because "outputCol" is set
> bucketizerAfterRead.transform(data)
> {code}
> And the trace:
> {code:java}
> java.lang.IllegalArgumentException: Bucketizer bucketizer_6f0acc3341f7 has 
> the inputCols Param set for multi-column transform. The following Params are 
> not applicable and should not be set: outputCol.
>   at 
> org.apache.spark.ml.param.ParamValidators$.checkExclusiveParams$1(params.scala:300)
>   at 
> org.apache.spark.ml.param.ParamValidators$.checkSingleVsMultiColumnParams(params.scala:314)
>   at 
> org.apache.spark.ml.feature.Bucketizer.transformSchema(Bucketizer.scala:189)
>   at 
> org.apache.spark.ml.feature.Bucketizer.transform(Bucketizer.scala:141)
>   at 
> line251821108a8a433da484ee31f166c83725.$read$$iw$$iw$$iw$$iw$$iw$$iw.(command-6079631:17)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23379) remove redundant metastore access if the current database name is the same

2018-02-12 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-23379.
-
   Resolution: Fixed
 Assignee: Feng Liu
Fix Version/s: 2.4.0

> remove redundant metastore access if the current database name is the same
> --
>
> Key: SPARK-23379
> URL: https://issues.apache.org/jira/browse/SPARK-23379
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Feng Liu
>Assignee: Feng Liu
>Priority: Major
> Fix For: 2.4.0
>
>
> We should be able to reduce one metastore access if the target database name 
> is as same as the current database:
> https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L295



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23400) Add the extra constructors for ScalaUDF

2018-02-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23400:


Assignee: Apache Spark  (was: Xiao Li)

> Add the extra constructors for ScalaUDF
> ---
>
> Key: SPARK-23400
> URL: https://issues.apache.org/jira/browse/SPARK-23400
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Major
>
> The last few releases, we changed the interface of ScalaUDF. Unfortunately, 
> some Spark Package (spark-deep-learning) are using our internal class 
> `ScalaUDF`. In the release 2.2 and 2.3, we added new parameters into these 
> class. The users hit the binary compatibility issues and got the exception:
> > java.lang.NoSuchMethodError: 
> > org.apache.spark.sql.catalyst.expressions.ScalaUDF.init(Ljava/lang/Object;Lorg/apache/spark/sql/types/DataType;Lscala/collection/Seq;Lscala/collection/Seq;Lscala/Option;)V



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23400) Add the extra constructors for ScalaUDF

2018-02-12 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23400:

Affects Version/s: (was: 2.2.1)
   (was: 2.1.2)

> Add the extra constructors for ScalaUDF
> ---
>
> Key: SPARK-23400
> URL: https://issues.apache.org/jira/browse/SPARK-23400
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
>
> The last few releases, we changed the interface of ScalaUDF. Unfortunately, 
> some Spark Package (spark-deep-learning) are using our internal class 
> `ScalaUDF`. In the release 2.3, we added new parameters into these class. The 
> users hit the binary compatibility issues and got the exception:
> > java.lang.NoSuchMethodError: 
> > org.apache.spark.sql.catalyst.expressions.ScalaUDF.init(Ljava/lang/Object;Lorg/apache/spark/sql/types/DataType;Lscala/collection/Seq;Lscala/collection/Seq;Lscala/Option;)V



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23400) Add the extra constructors for ScalaUDF

2018-02-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23400:


Assignee: Xiao Li  (was: Apache Spark)

> Add the extra constructors for ScalaUDF
> ---
>
> Key: SPARK-23400
> URL: https://issues.apache.org/jira/browse/SPARK-23400
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
>
> The last few releases, we changed the interface of ScalaUDF. Unfortunately, 
> some Spark Package (spark-deep-learning) are using our internal class 
> `ScalaUDF`. In the release 2.2 and 2.3, we added new parameters into these 
> class. The users hit the binary compatibility issues and got the exception:
> > java.lang.NoSuchMethodError: 
> > org.apache.spark.sql.catalyst.expressions.ScalaUDF.init(Ljava/lang/Object;Lorg/apache/spark/sql/types/DataType;Lscala/collection/Seq;Lscala/collection/Seq;Lscala/Option;)V



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23400) Add the extra constructors for ScalaUDF

2018-02-12 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23400:

Summary: Add the extra constructors for ScalaUDF  (was: Add two extra 
constructors for ScalaUDF)

> Add the extra constructors for ScalaUDF
> ---
>
> Key: SPARK-23400
> URL: https://issues.apache.org/jira/browse/SPARK-23400
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
>
> The last few releases, we changed the interface of ScalaUDF. Unfortunately, 
> some Spark Package (spark-deep-learning) are using our internal class 
> `ScalaUDF`. In the release 2.2 and 2.3, we added new parameters into these 
> class. The users hit the binary compatibility issues and got the exception:
> > java.lang.NoSuchMethodError: 
> > org.apache.spark.sql.catalyst.expressions.ScalaUDF.init(Ljava/lang/Object;Lorg/apache/spark/sql/types/DataType;Lscala/collection/Seq;Lscala/collection/Seq;Lscala/Option;)V



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23400) Add the extra constructors for ScalaUDF

2018-02-12 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23400:

Description: 
The last few releases, we changed the interface of ScalaUDF. Unfortunately, 
some Spark Package (spark-deep-learning) are using our internal class 
`ScalaUDF`. In the release 2.3, we added new parameters into these class. The 
users hit the binary compatibility issues and got the exception:

> java.lang.NoSuchMethodError: 
> org.apache.spark.sql.catalyst.expressions.ScalaUDF.init(Ljava/lang/Object;Lorg/apache/spark/sql/types/DataType;Lscala/collection/Seq;Lscala/collection/Seq;Lscala/Option;)V


  was:
The last few releases, we changed the interface of ScalaUDF. Unfortunately, 
some Spark Package (spark-deep-learning) are using our internal class 
`ScalaUDF`. In the release 2.2 and 2.3, we added new parameters into these 
class. The users hit the binary compatibility issues and got the exception:

> java.lang.NoSuchMethodError: 
> org.apache.spark.sql.catalyst.expressions.ScalaUDF.init(Ljava/lang/Object;Lorg/apache/spark/sql/types/DataType;Lscala/collection/Seq;Lscala/collection/Seq;Lscala/Option;)V



> Add the extra constructors for ScalaUDF
> ---
>
> Key: SPARK-23400
> URL: https://issues.apache.org/jira/browse/SPARK-23400
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
>
> The last few releases, we changed the interface of ScalaUDF. Unfortunately, 
> some Spark Package (spark-deep-learning) are using our internal class 
> `ScalaUDF`. In the release 2.3, we added new parameters into these class. The 
> users hit the binary compatibility issues and got the exception:
> > java.lang.NoSuchMethodError: 
> > org.apache.spark.sql.catalyst.expressions.ScalaUDF.init(Ljava/lang/Object;Lorg/apache/spark/sql/types/DataType;Lscala/collection/Seq;Lscala/collection/Seq;Lscala/Option;)V



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23400) Add the extra constructors for ScalaUDF

2018-02-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16361819#comment-16361819
 ] 

Apache Spark commented on SPARK-23400:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/20591

> Add the extra constructors for ScalaUDF
> ---
>
> Key: SPARK-23400
> URL: https://issues.apache.org/jira/browse/SPARK-23400
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
>
> The last few releases, we changed the interface of ScalaUDF. Unfortunately, 
> some Spark Package (spark-deep-learning) are using our internal class 
> `ScalaUDF`. In the release 2.2 and 2.3, we added new parameters into these 
> class. The users hit the binary compatibility issues and got the exception:
> > java.lang.NoSuchMethodError: 
> > org.apache.spark.sql.catalyst.expressions.ScalaUDF.init(Ljava/lang/Object;Lorg/apache/spark/sql/types/DataType;Lscala/collection/Seq;Lscala/collection/Seq;Lscala/Option;)V



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23323) DataSourceV2 should use the output commit coordinator.

2018-02-12 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23323.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20490
[https://github.com/apache/spark/pull/20490]

> DataSourceV2 should use the output commit coordinator.
> --
>
> Key: SPARK-23323
> URL: https://issues.apache.org/jira/browse/SPARK-23323
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 2.4.0
>
>
> DataSourceV2 writes should use the output commit coordinator to guarantee 
> only one tasks successfully commits.
> Not all sources will need coordination, so I propose adding 
> {{coordinateCommits: boolean}} to the {{DataWriterFactory}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23323) DataSourceV2 should use the output commit coordinator.

2018-02-12 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-23323:
---

Assignee: Ryan Blue

> DataSourceV2 should use the output commit coordinator.
> --
>
> Key: SPARK-23323
> URL: https://issues.apache.org/jira/browse/SPARK-23323
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 2.4.0
>
>
> DataSourceV2 writes should use the output commit coordinator to guarantee 
> only one tasks successfully commits.
> Not all sources will need coordination, so I propose adding 
> {{coordinateCommits: boolean}} to the {{DataWriterFactory}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23377) Bucketizer with multiple columns persistence bug

2018-02-12 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16361724#comment-16361724
 ] 

Liang-Chi Hsieh commented on SPARK-23377:
-

For now, I think neither 3rd option or my current patch can be easily going 
into 2.3 because they are both not small change. So I also support to have the 
fixing of 2nd option first as a quick fix in 2.3. If no objection, I will 
prepare the fixing as a PR soon.

> Bucketizer with multiple columns persistence bug
> 
>
> Key: SPARK-23377
> URL: https://issues.apache.org/jira/browse/SPARK-23377
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>Priority: Critical
>
> A Bucketizer with multiple input/output columns get "inputCol" set to the 
> default value on write -> read which causes it to throw an error on 
> transform. Here's an example.
> {code:java}
> import org.apache.spark.ml.feature._
> val splits = Array(Double.NegativeInfinity, 0, 10, 100, 
> Double.PositiveInfinity)
> val bucketizer = new Bucketizer()
>   .setSplitsArray(Array(splits, splits))
>   .setInputCols(Array("foo1", "foo2"))
>   .setOutputCols(Array("bar1", "bar2"))
> val data = Seq((1.0, 2.0), (10.0, 100.0), (101.0, -1.0)).toDF("foo1", "foo2")
> bucketizer.transform(data)
> val path = "/temp/bucketrizer-persist-test"
> bucketizer.write.overwrite.save(path)
> val bucketizerAfterRead = Bucketizer.read.load(path)
> println(bucketizerAfterRead.isDefined(bucketizerAfterRead.outputCol))
> // This line throws an error because "outputCol" is set
> bucketizerAfterRead.transform(data)
> {code}
> And the trace:
> {code:java}
> java.lang.IllegalArgumentException: Bucketizer bucketizer_6f0acc3341f7 has 
> the inputCols Param set for multi-column transform. The following Params are 
> not applicable and should not be set: outputCol.
>   at 
> org.apache.spark.ml.param.ParamValidators$.checkExclusiveParams$1(params.scala:300)
>   at 
> org.apache.spark.ml.param.ParamValidators$.checkSingleVsMultiColumnParams(params.scala:314)
>   at 
> org.apache.spark.ml.feature.Bucketizer.transformSchema(Bucketizer.scala:189)
>   at 
> org.apache.spark.ml.feature.Bucketizer.transform(Bucketizer.scala:141)
>   at 
> line251821108a8a433da484ee31f166c83725.$read$$iw$$iw$$iw$$iw$$iw$$iw.(command-6079631:17)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23377) Bucketizer with multiple columns persistence bug

2018-02-12 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16361718#comment-16361718
 ] 

Liang-Chi Hsieh commented on SPARK-23377:
-

I have no objection to [~josephkb]'s proposal (first 2nd and later 3rd).

 

The considering design is we should keep the default values of original Spark 
when saving the model, or use the default values of the Spark when loading the 
model. To keep the default values of original Spark, can make the behavior of 
the saved models reproducible. However, I have in mind that the behavior 
between loaded models and models created with current Spark can be different. 
E.g., The model "foo" from 2.1 with default value as "a" can reproducible 
behavior when loading back into 2.3. But it behaves differently with the same 
"foo" model created in 2.3 if the default value is changed to "b".

 

In other words, one is to keep the model behavior consistent before and after 
persistence even across Spark versions. Another one is to let the same kind of 
models has consistent behavior even they are coming from different Spark 
versions.

 

Current my patch follows the later one. I think the user should notice the 
change of default values in upgraded Spark, if they want to use old models. 
Btw, I also think of a rare but possible situation is, if we remove the default 
value from old version, the old models may not be easily loaded into new Spark.

 

 

> Bucketizer with multiple columns persistence bug
> 
>
> Key: SPARK-23377
> URL: https://issues.apache.org/jira/browse/SPARK-23377
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>Priority: Critical
>
> A Bucketizer with multiple input/output columns get "inputCol" set to the 
> default value on write -> read which causes it to throw an error on 
> transform. Here's an example.
> {code:java}
> import org.apache.spark.ml.feature._
> val splits = Array(Double.NegativeInfinity, 0, 10, 100, 
> Double.PositiveInfinity)
> val bucketizer = new Bucketizer()
>   .setSplitsArray(Array(splits, splits))
>   .setInputCols(Array("foo1", "foo2"))
>   .setOutputCols(Array("bar1", "bar2"))
> val data = Seq((1.0, 2.0), (10.0, 100.0), (101.0, -1.0)).toDF("foo1", "foo2")
> bucketizer.transform(data)
> val path = "/temp/bucketrizer-persist-test"
> bucketizer.write.overwrite.save(path)
> val bucketizerAfterRead = Bucketizer.read.load(path)
> println(bucketizerAfterRead.isDefined(bucketizerAfterRead.outputCol))
> // This line throws an error because "outputCol" is set
> bucketizerAfterRead.transform(data)
> {code}
> And the trace:
> {code:java}
> java.lang.IllegalArgumentException: Bucketizer bucketizer_6f0acc3341f7 has 
> the inputCols Param set for multi-column transform. The following Params are 
> not applicable and should not be set: outputCol.
>   at 
> org.apache.spark.ml.param.ParamValidators$.checkExclusiveParams$1(params.scala:300)
>   at 
> org.apache.spark.ml.param.ParamValidators$.checkSingleVsMultiColumnParams(params.scala:314)
>   at 
> org.apache.spark.ml.feature.Bucketizer.transformSchema(Bucketizer.scala:189)
>   at 
> org.apache.spark.ml.feature.Bucketizer.transform(Bucketizer.scala:141)
>   at 
> line251821108a8a433da484ee31f166c83725.$read$$iw$$iw$$iw$$iw$$iw$$iw.(command-6079631:17)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23230) When hive.default.fileformat is other kinds of file types, create textfile table cause a serde error

2018-02-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16361696#comment-16361696
 ] 

Apache Spark commented on SPARK-23230:
--

User 'cxzl25' has created a pull request for this issue:
https://github.com/apache/spark/pull/20593

> When hive.default.fileformat is other kinds of file types, create textfile 
> table cause a serde error
> 
>
> Key: SPARK-23230
> URL: https://issues.apache.org/jira/browse/SPARK-23230
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Minor
> Fix For: 2.3.0
>
>
> When hive.default.fileformat is other kinds of file types, create textfile 
> table cause a serde error.
>  We should take the default type of textfile and sequencefile both as 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.
> {code:java}
> set hive.default.fileformat=orc;
> create table tbl( i string ) stored as textfile;
> desc formatted tbl;
> Serde Library org.apache.hadoop.hive.ql.io.orc.OrcSerde
> InputFormat  org.apache.hadoop.mapred.TextInputFormat
> OutputFormat  org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat{code}
>  
> {code:java}
> set hive.default.fileformat=orc;
> create table tbl stored as textfile
> as
> select  1
> {code}
> {{It failed because it used the wrong SERDE}}
> {code:java}
> Caused by: java.lang.ClassCastException: 
> org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow cannot be cast to 
> org.apache.hadoop.io.BytesWritable
>   at 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat$1.write(HiveIgnoreKeyTextOutputFormat.java:91)
>   at 
> org.apache.spark.sql.hive.execution.HiveOutputWriter.write(HiveFileFormat.scala:149)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:327)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:258)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261)
>   ... 16 more
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23377) Bucketizer with multiple columns persistence bug

2018-02-12 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16361154#comment-16361154
 ] 

Joseph K. Bradley edited comment on SPARK-23377 at 2/13/18 1:10 AM:


[~viirya]'s patch currently changes DefaultParamsWriter to save only the 
explicitly set Param values.  This means that loading a model into a new 
version of Spark could use different Param values if the default values have 
changed.

In the original design of persistence (see [SPARK-6725]), the goal was to make 
behavior exactly reproducible.  This means that default Param values do need to 
be saved.  I recommend that we maintain this guarantee.

I can see a couple of possibilities:
1. Simplest: Change the loading logic of Bucketizer so that it handles this 
edge case (by removing the value for inputCol when inputCols is set).  This may 
be best for Spark 2.3 since it's the fastest fix.
2. Reasonable: Change the saving logic of Bucketizer to handle this case.  This 
will be best in terms of fixing the edge case and being pretty quick to do.
3. Largest: Change DefaultParamsWriter to separate explicitly set values and 
default values.  Then update Bucketizer's loading logic to make use of this 
distinction.  I'm not a fan of this approach since it would involve shoving a 
huge change into branch-2.3 during late QA.

I'd vote strongly for the 2nd option now, and perhaps the 3rd option later on.  
Opinions?


was (Author: josephkb):
[~viirya]'s patch currently changes DefaultParamsWriter to save only the 
explicitly set Param values.  This means that loading a model into a new 
version of Spark could use different Param values if the default values have 
changed.

In the original design of persistence (see [SPARK-6725]), the goal was to make 
behavior exactly reproducible.  This means that default Param values do need to 
be saved.  I recommend that we maintain this guarantee.

I can see a couple of possibilities:
1. Simplest: Change the loading logic of Bucketizer so that it handles this 
edge case (by removing the value for inputCol when inputCols is set).  This may 
be best for Spark 2.3 since it's the fastest fix.
2. Reasonable: Change the saving logic of Bucketizer to handle this case.  This 
will be best in terms of fixing the edge case and being pretty quick to do.
3. Largest: Change DefaultParamsWriter to separate explicitly set values and 
default values.  Then update Bucketizer's loading logic to make use of this 
distinction.  I'm not a fan of this approach since it would involve shoving a 
huge change into branch-2.3 during late QA.

I'd vote strongly for the 2nd option.  Opinions?

> Bucketizer with multiple columns persistence bug
> 
>
> Key: SPARK-23377
> URL: https://issues.apache.org/jira/browse/SPARK-23377
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>Priority: Critical
>
> A Bucketizer with multiple input/output columns get "inputCol" set to the 
> default value on write -> read which causes it to throw an error on 
> transform. Here's an example.
> {code:java}
> import org.apache.spark.ml.feature._
> val splits = Array(Double.NegativeInfinity, 0, 10, 100, 
> Double.PositiveInfinity)
> val bucketizer = new Bucketizer()
>   .setSplitsArray(Array(splits, splits))
>   .setInputCols(Array("foo1", "foo2"))
>   .setOutputCols(Array("bar1", "bar2"))
> val data = Seq((1.0, 2.0), (10.0, 100.0), (101.0, -1.0)).toDF("foo1", "foo2")
> bucketizer.transform(data)
> val path = "/temp/bucketrizer-persist-test"
> bucketizer.write.overwrite.save(path)
> val bucketizerAfterRead = Bucketizer.read.load(path)
> println(bucketizerAfterRead.isDefined(bucketizerAfterRead.outputCol))
> // This line throws an error because "outputCol" is set
> bucketizerAfterRead.transform(data)
> {code}
> And the trace:
> {code:java}
> java.lang.IllegalArgumentException: Bucketizer bucketizer_6f0acc3341f7 has 
> the inputCols Param set for multi-column transform. The following Params are 
> not applicable and should not be set: outputCol.
>   at 
> org.apache.spark.ml.param.ParamValidators$.checkExclusiveParams$1(params.scala:300)
>   at 
> org.apache.spark.ml.param.ParamValidators$.checkSingleVsMultiColumnParams(params.scala:314)
>   at 
> org.apache.spark.ml.feature.Bucketizer.transformSchema(Bucketizer.scala:189)
>   at 
> org.apache.spark.ml.feature.Bucketizer.transform(Bucketizer.scala:141)
>   at 
> line251821108a8a433da484ee31f166c83725.$read$$iw$$iw$$iw$$iw$$iw$$iw.(command-6079631:17)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Updated] (SPARK-23352) Explicitly specify supported types in Pandas UDFs

2018-02-12 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23352:

Fix Version/s: 2.3.1

> Explicitly specify supported types in Pandas UDFs
> -
>
> Key: SPARK-23352
> URL: https://issues.apache.org/jira/browse/SPARK-23352
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> Currently, we don't support {{BinaryType}} in Pandas UDFs:
> {code}
> >>> from pyspark.sql.functions import pandas_udf
> >>> pudf = pandas_udf(lambda x: x, "binary")
> >>> df = spark.createDataFrame([[bytearray("a")]])
> >>> df.select(pudf("_1")).show()
> ...
> TypeError: Unsupported type in conversion to Arrow: BinaryType
> {code}
> Also, the grouped aggregate Pandas UDF fail fast on {{ArrayType}} but seems 
> we can support this case.
> We should better clarify it in doc in Pandas UDFs, and fail fast with type 
> checking ahead, rather than execution time.
> Please consider this case:
> {code}
> pandas_udf(lambda x: x, BinaryType())  # we can fail fast at this stage 
> because we know the schema ahead
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23154) Document backwards compatibility guarantees for ML persistence

2018-02-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16361654#comment-16361654
 ] 

Apache Spark commented on SPARK-23154:
--

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/20592

> Document backwards compatibility guarantees for ML persistence
> --
>
> Key: SPARK-23154
> URL: https://issues.apache.org/jira/browse/SPARK-23154
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Major
>
> We have (as far as I know) maintained backwards compatibility for ML 
> persistence, but this is not documented anywhere.  I'd like us to document it 
> (for spark.ml, not for spark.mllib).
> I'd recommend something like:
> {quote}
> In general, MLlib maintains backwards compatibility for ML persistence.  
> I.e., if you save an ML model or Pipeline in one version of Spark, then you 
> should be able to load it back and use it in a future version of Spark.  
> However, there are rare exceptions, described below.
> Model persistence: Is a model or Pipeline saved using Apache Spark ML 
> persistence in Spark version X loadable by Spark version Y?
> * Major versions: No guarantees, but best-effort.
> * Minor and patch versions: Yes; these are backwards compatible.
> * Note about the format: There are no guarantees for a stable persistence 
> format, but model loading itself is designed to be backwards compatible.
> Model behavior: Does a model or Pipeline in Spark version X behave 
> identically in Spark version Y?
> * Major versions: No guarantees, but best-effort.
> * Minor and patch versions: Identical behavior, except for bug fixes.
> For both model persistence and model behavior, any breaking changes across a 
> minor version or patch version are reported in the Spark version release 
> notes. If a breakage is not reported in release notes, then it should be 
> treated as a bug to be fixed.
> {quote}
> How does this sound?
> Note: We unfortunately don't have tests for backwards compatibility (which 
> has technical hurdles and can be discussed in [SPARK-15573]).  However, we 
> have made efforts to maintain it during PR review and Spark release QA, and 
> most users expect it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20307) SparkR: pass on setHandleInvalid to spark.mllib functions that use StringIndexer

2018-02-12 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16361631#comment-16361631
 ] 

Miao Wang commented on SPARK-20307:
---

[~felixcheung] I will do it during the Lunar New Year vacation. I have been off 
the Spark community for a while. It is time to return :)

> SparkR: pass on setHandleInvalid to spark.mllib functions that use 
> StringIndexer
> 
>
> Key: SPARK-20307
> URL: https://issues.apache.org/jira/browse/SPARK-20307
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Anne Rutten
>Assignee: Miao Wang
>Priority: Minor
> Fix For: 2.3.0
>
>
> when training a model in SparkR with string variables (tested with 
> spark.randomForest, but i assume is valid for all spark.xx functions that 
> apply a StringIndexer under the hood), testing on a new dataset with factor 
> levels that are not in the training set will throw an "Unseen label" error. 
> I think this can be solved if there's a method to pass setHandleInvalid on to 
> the StringIndexers when calling spark.randomForest.
> code snippet:
> {code}
> # (i've run this in Zeppelin which already has SparkR and the context loaded)
> #library(SparkR)
> #sparkR.session(master = "local[*]") 
> data = data.frame(clicked = base::sample(c(0,1),100,replace=TRUE),
>   someString = base::sample(c("this", "that"), 
> 100, replace=TRUE), stringsAsFactors=FALSE)
> trainidxs = base::sample(nrow(data), nrow(data)*0.7)
> traindf = as.DataFrame(data[trainidxs,])
> testdf = as.DataFrame(rbind(data[-trainidxs,],c(0,"the other")))
> rf = spark.randomForest(traindf, clicked~., type="classification", 
> maxDepth=10, 
> maxBins=41,
> numTrees = 100)
> predictions = predict(rf, testdf)
> SparkR::collect(predictions)
> {code}
> stack trace:
> {quote}
> Error in handleErrors(returnStatus, conn): org.apache.spark.SparkException: 
> Job aborted due to stage failure: Task 0 in stage 607.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 607.0 (TID 1581, localhost, executor 
> driver): org.apache.spark.SparkException: Failed to execute user defined 
> function($anonfun$4: (string) => double)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Unseen label: the other.
> at 
> org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:170)
> at 
> org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:166)
> ... 16 more
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
> at 
> 

[jira] [Resolved] (SPARK-23230) When hive.default.fileformat is other kinds of file types, create textfile table cause a serde error

2018-02-12 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-23230.
-
   Resolution: Fixed
 Assignee: dzcxzl
Fix Version/s: 2.3.0

> When hive.default.fileformat is other kinds of file types, create textfile 
> table cause a serde error
> 
>
> Key: SPARK-23230
> URL: https://issues.apache.org/jira/browse/SPARK-23230
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Minor
> Fix For: 2.3.0
>
>
> When hive.default.fileformat is other kinds of file types, create textfile 
> table cause a serde error.
>  We should take the default type of textfile and sequencefile both as 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.
> {code:java}
> set hive.default.fileformat=orc;
> create table tbl( i string ) stored as textfile;
> desc formatted tbl;
> Serde Library org.apache.hadoop.hive.ql.io.orc.OrcSerde
> InputFormat  org.apache.hadoop.mapred.TextInputFormat
> OutputFormat  org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat{code}
>  
> {code:java}
> set hive.default.fileformat=orc;
> create table tbl stored as textfile
> as
> select  1
> {code}
> {{It failed because it used the wrong SERDE}}
> {code:java}
> Caused by: java.lang.ClassCastException: 
> org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow cannot be cast to 
> org.apache.hadoop.io.BytesWritable
>   at 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat$1.write(HiveIgnoreKeyTextOutputFormat.java:91)
>   at 
> org.apache.spark.sql.hive.execution.HiveOutputWriter.write(HiveFileFormat.scala:149)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:327)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:258)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261)
>   ... 16 more
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22820) Spark 2.3 SQL API audit

2018-02-12 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22820:

Fix Version/s: 2.3.0

> Spark 2.3 SQL API audit
> ---
>
> Key: SPARK-22820
> URL: https://issues.apache.org/jira/browse/SPARK-22820
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Blocker
> Fix For: 2.3.0
>
>
> Check all the API changes in Spark 2.3 release



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22820) Spark 2.3 SQL API audit

2018-02-12 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-22820.
-
Resolution: Fixed

> Spark 2.3 SQL API audit
> ---
>
> Key: SPARK-22820
> URL: https://issues.apache.org/jira/browse/SPARK-22820
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Blocker
>
> Check all the API changes in Spark 2.3 release



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23313) Add a migration guide for ORC

2018-02-12 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-23313.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> Add a migration guide for ORC
> -
>
> Key: SPARK-23313
> URL: https://issues.apache.org/jira/browse/SPARK-23313
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23313) Add a migration guide for ORC

2018-02-12 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-23313:
---

Assignee: Dongjoon Hyun

> Add a migration guide for ORC
> -
>
> Key: SPARK-23313
> URL: https://issues.apache.org/jira/browse/SPARK-23313
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23154) Document backwards compatibility guarantees for ML persistence

2018-02-12 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16361575#comment-16361575
 ] 

Joseph K. Bradley commented on SPARK-23154:
---

I'd prefer to put it in the subsection on saving & loading.  I'll send a PR now.

[~yanboliang] I actually spent a long time trying to come up with ways to test 
this, and it's non-trivial.  The main blocker is that I got pushback from 
others about putting binary files (Parquet model data files) in the git repo.  
Without that, there isn't a way to store example models from past versions.  I 
may just build a separate project to test this outside of apache/spark itself 
when I get the chance.  You can find more notes in the JIRA linked in the 
description above.

> Document backwards compatibility guarantees for ML persistence
> --
>
> Key: SPARK-23154
> URL: https://issues.apache.org/jira/browse/SPARK-23154
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Major
>
> We have (as far as I know) maintained backwards compatibility for ML 
> persistence, but this is not documented anywhere.  I'd like us to document it 
> (for spark.ml, not for spark.mllib).
> I'd recommend something like:
> {quote}
> In general, MLlib maintains backwards compatibility for ML persistence.  
> I.e., if you save an ML model or Pipeline in one version of Spark, then you 
> should be able to load it back and use it in a future version of Spark.  
> However, there are rare exceptions, described below.
> Model persistence: Is a model or Pipeline saved using Apache Spark ML 
> persistence in Spark version X loadable by Spark version Y?
> * Major versions: No guarantees, but best-effort.
> * Minor and patch versions: Yes; these are backwards compatible.
> * Note about the format: There are no guarantees for a stable persistence 
> format, but model loading itself is designed to be backwards compatible.
> Model behavior: Does a model or Pipeline in Spark version X behave 
> identically in Spark version Y?
> * Major versions: No guarantees, but best-effort.
> * Minor and patch versions: Identical behavior, except for bug fixes.
> For both model persistence and model behavior, any breaking changes across a 
> minor version or patch version are reported in the Spark version release 
> notes. If a breakage is not reported in release notes, then it should be 
> treated as a bug to be fixed.
> {quote}
> How does this sound?
> Note: We unfortunately don't have tests for backwards compatibility (which 
> has technical hurdles and can be discussed in [SPARK-15573]).  However, we 
> have made efforts to maintain it during PR review and Spark release QA, and 
> most users expect it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23378) move setCurrentDatabase from HiveExternalCatalog to HiveClientImpl

2018-02-12 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-23378.
-
   Resolution: Fixed
 Assignee: Feng Liu
Fix Version/s: 2.4.0

> move setCurrentDatabase from HiveExternalCatalog to HiveClientImpl
> --
>
> Key: SPARK-23378
> URL: https://issues.apache.org/jira/browse/SPARK-23378
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Feng Liu
>Assignee: Feng Liu
>Priority: Major
> Fix For: 2.4.0
>
>
> Conceptually, no methods of HiveExternalCatalog, besides the 
> `setCurrentDatabase`, should change the `currentDatabase` in the hive session 
> state. We can enforce this rule by removing the usage of `setCurrentDatabase` 
> in the HiveExternalCatalog.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23401) Improve test cases for all supported types and unsupported types

2018-02-12 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-23401:


 Summary: Improve test cases for all supported types and 
unsupported types
 Key: SPARK-23401
 URL: https://issues.apache.org/jira/browse/SPARK-23401
 Project: Spark
  Issue Type: Test
  Components: PySpark
Affects Versions: 2.3.0
Reporter: Hyukjin Kwon


Looks there are some missing types to test in supported types. 

For example, please see 
https://github.com/apache/spark/blob/c338c8cf8253c037ecd4f39bbd58ed5a86581b37/python/pyspark/sql/tests.py#L4397-L4401

We can improve this test coverage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23390) Flaky Test Suite: FileBasedDataSourceSuite in Spark 2.3/hadoop 2.7

2018-02-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16361537#comment-16361537
 ] 

Apache Spark commented on SPARK-23390:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/20591

> Flaky Test Suite: FileBasedDataSourceSuite in Spark 2.3/hadoop 2.7
> --
>
> Key: SPARK-23390
> URL: https://issues.apache.org/jira/browse/SPARK-23390
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Sameer Agarwal
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.3.0
>
>
> We're seeing multiple failures in {{FileBasedDataSourceSuite}} in 
> {{spark-branch-2.3-test-sbt-hadoop-2.7}}:
> {code}
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 15 times over 
> 10.01215805999 seconds. Last failure message: There are 1 possibly leaked 
> file streams..
> {code}
> Here's the full history: 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/testReport/org.apache.spark.sql/FileBasedDataSourceSuite/history/
> From a very quick look, these failures seem to be correlated with 
> https://github.com/apache/spark/pull/20479 (cc [~dongjoon]) as evident from 
> the following stack trace (full logs 
> [here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/console]):
>  
> {code}
> [info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds)
> 15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in 
> stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled)
> 15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem 
> connection created at:
> java.lang.Throwable
>   at 
> org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
>   at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
>   at 
> org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173)
>   at 
> org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254)
>   at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138)
> {code}
> Also, while this might be just a false correlation but the frequency of these 
> test failures have increased considerably in 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/
>  after https://github.com/apache/spark/pull/20562 (cc 
> [~feng...@databricks.com]) was merged.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23400) Add two extra constructors for ScalaUDF

2018-02-12 Thread Xiao Li (JIRA)
Xiao Li created SPARK-23400:
---

 Summary: Add two extra constructors for ScalaUDF
 Key: SPARK-23400
 URL: https://issues.apache.org/jira/browse/SPARK-23400
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.1, 2.1.2, 2.3.0
Reporter: Xiao Li
Assignee: Xiao Li


The last few releases, we changed the interface of ScalaUDF. Unfortunately, 
some Spark Package (spark-deep-learning) are using our internal class 
`ScalaUDF`. In the release 2.2 and 2.3, we added new parameters into these 
class. The users hit the binary compatibility issues and got the exception:

> java.lang.NoSuchMethodError: 
> org.apache.spark.sql.catalyst.expressions.ScalaUDF.init(Ljava/lang/Object;Lorg/apache/spark/sql/types/DataType;Lscala/collection/Seq;Lscala/collection/Seq;Lscala/Option;)V




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23399) Register a task completion listner first for OrcColumnarBatchReader

2018-02-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23399:


Assignee: (was: Apache Spark)

> Register a task completion listner first for OrcColumnarBatchReader
> ---
>
> Key: SPARK-23399
> URL: https://issues.apache.org/jira/browse/SPARK-23399
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This is related with SPARK-23390.
> Currently, there was a opened file leak for OrcColumnarBatchReader.
> {code}
> [info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds)
> 15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in 
> stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled)
> 15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem 
> connection created at:
> java.lang.Throwable
>   at 
> org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
>   at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
>   at 
> org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173)
>   at 
> org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254)
>   at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23399) Register a task completion listner first for OrcColumnarBatchReader

2018-02-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23399:


Assignee: Apache Spark

> Register a task completion listner first for OrcColumnarBatchReader
> ---
>
> Key: SPARK-23399
> URL: https://issues.apache.org/jira/browse/SPARK-23399
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>
> This is related with SPARK-23390.
> Currently, there was a opened file leak for OrcColumnarBatchReader.
> {code}
> [info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds)
> 15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in 
> stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled)
> 15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem 
> connection created at:
> java.lang.Throwable
>   at 
> org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
>   at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
>   at 
> org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173)
>   at 
> org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254)
>   at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23399) Register a task completion listner first for OrcColumnarBatchReader

2018-02-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16361508#comment-16361508
 ] 

Apache Spark commented on SPARK-23399:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/20590

> Register a task completion listner first for OrcColumnarBatchReader
> ---
>
> Key: SPARK-23399
> URL: https://issues.apache.org/jira/browse/SPARK-23399
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This is related with SPARK-23390.
> Currently, there was a opened file leak for OrcColumnarBatchReader.
> {code}
> [info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds)
> 15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in 
> stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled)
> 15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem 
> connection created at:
> java.lang.Throwable
>   at 
> org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
>   at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
>   at 
> org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173)
>   at 
> org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254)
>   at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23399) Register a task completion listner first for OrcColumnarBatchReader

2018-02-12 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23399:
--
Description: 
This is related with SPARK-23390.

Currently, there was a opened file leak for OrcColumnarBatchReader.

{code}
[info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds)
15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in 
stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled)
15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem 
connection created at:
java.lang.Throwable
at 
org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
at 
org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173)
at 
org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254)
at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633)
at 
org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138)
{code}

  was:This is related with SPARK-23390.


> Register a task completion listner first for OrcColumnarBatchReader
> ---
>
> Key: SPARK-23399
> URL: https://issues.apache.org/jira/browse/SPARK-23399
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This is related with SPARK-23390.
> Currently, there was a opened file leak for OrcColumnarBatchReader.
> {code}
> [info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds)
> 15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in 
> stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled)
> 15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem 
> connection created at:
> java.lang.Throwable
>   at 
> org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
>   at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
>   at 
> org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173)
>   at 
> org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254)
>   at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23399) Register a task completion listner first for OrcColumnarBatchReader

2018-02-12 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-23399:
-

 Summary: Register a task completion listner first for 
OrcColumnarBatchReader
 Key: SPARK-23399
 URL: https://issues.apache.org/jira/browse/SPARK-23399
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Dongjoon Hyun


This is related with SPARK-23390.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23394) Storage info's Cached Partitions doesn't consider the replications (but sc.getRDDStorageInfo does)

2018-02-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23394:


Assignee: Apache Spark

> Storage info's Cached Partitions doesn't consider the replications (but 
> sc.getRDDStorageInfo does)
> --
>
> Key: SPARK-23394
> URL: https://issues.apache.org/jira/browse/SPARK-23394
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Attila Zsolt Piros
>Assignee: Apache Spark
>Priority: Major
> Attachments: Spark_2.2.1.png, Spark_2.4.0-SNAPSHOT.png, 
> Storage_Tab.png
>
>
> Start spark as:
> {code:bash}
> $ bin/spark-shell --master local-cluster[2,1,1024]
> {code}
> {code:scala}
> scala> import org.apache.spark.storage.StorageLevel._
> import org.apache.spark.storage.StorageLevel._
> scala> sc.parallelize((1 to 100), 10).persist(MEMORY_AND_DISK_2).count
> res0: Long = 100  
>   
> scala> sc.getRDDStorageInfo(0).numCachedPartitions
> res1: Int = 20
> {code}
> h2. Cached Partitions 
> On the UI at the Storage tab Cached Partitions is 10:
>  !Storage_Tab.png! .
> h2. Full tab
> Moreover the replicated partitions was also listed on the old 2.2.1 like:
>  !Spark_2.2.1.png! 
> But now it is like:
>  !Spark_2.4.0-SNAPSHOT.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23394) Storage info's Cached Partitions doesn't consider the replications (but sc.getRDDStorageInfo does)

2018-02-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23394:


Assignee: (was: Apache Spark)

> Storage info's Cached Partitions doesn't consider the replications (but 
> sc.getRDDStorageInfo does)
> --
>
> Key: SPARK-23394
> URL: https://issues.apache.org/jira/browse/SPARK-23394
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Attila Zsolt Piros
>Priority: Major
> Attachments: Spark_2.2.1.png, Spark_2.4.0-SNAPSHOT.png, 
> Storage_Tab.png
>
>
> Start spark as:
> {code:bash}
> $ bin/spark-shell --master local-cluster[2,1,1024]
> {code}
> {code:scala}
> scala> import org.apache.spark.storage.StorageLevel._
> import org.apache.spark.storage.StorageLevel._
> scala> sc.parallelize((1 to 100), 10).persist(MEMORY_AND_DISK_2).count
> res0: Long = 100  
>   
> scala> sc.getRDDStorageInfo(0).numCachedPartitions
> res1: Int = 20
> {code}
> h2. Cached Partitions 
> On the UI at the Storage tab Cached Partitions is 10:
>  !Storage_Tab.png! .
> h2. Full tab
> Moreover the replicated partitions was also listed on the old 2.2.1 like:
>  !Spark_2.2.1.png! 
> But now it is like:
>  !Spark_2.4.0-SNAPSHOT.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23394) Storage info's Cached Partitions doesn't consider the replications (but sc.getRDDStorageInfo does)

2018-02-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16361347#comment-16361347
 ] 

Apache Spark commented on SPARK-23394:
--

User 'attilapiros' has created a pull request for this issue:
https://github.com/apache/spark/pull/20589

> Storage info's Cached Partitions doesn't consider the replications (but 
> sc.getRDDStorageInfo does)
> --
>
> Key: SPARK-23394
> URL: https://issues.apache.org/jira/browse/SPARK-23394
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Attila Zsolt Piros
>Priority: Major
> Attachments: Spark_2.2.1.png, Spark_2.4.0-SNAPSHOT.png, 
> Storage_Tab.png
>
>
> Start spark as:
> {code:bash}
> $ bin/spark-shell --master local-cluster[2,1,1024]
> {code}
> {code:scala}
> scala> import org.apache.spark.storage.StorageLevel._
> import org.apache.spark.storage.StorageLevel._
> scala> sc.parallelize((1 to 100), 10).persist(MEMORY_AND_DISK_2).count
> res0: Long = 100  
>   
> scala> sc.getRDDStorageInfo(0).numCachedPartitions
> res1: Int = 20
> {code}
> h2. Cached Partitions 
> On the UI at the Storage tab Cached Partitions is 10:
>  !Storage_Tab.png! .
> h2. Full tab
> Moreover the replicated partitions was also listed on the old 2.2.1 like:
>  !Spark_2.2.1.png! 
> But now it is like:
>  !Spark_2.4.0-SNAPSHOT.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23388) Support for Parquet Binary DecimalType in VectorizedColumnReader

2018-02-12 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-23388:
---

Assignee: James Thompson

> Support for Parquet Binary DecimalType in VectorizedColumnReader
> 
>
> Key: SPARK-23388
> URL: https://issues.apache.org/jira/browse/SPARK-23388
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: James Thompson
>Assignee: James Thompson
>Priority: Major
> Fix For: 2.3.0
>
>
> The following commit to spark removed support for decimal binary types: 
> [https://github.com/apache/spark/commit/9c29c557635caf739fde942f53255273aac0d7b1#diff-7bdf5fd0ce0b1ccbf4ecf083611976e6R428]
> As per the parquet spec, decimal can be used to annotate binary types, so 
> support should be re-added: 
> [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23388) Support for Parquet Binary DecimalType in VectorizedColumnReader

2018-02-12 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-23388.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> Support for Parquet Binary DecimalType in VectorizedColumnReader
> 
>
> Key: SPARK-23388
> URL: https://issues.apache.org/jira/browse/SPARK-23388
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: James Thompson
>Assignee: James Thompson
>Priority: Major
> Fix For: 2.3.0
>
>
> The following commit to spark removed support for decimal binary types: 
> [https://github.com/apache/spark/commit/9c29c557635caf739fde942f53255273aac0d7b1#diff-7bdf5fd0ce0b1ccbf4ecf083611976e6R428]
> As per the parquet spec, decimal can be used to annotate binary types, so 
> support should be re-added: 
> [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23310) Perf regression introduced by SPARK-21113

2018-02-12 Thread Nicolas Poggi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16361071#comment-16361071
 ] 

Nicolas Poggi edited comment on SPARK-23310 at 2/12/18 6:35 PM:


Q72 of TPC-DS is also affected around 30% at scale factor 1000. 
[~juliuszsompolski] SPARK-23366 also fixes it.


was (Author: npoggi):
Q72 of TPC-DS is also affected around 30% at scale factor 1000. 
[~juliuszsompolski] SPARK-23355 also fixes it.

> Perf regression introduced by SPARK-21113
> -
>
> Key: SPARK-23310
> URL: https://issues.apache.org/jira/browse/SPARK-23310
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Yin Huai
>Assignee: Sital Kedia
>Priority: Blocker
> Fix For: 2.3.0
>
>
> While running all TPC-DS queries with SF set to 1000, we noticed that Q95 
> (https://github.com/databricks/spark-sql-perf/blob/master/src/main/resources/tpcds_2_4/q95.sql)
>  has noticeable regression (11%). After looking into it, we found that the 
> regression was introduced by SPARK-21113. Specially, ReadAheadInputStream 
> gets lock congestion. After setting 
> spark.unsafe.sorter.spill.read.ahead.enabled set to false, the regression 
> disappear and the overall performance of all TPC-DS queries has improved.
>  
> I am proposing that we set spark.unsafe.sorter.spill.read.ahead.enabled to 
> false by default for Spark 2.3 and re-enable it after addressing the lock 
> congestion issue. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23398) DataSourceV2 should provide a way to get the source schema

2018-02-12 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-23398:
-

 Summary: DataSourceV2 should provide a way to get the source schema
 Key: SPARK-23398
 URL: https://issues.apache.org/jira/browse/SPARK-23398
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Ryan Blue


To validate writes with DataSourceV2, the planner needs to get a source's 
schema. The current API has no direct way to get that schema. SPARK-23321 
instantiates a reader to get the schema, but sources are not required to 
implement {{ReadSupport}} or {{ReadSupportWithSchema}}. V2 should either add a 
method to get the schema of a source, or require sources implement 
{{ReadSupport}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23398) DataSourceV2 should provide a way to get a source's schema.

2018-02-12 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated SPARK-23398:
--
Summary: DataSourceV2 should provide a way to get a source's schema.  (was: 
DataSourceV2 should provide a way to get the source schema)

> DataSourceV2 should provide a way to get a source's schema.
> ---
>
> Key: SPARK-23398
> URL: https://issues.apache.org/jira/browse/SPARK-23398
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>
> To validate writes with DataSourceV2, the planner needs to get a source's 
> schema. The current API has no direct way to get that schema. SPARK-23321 
> instantiates a reader to get the schema, but sources are not required to 
> implement {{ReadSupport}} or {{ReadSupportWithSchema}}. V2 should either add 
> a method to get the schema of a source, or require sources implement 
> {{ReadSupport}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23377) Bucketizer with multiple columns persistence bug

2018-02-12 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16361154#comment-16361154
 ] 

Joseph K. Bradley commented on SPARK-23377:
---

[~viirya]'s patch currently changes DefaultParamsWriter to save only the 
explicitly set Param values.  This means that loading a model into a new 
version of Spark could use different Param values if the default values have 
changed.

In the original design of persistence (see [SPARK-6725]), the goal was to make 
behavior exactly reproducible.  This means that default Param values do need to 
be saved.  I recommend that we maintain this guarantee.

I can see a couple of possibilities:
1. Simplest: Change the loading logic of Bucketizer so that it handles this 
edge case (by removing the value for inputCol when inputCols is set).  This may 
be best for Spark 2.3 since it's the fastest fix.
2. Reasonable: Change the saving logic of Bucketizer to handle this case.  This 
will be best in terms of fixing the edge case and being pretty quick to do.
3. Largest: Change DefaultParamsWriter to separate explicitly set values and 
default values.  Then update Bucketizer's loading logic to make use of this 
distinction.  I'm not a fan of this approach since it would involve shoving a 
huge change into branch-2.3 during late QA.

I'd vote strongly for the 2nd option.  Opinions?

> Bucketizer with multiple columns persistence bug
> 
>
> Key: SPARK-23377
> URL: https://issues.apache.org/jira/browse/SPARK-23377
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>Priority: Critical
>
> A Bucketizer with multiple input/output columns get "inputCol" set to the 
> default value on write -> read which causes it to throw an error on 
> transform. Here's an example.
> {code:java}
> import org.apache.spark.ml.feature._
> val splits = Array(Double.NegativeInfinity, 0, 10, 100, 
> Double.PositiveInfinity)
> val bucketizer = new Bucketizer()
>   .setSplitsArray(Array(splits, splits))
>   .setInputCols(Array("foo1", "foo2"))
>   .setOutputCols(Array("bar1", "bar2"))
> val data = Seq((1.0, 2.0), (10.0, 100.0), (101.0, -1.0)).toDF("foo1", "foo2")
> bucketizer.transform(data)
> val path = "/temp/bucketrizer-persist-test"
> bucketizer.write.overwrite.save(path)
> val bucketizerAfterRead = Bucketizer.read.load(path)
> println(bucketizerAfterRead.isDefined(bucketizerAfterRead.outputCol))
> // This line throws an error because "outputCol" is set
> bucketizerAfterRead.transform(data)
> {code}
> And the trace:
> {code:java}
> java.lang.IllegalArgumentException: Bucketizer bucketizer_6f0acc3341f7 has 
> the inputCols Param set for multi-column transform. The following Params are 
> not applicable and should not be set: outputCol.
>   at 
> org.apache.spark.ml.param.ParamValidators$.checkExclusiveParams$1(params.scala:300)
>   at 
> org.apache.spark.ml.param.ParamValidators$.checkSingleVsMultiColumnParams(params.scala:314)
>   at 
> org.apache.spark.ml.feature.Bucketizer.transformSchema(Bucketizer.scala:189)
>   at 
> org.apache.spark.ml.feature.Bucketizer.transform(Bucketizer.scala:141)
>   at 
> line251821108a8a433da484ee31f166c83725.$read$$iw$$iw$$iw$$iw$$iw$$iw.(command-6079631:17)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23377) Bucketizer with multiple columns persistence bug

2018-02-12 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-23377:
--
Priority: Critical  (was: Major)

> Bucketizer with multiple columns persistence bug
> 
>
> Key: SPARK-23377
> URL: https://issues.apache.org/jira/browse/SPARK-23377
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>Priority: Critical
>
> A Bucketizer with multiple input/output columns get "inputCol" set to the 
> default value on write -> read which causes it to throw an error on 
> transform. Here's an example.
> {code:java}
> import org.apache.spark.ml.feature._
> val splits = Array(Double.NegativeInfinity, 0, 10, 100, 
> Double.PositiveInfinity)
> val bucketizer = new Bucketizer()
>   .setSplitsArray(Array(splits, splits))
>   .setInputCols(Array("foo1", "foo2"))
>   .setOutputCols(Array("bar1", "bar2"))
> val data = Seq((1.0, 2.0), (10.0, 100.0), (101.0, -1.0)).toDF("foo1", "foo2")
> bucketizer.transform(data)
> val path = "/temp/bucketrizer-persist-test"
> bucketizer.write.overwrite.save(path)
> val bucketizerAfterRead = Bucketizer.read.load(path)
> println(bucketizerAfterRead.isDefined(bucketizerAfterRead.outputCol))
> // This line throws an error because "outputCol" is set
> bucketizerAfterRead.transform(data)
> {code}
> And the trace:
> {code:java}
> java.lang.IllegalArgumentException: Bucketizer bucketizer_6f0acc3341f7 has 
> the inputCols Param set for multi-column transform. The following Params are 
> not applicable and should not be set: outputCol.
>   at 
> org.apache.spark.ml.param.ParamValidators$.checkExclusiveParams$1(params.scala:300)
>   at 
> org.apache.spark.ml.param.ParamValidators$.checkSingleVsMultiColumnParams(params.scala:314)
>   at 
> org.apache.spark.ml.feature.Bucketizer.transformSchema(Bucketizer.scala:189)
>   at 
> org.apache.spark.ml.feature.Bucketizer.transform(Bucketizer.scala:141)
>   at 
> line251821108a8a433da484ee31f166c83725.$read$$iw$$iw$$iw$$iw$$iw$$iw.(command-6079631:17)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23310) Perf regression introduced by SPARK-21113

2018-02-12 Thread Nicolas Poggi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16361071#comment-16361071
 ] 

Nicolas Poggi commented on SPARK-23310:
---

Q72 of TPC-DS is also affected around 30% at scale factor 1000. 
[~juliuszsompolski] SPARK-23355 also fixes it.

> Perf regression introduced by SPARK-21113
> -
>
> Key: SPARK-23310
> URL: https://issues.apache.org/jira/browse/SPARK-23310
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Yin Huai
>Assignee: Sital Kedia
>Priority: Blocker
> Fix For: 2.3.0
>
>
> While running all TPC-DS queries with SF set to 1000, we noticed that Q95 
> (https://github.com/databricks/spark-sql-perf/blob/master/src/main/resources/tpcds_2_4/q95.sql)
>  has noticeable regression (11%). After looking into it, we found that the 
> regression was introduced by SPARK-21113. Specially, ReadAheadInputStream 
> gets lock congestion. After setting 
> spark.unsafe.sorter.spill.read.ahead.enabled set to false, the regression 
> disappear and the overall performance of all TPC-DS queries has improved.
>  
> I am proposing that we set spark.unsafe.sorter.spill.read.ahead.enabled to 
> false by default for Spark 2.3 and re-enable it after addressing the lock 
> congestion issue. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23390) Flaky Test Suite: FileBasedDataSourceSuite in Spark 2.3/hadoop 2.7

2018-02-12 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-23390.
-
   Resolution: Fixed
 Assignee: Wenchen Fan
Fix Version/s: 2.3.0

> Flaky Test Suite: FileBasedDataSourceSuite in Spark 2.3/hadoop 2.7
> --
>
> Key: SPARK-23390
> URL: https://issues.apache.org/jira/browse/SPARK-23390
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Sameer Agarwal
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.3.0
>
>
> We're seeing multiple failures in {{FileBasedDataSourceSuite}} in 
> {{spark-branch-2.3-test-sbt-hadoop-2.7}}:
> {code}
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 15 times over 
> 10.01215805999 seconds. Last failure message: There are 1 possibly leaked 
> file streams..
> {code}
> Here's the full history: 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/testReport/org.apache.spark.sql/FileBasedDataSourceSuite/history/
> From a very quick look, these failures seem to be correlated with 
> https://github.com/apache/spark/pull/20479 (cc [~dongjoon]) as evident from 
> the following stack trace (full logs 
> [here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/console]):
>  
> {code}
> [info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds)
> 15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in 
> stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled)
> 15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem 
> connection created at:
> java.lang.Throwable
>   at 
> org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
>   at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
>   at 
> org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173)
>   at 
> org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254)
>   at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138)
> {code}
> Also, while this might be just a false correlation but the frequency of these 
> test failures have increased considerably in 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/
>  after https://github.com/apache/spark/pull/20562 (cc 
> [~feng...@databricks.com]) was merged.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20327) Add CLI support for YARN custom resources, like GPUs

2018-02-12 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16360944#comment-16360944
 ] 

Marcelo Vanzin commented on SPARK-20327:


bq. I think the point is, without reflection, using 3.x+ APIs with 2.x results 
in complete failures at runtime

More than that, without reflection, the code will not compile against 2.x. And 
even when we add support for 3.x, we won't drop support for 2.x at the same 
time. So reflection is the only solution here (well, not the only, but the 
easiest).

bq. then we handle this case as an error explicitly and inform the user about 
this

The way this was done in the past was to print a warning and continue without 
using the feature. Either way should be fine though.




> Add CLI support for YARN custom resources, like GPUs
> 
>
> Key: SPARK-20327
> URL: https://issues.apache.org/jira/browse/SPARK-20327
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, Spark Submit
>Affects Versions: 2.1.0
>Reporter: Daniel Templeton
>Priority: Major
>  Labels: newbie
>
> YARN-3926 adds the ability for administrators to configure custom resources, 
> like GPUs.  This JIRA is to add support to Spark for requesting resources 
> other than CPU virtual cores and memory.  See YARN-3926.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20327) Add CLI support for YARN custom resources, like GPUs

2018-02-12 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16360933#comment-16360933
 ] 

Sean Owen commented on SPARK-20327:
---

I think the point is, without reflection, using 3.x+ APIs with 2.x results in 
complete failures at runtime because classes won't link. You can't even present 
an error. 

This isn't Hadoop-specific; this is how you have to deal with handling multiple 
incompatible APIs at the same time in the JVM in general.

> Add CLI support for YARN custom resources, like GPUs
> 
>
> Key: SPARK-20327
> URL: https://issues.apache.org/jira/browse/SPARK-20327
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, Spark Submit
>Affects Versions: 2.1.0
>Reporter: Daniel Templeton
>Priority: Major
>  Labels: newbie
>
> YARN-3926 adds the ability for administrators to configure custom resources, 
> like GPUs.  This JIRA is to add support to Spark for requesting resources 
> other than CPU virtual cores and memory.  See YARN-3926.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20327) Add CLI support for YARN custom resources, like GPUs

2018-02-12 Thread Szilard Nemeth (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16360904#comment-16360904
 ] 

Szilard Nemeth edited comment on SPARK-20327 at 2/12/18 3:43 PM:
-

Hey [~vanzin]!

I see what you said about compatibility.

By reflection, you meant a similar solution like this? 
[https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L268]

By the way, isn't it a good way to go forward  if users specify any of these 
new resource configs and the Hadoop version Spark currently depends on is 2.x 
then we handle this case as an error explicitly and inform the user about this?

In general, is that a conventional way to detect Hadoop version in Spark code?

Thanks!

 


was (Author: snemeth):
Hey [~vanzin]!

I see what you said about compatibility.

By reflection, you meant a similar solution like this? 
[https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L268]

By the way, isn't it a good way to go forward  if users specify any of these 
new resource onfigs and the Hadoop version Spark currently depends on is 2.x 
then we handle this case as an error explicitly and inform the user about this?

In general, is that a conventional way to detect Hadoop version in Spark code?

Thanks!

 

> Add CLI support for YARN custom resources, like GPUs
> 
>
> Key: SPARK-20327
> URL: https://issues.apache.org/jira/browse/SPARK-20327
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, Spark Submit
>Affects Versions: 2.1.0
>Reporter: Daniel Templeton
>Priority: Major
>  Labels: newbie
>
> YARN-3926 adds the ability for administrators to configure custom resources, 
> like GPUs.  This JIRA is to add support to Spark for requesting resources 
> other than CPU virtual cores and memory.  See YARN-3926.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20327) Add CLI support for YARN custom resources, like GPUs

2018-02-12 Thread Szilard Nemeth (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16360904#comment-16360904
 ] 

Szilard Nemeth commented on SPARK-20327:


Hey [~vanzin]!

I see what you said about compatibility.

By reflection, you meant a similar solution like this? 
[https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L268]

By the way, isn't it a good way to go forward  if users specify any of these 
new resource onfigs and the Hadoop version Spark currently depends on is 2.x 
then we handle this case as an error explicitly and inform the user about this?

In general, is that a conventional way to detect Hadoop version in Spark code?

Thanks!

 

> Add CLI support for YARN custom resources, like GPUs
> 
>
> Key: SPARK-20327
> URL: https://issues.apache.org/jira/browse/SPARK-20327
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, Spark Submit
>Affects Versions: 2.1.0
>Reporter: Daniel Templeton
>Priority: Major
>  Labels: newbie
>
> YARN-3926 adds the ability for administrators to configure custom resources, 
> like GPUs.  This JIRA is to add support to Spark for requesting resources 
> other than CPU virtual cores and memory.  See YARN-3926.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23391) It may lead to overflow for some integer multiplication

2018-02-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23391.
---
   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.2

Issue resolved by pull request 20581
[https://github.com/apache/spark/pull/20581]

> It may lead to overflow for some integer multiplication 
> 
>
> Key: SPARK-23391
> URL: https://issues.apache.org/jira/browse/SPARK-23391
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Assignee: liuxian
>Priority: Minor
> Fix For: 2.2.2, 2.3.0
>
>
> In the {{getBlockData}},{{blockId.reduceId}} is the {{Int}} type, when it is 
> greater than 2^28, {{blockId.reduceId*8}} will overflow.
> In the _decompress0, len_ and  _unitSize are  {{Int}}_ type, so _len * 
> unitSize_ may lead to  overflow
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23391) It may lead to overflow for some integer multiplication

2018-02-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-23391:
-

Assignee: liuxian

> It may lead to overflow for some integer multiplication 
> 
>
> Key: SPARK-23391
> URL: https://issues.apache.org/jira/browse/SPARK-23391
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Assignee: liuxian
>Priority: Minor
> Fix For: 2.2.2, 2.3.0
>
>
> In the {{getBlockData}},{{blockId.reduceId}} is the {{Int}} type, when it is 
> greater than 2^28, {{blockId.reduceId*8}} will overflow.
> In the _decompress0, len_ and  _unitSize are  {{Int}}_ type, so _len * 
> unitSize_ may lead to  overflow
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23308) ignoreCorruptFiles should not ignore retryable IOException

2018-02-12 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16360809#comment-16360809
 ] 

Steve Loughran commented on SPARK-23308:


BTW

bq  I should get at least ~82k partitions, thus the same number of S3 requests. 

if your input stream is doing abort/reopen on seek & positioned read, then you 
get many more S3 requests when reading columnar data, which bounces around a 
lot.  See HADOOP-13203 for the work there, & the recent change of HADOOP-14965 
to at least reduce the TCP abort call count, but not doing anything for the GET 
count, just using smaller ranges in the GET calls. Oh, and if you use SSE-KMS, 
separate throttling, but AFAIK it should only surface on the initial GET, when 
the encryption kicks off

> ignoreCorruptFiles should not ignore retryable IOException
> --
>
> Key: SPARK-23308
> URL: https://issues.apache.org/jira/browse/SPARK-23308
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Márcio Furlani Carmona
>Priority: Minor
>
> When `spark.sql.files.ignoreCorruptFiles` is set it totally ignores any kind 
> of RuntimeException or IOException, but some possible IOExceptions may happen 
> even if the file is not corrupted.
> One example is the SocketTimeoutException which can be retried to possibly 
> fetch the data without meaning the data is corrupted.
>  
> See: 
> https://github.com/apache/spark/blob/e30e2698a2193f0bbdcd4edb884710819ab6397c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L163



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23397) Scheduling delay causes Spark Streaming to miss batches.

2018-02-12 Thread Shahbaz Hussain (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16360802#comment-16360802
 ] 

Shahbaz Hussain edited comment on SPARK-23397 at 2/12/18 2:29 PM:
--

can we be able to make job creation  only once and have it as static ,rather 
than creating jobs for every batch ,since job by nature do not change.


was (Author: mhuss...@informatica.com):
can we be able to make job creation a only once and have it as static ,rather 
than creating jobs for every batch ,since job by nature do not change.

> Scheduling delay causes Spark Streaming to miss batches.
> 
>
> Key: SPARK-23397
> URL: https://issues.apache.org/jira/browse/SPARK-23397
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.2.1
>Reporter: Shahbaz Hussain
>Priority: Major
>
> * For Complex Spark (Scala) based D-Stream based applications ,which requires 
> creating Ex: 40 Jobs for every batch ,its been observed that ,batches does 
> not get created on the specific time ,ex: if i started a Spark Streaming 
> based application with batch interval as 20 seconds and application is 
> creating 40 odd Jobs ,observe the next batch does not create 20 seconds later 
> than previous job creation time.
>  * This is due to the fact that Job Creation is Single Threaded, if Job 
> Creation delay is greater than Batch Interval time ,batch execution misses 
> its schedule.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23397) Scheduling delay causes Spark Streaming to miss batches.

2018-02-12 Thread Shahbaz Hussain (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16360802#comment-16360802
 ] 

Shahbaz Hussain commented on SPARK-23397:
-

can we be able to make job creation a only once and have it as static ,rather 
than creating jobs for every batch ,since job by nature do not change.

> Scheduling delay causes Spark Streaming to miss batches.
> 
>
> Key: SPARK-23397
> URL: https://issues.apache.org/jira/browse/SPARK-23397
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.2.1
>Reporter: Shahbaz Hussain
>Priority: Major
>
> * For Complex Spark (Scala) based D-Stream based applications ,which requires 
> creating Ex: 40 Jobs for every batch ,its been observed that ,batches does 
> not get created on the specific time ,ex: if i started a Spark Streaming 
> based application with batch interval as 20 seconds and application is 
> creating 40 odd Jobs ,observe the next batch does not create 20 seconds later 
> than previous job creation time.
>  * This is due to the fact that Job Creation is Single Threaded, if Job 
> Creation delay is greater than Batch Interval time ,batch execution misses 
> its schedule.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23397) Scheduling delay causes Spark Streaming to miss batches.

2018-02-12 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16360798#comment-16360798
 ] 

Sean Owen commented on SPARK-23397:
---

That sounds correct. The next batch executes as soon as possible. There's no 
notion that it is the one originally scheduled for that time or not.

> Scheduling delay causes Spark Streaming to miss batches.
> 
>
> Key: SPARK-23397
> URL: https://issues.apache.org/jira/browse/SPARK-23397
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.2.1
>Reporter: Shahbaz Hussain
>Priority: Major
>
> * For Complex Spark (Scala) based D-Stream based applications ,which requires 
> creating Ex: 40 Jobs for every batch ,its been observed that ,batches does 
> not get created on the specific time ,ex: if i started a Spark Streaming 
> based application with batch interval as 20 seconds and application is 
> creating 40 odd Jobs ,observe the next batch does not create 20 seconds later 
> than previous job creation time.
>  * This is due to the fact that Job Creation is Single Threaded, if Job 
> Creation delay is greater than Batch Interval time ,batch execution misses 
> its schedule.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23397) Scheduling delay causes Spark Streaming to miss batches.

2018-02-12 Thread Shahbaz Hussain (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16360793#comment-16360793
 ] 

Shahbaz Hussain commented on SPARK-23397:
-

Yes ,if current Batch Processing time is greater than Batch Interval , the next 
Batch is delayed or Queued. However ,in this case ,when we have a complex spark 
application ,the batch execution is missed. Ex: Lets say if my application is 
started at 12:20:00 and with batch interval of 5 Seconds ,in case if Job 
Creation time is 20 seconds ,it misses those many batches and in spark UI we 
would see the next batch of 12:20:25 appear and the batches of 
12:20:05,12:20:10,12:20:15,12:20:20 not getting triggered.

> Scheduling delay causes Spark Streaming to miss batches.
> 
>
> Key: SPARK-23397
> URL: https://issues.apache.org/jira/browse/SPARK-23397
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.2.1
>Reporter: Shahbaz Hussain
>Priority: Major
>
> * For Complex Spark (Scala) based D-Stream based applications ,which requires 
> creating Ex: 40 Jobs for every batch ,its been observed that ,batches does 
> not get created on the specific time ,ex: if i started a Spark Streaming 
> based application with batch interval as 20 seconds and application is 
> creating 40 odd Jobs ,observe the next batch does not create 20 seconds later 
> than previous job creation time.
>  * This is due to the fact that Job Creation is Single Threaded, if Job 
> Creation delay is greater than Batch Interval time ,batch execution misses 
> its schedule.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23396) Spark HistoryServer will OMM if the event log is big

2018-02-12 Thread KaiXinXIaoLei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KaiXinXIaoLei updated SPARK-23396:
--
Description: 
if the event log is  big, the historyServer web will be out of memory . My 
eventlog size is 5.1G:

!eventlog.png! I open the web, the ui will be OMM. 

  was:if the event log is  big, the historyServer web will be out of memory  
!eventlog.png!


> Spark HistoryServer will OMM if the event log is big
> 
>
> Key: SPARK-23396
> URL: https://issues.apache.org/jira/browse/SPARK-23396
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: KaiXinXIaoLei
>Priority: Major
> Attachments: eventlog.png, historyServer.png
>
>
> if the event log is  big, the historyServer web will be out of memory . My 
> eventlog size is 5.1G:
> !eventlog.png! I open the web, the ui will be OMM. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23392) Add some test case for images feature

2018-02-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-23392:
--
Priority: Trivial  (was: Major)

> Add some test case for images feature
> -
>
> Key: SPARK-23392
> URL: https://issues.apache.org/jira/browse/SPARK-23392
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: xubo245
>Priority: Trivial
>
> Add some test case for images feature: SPARK-21866



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23396) Spark HistoryServer will OMM if the event log is big

2018-02-12 Thread KaiXinXIaoLei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KaiXinXIaoLei updated SPARK-23396:
--
Description: 
if the event log is  big, the historyServer web will be out of memory . My 
eventlog size is 5.1G:

 

!eventlog.png!

I open the web, the ui will be OMM. 

 

!historyServer.png!

  was:
if the event log is  big, the historyServer web will be out of memory . My 
eventlog size is 5.1G:

!eventlog.png! I open the web, the ui will be OMM. 


> Spark HistoryServer will OMM if the event log is big
> 
>
> Key: SPARK-23396
> URL: https://issues.apache.org/jira/browse/SPARK-23396
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: KaiXinXIaoLei
>Priority: Major
> Attachments: eventlog.png, historyServer.png
>
>
> if the event log is  big, the historyServer web will be out of memory . My 
> eventlog size is 5.1G:
>  
> !eventlog.png!
> I open the web, the ui will be OMM. 
>  
> !historyServer.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23396) Spark HistoryServer will OMM if the event log is big

2018-02-12 Thread KaiXinXIaoLei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KaiXinXIaoLei updated SPARK-23396:
--
Attachment: eventlog.png

> Spark HistoryServer will OMM if the event log is big
> 
>
> Key: SPARK-23396
> URL: https://issues.apache.org/jira/browse/SPARK-23396
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: KaiXinXIaoLei
>Priority: Major
> Attachments: eventlog.png, historyServer.png
>
>
> if the event log is  big, the historyServer web will be out of memory  
> !eventlog.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23396) Spark HistoryServer will OMM if the event log is big

2018-02-12 Thread KaiXinXIaoLei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KaiXinXIaoLei updated SPARK-23396:
--
Description: if the event log is  big, the historyServer web will be out of 
memory  !eventlog.png!  (was: if the event log is  big, the historyServer web 
will be out of memory !eventlog.png!!eventlog.png!

 )

> Spark HistoryServer will OMM if the event log is big
> 
>
> Key: SPARK-23396
> URL: https://issues.apache.org/jira/browse/SPARK-23396
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: KaiXinXIaoLei
>Priority: Major
> Attachments: historyServer.png
>
>
> if the event log is  big, the historyServer web will be out of memory  
> !eventlog.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23396) Spark HistoryServer will OMM if the event log is big

2018-02-12 Thread KaiXinXIaoLei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KaiXinXIaoLei updated SPARK-23396:
--
Description: 
if the event log is  big, the historyServer web will be out of memory 
!eventlog.png!!eventlog.png!

 

  was:
if the event log is  big, the historyServer web will be out of memory

 


> Spark HistoryServer will OMM if the event log is big
> 
>
> Key: SPARK-23396
> URL: https://issues.apache.org/jira/browse/SPARK-23396
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: KaiXinXIaoLei
>Priority: Major
> Attachments: historyServer.png
>
>
> if the event log is  big, the historyServer web will be out of memory 
> !eventlog.png!!eventlog.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23396) Spark HistoryServer will OMM if the event log is big

2018-02-12 Thread KaiXinXIaoLei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KaiXinXIaoLei updated SPARK-23396:
--
Attachment: (was: eventlog.png)

> Spark HistoryServer will OMM if the event log is big
> 
>
> Key: SPARK-23396
> URL: https://issues.apache.org/jira/browse/SPARK-23396
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: KaiXinXIaoLei
>Priority: Major
> Attachments: historyServer.png
>
>
> if the event log is  big, the historyServer web will be out of memory  
> !eventlog.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23397) Scheduling delay causes Spark Streaming to miss batches.

2018-02-12 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16360789#comment-16360789
 ] 

Sean Owen commented on SPARK-23397:
---

This is how it's supposed to work. Batches don't overlap. If one overruns, the 
rest are delayed.

> Scheduling delay causes Spark Streaming to miss batches.
> 
>
> Key: SPARK-23397
> URL: https://issues.apache.org/jira/browse/SPARK-23397
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.2.1
>Reporter: Shahbaz Hussain
>Priority: Major
>
> * For Complex Spark (Scala) based D-Stream based applications ,which requires 
> creating Ex: 40 Jobs for every batch ,its been observed that ,batches does 
> not get created on the specific time ,ex: if i started a Spark Streaming 
> based application with batch interval as 20 seconds and application is 
> creating 40 odd Jobs ,observe the next batch does not create 20 seconds later 
> than previous job creation time.
>  * This is due to the fact that Job Creation is Single Threaded, if Job 
> Creation delay is greater than Batch Interval time ,batch execution misses 
> its schedule.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23343) Increase the exception test for the bind port

2018-02-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23343.
---
Resolution: Won't Fix

> Increase the exception test for the bind port
> -
>
> Key: SPARK-23343
> URL: https://issues.apache.org/jira/browse/SPARK-23343
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: caoxuewen
>Priority: Minor
>
> this PR add new test case, 
> 1、add the boundary value test of port 65535
> 2、add the privileged port to test,
> 3、add rebinding port test when set `spark.port.maxRetries` is 1,
> 4、add `Utils.userPort` self circulation to generating port,
> in addition, in the existing test case, if you don't set the `spark.testing` 
> for true, the default value for `spark.port.maxRetries` is not 100, but 16, 
> (expectedPort + 100) is a little mistake.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23397) Scheduling delay causes Spark Streaming to miss batches.

2018-02-12 Thread Shahbaz Hussain (JIRA)
Shahbaz Hussain created SPARK-23397:
---

 Summary: Scheduling delay causes Spark Streaming to miss batches.
 Key: SPARK-23397
 URL: https://issues.apache.org/jira/browse/SPARK-23397
 Project: Spark
  Issue Type: Bug
  Components: DStreams
Affects Versions: 2.2.1
Reporter: Shahbaz Hussain


* For Complex Spark (Scala) based D-Stream based applications ,which requires 
creating Ex: 40 Jobs for every batch ,its been observed that ,batches does not 
get created on the specific time ,ex: if i started a Spark Streaming based 
application with batch interval as 20 seconds and application is creating 40 
odd Jobs ,observe the next batch does not create 20 seconds later than previous 
job creation time.
 * This is due to the fact that Job Creation is Single Threaded, if Job 
Creation delay is greater than Batch Interval time ,batch execution misses its 
schedule.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23396) Spark HistoryServer will OMM if the event log is big

2018-02-12 Thread KaiXinXIaoLei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KaiXinXIaoLei updated SPARK-23396:
--
Attachment: eventlog.png

> Spark HistoryServer will OMM if the event log is big
> 
>
> Key: SPARK-23396
> URL: https://issues.apache.org/jira/browse/SPARK-23396
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: KaiXinXIaoLei
>Priority: Major
> Attachments: eventlog.png, historyServer.png
>
>
> if the event log is  big, the historyServer web will be out of memory
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23396) Spark HistoryServer will OMM if the event log is big

2018-02-12 Thread KaiXinXIaoLei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KaiXinXIaoLei updated SPARK-23396:
--
Attachment: (was: historyServer.png)

> Spark HistoryServer will OMM if the event log is big
> 
>
> Key: SPARK-23396
> URL: https://issues.apache.org/jira/browse/SPARK-23396
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: KaiXinXIaoLei
>Priority: Major
> Attachments: historyServer.png
>
>
> if the event log is  big, the historyServer web will be out of memory
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23396) Spark HistoryServer will OMM if the event log is big

2018-02-12 Thread KaiXinXIaoLei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KaiXinXIaoLei updated SPARK-23396:
--
Description: 
if the event log is  big, the historyServer web will be out of memory

 

  was:if the event log is  big, the historyServer web will be out of memory


> Spark HistoryServer will OMM if the event log is big
> 
>
> Key: SPARK-23396
> URL: https://issues.apache.org/jira/browse/SPARK-23396
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: KaiXinXIaoLei
>Priority: Major
> Attachments: historyServer.png
>
>
> if the event log is  big, the historyServer web will be out of memory
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23396) Spark HistoryServer will OMM if the event log is big

2018-02-12 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16360784#comment-16360784
 ] 

Sean Owen commented on SPARK-23396:
---

This is far too vague. It seems to overlap with recent improvements in SHS from 
[~vanzin] I'd close this unless it's a lot more specific.

> Spark HistoryServer will OMM if the event log is big
> 
>
> Key: SPARK-23396
> URL: https://issues.apache.org/jira/browse/SPARK-23396
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: KaiXinXIaoLei
>Priority: Major
> Attachments: historyServer.png
>
>
> if the event log is  big, the historyServer web will be out of memory
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23396) Spark HistoryServer will OMM if the event log is big

2018-02-12 Thread KaiXinXIaoLei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KaiXinXIaoLei updated SPARK-23396:
--
Attachment: historyServer.png

> Spark HistoryServer will OMM if the event log is big
> 
>
> Key: SPARK-23396
> URL: https://issues.apache.org/jira/browse/SPARK-23396
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: KaiXinXIaoLei
>Priority: Major
> Attachments: historyServer.png
>
>
> if the event log is  big, the historyServer web will be out of memory
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23396) Spark HistoryServer will OMM if the event log is big

2018-02-12 Thread KaiXinXIaoLei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KaiXinXIaoLei updated SPARK-23396:
--
Attachment: historyServer.png

> Spark HistoryServer will OMM if the event log is big
> 
>
> Key: SPARK-23396
> URL: https://issues.apache.org/jira/browse/SPARK-23396
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: KaiXinXIaoLei
>Priority: Major
> Attachments: historyServer.png
>
>
> if the event log is  big, the historyServer web will be out of memory



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23396) Spark HistoryServer will OMM if the event log is big

2018-02-12 Thread KaiXinXIaoLei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KaiXinXIaoLei updated SPARK-23396:
--
Attachment: (was: historyServer.png)

> Spark HistoryServer will OMM if the event log is big
> 
>
> Key: SPARK-23396
> URL: https://issues.apache.org/jira/browse/SPARK-23396
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: KaiXinXIaoLei
>Priority: Major
> Attachments: historyServer.png
>
>
> if the event log is  big, the historyServer web will be out of memory



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23396) Spark HistoryServer will OMM if the event log is big

2018-02-12 Thread KaiXinXIaoLei (JIRA)
KaiXinXIaoLei created SPARK-23396:
-

 Summary: Spark HistoryServer will OMM if the event log is big
 Key: SPARK-23396
 URL: https://issues.apache.org/jira/browse/SPARK-23396
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.1
Reporter: KaiXinXIaoLei
 Attachments: historyServer.png

if the event log is  big, the historyServer web will be out of memory



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >