date:20190201

[jira] [Resolved] (SPARK-26808) Pruned schema should not change nullability

2019-02-01 Thread Liang-Chi Hsieh (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh resolved SPARK-26808.
-
Resolution: Won't Fix

> Pruned schema should not change nullability
> ---
>
> Key: SPARK-26808
> URL: https://issues.apache.org/jira/browse/SPARK-26808
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Priority: Minor
>
> We prune unnecessary nested fields from requested schema when reading 
> Parquet. Now seems we don't keep original nullability in pruned schema. We 
> should keep original nullability.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18161) Default PickleSerializer pickle protocol doesn't handle > 4GB objects

2019-02-01 Thread Boris Shminke (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-18161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758906#comment-16758906
 ] 

Boris Shminke commented on SPARK-18161:
---

[~ssimmons] thanks for starting this work. [~hyukjin.kwon] thanks for guiding 
me during the review.:)

> Default PickleSerializer pickle protocol doesn't handle > 4GB objects
> -
>
> Key: SPARK-18161
> URL: https://issues.apache.org/jira/browse/SPARK-18161
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Sloane Simmons
>Priority: Major
> Fix For: 3.0.0
>
>
> When broadcasting a fairly large numpy matrix in a Spark 2.0.1 program, there 
> is an error serializing the object with:
> {{OverflowError: cannot serialize a bytes object larger than 4 GiB}}
> in the stack trace.
> This is because Python's pickle serialization (with protocol <= 3) uses a 
> 32-bit integer for the object size, and so cannot handle objects larger than 
> 4 gigabytes.  This was changed in Protocol 4 of pickle 
> (https://www.python.org/dev/peps/pep-3154/#bit-opcodes-for-large-objects) and 
> is available in Python 3.4+.  
> I would like to use this protocol for broadcasting and in the default 
> PickleSerializer where available to make pyspark more robust to broadcasting 
> large variables.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26813) Consolidate java version across language compilers and build tools

2019-02-01 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26813:


Assignee: Apache Spark

> Consolidate java version across language compilers and build tools
> --
>
> Key: SPARK-26813
> URL: https://issues.apache.org/jira/browse/SPARK-26813
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Chenxiao Mao
>Assignee: Apache Spark
>Priority: Minor
>
> The java version here means versions of javac source, javac target, scalac 
> target. They could be consolidated as a single version (currently 1.8)
> || ||javac||scalac||
> |source|1.8|2.12/2.11|
> |target|1.8|1.8|
> The current issues are as follows
>  * Maven defines a single property to specify java version (java.version) 
> while SBT build defines different properties for javac (javacJVMVersion) and 
> scalac (scalacJVMVersion). SBT should use a single property as Maven does.
>  * Furthermore, it's better for SBT to refer to java.version defined by 
> Maven. This is possible since we've already been using sbt-pom-reader.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26813) Consolidate java version across language compilers and build tools

2019-02-01 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26813:


Assignee: (was: Apache Spark)

> Consolidate java version across language compilers and build tools
> --
>
> Key: SPARK-26813
> URL: https://issues.apache.org/jira/browse/SPARK-26813
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Chenxiao Mao
>Priority: Minor
>
> The java version here means versions of javac source, javac target, scalac 
> target. They could be consolidated as a single version (currently 1.8)
> || ||javac||scalac||
> |source|1.8|2.12/2.11|
> |target|1.8|1.8|
> The current issues are as follows
>  * Maven defines a single property to specify java version (java.version) 
> while SBT build defines different properties for javac (javacJVMVersion) and 
> scalac (scalacJVMVersion). SBT should use a single property as Maven does.
>  * Furthermore, it's better for SBT to refer to java.version defined by 
> Maven. This is possible since we've already been using sbt-pom-reader.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26813) Consolidate java version across language compilers and build tools

2019-02-01 Thread Chenxiao Mao (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenxiao Mao updated SPARK-26813:
-
Description: 
The java version here means versions of javac source, javac target, scalac 
target. They could be consolidated as a single version (currently 1.8)
|| ||javac||scalac||
|source|1.8|2.12/2.11|
|target|1.8|1.8|

The current issues are as follows
 * Maven defines a single property to specify java version (java.version) while 
SBT build defines different properties for javac (javacJVMVersion) and scalac 
(scalacJVMVersion). SBT should use a single property as Maven does.
 * Furthermore, it's even better for SBT to refer to java.version defined by 
Maven. This is possible since we've already been using sbt-pom-reader.

  was:
The java version here means versions of javac source, javac target, scalac 
target. They could be consolidated as a single version (currently 1.8)
|| ||javac||scalac||
|source|1.8|2.12/2.11|
|target|1.8|1.8|

The current issues are as follows
 * Maven defines a single property to specify java version (java.version) while 
SBT build defines different properties for javac (javacJVMVersion) and scalac 
(scalacJVMVersion). SBT should use a single property as Maven does.
 * Furthermore, it's even better for SBT to refer to java.version defined by 
Maven. This is possible since we've already been using sbt-pom-reader.

 


> Consolidate java version across language compilers and build tools
> --
>
> Key: SPARK-26813
> URL: https://issues.apache.org/jira/browse/SPARK-26813
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Chenxiao Mao
>Priority: Minor
>
> The java version here means versions of javac source, javac target, scalac 
> target. They could be consolidated as a single version (currently 1.8)
> || ||javac||scalac||
> |source|1.8|2.12/2.11|
> |target|1.8|1.8|
> The current issues are as follows
>  * Maven defines a single property to specify java version (java.version) 
> while SBT build defines different properties for javac (javacJVMVersion) and 
> scalac (scalacJVMVersion). SBT should use a single property as Maven does.
>  * Furthermore, it's even better for SBT to refer to java.version defined by 
> Maven. This is possible since we've already been using sbt-pom-reader.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26813) Consolidate java version across language compilers and build tools

2019-02-01 Thread Chenxiao Mao (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenxiao Mao updated SPARK-26813:
-
Description: 
The java version here means versions of javac source, javac target, scalac 
target. They could be consolidated as a single version (currently 1.8)
|| ||javac||scalac||
|source|1.8|2.12/2.11|
|target|1.8|1.8|

The current issues are as follows
 * Maven defines a single property to specify java version (java.version) while 
SBT build defines different properties for javac (javacJVMVersion) and scalac 
(scalacJVMVersion). SBT should use a single property as Maven does.
 * Furthermore, it's better for SBT to refer to java.version defined by Maven. 
This is possible since we've already been using sbt-pom-reader.

  was:
The java version here means versions of javac source, javac target, scalac 
target. They could be consolidated as a single version (currently 1.8)
|| ||javac||scalac||
|source|1.8|2.12/2.11|
|target|1.8|1.8|

The current issues are as follows
 * Maven defines a single property to specify java version (java.version) while 
SBT build defines different properties for javac (javacJVMVersion) and scalac 
(scalacJVMVersion). SBT should use a single property as Maven does.
 * Furthermore, it's even better for SBT to refer to java.version defined by 
Maven. This is possible since we've already been using sbt-pom-reader.


> Consolidate java version across language compilers and build tools
> --
>
> Key: SPARK-26813
> URL: https://issues.apache.org/jira/browse/SPARK-26813
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Chenxiao Mao
>Priority: Minor
>
> The java version here means versions of javac source, javac target, scalac 
> target. They could be consolidated as a single version (currently 1.8)
> || ||javac||scalac||
> |source|1.8|2.12/2.11|
> |target|1.8|1.8|
> The current issues are as follows
>  * Maven defines a single property to specify java version (java.version) 
> while SBT build defines different properties for javac (javacJVMVersion) and 
> scalac (scalacJVMVersion). SBT should use a single property as Maven does.
>  * Furthermore, it's better for SBT to refer to java.version defined by 
> Maven. This is possible since we've already been using sbt-pom-reader.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26813) Consolidate java version across language compilers and build tools

2019-02-01 Thread Chenxiao Mao (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenxiao Mao updated SPARK-26813:
-
Description: 
The java version here means versions of javac source, javac target, scalac 
target. They could be consolidated as a single version (currently 1.8)
|| ||javac||scalac||
|source|1.8|2.12/2.11|
|target|1.8|1.8|

The current issues are as follows
 * Maven defines a single property to specify java version (java.version) while 
SBT build defines different properties for javac (javacJVMVersion) and scalac 
(scalacJVMVersion). SBT should use a single property as Maven does.
 * Furthermore, it's even better for SBT to refer to java.version defined by 
Maven. This is possible since we've already been using sbt-pom-reader.

 

  was:
The java version here means versions of javac source, javac target, scalac 
target. They could be consolidated as a single version (currently 1.8)
|| ||javac||scalac||
|source|1.8|2.12/2.11|
|target|1.8|1.8|

The current issues are as follows
 * Maven defines a single property to specify java version (java.version) while 
SBT build defines different properties for javac (javacJVMVersion) and scalac 
(scalacJVMVersion). SBT should use a single property as Maven does.
 * For SBT build, both javac options and scalac options related to java version 
are provided. For Maven build, scala-maven-plugin compiles both Java and Scala 
code. However, javac options related to java version (-source, -target) are 
provided while scalac options related to java version (-target:TARGET) are not 
provided, which means scalac will depend on the default options (jvm-1.8). It's 
better for Maven build to explicitly provide scalac options as well.
 * Furthermore, it's even better for SBT to refer to java.version defined by 
Maven. This is possible since we've already been using sbt-pom-reader.

 


> Consolidate java version across language compilers and build tools
> --
>
> Key: SPARK-26813
> URL: https://issues.apache.org/jira/browse/SPARK-26813
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Chenxiao Mao
>Priority: Minor
>
> The java version here means versions of javac source, javac target, scalac 
> target. They could be consolidated as a single version (currently 1.8)
> || ||javac||scalac||
> |source|1.8|2.12/2.11|
> |target|1.8|1.8|
> The current issues are as follows
>  * Maven defines a single property to specify java version (java.version) 
> while SBT build defines different properties for javac (javacJVMVersion) and 
> scalac (scalacJVMVersion). SBT should use a single property as Maven does.
>  * Furthermore, it's even better for SBT to refer to java.version defined by 
> Maven. This is possible since we've already been using sbt-pom-reader.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26813) Consolidate java version across language compilers and build tools

2019-02-01 Thread Chenxiao Mao (JIRA)

Chenxiao Mao created SPARK-26813:


 Summary: Consolidate java version across language compilers and 
build tools
 Key: SPARK-26813
 URL: https://issues.apache.org/jira/browse/SPARK-26813
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 2.4.0
Reporter: Chenxiao Mao


The java version here means versions of javac source, javac target, scalac 
target. They could be consolidated as a single version (currently 1.8)
|| ||javac||scalac||
|source|1.8|2.12/2.11|
|target|1.8|1.8|

The current issues are as follows
 * Maven defines a single property to specify java version (java.version) while 
SBT build defines different properties for javac (javacJVMVersion) and scalac 
(scalacJVMVersion). SBT should use a single property as Maven does.
 * For SBT build, both javac options and scalac options related to java version 
are provided. For Maven build, scala-maven-plugin compiles both Java and Scala 
code. However, javac options related to java version (-source, -target) are 
provided while scalac options related to java version (-target:TARGET) are not 
provided, which means scalac will depend on the default options (jvm-1.8). It's 
better for Maven build to explicitly provide scalac options as well.
 * Furthermore, it's even better for SBT to refer to java.version defined by 
Maven. This is possible since we've already been using sbt-pom-reader.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26810) Fixing SPARK-25072 broke existing code and fails to show error message

2019-02-01 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758878#comment-16758878
 ] 

Hyukjin Kwon commented on SPARK-26810:
--

Also workaround is super super easy. Just put one {{*}}:

{code}
from pyspark.sql import Row
r = Row(*['a','b'])
r('1', '2')
{code}

Is {{r = Row(['a','b'])}} usage documented somewhere? I think it was a mistake 
we supported.


> Fixing SPARK-25072 broke existing code and fails to show error message
> --
>
> Key: SPARK-26810
> URL: https://issues.apache.org/jira/browse/SPARK-26810
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Arttu Voutilainen
>Priority: Minor
>
> Hey,
> We upgraded Spark recently, and 
> https://issues.apache.org/jira/browse/SPARK-25072 caused our pipeline to fail 
> after the upgrade. Annoyingly, the error message formatting also threw an 
> exception itself, thus hiding the message we should have seen.
> Repro using gettyimages/docker-spark, on 2.4.0:
> {code}
> from pyspark.sql import Row
> r = Row(['a','b'])
> r('1', '2')
> {code}
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/spark-2.4.0/python/pyspark/sql/types.py", line 1505, in __call__
> "but got %s" % (self, len(self), args))
>   File "/usr/spark-2.4.0/python/pyspark/sql/types.py", line 1552, in __repr__
> return "" % ", ".join(self)
> TypeError: sequence item 0: expected str instance, list found
> {code}
> On 2.3.1, and also showing how this was used:
> {code}
> from pyspark.sql import Row, types as T
> r = Row(['a','b'])
> df = spark.createDataFrame([Row(col='doesntmatter')])
> rdd = df.rdd.mapPartitions(lambda p: [r('a1','b2')])
> spark.createDataFrame(rdd, T.StructType([T.StructField('a', T.StringType()), 
> T.StructField('b', T.StringType())])).collect()
> {code}
> {code}
> [Row(a='a1', b='b2'), Row(a='a1', b='b2')]
> {code}
> While I do think the code we had was quite horrible, it used to work. The 
> unexpected error came from __repr__ as it assumes that the arguments given to 
> Row constructor are strings. That sounds like a reasonable assumption, should 
> the Row constructor validate that it holds true maybe? (I guess that might be 
> another potentially breaking change though, if someone has as weird code as 
> this one...)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26810) Fixing SPARK-25072 broke existing code and fails to show error message

2019-02-01 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758876#comment-16758876
 ] 

Hyukjin Kwon commented on SPARK-26810:
--

{code}
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/spark-2.4.0/python/pyspark/sql/types.py", line 1505, in __call__
"but got %s" % (self, len(self), args))
  File "/usr/spark-2.4.0/python/pyspark/sql/types.py", line 1552, in __repr__
return "" % ", ".join(self)
TypeError: sequence item 0: expected str instance, list found
{code}

Is another issue, I guess SPARK-23299.

Are you sure SPARK-25072 is the cause? I don't see the relevant error messages.

> Fixing SPARK-25072 broke existing code and fails to show error message
> --
>
> Key: SPARK-26810
> URL: https://issues.apache.org/jira/browse/SPARK-26810
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Arttu Voutilainen
>Priority: Minor
>
> Hey,
> We upgraded Spark recently, and 
> https://issues.apache.org/jira/browse/SPARK-25072 caused our pipeline to fail 
> after the upgrade. Annoyingly, the error message formatting also threw an 
> exception itself, thus hiding the message we should have seen.
> Repro using gettyimages/docker-spark, on 2.4.0:
> {code}
> from pyspark.sql import Row
> r = Row(['a','b'])
> r('1', '2')
> {code}
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/spark-2.4.0/python/pyspark/sql/types.py", line 1505, in __call__
> "but got %s" % (self, len(self), args))
>   File "/usr/spark-2.4.0/python/pyspark/sql/types.py", line 1552, in __repr__
> return "" % ", ".join(self)
> TypeError: sequence item 0: expected str instance, list found
> {code}
> On 2.3.1, and also showing how this was used:
> {code}
> from pyspark.sql import Row, types as T
> r = Row(['a','b'])
> df = spark.createDataFrame([Row(col='doesntmatter')])
> rdd = df.rdd.mapPartitions(lambda p: [r('a1','b2')])
> spark.createDataFrame(rdd, T.StructType([T.StructField('a', T.StringType()), 
> T.StructField('b', T.StringType())])).collect()
> {code}
> {code}
> [Row(a='a1', b='b2'), Row(a='a1', b='b2')]
> {code}
> While I do think the code we had was quite horrible, it used to work. The 
> unexpected error came from __repr__ as it assumes that the arguments given to 
> Row constructor are strings. That sounds like a reasonable assumption, should 
> the Row constructor validate that it holds true maybe? (I guess that might be 
> another potentially breaking change though, if someone has as weird code as 
> this one...)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26809) insert overwrite directory + concat function => error

2019-02-01 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758873#comment-16758873
 ] 

Hyukjin Kwon commented on SPARK-26809:
--

Is it able to post a self-contained reproducer? It will deduplicate efforts 
when other people start to investigate. 

> insert overwrite directory + concat function => error
> -
>
> Key: SPARK-26809
> URL: https://issues.apache.org/jira/browse/SPARK-26809
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: ant_nebula
>Priority: Critical
>
> insert overwrite directory '/tmp/xx'
> select concat(col1, col2)
> from tableXX
> limit 3
>  
> Caused by: org.apache.hadoop.hive.serde2.SerDeException: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 3 elements 
> while columns.types has 2 elements!
>  at 
> org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:145)
>  at 
> org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85)
>  at 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125)
>  at 
> org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:119)
>  at 
> org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103)
>  at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120)
>  at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:108)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:233)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:168)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>  at org.apache.spark.scheduler.Task.run(Task.scala:121)
>  at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26796) Testcases failing with "org.apache.hadoop.fs.ChecksumException" error

2019-02-01 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26796.
--
Resolution: Cannot Reproduce

> Testcases failing with "org.apache.hadoop.fs.ChecksumException" error
> -
>
> Key: SPARK-26796
> URL: https://issues.apache.org/jira/browse/SPARK-26796
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.3.2, 2.4.0
> Environment: Ubuntu 16.04 
> Java Version
> openjdk version "1.8.0_192"
>  OpenJDK Runtime Environment (build 1.8.0_192-b12_openj9)
>  Eclipse OpenJ9 VM (build openj9-0.11.0, JRE 1.8.0 Compressed References 
> 20181107_80 (JIT enabled, AOT enabled)
>  OpenJ9 - 090ff9dcd
>  OMR - ea548a66
>  JCL - b5a3affe73 based on jdk8u192-b12)
>  
> Hadoop  Version
> Hadoop 2.7.1
>  Subversion Unknown -r Unknown
>  Compiled by test on 2019-01-29T09:09Z
>  Compiled with protoc 2.5.0
>  From source with checksum 5e94a235f9a71834e2eb73fb36ee873f
>  This command was run using 
> /home/test/hadoop-release-2.7.1/hadoop-dist/target/hadoop-2.7.1/share/hadoop/common/hadoop-common-2.7.1.jar
>  
>  
>  
>Reporter: Anuja Jakhade
>Priority: Major
>
> Observing test case failures due to Checksum error 
> Below is the error log
> [ERROR] checkpointAndComputation(test.org.apache.spark.JavaAPISuite) Time 
> elapsed: 1.232 s <<< ERROR!
> org.apache.spark.SparkException: 
> Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost, executor 
> driver): org.apache.hadoop.fs.ChecksumException: Checksum error: 
> file:/home/test/spark/core/target/tmp/1548319689411-0/fd0ba388-539c-49aa-bf76-e7d50aa2d1fc/rdd-0/part-0
>  at 0 exp: 222499834 got: 1400184476
>  at org.apache.hadoop.fs.FSInputChecker.verifySums(FSInputChecker.java:323)
>  at 
> org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:279)
>  at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:214)
>  at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:232)
>  at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:196)
>  at java.io.DataInputStream.read(DataInputStream.java:149)
>  at 
> java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2769)
>  at 
> java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2785)
>  at 
> java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3262)
>  at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:968)
>  at java.io.ObjectInputStream.(ObjectInputStream.java:390)
>  at 
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.(JavaSerializer.scala:63)
>  at 
> org.apache.spark.serializer.JavaDeserializationStream.(JavaSerializer.scala:63)
>  at 
> org.apache.spark.serializer.JavaSerializerInstance.deserializeStream(JavaSerializer.scala:122)
>  at 
> org.apache.spark.rdd.ReliableCheckpointRDD$.readCheckpointFile(ReliableCheckpointRDD.scala:300)
>  at 
> org.apache.spark.rdd.ReliableCheckpointRDD.compute(ReliableCheckpointRDD.scala:100)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:322)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>  at org.apache.spark.scheduler.Task.run(Task.scala:109)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:813)
> Driver stacktrace:
>  at 
> test.org.apache.spark.JavaAPISuite.checkpointAndComputation(JavaAPISuite.java:1243)
> Caused by: org.apache.hadoop.fs.ChecksumException: Checksum error:
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26804) Spark sql carries newline char from last csv column when imported

2019-02-01 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758872#comment-16758872
 ] 

Hyukjin Kwon commented on SPARK-26804:
--

Can you show your input file? It would be easier to verify the issue if there's 
a self-contained reproducer. I am leaving this JIRA resolved until the details 
are provided.

> Spark sql carries newline char from last csv column when imported
> -
>
> Key: SPARK-26804
> URL: https://issues.apache.org/jira/browse/SPARK-26804
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Raj
>Priority: Major
>
> I am trying to generate external sql tables in DataBricks using Spark sql 
> query. Below is my query. The query reads csv file and creates external table 
> but it carries the newline char while creating the last column. Is there a 
> way to resolve this issue? 
>  
> %sql
> create table if not exists <>
> using CSV
> options ("header"="true", "inferschema"="true","multiLine"="true", 
> "escape"='"')
> location 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26804) Spark sql carries newline char from last csv column when imported

2019-02-01 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26804.
--
Resolution: Incomplete

> Spark sql carries newline char from last csv column when imported
> -
>
> Key: SPARK-26804
> URL: https://issues.apache.org/jira/browse/SPARK-26804
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Raj
>Priority: Major
>
> I am trying to generate external sql tables in DataBricks using Spark sql 
> query. Below is my query. The query reads csv file and creates external table 
> but it carries the newline char while creating the last column. Is there a 
> way to resolve this issue? 
>  
> %sql
> create table if not exists <>
> using CSV
> options ("header"="true", "inferschema"="true","multiLine"="true", 
> "escape"='"')
> location 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26801) Spark unable to read valid avro types

2019-02-01 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758869#comment-16758869
 ] 

Hyukjin Kwon commented on SPARK-26801:
--

Thanks for reporting this. Would you be interested in narrowing down the 
problem?

> Spark unable to read valid avro types
> -
>
> Key: SPARK-26801
> URL: https://issues.apache.org/jira/browse/SPARK-26801
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dhruve Ashar
>Priority: Major
>
> Currently the external avro package reads avro schemas for type records only. 
> This is probably because of representation of InternalRow in spark sql. As a 
> result, if the avro file has anything other than a sequence of records it 
> fails to read it.
> We faced this issue earlier while trying to read primitive types. We 
> encountered this again while trying to read an array of records. Below are 
> code examples trying to read valid avro data showing the stack traces.
> {code:java}
> spark.read.format("avro").load("avroTypes/randomInt.avro").show
> java.lang.RuntimeException: Avro schema cannot be converted to a Spark SQL 
> StructType:
> "int"
> at 
> org.apache.spark.sql.avro.AvroFileFormat.inferSchema(AvroFileFormat.scala:95)
> at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$6.apply(DataSource.scala:180)
> at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$6.apply(DataSource.scala:180)
> at scala.Option.orElse(Option.scala:289)
> at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:179)
> at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373)
> at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
> ... 49 elided
> ==
> scala> spark.read.format("avro").load("avroTypes/randomEnum.avro").show
> java.lang.RuntimeException: Avro schema cannot be converted to a Spark SQL 
> StructType:
> {
> "type" : "enum",
> "name" : "Suit",
> "symbols" : [ "SPADES", "HEARTS", "DIAMONDS", "CLUBS" ]
> }
> at 
> org.apache.spark.sql.avro.AvroFileFormat.inferSchema(AvroFileFormat.scala:95)
> at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$6.apply(DataSource.scala:180)
> at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$6.apply(DataSource.scala:180)
> at scala.Option.orElse(Option.scala:289)
> at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:179)
> at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373)
> at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
> ... 49 elided
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26796) Testcases failing with "org.apache.hadoop.fs.ChecksumException" error

2019-02-01 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758868#comment-16758868
 ] 

Hyukjin Kwon commented on SPARK-26796:
--

I'm unable to reproduce this in my local, and the tests look working fine in 
Jenkins. Can you run the tests via Maven or SBT? Let me leave this resolved 
until other people can reproduce via Maven or SBT not via IDE.

> Testcases failing with "org.apache.hadoop.fs.ChecksumException" error
> -
>
> Key: SPARK-26796
> URL: https://issues.apache.org/jira/browse/SPARK-26796
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.3.2, 2.4.0
> Environment: Ubuntu 16.04 
> Java Version
> openjdk version "1.8.0_192"
>  OpenJDK Runtime Environment (build 1.8.0_192-b12_openj9)
>  Eclipse OpenJ9 VM (build openj9-0.11.0, JRE 1.8.0 Compressed References 
> 20181107_80 (JIT enabled, AOT enabled)
>  OpenJ9 - 090ff9dcd
>  OMR - ea548a66
>  JCL - b5a3affe73 based on jdk8u192-b12)
>  
> Hadoop  Version
> Hadoop 2.7.1
>  Subversion Unknown -r Unknown
>  Compiled by test on 2019-01-29T09:09Z
>  Compiled with protoc 2.5.0
>  From source with checksum 5e94a235f9a71834e2eb73fb36ee873f
>  This command was run using 
> /home/test/hadoop-release-2.7.1/hadoop-dist/target/hadoop-2.7.1/share/hadoop/common/hadoop-common-2.7.1.jar
>  
>  
>  
>Reporter: Anuja Jakhade
>Priority: Major
>
> Observing test case failures due to Checksum error 
> Below is the error log
> [ERROR] checkpointAndComputation(test.org.apache.spark.JavaAPISuite) Time 
> elapsed: 1.232 s <<< ERROR!
> org.apache.spark.SparkException: 
> Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost, executor 
> driver): org.apache.hadoop.fs.ChecksumException: Checksum error: 
> file:/home/test/spark/core/target/tmp/1548319689411-0/fd0ba388-539c-49aa-bf76-e7d50aa2d1fc/rdd-0/part-0
>  at 0 exp: 222499834 got: 1400184476
>  at org.apache.hadoop.fs.FSInputChecker.verifySums(FSInputChecker.java:323)
>  at 
> org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:279)
>  at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:214)
>  at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:232)
>  at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:196)
>  at java.io.DataInputStream.read(DataInputStream.java:149)
>  at 
> java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2769)
>  at 
> java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2785)
>  at 
> java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3262)
>  at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:968)
>  at java.io.ObjectInputStream.(ObjectInputStream.java:390)
>  at 
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.(JavaSerializer.scala:63)
>  at 
> org.apache.spark.serializer.JavaDeserializationStream.(JavaSerializer.scala:63)
>  at 
> org.apache.spark.serializer.JavaSerializerInstance.deserializeStream(JavaSerializer.scala:122)
>  at 
> org.apache.spark.rdd.ReliableCheckpointRDD$.readCheckpointFile(ReliableCheckpointRDD.scala:300)
>  at 
> org.apache.spark.rdd.ReliableCheckpointRDD.compute(ReliableCheckpointRDD.scala:100)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:322)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>  at org.apache.spark.scheduler.Task.run(Task.scala:109)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:813)
> Driver stacktrace:
>  at 
> test.org.apache.spark.JavaAPISuite.checkpointAndComputation(JavaAPISuite.java:1243)
> Caused by: org.apache.hadoop.fs.ChecksumException: Checksum error:
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26791) Some scala codes doesn't show friendly and some description about foreachBatch is misleading

2019-02-01 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758867#comment-16758867
 ] 

Hyukjin Kwon commented on SPARK-26791:
--

Can you post a PR to improve the doc?

> Some scala codes doesn't show friendly and some description about 
> foreachBatch is misleading
> 
>
> Key: SPARK-26791
> URL: https://issues.apache.org/jira/browse/SPARK-26791
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.4.0
> Environment: NA
>Reporter: chaiyongqiang
>Priority: Minor
> Attachments: foreachBatch.jpg, multi-watermark.jpg
>
>
> [Introduction about 
> foreachbatch|http://spark.apache.org/docs/2.4.0/structured-streaming-programming-guide.html#foreachbatch]
> [Introduction about 
> policy-for-handling-multiple-watermarks|http://spark.apache.org/docs/2.4.0/structured-streaming-programming-guide.html#policy-for-handling-multiple-watermarks]
> The introduction about foreachBatch and 
> policy-for-handling-multiple-watermarks doesn't look good with the scala code.
> Besides, when taking about foreachBatch using the uncache api which doesn't 
> exists, it may be misleading.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26807) Confusing documentation regarding installation from PyPi

2019-02-01 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758866#comment-16758866
 ] 

Hyukjin Kwon commented on SPARK-26807:
--

Can you post a PR?

> Confusing documentation regarding installation from PyPi
> 
>
> Key: SPARK-26807
> URL: https://issues.apache.org/jira/browse/SPARK-26807
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.4.0
>Reporter: Emmanuel Arias
>Priority: Minor
>
> Hello!
> I am new using Spark. Reading the documentation I think that is a little 
> confusing on Downloading section.
> [ttps://spark.apache.org/docs/latest/#downloading|https://spark.apache.org/docs/latest/#downloading]
>  write: "Scala and Java users can include Spark in their projects using its 
> Maven coordinates and in the future Python users can also install Spark from 
> PyPI.", I interpret that currently Spark is not on PyPi yet. But  
> [https://spark.apache.org/downloads.html] write: 
> "[PySpark|https://pypi.python.org/pypi/pyspark] is now available in pypi. To 
> install just run {{pip install pyspark}}."



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26651) Use Proleptic Gregorian calendar

2019-02-01 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26651:


Assignee: Maxim Gekk  (was: Apache Spark)

> Use Proleptic Gregorian calendar
> 
>
> Key: SPARK-26651
> URL: https://issues.apache.org/jira/browse/SPARK-26651
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>  Labels: ReleaseNote
>
> Spark 2.4 and previous versions use a hybrid calendar - Julian + Gregorian in 
> date/timestamp parsing, functions and expressions. The ticket aims to switch 
> Spark on Proleptic Gregorian calendar, and use java.time classes introduced 
> in Java 8 for timestamp/date manipulations. One of the purpose of switching 
> on Proleptic Gregorian calendar is to conform to SQL standard which supposes 
> such calendar.
> *Release note:*
> Spark 3.0 has switched on Proleptic Gregorian calendar in parsing, 
> formatting, and converting dates and timestamps as well as in extracting 
> sub-components like years, days and etc. It uses Java 8 API classes from the 
> java.time packages that based on [ISO chronology 
> |https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html].
>  Previous versions of Spark performed those operations by using [the hybrid 
> calendar|https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html]
>  (Julian + Gregorian). The changes might impact on the results for dates and 
> timestamps before October 15, 1582 (Gregorian).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26651) Use Proleptic Gregorian calendar

2019-02-01 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26651:


Assignee: Apache Spark  (was: Maxim Gekk)

> Use Proleptic Gregorian calendar
> 
>
> Key: SPARK-26651
> URL: https://issues.apache.org/jira/browse/SPARK-26651
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>  Labels: ReleaseNote
>
> Spark 2.4 and previous versions use a hybrid calendar - Julian + Gregorian in 
> date/timestamp parsing, functions and expressions. The ticket aims to switch 
> Spark on Proleptic Gregorian calendar, and use java.time classes introduced 
> in Java 8 for timestamp/date manipulations. One of the purpose of switching 
> on Proleptic Gregorian calendar is to conform to SQL standard which supposes 
> such calendar.
> *Release note:*
> Spark 3.0 has switched on Proleptic Gregorian calendar in parsing, 
> formatting, and converting dates and timestamps as well as in extracting 
> sub-components like years, days and etc. It uses Java 8 API classes from the 
> java.time packages that based on [ISO chronology 
> |https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html].
>  Previous versions of Spark performed those operations by using [the hybrid 
> calendar|https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html]
>  (Julian + Gregorian). The changes might impact on the results for dates and 
> timestamps before October 15, 1582 (Gregorian).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26651) Use Proleptic Gregorian calendar

2019-02-01 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26651.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23722
[https://github.com/apache/spark/pull/23722]

> Use Proleptic Gregorian calendar
> 
>
> Key: SPARK-26651
> URL: https://issues.apache.org/jira/browse/SPARK-26651
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>  Labels: ReleaseNote
> Fix For: 3.0.0
>
>
> Spark 2.4 and previous versions use a hybrid calendar - Julian + Gregorian in 
> date/timestamp parsing, functions and expressions. The ticket aims to switch 
> Spark on Proleptic Gregorian calendar, and use java.time classes introduced 
> in Java 8 for timestamp/date manipulations. One of the purpose of switching 
> on Proleptic Gregorian calendar is to conform to SQL standard which supposes 
> such calendar.
> *Release note:*
> Spark 3.0 has switched on Proleptic Gregorian calendar in parsing, 
> formatting, and converting dates and timestamps as well as in extracting 
> sub-components like years, days and etc. It uses Java 8 API classes from the 
> java.time packages that based on [ISO chronology 
> |https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html].
>  Previous versions of Spark performed those operations by using [the hybrid 
> calendar|https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html]
>  (Julian + Gregorian). The changes might impact on the results for dates and 
> timestamps before October 15, 1582 (Gregorian).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26651) Use Proleptic Gregorian calendar

2019-02-01 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26651:
-
Fix Version/s: (was: 3.0.0)

> Use Proleptic Gregorian calendar
> 
>
> Key: SPARK-26651
> URL: https://issues.apache.org/jira/browse/SPARK-26651
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>  Labels: ReleaseNote
>
> Spark 2.4 and previous versions use a hybrid calendar - Julian + Gregorian in 
> date/timestamp parsing, functions and expressions. The ticket aims to switch 
> Spark on Proleptic Gregorian calendar, and use java.time classes introduced 
> in Java 8 for timestamp/date manipulations. One of the purpose of switching 
> on Proleptic Gregorian calendar is to conform to SQL standard which supposes 
> such calendar.
> *Release note:*
> Spark 3.0 has switched on Proleptic Gregorian calendar in parsing, 
> formatting, and converting dates and timestamps as well as in extracting 
> sub-components like years, days and etc. It uses Java 8 API classes from the 
> java.time packages that based on [ISO chronology 
> |https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html].
>  Previous versions of Spark performed those operations by using [the hybrid 
> calendar|https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html]
>  (Julian + Gregorian). The changes might impact on the results for dates and 
> timestamps before October 15, 1582 (Gregorian).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-26651) Use Proleptic Gregorian calendar

2019-02-01 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-26651:
--

> Use Proleptic Gregorian calendar
> 
>
> Key: SPARK-26651
> URL: https://issues.apache.org/jira/browse/SPARK-26651
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>  Labels: ReleaseNote
> Fix For: 3.0.0
>
>
> Spark 2.4 and previous versions use a hybrid calendar - Julian + Gregorian in 
> date/timestamp parsing, functions and expressions. The ticket aims to switch 
> Spark on Proleptic Gregorian calendar, and use java.time classes introduced 
> in Java 8 for timestamp/date manipulations. One of the purpose of switching 
> on Proleptic Gregorian calendar is to conform to SQL standard which supposes 
> such calendar.
> *Release note:*
> Spark 3.0 has switched on Proleptic Gregorian calendar in parsing, 
> formatting, and converting dates and timestamps as well as in extracting 
> sub-components like years, days and etc. It uses Java 8 API classes from the 
> java.time packages that based on [ISO chronology 
> |https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html].
>  Previous versions of Spark performed those operations by using [the hybrid 
> calendar|https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html]
>  (Julian + Gregorian). The changes might impact on the results for dates and 
> timestamps before October 15, 1582 (Gregorian).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18161) Default PickleSerializer pickle protocol doesn't handle > 4GB objects

2019-02-01 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-18161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-18161.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

This is fixed by upgrading cloudpickle at 
https://github.com/apache/spark/pull/20691

> Default PickleSerializer pickle protocol doesn't handle > 4GB objects
> -
>
> Key: SPARK-18161
> URL: https://issues.apache.org/jira/browse/SPARK-18161
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Sloane Simmons
>Priority: Major
> Fix For: 3.0.0
>
>
> When broadcasting a fairly large numpy matrix in a Spark 2.0.1 program, there 
> is an error serializing the object with:
> {{OverflowError: cannot serialize a bytes object larger than 4 GiB}}
> in the stack trace.
> This is because Python's pickle serialization (with protocol <= 3) uses a 
> 32-bit integer for the object size, and so cannot handle objects larger than 
> 4 gigabytes.  This was changed in Protocol 4 of pickle 
> (https://www.python.org/dev/peps/pep-3154/#bit-opcodes-for-large-objects) and 
> is available in Python 3.4+.  
> I would like to use this protocol for broadcasting and in the default 
> PickleSerializer where available to make pyspark more robust to broadcasting 
> large variables.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21733) ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM

2019-02-01 Thread Rajesh Chandramohan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-21733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758825#comment-16758825
 ] 

Rajesh Chandramohan commented on SPARK-21733:
-

Its based on symptom from the actual issue. 

When the there was a container limit in yarn cluster. the already spawned 
containers get sig term error.

CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM

> ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
> -
>
> Key: SPARK-21733
> URL: https://issues.apache.org/jira/browse/SPARK-21733
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.1
> Environment: Apache Spark2.1.1 
> CDH5.12.0 Yarn
>Reporter: Jepson
>Priority: Major
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Kafka+Spark streaming ,throw these error:
> {code:java}
> 17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003_piece0 stored 
> as bytes in memory (estimated size 1895.0 B, free 1643.2 MB)
> 17/08/15 09:34:14 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
> 8003 took 11 ms
> 17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003 stored as 
> values in memory (estimated size 2.9 KB, free 1643.2 MB)
> 17/08/15 09:34:14 INFO kafka010.KafkaRDD: Beginning offset 10130733 is the 
> same as ending offset skipping kssh 5
> 17/08/15 09:34:14 INFO executor.Executor: Finished task 7.0 in stage 8003.0 
> (TID 64178). 1740 bytes result sent to driver
> 17/08/15 09:34:21 INFO storage.BlockManager: Removing RDD 8002
> 17/08/15 09:34:21 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 64186
> 17/08/15 09:34:21 INFO executor.Executor: Running task 7.0 in stage 8004.0 
> (TID 64186)
> 17/08/15 09:34:21 INFO broadcast.TorrentBroadcast: Started reading broadcast 
> variable 8004
> 17/08/15 09:34:21 INFO memory.MemoryStore: Block broadcast_8004_piece0 stored 
> as bytes in memory (estimated size 1895.0 B, free 1643.2 MB)
> 17/08/15 09:34:21 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
> 8004 took 8 ms
> 17/08/15 09:34:21 INFO memory.MemoryStore: Block broadcast_8004 stored as 
> values in memory (estimated size 2.9 KB, free 1643.2 MB)
> 17/08/15 09:34:21 INFO kafka010.KafkaRDD: Beginning offset 10130733 is the 
> same as ending offset skipping kssh 5
> 17/08/15 09:34:21 INFO executor.Executor: Finished task 7.0 in stage 8004.0 
> (TID 64186). 1740 bytes result sent to driver
> h3. 17/08/15 09:34:29 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED 
> SIGNAL TERM
> 17/08/15 09:34:29 INFO storage.DiskBlockManager: Shutdown hook called
> 17/08/15 09:34:29 INFO util.ShutdownHookManager: Shutdown hook called
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26812) PushProjectionThroughUnion nullability issue

2019-02-01 Thread Bogdan Raducanu (JIRA)

Bogdan Raducanu created SPARK-26812:
---

 Summary: PushProjectionThroughUnion nullability issue
 Key: SPARK-26812
 URL: https://issues.apache.org/jira/browse/SPARK-26812
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Bogdan Raducanu


Union output data types are the output data types of the first child.
However the other union children may have different values nullability.
This means that we can't always push down a project on the children.

To reproduce
{code}
Seq(Map("foo" -> "bar")).toDF("a").write.saveAsTable("table1")
sql("SELECT 1 AS b").write.saveAsTable("table2")
sql("CREATE OR REPLACE VIEW test1 AS SELECT map() AS a FROM table2 UNION ALL 
SELECT a FROM table1")
 sql("select * from test1").show
{code}

This fails becaus the plan is no longer resolved.
The plan is broken by the PushProjectionThroughUnion rule which pushed down a 
cast to map with values nullability=true on a child with type 
map with values nullability=false.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26714) The job whose partiton num is zero not shown in WebUI

2019-02-01 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26714.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23637
[https://github.com/apache/spark/pull/23637]

> The job whose partiton num is zero not shown in WebUI
> -
>
> Key: SPARK-26714
> URL: https://issues.apache.org/jira/browse/SPARK-26714
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1, 2.4.0
>Reporter: deshanxiao
>Assignee: deshanxiao
>Priority: Minor
> Fix For: 3.0.0
>
>
> When the job's partiton is zero, it will still get a jobid but not shown in 
> ui.I think it's strange.
> Example:
> mkdir /home/test/testdir
> sc.textFile("/home/test/testdir")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26714) The job whose partiton num is zero not shown in WebUI

2019-02-01 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-26714:
-

Assignee: deshanxiao

> The job whose partiton num is zero not shown in WebUI
> -
>
> Key: SPARK-26714
> URL: https://issues.apache.org/jira/browse/SPARK-26714
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1, 2.4.0
>Reporter: deshanxiao
>Assignee: deshanxiao
>Priority: Minor
>
> When the job's partiton is zero, it will still get a jobid but not shown in 
> ui.I think it's strange.
> Example:
> mkdir /home/test/testdir
> sc.textFile("/home/test/testdir")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26771) Make .unpersist(), .destroy() consistently non-blocking by default

2019-02-01 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26771.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23685
[https://github.com/apache/spark/pull/23685]

> Make .unpersist(), .destroy() consistently non-blocking by default
> --
>
> Key: SPARK-26771
> URL: https://issues.apache.org/jira/browse/SPARK-26771
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX, Spark Core
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> See https://issues.apache.org/jira/browse/SPARK-26728 and 
> https://github.com/apache/spark/pull/23650 . 
> RDD and DataFrame expose an .unpersist() method with optional "blocking" 
> argument. So does Broadcast.destroy(). This argument is false by default 
> except for the Scala RDD (not Pyspark) implementation and its GraphX 
> subclasses. Most usages of these methods request non-blocking behavior 
> already, and indeed, it's not typical to want to wait for the resources to be 
> freed, except in tests asserting behavior about these methods (where blocking 
> is typically requested).
> This proposes to make the default false across these methods, and adjust 
> callers to only request non-default blocking behavior where important, such 
> as in a few key tests. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26754) Add hasTrainingSummary to replace duplicate code in PySpark

2019-02-01 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-26754:
-

Assignee: Huaxin Gao

> Add hasTrainingSummary to replace duplicate code in PySpark
> ---
>
> Key: SPARK-26754
> URL: https://issues.apache.org/jira/browse/SPARK-26754
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
>
> Python version of https://issues.apache.org/jira/browse/SPARK-20351.
> Add HasTrainingSummary to avoid code duplicate related to training summary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26754) Add hasTrainingSummary to replace duplicate code in PySpark

2019-02-01 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26754.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23676
[https://github.com/apache/spark/pull/23676]

> Add hasTrainingSummary to replace duplicate code in PySpark
> ---
>
> Key: SPARK-26754
> URL: https://issues.apache.org/jira/browse/SPARK-26754
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 3.0.0
>
>
> Python version of https://issues.apache.org/jira/browse/SPARK-20351.
> Add HasTrainingSummary to avoid code duplicate related to training summary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26786) Handle to treat escaped newline characters('\r','\n') in spark csv

2019-02-01 Thread vishnuram selvaraj (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758677#comment-16758677
 ] 

vishnuram selvaraj commented on SPARK-26786:


Thanks [~hyukjin.kwon]. I have raised a git 
issue(https://github.com/uniVocity/univocity-parsers/issues/308) in univocity 
project as well. Will post here of any updates I get from there.

> Handle to treat escaped newline characters('\r','\n') in spark csv
> --
>
> Key: SPARK-26786
> URL: https://issues.apache.org/jira/browse/SPARK-26786
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: vishnuram selvaraj
>Priority: Major
>
> There are some systems like AWS redshift which writes csv files by escaping 
> newline characters('\r','\n') in addition to escaping the quote characters, 
> if they come as part of the data.
> Redshift documentation 
> link([https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html)] and 
> below is their mention of escaping requirements in the mentioned link
> ESCAPE
> For CHAR and VARCHAR columns in delimited unload files, an escape character 
> (\{{}}) is placed before every occurrence of the following characters:
>  * Linefeed: {{\n}}
>  * Carriage return: {{\r}}
>  * The delimiter character specified for the unloaded data.
>  * The escape character: \{{}}
>  * A quote character: {{"}} or {{'}} (if both ESCAPE and ADDQUOTES are 
> specified in the UNLOAD command).
>  
> *Problem statement:* 
> But the spark CSV reader doesn't have a handle to treat/remove the escape 
> characters infront of the newline characters in the data.
> It would really help if we can add a feature to handle the escaped newline 
> characters through another parameter like (escapeNewline = 'true/false').
> *Example:*
> Below are the details of my test data set up in a file.
>  * The first record in that file has escaped windows newline character (
>  r
>  n)
>  * The third record in that file has escaped unix newline character (
>  n)
>  * The fifth record in that file has the escaped quote character (")
> the file looks like below in vi editor:
>  
> {code:java}
> "1","this is \^M\
> line1"^M
> "2","this is line2"^M
> "3","this is \
> line3"^M
> "4","this is \" line4"^M
> "5","this is line5"^M{code}
>  
> When I read the file in python's csv module with escape, it is able to remove 
> the added escape characters as you can see below,
>  
> {code:java}
> >>> with open('/tmp/test3.csv','r') as readCsv:
> ... readFile = 
> csv.reader(readCsv,dialect='excel',escapechar='\\',quotechar='"',delimiter=',',doublequote=False)
> ... for row in readFile:
> ... print(row)
> ...
> ['1', 'this is \r\n line1']
> ['2', 'this is line2']
> ['3', 'this is \n line3']
> ['4', 'this is " line4']
> ['5', 'this is line5']
> {code}
> But if I read the same file in spark-csv reader, the escape characters 
> infront of the newline characters are not removed.But the escape before the 
> (") is removed.
> {code:java}
> >>> redDf=spark.read.csv(path='file:///tmp/test3.csv',header='false',sep=',',quote='"',escape='\\',multiLine='true',ignoreLeadingWhiteSpace='true',ignoreTrailingWhiteSpace='true',mode='FAILFAST',inferSchema='false')
> >>> redDf.show()
> +---+--+
> |_c0| _c1|
> +---+--+
> \ 1|this is \
> line1|
> | 2| this is line2|
> | 3| this is \
> line3|
> | 4| this is " line4|
> | 5| this is line5|
> +---+--+
> {code}
>  *Expected result:*
> {code:java}
> +---+--+
> |_c0| _c1|
> +---+--+
> | 1|this is 
> line1|
> | 2| this is line2|
> | 3| this is 
> line3|
> | 4| this is " line4|
> | 5| this is line5|
> +---+--+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24541) TCP based shuffle

2019-02-01 Thread Jungtaek Lim (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758673#comment-16758673
 ] 

Jungtaek Lim commented on SPARK-24541:
--

Same understanding here: while I think there's pretty less chance for us to 
want to deal with lower level than Netty, but we may also want to send amount 
of data close to the size of data structure.

Btw, I don't know which thing Spark leverages to send (pull) shuffle data: 
whichever, it would be OK to also leverage it in here because it should be 
enough considered as a perspective of performance, security, etc.

> TCP based shuffle
> -
>
> Key: SPARK-24541
> URL: https://issues.apache.org/jira/browse/SPARK-24541
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26651) Use Proleptic Gregorian calendar

2019-02-01 Thread Maxim Gekk (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-26651:
---
Labels: ReleaseNote  (was: )

> Use Proleptic Gregorian calendar
> 
>
> Key: SPARK-26651
> URL: https://issues.apache.org/jira/browse/SPARK-26651
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>  Labels: ReleaseNote
>
> Spark 2.4 and previous versions use a hybrid calendar - Julian + Gregorian in 
> date/timestamp parsing, functions and expressions. The ticket aims to switch 
> Spark on Proleptic Gregorian calendar, and use java.time classes introduced 
> in Java 8 for timestamp/date manipulations. One of the purpose of switching 
> on Proleptic Gregorian calendar is to conform to SQL standard which supposes 
> such calendar.
> Release notes:
> Spark 3.0 has switched on Proleptic Gregorian calendar in parsing, 
> formatting, and converting dates and timestamps as well as in extracting 
> sub-components like years, days and etc. It uses Java 8 API classes from the 
> java.time packages that based on [ISO chronology 
> |https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html].
>  Previous versions of Spark performed those operations by using [the hybrid 
> calendar|https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html]
>  (Julian + Gregorian). The changes might impact on the results for dates and 
> timestamps before October 15, 1582 (Gregorian).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26651) Use Proleptic Gregorian calendar

2019-02-01 Thread Maxim Gekk (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-26651:
---
Description: 
Spark 2.4 and previous versions use a hybrid calendar - Julian + Gregorian in 
date/timestamp parsing, functions and expressions. The ticket aims to switch 
Spark on Proleptic Gregorian calendar, and use java.time classes introduced in 
Java 8 for timestamp/date manipulations. One of the purpose of switching on 
Proleptic Gregorian calendar is to conform to SQL standard which supposes such 
calendar.

*Release note:*

Spark 3.0 has switched on Proleptic Gregorian calendar in parsing, formatting, 
and converting dates and timestamps as well as in extracting sub-components 
like years, days and etc. It uses Java 8 API classes from the java.time 
packages that based on [ISO chronology 
|https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html].
 Previous versions of Spark performed those operations by using [the hybrid 
calendar|https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html]
 (Julian + Gregorian). The changes might impact on the results for dates and 
timestamps before October 15, 1582 (Gregorian).

  was:
Spark 2.4 and previous versions use a hybrid calendar - Julian + Gregorian in 
date/timestamp parsing, functions and expressions. The ticket aims to switch 
Spark on Proleptic Gregorian calendar, and use java.time classes introduced in 
Java 8 for timestamp/date manipulations. One of the purpose of switching on 
Proleptic Gregorian calendar is to conform to SQL standard which supposes such 
calendar.

Release notes:

Spark 3.0 has switched on Proleptic Gregorian calendar in parsing, formatting, 
and converting dates and timestamps as well as in extracting sub-components 
like years, days and etc. It uses Java 8 API classes from the java.time 
packages that based on [ISO chronology 
|https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html].
 Previous versions of Spark performed those operations by using [the hybrid 
calendar|https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html]
 (Julian + Gregorian). The changes might impact on the results for dates and 
timestamps before October 15, 1582 (Gregorian).


> Use Proleptic Gregorian calendar
> 
>
> Key: SPARK-26651
> URL: https://issues.apache.org/jira/browse/SPARK-26651
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>  Labels: ReleaseNote
>
> Spark 2.4 and previous versions use a hybrid calendar - Julian + Gregorian in 
> date/timestamp parsing, functions and expressions. The ticket aims to switch 
> Spark on Proleptic Gregorian calendar, and use java.time classes introduced 
> in Java 8 for timestamp/date manipulations. One of the purpose of switching 
> on Proleptic Gregorian calendar is to conform to SQL standard which supposes 
> such calendar.
> *Release note:*
> Spark 3.0 has switched on Proleptic Gregorian calendar in parsing, 
> formatting, and converting dates and timestamps as well as in extracting 
> sub-components like years, days and etc. It uses Java 8 API classes from the 
> java.time packages that based on [ISO chronology 
> |https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html].
>  Previous versions of Spark performed those operations by using [the hybrid 
> calendar|https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html]
>  (Julian + Gregorian). The changes might impact on the results for dates and 
> timestamps before October 15, 1582 (Gregorian).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26651) Use Proleptic Gregorian calendar

2019-02-01 Thread Maxim Gekk (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-26651:
---
Description: 
Spark 2.4 and previous versions use a hybrid calendar - Julian + Gregorian in 
date/timestamp parsing, functions and expressions. The ticket aims to switch 
Spark on Proleptic Gregorian calendar, and use java.time classes introduced in 
Java 8 for timestamp/date manipulations. One of the purpose of switching on 
Proleptic Gregorian calendar is to conform to SQL standard which supposes such 
calendar.

Release notes:

Spark 3.0 has switched on Proleptic Gregorian calendar in parsing, formatting, 
and converting dates and timestamps as well as in extracting sub-components 
like years, days and etc. It uses Java 8 API classes from the java.time 
packages that based on [ISO chronology 
|https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html].
 Previous versions of Spark performed those operations by using [the hybrid 
calendar|https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html]
 (Julian + Gregorian). The changes might impact on the results for dates and 
timestamps before October 15, 1582 (Gregorian).

  was:Spark 2.4 and previous versions use a hybrid calendar - Julian + 
Gregorian in date/timestamp parsing, functions and expressions. The ticket aims 
to switch Spark on Proleptic Gregorian calendar, and use java.time classes 
introduced in Java 8 for timestamp/date manipulations. One of the purpose of 
switching on Proleptic Gregorian calendar is to conform to SQL standard which 
supposes such calendar.


> Use Proleptic Gregorian calendar
> 
>
> Key: SPARK-26651
> URL: https://issues.apache.org/jira/browse/SPARK-26651
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> Spark 2.4 and previous versions use a hybrid calendar - Julian + Gregorian in 
> date/timestamp parsing, functions and expressions. The ticket aims to switch 
> Spark on Proleptic Gregorian calendar, and use java.time classes introduced 
> in Java 8 for timestamp/date manipulations. One of the purpose of switching 
> on Proleptic Gregorian calendar is to conform to SQL standard which supposes 
> such calendar.
> Release notes:
> Spark 3.0 has switched on Proleptic Gregorian calendar in parsing, 
> formatting, and converting dates and timestamps as well as in extracting 
> sub-components like years, days and etc. It uses Java 8 API classes from the 
> java.time packages that based on [ISO chronology 
> |https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html].
>  Previous versions of Spark performed those operations by using [the hybrid 
> calendar|https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html]
>  (Julian + Gregorian). The changes might impact on the results for dates and 
> timestamps before October 15, 1582 (Gregorian).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26811) Add DataSourceV2 capabilities to check support for batch append, overwrite, truncate during analysis.

2019-02-01 Thread Ryan Blue (JIRA)

Ryan Blue created SPARK-26811:
-

 Summary: Add DataSourceV2 capabilities to check support for batch 
append, overwrite, truncate during analysis.
 Key: SPARK-26811
 URL: https://issues.apache.org/jira/browse/SPARK-26811
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Ryan Blue






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26651) Use Proleptic Gregorian calendar

2019-02-01 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26651:


Assignee: Maxim Gekk  (was: Apache Spark)

> Use Proleptic Gregorian calendar
> 
>
> Key: SPARK-26651
> URL: https://issues.apache.org/jira/browse/SPARK-26651
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> Spark 2.4 and previous versions use a hybrid calendar - Julian + Gregorian in 
> date/timestamp parsing, functions and expressions. The ticket aims to switch 
> Spark on Proleptic Gregorian calendar, and use java.time classes introduced 
> in Java 8 for timestamp/date manipulations. One of the purpose of switching 
> on Proleptic Gregorian calendar is to conform to SQL standard which supposes 
> such calendar.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26651) Use Proleptic Gregorian calendar

2019-02-01 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26651:


Assignee: Apache Spark  (was: Maxim Gekk)

> Use Proleptic Gregorian calendar
> 
>
> Key: SPARK-26651
> URL: https://issues.apache.org/jira/browse/SPARK-26651
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Spark 2.4 and previous versions use a hybrid calendar - Julian + Gregorian in 
> date/timestamp parsing, functions and expressions. The ticket aims to switch 
> Spark on Proleptic Gregorian calendar, and use java.time classes introduced 
> in Java 8 for timestamp/date manipulations. One of the purpose of switching 
> on Proleptic Gregorian calendar is to conform to SQL standard which supposes 
> such calendar.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26806) EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly

2019-02-01 Thread Shixiong Zhu (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-26806:
-
Affects Version/s: 2.3.3

> EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly
> 
>
> Key: SPARK-26806
> URL: https://issues.apache.org/jira/browse/SPARK-26806
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0
>Reporter: liancheng
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.3.3, 2.4.1, 3.0.0, 2.2.4
>
>
> Right now, EventTimeStats.merge doesn't handle "zero.merge(zero)". This will 
> make "avg" become "NaN". And whatever gets merged with the result of 
> "zero.merge(zero)", "avg" will still be "NaN". Then finally, "NaN".toLong 
> will return "0" and the user will see the following incorrect report:
> {code}
> "eventTime" : {
> "avg" : "1970-01-01T00:00:00.000Z",
> "max" : "2019-01-31T12:57:00.000Z",
> "min" : "2019-01-30T18:44:04.000Z",
> "watermark" : "1970-01-01T00:00:00.000Z"
>   }
> {code}
> This issue was reported by [~liancheng]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26810) Fixing SPARK-25072 broke existing code and fails to show error message

2019-02-01 Thread Arttu Voutilainen (JIRA)

Arttu Voutilainen created SPARK-26810:
-

 Summary: Fixing SPARK-25072 broke existing code and fails to show 
error message
 Key: SPARK-26810
 URL: https://issues.apache.org/jira/browse/SPARK-26810
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.0
Reporter: Arttu Voutilainen


Hey,

We upgraded Spark recently, and 
https://issues.apache.org/jira/browse/SPARK-25072 caused our pipeline to fail 
after the upgrade. Annoyingly, the error message formatting also threw an 
exception itself, thus hiding the message we should have seen.

Repro using gettyimages/docker-spark, on 2.4.0:
{code}
from pyspark.sql import Row
r = Row(['a','b'])
r('1', '2')
{code}
{code}
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/spark-2.4.0/python/pyspark/sql/types.py", line 1505, in __call__
"but got %s" % (self, len(self), args))
  File "/usr/spark-2.4.0/python/pyspark/sql/types.py", line 1552, in __repr__
return "" % ", ".join(self)
TypeError: sequence item 0: expected str instance, list found
{code}
On 2.3.1, and also showing how this was used:
{code}
from pyspark.sql import Row, types as T

r = Row(['a','b'])
df = spark.createDataFrame([Row(col='doesntmatter')])
rdd = df.rdd.mapPartitions(lambda p: [r('a1','b2')])
spark.createDataFrame(rdd, T.StructType([T.StructField('a', T.StringType()), 
T.StructField('b', T.StringType())])).collect()
{code}
{code}
[Row(a='a1', b='b2'), Row(a='a1', b='b2')]
{code}
While I do think the code we had was quite horrible, it used to work. The 
unexpected error came from __repr__ as it assumes that the arguments given to 
Row constructor are strings. That sounds like a reasonable assumption, should 
the Row constructor validate that it holds true maybe? (I guess that might be 
another potentially breaking change though, if someone has as weird code as 
this one...)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26806) EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly

2019-02-01 Thread Shixiong Zhu (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-26806:
-
Fix Version/s: (was: 2.3.3)
   2.3.4

> EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly
> 
>
> Key: SPARK-26806
> URL: https://issues.apache.org/jira/browse/SPARK-26806
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0
>Reporter: liancheng
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.3.4, 2.4.1, 3.0.0, 2.2.4
>
>
> Right now, EventTimeStats.merge doesn't handle "zero.merge(zero)". This will 
> make "avg" become "NaN". And whatever gets merged with the result of 
> "zero.merge(zero)", "avg" will still be "NaN". Then finally, "NaN".toLong 
> will return "0" and the user will see the following incorrect report:
> {code}
> "eventTime" : {
> "avg" : "1970-01-01T00:00:00.000Z",
> "max" : "2019-01-31T12:57:00.000Z",
> "min" : "2019-01-30T18:44:04.000Z",
> "watermark" : "1970-01-01T00:00:00.000Z"
>   }
> {code}
> This issue was reported by [~liancheng]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26806) EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly

2019-02-01 Thread Shixiong Zhu (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-26806:
-
Affects Version/s: 2.2.2
   2.2.3

> EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly
> 
>
> Key: SPARK-26806
> URL: https://issues.apache.org/jira/browse/SPARK-26806
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: liancheng
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.3.3, 2.4.1, 3.0.0, 2.2.4
>
>
> Right now, EventTimeStats.merge doesn't handle "zero.merge(zero)". This will 
> make "avg" become "NaN". And whatever gets merged with the result of 
> "zero.merge(zero)", "avg" will still be "NaN". Then finally, "NaN".toLong 
> will return "0" and the user will see the following incorrect report:
> {code}
> "eventTime" : {
> "avg" : "1970-01-01T00:00:00.000Z",
> "max" : "2019-01-31T12:57:00.000Z",
> "min" : "2019-01-30T18:44:04.000Z",
> "watermark" : "1970-01-01T00:00:00.000Z"
>   }
> {code}
> This issue was reported by [~liancheng]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26806) EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly

2019-02-01 Thread Shixiong Zhu (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-26806.
--
   Resolution: Fixed
Fix Version/s: 3.0.0
   2.4.1
   2.3.3
   2.2.4

> EventTimeStats.merge doesn't handle "zero.merge(zero)" correctly
> 
>
> Key: SPARK-26806
> URL: https://issues.apache.org/jira/browse/SPARK-26806
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.1, 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: liancheng
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.2.4, 2.3.3, 2.4.1, 3.0.0
>
>
> Right now, EventTimeStats.merge doesn't handle "zero.merge(zero)". This will 
> make "avg" become "NaN". And whatever gets merged with the result of 
> "zero.merge(zero)", "avg" will still be "NaN". Then finally, "NaN".toLong 
> will return "0" and the user will see the following incorrect report:
> {code}
> "eventTime" : {
> "avg" : "1970-01-01T00:00:00.000Z",
> "max" : "2019-01-31T12:57:00.000Z",
> "min" : "2019-01-30T18:44:04.000Z",
> "watermark" : "1970-01-01T00:00:00.000Z"
>   }
> {code}
> This issue was reported by [~liancheng]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24961) sort operation causes out of memory

2019-02-01 Thread Mono Shiro (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758557#comment-16758557
 ] 

Mono Shiro commented on SPARK-24961:


Spark Version 2.3.2.  I have a very similar issue when simply reading a file 
that is bigger than the available memory on my machine.  Changing the 
StorageLevel to DISK_ONLY also blows up despite having ample space.  [Please 
see the question on 
stackoverflow|https://stackoverflow.com/questions/54469243/spark-storagelevel-in-local-mode-not-working/54470393#54470393]

 

It's important that local mode work for these sort of things.

> sort operation causes out of memory 
> 
>
> Key: SPARK-24961
> URL: https://issues.apache.org/jira/browse/SPARK-24961
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.3.1
> Environment: Java 1.8u144+
> Windows 10
> Spark 2.3.1 in local mode
> -Xms4g -Xmx4g
> optional: -XX:+UseParallelOldGC 
>Reporter: Markus Breuer
>Priority: Major
>
> A sort operation on large rdd - which does not fit in memory - causes out of 
> memory exception. I made the effect reproducable by an sample, the sample 
> creates large object of about 2mb size. When saving result the oom occurs. I 
> tried several StorageLevels, but if memory is included (MEMORY_AND_DISK, 
> MEMORY_AND_DISK_SER, none) application runs in out of memory. Only DISK_ONLY 
> seems to work.
> When replacing sort() with sortWithinPartitions() no StorageLevel is required 
> and application succeeds.
> {code:java}
> package de.bytefusion.examples;
> import breeze.storage.Storage;
> import de.bytefusion.utils.Options;
> import org.apache.hadoop.io.MapFile;
> import org.apache.hadoop.io.SequenceFile;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.mapred.SequenceFileOutputFormat;
> import org.apache.spark.api.java.JavaRDD;
> import org.apache.spark.api.java.JavaSparkContext;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.RowFactory;
> import org.apache.spark.sql.SparkSession;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.StructType;
> import org.apache.spark.storage.StorageLevel;
> import scala.Tuple2;
> import static org.apache.spark.sql.functions.*;
> import java.util.ArrayList;
> import java.util.List;
> import java.util.UUID;
> import java.util.stream.Collectors;
> import java.util.stream.IntStream;
> public class Example3 {
> public static void main(String... args) {
> // create spark session
> SparkSession spark = SparkSession.builder()
> .appName("example1")
> .master("local[4]")
> .config("spark.driver.maxResultSize","1g")
> .config("spark.driver.memory","512m")
> .config("spark.executor.memory","512m")
> .config("spark.local.dir","d:/temp/spark-tmp")
> .getOrCreate();
> JavaSparkContext sc = 
> JavaSparkContext.fromSparkContext(spark.sparkContext());
> // base to generate huge data
> List list = new ArrayList<>();
> for (int val = 1; val < 1; val++) {
> int valueOf = Integer.valueOf(val);
> list.add(valueOf);
> }
> // create simple rdd of int
> JavaRDD rdd = sc.parallelize(list,200);
> // use map to create large object per row
> JavaRDD rowRDD =
> rdd
> .map(value -> 
> RowFactory.create(String.valueOf(value), 
> createLongText(UUID.randomUUID().toString(), 2 * 1024 * 1024)))
> // no persist => out of memory exception on write()
> // persist MEMORY_AND_DISK => out of memory exception 
> on write()
> // persist MEMORY_AND_DISK_SER => out of memory 
> exception on write()
> // persist(StorageLevel.DISK_ONLY())
> ;
> StructType type = new StructType();
> type = type
> .add("c1", DataTypes.StringType)
> .add( "c2", DataTypes.StringType );
> Dataset df = spark.createDataFrame(rowRDD, type);
> // works
> df.show();
> df = df
> .sort(col("c1").asc() )
> ;
> df.explain();
> // takes a lot of time but works
> df.show();
> // OutOfMemoryError: java heap space
> df
> .write()
> .mode("overwrite")
> .csv("d:/temp/my.csv");
> // OutOfMemoryError: java heap space
> df
> .toJavaRDD()
> .mapToPair(row -> new Tuple2(new Text(row.getString(0)), new 
> Text( row.getString(1
> .saveAsHadoopFile("d:\\temp\\foo", Text.class,

[jira] [Commented] (SPARK-24541) TCP based shuffle

2019-02-01 Thread Jose Torres (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758488#comment-16758488
 ] 

Jose Torres commented on SPARK-24541:
-

I'm not gonna lie, I didn't put a tremendous amount of thought into the title 
of the Jira ticket. There's a strong argument that using Netty is indeed the 
right decision here. (Although we have to keep scalability in mind; we'll 
eventually need to do some kind of multiplexing to support even moderately 
sized N to N shuffles, so we should probably stay compatible with that.)

I'd guess that the RPC framework does carry a performance penalty from things 
such as extra headers, but I'd argue the major disadvantage is that it's not 
the right abstraction layer. RPCs normally live exclusively in the control 
plane.

> TCP based shuffle
> -
>
> Key: SPARK-24541
> URL: https://issues.apache.org/jira/browse/SPARK-24541
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24541) TCP based shuffle

2019-02-01 Thread Imran Rashid (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758482#comment-16758482
 ] 

Imran Rashid commented on SPARK-24541:
--

well, rpc is over tcp, so I'm still not really sure what this means.  Is the 
point sending raw data directly over sockets?  I'd be interested in knowing 
what the purpose is.  I guess to avoid the overhead associated w/ the extra 
headers etc from the rpc framework?  And if this is really going to try to use 
raw sockets, not through netty, then you'd have to reimplement encryption, 
manage your own buffers, etc.

> TCP based shuffle
> -
>
> Key: SPARK-24541
> URL: https://issues.apache.org/jira/browse/SPARK-24541
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23685) Spark Structured Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)

2019-02-01 Thread Gabor Somogyi (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758351#comment-16758351
 ] 

Gabor Somogyi commented on SPARK-23685:
---

[~sindiri] We've tried to reproduce the issue without success do you have an 
example code?

> Spark Structured Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive 
> Offsets (i.e. Log Compaction)
> -
>
> Key: SPARK-23685
> URL: https://issues.apache.org/jira/browse/SPARK-23685
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: sirisha
>Priority: Major
>
> When Kafka does log compaction offsets often end up with gaps, meaning the 
> next requested offset will be frequently not be offset+1. The logic in 
> KafkaSourceRDD & CachedKafkaConsumer assumes that the next offset will always 
> be just an increment of 1 .If not, it throws the below exception:
>  
> "Cannot fetch records in [5589, 5693) (GroupId: XXX, TopicPartition:). 
> Some data may have been lost because they are not available in Kafka any 
> more; either the data was aged out by Kafka or the topic may have been 
> deleted before all the data in the topic was processed. If you don't want 
> your streaming query to fail on such cases, set the source option 
> "failOnDataLoss" to "false". "
>  
> FYI: This bug is related to https://issues.apache.org/jira/browse/SPARK-17147
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26783) Kafka parameter documentation doesn't match with the reality (upper/lowercase)

2019-02-01 Thread Gabor Somogyi (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758348#comment-16758348
 ] 

Gabor Somogyi commented on SPARK-26783:
---

[~zsxwing] [~kabhwan] The more I'm playing with the things the more I think 
there could be different issues involved (not sure have effect on each other).

1. "failOnDataLoss": I'll ask the reporter on SPARK-23685 because not yet able 
to repro. Let's see whether the code or the doc has to be updated.
2. Generic data source implementation issue. Namely API doesn't guarantee 
lowercase params but the user code is depending on that. For example 
[this|https://github.com/apache/spark/blob/aea5f506463c19fac97547ba7a28f9dd491e3a6a/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala#L66]
 but there are other places.
Not sure it has anything to do with the first but could cause potentially such 
issues.


> Kafka parameter documentation doesn't match with the reality (upper/lowercase)
> --
>
> Key: SPARK-26783
> URL: https://issues.apache.org/jira/browse/SPARK-26783
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Minor
>
> A good example for this is "failOnDataLoss" which is reported in SPARK-23685. 
> I've just checked and there are several other parameters which suffer from 
> the same issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26734) StackOverflowError on WAL serialization caused by large receivedBlockQueue

2019-02-01 Thread Gabor Somogyi (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi updated SPARK-26734:
--
Component/s: DStreams

> StackOverflowError on WAL serialization caused by large receivedBlockQueue
> --
>
> Key: SPARK-26734
> URL: https://issues.apache.org/jira/browse/SPARK-26734
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, DStreams
>Affects Versions: 2.3.1, 2.3.2, 2.4.0
> Environment: spark 2.4.0 streaming job
> java 1.8
> scala 2.11.12
>Reporter: Ross M. Lodge
>Priority: Major
>
> We encountered an intermittent StackOverflowError with a stack trace similar 
> to:
>  
> {noformat}
> Exception in thread "JobGenerator" java.lang.StackOverflowError
> at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509){noformat}
> The name of the thread has been seen to be either "JobGenerator" or 
> "streaming-start", depending on when in the lifecycle of the job the problem 
> occurs.  It appears to only occur in streaming jobs with checkpointing and 
> WAL enabled; this has prevented us from upgrading to v2.4.0.
>  
> Via debugging, we tracked this down to allocateBlocksToBatch in 
> ReceivedBlockTracker:
> {code:java}
> /**
>  * Allocate all unallocated blocks to the given batch.
>  * This event will get written to the write ahead log (if enabled).
>  */
> def allocateBlocksToBatch(batchTime: Time): Unit = synchronized {
>   if (lastAllocatedBatchTime == null || batchTime > lastAllocatedBatchTime) {
> val streamIdToBlocks = streamIds.map { streamId =>
>   (streamId, getReceivedBlockQueue(streamId).clone())
> }.toMap
> val allocatedBlocks = AllocatedBlocks(streamIdToBlocks)
> if (writeToLog(BatchAllocationEvent(batchTime, allocatedBlocks))) {
>   streamIds.foreach(getReceivedBlockQueue(_).clear())
>   timeToAllocatedBlocks.put(batchTime, allocatedBlocks)
>   lastAllocatedBatchTime = batchTime
> } else {
>   logInfo(s"Possibly processed batch $batchTime needs to be processed 
> again in WAL recovery")
> }
>   } else {
> // This situation occurs when:
> // 1. WAL is ended with BatchAllocationEvent, but without 
> BatchCleanupEvent,
> // possibly processed batch job or half-processed batch job need to be 
> processed again,
> // so the batchTime will be equal to lastAllocatedBatchTime.
> // 2. Slow checkpointing makes recovered batch time older than WAL 
> recovered
> // lastAllocatedBatchTime.
> // This situation will only occurs in recovery time.
> logInfo(s"Possibly processed batch $batchTime needs to be processed again 
> in WAL recovery")
>   }
> }
> {code}
> Prior to 2.3.1, this code did
> {code:java}
> getReceivedBlockQueue(streamId).dequeueAll(x => true){code}
> but it was changed as part of SPARK-23991 to
> {code:java}
> getReceivedBlockQueue(streamId).clone(){code}
> We've not been able to reproduce this in a test of the actual above method, 
> but we've been able to produce a test that reproduces it by putting a lot of 
> values into the queue:
>  
> {code:java}
> class SerializationFailureTest extends FunSpec {
>   private val logger = LoggerFactory.getLogger(getClass)
>   private type ReceivedBlockQueue = mutable.Queue[ReceivedBlockInfo]
>   describe("Queue") {
> it("should be serializable") {
>   runTest(1062)
> }
> it("should not be serializable") {
>   runTest(1063)
> }
> it("should DEFINITELY not be serializable") {
>   runTest(199952)
> }
>   }
>   private def runTest(mx: Int): Array[Byte] = {
> try {
>   val random = new scala.util.Random()
>   val queue = new ReceivedBlockQueue()
>   for (_ <- 0 until mx) {
> queue += ReceivedBlockInfo(
>   streamId = 0,
>   numRecords = Some(random.nextInt(5)),
>   metadataOption = None,
>   blockStoreResult = WriteAheadLogBasedStoreResult(
> blockId = StreamBlockId(0, random.nextInt()),
> numRecords = Some(random.nextInt(5)),
> walRecordHandle = FileBasedWriteAheadLogSegment(
>   path = 
> s"""hdfs://foo.bar.com:8080/spark/streaming/BAZ/7/receivedData/0/log-${random.nextInt()}-${random.nextInt()}""",
>   offset = random.nextLong(),
>   length = random.nextInt()
> )
>   )
> )
>   }
>   val record = BatchAllocationEvent(
> Time(154832040L), AllocatedBlocks(

[jira] [Created] (SPARK-26809) insert overwrite directory + concat function => error

2019-02-01 Thread ant_nebula (JIRA)

ant_nebula created SPARK-26809:
--

 Summary: insert overwrite directory + concat function => error
 Key: SPARK-26809
 URL: https://issues.apache.org/jira/browse/SPARK-26809
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: ant_nebula


insert overwrite directory '/tmp/xx'

select concat(col1, col2)

from tableXX

limit 3

 

Caused by: org.apache.hadoop.hive.serde2.SerDeException: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 3 elements 
while columns.types has 2 elements!
 at 
org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:145)
 at 
org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:85)
 at 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125)
 at 
org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:119)
 at 
org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103)
 at 
org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120)
 at 
org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:108)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:233)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:168)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
 at org.apache.spark.scheduler.Task.run(Task.scala:121)
 at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26797) Start using the new logical types API of Parquet 1.11.0 instead of the deprecated one

2019-02-01 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26797:


Assignee: (was: Apache Spark)

> Start using the new logical types API of Parquet 1.11.0 instead of the 
> deprecated one
> -
>
> Key: SPARK-26797
> URL: https://issues.apache.org/jira/browse/SPARK-26797
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zoltan Ivanfi
>Priority: Major
>
> The 1.11.0 release of parquet-mr will deprecate its logical type API in 
> favour of a newly introduced one. The new API also introduces new subtypes 
> for different timestamp semantics, support for which should be added to Spark 
> in order to read those types correctly.
> At this point only a release candidate of parquet-mr 1.11.0 is available, but 
> that already allows implementing and reviewing this change.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26797) Start using the new logical types API of Parquet 1.11.0 instead of the deprecated one

2019-02-01 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26797:


Assignee: Apache Spark

> Start using the new logical types API of Parquet 1.11.0 instead of the 
> deprecated one
> -
>
> Key: SPARK-26797
> URL: https://issues.apache.org/jira/browse/SPARK-26797
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zoltan Ivanfi
>Assignee: Apache Spark
>Priority: Major
>
> The 1.11.0 release of parquet-mr will deprecate its logical type API in 
> favour of a newly introduced one. The new API also introduces new subtypes 
> for different timestamp semantics, support for which should be added to Spark 
> in order to read those types correctly.
> At this point only a release candidate of parquet-mr 1.11.0 is available, but 
> that already allows implementing and reviewing this change.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23155) YARN-aggregated executor/driver logs appear unavailable when NM is down

2019-02-01 Thread Gera Shegalov (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758102#comment-16758102
 ] 

Gera Shegalov commented on SPARK-23155:
---

[~kabhwan], [~vanzin] I would still be interested to be able to use the new 
mechanism with the old logs. [https://github.com/apache/spark/pull/23720] is a 
quick draft to demo how we could achieve this flexibly with named capture 
groups.

> YARN-aggregated executor/driver logs appear unavailable when NM is down
> ---
>
> Key: SPARK-23155
> URL: https://issues.apache.org/jira/browse/SPARK-23155
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 2.2.1
>Reporter: Gera Shegalov
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.0.0
>
>
> Unlike MapReduce JobHistory Server, Spark history server isn't rewriting 
> container log URL's to point to the aggregated yarn.log.server.url location 
> and relies on the NodeManager webUI to trigger a redirect. This fails when 
> the NM is down. Note that NM may be down permanently after decommissioning in 
> traditional environments or when used in a cloud environment such as AWS EMR 
> where either worker nodes are taken away with autoscale, the whole cluster is 
> used to run a single job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26792) Apply custom log URL to Spark UI

2019-02-01 Thread Gera Shegalov (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758095#comment-16758095
 ] 

Gera Shegalov commented on SPARK-26792:
---

[~kabhwan] thanks for doing this work. I verified that I can configure SHS so 
it satisfies our use case. Changing the default in Spark is a nice-to-have but 
not a high priority from my perspective.

> Apply custom log URL to Spark UI
> 
>
> Key: SPARK-26792
> URL: https://issues.apache.org/jira/browse/SPARK-26792
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> SPARK-23155 enables SHS to set up custom log URLs for incompleted / completed 
> apps.
> While getting reviews from SPARK-23155, I've got two comments which applying 
> custom log URLs to UI would help achieving it. Quoting these comments here:
> https://github.com/apache/spark/pull/23260#issuecomment-456827963
> {quote}
> Sorry I haven't had time to look through all the code so this might be a 
> separate jira, but one thing I thought of here is it would be really nice not 
> to have specifically stderr/stdout. users can specify any log4j.properties 
> and some tools like oozie by default end up using hadoop log4j rather then 
> spark log4j, so files aren't necessarily the same. Also users can put in 
> other logs files so it would be nice to have links to those from the UI. It 
> seems simpler if we just had a link to the directory and it read the files 
> within there. Other things in Hadoop do it this way, but I'm not sure if that 
> works well for other resource managers, any thoughts on that? As long as this 
> doesn't prevent the above I can file a separate jira for it.
> {quote}
> https://github.com/apache/spark/pull/23260#issuecomment-456904716
> {quote}
> Hi Tom, +1: singling out stdout and stderr is definitely an annoyance. We
> typically configure Spark jobs to write the GC log and dump heap on OOM
> using ,  and/or we use the rolling file appender to deal with
> large logs during debugging. So linking the YARN container log overview
> page would make much more sense for us. We work it around with a custom
> submit process that logs all important URLs on the submit side log.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

55 matches

Mail list logo