[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL

2018-04-26 Thread Dilip Biswal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16455734#comment-16455734
 ] 

Dilip Biswal commented on SPARK-21274:
--

[~maropu] Thanks. Yeah..  i will have two separate PRs.

> Implement EXCEPT ALL and INTERSECT ALL
> --
>
> Key: SPARK-21274
> URL: https://issues.apache.org/jira/browse/SPARK-21274
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> 1) *EXCEPT ALL* / MINUS ALL :
> {code}
> SELECT a,b,c FROM tab1
>  EXCEPT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following outer join:
> {code}
> SELECT a,b,c
> FROMtab1 t1
>  LEFT OUTER JOIN 
> tab2 t2
>  ON (
> (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c)
>  )
> WHERE
> COALESCE(t2.a, t2.b, t2.c) IS NULL
> {code}
> (register as a temp.view this second query under "*t1_except_t2_df*" name 
> that can be also used to find INTERSECT ALL below):
> 2) *INTERSECT ALL*:
> {code}
> SELECT a,b,c FROM tab1
>  INTERSECT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following anti-join using t1_except_t2_df we defined 
> above:
> {code}
> SELECT a,b,c
> FROMtab1 t1
> WHERE 
>NOT EXISTS
>(SELECT 1
> FROMt1_except_t2_df e
> WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c)
>)
> {code}
> So the suggestion is just to use above query rewrites to implement both 
> EXCEPT ALL and INTERSECT ALL sql set operations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24109) Remove class SnappyOutputStreamWrapper

2018-04-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24109:


Assignee: (was: Apache Spark)

> Remove class SnappyOutputStreamWrapper
> --
>
> Key: SPARK-24109
> URL: https://issues.apache.org/jira/browse/SPARK-24109
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Input/Output
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
>Reporter: wangjinhai
>Priority: Minor
> Fix For: 2.4.0
>
>
> Wrapper over `SnappyOutputStream` which guards against write-after-close and 
> double-close
> issues. See SPARK-7660 for more details.
> This wrapping can be removed if we upgrade to a version
> of snappy-java that contains the fix for 
> [https://github.com/xerial/snappy-java/issues/107.]
> {{snappy-java:1.1.2+ fixed the bug}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24109) Remove class SnappyOutputStreamWrapper

2018-04-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16455731#comment-16455731
 ] 

Apache Spark commented on SPARK-24109:
--

User 'manbuyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/21176

> Remove class SnappyOutputStreamWrapper
> --
>
> Key: SPARK-24109
> URL: https://issues.apache.org/jira/browse/SPARK-24109
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Input/Output
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
>Reporter: wangjinhai
>Priority: Minor
> Fix For: 2.4.0
>
>
> Wrapper over `SnappyOutputStream` which guards against write-after-close and 
> double-close
> issues. See SPARK-7660 for more details.
> This wrapping can be removed if we upgrade to a version
> of snappy-java that contains the fix for 
> [https://github.com/xerial/snappy-java/issues/107.]
> {{snappy-java:1.1.2+ fixed the bug}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24109) Remove class SnappyOutputStreamWrapper

2018-04-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24109:


Assignee: Apache Spark

> Remove class SnappyOutputStreamWrapper
> --
>
> Key: SPARK-24109
> URL: https://issues.apache.org/jira/browse/SPARK-24109
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Input/Output
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
>Reporter: wangjinhai
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 2.4.0
>
>
> Wrapper over `SnappyOutputStream` which guards against write-after-close and 
> double-close
> issues. See SPARK-7660 for more details.
> This wrapping can be removed if we upgrade to a version
> of snappy-java that contains the fix for 
> [https://github.com/xerial/snappy-java/issues/107.]
> {{snappy-java:1.1.2+ fixed the bug}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24109) Remove class SnappyOutputStreamWrapper

2018-04-26 Thread wangjinhai (JIRA)
wangjinhai created SPARK-24109:
--

 Summary: Remove class SnappyOutputStreamWrapper
 Key: SPARK-24109
 URL: https://issues.apache.org/jira/browse/SPARK-24109
 Project: Spark
  Issue Type: Improvement
  Components: Block Manager, Input/Output
Affects Versions: 2.3.0, 2.2.1, 2.2.0
Reporter: wangjinhai
 Fix For: 2.4.0


Wrapper over `SnappyOutputStream` which guards against write-after-close and 
double-close
issues. See SPARK-7660 for more details.

This wrapping can be removed if we upgrade to a version
of snappy-java that contains the fix for 
[https://github.com/xerial/snappy-java/issues/107.]

{{snappy-java:1.1.2+ fixed the bug}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL

2018-04-26 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16455716#comment-16455716
 ] 

Takeshi Yamamuro commented on SPARK-21274:
--

ok, thanks! IMO it'd be better to make separate two prs to implement `EXCEPT 
ALL` and `INTERSECT ALL`.

> Implement EXCEPT ALL and INTERSECT ALL
> --
>
> Key: SPARK-21274
> URL: https://issues.apache.org/jira/browse/SPARK-21274
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> 1) *EXCEPT ALL* / MINUS ALL :
> {code}
> SELECT a,b,c FROM tab1
>  EXCEPT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following outer join:
> {code}
> SELECT a,b,c
> FROMtab1 t1
>  LEFT OUTER JOIN 
> tab2 t2
>  ON (
> (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c)
>  )
> WHERE
> COALESCE(t2.a, t2.b, t2.c) IS NULL
> {code}
> (register as a temp.view this second query under "*t1_except_t2_df*" name 
> that can be also used to find INTERSECT ALL below):
> 2) *INTERSECT ALL*:
> {code}
> SELECT a,b,c FROM tab1
>  INTERSECT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following anti-join using t1_except_t2_df we defined 
> above:
> {code}
> SELECT a,b,c
> FROMtab1 t1
> WHERE 
>NOT EXISTS
>(SELECT 1
> FROMt1_except_t2_df e
> WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c)
>)
> {code}
> So the suggestion is just to use above query rewrites to implement both 
> EXCEPT ALL and INTERSECT ALL sql set operations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24108) ChunkedByteBuffer.writeFully method has not reset the limit value

2018-04-26 Thread wangjinhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangjinhai resolved SPARK-24108.

Resolution: Won't Do

> ChunkedByteBuffer.writeFully method has not reset the limit value
> -
>
> Key: SPARK-24108
> URL: https://issues.apache.org/jira/browse/SPARK-24108
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Input/Output
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
>Reporter: wangjinhai
>Priority: Major
> Fix For: 2.4.0
>
>
> ChunkedByteBuffer.writeFully method has not reset the limit value. When 
> chunks larger than bufferWriteChunkSize, such as 80*1024*1024 larger than
> config.BUFFER_WRITE_CHUNK_SIZE(64 * 1024 * 1024),only while once, will lost 
> 16*1024*1024 byte



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24107) ChunkedByteBuffer.writeFully method has not reset the limit value

2018-04-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24107:


Assignee: Apache Spark

> ChunkedByteBuffer.writeFully method has not reset the limit value
> -
>
> Key: SPARK-24107
> URL: https://issues.apache.org/jira/browse/SPARK-24107
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Input/Output
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
>Reporter: wangjinhai
>Assignee: Apache Spark
>Priority: Major
> Fix For: 2.4.0
>
>
> ChunkedByteBuffer.writeFully method has not reset the limit value. When 
> chunks larger than bufferWriteChunkSize, such as 80*1024*1024 larger than
> config.BUFFER_WRITE_CHUNK_SIZE(64 * 1024 * 1024),only while once, will lost 
> 16*1024*1024 byte



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24107) ChunkedByteBuffer.writeFully method has not reset the limit value

2018-04-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24107:


Assignee: (was: Apache Spark)

> ChunkedByteBuffer.writeFully method has not reset the limit value
> -
>
> Key: SPARK-24107
> URL: https://issues.apache.org/jira/browse/SPARK-24107
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Input/Output
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
>Reporter: wangjinhai
>Priority: Major
> Fix For: 2.4.0
>
>
> ChunkedByteBuffer.writeFully method has not reset the limit value. When 
> chunks larger than bufferWriteChunkSize, such as 80*1024*1024 larger than
> config.BUFFER_WRITE_CHUNK_SIZE(64 * 1024 * 1024),only while once, will lost 
> 16*1024*1024 byte



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24107) ChunkedByteBuffer.writeFully method has not reset the limit value

2018-04-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16455700#comment-16455700
 ] 

Apache Spark commented on SPARK-24107:
--

User 'manbuyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/21175

> ChunkedByteBuffer.writeFully method has not reset the limit value
> -
>
> Key: SPARK-24107
> URL: https://issues.apache.org/jira/browse/SPARK-24107
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Input/Output
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
>Reporter: wangjinhai
>Priority: Major
> Fix For: 2.4.0
>
>
> ChunkedByteBuffer.writeFully method has not reset the limit value. When 
> chunks larger than bufferWriteChunkSize, such as 80*1024*1024 larger than
> config.BUFFER_WRITE_CHUNK_SIZE(64 * 1024 * 1024),only while once, will lost 
> 16*1024*1024 byte



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL

2018-04-26 Thread Dilip Biswal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16455697#comment-16455697
 ] 

Dilip Biswal commented on SPARK-21274:
--

[~maropu] I am currently testing the code. I will open the PRs as soon as i 
finish.

> Implement EXCEPT ALL and INTERSECT ALL
> --
>
> Key: SPARK-21274
> URL: https://issues.apache.org/jira/browse/SPARK-21274
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> 1) *EXCEPT ALL* / MINUS ALL :
> {code}
> SELECT a,b,c FROM tab1
>  EXCEPT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following outer join:
> {code}
> SELECT a,b,c
> FROMtab1 t1
>  LEFT OUTER JOIN 
> tab2 t2
>  ON (
> (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c)
>  )
> WHERE
> COALESCE(t2.a, t2.b, t2.c) IS NULL
> {code}
> (register as a temp.view this second query under "*t1_except_t2_df*" name 
> that can be also used to find INTERSECT ALL below):
> 2) *INTERSECT ALL*:
> {code}
> SELECT a,b,c FROM tab1
>  INTERSECT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following anti-join using t1_except_t2_df we defined 
> above:
> {code}
> SELECT a,b,c
> FROMtab1 t1
> WHERE 
>NOT EXISTS
>(SELECT 1
> FROMt1_except_t2_df e
> WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c)
>)
> {code}
> So the suggestion is just to use above query rewrites to implement both 
> EXCEPT ALL and INTERSECT ALL sql set operations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24108) ChunkedByteBuffer.writeFully method has not reset the limit value

2018-04-26 Thread wangjinhai (JIRA)
wangjinhai created SPARK-24108:
--

 Summary: ChunkedByteBuffer.writeFully method has not reset the 
limit value
 Key: SPARK-24108
 URL: https://issues.apache.org/jira/browse/SPARK-24108
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Input/Output
Affects Versions: 2.3.0, 2.2.1, 2.2.0
Reporter: wangjinhai
 Fix For: 2.4.0


ChunkedByteBuffer.writeFully method has not reset the limit value. When 
chunks larger than bufferWriteChunkSize, such as 80*1024*1024 larger than
config.BUFFER_WRITE_CHUNK_SIZE(64 * 1024 * 1024),only while once, will lost 
16*1024*1024 byte



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24107) ChunkedByteBuffer.writeFully method has not reset the limit value

2018-04-26 Thread wangjinhai (JIRA)
wangjinhai created SPARK-24107:
--

 Summary: ChunkedByteBuffer.writeFully method has not reset the 
limit value
 Key: SPARK-24107
 URL: https://issues.apache.org/jira/browse/SPARK-24107
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Input/Output
Affects Versions: 2.3.0, 2.2.1, 2.2.0
Reporter: wangjinhai
 Fix For: 2.4.0


ChunkedByteBuffer.writeFully method has not reset the limit value. When 
chunks larger than bufferWriteChunkSize, such as 80*1024*1024 larger than
config.BUFFER_WRITE_CHUNK_SIZE(64 * 1024 * 1024),only while once, will lost 
16*1024*1024 byte



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23355) convertMetastore should not ignore table properties

2018-04-26 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23355.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20522
[https://github.com/apache/spark/pull/20522]

> convertMetastore should not ignore table properties
> ---
>
> Key: SPARK-23355
> URL: https://issues.apache.org/jira/browse/SPARK-23355
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.4.0
>
>
> `convertMetastoreOrc/Parquet` ignores table properties.
> This happens for a table created by `STORED AS ORC/PARQUET`.
> Currently, `convertMetastoreParquet` is `true` by default. So, it's a 
> well-known user-facing issue. For `convertMetastoreOrc`, it has been `false` 
> and  SPARK-22279 tried to turn on that for Feature Parity with Parquet.
> However, it's indeed a regression for the existing Hive ORC table users. So, 
> it'll be reverted for Apache Spark 2.3 via 
> https://github.com/apache/spark/pull/20536 .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23355) convertMetastore should not ignore table properties

2018-04-26 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-23355:
---

Assignee: Dongjoon Hyun

> convertMetastore should not ignore table properties
> ---
>
> Key: SPARK-23355
> URL: https://issues.apache.org/jira/browse/SPARK-23355
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.4.0
>
>
> `convertMetastoreOrc/Parquet` ignores table properties.
> This happens for a table created by `STORED AS ORC/PARQUET`.
> Currently, `convertMetastoreParquet` is `true` by default. So, it's a 
> well-known user-facing issue. For `convertMetastoreOrc`, it has been `false` 
> and  SPARK-22279 tried to turn on that for Feature Parity with Parquet.
> However, it's indeed a regression for the existing Hive ORC table users. So, 
> it'll be reverted for Apache Spark 2.3 via 
> https://github.com/apache/spark/pull/20536 .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24100) Add the CompressionCodec to the saveAsTextFiles interface.

2018-04-26 Thread caijie (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caijie updated SPARK-24100:
---
Component/s: (was: DStreams)

> Add the CompressionCodec to the saveAsTextFiles interface.
> --
>
> Key: SPARK-24100
> URL: https://issues.apache.org/jira/browse/SPARK-24100
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: caijie
>Priority: Minor
>
> Add the CompressionCodec to the saveAsTextFiles interface.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23830) Spark on YARN in cluster deploy mode fail with NullPointerException when a Spark application is a Scala class not object

2018-04-26 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16455621#comment-16455621
 ] 

Saisai Shao commented on SPARK-23830:
-

I agree [~emaynard].

> Spark on YARN in cluster deploy mode fail with NullPointerException when a 
> Spark application is a Scala class not object
> 
>
> Key: SPARK-23830
> URL: https://issues.apache.org/jira/browse/SPARK-23830
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> As reported on StackOverflow in [Why does Spark on YARN fail with “Exception 
> in thread ”Driver“ 
> java.lang.NullPointerException”?|https://stackoverflow.com/q/49564334/1305344]
>  the following Spark application fails with {{Exception in thread "Driver" 
> java.lang.NullPointerException}} with Spark on YARN in cluster deploy mode:
> {code}
> class MyClass {
>   def main(args: Array[String]): Unit = {
> val c = new MyClass()
> c.process()
>   }
>   def process(): Unit = {
> val sparkConf = new SparkConf().setAppName("my-test")
> val sparkSession: SparkSession = 
> SparkSession.builder().config(sparkConf).getOrCreate()
> import sparkSession.implicits._
> 
>   }
>   ...
> }
> {code}
> The exception is as follows:
> {code}
> 18/03/29 20:07:52 INFO ApplicationMaster: Starting the user application in a 
> separate Thread
> 18/03/29 20:07:52 INFO ApplicationMaster: Waiting for spark context 
> initialization...
> Exception in thread "Driver" java.lang.NullPointerException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637)
> {code}
> I think the reason for the exception {{Exception in thread "Driver" 
> java.lang.NullPointerException}} is due to [the following 
> code|https://github.com/apache/spark/blob/v2.3.0/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L700-L701]:
> {code}
> val mainMethod = userClassLoader.loadClass(args.userClass)
>   .getMethod("main", classOf[Array[String]])
> {code}
> So when {{mainMethod}} is used in [the following 
> code|https://github.com/apache/spark/blob/v2.3.0/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L706]
>  it simply gives NPE.
> {code}
> mainMethod.invoke(null, userArgs.toArray)
> {code}
> That could be easily avoided with an extra check if the {{mainMethod}} is 
> initialized and give a user a message what may have been a reason.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24099) java.io.CharConversionException: Invalid UTF-32 character prevents me from querying my data in JSON

2018-04-26 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24099.
--
Resolution: Duplicate

It will be fixed in SPARK-23723 soon.

> java.io.CharConversionException: Invalid UTF-32 character prevents me from 
> querying my data in JSON
> ---
>
> Key: SPARK-24099
> URL: https://issues.apache.org/jira/browse/SPARK-24099
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: OS: SUSE 11
> Spark Version: 2.3
>  
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> Steps:
>  # Launch spark-sql --master yarn
>  # create table json(name STRING, age int, gender string, id INT) using 
> org.apache.spark.sql.json options(path "hdfs:///user/testdemo/");
>  # Execute the below SQL queries 
> INSERT into json
> SELECT 'Shaan',21,'Male',1
> UNION ALL
> SELECT 'Xing',20,'Female',11
> UNION ALL
> SELECT 'Mile',4,'Female',20
> UNION ALL
> SELECT 'Malan',10,'Male',9;
>  # Select * from json;
> Throws below Exception
> Caused by: *java.io.CharConversionException: Invalid UTF-32 character* 
> 0x151a15(above 10) at char #1, byte #7)
>  at 
> com.fasterxml.jackson.core.io.UTF32Reader.reportInvalid(UTF32Reader.java:189)
>  at com.fasterxml.jackson.core.io.UTF32Reader.read(UTF32Reader.java:150)
>  at 
> com.fasterxml.jackson.core.json.ReaderBasedJsonParser.loadMore(ReaderBasedJsonParser.java:153)
>  at 
> com.fasterxml.jackson.core.json.ReaderBasedJsonParser._skipWSOrEnd(ReaderBasedJsonParser.java:2017)
>  at 
> com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:577)
>  at 
> org.apache.spark.sql.catalyst.json.JacksonParser$$anonfun$parse$2.apply(JacksonParser.scala:350)
>  at 
> org.apache.spark.sql.catalyst.json.JacksonParser$$anonfun$parse$2.apply(JacksonParser.scala:347)
>  at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2585)
>  at 
> org.apache.spark.sql.catalyst.json.JacksonParser.parse(JacksonParser.scala:347)
>  at 
> org.apache.spark.sql.execution.datasources.json.TextInputJsonDataSource$$anonfun$3.apply(JsonDataSource.scala:126)
>  at 
> org.apache.spark.sql.execution.datasources.json.TextInputJsonDataSource$$anonfun$3.apply(JsonDataSource.scala:126)
>  at 
> org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:61)
>  at 
> org.apache.spark.sql.execution.datasources.json.TextInputJsonDataSource$$anonfun$readFile$2.apply(JsonDataSource.scala:130)
>  at 
> org.apache.spark.sql.execution.datasources.json.TextInputJsonDataSource$$anonfun$readFile$2.apply(JsonDataSource.scala:130)
>  at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>  at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>  at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106)
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>  at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>  at org.apache.spark.scheduler.Task.run(Task.scala:109)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
>  
> Note:
> https://issues.apache.org/jira/browse/SPARK-16548 Jira raised in 1.6 and said 
> Fixed in 2.3 but still I am getting same Error.
> Please update on this.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24068) CSV schema inferring doesn't work for compressed files

2018-04-26 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16455611#comment-16455611
 ] 

Hyukjin Kwon edited comment on SPARK-24068 at 4/27/18 12:51 AM:


I roughly assume the fix will be small, similar or the same? I think it's fine 
to describe both issues here.


was (Author: hyukjin.kwon):
I roughly assume the fix will be small, similar or the same? I think it's fix 
to describe both issues here.

> CSV schema inferring doesn't work for compressed files
> --
>
> Key: SPARK-24068
> URL: https://issues.apache.org/jira/browse/SPARK-24068
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Here is a simple csv file compressed by lzo
> {code}
> $ cat ./test.csv
> col1,col2
> a,1
> $ lzop ./test.csv
> $ ls
> test.csv test.csv.lzo
> {code}
> Reading test.csv.lzo with LZO codec (see 
> https://github.com/twitter/hadoop-lzo, for example):
> {code:scala}
> scala> val ds = spark.read.option("header", true).option("inferSchema", 
> true).option("io.compression.codecs", 
> "com.hadoop.compression.lzo.LzopCodec").csv("/Users/maximgekk/tmp/issue/test.csv.lzo")
> ds: org.apache.spark.sql.DataFrame = [�LZO?: string]
> scala> ds.printSchema
> root
>  |-- �LZO: string (nullable = true)
> scala> ds.show
> +-+
> |�LZO|
> +-+
> |a|
> +-+
> {code}
> but the file can be read if the schema is specified:
> {code}
> scala> import org.apache.spark.sql.types._
> scala> val schema = new StructType().add("col1", StringType).add("col2", 
> IntegerType)
> scala> val ds = spark.read.schema(schema).option("header", 
> true).option("io.compression.codecs", 
> "com.hadoop.compression.lzo.LzopCodec").csv("test.csv.lzo")
> scala> ds.show
> +++
> |col1|col2|
> +++
> |   a|   1|
> +++
> {code}
> Just in case, schema inferring works for the original uncompressed file:
> {code:scala}
> scala> spark.read.option("header", true).option("inferSchema", 
> true).csv("test.csv").printSchema
> root
>  |-- col1: string (nullable = true)
>  |-- col2: integer (nullable = true)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24068) CSV schema inferring doesn't work for compressed files

2018-04-26 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16455611#comment-16455611
 ] 

Hyukjin Kwon commented on SPARK-24068:
--

I roughly assume the fix will be small, similar or the same? I think it's fix 
to describe both issues here.

> CSV schema inferring doesn't work for compressed files
> --
>
> Key: SPARK-24068
> URL: https://issues.apache.org/jira/browse/SPARK-24068
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Here is a simple csv file compressed by lzo
> {code}
> $ cat ./test.csv
> col1,col2
> a,1
> $ lzop ./test.csv
> $ ls
> test.csv test.csv.lzo
> {code}
> Reading test.csv.lzo with LZO codec (see 
> https://github.com/twitter/hadoop-lzo, for example):
> {code:scala}
> scala> val ds = spark.read.option("header", true).option("inferSchema", 
> true).option("io.compression.codecs", 
> "com.hadoop.compression.lzo.LzopCodec").csv("/Users/maximgekk/tmp/issue/test.csv.lzo")
> ds: org.apache.spark.sql.DataFrame = [�LZO?: string]
> scala> ds.printSchema
> root
>  |-- �LZO: string (nullable = true)
> scala> ds.show
> +-+
> |�LZO|
> +-+
> |a|
> +-+
> {code}
> but the file can be read if the schema is specified:
> {code}
> scala> import org.apache.spark.sql.types._
> scala> val schema = new StructType().add("col1", StringType).add("col2", 
> IntegerType)
> scala> val ds = spark.read.schema(schema).option("header", 
> true).option("io.compression.codecs", 
> "com.hadoop.compression.lzo.LzopCodec").csv("test.csv.lzo")
> scala> ds.show
> +++
> |col1|col2|
> +++
> |   a|   1|
> +++
> {code}
> Just in case, schema inferring works for the original uncompressed file:
> {code:scala}
> scala> spark.read.option("header", true).option("inferSchema", 
> true).csv("test.csv").printSchema
> root
>  |-- col1: string (nullable = true)
>  |-- col2: integer (nullable = true)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23925) High-order function: repeat(element, count) → array

2018-04-26 Thread Florent Pepin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16455506#comment-16455506
 ] 

Florent Pepin commented on SPARK-23925:
---

Hey, sorry I didn't realise the complexity of this one, mainly because of the 
existing repeat function that takes in a String and outputs a String. I've 
first created an array specific function that works fine, now I'm trying to see 
how I can consolidate that with the existing String one into one function.

> High-order function: repeat(element, count) → array
> ---
>
> Key: SPARK-23925
> URL: https://issues.apache.org/jira/browse/SPARK-23925
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Repeat element for count times.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24083) Diagnostics message for uncaught exceptions should include the stacktrace

2018-04-26 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-24083:
--

Assignee: zhoukang

> Diagnostics message for uncaught exceptions should include the stacktrace
> -
>
> Key: SPARK-24083
> URL: https://issues.apache.org/jira/browse/SPARK-24083
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Minor
> Fix For: 2.4.0
>
>
> Like [SPARK-23296|https://issues.apache.org/jira/browse/SPARK-23296].
> For uncaught exceptions, we should also print the stacktrace.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24083) Diagnostics message for uncaught exceptions should include the stacktrace

2018-04-26 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-24083.

   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21151
[https://github.com/apache/spark/pull/21151]

> Diagnostics message for uncaught exceptions should include the stacktrace
> -
>
> Key: SPARK-24083
> URL: https://issues.apache.org/jira/browse/SPARK-24083
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Minor
> Fix For: 2.4.0
>
>
> Like [SPARK-23296|https://issues.apache.org/jira/browse/SPARK-23296].
> For uncaught exceptions, we should also print the stacktrace.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL

2018-04-26 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16455470#comment-16455470
 ] 

Takeshi Yamamuro commented on SPARK-21274:
--

Looks great to me. I checked the queries above worked well. Any plan to make a 
pr?

> Implement EXCEPT ALL and INTERSECT ALL
> --
>
> Key: SPARK-21274
> URL: https://issues.apache.org/jira/browse/SPARK-21274
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> 1) *EXCEPT ALL* / MINUS ALL :
> {code}
> SELECT a,b,c FROM tab1
>  EXCEPT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following outer join:
> {code}
> SELECT a,b,c
> FROMtab1 t1
>  LEFT OUTER JOIN 
> tab2 t2
>  ON (
> (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c)
>  )
> WHERE
> COALESCE(t2.a, t2.b, t2.c) IS NULL
> {code}
> (register as a temp.view this second query under "*t1_except_t2_df*" name 
> that can be also used to find INTERSECT ALL below):
> 2) *INTERSECT ALL*:
> {code}
> SELECT a,b,c FROM tab1
>  INTERSECT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following anti-join using t1_except_t2_df we defined 
> above:
> {code}
> SELECT a,b,c
> FROMtab1 t1
> WHERE 
>NOT EXISTS
>(SELECT 1
> FROMt1_except_t2_df e
> WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c)
>)
> {code}
> So the suggestion is just to use above query rewrites to implement both 
> EXCEPT ALL and INTERSECT ALL sql set operations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24085) Scalar subquery error

2018-04-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24085:


Assignee: (was: Apache Spark)

> Scalar subquery error
> -
>
> Key: SPARK-24085
> URL: https://issues.apache.org/jira/browse/SPARK-24085
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Alexey Baturin
>Priority: Major
>
> Error
> {noformat}
> SQL Error: java.lang.UnsupportedOperationException: Cannot evaluate 
> expression: scalar-subquery{noformat}
> Then query a partitioed table based on a parquet file then filter by 
> partition column by scalar subquery.
> Query to reproduce:
> {code:sql}
> CREATE TABLE test_prc_bug (
> id_value string
> )
> partitioned by (id_type string)
> location '/tmp/test_prc_bug'
> stored as parquet;
> insert into test_prc_bug values ('1','a');
> insert into test_prc_bug values ('2','a');
> insert into test_prc_bug values ('3','b');
> insert into test_prc_bug values ('4','b');
> select * from test_prc_bug
> where id_type = (select 'b');
> {code}
> If table in ORC format it works fine



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24085) Scalar subquery error

2018-04-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24085:


Assignee: Apache Spark

> Scalar subquery error
> -
>
> Key: SPARK-24085
> URL: https://issues.apache.org/jira/browse/SPARK-24085
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Alexey Baturin
>Assignee: Apache Spark
>Priority: Major
>
> Error
> {noformat}
> SQL Error: java.lang.UnsupportedOperationException: Cannot evaluate 
> expression: scalar-subquery{noformat}
> Then query a partitioed table based on a parquet file then filter by 
> partition column by scalar subquery.
> Query to reproduce:
> {code:sql}
> CREATE TABLE test_prc_bug (
> id_value string
> )
> partitioned by (id_type string)
> location '/tmp/test_prc_bug'
> stored as parquet;
> insert into test_prc_bug values ('1','a');
> insert into test_prc_bug values ('2','a');
> insert into test_prc_bug values ('3','b');
> insert into test_prc_bug values ('4','b');
> select * from test_prc_bug
> where id_type = (select 'b');
> {code}
> If table in ORC format it works fine



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24085) Scalar subquery error

2018-04-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16455468#comment-16455468
 ] 

Apache Spark commented on SPARK-24085:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/21174

> Scalar subquery error
> -
>
> Key: SPARK-24085
> URL: https://issues.apache.org/jira/browse/SPARK-24085
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Alexey Baturin
>Priority: Major
>
> Error
> {noformat}
> SQL Error: java.lang.UnsupportedOperationException: Cannot evaluate 
> expression: scalar-subquery{noformat}
> Then query a partitioed table based on a parquet file then filter by 
> partition column by scalar subquery.
> Query to reproduce:
> {code:sql}
> CREATE TABLE test_prc_bug (
> id_value string
> )
> partitioned by (id_type string)
> location '/tmp/test_prc_bug'
> stored as parquet;
> insert into test_prc_bug values ('1','a');
> insert into test_prc_bug values ('2','a');
> insert into test_prc_bug values ('3','b');
> insert into test_prc_bug values ('4','b');
> select * from test_prc_bug
> where id_type = (select 'b');
> {code}
> If table in ORC format it works fine



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24044) Explicitly print out skipped tests from unittest module

2018-04-26 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-24044.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21107
[https://github.com/apache/spark/pull/21107]

> Explicitly print out skipped tests from unittest module
> ---
>
> Key: SPARK-24044
> URL: https://issues.apache.org/jira/browse/SPARK-24044
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.4.0
>
>
> There was an actual issue, SPARK-23300, and we fixed this by manually 
> checking if the package is installed. This way needed duplicated codes and 
> could only check dependencies. There are many conditions, for example, Python 
> version specific or other packages like NumPy.  I think this is something we 
> should fix.
> `unittest` module can print out the skipped messages but these were swallowed 
> so far in our own testing script. This PR prints out the messages below after 
> sorted.
> It would be nicer if we remove the duplications and print out all the skipped 
> tests. For example, as below:
> This PR proposes to remove duplicated dependency checking logics and also 
> print out skipped tests from unittests. 
> For example, as below:
> {code}
> Skipped tests in pyspark.sql.tests with pypy:
> test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) 
> ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
> test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) 
> ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
> ...
> Skipped tests in pyspark.sql.tests with python3:
> test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) 
> ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.'
> test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) 
> ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.'
> ...
> {code}
> Actual format can be a bit varied per the discussion in the PR. Please check 
> out the PR for exact format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24044) Explicitly print out skipped tests from unittest module

2018-04-26 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-24044:


Assignee: Hyukjin Kwon

> Explicitly print out skipped tests from unittest module
> ---
>
> Key: SPARK-24044
> URL: https://issues.apache.org/jira/browse/SPARK-24044
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.4.0
>
>
> There was an actual issue, SPARK-23300, and we fixed this by manually 
> checking if the package is installed. This way needed duplicated codes and 
> could only check dependencies. There are many conditions, for example, Python 
> version specific or other packages like NumPy.  I think this is something we 
> should fix.
> `unittest` module can print out the skipped messages but these were swallowed 
> so far in our own testing script. This PR prints out the messages below after 
> sorted.
> It would be nicer if we remove the duplications and print out all the skipped 
> tests. For example, as below:
> This PR proposes to remove duplicated dependency checking logics and also 
> print out skipped tests from unittests. 
> For example, as below:
> {code}
> Skipped tests in pyspark.sql.tests with pypy:
> test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) 
> ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
> test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) 
> ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
> ...
> Skipped tests in pyspark.sql.tests with python3:
> test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) 
> ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.'
> test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) 
> ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.'
> ...
> {code}
> Actual format can be a bit varied per the discussion in the PR. Please check 
> out the PR for exact format.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23856) Spark jdbc setQueryTimeout option

2018-04-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23856:


Assignee: (was: Apache Spark)

> Spark jdbc setQueryTimeout option
> -
>
> Key: SPARK-23856
> URL: https://issues.apache.org/jira/browse/SPARK-23856
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dmitry Mikhailov
>Priority: Minor
>
> It would be nice if a user could set the jdbc setQueryTimeout option when 
> running jdbc in Spark. I think it can be easily implemented by adding option 
> field to _JDBCOptions_ class and applying this option when initializing jdbc 
> statements in _JDBCRDD_ class. But only some DB vendors support this jdbc 
> feature. Is it worth starting a work on this option?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23856) Spark jdbc setQueryTimeout option

2018-04-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16455342#comment-16455342
 ] 

Apache Spark commented on SPARK-23856:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/21173

> Spark jdbc setQueryTimeout option
> -
>
> Key: SPARK-23856
> URL: https://issues.apache.org/jira/browse/SPARK-23856
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dmitry Mikhailov
>Priority: Minor
>
> It would be nice if a user could set the jdbc setQueryTimeout option when 
> running jdbc in Spark. I think it can be easily implemented by adding option 
> field to _JDBCOptions_ class and applying this option when initializing jdbc 
> statements in _JDBCRDD_ class. But only some DB vendors support this jdbc 
> feature. Is it worth starting a work on this option?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23856) Spark jdbc setQueryTimeout option

2018-04-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23856:


Assignee: Apache Spark

> Spark jdbc setQueryTimeout option
> -
>
> Key: SPARK-23856
> URL: https://issues.apache.org/jira/browse/SPARK-23856
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dmitry Mikhailov
>Assignee: Apache Spark
>Priority: Minor
>
> It would be nice if a user could set the jdbc setQueryTimeout option when 
> running jdbc in Spark. I think it can be easily implemented by adding option 
> field to _JDBCOptions_ class and applying this option when initializing jdbc 
> statements in _JDBCRDD_ class. But only some DB vendors support this jdbc 
> feature. Is it worth starting a work on this option?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23580) Interpreted mode fallback should be implemented for all expressions & projections

2018-04-26 Thread Bruce Robbins (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16454918#comment-16454918
 ] 

Bruce Robbins commented on SPARK-23580:
---

Should SortPrefix also get this treatment?

> Interpreted mode fallback should be implemented for all expressions & 
> projections
> -
>
> Key: SPARK-23580
> URL: https://issues.apache.org/jira/browse/SPARK-23580
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Priority: Major
>  Labels: release-notes
>
> Spark SQL currently does not support interpreted mode for all expressions and 
> projections. This is a problem for scenario's where were code generation does 
> not work, or blows past the JVM class limits. We currently cannot gracefully 
> fallback.
> This ticket is an umbrella to fix this class of problem in Spark SQL. This 
> work can be divided into two main area's:
> - Add interpreted versions for all dataset related expressions.
> - Add an interpreted version of {{GenerateUnsafeProjection}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24057) put the real data type in the AssertionError message

2018-04-26 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-24057.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21159
[https://github.com/apache/spark/pull/21159]

> put the real data type in the AssertionError message
> 
>
> Key: SPARK-24057
> URL: https://issues.apache.org/jira/browse/SPARK-24057
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 2.4.0
>
>
> I had a wrong data type in one of my tests and got the following message: 
> /spark/python/pyspark/sql/types.py", line 405, in __init__
> assert isinstance(dataType, DataType), "dataType should be DataType"
> AssertionError: dataType should be DataType
> I checked types.py,  line 405, in __init__, it has
> {{assert isinstance(dataType, DataType), "dataType should be DataType"}}
> I think it meant to be 
> {{assert isinstance(dataType, DataType), "%s should be %s"  % (dataType, 
> DataType)}}
> so the error message will be something like 
> {{AssertionError:  should be  'pyspark.sql.types.DataType'>}}
> There are a couple of other places that have the same problem. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24057) put the real data type in the AssertionError message

2018-04-26 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-24057:


Assignee: Huaxin Gao

> put the real data type in the AssertionError message
> 
>
> Key: SPARK-24057
> URL: https://issues.apache.org/jira/browse/SPARK-24057
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 2.4.0
>
>
> I had a wrong data type in one of my tests and got the following message: 
> /spark/python/pyspark/sql/types.py", line 405, in __init__
> assert isinstance(dataType, DataType), "dataType should be DataType"
> AssertionError: dataType should be DataType
> I checked types.py,  line 405, in __init__, it has
> {{assert isinstance(dataType, DataType), "dataType should be DataType"}}
> I think it meant to be 
> {{assert isinstance(dataType, DataType), "%s should be %s"  % (dataType, 
> DataType)}}
> so the error message will be something like 
> {{AssertionError:  should be  'pyspark.sql.types.DataType'>}}
> There are a couple of other places that have the same problem. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24105) Spark 2.3.0 on kubernetes

2018-04-26 Thread Anirudh Ramanathan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16454847#comment-16454847
 ] 

Anirudh Ramanathan commented on SPARK-24105:


> To avoid this deadlock, its required to support node selector (in future 
> affinity/anti-affinity) configruation by driver & executor.

Would inter-pod anti-affinity be a better bet here for this use-case?
In the extreme case, this is a gang scheduling issue IMO, where we don't want 
to schedule drivers if there are no executors that can be scheduled.
There's some work on gang scheduling ongoing in 
https://github.com/kubernetes/kubernetes/issues/61012 under sig-scheduling.

> Spark 2.3.0 on kubernetes
> -
>
> Key: SPARK-24105
> URL: https://issues.apache.org/jira/browse/SPARK-24105
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Lenin
>Priority: Major
>
> Right now its only possible to define node selector configurations 
> thruspark.kubernetes.node.selector.[labelKey]. This gets used for both driver 
> & executor pods. Without the capability to isolate driver & executor pods, 
> the cluster can run into a livelock scenario, where if there are a lot of 
> spark submits, can cause the driver pods to fill up the cluster capacity, 
> with no room for executor pods to do any work.
>  
> To avoid this deadlock, its required to support node selector (in future 
> affinity/anti-affinity) configruation by driver & executor.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24106) Spark Structure Streaming with RF model taking long time in processing probability for each mini batch

2018-04-26 Thread Tamilselvan Veeramani (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tamilselvan Veeramani updated SPARK-24106:
--
Target Version/s: 2.3.0, 2.2.1  (was: 2.2.1, 2.3.0)
  Issue Type: Improvement  (was: Bug)

> Spark Structure Streaming with RF model taking long time in processing 
> probability for each mini batch
> --
>
> Key: SPARK-24106
> URL: https://issues.apache.org/jira/browse/SPARK-24106
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
> Environment: Spark yarn / Standalone cluster
> 2 master nodes - 32 cores - 124 GB
> 9 worker nodes - 32 cores - 124 GB
> Kafka input and output topic with 6 partition
>Reporter: Tamilselvan Veeramani
>Priority: Major
>  Labels: performance
> Fix For: 2.3.0, 2.4.0
>
>
> RandomForestClassificationModel broadcasted to executors for every mini batch 
> in spark streaming while try to find probability
> RF model size 45MB
> spark kafka streaming job jar size 8 MB (including kafka dependency’s)
> following log show model broad cast to executors for every mini batch when we 
> call rf_model.transform(dataset).select("probability").
> due to which task deserialization time also increases comes to 6 to 7 second 
> for 45MB of rf model, although processing time is just 400 to 600 ms for mini 
> batch
> 18/04/15 03:21:23 INFO KafkaSource: GetBatch generating RDD of offset range: 
> KafkaSourceRDDOffsetRange(Kafka_input_topic-0,242,242,Some(executor_xx.xxx.xx.110_2)),
>  
> KafkaSourceRDDOffsetRange(Kafka_input_topic-1,239,239,Some(executor_xx.xxx.xx.107_0)),
>  
> KafkaSourceRDDOffsetRange(Kafka_input_topic-2,241,241,Some(executor_xx.xxx.xx.102_3)),
>  
> KafkaSourceRDDOffsetRange(Kafka_input_topic-3,238,239,Some(executor_xx.xxx.xx.138_4)),
>  
> KafkaSourceRDDOffsetRange(Kafka_input_topic-4,240,240,Some(executor_xx.xxx.xx.137_1)),
>  
> KafkaSourceRDDOffsetRange(Kafka_input_topic-5,242,242,Some(executor_xx.xxx.xx.111_5))
>  18/04/15 03:21:24 INFO SparkContext: Starting job: start at App.java:106
> 18/04/15 03:21:31 INFO BlockManagerInfo: Added broadcast_92_piece0 in memory 
> on xx.xxx.xx.137:44682 (size: 62.6 MB, free: 1942.0 MB)
> After 2 to 3 weeks of struggle, I found a potentially solution which will 
> help many people who is looking to use RF model for “probability” in real 
> time streaming context
> Since RandomForestClassificationModel class of transformImpl method 
> implements only “prediction” in current version of spark. Which can be 
> leveraged to implement “probability” also in RandomForestClassificationModel 
> class of transformImpl method.
> I have modified the code and implemented in our server and it’s working as 
> fast as 400ms to 500ms for every mini batch
> I see many people our there facing this issue and no solution provided in any 
> of the forums, Can you please review and put this fix in next release ? thanks



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24036) Stateful operators in continuous processing

2018-04-26 Thread Jose Torres (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16454788#comment-16454788
 ] 

Jose Torres commented on SPARK-24036:
-

https://docs.google.com/document/d/1IL4kJoKrZWeyIhklKUJqsW-yEN7V7aL05MmM65AYOfE

I wrote a quick doc summarizing my thoughts. TLDR is:
 * I think it's better to not reuse the existing shuffle infrastructure - we'll 
have to do more work to get good performance later, but current shuffle has 
very bad characteristics for what continuous processing is trying to do. In 
particular I doubt we'd be able to maintain millisecond-scale latency with 
anything like UnsafeShuffleWriter.
 * It's a small diff on top of a working shuffle to support exactly-once state 
management. I don't think the coordinator needs to worry about stateful 
operators; a writer will never commit if a stateful operator below it fails to 
checkpoint, and the stateful operator itself can rewind if it commits an epoch 
that ends up failing.

Let me know what you two think. I'll send this out to the dev list if it looks 
reasonable, and then we can start thinking about how this breaks down into 
individual tasks.

> Stateful operators in continuous processing
> ---
>
> Key: SPARK-24036
> URL: https://issues.apache.org/jira/browse/SPARK-24036
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>
> The first iteration of continuous processing in Spark 2.3 does not work with 
> stateful operators.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24106) Spark Structure Streaming with RF model taking long time in processing probability for each mini batch

2018-04-26 Thread Tamilselvan Veeramani (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tamilselvan Veeramani updated SPARK-24106:
--
Target Version/s: 2.3.0, 2.2.1  (was: 2.2.1, 2.3.0)
 Component/s: (was: MLlib)
  ML

> Spark Structure Streaming with RF model taking long time in processing 
> probability for each mini batch
> --
>
> Key: SPARK-24106
> URL: https://issues.apache.org/jira/browse/SPARK-24106
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
> Environment: Spark yarn / Standalone cluster
> 2 master nodes - 32 cores - 124 GB
> 9 worker nodes - 32 cores - 124 GB
> Kafka input and output topic with 6 partition
>Reporter: Tamilselvan Veeramani
>Priority: Major
>  Labels: performance
> Fix For: 2.3.0, 2.4.0
>
>
> RandomForestClassificationModel broadcasted to executors for every mini batch 
> in spark streaming while try to find probability
> RF model size 45MB
> spark kafka streaming job jar size 8 MB (including kafka dependency’s)
> following log show model broad cast to executors for every mini batch when we 
> call rf_model.transform(dataset).select("probability").
> due to which task deserialization time also increases comes to 6 to 7 second 
> for 45MB of rf model, although processing time is just 400 to 600 ms for mini 
> batch
> 18/04/15 03:21:23 INFO KafkaSource: GetBatch generating RDD of offset range: 
> KafkaSourceRDDOffsetRange(Kafka_input_topic-0,242,242,Some(executor_xx.xxx.xx.110_2)),
>  
> KafkaSourceRDDOffsetRange(Kafka_input_topic-1,239,239,Some(executor_xx.xxx.xx.107_0)),
>  
> KafkaSourceRDDOffsetRange(Kafka_input_topic-2,241,241,Some(executor_xx.xxx.xx.102_3)),
>  
> KafkaSourceRDDOffsetRange(Kafka_input_topic-3,238,239,Some(executor_xx.xxx.xx.138_4)),
>  
> KafkaSourceRDDOffsetRange(Kafka_input_topic-4,240,240,Some(executor_xx.xxx.xx.137_1)),
>  
> KafkaSourceRDDOffsetRange(Kafka_input_topic-5,242,242,Some(executor_xx.xxx.xx.111_5))
>  18/04/15 03:21:24 INFO SparkContext: Starting job: start at App.java:106
> 18/04/15 03:21:31 INFO BlockManagerInfo: Added broadcast_92_piece0 in memory 
> on xx.xxx.xx.137:44682 (size: 62.6 MB, free: 1942.0 MB)
> After 2 to 3 weeks of struggle, I found a potentially solution which will 
> help many people who is looking to use RF model for “probability” in real 
> time streaming context
> Since RandomForestClassificationModel class of transformImpl method 
> implements only “prediction” in current version of spark. Which can be 
> leveraged to implement “probability” also in RandomForestClassificationModel 
> class of transformImpl method.
> I have modified the code and implemented in our server and it’s working as 
> fast as 400ms to 500ms for every mini batch
> I see many people our there facing this issue and no solution provided in any 
> of the forums, Can you please review and put this fix in next release ? thanks



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24106) Spark Structure Streaming with RF model taking long time in processing probability for each mini batch

2018-04-26 Thread Tamilselvan Veeramani (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16454774#comment-16454774
 ] 

Tamilselvan Veeramani commented on SPARK-24106:
---

I have the code change ready and tested in our server. I can provide for your 
review.

> Spark Structure Streaming with RF model taking long time in processing 
> probability for each mini batch
> --
>
> Key: SPARK-24106
> URL: https://issues.apache.org/jira/browse/SPARK-24106
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
> Environment: Spark yarn / Standalone cluster
> 2 master nodes - 32 cores - 124 GB
> 9 worker nodes - 32 cores - 124 GB
> Kafka input and output topic with 6 partition
>Reporter: Tamilselvan Veeramani
>Priority: Major
>  Labels: performance
> Fix For: 2.3.0, 2.4.0
>
>
> RandomForestClassificationModel broadcasted to executors for every mini batch 
> in spark streaming while try to find probability
> RF model size 45MB
> spark kafka streaming job jar size 8 MB (including kafka dependency’s)
> following log show model broad cast to executors for every mini batch when we 
> call rf_model.transform(dataset).select("probability").
> due to which task deserialization time also increases comes to 6 to 7 second 
> for 45MB of rf model, although processing time is just 400 to 600 ms for mini 
> batch
> 18/04/15 03:21:23 INFO KafkaSource: GetBatch generating RDD of offset range: 
> KafkaSourceRDDOffsetRange(Kafka_input_topic-0,242,242,Some(executor_xx.xxx.xx.110_2)),
>  
> KafkaSourceRDDOffsetRange(Kafka_input_topic-1,239,239,Some(executor_xx.xxx.xx.107_0)),
>  
> KafkaSourceRDDOffsetRange(Kafka_input_topic-2,241,241,Some(executor_xx.xxx.xx.102_3)),
>  
> KafkaSourceRDDOffsetRange(Kafka_input_topic-3,238,239,Some(executor_xx.xxx.xx.138_4)),
>  
> KafkaSourceRDDOffsetRange(Kafka_input_topic-4,240,240,Some(executor_xx.xxx.xx.137_1)),
>  
> KafkaSourceRDDOffsetRange(Kafka_input_topic-5,242,242,Some(executor_xx.xxx.xx.111_5))
>  18/04/15 03:21:24 INFO SparkContext: Starting job: start at App.java:106
> 18/04/15 03:21:31 INFO BlockManagerInfo: Added broadcast_92_piece0 in memory 
> on xx.xxx.xx.137:44682 (size: 62.6 MB, free: 1942.0 MB)
> After 2 to 3 weeks of struggle, I found a potentially solution which will 
> help many people who is looking to use RF model for “probability” in real 
> time streaming context
> Since RandomForestClassificationModel class of transformImpl method 
> implements only “prediction” in current version of spark. Which can be 
> leveraged to implement “probability” also in RandomForestClassificationModel 
> class of transformImpl method.
> I have modified the code and implemented in our server and it’s working as 
> fast as 400ms to 500ms for every mini batch
> I see many people our there facing this issue and no solution provided in any 
> of the forums, Can you please review and put this fix in next release ? thanks



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24106) Spark Structure Streaming with RF model taking long time in processing probability for each mini batch

2018-04-26 Thread Tamilselvan Veeramani (JIRA)
Tamilselvan Veeramani created SPARK-24106:
-

 Summary: Spark Structure Streaming with RF model taking long time 
in processing probability for each mini batch
 Key: SPARK-24106
 URL: https://issues.apache.org/jira/browse/SPARK-24106
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 2.3.0, 2.2.1, 2.2.0
 Environment: Spark yarn / Standalone cluster
2 master nodes - 32 cores - 124 GB
9 worker nodes - 32 cores - 124 GB
Kafka input and output topic with 6 partition
Reporter: Tamilselvan Veeramani
 Fix For: 2.4.0, 2.3.0


RandomForestClassificationModel broadcasted to executors for every mini batch 
in spark streaming while try to find probability

RF model size 45MB
spark kafka streaming job jar size 8 MB (including kafka dependency’s)

following log show model broad cast to executors for every mini batch when we 
call rf_model.transform(dataset).select("probability").
due to which task deserialization time also increases comes to 6 to 7 second 
for 45MB of rf model, although processing time is just 400 to 600 ms for mini 
batch

18/04/15 03:21:23 INFO KafkaSource: GetBatch generating RDD of offset range: 
KafkaSourceRDDOffsetRange(Kafka_input_topic-0,242,242,Some(executor_xx.xxx.xx.110_2)),
 
KafkaSourceRDDOffsetRange(Kafka_input_topic-1,239,239,Some(executor_xx.xxx.xx.107_0)),
 
KafkaSourceRDDOffsetRange(Kafka_input_topic-2,241,241,Some(executor_xx.xxx.xx.102_3)),
 
KafkaSourceRDDOffsetRange(Kafka_input_topic-3,238,239,Some(executor_xx.xxx.xx.138_4)),
 
KafkaSourceRDDOffsetRange(Kafka_input_topic-4,240,240,Some(executor_xx.xxx.xx.137_1)),
 
KafkaSourceRDDOffsetRange(Kafka_input_topic-5,242,242,Some(executor_xx.xxx.xx.111_5))
 18/04/15 03:21:24 INFO SparkContext: Starting job: start at App.java:106
18/04/15 03:21:31 INFO BlockManagerInfo: Added broadcast_92_piece0 in memory on 
xx.xxx.xx.137:44682 (size: 62.6 MB, free: 1942.0 MB)

After 2 to 3 weeks of struggle, I found a potentially solution which will help 
many people who is looking to use RF model for “probability” in real time 
streaming context
Since RandomForestClassificationModel class of transformImpl method implements 
only “prediction” in current version of spark. Which can be leveraged to 
implement “probability” also in RandomForestClassificationModel class of 
transformImpl method.

I have modified the code and implemented in our server and it’s working as fast 
as 400ms to 500ms for every mini batch

I see many people our there facing this issue and no solution provided in any 
of the forums, Can you please review and put this fix in next release ? thanks




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24068) CSV schema inferring doesn't work for compressed files

2018-04-26 Thread Maxim Gekk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16454761#comment-16454761
 ] 

Maxim Gekk commented on SPARK-24068:


The same issue exists in JSON datasource. [~hyukjin.kwon] Do we need a separate 
ticket for that?

> CSV schema inferring doesn't work for compressed files
> --
>
> Key: SPARK-24068
> URL: https://issues.apache.org/jira/browse/SPARK-24068
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Here is a simple csv file compressed by lzo
> {code}
> $ cat ./test.csv
> col1,col2
> a,1
> $ lzop ./test.csv
> $ ls
> test.csv test.csv.lzo
> {code}
> Reading test.csv.lzo with LZO codec (see 
> https://github.com/twitter/hadoop-lzo, for example):
> {code:scala}
> scala> val ds = spark.read.option("header", true).option("inferSchema", 
> true).option("io.compression.codecs", 
> "com.hadoop.compression.lzo.LzopCodec").csv("/Users/maximgekk/tmp/issue/test.csv.lzo")
> ds: org.apache.spark.sql.DataFrame = [�LZO?: string]
> scala> ds.printSchema
> root
>  |-- �LZO: string (nullable = true)
> scala> ds.show
> +-+
> |�LZO|
> +-+
> |a|
> +-+
> {code}
> but the file can be read if the schema is specified:
> {code}
> scala> import org.apache.spark.sql.types._
> scala> val schema = new StructType().add("col1", StringType).add("col2", 
> IntegerType)
> scala> val ds = spark.read.schema(schema).option("header", 
> true).option("io.compression.codecs", 
> "com.hadoop.compression.lzo.LzopCodec").csv("test.csv.lzo")
> scala> ds.show
> +++
> |col1|col2|
> +++
> |   a|   1|
> +++
> {code}
> Just in case, schema inferring works for the original uncompressed file:
> {code:scala}
> scala> spark.read.option("header", true).option("inferSchema", 
> true).csv("test.csv").printSchema
> root
>  |-- col1: string (nullable = true)
>  |-- col2: integer (nullable = true)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23120) Add PMML pipeline export support to PySpark

2018-04-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16454716#comment-16454716
 ] 

Apache Spark commented on SPARK-23120:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/21172

> Add PMML pipeline export support to PySpark
> ---
>
> Key: SPARK-23120
> URL: https://issues.apache.org/jira/browse/SPARK-23120
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: holdenk
>Assignee: holdenk
>Priority: Major
>
> Once we have Scala support go back and fill in Python.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23120) Add PMML pipeline export support to PySpark

2018-04-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23120:


Assignee: holdenk  (was: Apache Spark)

> Add PMML pipeline export support to PySpark
> ---
>
> Key: SPARK-23120
> URL: https://issues.apache.org/jira/browse/SPARK-23120
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: holdenk
>Assignee: holdenk
>Priority: Major
>
> Once we have Scala support go back and fill in Python.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23120) Add PMML pipeline export support to PySpark

2018-04-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23120:


Assignee: Apache Spark  (was: holdenk)

> Add PMML pipeline export support to PySpark
> ---
>
> Key: SPARK-23120
> URL: https://issues.apache.org/jira/browse/SPARK-23120
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: holdenk
>Assignee: Apache Spark
>Priority: Major
>
> Once we have Scala support go back and fill in Python.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24085) Scalar subquery error

2018-04-26 Thread Dilip Biswal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16454707#comment-16454707
 ] 

Dilip Biswal edited comment on SPARK-24085 at 4/26/18 7:09 PM:
---

Working on a fix for this. My current thinking is to not consider subquery 
expressions in the partition pruning process.


was (Author: dkbiswal):
Working on a fix for this.

> Scalar subquery error
> -
>
> Key: SPARK-24085
> URL: https://issues.apache.org/jira/browse/SPARK-24085
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Alexey Baturin
>Priority: Major
>
> Error
> {noformat}
> SQL Error: java.lang.UnsupportedOperationException: Cannot evaluate 
> expression: scalar-subquery{noformat}
> Then query a partitioed table based on a parquet file then filter by 
> partition column by scalar subquery.
> Query to reproduce:
> {code:sql}
> CREATE TABLE test_prc_bug (
> id_value string
> )
> partitioned by (id_type string)
> location '/tmp/test_prc_bug'
> stored as parquet;
> insert into test_prc_bug values ('1','a');
> insert into test_prc_bug values ('2','a');
> insert into test_prc_bug values ('3','b');
> insert into test_prc_bug values ('4','b');
> select * from test_prc_bug
> where id_type = (select 'b');
> {code}
> If table in ORC format it works fine



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24085) Scalar subquery error

2018-04-26 Thread Dilip Biswal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16454707#comment-16454707
 ] 

Dilip Biswal commented on SPARK-24085:
--

Working on a fix for this.

> Scalar subquery error
> -
>
> Key: SPARK-24085
> URL: https://issues.apache.org/jira/browse/SPARK-24085
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Alexey Baturin
>Priority: Major
>
> Error
> {noformat}
> SQL Error: java.lang.UnsupportedOperationException: Cannot evaluate 
> expression: scalar-subquery{noformat}
> Then query a partitioed table based on a parquet file then filter by 
> partition column by scalar subquery.
> Query to reproduce:
> {code:sql}
> CREATE TABLE test_prc_bug (
> id_value string
> )
> partitioned by (id_type string)
> location '/tmp/test_prc_bug'
> stored as parquet;
> insert into test_prc_bug values ('1','a');
> insert into test_prc_bug values ('2','a');
> insert into test_prc_bug values ('3','b');
> insert into test_prc_bug values ('4','b');
> select * from test_prc_bug
> where id_type = (select 'b');
> {code}
> If table in ORC format it works fine



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24105) Spark 2.3.0 on kubernetes

2018-04-26 Thread Lenin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lenin updated SPARK-24105:
--
Description: 
Right now its only possible to define node selector configurations 
thruspark.kubernetes.node.selector.[labelKey]. This gets used for both driver & 
executor pods. Without the capability to isolate driver & executor pods, the 
cluster can run into a livelock scenario, where if there are a lot of spark 
submits, can cause the driver pods to fill up the cluster capacity, with no 
room for executor pods to do any work.

 

To avoid this deadlock, its required to support node selector (in future 
affinity/anti-affinity) configruation by driver & executor.

 

  was:
Right now its only possible to define node selector configurations 
thruspark.kubernetes.node.selector.[labelKey]. This gets used for both driver & 
executor pods. Without the capability to isolate driver & executor pods, the 
cluster can run into a deadlock scenario, where if there are a lot of spark 
submits, can cause the driver pods to fill up the cluster capacity, with no 
room for executor pods to do any work.

 

To avoid this deadlock, its required to support node selector (in future 
affinity/anti-affinity) configruation by driver & executor.

 


> Spark 2.3.0 on kubernetes
> -
>
> Key: SPARK-24105
> URL: https://issues.apache.org/jira/browse/SPARK-24105
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Lenin
>Priority: Major
>
> Right now its only possible to define node selector configurations 
> thruspark.kubernetes.node.selector.[labelKey]. This gets used for both driver 
> & executor pods. Without the capability to isolate driver & executor pods, 
> the cluster can run into a livelock scenario, where if there are a lot of 
> spark submits, can cause the driver pods to fill up the cluster capacity, 
> with no room for executor pods to do any work.
>  
> To avoid this deadlock, its required to support node selector (in future 
> affinity/anti-affinity) configruation by driver & executor.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24104) SQLAppStatusListener overwrites metrics onDriverAccumUpdates instead of updating them

2018-04-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24104:


Assignee: (was: Apache Spark)

> SQLAppStatusListener overwrites metrics onDriverAccumUpdates instead of 
> updating them
> -
>
> Key: SPARK-24104
> URL: https://issues.apache.org/jira/browse/SPARK-24104
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Juliusz Sompolski
>Priority: Major
>
> SqlAppStatusListener does 
> {code}
> exec.driverAccumUpdates = accumUpdates.toMap
> update(exec)
> {code}
> in onDriverAccumUpdates.
> But postDriverMetricUpdates is called multiple time per query, e.g. from each 
> FileSourceScanExec and BroadcastExchangeExec.
> If the update does not really update it in the KV store (depending on 
> liveUpdatePeriodNs), the previously posted metrics are lost.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24104) SQLAppStatusListener overwrites metrics onDriverAccumUpdates instead of updating them

2018-04-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24104:


Assignee: Apache Spark

> SQLAppStatusListener overwrites metrics onDriverAccumUpdates instead of 
> updating them
> -
>
> Key: SPARK-24104
> URL: https://issues.apache.org/jira/browse/SPARK-24104
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Juliusz Sompolski
>Assignee: Apache Spark
>Priority: Major
>
> SqlAppStatusListener does 
> {code}
> exec.driverAccumUpdates = accumUpdates.toMap
> update(exec)
> {code}
> in onDriverAccumUpdates.
> But postDriverMetricUpdates is called multiple time per query, e.g. from each 
> FileSourceScanExec and BroadcastExchangeExec.
> If the update does not really update it in the KV store (depending on 
> liveUpdatePeriodNs), the previously posted metrics are lost.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24104) SQLAppStatusListener overwrites metrics onDriverAccumUpdates instead of updating them

2018-04-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16454630#comment-16454630
 ] 

Apache Spark commented on SPARK-24104:
--

User 'juliuszsompolski' has created a pull request for this issue:
https://github.com/apache/spark/pull/21171

> SQLAppStatusListener overwrites metrics onDriverAccumUpdates instead of 
> updating them
> -
>
> Key: SPARK-24104
> URL: https://issues.apache.org/jira/browse/SPARK-24104
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Juliusz Sompolski
>Priority: Major
>
> SqlAppStatusListener does 
> {code}
> exec.driverAccumUpdates = accumUpdates.toMap
> update(exec)
> {code}
> in onDriverAccumUpdates.
> But postDriverMetricUpdates is called multiple time per query, e.g. from each 
> FileSourceScanExec and BroadcastExchangeExec.
> If the update does not really update it in the KV store (depending on 
> liveUpdatePeriodNs), the previously posted metrics are lost.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24105) Spark 2.3.0 on kubernetes

2018-04-26 Thread Lenin (JIRA)
Lenin created SPARK-24105:
-

 Summary: Spark 2.3.0 on kubernetes
 Key: SPARK-24105
 URL: https://issues.apache.org/jira/browse/SPARK-24105
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 2.3.0
Reporter: Lenin


Right now its only possible to define node selector configurations 
thruspark.kubernetes.node.selector.[labelKey]. This gets used for both driver & 
executor pods. Without the capability to isolate driver & executor pods, the 
cluster can run into a deadlock scenario, where if there are a lot of spark 
submits, can cause the driver pods to fill up the cluster capacity, with no 
room for executor pods to do any work.

 

To avoid this deadlock, its required to support node selector (in future 
affinity/anti-affinity) configruation by driver & executor.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23962) Flaky tests from SQLMetricsTestUtils.currentExecutionIds

2018-04-26 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16454594#comment-16454594
 ] 

Dongjoon Hyun commented on SPARK-23962:
---

[~irashid], [~cloud_fan], Please check the build status on `branch-2.3/sbt`.
[https://github.com/apache/spark/pull/21041#issuecomment-384721509]

Also, cc [~smilegator]. 

> Flaky tests from SQLMetricsTestUtils.currentExecutionIds
> 
>
> Key: SPARK-23962
> URL: https://issues.apache.org/jira/browse/SPARK-23962
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Minor
> Fix For: 2.3.1, 2.4.0
>
> Attachments: unit-tests.log
>
>
> I've seen 
> {{org.apache.spark.sql.execution.metric.SQLMetricsSuite.SortMergeJoin 
> metrics}} fail 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4150/testReport/org.apache.spark.sql.execution.metric/SQLMetricsSuite/SortMergeJoin_metrics/
>  
> with
> {noformat}
> Error Message
> org.scalatest.exceptions.TestFailedException: 2 did not equal 1
> Stacktrace
> sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 2 did 
> not equal 1
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
>   at 
> org.apache.spark.sql.execution.metric.SQLMetricsTestUtils$class.getSparkPlanMetrics(SQLMetricsTestUtils.scala:146)
>   at 
> org.apache.spark.sql.execution.metric.SQLMetricsSuite.getSparkPlanMetrics(SQLMetricsSuite.scala:33)
>   at 
> org.apache.spark.sql.execution.metric.SQLMetricsTestUtils$class.testSparkPlanMetrics(SQLMetricsTestUtils.scala:187)
>   at 
> org.apache.spark.sql.execution.metric.SQLMetricsSuite.testSparkPlanMetrics(SQLMetricsSuite.scala:33)
>   at 
> org.apache.spark.sql.execution.metric.SQLMetricsSuite$$anonfun$7$$anonfun$apply$mcV$sp$1.apply$mcV$sp(SQLMetricsSuite.scala:188)
>   at 
> org.apache.spark.sql.test.SQLTestUtilsBase$class.withTempView(SQLTestUtils.scala:260)
> ...
> {noformat}
> I believe this is because {{SQLMetricsTestUtils.currentExecutionId()}} is 
> racing with the listener bus.
> I'll attach trimmed logs as well



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23842) accessing java from PySpark lambda functions

2018-04-26 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk resolved SPARK-23842.
-
Resolution: Won't Fix

Not supported by the current design, alternatives do exist though.

> accessing java from PySpark lambda functions
> 
>
> Key: SPARK-23842
> URL: https://issues.apache.org/jira/browse/SPARK-23842
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: cloudpickle, py4j, pyspark
>
> Copied from https://github.com/bartdag/py4j/issues/311 but it seems to be 
> more of a Spark issue than py4j.. 
> |We have a third-party Java library that is distributed to Spark executors 
> through {{--jars}} parameter.
> We want to call a static Java method in that library on executor's side 
> through Spark's {{map()}} or create an object of that library's class through 
> {{mapPartitions()}} call. 
> None of the approaches worked so far. It seems Spark tries to serialize 
> everything it sees in a lambda function, distribute to executors etc.
> I am aware of an older py4j issue/question 
> [#171|https://github.com/bartdag/py4j/issues/171] but looking at that 
> discussion isn't helpful.
> We thought to create a reference to that "class" through a call like 
> {{genmodel = spark._jvm.hex.genmodel}}and then operate through py4j to expose 
> functionality of that library in pyspark executors' lambda functions.
> We don't want Spark to try to serialize spark session variables "spark" nor 
> its reference to py4j gateway {{spark._jvm}} (because it leads to expected 
> non-serializable exceptions), so tried to "trick" Spark not to try to 
> serialize those by nested the above {{genmodel = spark._jvm.hex.genmodel}} 
> into {{exec()}} call.
> It led to another issue that {{spark}} (spark session) nor {{sc}} (spark 
> context) variables seems not available in spark executors' lambda functions. 
> So we're stuck and don't know how to call a generic java class through py4j 
> on executor's side (from within {{map}} or {{mapPartitions}} lambda 
> functions).
> It would be an easier adventure from Scala/ Java for Spark as those can 
> directly call that 3rd-party libraries methods, but our users ask to have a 
> way to do the same from PySpark.|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23842) accessing java from PySpark lambda functions

2018-04-26 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16454585#comment-16454585
 ] 

holdenk commented on SPARK-23842:
-

So the py4j gateway only exists on the driver program, on the worker programs 
Spark uses its own method of communicating between the worker and Spark. You 
might find looking at [https://github.com/sparklingpandas/sparklingml] to be 
helpful, if you factor your code so that the Java function takes in a DataFrame 
you can use it that way (or register Java UDFs as shown).

> accessing java from PySpark lambda functions
> 
>
> Key: SPARK-23842
> URL: https://issues.apache.org/jira/browse/SPARK-23842
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: cloudpickle, py4j, pyspark
>
> Copied from https://github.com/bartdag/py4j/issues/311 but it seems to be 
> more of a Spark issue than py4j.. 
> |We have a third-party Java library that is distributed to Spark executors 
> through {{--jars}} parameter.
> We want to call a static Java method in that library on executor's side 
> through Spark's {{map()}} or create an object of that library's class through 
> {{mapPartitions()}} call. 
> None of the approaches worked so far. It seems Spark tries to serialize 
> everything it sees in a lambda function, distribute to executors etc.
> I am aware of an older py4j issue/question 
> [#171|https://github.com/bartdag/py4j/issues/171] but looking at that 
> discussion isn't helpful.
> We thought to create a reference to that "class" through a call like 
> {{genmodel = spark._jvm.hex.genmodel}}and then operate through py4j to expose 
> functionality of that library in pyspark executors' lambda functions.
> We don't want Spark to try to serialize spark session variables "spark" nor 
> its reference to py4j gateway {{spark._jvm}} (because it leads to expected 
> non-serializable exceptions), so tried to "trick" Spark not to try to 
> serialize those by nested the above {{genmodel = spark._jvm.hex.genmodel}} 
> into {{exec()}} call.
> It led to another issue that {{spark}} (spark session) nor {{sc}} (spark 
> context) variables seems not available in spark executors' lambda functions. 
> So we're stuck and don't know how to call a generic java class through py4j 
> on executor's side (from within {{map}} or {{mapPartitions}} lambda 
> functions).
> It would be an easier adventure from Scala/ Java for Spark as those can 
> directly call that 3rd-party libraries methods, but our users ask to have a 
> way to do the same from PySpark.|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24104) SQLAppStatusListener overwrites metrics onDriverAccumUpdates instead of updating them

2018-04-26 Thread Juliusz Sompolski (JIRA)
Juliusz Sompolski created SPARK-24104:
-

 Summary: SQLAppStatusListener overwrites metrics 
onDriverAccumUpdates instead of updating them
 Key: SPARK-24104
 URL: https://issues.apache.org/jira/browse/SPARK-24104
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Juliusz Sompolski


SqlAppStatusListener does 
{code}
exec.driverAccumUpdates = accumUpdates.toMap
update(exec)
{code}
in onDriverAccumUpdates.
But postDriverMetricUpdates is called multiple time per query, e.g. from each 
FileSourceScanExec and BroadcastExchangeExec.

If the update does not really update it in the KV store (depending on 
liveUpdatePeriodNs), the previously posted metrics are lost.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23925) High-order function: repeat(element, count) → array

2018-04-26 Thread Marek Novotny (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16454526#comment-16454526
 ] 

Marek Novotny edited comment on SPARK-23925 at 4/26/18 4:47 PM:


[~pepinoflo] Any joy? I can take this one or help if you want?


was (Author: mn-mikke):
@pepinoflo Any joy? I can take this one or help if you want?

> High-order function: repeat(element, count) → array
> ---
>
> Key: SPARK-23925
> URL: https://issues.apache.org/jira/browse/SPARK-23925
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Repeat element for count times.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23925) High-order function: repeat(element, count) → array

2018-04-26 Thread Marek Novotny (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16454526#comment-16454526
 ] 

Marek Novotny commented on SPARK-23925:
---

@pepinoflo Any joy? I can take this one or help if you want?

> High-order function: repeat(element, count) → array
> ---
>
> Key: SPARK-23925
> URL: https://issues.apache.org/jira/browse/SPARK-23925
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Repeat element for count times.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22732) Add DataSourceV2 streaming APIs

2018-04-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16454501#comment-16454501
 ] 

Apache Spark commented on SPARK-22732:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/21170

> Add DataSourceV2 streaming APIs
> ---
>
> Key: SPARK-22732
> URL: https://issues.apache.org/jira/browse/SPARK-22732
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jose Torres
>Assignee: Jose Torres
>Priority: Major
> Fix For: 2.3.0
>
>
> Structured Streaming APIs are currently tucked in a spark internal package. 
> We need to expose a new version in the DataSourceV2 framework, and add the 
> APIs required for continuous processing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23715) from_utc_timestamp returns incorrect results for some UTC date/time values

2018-04-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23715:


Assignee: Apache Spark

> from_utc_timestamp returns incorrect results for some UTC date/time values
> --
>
> Key: SPARK-23715
> URL: https://issues.apache.org/jira/browse/SPARK-23715
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bruce Robbins
>Assignee: Apache Spark
>Priority: Major
>
> This produces the expected answer:
> {noformat}
> df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" 
> ).as("dt")).show
> +---+
> | dt|
> +---+
> |2018-03-13 07:18:23|
> +---+
> {noformat}
> However, the equivalent UTC input (but with an explicit timezone) produces a 
> wrong answer:
> {noformat}
> df.select(from_utc_timestamp(lit("2018-03-13T06:18:23+00:00"), "GMT+1" 
> ).as("dt")).show
> +---+
> | dt|
> +---+
> |2018-03-13 00:18:23|
> +---+
> {noformat}
> Additionally, the equivalent Unix time (1520921903, which is also 
> "2018-03-13T06:18:23" in the UTC time zone) produces the same wrong answer:
> {noformat}
> df.select(from_utc_timestamp(to_timestamp(lit(1520921903)), "GMT+1" 
> ).as("dt")).show
> +---+
> | dt|
> +---+
> |2018-03-13 00:18:23|
> +---+
> {noformat}
> These issues stem from the fact that the FromUTCTimestamp expression, despite 
> its name, expects the input to be in the user's local timezone. There is some 
> magic under the covers to make things work (mostly) as the user expects.
> As an example, let's say a user in Los Angeles issues the following:
> {noformat}
> df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" 
> ).as("dt")).show
> {noformat}
> FromUTCTimestamp gets as input a Timestamp (long) value representing
> {noformat}
> 2018-03-13T06:18:23-07:00 (long value 152094710300)
> {noformat}
> What FromUTCTimestamp needs instead is
> {noformat}
> 2018-03-13T06:18:23+00:00 (long value 152092190300)
> {noformat}
> So, it applies the local timezone's offset to the input timestamp to get the 
> correct value (152094710300 minus 7 hours is 152092190300). Then it 
> can process the value and produce the expected output.
> When the user explicitly specifies a time zone, FromUTCTimestamp's 
> assumptions break down. The input is no longer in the local time zone. 
> Because of the way input data is implicitly casted, FromUTCTimestamp never 
> knows whether the input data had an explicit timezone.
> Here are some gory details:
> There is sometimes a mismatch in expectations between the (string => 
> timestamp) cast and FromUTCTimestamp. Also, since the FromUTCTimestamp 
> expression never sees the actual input string (the cast "intercepts" the 
> input and converts it to a long timestamp before FromUTCTimestamp uses the 
> value), FromUTCTimestamp cannot reject any input value that would exercise 
> this mismatch in expectations.
> There is a similar mismatch in expectations in the (integer => timestamp) 
> cast and FromUTCTimestamp. As a result, Unix time input almost always 
> produces incorrect output.
> h3. When things work as expected for String input:
> When from_utc_timestamp is passed a string time value with no time zone, 
> DateTimeUtils.stringToTimestamp (called from a Cast expression) treats the 
> datetime string as though it's in the user's local time zone. Because 
> DateTimeUtils.stringToTimestamp is a general function, this is reasonable.
> As a result, FromUTCTimestamp's input is a timestamp shifted by the local 
> time zone's offset. FromUTCTimestamp assumes this (or more accurately, a 
> utility function called by FromUTCTimestamp assumes this), so the first thing 
> it does is reverse-shift to get it back the correct value. Now that the long 
> value has been shifted back to the correct timestamp value, it can now 
> process it (by shifting it again based on the specified time zone).
> h3. When things go wrong with String input:
> When from_utc_timestamp is passed a string datetime value with an explicit 
> time zone, stringToTimestamp honors that timezone and ignores the local time 
> zone. stringToTimestamp does not shift the timestamp by the local timezone's 
> offset, but by the timezone specified on the datetime string.
> Unfortunately, FromUTCTimestamp, which has no insight into the actual input 
> or the conversion, still assumes the timestamp is shifted by the local time 
> zone. So it reverse-shifts the long value by the local time zone's offset, 
> which produces a incorrect timestamp (except in the case where the input 
> 

[jira] [Commented] (SPARK-24096) create table as select not using hive.default.fileformat

2018-04-26 Thread Yuming Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16454463#comment-16454463
 ] 

Yuming Wang commented on SPARK-24096:
-

Another related PR: https://github.com/apache/spark/pull/14430

> create table as select not using hive.default.fileformat
> 
>
> Key: SPARK-24096
> URL: https://issues.apache.org/jira/browse/SPARK-24096
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: StephenZou
>Priority: Major
>
> In my spark conf directory, hive-site.xml have an item indicating orc is the 
> default file format.
> 
>  hive.default.fileformat
>  orc
>  
>  
> But when I use "create table as select ..." to create a table, the output 
> format is plain text. 
> It works only I use "set hive.default.fileformat=orc"
>  
> Then I walked through the spark code and found in 
> sparkSqlParser:visitCreateHiveTable(), 
> val defaultStorage = HiveSerDe.getDefaultStorage(conf)  the conf is SQLConf,
> that explains the above observation, 
> "set hive.default.fileformat=orc" is put into conf map, hive-site.xml is not. 
>  
> It's quite misleading, How to unify the settings?
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23715) from_utc_timestamp returns incorrect results for some UTC date/time values

2018-04-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16454462#comment-16454462
 ] 

Apache Spark commented on SPARK-23715:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/21169

> from_utc_timestamp returns incorrect results for some UTC date/time values
> --
>
> Key: SPARK-23715
> URL: https://issues.apache.org/jira/browse/SPARK-23715
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bruce Robbins
>Priority: Major
>
> This produces the expected answer:
> {noformat}
> df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" 
> ).as("dt")).show
> +---+
> | dt|
> +---+
> |2018-03-13 07:18:23|
> +---+
> {noformat}
> However, the equivalent UTC input (but with an explicit timezone) produces a 
> wrong answer:
> {noformat}
> df.select(from_utc_timestamp(lit("2018-03-13T06:18:23+00:00"), "GMT+1" 
> ).as("dt")).show
> +---+
> | dt|
> +---+
> |2018-03-13 00:18:23|
> +---+
> {noformat}
> Additionally, the equivalent Unix time (1520921903, which is also 
> "2018-03-13T06:18:23" in the UTC time zone) produces the same wrong answer:
> {noformat}
> df.select(from_utc_timestamp(to_timestamp(lit(1520921903)), "GMT+1" 
> ).as("dt")).show
> +---+
> | dt|
> +---+
> |2018-03-13 00:18:23|
> +---+
> {noformat}
> These issues stem from the fact that the FromUTCTimestamp expression, despite 
> its name, expects the input to be in the user's local timezone. There is some 
> magic under the covers to make things work (mostly) as the user expects.
> As an example, let's say a user in Los Angeles issues the following:
> {noformat}
> df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" 
> ).as("dt")).show
> {noformat}
> FromUTCTimestamp gets as input a Timestamp (long) value representing
> {noformat}
> 2018-03-13T06:18:23-07:00 (long value 152094710300)
> {noformat}
> What FromUTCTimestamp needs instead is
> {noformat}
> 2018-03-13T06:18:23+00:00 (long value 152092190300)
> {noformat}
> So, it applies the local timezone's offset to the input timestamp to get the 
> correct value (152094710300 minus 7 hours is 152092190300). Then it 
> can process the value and produce the expected output.
> When the user explicitly specifies a time zone, FromUTCTimestamp's 
> assumptions break down. The input is no longer in the local time zone. 
> Because of the way input data is implicitly casted, FromUTCTimestamp never 
> knows whether the input data had an explicit timezone.
> Here are some gory details:
> There is sometimes a mismatch in expectations between the (string => 
> timestamp) cast and FromUTCTimestamp. Also, since the FromUTCTimestamp 
> expression never sees the actual input string (the cast "intercepts" the 
> input and converts it to a long timestamp before FromUTCTimestamp uses the 
> value), FromUTCTimestamp cannot reject any input value that would exercise 
> this mismatch in expectations.
> There is a similar mismatch in expectations in the (integer => timestamp) 
> cast and FromUTCTimestamp. As a result, Unix time input almost always 
> produces incorrect output.
> h3. When things work as expected for String input:
> When from_utc_timestamp is passed a string time value with no time zone, 
> DateTimeUtils.stringToTimestamp (called from a Cast expression) treats the 
> datetime string as though it's in the user's local time zone. Because 
> DateTimeUtils.stringToTimestamp is a general function, this is reasonable.
> As a result, FromUTCTimestamp's input is a timestamp shifted by the local 
> time zone's offset. FromUTCTimestamp assumes this (or more accurately, a 
> utility function called by FromUTCTimestamp assumes this), so the first thing 
> it does is reverse-shift to get it back the correct value. Now that the long 
> value has been shifted back to the correct timestamp value, it can now 
> process it (by shifting it again based on the specified time zone).
> h3. When things go wrong with String input:
> When from_utc_timestamp is passed a string datetime value with an explicit 
> time zone, stringToTimestamp honors that timezone and ignores the local time 
> zone. stringToTimestamp does not shift the timestamp by the local timezone's 
> offset, but by the timezone specified on the datetime string.
> Unfortunately, FromUTCTimestamp, which has no insight into the actual input 
> or the conversion, still assumes the timestamp is shifted by the local time 
> zone. So it reverse-shifts the long value by the local time zone's offset, 
> which 

[jira] [Assigned] (SPARK-23715) from_utc_timestamp returns incorrect results for some UTC date/time values

2018-04-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23715:


Assignee: (was: Apache Spark)

> from_utc_timestamp returns incorrect results for some UTC date/time values
> --
>
> Key: SPARK-23715
> URL: https://issues.apache.org/jira/browse/SPARK-23715
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bruce Robbins
>Priority: Major
>
> This produces the expected answer:
> {noformat}
> df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" 
> ).as("dt")).show
> +---+
> | dt|
> +---+
> |2018-03-13 07:18:23|
> +---+
> {noformat}
> However, the equivalent UTC input (but with an explicit timezone) produces a 
> wrong answer:
> {noformat}
> df.select(from_utc_timestamp(lit("2018-03-13T06:18:23+00:00"), "GMT+1" 
> ).as("dt")).show
> +---+
> | dt|
> +---+
> |2018-03-13 00:18:23|
> +---+
> {noformat}
> Additionally, the equivalent Unix time (1520921903, which is also 
> "2018-03-13T06:18:23" in the UTC time zone) produces the same wrong answer:
> {noformat}
> df.select(from_utc_timestamp(to_timestamp(lit(1520921903)), "GMT+1" 
> ).as("dt")).show
> +---+
> | dt|
> +---+
> |2018-03-13 00:18:23|
> +---+
> {noformat}
> These issues stem from the fact that the FromUTCTimestamp expression, despite 
> its name, expects the input to be in the user's local timezone. There is some 
> magic under the covers to make things work (mostly) as the user expects.
> As an example, let's say a user in Los Angeles issues the following:
> {noformat}
> df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" 
> ).as("dt")).show
> {noformat}
> FromUTCTimestamp gets as input a Timestamp (long) value representing
> {noformat}
> 2018-03-13T06:18:23-07:00 (long value 152094710300)
> {noformat}
> What FromUTCTimestamp needs instead is
> {noformat}
> 2018-03-13T06:18:23+00:00 (long value 152092190300)
> {noformat}
> So, it applies the local timezone's offset to the input timestamp to get the 
> correct value (152094710300 minus 7 hours is 152092190300). Then it 
> can process the value and produce the expected output.
> When the user explicitly specifies a time zone, FromUTCTimestamp's 
> assumptions break down. The input is no longer in the local time zone. 
> Because of the way input data is implicitly casted, FromUTCTimestamp never 
> knows whether the input data had an explicit timezone.
> Here are some gory details:
> There is sometimes a mismatch in expectations between the (string => 
> timestamp) cast and FromUTCTimestamp. Also, since the FromUTCTimestamp 
> expression never sees the actual input string (the cast "intercepts" the 
> input and converts it to a long timestamp before FromUTCTimestamp uses the 
> value), FromUTCTimestamp cannot reject any input value that would exercise 
> this mismatch in expectations.
> There is a similar mismatch in expectations in the (integer => timestamp) 
> cast and FromUTCTimestamp. As a result, Unix time input almost always 
> produces incorrect output.
> h3. When things work as expected for String input:
> When from_utc_timestamp is passed a string time value with no time zone, 
> DateTimeUtils.stringToTimestamp (called from a Cast expression) treats the 
> datetime string as though it's in the user's local time zone. Because 
> DateTimeUtils.stringToTimestamp is a general function, this is reasonable.
> As a result, FromUTCTimestamp's input is a timestamp shifted by the local 
> time zone's offset. FromUTCTimestamp assumes this (or more accurately, a 
> utility function called by FromUTCTimestamp assumes this), so the first thing 
> it does is reverse-shift to get it back the correct value. Now that the long 
> value has been shifted back to the correct timestamp value, it can now 
> process it (by shifting it again based on the specified time zone).
> h3. When things go wrong with String input:
> When from_utc_timestamp is passed a string datetime value with an explicit 
> time zone, stringToTimestamp honors that timezone and ignores the local time 
> zone. stringToTimestamp does not shift the timestamp by the local timezone's 
> offset, but by the timezone specified on the datetime string.
> Unfortunately, FromUTCTimestamp, which has no insight into the actual input 
> or the conversion, still assumes the timestamp is shifted by the local time 
> zone. So it reverse-shifts the long value by the local time zone's offset, 
> which produces a incorrect timestamp (except in the case where the input 
> datetime string just happened 

[jira] [Resolved] (SPARK-24096) create table as select not using hive.default.fileformat

2018-04-26 Thread Yuming Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-24096.
-
Resolution: Duplicate

> create table as select not using hive.default.fileformat
> 
>
> Key: SPARK-24096
> URL: https://issues.apache.org/jira/browse/SPARK-24096
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: StephenZou
>Priority: Major
>
> In my spark conf directory, hive-site.xml have an item indicating orc is the 
> default file format.
> 
>  hive.default.fileformat
>  orc
>  
>  
> But when I use "create table as select ..." to create a table, the output 
> format is plain text. 
> It works only I use "set hive.default.fileformat=orc"
>  
> Then I walked through the spark code and found in 
> sparkSqlParser:visitCreateHiveTable(), 
> val defaultStorage = HiveSerDe.getDefaultStorage(conf)  the conf is SQLConf,
> that explains the above observation, 
> "set hive.default.fileformat=orc" is put into conf map, hive-site.xml is not. 
>  
> It's quite misleading, How to unify the settings?
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20087) Include accumulators / taskMetrics when sending TaskKilled to onTaskEnd listeners

2018-04-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16454454#comment-16454454
 ] 

Apache Spark commented on SPARK-20087:
--

User 'advancedxy' has created a pull request for this issue:
https://github.com/apache/spark/pull/21165

> Include accumulators / taskMetrics when sending TaskKilled to onTaskEnd 
> listeners
> -
>
> Key: SPARK-20087
> URL: https://issues.apache.org/jira/browse/SPARK-20087
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Charles Lewis
>Priority: Major
>
> When tasks end due to an ExceptionFailure, subscribers to onTaskEnd receive 
> accumulators / task metrics for that task, if they were still available. 
> These metrics are not currently sent when tasks are killed intentionally, 
> such as when a speculative retry finishes, and the original is killed (or 
> vice versa). Since we're killing these tasks ourselves, these metrics should 
> almost always exist, and we should treat them the same way as we treat 
> ExceptionFailures.
> Sending these metrics with the TaskKilled end reason makes aggregation across 
> all tasks in an app more accurate. This data can inform decisions about how 
> to tune the speculation parameters in order to minimize duplicated work, and 
> in general, the total cost of an app should include both successful and 
> failed tasks, if that information exists.
> PR: https://github.com/apache/spark/pull/17422



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24101) MulticlassClassificationEvaluator should use sample weight data

2018-04-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16454445#comment-16454445
 ] 

Apache Spark commented on SPARK-24101:
--

User 'imatiach-msft' has created a pull request for this issue:
https://github.com/apache/spark/pull/17086

> MulticlassClassificationEvaluator should use sample weight data
> ---
>
> Key: SPARK-24101
> URL: https://issues.apache.org/jira/browse/SPARK-24101
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Ilya Matiach
>Priority: Major
>
> The LogisticRegression and LinearRegression models support training with a 
> weight column, but the corresponding evaluators do not support computing 
> metrics using those weights. This breaks model selection using CrossValidator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24101) MulticlassClassificationEvaluator should use sample weight data

2018-04-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24101:


Assignee: (was: Apache Spark)

> MulticlassClassificationEvaluator should use sample weight data
> ---
>
> Key: SPARK-24101
> URL: https://issues.apache.org/jira/browse/SPARK-24101
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Ilya Matiach
>Priority: Major
>
> The LogisticRegression and LinearRegression models support training with a 
> weight column, but the corresponding evaluators do not support computing 
> metrics using those weights. This breaks model selection using CrossValidator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24101) MulticlassClassificationEvaluator should use sample weight data

2018-04-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24101:


Assignee: Apache Spark

> MulticlassClassificationEvaluator should use sample weight data
> ---
>
> Key: SPARK-24101
> URL: https://issues.apache.org/jira/browse/SPARK-24101
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Ilya Matiach
>Assignee: Apache Spark
>Priority: Major
>
> The LogisticRegression and LinearRegression models support training with a 
> weight column, but the corresponding evaluators do not support computing 
> metrics using those weights. This breaks model selection using CrossValidator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24101) MulticlassClassificationEvaluator should use sample weight data

2018-04-26 Thread Ilya Matiach (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Matiach updated SPARK-24101:
-
Description: The LogisticRegression and LinearRegression models support 
training with a weight column, but the corresponding evaluators do not support 
computing metrics using those weights. This breaks model selection using 
CrossValidator.

> MulticlassClassificationEvaluator should use sample weight data
> ---
>
> Key: SPARK-24101
> URL: https://issues.apache.org/jira/browse/SPARK-24101
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Ilya Matiach
>Priority: Major
>
> The LogisticRegression and LinearRegression models support training with a 
> weight column, but the corresponding evaluators do not support computing 
> metrics using those weights. This breaks model selection using CrossValidator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24102) RegressionEvaluator should use sample weight data

2018-04-26 Thread Ilya Matiach (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Matiach updated SPARK-24102:
-
Description: The LogisticRegression and LinearRegression models support 
training with a weight column, but the corresponding evaluators do not support 
computing metrics using those weights. This breaks model selection using 
CrossValidator.

> RegressionEvaluator should use sample weight data
> -
>
> Key: SPARK-24102
> URL: https://issues.apache.org/jira/browse/SPARK-24102
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Ilya Matiach
>Priority: Major
>
> The LogisticRegression and LinearRegression models support training with a 
> weight column, but the corresponding evaluators do not support computing 
> metrics using those weights. This breaks model selection using CrossValidator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24103) BinaryClassificationEvaluator should use sample weight data

2018-04-26 Thread Ilya Matiach (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Matiach updated SPARK-24103:
-
Description: The LogisticRegression and LinearRegression models support 
training with a weight column, but the corresponding evaluators do not support 
computing metrics using those weights. This breaks model selection using 
CrossValidator.

> BinaryClassificationEvaluator should use sample weight data
> ---
>
> Key: SPARK-24103
> URL: https://issues.apache.org/jira/browse/SPARK-24103
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Ilya Matiach
>Priority: Major
>
> The LogisticRegression and LinearRegression models support training with a 
> weight column, but the corresponding evaluators do not support computing 
> metrics using those weights. This breaks model selection using CrossValidator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18693) BinaryClassificationEvaluator, RegressionEvaluator, and MulticlassClassificationEvaluator should use sample weight data

2018-04-26 Thread Ilya Matiach (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16454435#comment-16454435
 ] 

Ilya Matiach commented on SPARK-18693:
--

[~josephkb] sure, I've added 3 JIRAs for tracking and have linked to them:

https://issues.apache.org/jira/browse/SPARK-24103

https://issues.apache.org/jira/browse/SPARK-24101

https://issues.apache.org/jira/browse/SPARK-24102

Thank you, Ilya

> BinaryClassificationEvaluator, RegressionEvaluator, and 
> MulticlassClassificationEvaluator should use sample weight data
> ---
>
> Key: SPARK-18693
> URL: https://issues.apache.org/jira/browse/SPARK-18693
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Devesh Parekh
>Priority: Major
>
> The LogisticRegression and LinearRegression models support training with a 
> weight column, but the corresponding evaluators do not support computing 
> metrics using those weights. This breaks model selection using CrossValidator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24103) BinaryClassificationEvaluator should use sample weight data

2018-04-26 Thread Ilya Matiach (JIRA)
Ilya Matiach created SPARK-24103:


 Summary: BinaryClassificationEvaluator should use sample weight 
data
 Key: SPARK-24103
 URL: https://issues.apache.org/jira/browse/SPARK-24103
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.0.2
Reporter: Ilya Matiach






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24102) RegressionEvaluator should use sample weight data

2018-04-26 Thread Ilya Matiach (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Matiach updated SPARK-24102:
-
Issue Type: Improvement  (was: Bug)

> RegressionEvaluator should use sample weight data
> -
>
> Key: SPARK-24102
> URL: https://issues.apache.org/jira/browse/SPARK-24102
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Ilya Matiach
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24101) MulticlassClassificationEvaluator should use sample weight data

2018-04-26 Thread Ilya Matiach (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Matiach updated SPARK-24101:
-
Issue Type: Improvement  (was: Bug)

> MulticlassClassificationEvaluator should use sample weight data
> ---
>
> Key: SPARK-24101
> URL: https://issues.apache.org/jira/browse/SPARK-24101
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Ilya Matiach
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24102) RegressionEvaluator should use sample weight data

2018-04-26 Thread Ilya Matiach (JIRA)
Ilya Matiach created SPARK-24102:


 Summary: RegressionEvaluator should use sample weight data
 Key: SPARK-24102
 URL: https://issues.apache.org/jira/browse/SPARK-24102
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.0.2
Reporter: Ilya Matiach






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24101) MulticlassClassificationEvaluator should use sample weight data

2018-04-26 Thread Ilya Matiach (JIRA)
Ilya Matiach created SPARK-24101:


 Summary: MulticlassClassificationEvaluator should use sample 
weight data
 Key: SPARK-24101
 URL: https://issues.apache.org/jira/browse/SPARK-24101
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.0.2
Reporter: Ilya Matiach






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23933) High-order function: map(array, array) → map<K,V>

2018-04-26 Thread Alex Wajda (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16454393#comment-16454393
 ] 

Alex Wajda commented on SPARK-23933:


Oh, now I got it, thanks. I overlooked the case when the keys could be 
themselves arrays. Looks like another function name is needed, e.g. to_map or 
zip_map.

> High-order function: map(array, array) → map
> ---
>
> Key: SPARK-23933
> URL: https://issues.apache.org/jira/browse/SPARK-23933
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/map.html
> Returns a map created using the given key/value arrays.
> {noformat}
> SELECT map(ARRAY[1,3], ARRAY[2,4]); -- {1 -> 2, 3 -> 4}
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23830) Spark on YARN in cluster deploy mode fail with NullPointerException when a Spark application is a Scala class not object

2018-04-26 Thread Eric Maynard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16454365#comment-16454365
 ] 

Eric Maynard commented on SPARK-23830:
--

[~jerryshao] I agree we should not support using a {{class}}. However, I also 
believe that it's bad practice to throw a non-descriptive NPE. I created a PR 
here which adds more useful logging in the event that a proper main method 
isn't found and prevents throwing an NPE:
https://github.com/apache/spark/pull/21168  

> Spark on YARN in cluster deploy mode fail with NullPointerException when a 
> Spark application is a Scala class not object
> 
>
> Key: SPARK-23830
> URL: https://issues.apache.org/jira/browse/SPARK-23830
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> As reported on StackOverflow in [Why does Spark on YARN fail with “Exception 
> in thread ”Driver“ 
> java.lang.NullPointerException”?|https://stackoverflow.com/q/49564334/1305344]
>  the following Spark application fails with {{Exception in thread "Driver" 
> java.lang.NullPointerException}} with Spark on YARN in cluster deploy mode:
> {code}
> class MyClass {
>   def main(args: Array[String]): Unit = {
> val c = new MyClass()
> c.process()
>   }
>   def process(): Unit = {
> val sparkConf = new SparkConf().setAppName("my-test")
> val sparkSession: SparkSession = 
> SparkSession.builder().config(sparkConf).getOrCreate()
> import sparkSession.implicits._
> 
>   }
>   ...
> }
> {code}
> The exception is as follows:
> {code}
> 18/03/29 20:07:52 INFO ApplicationMaster: Starting the user application in a 
> separate Thread
> 18/03/29 20:07:52 INFO ApplicationMaster: Waiting for spark context 
> initialization...
> Exception in thread "Driver" java.lang.NullPointerException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637)
> {code}
> I think the reason for the exception {{Exception in thread "Driver" 
> java.lang.NullPointerException}} is due to [the following 
> code|https://github.com/apache/spark/blob/v2.3.0/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L700-L701]:
> {code}
> val mainMethod = userClassLoader.loadClass(args.userClass)
>   .getMethod("main", classOf[Array[String]])
> {code}
> So when {{mainMethod}} is used in [the following 
> code|https://github.com/apache/spark/blob/v2.3.0/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L706]
>  it simply gives NPE.
> {code}
> mainMethod.invoke(null, userArgs.toArray)
> {code}
> That could be easily avoided with an extra check if the {{mainMethod}} is 
> initialized and give a user a message what may have been a reason.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23830) Spark on YARN in cluster deploy mode fail with NullPointerException when a Spark application is a Scala class not object

2018-04-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23830:


Assignee: (was: Apache Spark)

> Spark on YARN in cluster deploy mode fail with NullPointerException when a 
> Spark application is a Scala class not object
> 
>
> Key: SPARK-23830
> URL: https://issues.apache.org/jira/browse/SPARK-23830
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> As reported on StackOverflow in [Why does Spark on YARN fail with “Exception 
> in thread ”Driver“ 
> java.lang.NullPointerException”?|https://stackoverflow.com/q/49564334/1305344]
>  the following Spark application fails with {{Exception in thread "Driver" 
> java.lang.NullPointerException}} with Spark on YARN in cluster deploy mode:
> {code}
> class MyClass {
>   def main(args: Array[String]): Unit = {
> val c = new MyClass()
> c.process()
>   }
>   def process(): Unit = {
> val sparkConf = new SparkConf().setAppName("my-test")
> val sparkSession: SparkSession = 
> SparkSession.builder().config(sparkConf).getOrCreate()
> import sparkSession.implicits._
> 
>   }
>   ...
> }
> {code}
> The exception is as follows:
> {code}
> 18/03/29 20:07:52 INFO ApplicationMaster: Starting the user application in a 
> separate Thread
> 18/03/29 20:07:52 INFO ApplicationMaster: Waiting for spark context 
> initialization...
> Exception in thread "Driver" java.lang.NullPointerException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637)
> {code}
> I think the reason for the exception {{Exception in thread "Driver" 
> java.lang.NullPointerException}} is due to [the following 
> code|https://github.com/apache/spark/blob/v2.3.0/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L700-L701]:
> {code}
> val mainMethod = userClassLoader.loadClass(args.userClass)
>   .getMethod("main", classOf[Array[String]])
> {code}
> So when {{mainMethod}} is used in [the following 
> code|https://github.com/apache/spark/blob/v2.3.0/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L706]
>  it simply gives NPE.
> {code}
> mainMethod.invoke(null, userArgs.toArray)
> {code}
> That could be easily avoided with an extra check if the {{mainMethod}} is 
> initialized and give a user a message what may have been a reason.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23830) Spark on YARN in cluster deploy mode fail with NullPointerException when a Spark application is a Scala class not object

2018-04-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23830:


Assignee: Apache Spark

> Spark on YARN in cluster deploy mode fail with NullPointerException when a 
> Spark application is a Scala class not object
> 
>
> Key: SPARK-23830
> URL: https://issues.apache.org/jira/browse/SPARK-23830
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Assignee: Apache Spark
>Priority: Trivial
>
> As reported on StackOverflow in [Why does Spark on YARN fail with “Exception 
> in thread ”Driver“ 
> java.lang.NullPointerException”?|https://stackoverflow.com/q/49564334/1305344]
>  the following Spark application fails with {{Exception in thread "Driver" 
> java.lang.NullPointerException}} with Spark on YARN in cluster deploy mode:
> {code}
> class MyClass {
>   def main(args: Array[String]): Unit = {
> val c = new MyClass()
> c.process()
>   }
>   def process(): Unit = {
> val sparkConf = new SparkConf().setAppName("my-test")
> val sparkSession: SparkSession = 
> SparkSession.builder().config(sparkConf).getOrCreate()
> import sparkSession.implicits._
> 
>   }
>   ...
> }
> {code}
> The exception is as follows:
> {code}
> 18/03/29 20:07:52 INFO ApplicationMaster: Starting the user application in a 
> separate Thread
> 18/03/29 20:07:52 INFO ApplicationMaster: Waiting for spark context 
> initialization...
> Exception in thread "Driver" java.lang.NullPointerException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637)
> {code}
> I think the reason for the exception {{Exception in thread "Driver" 
> java.lang.NullPointerException}} is due to [the following 
> code|https://github.com/apache/spark/blob/v2.3.0/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L700-L701]:
> {code}
> val mainMethod = userClassLoader.loadClass(args.userClass)
>   .getMethod("main", classOf[Array[String]])
> {code}
> So when {{mainMethod}} is used in [the following 
> code|https://github.com/apache/spark/blob/v2.3.0/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L706]
>  it simply gives NPE.
> {code}
> mainMethod.invoke(null, userArgs.toArray)
> {code}
> That could be easily avoided with an extra check if the {{mainMethod}} is 
> initialized and give a user a message what may have been a reason.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23830) Spark on YARN in cluster deploy mode fail with NullPointerException when a Spark application is a Scala class not object

2018-04-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16454358#comment-16454358
 ] 

Apache Spark commented on SPARK-23830:
--

User 'eric-maynard' has created a pull request for this issue:
https://github.com/apache/spark/pull/21168

> Spark on YARN in cluster deploy mode fail with NullPointerException when a 
> Spark application is a Scala class not object
> 
>
> Key: SPARK-23830
> URL: https://issues.apache.org/jira/browse/SPARK-23830
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> As reported on StackOverflow in [Why does Spark on YARN fail with “Exception 
> in thread ”Driver“ 
> java.lang.NullPointerException”?|https://stackoverflow.com/q/49564334/1305344]
>  the following Spark application fails with {{Exception in thread "Driver" 
> java.lang.NullPointerException}} with Spark on YARN in cluster deploy mode:
> {code}
> class MyClass {
>   def main(args: Array[String]): Unit = {
> val c = new MyClass()
> c.process()
>   }
>   def process(): Unit = {
> val sparkConf = new SparkConf().setAppName("my-test")
> val sparkSession: SparkSession = 
> SparkSession.builder().config(sparkConf).getOrCreate()
> import sparkSession.implicits._
> 
>   }
>   ...
> }
> {code}
> The exception is as follows:
> {code}
> 18/03/29 20:07:52 INFO ApplicationMaster: Starting the user application in a 
> separate Thread
> 18/03/29 20:07:52 INFO ApplicationMaster: Waiting for spark context 
> initialization...
> Exception in thread "Driver" java.lang.NullPointerException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637)
> {code}
> I think the reason for the exception {{Exception in thread "Driver" 
> java.lang.NullPointerException}} is due to [the following 
> code|https://github.com/apache/spark/blob/v2.3.0/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L700-L701]:
> {code}
> val mainMethod = userClassLoader.loadClass(args.userClass)
>   .getMethod("main", classOf[Array[String]])
> {code}
> So when {{mainMethod}} is used in [the following 
> code|https://github.com/apache/spark/blob/v2.3.0/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L706]
>  it simply gives NPE.
> {code}
> mainMethod.invoke(null, userArgs.toArray)
> {code}
> That could be easily avoided with an extra check if the {{mainMethod}} is 
> initialized and give a user a message what may have been a reason.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13446) Spark need to support reading data from Hive 2.0.0 metastore

2018-04-26 Thread Tavis Barr (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16454278#comment-16454278
 ] 

Tavis Barr commented on SPARK-13446:


Enjoy!

 

This is with Spark 2.3.0 and Hive 2.2.0

 

scala> spark2.sql("SHOW DATABASES").show
java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
 at 
org.apache.spark.sql.hive.HiveUtils$.formatTimeVarsForHiveClient(HiveUtils.scala:205)
 at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:286)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:195)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:195)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:195)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:194)
 at 
org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:114)
 at 
org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:102)
 at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.externalCatalog(HiveSessionStateBuilder.scala:39)
 at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog$lzycompute(HiveSessionStateBuilder.scala:54)
 at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:52)
 at 
org.apache.spark.sql.hive.HiveSessionStateBuilder$$anon$1.(HiveSessionStateBuilder.scala:69)
 at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.analyzer(HiveSessionStateBuilder.scala:69)
 at 
org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$build$2.apply(BaseSessionStateBuilder.scala:293)
 at 
org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$build$2.apply(BaseSessionStateBuilder.scala:293)
 at 
org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:79)
 at org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:79)
 at 
org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
 at 
org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
 at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
 at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74)
 at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:638)
 ... 51 elided

> Spark need to support reading data from Hive 2.0.0 metastore
> 
>
> Key: SPARK-13446
> URL: https://issues.apache.org/jira/browse/SPARK-13446
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Lifeng Wang
>Assignee: Xiao Li
>Priority: Major
> Fix For: 2.2.0
>
>
> Spark provided HIveContext class to read data from hive metastore directly. 
> While it only supports hive 1.2.1 version and older. Since hive 2.0.0 has 
> released, it's better to upgrade to support Hive 2.0.0.
> {noformat}
> 16/02/23 02:35:02 INFO metastore: Trying to connect to metastore with URI 
> thrift://hsw-node13:9083
> 16/02/23 02:35:02 INFO metastore: Opened a connection to metastore, current 
> connections: 1
> 16/02/23 02:35:02 INFO metastore: Connected to metastore.
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:473)
> at 
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:192)
> at 
> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:185)
> at 
> org.apache.spark.sql.hive.HiveContext$$anon$1.(HiveContext.scala:422)
> at 
> org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:422)
> at 
> org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:421)
> at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:72)
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:739)
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:735)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23715) from_utc_timestamp returns incorrect results for some UTC date/time values

2018-04-26 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16454221#comment-16454221
 ] 

Wenchen Fan edited comment on SPARK-23715 at 4/26/18 1:53 PM:
--

It seems the `from_utc_timestamp` doesn't make a lot of sense in Spark SQL. The 
timestamp in Spark SQL is TIMESTAMP WITH LOCAL TIME ZONE according to the SQL 
standard. Physically it stores microseconds from unix epoch, and when it's 
involved in timezone aware operations, like convert to string format, get the 
hour component, etc., Spark SQL uses session local timezone to interpret it.

That said, the timestamp in Spark does not carry the timezone information, so 
`from_utc_timestamp` doesn't make sense in Spark to change the timezone of a 
timestamp.

`from_utc_timestamp` was added in Spark 1.5, I think we can't remove it now, we 
should think of a reasonable definition for it to make it work as users expect. 
What do users expect?

{code}
scala> sql("select from_utc_timestamp('2018-01-01 00:00:00', 'GMT+8')").show
+-+
|from_utc_timestamp(CAST(2018-01-01 00:00:00 AS TIMESTAMP), GMT+8)|
+-+
|  2018-01-01 08:00:00|
+-+

scala> spark.conf.set("spark.sql.session.timeZone", "GMT+4")

scala> sql("select from_utc_timestamp('2018-01-01 00:00:00', 'GMT+8')").show
+-+
|from_utc_timestamp(CAST(2018-01-01 00:00:00 AS TIMESTAMP), GMT+8)|
+-+
|  2018-01-01 08:00:00|
+-+
{code}

The expected behavior is to shift the timestamp from UTC to the specified 
timezone, assuming the timestamp is timezone-agnostic. So the session local 
timezone should not affect the result. This also means the timestamp string 
can't carry the timezone and this is a bug in Spark.

For integer input:
{code}
scala> java.util.TimeZone.getDefault.getID
res27: String = Asia/Shanghai  // This is GMT+8

scala> sql("select from_utc_timestamp(cast(0 as timestamp), 'GMT+8')").show
+---+
|from_utc_timestamp(CAST(0 AS TIMESTAMP), GMT+8)|
+---+
|1970-01-01 16:00:00|
+---+


scala> spark.conf.set("spark.sql.session.timeZone", "GMT+4")

scala> sql("select from_utc_timestamp(cast(0 as timestamp), 'GMT+8')").show
+---+
|from_utc_timestamp(CAST(0 AS TIMESTAMP), GMT+8)|
+---+
|1970-01-01 12:00:00|
+---+
{code}

so what happened is, `cast(0 as timestamp)` assumes the input integer is 
seconds from unix epoch, and it converts the seconds to microseconds and done, 
no timezone information is needed. According to the semantic of Spark 
timestamp, `cast(0 as timestamp)` is effectively `1970-01-01 08:00:00`(local 
timezone is GMT+8) as an input to `from_utc_timestamp`, and that's why the 
result is `1970-01-01 16:00:00`. And the result changes if the local timezone 
changes.

Since this behavior is consistent with Hive, let's stick with it. Personly I 
don't think it's the corrected behavior but it's too late to change, and we 
don't have to change as long as it's well defined, i.e. the result is 
deterministic.

Now the only thing we need to change is, if input is string, return null if it 
contains timezone.


was (Author: cloud_fan):
It seems the `from_utc_timestamp` doesn't make a lot of sense in Spark SQL. The 
timestamp in Spark SQL is TIMESTAMP WITH LOCAL TIME ZONE according to the SQL 
standard. Physically it stores microseconds from unix epoch, and when it's 
involved in timezone aware operations, like convert to string format, get the 
hour component, etc., Spark SQL uses session local timezone to interpret it.

That said, the timestamp in Spark does not carry the timezone information, so 
`from_utc_timestamp` doesn't make sense in Spark to change the timezone of a 
timestamp.

`from_utc_timestamp` was added in Spark 1.5, I think we can't remove it now, we 
should think of a reasonable definition for it to make it work as users expect. 
What do users expect?

{code}
scala> sql("select from_utc_timestamp('2018-01-01 00:00:00', 'GMT+8')").show
+-+
|from_utc_timestamp(CAST(2018-01-01 00:00:00 AS TIMESTAMP), GMT+8)|
+-+
|  2018-01-01 

[jira] [Commented] (SPARK-23715) from_utc_timestamp returns incorrect results for some UTC date/time values

2018-04-26 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16454221#comment-16454221
 ] 

Wenchen Fan commented on SPARK-23715:
-

It seems the `from_utc_timestamp` doesn't make a lot of sense in Spark SQL. The 
timestamp in Spark SQL is TIMESTAMP WITH LOCAL TIME ZONE according to the SQL 
standard. Physically it stores microseconds from unix epoch, and when it's 
involved in timezone aware operations, like convert to string format, get the 
hour component, etc., Spark SQL uses session local timezone to interpret it.

That said, the timestamp in Spark does not carry the timezone information, so 
`from_utc_timestamp` doesn't make sense in Spark to change the timezone of a 
timestamp.

`from_utc_timestamp` was added in Spark 1.5, I think we can't remove it now, we 
should think of a reasonable definition for it to make it work as users expect. 
What do users expect?

{code}
scala> sql("select from_utc_timestamp('2018-01-01 00:00:00', 'GMT+8')").show
+-+
|from_utc_timestamp(CAST(2018-01-01 00:00:00 AS TIMESTAMP), GMT+8)|
+-+
|  2018-01-01 08:00:00|
+-+

scala> spark.conf.set("spark.sql.session.timeZone", "GMT+4")

scala> sql("select from_utc_timestamp('2018-01-01 00:00:00', 'GMT+8')").show
+-+
|from_utc_timestamp(CAST(2018-01-01 00:00:00 AS TIMESTAMP), GMT+8)|
+-+
|  2018-01-01 08:00:00|
+-+
{code}

The expected behavior is to shift the timestamp from UTC to the specified 
timezone, assuming the timestamp is timezone-agnostic. So the session local 
timezone should not affect the result. This also means the timestamp string 
can't carry the timezone and this is a bug in Spark.

For integer input,
{code}
scala> java.util.TimeZone.getDefault.getID
res27: String = Asia/Shanghai  // This is GMT+8

scala> sql("select from_utc_timestamp(cast(0 as timestamp), 'GMT+8')").show
+---+
|from_utc_timestamp(CAST(0 AS TIMESTAMP), GMT+8)|
+---+
|1970-01-01 16:00:00|
+---+


scala> spark.conf.set("spark.sql.session.timeZone", "GMT+4")

scala> sql("select from_utc_timestamp(cast(0 as timestamp), 'GMT+8')").show
+---+
|from_utc_timestamp(CAST(0 AS TIMESTAMP), GMT+8)|
+---+
|1970-01-01 12:00:00|
+---+
{code}

so what happened is, `cast(0 as timestamp)` assumes the input integer is 
seconds from unix epoch, and it converts the seconds to microseconds and done, 
no timezone information is needed. According to the semantic of Spark 
timestamp, `cast(0 as timestamp)` is effectively `1970-01-01 08:00:00`(local 
timezone is GMT+8) as an input to `from_utc_timestamp`, and that's why the 
result is `1970-01-01 16:00:00`. And the result changes if the local timezone 
changes.

Since this behavior is consistent with Hive, let's stick with it.

Now the only thing we need to change is, if input is string, return null if it 
contains timezone.

> from_utc_timestamp returns incorrect results for some UTC date/time values
> --
>
> Key: SPARK-23715
> URL: https://issues.apache.org/jira/browse/SPARK-23715
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bruce Robbins
>Priority: Major
>
> This produces the expected answer:
> {noformat}
> df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" 
> ).as("dt")).show
> +---+
> | dt|
> +---+
> |2018-03-13 07:18:23|
> +---+
> {noformat}
> However, the equivalent UTC input (but with an explicit timezone) produces a 
> wrong answer:
> {noformat}
> df.select(from_utc_timestamp(lit("2018-03-13T06:18:23+00:00"), "GMT+1" 
> ).as("dt")).show
> +---+
> | dt|
> +---+
> |2018-03-13 00:18:23|
> +---+
> {noformat}
> Additionally, the equivalent Unix time (1520921903, which is also 
> "2018-03-13T06:18:23" in the UTC time zone) produces the same wrong answer:
> {noformat}
> df.select(from_utc_timestamp(to_timestamp(lit(1520921903)), "GMT+1" 
> ).as("dt")).show
> +---+
> | 

[jira] [Commented] (SPARK-21661) SparkSQL can't merge load table from Hadoop

2018-04-26 Thread Li Yuanjian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16454070#comment-16454070
 ] 

Li Yuanjian commented on SPARK-21661:
-

Got it.

> SparkSQL can't merge load table from Hadoop
> ---
>
> Key: SPARK-21661
> URL: https://issues.apache.org/jira/browse/SPARK-21661
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Dapeng Sun
>Assignee: Li Yuanjian
>Priority: Major
> Fix For: 2.3.0
>
>
> Here is the original text of external table on HDFS:
> {noformat}
> PermissionOwner   Group   SizeLast Modified   Replication Block 
> Size  Name
> -rw-r--r--rootsupergroup  0 B 8/6/2017, 11:43:03 PM   3   
> 256 MB  income_band_001.dat
> -rw-r--r--rootsupergroup  0 B 8/6/2017, 11:39:31 PM   3   
> 256 MB  income_band_002.dat
> ...
> -rw-r--r--rootsupergroup  327 B   8/6/2017, 11:44:47 PM   3   
> 256 MB  income_band_530.dat
> {noformat}
> After SparkSQL load, every files have a output file, even the files are 0B. 
> For the load on Hive, the data files would be merged according the data size 
> of original files.
> Reproduce:
> {noformat}
> CREATE EXTERNAL TABLE t1 (a int,b string)  STORED AS TEXTFILE LOCATION 
> "hdfs://xxx:9000/data/t1"
> CREATE TABLE t2 STORED AS PARQUET AS SELECT * FROM t1;
> {noformat}
> The table t2 have many small files without data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23151) Provide a distribution of Spark with Hadoop 3.0

2018-04-26 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453991#comment-16453991
 ] 

Steve Loughran commented on SPARK-23151:


well, there's "have everything work on Hadoop 3" and "release it", so it could 
be kept open, or just make this an implict aspecvt of SPARK-23534

> Provide a distribution of Spark with Hadoop 3.0
> ---
>
> Key: SPARK-23151
> URL: https://issues.apache.org/jira/browse/SPARK-23151
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Louis Burke
>Priority: Major
>
> Provide a Spark package that supports Hadoop 3.0.0. Currently the Spark 
> package
> only supports Hadoop 2.7 i.e. spark-2.2.1-bin-hadoop2.7.tgz. The implication 
> is
> that using up to date Kinesis libraries alongside s3 causes a clash w.r.t
> aws-java-sdk.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24100) Add the CompressionCodec to the saveAsTextFiles interface.

2018-04-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453952#comment-16453952
 ] 

Apache Spark commented on SPARK-24100:
--

User 'WzRaCai' has created a pull request for this issue:
https://github.com/apache/spark/pull/21167

> Add the CompressionCodec to the saveAsTextFiles interface.
> --
>
> Key: SPARK-24100
> URL: https://issues.apache.org/jira/browse/SPARK-24100
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams, PySpark
>Affects Versions: 2.3.0
>Reporter: caijie
>Priority: Minor
>
> Add the CompressionCodec to the saveAsTextFiles interface.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24100) Add the CompressionCodec to the saveAsTextFiles interface.

2018-04-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24100:


Assignee: (was: Apache Spark)

> Add the CompressionCodec to the saveAsTextFiles interface.
> --
>
> Key: SPARK-24100
> URL: https://issues.apache.org/jira/browse/SPARK-24100
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams, PySpark
>Affects Versions: 2.3.0
>Reporter: caijie
>Priority: Minor
>
> Add the CompressionCodec to the saveAsTextFiles interface.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24100) Add the CompressionCodec to the saveAsTextFiles interface.

2018-04-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24100:


Assignee: Apache Spark

> Add the CompressionCodec to the saveAsTextFiles interface.
> --
>
> Key: SPARK-24100
> URL: https://issues.apache.org/jira/browse/SPARK-24100
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams, PySpark
>Affects Versions: 2.3.0
>Reporter: caijie
>Assignee: Apache Spark
>Priority: Minor
>
> Add the CompressionCodec to the saveAsTextFiles interface.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24100) Add the CompressionCodec to the saveAsTextFiles interface.

2018-04-26 Thread caijie (JIRA)
caijie created SPARK-24100:
--

 Summary: Add the CompressionCodec to the saveAsTextFiles interface.
 Key: SPARK-24100
 URL: https://issues.apache.org/jira/browse/SPARK-24100
 Project: Spark
  Issue Type: Improvement
  Components: DStreams, PySpark
Affects Versions: 2.3.0
Reporter: caijie


Add the CompressionCodec to the saveAsTextFiles interface.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23856) Spark jdbc setQueryTimeout option

2018-04-26 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453948#comment-16453948
 ] 

Takeshi Yamamuro commented on SPARK-23856:
--

[~dmitrymikhailov] You don't have time to make a pr? If no problem, I'll take 
it over.

> Spark jdbc setQueryTimeout option
> -
>
> Key: SPARK-23856
> URL: https://issues.apache.org/jira/browse/SPARK-23856
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dmitry Mikhailov
>Priority: Minor
>
> It would be nice if a user could set the jdbc setQueryTimeout option when 
> running jdbc in Spark. I think it can be easily implemented by adding option 
> field to _JDBCOptions_ class and applying this option when initializing jdbc 
> statements in _JDBCRDD_ class. But only some DB vendors support this jdbc 
> feature. Is it worth starting a work on this option?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23929) pandas_udf schema mapped by position and not by name

2018-04-26 Thread Tr3wory (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453880#comment-16453880
 ] 

Tr3wory commented on SPARK-23929:
-

Yes, the documentation is a must, even for 2.3 if possible (it's a new feature 
of 2.3, right?).

> pandas_udf schema mapped by position and not by name
> 
>
> Key: SPARK-23929
> URL: https://issues.apache.org/jira/browse/SPARK-23929
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: PySpark
> Spark 2.3.0
>  
>Reporter: Omri
>Priority: Major
>
> The return struct of a pandas_udf should be mapped to the provided schema by 
> name. Currently it's not the case.
> Consider these two examples, where the only change is the order of the fields 
> in the provided schema struct:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))  
> @pandas_udf("v double,id long", PandasUDFType.GROUPED_MAP)  
> def normalize(pdf):
>     v = pdf.v
>     return pdf.assign(v=(v - v.mean()) / v.std())
> df.groupby("id").apply(normalize).show() 
> {code}
> and this one:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))  
> @pandas_udf("id long,v double", PandasUDFType.GROUPED_MAP)  
> def normalize(pdf):
>     v = pdf.v
>     return pdf.assign(v=(v - v.mean()) / v.std())
> df.groupby("id").apply(normalize).show()
> {code}
> The results should be the same but they are different:
> For the first code:
> {code:java}
> +---+---+
> |  v| id|
> +---+---+
> |1.0|  0|
> |1.0|  0|
> |2.0|  0|
> |2.0|  0|
> |2.0|  1|
> +---+---+
> {code}
> For the second code:
> {code:java}
> +---+---+
> | id|  v|
> +---+---+
> |  1|-0.7071067811865475|
> |  1| 0.7071067811865475|
> |  2|-0.8320502943378437|
> |  2|-0.2773500981126146|
> |  2| 1.1094003924504583|
> +---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4781) Column values become all NULL after doing ALTER TABLE CHANGE for renaming column names (Parquet external table in HiveContext)

2018-04-26 Thread Peter Simon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453861#comment-16453861
 ] 

Peter Simon commented on SPARK-4781:


As commented under SPARK-11748, 

Possible workaround can be:
{code:java}
scala> spark.sql("set parquet.column.index.access=true").show
scala> spark.sql("set spark.sql.hive.convertMetastoreParquet=false").show
scala> spark.sql ("select * from test_parq").show
++--+
|a_ch| b|
++--+
| 1| a|
| 2|test 2|
++--+{code}

> Column values become all NULL after doing ALTER TABLE CHANGE for renaming 
> column names (Parquet external table in HiveContext)
> --
>
> Key: SPARK-4781
> URL: https://issues.apache.org/jira/browse/SPARK-4781
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0, 1.2.1, 1.3.0
>Reporter: Jianshi Huang
>Priority: Major
>
> I have a table say created like follows:
> {code}
> CREATE EXTERNAL TABLE pmt (
>   `sorted::cre_ts` string
> )
> STORED AS PARQUET
> LOCATION '...'
> {code}
> And I renamed the column from sorted::cre_ts to cre_ts by doing:
> {code}
> ALTER TABLE pmt CHANGE `sorted::cre_ts` cre_ts string
> {code}
> After renaming the column, the values in the column become all NULLs.
> {noformat}
> Before renaming:
> scala> sql("select `sorted::cre_ts` from pmt limit 1").collect
> res12: Array[org.apache.spark.sql.Row] = Array([12/02/2014 07:38:54])
> Execute renaming:
> scala> sql("alter table pmt change `sorted::cre_ts` cre_ts string")
> res13: org.apache.spark.sql.SchemaRDD =
> SchemaRDD[972] at RDD at SchemaRDD.scala:108
> == Query Plan ==
> 
> After renaming:
> scala> sql("select cre_ts from pmt limit 1").collect
> res16: Array[org.apache.spark.sql.Row] = Array([null])
> {noformat}
> Jianshi



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11748) Result is null after alter column name of table stored as Parquet

2018-04-26 Thread Peter Simon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453858#comment-16453858
 ] 

Peter Simon commented on SPARK-11748:
-

Possible workaround can be:
{code:java}
scala> spark.sql("set parquet.column.index.access=true").show
scala> spark.sql("set spark.sql.hive.convertMetastoreParquet=false").show
scala> spark.sql ("select * from test_parq").show
++--+
|a_ch| b|
++--+
| 1| a|
| 2|test 2|
++--+
{code}
 

> Result is null after alter column name of table stored as Parquet 
> --
>
> Key: SPARK-11748
> URL: https://issues.apache.org/jira/browse/SPARK-11748
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: pin_zhang
>Priority: Major
>
> 1. Test with the following code
> hctx.sql(" create table " + table + " (id int, str string) STORED AS 
> PARQUET ")
> val df = hctx.jsonFile("g:/vip.json")
> df.write.format("parquet").mode(SaveMode.Append).saveAsTable(table)
> hctx.sql(" select * from " + table).show()
> // alter table
> val alter = "alter table " + table + " CHANGE id i_d int "
> hctx.sql(alter)
>  
> hctx.sql(" select * from " + table).show()
> 2. Result
> after change table column name, data in null for the changed column
> Result before alter table
> +---+---+
> | id|str|
> +---+---+
> |  1| s1|
> |  2| s2|
> +---+---+
> Result after alter table
> ++---+
> | i_d|str|
> ++---+
> |null| s1|
> |null| s2|
> ++---+



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11334) numRunningTasks can't be less than 0, or it will affect executor allocation

2018-04-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453838#comment-16453838
 ] 

Apache Spark commented on SPARK-11334:
--

User 'sadhen' has created a pull request for this issue:
https://github.com/apache/spark/pull/21166

> numRunningTasks can't be less than 0, or it will affect executor allocation
> ---
>
> Key: SPARK-11334
> URL: https://issues.apache.org/jira/browse/SPARK-11334
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: meiyoula
>Assignee: Sital Kedia
>Priority: Major
> Fix For: 2.3.0
>
>
> With *Dynamic Allocation* function, a task failed over *maxFailure* time, all 
> the dependent jobs, stages, tasks will be killed or aborted. In this process, 
> *SparkListenerTaskEnd* event will be behind in *SparkListenerStageCompleted* 
> and *SparkListenerJobEnd*. Like the Event Log below:
> {code}
> {"Event":"SparkListenerStageCompleted","Stage Info":{"Stage ID":20,"Stage 
> Attempt ID":0,"Stage Name":"run at AccessController.java:-2","Number of 
> Tasks":200}
> {"Event":"SparkListenerJobEnd","Job ID":9,"Completion Time":1444914699829}
> {"Event":"SparkListenerTaskEnd","Stage ID":20,"Stage Attempt ID":0,"Task 
> Type":"ResultTask","Task End Reason":{"Reason":"TaskKilled"},"Task 
> Info":{"Task ID":1955,"Index":88,"Attempt":2,"Launch 
> Time":1444914699763,"Executor 
> ID":"5","Host":"linux-223","Locality":"PROCESS_LOCAL","Speculative":false,"Getting
>  Result Time":0,"Finish Time":1444914699864,"Failed":true,"Accumulables":[]}}
> {code}
> Because that, the *numRunningTasks* in *ExecutorAllocationManager* class will 
> be less than 0, and it will affect executor allocation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13446) Spark need to support reading data from Hive 2.0.0 metastore

2018-04-26 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453825#comment-16453825
 ] 

Steve Loughran commented on SPARK-13446:


[~Tavis]: can you paste in the stack you see?

> Spark need to support reading data from Hive 2.0.0 metastore
> 
>
> Key: SPARK-13446
> URL: https://issues.apache.org/jira/browse/SPARK-13446
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Lifeng Wang
>Assignee: Xiao Li
>Priority: Major
> Fix For: 2.2.0
>
>
> Spark provided HIveContext class to read data from hive metastore directly. 
> While it only supports hive 1.2.1 version and older. Since hive 2.0.0 has 
> released, it's better to upgrade to support Hive 2.0.0.
> {noformat}
> 16/02/23 02:35:02 INFO metastore: Trying to connect to metastore with URI 
> thrift://hsw-node13:9083
> 16/02/23 02:35:02 INFO metastore: Opened a connection to metastore, current 
> connections: 1
> 16/02/23 02:35:02 INFO metastore: Connected to metastore.
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:473)
> at 
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:192)
> at 
> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:185)
> at 
> org.apache.spark.sql.hive.HiveContext$$anon$1.(HiveContext.scala:422)
> at 
> org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:422)
> at 
> org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:421)
> at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:72)
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:739)
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:735)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23885) trying to spark submit 2.3.0 on minikube

2018-04-26 Thread anant pukale (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453803#comment-16453803
 ] 

anant pukale commented on SPARK-23885:
--

HI Anirudh,

Kindly suggest any work around .

 

Thanks

Anant

> trying to spark submit 2.3.0 on minikube
> 
>
> Key: SPARK-23885
> URL: https://issues.apache.org/jira/browse/SPARK-23885
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Submit
>Affects Versions: 2.3.0
>Reporter: anant pukale
>Assignee: Anirudh Ramanathan
>Priority: Major
>
>  spark-submit on minikube(kubernets) failing .
> Kindly refere link for details 
>  
> [https://stackoverflow.com/questions/49689298/exception-in-thread-main-org-apache-spark-sparkexception-must-specify-the-dri|http://example.com]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24099) java.io.CharConversionException: Invalid UTF-32 character prevents me from querying my data in JSON

2018-04-26 Thread ABHISHEK KUMAR GUPTA (JIRA)
ABHISHEK KUMAR GUPTA created SPARK-24099:


 Summary: java.io.CharConversionException: Invalid UTF-32 character 
prevents me from querying my data in JSON
 Key: SPARK-24099
 URL: https://issues.apache.org/jira/browse/SPARK-24099
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
 Environment: OS: SUSE 11

Spark Version: 2.3

 
Reporter: ABHISHEK KUMAR GUPTA


Steps:
 # Launch spark-sql --master yarn
 # create table json(name STRING, age int, gender string, id INT) using 
org.apache.spark.sql.json options(path "hdfs:///user/testdemo/");
 # Execute the below SQL queries 
INSERT into json
SELECT 'Shaan',21,'Male',1
UNION ALL
SELECT 'Xing',20,'Female',11
UNION ALL
SELECT 'Mile',4,'Female',20
UNION ALL
SELECT 'Malan',10,'Male',9;
 # Select * from json;

Throws below Exception

Caused by: *java.io.CharConversionException: Invalid UTF-32 character* 
0x151a15(above 10) at char #1, byte #7)
 at 
com.fasterxml.jackson.core.io.UTF32Reader.reportInvalid(UTF32Reader.java:189)
 at com.fasterxml.jackson.core.io.UTF32Reader.read(UTF32Reader.java:150)
 at 
com.fasterxml.jackson.core.json.ReaderBasedJsonParser.loadMore(ReaderBasedJsonParser.java:153)
 at 
com.fasterxml.jackson.core.json.ReaderBasedJsonParser._skipWSOrEnd(ReaderBasedJsonParser.java:2017)
 at 
com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:577)
 at 
org.apache.spark.sql.catalyst.json.JacksonParser$$anonfun$parse$2.apply(JacksonParser.scala:350)
 at 
org.apache.spark.sql.catalyst.json.JacksonParser$$anonfun$parse$2.apply(JacksonParser.scala:347)
 at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2585)
 at 
org.apache.spark.sql.catalyst.json.JacksonParser.parse(JacksonParser.scala:347)
 at 
org.apache.spark.sql.execution.datasources.json.TextInputJsonDataSource$$anonfun$3.apply(JsonDataSource.scala:126)
 at 
org.apache.spark.sql.execution.datasources.json.TextInputJsonDataSource$$anonfun$3.apply(JsonDataSource.scala:126)
 at 
org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:61)
 at 
org.apache.spark.sql.execution.datasources.json.TextInputJsonDataSource$$anonfun$readFile$2.apply(JsonDataSource.scala:130)
 at 
org.apache.spark.sql.execution.datasources.json.TextInputJsonDataSource$$anonfun$readFile$2.apply(JsonDataSource.scala:130)
 at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
 at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
 at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
 at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
 at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
 at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
 at org.apache.spark.scheduler.Task.run(Task.scala:109)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)

 

Note:

https://issues.apache.org/jira/browse/SPARK-16548 Jira raised in 1.6 and said 
Fixed in 2.3 but still I am getting same Error.

Please update on this.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24098) ScriptTransformationExec should wait process exiting before output iterator finish

2018-04-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453716#comment-16453716
 ] 

Apache Spark commented on SPARK-24098:
--

User 'liutang123' has created a pull request for this issue:
https://github.com/apache/spark/pull/21164

> ScriptTransformationExec should wait process exiting before output iterator 
> finish
> --
>
> Key: SPARK-24098
> URL: https://issues.apache.org/jira/browse/SPARK-24098
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
> Environment: Spark Version: 2.2.1
> Hadoop Version: 2.7.1
>  
>Reporter: Lijia Liu
>Priority: Critical
>
> In our spark cluster, some users find that spark may lost data when they use 
> transform in sql.
> We check the output file and discovery that some file are empty.
> Then we check the executor's log, some exception were found like follow:
>  
>  
> {code:java}
> 18/04/19 03:33:03 ERROR SparkUncaughtExceptionHandler: [Container in 
> shutdown] Uncaught exception in thread 
> Thread[Thread-ScriptTransformation-Feed,5,main]
> java.io.IOException: Broken pipe
> at java.io.FileOutputStream.writeBytes(Native Method)
> at java.io.FileOutputStream.write(FileOutputStream.java:326)
> at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
> at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
> at java.io.DataOutputStream.write(DataOutputStream.java:107)
> at 
> org.apache.hadoop.hive.ql.exec.TextRecordWriter.write(TextRecordWriter.java:53)
> at 
> org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(ScriptTransformationExec.scala:341)
> at 
> org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$2.apply(ScriptTransformationExec.scala:317)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at 
> org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1.apply$mcV$sp(ScriptTransformationExec.scala:317)
> at 
> org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1.apply(ScriptTransformationExec.scala:306)
> at 
> org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1.apply(ScriptTransformationExec.scala:306)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1952)
> at 
> org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread.run(ScriptTransformationExec.scala:306)
> 18/04/19 03:33:03 INFO FileOutputCommitter: Saved output of task 
> 'attempt_20180419033229_0049_m_01_0' to 
> viewfs://hadoop-meituan/ghnn07/warehouse/mart_waimai_crm.db/.hive-staging_hive_2018-04-19_03-18-12_389_7869324671857417311-1/-ext-1/_temporary/0/task_20180419033229_0049_m_01
> 18/04/19 03:33:03 INFO SparkHadoopMapRedUtil: 
> attempt_20180419033229_0049_m_01_0: Committed
> 18/04/19 03:33:03 INFO Executor: Finished task 1.0 in stage 49.0 (TID 9843). 
> 17342 bytes result sent to driver{code}
>  
> ScriptTransformation-Feed fail but the task finished successful.
> FInally, we analysed the class ScriptTransformationExec and find follow 
> result:
> When feed thread doesn't set its _exception variable and the progress doesn't 
> exit completely, output Iterator will return false in hasNext function.
>  
> Bug Reappearance:
> 1. Add Thread._sleep_(1000 * 600) before assign for _exception.
> 2. structure a python script witch will throw exception like follow:
> test.py
>  
> {code:java}
> import sys 
> for line in sys.stdin:   
>   raise Exception('error') 
>   print line
> {code}
> 3. use script created in step 2 in transform.
> {code:java}
> spark-sql>add files /path_to/test.py;
> spark-sql>select transform(id) using 'python test.py' as id from city;
> {code}
> The result is that spark will end successfully.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >