date:20180503

[jira] [Commented] (SPARK-23291) SparkR : substr : In SparkR dataframe , starting and ending position arguments in "substr" is giving wrong result when the position is greater than 1

2018-05-03 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16463391#comment-16463391
 ] 

Hyukjin Kwon commented on SPARK-23291:
--

[~felixcheung] WDYT?

> SparkR : substr : In SparkR dataframe , starting and ending position 
> arguments in "substr" is giving wrong result  when the position is greater 
> than 1
> --
>
> Key: SPARK-23291
> URL: https://issues.apache.org/jira/browse/SPARK-23291
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.2, 2.2.0, 2.2.1, 2.3.0
>Reporter: Narendra
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.0
>
>
> Defect Description :
> -
> For example ,an input string "2017-12-01" is read into a SparkR dataframe 
> "df" with column name "col1".
>  The target is to create a a new column named "col2" with the value "12" 
> which is inside the string ."12" can be extracted with "starting position" as 
> "6" and "Ending position" as "7"
>  (the starting position of the first character is considered as "1" )
> But,the current code that needs to be written is :
>  
>  df <- withColumn(df,"col2",substr(df$col1,7,8)))
> Observe that the first argument in the "substr" API , which indicates the 
> 'starting position', is mentioned as "7" 
>  Also, observe that the second argument in the "substr" API , which indicates 
> the 'ending position', is mentioned as "8"
> i.e the number that should be mentioned to indicate the position should be 
> the "actual position + 1"
> Expected behavior :
> 
> The code that needs to be written is :
>  
>  df <- withColumn(df,"col2",substr(df$col1,6,7)))
> Note :
> ---
>  This defect is observed with only when the starting position is greater than 
> 1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24176) The hdfs file path with wildcard can not be identified when loading data

2018-05-03 Thread ABHISHEK KUMAR GUPTA (JIRA)

ABHISHEK KUMAR GUPTA created SPARK-24176:


 Summary: The hdfs file path with wildcard can not be identified 
when loading data
 Key: SPARK-24176
 URL: https://issues.apache.org/jira/browse/SPARK-24176
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
 Environment: OS: SUSE11

Spark Version:2.3
Reporter: ABHISHEK KUMAR GUPTA


# Launch spark-sql
 # create table wild1 (time timestamp, name string, isright boolean, datetoday 
date, num binary, height double, score float, decimaler decimal(10,0), id 
tinyint, age int, license bigint, length smallint) row format delimited fields 
terminated by ',' stored as textfile;
 # loaded data in table as below and it failed some cases not consistent
 # load data inpath '/user/testdemo1/user1/?ype* ' into table wild1; - Success
load data inpath '/user/testdemo1/user1/t??eddata60.txt' into table wild1; - 
*Failed*
load data inpath '/user/testdemo1/user1/?ypeddata60.txt' into table wild1; - 
Success

Exception as below

> load data inpath '/user/testdemo1/user1/t??eddata61.txt' into table wild1;
2018-05-04 13:16:25 INFO HiveMetaStore:746 - 0: get_database: one
2018-05-04 13:16:25 INFO audit:371 - ugi=spark/had...@hadoop.com 
ip=unknown-ip-addr cmd=get_database: one
2018-05-04 13:16:25 INFO HiveMetaStore:746 - 0: get_table : db=one tbl=wild1
2018-05-04 13:16:25 INFO audit:371 - ugi=spark/had...@hadoop.com 
ip=unknown-ip-addr cmd=get_table : db=one tbl=wild1
2018-05-04 13:16:25 INFO HiveMetaStore:746 - 0: get_table : db=one tbl=wild1
2018-05-04 13:16:25 INFO audit:371 - ugi=spark/had...@hadoop.com 
ip=unknown-ip-addr cmd=get_table : db=one tbl=wild1
*Error in query: LOAD DATA input path does not exist: 
/user/testdemo1/user1/t??eddata61.txt;*
spark-sql>

Behavior is not consistent. Need to fix with all combination of wild card char 
as it is not consistent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23291) SparkR : substr : In SparkR dataframe , starting and ending position arguments in "substr" is giving wrong result when the position is greater than 1

2018-05-03 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16463369#comment-16463369
 ] 

Wenchen Fan commented on SPARK-23291:
-

Yea I think we should backport it.

> SparkR : substr : In SparkR dataframe , starting and ending position 
> arguments in "substr" is giving wrong result  when the position is greater 
> than 1
> --
>
> Key: SPARK-23291
> URL: https://issues.apache.org/jira/browse/SPARK-23291
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.2, 2.2.0, 2.2.1, 2.3.0
>Reporter: Narendra
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.0
>
>
> Defect Description :
> -
> For example ,an input string "2017-12-01" is read into a SparkR dataframe 
> "df" with column name "col1".
>  The target is to create a a new column named "col2" with the value "12" 
> which is inside the string ."12" can be extracted with "starting position" as 
> "6" and "Ending position" as "7"
>  (the starting position of the first character is considered as "1" )
> But,the current code that needs to be written is :
>  
>  df <- withColumn(df,"col2",substr(df$col1,7,8)))
> Observe that the first argument in the "substr" API , which indicates the 
> 'starting position', is mentioned as "7" 
>  Also, observe that the second argument in the "substr" API , which indicates 
> the 'ending position', is mentioned as "8"
> i.e the number that should be mentioned to indicate the position should be 
> the "actual position + 1"
> Expected behavior :
> 
> The code that needs to be written is :
>  
>  df <- withColumn(df,"col2",substr(df$col1,6,7)))
> Note :
> ---
>  This defect is observed with only when the starting position is greater than 
> 1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22918) sbt test (spark - local) fail after upgrading to 2.2.1 with: java.security.AccessControlException: access denied org.apache.derby.security.SystemPermission( "engine",

2018-05-03 Thread Michael Ransley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16463359#comment-16463359
 ] 

Michael Ransley commented on SPARK-22918:
-

I think this relates to https://issues.apache.org/jira/browse/DERBY-6648 which 
was change made in Derby 10.12.1.1.

To quote: 
{quote}Users who run Derby under a SecurityManager must edit the policy file 
and grant the following additional permission to derby.jar, derbynet.jar, and 
derbyoptionaltools.jar:
permission org.apache.derby.security.SystemPermission "engine", 
"usederbyinternals";{quote}

> sbt test (spark - local) fail after upgrading to 2.2.1 with: 
> java.security.AccessControlException: access denied 
> org.apache.derby.security.SystemPermission( "engine", "usederbyinternals" )
> 
>
> Key: SPARK-22918
> URL: https://issues.apache.org/jira/browse/SPARK-22918
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Damian Momot
>Priority: Major
>
> After upgrading 2.2.0 -> 2.2.1 sbt test command in one of my projects started 
> to fail with following exception:
> {noformat}
> java.security.AccessControlException: access denied 
> org.apache.derby.security.SystemPermission( "engine", "usederbyinternals" )
>   at 
> java.security.AccessControlContext.checkPermission(AccessControlContext.java:472)
>   at 
> java.security.AccessController.checkPermission(AccessController.java:884)
>   at 
> org.apache.derby.iapi.security.SecurityUtil.checkDerbyInternalsPrivilege(Unknown
>  Source)
>   at org.apache.derby.iapi.services.monitor.Monitor.startMonitor(Unknown 
> Source)
>   at org.apache.derby.iapi.jdbc.JDBCBoot$1.run(Unknown Source)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at org.apache.derby.iapi.jdbc.JDBCBoot.boot(Unknown Source)
>   at org.apache.derby.iapi.jdbc.JDBCBoot.boot(Unknown Source)
>   at org.apache.derby.jdbc.EmbeddedDriver.boot(Unknown Source)
>   at org.apache.derby.jdbc.EmbeddedDriver.(Unknown Source)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at java.lang.Class.newInstance(Class.java:442)
>   at 
> org.datanucleus.store.rdbms.connectionpool.AbstractConnectionPoolFactory.loadDriver(AbstractConnectionPoolFactory.java:47)
>   at 
> org.datanucleus.store.rdbms.connectionpool.BoneCPConnectionPoolFactory.createConnectionPool(BoneCPConnectionPoolFactory.java:54)
>   at 
> org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:238)
>   at 
> org.datanucleus.store.rdbms.ConnectionFactoryImpl.initialiseDataSources(ConnectionFactoryImpl.java:131)
>   at 
> org.datanucleus.store.rdbms.ConnectionFactoryImpl.(ConnectionFactoryImpl.java:85)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631)
>   at 
> org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:325)
>   at 
> org.datanucleus.store.AbstractStoreManager.registerConnectionFactory(AbstractStoreManager.java:282)
>   at 
> org.datanucleus.store.AbstractStoreManager.(AbstractStoreManager.java:240)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager.(RDBMSStoreManager.java:286)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631)
>   at 
> org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301)
>   at 
> org.datanucleus.NucleusContext.createStoreMana

[jira] [Updated] (SPARK-24174) Expose Hadoop config as part of /environment API

2018-05-03 Thread Nikolay Sokolov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikolay Sokolov updated SPARK-24174:

Affects Version/s: (was: 2.3.0)
   2.1.0

> Expose Hadoop config as part of /environment API
> 
>
> Key: SPARK-24174
> URL: https://issues.apache.org/jira/browse/SPARK-24174
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Nikolay Sokolov
>Priority: Minor
>  Labels: features, usability
>
> Currently, /environment API call exposes only system properties and 
> SparkConf. However, in some cases when Spark is used in conjunction with 
> Hadoop, it is useful to know Hadoop configuration properties. For example, 
> HDFS or GS buffer sizes, hive metastore settings, and so on.
> So it would be good to have hadoop properties being exposed in /environment 
> API, for example:
> {code:none}
> GET .../application_1525395994996_5/environment
> {
>"runtime": {"javaVersion": "1.8.0_131 (Oracle Corporation)", ...}
>"sparkProperties": ["java.io.tmpdir","/tmp", ...],
>"systemProperties": [["spark.yarn.jars", "local:/usr/lib/spark/jars/*"], 
> ...],
>"classpathEntries": [["/usr/lib/hadoop/hadoop-annotations.jar","System 
> Classpath"], ...],
>"hadoopProperties": [["dfs.stream-buffer-size": 4096], ...],
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24170) [Spark SQL] json file format is not dropped after dropping table

2018-05-03 Thread ABHISHEK KUMAR GUPTA (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16463336#comment-16463336
 ] 

ABHISHEK KUMAR GUPTA commented on SPARK-24170:
--

I think this behavior need  to re look and can be taken as improvement for next 
version. I feel it should json file should also be deleted.

> [Spark SQL] json file format is not dropped after dropping table
> 
>
> Key: SPARK-24170
> URL: https://issues.apache.org/jira/browse/SPARK-24170
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: OS: SUSE 11
> Spark Version: 2.3
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> Steps:
>  # Launch spark-sql --master yarn
>  #  create table json(name STRING, age int, gender string, id INT) using 
> org.apache.spark.sql.json options(path "hdfs:///user/testdemo/");
>  # Execute the below SQL queries 
> INSERT into json
> SELECT 'Shaan',21,'Male',1
> UNION ALL
> SELECT 'Xing',20,'Female',11
> UNION ALL
> SELECT 'Mile',4,'Female',20
> UNION ALL
> SELECT 'Malan',10,'Male',9;
> Below 4 json file format created 
> BLR123111:/opt/Antsecure/install/hadoop/namenode/bin # ./hdfs dfs -ls 
> /user/testdemo
> Found 14 items
> -rw-r--r-- 3 spark hadoop 0 2018-04-26 17:44 /user/testdemo/_SUCCESS
> -rw-r--r-- 3 spark hadoop 4802 2018-04-24 18:20 /user/testdemo/customer1.csv
> -rw-r--r-- 3 spark hadoop 92 2018-04-26 17:02 /user/testdemo/json1.txt
> -rw-r--r-- 3 spark hadoop 49 2018-04-26 17:32 
> /user/testdemo/part-0-4311f66b-ba1b-4a4d-a289-1a211f27f653-c000.json
> -rw-r--r-- 3 spark hadoop 49 2018-04-26 17:44 
> /user/testdemo/part-0-b8a8e16a-91a8-48ec-9998-2d741c52cf5a-c000.json
> -rw-r--r-- 3 spark hadoop 51 2018-04-26 17:32 
> /user/testdemo/part-1-4311f66b-ba1b-4a4d-a289-1a211f27f653-c000.json
> -rw-r--r-- 3 spark hadoop 51 2018-04-26 17:44 
> /user/testdemo/part-1-b8a8e16a-91a8-48ec-9998-2d741c52cf5a-c000.json
> -rw-r--r-- 3 spark hadoop 50 2018-04-26 17:32 
> /user/testdemo/part-2-4311f66b-ba1b-4a4d-a289-1a211f27f653-c000.json
> -rw-r--r-- 3 spark hadoop 50 2018-04-26 17:44 
> /user/testdemo/part-2-b8a8e16a-91a8-48ec-9998-2d741c52cf5a-c000.json
> -rw-r--r-- 3 spark hadoop 49 2018-04-26 17:32 
> /user/testdemo/part-3-4311f66b-ba1b-4a4d-a289-1a211f27f653-c000.json
> -rw-r--r-- 3 spark hadoop 49 2018-04-26 17:44 
> /user/testdemo/part-3-b8a8e16a-91a8-48ec-9998-2d741c52cf5a-c000.json
>  
> Issue is:
> Now executed below drop command
> spark-sql> drop table json;
>  
> Table dropped successfully but json file still present in the path  
> /user/testdemo



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24175) improve the Spark 2.4 migration guide

2018-05-03 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-24175:

Description: 
The current Spark 2.4 migration guide is not well phrased. We should
1. State the before behavior
2. State the after behavior
3. Add a concrete example with code to illustrate.


For example:

Since Spark 2.4, Spark compares a DATE type with a TIMESTAMP type after 
promotes both sides to TIMESTAMP. To set `false` to 
`spark.sql.hive.compareDateTimestampInTimestamp` restores the previous 
behavior. This option will be removed in Spark 3.0.

-->

In version 2.3 and earlier, Spark implicitly casts a timestamp column to date 
type when comparing with a date column. In version 2.4 and later, Spark casts 
the date column to timestamp type instead. As an example, "xxx" would result in 
".." in Spark 2.3, and in Spark 2.4, the result would be "..."

  was:The current Spark 2.4 migration guide is not well phrased. 


> improve the Spark 2.4 migration guide
> -
>
> Key: SPARK-24175
> URL: https://issues.apache.org/jira/browse/SPARK-24175
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> The current Spark 2.4 migration guide is not well phrased. We should
> 1. State the before behavior
> 2. State the after behavior
> 3. Add a concrete example with code to illustrate.
> For example:
> Since Spark 2.4, Spark compares a DATE type with a TIMESTAMP type after 
> promotes both sides to TIMESTAMP. To set `false` to 
> `spark.sql.hive.compareDateTimestampInTimestamp` restores the previous 
> behavior. This option will be removed in Spark 3.0.
> -->
> In version 2.3 and earlier, Spark implicitly casts a timestamp column to date 
> type when comparing with a date column. In version 2.4 and later, Spark casts 
> the date column to timestamp type instead. As an example, "xxx" would result 
> in ".." in Spark 2.3, and in Spark 2.4, the result would be "..."



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24175) improve the Spark 2.4 migration guide

2018-05-03 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-24175:
---

 Summary: improve the Spark 2.4 migration guide
 Key: SPARK-24175
 URL: https://issues.apache.org/jira/browse/SPARK-24175
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24175) improve the Spark 2.4 migration guide

2018-05-03 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-24175:

Description: The current Spark 2.4 migration guide is not well phrased. 

> improve the Spark 2.4 migration guide
> -
>
> Key: SPARK-24175
> URL: https://issues.apache.org/jira/browse/SPARK-24175
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> The current Spark 2.4 migration guide is not well phrased. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23291) SparkR : substr : In SparkR dataframe , starting and ending position arguments in "substr" is giving wrong result when the position is greater than 1

2018-05-03 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16463280#comment-16463280
 ] 

Hyukjin Kwon commented on SPARK-23291:
--

[~cloud_fan], you feel like we should backport this? I was looking through PRs 
for backporting thing and seems we better backport it with a note in migration 
guide.
I think it was just a bad design mistake. It doesn't even match to R's one. 
Workaround should relatively easy if users are aware of this change.
If you feel about backporting it too, I would rather like to backport this.

> SparkR : substr : In SparkR dataframe , starting and ending position 
> arguments in "substr" is giving wrong result  when the position is greater 
> than 1
> --
>
> Key: SPARK-23291
> URL: https://issues.apache.org/jira/browse/SPARK-23291
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.2, 2.2.0, 2.2.1, 2.3.0
>Reporter: Narendra
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.0
>
>
> Defect Description :
> -
> For example ,an input string "2017-12-01" is read into a SparkR dataframe 
> "df" with column name "col1".
>  The target is to create a a new column named "col2" with the value "12" 
> which is inside the string ."12" can be extracted with "starting position" as 
> "6" and "Ending position" as "7"
>  (the starting position of the first character is considered as "1" )
> But,the current code that needs to be written is :
>  
>  df <- withColumn(df,"col2",substr(df$col1,7,8)))
> Observe that the first argument in the "substr" API , which indicates the 
> 'starting position', is mentioned as "7" 
>  Also, observe that the second argument in the "substr" API , which indicates 
> the 'ending position', is mentioned as "8"
> i.e the number that should be mentioned to indicate the position should be 
> the "actual position + 1"
> Expected behavior :
> 
> The code that needs to be written is :
>  
>  df <- withColumn(df,"col2",substr(df$col1,6,7)))
> Note :
> ---
>  This defect is observed with only when the starting position is greater than 
> 1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24167) ParquetFilters should not access SQLConf at executor side

2018-05-03 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-24167.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

> ParquetFilters should not access SQLConf at executor side
> -
>
> Key: SPARK-24167
> URL: https://issues.apache.org/jira/browse/SPARK-24167
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24167) ParquetFilters should not access SQLConf at executor side

2018-05-03 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-24167:

Affects Version/s: (was: 2.3.0)
   2.4.0

> ParquetFilters should not access SQLConf at executor side
> -
>
> Key: SPARK-24167
> URL: https://issues.apache.org/jira/browse/SPARK-24167
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24147) .count() reports wrong size of dataframe when filtering dataframe on corrupt record field

2018-05-03 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24147.
--
Resolution: Duplicate

> .count() reports wrong size of dataframe when filtering dataframe on corrupt 
> record field
> -
>
> Key: SPARK-24147
> URL: https://issues.apache.org/jira/browse/SPARK-24147
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.2.1
> Environment: Spark version 2.2.1
> Pyspark 
> Python version 3.6.4
>Reporter: Rich Smith
>Priority: Major
>
> Spark reports the wrong size of dataframe using .count() after filtering on a 
> corruptField field.
> Example file that shows the problem:
>  
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql.types import StringType, StructType, StructField, DoubleType
> from pyspark.sql.functions import col, lit
> spark = 
> SparkSession.builder.master("local[3]").appName("pyspark-unittest").getOrCreate()
> spark.conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
> SCHEMA = StructType([
> StructField("headerDouble", DoubleType(), False),
> StructField("ErrorField", StringType(), False)
> ])
> dataframe = (
> spark.read
> .option("header", "true")
> .option("mode", "PERMISSIVE")
> .option("columnNameOfCorruptRecord", "ErrorField")
> .schema(SCHEMA).csv("./x.csv")
> )
> total_row_count = dataframe.count()
> print("total_row_count = " + str(total_row_count))
> errors = dataframe.filter(col("ErrorField").isNotNull())
> errors.show()
> error_count = errors.count()
> print("errors count = " + str(error_count))
> {code}
>  
>  
> Using input file x.csv:
>  
> {code:java}
> headerDouble
> wrong
> {code}
>  
>  
> Output text. As shown, contents of dataframe contains a row, but .count() 
> reports 0.
>  
> {code:java}
> total_row_count = 1
> ++--+
> |headerDouble|ErrorField|
> ++--+
> |null| wrong|
> ++--+
> errors count = 0
> {code}
>  
>  
> Also discussed briefly on StackOverflow: 
> [https://stackoverflow.com/questions/50121899/how-can-sparks-count-function-be-different-to-the-contents-of-the-dataframe]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24174) Expose Hadoop config as part of /environment API

2018-05-03 Thread Nikolay Sokolov (JIRA)

Nikolay Sokolov created SPARK-24174:
---

 Summary: Expose Hadoop config as part of /environment API
 Key: SPARK-24174
 URL: https://issues.apache.org/jira/browse/SPARK-24174
 Project: Spark
  Issue Type: Wish
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: Nikolay Sokolov


Currently, /environment API call exposes only system properties and SparkConf. 
However, in some cases when Spark is used in conjunction with Hadoop, it is 
useful to know Hadoop configuration properties. For example, HDFS or GS buffer 
sizes, hive metastore settings, and so on.

So it would be good to have hadoop properties being exposed in /environment 
API, for example:
{code:none}
GET .../application_1525395994996_5/environment
{
   "runtime": {"javaVersion": "1.8.0_131 (Oracle Corporation)", ...}
   "sparkProperties": ["java.io.tmpdir","/tmp", ...],
   "systemProperties": [["spark.yarn.jars", "local:/usr/lib/spark/jars/*"], 
...],
   "classpathEntries": [["/usr/lib/hadoop/hadoop-annotations.jar","System 
Classpath"], ...],
   "hadoopProperties": [["dfs.stream-buffer-size": 4096], ...],
}
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24168) WindowExec should not access SQLConf at executor side

2018-05-03 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24168.
-
   Resolution: Fixed
Fix Version/s: 2.4.0
   2.3.1

> WindowExec should not access SQLConf at executor side
> -
>
> Key: SPARK-24168
> URL: https://issues.apache.org/jira/browse/SPARK-24168
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23489) Flaky Test: HiveExternalCatalogVersionsSuite

2018-05-03 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23489:

Fix Version/s: 2.2.2

> Flaky Test: HiveExternalCatalogVersionsSuite
> 
>
> Key: SPARK-23489
> URL: https://issues.apache.org/jira/browse/SPARK-23489
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.2.1, 2.3.0, 2.4.0
>Reporter: Marco Gaido
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.2.2, 2.3.1, 2.4.0
>
>
> I saw this error in an unrelated PR. It seems to me a bad configuration in 
> the Jenkins node where the tests are run.
> {code}
> Error Message
> java.io.IOException: Cannot run program "./bin/spark-submit" (in directory 
> "/tmp/test-spark/spark-2.2.1"): error=2, No such file or directory
> Stacktrace
> sbt.ForkMain$ForkError: java.io.IOException: Cannot run program 
> "./bin/spark-submit" (in directory "/tmp/test-spark/spark-2.2.1"): error=2, 
> No such file or directory
>   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
>   at 
> org.apache.spark.sql.hive.SparkSubmitTestUtils$class.runSparkSubmit(SparkSubmitTestUtils.scala:73)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite.runSparkSubmit(HiveExternalCatalogVersionsSuite.scala:43)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite$$anonfun$beforeAll$1.apply(HiveExternalCatalogVersionsSuite.scala:176)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite$$anonfun$beforeAll$1.apply(HiveExternalCatalogVersionsSuite.scala:161)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite.beforeAll(HiveExternalCatalogVersionsSuite.scala:161)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:212)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: sbt.ForkMain$ForkError: java.io.IOException: error=2, No such file 
> or directory
>   at java.lang.UNIXProcess.forkAndExec(Native Method)
>   at java.lang.UNIXProcess.(UNIXProcess.java:248)
>   at java.lang.ProcessImpl.start(ProcessImpl.java:134)
>   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
>   ... 17 more
> {code}
> This is the link: 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87615/testReport/.
> *MASTER BRANCH*
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/4389
> *BRANCH 2.3*
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/321/
> *NOTE: This failure frequently looks as `Test Result (no failures)`*
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.6/4811/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24163) Support "ANY" or sub-query for Pivot "IN" clause

2018-05-03 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16463219#comment-16463219
 ] 

Xiao Li commented on SPARK-24163:
-

This is just nice to have. 

> Support "ANY" or sub-query for Pivot "IN" clause
> 
>
> Key: SPARK-24163
> URL: https://issues.apache.org/jira/browse/SPARK-24163
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maryann Xue
>Priority: Major
>
> This is part of a functionality extension to Pivot SQL support as SPARK-24035.
> Currently, only literal values are allowed in Pivot "IN" clause. To support 
> ANY or a sub-query in the "IN" clause (the examples of which provided below), 
> we need to enable evaluation of a sub-query before/during query analysis time.
> {code:java}
> SELECT * FROM (
>   SELECT year, course, earnings FROM courseSales
> )
> PIVOT (
>   sum(earnings)
>   FOR course IN ANY
> );{code}
> {code:java}
> SELECT * FROM (
>   SELECT year, course, earnings FROM courseSales
> )
> PIVOT (
>   sum(earnings)
>   FOR course IN (
> SELECT course FROM courses
> WHERE region = 'AZ'
>   )
> );
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24164) Support column list as the pivot column in Pivot

2018-05-03 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-24164:
---

Assignee: Maryann Xue

> Support column list as the pivot column in Pivot
> 
>
> Key: SPARK-24164
> URL: https://issues.apache.org/jira/browse/SPARK-24164
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maryann Xue
>Assignee: Maryann Xue
>Priority: Major
>
> This is part of a functionality extension to Pivot SQL support as SPARK-24035.
> Currently, we only support a single column as the pivot column, while a 
> column list as the pivot column would look like:
> {code:java}
> SELECT * FROM (
>   SELECT year, course, earnings FROM courseSales
> )
> PIVOT (
>   sum(earnings)
>   FOR (course, year) IN (('dotNET', 2012), ('Java', 2013))
> );{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24162) Support aliased literal values for Pivot "IN" clause

2018-05-03 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-24162:
---

Assignee: Maryann Xue

> Support aliased literal values for Pivot "IN" clause
> 
>
> Key: SPARK-24162
> URL: https://issues.apache.org/jira/browse/SPARK-24162
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maryann Xue
>Assignee: Maryann Xue
>Priority: Minor
>
> This is part of a functionality extension to Pivot SQL support as SPARK-24035.
> When literal values are specified in Pivot IN clause, it would be nice to 
> allow aliases for those values so that the output column names can be 
> customized. For example:
> {code:java}
> SELECT * FROM (
>   SELECT year, course, earnings FROM courseSales
> )
> PIVOT (
>   sum(earnings)
>   FOR course IN ('dotNET' as c1, 'Java' as c2)
> );{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24035) SQL syntax for Pivot

2018-05-03 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24035.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

> SQL syntax for Pivot
> 
>
> Key: SPARK-24035
> URL: https://issues.apache.org/jira/browse/SPARK-24035
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Maryann Xue
>Priority: Major
> Fix For: 2.4.0
>
>
> Some users who are SQL experts but don’t know an ounce of Scala/Python or R. 
> Thus, we prefer to supporting the SQL syntax for Pivot too



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23703) Collapse sequential watermarks

2018-05-03 Thread Jungtaek Lim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16463209#comment-16463209
 ] 

Jungtaek Lim commented on SPARK-23703:
--

Agreed. Is it worth to discuss in dev. mailing list? Or we can simply propose 
the patch for the fix?

> Collapse sequential watermarks 
> ---
>
> Key: SPARK-23703
> URL: https://issues.apache.org/jira/browse/SPARK-23703
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>
> When there are two sequential EventTimeWatermark nodes in a query plan, the 
> topmost one overrides the column tracking metadata from its children, but 
> leaves the nodes themselves untouched. When there is no intervening stateful 
> operation to consume the watermark, we should remove the lower node entirely.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23703) Collapse sequential watermarks

2018-05-03 Thread Jose Torres (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16463190#comment-16463190
 ] 

Jose Torres commented on SPARK-23703:
-

No, I don't know of any actual use cases for this. I think just writing an 
analyzer rule disallowing it could be a valid resolution here.

> Collapse sequential watermarks 
> ---
>
> Key: SPARK-23703
> URL: https://issues.apache.org/jira/browse/SPARK-23703
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>
> When there are two sequential EventTimeWatermark nodes in a query plan, the 
> topmost one overrides the column tracking metadata from its children, but 
> leaves the nodes themselves untouched. When there is no intervening stateful 
> operation to consume the watermark, we should remove the lower node entirely.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23703) Collapse sequential watermarks

2018-05-03 Thread Jungtaek Lim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16463189#comment-16463189
 ] 

Jungtaek Lim commented on SPARK-23703:
--

Actually I haven't hear about multiple watermarks on same source, which makes 
the things complicated. What I've heard is event-time window with single time 
field, and watermark for such field. Do you have/hear actual use cases for this?

> Collapse sequential watermarks 
> ---
>
> Key: SPARK-23703
> URL: https://issues.apache.org/jira/browse/SPARK-23703
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>
> When there are two sequential EventTimeWatermark nodes in a query plan, the 
> topmost one overrides the column tracking metadata from its children, but 
> leaves the nodes themselves untouched. When there is no intervening stateful 
> operation to consume the watermark, we should remove the lower node entirely.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13446) Spark need to support reading data from Hive 2.0.0 metastore

2018-05-03 Thread thua...@fsoft.com.vn (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16463003#comment-16463003
 ] 

thua...@fsoft.com.vn edited comment on SPARK-13446 at 5/3/18 11:10 PM:
---

Hi Tavis and Spark Team,

Tavis explanation is very clear. As I read in the other thread, this bug is 
fixed and fully tested by QA team. Could someone please help to look into the 
merging the code.

I am running into the same problem in pyspark.  Bellow my pyspark submit stack 
trace:

py4j.protocol.Py4JJavaError: An error occurred while calling o24.sessionState.

: java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT

at 
org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:200)

at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:265)

at 
org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66)

at 
org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65)

at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:194)

at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:194)

at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:194)

at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)

at 
org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:193)

at 
org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:105)

at 
org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:93)

at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.externalCatalog(HiveSessionStateBuilder.scala:39)

at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog$lzycompute(HiveSessionStateBuilder.scala:54)

at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:52)

at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:35)

at 
org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:289)

at 
org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1050)

at 
org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:130)

at 
org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:130)

at scala.Option.getOrElse(Option.scala:121)


was (Author: thua...@fsoft.com.vn):
I am running into the same problem in pyspark. Could anyone help resolve this 
issue, please? Bellow my pyspark submit stack trace:

py4j.protocol.Py4JJavaError: An error occurred while calling o24.sessionState.

: java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT

 at 
org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:200)

 at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:265)

 at 
org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66)

 at 
org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65)

 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:194)

 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:194)

 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:194)

 at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)

 at 
org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:193)

 at 
org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:105)

 at 
org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:93)

 at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.externalCatalog(HiveSessionStateBuilder.scala:39)

 at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog$lzycompute(HiveSessionStateBuilder.scala:54)

 at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:52)

 at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:35)

 at 
org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:289)

 at 
org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1050)

 at 
org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:130)

 at 
org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:130)

 at scala.Option.getOrElse(Option.scala:121)

> Spark need to support reading data from Hive 2

[jira] [Commented] (SPARK-21824) DROP TABLE should automatically drop any dependent referential constraints or raise error.

2018-05-03 Thread Sunitha Kambhampati (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16463054#comment-16463054
 ] 

Sunitha Kambhampati commented on SPARK-21824:
-

I'll look into this.

> DROP TABLE should  automatically  drop any dependent referential constraints 
> or raise error.
> 
>
> Key: SPARK-21824
> URL: https://issues.apache.org/jira/browse/SPARK-21824
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Suresh Thalamati
>Priority: Major
>
> DROP TABLE should  raise error if there are any  dependent referential 
> constraints  unless user specifies CASCADE CONSTRAINTS
> Syntax :
> {code:sql}
> DROP TABLE  [CASCADE CONSTRAINTS]
> {code}
> Hive drops the referential constraints automatically. Oracle requires user 
> specify _CASCADE CONSTRAINTS_ clause to automatically drop the referential 
> constraints, otherwise raises the error. Should we stick to the *Hive 
> behavior* ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13446) Spark need to support reading data from Hive 2.0.0 metastore

2018-05-03 Thread thua...@fsoft.com.vn (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16463003#comment-16463003
 ] 

thua...@fsoft.com.vn commented on SPARK-13446:
--

I am running into the same problem in pyspark. Could anyone help resolve this 
issue, please? Bellow my pyspark submit stack trace:

py4j.protocol.Py4JJavaError: An error occurred while calling o24.sessionState.

: java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT

 at 
org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:200)

 at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:265)

 at 
org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66)

 at 
org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65)

 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:194)

 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:194)

 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:194)

 at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)

 at 
org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:193)

 at 
org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:105)

 at 
org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:93)

 at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.externalCatalog(HiveSessionStateBuilder.scala:39)

 at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog$lzycompute(HiveSessionStateBuilder.scala:54)

 at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:52)

 at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:35)

 at 
org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:289)

 at 
org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1050)

 at 
org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:130)

 at 
org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:130)

 at scala.Option.getOrElse(Option.scala:121)

> Spark need to support reading data from Hive 2.0.0 metastore
> 
>
> Key: SPARK-13446
> URL: https://issues.apache.org/jira/browse/SPARK-13446
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Lifeng Wang
>Assignee: Xiao Li
>Priority: Major
> Fix For: 2.2.0
>
>
> Spark provided HIveContext class to read data from hive metastore directly. 
> While it only supports hive 1.2.1 version and older. Since hive 2.0.0 has 
> released, it's better to upgrade to support Hive 2.0.0.
> {noformat}
> 16/02/23 02:35:02 INFO metastore: Trying to connect to metastore with URI 
> thrift://hsw-node13:9083
> 16/02/23 02:35:02 INFO metastore: Opened a connection to metastore, current 
> connections: 1
> 16/02/23 02:35:02 INFO metastore: Connected to metastore.
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:473)
> at 
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:192)
> at 
> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:185)
> at 
> org.apache.spark.sql.hive.HiveContext$$anon$1.(HiveContext.scala:422)
> at 
> org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:422)
> at 
> org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:421)
> at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:72)
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:739)
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:735)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23489) Flaky Test: HiveExternalCatalogVersionsSuite

2018-05-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462971#comment-16462971
 ] 

Apache Spark commented on SPARK-23489:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/21232

> Flaky Test: HiveExternalCatalogVersionsSuite
> 
>
> Key: SPARK-23489
> URL: https://issues.apache.org/jira/browse/SPARK-23489
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.2.1, 2.3.0, 2.4.0
>Reporter: Marco Gaido
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> I saw this error in an unrelated PR. It seems to me a bad configuration in 
> the Jenkins node where the tests are run.
> {code}
> Error Message
> java.io.IOException: Cannot run program "./bin/spark-submit" (in directory 
> "/tmp/test-spark/spark-2.2.1"): error=2, No such file or directory
> Stacktrace
> sbt.ForkMain$ForkError: java.io.IOException: Cannot run program 
> "./bin/spark-submit" (in directory "/tmp/test-spark/spark-2.2.1"): error=2, 
> No such file or directory
>   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
>   at 
> org.apache.spark.sql.hive.SparkSubmitTestUtils$class.runSparkSubmit(SparkSubmitTestUtils.scala:73)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite.runSparkSubmit(HiveExternalCatalogVersionsSuite.scala:43)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite$$anonfun$beforeAll$1.apply(HiveExternalCatalogVersionsSuite.scala:176)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite$$anonfun$beforeAll$1.apply(HiveExternalCatalogVersionsSuite.scala:161)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite.beforeAll(HiveExternalCatalogVersionsSuite.scala:161)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:212)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: sbt.ForkMain$ForkError: java.io.IOException: error=2, No such file 
> or directory
>   at java.lang.UNIXProcess.forkAndExec(Native Method)
>   at java.lang.UNIXProcess.(UNIXProcess.java:248)
>   at java.lang.ProcessImpl.start(ProcessImpl.java:134)
>   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
>   ... 17 more
> {code}
> This is the link: 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87615/testReport/.
> *MASTER BRANCH*
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/4389
> *BRANCH 2.3*
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/321/
> *NOTE: This failure frequently looks as `Test Result (no failures)`*
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.6/4811/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24119) Add interpreted execution to SortPrefix expression

2018-05-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24119:


Assignee: (was: Apache Spark)

> Add interpreted execution to SortPrefix expression
> --
>
> Key: SPARK-24119
> URL: https://issues.apache.org/jira/browse/SPARK-24119
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bruce Robbins
>Priority: Minor
>
> [~hvanhovell] [~kiszk]
> I noticed SortPrefix did not support interpreted execution when I was testing 
> the PR for SPARK-24043. Somehow it was not covered by the umbrella Jira for 
> adding interpreted execution (SPARK-23580)
> Since I had to implement interpreted execution for SortPrefix to complete 
> testing, I am creating this Jira. If there's no good reason why eval wasn't 
> implemented, I will make the PR in a few days.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24119) Add interpreted execution to SortPrefix expression

2018-05-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462939#comment-16462939
 ] 

Apache Spark commented on SPARK-24119:
--

User 'bersprockets' has created a pull request for this issue:
https://github.com/apache/spark/pull/21231

> Add interpreted execution to SortPrefix expression
> --
>
> Key: SPARK-24119
> URL: https://issues.apache.org/jira/browse/SPARK-24119
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bruce Robbins
>Priority: Minor
>
> [~hvanhovell] [~kiszk]
> I noticed SortPrefix did not support interpreted execution when I was testing 
> the PR for SPARK-24043. Somehow it was not covered by the umbrella Jira for 
> adding interpreted execution (SPARK-23580)
> Since I had to implement interpreted execution for SortPrefix to complete 
> testing, I am creating this Jira. If there's no good reason why eval wasn't 
> implemented, I will make the PR in a few days.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24119) Add interpreted execution to SortPrefix expression

2018-05-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24119:


Assignee: Apache Spark

> Add interpreted execution to SortPrefix expression
> --
>
> Key: SPARK-24119
> URL: https://issues.apache.org/jira/browse/SPARK-24119
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bruce Robbins
>Assignee: Apache Spark
>Priority: Minor
>
> [~hvanhovell] [~kiszk]
> I noticed SortPrefix did not support interpreted execution when I was testing 
> the PR for SPARK-24043. Somehow it was not covered by the umbrella Jira for 
> adding interpreted execution (SPARK-23580)
> Since I had to implement interpreted execution for SortPrefix to complete 
> testing, I am creating this Jira. If there's no good reason why eval wasn't 
> implemented, I will make the PR in a few days.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24173) Flaky Test: VersionsSuite

2018-05-03 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24173:
--
Description: 
*BRANCH-2.2*
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.2-test-maven-hadoop-2.6/519/

*BRANCH-2.3*
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/325/

  was:- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/325/


> Flaky Test: VersionsSuite
> -
>
> Key: SPARK-24173
> URL: https://issues.apache.org/jira/browse/SPARK-24173
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> *BRANCH-2.2*
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.2-test-maven-hadoop-2.6/519/
> *BRANCH-2.3*
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/325/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12764) XML Column type is not supported (JDBC connection to Postgres)

2018-05-03 Thread Martin Tapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462827#comment-16462827
 ] 

Martin Tapp commented on SPARK-12764:
-

Using Spark 2.1.1 and can't load a table with a JSONB column even if I'm 
explicitly doing a select ignoring the JSONB column.

Thanks

> XML Column type is not supported (JDBC connection to Postgres)
> --
>
> Key: SPARK-12764
> URL: https://issues.apache.org/jira/browse/SPARK-12764
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Mac Os X El Capitan
>Reporter: Rajeshwar Gaini
>Priority: Major
>
> Hi All,
> I am using PostgreSQL database. I am using the following jdbc call to access 
> a customer table (customer_id int, event text, country text, content xml) in 
> my database.
> {code}
> val dataframe1 = sqlContext.load("jdbc", Map("url" -> 
> "jdbc:postgresql://localhost/customerlogs?user=postgres&password=postgres", 
> "dbtable" -> "customer"))
> {code}
> When i run above command in spark-shell i receive the following error.
> {code}
> java.sql.SQLException: Unsupported type 
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.org$apache$spark$sql$execution$datasources$jdbc$JDBCRDD$$getCatalystType(JDBCRDD.scala:103)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:140)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:140)
>   at scala.Option.getOrElse(Option.scala:120)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:139)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:91)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.DefaultSource.createRelation(DefaultSource.scala:60)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:158)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
>   at org.apache.spark.sql.SQLContext.load(SQLContext.scala:1153)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:25)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:30)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:32)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:34)
>   at $iwC$$iwC$$iwC$$iwC.(:36)
>   at $iwC$$iwC$$iwC.(:38)
>   at $iwC$$iwC.(:40)
>   at $iwC.(:42)
>   at (:44)
>   at .(:48)
>   at .()
>   at .(:7)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>   at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
>   at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
>   at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
>   at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
>   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
>   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
>   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
>   at 
> org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
>   at 
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
>   at 
> org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
>   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
>   at org.apache.spark.repl.Main$.main(Main.scala:31)
>   at org.apache.spark.repl.Main.main(Main.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at

[jira] [Commented] (SPARK-10943) NullType Column cannot be written to Parquet

2018-05-03 Thread Stephen Boesch (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462797#comment-16462797
 ] 

Stephen Boesch commented on SPARK-10943:


Given the comment by Daniel Davis can this issue be reopened?

> NullType Column cannot be written to Parquet
> 
>
> Key: SPARK-10943
> URL: https://issues.apache.org/jira/browse/SPARK-10943
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Jason Pohl
>Priority: Major
>
> {code}
> var data02 = sqlContext.sql("select 1 as id, \"cat in the hat\" as text, null 
> as comments")
> {code}
> //FAIL - Try writing a NullType column (where all the values are NULL)
> {code}
> data02.write.parquet("/tmp/test/dataset2")
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:156)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:304)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 179.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 179.0 (TID 39924, 10.0.196.208): 
> org.apache.spark.sql.AnalysisException: Unsupported data type 
> StructField(comments,NullType,true).dataType;
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:524)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:312)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convert$1.apply(CatalystSchemaConverter.scala:305)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convert$1.apply(CatalystSchemaConverter.scala:305)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:92)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at org.apache.spark.sql.types.StructType.map(StructType.scala:92)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convert(CatalystSchemaConverter.scala:305)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetTypesConverter$.convertFromAttributes(ParquetTypesConverter.scala:58)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.RowWriteSupport.init(ParquetTableSupport.scala:55)
>   at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:288)
>   at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.P

[jira] [Created] (SPARK-24173) Flaky Test: VersionsSuite

2018-05-03 Thread Dongjoon Hyun (JIRA)

Dongjoon Hyun created SPARK-24173:
-

 Summary: Flaky Test: VersionsSuite
 Key: SPARK-24173
 URL: https://issues.apache.org/jira/browse/SPARK-24173
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Dongjoon Hyun


- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/325/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23622) Flaky Test: HiveClientSuites

2018-05-03 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23622:
--
Summary: Flaky Test: HiveClientSuites  (was: HiveClientSuites fails with 
InvocationTargetException)

> Flaky Test: HiveClientSuites
> 
>
> Key: SPARK-23622
> URL: https://issues.apache.org/jira/browse/SPARK-23622
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88052/testReport/org.apache.spark.sql.hive.client/HiveClientSuites/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/
> - https://amplab.cs.berkeley.edu/jenkins/view/Spark QA Test 
> (Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/325
> {code}
> Error Message
> java.lang.reflect.InvocationTargetException: null
> Stacktrace
> sbt.ForkMain$ForkError: java.lang.reflect.InvocationTargetException: null
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:270)
>   at 
> org.apache.spark.sql.hive.client.HiveClientBuilder$.buildClient(HiveClientBuilder.scala:58)
>   at 
> org.apache.spark.sql.hive.client.HiveVersionSuite.buildClient(HiveVersionSuite.scala:41)
>   at 
> org.apache.spark.sql.hive.client.HiveClientSuite.org$apache$spark$sql$hive$client$HiveClientSuite$$init(HiveClientSuite.scala:48)
>   at 
> org.apache.spark.sql.hive.client.HiveClientSuite.beforeAll(HiveClientSuite.scala:71)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:212)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52)
>   at org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1210)
>   at 
> org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1257)
>   at 
> org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1255)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at org.scalatest.Suite$class.runNestedSuites(Suite.scala:1255)
>   at 
> org.apache.spark.sql.hive.client.HiveClientSuites.runNestedSuites(HiveClientSuites.scala:24)
>   at org.scalatest.Suite$class.run(Suite.scala:1144)
>   at 
> org.apache.spark.sql.hive.client.HiveClientSuites.run(HiveClientSuites.scala:24)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: sbt.ForkMain$ForkError: java.lang.RuntimeException: 
> java.lang.RuntimeException: Unable to instantiate 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:444)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:183)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:117)
>   ... 29 more
> Caused by: sbt.ForkMain$ForkError: java.lang.RuntimeException: Unable to 
> instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
>   at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1453)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:63)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:73)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2664)
>   at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2683)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:425)
>   ... 31 more
> Caused by: sbt.ForkMain$ForkError: 
> java.lang.reflect.InvocationTargetException:

[jira] [Updated] (SPARK-23622) HiveClientSuites fails with InvocationTargetException

2018-05-03 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23622:
--
Description: 
- 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88052/testReport/org.apache.spark.sql.hive.client/HiveClientSuites/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/
- https://amplab.cs.berkeley.edu/jenkins/view/Spark QA Test 
(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/325

{code}
Error Message

java.lang.reflect.InvocationTargetException: null

Stacktrace

sbt.ForkMain$ForkError: java.lang.reflect.InvocationTargetException: null
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:270)
at 
org.apache.spark.sql.hive.client.HiveClientBuilder$.buildClient(HiveClientBuilder.scala:58)
at 
org.apache.spark.sql.hive.client.HiveVersionSuite.buildClient(HiveVersionSuite.scala:41)
at 
org.apache.spark.sql.hive.client.HiveClientSuite.org$apache$spark$sql$hive$client$HiveClientSuite$$init(HiveClientSuite.scala:48)
at 
org.apache.spark.sql.hive.client.HiveClientSuite.beforeAll(HiveClientSuite.scala:71)
at 
org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:212)
at 
org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210)
at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52)
at org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1210)
at 
org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1257)
at 
org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1255)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at org.scalatest.Suite$class.runNestedSuites(Suite.scala:1255)
at 
org.apache.spark.sql.hive.client.HiveClientSuites.runNestedSuites(HiveClientSuites.scala:24)
at org.scalatest.Suite$class.run(Suite.scala:1144)
at 
org.apache.spark.sql.hive.client.HiveClientSuites.run(HiveClientSuites.scala:24)
at 
org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314)
at 
org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480)
at sbt.ForkMain$Run$2.call(ForkMain.java:296)
at sbt.ForkMain$Run$2.call(ForkMain.java:286)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: sbt.ForkMain$ForkError: java.lang.RuntimeException: 
java.lang.RuntimeException: Unable to instantiate 
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at 
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:444)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:183)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:117)
... 29 more
Caused by: sbt.ForkMain$ForkError: java.lang.RuntimeException: Unable to 
instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at 
org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1453)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:63)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:73)
at 
org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2664)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2683)
at 
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:425)
... 31 more
Caused by: sbt.ForkMain$ForkError: java.lang.reflect.InvocationTargetException: 
null
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at 
org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1451)
... 36 more
Caused by: sbt.ForkMai

[jira] [Resolved] (SPARK-23433) java.lang.IllegalStateException: more than one active taskSet for stage

2018-05-03 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-23433.
--
   Resolution: Fixed
Fix Version/s: 2.4.0
   2.3.1
   2.2.2

fixed by https://github.com/apache/spark/pull/21131

> java.lang.IllegalStateException: more than one active taskSet for stage
> ---
>
> Key: SPARK-23433
> URL: https://issues.apache.org/jira/browse/SPARK-23433
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Shixiong Zhu
>Assignee: Imran Rashid
>Priority: Major
> Fix For: 2.2.2, 2.3.1, 2.4.0
>
>
> This following error thrown by DAGScheduler stopped the cluster:
> {code}
> 18/02/11 13:22:27 ERROR DAGSchedulerEventProcessLoop: 
> DAGSchedulerEventProcessLoop failed; shutting down SparkContext
> java.lang.IllegalStateException: more than one active taskSet for stage 
> 7580621: 7580621.2,7580621.1
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:229)
>   at 
> org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1193)
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:1059)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:900)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:899)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> org.apache.spark.scheduler.DAGScheduler.submitWaitingChildStages(DAGScheduler.scala:899)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1427)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1929)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1880)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1868)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23433) java.lang.IllegalStateException: more than one active taskSet for stage

2018-05-03 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid reassigned SPARK-23433:


Assignee: Imran Rashid

> java.lang.IllegalStateException: more than one active taskSet for stage
> ---
>
> Key: SPARK-23433
> URL: https://issues.apache.org/jira/browse/SPARK-23433
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Shixiong Zhu
>Assignee: Imran Rashid
>Priority: Major
>
> This following error thrown by DAGScheduler stopped the cluster:
> {code}
> 18/02/11 13:22:27 ERROR DAGSchedulerEventProcessLoop: 
> DAGSchedulerEventProcessLoop failed; shutting down SparkContext
> java.lang.IllegalStateException: more than one active taskSet for stage 
> 7580621: 7580621.2,7580621.1
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:229)
>   at 
> org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1193)
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:1059)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:900)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:899)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> org.apache.spark.scheduler.DAGScheduler.submitWaitingChildStages(DAGScheduler.scala:899)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1427)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1929)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1880)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1868)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24154) AccumulatorV2 loses type information during serialization

2018-05-03 Thread Sergey Zhemzhitsky (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462630#comment-16462630
 ] 

Sergey Zhemzhitsky commented on SPARK-24154:


> If users have to support mixin traits, they can still use accumulator v1.

But accumulators V1 are deprecated and will be removed one day I believe.





> AccumulatorV2 loses type information during serialization
> -
>
> Key: SPARK-24154
> URL: https://issues.apache.org/jira/browse/SPARK-24154
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.2.1, 2.3.0, 2.3.1
> Environment: Scala 2.11
> Spark 2.2.0
>Reporter: Sergey Zhemzhitsky
>Priority: Major
>
> AccumulatorV2 loses type information during serialization.
> It happens 
> [here|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L164]
>  during *writeReplace* call
> {code:scala}
> final protected def writeReplace(): Any = {
>   if (atDriverSide) {
> if (!isRegistered) {
>   throw new UnsupportedOperationException(
> "Accumulator must be registered before send to executor")
> }
> val copyAcc = copyAndReset()
> assert(copyAcc.isZero, "copyAndReset must return a zero value copy")
> val isInternalAcc = name.isDefined && 
> name.get.startsWith(InternalAccumulator.METRICS_PREFIX)
> if (isInternalAcc) {
>   // Do not serialize the name of internal accumulator and send it to 
> executor.
>   copyAcc.metadata = metadata.copy(name = None)
> } else {
>   // For non-internal accumulators, we still need to send the name 
> because users may need to
>   // access the accumulator name at executor side, or they may keep the 
> accumulators sent from
>   // executors and access the name when the registered accumulator is 
> already garbage
>   // collected(e.g. SQLMetrics).
>   copyAcc.metadata = metadata
> }
> copyAcc
>   } else {
> this
>   }
> }
> {code}
> It means that it is hardly possible to create new accumulators easily by 
> adding new behaviour to existing ones by means of mix-ins or inheritance 
> (without overriding *copy*).
> For example the following snippet ...
> {code:scala}
> trait TripleCount {
>   self: LongAccumulator =>
>   abstract override def add(v: jl.Long): Unit = {
> self.add(v * 3)
>   }
> }
> val acc = new LongAccumulator with TripleCount
> sc.register(acc)
> val data = 1 to 10
> val rdd = sc.makeRDD(data, 5)
> rdd.foreach(acc.add(_))
> acc.value shouldBe 3 * data.sum
> {code}
> ... fails with
> {code:none}
> org.scalatest.exceptions.TestFailedException: 55 was not equal to 165
>   at org.scalatest.MatchersHelper$.indicateFailure(MatchersHelper.scala:340)
>   at org.scalatest.Matchers$AnyShouldWrapper.shouldBe(Matchers.scala:6864)
> {code}
> Also such a behaviour seems to be error prone and confusing because an 
> implementor gets not the same thing as he/she sees in the code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24169) JsonToStructs should not access SQLConf at executor side

2018-05-03 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-24169.
-
   Resolution: Fixed
Fix Version/s: 2.3.1
   2.4.0

Issue resolved by pull request 21226
[https://github.com/apache/spark/pull/21226]

> JsonToStructs should not access SQLConf at executor side
> 
>
> Key: SPARK-24169
> URL: https://issues.apache.org/jira/browse/SPARK-24169
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.4.0, 2.3.1
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24172) we should not apply operator pushdown to data source v2 many times

2018-05-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24172:


Assignee: Wenchen Fan  (was: Apache Spark)

> we should not apply operator pushdown to data source v2 many times
> --
>
> Key: SPARK-24172
> URL: https://issues.apache.org/jira/browse/SPARK-24172
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24172) we should not apply operator pushdown to data source v2 many times

2018-05-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462587#comment-16462587
 ] 

Apache Spark commented on SPARK-24172:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/21230

> we should not apply operator pushdown to data source v2 many times
> --
>
> Key: SPARK-24172
> URL: https://issues.apache.org/jira/browse/SPARK-24172
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24172) we should not apply operator pushdown to data source v2 many times

2018-05-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24172:


Assignee: Apache Spark  (was: Wenchen Fan)

> we should not apply operator pushdown to data source v2 many times
> --
>
> Key: SPARK-24172
> URL: https://issues.apache.org/jira/browse/SPARK-24172
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23703) Collapse sequential watermarks

2018-05-03 Thread Jose Torres (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462584#comment-16462584
 ] 

Jose Torres commented on SPARK-23703:
-

I'm no longer entirely convinced that this (and the parent JIRA) are correct. 
We might not want to support these scenarios at all.

The question here is what we should do with the query:

df.withWatermark(“a”, …)
   .withWatermark(“b”, …)
   .agg(...)

What we do right now is definitely wrong. We (in MicroBatchExecution) calculate 
separate watermarks on "a" and "b", take their minimum, and then pass that as 
the watermark value to the aggregate. But the aggregate only sees "b" as a 
watermarked column, because only "b" has EventTimeWatermark.delayKey set in its 
attribute metadata at the aggregate node. EventTimeWatermark("b").output erases 
the metadata for "a" in its output.

So we need to somehow resolve this mismatch.

> Collapse sequential watermarks 
> ---
>
> Key: SPARK-23703
> URL: https://issues.apache.org/jira/browse/SPARK-23703
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>
> When there are two sequential EventTimeWatermark nodes in a query plan, the 
> topmost one overrides the column tracking metadata from its children, but 
> leaves the nodes themselves untouched. When there is no intervening stateful 
> operation to consume the watermark, we should remove the lower node entirely.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24172) we should not apply operator pushdown to data source v2 many times

2018-05-03 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-24172:
---

 Summary: we should not apply operator pushdown to data source v2 
many times
 Key: SPARK-24172
 URL: https://issues.apache.org/jira/browse/SPARK-24172
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24133) Reading Parquet files containing large strings can fail with java.lang.ArrayIndexOutOfBoundsException

2018-05-03 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-24133:
--
Fix Version/s: 2.3.1

> Reading Parquet files containing large strings can fail with 
> java.lang.ArrayIndexOutOfBoundsException
> -
>
> Key: SPARK-24133
> URL: https://issues.apache.org/jira/browse/SPARK-24133
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ala Luszczak
>Assignee: Ala Luszczak
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> ColumnVectors store string data in one big byte array. Since the array size 
> is capped at just under Integer.MAX_VALUE, a single ColumnVector cannot store 
> more than 2GB of string data.
> However, since the Parquet files commonly contain large blobs stored as 
> strings, and ColumnVectors by default carry 4096 values, it's entirely 
> possible to go past that limit.
> In such cases a negative capacity is requested from 
> WritableColumnVector.reserve(). The call succeeds (requested capacity is 
> smaller than already allocated), and consequently  
> java.lang.ArrayIndexOutOfBoundsException is thrown when the reader actually 
> attempts to put the data into the array.
> This behavior is hard to troubleshoot for the users. Spark should instead 
> check for negative requested capacity in WritableColumnVector.reserve() and 
> throw more informative error, instructing the user to tweak ColumnarBatch 
> size.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24170) [Spark SQL] json file format is not dropped after dropping table

2018-05-03 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462545#comment-16462545
 ] 

Marco Gaido commented on SPARK-24170:
-

This is true for every datasource. This is the expected behavior when you set 
the location, because by default if the location is set, Spark assumes that the 
table is external (and not managed). I am . not sure whether this is the right 
thing to do, but it is how it works.

cc [~smilegator] [~dongjoon] any further comments on this? Shall we discuss if 
this is the right behavior?

> [Spark SQL] json file format is not dropped after dropping table
> 
>
> Key: SPARK-24170
> URL: https://issues.apache.org/jira/browse/SPARK-24170
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: OS: SUSE 11
> Spark Version: 2.3
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> Steps:
>  # Launch spark-sql --master yarn
>  #  create table json(name STRING, age int, gender string, id INT) using 
> org.apache.spark.sql.json options(path "hdfs:///user/testdemo/");
>  # Execute the below SQL queries 
> INSERT into json
> SELECT 'Shaan',21,'Male',1
> UNION ALL
> SELECT 'Xing',20,'Female',11
> UNION ALL
> SELECT 'Mile',4,'Female',20
> UNION ALL
> SELECT 'Malan',10,'Male',9;
> Below 4 json file format created 
> BLR123111:/opt/Antsecure/install/hadoop/namenode/bin # ./hdfs dfs -ls 
> /user/testdemo
> Found 14 items
> -rw-r--r-- 3 spark hadoop 0 2018-04-26 17:44 /user/testdemo/_SUCCESS
> -rw-r--r-- 3 spark hadoop 4802 2018-04-24 18:20 /user/testdemo/customer1.csv
> -rw-r--r-- 3 spark hadoop 92 2018-04-26 17:02 /user/testdemo/json1.txt
> -rw-r--r-- 3 spark hadoop 49 2018-04-26 17:32 
> /user/testdemo/part-0-4311f66b-ba1b-4a4d-a289-1a211f27f653-c000.json
> -rw-r--r-- 3 spark hadoop 49 2018-04-26 17:44 
> /user/testdemo/part-0-b8a8e16a-91a8-48ec-9998-2d741c52cf5a-c000.json
> -rw-r--r-- 3 spark hadoop 51 2018-04-26 17:32 
> /user/testdemo/part-1-4311f66b-ba1b-4a4d-a289-1a211f27f653-c000.json
> -rw-r--r-- 3 spark hadoop 51 2018-04-26 17:44 
> /user/testdemo/part-1-b8a8e16a-91a8-48ec-9998-2d741c52cf5a-c000.json
> -rw-r--r-- 3 spark hadoop 50 2018-04-26 17:32 
> /user/testdemo/part-2-4311f66b-ba1b-4a4d-a289-1a211f27f653-c000.json
> -rw-r--r-- 3 spark hadoop 50 2018-04-26 17:44 
> /user/testdemo/part-2-b8a8e16a-91a8-48ec-9998-2d741c52cf5a-c000.json
> -rw-r--r-- 3 spark hadoop 49 2018-04-26 17:32 
> /user/testdemo/part-3-4311f66b-ba1b-4a4d-a289-1a211f27f653-c000.json
> -rw-r--r-- 3 spark hadoop 49 2018-04-26 17:44 
> /user/testdemo/part-3-b8a8e16a-91a8-48ec-9998-2d741c52cf5a-c000.json
>  
> Issue is:
> Now executed below drop command
> spark-sql> drop table json;
>  
> Table dropped successfully but json file still present in the path  
> /user/testdemo



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size

2018-05-03 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462540#comment-16462540
 ] 

Imran Rashid commented on SPARK-24135:
--

Honestly I don't understand the failure mode described here at all, but I can 
make some comparisons to yarn's handling of executor failures at the allocator 
level.

In yarn, spark already has a check for the number of executor failures, and it 
fails the entire application if there are too many.  Its controlled by 
"spark.yarn.max.executor.failures".  The failures expire over time, controlled 
by "spark.yarn.executor.failuresValidityInterval", so really long running apps 
are not penalized by a few errors spread out over a long period of time.  See 
code in ApplicationMaster & YarnAllocator.

There is also ongoing work to have spark realize where container initialization 
has failed, and then request other nodes selected instead, SPARK-16630. There 
is a PR under review for that now.

>From the bug description, I do think there should be some better error 
>handling than what there is now so the user at least knows what is going on, 
>but it sounds like you're all in agreement about that already :).

> [K8s] Executors that fail to start up because of init-container errors are 
> not retried and limit the executor pool size
> ---
>
> Key: SPARK-24135
> URL: https://issues.apache.org/jira/browse/SPARK-24135
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Priority: Major
>
> In KubernetesClusterSchedulerBackend, we detect if executors disconnect after 
> having been started or if executors hit the {{ERROR}} or {{DELETED}} states. 
> When executors fail in these ways, they are removed from the pending 
> executors pool and the driver should retry requesting these executors.
> However, the driver does not handle a different class of error: when the pod 
> enters the {{Init:Error}} state. This state comes up when the executor fails 
> to launch because one of its init-containers fails. Spark itself doesn't 
> attach any init-containers to the executors. However, custom web hooks can 
> run on the cluster and attach init-containers to the executor pods. 
> Additionally, pod presets can specify init containers to run on these pods. 
> Therefore Spark should be handling the {{Init:Error}} cases regardless if 
> Spark itself is aware of init-containers or not.
> This class of error is particularly bad because when we hit this state, the 
> failed executor will never start, but it's still seen as pending by the 
> executor allocator. The executor allocator won't request more rounds of 
> executors because its current batch hasn't been resolved to either running or 
> failed. Therefore we end up with being stuck with the number of executors 
> that successfully started before the faulty one failed to start, potentially 
> creating a fake resource bottleneck.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23697) Accumulators of Spark 1.x no longer work with Spark 2.x

2018-05-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23697:


Assignee: (was: Apache Spark)

> Accumulators of Spark 1.x no longer work with Spark 2.x
> ---
>
> Key: SPARK-23697
> URL: https://issues.apache.org/jira/browse/SPARK-23697
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
> Environment: Spark 2.2.0
> Scala 2.11
>Reporter: Sergey Zhemzhitsky
>Priority: Major
>
> I've noticed that accumulators of Spark 1.x no longer work with Spark 2.x 
> failing with
> {code:java}
> java.lang.AssertionError: assertion failed: copyAndReset must return a zero 
> value copy{code}
>  It happens while serializing an accumulator 
> [here|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L165]
> {code:java}
> val copyAcc = copyAndReset()
> assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code}
> ... although copyAndReset returns zero-value copy for sure, just consider the 
> accumulator below
> {code:java}
> val concatParam = new AccumulatorParam[jl.StringBuilder] {
>   override def zero(initialValue: jl.StringBuilder): jl.StringBuilder = new 
> jl.StringBuilder()
>   override def addInPlace(r1: jl.StringBuilder, r2: jl.StringBuilder): 
> jl.StringBuilder = r1.append(r2)
> }{code}
> So, Spark treats zero value as non-zero due to how 
> [isZero|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L489]
>  is implemented in LegacyAccumulatorWrapper.
> {code:java}
> override def isZero: Boolean = _value == param.zero(initialValue){code}
> All this means that the values to be accumulated must implement equals and 
> hashCode, otherwise isZero is very likely to always return false.
> So I'm wondering whether the assertion 
> {code:java}
> assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code}
> is really necessary and whether it can be safely removed from there?
> If not - is it ok to just override writeReplace for LegacyAccumulatorWrapper 
> to prevent such failures?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23697) Accumulators of Spark 1.x no longer work with Spark 2.x

2018-05-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23697:


Assignee: Apache Spark

> Accumulators of Spark 1.x no longer work with Spark 2.x
> ---
>
> Key: SPARK-23697
> URL: https://issues.apache.org/jira/browse/SPARK-23697
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
> Environment: Spark 2.2.0
> Scala 2.11
>Reporter: Sergey Zhemzhitsky
>Assignee: Apache Spark
>Priority: Major
>
> I've noticed that accumulators of Spark 1.x no longer work with Spark 2.x 
> failing with
> {code:java}
> java.lang.AssertionError: assertion failed: copyAndReset must return a zero 
> value copy{code}
>  It happens while serializing an accumulator 
> [here|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L165]
> {code:java}
> val copyAcc = copyAndReset()
> assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code}
> ... although copyAndReset returns zero-value copy for sure, just consider the 
> accumulator below
> {code:java}
> val concatParam = new AccumulatorParam[jl.StringBuilder] {
>   override def zero(initialValue: jl.StringBuilder): jl.StringBuilder = new 
> jl.StringBuilder()
>   override def addInPlace(r1: jl.StringBuilder, r2: jl.StringBuilder): 
> jl.StringBuilder = r1.append(r2)
> }{code}
> So, Spark treats zero value as non-zero due to how 
> [isZero|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L489]
>  is implemented in LegacyAccumulatorWrapper.
> {code:java}
> override def isZero: Boolean = _value == param.zero(initialValue){code}
> All this means that the values to be accumulated must implement equals and 
> hashCode, otherwise isZero is very likely to always return false.
> So I'm wondering whether the assertion 
> {code:java}
> assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code}
> is really necessary and whether it can be safely removed from there?
> If not - is it ok to just override writeReplace for LegacyAccumulatorWrapper 
> to prevent such failures?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23697) Accumulators of Spark 1.x no longer work with Spark 2.x

2018-05-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462465#comment-16462465
 ] 

Apache Spark commented on SPARK-23697:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/21229

> Accumulators of Spark 1.x no longer work with Spark 2.x
> ---
>
> Key: SPARK-23697
> URL: https://issues.apache.org/jira/browse/SPARK-23697
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
> Environment: Spark 2.2.0
> Scala 2.11
>Reporter: Sergey Zhemzhitsky
>Priority: Major
>
> I've noticed that accumulators of Spark 1.x no longer work with Spark 2.x 
> failing with
> {code:java}
> java.lang.AssertionError: assertion failed: copyAndReset must return a zero 
> value copy{code}
>  It happens while serializing an accumulator 
> [here|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L165]
> {code:java}
> val copyAcc = copyAndReset()
> assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code}
> ... although copyAndReset returns zero-value copy for sure, just consider the 
> accumulator below
> {code:java}
> val concatParam = new AccumulatorParam[jl.StringBuilder] {
>   override def zero(initialValue: jl.StringBuilder): jl.StringBuilder = new 
> jl.StringBuilder()
>   override def addInPlace(r1: jl.StringBuilder, r2: jl.StringBuilder): 
> jl.StringBuilder = r1.append(r2)
> }{code}
> So, Spark treats zero value as non-zero due to how 
> [isZero|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L489]
>  is implemented in LegacyAccumulatorWrapper.
> {code:java}
> override def isZero: Boolean = _value == param.zero(initialValue){code}
> All this means that the values to be accumulated must implement equals and 
> hashCode, otherwise isZero is very likely to always return false.
> So I'm wondering whether the assertion 
> {code:java}
> assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code}
> is really necessary and whether it can be safely removed from there?
> If not - is it ok to just override writeReplace for LegacyAccumulatorWrapper 
> to prevent such failures?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24154) AccumulatorV2 loses type information during serialization

2018-05-03 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462413#comment-16462413
 ] 

Wenchen Fan commented on SPARK-24154:
-

I think this is a trade-off: In accumulator v1, we have separated classes for 
keeping states and defining computation, i.e. `Accumulable` and 
`AccumulableParam`. Users can mix in different traits into their 
`AccumulableParam` implementation, it works fine because the `copy` is defined 
in `Accumulable`. In accumulator v2, we have a single class for keeping states 
and defining computation, i.e. `AccumulatorV2`. It simplifies the accumulator 
framework a lot, and makes it much easier for users to eliminate boxing, 
although it loses the flexibility to mix in traits.

In general, I don't think it's possible to support this pattern in accumulator 
v2, users would have to override the `copy` method. I think this is worth, 
compared to what accumulator v2 brings in. If users have to support mixin 
traits, they can still use accumulator v1.

> AccumulatorV2 loses type information during serialization
> -
>
> Key: SPARK-24154
> URL: https://issues.apache.org/jira/browse/SPARK-24154
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.2.1, 2.3.0, 2.3.1
> Environment: Scala 2.11
> Spark 2.2.0
>Reporter: Sergey Zhemzhitsky
>Priority: Major
>
> AccumulatorV2 loses type information during serialization.
> It happens 
> [here|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L164]
>  during *writeReplace* call
> {code:scala}
> final protected def writeReplace(): Any = {
>   if (atDriverSide) {
> if (!isRegistered) {
>   throw new UnsupportedOperationException(
> "Accumulator must be registered before send to executor")
> }
> val copyAcc = copyAndReset()
> assert(copyAcc.isZero, "copyAndReset must return a zero value copy")
> val isInternalAcc = name.isDefined && 
> name.get.startsWith(InternalAccumulator.METRICS_PREFIX)
> if (isInternalAcc) {
>   // Do not serialize the name of internal accumulator and send it to 
> executor.
>   copyAcc.metadata = metadata.copy(name = None)
> } else {
>   // For non-internal accumulators, we still need to send the name 
> because users may need to
>   // access the accumulator name at executor side, or they may keep the 
> accumulators sent from
>   // executors and access the name when the registered accumulator is 
> already garbage
>   // collected(e.g. SQLMetrics).
>   copyAcc.metadata = metadata
> }
> copyAcc
>   } else {
> this
>   }
> }
> {code}
> It means that it is hardly possible to create new accumulators easily by 
> adding new behaviour to existing ones by means of mix-ins or inheritance 
> (without overriding *copy*).
> For example the following snippet ...
> {code:scala}
> trait TripleCount {
>   self: LongAccumulator =>
>   abstract override def add(v: jl.Long): Unit = {
> self.add(v * 3)
>   }
> }
> val acc = new LongAccumulator with TripleCount
> sc.register(acc)
> val data = 1 to 10
> val rdd = sc.makeRDD(data, 5)
> rdd.foreach(acc.add(_))
> acc.value shouldBe 3 * data.sum
> {code}
> ... fails with
> {code:none}
> org.scalatest.exceptions.TestFailedException: 55 was not equal to 165
>   at org.scalatest.MatchersHelper$.indicateFailure(MatchersHelper.scala:340)
>   at org.scalatest.Matchers$AnyShouldWrapper.shouldBe(Matchers.scala:6864)
> {code}
> Also such a behaviour seems to be error prone and confusing because an 
> implementor gets not the same thing as he/she sees in the code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24090) Kubernetes Backend Hotlist for Spark 2.4

2018-05-03 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462363#comment-16462363
 ] 

Stavros Kontopoulos commented on SPARK-24090:
-

Any plans for adding more items on this list?

> Kubernetes Backend Hotlist for Spark 2.4
> 
>
> Key: SPARK-24090
> URL: https://issues.apache.org/jira/browse/SPARK-24090
> Project: Spark
>  Issue Type: Umbrella
>  Components: Kubernetes, Scheduler
>Affects Versions: 2.4.0
>Reporter: Anirudh Ramanathan
>Assignee: Anirudh Ramanathan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24171) Update comments for non-deterministic functions

2018-05-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462360#comment-16462360
 ] 

Apache Spark commented on SPARK-24171:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/21228

> Update comments for non-deterministic functions
> ---
>
> Key: SPARK-24171
> URL: https://issues.apache.org/jira/browse/SPARK-24171
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Description of non-deterministic functions like the _collect_list()_ and 
> _first()_ doesn't contain information about that. Need to add a notice about 
> it to show the behavior in user facing docs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24171) Update comments for non-deterministic functions

2018-05-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24171:


Assignee: Apache Spark

> Update comments for non-deterministic functions
> ---
>
> Key: SPARK-24171
> URL: https://issues.apache.org/jira/browse/SPARK-24171
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Minor
>
> Description of non-deterministic functions like the _collect_list()_ and 
> _first()_ doesn't contain information about that. Need to add a notice about 
> it to show the behavior in user facing docs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24171) Update comments for non-deterministic functions

2018-05-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24171:


Assignee: (was: Apache Spark)

> Update comments for non-deterministic functions
> ---
>
> Key: SPARK-24171
> URL: https://issues.apache.org/jira/browse/SPARK-24171
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Description of non-deterministic functions like the _collect_list()_ and 
> _first()_ doesn't contain information about that. Need to add a notice about 
> it to show the behavior in user facing docs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24166) InMemoryTableScanExec should not access SQLConf at executor side

2018-05-03 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-24166.
-
   Resolution: Fixed
Fix Version/s: 2.3.1
   2.4.0

Issue resolved by pull request 21223
[https://github.com/apache/spark/pull/21223]

> InMemoryTableScanExec should not access SQLConf at executor side
> 
>
> Key: SPARK-24166
> URL: https://issues.apache.org/jira/browse/SPARK-24166
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.4.0, 2.3.1
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23703) Collapse sequential watermarks

2018-05-03 Thread Jungtaek Lim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462339#comment-16462339
 ] 

Jungtaek Lim commented on SPARK-23703:
--

[~joseph.torres]

Could you provide simple code or query showing this behavior? It would make me 
(and possible other contributors) better understanding of rationalize on this 
issue, and maybe relevant internal too.

Once I could understand the details I'd also like to work on this.

> Collapse sequential watermarks 
> ---
>
> Key: SPARK-23703
> URL: https://issues.apache.org/jira/browse/SPARK-23703
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>
> When there are two sequential EventTimeWatermark nodes in a query plan, the 
> topmost one overrides the column tracking metadata from its children, but 
> leaves the nodes themselves untouched. When there is no intervening stateful 
> operation to consume the watermark, we should remove the lower node entirely.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24171) Update comments for non-deterministic functions

2018-05-03 Thread Maxim Gekk (JIRA)

Maxim Gekk created SPARK-24171:
--

 Summary: Update comments for non-deterministic functions
 Key: SPARK-24171
 URL: https://issues.apache.org/jira/browse/SPARK-24171
 Project: Spark
  Issue Type: Documentation
  Components: SQL
Affects Versions: 2.3.0
Reporter: Maxim Gekk


Description of non-deterministic functions like the _collect_list()_ and 
_first()_ doesn't contain information about that. Need to add a notice about it 
to show the behavior in user facing docs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23715) from_utc_timestamp returns incorrect results for some UTC date/time values

2018-05-03 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-23715:


Assignee: Wenchen Fan

> from_utc_timestamp returns incorrect results for some UTC date/time values
> --
>
> Key: SPARK-23715
> URL: https://issues.apache.org/jira/browse/SPARK-23715
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bruce Robbins
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.4.0
>
>
> This produces the expected answer:
> {noformat}
> df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" 
> ).as("dt")).show
> +---+
> | dt|
> +---+
> |2018-03-13 07:18:23|
> +---+
> {noformat}
> However, the equivalent UTC input (but with an explicit timezone) produces a 
> wrong answer:
> {noformat}
> df.select(from_utc_timestamp(lit("2018-03-13T06:18:23+00:00"), "GMT+1" 
> ).as("dt")).show
> +---+
> | dt|
> +---+
> |2018-03-13 00:18:23|
> +---+
> {noformat}
> Additionally, the equivalent Unix time (1520921903, which is also 
> "2018-03-13T06:18:23" in the UTC time zone) produces the same wrong answer:
> {noformat}
> df.select(from_utc_timestamp(to_timestamp(lit(1520921903)), "GMT+1" 
> ).as("dt")).show
> +---+
> | dt|
> +---+
> |2018-03-13 00:18:23|
> +---+
> {noformat}
> These issues stem from the fact that the FromUTCTimestamp expression, despite 
> its name, expects the input to be in the user's local timezone. There is some 
> magic under the covers to make things work (mostly) as the user expects.
> As an example, let's say a user in Los Angeles issues the following:
> {noformat}
> df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" 
> ).as("dt")).show
> {noformat}
> FromUTCTimestamp gets as input a Timestamp (long) value representing
> {noformat}
> 2018-03-13T06:18:23-07:00 (long value 152094710300)
> {noformat}
> What FromUTCTimestamp needs instead is
> {noformat}
> 2018-03-13T06:18:23+00:00 (long value 152092190300)
> {noformat}
> So, it applies the local timezone's offset to the input timestamp to get the 
> correct value (152094710300 minus 7 hours is 152092190300). Then it 
> can process the value and produce the expected output.
> When the user explicitly specifies a time zone, FromUTCTimestamp's 
> assumptions break down. The input is no longer in the local time zone. 
> Because of the way input data is implicitly casted, FromUTCTimestamp never 
> knows whether the input data had an explicit timezone.
> Here are some gory details:
> There is sometimes a mismatch in expectations between the (string => 
> timestamp) cast and FromUTCTimestamp. Also, since the FromUTCTimestamp 
> expression never sees the actual input string (the cast "intercepts" the 
> input and converts it to a long timestamp before FromUTCTimestamp uses the 
> value), FromUTCTimestamp cannot reject any input value that would exercise 
> this mismatch in expectations.
> There is a similar mismatch in expectations in the (integer => timestamp) 
> cast and FromUTCTimestamp. As a result, Unix time input almost always 
> produces incorrect output.
> h3. When things work as expected for String input:
> When from_utc_timestamp is passed a string time value with no time zone, 
> DateTimeUtils.stringToTimestamp (called from a Cast expression) treats the 
> datetime string as though it's in the user's local time zone. Because 
> DateTimeUtils.stringToTimestamp is a general function, this is reasonable.
> As a result, FromUTCTimestamp's input is a timestamp shifted by the local 
> time zone's offset. FromUTCTimestamp assumes this (or more accurately, a 
> utility function called by FromUTCTimestamp assumes this), so the first thing 
> it does is reverse-shift to get it back the correct value. Now that the long 
> value has been shifted back to the correct timestamp value, it can now 
> process it (by shifting it again based on the specified time zone).
> h3. When things go wrong with String input:
> When from_utc_timestamp is passed a string datetime value with an explicit 
> time zone, stringToTimestamp honors that timezone and ignores the local time 
> zone. stringToTimestamp does not shift the timestamp by the local timezone's 
> offset, but by the timezone specified on the datetime string.
> Unfortunately, FromUTCTimestamp, which has no insight into the actual input 
> or the conversion, still assumes the timestamp is shifted by the local time 
> zone. So it reverse-shifts the long value by the local time zone's offset, 
> which produces a incorrect timestamp (except in the c

[jira] [Resolved] (SPARK-23715) from_utc_timestamp returns incorrect results for some UTC date/time values

2018-05-03 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23715.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21169
[https://github.com/apache/spark/pull/21169]

> from_utc_timestamp returns incorrect results for some UTC date/time values
> --
>
> Key: SPARK-23715
> URL: https://issues.apache.org/jira/browse/SPARK-23715
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bruce Robbins
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.4.0
>
>
> This produces the expected answer:
> {noformat}
> df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" 
> ).as("dt")).show
> +---+
> | dt|
> +---+
> |2018-03-13 07:18:23|
> +---+
> {noformat}
> However, the equivalent UTC input (but with an explicit timezone) produces a 
> wrong answer:
> {noformat}
> df.select(from_utc_timestamp(lit("2018-03-13T06:18:23+00:00"), "GMT+1" 
> ).as("dt")).show
> +---+
> | dt|
> +---+
> |2018-03-13 00:18:23|
> +---+
> {noformat}
> Additionally, the equivalent Unix time (1520921903, which is also 
> "2018-03-13T06:18:23" in the UTC time zone) produces the same wrong answer:
> {noformat}
> df.select(from_utc_timestamp(to_timestamp(lit(1520921903)), "GMT+1" 
> ).as("dt")).show
> +---+
> | dt|
> +---+
> |2018-03-13 00:18:23|
> +---+
> {noformat}
> These issues stem from the fact that the FromUTCTimestamp expression, despite 
> its name, expects the input to be in the user's local timezone. There is some 
> magic under the covers to make things work (mostly) as the user expects.
> As an example, let's say a user in Los Angeles issues the following:
> {noformat}
> df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" 
> ).as("dt")).show
> {noformat}
> FromUTCTimestamp gets as input a Timestamp (long) value representing
> {noformat}
> 2018-03-13T06:18:23-07:00 (long value 152094710300)
> {noformat}
> What FromUTCTimestamp needs instead is
> {noformat}
> 2018-03-13T06:18:23+00:00 (long value 152092190300)
> {noformat}
> So, it applies the local timezone's offset to the input timestamp to get the 
> correct value (152094710300 minus 7 hours is 152092190300). Then it 
> can process the value and produce the expected output.
> When the user explicitly specifies a time zone, FromUTCTimestamp's 
> assumptions break down. The input is no longer in the local time zone. 
> Because of the way input data is implicitly casted, FromUTCTimestamp never 
> knows whether the input data had an explicit timezone.
> Here are some gory details:
> There is sometimes a mismatch in expectations between the (string => 
> timestamp) cast and FromUTCTimestamp. Also, since the FromUTCTimestamp 
> expression never sees the actual input string (the cast "intercepts" the 
> input and converts it to a long timestamp before FromUTCTimestamp uses the 
> value), FromUTCTimestamp cannot reject any input value that would exercise 
> this mismatch in expectations.
> There is a similar mismatch in expectations in the (integer => timestamp) 
> cast and FromUTCTimestamp. As a result, Unix time input almost always 
> produces incorrect output.
> h3. When things work as expected for String input:
> When from_utc_timestamp is passed a string time value with no time zone, 
> DateTimeUtils.stringToTimestamp (called from a Cast expression) treats the 
> datetime string as though it's in the user's local time zone. Because 
> DateTimeUtils.stringToTimestamp is a general function, this is reasonable.
> As a result, FromUTCTimestamp's input is a timestamp shifted by the local 
> time zone's offset. FromUTCTimestamp assumes this (or more accurately, a 
> utility function called by FromUTCTimestamp assumes this), so the first thing 
> it does is reverse-shift to get it back the correct value. Now that the long 
> value has been shifted back to the correct timestamp value, it can now 
> process it (by shifting it again based on the specified time zone).
> h3. When things go wrong with String input:
> When from_utc_timestamp is passed a string datetime value with an explicit 
> time zone, stringToTimestamp honors that timezone and ignores the local time 
> zone. stringToTimestamp does not shift the timestamp by the local timezone's 
> offset, but by the timezone specified on the datetime string.
> Unfortunately, FromUTCTimestamp, which has no insight into the actual input 
> or the conversion, still assumes the timestamp is shifted by the local time 
> zone. So it reverse-shifts th

[jira] [Resolved] (SPARK-24152) SparkR CRAN feasibility check server problem

2018-05-03 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24152.
--
Resolution: Fixed

> SparkR CRAN feasibility check server problem
> 
>
> Key: SPARK-24152
> URL: https://issues.apache.org/jira/browse/SPARK-24152
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Liang-Chi Hsieh
>Priority: Critical
>
> PR builder and master branch test fails with the following SparkR error with 
> unknown reason. The following is an error message from that.
> {code}
> * this is package 'SparkR' version '2.4.0'
> * checking CRAN incoming feasibility ...Error in 
> .check_package_CRAN_incoming(pkgdir) : 
>   dims [product 24] do not match the length of object [0]
> Execution halted
> {code}
> *PR BUILDER*
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90039/
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89983/
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89998/
> *MASTER BRANCH*
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4458/
>  (Fail with no failures)
> This is critical because we already start to merge the PR by ignoring this 
> **known unkonwn** SparkR failure.
> - https://github.com/apache/spark/pull/21175



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24133) Reading Parquet files containing large strings can fail with java.lang.ArrayIndexOutOfBoundsException

2018-05-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462287#comment-16462287
 ] 

Apache Spark commented on SPARK-24133:
--

User 'ala' has created a pull request for this issue:
https://github.com/apache/spark/pull/21227

> Reading Parquet files containing large strings can fail with 
> java.lang.ArrayIndexOutOfBoundsException
> -
>
> Key: SPARK-24133
> URL: https://issues.apache.org/jira/browse/SPARK-24133
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ala Luszczak
>Assignee: Ala Luszczak
>Priority: Major
> Fix For: 2.4.0
>
>
> ColumnVectors store string data in one big byte array. Since the array size 
> is capped at just under Integer.MAX_VALUE, a single ColumnVector cannot store 
> more than 2GB of string data.
> However, since the Parquet files commonly contain large blobs stored as 
> strings, and ColumnVectors by default carry 4096 values, it's entirely 
> possible to go past that limit.
> In such cases a negative capacity is requested from 
> WritableColumnVector.reserve(). The call succeeds (requested capacity is 
> smaller than already allocated), and consequently  
> java.lang.ArrayIndexOutOfBoundsException is thrown when the reader actually 
> attempts to put the data into the array.
> This behavior is hard to troubleshoot for the users. Spark should instead 
> check for negative requested capacity in WritableColumnVector.reserve() and 
> throw more informative error, instructing the user to tweak ColumnarBatch 
> size.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24152) SparkR CRAN feasibility check server problem

2018-05-03 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462280#comment-16462280
 ] 

Liang-Chi Hsieh commented on SPARK-24152:
-

Can be resolved now as I saw Jenkins test passed.

> SparkR CRAN feasibility check server problem
> 
>
> Key: SPARK-24152
> URL: https://issues.apache.org/jira/browse/SPARK-24152
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Liang-Chi Hsieh
>Priority: Critical
>
> PR builder and master branch test fails with the following SparkR error with 
> unknown reason. The following is an error message from that.
> {code}
> * this is package 'SparkR' version '2.4.0'
> * checking CRAN incoming feasibility ...Error in 
> .check_package_CRAN_incoming(pkgdir) : 
>   dims [product 24] do not match the length of object [0]
> Execution halted
> {code}
> *PR BUILDER*
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90039/
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89983/
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89998/
> *MASTER BRANCH*
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4458/
>  (Fail with no failures)
> This is critical because we already start to merge the PR by ignoring this 
> **known unkonwn** SparkR failure.
> - https://github.com/apache/spark/pull/21175



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10943) NullType Column cannot be written to Parquet

2018-05-03 Thread Daniel Davis (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462244#comment-16462244
 ] 

Daniel Davis commented on SPARK-10943:
--

According to parquet data types 
[https://github.com/apache/parquet-format/blob/master/LogicalTypes.md], now a 
Null type should be supported. So perhaps this issue should be reconsidered?

> NullType Column cannot be written to Parquet
> 
>
> Key: SPARK-10943
> URL: https://issues.apache.org/jira/browse/SPARK-10943
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Jason Pohl
>Priority: Major
>
> {code}
> var data02 = sqlContext.sql("select 1 as id, \"cat in the hat\" as text, null 
> as comments")
> {code}
> //FAIL - Try writing a NullType column (where all the values are NULL)
> {code}
> data02.write.parquet("/tmp/test/dataset2")
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:156)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:304)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 179.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 179.0 (TID 39924, 10.0.196.208): 
> org.apache.spark.sql.AnalysisException: Unsupported data type 
> StructField(comments,NullType,true).dataType;
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:524)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:312)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convert$1.apply(CatalystSchemaConverter.scala:305)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convert$1.apply(CatalystSchemaConverter.scala:305)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:92)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at org.apache.spark.sql.types.StructType.map(StructType.scala:92)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convert(CatalystSchemaConverter.scala:305)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetTypesConverter$.convertFromAttributes(ParquetTypesConverter.scala:58)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.RowWriteSupport.init(ParquetTableSupport.scala:55)
>   at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:288)
>   at 
> org.apache.parquet.hadoop.ParquetOu

[jira] [Commented] (SPARK-21429) show on structured Dataset is equivalent to writeStream to console once

2018-05-03 Thread Jungtaek Lim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462225#comment-16462225
 ] 

Jungtaek Lim commented on SPARK-21429:
--

I agree that shortcut would help, but a bit afraid that such shortcut might 
hide the detail, difference between executing batch and streaming. Unless 
source data has changed, running batch will not have side effect. But running 
streaming will change source offset. (I guess it is true even for 
Trigger.once() but please correct me if I'm missing something.)

> show on structured Dataset is equivalent to writeStream to console once
> ---
>
> Key: SPARK-21429
> URL: https://issues.apache.org/jira/browse/SPARK-21429
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> While working with Datasets it's often helpful to do {{show}}. It does not 
> work for streaming Datasets (and leads to {{AnalysisException}} - see below), 
> but think it could just be the following under the covers and very helpful 
> (would cut plenty of keystrokes for sure).
> {code}
> val sq = ...
> scala> sq.isStreaming
> res0: Boolean = true
> import org.apache.spark.sql.streaming.Trigger
> scala> sq.writeStream.format("console").trigger(Trigger.Once).start
> {code}
> Since {{show}} returns {{Unit}} that could just work.
> Currently {{show}} reports {{AnalysisException}}.
> {code}
> scala> sq.show
> org.apache.spark.sql.AnalysisException: Queries with streaming sources must 
> be executed with writeStream.start();;
> rate
>   at 
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:297)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:36)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:34)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForBatch(UnsupportedOperationChecker.scala:34)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertSupported(QueryExecution.scala:63)
>   at 
> org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:74)
>   at 
> org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:72)
>   at 
> org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:84)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:89)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:89)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3027)
>   at org.apache.spark.sql.Dataset.head(Dataset.scala:2340)
>   at org.apache.spark.sql.Dataset.take(Dataset.scala:2553)
>   at org.apache.spark.sql.Dataset.showString(Dataset.scala:241)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:671)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:630)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:639)
>   ... 50 elided
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10816) EventTime based sessionization

2018-05-03 Thread Jungtaek Lim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462149#comment-16462149
 ] 

Jungtaek Lim commented on SPARK-10816:
--

I'm still curious about out-of-box support on session window. I'm aware of 
session window example which leverages advanced API in structured streaming 
guide doc, but out-of-box support may enable the feature in sql statement 
instead of dealing with mapGroupsWithState.

If Spark community is still interested on out-of-box support, I'd like to spend 
some time to take a deep look to see whether we can get it.

Thanks!

> EventTime based sessionization
> --
>
> Key: SPARK-10816
> URL: https://issues.apache.org/jira/browse/SPARK-10816
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Reynold Xin
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size

2018-05-03 Thread Anirudh Ramanathan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462145#comment-16462145
 ] 

Anirudh Ramanathan commented on SPARK-24135:


cc/ [~mridulm80] [~irashid] for thoughts on whether this behavior would be 
intuitive to an existing Spark user.

> [K8s] Executors that fail to start up because of init-container errors are 
> not retried and limit the executor pool size
> ---
>
> Key: SPARK-24135
> URL: https://issues.apache.org/jira/browse/SPARK-24135
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Priority: Major
>
> In KubernetesClusterSchedulerBackend, we detect if executors disconnect after 
> having been started or if executors hit the {{ERROR}} or {{DELETED}} states. 
> When executors fail in these ways, they are removed from the pending 
> executors pool and the driver should retry requesting these executors.
> However, the driver does not handle a different class of error: when the pod 
> enters the {{Init:Error}} state. This state comes up when the executor fails 
> to launch because one of its init-containers fails. Spark itself doesn't 
> attach any init-containers to the executors. However, custom web hooks can 
> run on the cluster and attach init-containers to the executor pods. 
> Additionally, pod presets can specify init containers to run on these pods. 
> Therefore Spark should be handling the {{Init:Error}} cases regardless if 
> Spark itself is aware of init-containers or not.
> This class of error is particularly bad because when we hit this state, the 
> failed executor will never start, but it's still seen as pending by the 
> executor allocator. The executor allocator won't request more rounds of 
> executors because its current batch hasn't been resolved to either running or 
> failed. Therefore we end up with being stuck with the number of executors 
> that successfully started before the faulty one failed to start, potentially 
> creating a fake resource bottleneck.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size

2018-05-03 Thread Anirudh Ramanathan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462139#comment-16462139
 ] 

Anirudh Ramanathan edited comment on SPARK-24135 at 5/3/18 9:01 AM:


It is increasingly common for people to write custom controllers and custom 
resources and not use the built-in controllers, especially when the workloads 
have special characteristics. This is the whole reason why people are working 
on tooling like the [operator 
framework|https://coreos.com/blog/introducing-operator-framework]. I don't 
think the future lies in shoehorning applications to use the existing 
controllers. The existing controllers are a good starting point but for any 
custom orchestration, the recommendation from the k8s community at large would 
be to write an operator which in some sense is what we've done. So, I think 
moving towards the built-in controllers doesn't give us anything more. 

Also, replication controllers and deployments are not used for applications 
with termination semantics. They're suitable for long running services. That's 
the reason why they never give up after seeing failures. However, if you see 
the "batch" type built-in controller, the [job 
controller|https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion],
 it does implement a [backoff 
policy|https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#pod-backoff-failure-policy]
 that covers the initialization and runtime errors in containers. As I see it, 
we should have safe limits for all kinds of failures to eventually give up. I'm 
ok with having this limit similar to the job controller, as a configurable 
number and one might want to set it very high in your case to do near infinite 
retries, but I'm not convinced that that behavior is a safe choice in the 
general case. 

Also, flakiness due to admission webhooks seems like it should be handled by 
retries in the init container, or by some other automation, since it's outside 
Spark land. That makes me apprehensive about handling such specific cases 
within Spark, instead of dealing with it as "framework error" and "app error".


was (Author: foxish):
It is increasingly common for people to write custom controllers and custom 
resources and not use the built-in controllers, especially when the workloads 
have special characteristics. This is the whole reason why people are working 
on tooling like the [operator 
framework|https://coreos.com/blog/introducing-operator-framework]. I don't 
think the future lies in shoehorning applications to use the existing 
controllers. The existing controllers are a good starting point but for any 
custom orchestration, the recommendation from the k8s community at large would 
be to write an operator which in some sense is what we've done. So, I think 
moving towards the built-in controllers doesn't give us anything more. 

Also, replication controllers and deployments are not used for applications 
with termination semantics. They're suitable for long running services. The 
only "batch" type built-in controller is the [job 
controller|https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion],
 which does implement a [backoff 
policy|https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#pod-backoff-failure-policy]
 that covers the initialization and runtime errors in containers. As I see it, 
we should have safe limits for all kinds of failures to eventually give up; 
it's more a question of whether this should be treated differently as a 
framework error.

Also, flakiness due to admission webhooks seems like it should be handled by 
retries in the init container, or by some other automation, since it's outside 
Spark land. That makes me apprehensive about handling such specific cases 
within Spark, instead of dealing with it as "framework error" and "app error".

> [K8s] Executors that fail to start up because of init-container errors are 
> not retried and limit the executor pool size
> ---
>
> Key: SPARK-24135
> URL: https://issues.apache.org/jira/browse/SPARK-24135
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Priority: Major
>
> In KubernetesClusterSchedulerBackend, we detect if executors disconnect after 
> having been started or if executors hit the {{ERROR}} or {{DELETED}} states. 
> When executors fail in these ways, they are removed from the pending 
> executors pool and the driver should retry requesting these executors.
> However, the driver does not handle a different class of error: when the pod 
> enters the {{Init:Error}} state. Thi

[jira] [Comment Edited] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size

2018-05-03 Thread Anirudh Ramanathan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462139#comment-16462139
 ] 

Anirudh Ramanathan edited comment on SPARK-24135 at 5/3/18 8:58 AM:


It is increasingly common for people to write custom controllers and custom 
resources and not use the built-in controllers, especially when the workloads 
have special characteristics. This is the whole reason why people are working 
on tooling like the [operator 
framework|https://coreos.com/blog/introducing-operator-framework]. I don't 
think the future lies in shoehorning applications to use the existing 
controllers. The existing controllers are a good starting point but for any 
custom orchestration, the recommendation from the k8s community at large would 
be to write an operator which in some sense is what we've done. So, I think 
moving towards the built-in controllers doesn't give us anything more. 

Also, replication controllers and deployments are not used for applications 
with termination semantics. They're suitable for long running services. The 
only "batch" type built-in controller is the [job 
controller|https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion],
 which does implement a [backoff 
policy|https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#pod-backoff-failure-policy]
 that covers the initialization and runtime errors in containers. As I see it, 
we should have safe limits for all kinds of failures to eventually give up; 
it's more a question of whether this should be treated differently as a 
framework error.

Also, flakiness due to admission webhooks seems like it should be handled by 
retries in the init container, or by some other automation, since it's outside 
Spark land. That makes me apprehensive about handling such specific cases 
within Spark, instead of dealing with it as "framework error" and "app error".


was (Author: foxish):
It is increasingly common for people to write custom controllers and custom 
resources and not use the built-in controllers, especially when the workloads 
have special characteristics. This is the whole reason why people are working 
on tooling like the [operator 
framework|https://coreos.com/blog/introducing-operator-framework]. I don't 
think the future lies in shoehorning applications to use the existing 
controllers. The existing controllers are a good starting point but for any 
custom orchestration, the recommendation from the k8s community at large would 
be to write an operator which in some sense is what we've done. So, I think 
moving towards the built-in controllers doesn't give us anything more. 

Also, replication controllers and deployments are not used for applications 
with termination semantics. They're suitable for long running services. The 
only "batch" type built-in controller is the [job 
controller|https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion],
 which does implement a backoff policy that covers the initialization and 
runtime errors in containers. As I see it, we should have safe limits for all 
kinds of failures to eventually give up; it's more a question of whether this 
should be treated differently as a framework error.

Also, flakiness due to admission webhooks seems like it should be handled by 
retries in the init container, or by some other automation, since it's outside 
Spark land. That makes me apprehensive about handling such specific cases 
within Spark, instead of dealing with it as "framework error" and "app error".

> [K8s] Executors that fail to start up because of init-container errors are 
> not retried and limit the executor pool size
> ---
>
> Key: SPARK-24135
> URL: https://issues.apache.org/jira/browse/SPARK-24135
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Priority: Major
>
> In KubernetesClusterSchedulerBackend, we detect if executors disconnect after 
> having been started or if executors hit the {{ERROR}} or {{DELETED}} states. 
> When executors fail in these ways, they are removed from the pending 
> executors pool and the driver should retry requesting these executors.
> However, the driver does not handle a different class of error: when the pod 
> enters the {{Init:Error}} state. This state comes up when the executor fails 
> to launch because one of its init-containers fails. Spark itself doesn't 
> attach any init-containers to the executors. However, custom web hooks can 
> run on the cluster and attach init-containers to the executor pods. 
> Additionally, pod presets can specify init containers to run on these pods. 
>

[jira] [Commented] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size

2018-05-03 Thread Anirudh Ramanathan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462139#comment-16462139
 ] 

Anirudh Ramanathan commented on SPARK-24135:


It is increasingly common for people to write custom controllers and custom 
resources and not use the built-in controllers, especially when the workloads 
have special characteristics. This is the whole reason why people are working 
on tooling like the [operator 
framework|https://coreos.com/blog/introducing-operator-framework]. I don't 
think the future lies in shoehorning applications to use the existing 
controllers. The existing controllers are a good starting point but for any 
custom orchestration, the recommendation from the k8s community at large would 
be to write an operator which in some sense is what we've done. So, I think 
moving towards the built-in controllers doesn't give us anything more. 

Also, replication controllers and deployments are not used for applications 
with termination semantics. They're suitable for long running services. The 
only "batch" type built-in controller is the [job 
controller|https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion],
 which does implement a backoff policy that covers the initialization and 
runtime errors in containers. As I see it, we should have safe limits for all 
kinds of failures to eventually give up; it's more a question of whether this 
should be treated differently as a framework error.

Also, flakiness due to admission webhooks seems like it should be handled by 
retries in the init container, or by some other automation, since it's outside 
Spark land. That makes me apprehensive about handling such specific cases 
within Spark, instead of dealing with it as "framework error" and "app error".

> [K8s] Executors that fail to start up because of init-container errors are 
> not retried and limit the executor pool size
> ---
>
> Key: SPARK-24135
> URL: https://issues.apache.org/jira/browse/SPARK-24135
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Priority: Major
>
> In KubernetesClusterSchedulerBackend, we detect if executors disconnect after 
> having been started or if executors hit the {{ERROR}} or {{DELETED}} states. 
> When executors fail in these ways, they are removed from the pending 
> executors pool and the driver should retry requesting these executors.
> However, the driver does not handle a different class of error: when the pod 
> enters the {{Init:Error}} state. This state comes up when the executor fails 
> to launch because one of its init-containers fails. Spark itself doesn't 
> attach any init-containers to the executors. However, custom web hooks can 
> run on the cluster and attach init-containers to the executor pods. 
> Additionally, pod presets can specify init containers to run on these pods. 
> Therefore Spark should be handling the {{Init:Error}} cases regardless if 
> Spark itself is aware of init-containers or not.
> This class of error is particularly bad because when we hit this state, the 
> failed executor will never start, but it's still seen as pending by the 
> executor allocator. The executor allocator won't request more rounds of 
> executors because its current batch hasn't been resolved to either running or 
> failed. Therefore we end up with being stuck with the number of executors 
> that successfully started before the faulty one failed to start, potentially 
> creating a fake resource bottleneck.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size

2018-05-03 Thread Matt Cheah (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462106#comment-16462106
 ] 

Matt Cheah edited comment on SPARK-24135 at 5/3/18 8:35 AM:


Not necessarily - if the pods fail to start up, we should retry them 
indefinitely as a replication controller or a deployment would. There's an 
argument that can be made that we should be using those higher level primitives 
to run executors instead of raw pods anyways, just that Spark's scheduler code 
would need non-trivial changes to do so right now.


was (Author: mcheah):
Not necessarily - if the pods fail to start up, we should retry them 
indefinitely as a replication controller or a deployment would. There's an 
argument that can be made that we should be using those lower level primitives 
to run executors instead of raw pods anyways, just that Spark's scheduler code 
would need non-trivial changes to do so right now.

> [K8s] Executors that fail to start up because of init-container errors are 
> not retried and limit the executor pool size
> ---
>
> Key: SPARK-24135
> URL: https://issues.apache.org/jira/browse/SPARK-24135
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Priority: Major
>
> In KubernetesClusterSchedulerBackend, we detect if executors disconnect after 
> having been started or if executors hit the {{ERROR}} or {{DELETED}} states. 
> When executors fail in these ways, they are removed from the pending 
> executors pool and the driver should retry requesting these executors.
> However, the driver does not handle a different class of error: when the pod 
> enters the {{Init:Error}} state. This state comes up when the executor fails 
> to launch because one of its init-containers fails. Spark itself doesn't 
> attach any init-containers to the executors. However, custom web hooks can 
> run on the cluster and attach init-containers to the executor pods. 
> Additionally, pod presets can specify init containers to run on these pods. 
> Therefore Spark should be handling the {{Init:Error}} cases regardless if 
> Spark itself is aware of init-containers or not.
> This class of error is particularly bad because when we hit this state, the 
> failed executor will never start, but it's still seen as pending by the 
> executor allocator. The executor allocator won't request more rounds of 
> executors because its current batch hasn't been resolved to either running or 
> failed. Therefore we end up with being stuck with the number of executors 
> that successfully started before the faulty one failed to start, potentially 
> creating a fake resource bottleneck.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size

2018-05-03 Thread Matt Cheah (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462106#comment-16462106
 ] 

Matt Cheah commented on SPARK-24135:


Not necessarily - if the pods fail to start up, we should retry them 
indefinitely as a replication controller or a deployment would. There's an 
argument that can be made that we should be using those lower level primitives 
to run executors instead of raw pods anyways, just that Spark's scheduler code 
would need non-trivial changes to do so right now.

> [K8s] Executors that fail to start up because of init-container errors are 
> not retried and limit the executor pool size
> ---
>
> Key: SPARK-24135
> URL: https://issues.apache.org/jira/browse/SPARK-24135
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Priority: Major
>
> In KubernetesClusterSchedulerBackend, we detect if executors disconnect after 
> having been started or if executors hit the {{ERROR}} or {{DELETED}} states. 
> When executors fail in these ways, they are removed from the pending 
> executors pool and the driver should retry requesting these executors.
> However, the driver does not handle a different class of error: when the pod 
> enters the {{Init:Error}} state. This state comes up when the executor fails 
> to launch because one of its init-containers fails. Spark itself doesn't 
> attach any init-containers to the executors. However, custom web hooks can 
> run on the cluster and attach init-containers to the executor pods. 
> Additionally, pod presets can specify init containers to run on these pods. 
> Therefore Spark should be handling the {{Init:Error}} cases regardless if 
> Spark itself is aware of init-containers or not.
> This class of error is particularly bad because when we hit this state, the 
> failed executor will never start, but it's still seen as pending by the 
> executor allocator. The executor allocator won't request more rounds of 
> executors because its current batch hasn't been resolved to either running or 
> failed. Therefore we end up with being stuck with the number of executors 
> that successfully started before the faulty one failed to start, potentially 
> creating a fake resource bottleneck.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24030) SparkSQL percentile_approx function is too slow for over 1,060,000 records.

2018-05-03 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24030.
--
Resolution: Cannot Reproduce

> SparkSQL percentile_approx function is too slow for over 1,060,000 records.
> ---
>
> Key: SPARK-24030
> URL: https://issues.apache.org/jira/browse/SPARK-24030
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
> Environment: zeppline + Spark 2.2.1 on Amazon EMR and local laptop.
>Reporter: Seok-Joon,Yun
>Priority: Major
> Attachments: screenshot_2018-04-20 23.15.02.png
>
>
> I used percentile_approx functions for over 1,060,000 records. It is too 
> slow. It takes about 90 mins. So I tried for 1,040,000 records. It take about 
> 10 secs.
> I tested for data reading on JDBC and parquet. It takes same time lengths.
> I wonder that function is not designed for multi worker.
> I looked gangglia and spark history. It worked on one worker.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23489) Flaky Test: HiveExternalCatalogVersionsSuite

2018-05-03 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-23489:
---

Assignee: Dongjoon Hyun

> Flaky Test: HiveExternalCatalogVersionsSuite
> 
>
> Key: SPARK-23489
> URL: https://issues.apache.org/jira/browse/SPARK-23489
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.2.1, 2.3.0, 2.4.0
>Reporter: Marco Gaido
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> I saw this error in an unrelated PR. It seems to me a bad configuration in 
> the Jenkins node where the tests are run.
> {code}
> Error Message
> java.io.IOException: Cannot run program "./bin/spark-submit" (in directory 
> "/tmp/test-spark/spark-2.2.1"): error=2, No such file or directory
> Stacktrace
> sbt.ForkMain$ForkError: java.io.IOException: Cannot run program 
> "./bin/spark-submit" (in directory "/tmp/test-spark/spark-2.2.1"): error=2, 
> No such file or directory
>   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
>   at 
> org.apache.spark.sql.hive.SparkSubmitTestUtils$class.runSparkSubmit(SparkSubmitTestUtils.scala:73)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite.runSparkSubmit(HiveExternalCatalogVersionsSuite.scala:43)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite$$anonfun$beforeAll$1.apply(HiveExternalCatalogVersionsSuite.scala:176)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite$$anonfun$beforeAll$1.apply(HiveExternalCatalogVersionsSuite.scala:161)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite.beforeAll(HiveExternalCatalogVersionsSuite.scala:161)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:212)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: sbt.ForkMain$ForkError: java.io.IOException: error=2, No such file 
> or directory
>   at java.lang.UNIXProcess.forkAndExec(Native Method)
>   at java.lang.UNIXProcess.(UNIXProcess.java:248)
>   at java.lang.ProcessImpl.start(ProcessImpl.java:134)
>   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
>   ... 17 more
> {code}
> This is the link: 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87615/testReport/.
> *MASTER BRANCH*
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/4389
> *BRANCH 2.3*
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/321/
> *NOTE: This failure frequently looks as `Test Result (no failures)`*
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.6/4811/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23489) Flaky Test: HiveExternalCatalogVersionsSuite

2018-05-03 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23489.
-
   Resolution: Fixed
Fix Version/s: 2.3.1
   2.4.0

Issue resolved by pull request 21210
[https://github.com/apache/spark/pull/21210]

> Flaky Test: HiveExternalCatalogVersionsSuite
> 
>
> Key: SPARK-23489
> URL: https://issues.apache.org/jira/browse/SPARK-23489
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.2.1, 2.3.0, 2.4.0
>Reporter: Marco Gaido
>Priority: Major
> Fix For: 2.4.0, 2.3.1
>
>
> I saw this error in an unrelated PR. It seems to me a bad configuration in 
> the Jenkins node where the tests are run.
> {code}
> Error Message
> java.io.IOException: Cannot run program "./bin/spark-submit" (in directory 
> "/tmp/test-spark/spark-2.2.1"): error=2, No such file or directory
> Stacktrace
> sbt.ForkMain$ForkError: java.io.IOException: Cannot run program 
> "./bin/spark-submit" (in directory "/tmp/test-spark/spark-2.2.1"): error=2, 
> No such file or directory
>   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
>   at 
> org.apache.spark.sql.hive.SparkSubmitTestUtils$class.runSparkSubmit(SparkSubmitTestUtils.scala:73)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite.runSparkSubmit(HiveExternalCatalogVersionsSuite.scala:43)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite$$anonfun$beforeAll$1.apply(HiveExternalCatalogVersionsSuite.scala:176)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite$$anonfun$beforeAll$1.apply(HiveExternalCatalogVersionsSuite.scala:161)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite.beforeAll(HiveExternalCatalogVersionsSuite.scala:161)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:212)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: sbt.ForkMain$ForkError: java.io.IOException: error=2, No such file 
> or directory
>   at java.lang.UNIXProcess.forkAndExec(Native Method)
>   at java.lang.UNIXProcess.(UNIXProcess.java:248)
>   at java.lang.ProcessImpl.start(ProcessImpl.java:134)
>   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
>   ... 17 more
> {code}
> This is the link: 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87615/testReport/.
> *MASTER BRANCH*
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/4389
> *BRANCH 2.3*
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/321/
> *NOTE: This failure frequently looks as `Test Result (no failures)`*
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.6/4811/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24116) SparkSQL inserting overwrite table has inconsistent behavior regarding HDFS trash

2018-05-03 Thread Rui Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462040#comment-16462040
 ] 

Rui Li commented on SPARK-24116:


To reproduce:
{code}
create table test_text(x int);

insert overwrite table test_text values (1),(2);

insert overwrite table test_text values (3),(4); -- the old data is moved to 
trash

create table test_parquet(x int) using parquet;

insert overwrite table test_parquet values (1),(2);

insert overwrite table test_parquet values (3),(4); -- the old data is not 
moved to trash
{code}

> SparkSQL inserting overwrite table has inconsistent behavior regarding HDFS 
> trash
> -
>
> Key: SPARK-24116
> URL: https://issues.apache.org/jira/browse/SPARK-24116
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Rui Li
>Priority: Major
>
> When inserting overwrite a table, the old data may or may not go to trash 
> based on:
>  # Date format. E.g. text table may go to trash but parquet table doesn't.
>  # Whether table is partitioned. E.g. partitioned text table doesn't go to 
> trash while non-partitioned table does.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24152) SparkR CRAN feasibility check server problem

2018-05-03 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462030#comment-16462030
 ] 

Liang-Chi Hsieh commented on SPARK-24152:
-

I think it is fixed now. It works in local. But better to check Jenkins test 
results too.

> SparkR CRAN feasibility check server problem
> 
>
> Key: SPARK-24152
> URL: https://issues.apache.org/jira/browse/SPARK-24152
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Liang-Chi Hsieh
>Priority: Critical
>
> PR builder and master branch test fails with the following SparkR error with 
> unknown reason. The following is an error message from that.
> {code}
> * this is package 'SparkR' version '2.4.0'
> * checking CRAN incoming feasibility ...Error in 
> .check_package_CRAN_incoming(pkgdir) : 
>   dims [product 24] do not match the length of object [0]
> Execution halted
> {code}
> *PR BUILDER*
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90039/
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89983/
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89998/
> *MASTER BRANCH*
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4458/
>  (Fail with no failures)
> This is critical because we already start to merge the PR by ignoring this 
> **known unkonwn** SparkR failure.
> - https://github.com/apache/spark/pull/21175



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

78 matches

Mail list logo