[jira] [Commented] (SPARK-23291) SparkR : substr : In SparkR dataframe , starting and ending position arguments in "substr" is giving wrong result when the position is greater than 1
[ https://issues.apache.org/jira/browse/SPARK-23291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16463391#comment-16463391 ] Hyukjin Kwon commented on SPARK-23291: -- [~felixcheung] WDYT? > SparkR : substr : In SparkR dataframe , starting and ending position > arguments in "substr" is giving wrong result when the position is greater > than 1 > -- > > Key: SPARK-23291 > URL: https://issues.apache.org/jira/browse/SPARK-23291 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.2, 2.2.0, 2.2.1, 2.3.0 >Reporter: Narendra >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 2.4.0 > > > Defect Description : > - > For example ,an input string "2017-12-01" is read into a SparkR dataframe > "df" with column name "col1". > The target is to create a a new column named "col2" with the value "12" > which is inside the string ."12" can be extracted with "starting position" as > "6" and "Ending position" as "7" > (the starting position of the first character is considered as "1" ) > But,the current code that needs to be written is : > > df <- withColumn(df,"col2",substr(df$col1,7,8))) > Observe that the first argument in the "substr" API , which indicates the > 'starting position', is mentioned as "7" > Also, observe that the second argument in the "substr" API , which indicates > the 'ending position', is mentioned as "8" > i.e the number that should be mentioned to indicate the position should be > the "actual position + 1" > Expected behavior : > > The code that needs to be written is : > > df <- withColumn(df,"col2",substr(df$col1,6,7))) > Note : > --- > This defect is observed with only when the starting position is greater than > 1. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24176) The hdfs file path with wildcard can not be identified when loading data
ABHISHEK KUMAR GUPTA created SPARK-24176: Summary: The hdfs file path with wildcard can not be identified when loading data Key: SPARK-24176 URL: https://issues.apache.org/jira/browse/SPARK-24176 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Environment: OS: SUSE11 Spark Version:2.3 Reporter: ABHISHEK KUMAR GUPTA # Launch spark-sql # create table wild1 (time timestamp, name string, isright boolean, datetoday date, num binary, height double, score float, decimaler decimal(10,0), id tinyint, age int, license bigint, length smallint) row format delimited fields terminated by ',' stored as textfile; # loaded data in table as below and it failed some cases not consistent # load data inpath '/user/testdemo1/user1/?ype* ' into table wild1; - Success load data inpath '/user/testdemo1/user1/t??eddata60.txt' into table wild1; - *Failed* load data inpath '/user/testdemo1/user1/?ypeddata60.txt' into table wild1; - Success Exception as below > load data inpath '/user/testdemo1/user1/t??eddata61.txt' into table wild1; 2018-05-04 13:16:25 INFO HiveMetaStore:746 - 0: get_database: one 2018-05-04 13:16:25 INFO audit:371 - ugi=spark/had...@hadoop.com ip=unknown-ip-addr cmd=get_database: one 2018-05-04 13:16:25 INFO HiveMetaStore:746 - 0: get_table : db=one tbl=wild1 2018-05-04 13:16:25 INFO audit:371 - ugi=spark/had...@hadoop.com ip=unknown-ip-addr cmd=get_table : db=one tbl=wild1 2018-05-04 13:16:25 INFO HiveMetaStore:746 - 0: get_table : db=one tbl=wild1 2018-05-04 13:16:25 INFO audit:371 - ugi=spark/had...@hadoop.com ip=unknown-ip-addr cmd=get_table : db=one tbl=wild1 *Error in query: LOAD DATA input path does not exist: /user/testdemo1/user1/t??eddata61.txt;* spark-sql> Behavior is not consistent. Need to fix with all combination of wild card char as it is not consistent. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23291) SparkR : substr : In SparkR dataframe , starting and ending position arguments in "substr" is giving wrong result when the position is greater than 1
[ https://issues.apache.org/jira/browse/SPARK-23291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16463369#comment-16463369 ] Wenchen Fan commented on SPARK-23291: - Yea I think we should backport it. > SparkR : substr : In SparkR dataframe , starting and ending position > arguments in "substr" is giving wrong result when the position is greater > than 1 > -- > > Key: SPARK-23291 > URL: https://issues.apache.org/jira/browse/SPARK-23291 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.2, 2.2.0, 2.2.1, 2.3.0 >Reporter: Narendra >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 2.4.0 > > > Defect Description : > - > For example ,an input string "2017-12-01" is read into a SparkR dataframe > "df" with column name "col1". > The target is to create a a new column named "col2" with the value "12" > which is inside the string ."12" can be extracted with "starting position" as > "6" and "Ending position" as "7" > (the starting position of the first character is considered as "1" ) > But,the current code that needs to be written is : > > df <- withColumn(df,"col2",substr(df$col1,7,8))) > Observe that the first argument in the "substr" API , which indicates the > 'starting position', is mentioned as "7" > Also, observe that the second argument in the "substr" API , which indicates > the 'ending position', is mentioned as "8" > i.e the number that should be mentioned to indicate the position should be > the "actual position + 1" > Expected behavior : > > The code that needs to be written is : > > df <- withColumn(df,"col2",substr(df$col1,6,7))) > Note : > --- > This defect is observed with only when the starting position is greater than > 1. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22918) sbt test (spark - local) fail after upgrading to 2.2.1 with: java.security.AccessControlException: access denied org.apache.derby.security.SystemPermission( "engine",
[ https://issues.apache.org/jira/browse/SPARK-22918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16463359#comment-16463359 ] Michael Ransley commented on SPARK-22918: - I think this relates to https://issues.apache.org/jira/browse/DERBY-6648 which was change made in Derby 10.12.1.1. To quote: {quote}Users who run Derby under a SecurityManager must edit the policy file and grant the following additional permission to derby.jar, derbynet.jar, and derbyoptionaltools.jar: permission org.apache.derby.security.SystemPermission "engine", "usederbyinternals";{quote} > sbt test (spark - local) fail after upgrading to 2.2.1 with: > java.security.AccessControlException: access denied > org.apache.derby.security.SystemPermission( "engine", "usederbyinternals" ) > > > Key: SPARK-22918 > URL: https://issues.apache.org/jira/browse/SPARK-22918 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Damian Momot >Priority: Major > > After upgrading 2.2.0 -> 2.2.1 sbt test command in one of my projects started > to fail with following exception: > {noformat} > java.security.AccessControlException: access denied > org.apache.derby.security.SystemPermission( "engine", "usederbyinternals" ) > at > java.security.AccessControlContext.checkPermission(AccessControlContext.java:472) > at > java.security.AccessController.checkPermission(AccessController.java:884) > at > org.apache.derby.iapi.security.SecurityUtil.checkDerbyInternalsPrivilege(Unknown > Source) > at org.apache.derby.iapi.services.monitor.Monitor.startMonitor(Unknown > Source) > at org.apache.derby.iapi.jdbc.JDBCBoot$1.run(Unknown Source) > at java.security.AccessController.doPrivileged(Native Method) > at org.apache.derby.iapi.jdbc.JDBCBoot.boot(Unknown Source) > at org.apache.derby.iapi.jdbc.JDBCBoot.boot(Unknown Source) > at org.apache.derby.jdbc.EmbeddedDriver.boot(Unknown Source) > at org.apache.derby.jdbc.EmbeddedDriver.(Unknown Source) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at java.lang.Class.newInstance(Class.java:442) > at > org.datanucleus.store.rdbms.connectionpool.AbstractConnectionPoolFactory.loadDriver(AbstractConnectionPoolFactory.java:47) > at > org.datanucleus.store.rdbms.connectionpool.BoneCPConnectionPoolFactory.createConnectionPool(BoneCPConnectionPoolFactory.java:54) > at > org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:238) > at > org.datanucleus.store.rdbms.ConnectionFactoryImpl.initialiseDataSources(ConnectionFactoryImpl.java:131) > at > org.datanucleus.store.rdbms.ConnectionFactoryImpl.(ConnectionFactoryImpl.java:85) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631) > at > org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:325) > at > org.datanucleus.store.AbstractStoreManager.registerConnectionFactory(AbstractStoreManager.java:282) > at > org.datanucleus.store.AbstractStoreManager.(AbstractStoreManager.java:240) > at > org.datanucleus.store.rdbms.RDBMSStoreManager.(RDBMSStoreManager.java:286) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631) > at > org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301) > at > org.datanucleus.NucleusContext.createStoreMana
[jira] [Updated] (SPARK-24174) Expose Hadoop config as part of /environment API
[ https://issues.apache.org/jira/browse/SPARK-24174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nikolay Sokolov updated SPARK-24174: Affects Version/s: (was: 2.3.0) 2.1.0 > Expose Hadoop config as part of /environment API > > > Key: SPARK-24174 > URL: https://issues.apache.org/jira/browse/SPARK-24174 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Nikolay Sokolov >Priority: Minor > Labels: features, usability > > Currently, /environment API call exposes only system properties and > SparkConf. However, in some cases when Spark is used in conjunction with > Hadoop, it is useful to know Hadoop configuration properties. For example, > HDFS or GS buffer sizes, hive metastore settings, and so on. > So it would be good to have hadoop properties being exposed in /environment > API, for example: > {code:none} > GET .../application_1525395994996_5/environment > { >"runtime": {"javaVersion": "1.8.0_131 (Oracle Corporation)", ...} >"sparkProperties": ["java.io.tmpdir","/tmp", ...], >"systemProperties": [["spark.yarn.jars", "local:/usr/lib/spark/jars/*"], > ...], >"classpathEntries": [["/usr/lib/hadoop/hadoop-annotations.jar","System > Classpath"], ...], >"hadoopProperties": [["dfs.stream-buffer-size": 4096], ...], > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24170) [Spark SQL] json file format is not dropped after dropping table
[ https://issues.apache.org/jira/browse/SPARK-24170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16463336#comment-16463336 ] ABHISHEK KUMAR GUPTA commented on SPARK-24170: -- I think this behavior need to re look and can be taken as improvement for next version. I feel it should json file should also be deleted. > [Spark SQL] json file format is not dropped after dropping table > > > Key: SPARK-24170 > URL: https://issues.apache.org/jira/browse/SPARK-24170 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: OS: SUSE 11 > Spark Version: 2.3 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > Steps: > # Launch spark-sql --master yarn > # create table json(name STRING, age int, gender string, id INT) using > org.apache.spark.sql.json options(path "hdfs:///user/testdemo/"); > # Execute the below SQL queries > INSERT into json > SELECT 'Shaan',21,'Male',1 > UNION ALL > SELECT 'Xing',20,'Female',11 > UNION ALL > SELECT 'Mile',4,'Female',20 > UNION ALL > SELECT 'Malan',10,'Male',9; > Below 4 json file format created > BLR123111:/opt/Antsecure/install/hadoop/namenode/bin # ./hdfs dfs -ls > /user/testdemo > Found 14 items > -rw-r--r-- 3 spark hadoop 0 2018-04-26 17:44 /user/testdemo/_SUCCESS > -rw-r--r-- 3 spark hadoop 4802 2018-04-24 18:20 /user/testdemo/customer1.csv > -rw-r--r-- 3 spark hadoop 92 2018-04-26 17:02 /user/testdemo/json1.txt > -rw-r--r-- 3 spark hadoop 49 2018-04-26 17:32 > /user/testdemo/part-0-4311f66b-ba1b-4a4d-a289-1a211f27f653-c000.json > -rw-r--r-- 3 spark hadoop 49 2018-04-26 17:44 > /user/testdemo/part-0-b8a8e16a-91a8-48ec-9998-2d741c52cf5a-c000.json > -rw-r--r-- 3 spark hadoop 51 2018-04-26 17:32 > /user/testdemo/part-1-4311f66b-ba1b-4a4d-a289-1a211f27f653-c000.json > -rw-r--r-- 3 spark hadoop 51 2018-04-26 17:44 > /user/testdemo/part-1-b8a8e16a-91a8-48ec-9998-2d741c52cf5a-c000.json > -rw-r--r-- 3 spark hadoop 50 2018-04-26 17:32 > /user/testdemo/part-2-4311f66b-ba1b-4a4d-a289-1a211f27f653-c000.json > -rw-r--r-- 3 spark hadoop 50 2018-04-26 17:44 > /user/testdemo/part-2-b8a8e16a-91a8-48ec-9998-2d741c52cf5a-c000.json > -rw-r--r-- 3 spark hadoop 49 2018-04-26 17:32 > /user/testdemo/part-3-4311f66b-ba1b-4a4d-a289-1a211f27f653-c000.json > -rw-r--r-- 3 spark hadoop 49 2018-04-26 17:44 > /user/testdemo/part-3-b8a8e16a-91a8-48ec-9998-2d741c52cf5a-c000.json > > Issue is: > Now executed below drop command > spark-sql> drop table json; > > Table dropped successfully but json file still present in the path > /user/testdemo -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24175) improve the Spark 2.4 migration guide
[ https://issues.apache.org/jira/browse/SPARK-24175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-24175: Description: The current Spark 2.4 migration guide is not well phrased. We should 1. State the before behavior 2. State the after behavior 3. Add a concrete example with code to illustrate. For example: Since Spark 2.4, Spark compares a DATE type with a TIMESTAMP type after promotes both sides to TIMESTAMP. To set `false` to `spark.sql.hive.compareDateTimestampInTimestamp` restores the previous behavior. This option will be removed in Spark 3.0. --> In version 2.3 and earlier, Spark implicitly casts a timestamp column to date type when comparing with a date column. In version 2.4 and later, Spark casts the date column to timestamp type instead. As an example, "xxx" would result in ".." in Spark 2.3, and in Spark 2.4, the result would be "..." was:The current Spark 2.4 migration guide is not well phrased. > improve the Spark 2.4 migration guide > - > > Key: SPARK-24175 > URL: https://issues.apache.org/jira/browse/SPARK-24175 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Major > > The current Spark 2.4 migration guide is not well phrased. We should > 1. State the before behavior > 2. State the after behavior > 3. Add a concrete example with code to illustrate. > For example: > Since Spark 2.4, Spark compares a DATE type with a TIMESTAMP type after > promotes both sides to TIMESTAMP. To set `false` to > `spark.sql.hive.compareDateTimestampInTimestamp` restores the previous > behavior. This option will be removed in Spark 3.0. > --> > In version 2.3 and earlier, Spark implicitly casts a timestamp column to date > type when comparing with a date column. In version 2.4 and later, Spark casts > the date column to timestamp type instead. As an example, "xxx" would result > in ".." in Spark 2.3, and in Spark 2.4, the result would be "..." -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24175) improve the Spark 2.4 migration guide
Wenchen Fan created SPARK-24175: --- Summary: improve the Spark 2.4 migration guide Key: SPARK-24175 URL: https://issues.apache.org/jira/browse/SPARK-24175 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24175) improve the Spark 2.4 migration guide
[ https://issues.apache.org/jira/browse/SPARK-24175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-24175: Description: The current Spark 2.4 migration guide is not well phrased. > improve the Spark 2.4 migration guide > - > > Key: SPARK-24175 > URL: https://issues.apache.org/jira/browse/SPARK-24175 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Major > > The current Spark 2.4 migration guide is not well phrased. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23291) SparkR : substr : In SparkR dataframe , starting and ending position arguments in "substr" is giving wrong result when the position is greater than 1
[ https://issues.apache.org/jira/browse/SPARK-23291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16463280#comment-16463280 ] Hyukjin Kwon commented on SPARK-23291: -- [~cloud_fan], you feel like we should backport this? I was looking through PRs for backporting thing and seems we better backport it with a note in migration guide. I think it was just a bad design mistake. It doesn't even match to R's one. Workaround should relatively easy if users are aware of this change. If you feel about backporting it too, I would rather like to backport this. > SparkR : substr : In SparkR dataframe , starting and ending position > arguments in "substr" is giving wrong result when the position is greater > than 1 > -- > > Key: SPARK-23291 > URL: https://issues.apache.org/jira/browse/SPARK-23291 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.2, 2.2.0, 2.2.1, 2.3.0 >Reporter: Narendra >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 2.4.0 > > > Defect Description : > - > For example ,an input string "2017-12-01" is read into a SparkR dataframe > "df" with column name "col1". > The target is to create a a new column named "col2" with the value "12" > which is inside the string ."12" can be extracted with "starting position" as > "6" and "Ending position" as "7" > (the starting position of the first character is considered as "1" ) > But,the current code that needs to be written is : > > df <- withColumn(df,"col2",substr(df$col1,7,8))) > Observe that the first argument in the "substr" API , which indicates the > 'starting position', is mentioned as "7" > Also, observe that the second argument in the "substr" API , which indicates > the 'ending position', is mentioned as "8" > i.e the number that should be mentioned to indicate the position should be > the "actual position + 1" > Expected behavior : > > The code that needs to be written is : > > df <- withColumn(df,"col2",substr(df$col1,6,7))) > Note : > --- > This defect is observed with only when the starting position is greater than > 1. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24167) ParquetFilters should not access SQLConf at executor side
[ https://issues.apache.org/jira/browse/SPARK-24167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-24167. - Resolution: Fixed Fix Version/s: 2.4.0 > ParquetFilters should not access SQLConf at executor side > - > > Key: SPARK-24167 > URL: https://issues.apache.org/jira/browse/SPARK-24167 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24167) ParquetFilters should not access SQLConf at executor side
[ https://issues.apache.org/jira/browse/SPARK-24167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-24167: Affects Version/s: (was: 2.3.0) 2.4.0 > ParquetFilters should not access SQLConf at executor side > - > > Key: SPARK-24167 > URL: https://issues.apache.org/jira/browse/SPARK-24167 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24147) .count() reports wrong size of dataframe when filtering dataframe on corrupt record field
[ https://issues.apache.org/jira/browse/SPARK-24147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24147. -- Resolution: Duplicate > .count() reports wrong size of dataframe when filtering dataframe on corrupt > record field > - > > Key: SPARK-24147 > URL: https://issues.apache.org/jira/browse/SPARK-24147 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 2.2.1 > Environment: Spark version 2.2.1 > Pyspark > Python version 3.6.4 >Reporter: Rich Smith >Priority: Major > > Spark reports the wrong size of dataframe using .count() after filtering on a > corruptField field. > Example file that shows the problem: > > {code:java} > from pyspark.sql import SparkSession > from pyspark.sql.types import StringType, StructType, StructField, DoubleType > from pyspark.sql.functions import col, lit > spark = > SparkSession.builder.master("local[3]").appName("pyspark-unittest").getOrCreate() > spark.conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false") > SCHEMA = StructType([ > StructField("headerDouble", DoubleType(), False), > StructField("ErrorField", StringType(), False) > ]) > dataframe = ( > spark.read > .option("header", "true") > .option("mode", "PERMISSIVE") > .option("columnNameOfCorruptRecord", "ErrorField") > .schema(SCHEMA).csv("./x.csv") > ) > total_row_count = dataframe.count() > print("total_row_count = " + str(total_row_count)) > errors = dataframe.filter(col("ErrorField").isNotNull()) > errors.show() > error_count = errors.count() > print("errors count = " + str(error_count)) > {code} > > > Using input file x.csv: > > {code:java} > headerDouble > wrong > {code} > > > Output text. As shown, contents of dataframe contains a row, but .count() > reports 0. > > {code:java} > total_row_count = 1 > ++--+ > |headerDouble|ErrorField| > ++--+ > |null| wrong| > ++--+ > errors count = 0 > {code} > > > Also discussed briefly on StackOverflow: > [https://stackoverflow.com/questions/50121899/how-can-sparks-count-function-be-different-to-the-contents-of-the-dataframe] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24174) Expose Hadoop config as part of /environment API
Nikolay Sokolov created SPARK-24174: --- Summary: Expose Hadoop config as part of /environment API Key: SPARK-24174 URL: https://issues.apache.org/jira/browse/SPARK-24174 Project: Spark Issue Type: Wish Components: Spark Core Affects Versions: 2.3.0 Reporter: Nikolay Sokolov Currently, /environment API call exposes only system properties and SparkConf. However, in some cases when Spark is used in conjunction with Hadoop, it is useful to know Hadoop configuration properties. For example, HDFS or GS buffer sizes, hive metastore settings, and so on. So it would be good to have hadoop properties being exposed in /environment API, for example: {code:none} GET .../application_1525395994996_5/environment { "runtime": {"javaVersion": "1.8.0_131 (Oracle Corporation)", ...} "sparkProperties": ["java.io.tmpdir","/tmp", ...], "systemProperties": [["spark.yarn.jars", "local:/usr/lib/spark/jars/*"], ...], "classpathEntries": [["/usr/lib/hadoop/hadoop-annotations.jar","System Classpath"], ...], "hadoopProperties": [["dfs.stream-buffer-size": 4096], ...], } {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24168) WindowExec should not access SQLConf at executor side
[ https://issues.apache.org/jira/browse/SPARK-24168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-24168. - Resolution: Fixed Fix Version/s: 2.4.0 2.3.1 > WindowExec should not access SQLConf at executor side > - > > Key: SPARK-24168 > URL: https://issues.apache.org/jira/browse/SPARK-24168 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 2.3.1, 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23489) Flaky Test: HiveExternalCatalogVersionsSuite
[ https://issues.apache.org/jira/browse/SPARK-23489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-23489: Fix Version/s: 2.2.2 > Flaky Test: HiveExternalCatalogVersionsSuite > > > Key: SPARK-23489 > URL: https://issues.apache.org/jira/browse/SPARK-23489 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.2.1, 2.3.0, 2.4.0 >Reporter: Marco Gaido >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.2.2, 2.3.1, 2.4.0 > > > I saw this error in an unrelated PR. It seems to me a bad configuration in > the Jenkins node where the tests are run. > {code} > Error Message > java.io.IOException: Cannot run program "./bin/spark-submit" (in directory > "/tmp/test-spark/spark-2.2.1"): error=2, No such file or directory > Stacktrace > sbt.ForkMain$ForkError: java.io.IOException: Cannot run program > "./bin/spark-submit" (in directory "/tmp/test-spark/spark-2.2.1"): error=2, > No such file or directory > at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048) > at > org.apache.spark.sql.hive.SparkSubmitTestUtils$class.runSparkSubmit(SparkSubmitTestUtils.scala:73) > at > org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite.runSparkSubmit(HiveExternalCatalogVersionsSuite.scala:43) > at > org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite$$anonfun$beforeAll$1.apply(HiveExternalCatalogVersionsSuite.scala:176) > at > org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite$$anonfun$beforeAll$1.apply(HiveExternalCatalogVersionsSuite.scala:161) > at scala.collection.immutable.List.foreach(List.scala:381) > at > org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite.beforeAll(HiveExternalCatalogVersionsSuite.scala:161) > at > org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:212) > at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210) > at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52) > at > org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314) > at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480) > at sbt.ForkMain$Run$2.call(ForkMain.java:296) > at sbt.ForkMain$Run$2.call(ForkMain.java:286) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: sbt.ForkMain$ForkError: java.io.IOException: error=2, No such file > or directory > at java.lang.UNIXProcess.forkAndExec(Native Method) > at java.lang.UNIXProcess.(UNIXProcess.java:248) > at java.lang.ProcessImpl.start(ProcessImpl.java:134) > at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) > ... 17 more > {code} > This is the link: > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87615/testReport/. > *MASTER BRANCH* > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/4389 > *BRANCH 2.3* > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/321/ > *NOTE: This failure frequently looks as `Test Result (no failures)`* > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.6/4811/ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24163) Support "ANY" or sub-query for Pivot "IN" clause
[ https://issues.apache.org/jira/browse/SPARK-24163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16463219#comment-16463219 ] Xiao Li commented on SPARK-24163: - This is just nice to have. > Support "ANY" or sub-query for Pivot "IN" clause > > > Key: SPARK-24163 > URL: https://issues.apache.org/jira/browse/SPARK-24163 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maryann Xue >Priority: Major > > This is part of a functionality extension to Pivot SQL support as SPARK-24035. > Currently, only literal values are allowed in Pivot "IN" clause. To support > ANY or a sub-query in the "IN" clause (the examples of which provided below), > we need to enable evaluation of a sub-query before/during query analysis time. > {code:java} > SELECT * FROM ( > SELECT year, course, earnings FROM courseSales > ) > PIVOT ( > sum(earnings) > FOR course IN ANY > );{code} > {code:java} > SELECT * FROM ( > SELECT year, course, earnings FROM courseSales > ) > PIVOT ( > sum(earnings) > FOR course IN ( > SELECT course FROM courses > WHERE region = 'AZ' > ) > ); > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24164) Support column list as the pivot column in Pivot
[ https://issues.apache.org/jira/browse/SPARK-24164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-24164: --- Assignee: Maryann Xue > Support column list as the pivot column in Pivot > > > Key: SPARK-24164 > URL: https://issues.apache.org/jira/browse/SPARK-24164 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maryann Xue >Assignee: Maryann Xue >Priority: Major > > This is part of a functionality extension to Pivot SQL support as SPARK-24035. > Currently, we only support a single column as the pivot column, while a > column list as the pivot column would look like: > {code:java} > SELECT * FROM ( > SELECT year, course, earnings FROM courseSales > ) > PIVOT ( > sum(earnings) > FOR (course, year) IN (('dotNET', 2012), ('Java', 2013)) > );{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24162) Support aliased literal values for Pivot "IN" clause
[ https://issues.apache.org/jira/browse/SPARK-24162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-24162: --- Assignee: Maryann Xue > Support aliased literal values for Pivot "IN" clause > > > Key: SPARK-24162 > URL: https://issues.apache.org/jira/browse/SPARK-24162 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maryann Xue >Assignee: Maryann Xue >Priority: Minor > > This is part of a functionality extension to Pivot SQL support as SPARK-24035. > When literal values are specified in Pivot IN clause, it would be nice to > allow aliases for those values so that the output column names can be > customized. For example: > {code:java} > SELECT * FROM ( > SELECT year, course, earnings FROM courseSales > ) > PIVOT ( > sum(earnings) > FOR course IN ('dotNET' as c1, 'Java' as c2) > );{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24035) SQL syntax for Pivot
[ https://issues.apache.org/jira/browse/SPARK-24035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-24035. - Resolution: Fixed Fix Version/s: 2.4.0 > SQL syntax for Pivot > > > Key: SPARK-24035 > URL: https://issues.apache.org/jira/browse/SPARK-24035 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Maryann Xue >Priority: Major > Fix For: 2.4.0 > > > Some users who are SQL experts but don’t know an ounce of Scala/Python or R. > Thus, we prefer to supporting the SQL syntax for Pivot too -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23703) Collapse sequential watermarks
[ https://issues.apache.org/jira/browse/SPARK-23703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16463209#comment-16463209 ] Jungtaek Lim commented on SPARK-23703: -- Agreed. Is it worth to discuss in dev. mailing list? Or we can simply propose the patch for the fix? > Collapse sequential watermarks > --- > > Key: SPARK-23703 > URL: https://issues.apache.org/jira/browse/SPARK-23703 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > > When there are two sequential EventTimeWatermark nodes in a query plan, the > topmost one overrides the column tracking metadata from its children, but > leaves the nodes themselves untouched. When there is no intervening stateful > operation to consume the watermark, we should remove the lower node entirely. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23703) Collapse sequential watermarks
[ https://issues.apache.org/jira/browse/SPARK-23703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16463190#comment-16463190 ] Jose Torres commented on SPARK-23703: - No, I don't know of any actual use cases for this. I think just writing an analyzer rule disallowing it could be a valid resolution here. > Collapse sequential watermarks > --- > > Key: SPARK-23703 > URL: https://issues.apache.org/jira/browse/SPARK-23703 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > > When there are two sequential EventTimeWatermark nodes in a query plan, the > topmost one overrides the column tracking metadata from its children, but > leaves the nodes themselves untouched. When there is no intervening stateful > operation to consume the watermark, we should remove the lower node entirely. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23703) Collapse sequential watermarks
[ https://issues.apache.org/jira/browse/SPARK-23703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16463189#comment-16463189 ] Jungtaek Lim commented on SPARK-23703: -- Actually I haven't hear about multiple watermarks on same source, which makes the things complicated. What I've heard is event-time window with single time field, and watermark for such field. Do you have/hear actual use cases for this? > Collapse sequential watermarks > --- > > Key: SPARK-23703 > URL: https://issues.apache.org/jira/browse/SPARK-23703 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > > When there are two sequential EventTimeWatermark nodes in a query plan, the > topmost one overrides the column tracking metadata from its children, but > leaves the nodes themselves untouched. When there is no intervening stateful > operation to consume the watermark, we should remove the lower node entirely. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13446) Spark need to support reading data from Hive 2.0.0 metastore
[ https://issues.apache.org/jira/browse/SPARK-13446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16463003#comment-16463003 ] thua...@fsoft.com.vn edited comment on SPARK-13446 at 5/3/18 11:10 PM: --- Hi Tavis and Spark Team, Tavis explanation is very clear. As I read in the other thread, this bug is fixed and fully tested by QA team. Could someone please help to look into the merging the code. I am running into the same problem in pyspark. Bellow my pyspark submit stack trace: py4j.protocol.Py4JJavaError: An error occurred while calling o24.sessionState. : java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT at org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:200) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:265) at org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66) at org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:194) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:194) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:194) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:193) at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:105) at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:93) at org.apache.spark.sql.hive.HiveSessionStateBuilder.externalCatalog(HiveSessionStateBuilder.scala:39) at org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog$lzycompute(HiveSessionStateBuilder.scala:54) at org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:52) at org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:35) at org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:289) at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1050) at org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:130) at org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:130) at scala.Option.getOrElse(Option.scala:121) was (Author: thua...@fsoft.com.vn): I am running into the same problem in pyspark. Could anyone help resolve this issue, please? Bellow my pyspark submit stack trace: py4j.protocol.Py4JJavaError: An error occurred while calling o24.sessionState. : java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT at org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:200) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:265) at org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66) at org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:194) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:194) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:194) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:193) at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:105) at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:93) at org.apache.spark.sql.hive.HiveSessionStateBuilder.externalCatalog(HiveSessionStateBuilder.scala:39) at org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog$lzycompute(HiveSessionStateBuilder.scala:54) at org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:52) at org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:35) at org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:289) at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1050) at org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:130) at org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:130) at scala.Option.getOrElse(Option.scala:121) > Spark need to support reading data from Hive 2
[jira] [Commented] (SPARK-21824) DROP TABLE should automatically drop any dependent referential constraints or raise error.
[ https://issues.apache.org/jira/browse/SPARK-21824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16463054#comment-16463054 ] Sunitha Kambhampati commented on SPARK-21824: - I'll look into this. > DROP TABLE should automatically drop any dependent referential constraints > or raise error. > > > Key: SPARK-21824 > URL: https://issues.apache.org/jira/browse/SPARK-21824 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Suresh Thalamati >Priority: Major > > DROP TABLE should raise error if there are any dependent referential > constraints unless user specifies CASCADE CONSTRAINTS > Syntax : > {code:sql} > DROP TABLE [CASCADE CONSTRAINTS] > {code} > Hive drops the referential constraints automatically. Oracle requires user > specify _CASCADE CONSTRAINTS_ clause to automatically drop the referential > constraints, otherwise raises the error. Should we stick to the *Hive > behavior* ? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13446) Spark need to support reading data from Hive 2.0.0 metastore
[ https://issues.apache.org/jira/browse/SPARK-13446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16463003#comment-16463003 ] thua...@fsoft.com.vn commented on SPARK-13446: -- I am running into the same problem in pyspark. Could anyone help resolve this issue, please? Bellow my pyspark submit stack trace: py4j.protocol.Py4JJavaError: An error occurred while calling o24.sessionState. : java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT at org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:200) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:265) at org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66) at org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:194) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:194) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:194) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:193) at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:105) at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:93) at org.apache.spark.sql.hive.HiveSessionStateBuilder.externalCatalog(HiveSessionStateBuilder.scala:39) at org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog$lzycompute(HiveSessionStateBuilder.scala:54) at org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:52) at org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:35) at org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:289) at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1050) at org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:130) at org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:130) at scala.Option.getOrElse(Option.scala:121) > Spark need to support reading data from Hive 2.0.0 metastore > > > Key: SPARK-13446 > URL: https://issues.apache.org/jira/browse/SPARK-13446 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Lifeng Wang >Assignee: Xiao Li >Priority: Major > Fix For: 2.2.0 > > > Spark provided HIveContext class to read data from hive metastore directly. > While it only supports hive 1.2.1 version and older. Since hive 2.0.0 has > released, it's better to upgrade to support Hive 2.0.0. > {noformat} > 16/02/23 02:35:02 INFO metastore: Trying to connect to metastore with URI > thrift://hsw-node13:9083 > 16/02/23 02:35:02 INFO metastore: Opened a connection to metastore, current > connections: 1 > 16/02/23 02:35:02 INFO metastore: Connected to metastore. > Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT > at > org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:473) > at > org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:192) > at > org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:185) > at > org.apache.spark.sql.hive.HiveContext$$anon$1.(HiveContext.scala:422) > at > org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:422) > at > org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:421) > at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:72) > at org.apache.spark.sql.SQLContext.table(SQLContext.scala:739) > at org.apache.spark.sql.SQLContext.table(SQLContext.scala:735) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23489) Flaky Test: HiveExternalCatalogVersionsSuite
[ https://issues.apache.org/jira/browse/SPARK-23489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462971#comment-16462971 ] Apache Spark commented on SPARK-23489: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/21232 > Flaky Test: HiveExternalCatalogVersionsSuite > > > Key: SPARK-23489 > URL: https://issues.apache.org/jira/browse/SPARK-23489 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.2.1, 2.3.0, 2.4.0 >Reporter: Marco Gaido >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.3.1, 2.4.0 > > > I saw this error in an unrelated PR. It seems to me a bad configuration in > the Jenkins node where the tests are run. > {code} > Error Message > java.io.IOException: Cannot run program "./bin/spark-submit" (in directory > "/tmp/test-spark/spark-2.2.1"): error=2, No such file or directory > Stacktrace > sbt.ForkMain$ForkError: java.io.IOException: Cannot run program > "./bin/spark-submit" (in directory "/tmp/test-spark/spark-2.2.1"): error=2, > No such file or directory > at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048) > at > org.apache.spark.sql.hive.SparkSubmitTestUtils$class.runSparkSubmit(SparkSubmitTestUtils.scala:73) > at > org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite.runSparkSubmit(HiveExternalCatalogVersionsSuite.scala:43) > at > org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite$$anonfun$beforeAll$1.apply(HiveExternalCatalogVersionsSuite.scala:176) > at > org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite$$anonfun$beforeAll$1.apply(HiveExternalCatalogVersionsSuite.scala:161) > at scala.collection.immutable.List.foreach(List.scala:381) > at > org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite.beforeAll(HiveExternalCatalogVersionsSuite.scala:161) > at > org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:212) > at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210) > at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52) > at > org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314) > at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480) > at sbt.ForkMain$Run$2.call(ForkMain.java:296) > at sbt.ForkMain$Run$2.call(ForkMain.java:286) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: sbt.ForkMain$ForkError: java.io.IOException: error=2, No such file > or directory > at java.lang.UNIXProcess.forkAndExec(Native Method) > at java.lang.UNIXProcess.(UNIXProcess.java:248) > at java.lang.ProcessImpl.start(ProcessImpl.java:134) > at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) > ... 17 more > {code} > This is the link: > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87615/testReport/. > *MASTER BRANCH* > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/4389 > *BRANCH 2.3* > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/321/ > *NOTE: This failure frequently looks as `Test Result (no failures)`* > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.6/4811/ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24119) Add interpreted execution to SortPrefix expression
[ https://issues.apache.org/jira/browse/SPARK-24119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24119: Assignee: (was: Apache Spark) > Add interpreted execution to SortPrefix expression > -- > > Key: SPARK-24119 > URL: https://issues.apache.org/jira/browse/SPARK-24119 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Bruce Robbins >Priority: Minor > > [~hvanhovell] [~kiszk] > I noticed SortPrefix did not support interpreted execution when I was testing > the PR for SPARK-24043. Somehow it was not covered by the umbrella Jira for > adding interpreted execution (SPARK-23580) > Since I had to implement interpreted execution for SortPrefix to complete > testing, I am creating this Jira. If there's no good reason why eval wasn't > implemented, I will make the PR in a few days. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24119) Add interpreted execution to SortPrefix expression
[ https://issues.apache.org/jira/browse/SPARK-24119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462939#comment-16462939 ] Apache Spark commented on SPARK-24119: -- User 'bersprockets' has created a pull request for this issue: https://github.com/apache/spark/pull/21231 > Add interpreted execution to SortPrefix expression > -- > > Key: SPARK-24119 > URL: https://issues.apache.org/jira/browse/SPARK-24119 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Bruce Robbins >Priority: Minor > > [~hvanhovell] [~kiszk] > I noticed SortPrefix did not support interpreted execution when I was testing > the PR for SPARK-24043. Somehow it was not covered by the umbrella Jira for > adding interpreted execution (SPARK-23580) > Since I had to implement interpreted execution for SortPrefix to complete > testing, I am creating this Jira. If there's no good reason why eval wasn't > implemented, I will make the PR in a few days. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24119) Add interpreted execution to SortPrefix expression
[ https://issues.apache.org/jira/browse/SPARK-24119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24119: Assignee: Apache Spark > Add interpreted execution to SortPrefix expression > -- > > Key: SPARK-24119 > URL: https://issues.apache.org/jira/browse/SPARK-24119 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Bruce Robbins >Assignee: Apache Spark >Priority: Minor > > [~hvanhovell] [~kiszk] > I noticed SortPrefix did not support interpreted execution when I was testing > the PR for SPARK-24043. Somehow it was not covered by the umbrella Jira for > adding interpreted execution (SPARK-23580) > Since I had to implement interpreted execution for SortPrefix to complete > testing, I am creating this Jira. If there's no good reason why eval wasn't > implemented, I will make the PR in a few days. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24173) Flaky Test: VersionsSuite
[ https://issues.apache.org/jira/browse/SPARK-24173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-24173: -- Description: *BRANCH-2.2* - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.2-test-maven-hadoop-2.6/519/ *BRANCH-2.3* - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/325/ was:- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/325/ > Flaky Test: VersionsSuite > - > > Key: SPARK-24173 > URL: https://issues.apache.org/jira/browse/SPARK-24173 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Major > > *BRANCH-2.2* > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.2-test-maven-hadoop-2.6/519/ > *BRANCH-2.3* > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/325/ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12764) XML Column type is not supported (JDBC connection to Postgres)
[ https://issues.apache.org/jira/browse/SPARK-12764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462827#comment-16462827 ] Martin Tapp commented on SPARK-12764: - Using Spark 2.1.1 and can't load a table with a JSONB column even if I'm explicitly doing a select ignoring the JSONB column. Thanks > XML Column type is not supported (JDBC connection to Postgres) > -- > > Key: SPARK-12764 > URL: https://issues.apache.org/jira/browse/SPARK-12764 > Project: Spark > Issue Type: Wish > Components: SQL >Affects Versions: 1.6.0 > Environment: Mac Os X El Capitan >Reporter: Rajeshwar Gaini >Priority: Major > > Hi All, > I am using PostgreSQL database. I am using the following jdbc call to access > a customer table (customer_id int, event text, country text, content xml) in > my database. > {code} > val dataframe1 = sqlContext.load("jdbc", Map("url" -> > "jdbc:postgresql://localhost/customerlogs?user=postgres&password=postgres", > "dbtable" -> "customer")) > {code} > When i run above command in spark-shell i receive the following error. > {code} > java.sql.SQLException: Unsupported type > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.org$apache$spark$sql$execution$datasources$jdbc$JDBCRDD$$getCatalystType(JDBCRDD.scala:103) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:140) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:140) > at scala.Option.getOrElse(Option.scala:120) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:139) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:91) > at > org.apache.spark.sql.execution.datasources.jdbc.DefaultSource.createRelation(DefaultSource.scala:60) > at > org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:158) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119) > at org.apache.spark.sql.SQLContext.load(SQLContext.scala:1153) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:25) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:30) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:32) > at $iwC$$iwC$$iwC$$iwC$$iwC.(:34) > at $iwC$$iwC$$iwC$$iwC.(:36) > at $iwC$$iwC$$iwC.(:38) > at $iwC$$iwC.(:40) > at $iwC.(:42) > at (:44) > at .(:48) > at .() > at .(:7) > at .() > at $print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) > at > org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346) > at > org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) > at > org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) > at > org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) > at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814) > at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657) > at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665) > at > org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670) > at > org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997) > at > org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) > at > org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) > at > scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) > at > org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945) > at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059) > at org.apache.spark.repl.Main$.main(Main.scala:31) > at org.apache.spark.repl.Main.main(Main.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at
[jira] [Commented] (SPARK-10943) NullType Column cannot be written to Parquet
[ https://issues.apache.org/jira/browse/SPARK-10943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462797#comment-16462797 ] Stephen Boesch commented on SPARK-10943: Given the comment by Daniel Davis can this issue be reopened? > NullType Column cannot be written to Parquet > > > Key: SPARK-10943 > URL: https://issues.apache.org/jira/browse/SPARK-10943 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Jason Pohl >Priority: Major > > {code} > var data02 = sqlContext.sql("select 1 as id, \"cat in the hat\" as text, null > as comments") > {code} > //FAIL - Try writing a NullType column (where all the values are NULL) > {code} > data02.write.parquet("/tmp/test/dataset2") > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:156) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57) > at > org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138) > at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933) > at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933) > at > org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137) > at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:304) > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 0 in stage 179.0 failed 4 times, most recent failure: Lost task 0.3 in > stage 179.0 (TID 39924, 10.0.196.208): > org.apache.spark.sql.AnalysisException: Unsupported data type > StructField(comments,NullType,true).dataType; > at > org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:524) > at > org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:312) > at > org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convert$1.apply(CatalystSchemaConverter.scala:305) > at > org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convert$1.apply(CatalystSchemaConverter.scala:305) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at org.apache.spark.sql.types.StructType.foreach(StructType.scala:92) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at org.apache.spark.sql.types.StructType.map(StructType.scala:92) > at > org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convert(CatalystSchemaConverter.scala:305) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetTypesConverter$.convertFromAttributes(ParquetTypesConverter.scala:58) > at > org.apache.spark.sql.execution.datasources.parquet.RowWriteSupport.init(ParquetTableSupport.scala:55) > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:288) > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262) > at > org.apache.spark.sql.execution.datasources.parquet.P
[jira] [Created] (SPARK-24173) Flaky Test: VersionsSuite
Dongjoon Hyun created SPARK-24173: - Summary: Flaky Test: VersionsSuite Key: SPARK-24173 URL: https://issues.apache.org/jira/browse/SPARK-24173 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Dongjoon Hyun - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/325/ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23622) Flaky Test: HiveClientSuites
[ https://issues.apache.org/jira/browse/SPARK-23622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-23622: -- Summary: Flaky Test: HiveClientSuites (was: HiveClientSuites fails with InvocationTargetException) > Flaky Test: HiveClientSuites > > > Key: SPARK-23622 > URL: https://issues.apache.org/jira/browse/SPARK-23622 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Priority: Major > > - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88052/testReport/org.apache.spark.sql.hive.client/HiveClientSuites/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/ > - https://amplab.cs.berkeley.edu/jenkins/view/Spark QA Test > (Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/325 > {code} > Error Message > java.lang.reflect.InvocationTargetException: null > Stacktrace > sbt.ForkMain$ForkError: java.lang.reflect.InvocationTargetException: null > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:270) > at > org.apache.spark.sql.hive.client.HiveClientBuilder$.buildClient(HiveClientBuilder.scala:58) > at > org.apache.spark.sql.hive.client.HiveVersionSuite.buildClient(HiveVersionSuite.scala:41) > at > org.apache.spark.sql.hive.client.HiveClientSuite.org$apache$spark$sql$hive$client$HiveClientSuite$$init(HiveClientSuite.scala:48) > at > org.apache.spark.sql.hive.client.HiveClientSuite.beforeAll(HiveClientSuite.scala:71) > at > org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:212) > at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210) > at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52) > at org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1210) > at > org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1257) > at > org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1255) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at org.scalatest.Suite$class.runNestedSuites(Suite.scala:1255) > at > org.apache.spark.sql.hive.client.HiveClientSuites.runNestedSuites(HiveClientSuites.scala:24) > at org.scalatest.Suite$class.run(Suite.scala:1144) > at > org.apache.spark.sql.hive.client.HiveClientSuites.run(HiveClientSuites.scala:24) > at > org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314) > at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480) > at sbt.ForkMain$Run$2.call(ForkMain.java:296) > at sbt.ForkMain$Run$2.call(ForkMain.java:286) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: sbt.ForkMain$ForkError: java.lang.RuntimeException: > java.lang.RuntimeException: Unable to instantiate > org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient > at > org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:444) > at > org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:183) > at > org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:117) > ... 29 more > Caused by: sbt.ForkMain$ForkError: java.lang.RuntimeException: Unable to > instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient > at > org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1453) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:63) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:73) > at > org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2664) > at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2683) > at > org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:425) > ... 31 more > Caused by: sbt.ForkMain$ForkError: > java.lang.reflect.InvocationTargetException:
[jira] [Updated] (SPARK-23622) HiveClientSuites fails with InvocationTargetException
[ https://issues.apache.org/jira/browse/SPARK-23622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-23622: -- Description: - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88052/testReport/org.apache.spark.sql.hive.client/HiveClientSuites/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark QA Test (Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/325 {code} Error Message java.lang.reflect.InvocationTargetException: null Stacktrace sbt.ForkMain$ForkError: java.lang.reflect.InvocationTargetException: null at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:270) at org.apache.spark.sql.hive.client.HiveClientBuilder$.buildClient(HiveClientBuilder.scala:58) at org.apache.spark.sql.hive.client.HiveVersionSuite.buildClient(HiveVersionSuite.scala:41) at org.apache.spark.sql.hive.client.HiveClientSuite.org$apache$spark$sql$hive$client$HiveClientSuite$$init(HiveClientSuite.scala:48) at org.apache.spark.sql.hive.client.HiveClientSuite.beforeAll(HiveClientSuite.scala:71) at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:212) at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210) at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52) at org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1210) at org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1257) at org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1255) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at org.scalatest.Suite$class.runNestedSuites(Suite.scala:1255) at org.apache.spark.sql.hive.client.HiveClientSuites.runNestedSuites(HiveClientSuites.scala:24) at org.scalatest.Suite$class.run(Suite.scala:1144) at org.apache.spark.sql.hive.client.HiveClientSuites.run(HiveClientSuites.scala:24) at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314) at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480) at sbt.ForkMain$Run$2.call(ForkMain.java:296) at sbt.ForkMain$Run$2.call(ForkMain.java:286) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: sbt.ForkMain$ForkError: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:444) at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:183) at org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:117) ... 29 more Caused by: sbt.ForkMain$ForkError: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1453) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:63) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:73) at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2664) at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2683) at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:425) ... 31 more Caused by: sbt.ForkMain$ForkError: java.lang.reflect.InvocationTargetException: null at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1451) ... 36 more Caused by: sbt.ForkMai
[jira] [Resolved] (SPARK-23433) java.lang.IllegalStateException: more than one active taskSet for stage
[ https://issues.apache.org/jira/browse/SPARK-23433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-23433. -- Resolution: Fixed Fix Version/s: 2.4.0 2.3.1 2.2.2 fixed by https://github.com/apache/spark/pull/21131 > java.lang.IllegalStateException: more than one active taskSet for stage > --- > > Key: SPARK-23433 > URL: https://issues.apache.org/jira/browse/SPARK-23433 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Shixiong Zhu >Assignee: Imran Rashid >Priority: Major > Fix For: 2.2.2, 2.3.1, 2.4.0 > > > This following error thrown by DAGScheduler stopped the cluster: > {code} > 18/02/11 13:22:27 ERROR DAGSchedulerEventProcessLoop: > DAGSchedulerEventProcessLoop failed; shutting down SparkContext > java.lang.IllegalStateException: more than one active taskSet for stage > 7580621: 7580621.2,7580621.1 > at > org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:229) > at > org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1193) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:1059) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:900) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:899) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > org.apache.spark.scheduler.DAGScheduler.submitWaitingChildStages(DAGScheduler.scala:899) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1427) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1929) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1880) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1868) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23433) java.lang.IllegalStateException: more than one active taskSet for stage
[ https://issues.apache.org/jira/browse/SPARK-23433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid reassigned SPARK-23433: Assignee: Imran Rashid > java.lang.IllegalStateException: more than one active taskSet for stage > --- > > Key: SPARK-23433 > URL: https://issues.apache.org/jira/browse/SPARK-23433 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Shixiong Zhu >Assignee: Imran Rashid >Priority: Major > > This following error thrown by DAGScheduler stopped the cluster: > {code} > 18/02/11 13:22:27 ERROR DAGSchedulerEventProcessLoop: > DAGSchedulerEventProcessLoop failed; shutting down SparkContext > java.lang.IllegalStateException: more than one active taskSet for stage > 7580621: 7580621.2,7580621.1 > at > org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:229) > at > org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1193) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:1059) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:900) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:899) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > org.apache.spark.scheduler.DAGScheduler.submitWaitingChildStages(DAGScheduler.scala:899) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1427) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1929) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1880) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1868) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24154) AccumulatorV2 loses type information during serialization
[ https://issues.apache.org/jira/browse/SPARK-24154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462630#comment-16462630 ] Sergey Zhemzhitsky commented on SPARK-24154: > If users have to support mixin traits, they can still use accumulator v1. But accumulators V1 are deprecated and will be removed one day I believe. > AccumulatorV2 loses type information during serialization > - > > Key: SPARK-24154 > URL: https://issues.apache.org/jira/browse/SPARK-24154 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0, 2.2.1, 2.3.0, 2.3.1 > Environment: Scala 2.11 > Spark 2.2.0 >Reporter: Sergey Zhemzhitsky >Priority: Major > > AccumulatorV2 loses type information during serialization. > It happens > [here|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L164] > during *writeReplace* call > {code:scala} > final protected def writeReplace(): Any = { > if (atDriverSide) { > if (!isRegistered) { > throw new UnsupportedOperationException( > "Accumulator must be registered before send to executor") > } > val copyAcc = copyAndReset() > assert(copyAcc.isZero, "copyAndReset must return a zero value copy") > val isInternalAcc = name.isDefined && > name.get.startsWith(InternalAccumulator.METRICS_PREFIX) > if (isInternalAcc) { > // Do not serialize the name of internal accumulator and send it to > executor. > copyAcc.metadata = metadata.copy(name = None) > } else { > // For non-internal accumulators, we still need to send the name > because users may need to > // access the accumulator name at executor side, or they may keep the > accumulators sent from > // executors and access the name when the registered accumulator is > already garbage > // collected(e.g. SQLMetrics). > copyAcc.metadata = metadata > } > copyAcc > } else { > this > } > } > {code} > It means that it is hardly possible to create new accumulators easily by > adding new behaviour to existing ones by means of mix-ins or inheritance > (without overriding *copy*). > For example the following snippet ... > {code:scala} > trait TripleCount { > self: LongAccumulator => > abstract override def add(v: jl.Long): Unit = { > self.add(v * 3) > } > } > val acc = new LongAccumulator with TripleCount > sc.register(acc) > val data = 1 to 10 > val rdd = sc.makeRDD(data, 5) > rdd.foreach(acc.add(_)) > acc.value shouldBe 3 * data.sum > {code} > ... fails with > {code:none} > org.scalatest.exceptions.TestFailedException: 55 was not equal to 165 > at org.scalatest.MatchersHelper$.indicateFailure(MatchersHelper.scala:340) > at org.scalatest.Matchers$AnyShouldWrapper.shouldBe(Matchers.scala:6864) > {code} > Also such a behaviour seems to be error prone and confusing because an > implementor gets not the same thing as he/she sees in the code. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24169) JsonToStructs should not access SQLConf at executor side
[ https://issues.apache.org/jira/browse/SPARK-24169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-24169. - Resolution: Fixed Fix Version/s: 2.3.1 2.4.0 Issue resolved by pull request 21226 [https://github.com/apache/spark/pull/21226] > JsonToStructs should not access SQLConf at executor side > > > Key: SPARK-24169 > URL: https://issues.apache.org/jira/browse/SPARK-24169 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 2.4.0, 2.3.1 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24172) we should not apply operator pushdown to data source v2 many times
[ https://issues.apache.org/jira/browse/SPARK-24172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24172: Assignee: Wenchen Fan (was: Apache Spark) > we should not apply operator pushdown to data source v2 many times > -- > > Key: SPARK-24172 > URL: https://issues.apache.org/jira/browse/SPARK-24172 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24172) we should not apply operator pushdown to data source v2 many times
[ https://issues.apache.org/jira/browse/SPARK-24172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462587#comment-16462587 ] Apache Spark commented on SPARK-24172: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/21230 > we should not apply operator pushdown to data source v2 many times > -- > > Key: SPARK-24172 > URL: https://issues.apache.org/jira/browse/SPARK-24172 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24172) we should not apply operator pushdown to data source v2 many times
[ https://issues.apache.org/jira/browse/SPARK-24172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24172: Assignee: Apache Spark (was: Wenchen Fan) > we should not apply operator pushdown to data source v2 many times > -- > > Key: SPARK-24172 > URL: https://issues.apache.org/jira/browse/SPARK-24172 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23703) Collapse sequential watermarks
[ https://issues.apache.org/jira/browse/SPARK-23703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462584#comment-16462584 ] Jose Torres commented on SPARK-23703: - I'm no longer entirely convinced that this (and the parent JIRA) are correct. We might not want to support these scenarios at all. The question here is what we should do with the query: df.withWatermark(“a”, …) .withWatermark(“b”, …) .agg(...) What we do right now is definitely wrong. We (in MicroBatchExecution) calculate separate watermarks on "a" and "b", take their minimum, and then pass that as the watermark value to the aggregate. But the aggregate only sees "b" as a watermarked column, because only "b" has EventTimeWatermark.delayKey set in its attribute metadata at the aggregate node. EventTimeWatermark("b").output erases the metadata for "a" in its output. So we need to somehow resolve this mismatch. > Collapse sequential watermarks > --- > > Key: SPARK-23703 > URL: https://issues.apache.org/jira/browse/SPARK-23703 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > > When there are two sequential EventTimeWatermark nodes in a query plan, the > topmost one overrides the column tracking metadata from its children, but > leaves the nodes themselves untouched. When there is no intervening stateful > operation to consume the watermark, we should remove the lower node entirely. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24172) we should not apply operator pushdown to data source v2 many times
Wenchen Fan created SPARK-24172: --- Summary: we should not apply operator pushdown to data source v2 many times Key: SPARK-24172 URL: https://issues.apache.org/jira/browse/SPARK-24172 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24133) Reading Parquet files containing large strings can fail with java.lang.ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-24133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell updated SPARK-24133: -- Fix Version/s: 2.3.1 > Reading Parquet files containing large strings can fail with > java.lang.ArrayIndexOutOfBoundsException > - > > Key: SPARK-24133 > URL: https://issues.apache.org/jira/browse/SPARK-24133 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ala Luszczak >Assignee: Ala Luszczak >Priority: Major > Fix For: 2.3.1, 2.4.0 > > > ColumnVectors store string data in one big byte array. Since the array size > is capped at just under Integer.MAX_VALUE, a single ColumnVector cannot store > more than 2GB of string data. > However, since the Parquet files commonly contain large blobs stored as > strings, and ColumnVectors by default carry 4096 values, it's entirely > possible to go past that limit. > In such cases a negative capacity is requested from > WritableColumnVector.reserve(). The call succeeds (requested capacity is > smaller than already allocated), and consequently > java.lang.ArrayIndexOutOfBoundsException is thrown when the reader actually > attempts to put the data into the array. > This behavior is hard to troubleshoot for the users. Spark should instead > check for negative requested capacity in WritableColumnVector.reserve() and > throw more informative error, instructing the user to tweak ColumnarBatch > size. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24170) [Spark SQL] json file format is not dropped after dropping table
[ https://issues.apache.org/jira/browse/SPARK-24170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462545#comment-16462545 ] Marco Gaido commented on SPARK-24170: - This is true for every datasource. This is the expected behavior when you set the location, because by default if the location is set, Spark assumes that the table is external (and not managed). I am . not sure whether this is the right thing to do, but it is how it works. cc [~smilegator] [~dongjoon] any further comments on this? Shall we discuss if this is the right behavior? > [Spark SQL] json file format is not dropped after dropping table > > > Key: SPARK-24170 > URL: https://issues.apache.org/jira/browse/SPARK-24170 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: OS: SUSE 11 > Spark Version: 2.3 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > Steps: > # Launch spark-sql --master yarn > # create table json(name STRING, age int, gender string, id INT) using > org.apache.spark.sql.json options(path "hdfs:///user/testdemo/"); > # Execute the below SQL queries > INSERT into json > SELECT 'Shaan',21,'Male',1 > UNION ALL > SELECT 'Xing',20,'Female',11 > UNION ALL > SELECT 'Mile',4,'Female',20 > UNION ALL > SELECT 'Malan',10,'Male',9; > Below 4 json file format created > BLR123111:/opt/Antsecure/install/hadoop/namenode/bin # ./hdfs dfs -ls > /user/testdemo > Found 14 items > -rw-r--r-- 3 spark hadoop 0 2018-04-26 17:44 /user/testdemo/_SUCCESS > -rw-r--r-- 3 spark hadoop 4802 2018-04-24 18:20 /user/testdemo/customer1.csv > -rw-r--r-- 3 spark hadoop 92 2018-04-26 17:02 /user/testdemo/json1.txt > -rw-r--r-- 3 spark hadoop 49 2018-04-26 17:32 > /user/testdemo/part-0-4311f66b-ba1b-4a4d-a289-1a211f27f653-c000.json > -rw-r--r-- 3 spark hadoop 49 2018-04-26 17:44 > /user/testdemo/part-0-b8a8e16a-91a8-48ec-9998-2d741c52cf5a-c000.json > -rw-r--r-- 3 spark hadoop 51 2018-04-26 17:32 > /user/testdemo/part-1-4311f66b-ba1b-4a4d-a289-1a211f27f653-c000.json > -rw-r--r-- 3 spark hadoop 51 2018-04-26 17:44 > /user/testdemo/part-1-b8a8e16a-91a8-48ec-9998-2d741c52cf5a-c000.json > -rw-r--r-- 3 spark hadoop 50 2018-04-26 17:32 > /user/testdemo/part-2-4311f66b-ba1b-4a4d-a289-1a211f27f653-c000.json > -rw-r--r-- 3 spark hadoop 50 2018-04-26 17:44 > /user/testdemo/part-2-b8a8e16a-91a8-48ec-9998-2d741c52cf5a-c000.json > -rw-r--r-- 3 spark hadoop 49 2018-04-26 17:32 > /user/testdemo/part-3-4311f66b-ba1b-4a4d-a289-1a211f27f653-c000.json > -rw-r--r-- 3 spark hadoop 49 2018-04-26 17:44 > /user/testdemo/part-3-b8a8e16a-91a8-48ec-9998-2d741c52cf5a-c000.json > > Issue is: > Now executed below drop command > spark-sql> drop table json; > > Table dropped successfully but json file still present in the path > /user/testdemo -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size
[ https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462540#comment-16462540 ] Imran Rashid commented on SPARK-24135: -- Honestly I don't understand the failure mode described here at all, but I can make some comparisons to yarn's handling of executor failures at the allocator level. In yarn, spark already has a check for the number of executor failures, and it fails the entire application if there are too many. Its controlled by "spark.yarn.max.executor.failures". The failures expire over time, controlled by "spark.yarn.executor.failuresValidityInterval", so really long running apps are not penalized by a few errors spread out over a long period of time. See code in ApplicationMaster & YarnAllocator. There is also ongoing work to have spark realize where container initialization has failed, and then request other nodes selected instead, SPARK-16630. There is a PR under review for that now. >From the bug description, I do think there should be some better error >handling than what there is now so the user at least knows what is going on, >but it sounds like you're all in agreement about that already :). > [K8s] Executors that fail to start up because of init-container errors are > not retried and limit the executor pool size > --- > > Key: SPARK-24135 > URL: https://issues.apache.org/jira/browse/SPARK-24135 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Matt Cheah >Priority: Major > > In KubernetesClusterSchedulerBackend, we detect if executors disconnect after > having been started or if executors hit the {{ERROR}} or {{DELETED}} states. > When executors fail in these ways, they are removed from the pending > executors pool and the driver should retry requesting these executors. > However, the driver does not handle a different class of error: when the pod > enters the {{Init:Error}} state. This state comes up when the executor fails > to launch because one of its init-containers fails. Spark itself doesn't > attach any init-containers to the executors. However, custom web hooks can > run on the cluster and attach init-containers to the executor pods. > Additionally, pod presets can specify init containers to run on these pods. > Therefore Spark should be handling the {{Init:Error}} cases regardless if > Spark itself is aware of init-containers or not. > This class of error is particularly bad because when we hit this state, the > failed executor will never start, but it's still seen as pending by the > executor allocator. The executor allocator won't request more rounds of > executors because its current batch hasn't been resolved to either running or > failed. Therefore we end up with being stuck with the number of executors > that successfully started before the faulty one failed to start, potentially > creating a fake resource bottleneck. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23697) Accumulators of Spark 1.x no longer work with Spark 2.x
[ https://issues.apache.org/jira/browse/SPARK-23697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23697: Assignee: (was: Apache Spark) > Accumulators of Spark 1.x no longer work with Spark 2.x > --- > > Key: SPARK-23697 > URL: https://issues.apache.org/jira/browse/SPARK-23697 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0, 2.2.1, 2.3.0 > Environment: Spark 2.2.0 > Scala 2.11 >Reporter: Sergey Zhemzhitsky >Priority: Major > > I've noticed that accumulators of Spark 1.x no longer work with Spark 2.x > failing with > {code:java} > java.lang.AssertionError: assertion failed: copyAndReset must return a zero > value copy{code} > It happens while serializing an accumulator > [here|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L165] > {code:java} > val copyAcc = copyAndReset() > assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code} > ... although copyAndReset returns zero-value copy for sure, just consider the > accumulator below > {code:java} > val concatParam = new AccumulatorParam[jl.StringBuilder] { > override def zero(initialValue: jl.StringBuilder): jl.StringBuilder = new > jl.StringBuilder() > override def addInPlace(r1: jl.StringBuilder, r2: jl.StringBuilder): > jl.StringBuilder = r1.append(r2) > }{code} > So, Spark treats zero value as non-zero due to how > [isZero|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L489] > is implemented in LegacyAccumulatorWrapper. > {code:java} > override def isZero: Boolean = _value == param.zero(initialValue){code} > All this means that the values to be accumulated must implement equals and > hashCode, otherwise isZero is very likely to always return false. > So I'm wondering whether the assertion > {code:java} > assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code} > is really necessary and whether it can be safely removed from there? > If not - is it ok to just override writeReplace for LegacyAccumulatorWrapper > to prevent such failures? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23697) Accumulators of Spark 1.x no longer work with Spark 2.x
[ https://issues.apache.org/jira/browse/SPARK-23697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23697: Assignee: Apache Spark > Accumulators of Spark 1.x no longer work with Spark 2.x > --- > > Key: SPARK-23697 > URL: https://issues.apache.org/jira/browse/SPARK-23697 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0, 2.2.1, 2.3.0 > Environment: Spark 2.2.0 > Scala 2.11 >Reporter: Sergey Zhemzhitsky >Assignee: Apache Spark >Priority: Major > > I've noticed that accumulators of Spark 1.x no longer work with Spark 2.x > failing with > {code:java} > java.lang.AssertionError: assertion failed: copyAndReset must return a zero > value copy{code} > It happens while serializing an accumulator > [here|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L165] > {code:java} > val copyAcc = copyAndReset() > assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code} > ... although copyAndReset returns zero-value copy for sure, just consider the > accumulator below > {code:java} > val concatParam = new AccumulatorParam[jl.StringBuilder] { > override def zero(initialValue: jl.StringBuilder): jl.StringBuilder = new > jl.StringBuilder() > override def addInPlace(r1: jl.StringBuilder, r2: jl.StringBuilder): > jl.StringBuilder = r1.append(r2) > }{code} > So, Spark treats zero value as non-zero due to how > [isZero|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L489] > is implemented in LegacyAccumulatorWrapper. > {code:java} > override def isZero: Boolean = _value == param.zero(initialValue){code} > All this means that the values to be accumulated must implement equals and > hashCode, otherwise isZero is very likely to always return false. > So I'm wondering whether the assertion > {code:java} > assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code} > is really necessary and whether it can be safely removed from there? > If not - is it ok to just override writeReplace for LegacyAccumulatorWrapper > to prevent such failures? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23697) Accumulators of Spark 1.x no longer work with Spark 2.x
[ https://issues.apache.org/jira/browse/SPARK-23697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462465#comment-16462465 ] Apache Spark commented on SPARK-23697: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/21229 > Accumulators of Spark 1.x no longer work with Spark 2.x > --- > > Key: SPARK-23697 > URL: https://issues.apache.org/jira/browse/SPARK-23697 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0, 2.2.1, 2.3.0 > Environment: Spark 2.2.0 > Scala 2.11 >Reporter: Sergey Zhemzhitsky >Priority: Major > > I've noticed that accumulators of Spark 1.x no longer work with Spark 2.x > failing with > {code:java} > java.lang.AssertionError: assertion failed: copyAndReset must return a zero > value copy{code} > It happens while serializing an accumulator > [here|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L165] > {code:java} > val copyAcc = copyAndReset() > assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code} > ... although copyAndReset returns zero-value copy for sure, just consider the > accumulator below > {code:java} > val concatParam = new AccumulatorParam[jl.StringBuilder] { > override def zero(initialValue: jl.StringBuilder): jl.StringBuilder = new > jl.StringBuilder() > override def addInPlace(r1: jl.StringBuilder, r2: jl.StringBuilder): > jl.StringBuilder = r1.append(r2) > }{code} > So, Spark treats zero value as non-zero due to how > [isZero|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L489] > is implemented in LegacyAccumulatorWrapper. > {code:java} > override def isZero: Boolean = _value == param.zero(initialValue){code} > All this means that the values to be accumulated must implement equals and > hashCode, otherwise isZero is very likely to always return false. > So I'm wondering whether the assertion > {code:java} > assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code} > is really necessary and whether it can be safely removed from there? > If not - is it ok to just override writeReplace for LegacyAccumulatorWrapper > to prevent such failures? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24154) AccumulatorV2 loses type information during serialization
[ https://issues.apache.org/jira/browse/SPARK-24154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462413#comment-16462413 ] Wenchen Fan commented on SPARK-24154: - I think this is a trade-off: In accumulator v1, we have separated classes for keeping states and defining computation, i.e. `Accumulable` and `AccumulableParam`. Users can mix in different traits into their `AccumulableParam` implementation, it works fine because the `copy` is defined in `Accumulable`. In accumulator v2, we have a single class for keeping states and defining computation, i.e. `AccumulatorV2`. It simplifies the accumulator framework a lot, and makes it much easier for users to eliminate boxing, although it loses the flexibility to mix in traits. In general, I don't think it's possible to support this pattern in accumulator v2, users would have to override the `copy` method. I think this is worth, compared to what accumulator v2 brings in. If users have to support mixin traits, they can still use accumulator v1. > AccumulatorV2 loses type information during serialization > - > > Key: SPARK-24154 > URL: https://issues.apache.org/jira/browse/SPARK-24154 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0, 2.2.1, 2.3.0, 2.3.1 > Environment: Scala 2.11 > Spark 2.2.0 >Reporter: Sergey Zhemzhitsky >Priority: Major > > AccumulatorV2 loses type information during serialization. > It happens > [here|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L164] > during *writeReplace* call > {code:scala} > final protected def writeReplace(): Any = { > if (atDriverSide) { > if (!isRegistered) { > throw new UnsupportedOperationException( > "Accumulator must be registered before send to executor") > } > val copyAcc = copyAndReset() > assert(copyAcc.isZero, "copyAndReset must return a zero value copy") > val isInternalAcc = name.isDefined && > name.get.startsWith(InternalAccumulator.METRICS_PREFIX) > if (isInternalAcc) { > // Do not serialize the name of internal accumulator and send it to > executor. > copyAcc.metadata = metadata.copy(name = None) > } else { > // For non-internal accumulators, we still need to send the name > because users may need to > // access the accumulator name at executor side, or they may keep the > accumulators sent from > // executors and access the name when the registered accumulator is > already garbage > // collected(e.g. SQLMetrics). > copyAcc.metadata = metadata > } > copyAcc > } else { > this > } > } > {code} > It means that it is hardly possible to create new accumulators easily by > adding new behaviour to existing ones by means of mix-ins or inheritance > (without overriding *copy*). > For example the following snippet ... > {code:scala} > trait TripleCount { > self: LongAccumulator => > abstract override def add(v: jl.Long): Unit = { > self.add(v * 3) > } > } > val acc = new LongAccumulator with TripleCount > sc.register(acc) > val data = 1 to 10 > val rdd = sc.makeRDD(data, 5) > rdd.foreach(acc.add(_)) > acc.value shouldBe 3 * data.sum > {code} > ... fails with > {code:none} > org.scalatest.exceptions.TestFailedException: 55 was not equal to 165 > at org.scalatest.MatchersHelper$.indicateFailure(MatchersHelper.scala:340) > at org.scalatest.Matchers$AnyShouldWrapper.shouldBe(Matchers.scala:6864) > {code} > Also such a behaviour seems to be error prone and confusing because an > implementor gets not the same thing as he/she sees in the code. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24090) Kubernetes Backend Hotlist for Spark 2.4
[ https://issues.apache.org/jira/browse/SPARK-24090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462363#comment-16462363 ] Stavros Kontopoulos commented on SPARK-24090: - Any plans for adding more items on this list? > Kubernetes Backend Hotlist for Spark 2.4 > > > Key: SPARK-24090 > URL: https://issues.apache.org/jira/browse/SPARK-24090 > Project: Spark > Issue Type: Umbrella > Components: Kubernetes, Scheduler >Affects Versions: 2.4.0 >Reporter: Anirudh Ramanathan >Assignee: Anirudh Ramanathan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24171) Update comments for non-deterministic functions
[ https://issues.apache.org/jira/browse/SPARK-24171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462360#comment-16462360 ] Apache Spark commented on SPARK-24171: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/21228 > Update comments for non-deterministic functions > --- > > Key: SPARK-24171 > URL: https://issues.apache.org/jira/browse/SPARK-24171 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Priority: Minor > > Description of non-deterministic functions like the _collect_list()_ and > _first()_ doesn't contain information about that. Need to add a notice about > it to show the behavior in user facing docs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24171) Update comments for non-deterministic functions
[ https://issues.apache.org/jira/browse/SPARK-24171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24171: Assignee: Apache Spark > Update comments for non-deterministic functions > --- > > Key: SPARK-24171 > URL: https://issues.apache.org/jira/browse/SPARK-24171 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Minor > > Description of non-deterministic functions like the _collect_list()_ and > _first()_ doesn't contain information about that. Need to add a notice about > it to show the behavior in user facing docs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24171) Update comments for non-deterministic functions
[ https://issues.apache.org/jira/browse/SPARK-24171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24171: Assignee: (was: Apache Spark) > Update comments for non-deterministic functions > --- > > Key: SPARK-24171 > URL: https://issues.apache.org/jira/browse/SPARK-24171 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Priority: Minor > > Description of non-deterministic functions like the _collect_list()_ and > _first()_ doesn't contain information about that. Need to add a notice about > it to show the behavior in user facing docs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24166) InMemoryTableScanExec should not access SQLConf at executor side
[ https://issues.apache.org/jira/browse/SPARK-24166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-24166. - Resolution: Fixed Fix Version/s: 2.3.1 2.4.0 Issue resolved by pull request 21223 [https://github.com/apache/spark/pull/21223] > InMemoryTableScanExec should not access SQLConf at executor side > > > Key: SPARK-24166 > URL: https://issues.apache.org/jira/browse/SPARK-24166 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 2.4.0, 2.3.1 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23703) Collapse sequential watermarks
[ https://issues.apache.org/jira/browse/SPARK-23703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462339#comment-16462339 ] Jungtaek Lim commented on SPARK-23703: -- [~joseph.torres] Could you provide simple code or query showing this behavior? It would make me (and possible other contributors) better understanding of rationalize on this issue, and maybe relevant internal too. Once I could understand the details I'd also like to work on this. > Collapse sequential watermarks > --- > > Key: SPARK-23703 > URL: https://issues.apache.org/jira/browse/SPARK-23703 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > > When there are two sequential EventTimeWatermark nodes in a query plan, the > topmost one overrides the column tracking metadata from its children, but > leaves the nodes themselves untouched. When there is no intervening stateful > operation to consume the watermark, we should remove the lower node entirely. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24171) Update comments for non-deterministic functions
Maxim Gekk created SPARK-24171: -- Summary: Update comments for non-deterministic functions Key: SPARK-24171 URL: https://issues.apache.org/jira/browse/SPARK-24171 Project: Spark Issue Type: Documentation Components: SQL Affects Versions: 2.3.0 Reporter: Maxim Gekk Description of non-deterministic functions like the _collect_list()_ and _first()_ doesn't contain information about that. Need to add a notice about it to show the behavior in user facing docs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23715) from_utc_timestamp returns incorrect results for some UTC date/time values
[ https://issues.apache.org/jira/browse/SPARK-23715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-23715: Assignee: Wenchen Fan > from_utc_timestamp returns incorrect results for some UTC date/time values > -- > > Key: SPARK-23715 > URL: https://issues.apache.org/jira/browse/SPARK-23715 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Bruce Robbins >Assignee: Wenchen Fan >Priority: Major > Fix For: 2.4.0 > > > This produces the expected answer: > {noformat} > df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" > ).as("dt")).show > +---+ > | dt| > +---+ > |2018-03-13 07:18:23| > +---+ > {noformat} > However, the equivalent UTC input (but with an explicit timezone) produces a > wrong answer: > {noformat} > df.select(from_utc_timestamp(lit("2018-03-13T06:18:23+00:00"), "GMT+1" > ).as("dt")).show > +---+ > | dt| > +---+ > |2018-03-13 00:18:23| > +---+ > {noformat} > Additionally, the equivalent Unix time (1520921903, which is also > "2018-03-13T06:18:23" in the UTC time zone) produces the same wrong answer: > {noformat} > df.select(from_utc_timestamp(to_timestamp(lit(1520921903)), "GMT+1" > ).as("dt")).show > +---+ > | dt| > +---+ > |2018-03-13 00:18:23| > +---+ > {noformat} > These issues stem from the fact that the FromUTCTimestamp expression, despite > its name, expects the input to be in the user's local timezone. There is some > magic under the covers to make things work (mostly) as the user expects. > As an example, let's say a user in Los Angeles issues the following: > {noformat} > df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" > ).as("dt")).show > {noformat} > FromUTCTimestamp gets as input a Timestamp (long) value representing > {noformat} > 2018-03-13T06:18:23-07:00 (long value 152094710300) > {noformat} > What FromUTCTimestamp needs instead is > {noformat} > 2018-03-13T06:18:23+00:00 (long value 152092190300) > {noformat} > So, it applies the local timezone's offset to the input timestamp to get the > correct value (152094710300 minus 7 hours is 152092190300). Then it > can process the value and produce the expected output. > When the user explicitly specifies a time zone, FromUTCTimestamp's > assumptions break down. The input is no longer in the local time zone. > Because of the way input data is implicitly casted, FromUTCTimestamp never > knows whether the input data had an explicit timezone. > Here are some gory details: > There is sometimes a mismatch in expectations between the (string => > timestamp) cast and FromUTCTimestamp. Also, since the FromUTCTimestamp > expression never sees the actual input string (the cast "intercepts" the > input and converts it to a long timestamp before FromUTCTimestamp uses the > value), FromUTCTimestamp cannot reject any input value that would exercise > this mismatch in expectations. > There is a similar mismatch in expectations in the (integer => timestamp) > cast and FromUTCTimestamp. As a result, Unix time input almost always > produces incorrect output. > h3. When things work as expected for String input: > When from_utc_timestamp is passed a string time value with no time zone, > DateTimeUtils.stringToTimestamp (called from a Cast expression) treats the > datetime string as though it's in the user's local time zone. Because > DateTimeUtils.stringToTimestamp is a general function, this is reasonable. > As a result, FromUTCTimestamp's input is a timestamp shifted by the local > time zone's offset. FromUTCTimestamp assumes this (or more accurately, a > utility function called by FromUTCTimestamp assumes this), so the first thing > it does is reverse-shift to get it back the correct value. Now that the long > value has been shifted back to the correct timestamp value, it can now > process it (by shifting it again based on the specified time zone). > h3. When things go wrong with String input: > When from_utc_timestamp is passed a string datetime value with an explicit > time zone, stringToTimestamp honors that timezone and ignores the local time > zone. stringToTimestamp does not shift the timestamp by the local timezone's > offset, but by the timezone specified on the datetime string. > Unfortunately, FromUTCTimestamp, which has no insight into the actual input > or the conversion, still assumes the timestamp is shifted by the local time > zone. So it reverse-shifts the long value by the local time zone's offset, > which produces a incorrect timestamp (except in the c
[jira] [Resolved] (SPARK-23715) from_utc_timestamp returns incorrect results for some UTC date/time values
[ https://issues.apache.org/jira/browse/SPARK-23715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23715. -- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21169 [https://github.com/apache/spark/pull/21169] > from_utc_timestamp returns incorrect results for some UTC date/time values > -- > > Key: SPARK-23715 > URL: https://issues.apache.org/jira/browse/SPARK-23715 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Bruce Robbins >Assignee: Wenchen Fan >Priority: Major > Fix For: 2.4.0 > > > This produces the expected answer: > {noformat} > df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" > ).as("dt")).show > +---+ > | dt| > +---+ > |2018-03-13 07:18:23| > +---+ > {noformat} > However, the equivalent UTC input (but with an explicit timezone) produces a > wrong answer: > {noformat} > df.select(from_utc_timestamp(lit("2018-03-13T06:18:23+00:00"), "GMT+1" > ).as("dt")).show > +---+ > | dt| > +---+ > |2018-03-13 00:18:23| > +---+ > {noformat} > Additionally, the equivalent Unix time (1520921903, which is also > "2018-03-13T06:18:23" in the UTC time zone) produces the same wrong answer: > {noformat} > df.select(from_utc_timestamp(to_timestamp(lit(1520921903)), "GMT+1" > ).as("dt")).show > +---+ > | dt| > +---+ > |2018-03-13 00:18:23| > +---+ > {noformat} > These issues stem from the fact that the FromUTCTimestamp expression, despite > its name, expects the input to be in the user's local timezone. There is some > magic under the covers to make things work (mostly) as the user expects. > As an example, let's say a user in Los Angeles issues the following: > {noformat} > df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" > ).as("dt")).show > {noformat} > FromUTCTimestamp gets as input a Timestamp (long) value representing > {noformat} > 2018-03-13T06:18:23-07:00 (long value 152094710300) > {noformat} > What FromUTCTimestamp needs instead is > {noformat} > 2018-03-13T06:18:23+00:00 (long value 152092190300) > {noformat} > So, it applies the local timezone's offset to the input timestamp to get the > correct value (152094710300 minus 7 hours is 152092190300). Then it > can process the value and produce the expected output. > When the user explicitly specifies a time zone, FromUTCTimestamp's > assumptions break down. The input is no longer in the local time zone. > Because of the way input data is implicitly casted, FromUTCTimestamp never > knows whether the input data had an explicit timezone. > Here are some gory details: > There is sometimes a mismatch in expectations between the (string => > timestamp) cast and FromUTCTimestamp. Also, since the FromUTCTimestamp > expression never sees the actual input string (the cast "intercepts" the > input and converts it to a long timestamp before FromUTCTimestamp uses the > value), FromUTCTimestamp cannot reject any input value that would exercise > this mismatch in expectations. > There is a similar mismatch in expectations in the (integer => timestamp) > cast and FromUTCTimestamp. As a result, Unix time input almost always > produces incorrect output. > h3. When things work as expected for String input: > When from_utc_timestamp is passed a string time value with no time zone, > DateTimeUtils.stringToTimestamp (called from a Cast expression) treats the > datetime string as though it's in the user's local time zone. Because > DateTimeUtils.stringToTimestamp is a general function, this is reasonable. > As a result, FromUTCTimestamp's input is a timestamp shifted by the local > time zone's offset. FromUTCTimestamp assumes this (or more accurately, a > utility function called by FromUTCTimestamp assumes this), so the first thing > it does is reverse-shift to get it back the correct value. Now that the long > value has been shifted back to the correct timestamp value, it can now > process it (by shifting it again based on the specified time zone). > h3. When things go wrong with String input: > When from_utc_timestamp is passed a string datetime value with an explicit > time zone, stringToTimestamp honors that timezone and ignores the local time > zone. stringToTimestamp does not shift the timestamp by the local timezone's > offset, but by the timezone specified on the datetime string. > Unfortunately, FromUTCTimestamp, which has no insight into the actual input > or the conversion, still assumes the timestamp is shifted by the local time > zone. So it reverse-shifts th
[jira] [Resolved] (SPARK-24152) SparkR CRAN feasibility check server problem
[ https://issues.apache.org/jira/browse/SPARK-24152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24152. -- Resolution: Fixed > SparkR CRAN feasibility check server problem > > > Key: SPARK-24152 > URL: https://issues.apache.org/jira/browse/SPARK-24152 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Liang-Chi Hsieh >Priority: Critical > > PR builder and master branch test fails with the following SparkR error with > unknown reason. The following is an error message from that. > {code} > * this is package 'SparkR' version '2.4.0' > * checking CRAN incoming feasibility ...Error in > .check_package_CRAN_incoming(pkgdir) : > dims [product 24] do not match the length of object [0] > Execution halted > {code} > *PR BUILDER* > - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90039/ > - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89983/ > - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89998/ > *MASTER BRANCH* > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4458/ > (Fail with no failures) > This is critical because we already start to merge the PR by ignoring this > **known unkonwn** SparkR failure. > - https://github.com/apache/spark/pull/21175 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24133) Reading Parquet files containing large strings can fail with java.lang.ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-24133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462287#comment-16462287 ] Apache Spark commented on SPARK-24133: -- User 'ala' has created a pull request for this issue: https://github.com/apache/spark/pull/21227 > Reading Parquet files containing large strings can fail with > java.lang.ArrayIndexOutOfBoundsException > - > > Key: SPARK-24133 > URL: https://issues.apache.org/jira/browse/SPARK-24133 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ala Luszczak >Assignee: Ala Luszczak >Priority: Major > Fix For: 2.4.0 > > > ColumnVectors store string data in one big byte array. Since the array size > is capped at just under Integer.MAX_VALUE, a single ColumnVector cannot store > more than 2GB of string data. > However, since the Parquet files commonly contain large blobs stored as > strings, and ColumnVectors by default carry 4096 values, it's entirely > possible to go past that limit. > In such cases a negative capacity is requested from > WritableColumnVector.reserve(). The call succeeds (requested capacity is > smaller than already allocated), and consequently > java.lang.ArrayIndexOutOfBoundsException is thrown when the reader actually > attempts to put the data into the array. > This behavior is hard to troubleshoot for the users. Spark should instead > check for negative requested capacity in WritableColumnVector.reserve() and > throw more informative error, instructing the user to tweak ColumnarBatch > size. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24152) SparkR CRAN feasibility check server problem
[ https://issues.apache.org/jira/browse/SPARK-24152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462280#comment-16462280 ] Liang-Chi Hsieh commented on SPARK-24152: - Can be resolved now as I saw Jenkins test passed. > SparkR CRAN feasibility check server problem > > > Key: SPARK-24152 > URL: https://issues.apache.org/jira/browse/SPARK-24152 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Liang-Chi Hsieh >Priority: Critical > > PR builder and master branch test fails with the following SparkR error with > unknown reason. The following is an error message from that. > {code} > * this is package 'SparkR' version '2.4.0' > * checking CRAN incoming feasibility ...Error in > .check_package_CRAN_incoming(pkgdir) : > dims [product 24] do not match the length of object [0] > Execution halted > {code} > *PR BUILDER* > - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90039/ > - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89983/ > - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89998/ > *MASTER BRANCH* > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4458/ > (Fail with no failures) > This is critical because we already start to merge the PR by ignoring this > **known unkonwn** SparkR failure. > - https://github.com/apache/spark/pull/21175 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10943) NullType Column cannot be written to Parquet
[ https://issues.apache.org/jira/browse/SPARK-10943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462244#comment-16462244 ] Daniel Davis commented on SPARK-10943: -- According to parquet data types [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md], now a Null type should be supported. So perhaps this issue should be reconsidered? > NullType Column cannot be written to Parquet > > > Key: SPARK-10943 > URL: https://issues.apache.org/jira/browse/SPARK-10943 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Jason Pohl >Priority: Major > > {code} > var data02 = sqlContext.sql("select 1 as id, \"cat in the hat\" as text, null > as comments") > {code} > //FAIL - Try writing a NullType column (where all the values are NULL) > {code} > data02.write.parquet("/tmp/test/dataset2") > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:156) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57) > at > org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138) > at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933) > at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933) > at > org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137) > at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:304) > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 0 in stage 179.0 failed 4 times, most recent failure: Lost task 0.3 in > stage 179.0 (TID 39924, 10.0.196.208): > org.apache.spark.sql.AnalysisException: Unsupported data type > StructField(comments,NullType,true).dataType; > at > org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:524) > at > org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:312) > at > org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convert$1.apply(CatalystSchemaConverter.scala:305) > at > org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convert$1.apply(CatalystSchemaConverter.scala:305) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at org.apache.spark.sql.types.StructType.foreach(StructType.scala:92) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at org.apache.spark.sql.types.StructType.map(StructType.scala:92) > at > org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convert(CatalystSchemaConverter.scala:305) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetTypesConverter$.convertFromAttributes(ParquetTypesConverter.scala:58) > at > org.apache.spark.sql.execution.datasources.parquet.RowWriteSupport.init(ParquetTableSupport.scala:55) > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:288) > at > org.apache.parquet.hadoop.ParquetOu
[jira] [Commented] (SPARK-21429) show on structured Dataset is equivalent to writeStream to console once
[ https://issues.apache.org/jira/browse/SPARK-21429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462225#comment-16462225 ] Jungtaek Lim commented on SPARK-21429: -- I agree that shortcut would help, but a bit afraid that such shortcut might hide the detail, difference between executing batch and streaming. Unless source data has changed, running batch will not have side effect. But running streaming will change source offset. (I guess it is true even for Trigger.once() but please correct me if I'm missing something.) > show on structured Dataset is equivalent to writeStream to console once > --- > > Key: SPARK-21429 > URL: https://issues.apache.org/jira/browse/SPARK-21429 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Jacek Laskowski >Priority: Minor > > While working with Datasets it's often helpful to do {{show}}. It does not > work for streaming Datasets (and leads to {{AnalysisException}} - see below), > but think it could just be the following under the covers and very helpful > (would cut plenty of keystrokes for sure). > {code} > val sq = ... > scala> sq.isStreaming > res0: Boolean = true > import org.apache.spark.sql.streaming.Trigger > scala> sq.writeStream.format("console").trigger(Trigger.Once).start > {code} > Since {{show}} returns {{Unit}} that could just work. > Currently {{show}} reports {{AnalysisException}}. > {code} > scala> sq.show > org.apache.spark.sql.AnalysisException: Queries with streaming sources must > be executed with writeStream.start();; > rate > at > org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:297) > at > org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:36) > at > org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:34) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126) > at scala.collection.immutable.List.foreach(List.scala:381) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126) > at scala.collection.immutable.List.foreach(List.scala:381) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126) > at > org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForBatch(UnsupportedOperationChecker.scala:34) > at > org.apache.spark.sql.execution.QueryExecution.assertSupported(QueryExecution.scala:63) > at > org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:74) > at > org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:72) > at > org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:78) > at > org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:78) > at > org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:84) > at > org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:89) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:89) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3027) > at org.apache.spark.sql.Dataset.head(Dataset.scala:2340) > at org.apache.spark.sql.Dataset.take(Dataset.scala:2553) > at org.apache.spark.sql.Dataset.showString(Dataset.scala:241) > at org.apache.spark.sql.Dataset.show(Dataset.scala:671) > at org.apache.spark.sql.Dataset.show(Dataset.scala:630) > at org.apache.spark.sql.Dataset.show(Dataset.scala:639) > ... 50 elided > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10816) EventTime based sessionization
[ https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462149#comment-16462149 ] Jungtaek Lim commented on SPARK-10816: -- I'm still curious about out-of-box support on session window. I'm aware of session window example which leverages advanced API in structured streaming guide doc, but out-of-box support may enable the feature in sql statement instead of dealing with mapGroupsWithState. If Spark community is still interested on out-of-box support, I'd like to spend some time to take a deep look to see whether we can get it. Thanks! > EventTime based sessionization > -- > > Key: SPARK-10816 > URL: https://issues.apache.org/jira/browse/SPARK-10816 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Reporter: Reynold Xin >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size
[ https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462145#comment-16462145 ] Anirudh Ramanathan commented on SPARK-24135: cc/ [~mridulm80] [~irashid] for thoughts on whether this behavior would be intuitive to an existing Spark user. > [K8s] Executors that fail to start up because of init-container errors are > not retried and limit the executor pool size > --- > > Key: SPARK-24135 > URL: https://issues.apache.org/jira/browse/SPARK-24135 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Matt Cheah >Priority: Major > > In KubernetesClusterSchedulerBackend, we detect if executors disconnect after > having been started or if executors hit the {{ERROR}} or {{DELETED}} states. > When executors fail in these ways, they are removed from the pending > executors pool and the driver should retry requesting these executors. > However, the driver does not handle a different class of error: when the pod > enters the {{Init:Error}} state. This state comes up when the executor fails > to launch because one of its init-containers fails. Spark itself doesn't > attach any init-containers to the executors. However, custom web hooks can > run on the cluster and attach init-containers to the executor pods. > Additionally, pod presets can specify init containers to run on these pods. > Therefore Spark should be handling the {{Init:Error}} cases regardless if > Spark itself is aware of init-containers or not. > This class of error is particularly bad because when we hit this state, the > failed executor will never start, but it's still seen as pending by the > executor allocator. The executor allocator won't request more rounds of > executors because its current batch hasn't been resolved to either running or > failed. Therefore we end up with being stuck with the number of executors > that successfully started before the faulty one failed to start, potentially > creating a fake resource bottleneck. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size
[ https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462139#comment-16462139 ] Anirudh Ramanathan edited comment on SPARK-24135 at 5/3/18 9:01 AM: It is increasingly common for people to write custom controllers and custom resources and not use the built-in controllers, especially when the workloads have special characteristics. This is the whole reason why people are working on tooling like the [operator framework|https://coreos.com/blog/introducing-operator-framework]. I don't think the future lies in shoehorning applications to use the existing controllers. The existing controllers are a good starting point but for any custom orchestration, the recommendation from the k8s community at large would be to write an operator which in some sense is what we've done. So, I think moving towards the built-in controllers doesn't give us anything more. Also, replication controllers and deployments are not used for applications with termination semantics. They're suitable for long running services. That's the reason why they never give up after seeing failures. However, if you see the "batch" type built-in controller, the [job controller|https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion], it does implement a [backoff policy|https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#pod-backoff-failure-policy] that covers the initialization and runtime errors in containers. As I see it, we should have safe limits for all kinds of failures to eventually give up. I'm ok with having this limit similar to the job controller, as a configurable number and one might want to set it very high in your case to do near infinite retries, but I'm not convinced that that behavior is a safe choice in the general case. Also, flakiness due to admission webhooks seems like it should be handled by retries in the init container, or by some other automation, since it's outside Spark land. That makes me apprehensive about handling such specific cases within Spark, instead of dealing with it as "framework error" and "app error". was (Author: foxish): It is increasingly common for people to write custom controllers and custom resources and not use the built-in controllers, especially when the workloads have special characteristics. This is the whole reason why people are working on tooling like the [operator framework|https://coreos.com/blog/introducing-operator-framework]. I don't think the future lies in shoehorning applications to use the existing controllers. The existing controllers are a good starting point but for any custom orchestration, the recommendation from the k8s community at large would be to write an operator which in some sense is what we've done. So, I think moving towards the built-in controllers doesn't give us anything more. Also, replication controllers and deployments are not used for applications with termination semantics. They're suitable for long running services. The only "batch" type built-in controller is the [job controller|https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion], which does implement a [backoff policy|https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#pod-backoff-failure-policy] that covers the initialization and runtime errors in containers. As I see it, we should have safe limits for all kinds of failures to eventually give up; it's more a question of whether this should be treated differently as a framework error. Also, flakiness due to admission webhooks seems like it should be handled by retries in the init container, or by some other automation, since it's outside Spark land. That makes me apprehensive about handling such specific cases within Spark, instead of dealing with it as "framework error" and "app error". > [K8s] Executors that fail to start up because of init-container errors are > not retried and limit the executor pool size > --- > > Key: SPARK-24135 > URL: https://issues.apache.org/jira/browse/SPARK-24135 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Matt Cheah >Priority: Major > > In KubernetesClusterSchedulerBackend, we detect if executors disconnect after > having been started or if executors hit the {{ERROR}} or {{DELETED}} states. > When executors fail in these ways, they are removed from the pending > executors pool and the driver should retry requesting these executors. > However, the driver does not handle a different class of error: when the pod > enters the {{Init:Error}} state. Thi
[jira] [Comment Edited] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size
[ https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462139#comment-16462139 ] Anirudh Ramanathan edited comment on SPARK-24135 at 5/3/18 8:58 AM: It is increasingly common for people to write custom controllers and custom resources and not use the built-in controllers, especially when the workloads have special characteristics. This is the whole reason why people are working on tooling like the [operator framework|https://coreos.com/blog/introducing-operator-framework]. I don't think the future lies in shoehorning applications to use the existing controllers. The existing controllers are a good starting point but for any custom orchestration, the recommendation from the k8s community at large would be to write an operator which in some sense is what we've done. So, I think moving towards the built-in controllers doesn't give us anything more. Also, replication controllers and deployments are not used for applications with termination semantics. They're suitable for long running services. The only "batch" type built-in controller is the [job controller|https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion], which does implement a [backoff policy|https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#pod-backoff-failure-policy] that covers the initialization and runtime errors in containers. As I see it, we should have safe limits for all kinds of failures to eventually give up; it's more a question of whether this should be treated differently as a framework error. Also, flakiness due to admission webhooks seems like it should be handled by retries in the init container, or by some other automation, since it's outside Spark land. That makes me apprehensive about handling such specific cases within Spark, instead of dealing with it as "framework error" and "app error". was (Author: foxish): It is increasingly common for people to write custom controllers and custom resources and not use the built-in controllers, especially when the workloads have special characteristics. This is the whole reason why people are working on tooling like the [operator framework|https://coreos.com/blog/introducing-operator-framework]. I don't think the future lies in shoehorning applications to use the existing controllers. The existing controllers are a good starting point but for any custom orchestration, the recommendation from the k8s community at large would be to write an operator which in some sense is what we've done. So, I think moving towards the built-in controllers doesn't give us anything more. Also, replication controllers and deployments are not used for applications with termination semantics. They're suitable for long running services. The only "batch" type built-in controller is the [job controller|https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion], which does implement a backoff policy that covers the initialization and runtime errors in containers. As I see it, we should have safe limits for all kinds of failures to eventually give up; it's more a question of whether this should be treated differently as a framework error. Also, flakiness due to admission webhooks seems like it should be handled by retries in the init container, or by some other automation, since it's outside Spark land. That makes me apprehensive about handling such specific cases within Spark, instead of dealing with it as "framework error" and "app error". > [K8s] Executors that fail to start up because of init-container errors are > not retried and limit the executor pool size > --- > > Key: SPARK-24135 > URL: https://issues.apache.org/jira/browse/SPARK-24135 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Matt Cheah >Priority: Major > > In KubernetesClusterSchedulerBackend, we detect if executors disconnect after > having been started or if executors hit the {{ERROR}} or {{DELETED}} states. > When executors fail in these ways, they are removed from the pending > executors pool and the driver should retry requesting these executors. > However, the driver does not handle a different class of error: when the pod > enters the {{Init:Error}} state. This state comes up when the executor fails > to launch because one of its init-containers fails. Spark itself doesn't > attach any init-containers to the executors. However, custom web hooks can > run on the cluster and attach init-containers to the executor pods. > Additionally, pod presets can specify init containers to run on these pods. >
[jira] [Commented] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size
[ https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462139#comment-16462139 ] Anirudh Ramanathan commented on SPARK-24135: It is increasingly common for people to write custom controllers and custom resources and not use the built-in controllers, especially when the workloads have special characteristics. This is the whole reason why people are working on tooling like the [operator framework|https://coreos.com/blog/introducing-operator-framework]. I don't think the future lies in shoehorning applications to use the existing controllers. The existing controllers are a good starting point but for any custom orchestration, the recommendation from the k8s community at large would be to write an operator which in some sense is what we've done. So, I think moving towards the built-in controllers doesn't give us anything more. Also, replication controllers and deployments are not used for applications with termination semantics. They're suitable for long running services. The only "batch" type built-in controller is the [job controller|https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion], which does implement a backoff policy that covers the initialization and runtime errors in containers. As I see it, we should have safe limits for all kinds of failures to eventually give up; it's more a question of whether this should be treated differently as a framework error. Also, flakiness due to admission webhooks seems like it should be handled by retries in the init container, or by some other automation, since it's outside Spark land. That makes me apprehensive about handling such specific cases within Spark, instead of dealing with it as "framework error" and "app error". > [K8s] Executors that fail to start up because of init-container errors are > not retried and limit the executor pool size > --- > > Key: SPARK-24135 > URL: https://issues.apache.org/jira/browse/SPARK-24135 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Matt Cheah >Priority: Major > > In KubernetesClusterSchedulerBackend, we detect if executors disconnect after > having been started or if executors hit the {{ERROR}} or {{DELETED}} states. > When executors fail in these ways, they are removed from the pending > executors pool and the driver should retry requesting these executors. > However, the driver does not handle a different class of error: when the pod > enters the {{Init:Error}} state. This state comes up when the executor fails > to launch because one of its init-containers fails. Spark itself doesn't > attach any init-containers to the executors. However, custom web hooks can > run on the cluster and attach init-containers to the executor pods. > Additionally, pod presets can specify init containers to run on these pods. > Therefore Spark should be handling the {{Init:Error}} cases regardless if > Spark itself is aware of init-containers or not. > This class of error is particularly bad because when we hit this state, the > failed executor will never start, but it's still seen as pending by the > executor allocator. The executor allocator won't request more rounds of > executors because its current batch hasn't been resolved to either running or > failed. Therefore we end up with being stuck with the number of executors > that successfully started before the faulty one failed to start, potentially > creating a fake resource bottleneck. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size
[ https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462106#comment-16462106 ] Matt Cheah edited comment on SPARK-24135 at 5/3/18 8:35 AM: Not necessarily - if the pods fail to start up, we should retry them indefinitely as a replication controller or a deployment would. There's an argument that can be made that we should be using those higher level primitives to run executors instead of raw pods anyways, just that Spark's scheduler code would need non-trivial changes to do so right now. was (Author: mcheah): Not necessarily - if the pods fail to start up, we should retry them indefinitely as a replication controller or a deployment would. There's an argument that can be made that we should be using those lower level primitives to run executors instead of raw pods anyways, just that Spark's scheduler code would need non-trivial changes to do so right now. > [K8s] Executors that fail to start up because of init-container errors are > not retried and limit the executor pool size > --- > > Key: SPARK-24135 > URL: https://issues.apache.org/jira/browse/SPARK-24135 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Matt Cheah >Priority: Major > > In KubernetesClusterSchedulerBackend, we detect if executors disconnect after > having been started or if executors hit the {{ERROR}} or {{DELETED}} states. > When executors fail in these ways, they are removed from the pending > executors pool and the driver should retry requesting these executors. > However, the driver does not handle a different class of error: when the pod > enters the {{Init:Error}} state. This state comes up when the executor fails > to launch because one of its init-containers fails. Spark itself doesn't > attach any init-containers to the executors. However, custom web hooks can > run on the cluster and attach init-containers to the executor pods. > Additionally, pod presets can specify init containers to run on these pods. > Therefore Spark should be handling the {{Init:Error}} cases regardless if > Spark itself is aware of init-containers or not. > This class of error is particularly bad because when we hit this state, the > failed executor will never start, but it's still seen as pending by the > executor allocator. The executor allocator won't request more rounds of > executors because its current batch hasn't been resolved to either running or > failed. Therefore we end up with being stuck with the number of executors > that successfully started before the faulty one failed to start, potentially > creating a fake resource bottleneck. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size
[ https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462106#comment-16462106 ] Matt Cheah commented on SPARK-24135: Not necessarily - if the pods fail to start up, we should retry them indefinitely as a replication controller or a deployment would. There's an argument that can be made that we should be using those lower level primitives to run executors instead of raw pods anyways, just that Spark's scheduler code would need non-trivial changes to do so right now. > [K8s] Executors that fail to start up because of init-container errors are > not retried and limit the executor pool size > --- > > Key: SPARK-24135 > URL: https://issues.apache.org/jira/browse/SPARK-24135 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Matt Cheah >Priority: Major > > In KubernetesClusterSchedulerBackend, we detect if executors disconnect after > having been started or if executors hit the {{ERROR}} or {{DELETED}} states. > When executors fail in these ways, they are removed from the pending > executors pool and the driver should retry requesting these executors. > However, the driver does not handle a different class of error: when the pod > enters the {{Init:Error}} state. This state comes up when the executor fails > to launch because one of its init-containers fails. Spark itself doesn't > attach any init-containers to the executors. However, custom web hooks can > run on the cluster and attach init-containers to the executor pods. > Additionally, pod presets can specify init containers to run on these pods. > Therefore Spark should be handling the {{Init:Error}} cases regardless if > Spark itself is aware of init-containers or not. > This class of error is particularly bad because when we hit this state, the > failed executor will never start, but it's still seen as pending by the > executor allocator. The executor allocator won't request more rounds of > executors because its current batch hasn't been resolved to either running or > failed. Therefore we end up with being stuck with the number of executors > that successfully started before the faulty one failed to start, potentially > creating a fake resource bottleneck. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24030) SparkSQL percentile_approx function is too slow for over 1,060,000 records.
[ https://issues.apache.org/jira/browse/SPARK-24030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24030. -- Resolution: Cannot Reproduce > SparkSQL percentile_approx function is too slow for over 1,060,000 records. > --- > > Key: SPARK-24030 > URL: https://issues.apache.org/jira/browse/SPARK-24030 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 > Environment: zeppline + Spark 2.2.1 on Amazon EMR and local laptop. >Reporter: Seok-Joon,Yun >Priority: Major > Attachments: screenshot_2018-04-20 23.15.02.png > > > I used percentile_approx functions for over 1,060,000 records. It is too > slow. It takes about 90 mins. So I tried for 1,040,000 records. It take about > 10 secs. > I tested for data reading on JDBC and parquet. It takes same time lengths. > I wonder that function is not designed for multi worker. > I looked gangglia and spark history. It worked on one worker. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23489) Flaky Test: HiveExternalCatalogVersionsSuite
[ https://issues.apache.org/jira/browse/SPARK-23489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-23489: --- Assignee: Dongjoon Hyun > Flaky Test: HiveExternalCatalogVersionsSuite > > > Key: SPARK-23489 > URL: https://issues.apache.org/jira/browse/SPARK-23489 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.2.1, 2.3.0, 2.4.0 >Reporter: Marco Gaido >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.3.1, 2.4.0 > > > I saw this error in an unrelated PR. It seems to me a bad configuration in > the Jenkins node where the tests are run. > {code} > Error Message > java.io.IOException: Cannot run program "./bin/spark-submit" (in directory > "/tmp/test-spark/spark-2.2.1"): error=2, No such file or directory > Stacktrace > sbt.ForkMain$ForkError: java.io.IOException: Cannot run program > "./bin/spark-submit" (in directory "/tmp/test-spark/spark-2.2.1"): error=2, > No such file or directory > at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048) > at > org.apache.spark.sql.hive.SparkSubmitTestUtils$class.runSparkSubmit(SparkSubmitTestUtils.scala:73) > at > org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite.runSparkSubmit(HiveExternalCatalogVersionsSuite.scala:43) > at > org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite$$anonfun$beforeAll$1.apply(HiveExternalCatalogVersionsSuite.scala:176) > at > org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite$$anonfun$beforeAll$1.apply(HiveExternalCatalogVersionsSuite.scala:161) > at scala.collection.immutable.List.foreach(List.scala:381) > at > org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite.beforeAll(HiveExternalCatalogVersionsSuite.scala:161) > at > org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:212) > at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210) > at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52) > at > org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314) > at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480) > at sbt.ForkMain$Run$2.call(ForkMain.java:296) > at sbt.ForkMain$Run$2.call(ForkMain.java:286) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: sbt.ForkMain$ForkError: java.io.IOException: error=2, No such file > or directory > at java.lang.UNIXProcess.forkAndExec(Native Method) > at java.lang.UNIXProcess.(UNIXProcess.java:248) > at java.lang.ProcessImpl.start(ProcessImpl.java:134) > at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) > ... 17 more > {code} > This is the link: > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87615/testReport/. > *MASTER BRANCH* > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/4389 > *BRANCH 2.3* > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/321/ > *NOTE: This failure frequently looks as `Test Result (no failures)`* > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.6/4811/ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23489) Flaky Test: HiveExternalCatalogVersionsSuite
[ https://issues.apache.org/jira/browse/SPARK-23489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-23489. - Resolution: Fixed Fix Version/s: 2.3.1 2.4.0 Issue resolved by pull request 21210 [https://github.com/apache/spark/pull/21210] > Flaky Test: HiveExternalCatalogVersionsSuite > > > Key: SPARK-23489 > URL: https://issues.apache.org/jira/browse/SPARK-23489 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.2.1, 2.3.0, 2.4.0 >Reporter: Marco Gaido >Priority: Major > Fix For: 2.4.0, 2.3.1 > > > I saw this error in an unrelated PR. It seems to me a bad configuration in > the Jenkins node where the tests are run. > {code} > Error Message > java.io.IOException: Cannot run program "./bin/spark-submit" (in directory > "/tmp/test-spark/spark-2.2.1"): error=2, No such file or directory > Stacktrace > sbt.ForkMain$ForkError: java.io.IOException: Cannot run program > "./bin/spark-submit" (in directory "/tmp/test-spark/spark-2.2.1"): error=2, > No such file or directory > at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048) > at > org.apache.spark.sql.hive.SparkSubmitTestUtils$class.runSparkSubmit(SparkSubmitTestUtils.scala:73) > at > org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite.runSparkSubmit(HiveExternalCatalogVersionsSuite.scala:43) > at > org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite$$anonfun$beforeAll$1.apply(HiveExternalCatalogVersionsSuite.scala:176) > at > org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite$$anonfun$beforeAll$1.apply(HiveExternalCatalogVersionsSuite.scala:161) > at scala.collection.immutable.List.foreach(List.scala:381) > at > org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite.beforeAll(HiveExternalCatalogVersionsSuite.scala:161) > at > org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:212) > at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210) > at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52) > at > org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314) > at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480) > at sbt.ForkMain$Run$2.call(ForkMain.java:296) > at sbt.ForkMain$Run$2.call(ForkMain.java:286) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: sbt.ForkMain$ForkError: java.io.IOException: error=2, No such file > or directory > at java.lang.UNIXProcess.forkAndExec(Native Method) > at java.lang.UNIXProcess.(UNIXProcess.java:248) > at java.lang.ProcessImpl.start(ProcessImpl.java:134) > at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) > ... 17 more > {code} > This is the link: > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87615/testReport/. > *MASTER BRANCH* > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/4389 > *BRANCH 2.3* > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/321/ > *NOTE: This failure frequently looks as `Test Result (no failures)`* > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.6/4811/ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24116) SparkSQL inserting overwrite table has inconsistent behavior regarding HDFS trash
[ https://issues.apache.org/jira/browse/SPARK-24116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462040#comment-16462040 ] Rui Li commented on SPARK-24116: To reproduce: {code} create table test_text(x int); insert overwrite table test_text values (1),(2); insert overwrite table test_text values (3),(4); -- the old data is moved to trash create table test_parquet(x int) using parquet; insert overwrite table test_parquet values (1),(2); insert overwrite table test_parquet values (3),(4); -- the old data is not moved to trash {code} > SparkSQL inserting overwrite table has inconsistent behavior regarding HDFS > trash > - > > Key: SPARK-24116 > URL: https://issues.apache.org/jira/browse/SPARK-24116 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Rui Li >Priority: Major > > When inserting overwrite a table, the old data may or may not go to trash > based on: > # Date format. E.g. text table may go to trash but parquet table doesn't. > # Whether table is partitioned. E.g. partitioned text table doesn't go to > trash while non-partitioned table does. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24152) SparkR CRAN feasibility check server problem
[ https://issues.apache.org/jira/browse/SPARK-24152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462030#comment-16462030 ] Liang-Chi Hsieh commented on SPARK-24152: - I think it is fixed now. It works in local. But better to check Jenkins test results too. > SparkR CRAN feasibility check server problem > > > Key: SPARK-24152 > URL: https://issues.apache.org/jira/browse/SPARK-24152 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Liang-Chi Hsieh >Priority: Critical > > PR builder and master branch test fails with the following SparkR error with > unknown reason. The following is an error message from that. > {code} > * this is package 'SparkR' version '2.4.0' > * checking CRAN incoming feasibility ...Error in > .check_package_CRAN_incoming(pkgdir) : > dims [product 24] do not match the length of object [0] > Execution halted > {code} > *PR BUILDER* > - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90039/ > - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89983/ > - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89998/ > *MASTER BRANCH* > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4458/ > (Fail with no failures) > This is critical because we already start to merge the PR by ignoring this > **known unkonwn** SparkR failure. > - https://github.com/apache/spark/pull/21175 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org