[jira] [Commented] (SPARK-24076) very bad performance when shuffle.partition = 8192
[ https://issues.apache.org/jira/browse/SPARK-24076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451743#comment-16451743 ] yucai commented on SPARK-24076: --- 1. When shuffle.partition = 8192, tuples in the same partition follows the connection like below: hash(tuple x) = hash(tuple y) + n * 8192 2. In the next HashAggregate stage, tuples from the same partition need put into a 16K BytesToBytesMap (unsafeRowAggBuffer). Here, the HashAggregate uses the same hash algorithm and seed as shuffle, it leads to all tuples will be hashed to only 2 different places actually. That's why hash conflict happens. > very bad performance when shuffle.partition = 8192 > -- > > Key: SPARK-24076 > URL: https://issues.apache.org/jira/browse/SPARK-24076 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: yucai >Priority: Major > Attachments: image-2018-04-25-14-29-39-958.png, p1.png, p2.png > > > We see very bad performance when shuffle.partition = 8192 on some cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20087) Include accumulators / taskMetrics when sending TaskKilled to onTaskEnd listeners
[ https://issues.apache.org/jira/browse/SPARK-20087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451732#comment-16451732 ] Xianjin YE commented on SPARK-20087: cc [~jiangxb1987] [~irashid], I am going to send a new pr if you still think this is the desired behaviour. > Include accumulators / taskMetrics when sending TaskKilled to onTaskEnd > listeners > - > > Key: SPARK-20087 > URL: https://issues.apache.org/jira/browse/SPARK-20087 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Charles Lewis >Priority: Major > > When tasks end due to an ExceptionFailure, subscribers to onTaskEnd receive > accumulators / task metrics for that task, if they were still available. > These metrics are not currently sent when tasks are killed intentionally, > such as when a speculative retry finishes, and the original is killed (or > vice versa). Since we're killing these tasks ourselves, these metrics should > almost always exist, and we should treat them the same way as we treat > ExceptionFailures. > Sending these metrics with the TaskKilled end reason makes aggregation across > all tasks in an app more accurate. This data can inform decisions about how > to tune the speculation parameters in order to minimize duplicated work, and > in general, the total cost of an app should include both successful and > failed tasks, if that information exists. > PR: https://github.com/apache/spark/pull/17422 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24082) [Spark SQL] Tables are not listing under DB
ABHISHEK KUMAR GUPTA created SPARK-24082: Summary: [Spark SQL] Tables are not listing under DB Key: SPARK-24082 URL: https://issues.apache.org/jira/browse/SPARK-24082 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Environment: OS: Suse11 Spark Version: 2.3 Reporter: ABHISHEK KUMAR GUPTA Steps: # Launch spark-sql --master yarn # use one;(* DB name is one ) # . create table csvTable (time timestamp, name string, isright boolean, datetoday date, num binary, height double, score float, decimaler decimal(10,0), id tinyint, age int, license bigint, length smallint) using CSV options (path "/user/datatmo/customer1.csv"); # CREATE TABLE Parquettemp USING org.apache.spark.sql.parquet OPTIONS (path "/user/sparkhive/warehouse/one.db/test1.parquet/"); # show tables; Tables are listing as below one csvtable false one parquettemp false But when listing the tables under "one.db" in HDFS FIle system, csvTable and Parquettemp tables are not listing. Used below command BLR123111:/opt/Antsecure/install/hadoop/namenode/bin # ./hdfs dfs -ls /user/sparkhive/warehouse/one.db -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24076) very bad performance when shuffle.partition = 8192
[ https://issues.apache.org/jira/browse/SPARK-24076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451728#comment-16451728 ] yucai commented on SPARK-24076: --- Root cause: very bad hash conflict in hashaggregate. !image-2018-04-25-14-29-39-958.png! > very bad performance when shuffle.partition = 8192 > -- > > Key: SPARK-24076 > URL: https://issues.apache.org/jira/browse/SPARK-24076 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: yucai >Priority: Major > Attachments: image-2018-04-25-14-29-39-958.png, p1.png, p2.png > > > We see very bad performance when shuffle.partition = 8192 on some cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24076) very bad performance when shuffle.partition = 8192
[ https://issues.apache.org/jira/browse/SPARK-24076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-24076: -- Attachment: image-2018-04-25-14-29-39-958.png > very bad performance when shuffle.partition = 8192 > -- > > Key: SPARK-24076 > URL: https://issues.apache.org/jira/browse/SPARK-24076 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: yucai >Priority: Major > Attachments: image-2018-04-25-14-29-39-958.png, p1.png, p2.png > > > We see very bad performance when shuffle.partition = 8192 on some cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24081) Spark SQL drops the table while writing into table in "overwrite" mode.
[ https://issues.apache.org/jira/browse/SPARK-24081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashish updated SPARK-24081: --- Priority: Blocker (was: Major) > Spark SQL drops the table while writing into table in "overwrite" mode. > > > Key: SPARK-24081 > URL: https://issues.apache.org/jira/browse/SPARK-24081 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.3.0 >Reporter: Ashish >Priority: Blocker > > I am taking data from table and doing modification to the data once I am > writing back to table in overwrite mode its deleting all the record. > Expectation: It will update the table with updated data. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24076) very bad performance when shuffle.partition = 8192
[ https://issues.apache.org/jira/browse/SPARK-24076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451727#comment-16451727 ] yucai commented on SPARK-24076: --- The query example: {code:sql} insert overwrite table target_xxx SELECT item_id, auct_end_dt FROM (select cast(item_id as double) as item_id, auct_end_dt from source_xxx GROUP BY item_id, auct_end_dt {code} > very bad performance when shuffle.partition = 8192 > -- > > Key: SPARK-24076 > URL: https://issues.apache.org/jira/browse/SPARK-24076 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: yucai >Priority: Major > Attachments: p1.png, p2.png > > > We see very bad performance when shuffle.partition = 8192 on some cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24074) Maven package resolver downloads javadoc instead of jar
[ https://issues.apache.org/jira/browse/SPARK-24074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451699#comment-16451699 ] Nadav Samet commented on SPARK-24074: - Also, this problem doesn't occur with Spark 2.2.1, but happens on Spark 2.3.0 - this also makes it appear as though it is related to the maven resolution mechanism in Spark. > Maven package resolver downloads javadoc instead of jar > --- > > Key: SPARK-24074 > URL: https://issues.apache.org/jira/browse/SPARK-24074 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.3.0 >Reporter: Nadav Samet >Priority: Major > > {code:java} > // code placeholder > {code} > From some reason spark downloads a javadoc artifact of a package instead of > the jar. > Steps to reproduce: > # Delete (or move) your local ~/.ivy2 cache to force Spark to resolve and > fetch artifacts from central: > {code:java} > rm -rf ~/.ivy2 > {code} > 1. Run: > {code:java} > ~/dev/spark-2.3.0-bin-hadoop2.7/bin/spark-shell --packages > org.scalanlp:breeze_2.11:0.13.2{code} > 2.Spark would download the javadoc instead of the jar: > {code:java} > downloading > https://repo1.maven.org/maven2/net/sourceforge/f2j/arpack_combined_all/0.1/arpack_combined_all-0.1-javadoc.jar > ... > [SUCCESSFUL ] > net.sourceforge.f2j#arpack_combined_all;0.1!arpack_combined_all.jar > (610ms){code} > 3. Later spark would complain that it couldn't find the jar: > {code:java} > Warning: Local jar > /Users/thesamet/.ivy2/jars/net.sourceforge.f2j_arpack_combined_all-0.1.jar > does not exist, skipping. > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel).{code} > 4. The dependency of breeze on f2j_arpack_combined seem fine: > [http://central.maven.org/maven2/org/scalanlp/breeze_2.11/0.13.2/breeze_2.11-0.13.2.pom] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24074) Maven package resolver downloads javadoc instead of jar
[ https://issues.apache.org/jira/browse/SPARK-24074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451692#comment-16451692 ] Nadav Samet commented on SPARK-24074: - I think that breeze is fine: SBT is able to correctly resolve and download the dependencies of breeze. Also, the pom files for both projects seem valid (only manually looked at them). I am not sure what's causing this, but somehow Spark ends up downloading the wrong artifact, while SBT doesn't. > Maven package resolver downloads javadoc instead of jar > --- > > Key: SPARK-24074 > URL: https://issues.apache.org/jira/browse/SPARK-24074 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.3.0 >Reporter: Nadav Samet >Priority: Major > > {code:java} > // code placeholder > {code} > From some reason spark downloads a javadoc artifact of a package instead of > the jar. > Steps to reproduce: > # Delete (or move) your local ~/.ivy2 cache to force Spark to resolve and > fetch artifacts from central: > {code:java} > rm -rf ~/.ivy2 > {code} > 1. Run: > {code:java} > ~/dev/spark-2.3.0-bin-hadoop2.7/bin/spark-shell --packages > org.scalanlp:breeze_2.11:0.13.2{code} > 2.Spark would download the javadoc instead of the jar: > {code:java} > downloading > https://repo1.maven.org/maven2/net/sourceforge/f2j/arpack_combined_all/0.1/arpack_combined_all-0.1-javadoc.jar > ... > [SUCCESSFUL ] > net.sourceforge.f2j#arpack_combined_all;0.1!arpack_combined_all.jar > (610ms){code} > 3. Later spark would complain that it couldn't find the jar: > {code:java} > Warning: Local jar > /Users/thesamet/.ivy2/jars/net.sourceforge.f2j_arpack_combined_all-0.1.jar > does not exist, skipping. > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel).{code} > 4. The dependency of breeze on f2j_arpack_combined seem fine: > [http://central.maven.org/maven2/org/scalanlp/breeze_2.11/0.13.2/breeze_2.11-0.13.2.pom] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19256) Hive bucketing support
[ https://issues.apache.org/jira/browse/SPARK-19256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451689#comment-16451689 ] Xianjin YE commented on SPARK-19256: Hi [~tejasp] [~cloud_fan], are you still working on this? We also need this feature in our internal Spark stack, is there anything pending? cc [~XuanYuan] > Hive bucketing support > -- > > Key: SPARK-19256 > URL: https://issues.apache.org/jira/browse/SPARK-19256 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.1.0 >Reporter: Tejas Patil >Priority: Minor > > JIRA to track design discussions and tasks related to Hive bucketing support > in Spark. > Proposal : > https://docs.google.com/document/d/1a8IDh23RAkrkg9YYAeO51F4aGO8-xAlupKwdshve2fc/edit?usp=sharing -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24009) spark2.3.0 INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/aaaaab'
[ https://issues.apache.org/jira/browse/SPARK-24009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chris_j updated SPARK-24009: Description: local mode spark execute "INSERT OVERWRITE LOCAL DIRECTORY " successfully. on yarn spark execute "INSERT OVERWRITE LOCAL DIRECTORY " failed, not permission problem also 1.spark-sql -e "INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/ab'row format delimited FIELDS TERMINATED BY '\t' STORED AS TEXTFILE select * from default.dim_date" write local directory successful 2.spark-sql --master yarn -e "INSERT OVERWRITE DIRECTORY 'ab'row format delimited FIELDS TERMINATED BY '\t' STORED AS TEXTFILE select * from default.dim_date" write hdfs successful 3.spark-sql --master yarn -e "INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/ab'row format delimited FIELDS TERMINATED BY '\t' STORED AS TEXTFILE select * from default.dim_date" on yarn write local directory failed Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: Mkdirs failed to create [file:/home/spark/ab/.hive-staging_hive_2018-04-18_14-14-37_208_1244164279218288723-1/-ext-1/_temporary/0/_temporary/attempt_20180418141439__m_00_0|file://home/spark/ab/.hive-staging_hive_2018-04-18_14-14-37_208_1244164279218288723-1/-ext-1/_temporary/0/_temporary/attempt_20180418141439__m_00_0] (exists=false, cwd=[file:/data/hadoop/tmp/nm-local-dir/usercache/spark/appcache/application_1523246226712_0403/container_1523246226712_0403_01_02|file://data/hadoop/tmp/nm-local-dir/usercache/spark/appcache/application_1523246226712_0403/container_1523246226712_0403_01_02]) at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:249) at org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:123) at org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103) at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:367) at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:378) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272) ... 8 more Caused by: java.io.IOException: Mkdirs failed to create [file:/home/spark/ab/.hive-staging_hive_2018-04-18_14-14-37_208_1244164279218288723-1/-ext-1/_temporary/0/_temporary/attempt_20180418141439__m_00_0|file://home/spark/ab/.hive-staging_hive_2018-04-18_14-14-37_208_1244164279218288723-1/-ext-1/_temporary/0/_temporary/attempt_20180418141439__m_00_0] (exists=false, cwd=[file:/data/hadoop/tmp/nm-local-dir/usercache/spark/appcache/application_1523246226712_0403/container_1523246226712_0403_01_02|file://data/hadoop/tmp/nm-local-dir/usercache/spark/appcache/application_1523246226712_0403/container_1523246226712_0403_01_02]) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:447) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:433) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:801) at org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat.getHiveRecordWriter(HiveIgnoreKeyTextOutputFormat.java:80) at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getRecordWriter(HiveFileFormatUtils.java:261) at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:246) was: local mode spark execute "INSERT OVERWRITE LOCAL DIRECTORY " successfully. on yarn spark execute "INSERT OVERWRITE LOCAL DIRECTORY " failed, not permission problem also 1.spark-sql -e "INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/ab'row format delimited FIELDS TERMINATED BY '\t' STORED AS TEXTFILE select * from default.dim_date" write local directory successful 2.spark-sql --master yarn -e "INSERT OVERWRITE DIRECTORY 'ab'row format delimited FIELDS TERMINATED BY '\t' STORED AS TEXTFILE select * from default.dim_date" write hdfs successful 3.spark-sql --master yarn -e "INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/ab'row format delimited FIELDS TERMINATED BY '\t' STORED AS TEXTFILE select * from default.dim_date" on yarn writr local directory fail
[jira] [Commented] (SPARK-24078) reduce with unionAll takes a long time
[ https://issues.apache.org/jira/browse/SPARK-24078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451679#comment-16451679 ] Hyukjin Kwon commented on SPARK-24078: -- Would you be able to test this in higher versions? > reduce with unionAll takes a long time > -- > > Key: SPARK-24078 > URL: https://issues.apache.org/jira/browse/SPARK-24078 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.6.3 >Reporter: zhangsongcheng >Priority: Major > > I try to sample the traning sets with each category,and then uion all samples > together.This is my code: > def balance4Single(dataSet: DataFrame): DataFrame = { > val samples = LabelConf.cardIDList.map { cardID => > val tmpDataSet = dataSet.filter(col("card_id") === cardID) > val sample = underSample(tmpDataSet, cardID) > sample > } > samples.reduce((x, y) => x.unionAll(y)) > } > def underSample(dataSet: DataFrame, cardID: String): DataFrame = { > val positiveSample = dataSet.filter(col("label") > 0.5).sample(false, 0.1) > val negativeSample = dataSet.filter(col("label") < 0.5).sample(false, 0.1) > positiveSample.unionAll(negativeSample).distinct() > } > But the code blocked in {{samples.reduce((x, y) => x.unionAll(y))}}, and it > runs slowly and slowly, and even cannot run any more.It confused me a long > time.Who can help me? Than you! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24077) Why spark SQL not support `CREATE TEMPORARY FUNCTION IF NOT EXISTS`?
[ https://issues.apache.org/jira/browse/SPARK-24077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24077. -- Resolution: Invalid Fix Version/s: (was: 3.0.0) Target Version/s: (was: 2.3.0) Questions should go to mailing lists. I am resolving this. > Why spark SQL not support `CREATE TEMPORARY FUNCTION IF NOT EXISTS`? > > > Key: SPARK-24077 > URL: https://issues.apache.org/jira/browse/SPARK-24077 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 2.3.0 >Reporter: Benedict Jin >Priority: Major > > Why spark SQL not support `CREATE TEMPORARY FUNCTION IF NOT EXISTS`? > > scala> > org.apache.spark.sql.SparkSession.builder().enableHiveSupport.getOrCreate.sql("CREATE > TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as > 'org.apache.spark.sql.hive.udf.YuZhouWan'") > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input 'NOT' expecting \{'.', 'AS'}(line 1, pos 29) > == SQL == > CREATE TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as > 'org.apache.spark.sql.hive.udf.YuZhouWan' > -^^^ > at > org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99) > at > org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592) > ... 48 elided -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24074) Maven package resolver downloads javadoc instead of jar
[ https://issues.apache.org/jira/browse/SPARK-24074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451677#comment-16451677 ] Hyukjin Kwon commented on SPARK-24074: -- I haven't looked into this yet but doesn't that sound more likely specific to breeze? > Maven package resolver downloads javadoc instead of jar > --- > > Key: SPARK-24074 > URL: https://issues.apache.org/jira/browse/SPARK-24074 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.3.0 >Reporter: Nadav Samet >Priority: Major > > {code:java} > // code placeholder > {code} > From some reason spark downloads a javadoc artifact of a package instead of > the jar. > Steps to reproduce: > # Delete (or move) your local ~/.ivy2 cache to force Spark to resolve and > fetch artifacts from central: > {code:java} > rm -rf ~/.ivy2 > {code} > 1. Run: > {code:java} > ~/dev/spark-2.3.0-bin-hadoop2.7/bin/spark-shell --packages > org.scalanlp:breeze_2.11:0.13.2{code} > 2.Spark would download the javadoc instead of the jar: > {code:java} > downloading > https://repo1.maven.org/maven2/net/sourceforge/f2j/arpack_combined_all/0.1/arpack_combined_all-0.1-javadoc.jar > ... > [SUCCESSFUL ] > net.sourceforge.f2j#arpack_combined_all;0.1!arpack_combined_all.jar > (610ms){code} > 3. Later spark would complain that it couldn't find the jar: > {code:java} > Warning: Local jar > /Users/thesamet/.ivy2/jars/net.sourceforge.f2j_arpack_combined_all-0.1.jar > does not exist, skipping. > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel).{code} > 4. The dependency of breeze on f2j_arpack_combined seem fine: > [http://central.maven.org/maven2/org/scalanlp/breeze_2.11/0.13.2/breeze_2.11-0.13.2.pom] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5594) SparkException: Failed to get broadcast (TorrentBroadcast)
[ https://issues.apache.org/jira/browse/SPARK-5594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451666#comment-16451666 ] Spark User commented on SPARK-5594: --- In my case, this issue was happening when spark context doesn't close successfully. When spark context closes abruptly, the files in spark-local and spark-worker directories are left uncleaned. The next time any job is run, the broadcast exception occurs. I managed a workaround by redirecting spark-worker and spark-local outputs to specific folders and cleaning them up in case the spark context doesn't close successfully. > SparkException: Failed to get broadcast (TorrentBroadcast) > -- > > Key: SPARK-5594 > URL: https://issues.apache.org/jira/browse/SPARK-5594 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0, 1.3.0 >Reporter: John Sandiford >Priority: Critical > > I am uncertain whether this is a bug, however I am getting the error below > when running on a cluster (works locally), and have no idea what is causing > it, or where to look for more information. > Any help is appreciated. Others appear to experience the same issue, but I > have not found any solutions online. > Please note that this only happens with certain code and is repeatable, all > my other spark jobs work fine. > {noformat} > ERROR TaskSetManager: Task 3 in stage 6.0 failed 4 times; aborting job > Exception in thread "main" org.apache.spark.SparkException: Job aborted due > to stage failure: Task 3 in stage 6.0 failed 4 times, most recent failure: > Lost task 3.3 in stage 6.0 (TID 24, ): java.io.IOException: > org.apache.spark.SparkException: Failed to get broadcast_6_piece0 of > broadcast_6 > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1011) > at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164) > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64) > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64) > at > org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87) > at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > Caused by: org.apache.spark.SparkException: Failed to get broadcast_6_piece0 > of broadcast_6 > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137) > at scala.Option.getOrElse(Option.scala:120) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:136) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:119) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:174) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1008) > ... 11 more > {noformat} > Driver stacktrace: > {noformat} > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.
[jira] [Created] (SPARK-24081) Spark SQL drops the table while writing into table in "overwrite" mode.
Ashish created SPARK-24081: -- Summary: Spark SQL drops the table while writing into table in "overwrite" mode. Key: SPARK-24081 URL: https://issues.apache.org/jira/browse/SPARK-24081 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 2.3.0 Reporter: Ashish I am taking data from table and doing modification to the data once I am writing back to table in overwrite mode its deleting all the record. Expectation: It will update the table with updated data. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24036) Stateful operators in continuous processing
[ https://issues.apache.org/jira/browse/SPARK-24036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451653#comment-16451653 ] Jungtaek Lim commented on SPARK-24036: -- Hello, I'm quite interested to this issue since I just read the codebase in recent change of continuous mode and observed same limitations. Do you have ideas or any design docs for this? Moreover do you plan to share these tasks with Spark community? Willing to contribute on this side, but that's completely OK if you plan to drive whole tasks from your own. > Stateful operators in continuous processing > --- > > Key: SPARK-24036 > URL: https://issues.apache.org/jira/browse/SPARK-24036 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > > The first iteration of continuous processing in Spark 2.3 does not work with > stateful operators. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24080) Update the nullability of Filter output based on inferred predicates
Takeshi Yamamuro created SPARK-24080: Summary: Update the nullability of Filter output based on inferred predicates Key: SPARK-24080 URL: https://issues.apache.org/jira/browse/SPARK-24080 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Takeshi Yamamuro In the master, a logical `Filter` node does not respect the nullability that the optimizer rule `InferFiltersFromConstraints` might change when inferred predicates have `IsNotNull`, e.g., {code} scala> val df = Seq((Some(1), Some(2))).toDF("a", "b") scala> val filteredDf = df.where("a = 3") scala> val filteredDf.explain scala> filteredDf.queryExecution.optimizedPlan.children(0) res4: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Filter (isnotnull(_1#2) && (_1#2 = 3)) +- LocalRelation [_1#2, _2#3] scala> filteredDf.queryExecution.optimizedPlan.children(0).output.map(_.nullable) res5: Seq[Boolean] = List(true, true) {code} But, these `nullable` values should be: {code} scala> filteredDf.queryExecution.optimizedPlan.children(0).output.map(_.nullable) res5: Seq[Boolean] = List(false, true) {code} This ticket comes from the previous discussion: https://github.com/apache/spark/pull/18576#pullrequestreview-107585997 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24079) Update the nullability of Join output based on inferred predicates
[ https://issues.apache.org/jira/browse/SPARK-24079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451650#comment-16451650 ] Apache Spark commented on SPARK-24079: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/21148 > Update the nullability of Join output based on inferred predicates > -- > > Key: SPARK-24079 > URL: https://issues.apache.org/jira/browse/SPARK-24079 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Takeshi Yamamuro >Priority: Minor > > In the master, a logical `Join` node does not respect the nullability that > the optimizer rule `InferFiltersFromConstraints` > might change when inferred predicates have `IsNotNull`, e.g., > {code} > scala> val df1 = Seq((Some(1), Some(2))).toDF("k", "v0") > scala> val df2 = Seq((Some(1), Some(3))).toDF("k", "v1") > scala> val joinedDf = df1.join(df2, df1("k") === df2("k"), "inner") > scala> joinedDf.explain > == Physical Plan == > *(2) BroadcastHashJoin [k#83], [k#92], Inner, BuildRight > :- *(2) Project [_1#80 AS k#83, _2#81 AS v0#84] > : +- *(2) Filter isnotnull(_1#80) > : +- LocalTableScan [_1#80, _2#81] > +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, > true] as bigint))) >+- *(1) Project [_1#89 AS k#92, _2#90 AS v1#93] > +- *(1) Filter isnotnull(_1#89) > +- LocalTableScan [_1#89, _2#90] > scala> joinedDf.queryExecution.optimizedPlan.output.map(_.nullable) > res15: Seq[Boolean] = List(true, true, true, true) > {code} > But, these `nullable` values should be: > {code} > scala> joinedDf.queryExecution.optimizedPlan.output.map(_.nullable) > res15: Seq[Boolean] = List(false, true, false, true) > {code} > This ticket comes from the previous discussion: > https://github.com/apache/spark/pull/18576#pullrequestreview-107585997 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24079) Update the nullability of Join output based on inferred predicates
[ https://issues.apache.org/jira/browse/SPARK-24079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24079: Assignee: (was: Apache Spark) > Update the nullability of Join output based on inferred predicates > -- > > Key: SPARK-24079 > URL: https://issues.apache.org/jira/browse/SPARK-24079 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Takeshi Yamamuro >Priority: Minor > > In the master, a logical `Join` node does not respect the nullability that > the optimizer rule `InferFiltersFromConstraints` > might change when inferred predicates have `IsNotNull`, e.g., > {code} > scala> val df1 = Seq((Some(1), Some(2))).toDF("k", "v0") > scala> val df2 = Seq((Some(1), Some(3))).toDF("k", "v1") > scala> val joinedDf = df1.join(df2, df1("k") === df2("k"), "inner") > scala> joinedDf.explain > == Physical Plan == > *(2) BroadcastHashJoin [k#83], [k#92], Inner, BuildRight > :- *(2) Project [_1#80 AS k#83, _2#81 AS v0#84] > : +- *(2) Filter isnotnull(_1#80) > : +- LocalTableScan [_1#80, _2#81] > +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, > true] as bigint))) >+- *(1) Project [_1#89 AS k#92, _2#90 AS v1#93] > +- *(1) Filter isnotnull(_1#89) > +- LocalTableScan [_1#89, _2#90] > scala> joinedDf.queryExecution.optimizedPlan.output.map(_.nullable) > res15: Seq[Boolean] = List(true, true, true, true) > {code} > But, these `nullable` values should be: > {code} > scala> joinedDf.queryExecution.optimizedPlan.output.map(_.nullable) > res15: Seq[Boolean] = List(false, true, false, true) > {code} > This ticket comes from the previous discussion: > https://github.com/apache/spark/pull/18576#pullrequestreview-107585997 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24079) Update the nullability of Join output based on inferred predicates
[ https://issues.apache.org/jira/browse/SPARK-24079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24079: Assignee: Apache Spark > Update the nullability of Join output based on inferred predicates > -- > > Key: SPARK-24079 > URL: https://issues.apache.org/jira/browse/SPARK-24079 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Takeshi Yamamuro >Assignee: Apache Spark >Priority: Minor > > In the master, a logical `Join` node does not respect the nullability that > the optimizer rule `InferFiltersFromConstraints` > might change when inferred predicates have `IsNotNull`, e.g., > {code} > scala> val df1 = Seq((Some(1), Some(2))).toDF("k", "v0") > scala> val df2 = Seq((Some(1), Some(3))).toDF("k", "v1") > scala> val joinedDf = df1.join(df2, df1("k") === df2("k"), "inner") > scala> joinedDf.explain > == Physical Plan == > *(2) BroadcastHashJoin [k#83], [k#92], Inner, BuildRight > :- *(2) Project [_1#80 AS k#83, _2#81 AS v0#84] > : +- *(2) Filter isnotnull(_1#80) > : +- LocalTableScan [_1#80, _2#81] > +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, > true] as bigint))) >+- *(1) Project [_1#89 AS k#92, _2#90 AS v1#93] > +- *(1) Filter isnotnull(_1#89) > +- LocalTableScan [_1#89, _2#90] > scala> joinedDf.queryExecution.optimizedPlan.output.map(_.nullable) > res15: Seq[Boolean] = List(true, true, true, true) > {code} > But, these `nullable` values should be: > {code} > scala> joinedDf.queryExecution.optimizedPlan.output.map(_.nullable) > res15: Seq[Boolean] = List(false, true, false, true) > {code} > This ticket comes from the previous discussion: > https://github.com/apache/spark/pull/18576#pullrequestreview-107585997 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24079) Update the nullability of Join output based on inferred predicates
Takeshi Yamamuro created SPARK-24079: Summary: Update the nullability of Join output based on inferred predicates Key: SPARK-24079 URL: https://issues.apache.org/jira/browse/SPARK-24079 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Takeshi Yamamuro In the master, a logical `Join` node does not respect the nullability that the optimizer rule `InferFiltersFromConstraints` might change when inferred predicates have `IsNotNull`, e.g., {code} scala> val df1 = Seq((Some(1), Some(2))).toDF("k", "v0") scala> val df2 = Seq((Some(1), Some(3))).toDF("k", "v1") scala> val joinedDf = df1.join(df2, df1("k") === df2("k"), "inner") scala> joinedDf.explain == Physical Plan == *(2) BroadcastHashJoin [k#83], [k#92], Inner, BuildRight :- *(2) Project [_1#80 AS k#83, _2#81 AS v0#84] : +- *(2) Filter isnotnull(_1#80) : +- LocalTableScan [_1#80, _2#81] +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))) +- *(1) Project [_1#89 AS k#92, _2#90 AS v1#93] +- *(1) Filter isnotnull(_1#89) +- LocalTableScan [_1#89, _2#90] scala> joinedDf.queryExecution.optimizedPlan.output.map(_.nullable) res15: Seq[Boolean] = List(true, true, true, true) {code} But, these `nullable` values should be: {code} scala> joinedDf.queryExecution.optimizedPlan.output.map(_.nullable) res15: Seq[Boolean] = List(false, true, false, true) {code} This ticket comes from the previous discussion: https://github.com/apache/spark/pull/18576#pullrequestreview-107585997 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24070) TPC-DS Performance Tests for Parquet 1.10.0 Upgrade
[ https://issues.apache.org/jira/browse/SPARK-24070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-24070: --- Assignee: Takeshi Yamamuro > TPC-DS Performance Tests for Parquet 1.10.0 Upgrade > --- > > Key: SPARK-24070 > URL: https://issues.apache.org/jira/browse/SPARK-24070 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Xiao Li >Assignee: Takeshi Yamamuro >Priority: Major > > TPC-DS performance evaluation of Apache Spark Parquet 1.8.2 and 1.10.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24078) reduce with unionAll takes a long time
[ https://issues.apache.org/jira/browse/SPARK-24078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhangsongcheng updated SPARK-24078: --- Description: I try to sample the traning sets with each category,and then uion all samples together.This is my code: def balance4Single(dataSet: DataFrame): DataFrame = { val samples = LabelConf.cardIDList.map { cardID => val tmpDataSet = dataSet.filter(col("card_id") === cardID) val sample = underSample(tmpDataSet, cardID) sample } samples.reduce((x, y) => x.unionAll(y)) } def underSample(dataSet: DataFrame, cardID: String): DataFrame = { val positiveSample = dataSet.filter(col("label") > 0.5).sample(false, 0.1) val negativeSample = dataSet.filter(col("label") < 0.5).sample(false, 0.1) positiveSample.unionAll(negativeSample).distinct() } But the code blocked in {{samples.reduce((x, y) => x.unionAll(y))}}, and it runs slowly and slowly, and even cannot run any more.It confused me a long time.Who can help me? Than you! was: I try to sample the traning sets with each category,and then uion all samples together.This is my code: {{ def balanceCategory(dataSet: DataFrame): DataFrame = }} {{{}} val samples = LabelConf.categories.map { category => {{ val tmpDataSet = dataSet.filter(col("category_id") === category)}} val sample = underSample(tmpDataSet, category) sample } {{ samples.reduce((x, y) => x.unionAll(y))}} } {{ def underSample(dataSet: DataFrame, cardID: String): DataFrame = { val positiveSample = dataSet.filter(col("label") > 0.5).sample(false, 0.1)}} {{ val negativeSample = dataSet.filter(col("label") < 0.5).sample(false, 0.1)}} {{ val positiveSample.unionAll(negativeSample)}} } But the code blocked in `{{samples.reduce((x, y) => x.unionAll(y))`}}, and it runs slowly and slowly, and even cannot run any more.It confused me a long time.Who can help me? Than you! > reduce with unionAll takes a long time > -- > > Key: SPARK-24078 > URL: https://issues.apache.org/jira/browse/SPARK-24078 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.6.3 >Reporter: zhangsongcheng >Priority: Major > > I try to sample the traning sets with each category,and then uion all samples > together.This is my code: > def balance4Single(dataSet: DataFrame): DataFrame = { > val samples = LabelConf.cardIDList.map { cardID => > val tmpDataSet = dataSet.filter(col("card_id") === cardID) > val sample = underSample(tmpDataSet, cardID) > sample > } > samples.reduce((x, y) => x.unionAll(y)) > } > def underSample(dataSet: DataFrame, cardID: String): DataFrame = { > val positiveSample = dataSet.filter(col("label") > 0.5).sample(false, 0.1) > val negativeSample = dataSet.filter(col("label") < 0.5).sample(false, 0.1) > positiveSample.unionAll(negativeSample).distinct() > } > But the code blocked in {{samples.reduce((x, y) => x.unionAll(y))}}, and it > runs slowly and slowly, and even cannot run any more.It confused me a long > time.Who can help me? Than you! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23799) [CBO] FilterEstimation.evaluateInSet produces devision by zero in a case of empty table with analyzed statistics
[ https://issues.apache.org/jira/browse/SPARK-23799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451609#comment-16451609 ] Apache Spark commented on SPARK-23799: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/21147 > [CBO] FilterEstimation.evaluateInSet produces devision by zero in a case of > empty table with analyzed statistics > > > Key: SPARK-23799 > URL: https://issues.apache.org/jira/browse/SPARK-23799 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.2.1, 2.3.0 >Reporter: Michael Shtelma >Assignee: Michael Shtelma >Priority: Major > Fix For: 2.4.0 > > > Spark 2.2.1 and 2.3.0 can produce NumberFormatException (see below) during > the analysis of the queries, which are using previously analyzed hive tables. > The NumberFormatException occurs because in [FilterEstimation.scala on lines > 50 and > 52|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala?utf8=%E2%9C%93#L50-L52] > the method calculateFilterSelectivity returns NaN, which is caused by > devision by zero. This leads to NumberFormatException during conversion from > Double to BigDecimal. > NaN is caused by devision by zero in evaluateInSet method. > Exception: > java.lang.NumberFormatException > at java.math.BigDecimal.(BigDecimal.java:494) > at java.math.BigDecimal.(BigDecimal.java:824) > at scala.math.BigDecimal$.decimal(BigDecimal.scala:52) > at scala.math.BigDecimal$.decimal(BigDecimal.scala:55) > at scala.math.BigDecimal$.double2bigDecimal(BigDecimal.scala:343) > at > org.apache.spark.sql.catalyst.plans.logical.statsEstimation.FilterEstimation.estimate(FilterEstimation.scala:52) > at > org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visitFilter(BasicStatsPlanVisitor.scala:43) > at > org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visitFilter(BasicStatsPlanVisitor.scala:25) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor$class.visit(LogicalPlanVisitor.scala:30) > at > org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visit(BasicStatsPlanVisitor.scala:25) > at > org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$$anonfun$stats$1.apply(LogicalPlanStats.scala:35) > at > org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$$anonfun$stats$1.apply(LogicalPlanStats.scala:33) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$class.stats(LogicalPlanStats.scala:33) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.stats(LogicalPlan.scala:30) > at > org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils$$anonfun$rowCountsExist$1.apply(EstimationUtils.scala:32) > at > org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils$$anonfun$rowCountsExist$1.apply(EstimationUtils.scala:32) > at > scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38) > at > scala.collection.IndexedSeqOptimized$class.forall(IndexedSeqOptimized.scala:43) > at scala.collection.mutable.WrappedArray.forall(WrappedArray.scala:35) > at > org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils$.rowCountsExist(EstimationUtils.scala:32) > at > org.apache.spark.sql.catalyst.plans.logical.statsEstimation.ProjectEstimation$.estimate(ProjectEstimation.scala:27) > at > org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visitProject(BasicStatsPlanVisitor.scala:63) > at > org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visitProject(BasicStatsPlanVisitor.scala:25) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor$class.visit(LogicalPlanVisitor.scala:37) > at > org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visit(BasicStatsPlanVisitor.scala:25) > at > org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$$anonfun$stats$1.apply(LogicalPlanStats.scala:35) > at > org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$$anonfun$stats$1.apply(LogicalPlanStats.scala:33) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$class.stats(LogicalPlanStats.scala:33) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.stats(LogicalPlan.scala:30) > at > org.apache.spark.sql.c
[jira] [Updated] (SPARK-24078) reduce with unionAll takes a long time
[ https://issues.apache.org/jira/browse/SPARK-24078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhangsongcheng updated SPARK-24078: --- Description: I try to sample the traning sets with each category,and then uion all samples together.This is my code: {{ def balanceCategory(dataSet: DataFrame): DataFrame = }} {{{}} val samples = LabelConf.categories.map { category => {{ val tmpDataSet = dataSet.filter(col("category_id") === category)}} val sample = underSample(tmpDataSet, category) sample } {{ samples.reduce((x, y) => x.unionAll(y))}} } {{ def underSample(dataSet: DataFrame, cardID: String): DataFrame = { val positiveSample = dataSet.filter(col("label") > 0.5).sample(false, 0.1)}} {{ val negativeSample = dataSet.filter(col("label") < 0.5).sample(false, 0.1)}} {{ val positiveSample.unionAll(negativeSample)}} } But the code blocked in `{{samples.reduce((x, y) => x.unionAll(y))`}}, and it runs slowly and slowly, and even cannot run any more.It confused me a long time.Who can help me? Than you! was: I try to sample the traning sets with each category,and then uion all samples together.This is my code: {{ def balanceCategory(dataSet: DataFrame): DataFrame = {}} {{ val samples = LabelConf.categorys.map { }}category => {{ val tmpDataSet = dataSet.filter(col("category_id") === category)}} val sample = underSample(tmpDataSet, category) sample } {{ samples.reduce((x, y) => x.unionAll(y))}} } {{ def underSample(dataSet: DataFrame, cardID: String): DataFrame = { val positiveSample = dataSet.filter(col("label") > 0.5).sample(false, 0.1)}} {{ val negativeSample = dataSet.filter(col("label") < 0.5).sample(false, 0.1)}} {{ val positiveSample.unionAll(negativeSample)}} } But the code blocked in `{{samples.reduce((x, y) => x.unionAll(y))`}}, and it runs slowly and slowly, and even cannot run any more.It confused me a long time.Who can help me? Than you! > reduce with unionAll takes a long time > -- > > Key: SPARK-24078 > URL: https://issues.apache.org/jira/browse/SPARK-24078 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.6.3 >Reporter: zhangsongcheng >Priority: Major > > I try to sample the traning sets with each category,and then uion all samples > together.This is my code: > {{ def balanceCategory(dataSet: DataFrame): DataFrame = }} > {{{}} > val samples = LabelConf.categories.map { > category => > {{ val tmpDataSet = dataSet.filter(col("category_id") === category)}} > val sample = underSample(tmpDataSet, category) > sample > } > {{ samples.reduce((x, y) => x.unionAll(y))}} > } > > {{ def underSample(dataSet: DataFrame, cardID: String): DataFrame = { > val positiveSample = dataSet.filter(col("label") > 0.5).sample(false, 0.1)}} > {{ val negativeSample = dataSet.filter(col("label") < 0.5).sample(false, > 0.1)}} > {{ val positiveSample.unionAll(negativeSample)}} > } > > But the code blocked in `{{samples.reduce((x, y) => x.unionAll(y))`}}, and > it runs slowly and slowly, and even cannot run any more.It confused me a long > time.Who can help me? Than you! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24078) reduce with unionAll takes a long time
[ https://issues.apache.org/jira/browse/SPARK-24078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhangsongcheng updated SPARK-24078: --- Description: I try to sample the traning sets with each category,and then uion all samples together.This is my code: {{ def balanceCategory(dataSet: DataFrame): DataFrame = {}} {{ val samples = LabelConf.categorys.map { }}category => {{ val tmpDataSet = dataSet.filter(col("category_id") === category)}} val sample = underSample(tmpDataSet, category) sample } {{ samples.reduce((x, y) => x.unionAll(y))}} } {{ def underSample(dataSet: DataFrame, cardID: String): DataFrame = { val positiveSample = dataSet.filter(col("label") > 0.5).sample(false, 0.1)}} {{ val negativeSample = dataSet.filter(col("label") < 0.5).sample(false, 0.1)}} {{ val positiveSample.unionAll(negativeSample)}} } But the code blocked in `{{samples.reduce((x, y) => x.unionAll(y))`}}, and it runs slowly and slowly, and even cannot run any more.It confused me a long time.Who can help me? Than you! was: I try to sample the traning sets with each category,and then uion all samples together.This is my code: {{ def balanceCategory(dataSet: DataFrame): DataFrame = {}} {{ val samples = LabelConf.categorys.map { }}{{category => }} {{ val tmpDataSet = dataSet.filter(col("category_id") === category)}} {{ val sample = underSample(tmpDataSet, category) sample }} {{ } }} {{ samples.reduce((x, y) => x.unionAll(y))}} {{ } }} {{ def underSample(dataSet: DataFrame, cardID: String): DataFrame = { val positiveSample = dataSet.filter(col("label") > 0.5).sample(false, 0.1)}} {{ val negativeSample = dataSet.filter(col("label") < 0.5).sample(false, 0.1)}} {{ val positiveSample.unionAll(negativeSample)}} } But the code blocked in `{{samples.reduce((x, y) => x.unionAll(y))`}}, and it runs slowly and slowly, and even cannot run any more.It confused me a long time.Who can help me? Than you! > reduce with unionAll takes a long time > -- > > Key: SPARK-24078 > URL: https://issues.apache.org/jira/browse/SPARK-24078 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.6.3 >Reporter: zhangsongcheng >Priority: Major > > I try to sample the traning sets with each category,and then uion all samples > together.This is my code: > {{ def balanceCategory(dataSet: DataFrame): DataFrame = {}} > {{ val samples = LabelConf.categorys.map { }}category => > {{ val tmpDataSet = dataSet.filter(col("category_id") === category)}} > val sample = underSample(tmpDataSet, category) > sample > } > {{ samples.reduce((x, y) => x.unionAll(y))}} > } > > {{ def underSample(dataSet: DataFrame, cardID: String): DataFrame = { > val positiveSample = dataSet.filter(col("label") > 0.5).sample(false, 0.1)}} > {{ val negativeSample = dataSet.filter(col("label") < 0.5).sample(false, > 0.1)}} > {{ val positiveSample.unionAll(negativeSample)}} > } > > But the code blocked in `{{samples.reduce((x, y) => x.unionAll(y))`}}, and > it runs slowly and slowly, and even cannot run any more.It confused me a long > time.Who can help me? Than you! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24078) reduce with unionAll takes a long time
zhangsongcheng created SPARK-24078: -- Summary: reduce with unionAll takes a long time Key: SPARK-24078 URL: https://issues.apache.org/jira/browse/SPARK-24078 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.6.3 Reporter: zhangsongcheng I try to sample the traning sets with each category,and then uion all samples together.This is my code: {{ def balanceCategory(dataSet: DataFrame): DataFrame = {}} {{ val samples = LabelConf.categorys.map { }}{{category => }} {{ val tmpDataSet = dataSet.filter(col("category_id") === category)}} {{ val sample = underSample(tmpDataSet, category) sample }} {{ } }} {{ samples.reduce((x, y) => x.unionAll(y))}} {{ } }} {{ def underSample(dataSet: DataFrame, cardID: String): DataFrame = { val positiveSample = dataSet.filter(col("label") > 0.5).sample(false, 0.1)}} {{ val negativeSample = dataSet.filter(col("label") < 0.5).sample(false, 0.1)}} {{ val positiveSample.unionAll(negativeSample)}} } But the code blocked in `{{samples.reduce((x, y) => x.unionAll(y))`}}, and it runs slowly and slowly, and even cannot run any more.It confused me a long time.Who can help me? Than you! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24076) very bad performance when shuffle.partition = 8192
[ https://issues.apache.org/jira/browse/SPARK-24076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451540#comment-16451540 ] yucai commented on SPARK-24076: --- shuffle.partition = 8192 !p1.png! shuffle.partition = 8000 !p2.png! > very bad performance when shuffle.partition = 8192 > -- > > Key: SPARK-24076 > URL: https://issues.apache.org/jira/browse/SPARK-24076 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: yucai >Priority: Major > Attachments: p1.png, p2.png > > > We see very bad performance when shuffle.partition = 8192 on some cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24076) very bad performance when shuffle.partition = 8192
[ https://issues.apache.org/jira/browse/SPARK-24076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-24076: -- Attachment: p2.png p1.png > very bad performance when shuffle.partition = 8192 > -- > > Key: SPARK-24076 > URL: https://issues.apache.org/jira/browse/SPARK-24076 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: yucai >Priority: Major > Attachments: p1.png, p2.png > > > We see very bad performance when shuffle.partition = 8192 on some cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24077) Why spark SQL not support `CREATE TEMPORARY FUNCTION IF NOT EXISTS`?
Benedict Jin created SPARK-24077: Summary: Why spark SQL not support `CREATE TEMPORARY FUNCTION IF NOT EXISTS`? Key: SPARK-24077 URL: https://issues.apache.org/jira/browse/SPARK-24077 Project: Spark Issue Type: Question Components: SQL Affects Versions: 2.3.0 Reporter: Benedict Jin Fix For: 3.0.0 Why spark SQL not support `CREATE TEMPORARY FUNCTION IF NOT EXISTS`? scala> org.apache.spark.sql.SparkSession.builder().enableHiveSupport.getOrCreate.sql("CREATE TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as 'org.apache.spark.sql.hive.udf.YuZhouWan'") org.apache.spark.sql.catalyst.parser.ParseException: mismatched input 'NOT' expecting \{'.', 'AS'}(line 1, pos 29) == SQL == CREATE TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as 'org.apache.spark.sql.hive.udf.YuZhouWan' -^^^ at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592) ... 48 elided -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24076) very bad performance when shuffle.partition = 8192
yucai created SPARK-24076: - Summary: very bad performance when shuffle.partition = 8192 Key: SPARK-24076 URL: https://issues.apache.org/jira/browse/SPARK-24076 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: yucai We see very bad performance when shuffle.partition = 8192 on some cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23821) High-order function: flatten(x) → array
[ https://issues.apache.org/jira/browse/SPARK-23821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-23821. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20938 [https://github.com/apache/spark/pull/20938] > High-order function: flatten(x) → array > --- > > Key: SPARK-23821 > URL: https://issues.apache.org/jira/browse/SPARK-23821 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Marek Novotny >Assignee: Marek Novotny >Priority: Major > Fix For: 2.4.0 > > > Add the flatten function that transforms an Array of Arrays column into an > Array elements column. if the array structure contains more than two levels > of nesting, the function removes one nesting level > Example: > {{flatten(array(array(1, 2, 3), array(3, 4, 5), array(6, 7, 8)) => > [1,2,3,4,5,6,7,8,9]}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23821) High-order function: flatten(x) → array
[ https://issues.apache.org/jira/browse/SPARK-23821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin reassigned SPARK-23821: - Assignee: Marek Novotny > High-order function: flatten(x) → array > --- > > Key: SPARK-23821 > URL: https://issues.apache.org/jira/browse/SPARK-23821 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Marek Novotny >Assignee: Marek Novotny >Priority: Major > > Add the flatten function that transforms an Array of Arrays column into an > Array elements column. if the array structure contains more than two levels > of nesting, the function removes one nesting level > Example: > {{flatten(array(array(1, 2, 3), array(3, 4, 5), array(6, 7, 8)) => > [1,2,3,4,5,6,7,8,9]}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24075) [Mesos] Supervised driver upon failure will be retried indefinitely unless explicitly killed
Yogesh Natarajan created SPARK-24075: Summary: [Mesos] Supervised driver upon failure will be retried indefinitely unless explicitly killed Key: SPARK-24075 URL: https://issues.apache.org/jira/browse/SPARK-24075 Project: Spark Issue Type: Improvement Components: Mesos Affects Versions: 2.3.0 Reporter: Yogesh Natarajan If supervise is enabled, MesosClusterScheduler will retry a failing driver indefinitely. This takes up cluster resources which is freed up only when the driver is explicitly killed. The proposed solution is to introduce spark configuration "spark.driver.supervise.maxRetries" which allows the maximum number of retries to be specified while preserving the default behavior of retrying the driver indefinitely. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24074) Maven package resolver downloads javadoc instead of jar
[ https://issues.apache.org/jira/browse/SPARK-24074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451499#comment-16451499 ] Nadav Samet commented on SPARK-24074: - I was only able to reproduce this problem with this particular dependency, whether I request breeze, my own package that depends on breeze, or directly this package - it always fails to download it > Maven package resolver downloads javadoc instead of jar > --- > > Key: SPARK-24074 > URL: https://issues.apache.org/jira/browse/SPARK-24074 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.3.0 >Reporter: Nadav Samet >Priority: Major > > {code:java} > // code placeholder > {code} > From some reason spark downloads a javadoc artifact of a package instead of > the jar. > Steps to reproduce: > # Delete (or move) your local ~/.ivy2 cache to force Spark to resolve and > fetch artifacts from central: > {code:java} > rm -rf ~/.ivy2 > {code} > 1. Run: > {code:java} > ~/dev/spark-2.3.0-bin-hadoop2.7/bin/spark-shell --packages > org.scalanlp:breeze_2.11:0.13.2{code} > 2.Spark would download the javadoc instead of the jar: > {code:java} > downloading > https://repo1.maven.org/maven2/net/sourceforge/f2j/arpack_combined_all/0.1/arpack_combined_all-0.1-javadoc.jar > ... > [SUCCESSFUL ] > net.sourceforge.f2j#arpack_combined_all;0.1!arpack_combined_all.jar > (610ms){code} > 3. Later spark would complain that it couldn't find the jar: > {code:java} > Warning: Local jar > /Users/thesamet/.ivy2/jars/net.sourceforge.f2j_arpack_combined_all-0.1.jar > does not exist, skipping. > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel).{code} > 4. The dependency of breeze on f2j_arpack_combined seem fine: > [http://central.maven.org/maven2/org/scalanlp/breeze_2.11/0.13.2/breeze_2.11-0.13.2.pom] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24064) [Spark SQL] Create table using csv does not support binary column Type
[ https://issues.apache.org/jira/browse/SPARK-24064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-24064: - Target Version/s: (was: 2.3.1) Please avoid to set a target version which is usually set by a committer. Yea, I know this limitation. Would you have an idea to support this type? > [Spark SQL] Create table using csv does not support binary column Type > --- > > Key: SPARK-24064 > URL: https://issues.apache.org/jira/browse/SPARK-24064 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: OS Type: Suse 11 > Spark Version: 2.3.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Minor > Labels: test > > # Launch spark-sql --master yarn > # create table csvTable (time timestamp, name string, isright boolean, > datetoday date, num binary, height double, score float, decimaler > decimal(10,0), id tinyint, age int, license bigint, length smallint) using > CSV options (path "/user/datatmo/customer1.csv"); > Result: Table creation is successful > 3. Select * from csvTable; > Throws below Exception > ERROR SparkSQLDriver:91 - Failed in [select * from csvtable] > java.lang.UnsupportedOperationException: *CSV data source does not support > binary data type*. > at > org.apache.spark.sql.execution.datasources.csv.CSVUtils$.org$apache$spark$sql$execution$datasources$csv$CSVUtils$$verifyType$1(CSVUtils.scala:127) > at > org.apache.spark.sql.execution.datasources.csv.CSVUtils$$anonfun$verifySchema$1.apply(CSVUtils.scala:131) > at > org.apache.spark.sql.execution.datasources.csv.CSVUtils$$anonfun$verifySchema$1.apply(CSVUtils.scala:131) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99) > > But Normal table supports binary Data Type. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24068) CSV schema inferring doesn't work for compressed files
[ https://issues.apache.org/jira/browse/SPARK-24068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451483#comment-16451483 ] Hyukjin Kwon commented on SPARK-24068: -- Hm, [~maxgekk], btw is this specific to CSV (not, for example JSON)? > CSV schema inferring doesn't work for compressed files > -- > > Key: SPARK-24068 > URL: https://issues.apache.org/jira/browse/SPARK-24068 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Priority: Major > > Here is a simple csv file compressed by lzo > {code} > $ cat ./test.csv > col1,col2 > a,1 > $ lzop ./test.csv > $ ls > test.csv test.csv.lzo > {code} > Reading test.csv.lzo with LZO codec (see > https://github.com/twitter/hadoop-lzo, for example): > {code:scala} > scala> val ds = spark.read.option("header", true).option("inferSchema", > true).option("io.compression.codecs", > "com.hadoop.compression.lzo.LzopCodec").csv("/Users/maximgekk/tmp/issue/test.csv.lzo") > ds: org.apache.spark.sql.DataFrame = [�LZO?: string] > scala> ds.printSchema > root > |-- �LZO: string (nullable = true) > scala> ds.show > +-+ > |�LZO| > +-+ > |a| > +-+ > {code} > but the file can be read if the schema is specified: > {code} > scala> import org.apache.spark.sql.types._ > scala> val schema = new StructType().add("col1", StringType).add("col2", > IntegerType) > scala> val ds = spark.read.schema(schema).option("header", > true).option("io.compression.codecs", > "com.hadoop.compression.lzo.LzopCodec").csv("test.csv.lzo") > scala> ds.show > +++ > |col1|col2| > +++ > | a| 1| > +++ > {code} > Just in case, schema inferring works for the original uncompressed file: > {code:scala} > scala> spark.read.option("header", true).option("inferSchema", > true).csv("test.csv").printSchema > root > |-- col1: string (nullable = true) > |-- col2: integer (nullable = true) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24074) Maven package resolver downloads javadoc instead of jar
[ https://issues.apache.org/jira/browse/SPARK-24074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-24074: - Priority: Major (was: Critical) Please avoid to set Critical+ which is usually reserved for committers. Does this consistently happen with other packages too? > Maven package resolver downloads javadoc instead of jar > --- > > Key: SPARK-24074 > URL: https://issues.apache.org/jira/browse/SPARK-24074 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.3.0 >Reporter: Nadav Samet >Priority: Major > > {code:java} > // code placeholder > {code} > From some reason spark downloads a javadoc artifact of a package instead of > the jar. > Steps to reproduce: > # Delete (or move) your local ~/.ivy2 cache to force Spark to resolve and > fetch artifacts from central: > {code:java} > rm -rf ~/.ivy2 > {code} > 1. Run: > {code:java} > ~/dev/spark-2.3.0-bin-hadoop2.7/bin/spark-shell --packages > org.scalanlp:breeze_2.11:0.13.2{code} > 2.Spark would download the javadoc instead of the jar: > {code:java} > downloading > https://repo1.maven.org/maven2/net/sourceforge/f2j/arpack_combined_all/0.1/arpack_combined_all-0.1-javadoc.jar > ... > [SUCCESSFUL ] > net.sourceforge.f2j#arpack_combined_all;0.1!arpack_combined_all.jar > (610ms){code} > 3. Later spark would complain that it couldn't find the jar: > {code:java} > Warning: Local jar > /Users/thesamet/.ivy2/jars/net.sourceforge.f2j_arpack_combined_all-0.1.jar > does not exist, skipping. > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel).{code} > 4. The dependency of breeze on f2j_arpack_combined seem fine: > [http://central.maven.org/maven2/org/scalanlp/breeze_2.11/0.13.2/breeze_2.11-0.13.2.pom] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24070) TPC-DS Performance Tests for Parquet 1.10.0 Upgrade
[ https://issues.apache.org/jira/browse/SPARK-24070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451466#comment-16451466 ] Takeshi Yamamuro commented on SPARK-24070: -- ok > TPC-DS Performance Tests for Parquet 1.10.0 Upgrade > --- > > Key: SPARK-24070 > URL: https://issues.apache.org/jira/browse/SPARK-24070 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Xiao Li >Priority: Major > > TPC-DS performance evaluation of Apache Spark Parquet 1.8.2 and 1.10.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24038) refactor continuous write exec to its own class
[ https://issues.apache.org/jira/browse/SPARK-24038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-24038. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 21116 [https://github.com/apache/spark/pull/21116] > refactor continuous write exec to its own class > --- > > Key: SPARK-24038 > URL: https://issues.apache.org/jira/browse/SPARK-24038 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Assignee: Jose Torres >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24038) refactor continuous write exec to its own class
[ https://issues.apache.org/jira/browse/SPARK-24038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das reassigned SPARK-24038: - Assignee: Jose Torres > refactor continuous write exec to its own class > --- > > Key: SPARK-24038 > URL: https://issues.apache.org/jira/browse/SPARK-24038 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Assignee: Jose Torres >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24070) TPC-DS Performance Tests for Parquet 1.10.0 Upgrade
[ https://issues.apache.org/jira/browse/SPARK-24070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451427#comment-16451427 ] Xiao Li commented on SPARK-24070: - Yeah, please do it here. Thanks! If you have the bandwidth to write the micro benchmark suite, that needs a separate PR. > TPC-DS Performance Tests for Parquet 1.10.0 Upgrade > --- > > Key: SPARK-24070 > URL: https://issues.apache.org/jira/browse/SPARK-24070 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Xiao Li >Priority: Major > > TPC-DS performance evaluation of Apache Spark Parquet 1.8.2 and 1.10.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24070) TPC-DS Performance Tests for Parquet 1.10.0 Upgrade
[ https://issues.apache.org/jira/browse/SPARK-24070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451404#comment-16451404 ] Takeshi Yamamuro commented on SPARK-24070: -- ok, this ticket means we will put the performance results here instead of pr? > TPC-DS Performance Tests for Parquet 1.10.0 Upgrade > --- > > Key: SPARK-24070 > URL: https://issues.apache.org/jira/browse/SPARK-24070 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Xiao Li >Priority: Major > > TPC-DS performance evaluation of Apache Spark Parquet 1.8.2 and 1.10.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24074) Maven package resolver downloads javadoc instead of jar
[ https://issues.apache.org/jira/browse/SPARK-24074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nadav Samet updated SPARK-24074: Environment: (was: {code:java} // code placeholder {code}) > Maven package resolver downloads javadoc instead of jar > --- > > Key: SPARK-24074 > URL: https://issues.apache.org/jira/browse/SPARK-24074 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.3.0 >Reporter: Nadav Samet >Priority: Critical > > {code:java} > // code placeholder > {code} > From some reason spark downloads a javadoc artifact of a package instead of > the jar. > Steps to reproduce: > # Delete (or move) your local ~/.ivy2 cache to force Spark to resolve and > fetch artifacts from central: > {code:java} > rm -rf ~/.ivy2 > {code} > 1. Run: > {code:java} > ~/dev/spark-2.3.0-bin-hadoop2.7/bin/spark-shell --packages > org.scalanlp:breeze_2.11:0.13.2{code} > 2.Spark would download the javadoc instead of the jar: > {code:java} > downloading > https://repo1.maven.org/maven2/net/sourceforge/f2j/arpack_combined_all/0.1/arpack_combined_all-0.1-javadoc.jar > ... > [SUCCESSFUL ] > net.sourceforge.f2j#arpack_combined_all;0.1!arpack_combined_all.jar > (610ms){code} > 3. Later spark would complain that it couldn't find the jar: > {code:java} > Warning: Local jar > /Users/thesamet/.ivy2/jars/net.sourceforge.f2j_arpack_combined_all-0.1.jar > does not exist, skipping. > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel).{code} > 4. The dependency of breeze on f2j_arpack_combined seem fine: > [http://central.maven.org/maven2/org/scalanlp/breeze_2.11/0.13.2/breeze_2.11-0.13.2.pom] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24074) Maven package resolver downloads javadoc instead of jar
[ https://issues.apache.org/jira/browse/SPARK-24074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nadav Samet updated SPARK-24074: Description: {code:java} // code placeholder {code} >From some reason spark downloads a javadoc artifact of a package instead of >the jar. Steps to reproduce: # Delete (or move) your local ~/.ivy2 cache to force Spark to resolve and fetch artifacts from central: {code:java} rm -rf ~/.ivy2 {code} 1. Run: {code:java} ~/dev/spark-2.3.0-bin-hadoop2.7/bin/spark-shell --packages org.scalanlp:breeze_2.11:0.13.2{code} 2.Spark would download the javadoc instead of the jar: {code:java} downloading https://repo1.maven.org/maven2/net/sourceforge/f2j/arpack_combined_all/0.1/arpack_combined_all-0.1-javadoc.jar ... [SUCCESSFUL ] net.sourceforge.f2j#arpack_combined_all;0.1!arpack_combined_all.jar (610ms){code} 3. Later spark would complain that it couldn't find the jar: {code:java} Warning: Local jar /Users/thesamet/.ivy2/jars/net.sourceforge.f2j_arpack_combined_all-0.1.jar does not exist, skipping. Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).{code} 4. The dependency of breeze on f2j_arpack_combined seem fine: [http://central.maven.org/maven2/org/scalanlp/breeze_2.11/0.13.2/breeze_2.11-0.13.2.pom] was: {code:java} // code placeholder {code} >From some reason spark downloads a javadoc artifact of a package instead of >the jar. Steps to reproduce: # Delete (or move) your local ~/.ivy2 cache to force Spark to resolve and fetch artifacts from central: {code:java} rm -rf ~/.ivy2 {code} # Run: {code:java} ~/dev/spark-2.3.0-bin-hadoop2.7/bin/spark-shell --packages org.scalanlp:breeze_2.11:0.13.2{code} # Spark would download the javadoc instead of the jar: {code:java} downloading https://repo1.maven.org/maven2/net/sourceforge/f2j/arpack_combined_all/0.1/arpack_combined_all-0.1-javadoc.jar ... [SUCCESSFUL ] net.sourceforge.f2j#arpack_combined_all;0.1!arpack_combined_all.jar (610ms){code} # Later spark would complain that it couldn't find the jar: {code:java} Warning: Local jar /Users/thesamet/.ivy2/jars/net.sourceforge.f2j_arpack_combined_all-0.1.jar does not exist, skipping. Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).{code} # The dependency of breeze on f2j_arpack_combined seem fine: [http://central.maven.org/maven2/org/scalanlp/breeze_2.11/0.13.2/breeze_2.11-0.13.2.pom] > Maven package resolver downloads javadoc instead of jar > --- > > Key: SPARK-24074 > URL: https://issues.apache.org/jira/browse/SPARK-24074 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.3.0 > Environment: {code:java} > // code placeholder > {code} >Reporter: Nadav Samet >Priority: Critical > > {code:java} > // code placeholder > {code} > From some reason spark downloads a javadoc artifact of a package instead of > the jar. > Steps to reproduce: > # Delete (or move) your local ~/.ivy2 cache to force Spark to resolve and > fetch artifacts from central: > {code:java} > rm -rf ~/.ivy2 > {code} > 1. Run: > {code:java} > ~/dev/spark-2.3.0-bin-hadoop2.7/bin/spark-shell --packages > org.scalanlp:breeze_2.11:0.13.2{code} > 2.Spark would download the javadoc instead of the jar: > {code:java} > downloading > https://repo1.maven.org/maven2/net/sourceforge/f2j/arpack_combined_all/0.1/arpack_combined_all-0.1-javadoc.jar > ... > [SUCCESSFUL ] > net.sourceforge.f2j#arpack_combined_all;0.1!arpack_combined_all.jar > (610ms){code} > 3. Later spark would complain that it couldn't find the jar: > {code:java} > Warning: Local jar > /Users/thesamet/.ivy2/jars/net.sourceforge.f2j_arpack_combined_all-0.1.jar > does not exist, skipping. > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel).{code} > 4. The dependency of breeze on f2j_arpack_combined seem fine: > [http://central.maven.org/maven2/org/scalanlp/breeze_2.11/0.13.2/breeze_2.11-0.13.2.pom] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24074) Maven package resolver downloads javadoc instead of jar
Nadav Samet created SPARK-24074: --- Summary: Maven package resolver downloads javadoc instead of jar Key: SPARK-24074 URL: https://issues.apache.org/jira/browse/SPARK-24074 Project: Spark Issue Type: Bug Components: Spark Submit Affects Versions: 2.3.0 Environment: {code:java} // code placeholder {code} Reporter: Nadav Samet {code:java} // code placeholder {code} >From some reason spark downloads a javadoc artifact of a package instead of >the jar. Steps to reproduce: # Delete (or move) your local ~/.ivy2 cache to force Spark to resolve and fetch artifacts from central: {code:java} rm -rf ~/.ivy2 {code} # Run: {code:java} ~/dev/spark-2.3.0-bin-hadoop2.7/bin/spark-shell --packages org.scalanlp:breeze_2.11:0.13.2{code} # Spark would download the javadoc instead of the jar: {code:java} downloading https://repo1.maven.org/maven2/net/sourceforge/f2j/arpack_combined_all/0.1/arpack_combined_all-0.1-javadoc.jar ... [SUCCESSFUL ] net.sourceforge.f2j#arpack_combined_all;0.1!arpack_combined_all.jar (610ms){code} # Later spark would complain that it couldn't find the jar: {code:java} Warning: Local jar /Users/thesamet/.ivy2/jars/net.sourceforge.f2j_arpack_combined_all-0.1.jar does not exist, skipping. Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).{code} # The dependency of breeze on f2j_arpack_combined seem fine: [http://central.maven.org/maven2/org/scalanlp/breeze_2.11/0.13.2/breeze_2.11-0.13.2.pom] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23852) Parquet MR bug can lead to incorrect SQL results
[ https://issues.apache.org/jira/browse/SPARK-23852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451263#comment-16451263 ] Henry Robinson commented on SPARK-23852: Yes it has - the Parquet community are going to do a 1.8.3 release, mostly just for us for this issue. Parquet 1.10 has already been released, and includes this fix. Upgrading Spark trunk to that version is the subject of SPARK-23972. > Parquet MR bug can lead to incorrect SQL results > > > Key: SPARK-23852 > URL: https://issues.apache.org/jira/browse/SPARK-23852 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Henry Robinson >Priority: Blocker > Labels: correctness > > Parquet MR 1.9.0 and 1.8.2 both have a bug, PARQUET-1217, that means that > pushing certain predicates to Parquet scanners can return fewer results than > they should. > The bug triggers in Spark when: > * The Parquet file being scanner has stats for the null count, but not the > max or min on the column with the predicate (Apache Impala writes files like > this). > * The vectorized Parquet reader path is not taken, and the parquet-mr reader > is used. > * A suitable <, <=, > or >= predicate is pushed down to Parquet. > The bug is that the parquet-mr interprets the max and min of a row-group's > column as 0 in the absence of stats. So {{col > 0}} will filter all results, > even if some are > 0. > There is no upstream release of Parquet that contains the fix for > PARQUET-1217, although a 1.10 release is planned. > The least impactful workaround is to set the Parquet configuration > {{parquet.filter.stats.enabled}} to {{false}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24056) Make consumer creation lazy in Kafka source for Structured streaming
[ https://issues.apache.org/jira/browse/SPARK-24056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-24056. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 21134 [https://github.com/apache/spark/pull/21134] > Make consumer creation lazy in Kafka source for Structured streaming > > > Key: SPARK-24056 > URL: https://issues.apache.org/jira/browse/SPARK-24056 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Minor > Fix For: 3.0.0 > > > Currently, the driver side of the Kafka source (i.e. KafkaMicroBatchReader) > eagerly creates a consumer as soon as the Kafk aMicroBatchReader is created. > However, we create dummy KafkaMicroBatchReader to get the schema and > immediately stop it. Its better to make the consumer creation lazy, it will > be created on the first attempt to fetch offsets using the KafkaOffsetReader. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24051) Incorrect results for certain queries using Java and Python APIs on Spark 2.3.0
[ https://issues.apache.org/jira/browse/SPARK-24051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451241#comment-16451241 ] Marco Gaido commented on SPARK-24051: - [~hvanhovell] I am not sure that the analysis barriers are the root cause. The issue is that using the very same list of cols in the two selects, the two different Alias have the same exprId. I am not sure this case is supposed to be fixed in ResolveReferences (or another place), but it is not because of the introduction of AnalysisBarrier, or if it was just supposed not to be possible. > Incorrect results for certain queries using Java and Python APIs on Spark > 2.3.0 > --- > > Key: SPARK-24051 > URL: https://issues.apache.org/jira/browse/SPARK-24051 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Emlyn Corrin >Priority: Major > > I'm seeing Spark 2.3.0 return incorrect results for a certain (very specific) > query, demonstrated by the Java program below. It was simplified from a much > more complex query, but I'm having trouble simplifying it further without > removing the erroneous behaviour. > {code:java} > package sparktest; > import org.apache.spark.SparkConf; > import org.apache.spark.sql.*; > import org.apache.spark.sql.expressions.Window; > import org.apache.spark.sql.types.DataTypes; > import org.apache.spark.sql.types.Metadata; > import org.apache.spark.sql.types.StructField; > import org.apache.spark.sql.types.StructType; > import java.util.Arrays; > public class Main { > public static void main(String[] args) { > SparkConf conf = new SparkConf() > .setAppName("SparkTest") > .setMaster("local[*]"); > SparkSession session = > SparkSession.builder().config(conf).getOrCreate(); > Row[] arr1 = new Row[]{ > RowFactory.create(1, 42), > RowFactory.create(2, 99)}; > StructType sch1 = new StructType(new StructField[]{ > new StructField("a", DataTypes.IntegerType, true, > Metadata.empty()), > new StructField("b", DataTypes.IntegerType, true, > Metadata.empty())}); > Dataset ds1 = session.createDataFrame(Arrays.asList(arr1), sch1); > ds1.show(); > Row[] arr2 = new Row[]{ > RowFactory.create(3)}; > StructType sch2 = new StructType(new StructField[]{ > new StructField("a", DataTypes.IntegerType, true, > Metadata.empty())}); > Dataset ds2 = session.createDataFrame(Arrays.asList(arr2), sch2) > .withColumn("b", functions.lit(0)); > ds2.show(); > Column[] cols = new Column[]{ > new Column("a"), > new Column("b").as("b"), > functions.count(functions.lit(1)) > .over(Window.partitionBy()) > .as("n")}; > Dataset ds = ds1 > .select(cols) > .union(ds2.select(cols)) > .where(new Column("n").geq(1)) > .drop("n"); > ds.show(); > //ds.explain(true); > } > } > {code} > It just calculates the union of 2 datasets, > {code:java} > +---+---+ > | a| b| > +---+---+ > | 1| 42| > | 2| 99| > +---+---+ > {code} > with > {code:java} > +---+---+ > | a| b| > +---+---+ > | 3| 0| > +---+---+ > {code} > The expected result is: > {code:java} > +---+---+ > | a| b| > +---+---+ > | 1| 42| > | 2| 99| > | 3| 0| > +---+---+ > {code} > but instead it prints: > {code:java} > +---+---+ > | a| b| > +---+---+ > | 1| 0| > | 2| 0| > | 3| 0| > +---+---+ > {code} > notice how the value in column c is always zero, overriding the original > values in rows 1 and 2. > Making seemingly trivial changes, such as replacing {{new > Column("b").as("b"),}} with just {{new Column("b"),}} or removing the > {{where}} clause after the union, make it behave correctly again. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20114) spark.ml parity for sequential pattern mining - PrefixSpan
[ https://issues.apache.org/jira/browse/SPARK-20114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-20114: - Assignee: Weichen Xu > spark.ml parity for sequential pattern mining - PrefixSpan > -- > > Key: SPARK-20114 > URL: https://issues.apache.org/jira/browse/SPARK-20114 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang >Assignee: Weichen Xu >Priority: Major > > Creating this jira to track the feature parity for PrefixSpan and sequential > pattern mining in Spark ml with DataFrame API. > First list a few design issues to be discussed, then subtasks like Scala, > Python and R API will be created. > # Wrapping the MLlib PrefixSpan and provide a generic fit() should be > straightforward. Yet PrefixSpan only extracts frequent sequential patterns, > which is not good to be used directly for predicting on new records. Please > read > http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/ > for some background knowledge. Thanks Philippe Fournier-Viger for providing > insights. If we want to keep using the Estimator/Transformer pattern, options > are: > #* Implement a dummy transform for PrefixSpanModel, which will not add > new column to the input DataSet. The PrefixSpanModel is only used to provide > access for frequent sequential patterns. > #* Adding the feature to extract sequential rules from sequential > patterns. Then use the sequential rules in the transform as FPGrowthModel. > The rules extracted are of the form X–> Y where X and Y are sequential > patterns. But in practice, these rules are not very good as they are too > precise and thus not noise tolerant. > # Different from association rules and frequent itemsets, sequential rules > can be extracted from the original dataset more efficiently using algorithms > like RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is > unordered, but X must appear before Y, which is more general and can work > better in practice for prediction. > I'd like to hear more from the users to see which kind of Sequential rules > are more practical. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23654) Cut jets3t as a dependency of spark-core
[ https://issues.apache.org/jira/browse/SPARK-23654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16450449#comment-16450449 ] Apache Spark commented on SPARK-23654: -- User 'steveloughran' has created a pull request for this issue: https://github.com/apache/spark/pull/21146 > Cut jets3t as a dependency of spark-core > > > Key: SPARK-23654 > URL: https://issues.apache.org/jira/browse/SPARK-23654 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Steve Loughran >Priority: Minor > > Spark core declares a dependency on Jets3t, which pulls in other cruft > # the hadoop-cloud module pulls in the hadoop-aws module with the > jets3t-compatible connectors, and the relevant dependencies: the spark-core > dependency is incomplete if that module isn't built, and superflous or > inconsistent if it is. > # We've cut out s3n/s3 and all dependencies on jets3t entirely from hadoop > 3.x in favour we're willing to maintain. > JetS3t was wonderful when it came out, but now the amazon SDKs massively > exceed it in functionality, albeit at the expense of week-to-week stability > and JAR binary compatibility -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23654) Cut jets3t as a dependency of spark-core
[ https://issues.apache.org/jira/browse/SPARK-23654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23654: Assignee: Apache Spark > Cut jets3t as a dependency of spark-core > > > Key: SPARK-23654 > URL: https://issues.apache.org/jira/browse/SPARK-23654 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Steve Loughran >Assignee: Apache Spark >Priority: Minor > > Spark core declares a dependency on Jets3t, which pulls in other cruft > # the hadoop-cloud module pulls in the hadoop-aws module with the > jets3t-compatible connectors, and the relevant dependencies: the spark-core > dependency is incomplete if that module isn't built, and superflous or > inconsistent if it is. > # We've cut out s3n/s3 and all dependencies on jets3t entirely from hadoop > 3.x in favour we're willing to maintain. > JetS3t was wonderful when it came out, but now the amazon SDKs massively > exceed it in functionality, albeit at the expense of week-to-week stability > and JAR binary compatibility -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20114) spark.ml parity for sequential pattern mining - PrefixSpan
[ https://issues.apache.org/jira/browse/SPARK-20114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-20114: -- Shepherd: Joseph K. Bradley > spark.ml parity for sequential pattern mining - PrefixSpan > -- > > Key: SPARK-20114 > URL: https://issues.apache.org/jira/browse/SPARK-20114 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang >Priority: Major > > Creating this jira to track the feature parity for PrefixSpan and sequential > pattern mining in Spark ml with DataFrame API. > First list a few design issues to be discussed, then subtasks like Scala, > Python and R API will be created. > # Wrapping the MLlib PrefixSpan and provide a generic fit() should be > straightforward. Yet PrefixSpan only extracts frequent sequential patterns, > which is not good to be used directly for predicting on new records. Please > read > http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/ > for some background knowledge. Thanks Philippe Fournier-Viger for providing > insights. If we want to keep using the Estimator/Transformer pattern, options > are: > #* Implement a dummy transform for PrefixSpanModel, which will not add > new column to the input DataSet. The PrefixSpanModel is only used to provide > access for frequent sequential patterns. > #* Adding the feature to extract sequential rules from sequential > patterns. Then use the sequential rules in the transform as FPGrowthModel. > The rules extracted are of the form X–> Y where X and Y are sequential > patterns. But in practice, these rules are not very good as they are too > precise and thus not noise tolerant. > # Different from association rules and frequent itemsets, sequential rules > can be extracted from the original dataset more efficiently using algorithms > like RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is > unordered, but X must appear before Y, which is more general and can work > better in practice for prediction. > I'd like to hear more from the users to see which kind of Sequential rules > are more practical. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23654) Cut jets3t as a dependency of spark-core
[ https://issues.apache.org/jira/browse/SPARK-23654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23654: Assignee: (was: Apache Spark) > Cut jets3t as a dependency of spark-core > > > Key: SPARK-23654 > URL: https://issues.apache.org/jira/browse/SPARK-23654 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Steve Loughran >Priority: Minor > > Spark core declares a dependency on Jets3t, which pulls in other cruft > # the hadoop-cloud module pulls in the hadoop-aws module with the > jets3t-compatible connectors, and the relevant dependencies: the spark-core > dependency is incomplete if that module isn't built, and superflous or > inconsistent if it is. > # We've cut out s3n/s3 and all dependencies on jets3t entirely from hadoop > 3.x in favour we're willing to maintain. > JetS3t was wonderful when it came out, but now the amazon SDKs massively > exceed it in functionality, albeit at the expense of week-to-week stability > and JAR binary compatibility -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20114) spark.ml parity for sequential pattern mining - PrefixSpan
[ https://issues.apache.org/jira/browse/SPARK-20114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-20114: -- Target Version/s: 2.4.0 > spark.ml parity for sequential pattern mining - PrefixSpan > -- > > Key: SPARK-20114 > URL: https://issues.apache.org/jira/browse/SPARK-20114 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang >Priority: Major > > Creating this jira to track the feature parity for PrefixSpan and sequential > pattern mining in Spark ml with DataFrame API. > First list a few design issues to be discussed, then subtasks like Scala, > Python and R API will be created. > # Wrapping the MLlib PrefixSpan and provide a generic fit() should be > straightforward. Yet PrefixSpan only extracts frequent sequential patterns, > which is not good to be used directly for predicting on new records. Please > read > http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/ > for some background knowledge. Thanks Philippe Fournier-Viger for providing > insights. If we want to keep using the Estimator/Transformer pattern, options > are: > #* Implement a dummy transform for PrefixSpanModel, which will not add > new column to the input DataSet. The PrefixSpanModel is only used to provide > access for frequent sequential patterns. > #* Adding the feature to extract sequential rules from sequential > patterns. Then use the sequential rules in the transform as FPGrowthModel. > The rules extracted are of the form X–> Y where X and Y are sequential > patterns. But in practice, these rules are not very good as they are too > precise and thus not noise tolerant. > # Different from association rules and frequent itemsets, sequential rules > can be extracted from the original dataset more efficiently using algorithms > like RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is > unordered, but X must appear before Y, which is more general and can work > better in practice for prediction. > I'd like to hear more from the users to see which kind of Sequential rules > are more practical. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24073) DataSourceV2: Rename DataReaderFactory back to ReadTask.
[ https://issues.apache.org/jira/browse/SPARK-24073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16450443#comment-16450443 ] Apache Spark commented on SPARK-24073: -- User 'rdblue' has created a pull request for this issue: https://github.com/apache/spark/pull/21145 > DataSourceV2: Rename DataReaderFactory back to ReadTask. > > > Key: SPARK-24073 > URL: https://issues.apache.org/jira/browse/SPARK-24073 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Priority: Major > Fix For: 2.4.0 > > > Just before 2.3.0, SPARK-23219 renamed ReadTask to DataReaderFactory. The > intent was to make the read and write API match (write side uses > DataWriterFactory), but the underlying problem is that the two classes are > not equivalent. > ReadTask/DataReader function as Iterable/Iterator. ReadTask is a specific to > a read task, in contrast to DataWriterFactory where the same factory instance > is used in all write tasks. ReadTask's purpose is to manage the lifecycle of > DataReader with an explicit create operation to mirror the close operation. > This is no longer clear from the API, where DataReaderFactory appears to be > more generic than it is and it isn't clear why a set of them is produced for > a read. > We should rename DataReaderFactory back to ReadTask, which correctly conveys > the purpose and use of the class. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24073) DataSourceV2: Rename DataReaderFactory back to ReadTask.
[ https://issues.apache.org/jira/browse/SPARK-24073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24073: Assignee: Apache Spark > DataSourceV2: Rename DataReaderFactory back to ReadTask. > > > Key: SPARK-24073 > URL: https://issues.apache.org/jira/browse/SPARK-24073 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Assignee: Apache Spark >Priority: Major > Fix For: 2.4.0 > > > Just before 2.3.0, SPARK-23219 renamed ReadTask to DataReaderFactory. The > intent was to make the read and write API match (write side uses > DataWriterFactory), but the underlying problem is that the two classes are > not equivalent. > ReadTask/DataReader function as Iterable/Iterator. ReadTask is a specific to > a read task, in contrast to DataWriterFactory where the same factory instance > is used in all write tasks. ReadTask's purpose is to manage the lifecycle of > DataReader with an explicit create operation to mirror the close operation. > This is no longer clear from the API, where DataReaderFactory appears to be > more generic than it is and it isn't clear why a set of them is produced for > a read. > We should rename DataReaderFactory back to ReadTask, which correctly conveys > the purpose and use of the class. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24073) DataSourceV2: Rename DataReaderFactory back to ReadTask.
[ https://issues.apache.org/jira/browse/SPARK-24073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24073: Assignee: (was: Apache Spark) > DataSourceV2: Rename DataReaderFactory back to ReadTask. > > > Key: SPARK-24073 > URL: https://issues.apache.org/jira/browse/SPARK-24073 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Priority: Major > Fix For: 2.4.0 > > > Just before 2.3.0, SPARK-23219 renamed ReadTask to DataReaderFactory. The > intent was to make the read and write API match (write side uses > DataWriterFactory), but the underlying problem is that the two classes are > not equivalent. > ReadTask/DataReader function as Iterable/Iterator. ReadTask is a specific to > a read task, in contrast to DataWriterFactory where the same factory instance > is used in all write tasks. ReadTask's purpose is to manage the lifecycle of > DataReader with an explicit create operation to mirror the close operation. > This is no longer clear from the API, where DataReaderFactory appears to be > more generic than it is and it isn't clear why a set of them is produced for > a read. > We should rename DataReaderFactory back to ReadTask, which correctly conveys > the purpose and use of the class. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23654) Cut jets3t as a dependency of spark-core
[ https://issues.apache.org/jira/browse/SPARK-23654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated SPARK-23654: --- Summary: Cut jets3t as a dependency of spark-core (was: Cut jets3t as a dependency of spark-core; exclude it from hadoop-cloud module as incompatible) > Cut jets3t as a dependency of spark-core > > > Key: SPARK-23654 > URL: https://issues.apache.org/jira/browse/SPARK-23654 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Steve Loughran >Priority: Minor > > Spark core declares a dependency on Jets3t, which pulls in other cruft > # the hadoop-cloud module pulls in the hadoop-aws module with the > jets3t-compatible connectors, and the relevant dependencies: the spark-core > dependency is incomplete if that module isn't built, and superflous or > inconsistent if it is. > # We've cut out s3n/s3 and all dependencies on jets3t entirely from hadoop > 3.x in favour we're willing to maintain. > JetS3t was wonderful when it came out, but now the amazon SDKs massively > exceed it in functionality, albeit at the expense of week-to-week stability > and JAR binary compatibility -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24073) DataSourceV2: Rename DataReaderFactory back to ReadTask.
Ryan Blue created SPARK-24073: - Summary: DataSourceV2: Rename DataReaderFactory back to ReadTask. Key: SPARK-24073 URL: https://issues.apache.org/jira/browse/SPARK-24073 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.3.0 Reporter: Ryan Blue Fix For: 2.4.0 Just before 2.3.0, SPARK-23219 renamed ReadTask to DataReaderFactory. The intent was to make the read and write API match (write side uses DataWriterFactory), but the underlying problem is that the two classes are not equivalent. ReadTask/DataReader function as Iterable/Iterator. ReadTask is a specific to a read task, in contrast to DataWriterFactory where the same factory instance is used in all write tasks. ReadTask's purpose is to manage the lifecycle of DataReader with an explicit create operation to mirror the close operation. This is no longer clear from the API, where DataReaderFactory appears to be more generic than it is and it isn't clear why a set of them is produced for a read. We should rename DataReaderFactory back to ReadTask, which correctly conveys the purpose and use of the class. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24043) InterpretedPredicate.eval fails if expression tree contains Nondeterministic expressions
[ https://issues.apache.org/jira/browse/SPARK-24043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16450426#comment-16450426 ] Apache Spark commented on SPARK-24043: -- User 'bersprockets' has created a pull request for this issue: https://github.com/apache/spark/pull/21144 > InterpretedPredicate.eval fails if expression tree contains Nondeterministic > expressions > > > Key: SPARK-24043 > URL: https://issues.apache.org/jira/browse/SPARK-24043 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Bruce Robbins >Priority: Minor > > When whole-stage codegen and predicate codegen both fail, FilterExec falls > back to using InterpretedPredicate. If the predicate's expression contains > any non-deterministic expressions, the evaluation throws an error: > {noformat} > scala> val df = Seq((1)).toDF("a") > df: org.apache.spark.sql.DataFrame = [a: int] > scala> df.filter('a > 0).show // this works fine > 2018-04-21 20:39:26 WARN FilterExec:66 - Codegen disabled for this > expression: > (value#1 > 0) > +---+ > | a| > +---+ > | 1| > +---+ > scala> df.filter('a > rand(7)).show // this will throw an error > 2018-04-21 20:39:40 WARN FilterExec:66 - Codegen disabled for this > expression: > (cast(value#1 as double) > rand(7)) > 2018-04-21 20:39:40 ERROR Executor:91 - Exception in task 0.0 in stage 1.0 > (TID 1) > java.lang.IllegalArgumentException: requirement failed: Nondeterministic > expression org.apache.spark.sql.catalyst.expressions.Rand should be > initialized before eval. > at scala.Predef$.require(Predef.scala:224) > at > org.apache.spark.sql.catalyst.expressions.Nondeterministic$class.eval(Expression.scala:326) > at > org.apache.spark.sql.catalyst.expressions.RDG.eval(randomExpressions.scala:34) > {noformat} > This is because no code initializes the Nondeterministic expressions before > eval is called on them. > This is a low impact issue, since it would require both whole-stage codegen > and predicate codegen to fail before FilterExec would fall back to using > InterpretedPredicate. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24043) InterpretedPredicate.eval fails if expression tree contains Nondeterministic expressions
[ https://issues.apache.org/jira/browse/SPARK-24043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24043: Assignee: Apache Spark > InterpretedPredicate.eval fails if expression tree contains Nondeterministic > expressions > > > Key: SPARK-24043 > URL: https://issues.apache.org/jira/browse/SPARK-24043 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Bruce Robbins >Assignee: Apache Spark >Priority: Minor > > When whole-stage codegen and predicate codegen both fail, FilterExec falls > back to using InterpretedPredicate. If the predicate's expression contains > any non-deterministic expressions, the evaluation throws an error: > {noformat} > scala> val df = Seq((1)).toDF("a") > df: org.apache.spark.sql.DataFrame = [a: int] > scala> df.filter('a > 0).show // this works fine > 2018-04-21 20:39:26 WARN FilterExec:66 - Codegen disabled for this > expression: > (value#1 > 0) > +---+ > | a| > +---+ > | 1| > +---+ > scala> df.filter('a > rand(7)).show // this will throw an error > 2018-04-21 20:39:40 WARN FilterExec:66 - Codegen disabled for this > expression: > (cast(value#1 as double) > rand(7)) > 2018-04-21 20:39:40 ERROR Executor:91 - Exception in task 0.0 in stage 1.0 > (TID 1) > java.lang.IllegalArgumentException: requirement failed: Nondeterministic > expression org.apache.spark.sql.catalyst.expressions.Rand should be > initialized before eval. > at scala.Predef$.require(Predef.scala:224) > at > org.apache.spark.sql.catalyst.expressions.Nondeterministic$class.eval(Expression.scala:326) > at > org.apache.spark.sql.catalyst.expressions.RDG.eval(randomExpressions.scala:34) > {noformat} > This is because no code initializes the Nondeterministic expressions before > eval is called on them. > This is a low impact issue, since it would require both whole-stage codegen > and predicate codegen to fail before FilterExec would fall back to using > InterpretedPredicate. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24043) InterpretedPredicate.eval fails if expression tree contains Nondeterministic expressions
[ https://issues.apache.org/jira/browse/SPARK-24043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24043: Assignee: (was: Apache Spark) > InterpretedPredicate.eval fails if expression tree contains Nondeterministic > expressions > > > Key: SPARK-24043 > URL: https://issues.apache.org/jira/browse/SPARK-24043 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Bruce Robbins >Priority: Minor > > When whole-stage codegen and predicate codegen both fail, FilterExec falls > back to using InterpretedPredicate. If the predicate's expression contains > any non-deterministic expressions, the evaluation throws an error: > {noformat} > scala> val df = Seq((1)).toDF("a") > df: org.apache.spark.sql.DataFrame = [a: int] > scala> df.filter('a > 0).show // this works fine > 2018-04-21 20:39:26 WARN FilterExec:66 - Codegen disabled for this > expression: > (value#1 > 0) > +---+ > | a| > +---+ > | 1| > +---+ > scala> df.filter('a > rand(7)).show // this will throw an error > 2018-04-21 20:39:40 WARN FilterExec:66 - Codegen disabled for this > expression: > (cast(value#1 as double) > rand(7)) > 2018-04-21 20:39:40 ERROR Executor:91 - Exception in task 0.0 in stage 1.0 > (TID 1) > java.lang.IllegalArgumentException: requirement failed: Nondeterministic > expression org.apache.spark.sql.catalyst.expressions.Rand should be > initialized before eval. > at scala.Predef$.require(Predef.scala:224) > at > org.apache.spark.sql.catalyst.expressions.Nondeterministic$class.eval(Expression.scala:326) > at > org.apache.spark.sql.catalyst.expressions.RDG.eval(randomExpressions.scala:34) > {noformat} > This is because no code initializes the Nondeterministic expressions before > eval is called on them. > This is a low impact issue, since it would require both whole-stage codegen > and predicate codegen to fail before FilterExec would fall back to using > InterpretedPredicate. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24072) clearly define pushed filters
[ https://issues.apache.org/jira/browse/SPARK-24072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-24072: Summary: clearly define pushed filters (was: remove unused DataSourceV2Relation.pushedFilters) > clearly define pushed filters > - > > Key: SPARK-24072 > URL: https://issues.apache.org/jira/browse/SPARK-24072 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23990) Instruments logging improvements - ML regression package
[ https://issues.apache.org/jira/browse/SPARK-23990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-23990. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21078 [https://github.com/apache/spark/pull/21078] > Instruments logging improvements - ML regression package > > > Key: SPARK-23990 > URL: https://issues.apache.org/jira/browse/SPARK-23990 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.3.0 > Environment: Instruments logging improvements - ML regression package >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > Fix For: 2.4.0 > > Original Estimate: 120h > Remaining Estimate: 120h > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24051) Incorrect results for certain queries using Java and Python APIs on Spark 2.3.0
[ https://issues.apache.org/jira/browse/SPARK-24051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16450293#comment-16450293 ] Herman van Hovell commented on SPARK-24051: --- [~mgaido] do you have any idea why this is failing in Spark 2.3 specifically? Does it have something to do with introduction of analysis barriers? > Incorrect results for certain queries using Java and Python APIs on Spark > 2.3.0 > --- > > Key: SPARK-24051 > URL: https://issues.apache.org/jira/browse/SPARK-24051 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Emlyn Corrin >Priority: Major > > I'm seeing Spark 2.3.0 return incorrect results for a certain (very specific) > query, demonstrated by the Java program below. It was simplified from a much > more complex query, but I'm having trouble simplifying it further without > removing the erroneous behaviour. > {code:java} > package sparktest; > import org.apache.spark.SparkConf; > import org.apache.spark.sql.*; > import org.apache.spark.sql.expressions.Window; > import org.apache.spark.sql.types.DataTypes; > import org.apache.spark.sql.types.Metadata; > import org.apache.spark.sql.types.StructField; > import org.apache.spark.sql.types.StructType; > import java.util.Arrays; > public class Main { > public static void main(String[] args) { > SparkConf conf = new SparkConf() > .setAppName("SparkTest") > .setMaster("local[*]"); > SparkSession session = > SparkSession.builder().config(conf).getOrCreate(); > Row[] arr1 = new Row[]{ > RowFactory.create(1, 42), > RowFactory.create(2, 99)}; > StructType sch1 = new StructType(new StructField[]{ > new StructField("a", DataTypes.IntegerType, true, > Metadata.empty()), > new StructField("b", DataTypes.IntegerType, true, > Metadata.empty())}); > Dataset ds1 = session.createDataFrame(Arrays.asList(arr1), sch1); > ds1.show(); > Row[] arr2 = new Row[]{ > RowFactory.create(3)}; > StructType sch2 = new StructType(new StructField[]{ > new StructField("a", DataTypes.IntegerType, true, > Metadata.empty())}); > Dataset ds2 = session.createDataFrame(Arrays.asList(arr2), sch2) > .withColumn("b", functions.lit(0)); > ds2.show(); > Column[] cols = new Column[]{ > new Column("a"), > new Column("b").as("b"), > functions.count(functions.lit(1)) > .over(Window.partitionBy()) > .as("n")}; > Dataset ds = ds1 > .select(cols) > .union(ds2.select(cols)) > .where(new Column("n").geq(1)) > .drop("n"); > ds.show(); > //ds.explain(true); > } > } > {code} > It just calculates the union of 2 datasets, > {code:java} > +---+---+ > | a| b| > +---+---+ > | 1| 42| > | 2| 99| > +---+---+ > {code} > with > {code:java} > +---+---+ > | a| b| > +---+---+ > | 3| 0| > +---+---+ > {code} > The expected result is: > {code:java} > +---+---+ > | a| b| > +---+---+ > | 1| 42| > | 2| 99| > | 3| 0| > +---+---+ > {code} > but instead it prints: > {code:java} > +---+---+ > | a| b| > +---+---+ > | 1| 0| > | 2| 0| > | 3| 0| > +---+---+ > {code} > notice how the value in column c is always zero, overriding the original > values in rows 1 and 2. > Making seemingly trivial changes, such as replacing {{new > Column("b").as("b"),}} with just {{new Column("b"),}} or removing the > {{where}} clause after the union, make it behave correctly again. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23455) Default Params in ML should be saved separately
[ https://issues.apache.org/jira/browse/SPARK-23455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-23455. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20633 [https://github.com/apache/spark/pull/20633] > Default Params in ML should be saved separately > --- > > Key: SPARK-23455 > URL: https://issues.apache.org/jira/browse/SPARK-23455 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.4.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 2.4.0 > > > We save ML's user-supplied params and default params as one entity in JSON. > During loading the saved models, we set all the loaded params into created ML > model instances as user-supplied params. > It causes some problems, e.g., if we strictly disallow some params to be set > at the same time, a default param can fail the param check because it is > treated as user-supplied param after loading. > The loaded default params should not be set as user-supplied params. We > should save ML default params separately in JSON. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24072) remove unused DataSourceV2Relation.pushedFilters
[ https://issues.apache.org/jira/browse/SPARK-24072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24072: Assignee: Wenchen Fan (was: Apache Spark) > remove unused DataSourceV2Relation.pushedFilters > > > Key: SPARK-24072 > URL: https://issues.apache.org/jira/browse/SPARK-24072 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24072) remove unused DataSourceV2Relation.pushedFilters
[ https://issues.apache.org/jira/browse/SPARK-24072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24072: Assignee: Apache Spark (was: Wenchen Fan) > remove unused DataSourceV2Relation.pushedFilters > > > Key: SPARK-24072 > URL: https://issues.apache.org/jira/browse/SPARK-24072 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24072) remove unused DataSourceV2Relation.pushedFilters
[ https://issues.apache.org/jira/browse/SPARK-24072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16450273#comment-16450273 ] Apache Spark commented on SPARK-24072: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/21143 > remove unused DataSourceV2Relation.pushedFilters > > > Key: SPARK-24072 > URL: https://issues.apache.org/jira/browse/SPARK-24072 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24072) remove unused DataSourceV2Relation.pushedFilters
Wenchen Fan created SPARK-24072: --- Summary: remove unused DataSourceV2Relation.pushedFilters Key: SPARK-24072 URL: https://issues.apache.org/jira/browse/SPARK-24072 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23933) High-order function: map(array, array) → map
[ https://issues.apache.org/jira/browse/SPARK-23933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16450262#comment-16450262 ] Kazuaki Ishizaki commented on SPARK-23933: -- cc [~smilegator] > High-order function: map(array, array) → map > --- > > Key: SPARK-23933 > URL: https://issues.apache.org/jira/browse/SPARK-23933 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/map.html > Returns a map created using the given key/value arrays. > {noformat} > SELECT map(ARRAY[1,3], ARRAY[2,4]); -- {1 -> 2, 3 -> 4} > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24070) TPC-DS Performance Tests for Parquet 1.10.0 Upgrade
[ https://issues.apache.org/jira/browse/SPARK-24070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16450243#comment-16450243 ] Xiao Li commented on SPARK-24070: - cc [~maropu] > TPC-DS Performance Tests for Parquet 1.10.0 Upgrade > --- > > Key: SPARK-24070 > URL: https://issues.apache.org/jira/browse/SPARK-24070 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Xiao Li >Priority: Major > > TPC-DS performance evaluation of Apache Spark Parquet 1.8.2 and 1.10.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24071) Micro-benchmark of Parquet Filter Pushdown
Xiao Li created SPARK-24071: --- Summary: Micro-benchmark of Parquet Filter Pushdown Key: SPARK-24071 URL: https://issues.apache.org/jira/browse/SPARK-24071 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.4.0 Reporter: Xiao Li Need a micro-benchmark suite for Parquet filter pushdown -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24070) TPC-DS Performance Tests for Parquet 1.10.0 Upgrade
Xiao Li created SPARK-24070: --- Summary: TPC-DS Performance Tests for Parquet 1.10.0 Upgrade Key: SPARK-24070 URL: https://issues.apache.org/jira/browse/SPARK-24070 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.4.0 Reporter: Xiao Li TPC-DS performance evaluation of Apache Spark Parquet 1.8.2 and 1.10.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23807) Add Hadoop 3 profile with relevant POM fix ups
[ https://issues.apache.org/jira/browse/SPARK-23807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-23807. Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20923 [https://github.com/apache/spark/pull/20923] > Add Hadoop 3 profile with relevant POM fix ups > -- > > Key: SPARK-23807 > URL: https://issues.apache.org/jira/browse/SPARK-23807 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 2.4.0 >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Major > Fix For: 2.4.0 > > > Hadoop 3, and particular Hadoop 3.1 adds: > * Java 8 as the minimum (and currently sole) supported Java version > * A new "hadoop-cloud-storage" module intended to be a minimal dependency > POM for all the cloud connectors in the version of hadoop built against > * The ability to declare a committer for any FileOutputFormat which > supercedes the classic FileOutputCommitter -in both a job and for a specific > FS URI > * A shaded client JAR, though not yet one complete enough for spark. > * Lots of other features and fixes. > The basic work of building spark with hadoop 3 is one of just doing the build > with {{-Dhadoop.version=3.x.y}}; however that > * Doesn't build on SBT (dependency resolution of zookeeper JAR) > * Misses the new cloud features > The ZK dependency can be fixed everywhere by explicitly declaring the ZK > artifact, instead of relying on curator to pull it in; this needs a profile > to declare the right ZK version, obviously.. > To use the cloud features spark the hadoop-3 profile should declare that the > spark-hadoop-cloud module depends on —and only on— the > hadoop/hadoop-cloud-storage module for its transitive dependencies on cloud > storage, and a source package which is only built and tested when build > against Hadoop 3.1+ > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23807) Add Hadoop 3 profile with relevant POM fix ups
[ https://issues.apache.org/jira/browse/SPARK-23807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-23807: -- Assignee: Steve Loughran > Add Hadoop 3 profile with relevant POM fix ups > -- > > Key: SPARK-23807 > URL: https://issues.apache.org/jira/browse/SPARK-23807 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 2.4.0 >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Major > Fix For: 2.4.0 > > > Hadoop 3, and particular Hadoop 3.1 adds: > * Java 8 as the minimum (and currently sole) supported Java version > * A new "hadoop-cloud-storage" module intended to be a minimal dependency > POM for all the cloud connectors in the version of hadoop built against > * The ability to declare a committer for any FileOutputFormat which > supercedes the classic FileOutputCommitter -in both a job and for a specific > FS URI > * A shaded client JAR, though not yet one complete enough for spark. > * Lots of other features and fixes. > The basic work of building spark with hadoop 3 is one of just doing the build > with {{-Dhadoop.version=3.x.y}}; however that > * Doesn't build on SBT (dependency resolution of zookeeper JAR) > * Misses the new cloud features > The ZK dependency can be fixed everywhere by explicitly declaring the ZK > artifact, instead of relying on curator to pull it in; this needs a profile > to declare the right ZK version, obviously.. > To use the cloud features spark the hadoop-3 profile should declare that the > spark-hadoop-cloud module depends on —and only on— the > hadoop/hadoop-cloud-storage module for its transitive dependencies on cloud > storage, and a source package which is only built and tested when build > against Hadoop 3.1+ > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used
[ https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16450186#comment-16450186 ] Julien Cuquemelle commented on SPARK-22683: --- Thanks for all your comments and proposals :) > DynamicAllocation wastes resources by allocating containers that will barely > be used > > > Key: SPARK-22683 > URL: https://issues.apache.org/jira/browse/SPARK-22683 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Julien Cuquemelle >Assignee: Julien Cuquemelle >Priority: Major > Labels: pull-request-available > Fix For: 2.4.0 > > > While migrating a series of jobs from MR to Spark using dynamicAllocation, > I've noticed almost a doubling (+114% exactly) of resource consumption of > Spark w.r.t MR, for a wall clock time gain of 43% > About the context: > - resource usage stands for vcore-hours allocation for the whole job, as seen > by YARN > - I'm talking about a series of jobs because we provide our users with a way > to define experiments (via UI / DSL) that automatically get translated to > Spark / MR jobs and submitted on the cluster > - we submit around 500 of such jobs each day > - these jobs are usually one shot, and the amount of processing can vary a > lot between jobs, and as such finding an efficient number of executors for > each job is difficult to get right, which is the reason I took the path of > dynamic allocation. > - Some of the tests have been scheduled on an idle queue, some on a full > queue. > - experiments have been conducted with spark.executor-cores = 5 and 10, only > results for 5 cores have been reported because efficiency was overall better > than with 10 cores > - the figures I give are averaged over a representative sample of those jobs > (about 600 jobs) ranging from tens to thousands splits in the data > partitioning and between 400 to 9000 seconds of wall clock time. > - executor idle timeout is set to 30s; > > Definition: > - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, > which represent the max number of tasks an executor will process in parallel. > - the current behaviour of the dynamic allocation is to allocate enough > containers to have one taskSlot per task, which minimizes latency, but wastes > resources when tasks are small regarding executor allocation and idling > overhead. > The results using the proposal (described below) over the job sample (600 > jobs): > - by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in > resource usage, for a 37% (against 43%) reduction in wall clock time for > Spark w.r.t MR > - by trying to minimize the average resource consumption, I ended up with 6 > tasks per core, with a 30% resource usage reduction, for a similar wall clock > time w.r.t. MR > What did I try to solve the issue with existing parameters (summing up a few > points mentioned in the comments) ? > - change dynamicAllocation.maxExecutors: this would need to be adapted for > each job (tens to thousands splits can occur), and essentially remove the > interest of using the dynamic allocation. > - use dynamicAllocation.backlogTimeout: > - setting this parameter right to avoid creating unused executors is very > dependant on wall clock time. One basically needs to solve the exponential > ramp up for the target time. So this is not an option for my use case where I > don't want a per-job tuning. > - I've still done a series of experiments, details in the comments. > Result is that after manual tuning, the best I could get was a similar > resource consumption at the expense of 20% more wall clock time, or a similar > wall clock time at the expense of 60% more resource consumption than what I > got using my proposal @ 6 tasks per slot (this value being optimized over a > much larger range of jobs as already stated) > - as mentioned in another comment, tampering with the exponential ramp up > might yield task imbalance and such old executors could become contention > points for other exes trying to remotely access blocks in the old exes (not > witnessed in the jobs I'm talking about, but we did see this behavior in > other jobs) > Proposal: > Simply add a tasksPerExecutorSlot parameter, which makes it possible to > specify how many tasks a single taskSlot should ideally execute to mitigate > the overhead of executor allocation. > PR: https://github.com/apache/spark/pull/19881 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands,
[jira] [Resolved] (SPARK-24052) Support spark version showing on environment page
[ https://issues.apache.org/jira/browse/SPARK-24052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-24052. --- Resolution: Not A Problem > Support spark version showing on environment page > - > > Key: SPARK-24052 > URL: https://issues.apache.org/jira/browse/SPARK-24052 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.0 >Reporter: zhoukang >Priority: Major > Attachments: environment-page.png > > > Since we may have multiple version in production cluster,we can showing some > information on environment page like below: > !environment-page.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23975) Allow Clustering to take Arrays of Double as input features
[ https://issues.apache.org/jira/browse/SPARK-23975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16450161#comment-16450161 ] Joseph K. Bradley commented on SPARK-23975: --- I merged https://github.com/apache/spark/pull/21081 for KMeans, and [~lu.DB] will follow up for the other algs. > Allow Clustering to take Arrays of Double as input features > --- > > Key: SPARK-23975 > URL: https://issues.apache.org/jira/browse/SPARK-23975 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0 >Reporter: Lu Wang >Priority: Major > > Clustering algorithms should accept Arrays in addition to Vectors as input > features. The python interface should also be changed so that it would make > PySpark a lot easier to use. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23975) Allow Clustering to take Arrays of Double as input features
[ https://issues.apache.org/jira/browse/SPARK-23975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-23975: - Assignee: Lu Wang > Allow Clustering to take Arrays of Double as input features > --- > > Key: SPARK-23975 > URL: https://issues.apache.org/jira/browse/SPARK-23975 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0 >Reporter: Lu Wang >Assignee: Lu Wang >Priority: Major > > Clustering algorithms should accept Arrays in addition to Vectors as input > features. The python interface should also be changed so that it would make > PySpark a lot easier to use. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24069) Add array_max / array_min functions
[ https://issues.apache.org/jira/browse/SPARK-24069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24069: Assignee: (was: Apache Spark) > Add array_max / array_min functions > --- > > Key: SPARK-24069 > URL: https://issues.apache.org/jira/browse/SPARK-24069 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > Add R versions of SPARK-23918 and SPARK-23917 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24069) Add array_max / array_min functions
[ https://issues.apache.org/jira/browse/SPARK-24069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16450142#comment-16450142 ] Apache Spark commented on SPARK-24069: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/21142 > Add array_max / array_min functions > --- > > Key: SPARK-24069 > URL: https://issues.apache.org/jira/browse/SPARK-24069 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > Add R versions of SPARK-23918 and SPARK-23917 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24069) Add array_max / array_min functions
[ https://issues.apache.org/jira/browse/SPARK-24069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24069: Assignee: Apache Spark > Add array_max / array_min functions > --- > > Key: SPARK-24069 > URL: https://issues.apache.org/jira/browse/SPARK-24069 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > Add R versions of SPARK-23918 and SPARK-23917 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24069) Add array_max / array_min functions
Hyukjin Kwon created SPARK-24069: Summary: Add array_max / array_min functions Key: SPARK-24069 URL: https://issues.apache.org/jira/browse/SPARK-24069 Project: Spark Issue Type: Sub-task Components: SparkR Affects Versions: 2.4.0 Reporter: Hyukjin Kwon Add R versions of SPARK-23918 and SPARK-23917 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used
[ https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves reassigned SPARK-22683: - Assignee: Julien Cuquemelle > DynamicAllocation wastes resources by allocating containers that will barely > be used > > > Key: SPARK-22683 > URL: https://issues.apache.org/jira/browse/SPARK-22683 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Julien Cuquemelle >Assignee: Julien Cuquemelle >Priority: Major > Labels: pull-request-available > Fix For: 2.4.0 > > > While migrating a series of jobs from MR to Spark using dynamicAllocation, > I've noticed almost a doubling (+114% exactly) of resource consumption of > Spark w.r.t MR, for a wall clock time gain of 43% > About the context: > - resource usage stands for vcore-hours allocation for the whole job, as seen > by YARN > - I'm talking about a series of jobs because we provide our users with a way > to define experiments (via UI / DSL) that automatically get translated to > Spark / MR jobs and submitted on the cluster > - we submit around 500 of such jobs each day > - these jobs are usually one shot, and the amount of processing can vary a > lot between jobs, and as such finding an efficient number of executors for > each job is difficult to get right, which is the reason I took the path of > dynamic allocation. > - Some of the tests have been scheduled on an idle queue, some on a full > queue. > - experiments have been conducted with spark.executor-cores = 5 and 10, only > results for 5 cores have been reported because efficiency was overall better > than with 10 cores > - the figures I give are averaged over a representative sample of those jobs > (about 600 jobs) ranging from tens to thousands splits in the data > partitioning and between 400 to 9000 seconds of wall clock time. > - executor idle timeout is set to 30s; > > Definition: > - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, > which represent the max number of tasks an executor will process in parallel. > - the current behaviour of the dynamic allocation is to allocate enough > containers to have one taskSlot per task, which minimizes latency, but wastes > resources when tasks are small regarding executor allocation and idling > overhead. > The results using the proposal (described below) over the job sample (600 > jobs): > - by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in > resource usage, for a 37% (against 43%) reduction in wall clock time for > Spark w.r.t MR > - by trying to minimize the average resource consumption, I ended up with 6 > tasks per core, with a 30% resource usage reduction, for a similar wall clock > time w.r.t. MR > What did I try to solve the issue with existing parameters (summing up a few > points mentioned in the comments) ? > - change dynamicAllocation.maxExecutors: this would need to be adapted for > each job (tens to thousands splits can occur), and essentially remove the > interest of using the dynamic allocation. > - use dynamicAllocation.backlogTimeout: > - setting this parameter right to avoid creating unused executors is very > dependant on wall clock time. One basically needs to solve the exponential > ramp up for the target time. So this is not an option for my use case where I > don't want a per-job tuning. > - I've still done a series of experiments, details in the comments. > Result is that after manual tuning, the best I could get was a similar > resource consumption at the expense of 20% more wall clock time, or a similar > wall clock time at the expense of 60% more resource consumption than what I > got using my proposal @ 6 tasks per slot (this value being optimized over a > much larger range of jobs as already stated) > - as mentioned in another comment, tampering with the exponential ramp up > might yield task imbalance and such old executors could become contention > points for other exes trying to remotely access blocks in the old exes (not > witnessed in the jobs I'm talking about, but we did see this behavior in > other jobs) > Proposal: > Simply add a tasksPerExecutorSlot parameter, which makes it possible to > specify how many tasks a single taskSlot should ideally execute to mitigate > the overhead of executor allocation. > PR: https://github.com/apache/spark/pull/19881 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used
[ https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16450131#comment-16450131 ] Thomas Graves commented on SPARK-22683: --- Note this added a new config spark.dynamicAllocation.executorAllocationRatio, default to 1.0 which is the same behavior as existing releases. > DynamicAllocation wastes resources by allocating containers that will barely > be used > > > Key: SPARK-22683 > URL: https://issues.apache.org/jira/browse/SPARK-22683 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Julien Cuquemelle >Priority: Major > Labels: pull-request-available > Fix For: 2.4.0 > > > While migrating a series of jobs from MR to Spark using dynamicAllocation, > I've noticed almost a doubling (+114% exactly) of resource consumption of > Spark w.r.t MR, for a wall clock time gain of 43% > About the context: > - resource usage stands for vcore-hours allocation for the whole job, as seen > by YARN > - I'm talking about a series of jobs because we provide our users with a way > to define experiments (via UI / DSL) that automatically get translated to > Spark / MR jobs and submitted on the cluster > - we submit around 500 of such jobs each day > - these jobs are usually one shot, and the amount of processing can vary a > lot between jobs, and as such finding an efficient number of executors for > each job is difficult to get right, which is the reason I took the path of > dynamic allocation. > - Some of the tests have been scheduled on an idle queue, some on a full > queue. > - experiments have been conducted with spark.executor-cores = 5 and 10, only > results for 5 cores have been reported because efficiency was overall better > than with 10 cores > - the figures I give are averaged over a representative sample of those jobs > (about 600 jobs) ranging from tens to thousands splits in the data > partitioning and between 400 to 9000 seconds of wall clock time. > - executor idle timeout is set to 30s; > > Definition: > - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, > which represent the max number of tasks an executor will process in parallel. > - the current behaviour of the dynamic allocation is to allocate enough > containers to have one taskSlot per task, which minimizes latency, but wastes > resources when tasks are small regarding executor allocation and idling > overhead. > The results using the proposal (described below) over the job sample (600 > jobs): > - by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in > resource usage, for a 37% (against 43%) reduction in wall clock time for > Spark w.r.t MR > - by trying to minimize the average resource consumption, I ended up with 6 > tasks per core, with a 30% resource usage reduction, for a similar wall clock > time w.r.t. MR > What did I try to solve the issue with existing parameters (summing up a few > points mentioned in the comments) ? > - change dynamicAllocation.maxExecutors: this would need to be adapted for > each job (tens to thousands splits can occur), and essentially remove the > interest of using the dynamic allocation. > - use dynamicAllocation.backlogTimeout: > - setting this parameter right to avoid creating unused executors is very > dependant on wall clock time. One basically needs to solve the exponential > ramp up for the target time. So this is not an option for my use case where I > don't want a per-job tuning. > - I've still done a series of experiments, details in the comments. > Result is that after manual tuning, the best I could get was a similar > resource consumption at the expense of 20% more wall clock time, or a similar > wall clock time at the expense of 60% more resource consumption than what I > got using my proposal @ 6 tasks per slot (this value being optimized over a > much larger range of jobs as already stated) > - as mentioned in another comment, tampering with the exponential ramp up > might yield task imbalance and such old executors could become contention > points for other exes trying to remotely access blocks in the old exes (not > witnessed in the jobs I'm talking about, but we did see this behavior in > other jobs) > Proposal: > Simply add a tasksPerExecutorSlot parameter, which makes it possible to > specify how many tasks a single taskSlot should ideally execute to mitigate > the overhead of executor allocation. > PR: https://github.com/apache/spark/pull/19881 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubs
[jira] [Resolved] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used
[ https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-22683. --- Resolution: Fixed Fix Version/s: 2.4.0 > DynamicAllocation wastes resources by allocating containers that will barely > be used > > > Key: SPARK-22683 > URL: https://issues.apache.org/jira/browse/SPARK-22683 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Julien Cuquemelle >Priority: Major > Labels: pull-request-available > Fix For: 2.4.0 > > > While migrating a series of jobs from MR to Spark using dynamicAllocation, > I've noticed almost a doubling (+114% exactly) of resource consumption of > Spark w.r.t MR, for a wall clock time gain of 43% > About the context: > - resource usage stands for vcore-hours allocation for the whole job, as seen > by YARN > - I'm talking about a series of jobs because we provide our users with a way > to define experiments (via UI / DSL) that automatically get translated to > Spark / MR jobs and submitted on the cluster > - we submit around 500 of such jobs each day > - these jobs are usually one shot, and the amount of processing can vary a > lot between jobs, and as such finding an efficient number of executors for > each job is difficult to get right, which is the reason I took the path of > dynamic allocation. > - Some of the tests have been scheduled on an idle queue, some on a full > queue. > - experiments have been conducted with spark.executor-cores = 5 and 10, only > results for 5 cores have been reported because efficiency was overall better > than with 10 cores > - the figures I give are averaged over a representative sample of those jobs > (about 600 jobs) ranging from tens to thousands splits in the data > partitioning and between 400 to 9000 seconds of wall clock time. > - executor idle timeout is set to 30s; > > Definition: > - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, > which represent the max number of tasks an executor will process in parallel. > - the current behaviour of the dynamic allocation is to allocate enough > containers to have one taskSlot per task, which minimizes latency, but wastes > resources when tasks are small regarding executor allocation and idling > overhead. > The results using the proposal (described below) over the job sample (600 > jobs): > - by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in > resource usage, for a 37% (against 43%) reduction in wall clock time for > Spark w.r.t MR > - by trying to minimize the average resource consumption, I ended up with 6 > tasks per core, with a 30% resource usage reduction, for a similar wall clock > time w.r.t. MR > What did I try to solve the issue with existing parameters (summing up a few > points mentioned in the comments) ? > - change dynamicAllocation.maxExecutors: this would need to be adapted for > each job (tens to thousands splits can occur), and essentially remove the > interest of using the dynamic allocation. > - use dynamicAllocation.backlogTimeout: > - setting this parameter right to avoid creating unused executors is very > dependant on wall clock time. One basically needs to solve the exponential > ramp up for the target time. So this is not an option for my use case where I > don't want a per-job tuning. > - I've still done a series of experiments, details in the comments. > Result is that after manual tuning, the best I could get was a similar > resource consumption at the expense of 20% more wall clock time, or a similar > wall clock time at the expense of 60% more resource consumption than what I > got using my proposal @ 6 tasks per slot (this value being optimized over a > much larger range of jobs as already stated) > - as mentioned in another comment, tampering with the exponential ramp up > might yield task imbalance and such old executors could become contention > points for other exes trying to remotely access blocks in the old exes (not > witnessed in the jobs I'm talking about, but we did see this behavior in > other jobs) > Proposal: > Simply add a tasksPerExecutorSlot parameter, which makes it possible to > specify how many tasks a single taskSlot should ideally execute to mitigate > the overhead of executor allocation. > PR: https://github.com/apache/spark/pull/19881 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24000) S3A: Create Table should fail on invalid AK/SK
[ https://issues.apache.org/jira/browse/SPARK-24000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16450024#comment-16450024 ] Steve Loughran commented on SPARK-24000: We could consider whether or not to raise an AccessDeniedException in S3A on this situation, so fail fast if you can't read the bucket. Interesting issue: you'd need to extend the current assumed role tests to see if there's a risk that you could be granted access to paths under a bucket yet still have this test fail with 403. # Created HADOOP-15409 about S3A at least failing fast on invalid credentials # Is that enough? if Spark adds a getFileStatus() call on the path and catches all FNFEs then it would handle the auth problems, but also include other failure modes, e.g. network connectivity. Good or bad? > S3A: Create Table should fail on invalid AK/SK > -- > > Key: SPARK-24000 > URL: https://issues.apache.org/jira/browse/SPARK-24000 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.3.0 >Reporter: Brahma Reddy Battula >Priority: Major > > Currently, When we pass the i{color:#FF}nvalid ak&&sk{color} *create > table* will be the *success*. > when the S3AFileSystem is initialized, *verifyBucketExists*() is called, > which will return *True* as the status code 403 > (*_BUCKET_ACCESS_FORBIDDEN_STATUS_CODE)_* _from following as bucket exists._ > {code:java} > public boolean doesBucketExist(String bucketName) > throws AmazonClientException, AmazonServiceException { > > try { > headBucket(new HeadBucketRequest(bucketName)); > return true; > } catch (AmazonServiceException ase) { > // A redirect error or a forbidden error means the bucket exists. So > // returning true. > if ((ase.getStatusCode() == Constants.BUCKET_REDIRECT_STATUS_CODE) > || (ase.getStatusCode() == > Constants.BUCKET_ACCESS_FORBIDDEN_STATUS_CODE)) { > return true; > } > if (ase.getStatusCode() == Constants.NO_SUCH_BUCKET_STATUS_CODE) { > return false; > } > throw ase; > > } > }{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23852) Parquet MR bug can lead to incorrect SQL results
[ https://issues.apache.org/jira/browse/SPARK-23852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16450018#comment-16450018 ] Eric Maynard commented on SPARK-23852: -- {color:#33}>There is no upstream release of Parquet that contains the fix for {color}PARQUET-1217{color:#33}, although a 1.10 release is planned.{color} PARQUET-1217 seems to have been merged into Parquet 1.8.3 today. > Parquet MR bug can lead to incorrect SQL results > > > Key: SPARK-23852 > URL: https://issues.apache.org/jira/browse/SPARK-23852 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Henry Robinson >Priority: Blocker > Labels: correctness > > Parquet MR 1.9.0 and 1.8.2 both have a bug, PARQUET-1217, that means that > pushing certain predicates to Parquet scanners can return fewer results than > they should. > The bug triggers in Spark when: > * The Parquet file being scanner has stats for the null count, but not the > max or min on the column with the predicate (Apache Impala writes files like > this). > * The vectorized Parquet reader path is not taken, and the parquet-mr reader > is used. > * A suitable <, <=, > or >= predicate is pushed down to Parquet. > The bug is that the parquet-mr interprets the max and min of a row-group's > column as 0 in the absence of stats. So {{col > 0}} will filter all results, > even if some are > 0. > There is no upstream release of Parquet that contains the fix for > PARQUET-1217, although a 1.10 release is planned. > The least impactful workaround is to set the Parquet configuration > {{parquet.filter.stats.enabled}} to {{false}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name
[ https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16449948#comment-16449948 ] Eric Maynard commented on SPARK-23519: -- Why is the fact that you dynamically generate the statement mean that you can't alias the columns in your select statement? You can generate aliases as well. This seems like a non-issue. > Create View Commands Fails with The view output (col1,col1) contains > duplicate column name > --- > > Key: SPARK-23519 > URL: https://issues.apache.org/jira/browse/SPARK-23519 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.1 >Reporter: Franck Tago >Priority: Critical > > 1- create and populate a hive table . I did this in a hive cli session .[ > not that this matters ] > create table atable (col1 int) ; > insert into atable values (10 ) , (100) ; > 2. create a view from the table. > [These actions were performed from a spark shell ] > spark.sql("create view default.aview (int1 , int2 ) as select col1 , col1 > from atable ") > java.lang.AssertionError: assertion failed: The view output (col1,col1) > contains duplicate column name. > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361) > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:183) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22947) SPIP: as-of join in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-22947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16449935#comment-16449935 ] Li Jin edited comment on SPARK-22947 at 4/24/18 2:16 PM: - I came across this blog today: [https://databricks.com/blog/2018/03/13/introducing-stream-stream-joins-in-apache-spark-2-3.html] And realized the Ad Monetization example in the log pretty much described asof join case in streaming mode. was (Author: icexelloss): I came across this blog today: [https://databricks.com/blog/2018/03/13/introducing-stream-stream-joins-in-apache-spark-2-3.html] And realized the Ad Monetization problem pretty much described asof join case in streaming mode. > SPIP: as-of join in Spark SQL > - > > Key: SPARK-22947 > URL: https://issues.apache.org/jira/browse/SPARK-22947 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.1 >Reporter: Li Jin >Priority: Major > Attachments: SPIP_ as-of join in Spark SQL (1).pdf > > > h2. Background and Motivation > Time series analysis is one of the most common analysis on financial data. In > time series analysis, as-of join is a very common operation. Supporting as-of > join in Spark SQL will allow many use cases of using Spark SQL for time > series analysis. > As-of join is “join on time” with inexact time matching criteria. Various > library has implemented asof join or similar functionality: > Kdb: https://code.kx.com/wiki/Reference/aj > Pandas: > http://pandas.pydata.org/pandas-docs/version/0.19.0/merging.html#merging-merge-asof > R: This functionality is called “Last Observation Carried Forward” > https://www.rdocumentation.org/packages/zoo/versions/1.8-0/topics/na.locf > JuliaDB: http://juliadb.org/latest/api/joins.html#IndexedTables.asofjoin > Flint: https://github.com/twosigma/flint#temporal-join-functions > This proposal advocates introducing new API in Spark SQL to support as-of > join. > h2. Target Personas > Data scientists, data engineers > h2. Goals > * New API in Spark SQL that allows as-of join > * As-of join of multiple table (>2) should be performant, because it’s very > common that users need to join multiple data sources together for further > analysis. > * Define Distribution, Partitioning and shuffle strategy for ordered time > series data > h2. Non-Goals > These are out of scope for the existing SPIP, should be considered in future > SPIP as improvement to Spark’s time series analysis ability: > * Utilize partition information from data source, i.e, begin/end of each > partition to reduce sorting/shuffling > * Define API for user to implement asof join time spec in business calendar > (i.e. lookback one business day, this is very common in financial data > analysis because of market calendars) > * Support broadcast join > h2. Proposed API Changes > h3. TimeContext > TimeContext is an object that defines the time scope of the analysis, it has > begin time (inclusive) and end time (exclusive). User should be able to > change the time scope of the analysis (i.e, from one month to five year) by > just changing the TimeContext. > To Spark engine, TimeContext is a hint that: > can be used to repartition data for join > serve as a predicate that can be pushed down to storage layer > Time context is similar to filtering time by begin/end, the main difference > is that time context can be expanded based on the operation taken (see > example in as-of join). > Time context example: > {code:java} > TimeContext timeContext = TimeContext("20160101", "20170101") > {code} > h3. asofJoin > h4. User Case A (join without key) > Join two DataFrames on time, with one day lookback: > {code:java} > TimeContext timeContext = TimeContext("20160101", "20170101") > dfA = ... > dfB = ... > JoinSpec joinSpec = JoinSpec(timeContext).on("time").tolerance("-1day") > result = dfA.asofJoin(dfB, joinSpec) > {code} > Example input/output: > {code:java} > dfA: > time, quantity > 20160101, 100 > 20160102, 50 > 20160104, -50 > 20160105, 100 > dfB: > time, price > 20151231, 100.0 > 20160104, 105.0 > 20160105, 102.0 > output: > time, quantity, price > 20160101, 100, 100.0 > 20160102, 50, null > 20160104, -50, 105.0 > 20160105, 100, 102.0 > {code} > Note row (20160101, 100) of dfA is joined with (20151231, 100.0) of dfB. This > is an important illustration of the time context - it is able to expand the > context to 20151231 on dfB because of the 1 day lookback. > h4. Use Case B (join with key) > To join on time and another key (for instance, id), we use “by” to specify > the key. > {code:java} > TimeContext timeContext = TimeContext("20160101", "20170101") > dfA = ... > dfB = ... > JoinSpec joinSpec = > JoinSpec(timeContext).on("time").by("id").tolerance("-1day") > result = dfA.aso
[jira] [Commented] (SPARK-22947) SPIP: as-of join in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-22947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16449935#comment-16449935 ] Li Jin commented on SPARK-22947: I came across this blog today: [https://databricks.com/blog/2018/03/13/introducing-stream-stream-joins-in-apache-spark-2-3.html] And realized the Ad Monetization problem pretty much described asof join case in streaming mode. > SPIP: as-of join in Spark SQL > - > > Key: SPARK-22947 > URL: https://issues.apache.org/jira/browse/SPARK-22947 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.1 >Reporter: Li Jin >Priority: Major > Attachments: SPIP_ as-of join in Spark SQL (1).pdf > > > h2. Background and Motivation > Time series analysis is one of the most common analysis on financial data. In > time series analysis, as-of join is a very common operation. Supporting as-of > join in Spark SQL will allow many use cases of using Spark SQL for time > series analysis. > As-of join is “join on time” with inexact time matching criteria. Various > library has implemented asof join or similar functionality: > Kdb: https://code.kx.com/wiki/Reference/aj > Pandas: > http://pandas.pydata.org/pandas-docs/version/0.19.0/merging.html#merging-merge-asof > R: This functionality is called “Last Observation Carried Forward” > https://www.rdocumentation.org/packages/zoo/versions/1.8-0/topics/na.locf > JuliaDB: http://juliadb.org/latest/api/joins.html#IndexedTables.asofjoin > Flint: https://github.com/twosigma/flint#temporal-join-functions > This proposal advocates introducing new API in Spark SQL to support as-of > join. > h2. Target Personas > Data scientists, data engineers > h2. Goals > * New API in Spark SQL that allows as-of join > * As-of join of multiple table (>2) should be performant, because it’s very > common that users need to join multiple data sources together for further > analysis. > * Define Distribution, Partitioning and shuffle strategy for ordered time > series data > h2. Non-Goals > These are out of scope for the existing SPIP, should be considered in future > SPIP as improvement to Spark’s time series analysis ability: > * Utilize partition information from data source, i.e, begin/end of each > partition to reduce sorting/shuffling > * Define API for user to implement asof join time spec in business calendar > (i.e. lookback one business day, this is very common in financial data > analysis because of market calendars) > * Support broadcast join > h2. Proposed API Changes > h3. TimeContext > TimeContext is an object that defines the time scope of the analysis, it has > begin time (inclusive) and end time (exclusive). User should be able to > change the time scope of the analysis (i.e, from one month to five year) by > just changing the TimeContext. > To Spark engine, TimeContext is a hint that: > can be used to repartition data for join > serve as a predicate that can be pushed down to storage layer > Time context is similar to filtering time by begin/end, the main difference > is that time context can be expanded based on the operation taken (see > example in as-of join). > Time context example: > {code:java} > TimeContext timeContext = TimeContext("20160101", "20170101") > {code} > h3. asofJoin > h4. User Case A (join without key) > Join two DataFrames on time, with one day lookback: > {code:java} > TimeContext timeContext = TimeContext("20160101", "20170101") > dfA = ... > dfB = ... > JoinSpec joinSpec = JoinSpec(timeContext).on("time").tolerance("-1day") > result = dfA.asofJoin(dfB, joinSpec) > {code} > Example input/output: > {code:java} > dfA: > time, quantity > 20160101, 100 > 20160102, 50 > 20160104, -50 > 20160105, 100 > dfB: > time, price > 20151231, 100.0 > 20160104, 105.0 > 20160105, 102.0 > output: > time, quantity, price > 20160101, 100, 100.0 > 20160102, 50, null > 20160104, -50, 105.0 > 20160105, 100, 102.0 > {code} > Note row (20160101, 100) of dfA is joined with (20151231, 100.0) of dfB. This > is an important illustration of the time context - it is able to expand the > context to 20151231 on dfB because of the 1 day lookback. > h4. Use Case B (join with key) > To join on time and another key (for instance, id), we use “by” to specify > the key. > {code:java} > TimeContext timeContext = TimeContext("20160101", "20170101") > dfA = ... > dfB = ... > JoinSpec joinSpec = > JoinSpec(timeContext).on("time").by("id").tolerance("-1day") > result = dfA.asofJoin(dfB, joinSpec) > {code} > Example input/output: > {code:java} > dfA: > time, id, quantity > 20160101, 1, 100 > 20160101, 2, 50 > 20160102, 1, -50 > 20160102, 2, 50 > dfB: > time, id, price > 20151231, 1, 100.0 > 20150102, 1, 105.0 > 20150102, 2, 195.0 > Output: > time, id, quantity, price > 20160101, 1, 100, 100.0 >
[jira] [Created] (SPARK-24068) CSV schema inferring doesn't work for compressed files
Maxim Gekk created SPARK-24068: -- Summary: CSV schema inferring doesn't work for compressed files Key: SPARK-24068 URL: https://issues.apache.org/jira/browse/SPARK-24068 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Maxim Gekk Here is a simple csv file compressed by lzo {code} $ cat ./test.csv col1,col2 a,1 $ lzop ./test.csv $ ls test.csv test.csv.lzo {code} Reading test.csv.lzo with LZO codec (see https://github.com/twitter/hadoop-lzo, for example): {code:scala} scala> val ds = spark.read.option("header", true).option("inferSchema", true).option("io.compression.codecs", "com.hadoop.compression.lzo.LzopCodec").csv("/Users/maximgekk/tmp/issue/test.csv.lzo") ds: org.apache.spark.sql.DataFrame = [�LZO?: string] scala> ds.printSchema root |-- �LZO: string (nullable = true) scala> ds.show +-+ |�LZO| +-+ |a| +-+ {code} but the file can be read if the schema is specified: {code} scala> import org.apache.spark.sql.types._ scala> val schema = new StructType().add("col1", StringType).add("col2", IntegerType) scala> val ds = spark.read.schema(schema).option("header", true).option("io.compression.codecs", "com.hadoop.compression.lzo.LzopCodec").csv("test.csv.lzo") scala> ds.show +++ |col1|col2| +++ | a| 1| +++ {code} Just in case, schema inferring works for the original uncompressed file: {code:scala} scala> spark.read.option("header", true).option("inferSchema", true).csv("test.csv").printSchema root |-- col1: string (nullable = true) |-- col2: integer (nullable = true) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23182) Allow enabling of TCP keep alive for master RPC connections
[ https://issues.apache.org/jira/browse/SPARK-23182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Petar Petrov updated SPARK-23182: - Affects Version/s: 2.2.2 > Allow enabling of TCP keep alive for master RPC connections > --- > > Key: SPARK-23182 > URL: https://issues.apache.org/jira/browse/SPARK-23182 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.2, 2.4.0 >Reporter: Petar Petrov >Priority: Major > > We rely heavily on preemptible worker machines in GCP/GCE. These machines > disappear without closing the TCP connections to the master which increases > the number of established connections and new workers can not connect because > of "Too many open files" on the master. > To solve the problem we need to enable TCP keep alive for the RPC connections > to the master but it's not possible to do so via configuration. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18673) Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version
[ https://issues.apache.org/jira/browse/SPARK-18673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16449857#comment-16449857 ] Steve Loughran commented on SPARK-18673: It's a big hive patch, but most of it is hbase related. I'm going to offer to help cherrypick the relevant hive changes to the org.sparkproject hive JAR, but I'm going to place a dependency on SPARK-23807 first, so that you can actually build against hadoop-3 properly in both maven and SBT. > Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version > -- > > Key: SPARK-18673 > URL: https://issues.apache.org/jira/browse/SPARK-18673 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Environment: Spark built with -Dhadoop.version=3.0.0-alpha2-SNAPSHOT >Reporter: Steve Loughran >Priority: Major > > Spark Dataframes fail to run on Hadoop 3.0.x, because hive.jar's shimloader > considers 3.x to be an unknown Hadoop version. > Hive itself will have to fix this; as Spark uses its own hive 1.2.x JAR, it > will need to be updated to match. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24067) Backport SPARK-17147 to 2.3 (Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction))
[ https://issues.apache.org/jira/browse/SPARK-24067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joachim Hereth updated SPARK-24067: --- Description: SPARK-17147 fixes a problem with non-consecutive Kafka Offsets. The [PR w|https://github.com/apache/spark/pull/20572]as merged to {{master}}. This should be backported to 2.3. Original Description from SPARK-17147 : When Kafka does log compaction offsets often end up with gaps, meaning the next requested offset will be frequently not be offset+1. The logic in KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset will always be just an increment of 1 above the previous offset. I have worked around this problem by changing CachedKafkaConsumer to use the returned record's offset, from: {{nextOffset = offset + 1}} to: {{nextOffset = record.offset + 1}} and changed KafkaRDD from: {{requestOffset += 1}} to: {{requestOffset = r.offset() + 1}} (I also had to change some assert logic in CachedKafkaConsumer). There's a strong possibility that I have misconstrued how to use the streaming kafka consumer, and I'm happy to close this out if that's the case. If, however, it is supposed to support non-consecutive offsets (e.g. due to log compaction) I am also happy to contribute a PR. was: Spark 2.3 Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction) When Kafka does log compaction offsets often end up with gaps, meaning the next requested offset will be frequently not be offset+1. The logic in KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset will always be just an increment of 1 above the previous offset. I have worked around this problem by changing CachedKafkaConsumer to use the returned record's offset, from: {{nextOffset = offset + 1}} to: {{nextOffset = record.offset + 1}} and changed KafkaRDD from: {{requestOffset += 1}} to: {{requestOffset = r.offset() + 1}} (I also had to change some assert logic in CachedKafkaConsumer). There's a strong possibility that I have misconstrued how to use the streaming kafka consumer, and I'm happy to close this out if that's the case. If, however, it is supposed to support non-consecutive offsets (e.g. due to log compaction) I am also happy to contribute a PR. > Backport SPARK-17147 to 2.3 (Spark Streaming Kafka 0.10 Consumer Can't Handle > Non-consecutive Offsets (i.e. Log Compaction)) > > > Key: SPARK-24067 > URL: https://issues.apache.org/jira/browse/SPARK-24067 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.3.0 >Reporter: Joachim Hereth >Assignee: Cody Koeninger >Priority: Major > > SPARK-17147 fixes a problem with non-consecutive Kafka Offsets. The [PR > w|https://github.com/apache/spark/pull/20572]as merged to {{master}}. This > should be backported to 2.3. > > Original Description from SPARK-17147 : > > When Kafka does log compaction offsets often end up with gaps, meaning the > next requested offset will be frequently not be offset+1. The logic in > KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset > will always be just an increment of 1 above the previous offset. > I have worked around this problem by changing CachedKafkaConsumer to use the > returned record's offset, from: > {{nextOffset = offset + 1}} > to: > {{nextOffset = record.offset + 1}} > and changed KafkaRDD from: > {{requestOffset += 1}} > to: > {{requestOffset = r.offset() + 1}} > (I also had to change some assert logic in CachedKafkaConsumer). > There's a strong possibility that I have misconstrued how to use the > streaming kafka consumer, and I'm happy to close this out if that's the case. > If, however, it is supposed to support non-consecutive offsets (e.g. due to > log compaction) I am also happy to contribute a PR. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24067) Backport SPARK-17147 to 2.3 (Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction))
[ https://issues.apache.org/jira/browse/SPARK-24067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joachim Hereth updated SPARK-24067: --- Affects Version/s: (was: 2.0.0) 2.3.0 > Backport SPARK-17147 to 2.3 (Spark Streaming Kafka 0.10 Consumer Can't Handle > Non-consecutive Offsets (i.e. Log Compaction)) > > > Key: SPARK-24067 > URL: https://issues.apache.org/jira/browse/SPARK-24067 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.3.0 >Reporter: Joachim Hereth >Assignee: Cody Koeninger >Priority: Major > > Spark 2.3 Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets > (i.e. Log Compaction) > > When Kafka does log compaction offsets often end up with gaps, meaning the > next requested offset will be frequently not be offset+1. The logic in > KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset > will always be just an increment of 1 above the previous offset. > I have worked around this problem by changing CachedKafkaConsumer to use the > returned record's offset, from: > {{nextOffset = offset + 1}} > to: > {{nextOffset = record.offset + 1}} > and changed KafkaRDD from: > {{requestOffset += 1}} > to: > {{requestOffset = r.offset() + 1}} > (I also had to change some assert logic in CachedKafkaConsumer). > There's a strong possibility that I have misconstrued how to use the > streaming kafka consumer, and I'm happy to close this out if that's the case. > If, however, it is supposed to support non-consecutive offsets (e.g. due to > log compaction) I am also happy to contribute a PR. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24067) Backport SPARK-17147 to 2.3 (Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction))
[ https://issues.apache.org/jira/browse/SPARK-24067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joachim Hereth updated SPARK-24067: --- Fix Version/s: (was: 2.4.0) > Backport SPARK-17147 to 2.3 (Spark Streaming Kafka 0.10 Consumer Can't Handle > Non-consecutive Offsets (i.e. Log Compaction)) > > > Key: SPARK-24067 > URL: https://issues.apache.org/jira/browse/SPARK-24067 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.3.0 >Reporter: Joachim Hereth >Assignee: Cody Koeninger >Priority: Major > > Spark 2.3 Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets > (i.e. Log Compaction) > > When Kafka does log compaction offsets often end up with gaps, meaning the > next requested offset will be frequently not be offset+1. The logic in > KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset > will always be just an increment of 1 above the previous offset. > I have worked around this problem by changing CachedKafkaConsumer to use the > returned record's offset, from: > {{nextOffset = offset + 1}} > to: > {{nextOffset = record.offset + 1}} > and changed KafkaRDD from: > {{requestOffset += 1}} > to: > {{requestOffset = r.offset() + 1}} > (I also had to change some assert logic in CachedKafkaConsumer). > There's a strong possibility that I have misconstrued how to use the > streaming kafka consumer, and I'm happy to close this out if that's the case. > If, however, it is supposed to support non-consecutive offsets (e.g. due to > log compaction) I am also happy to contribute a PR. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org