[jira] [Commented] (SPARK-23514) Replace spark.sparkContext.hadoopConfiguration by spark.sessionState.newHadoopConf()
[ https://issues.apache.org/jira/browse/SPARK-23514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16375978#comment-16375978 ] Xiao Li commented on SPARK-23514: - cc [~dongjoon] Do you want to make a try? > Replace spark.sparkContext.hadoopConfiguration by > spark.sessionState.newHadoopConf() > > > Key: SPARK-23514 > URL: https://issues.apache.org/jira/browse/SPARK-23514 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Check all the places where we directly use > {{spark.sparkContext.hadoopConfiguration}}. Instead, in some scenarios, it > makes more sense to call {{spark.sessionState.newHadoopConf()}} which blends > in settings from SQLConf. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23405) The task will hang up when a small table left semi join a big table
[ https://issues.apache.org/jira/browse/SPARK-23405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] KaiXinXIaoLei updated SPARK-23405: -- Description: # I run a sql: `select ls.cs_order_number from ls left semi join catalog_sales cs on ls.cs_order_number = cs.cs_order_number`, The `ls` table is a small table ,and the number is one. The `catalog_sales` table is a big table, and the number is 10 billion. The task will be hang up: !taskhang up.png! And the sql page is : !SQL.png! was: I run a sql: `select ls.cs_order_number from ls left semi join catalog_sales cs on ls.cs_order_number = cs.cs_order_number`, The `ls` table is a small table ,and the number is one. The `catalog_sales` table is a big table, and the number is 10 billion. The task will be hang up: !taskhang up.png! And the sql page is : !SQL.png! > The task will hang up when a small table left semi join a big table > --- > > Key: SPARK-23405 > URL: https://issues.apache.org/jira/browse/SPARK-23405 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: KaiXinXIaoLei >Priority: Major > Attachments: SQL.png, taskhang up.png > > > # I run a sql: `select ls.cs_order_number from ls left semi join > catalog_sales cs on ls.cs_order_number = cs.cs_order_number`, The `ls` table > is a small table ,and the number is one. The `catalog_sales` table is a big > table, and the number is 10 billion. The task will be hang up: > !taskhang up.png! > And the sql page is : > !SQL.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23514) Replace spark.sparkContext.hadoopConfiguration by spark.sessionState.newHadoopConf()
Xiao Li created SPARK-23514: --- Summary: Replace spark.sparkContext.hadoopConfiguration by spark.sessionState.newHadoopConf() Key: SPARK-23514 URL: https://issues.apache.org/jira/browse/SPARK-23514 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Xiao Li Check all the places where we directly use {{spark.sparkContext.hadoopConfiguration}}. Instead, in some scenarios, it makes more sense to call {{spark.sessionState.newHadoopConf()}} which blends in settings from SQLConf. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23405) The task will hang up when a small table left semi join a big table
[ https://issues.apache.org/jira/browse/SPARK-23405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16375973#comment-16375973 ] Apache Spark commented on SPARK-23405: -- User 'KaiXinXiaoLei' has created a pull request for this issue: https://github.com/apache/spark/pull/20670 > The task will hang up when a small table left semi join a big table > --- > > Key: SPARK-23405 > URL: https://issues.apache.org/jira/browse/SPARK-23405 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: KaiXinXIaoLei >Priority: Major > Attachments: SQL.png, taskhang up.png > > > I run a sql: `select ls.cs_order_number from ls left semi join catalog_sales > cs on ls.cs_order_number = cs.cs_order_number`, The `ls` table is a small > table ,and the number is one. The `catalog_sales` table is a big table, and > the number is 10 billion. The task will be hang up: > !taskhang up.png! > And the sql page is : > !SQL.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23405) The task will hang up when a small table left semi join a big table
[ https://issues.apache.org/jira/browse/SPARK-23405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23405: Assignee: Apache Spark > The task will hang up when a small table left semi join a big table > --- > > Key: SPARK-23405 > URL: https://issues.apache.org/jira/browse/SPARK-23405 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: KaiXinXIaoLei >Assignee: Apache Spark >Priority: Major > Attachments: SQL.png, taskhang up.png > > > I run a sql: `select ls.cs_order_number from ls left semi join catalog_sales > cs on ls.cs_order_number = cs.cs_order_number`, The `ls` table is a small > table ,and the number is one. The `catalog_sales` table is a big table, and > the number is 10 billion. The task will be hang up: > !taskhang up.png! > And the sql page is : > !SQL.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23405) The task will hang up when a small table left semi join a big table
[ https://issues.apache.org/jira/browse/SPARK-23405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23405: Assignee: (was: Apache Spark) > The task will hang up when a small table left semi join a big table > --- > > Key: SPARK-23405 > URL: https://issues.apache.org/jira/browse/SPARK-23405 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: KaiXinXIaoLei >Priority: Major > Attachments: SQL.png, taskhang up.png > > > I run a sql: `select ls.cs_order_number from ls left semi join catalog_sales > cs on ls.cs_order_number = cs.cs_order_number`, The `ls` table is a small > table ,and the number is one. The `catalog_sales` table is a big table, and > the number is 10 billion. The task will be hang up: > !taskhang up.png! > And the sql page is : > !SQL.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23513) java.io.IOException: Expected 12 fields, but got 5 for row :Spark submit error
Rawia created SPARK-23513: -- Summary: java.io.IOException: Expected 12 fields, but got 5 for row :Spark submit error Key: SPARK-23513 URL: https://issues.apache.org/jira/browse/SPARK-23513 Project: Spark Issue Type: Bug Components: EC2, Examples, Input/Output, Java API Affects Versions: 2.2.0, 1.4.0 Reporter: Rawia Hello I'm trying to run a spark application (distributedWekaSpark) but when I'm using the spark-submit command I get this error {quote}{quote}ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.io.IOException: Expected 12 fields, but got 5 for row: outlook,temperature,humidity,windy,play {quote}{quote} I tried with other datasets but always the same error appeared, (always 12 fields expected) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22324) Upgrade Arrow to version 0.8.0 and upgrade Netty to 4.1.17
[ https://issues.apache.org/jira/browse/SPARK-22324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-22324: Summary: Upgrade Arrow to version 0.8.0 and upgrade Netty to 4.1.17 (was: Upgrade Arrow to version 0.8.0) > Upgrade Arrow to version 0.8.0 and upgrade Netty to 4.1.17 > --- > > Key: SPARK-22324 > URL: https://issues.apache.org/jira/browse/SPARK-22324 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > Fix For: 2.3.0 > > > Arrow version 0.8.0 is slated for release in early November, but I'd like to > start discussing to help get all the work that's being done synced up. > Along with upgrading the Arrow Java artifacts, pyarrow on our Jenkins test > envs will need to be upgraded as well that will take a fair amount of work > and planning. > One topic I'd like to discuss is if pyarrow should be an installation > requirement for pyspark, i.e. when a user pip installs pyspark, it will also > install pyarrow. If not, then is there a minimum version that needs to be > supported? We currently have 0.4.1 installed on Jenkins. > There are a number of improvements and cleanups in the current code that can > happen depending on what we decide (I'll link them all here later, but off > the top of my head): > * Decimal bug fix and improved support > * Improved internal casting between pyarrow and pandas (can clean up some > workarounds), this will also verify data bounds if the user specifies a type > and data overflows. see > https://github.com/apache/spark/pull/19459#discussion_r146421804 > * Better type checking when converting Spark types to Arrow > * Timestamp conversion to microseconds (for Spark internal format) > * Full support for using validity mask with 'object' types > https://github.com/apache/spark/pull/18664#discussion_r146567335 > * VectorSchemaRoot can call close more than once to simplify listener > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala#L90 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23207) Shuffle+Repartition on an DataFrame could lead to incorrect answers
[ https://issues.apache.org/jira/browse/SPARK-23207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-23207: Fix Version/s: (was: 2.4.0) > Shuffle+Repartition on an DataFrame could lead to incorrect answers > --- > > Key: SPARK-23207 > URL: https://issues.apache.org/jira/browse/SPARK-23207 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Jiang Xingbo >Assignee: Jiang Xingbo >Priority: Blocker > Fix For: 2.3.0 > > > Currently shuffle repartition uses RoundRobinPartitioning, the generated > result is nondeterministic since the sequence of input rows are not > determined. > The bug can be triggered when there is a repartition call following a shuffle > (which would lead to non-deterministic row ordering), as the pattern shows > below: > upstream stage -> repartition stage -> result stage > (-> indicate a shuffle) > When one of the executors process goes down, some tasks on the repartition > stage will be retried and generate inconsistent ordering, and some tasks of > the result stage will be retried generating different data. > The following code returns 931532, instead of 100: > {code} > import scala.sys.process._ > import org.apache.spark.TaskContext > val res = spark.range(0, 1000 * 1000, 1).repartition(200).map { x => > x > }.repartition(200).map { x => > if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) { > throw new Exception("pkill -f java".!!) > } > x > } > res.distinct().count() > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23512) Complex operations on Dataframe corrupts data
Nazarii Bardiuk created SPARK-23512: --- Summary: Complex operations on Dataframe corrupts data Key: SPARK-23512 URL: https://issues.apache.org/jira/browse/SPARK-23512 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.2.1 Reporter: Nazarii Bardiuk Next code demonstrates sequence of transformations for a DataFrame that corrupts data {code} from pyspark import SparkContext, SQLContext, Row from pyspark.sql import Window from pyspark.sql.functions import explode, lit, count, row_number, col, countDistinct ss = SQLContext(SparkContext('local', 'pyspark')) diffs = ss.createDataFrame([ Row(id="1", a=["1"], b=["2"], t="2"), Row(id="2", a=["2"], b=["1"], t="1"), Row(id="3", a=["1"], b=["4", "3"], t="3"), Row(id="3", a=["1"], b=["4", "3"], t="4"), Row(id="4", a=["1"], b=["4", "3"], t="3"), Row(id="4", a=["1"], b=["4", "3"], t="4") ]) a = diffs.select("id", explode("a").alias("l"), "t").withColumn("problem", lit("a")) b = diffs.select("id", explode("b").alias("l"), "t").withColumn("problem", lit("b")) \ .filter(col("t") != col("l")) all = a.union(b) grouped = all \ .groupBy("l", "t", "problem").agg(count("id").alias("count")) \ .withColumn("rn", row_number().over(Window.partitionBy("l", "problem").orderBy(col("count").desc( \ .withColumn("f", (col("rn") < 2) & (col("count") > 1)) \ .cache() # the change that broke test keep = grouped.filter("f").select("l", "t", "problem", "count") agg = all.join(grouped.filter(~col("f")), ["l", "t", "problem"]) \ .withColumn("t", lit(None)) \ .groupBy("l", "t", "problem").agg(countDistinct("id").alias("count")) keep.union(agg).show() # corrupts column "problem" agg.union(keep).show() # as expected {code} Expected: data in "problem" column of both unions is the same Actual: "problem" column looses data {code} keep.union(agg).show() # corrupts column "problem" +---++---+-+ | l| t|problem|count| +---++---+-+ | 3| 4| a|2| | 4| 3| a|2| | 1| 4| a|2| | 1|null| a|3| | 2|null| a|1| +---++---+-+ agg.union(keep).show() # as expected +---++---+-+ | l| t|problem|count| +---++---+-+ | 1|null| a|3| | 2|null| a|1| | 3| 4| b|2| | 4| 3| b|2| | 1| 4| a|2| +---++---+-+ {code} Note a cache() statement that was a tipping point that broke our code, without it works as expected -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22839) Refactor Kubernetes code for configuring driver/executor pods to use consistent and cleaner abstraction
[ https://issues.apache.org/jira/browse/SPARK-22839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22839: Assignee: Apache Spark > Refactor Kubernetes code for configuring driver/executor pods to use > consistent and cleaner abstraction > --- > > Key: SPARK-22839 > URL: https://issues.apache.org/jira/browse/SPARK-22839 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Yinan Li >Assignee: Apache Spark >Priority: Major > > As discussed in https://github.com/apache/spark/pull/19954, the current code > for configuring the driver pod vs the code for configuring the executor pods > are not using the same abstraction. Besides that, the current code leaves a > lot to be desired in terms of the level and cleaness of abstraction. For > example, the current code is passing around many pieces of information around > different class hierarchies, which makes code review and maintenance > challenging. We need some thorough refactoring of the current code to achieve > better, cleaner, and consistent abstraction. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22839) Refactor Kubernetes code for configuring driver/executor pods to use consistent and cleaner abstraction
[ https://issues.apache.org/jira/browse/SPARK-22839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16375859#comment-16375859 ] Apache Spark commented on SPARK-22839: -- User 'ifilonenko' has created a pull request for this issue: https://github.com/apache/spark/pull/20669 > Refactor Kubernetes code for configuring driver/executor pods to use > consistent and cleaner abstraction > --- > > Key: SPARK-22839 > URL: https://issues.apache.org/jira/browse/SPARK-22839 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Yinan Li >Priority: Major > > As discussed in https://github.com/apache/spark/pull/19954, the current code > for configuring the driver pod vs the code for configuring the executor pods > are not using the same abstraction. Besides that, the current code leaves a > lot to be desired in terms of the level and cleaness of abstraction. For > example, the current code is passing around many pieces of information around > different class hierarchies, which makes code review and maintenance > challenging. We need some thorough refactoring of the current code to achieve > better, cleaner, and consistent abstraction. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22839) Refactor Kubernetes code for configuring driver/executor pods to use consistent and cleaner abstraction
[ https://issues.apache.org/jira/browse/SPARK-22839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22839: Assignee: (was: Apache Spark) > Refactor Kubernetes code for configuring driver/executor pods to use > consistent and cleaner abstraction > --- > > Key: SPARK-22839 > URL: https://issues.apache.org/jira/browse/SPARK-22839 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Yinan Li >Priority: Major > > As discussed in https://github.com/apache/spark/pull/19954, the current code > for configuring the driver pod vs the code for configuring the executor pods > are not using the same abstraction. Besides that, the current code leaves a > lot to be desired in terms of the level and cleaness of abstraction. For > example, the current code is passing around many pieces of information around > different class hierarchies, which makes code review and maintenance > challenging. We need some thorough refactoring of the current code to achieve > better, cleaner, and consistent abstraction. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23511) Catalyst: Implement GetField
Nadav Samet created SPARK-23511: --- Summary: Catalyst: Implement GetField Key: SPARK-23511 URL: https://issues.apache.org/jira/browse/SPARK-23511 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.2.1 Reporter: Nadav Samet Similar to Invoke, InvokeStatic and NewInstance, it would be nice to have GetStaticField(expression, fieldName). My use case is invoking a method on a companion object given the class of the companion object itself - it turns out they are not static. I'd like to be able to do something like this: Invoke(GetStaticField(cls, "MODULE$", ...), "someMethod", ...) My workaround is passing the companion to Invoke() directly to Literal.fromObject but I think having a general solution to call method on companion objects would be better. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16996) Hive ACID delta files not seen
[ https://issues.apache.org/jira/browse/SPARK-16996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16375758#comment-16375758 ] Frédéric ESCANDELL edited comment on SPARK-16996 at 2/24/18 8:16 PM: - On Hdp 2.6, i confirm that the steps described by [~maver1ck] work. [~ste...@apache.org], why did hortonworks integrate Spark 2 with an older version of Hive 1.2 than the one distributed in HDP ? was (Author: fescandell): On Hdp 2.6, i confirm that the steps described by @Maciej Bryński work. @Steve Loughran, why did hortonworks integrate Spark 2 with an older version of Hive 1.2 than the one distributed in HDP ? > Hive ACID delta files not seen > -- > > Key: SPARK-16996 > URL: https://issues.apache.org/jira/browse/SPARK-16996 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 1.6.3, 2.1.2, 2.2.0 > Environment: Hive 1.2.1, Spark 1.5.2 >Reporter: Benjamin BONNET >Priority: Critical > > spark-sql seems not to see data stored as delta files in an ACID Hive table. > Actually I encountered the same problem as describe here : > http://stackoverflow.com/questions/35955666/spark-sql-is-not-returning-records-for-hive-transactional-tables-on-hdp > For example, create an ACID table with HiveCLI and insert a row : > {code} > set hive.support.concurrency=true; > set hive.enforce.bucketing=true; > set hive.exec.dynamic.partition.mode=nonstrict; > set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager; > set hive.compactor.initiator.on=true; > set hive.compactor.worker.threads=1; > CREATE TABLE deltas(cle string,valeur string) CLUSTERED BY (cle) INTO 1 > BUCKETS > ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' > STORED AS > INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' > TBLPROPERTIES ('transactional'='true'); > INSERT INTO deltas VALUES("a","a"); > {code} > Then make a query with spark-sql CLI : > {code} > SELECT * FROM deltas; > {code} > That query gets no result and there are no errors in logs. > If you go to HDFS to inspect table files, you find only deltas > {code} > ~>hdfs dfs -ls /apps/hive/warehouse/deltas > Found 1 items > drwxr-x--- - me hdfs 0 2016-08-10 14:03 > /apps/hive/warehouse/deltas/delta_0020943_0020943 > {code} > Then if you run compaction on that table (in HiveCLI) : > {code} > ALTER TABLE deltas COMPACT 'MAJOR'; > {code} > As a result, the delta will be compute into a base file : > {code} > ~>hdfs dfs -ls /apps/hive/warehouse/deltas > Found 1 items > drwxrwxrwx - me hdfs 0 2016-08-10 15:25 > /apps/hive/warehouse/deltas/base_0020943 > {code} > Go back to spark-sql and the same query gets a result : > {code} > SELECT * FROM deltas; > a a > Time taken: 0.477 seconds, Fetched 1 row(s) > {code} > But next time you make an insert into Hive table : > {code} > INSERT INTO deltas VALUES("b","b"); > {code} > spark-sql will immediately see changes : > {code} > SELECT * FROM deltas; > a a > b b > Time taken: 0.122 seconds, Fetched 2 row(s) > {code} > Yet there was no other compaction, but spark-sql "sees" the base AND the > delta file : > {code} > ~> hdfs dfs -ls /apps/hive/warehouse/deltas > Found 2 items > drwxrwxrwx - valdata hdfs 0 2016-08-10 15:25 > /apps/hive/warehouse/deltas/base_0020943 > drwxr-x--- - valdata hdfs 0 2016-08-10 15:31 > /apps/hive/warehouse/deltas/delta_0020956_0020956 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16996) Hive ACID delta files not seen
[ https://issues.apache.org/jira/browse/SPARK-16996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16375758#comment-16375758 ] Frédéric ESCANDELL edited comment on SPARK-16996 at 2/24/18 8:15 PM: - On Hdp 2.6, i confirm that the steps described by @Maciej Bryński work. @Steve Loughran, why did hortonworks integrate Spark 2 with an older version of Hive 1.2 than the one distributed in HDP ? was (Author: fescandell): On Hdp 2.6, i confirm that the steps described by Maciej Bryński work. Steve Loughran, why did hortonworks integrate Spark 2 with an older version of Hive 1.2 than the one distributed in HDP ? > Hive ACID delta files not seen > -- > > Key: SPARK-16996 > URL: https://issues.apache.org/jira/browse/SPARK-16996 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 1.6.3, 2.1.2, 2.2.0 > Environment: Hive 1.2.1, Spark 1.5.2 >Reporter: Benjamin BONNET >Priority: Critical > > spark-sql seems not to see data stored as delta files in an ACID Hive table. > Actually I encountered the same problem as describe here : > http://stackoverflow.com/questions/35955666/spark-sql-is-not-returning-records-for-hive-transactional-tables-on-hdp > For example, create an ACID table with HiveCLI and insert a row : > {code} > set hive.support.concurrency=true; > set hive.enforce.bucketing=true; > set hive.exec.dynamic.partition.mode=nonstrict; > set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager; > set hive.compactor.initiator.on=true; > set hive.compactor.worker.threads=1; > CREATE TABLE deltas(cle string,valeur string) CLUSTERED BY (cle) INTO 1 > BUCKETS > ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' > STORED AS > INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' > TBLPROPERTIES ('transactional'='true'); > INSERT INTO deltas VALUES("a","a"); > {code} > Then make a query with spark-sql CLI : > {code} > SELECT * FROM deltas; > {code} > That query gets no result and there are no errors in logs. > If you go to HDFS to inspect table files, you find only deltas > {code} > ~>hdfs dfs -ls /apps/hive/warehouse/deltas > Found 1 items > drwxr-x--- - me hdfs 0 2016-08-10 14:03 > /apps/hive/warehouse/deltas/delta_0020943_0020943 > {code} > Then if you run compaction on that table (in HiveCLI) : > {code} > ALTER TABLE deltas COMPACT 'MAJOR'; > {code} > As a result, the delta will be compute into a base file : > {code} > ~>hdfs dfs -ls /apps/hive/warehouse/deltas > Found 1 items > drwxrwxrwx - me hdfs 0 2016-08-10 15:25 > /apps/hive/warehouse/deltas/base_0020943 > {code} > Go back to spark-sql and the same query gets a result : > {code} > SELECT * FROM deltas; > a a > Time taken: 0.477 seconds, Fetched 1 row(s) > {code} > But next time you make an insert into Hive table : > {code} > INSERT INTO deltas VALUES("b","b"); > {code} > spark-sql will immediately see changes : > {code} > SELECT * FROM deltas; > a a > b b > Time taken: 0.122 seconds, Fetched 2 row(s) > {code} > Yet there was no other compaction, but spark-sql "sees" the base AND the > delta file : > {code} > ~> hdfs dfs -ls /apps/hive/warehouse/deltas > Found 2 items > drwxrwxrwx - valdata hdfs 0 2016-08-10 15:25 > /apps/hive/warehouse/deltas/base_0020943 > drwxr-x--- - valdata hdfs 0 2016-08-10 15:31 > /apps/hive/warehouse/deltas/delta_0020956_0020956 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16996) Hive ACID delta files not seen
[ https://issues.apache.org/jira/browse/SPARK-16996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16375758#comment-16375758 ] Frédéric ESCANDELL commented on SPARK-16996: On Hdp 2.6, i confirm that the steps described by Maciej Bryński work. Steve Loughran, why did hortonworks integrate Spark 2 with an older version of Hive 1.2 than the one distributed in HDP ? > Hive ACID delta files not seen > -- > > Key: SPARK-16996 > URL: https://issues.apache.org/jira/browse/SPARK-16996 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 1.6.3, 2.1.2, 2.2.0 > Environment: Hive 1.2.1, Spark 1.5.2 >Reporter: Benjamin BONNET >Priority: Critical > > spark-sql seems not to see data stored as delta files in an ACID Hive table. > Actually I encountered the same problem as describe here : > http://stackoverflow.com/questions/35955666/spark-sql-is-not-returning-records-for-hive-transactional-tables-on-hdp > For example, create an ACID table with HiveCLI and insert a row : > {code} > set hive.support.concurrency=true; > set hive.enforce.bucketing=true; > set hive.exec.dynamic.partition.mode=nonstrict; > set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager; > set hive.compactor.initiator.on=true; > set hive.compactor.worker.threads=1; > CREATE TABLE deltas(cle string,valeur string) CLUSTERED BY (cle) INTO 1 > BUCKETS > ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' > STORED AS > INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' > TBLPROPERTIES ('transactional'='true'); > INSERT INTO deltas VALUES("a","a"); > {code} > Then make a query with spark-sql CLI : > {code} > SELECT * FROM deltas; > {code} > That query gets no result and there are no errors in logs. > If you go to HDFS to inspect table files, you find only deltas > {code} > ~>hdfs dfs -ls /apps/hive/warehouse/deltas > Found 1 items > drwxr-x--- - me hdfs 0 2016-08-10 14:03 > /apps/hive/warehouse/deltas/delta_0020943_0020943 > {code} > Then if you run compaction on that table (in HiveCLI) : > {code} > ALTER TABLE deltas COMPACT 'MAJOR'; > {code} > As a result, the delta will be compute into a base file : > {code} > ~>hdfs dfs -ls /apps/hive/warehouse/deltas > Found 1 items > drwxrwxrwx - me hdfs 0 2016-08-10 15:25 > /apps/hive/warehouse/deltas/base_0020943 > {code} > Go back to spark-sql and the same query gets a result : > {code} > SELECT * FROM deltas; > a a > Time taken: 0.477 seconds, Fetched 1 row(s) > {code} > But next time you make an insert into Hive table : > {code} > INSERT INTO deltas VALUES("b","b"); > {code} > spark-sql will immediately see changes : > {code} > SELECT * FROM deltas; > a a > b b > Time taken: 0.122 seconds, Fetched 2 row(s) > {code} > Yet there was no other compaction, but spark-sql "sees" the base AND the > delta file : > {code} > ~> hdfs dfs -ls /apps/hive/warehouse/deltas > Found 2 items > drwxrwxrwx - valdata hdfs 0 2016-08-10 15:25 > /apps/hive/warehouse/deltas/base_0020943 > drwxr-x--- - valdata hdfs 0 2016-08-10 15:31 > /apps/hive/warehouse/deltas/delta_0020956_0020956 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23458) Flaky test: OrcQuerySuite
[ https://issues.apache.org/jira/browse/SPARK-23458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-23458: -- Issue Type: Bug (was: Task) > Flaky test: OrcQuerySuite > -- > > Key: SPARK-23458 > URL: https://issues.apache.org/jira/browse/SPARK-23458 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.4.0 > Environment: AMPLab Jenkins >Reporter: Marco Gaido >Priority: Major > > Sometimes we have UT failures with the following stacktrace: > {code:java} > sbt.ForkMain$ForkError: > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 15 times over > 10.01396221801 seconds. Last failure message: There are 1 possibly leaked > file streams.. > at > org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:421) > at > org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:439) > at > org.apache.spark.sql.execution.datasources.orc.OrcTest.eventually(OrcTest.scala:45) > at > org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:308) > at > org.apache.spark.sql.execution.datasources.orc.OrcTest.eventually(OrcTest.scala:45) > at > org.apache.spark.sql.test.SharedSparkSession$class.afterEach(SharedSparkSession.scala:114) > at > org.apache.spark.sql.execution.datasources.orc.OrcQuerySuite.afterEach(OrcQuerySuite.scala:583) > at > org.scalatest.BeforeAndAfterEach$$anonfun$1.apply$mcV$sp(BeforeAndAfterEach.scala:234) > at > org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:379) > at > org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:375) > at org.scalatest.SucceededStatus$.whenCompleted(Status.scala:454) > at org.scalatest.Status$class.withAfterEffect(Status.scala:375) > at org.scalatest.SucceededStatus$.withAfterEffect(Status.scala:426) > at > org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:232) > at > org.apache.spark.sql.execution.datasources.orc.OrcQuerySuite.runTest(OrcQuerySuite.scala:583) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384) > at scala.collection.immutable.List.foreach(List.scala:381) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461) > at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1560) > at org.scalatest.Suite$class.run(Suite.scala:1147) > at > org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233) > at org.scalatest.SuperEngine.runImpl(Engine.scala:521) > at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233) > at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:52) > at > org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213) > at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210) > at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52) > at > org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314) > at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480) > at sbt.ForkMain$Run$2.call(ForkMain.java:296) > at sbt.ForkMain$Run$2.call(ForkMain.java:286) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: sbt.ForkMain$ForkError: java.lang.IllegalStateException: There are > 1 possibly leaked file streams. > at > org.apache.spark.DebugFilesystem$.assertNoOpenStreams(DebugFilesystem.scala:54) > at > org.apache.spark.sql.test.SharedSparkSession$$anonfun$afterEach$1.apply$mcV$sp(SharedSparkSession.scala:115) > at >
[jira] [Comment Edited] (SPARK-23458) Flaky test: OrcQuerySuite
[ https://issues.apache.org/jira/browse/SPARK-23458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16375721#comment-16375721 ] Dongjoon Hyun edited comment on SPARK-23458 at 2/24/18 6:37 PM: I updated the title bacause the reported URL is OrcQuerySuite and added a link to `ParquetQuerySuite` because OrcQuerySuite `Enabling/disabling ignoreCorruptFiles` comes from `ParquetQuerySuite`. I'm looking at the following three together. - ParquetyQuerySuite - OrcQuerySuite - FileBasedDataSourceSuite was (Author: dongjoon): I added a link to `ParquetQuerySuite` because OrcQuerySuite `Enabling/disabling ignoreCorruptFiles` comes from `ParquetQuerySuite`. I'm looking at the following three together. - ParquetyQuerySuite - OrcQuerySuite - FileBasedDataSourceSuite > Flaky test: OrcQuerySuite > -- > > Key: SPARK-23458 > URL: https://issues.apache.org/jira/browse/SPARK-23458 > Project: Spark > Issue Type: Task > Components: SQL, Tests >Affects Versions: 2.4.0 > Environment: AMPLab Jenkins >Reporter: Marco Gaido >Priority: Major > > Sometimes we have UT failures with the following stacktrace: > {code:java} > sbt.ForkMain$ForkError: > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 15 times over > 10.01396221801 seconds. Last failure message: There are 1 possibly leaked > file streams.. > at > org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:421) > at > org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:439) > at > org.apache.spark.sql.execution.datasources.orc.OrcTest.eventually(OrcTest.scala:45) > at > org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:308) > at > org.apache.spark.sql.execution.datasources.orc.OrcTest.eventually(OrcTest.scala:45) > at > org.apache.spark.sql.test.SharedSparkSession$class.afterEach(SharedSparkSession.scala:114) > at > org.apache.spark.sql.execution.datasources.orc.OrcQuerySuite.afterEach(OrcQuerySuite.scala:583) > at > org.scalatest.BeforeAndAfterEach$$anonfun$1.apply$mcV$sp(BeforeAndAfterEach.scala:234) > at > org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:379) > at > org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:375) > at org.scalatest.SucceededStatus$.whenCompleted(Status.scala:454) > at org.scalatest.Status$class.withAfterEffect(Status.scala:375) > at org.scalatest.SucceededStatus$.withAfterEffect(Status.scala:426) > at > org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:232) > at > org.apache.spark.sql.execution.datasources.orc.OrcQuerySuite.runTest(OrcQuerySuite.scala:583) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384) > at scala.collection.immutable.List.foreach(List.scala:381) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461) > at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1560) > at org.scalatest.Suite$class.run(Suite.scala:1147) > at > org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233) > at org.scalatest.SuperEngine.runImpl(Engine.scala:521) > at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233) > at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:52) > at > org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213) > at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210) > at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52) > at > org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314) > at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480) > at sbt.ForkMain$Run$2.call(ForkMain.java:296) > at sbt.ForkMain$Run$2.call(ForkMain.java:286) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) >
[jira] [Updated] (SPARK-23458) Flaky test: OrcQuerySuite
[ https://issues.apache.org/jira/browse/SPARK-23458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-23458: -- Component/s: Tests > Flaky test: OrcQuerySuite > -- > > Key: SPARK-23458 > URL: https://issues.apache.org/jira/browse/SPARK-23458 > Project: Spark > Issue Type: Task > Components: SQL, Tests >Affects Versions: 2.4.0 > Environment: AMPLab Jenkins >Reporter: Marco Gaido >Priority: Major > > Sometimes we have UT failures with the following stacktrace: > {code:java} > sbt.ForkMain$ForkError: > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 15 times over > 10.01396221801 seconds. Last failure message: There are 1 possibly leaked > file streams.. > at > org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:421) > at > org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:439) > at > org.apache.spark.sql.execution.datasources.orc.OrcTest.eventually(OrcTest.scala:45) > at > org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:308) > at > org.apache.spark.sql.execution.datasources.orc.OrcTest.eventually(OrcTest.scala:45) > at > org.apache.spark.sql.test.SharedSparkSession$class.afterEach(SharedSparkSession.scala:114) > at > org.apache.spark.sql.execution.datasources.orc.OrcQuerySuite.afterEach(OrcQuerySuite.scala:583) > at > org.scalatest.BeforeAndAfterEach$$anonfun$1.apply$mcV$sp(BeforeAndAfterEach.scala:234) > at > org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:379) > at > org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:375) > at org.scalatest.SucceededStatus$.whenCompleted(Status.scala:454) > at org.scalatest.Status$class.withAfterEffect(Status.scala:375) > at org.scalatest.SucceededStatus$.withAfterEffect(Status.scala:426) > at > org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:232) > at > org.apache.spark.sql.execution.datasources.orc.OrcQuerySuite.runTest(OrcQuerySuite.scala:583) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384) > at scala.collection.immutable.List.foreach(List.scala:381) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461) > at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1560) > at org.scalatest.Suite$class.run(Suite.scala:1147) > at > org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233) > at org.scalatest.SuperEngine.runImpl(Engine.scala:521) > at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233) > at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:52) > at > org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213) > at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210) > at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52) > at > org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314) > at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480) > at sbt.ForkMain$Run$2.call(ForkMain.java:296) > at sbt.ForkMain$Run$2.call(ForkMain.java:286) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: sbt.ForkMain$ForkError: java.lang.IllegalStateException: There are > 1 possibly leaked file streams. > at > org.apache.spark.DebugFilesystem$.assertNoOpenStreams(DebugFilesystem.scala:54) > at > org.apache.spark.sql.test.SharedSparkSession$$anonfun$afterEach$1.apply$mcV$sp(SharedSparkSession.scala:115) > at >
[jira] [Updated] (SPARK-23458) Flaky test: OrcQuerySuite
[ https://issues.apache.org/jira/browse/SPARK-23458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-23458: -- Summary: Flaky test: OrcQuerySuite (was: OrcSuite flaky test) > Flaky test: OrcQuerySuite > -- > > Key: SPARK-23458 > URL: https://issues.apache.org/jira/browse/SPARK-23458 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 2.4.0 > Environment: AMPLab Jenkins >Reporter: Marco Gaido >Priority: Major > > Sometimes we have UT failures with the following stacktrace: > {code:java} > sbt.ForkMain$ForkError: > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 15 times over > 10.01396221801 seconds. Last failure message: There are 1 possibly leaked > file streams.. > at > org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:421) > at > org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:439) > at > org.apache.spark.sql.execution.datasources.orc.OrcTest.eventually(OrcTest.scala:45) > at > org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:308) > at > org.apache.spark.sql.execution.datasources.orc.OrcTest.eventually(OrcTest.scala:45) > at > org.apache.spark.sql.test.SharedSparkSession$class.afterEach(SharedSparkSession.scala:114) > at > org.apache.spark.sql.execution.datasources.orc.OrcQuerySuite.afterEach(OrcQuerySuite.scala:583) > at > org.scalatest.BeforeAndAfterEach$$anonfun$1.apply$mcV$sp(BeforeAndAfterEach.scala:234) > at > org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:379) > at > org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:375) > at org.scalatest.SucceededStatus$.whenCompleted(Status.scala:454) > at org.scalatest.Status$class.withAfterEffect(Status.scala:375) > at org.scalatest.SucceededStatus$.withAfterEffect(Status.scala:426) > at > org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:232) > at > org.apache.spark.sql.execution.datasources.orc.OrcQuerySuite.runTest(OrcQuerySuite.scala:583) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384) > at scala.collection.immutable.List.foreach(List.scala:381) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461) > at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1560) > at org.scalatest.Suite$class.run(Suite.scala:1147) > at > org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233) > at org.scalatest.SuperEngine.runImpl(Engine.scala:521) > at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233) > at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:52) > at > org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213) > at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210) > at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52) > at > org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314) > at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480) > at sbt.ForkMain$Run$2.call(ForkMain.java:296) > at sbt.ForkMain$Run$2.call(ForkMain.java:286) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: sbt.ForkMain$ForkError: java.lang.IllegalStateException: There are > 1 possibly leaked file streams. > at > org.apache.spark.DebugFilesystem$.assertNoOpenStreams(DebugFilesystem.scala:54) > at > org.apache.spark.sql.test.SharedSparkSession$$anonfun$afterEach$1.apply$mcV$sp(SharedSparkSession.scala:115) > at >
[jira] [Comment Edited] (SPARK-23458) OrcSuite flaky test
[ https://issues.apache.org/jira/browse/SPARK-23458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16375721#comment-16375721 ] Dongjoon Hyun edited comment on SPARK-23458 at 2/24/18 6:34 PM: I added a link to `ParquetQuerySuite` because OrcQuerySuite `Enabling/disabling ignoreCorruptFiles` comes from `ParquetQuerySuite`. I'm looking at the following three together. - ParquetyQuerySuite - OrcQuerySuite - FileBasedDataSourceSuite was (Author: dongjoon): I added a link to `ParquetQuerySuite` because OrcQuerySuite `Enabling/disabling ignoreCorruptFiles` comes from `ParquetQuerySuite`. > OrcSuite flaky test > --- > > Key: SPARK-23458 > URL: https://issues.apache.org/jira/browse/SPARK-23458 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 2.4.0 > Environment: AMPLab Jenkins >Reporter: Marco Gaido >Priority: Major > > Sometimes we have UT failures with the following stacktrace: > {code:java} > sbt.ForkMain$ForkError: > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 15 times over > 10.01396221801 seconds. Last failure message: There are 1 possibly leaked > file streams.. > at > org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:421) > at > org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:439) > at > org.apache.spark.sql.execution.datasources.orc.OrcTest.eventually(OrcTest.scala:45) > at > org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:308) > at > org.apache.spark.sql.execution.datasources.orc.OrcTest.eventually(OrcTest.scala:45) > at > org.apache.spark.sql.test.SharedSparkSession$class.afterEach(SharedSparkSession.scala:114) > at > org.apache.spark.sql.execution.datasources.orc.OrcQuerySuite.afterEach(OrcQuerySuite.scala:583) > at > org.scalatest.BeforeAndAfterEach$$anonfun$1.apply$mcV$sp(BeforeAndAfterEach.scala:234) > at > org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:379) > at > org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:375) > at org.scalatest.SucceededStatus$.whenCompleted(Status.scala:454) > at org.scalatest.Status$class.withAfterEffect(Status.scala:375) > at org.scalatest.SucceededStatus$.withAfterEffect(Status.scala:426) > at > org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:232) > at > org.apache.spark.sql.execution.datasources.orc.OrcQuerySuite.runTest(OrcQuerySuite.scala:583) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384) > at scala.collection.immutable.List.foreach(List.scala:381) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461) > at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1560) > at org.scalatest.Suite$class.run(Suite.scala:1147) > at > org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233) > at org.scalatest.SuperEngine.runImpl(Engine.scala:521) > at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233) > at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:52) > at > org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213) > at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210) > at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52) > at > org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314) > at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480) > at sbt.ForkMain$Run$2.call(ForkMain.java:296) > at sbt.ForkMain$Run$2.call(ForkMain.java:286) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at
[jira] [Commented] (SPARK-23458) OrcSuite flaky test
[ https://issues.apache.org/jira/browse/SPARK-23458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16375721#comment-16375721 ] Dongjoon Hyun commented on SPARK-23458: --- I added a link to `ParquetQuerySuite` because OrcQuerySuite `Enabling/disabling ignoreCorruptFiles` comes from `ParquetQuerySuite`. > OrcSuite flaky test > --- > > Key: SPARK-23458 > URL: https://issues.apache.org/jira/browse/SPARK-23458 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 2.4.0 > Environment: AMPLab Jenkins >Reporter: Marco Gaido >Priority: Major > > Sometimes we have UT failures with the following stacktrace: > {code:java} > sbt.ForkMain$ForkError: > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 15 times over > 10.01396221801 seconds. Last failure message: There are 1 possibly leaked > file streams.. > at > org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:421) > at > org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:439) > at > org.apache.spark.sql.execution.datasources.orc.OrcTest.eventually(OrcTest.scala:45) > at > org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:308) > at > org.apache.spark.sql.execution.datasources.orc.OrcTest.eventually(OrcTest.scala:45) > at > org.apache.spark.sql.test.SharedSparkSession$class.afterEach(SharedSparkSession.scala:114) > at > org.apache.spark.sql.execution.datasources.orc.OrcQuerySuite.afterEach(OrcQuerySuite.scala:583) > at > org.scalatest.BeforeAndAfterEach$$anonfun$1.apply$mcV$sp(BeforeAndAfterEach.scala:234) > at > org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:379) > at > org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:375) > at org.scalatest.SucceededStatus$.whenCompleted(Status.scala:454) > at org.scalatest.Status$class.withAfterEffect(Status.scala:375) > at org.scalatest.SucceededStatus$.withAfterEffect(Status.scala:426) > at > org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:232) > at > org.apache.spark.sql.execution.datasources.orc.OrcQuerySuite.runTest(OrcQuerySuite.scala:583) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384) > at scala.collection.immutable.List.foreach(List.scala:381) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461) > at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1560) > at org.scalatest.Suite$class.run(Suite.scala:1147) > at > org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233) > at org.scalatest.SuperEngine.runImpl(Engine.scala:521) > at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233) > at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:52) > at > org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213) > at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210) > at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52) > at > org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314) > at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480) > at sbt.ForkMain$Run$2.call(ForkMain.java:296) > at sbt.ForkMain$Run$2.call(ForkMain.java:286) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: sbt.ForkMain$ForkError: java.lang.IllegalStateException: There are > 1 possibly leaked file streams. > at > org.apache.spark.DebugFilesystem$.assertNoOpenStreams(DebugFilesystem.scala:54) > at >
[jira] [Commented] (SPARK-23510) Support read data from Hive 2.2 and Hive 2.3 metastore
[ https://issues.apache.org/jira/browse/SPARK-23510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16375660#comment-16375660 ] Yuming Wang commented on SPARK-23510: - [~JPMoresmau] Can you try https://github.com/apache/spark/pull/20668? > Support read data from Hive 2.2 and Hive 2.3 metastore > -- > > Key: SPARK-23510 > URL: https://issues.apache.org/jira/browse/SPARK-23510 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23510) Support read data from Hive 2.2 and Hive 2.3 metastore
[ https://issues.apache.org/jira/browse/SPARK-23510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16375636#comment-16375636 ] Apache Spark commented on SPARK-23510: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/20668 > Support read data from Hive 2.2 and Hive 2.3 metastore > -- > > Key: SPARK-23510 > URL: https://issues.apache.org/jira/browse/SPARK-23510 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23510) Support read data from Hive 2.2 and Hive 2.3 metastore
[ https://issues.apache.org/jira/browse/SPARK-23510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23510: Assignee: Apache Spark > Support read data from Hive 2.2 and Hive 2.3 metastore > -- > > Key: SPARK-23510 > URL: https://issues.apache.org/jira/browse/SPARK-23510 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23510) Support read data from Hive 2.2 and Hive 2.3 metastore
[ https://issues.apache.org/jira/browse/SPARK-23510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23510: Assignee: (was: Apache Spark) > Support read data from Hive 2.2 and Hive 2.3 metastore > -- > > Key: SPARK-23510 > URL: https://issues.apache.org/jira/browse/SPARK-23510 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23510) Support read data from Hive 2.2 and Hive 2.3 metastore
Yuming Wang created SPARK-23510: --- Summary: Support read data from Hive 2.2 and Hive 2.3 metastore Key: SPARK-23510 URL: https://issues.apache.org/jira/browse/SPARK-23510 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Yuming Wang -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20411) New features for expression.scalalang.typed
[ https://issues.apache.org/jira/browse/SPARK-20411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16375559#comment-16375559 ] Diego Fanesi commented on SPARK-20411: -- In SPARK-20890 new default aggregators for Long and Double are being added. I think this is still not enough. Spark should provide a trait that requires the sum and compare operators to be defined and the default aggregators sum() min() max() should work on every type that extends the trait. Maybe we could also make a different trait per operator so we don't force the developer to implement both operators when only one is needed. This would enable any developer to use the default aggregators for any custom type without having to define a custom aggregator for every new case class used. > New features for expression.scalalang.typed > --- > > Key: SPARK-20411 > URL: https://issues.apache.org/jira/browse/SPARK-20411 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.1.0 >Reporter: Loic Descotte >Priority: Minor > > In Spark 2 it is possible to use typed expressions for aggregation methods: > {code} > import org.apache.spark.sql.expressions.scalalang._ > dataset.groupByKey(_.productId).agg(typed.sum[Token](_.score)).toDF("productId", > "sum").orderBy('productId).show > {code} > It seems that only avg, count and sum are defined : > https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/expressions/scalalang/typed.html > It is very nice to be able to use a typesafe DSL, but it would be good to > have more possibilities, like min and max functions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23509) Upgrade commons-net from 2.2 to 3.1
[ https://issues.apache.org/jira/browse/SPARK-23509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] PandaMonkey updated SPARK-23509: Description: Hi, after analyzing spark-master\core\pom.xml, we found that Spark-core depends on org.apache.hadoop:hadoop-client:2.6.5, which transitivity introduced commons-net:3.1. At the same time, Spark-core directly depends on a older version of commons-net:2.2. By further look into the source code, these two versions of commons-net have many different features. The dependency conflict problem brings high risks of "NotClassDefFoundError:" or "NoSuchMethodError" issues at runtime. Please notice this problem. Maybe upgrading commons-net from 2.2 to 3.1 is a good choice. Hope this report can help you. Thanks! Regards, Panda was: Hi, after analyzing spark-master\core\pom.xml, we found that Spark-core depends on org.apache.hadoop:hadoop-client:2.6.5, which transitivity introduced commons-net:3.1. At the same time, Spark-core directly depends on a older version of commons-net:2.2. By further look into the source code, these two versions of commons-net have many different features. The dependency conflict problem brings high risks of "NotClassDefFoundError:" or "NoSuchMethodError" issues at runtime. Please notice this problem. Maybe upgrading commons-net from 2.2 to 3.1 is a good choice. Please notice this problem. Hope this report can help you. Thanks! Regards, Panda > Upgrade commons-net from 2.2 to 3.1 > --- > > Key: SPARK-23509 > URL: https://issues.apache.org/jira/browse/SPARK-23509 > Project: Spark > Issue Type: Dependency upgrade > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: PandaMonkey >Priority: Major > Fix For: 2.4.0 > > Attachments: spark.txt > > > Hi, after analyzing spark-master\core\pom.xml, we found that Spark-core > depends on org.apache.hadoop:hadoop-client:2.6.5, which transitivity > introduced commons-net:3.1. At the same time, Spark-core directly depends on > a older version of commons-net:2.2. By further look into the source code, > these two versions of commons-net have many different features. The > dependency conflict problem brings high risks of "NotClassDefFoundError:" or > "NoSuchMethodError" issues at runtime. Please notice this problem. Maybe > upgrading commons-net from 2.2 to 3.1 is a good choice. Hope this report can > help you. Thanks! > > Regards, > Panda -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23509) Upgrade commons-net from 2.2 to 3.1
PandaMonkey created SPARK-23509: --- Summary: Upgrade commons-net from 2.2 to 3.1 Key: SPARK-23509 URL: https://issues.apache.org/jira/browse/SPARK-23509 Project: Spark Issue Type: Dependency upgrade Components: Spark Core Affects Versions: 2.4.0 Reporter: PandaMonkey Fix For: 2.4.0 Attachments: spark.txt Hi, after analyzing spark-master\core\pom.xml, we found that Spark-core depends on org.apache.hadoop:hadoop-client:2.6.5, which transitivity introduced commons-net:3.1. At the same time, Spark-core directly depends on a older version of commons-net:2.2. By further look into the source code, these two versions of commons-net have many different features. The dependency conflict problem brings high risks of "NotClassDefFoundError:" or "NoSuchMethodError" issues at runtime. Please notice this problem. Maybe upgrading commons-net from 2.2 to 3.1 is a good choice. Please notice this problem. Hope this report can help you. Thanks! Regards, Panda -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23509) Upgrade commons-net from 2.2 to 3.1
[ https://issues.apache.org/jira/browse/SPARK-23509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] PandaMonkey updated SPARK-23509: Attachment: spark.txt > Upgrade commons-net from 2.2 to 3.1 > --- > > Key: SPARK-23509 > URL: https://issues.apache.org/jira/browse/SPARK-23509 > Project: Spark > Issue Type: Dependency upgrade > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: PandaMonkey >Priority: Major > Fix For: 2.4.0 > > Attachments: spark.txt > > > Hi, after analyzing spark-master\core\pom.xml, we found that Spark-core > depends on org.apache.hadoop:hadoop-client:2.6.5, which transitivity > introduced commons-net:3.1. At the same time, Spark-core directly depends on > a older version of commons-net:2.2. By further look into the source code, > these two versions of commons-net have many different features. The > dependency conflict problem brings high risks of "NotClassDefFoundError:" or > "NoSuchMethodError" issues at runtime. Please notice this problem. Maybe > upgrading commons-net from 2.2 to 3.1 is a good choice. Please notice this > problem. Hope this report can help you. Thanks! > > Regards, > Panda -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23508) blockManagerIdCache in BlockManagerId may cause oom
[ https://issues.apache.org/jira/browse/SPARK-23508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16375492#comment-16375492 ] Apache Spark commented on SPARK-23508: -- User 'caneGuy' has created a pull request for this issue: https://github.com/apache/spark/pull/20667 > blockManagerIdCache in BlockManagerId may cause oom > --- > > Key: SPARK-23508 > URL: https://issues.apache.org/jira/browse/SPARK-23508 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core >Affects Versions: 2.1.1, 2.2.1 >Reporter: zhoukang >Priority: Major > Attachments: elepahnt-oom1.png, elephant-oom.png > > > blockManagerIdCache in BlockManagerId will not remove old values which may > cause oom > {code:java} > val blockManagerIdCache = new ConcurrentHashMap[BlockManagerId, > BlockManagerId]() > {code} > Since whenever we apply a new BlockManagerId, it will put into this map. > below is an jmap: > !elepahnt-oom1.png! > !elephant-oom.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23508) blockManagerIdCache in BlockManagerId may cause oom
[ https://issues.apache.org/jira/browse/SPARK-23508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23508: Assignee: (was: Apache Spark) > blockManagerIdCache in BlockManagerId may cause oom > --- > > Key: SPARK-23508 > URL: https://issues.apache.org/jira/browse/SPARK-23508 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core >Affects Versions: 2.1.1, 2.2.1 >Reporter: zhoukang >Priority: Major > Attachments: elepahnt-oom1.png, elephant-oom.png > > > blockManagerIdCache in BlockManagerId will not remove old values which may > cause oom > {code:java} > val blockManagerIdCache = new ConcurrentHashMap[BlockManagerId, > BlockManagerId]() > {code} > Since whenever we apply a new BlockManagerId, it will put into this map. > below is an jmap: > !elepahnt-oom1.png! > !elephant-oom.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23508) blockManagerIdCache in BlockManagerId may cause oom
[ https://issues.apache.org/jira/browse/SPARK-23508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23508: Assignee: Apache Spark > blockManagerIdCache in BlockManagerId may cause oom > --- > > Key: SPARK-23508 > URL: https://issues.apache.org/jira/browse/SPARK-23508 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core >Affects Versions: 2.1.1, 2.2.1 >Reporter: zhoukang >Assignee: Apache Spark >Priority: Major > Attachments: elepahnt-oom1.png, elephant-oom.png > > > blockManagerIdCache in BlockManagerId will not remove old values which may > cause oom > {code:java} > val blockManagerIdCache = new ConcurrentHashMap[BlockManagerId, > BlockManagerId]() > {code} > Since whenever we apply a new BlockManagerId, it will put into this map. > below is an jmap: > !elepahnt-oom1.png! > !elephant-oom.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23508) blockManagerIdCache in BlockManagerId may cause oom
[ https://issues.apache.org/jira/browse/SPARK-23508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated SPARK-23508: - Description: blockManagerIdCache in BlockManagerId will not remove old values which may cause oom {code:java} val blockManagerIdCache = new ConcurrentHashMap[BlockManagerId, BlockManagerId]() {code} Since whenever we apply a new BlockManagerId, it will put into this map. below is an jmap: !elepahnt-oom1.png! !elephant-oom.png! was: blockManagerIdCache in BlockManagerId will not remove old values which may cause oom {code:java} val blockManagerIdCache = new ConcurrentHashMap[BlockManagerId, BlockManagerId]() {code} Since whenever we apply a new BlockManagerId, it will put into this map. !elephant-oom.png! > blockManagerIdCache in BlockManagerId may cause oom > --- > > Key: SPARK-23508 > URL: https://issues.apache.org/jira/browse/SPARK-23508 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core >Affects Versions: 2.1.1, 2.2.1 >Reporter: zhoukang >Priority: Major > Attachments: elepahnt-oom1.png, elephant-oom.png > > > blockManagerIdCache in BlockManagerId will not remove old values which may > cause oom > {code:java} > val blockManagerIdCache = new ConcurrentHashMap[BlockManagerId, > BlockManagerId]() > {code} > Since whenever we apply a new BlockManagerId, it will put into this map. > below is an jmap: > !elepahnt-oom1.png! > !elephant-oom.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23508) blockManagerIdCache in BlockManagerId may cause oom
[ https://issues.apache.org/jira/browse/SPARK-23508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated SPARK-23508: - Description: blockManagerIdCache in BlockManagerId will not remove old values which may cause oom {code:java} val blockManagerIdCache = new ConcurrentHashMap[BlockManagerId, BlockManagerId]() {code} Since whenever we apply a new BlockManagerId, it will put into this map. !elephant-oom.png! was: blockManagerIdCache in BlockManagerId will not remove old values which may cause oom {code:java} val blockManagerIdCache = new ConcurrentHashMap[BlockManagerId, BlockManagerId]() {code} Since whenever we apply a new BlockManagerId, it will put into this map. > blockManagerIdCache in BlockManagerId may cause oom > --- > > Key: SPARK-23508 > URL: https://issues.apache.org/jira/browse/SPARK-23508 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core >Affects Versions: 2.1.1, 2.2.1 >Reporter: zhoukang >Priority: Major > Attachments: elepahnt-oom1.png, elephant-oom.png > > > blockManagerIdCache in BlockManagerId will not remove old values which may > cause oom > {code:java} > val blockManagerIdCache = new ConcurrentHashMap[BlockManagerId, > BlockManagerId]() > {code} > Since whenever we apply a new BlockManagerId, it will put into this map. > !elephant-oom.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23508) blockManagerIdCache in BlockManagerId may cause oom
[ https://issues.apache.org/jira/browse/SPARK-23508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated SPARK-23508: - Attachment: elephant-oom.png elepahnt-oom1.png > blockManagerIdCache in BlockManagerId may cause oom > --- > > Key: SPARK-23508 > URL: https://issues.apache.org/jira/browse/SPARK-23508 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core >Affects Versions: 2.1.1, 2.2.1 >Reporter: zhoukang >Priority: Major > Attachments: elepahnt-oom1.png, elephant-oom.png > > > blockManagerIdCache in BlockManagerId will not remove old values which may > cause oom > {code:java} > val blockManagerIdCache = new ConcurrentHashMap[BlockManagerId, > BlockManagerId]() > {code} > Since whenever we apply a new BlockManagerId, it will put into this map. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23508) blockManagerIdCache in BlockManagerId may cause oom
zhoukang created SPARK-23508: Summary: blockManagerIdCache in BlockManagerId may cause oom Key: SPARK-23508 URL: https://issues.apache.org/jira/browse/SPARK-23508 Project: Spark Issue Type: Bug Components: Deploy, Spark Core Affects Versions: 2.2.1, 2.1.1 Reporter: zhoukang blockManagerIdCache in BlockManagerId will not remove old values which may cause oom {code:java} val blockManagerIdCache = new ConcurrentHashMap[BlockManagerId, BlockManagerId]() {code} Since whenever we apply a new BlockManagerId, it will put into this map. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23448) Dataframe returns wrong result when column don't respect datatype
[ https://issues.apache.org/jira/browse/SPARK-23448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16375439#comment-16375439 ] Liang-Chi Hsieh commented on SPARK-23448: - In fact this is exactly the JSON parser's behavior, not a bug. We don't allow partial result for corrupted records. Except for the field configured by {{columnNameOfCorruptRecord}}, all fields will be set to {{null}}. > Dataframe returns wrong result when column don't respect datatype > - > > Key: SPARK-23448 > URL: https://issues.apache.org/jira/browse/SPARK-23448 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 > Environment: Local >Reporter: Ahmed ZAROUI >Priority: Major > > I have the following json file that contains some noisy data(String instead > of Array): > > {code:java} > {"attr1":"val1","attr2":"[\"val2\"]"} > {"attr1":"val1","attr2":["val2"]} > {code} > And i need to specify schema programatically like this: > > {code:java} > implicit val spark = SparkSession > .builder() > .master("local[*]") > .config("spark.ui.enabled", false) > .config("spark.sql.caseSensitive", "True") > .getOrCreate() > import spark.implicits._ > val schema = StructType( > Seq(StructField("attr1", StringType, true), > StructField("attr2", ArrayType(StringType, true), true))) > spark.read.schema(schema).json(input).collect().foreach(println) > {code} > The result given by this code is: > {code:java} > [null,null] > [val1,WrappedArray(val2)] > {code} > Instead of putting null in corrupted column, all columns of the first message > are null > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23507) Migrate file-based data sources to data source v2
Gengliang Wang created SPARK-23507: -- Summary: Migrate file-based data sources to data source v2 Key: SPARK-23507 URL: https://issues.apache.org/jira/browse/SPARK-23507 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.1 Reporter: Gengliang Wang Migrate file-based data sources to data source v2, including: # Parquet # ORC # Json # CSV # JDBC # Text -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org