[jira] [Updated] (SPARK-27669) Refactor DataFrameWriter to resolve datasources in a command
[ https://issues.apache.org/jira/browse/SPARK-27669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-27669: --- Summary: Refactor DataFrameWriter to resolve datasources in a command (was: Refactor DataFrameWriter to always go through Catalyst for analysis) > Refactor DataFrameWriter to resolve datasources in a command > > > Key: SPARK-27669 > URL: https://issues.apache.org/jira/browse/SPARK-27669 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.3 >Reporter: Eric Liang >Priority: Major > > Currently, DataFrameWriter.save() does a large amount of ad-hoc analysis > (e.g., loading data source classes, validating options, and so on) before > executing the command. > The execution of this code falls outside the scope of any SQL execution, > which is unfortunate since it means it's untracked by Spark (e.g., in the > Spark UI), and also means df.write ops cannot be manipulated by custom > catalyst rules prior to execution. > These issues can be largely resolved by creating a command that represents > df.write.save/saveAsTable(). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27669) Refactor DataFrameWriter to always go through Catalyst for analysis
Eric Liang created SPARK-27669: -- Summary: Refactor DataFrameWriter to always go through Catalyst for analysis Key: SPARK-27669 URL: https://issues.apache.org/jira/browse/SPARK-27669 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.3 Reporter: Eric Liang Currently, DataFrameWriter.save() does a large amount of ad-hoc analysis (e.g., loading data source classes, validating options, and so on) before executing the command. The execution of this code falls outside the scope of any SQL execution, which is unfortunate since it means it's untracked by Spark (e.g., in the Spark UI), and also means df.write ops cannot be manipulated by custom catalyst rules prior to execution. These issues can be largely resolved by creating a command that represents df.write.save/saveAsTable(). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27392) TestHive test tables should be placed in shared test state, not per session
Eric Liang created SPARK-27392: -- Summary: TestHive test tables should be placed in shared test state, not per session Key: SPARK-27392 URL: https://issues.apache.org/jira/browse/SPARK-27392 Project: Spark Issue Type: Test Components: SQL Affects Versions: 2.4.1 Reporter: Eric Liang Otherwise, tests that use tables from multiple sessions will run into issues if they access the same table. The correct location is in shared state. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23971) Should not leak Spark sessions across test suites
Eric Liang created SPARK-23971: -- Summary: Should not leak Spark sessions across test suites Key: SPARK-23971 URL: https://issues.apache.org/jira/browse/SPARK-23971 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Eric Liang Many suites currently leak Spark sessions (sometimes with stopped SparkContexts) via the thread-local active Spark session and default Spark session. We should attempt to clean these up and detect when this happens to improve the reproducibility of tests. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23809) Active SparkSession should be set by getOrCreate
Eric Liang created SPARK-23809: -- Summary: Active SparkSession should be set by getOrCreate Key: SPARK-23809 URL: https://issues.apache.org/jira/browse/SPARK-23809 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Eric Liang Currently, the active spark session is set inconsistently (e.g., in createDataFrame, prior to query execution). Many places in spark also incorrectly query active session when they should be calling activeSession.getOrElse(defaultSession). The semantics here can be cleaned up if we also set the active session when the default session is set. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18727) Support schema evolution as new files are inserted into table
[ https://issues.apache.org/jira/browse/SPARK-18727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987966#comment-15987966 ] Eric Liang commented on SPARK-18727: +1 for supporting ALTER TABLE REPLACE COLUMNS > Support schema evolution as new files are inserted into table > - > > Key: SPARK-18727 > URL: https://issues.apache.org/jira/browse/SPARK-18727 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Priority: Critical > > Now that we have pushed partition management of all tables to the catalog, > one issue for scalable partition handling remains: handling schema updates. > Currently, a schema update requires dropping and recreating the entire table, > which does not scale well with the size of the table. > We should support updating the schema of the table, either via ALTER TABLE, > or automatically as new files with compatible schemas are appended into the > table. > cc [~rxin] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18727) Support schema evolution as new files are inserted into table
[ https://issues.apache.org/jira/browse/SPARK-18727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-18727: --- The common case we see is users having a complete schema (e.g. output of ETL pipeline) and wanting to update/merge it in an automated job. In this case it's actually more work to alter the columns one at a time, rather than all at once. > Support schema evolution as new files are inserted into table > - > > Key: SPARK-18727 > URL: https://issues.apache.org/jira/browse/SPARK-18727 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Priority: Critical > > Now that we have pushed partition management of all tables to the catalog, > one issue for scalable partition handling remains: handling schema updates. > Currently, a schema update requires dropping and recreating the entire table, > which does not scale well with the size of the table. > We should support updating the schema of the table, either via ALTER TABLE, > or automatically as new files with compatible schemas are appended into the > table. > cc [~rxin] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18727) Support schema evolution as new files are inserted into table
[ https://issues.apache.org/jira/browse/SPARK-18727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-18727: --- Can we add ALTER TABLE SCHEMA to update the entire schema? That would cover any edge cases. > Support schema evolution as new files are inserted into table > - > > Key: SPARK-18727 > URL: https://issues.apache.org/jira/browse/SPARK-18727 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Priority: Critical > > Now that we have pushed partition management of all tables to the catalog, > one issue for scalable partition handling remains: handling schema updates. > Currently, a schema update requires dropping and recreating the entire table, > which does not scale well with the size of the table. > We should support updating the schema of the table, either via ALTER TABLE, > or automatically as new files with compatible schemas are appended into the > table. > cc [~rxin] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20450) Unexpected first-query schema inference cost with 2.1.1 RC
[ https://issues.apache.org/jira/browse/SPARK-20450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15981753#comment-15981753 ] Eric Liang commented on SPARK-20450: I'm not sure what you mean by new issue, but it's only in the latest 2.1.1 release afaik. It was not present in the original 2.1 > Unexpected first-query schema inference cost with 2.1.1 RC > -- > > Key: SPARK-20450 > URL: https://issues.apache.org/jira/browse/SPARK-20450 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 >Reporter: Eric Liang > > https://issues.apache.org/jira/browse/SPARK-19611 fixes a regression from 2.0 > where Spark silently fails to read case-sensitive fields missing a > case-sensitive schema in the table properties. The fix is to detect this > situation, infer the schema, and write the case-sensitive schema into the > metastore. > However this can incur an unexpected performance hit the first time such a > problematic table is queried (and there is a high false-positive rate here > since most tables don't actually have case-sensitive fields). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20450) Unexpected first-query schema inference cost with 2.1.1 RC
[ https://issues.apache.org/jira/browse/SPARK-20450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15981753#comment-15981753 ] Eric Liang edited comment on SPARK-20450 at 4/24/17 7:40 PM: - I'm not sure what you mean by new issue, but it's only in the latest 2.1.1 RC afaik. It was not present in the original 2.1 was (Author: ekhliang): I'm not sure what you mean by new issue, but it's only in the latest 2.1.1 release afaik. It was not present in the original 2.1 > Unexpected first-query schema inference cost with 2.1.1 RC > -- > > Key: SPARK-20450 > URL: https://issues.apache.org/jira/browse/SPARK-20450 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 >Reporter: Eric Liang > > https://issues.apache.org/jira/browse/SPARK-19611 fixes a regression from 2.0 > where Spark silently fails to read case-sensitive fields missing a > case-sensitive schema in the table properties. The fix is to detect this > situation, infer the schema, and write the case-sensitive schema into the > metastore. > However this can incur an unexpected performance hit the first time such a > problematic table is queried (and there is a high false-positive rate here > since most tables don't actually have case-sensitive fields). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20450) Unexpected first-query schema inference cost with 2.1.1 RC
Eric Liang created SPARK-20450: -- Summary: Unexpected first-query schema inference cost with 2.1.1 RC Key: SPARK-20450 URL: https://issues.apache.org/jira/browse/SPARK-20450 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.1 Reporter: Eric Liang Priority: Blocker https://issues.apache.org/jira/browse/SPARK-19611 fixes a regression from 2.0 where Spark silently fails to read case-sensitive fields missing a case-sensitive schema in the table properties. The fix is to detect this situation, infer the schema, and write the case-sensitive schema into the metastore. However this can incur an unexpected performance hit the first time such a problematic table is queried (and there is a high false-positive rate here since most tables don't actually have case-sensitive fields). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20398) range() operator should include cancellation reason when killed
Eric Liang created SPARK-20398: -- Summary: range() operator should include cancellation reason when killed Key: SPARK-20398 URL: https://issues.apache.org/jira/browse/SPARK-20398 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Eric Liang https://issues.apache.org/jira/browse/SPARK-19820 adds a reason field for why tasks were killed. However, for backwards compatibility it left the old TaskKilledException constructor which defaults to "unknown reason". The range() operator should use the constructor that fills in the reason rather than dropping it on task kill. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20358) Executors failing stage on interrupted exception thrown by cancelled tasks
Eric Liang created SPARK-20358: -- Summary: Executors failing stage on interrupted exception thrown by cancelled tasks Key: SPARK-20358 URL: https://issues.apache.org/jira/browse/SPARK-20358 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.0 Reporter: Eric Liang https://issues.apache.org/jira/browse/SPARK-20217 introduced a regression where now interrupted exceptions will cause a task to fail on cancellation. This is because NonFatal(e) does not match if e is an InterrupedException. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20217) Executor should not fail stage if killed task throws non-interrupted exception
[ https://issues.apache.org/jira/browse/SPARK-20217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-20217: --- Description: This is reproducible as follows. Run the following, and then use SparkContext.killTaskAttempt to kill one of the tasks. The entire stage will fail since we threw a RuntimeException instead of InterruptedException. We should probably unconditionally return TaskKilled instead of TaskFailed if the task was killed by the driver, regardless of the actual exception thrown. {code} spark.range(100).repartition(100).foreach { i => try { Thread.sleep(1000) } catch { case t: InterruptedException => throw new RuntimeException(t) } } {code} Based on the code in TaskSetManager, I think this also affects kills of speculative tasks. However, since the number of speculated tasks is few, and usually you need to fail a task a few times before the stage is cancelled, probably no-one noticed this in production. was: This is reproducible as follows. Run the following, and then use SparkContext.killTaskAttempt to kill one of the tasks. The entire stage will fail since we threw a RuntimeException instead of InterruptedException. We should probably unconditionally return TaskKilled instead of TaskFailed if the task was killed by the driver, regardless of the actual exception thrown. {code} spark.range(100).repartition(100).foreach { i => try { Thread.sleep(1000) } catch { case t: InterruptedException => throw new RuntimeException(t) } } {code} > Executor should not fail stage if killed task throws non-interrupted exception > -- > > Key: SPARK-20217 > URL: https://issues.apache.org/jira/browse/SPARK-20217 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Eric Liang > > This is reproducible as follows. Run the following, and then use > SparkContext.killTaskAttempt to kill one of the tasks. The entire stage will > fail since we threw a RuntimeException instead of InterruptedException. > We should probably unconditionally return TaskKilled instead of TaskFailed if > the task was killed by the driver, regardless of the actual exception thrown. > {code} > spark.range(100).repartition(100).foreach { i => > try { > Thread.sleep(1000) > } catch { > case t: InterruptedException => > throw new RuntimeException(t) > } > } > {code} > Based on the code in TaskSetManager, I think this also affects kills of > speculative tasks. However, since the number of speculated tasks is few, and > usually you need to fail a task a few times before the stage is cancelled, > probably no-one noticed this in production. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20217) Executor should not fail stage if killed task throws non-interrupted exception
Eric Liang created SPARK-20217: -- Summary: Executor should not fail stage if killed task throws non-interrupted exception Key: SPARK-20217 URL: https://issues.apache.org/jira/browse/SPARK-20217 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.0 Reporter: Eric Liang This is reproducible as follows. Run the following, and then use SparkContext.killTaskAttempt to kill one of the tasks. The entire stage will fail since we threw a RuntimeException instead of InterruptedException. We should probably unconditionally return TaskKilled instead of TaskFailed if the task was killed by the driver, regardless of the actual exception thrown. {code} spark.range(100).repartition(100).foreach { i => try { Thread.sleep(1000) } catch { case t: InterruptedException => throw new RuntimeException(t) } } {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20148) Extend the file commit interface to allow subscribing to task commit messages
Eric Liang created SPARK-20148: -- Summary: Extend the file commit interface to allow subscribing to task commit messages Key: SPARK-20148 URL: https://issues.apache.org/jira/browse/SPARK-20148 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0 Reporter: Eric Liang Priority: Minor The internal FileCommitProtocol interface returns all task commit messages in bulk to the implementation when a job finishes. However, it is sometimes useful to access those messages before the job completes, so that the driver gets incremental progress updates before the job finishes. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19820) Allow reason to be specified for task kill
Eric Liang created SPARK-19820: -- Summary: Allow reason to be specified for task kill Key: SPARK-19820 URL: https://issues.apache.org/jira/browse/SPARK-19820 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.1.0 Reporter: Eric Liang Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19183) Add deleteWithJob hook to internal commit protocol API
Eric Liang created SPARK-19183: -- Summary: Add deleteWithJob hook to internal commit protocol API Key: SPARK-19183 URL: https://issues.apache.org/jira/browse/SPARK-19183 Project: Spark Issue Type: Improvement Components: SQL Reporter: Eric Liang Currently in SQL we implement overwrites by calling fs.delete() directly on the original data. This is not ideal since we the original files end up deleted even if the job aborts. We should extend the commit protocol to allow file overwrites to be managed as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18814) CheckAnalysis rejects TPCDS query 32
[ https://issues.apache.org/jira/browse/SPARK-18814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15737148#comment-15737148 ] Eric Liang commented on SPARK-18814: It seems that the references of an Alias expression should include the referenced attribute, so I would expect #39 to still show up. I could be misunderstanding the behavior of Alias though. On Fri, Dec 9, 2016, 7:50 PM Nattavut Sutyanyong (JIRA)> CheckAnalysis rejects TPCDS query 32 > > > Key: SPARK-18814 > URL: https://issues.apache.org/jira/browse/SPARK-18814 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Priority: Blocker > > It seems the CheckAnalysis rule introduced by SPARK-18504 is incorrect > rejecting this TPCDS query, which ran fine in Spark 2.0. There doesn't seem > to be any obvious error in the query or the check rule though: in the plan > below, the scalar subquery's condition field is "scalar-subquery#24 > [(cs_item_sk#39#111 = i_item_sk#59)] ", which should reference cs_item_sk#39. > Nonetheless CheckAnalysis complains that cs_item_sk#39 is not referenced by > the scalar subquery predicates. > analysis error: > {code} > == Query: q32-v1.4 == > Can't be analyzed: org.apache.spark.sql.AnalysisException: a GROUP BY clause > in a scalar correlated subquery cannot contain non-correlated columns: > cs_item_sk#39;; > GlobalLimit 100 > +- LocalLimit 100 >+- Aggregate [sum(cs_ext_discount_amt#46) AS excess discount amount#23] > +- Filter i_manufact_id#72 = 977) && (i_item_sk#59 = > cs_item_sk#39)) && ((d_date#83 >= 2000-01-27) && (d_date#83 <= > cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 > days as date) as string && ((d_date_sk#81 = cs_sold_date_sk#58) && > (cast(cs_ext_discount_amt#46 as decimal(14,7)) > cast(scalar-subquery#24 > [(cs_item_sk#39#111 = i_item_sk#59)] as decimal(14,7) > : +- Project [(CAST(1.3 AS DECIMAL(11,6)) * > CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39 AS > cs_item_sk#39#111] > : +- Aggregate [cs_item_sk#39], > [CheckOverflow((promote_precision(cast(1.3 as decimal(11,6))) * > promote_precision(cast(avg(cs_ext_discount_amt#46) as decimal(11,6, > DecimalType(14,7)) AS (CAST(1.3 AS DECIMAL(11,6)) * > CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39] > :+- Filter (((d_date#83 >= 2000-01-27]) && (d_date#83 <= > cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 > days as date) as string))) && (d_date_sk#81 = cs_sold_date_sk#58)) > : +- Join Inner > : :- SubqueryAlias catalog_sales > : : +- > Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,... > 10 more fields] parquet > : +- SubqueryAlias date_dim > : +- > Relation[d_date_sk#81,d_date_id#82,d_date#83,d_month_seq#84,d_week_seq#85,d_quarter_seq#86,d_year#87,d_dow#88,d_moy#89,d_dom#90,d_qoy#91,d_fy_year#92,d_fy_quarter_seq#93,d_fy_week_seq#94,d_day_name#95,d_quarter_name#96,d_holiday#97,d_weekend#98,d_following_holiday#99,d_first_dom#100,d_last_dom#101,d_same_day_ly#102,d_same_day_lq#103,d_current_day#104,... > 4 more fields] parquet > +- Join Inner > :- Join Inner > : :- SubqueryAlias catalog_sales > : : +- > Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,... > 10 more fields] parquet > : +- SubqueryAlias item > : +- > Relation[i_item_sk#59,i_item_id#60,i_rec_start_date#61,i_rec_end_date#62,i_item_desc#63,i_current_price#64,i_wholesale_cost#65,i_brand_id#66,i_brand#67,i_class_id#68,i_class#69,i_category_id#70,i_category#71,i_manufact_id#72,i_manufact#73,i_size#74,i_formulation#75,i_color#76,i_units#77,i_container#78,i_manager_id#79,i_product_name#80] > parquet >
[jira] [Commented] (SPARK-18814) CheckAnalysis rejects TPCDS query 32
[ https://issues.apache.org/jira/browse/SPARK-18814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15737068#comment-15737068 ] Eric Liang commented on SPARK-18814: [~rxin] > CheckAnalysis rejects TPCDS query 32 > > > Key: SPARK-18814 > URL: https://issues.apache.org/jira/browse/SPARK-18814 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Priority: Blocker > > It seems the CheckAnalysis rule introduced by SPARK-18504 is incorrect > rejecting this TPCDS query, which ran fine in Spark 2.0. There doesn't seem > to be any obvious error in the query or the check rule though: in the plan > below, the scalar subquery's condition field is "scalar-subquery#24 > [(cs_item_sk#39#111 = i_item_sk#59)] ", which should reference cs_item_sk#39. > Nonetheless CheckAnalysis complains that cs_item_sk#39 is not referenced by > the scalar subquery predicates. > analysis error: > {code} > == Query: q32-v1.4 == > Can't be analyzed: org.apache.spark.sql.AnalysisException: a GROUP BY clause > in a scalar correlated subquery cannot contain non-correlated columns: > cs_item_sk#39;; > GlobalLimit 100 > +- LocalLimit 100 >+- Aggregate [sum(cs_ext_discount_amt#46) AS excess discount amount#23] > +- Filter i_manufact_id#72 = 977) && (i_item_sk#59 = > cs_item_sk#39)) && ((d_date#83 >= 2000-01-27) && (d_date#83 <= > cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 > days as date) as string && ((d_date_sk#81 = cs_sold_date_sk#58) && > (cast(cs_ext_discount_amt#46 as decimal(14,7)) > cast(scalar-subquery#24 > [(cs_item_sk#39#111 = i_item_sk#59)] as decimal(14,7) > : +- Project [(CAST(1.3 AS DECIMAL(11,6)) * > CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39 AS > cs_item_sk#39#111] > : +- Aggregate [cs_item_sk#39], > [CheckOverflow((promote_precision(cast(1.3 as decimal(11,6))) * > promote_precision(cast(avg(cs_ext_discount_amt#46) as decimal(11,6, > DecimalType(14,7)) AS (CAST(1.3 AS DECIMAL(11,6)) * > CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39] > :+- Filter (((d_date#83 >= 2000-01-27]) && (d_date#83 <= > cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 > days as date) as string))) && (d_date_sk#81 = cs_sold_date_sk#58)) > : +- Join Inner > : :- SubqueryAlias catalog_sales > : : +- > Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,... > 10 more fields] parquet > : +- SubqueryAlias date_dim > : +- > Relation[d_date_sk#81,d_date_id#82,d_date#83,d_month_seq#84,d_week_seq#85,d_quarter_seq#86,d_year#87,d_dow#88,d_moy#89,d_dom#90,d_qoy#91,d_fy_year#92,d_fy_quarter_seq#93,d_fy_week_seq#94,d_day_name#95,d_quarter_name#96,d_holiday#97,d_weekend#98,d_following_holiday#99,d_first_dom#100,d_last_dom#101,d_same_day_ly#102,d_same_day_lq#103,d_current_day#104,... > 4 more fields] parquet > +- Join Inner > :- Join Inner > : :- SubqueryAlias catalog_sales > : : +- > Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,... > 10 more fields] parquet > : +- SubqueryAlias item > : +- > Relation[i_item_sk#59,i_item_id#60,i_rec_start_date#61,i_rec_end_date#62,i_item_desc#63,i_current_price#64,i_wholesale_cost#65,i_brand_id#66,i_brand#67,i_class_id#68,i_class#69,i_category_id#70,i_category#71,i_manufact_id#72,i_manufact#73,i_size#74,i_formulation#75,i_color#76,i_units#77,i_container#78,i_manager_id#79,i_product_name#80] > parquet > +- SubqueryAlias date_dim >+- >
[jira] [Created] (SPARK-18814) CheckAnalysis rejects TPCDS query 32
Eric Liang created SPARK-18814: -- Summary: CheckAnalysis rejects TPCDS query 32 Key: SPARK-18814 URL: https://issues.apache.org/jira/browse/SPARK-18814 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Eric Liang Priority: Blocker It seems the CheckAnalysis rule introduced by SPARK-18504 is incorrect rejecting this TPCDS query, which ran fine in Spark 2.0. There doesn't seem to be any obvious error in the query or the check rule though: in the plan below, the scalar subquery's condition field is "scalar-subquery#24 [(cs_item_sk#39#111 = i_item_sk#59)] ", which should reference cs_item_sk#39. Nonetheless CheckAnalysis complains that cs_item_sk#39 is not referenced by the scalar subquery predicates. analysis error: {code} == Query: q32-v1.4 == Can't be analyzed: org.apache.spark.sql.AnalysisException: a GROUP BY clause in a scalar correlated subquery cannot contain non-correlated columns: cs_item_sk#39;; GlobalLimit 100 +- LocalLimit 100 +- Aggregate [sum(cs_ext_discount_amt#46) AS excess discount amount#23] +- Filter i_manufact_id#72 = 977) && (i_item_sk#59 = cs_item_sk#39)) && ((d_date#83 >= 2000-01-27) && (d_date#83 <= cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 days as date) as string && ((d_date_sk#81 = cs_sold_date_sk#58) && (cast(cs_ext_discount_amt#46 as decimal(14,7)) > cast(scalar-subquery#24 [(cs_item_sk#39#111 = i_item_sk#59)] as decimal(14,7) : +- Project [(CAST(1.3 AS DECIMAL(11,6)) * CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39 AS cs_item_sk#39#111] : +- Aggregate [cs_item_sk#39], [CheckOverflow((promote_precision(cast(1.3 as decimal(11,6))) * promote_precision(cast(avg(cs_ext_discount_amt#46) as decimal(11,6, DecimalType(14,7)) AS (CAST(1.3 AS DECIMAL(11,6)) * CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39] :+- Filter (((d_date#83 >= 2000-01-27]) && (d_date#83 <= cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 days as date) as string))) && (d_date_sk#81 = cs_sold_date_sk#58)) : +- Join Inner : :- SubqueryAlias catalog_sales : : +- Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,... 10 more fields] parquet : +- SubqueryAlias date_dim : +- Relation[d_date_sk#81,d_date_id#82,d_date#83,d_month_seq#84,d_week_seq#85,d_quarter_seq#86,d_year#87,d_dow#88,d_moy#89,d_dom#90,d_qoy#91,d_fy_year#92,d_fy_quarter_seq#93,d_fy_week_seq#94,d_day_name#95,d_quarter_name#96,d_holiday#97,d_weekend#98,d_following_holiday#99,d_first_dom#100,d_last_dom#101,d_same_day_ly#102,d_same_day_lq#103,d_current_day#104,... 4 more fields] parquet +- Join Inner :- Join Inner : :- SubqueryAlias catalog_sales : : +- Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,... 10 more fields] parquet : +- SubqueryAlias item : +- Relation[i_item_sk#59,i_item_id#60,i_rec_start_date#61,i_rec_end_date#62,i_item_desc#63,i_current_price#64,i_wholesale_cost#65,i_brand_id#66,i_brand#67,i_class_id#68,i_class#69,i_category_id#70,i_category#71,i_manufact_id#72,i_manufact#73,i_size#74,i_formulation#75,i_color#76,i_units#77,i_container#78,i_manager_id#79,i_product_name#80] parquet +- SubqueryAlias date_dim +- Relation[d_date_sk#81,d_date_id#82,d_date#83,d_month_seq#84,d_week_seq#85,d_quarter_seq#86,d_year#87,d_dow#88,d_moy#89,d_dom#90,d_qoy#91,d_fy_year#92,d_fy_quarter_seq#93,d_fy_week_seq#94,d_day_name#95,d_quarter_name#96,d_holiday#97,d_weekend#98,d_following_holiday#99,d_first_dom#100,d_last_dom#101,d_same_day_ly#102,d_same_day_lq#103,d_current_day#104,... 4 more fields] parquet {code} query text: {code} select sum(cs_ext_discount_amt) as `excess discount amount` from catalog_sales, item, date_dim where
[jira] [Created] (SPARK-18727) Support schema evolution as new files are inserted into table
Eric Liang created SPARK-18727: -- Summary: Support schema evolution as new files are inserted into table Key: SPARK-18727 URL: https://issues.apache.org/jira/browse/SPARK-18727 Project: Spark Issue Type: Bug Affects Versions: 2.1.0 Reporter: Eric Liang Priority: Critical Now that we have pushed partition management of all tables to the catalog, one issue for scalable partition handling remains: handling schema updates. Currently, a schema update requires dropping and recreating the entire table, which does not scale well with the size of the table. We should support updating the schema of the table, either via ALTER TABLE, or automatically as new files with compatible schemas are appended into the table. cc [~rxin] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18727) Support schema evolution as new files are inserted into table
[ https://issues.apache.org/jira/browse/SPARK-18727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-18727: --- Component/s: SQL > Support schema evolution as new files are inserted into table > - > > Key: SPARK-18727 > URL: https://issues.apache.org/jira/browse/SPARK-18727 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Priority: Critical > > Now that we have pushed partition management of all tables to the catalog, > one issue for scalable partition handling remains: handling schema updates. > Currently, a schema update requires dropping and recreating the entire table, > which does not scale well with the size of the table. > We should support updating the schema of the table, either via ALTER TABLE, > or automatically as new files with compatible schemas are appended into the > table. > cc [~rxin] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18726) Filesystem unnecessarily scanned twice during creation of non-catalog table
Eric Liang created SPARK-18726: -- Summary: Filesystem unnecessarily scanned twice during creation of non-catalog table Key: SPARK-18726 URL: https://issues.apache.org/jira/browse/SPARK-18726 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Eric Liang It seems that for non-catalog tables (e.g. spark.read.parquet(...)), we scan the filesystem twice, once for schema inference, and another to create a FileIndex class for the relation. It would be better to combine these scans somehow, since this is the most costly step of creating a table. This is a follow-up ticket to https://github.com/apache/spark/pull/16090. cc [~cloud_fan] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18725) Creating a datasource table with schema should not scan all files for table
Eric Liang created SPARK-18725: -- Summary: Creating a datasource table with schema should not scan all files for table Key: SPARK-18725 URL: https://issues.apache.org/jira/browse/SPARK-18725 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Eric Liang When creating an unpartitioned Datasource table, Spark will scan the table for all files to infer the schema, even if the user already provides a schema. This is fixed for partitioned tables by https://github.com/apache/spark/pull/16090 but this was left as a separate followup. Since unpartitioned tables are not typically large this can probably be left out of 2.1 release. cc [~cloud_fan] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18679) Regression in file listing performance
[ https://issues.apache.org/jira/browse/SPARK-18679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-18679: --- Component/s: SQL > Regression in file listing performance > -- > > Key: SPARK-18679 > URL: https://issues.apache.org/jira/browse/SPARK-18679 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Priority: Blocker > > In Spark 2.1 ListingFileCatalog was significantly refactored (and renamed to > InMemoryFileIndex). > It seems there is a performance regression here where we no longer > performance listing in parallel for the non-root directory. This forces file > listing to be completely serial when resolving datasource tables that are not > backed by an external catalog. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18679) Regression in file listing performance
Eric Liang created SPARK-18679: -- Summary: Regression in file listing performance Key: SPARK-18679 URL: https://issues.apache.org/jira/browse/SPARK-18679 Project: Spark Issue Type: Bug Reporter: Eric Liang Priority: Blocker In Spark 2.1 ListingFileCatalog was significantly refactored (and renamed to InMemoryFileIndex). It seems there is a performance regression here where we no longer performance listing in parallel for the non-root directory. This forces file listing to be completely serial when resolving datasource tables that are not backed by an external catalog. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18679) Regression in file listing performance
[ https://issues.apache.org/jira/browse/SPARK-18679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-18679: --- Affects Version/s: 2.1.0 > Regression in file listing performance > -- > > Key: SPARK-18679 > URL: https://issues.apache.org/jira/browse/SPARK-18679 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Priority: Blocker > > In Spark 2.1 ListingFileCatalog was significantly refactored (and renamed to > InMemoryFileIndex). > It seems there is a performance regression here where we no longer > performance listing in parallel for the non-root directory. This forces file > listing to be completely serial when resolving datasource tables that are not > backed by an external catalog. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18661) Creating a partitioned datasource table should not scan all files in filesystem
Eric Liang created SPARK-18661: -- Summary: Creating a partitioned datasource table should not scan all files in filesystem Key: SPARK-18661 URL: https://issues.apache.org/jira/browse/SPARK-18661 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Eric Liang Priority: Blocker Even though in 2.1 creating a partitioned datasource table will not populate the partition data by default (until the user issues MSCK REPAIR TABLE), it seems we still scan the filesystem for no good reason. We should avoid doing this when the user specifies a schema. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18661) Creating a partitioned datasource table should not scan all files for table
[ https://issues.apache.org/jira/browse/SPARK-18661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-18661: --- Summary: Creating a partitioned datasource table should not scan all files for table (was: Creating a partitioned datasource table should not scan all files in filesystem) > Creating a partitioned datasource table should not scan all files for table > --- > > Key: SPARK-18661 > URL: https://issues.apache.org/jira/browse/SPARK-18661 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Priority: Blocker > > Even though in 2.1 creating a partitioned datasource table will not populate > the partition data by default (until the user issues MSCK REPAIR TABLE), it > seems we still scan the filesystem for no good reason. > We should avoid doing this when the user specifies a schema. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18659) Incorrect behaviors in overwrite table for datasource tables
[ https://issues.apache.org/jira/browse/SPARK-18659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-18659: --- Description: The first three test cases fail due to a crash in hive client when dropping partitions that don't contain files. The last one deletes too many files due to a partition case resolution failure. {code} test("foo") { withTable("test") { spark.range(10) .selectExpr("id", "id as A", "'x' as B") .write.partitionBy("A", "B").mode("overwrite") .saveAsTable("test") spark.sql("insert overwrite table test select id, id, 'x' from range(1)") assert(spark.sql("select * from test").count() == 1) } } test("bar") { withTable("test") { spark.range(10) .selectExpr("id", "id as A", "'x' as B") .write.partitionBy("A", "B").mode("overwrite") .saveAsTable("test") spark.sql("insert overwrite table test partition (a, b) select id, id, 'x' from range(1)") assert(spark.sql("select * from test").count() == 1) } } test("baz") { withTable("test") { spark.range(10) .selectExpr("id", "id as A", "'x' as B") .write.partitionBy("A", "B").mode("overwrite") .saveAsTable("test") spark.sql("insert overwrite table test partition (A, B) select id, id, 'x' from range(1)") assert(spark.sql("select * from test").count() == 1) } } test("qux") { withTable("test") { spark.range(10) .selectExpr("id", "id as A", "'x' as B") .write.partitionBy("A", "B").mode("overwrite") .saveAsTable("test") spark.sql("insert overwrite table test partition (a=1, b) select id, 'x' from range(1)") assert(spark.sql("select * from test").count() == 10) } } {code} was: The following test cases fail due to a crash in hive client when dropping partitions that don't contain files. The last one deletes too many files due to a partition case resolution failure. {code} test("foo") { withTable("test") { spark.range(10) .selectExpr("id", "id as A", "'x' as B") .write.partitionBy("A", "B").mode("overwrite") .saveAsTable("test") spark.sql("insert overwrite table test select id, id, 'x' from range(1)") assert(spark.sql("select * from test").count() == 1) } } test("bar") { withTable("test") { spark.range(10) .selectExpr("id", "id as A", "'x' as B") .write.partitionBy("A", "B").mode("overwrite") .saveAsTable("test") spark.sql("insert overwrite table test partition (a, b) select id, id, 'x' from range(1)") assert(spark.sql("select * from test").count() == 1) } } test("baz") { withTable("test") { spark.range(10) .selectExpr("id", "id as A", "'x' as B") .write.partitionBy("A", "B").mode("overwrite") .saveAsTable("test") spark.sql("insert overwrite table test partition (A, B) select id, id, 'x' from range(1)") assert(spark.sql("select * from test").count() == 1) } } test("qux") { withTable("test") { spark.range(10) .selectExpr("id", "id as A", "'x' as B") .write.partitionBy("A", "B").mode("overwrite") .saveAsTable("test") spark.sql("insert overwrite table test partition (a=1, b) select id, 'x' from range(1)") assert(spark.sql("select * from test").count() == 10) } } {code} > Incorrect behaviors in overwrite table for datasource tables > > > Key: SPARK-18659 > URL: https://issues.apache.org/jira/browse/SPARK-18659 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Priority: Blocker > > The first three test cases fail due to a crash in hive client when dropping > partitions that don't contain files. The last one deletes too many files due > to a partition case resolution failure. > {code} > test("foo") { > withTable("test") { > spark.range(10) > .selectExpr("id", "id as A", "'x' as B") > .write.partitionBy("A", "B").mode("overwrite") > .saveAsTable("test") > spark.sql("insert overwrite table test select id, id, 'x' from > range(1)") > assert(spark.sql("select * from test").count() == 1) > } > } > test("bar") { > withTable("test") { > spark.range(10) > .selectExpr("id", "id as A", "'x' as B") > .write.partitionBy("A", "B").mode("overwrite") > .saveAsTable("test") > spark.sql("insert overwrite table test partition (a, b) select id, id, > 'x' from range(1)") > assert(spark.sql("select * from test").count() == 1) > } > } > test("baz") { > withTable("test") { > spark.range(10) > .selectExpr("id", "id as A", "'x' as B")
[jira] [Updated] (SPARK-18659) Incorrect behaviors in overwrite table for datasource tables
[ https://issues.apache.org/jira/browse/SPARK-18659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-18659: --- Description: The following test cases fail due to a crash in hive client when dropping partitions that don't contain files. The last one deletes too many files due to a partition case resolution failure. {code} test("foo") { withTable("test") { spark.range(10) .selectExpr("id", "id as A", "'x' as B") .write.partitionBy("A", "B").mode("overwrite") .saveAsTable("test") spark.sql("insert overwrite table test select id, id, 'x' from range(1)") assert(spark.sql("select * from test").count() == 1) } } test("bar") { withTable("test") { spark.range(10) .selectExpr("id", "id as A", "'x' as B") .write.partitionBy("A", "B").mode("overwrite") .saveAsTable("test") spark.sql("insert overwrite table test partition (a, b) select id, id, 'x' from range(1)") assert(spark.sql("select * from test").count() == 1) } } test("baz") { withTable("test") { spark.range(10) .selectExpr("id", "id as A", "'x' as B") .write.partitionBy("A", "B").mode("overwrite") .saveAsTable("test") spark.sql("insert overwrite table test partition (A, B) select id, id, 'x' from range(1)") assert(spark.sql("select * from test").count() == 1) } } test("qux") { withTable("test") { spark.range(10) .selectExpr("id", "id as A", "'x' as B") .write.partitionBy("A", "B").mode("overwrite") .saveAsTable("test") spark.sql("insert overwrite table test partition (a=1, b) select id, 'x' from range(1)") assert(spark.sql("select * from test").count() == 10) } } {code} was: The following test cases fail due to a crash in hive client when dropping partitions that don't contain files. The last one crashes due to a partition case resolution failure. {code} test("foo") { withTable("test") { spark.range(10) .selectExpr("id", "id as A", "'x' as B") .write.partitionBy("A", "B").mode("overwrite") .saveAsTable("test") spark.sql("insert overwrite table test select id, id, 'x' from range(1)") assert(spark.sql("select * from test").count() == 1) } } test("bar") { withTable("test") { spark.range(10) .selectExpr("id", "id as A", "'x' as B") .write.partitionBy("A", "B").mode("overwrite") .saveAsTable("test") spark.sql("insert overwrite table test partition (a, b) select id, id, 'x' from range(1)") assert(spark.sql("select * from test").count() == 1) } } test("baz") { withTable("test") { spark.range(10) .selectExpr("id", "id as A", "'x' as B") .write.partitionBy("A", "B").mode("overwrite") .saveAsTable("test") spark.sql("insert overwrite table test partition (A, B) select id, id, 'x' from range(1)") assert(spark.sql("select * from test").count() == 1) } } test("qux") { withTable("test") { spark.range(10) .selectExpr("id", "id as A", "'x' as B") .write.partitionBy("A", "B").mode("overwrite") .saveAsTable("test") spark.sql("insert overwrite table test partition (a=1, b) select id, 'x' from range(1)") assert(spark.sql("select * from test").count() == 10) } } {code} > Incorrect behaviors in overwrite table for datasource tables > > > Key: SPARK-18659 > URL: https://issues.apache.org/jira/browse/SPARK-18659 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Priority: Blocker > > The following test cases fail due to a crash in hive client when dropping > partitions that don't contain files. The last one deletes too many files due > to a partition case resolution failure. > {code} > test("foo") { > withTable("test") { > spark.range(10) > .selectExpr("id", "id as A", "'x' as B") > .write.partitionBy("A", "B").mode("overwrite") > .saveAsTable("test") > spark.sql("insert overwrite table test select id, id, 'x' from > range(1)") > assert(spark.sql("select * from test").count() == 1) > } > } > test("bar") { > withTable("test") { > spark.range(10) > .selectExpr("id", "id as A", "'x' as B") > .write.partitionBy("A", "B").mode("overwrite") > .saveAsTable("test") > spark.sql("insert overwrite table test partition (a, b) select id, id, > 'x' from range(1)") > assert(spark.sql("select * from test").count() == 1) > } > } > test("baz") { > withTable("test") { > spark.range(10) > .selectExpr("id", "id as A", "'x' as B") >
[jira] [Updated] (SPARK-18659) Incorrect behaviors in overwrite table for datasource tables
[ https://issues.apache.org/jira/browse/SPARK-18659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-18659: --- Summary: Incorrect behaviors in overwrite table for datasource tables (was: Crash in overwrite table partitions due to hive metastore integration) > Incorrect behaviors in overwrite table for datasource tables > > > Key: SPARK-18659 > URL: https://issues.apache.org/jira/browse/SPARK-18659 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Priority: Blocker > > The following test cases fail due to a crash in hive client when dropping > partitions that don't contain files. The last one crashes due to a partition > case resolution failure. > {code} > test("foo") { > withTable("test") { > spark.range(10) > .selectExpr("id", "id as A", "'x' as B") > .write.partitionBy("A", "B").mode("overwrite") > .saveAsTable("test") > spark.sql("insert overwrite table test select id, id, 'x' from > range(1)") > assert(spark.sql("select * from test").count() == 1) > } > } > test("bar") { > withTable("test") { > spark.range(10) > .selectExpr("id", "id as A", "'x' as B") > .write.partitionBy("A", "B").mode("overwrite") > .saveAsTable("test") > spark.sql("insert overwrite table test partition (a, b) select id, id, > 'x' from range(1)") > assert(spark.sql("select * from test").count() == 1) > } > } > test("baz") { > withTable("test") { > spark.range(10) > .selectExpr("id", "id as A", "'x' as B") > .write.partitionBy("A", "B").mode("overwrite") > .saveAsTable("test") > spark.sql("insert overwrite table test partition (A, B) select id, id, > 'x' from range(1)") > assert(spark.sql("select * from test").count() == 1) > } > } > test("qux") { > withTable("test") { > spark.range(10) > .selectExpr("id", "id as A", "'x' as B") > .write.partitionBy("A", "B").mode("overwrite") > .saveAsTable("test") > spark.sql("insert overwrite table test partition (a=1, b) select id, > 'x' from range(1)") > assert(spark.sql("select * from test").count() == 10) > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18659) Crash in overwrite table partitions due to hive metastore integration
Eric Liang created SPARK-18659: -- Summary: Crash in overwrite table partitions due to hive metastore integration Key: SPARK-18659 URL: https://issues.apache.org/jira/browse/SPARK-18659 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Eric Liang Priority: Blocker The following test cases fail due to a crash in hive client when dropping partitions that don't contain files. The last one crashes due to a partition case resolution failure. {code} test("foo") { withTable("test") { spark.range(10) .selectExpr("id", "id as A", "'x' as B") .write.partitionBy("A", "B").mode("overwrite") .saveAsTable("test") spark.sql("insert overwrite table test select id, id, 'x' from range(1)") assert(spark.sql("select * from test").count() == 1) } } test("bar") { withTable("test") { spark.range(10) .selectExpr("id", "id as A", "'x' as B") .write.partitionBy("A", "B").mode("overwrite") .saveAsTable("test") spark.sql("insert overwrite table test partition (a, b) select id, id, 'x' from range(1)") assert(spark.sql("select * from test").count() == 1) } } test("baz") { withTable("test") { spark.range(10) .selectExpr("id", "id as A", "'x' as B") .write.partitionBy("A", "B").mode("overwrite") .saveAsTable("test") spark.sql("insert overwrite table test partition (A, B) select id, id, 'x' from range(1)") assert(spark.sql("select * from test").count() == 1) } } test("qux") { withTable("test") { spark.range(10) .selectExpr("id", "id as A", "'x' as B") .write.partitionBy("A", "B").mode("overwrite") .saveAsTable("test") spark.sql("insert overwrite table test partition (a=1, b) select id, 'x' from range(1)") assert(spark.sql("select * from test").count() == 10) } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18635) Partition name/values not escaped correctly in some cases
[ https://issues.apache.org/jira/browse/SPARK-18635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-18635: --- Target Version/s: 2.1.0 Priority: Critical (was: Major) > Partition name/values not escaped correctly in some cases > - > > Key: SPARK-18635 > URL: https://issues.apache.org/jira/browse/SPARK-18635 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Eric Liang >Priority: Critical > > For example, the following command does not insert data properly into the > table > {code} > spark.sqlContext.range(10).selectExpr("id", "id as A", "'A$\\=%' as > B").write.partitionBy("A", "B").mode("overwrite").saveAsTable("testy") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18635) Partition name/values not escaped correctly in some cases
Eric Liang created SPARK-18635: -- Summary: Partition name/values not escaped correctly in some cases Key: SPARK-18635 URL: https://issues.apache.org/jira/browse/SPARK-18635 Project: Spark Issue Type: Sub-task Reporter: Eric Liang For example, the following command does not insert data properly into the table {code} spark.sqlContext.range(10).selectExpr("id", "id as A", "'A$\\=%' as B").write.partitionBy("A", "B").mode("overwrite").saveAsTable("testy") {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18545) Verify number of hive client RPCs in PartitionedTablePerfStatsSuite
[ https://issues.apache.org/jira/browse/SPARK-18545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-18545: --- Issue Type: Sub-task (was: Test) Parent: SPARK-17861 > Verify number of hive client RPCs in PartitionedTablePerfStatsSuite > --- > > Key: SPARK-18545 > URL: https://issues.apache.org/jira/browse/SPARK-18545 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Eric Liang >Assignee: Eric Liang >Priority: Minor > Fix For: 2.1.0 > > > To avoid performance regressions like > https://issues.apache.org/jira/browse/SPARK-18507 in the future, we should > add a metric for the number of Hive client RPC issued and check it in the > perf stats suite. > cc [~cloud_fan] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18507) Major performance regression in SHOW PARTITIONS on partitioned Hive tables
[ https://issues.apache.org/jira/browse/SPARK-18507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-18507: --- Issue Type: Sub-task (was: Bug) Parent: SPARK-17861 > Major performance regression in SHOW PARTITIONS on partitioned Hive tables > -- > > Key: SPARK-18507 > URL: https://issues.apache.org/jira/browse/SPARK-18507 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Michael Allman >Assignee: Wenchen Fan >Priority: Critical > Fix For: 2.1.0 > > > Commit {{ccb11543048dccd4cc590a8db1df1d9d5847d112}} > (https://github.com/apache/spark/commit/ccb11543048dccd4cc590a8db1df1d9d5847d112) > appears to have introduced a major regression in the performance of the Hive > {{SHOW PARTITIONS}} command. Running that command on a Hive table with 17,337 > partitions in the {{spark-sql}} shell with the parent commit of {{ccb1154}} > takes approximately 7.3 seconds. Running the same command with commit > {{ccb1154}} takes approximately 250 seconds. > I have not had the opportunity to complete a thorough investigation, but I > suspect the problem lies in the diff hunk beginning at > https://github.com/apache/spark/commit/ccb11543048dccd4cc590a8db1df1d9d5847d112#diff-159191585e10542f013cb3a714f26075L675. > If that's the case, this performance issue should manifest itself in other > areas as this programming pattern was used elsewhere in this commit. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18544) Append with df.saveAsTable writes data to wrong location
[ https://issues.apache.org/jira/browse/SPARK-18544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-18544: --- Description: When using saveAsTable in append mode, data will be written to the wrong location for non-managed Datasource tables. The following example illustrates this. It seems somehow pass the wrong table path to InsertIntoHadoopFsRelation from DataFrameWriter. Also, we should probably remove the repair table call at the end of saveAsTable in DataFrameWriter. That shouldn't be needed in either the Hive or Datasource case. {code} scala> spark.sqlContext.range(100).selectExpr("id", "id as A", "id as B").write.partitionBy("A", "B").mode("overwrite").parquet("/tmp/test") scala> sql("create table test (id long, A int, B int) USING parquet OPTIONS (path '/tmp/test') PARTITIONED BY (A, B)") scala> sql("msck repair table test") scala> sql("select * from test where A = 1").count res6: Long = 1 scala> spark.sqlContext.range(10).selectExpr("id", "id as A", "id as B").write.partitionBy("A", "B").mode("append").saveAsTable("test") scala> sql("select * from test where A = 1").count res8: Long = 1 {code} was: When using saveAsTable in append mode, data will be written to the wrong location for non-managed Datasource tables. The following example illustrates this. It seems somehow pass the wrong table path to InsertIntoHadoopFsRelation from DataFrameWriter. Also, we should probably remove the repair table call at the end of saveAsTable in DataFrameWriter. That shouldn't be needed in either the Hive or Datasource case. {code} scala> spark.sqlContext.range(1).selectExpr("id", "id as A", "id as B").write.partitionBy("A", "B").mode("overwrite").parquet("/tmp/test_10k") scala> sql("msck repair table test_10k") scala> sql("select * from test_10k where A = 1").count res6: Long = 1 scala> spark.sqlContext.range(10).selectExpr("id", "id as A", "id as B").write.partitionBy("A", "B").mode("append").parquet("/tmp/test_10k") scala> sql("select * from test_10k where A = 1").count res8: Long = 1 {code} > Append with df.saveAsTable writes data to wrong location > > > Key: SPARK-18544 > URL: https://issues.apache.org/jira/browse/SPARK-18544 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Eric Liang >Priority: Blocker > > When using saveAsTable in append mode, data will be written to the wrong > location for non-managed Datasource tables. The following example illustrates > this. > It seems somehow pass the wrong table path to InsertIntoHadoopFsRelation from > DataFrameWriter. Also, we should probably remove the repair table call at the > end of saveAsTable in DataFrameWriter. That shouldn't be needed in either the > Hive or Datasource case. > {code} > scala> spark.sqlContext.range(100).selectExpr("id", "id as A", "id as > B").write.partitionBy("A", "B").mode("overwrite").parquet("/tmp/test") > scala> sql("create table test (id long, A int, B int) USING parquet OPTIONS > (path '/tmp/test') PARTITIONED BY (A, B)") > scala> sql("msck repair table test") > scala> sql("select * from test where A = 1").count > res6: Long = 1 > scala> spark.sqlContext.range(10).selectExpr("id", "id as A", "id as > B").write.partitionBy("A", "B").mode("append").saveAsTable("test") > scala> sql("select * from test where A = 1").count > res8: Long = 1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18544) Append with df.saveAsTable writes data to wrong location
[ https://issues.apache.org/jira/browse/SPARK-18544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15687884#comment-15687884 ] Eric Liang commented on SPARK-18544: cc [~yhuai] [~cloud_fan] I'll try to look at this today but might not get to it. > Append with df.saveAsTable writes data to wrong location > > > Key: SPARK-18544 > URL: https://issues.apache.org/jira/browse/SPARK-18544 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Eric Liang > > When using saveAsTable in append mode, data will be written to the wrong > location for non-managed Datasource tables. The following example illustrates > this. > It seems somehow pass the wrong table path to InsertIntoHadoopFsRelation from > DataFrameWriter. Also, we should probably remove the repair table call at the > end of saveAsTable in DataFrameWriter. That shouldn't be needed in either the > Hive or Datasource case. > {code} > scala> spark.sqlContext.range(1).selectExpr("id", "id as A", "id as > B").write.partitionBy("A", "B").mode("overwrite").parquet("/tmp/test_10k") > scala> sql("msck repair table test_10k") > scala> sql("select * from test_10k where A = 1").count > res6: Long = 1 > scala> spark.sqlContext.range(10).selectExpr("id", "id as A", "id as > B").write.partitionBy("A", "B").mode("append").parquet("/tmp/test_10k") > scala> sql("select * from test_10k where A = 1").count > res8: Long = 1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18545) Verify number of hive client RPCs in PartitionedTablePerfStatsSuite
Eric Liang created SPARK-18545: -- Summary: Verify number of hive client RPCs in PartitionedTablePerfStatsSuite Key: SPARK-18545 URL: https://issues.apache.org/jira/browse/SPARK-18545 Project: Spark Issue Type: Test Components: SQL Reporter: Eric Liang Priority: Minor To avoid performance regressions like https://issues.apache.org/jira/browse/SPARK-18507 in the future, we should add a metric for the number of Hive client RPC issued and check it in the perf stats suite. cc [~cloud_fan] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18544) Append with df.saveAsTable writes data to wrong location
Eric Liang created SPARK-18544: -- Summary: Append with df.saveAsTable writes data to wrong location Key: SPARK-18544 URL: https://issues.apache.org/jira/browse/SPARK-18544 Project: Spark Issue Type: Bug Components: SQL Reporter: Eric Liang When using saveAsTable in append mode, data will be written to the wrong location for non-managed Datasource tables. The following example illustrates this. It seems somehow pass the wrong table path to InsertIntoHadoopFsRelation from DataFrameWriter. Also, we should probably remove the repair table call at the end of saveAsTable in DataFrameWriter. That shouldn't be needed in either the Hive or Datasource case. {code} scala> spark.sqlContext.range(1).selectExpr("id", "id as A", "id as B").write.partitionBy("A", "B").mode("overwrite").parquet("/tmp/test_10k") scala> sql("msck repair table test_10k") scala> sql("select * from test_10k where A = 1").count res6: Long = 1 scala> spark.sqlContext.range(10).selectExpr("id", "id as A", "id as B").write.partitionBy("A", "B").mode("append").parquet("/tmp/test_10k") scala> sql("select * from test_10k where A = 1").count res8: Long = 1 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18393) DataFrame pivot output column names should respect aliases
Eric Liang created SPARK-18393: -- Summary: DataFrame pivot output column names should respect aliases Key: SPARK-18393 URL: https://issues.apache.org/jira/browse/SPARK-18393 Project: Spark Issue Type: Improvement Components: SQL Reporter: Eric Liang Priority: Minor For example {code} val df = spark.range(100).selectExpr("id % 5 as x", "id % 2 as a", "id as b") df .groupBy('x) .pivot("a", Seq(0, 1)) .agg(expr("sum(b)").as("blah"), expr("count(b)").as("foo")) .show() +---++-++-+ | x|0_sum(`b`) AS `blah`|0_count(`b`) AS `foo`|1_sum(`b`) AS `blah`|1_count(`b`) AS `foo`| +---++-++-+ | 0| 450| 10| 500| 10| | 1| 510| 10| 460| 10| | 3| 530| 10| 480| 10| | 2| 470| 10| 520| 10| | 4| 490| 10| 540| 10| +---++-++-+ {code} The column names here are quite hard to read. Ideally we would respect the aliases and generate column names like 0_blah, 0_foo, 1_blah, 1_foo instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is
[ https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15652414#comment-15652414 ] Eric Liang commented on SPARK-17916: In our case, a user wants the empty string (whether actually missing, e.g. ,, or quoted ,""), to resolve as the empty string. It should only turn into null if nullValue is set to "". There doesn't currently appear to be some option combination that allows this. > CSV data source treats empty string as null no matter what nullValue option is > -- > > Key: SPARK-17916 > URL: https://issues.apache.org/jira/browse/SPARK-17916 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Hossein Falaki > > When user configures {{nullValue}} in CSV data source, in addition to those > values, all empty string values are also converted to null. > {code} > data: > col1,col2 > 1,"-" > 2,"" > {code} > {code} > spark.read.format("csv").option("nullValue", "-") > {code} > We will find a null in both rows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is
[ https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649039#comment-15649039 ] Eric Liang commented on SPARK-17916: We're hitting this as a regression from 2.0 as well. Ideally, we don't want the empty string to be treated specially in any scenario. The only logic that converts it to nulls should be due to the nullValue option. > CSV data source treats empty string as null no matter what nullValue option is > -- > > Key: SPARK-17916 > URL: https://issues.apache.org/jira/browse/SPARK-17916 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Hossein Falaki > > When user configures {{nullValue}} in CSV data source, in addition to those > values, all empty string values are also converted to null. > {code} > data: > col1,col2 > 1,"-" > 2,"" > {code} > {code} > spark.read.format("csv").option("nullValue", "-") > {code} > We will find a null in both rows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17990) ALTER TABLE ... ADD PARTITION does not play nice with mixed-case partition column names
[ https://issues.apache.org/jira/browse/SPARK-17990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-17990: --- Target Version/s: 2.1.0 > ALTER TABLE ... ADD PARTITION does not play nice with mixed-case partition > column names > --- > > Key: SPARK-17990 > URL: https://issues.apache.org/jira/browse/SPARK-17990 > Project: Spark > Issue Type: Sub-task > Components: SQL > Environment: Linux > Mac OS with a case-sensitive filesystem >Reporter: Michael Allman > > Writing partition data to an external table's file location and then adding > those as table partition metadata is a common use case. However, for tables > with partition column names with upper case letters, the SQL command {{ALTER > TABLE ... ADD PARTITION}} does not work, as illustrated in the following > example: > {code} > scala> sql("create external table mixed_case_partitioning (a bigint) > PARTITIONED BY (partCol bigint) STORED AS parquet LOCATION > '/tmp/mixed_case_partitioning'") > res0: org.apache.spark.sql.DataFrame = [] > scala> spark.sqlContext.range(10).selectExpr("id as a", "id as > partCol").write.partitionBy("partCol").mode("overwrite").parquet("/tmp/mixed_case_partitioning") > {code} > At this point, doing a {{hadoop fs -ls /tmp/mixed_case_partitioning}} > produces the following: > {code} > [msa@jupyter ~]$ hadoop fs -ls /tmp/mixed_case_partitioning > Found 11 items > -rw-r--r-- 3 msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/_SUCCESS > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=0 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=1 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=2 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=3 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=4 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=5 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=6 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=7 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=8 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=9 > {code} > Returning to the Spark shell, we execute the following to add the partition > metadata: > {code} > scala> (0 to 9).foreach { p => sql(s"alter table mixed_case_partitioning add > partition(partCol=$p)") } > {code} > Examining the HDFS file listing again, we see: > {code} > [msa@jupyter ~]$ hadoop fs -ls /tmp/mixed_case_partitioning > Found 21 items > -rw-r--r-- 3 msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/_SUCCESS > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=0 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=1 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=2 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=3 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=4 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=5 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=6 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=7 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=8 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:52 > /tmp/mixed_case_partitioning/partCol=9 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:53 > /tmp/mixed_case_partitioning/partcol=0 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:53 > /tmp/mixed_case_partitioning/partcol=1 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:53 > /tmp/mixed_case_partitioning/partcol=2 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:53 > /tmp/mixed_case_partitioning/partcol=3 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:53 > /tmp/mixed_case_partitioning/partcol=4 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:53 > /tmp/mixed_case_partitioning/partcol=5 > drwxr-xr-x - msa supergroup 0 2016-10-18 17:53 > /tmp/mixed_case_partitioning/partcol=6 > drwxr-xr-x - msa supergroup 0 2016-10-18
[jira] [Updated] (SPARK-18333) Revert hacks in parquet and orc reader to support case insensitive resolution
[ https://issues.apache.org/jira/browse/SPARK-18333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-18333: --- Target Version/s: 2.1.0 > Revert hacks in parquet and orc reader to support case insensitive resolution > - > > Key: SPARK-18333 > URL: https://issues.apache.org/jira/browse/SPARK-18333 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Eric Liang > > These are no longer needed after > https://issues.apache.org/jira/browse/SPARK-17183 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18145) Update documentation for hive partition management in 2.1
[ https://issues.apache.org/jira/browse/SPARK-18145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-18145: --- Target Version/s: 2.1.0 > Update documentation for hive partition management in 2.1 > - > > Key: SPARK-18145 > URL: https://issues.apache.org/jira/browse/SPARK-18145 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Eric Liang > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18333) Revert hacks in parquet and orc reader to support case insensitive resolution
Eric Liang created SPARK-18333: -- Summary: Revert hacks in parquet and orc reader to support case insensitive resolution Key: SPARK-18333 URL: https://issues.apache.org/jira/browse/SPARK-18333 Project: Spark Issue Type: Sub-task Reporter: Eric Liang These are no longer needed after https://issues.apache.org/jira/browse/SPARK-17183 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18185) Should fix INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions
[ https://issues.apache.org/jira/browse/SPARK-18185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-18185: --- Issue Type: Sub-task (was: Bug) Parent: SPARK-17861 > Should fix INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions > -- > > Key: SPARK-18185 > URL: https://issues.apache.org/jira/browse/SPARK-18185 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Eric Liang > > As of current 2.1, INSERT OVERWRITE with dynamic partitions against a > Datasource table will overwrite the entire table instead of only the updated > partitions as in Hive. It also doesn't respect custom partition locations. > We should delete only the proper partitions, scan the metastore for affected > partitions with custom locations, and ensure that deletes/writes go to the > right locations for those as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18185) Should fix INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions
[ https://issues.apache.org/jira/browse/SPARK-18185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-18185: --- Target Version/s: 2.1.0 > Should fix INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions > -- > > Key: SPARK-18185 > URL: https://issues.apache.org/jira/browse/SPARK-18185 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Eric Liang > > As of current 2.1, INSERT OVERWRITE with dynamic partitions against a > Datasource table will overwrite the entire table instead of only the updated > partitions as in Hive. It also doesn't respect custom partition locations. > We should delete only the proper partitions, scan the metastore for affected > partitions with custom locations, and ensure that deletes/writes go to the > right locations for those as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18185) Should fix INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions
[ https://issues.apache.org/jira/browse/SPARK-18185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15645238#comment-15645238 ] Eric Liang commented on SPARK-18185: I'm currently working on this. > Should fix INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions > -- > > Key: SPARK-18185 > URL: https://issues.apache.org/jira/browse/SPARK-18185 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Eric Liang > > As of current 2.1, INSERT OVERWRITE with dynamic partitions against a > Datasource table will overwrite the entire table instead of only the updated > partitions as in Hive. It also doesn't respect custom partition locations. > We should delete only the proper partitions, scan the metastore for affected > partitions with custom locations, and ensure that deletes/writes go to the > right locations for those as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18185) Should fix INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions
[ https://issues.apache.org/jira/browse/SPARK-18185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-18185: --- Description: As of current 2.1, INSERT OVERWRITE with dynamic partitions against a Datasource table will overwrite the entire table instead of only the updated partitions as in Hive. It also doesn't respect custom partition locations. We should delete only the proper partitions, scan the metastore for affected partitions with custom locations, and ensure that deletes/writes go to the right locations for those as well. was: As of current 2.1, INSERT OVERWRITE with dynamic partitions against a Datasource table will overwrite the entire table instead of only the updated partitions as in Hive. This is non-trivial to fix in 2.1, so we should throw an exception here. > Should fix INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions > -- > > Key: SPARK-18185 > URL: https://issues.apache.org/jira/browse/SPARK-18185 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Eric Liang > > As of current 2.1, INSERT OVERWRITE with dynamic partitions against a > Datasource table will overwrite the entire table instead of only the updated > partitions as in Hive. It also doesn't respect custom partition locations. > We should delete only the proper partitions, scan the metastore for affected > partitions with custom locations, and ensure that deletes/writes go to the > right locations for those as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18185) Should fix INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions
[ https://issues.apache.org/jira/browse/SPARK-18185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-18185: --- Summary: Should fix INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions (was: Should disallow INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions) > Should fix INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions > -- > > Key: SPARK-18185 > URL: https://issues.apache.org/jira/browse/SPARK-18185 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Eric Liang > > As of current 2.1, INSERT OVERWRITE with dynamic partitions against a > Datasource table will overwrite the entire table instead of only the updated > partitions as in Hive. > This is non-trivial to fix in 2.1, so we should throw an exception here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18185) Should disallow INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions
Eric Liang created SPARK-18185: -- Summary: Should disallow INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions Key: SPARK-18185 URL: https://issues.apache.org/jira/browse/SPARK-18185 Project: Spark Issue Type: Bug Components: SQL Reporter: Eric Liang As of current 2.1, INSERT OVERWRITE with dynamic partitions against a Datasource table will overwrite the entire table instead of only the updated partitions as in Hive. This is non-trivial to fix in 2.1, so we should throw an exception here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18184) INSERT [INTO|OVERWRITE] TABLE ... PARTITION for Datasource tables cannot handle partitions with custom locations
Eric Liang created SPARK-18184: -- Summary: INSERT [INTO|OVERWRITE] TABLE ... PARTITION for Datasource tables cannot handle partitions with custom locations Key: SPARK-18184 URL: https://issues.apache.org/jira/browse/SPARK-18184 Project: Spark Issue Type: Bug Components: SQL Reporter: Eric Liang As part of https://issues.apache.org/jira/browse/SPARK-17861 we should support this case as well now that Datasource table partitions can have custom locations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18183) INSERT OVERWRITE TABLE ... PARTITION will overwrite the entire Datasource table instead of just the specified partition
[ https://issues.apache.org/jira/browse/SPARK-18183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-18183: --- Component/s: SQL > INSERT OVERWRITE TABLE ... PARTITION will overwrite the entire Datasource > table instead of just the specified partition > --- > > Key: SPARK-18183 > URL: https://issues.apache.org/jira/browse/SPARK-18183 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Eric Liang > > In Hive, this will only overwrite the specified partition. We should fix this > for Datasource tables to be more in line with the Hive behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18183) INSERT OVERWRITE TABLE ... PARTITION will overwrite the entire Datasource table instead of just the specified partition
Eric Liang created SPARK-18183: -- Summary: INSERT OVERWRITE TABLE ... PARTITION will overwrite the entire Datasource table instead of just the specified partition Key: SPARK-18183 URL: https://issues.apache.org/jira/browse/SPARK-18183 Project: Spark Issue Type: Bug Reporter: Eric Liang In Hive, this will only overwrite the specified partition. We should fix this for Datasource tables to be more in line with the Hive behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18167) Flaky test when hive partition pruning is enabled
Eric Liang created SPARK-18167: -- Summary: Flaky test when hive partition pruning is enabled Key: SPARK-18167 URL: https://issues.apache.org/jira/browse/SPARK-18167 Project: Spark Issue Type: Bug Components: SQL Reporter: Eric Liang org.apache.spark.sql.hive.execution.SQLQuerySuite is flaking when hive partition pruning is enabled. Based on the stack traces, it seems to be an old issue where Hive fails to cast a numeric partition column ("Invalid character string format for type DECIMAL"). There are two possibilities here: either we are somehow corrupting the partition table to have non-decimal values in that column, or there is a transient issue with Derby. {code} Error Message java.lang.reflect.InvocationTargetException: null Stacktrace sbt.ForkMain$ForkError: java.lang.reflect.InvocationTargetException: null at sun.reflect.GeneratedMethodAccessor263.invoke(Unknown Source)at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:588) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:544) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:542) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:282) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:229) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:228) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:271) at org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:542) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:702) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:686) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:91) at org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:686) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:769) at org.apache.spark.sql.execution.datasources.TableFileCatalog.filterPartitions(TableFileCatalog.scala:67) at org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:59) at org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:26) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291) at org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:26) at org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:25) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74) at scala.collection.immutable.List.foreach(List.scala:381) at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74) at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:73) at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:73) at org.apache.spark.sql.QueryTest.assertEmptyMissingInput(QueryTest.scala:234) at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:170) at org.apache.spark.sql.hive.execution.SQLQuerySuite$$anonfun$76$$anonfun$apply$mcV$sp$24.apply$mcV$sp(SQLQuerySuite.scala:1559)
[jira] [Created] (SPARK-18146) Avoid using Union to chain together create table and repair partition commands
Eric Liang created SPARK-18146: -- Summary: Avoid using Union to chain together create table and repair partition commands Key: SPARK-18146 URL: https://issues.apache.org/jira/browse/SPARK-18146 Project: Spark Issue Type: Sub-task Reporter: Eric Liang Priority: Minor The behavior of union is not well defined here. We should add an internal command to execute these commands sequentially. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18145) Update documentation
[ https://issues.apache.org/jira/browse/SPARK-18145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-18145: --- Issue Type: Sub-task (was: Documentation) Parent: SPARK-17861 > Update documentation > > > Key: SPARK-18145 > URL: https://issues.apache.org/jira/browse/SPARK-18145 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Eric Liang > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18145) Update documentation for hive partition management in 2.1
[ https://issues.apache.org/jira/browse/SPARK-18145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-18145: --- Component/s: SQL > Update documentation for hive partition management in 2.1 > - > > Key: SPARK-18145 > URL: https://issues.apache.org/jira/browse/SPARK-18145 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Eric Liang > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18145) Update documentation for hive partition management in 2.1
[ https://issues.apache.org/jira/browse/SPARK-18145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-18145: --- Summary: Update documentation for hive partition management in 2.1 (was: Update documentation) > Update documentation for hive partition management in 2.1 > - > > Key: SPARK-18145 > URL: https://issues.apache.org/jira/browse/SPARK-18145 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Eric Liang > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18145) Update documentation
Eric Liang created SPARK-18145: -- Summary: Update documentation Key: SPARK-18145 URL: https://issues.apache.org/jira/browse/SPARK-18145 Project: Spark Issue Type: Documentation Reporter: Eric Liang -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18103) Rename *FileCatalog to *FileProvider
Eric Liang created SPARK-18103: -- Summary: Rename *FileCatalog to *FileProvider Key: SPARK-18103 URL: https://issues.apache.org/jira/browse/SPARK-18103 Project: Spark Issue Type: Improvement Reporter: Eric Liang Priority: Minor In the SQL component there are too many different components called some variant of *Catalog, which is quite confusing. We should rename the subclasses of FileCatalog to avoid this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18101) ExternalCatalogSuite should test with mixed case fields
[ https://issues.apache.org/jira/browse/SPARK-18101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-18101: --- Issue Type: Sub-task (was: Test) Parent: SPARK-17861 > ExternalCatalogSuite should test with mixed case fields > --- > > Key: SPARK-18101 > URL: https://issues.apache.org/jira/browse/SPARK-18101 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Eric Liang > > Currently, it uses field names such as "a" and "b" which are not useful for > testing case preservation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18101) ExternalCatalogSuite should test with mixed case fields
Eric Liang created SPARK-18101: -- Summary: ExternalCatalogSuite should test with mixed case fields Key: SPARK-18101 URL: https://issues.apache.org/jira/browse/SPARK-18101 Project: Spark Issue Type: Test Components: SQL Reporter: Eric Liang Currently, it uses field names such as "a" and "b" which are not useful for testing case preservation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17183) put hive serde table schema to table properties like data source table
[ https://issues.apache.org/jira/browse/SPARK-17183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-17183: --- Issue Type: Sub-task (was: Improvement) Parent: SPARK-17861 > put hive serde table schema to table properties like data source table > -- > > Key: SPARK-17183 > URL: https://issues.apache.org/jira/browse/SPARK-17183 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18087) Optimize insert to not require REPAIR TABLE
Eric Liang created SPARK-18087: -- Summary: Optimize insert to not require REPAIR TABLE Key: SPARK-18087 URL: https://issues.apache.org/jira/browse/SPARK-18087 Project: Spark Issue Type: Sub-task Reporter: Eric Liang -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18026) should not always lowercase partition columns of partition spec in parser
[ https://issues.apache.org/jira/browse/SPARK-18026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-18026: --- Issue Type: Sub-task (was: Improvement) Parent: SPARK-17861 > should not always lowercase partition columns of partition spec in parser > - > > Key: SPARK-18026 > URL: https://issues.apache.org/jira/browse/SPARK-18026 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17970) Use metastore for managing filesource table partitions as well
[ https://issues.apache.org/jira/browse/SPARK-17970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-17970: --- Summary: Use metastore for managing filesource table partitions as well (was: store partition spec in metastore for data source table) > Use metastore for managing filesource table partitions as well > -- > > Key: SPARK-17970 > URL: https://issues.apache.org/jira/browse/SPARK-17970 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17994) Add back a file status cache for catalog tables
[ https://issues.apache.org/jira/browse/SPARK-17994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-17994: --- Description: In SPARK-16980, we removed the full in-memory cache of table partitions in favor of loading only needed partitions from the metastore. This greatly improves the initial latency of queries that only read a small fraction of table partitions. However, since the metastore does not store file statistics, we need to discover those from remote storage. With the loss of the in-memory file status cache this has to happen on each query, increasing the latency of repeated queries over the same partitions. The proposal is to add back a per-table cache of partition contents, i.e. Map[Path, Array[FileStatus]]. This cache would be retained per-table, and can be invalidated through refreshTable() and refreshByPath(). Unlike the prior cache, it can be incrementally updated as new partitions are read. cc [~michael] was: In SPARK-16980, we removed the full in-memory cache of table partitions in favor of loading only needed partitions from the metastore. This greatly improves the initial latency of queries that only read a small fraction of table partitions. However, since the metastore does not store file statistics, we need to discover those from remote storage. With the loss of the in-memory file status cache this has to happen on each query, increasing the latency of repeated queries over the same partitions. The proposal is to add back a per-table cache of partition contents, i.e. Map[Path, Array[FileStatus]]. This cache would be retained per-table, and would be invalidated through refreshTable() and refreshByPath(). Unlike the prior cache, it can be incrementally updated as new partitions are read. cc [~michael] > Add back a file status cache for catalog tables > --- > > Key: SPARK-17994 > URL: https://issues.apache.org/jira/browse/SPARK-17994 > Project: Spark > Issue Type: Sub-task >Reporter: Eric Liang > > In SPARK-16980, we removed the full in-memory cache of table partitions in > favor of loading only needed partitions from the metastore. This greatly > improves the initial latency of queries that only read a small fraction of > table partitions. > However, since the metastore does not store file statistics, we need to > discover those from remote storage. With the loss of the in-memory file > status cache this has to happen on each query, increasing the latency of > repeated queries over the same partitions. > The proposal is to add back a per-table cache of partition contents, i.e. > Map[Path, Array[FileStatus]]. This cache would be retained per-table, and can > be invalidated through refreshTable() and refreshByPath(). Unlike the prior > cache, it can be incrementally updated as new partitions are read. > cc [~michael] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17994) Add back a file status cache for catalog tables
[ https://issues.apache.org/jira/browse/SPARK-17994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-17994: --- Issue Type: Sub-task (was: Improvement) Parent: SPARK-17861 > Add back a file status cache for catalog tables > --- > > Key: SPARK-17994 > URL: https://issues.apache.org/jira/browse/SPARK-17994 > Project: Spark > Issue Type: Sub-task >Reporter: Eric Liang > > In SPARK-16980, we removed the full in-memory cache of table partitions in > favor of loading only needed partitions from the metastore. This greatly > improves the initial latency of queries that only read a small fraction of > table partitions. > However, since the metastore does not store file statistics, we need to > discover those from remote storage. With the loss of the in-memory file > status cache this has to happen on each query, increasing the latency of > repeated queries over the same partitions. > The proposal is to add back a per-table cache of partition contents, i.e. > Map[Path, Array[FileStatus]]. This cache would be retained per-table, and > would be invalidated through refreshTable() and refreshByPath(). Unlike the > prior cache, it can be incrementally updated as new partitions are read. > cc [~michael] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17994) Add back a file status cache for catalog tables
Eric Liang created SPARK-17994: -- Summary: Add back a file status cache for catalog tables Key: SPARK-17994 URL: https://issues.apache.org/jira/browse/SPARK-17994 Project: Spark Issue Type: Improvement Reporter: Eric Liang In SPARK-16980, we removed the full in-memory cache of table partitions in favor of loading only needed partitions from the metastore. This greatly improves the initial latency of queries that only read a small fraction of table partitions. However, since the metastore does not store file statistics, we need to discover those from remote storage. With the loss of the in-memory file status cache this has to happen on each query, increasing the latency of repeated queries over the same partitions. The proposal is to add back a per-table cache of partition contents, i.e. Map[Path, Array[FileStatus]]. This cache would be retained per-table, and would be invalidated through refreshTable() and refreshByPath(). Unlike the prior cache, it can be incrementally updated as new partitions are read. cc [~michael] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17983) Can't filter over mixed case parquet columns of converted Hive tables
[ https://issues.apache.org/jira/browse/SPARK-17983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586476#comment-15586476 ] Eric Liang commented on SPARK-17983: Since we already store the original (case-sensitive) schema of datasource tables in the metastore as a table property, I think it also makes sense to do this for hive tables created through Spark. This would resolve this issue for both types of tables. The one issue I can see with this is that hive tables created through previous versions of Spark will stop working in 2.1 if they have mixed-case files. We would need a workaround for this situation. > Can't filter over mixed case parquet columns of converted Hive tables > - > > Key: SPARK-17983 > URL: https://issues.apache.org/jira/browse/SPARK-17983 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Priority: Critical > > We should probably revive https://github.com/apache/spark/pull/14750 in order > to fix this issue and related classes of issues. > The only other alternatives are (1) reconciling on-disk schemas with > metastore schema at planning time, which seems pretty messy, and (2) fixing > all the datasources to support case-insensitive matching, which also has > issues. > Reproduction: > {code} > private def setupPartitionedTable(tableName: String, dir: File): Unit = { > spark.range(5).selectExpr("id as normalCol", "id as partCol1", "id as > partCol2").write > .partitionBy("partCol1", "partCol2") > .mode("overwrite") > .parquet(dir.getAbsolutePath) > spark.sql(s""" > |create external table $tableName (normalCol long) > |partitioned by (partCol1 int, partCol2 int) > |stored as parquet > |location "${dir.getAbsolutePath}.stripMargin) > spark.sql(s"msck repair table $tableName") > } > test("filter by mixed case col") { > withTable("test") { > withTempDir { dir => > setupPartitionedTable("test", dir) > val df = spark.sql("select * from test where normalCol = 3") > assert(df.count() == 1) > } > } > } > {code} > cc [~cloud_fan] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17991) Enable metastore partition pruning for unconverted hive tables by default
[ https://issues.apache.org/jira/browse/SPARK-17991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-17991: --- Issue Type: Sub-task (was: Improvement) Parent: SPARK-17861 > Enable metastore partition pruning for unconverted hive tables by default > - > > Key: SPARK-17991 > URL: https://issues.apache.org/jira/browse/SPARK-17991 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Eric Liang > > Now that we use the metastore for pruning converted hive tables, we should > enable it for unconverted tables as well. Previously this was turned off due > to test flakiness, but this does not seem to be the case currently. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17991) Enable metastore partition pruning for unconverted hive tables by default
Eric Liang created SPARK-17991: -- Summary: Enable metastore partition pruning for unconverted hive tables by default Key: SPARK-17991 URL: https://issues.apache.org/jira/browse/SPARK-17991 Project: Spark Issue Type: Improvement Components: SQL Reporter: Eric Liang Now that we use the metastore for pruning converted hive tables, we should enable it for unconverted tables as well. Previously this was turned off due to test flakiness, but this does not seem to be the case currently. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17980) Fix refreshByPath for converted Hive tables
[ https://issues.apache.org/jira/browse/SPARK-17980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-17980: --- Issue Type: Sub-task (was: Bug) Parent: SPARK-17861 > Fix refreshByPath for converted Hive tables > --- > > Key: SPARK-17980 > URL: https://issues.apache.org/jira/browse/SPARK-17980 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Priority: Minor > > There is a small bug introduced in https://github.com/apache/spark/pull/14690 > which broke refreshByPath with converted hive tables (though, it turns out it > was very difficult to refresh converted hive tables anyways, since you had to > specify the exact path of one of the partitions). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17862) Feature flag SPARK-16980
[ https://issues.apache.org/jira/browse/SPARK-17862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586210#comment-15586210 ] Eric Liang commented on SPARK-17862: Yes, this is the flag: {code} val HIVE_FILESOURCE_PARTITION_PRUNING = SQLConfigBuilder("spark.sql.hive.filesourcePartitionPruning") .doc("When true, enable metastore partition pruning for file source tables as well. " + "This is currently implemented for converted Hive tables only.") .booleanConf .createWithDefault(true) {code} > Feature flag SPARK-16980 > > > Key: SPARK-17862 > URL: https://issues.apache.org/jira/browse/SPARK-17862 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Eric Liang >Priority: Trivial > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17983) Can't filter over mixed case parquet columns of converted Hive tables
[ https://issues.apache.org/jira/browse/SPARK-17983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-17983: --- Description: We should probably revive https://github.com/apache/spark/pull/14750 in order to fix this issue and related classes of issues. The only other alternatives are (1) reconciling on-disk schemas with metastore schema at planning time, which seems pretty messy, and (2) fixing all the datasources to support case-insensitive matching, which also has issues. Reproduction: {code} private def setupPartitionedTable(tableName: String, dir: File): Unit = { spark.range(5).selectExpr("id as normalCol", "id as partCol1", "id as partCol2").write .partitionBy("partCol1", "partCol2") .mode("overwrite") .parquet(dir.getAbsolutePath) spark.sql(s""" |create external table $tableName (normalCol long) |partitioned by (partCol1 int, partCol2 int) |stored as parquet |location "${dir.getAbsolutePath}.stripMargin) spark.sql(s"msck repair table $tableName") } test("filter by mixed case col") { withTable("test") { withTempDir { dir => setupPartitionedTable("test", dir) val df = spark.sql("select * from test where normalCol = 3") assert(df.count() == 1) } } } {code} cc [~cloud_fan] was: We should probably revive https://github.com/apache/spark/pull/14750 in order to fix this issue and related classes of issues. The only other alternatives are (1) reconciling on-disk schemas with metastore schema at planning time, which seems pretty messy, and (2) fixing all the datasources to support case-insensitive matching, which also has issues. cc [~cloud_fan] > Can't filter over mixed case parquet columns of converted Hive tables > - > > Key: SPARK-17983 > URL: https://issues.apache.org/jira/browse/SPARK-17983 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Priority: Critical > > We should probably revive https://github.com/apache/spark/pull/14750 in order > to fix this issue and related classes of issues. > The only other alternatives are (1) reconciling on-disk schemas with > metastore schema at planning time, which seems pretty messy, and (2) fixing > all the datasources to support case-insensitive matching, which also has > issues. > Reproduction: > {code} > private def setupPartitionedTable(tableName: String, dir: File): Unit = { > spark.range(5).selectExpr("id as normalCol", "id as partCol1", "id as > partCol2").write > .partitionBy("partCol1", "partCol2") > .mode("overwrite") > .parquet(dir.getAbsolutePath) > spark.sql(s""" > |create external table $tableName (normalCol long) > |partitioned by (partCol1 int, partCol2 int) > |stored as parquet > |location "${dir.getAbsolutePath}.stripMargin) > spark.sql(s"msck repair table $tableName") > } > test("filter by mixed case col") { > withTable("test") { > withTempDir { dir => > setupPartitionedTable("test", dir) > val df = spark.sql("select * from test where normalCol = 3") > assert(df.count() == 1) > } > } > } > {code} > cc [~cloud_fan] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17980) Fix refreshByPath for converted Hive tables
Eric Liang created SPARK-17980: -- Summary: Fix refreshByPath for converted Hive tables Key: SPARK-17980 URL: https://issues.apache.org/jira/browse/SPARK-17980 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Eric Liang Priority: Minor There is a small bug introduced in https://github.com/apache/spark/pull/14690 which broke refreshByPath with converted hive tables (though, it turns out it was very difficult to refresh converted hive tables anyways, since you had to specify the exact path of one of the partitions). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17974) Refactor FileCatalog classes to simplify the inheritance tree
[ https://issues.apache.org/jira/browse/SPARK-17974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-17974: --- Affects Version/s: 2.1.0 > Refactor FileCatalog classes to simplify the inheritance tree > - > > Key: SPARK-17974 > URL: https://issues.apache.org/jira/browse/SPARK-17974 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Priority: Minor > > This is a follow-up item for https://github.com/apache/spark/pull/14690 which > adds support for metastore partition pruning of converted hive tables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17974) Refactor FileCatalog classes to simplify the inheritance tree
Eric Liang created SPARK-17974: -- Summary: Refactor FileCatalog classes to simplify the inheritance tree Key: SPARK-17974 URL: https://issues.apache.org/jira/browse/SPARK-17974 Project: Spark Issue Type: Improvement Reporter: Eric Liang Priority: Minor This is a follow-up item for https://github.com/apache/spark/pull/14690 which adds support for metastore partition pruning of converted hive tables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17974) Refactor FileCatalog classes to simplify the inheritance tree
[ https://issues.apache.org/jira/browse/SPARK-17974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-17974: --- Component/s: SQL > Refactor FileCatalog classes to simplify the inheritance tree > - > > Key: SPARK-17974 > URL: https://issues.apache.org/jira/browse/SPARK-17974 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Liang >Priority: Minor > > This is a follow-up item for https://github.com/apache/spark/pull/14690 which > adds support for metastore partition pruning of converted hive tables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17740) Spark tests should mock / interpose HDFS to ensure that streams are closed
Eric Liang created SPARK-17740: -- Summary: Spark tests should mock / interpose HDFS to ensure that streams are closed Key: SPARK-17740 URL: https://issues.apache.org/jira/browse/SPARK-17740 Project: Spark Issue Type: Test Components: Spark Core Reporter: Eric Liang As a followup to SPARK-17666, we should add a test to ensure filesystem connections are not leaked at least in unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17713) Move row-datasource related tests out of JDBCSuite
Eric Liang created SPARK-17713: -- Summary: Move row-datasource related tests out of JDBCSuite Key: SPARK-17713 URL: https://issues.apache.org/jira/browse/SPARK-17713 Project: Spark Issue Type: Test Components: SQL Reporter: Eric Liang Priority: Trivial As a followup for https://github.com/apache/spark/pull/15273 we should move non-JDBC specific tests out of that suite. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17701) Refactor DataSourceScanExec so its sameResult call does not compare strings
Eric Liang created SPARK-17701: -- Summary: Refactor DataSourceScanExec so its sameResult call does not compare strings Key: SPARK-17701 URL: https://issues.apache.org/jira/browse/SPARK-17701 Project: Spark Issue Type: Improvement Reporter: Eric Liang Currently, RowDataSourceScanExec and FileSourceScanExec rely on a "metadata" string map to implement equality comparison, since the RDDs they depend on cannot be directly compared. This has resulted in a number of correctness bugs around exchange reuse, e.g. SPARK-17673 and SPARK-16818. To make these comparisons less brittle, we should refactor these classes to compare constructor parameters directly instead of relying on the metadata map. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17701) Refactor DataSourceScanExec so its sameResult call does not compare strings
[ https://issues.apache.org/jira/browse/SPARK-17701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-17701: --- Component/s: SQL > Refactor DataSourceScanExec so its sameResult call does not compare strings > --- > > Key: SPARK-17701 > URL: https://issues.apache.org/jira/browse/SPARK-17701 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Eric Liang > > Currently, RowDataSourceScanExec and FileSourceScanExec rely on a "metadata" > string map to implement equality comparison, since the RDDs they depend on > cannot be directly compared. This has resulted in a number of correctness > bugs around exchange reuse, e.g. SPARK-17673 and SPARK-16818. > To make these comparisons less brittle, we should refactor these classes to > compare constructor parameters directly instead of relying on the metadata > map. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17673) Reused Exchange Aggregations Produce Incorrect Results
[ https://issues.apache.org/jira/browse/SPARK-17673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15528145#comment-15528145 ] Eric Liang commented on SPARK-17673: Russell, could you try applying this patch (wip) to see if it resolves the issue? https://github.com/apache/spark/pull/15273/files It fixes equality comparison for row datasource scans to take into account the output schema of the scan. > Reused Exchange Aggregations Produce Incorrect Results > -- > > Key: SPARK-17673 > URL: https://issues.apache.org/jira/browse/SPARK-17673 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Russell Spitzer >Priority: Blocker > Labels: correctness > > https://datastax-oss.atlassian.net/browse/SPARKC-429 > Was brought to my attention where the following code produces incorrect > results > {code} > val data = List(TestData("A", 1, 7)) > val frame = > session.sqlContext.createDataFrame(session.sparkContext.parallelize(data)) > frame.createCassandraTable( > keySpaceName, > table, > partitionKeyColumns = Some(Seq("id"))) > frame > .write > .format("org.apache.spark.sql.cassandra") > .mode(SaveMode.Append) > .options(Map("table" -> table, "keyspace" -> keySpaceName)) > .save() > val loaded = sparkSession.sqlContext > .read > .format("org.apache.spark.sql.cassandra") > .options(Map("table" -> table, "keyspace" -> ks)) > .load() > .select("id", "col1", "col2") > val min1 = loaded.groupBy("id").agg(min("col1").as("min")) > val min2 = loaded.groupBy("id").agg(min("col2").as("min")) > min1.union(min2).show() > /* prints: > +---+---+ > | id|min| > +---+---+ > | A| 1| > | A| 1| > +---+---+ > Should be > | A| 1| > | A| 7| > */ > {code} > I looked into the explain pattern and saw > {code} > Union > :- *HashAggregate(keys=[id#93], functions=[min(col1#94)]) > : +- Exchange hashpartitioning(id#93, 200) > : +- *HashAggregate(keys=[id#93], functions=[partial_min(col1#94)]) > :+- *Scan > org.apache.spark.sql.cassandra.CassandraSourceRelation@7ec20844 > [id#93,col1#94] > +- *HashAggregate(keys=[id#93], functions=[min(col2#95)]) >+- ReusedExchange [id#93, min#153], Exchange hashpartitioning(id#93, 200) > {code} > Which was different than using a parallelized collection as the DF backing. > So I tested the same code with a Parquet backed DF and saw the same results. > {code} > frame.write.parquet("garbagetest") > val parquet = sparkSession.read.parquet("garbagetest").select("id", > "col1", "col2") > println("PDF") > parquetmin1.union(parquetmin2).explain() > parquetmin1.union(parquetmin2).show() > /* prints: > +---+---+ > | id|min| > +---+---+ > | A| 1| > | A| 1| > +---+---+ > */ > {code} > Which leads me to believe there is something wrong with the reused exchange. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17673) Reused Exchange Aggregations Produce Incorrect Results
[ https://issues.apache.org/jira/browse/SPARK-17673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15527339#comment-15527339 ] Eric Liang commented on SPARK-17673: I'm looking at this now. > Reused Exchange Aggregations Produce Incorrect Results > -- > > Key: SPARK-17673 > URL: https://issues.apache.org/jira/browse/SPARK-17673 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Russell Spitzer >Priority: Blocker > Labels: correctness > > https://datastax-oss.atlassian.net/browse/SPARKC-429 > Was brought to my attention where the following code produces incorrect > results > {code} > val data = List(TestData("A", 1, 7)) > val frame = > session.sqlContext.createDataFrame(session.sparkContext.parallelize(data)) > frame.createCassandraTable( > keySpaceName, > table, > partitionKeyColumns = Some(Seq("id"))) > frame > .write > .format("org.apache.spark.sql.cassandra") > .mode(SaveMode.Append) > .options(Map("table" -> table, "keyspace" -> keySpaceName)) > .save() > val loaded = sparkSession.sqlContext > .read > .format("org.apache.spark.sql.cassandra") > .options(Map("table" -> table, "keyspace" -> ks)) > .load() > .select("id", "col1", "col2") > val min1 = loaded.groupBy("id").agg(min("col1").as("min")) > val min2 = loaded.groupBy("id").agg(min("col2").as("min")) > min1.union(min2).show() > /* prints: > +---+---+ > | id|min| > +---+---+ > | A| 1| > | A| 1| > +---+---+ > Should be > | A| 1| > | A| 7| > */ > {code} > I looked into the explain pattern and saw > {code} > Union > :- *HashAggregate(keys=[id#93], functions=[min(col1#94)]) > : +- Exchange hashpartitioning(id#93, 200) > : +- *HashAggregate(keys=[id#93], functions=[partial_min(col1#94)]) > :+- *Scan > org.apache.spark.sql.cassandra.CassandraSourceRelation@7ec20844 > [id#93,col1#94] > +- *HashAggregate(keys=[id#93], functions=[min(col2#95)]) >+- ReusedExchange [id#93, min#153], Exchange hashpartitioning(id#93, 200) > {code} > Which was different than using a parallelized collection as the DF backing. > So I tested the same code with a Parquet backed DF and saw the same results. > {code} > frame.write.parquet("garbagetest") > val parquet = sparkSession.read.parquet("garbagetest").select("id", > "col1", "col2") > println("PDF") > parquetmin1.union(parquetmin2).explain() > parquetmin1.union(parquetmin2).show() > /* prints: > +---+---+ > | id|min| > +---+---+ > | A| 1| > | A| 1| > +---+---+ > */ > {code} > Which leads me to believe there is something wrong with the reused exchange. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17472) Better error message for serialization failures of large objects in Python
Eric Liang created SPARK-17472: -- Summary: Better error message for serialization failures of large objects in Python Key: SPARK-17472 URL: https://issues.apache.org/jira/browse/SPARK-17472 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Eric Liang Priority: Minor {code} def run(): import numpy.random as nr b = nr.bytes(8 * 10) sc.parallelize(range(1000), 1000).map(lambda x: len(b)).count() run() {code} Gives you the following error from pickle {code} error: 'i' format requires -2147483648 <= number <= 2147483647 --- error Traceback (most recent call last) in () 4 sc.parallelize(range(1000), 1000).map(lambda x: len(b)).count() 5 > 6 run() in run() 2 import numpy.random as nr 3 b = nr.bytes(8 * 10) > 4 sc.parallelize(range(1000), 1000).map(lambda x: len(b)).count() 5 6 run() /databricks/spark/python/pyspark/rdd.pyc in count(self) 1002 3 1003 """ -> 1004 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() 1005 1006 def stats(self): /databricks/spark/python/pyspark/rdd.pyc in sum(self) 993 6.0 994 """ --> 995 return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add) 996 997 def count(self): /databricks/spark/python/pyspark/rdd.pyc in fold(self, zeroValue, op) 867 # zeroValue provided to each partition is unique from the one provided 868 # to the final reduce call --> 869 vals = self.mapPartitions(func).collect() 870 return reduce(op, vals, zeroValue) 871 /databricks/spark/python/pyspark/rdd.pyc in collect(self) 769 """ 770 with SCCallSiteSync(self.context) as css: --> 771 port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) 772 return list(_load_from_socket(port, self._jrdd_deserializer)) 773 /databricks/spark/python/pyspark/rdd.pyc in _jrdd(self) 2377 command = (self.func, profiler, self._prev_jrdd_deserializer, 2378self._jrdd_deserializer) -> 2379 pickled_cmd, bvars, env, includes = _prepare_for_python_RDD(self.ctx, command, self) 2380 python_rdd = self.ctx._jvm.PythonRDD(self._prev_jrdd.rdd(), 2381 bytearray(pickled_cmd), /databricks/spark/python/pyspark/rdd.pyc in _prepare_for_python_RDD(sc, command, obj) 2297 # the serialized command will be compressed by broadcast 2298 ser = CloudPickleSerializer() -> 2299 pickled_command = ser.dumps(command) 2300 if len(pickled_command) > (1 << 20): # 1M 2301 # The broadcast will have same life cycle as created PythonRDD /databricks/spark/python/pyspark/serializers.pyc in dumps(self, obj) 426 427 def dumps(self, obj): --> 428 return cloudpickle.dumps(obj, 2) 429 430 /databricks/spark/python/pyspark/cloudpickle.pyc in dumps(obj, protocol) 655 656 cp = CloudPickler(file,protocol) --> 657 cp.dump(obj) 658 659 return file.getvalue() /databricks/spark/python/pyspark/cloudpickle.pyc in dump(self, obj) 105 self.inject_addons() 106 try: --> 107 return Pickler.dump(self, obj) 108 except RuntimeError as e: 109 if 'recursion' in e.args[0]: /usr/lib/python2.7/pickle.pyc in dump(self, obj) 222 if self.proto >= 2: 223 self.write(PROTO + chr(self.proto)) --> 224 self.save(obj) 225 self.write(STOP) 226 /usr/lib/python2.7/pickle.pyc in save(self, obj) 284 f = self.dispatch.get(t) 285 if f: --> 286 f(self, obj) # Call unbound method with explicit self 287 return 288 /usr/lib/python2.7/pickle.pyc in save_tuple(self, obj) 560 write(MARK) 561 for element in obj: --> 562 save(element) 563 564 if id(obj) in memo: /usr/lib/python2.7/pickle.pyc in save(self, obj) 284 f = self.dispatch.get(t) 285 if f: --> 286 f(self, obj) # Call unbound method with explicit self 287 return 288 /databricks/spark/python/pyspark/cloudpickle.pyc in save_function(self, obj, name) 202 klass = getattr(themodule, name, None) 203 if klass is None or klass is not obj: --> 204 self.save_function_tuple(obj) 205 return 206 /databricks/spark/python/pyspark/cloudpickle.pyc in save_function_tuple(self, func) 239 # create a skeleton function object and memoize it 240
[jira] [Updated] (SPARK-17370) Shuffle service files not invalidated when a slave is lost
[ https://issues.apache.org/jira/browse/SPARK-17370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-17370: --- Component/s: Spark Core > Shuffle service files not invalidated when a slave is lost > -- > > Key: SPARK-17370 > URL: https://issues.apache.org/jira/browse/SPARK-17370 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Eric Liang > > DAGScheduler invalidates shuffle files when an executor loss event occurs, > but not when the external shuffle service is enabled. This is because when > shuffle service is on, the shuffle file lifetime can exceed the executor > lifetime. > However, it doesn't invalidate shuffle files when the shuffle service itself > is lost (due to whole slave loss). This can cause long hangs when slaves are > lost since the file loss is not detected until a subsequent stage attempts to > read the shuffle files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17371) Resubmitted stage outputs deleted by zombie map tasks on stop()
Eric Liang created SPARK-17371: -- Summary: Resubmitted stage outputs deleted by zombie map tasks on stop() Key: SPARK-17371 URL: https://issues.apache.org/jira/browse/SPARK-17371 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Eric Liang It seems that old shuffle map tasks hanging around after a stage resubmit will delete intended shuffle output files on stop(), causing downstream stages to fail even after successful resubmit completion. This can happen easily if the prior map task is waiting for a network timeout when its stage is resubmitted. This can cause unnecessary stage resubmits, sometimes multiple times, and very confusing FetchFailure messages that report shuffle index files missing from the local disk. Given that IndexShuffleBlockResolver commits data atomically, it seems unnecessary to ever delete committed task output: even in the rare case that a task is failed after it finishes committing shuffle output, it should be safe to retain that output. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17370) Shuffle service files not invalidated when a slave is lost
Eric Liang created SPARK-17370: -- Summary: Shuffle service files not invalidated when a slave is lost Key: SPARK-17370 URL: https://issues.apache.org/jira/browse/SPARK-17370 Project: Spark Issue Type: Bug Reporter: Eric Liang DAGScheduler invalidates shuffle files when an executor loss event occurs, but not when the external shuffle service is enabled. This is because when shuffle service is on, the shuffle file lifetime can exceed the executor lifetime. However, it doesn't invalidate shuffle files when the shuffle service itself is lost (due to whole slave loss). This can cause long hangs when slaves are lost since the file loss is not detected until a subsequent stage attempts to read the shuffle files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17042) Repl-defined classes cannot be replicated
[ https://issues.apache.org/jira/browse/SPARK-17042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431616#comment-15431616 ] Eric Liang commented on SPARK-17042: Yeah, my bad. I was trying to split this up but it turns out to be unnecessary. > Repl-defined classes cannot be replicated > - > > Key: SPARK-17042 > URL: https://issues.apache.org/jira/browse/SPARK-17042 > Project: Spark > Issue Type: Sub-task > Components: Block Manager, Spark Core >Reporter: Eric Liang > > A simple fix is to erase the classTag when using the default serializer, > since it's not needed in that case, and the classTag was failing to > deserialize on the remote end. > The proper fix is actually to use the right classloader when deserializing > the classtags, but that is a much more invasive change for 2.0. > The following test can be added to ReplSuite to reproduce the bug: > {code} > test("replicating blocks of object with class defined in repl") { > val output = runInterpreter("local-cluster[2,1,1024]", > """ > |import org.apache.spark.storage.StorageLevel._ > |case class Foo(i: Int) > |val ret = sc.parallelize((1 to 100).map(Foo), > 10).persist(MEMORY_ONLY_2) > |ret.count() > |sc.getExecutorStorageStatus.map(s => > s.rddBlocksById(ret.id).size).sum > """.stripMargin) > assertDoesNotContain("error:", output) > assertDoesNotContain("Exception", output) > assertContains(": Int = 20", output) > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17162) Range does not support SQL generation
Eric Liang created SPARK-17162: -- Summary: Range does not support SQL generation Key: SPARK-17162 URL: https://issues.apache.org/jira/browse/SPARK-17162 Project: Spark Issue Type: Bug Reporter: Eric Liang Priority: Minor {code} scala> sql("create view a as select * from range(100)") 16/08/19 21:10:29 INFO SparkSqlParser: Parsing command: create view a as select * from range(100) java.lang.UnsupportedOperationException: unsupported plan Range (0, 100, splits=8) at org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:212) at org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:165) at org.apache.spark.sql.catalyst.SQLBuilder.projectToSQL(SQLBuilder.scala:229) at org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:127) at org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:165) at org.apache.spark.sql.catalyst.SQLBuilder.projectToSQL(SQLBuilder.scala:229) at org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:127) at org.apache.spark.sql.catalyst.SQLBuilder.toSQL(SQLBuilder.scala:97) at org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:174) at org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:138) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) ``` {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17069) Expose spark.range() as table-valued function in SQL
Eric Liang created SPARK-17069: -- Summary: Expose spark.range() as table-valued function in SQL Key: SPARK-17069 URL: https://issues.apache.org/jira/browse/SPARK-17069 Project: Spark Issue Type: New Feature Components: SQL Reporter: Eric Liang Priority: Minor The idea here is to create the spark.range( x ) equivalent in SQL, so we can do something like {noformat} select count(*) from range(1) {noformat} This would be useful for sql-only testing and benchmarks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17042) Repl-defined classes cannot be replicated
[ https://issues.apache.org/jira/browse/SPARK-17042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-17042: --- Description: A simple fix is to erase the classTag when using the default serializer, since it's not needed in that case, and the classTag was failing to deserialize on the remote end. The proper fix is actually to use the right classloader when deserializing the classtags, but that is a much more invasive change for 2.0. The following test can be added to ReplSuite to reproduce the bug: {code} test("replicating blocks of object with class defined in repl") { val output = runInterpreter("local-cluster[2,1,1024]", """ |import org.apache.spark.storage.StorageLevel._ |case class Foo(i: Int) |val ret = sc.parallelize((1 to 100).map(Foo), 10).persist(MEMORY_ONLY_2) |ret.count() |sc.getExecutorStorageStatus.map(s => s.rddBlocksById(ret.id).size).sum """.stripMargin) assertDoesNotContain("error:", output) assertDoesNotContain("Exception", output) assertContains(": Int = 20", output) } {code} was: The following test can be added to ReplSuite to reproduce the bug: {code} test("replicating blocks of object with class defined in repl") { val output = runInterpreter("local-cluster[2,1,1024]", """ |import org.apache.spark.storage.StorageLevel._ |case class Foo(i: Int) |val ret = sc.parallelize((1 to 100).map(Foo), 10).persist(MEMORY_ONLY_2) |ret.count() |sc.getExecutorStorageStatus.map(s => s.rddBlocksById(ret.id).size).sum """.stripMargin) assertDoesNotContain("error:", output) assertDoesNotContain("Exception", output) assertContains(": Int = 20", output) } {code} > Repl-defined classes cannot be replicated > - > > Key: SPARK-17042 > URL: https://issues.apache.org/jira/browse/SPARK-17042 > Project: Spark > Issue Type: Sub-task > Components: Block Manager, Spark Core >Reporter: Eric Liang > > A simple fix is to erase the classTag when using the default serializer, > since it's not needed in that case, and the classTag was failing to > deserialize on the remote end. > The proper fix is actually to use the right classloader when deserializing > the classtags, but that is a much more invasive change for 2.0. > The following test can be added to ReplSuite to reproduce the bug: > {code} > test("replicating blocks of object with class defined in repl") { > val output = runInterpreter("local-cluster[2,1,1024]", > """ > |import org.apache.spark.storage.StorageLevel._ > |case class Foo(i: Int) > |val ret = sc.parallelize((1 to 100).map(Foo), > 10).persist(MEMORY_ONLY_2) > |ret.count() > |sc.getExecutorStorageStatus.map(s => > s.rddBlocksById(ret.id).size).sum > """.stripMargin) > assertDoesNotContain("error:", output) > assertDoesNotContain("Exception", output) > assertContains(": Int = 20", output) > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17042) Repl-defined classes cannot be replicated
Eric Liang created SPARK-17042: -- Summary: Repl-defined classes cannot be replicated Key: SPARK-17042 URL: https://issues.apache.org/jira/browse/SPARK-17042 Project: Spark Issue Type: Sub-task Reporter: Eric Liang The following test can be added to ReplSuite to reproduce the bug: {code} test("replicating blocks of object with class defined in repl") { val output = runInterpreter("local-cluster[2,1,1024]", """ |import org.apache.spark.storage.StorageLevel._ |case class Foo(i: Int) |val ret = sc.parallelize((1 to 100).map(Foo), 10).persist(MEMORY_ONLY_2) |ret.count() |sc.getExecutorStorageStatus.map(s => s.rddBlocksById(ret.id).size).sum """.stripMargin) assertDoesNotContain("error:", output) assertDoesNotContain("Exception", output) assertContains(": Int = 20", output) } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16884) Move DataSourceScanExec out of ExistingRDD.scala file
[ https://issues.apache.org/jira/browse/SPARK-16884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-16884: --- Issue Type: Improvement (was: Bug) > Move DataSourceScanExec out of ExistingRDD.scala file > - > > Key: SPARK-16884 > URL: https://issues.apache.org/jira/browse/SPARK-16884 > Project: Spark > Issue Type: Improvement >Reporter: Eric Liang >Priority: Trivial > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org