[jira] [Updated] (SPARK-27669) Refactor DataFrameWriter to resolve datasources in a command

2019-05-09 Thread Eric Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-27669:
---
Summary: Refactor DataFrameWriter to resolve datasources in a command  
(was: Refactor DataFrameWriter to always go through Catalyst for analysis)

> Refactor DataFrameWriter to resolve datasources in a command
> 
>
> Key: SPARK-27669
> URL: https://issues.apache.org/jira/browse/SPARK-27669
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Eric Liang
>Priority: Major
>
> Currently, DataFrameWriter.save() does a large amount of ad-hoc analysis 
> (e.g., loading data source classes, validating options, and so on) before 
> executing the command.
> The execution of this code falls outside the scope of any SQL execution, 
> which is unfortunate since it means it's untracked by Spark (e.g., in the 
> Spark UI), and also means df.write ops cannot be manipulated by custom 
> catalyst rules prior to execution.
> These issues can be largely resolved by creating a command that represents 
> df.write.save/saveAsTable().



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27669) Refactor DataFrameWriter to always go through Catalyst for analysis

2019-05-09 Thread Eric Liang (JIRA)
Eric Liang created SPARK-27669:
--

 Summary: Refactor DataFrameWriter to always go through Catalyst 
for analysis
 Key: SPARK-27669
 URL: https://issues.apache.org/jira/browse/SPARK-27669
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.3
Reporter: Eric Liang


Currently, DataFrameWriter.save() does a large amount of ad-hoc analysis (e.g., 
loading data source classes, validating options, and so on) before executing 
the command.

The execution of this code falls outside the scope of any SQL execution, which 
is unfortunate since it means it's untracked by Spark (e.g., in the Spark UI), 
and also means df.write ops cannot be manipulated by custom catalyst rules 
prior to execution.

These issues can be largely resolved by creating a command that represents 
df.write.save/saveAsTable().



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27392) TestHive test tables should be placed in shared test state, not per session

2019-04-04 Thread Eric Liang (JIRA)
Eric Liang created SPARK-27392:
--

 Summary: TestHive test tables should be placed in shared test 
state, not per session
 Key: SPARK-27392
 URL: https://issues.apache.org/jira/browse/SPARK-27392
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 2.4.1
Reporter: Eric Liang


Otherwise, tests that use tables from multiple sessions will run into issues if 
they access the same table. The correct location is in shared state.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23971) Should not leak Spark sessions across test suites

2018-04-12 Thread Eric Liang (JIRA)
Eric Liang created SPARK-23971:
--

 Summary: Should not leak Spark sessions across test suites
 Key: SPARK-23971
 URL: https://issues.apache.org/jira/browse/SPARK-23971
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Eric Liang


Many suites currently leak Spark sessions (sometimes with stopped 
SparkContexts) via the thread-local active Spark session and default Spark 
session. We should attempt to clean these up and detect when this happens to 
improve the reproducibility of tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23809) Active SparkSession should be set by getOrCreate

2018-03-28 Thread Eric Liang (JIRA)
Eric Liang created SPARK-23809:
--

 Summary: Active SparkSession should be set by getOrCreate
 Key: SPARK-23809
 URL: https://issues.apache.org/jira/browse/SPARK-23809
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Eric Liang


Currently, the active spark session is set inconsistently (e.g., in 
createDataFrame, prior to query execution). Many places in spark also 
incorrectly query active session when they should be calling 
activeSession.getOrElse(defaultSession).

The semantics here can be cleaned up if we also set the active session when the 
default session is set.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18727) Support schema evolution as new files are inserted into table

2017-04-27 Thread Eric Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987966#comment-15987966
 ] 

Eric Liang commented on SPARK-18727:


+1 for supporting ALTER TABLE REPLACE COLUMNS

> Support schema evolution as new files are inserted into table
> -
>
> Key: SPARK-18727
> URL: https://issues.apache.org/jira/browse/SPARK-18727
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Critical
>
> Now that we have pushed partition management of all tables to the catalog, 
> one issue for scalable partition handling remains: handling schema updates.
> Currently, a schema update requires dropping and recreating the entire table, 
> which does not scale well with the size of the table.
> We should support updating the schema of the table, either via ALTER TABLE, 
> or automatically as new files with compatible schemas are appended into the 
> table.
> cc [~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18727) Support schema evolution as new files are inserted into table

2017-04-27 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-18727:
---

The common case we see is users having a complete schema (e.g. output of
ETL pipeline) and wanting to update/merge it in an automated job. In this
case it's actually more work to alter the columns one at a time, rather
than all at once.




> Support schema evolution as new files are inserted into table
> -
>
> Key: SPARK-18727
> URL: https://issues.apache.org/jira/browse/SPARK-18727
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Critical
>
> Now that we have pushed partition management of all tables to the catalog, 
> one issue for scalable partition handling remains: handling schema updates.
> Currently, a schema update requires dropping and recreating the entire table, 
> which does not scale well with the size of the table.
> We should support updating the schema of the table, either via ALTER TABLE, 
> or automatically as new files with compatible schemas are appended into the 
> table.
> cc [~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18727) Support schema evolution as new files are inserted into table

2017-04-27 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-18727:
---

Can we add ALTER TABLE SCHEMA to update the entire schema? That would cover
any edge cases.




> Support schema evolution as new files are inserted into table
> -
>
> Key: SPARK-18727
> URL: https://issues.apache.org/jira/browse/SPARK-18727
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Critical
>
> Now that we have pushed partition management of all tables to the catalog, 
> one issue for scalable partition handling remains: handling schema updates.
> Currently, a schema update requires dropping and recreating the entire table, 
> which does not scale well with the size of the table.
> We should support updating the schema of the table, either via ALTER TABLE, 
> or automatically as new files with compatible schemas are appended into the 
> table.
> cc [~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20450) Unexpected first-query schema inference cost with 2.1.1 RC

2017-04-24 Thread Eric Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15981753#comment-15981753
 ] 

Eric Liang commented on SPARK-20450:


I'm not sure what you mean by new issue, but it's only in the latest 2.1.1 
release afaik. It was not present in the original 2.1

> Unexpected first-query schema inference cost with 2.1.1 RC
> --
>
> Key: SPARK-20450
> URL: https://issues.apache.org/jira/browse/SPARK-20450
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Eric Liang
>
> https://issues.apache.org/jira/browse/SPARK-19611 fixes a regression from 2.0 
> where Spark silently fails to read case-sensitive fields missing a 
> case-sensitive schema in the table properties. The fix is to detect this 
> situation, infer the schema, and write the case-sensitive schema into the 
> metastore.
> However this can incur an unexpected performance hit the first time such a 
> problematic table is queried (and there is a high false-positive rate here 
> since most tables don't actually have case-sensitive fields).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20450) Unexpected first-query schema inference cost with 2.1.1 RC

2017-04-24 Thread Eric Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15981753#comment-15981753
 ] 

Eric Liang edited comment on SPARK-20450 at 4/24/17 7:40 PM:
-

I'm not sure what you mean by new issue, but it's only in the latest 2.1.1 RC 
afaik. It was not present in the original 2.1


was (Author: ekhliang):
I'm not sure what you mean by new issue, but it's only in the latest 2.1.1 
release afaik. It was not present in the original 2.1

> Unexpected first-query schema inference cost with 2.1.1 RC
> --
>
> Key: SPARK-20450
> URL: https://issues.apache.org/jira/browse/SPARK-20450
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Eric Liang
>
> https://issues.apache.org/jira/browse/SPARK-19611 fixes a regression from 2.0 
> where Spark silently fails to read case-sensitive fields missing a 
> case-sensitive schema in the table properties. The fix is to detect this 
> situation, infer the schema, and write the case-sensitive schema into the 
> metastore.
> However this can incur an unexpected performance hit the first time such a 
> problematic table is queried (and there is a high false-positive rate here 
> since most tables don't actually have case-sensitive fields).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20450) Unexpected first-query schema inference cost with 2.1.1 RC

2017-04-24 Thread Eric Liang (JIRA)
Eric Liang created SPARK-20450:
--

 Summary: Unexpected first-query schema inference cost with 2.1.1 RC
 Key: SPARK-20450
 URL: https://issues.apache.org/jira/browse/SPARK-20450
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.1
Reporter: Eric Liang
Priority: Blocker


https://issues.apache.org/jira/browse/SPARK-19611 fixes a regression from 2.0 
where Spark silently fails to read case-sensitive fields missing a 
case-sensitive schema in the table properties. The fix is to detect this 
situation, infer the schema, and write the case-sensitive schema into the 
metastore.

However this can incur an unexpected performance hit the first time such a 
problematic table is queried (and there is a high false-positive rate here 
since most tables don't actually have case-sensitive fields).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20398) range() operator should include cancellation reason when killed

2017-04-19 Thread Eric Liang (JIRA)
Eric Liang created SPARK-20398:
--

 Summary: range() operator should include cancellation reason when 
killed
 Key: SPARK-20398
 URL: https://issues.apache.org/jira/browse/SPARK-20398
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Eric Liang


https://issues.apache.org/jira/browse/SPARK-19820 adds a reason field for why 
tasks were killed. However, for backwards compatibility it left the old 
TaskKilledException constructor which defaults to "unknown reason".

The range() operator should use the constructor that fills in the reason rather 
than dropping it on task kill.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20358) Executors failing stage on interrupted exception thrown by cancelled tasks

2017-04-17 Thread Eric Liang (JIRA)
Eric Liang created SPARK-20358:
--

 Summary: Executors failing stage on interrupted exception thrown 
by cancelled tasks
 Key: SPARK-20358
 URL: https://issues.apache.org/jira/browse/SPARK-20358
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Eric Liang


https://issues.apache.org/jira/browse/SPARK-20217 introduced a regression where 
now interrupted exceptions will cause a task to fail on cancellation. This is 
because NonFatal(e) does not match if e is an InterrupedException.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20217) Executor should not fail stage if killed task throws non-interrupted exception

2017-04-04 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-20217:
---
Description: 
This is reproducible as follows. Run the following, and then use 
SparkContext.killTaskAttempt to kill one of the tasks. The entire stage will 
fail since we threw a RuntimeException instead of InterruptedException.

We should probably unconditionally return TaskKilled instead of TaskFailed if 
the task was killed by the driver, regardless of the actual exception thrown.

{code}
spark.range(100).repartition(100).foreach { i =>
  try {
Thread.sleep(1000)
  } catch {
case t: InterruptedException =>
  throw new RuntimeException(t)
  }
}
{code}

Based on the code in TaskSetManager, I think this also affects kills of 
speculative tasks. However, since the number of speculated tasks is few, and 
usually you need to fail a task a few times before the stage is cancelled, 
probably no-one noticed this in production.

  was:
This is reproducible as follows. Run the following, and then use 
SparkContext.killTaskAttempt to kill one of the tasks. The entire stage will 
fail since we threw a RuntimeException instead of InterruptedException.

We should probably unconditionally return TaskKilled instead of TaskFailed if 
the task was killed by the driver, regardless of the actual exception thrown.

{code}
spark.range(100).repartition(100).foreach { i =>
  try {
Thread.sleep(1000)
  } catch {
case t: InterruptedException =>
  throw new RuntimeException(t)
  }
}
{code}



> Executor should not fail stage if killed task throws non-interrupted exception
> --
>
> Key: SPARK-20217
> URL: https://issues.apache.org/jira/browse/SPARK-20217
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Eric Liang
>
> This is reproducible as follows. Run the following, and then use 
> SparkContext.killTaskAttempt to kill one of the tasks. The entire stage will 
> fail since we threw a RuntimeException instead of InterruptedException.
> We should probably unconditionally return TaskKilled instead of TaskFailed if 
> the task was killed by the driver, regardless of the actual exception thrown.
> {code}
> spark.range(100).repartition(100).foreach { i =>
>   try {
> Thread.sleep(1000)
>   } catch {
> case t: InterruptedException =>
>   throw new RuntimeException(t)
>   }
> }
> {code}
> Based on the code in TaskSetManager, I think this also affects kills of 
> speculative tasks. However, since the number of speculated tasks is few, and 
> usually you need to fail a task a few times before the stage is cancelled, 
> probably no-one noticed this in production.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20217) Executor should not fail stage if killed task throws non-interrupted exception

2017-04-04 Thread Eric Liang (JIRA)
Eric Liang created SPARK-20217:
--

 Summary: Executor should not fail stage if killed task throws 
non-interrupted exception
 Key: SPARK-20217
 URL: https://issues.apache.org/jira/browse/SPARK-20217
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Eric Liang


This is reproducible as follows. Run the following, and then use 
SparkContext.killTaskAttempt to kill one of the tasks. The entire stage will 
fail since we threw a RuntimeException instead of InterruptedException.

We should probably unconditionally return TaskKilled instead of TaskFailed if 
the task was killed by the driver, regardless of the actual exception thrown.

{code}
spark.range(100).repartition(100).foreach { i =>
  try {
Thread.sleep(1000)
  } catch {
case t: InterruptedException =>
  throw new RuntimeException(t)
  }
}
{code}




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20148) Extend the file commit interface to allow subscribing to task commit messages

2017-03-29 Thread Eric Liang (JIRA)
Eric Liang created SPARK-20148:
--

 Summary: Extend the file commit interface to allow subscribing to 
task commit messages
 Key: SPARK-20148
 URL: https://issues.apache.org/jira/browse/SPARK-20148
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Eric Liang
Priority: Minor


The internal FileCommitProtocol interface returns all task commit messages in 
bulk to the implementation when a job finishes. However, it is sometimes useful 
to access those messages before the job completes, so that the driver gets 
incremental progress updates before the job finishes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19820) Allow reason to be specified for task kill

2017-03-04 Thread Eric Liang (JIRA)
Eric Liang created SPARK-19820:
--

 Summary: Allow reason to be specified for task kill
 Key: SPARK-19820
 URL: https://issues.apache.org/jira/browse/SPARK-19820
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Eric Liang
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19183) Add deleteWithJob hook to internal commit protocol API

2017-01-11 Thread Eric Liang (JIRA)
Eric Liang created SPARK-19183:
--

 Summary: Add deleteWithJob hook to internal commit protocol API
 Key: SPARK-19183
 URL: https://issues.apache.org/jira/browse/SPARK-19183
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Eric Liang


Currently in SQL we implement overwrites by calling fs.delete() directly on the 
original data. This is not ideal since we the original files end up deleted 
even if the job aborts. We should extend the commit protocol to allow file 
overwrites to be managed as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18814) CheckAnalysis rejects TPCDS query 32

2016-12-09 Thread Eric Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15737148#comment-15737148
 ] 

Eric Liang commented on SPARK-18814:


It seems that the references of an Alias expression should include the
referenced attribute, so I would expect #39 to still show up. I could be
misunderstanding the behavior of Alias though.

On Fri, Dec 9, 2016, 7:50 PM Nattavut Sutyanyong (JIRA) 



> CheckAnalysis rejects TPCDS query 32
> 
>
> Key: SPARK-18814
> URL: https://issues.apache.org/jira/browse/SPARK-18814
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Blocker
>
> It seems the CheckAnalysis rule introduced by SPARK-18504 is incorrect 
> rejecting this TPCDS query, which ran fine in Spark 2.0. There doesn't seem 
> to be any obvious error in the query or the check rule though: in the plan 
> below, the scalar subquery's condition field is "scalar-subquery#24 
> [(cs_item_sk#39#111 = i_item_sk#59)] ", which should reference cs_item_sk#39. 
> Nonetheless CheckAnalysis complains that cs_item_sk#39 is not referenced by 
> the scalar subquery predicates.
> analysis error:
> {code}
> == Query: q32-v1.4 ==
>  Can't be analyzed: org.apache.spark.sql.AnalysisException: a GROUP BY clause 
> in a scalar correlated subquery cannot contain non-correlated columns: 
> cs_item_sk#39;;
> GlobalLimit 100
> +- LocalLimit 100
>+- Aggregate [sum(cs_ext_discount_amt#46) AS excess discount amount#23]
>   +- Filter i_manufact_id#72 = 977) && (i_item_sk#59 = 
> cs_item_sk#39)) && ((d_date#83 >= 2000-01-27) && (d_date#83 <= 
> cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
> days as date) as string && ((d_date_sk#81 = cs_sold_date_sk#58) && 
> (cast(cs_ext_discount_amt#46 as decimal(14,7)) > cast(scalar-subquery#24 
> [(cs_item_sk#39#111 = i_item_sk#59)] as decimal(14,7)
>  :  +- Project [(CAST(1.3 AS DECIMAL(11,6)) * 
> CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39 AS 
> cs_item_sk#39#111]
>  : +- Aggregate [cs_item_sk#39], 
> [CheckOverflow((promote_precision(cast(1.3 as decimal(11,6))) * 
> promote_precision(cast(avg(cs_ext_discount_amt#46) as decimal(11,6, 
> DecimalType(14,7)) AS (CAST(1.3 AS DECIMAL(11,6)) * 
> CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39]
>  :+- Filter (((d_date#83 >= 2000-01-27]) && (d_date#83 <= 
> cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
> days as date) as string))) && (d_date_sk#81 = cs_sold_date_sk#58))
>  :   +- Join Inner
>  :  :- SubqueryAlias catalog_sales
>  :  :  +- 
> Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,...
>  10 more fields] parquet
>  :  +- SubqueryAlias date_dim
>  : +- 
> Relation[d_date_sk#81,d_date_id#82,d_date#83,d_month_seq#84,d_week_seq#85,d_quarter_seq#86,d_year#87,d_dow#88,d_moy#89,d_dom#90,d_qoy#91,d_fy_year#92,d_fy_quarter_seq#93,d_fy_week_seq#94,d_day_name#95,d_quarter_name#96,d_holiday#97,d_weekend#98,d_following_holiday#99,d_first_dom#100,d_last_dom#101,d_same_day_ly#102,d_same_day_lq#103,d_current_day#104,...
>  4 more fields] parquet
>  +- Join Inner
> :- Join Inner
> :  :- SubqueryAlias catalog_sales
> :  :  +- 
> Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,...
>  10 more fields] parquet
> :  +- SubqueryAlias item
> : +- 
> Relation[i_item_sk#59,i_item_id#60,i_rec_start_date#61,i_rec_end_date#62,i_item_desc#63,i_current_price#64,i_wholesale_cost#65,i_brand_id#66,i_brand#67,i_class_id#68,i_class#69,i_category_id#70,i_category#71,i_manufact_id#72,i_manufact#73,i_size#74,i_formulation#75,i_color#76,i_units#77,i_container#78,i_manager_id#79,i_product_name#80]
>  parquet
> 

[jira] [Commented] (SPARK-18814) CheckAnalysis rejects TPCDS query 32

2016-12-09 Thread Eric Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15737068#comment-15737068
 ] 

Eric Liang commented on SPARK-18814:


[~rxin]

> CheckAnalysis rejects TPCDS query 32
> 
>
> Key: SPARK-18814
> URL: https://issues.apache.org/jira/browse/SPARK-18814
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Blocker
>
> It seems the CheckAnalysis rule introduced by SPARK-18504 is incorrect 
> rejecting this TPCDS query, which ran fine in Spark 2.0. There doesn't seem 
> to be any obvious error in the query or the check rule though: in the plan 
> below, the scalar subquery's condition field is "scalar-subquery#24 
> [(cs_item_sk#39#111 = i_item_sk#59)] ", which should reference cs_item_sk#39. 
> Nonetheless CheckAnalysis complains that cs_item_sk#39 is not referenced by 
> the scalar subquery predicates.
> analysis error:
> {code}
> == Query: q32-v1.4 ==
>  Can't be analyzed: org.apache.spark.sql.AnalysisException: a GROUP BY clause 
> in a scalar correlated subquery cannot contain non-correlated columns: 
> cs_item_sk#39;;
> GlobalLimit 100
> +- LocalLimit 100
>+- Aggregate [sum(cs_ext_discount_amt#46) AS excess discount amount#23]
>   +- Filter i_manufact_id#72 = 977) && (i_item_sk#59 = 
> cs_item_sk#39)) && ((d_date#83 >= 2000-01-27) && (d_date#83 <= 
> cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
> days as date) as string && ((d_date_sk#81 = cs_sold_date_sk#58) && 
> (cast(cs_ext_discount_amt#46 as decimal(14,7)) > cast(scalar-subquery#24 
> [(cs_item_sk#39#111 = i_item_sk#59)] as decimal(14,7)
>  :  +- Project [(CAST(1.3 AS DECIMAL(11,6)) * 
> CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39 AS 
> cs_item_sk#39#111]
>  : +- Aggregate [cs_item_sk#39], 
> [CheckOverflow((promote_precision(cast(1.3 as decimal(11,6))) * 
> promote_precision(cast(avg(cs_ext_discount_amt#46) as decimal(11,6, 
> DecimalType(14,7)) AS (CAST(1.3 AS DECIMAL(11,6)) * 
> CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39]
>  :+- Filter (((d_date#83 >= 2000-01-27]) && (d_date#83 <= 
> cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
> days as date) as string))) && (d_date_sk#81 = cs_sold_date_sk#58))
>  :   +- Join Inner
>  :  :- SubqueryAlias catalog_sales
>  :  :  +- 
> Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,...
>  10 more fields] parquet
>  :  +- SubqueryAlias date_dim
>  : +- 
> Relation[d_date_sk#81,d_date_id#82,d_date#83,d_month_seq#84,d_week_seq#85,d_quarter_seq#86,d_year#87,d_dow#88,d_moy#89,d_dom#90,d_qoy#91,d_fy_year#92,d_fy_quarter_seq#93,d_fy_week_seq#94,d_day_name#95,d_quarter_name#96,d_holiday#97,d_weekend#98,d_following_holiday#99,d_first_dom#100,d_last_dom#101,d_same_day_ly#102,d_same_day_lq#103,d_current_day#104,...
>  4 more fields] parquet
>  +- Join Inner
> :- Join Inner
> :  :- SubqueryAlias catalog_sales
> :  :  +- 
> Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,...
>  10 more fields] parquet
> :  +- SubqueryAlias item
> : +- 
> Relation[i_item_sk#59,i_item_id#60,i_rec_start_date#61,i_rec_end_date#62,i_item_desc#63,i_current_price#64,i_wholesale_cost#65,i_brand_id#66,i_brand#67,i_class_id#68,i_class#69,i_category_id#70,i_category#71,i_manufact_id#72,i_manufact#73,i_size#74,i_formulation#75,i_color#76,i_units#77,i_container#78,i_manager_id#79,i_product_name#80]
>  parquet
> +- SubqueryAlias date_dim
>+- 
> 

[jira] [Created] (SPARK-18814) CheckAnalysis rejects TPCDS query 32

2016-12-09 Thread Eric Liang (JIRA)
Eric Liang created SPARK-18814:
--

 Summary: CheckAnalysis rejects TPCDS query 32
 Key: SPARK-18814
 URL: https://issues.apache.org/jira/browse/SPARK-18814
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Eric Liang
Priority: Blocker


It seems the CheckAnalysis rule introduced by SPARK-18504 is incorrect 
rejecting this TPCDS query, which ran fine in Spark 2.0. There doesn't seem to 
be any obvious error in the query or the check rule though: in the plan below, 
the scalar subquery's condition field is "scalar-subquery#24 
[(cs_item_sk#39#111 = i_item_sk#59)] ", which should reference cs_item_sk#39. 
Nonetheless CheckAnalysis complains that cs_item_sk#39 is not referenced by the 
scalar subquery predicates.

analysis error:
{code}
== Query: q32-v1.4 ==
 Can't be analyzed: org.apache.spark.sql.AnalysisException: a GROUP BY clause 
in a scalar correlated subquery cannot contain non-correlated columns: 
cs_item_sk#39;;
GlobalLimit 100
+- LocalLimit 100
   +- Aggregate [sum(cs_ext_discount_amt#46) AS excess discount amount#23]
  +- Filter i_manufact_id#72 = 977) && (i_item_sk#59 = cs_item_sk#39)) 
&& ((d_date#83 >= 2000-01-27) && (d_date#83 <= cast(cast(cast(cast(2000-01-27 
as date) as timestamp) + interval 12 weeks 6 days as date) as string && 
((d_date_sk#81 = cs_sold_date_sk#58) && (cast(cs_ext_discount_amt#46 as 
decimal(14,7)) > cast(scalar-subquery#24 [(cs_item_sk#39#111 = i_item_sk#59)] 
as decimal(14,7)
 :  +- Project [(CAST(1.3 AS DECIMAL(11,6)) * 
CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39 AS 
cs_item_sk#39#111]
 : +- Aggregate [cs_item_sk#39], 
[CheckOverflow((promote_precision(cast(1.3 as decimal(11,6))) * 
promote_precision(cast(avg(cs_ext_discount_amt#46) as decimal(11,6, 
DecimalType(14,7)) AS (CAST(1.3 AS DECIMAL(11,6)) * 
CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39]
 :+- Filter (((d_date#83 >= 2000-01-27]) && (d_date#83 <= 
cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
days as date) as string))) && (d_date_sk#81 = cs_sold_date_sk#58))
 :   +- Join Inner
 :  :- SubqueryAlias catalog_sales
 :  :  +- 
Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,...
 10 more fields] parquet
 :  +- SubqueryAlias date_dim
 : +- 
Relation[d_date_sk#81,d_date_id#82,d_date#83,d_month_seq#84,d_week_seq#85,d_quarter_seq#86,d_year#87,d_dow#88,d_moy#89,d_dom#90,d_qoy#91,d_fy_year#92,d_fy_quarter_seq#93,d_fy_week_seq#94,d_day_name#95,d_quarter_name#96,d_holiday#97,d_weekend#98,d_following_holiday#99,d_first_dom#100,d_last_dom#101,d_same_day_ly#102,d_same_day_lq#103,d_current_day#104,...
 4 more fields] parquet
 +- Join Inner
:- Join Inner
:  :- SubqueryAlias catalog_sales
:  :  +- 
Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,...
 10 more fields] parquet
:  +- SubqueryAlias item
: +- 
Relation[i_item_sk#59,i_item_id#60,i_rec_start_date#61,i_rec_end_date#62,i_item_desc#63,i_current_price#64,i_wholesale_cost#65,i_brand_id#66,i_brand#67,i_class_id#68,i_class#69,i_category_id#70,i_category#71,i_manufact_id#72,i_manufact#73,i_size#74,i_formulation#75,i_color#76,i_units#77,i_container#78,i_manager_id#79,i_product_name#80]
 parquet
+- SubqueryAlias date_dim
   +- 
Relation[d_date_sk#81,d_date_id#82,d_date#83,d_month_seq#84,d_week_seq#85,d_quarter_seq#86,d_year#87,d_dow#88,d_moy#89,d_dom#90,d_qoy#91,d_fy_year#92,d_fy_quarter_seq#93,d_fy_week_seq#94,d_day_name#95,d_quarter_name#96,d_holiday#97,d_weekend#98,d_following_holiday#99,d_first_dom#100,d_last_dom#101,d_same_day_ly#102,d_same_day_lq#103,d_current_day#104,...
 4 more fields] parquet

{code}

query text:
{code}
select sum(cs_ext_discount_amt) as `excess discount amount`
 from
catalog_sales, item, date_dim
 where
   

[jira] [Created] (SPARK-18727) Support schema evolution as new files are inserted into table

2016-12-05 Thread Eric Liang (JIRA)
Eric Liang created SPARK-18727:
--

 Summary: Support schema evolution as new files are inserted into 
table
 Key: SPARK-18727
 URL: https://issues.apache.org/jira/browse/SPARK-18727
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.1.0
Reporter: Eric Liang
Priority: Critical


Now that we have pushed partition management of all tables to the catalog, one 
issue for scalable partition handling remains: handling schema updates.

Currently, a schema update requires dropping and recreating the entire table, 
which does not scale well with the size of the table.

We should support updating the schema of the table, either via ALTER TABLE, or 
automatically as new files with compatible schemas are appended into the table.

cc [~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18727) Support schema evolution as new files are inserted into table

2016-12-05 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-18727:
---
Component/s: SQL

> Support schema evolution as new files are inserted into table
> -
>
> Key: SPARK-18727
> URL: https://issues.apache.org/jira/browse/SPARK-18727
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Critical
>
> Now that we have pushed partition management of all tables to the catalog, 
> one issue for scalable partition handling remains: handling schema updates.
> Currently, a schema update requires dropping and recreating the entire table, 
> which does not scale well with the size of the table.
> We should support updating the schema of the table, either via ALTER TABLE, 
> or automatically as new files with compatible schemas are appended into the 
> table.
> cc [~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18726) Filesystem unnecessarily scanned twice during creation of non-catalog table

2016-12-05 Thread Eric Liang (JIRA)
Eric Liang created SPARK-18726:
--

 Summary: Filesystem unnecessarily scanned twice during creation of 
non-catalog table
 Key: SPARK-18726
 URL: https://issues.apache.org/jira/browse/SPARK-18726
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Eric Liang


It seems that for non-catalog tables (e.g. spark.read.parquet(...)), we scan 
the filesystem twice, once for schema inference, and another to create a 
FileIndex class for the relation.

It would be better to combine these scans somehow, since this is the most 
costly step of creating a table. This is a follow-up ticket to 
https://github.com/apache/spark/pull/16090.

cc [~cloud_fan]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18725) Creating a datasource table with schema should not scan all files for table

2016-12-05 Thread Eric Liang (JIRA)
Eric Liang created SPARK-18725:
--

 Summary: Creating a datasource table with schema should not scan 
all files for table
 Key: SPARK-18725
 URL: https://issues.apache.org/jira/browse/SPARK-18725
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Eric Liang


When creating an unpartitioned Datasource table, Spark will scan the table for 
all files to infer the schema, even if the user already provides a schema.

This is fixed for partitioned tables by 
https://github.com/apache/spark/pull/16090 but this was left as a separate 
followup. Since unpartitioned tables are not typically large this can probably 
be left out of 2.1 release. 

cc [~cloud_fan]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18679) Regression in file listing performance

2016-12-01 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-18679:
---
Component/s: SQL

> Regression in file listing performance
> --
>
> Key: SPARK-18679
> URL: https://issues.apache.org/jira/browse/SPARK-18679
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Blocker
>
> In Spark 2.1 ListingFileCatalog was significantly refactored (and renamed to 
> InMemoryFileIndex).
> It seems there is a performance regression here where we no longer 
> performance listing in parallel for the non-root directory. This forces file 
> listing to be completely serial when resolving datasource tables that are not 
> backed by an external catalog.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18679) Regression in file listing performance

2016-12-01 Thread Eric Liang (JIRA)
Eric Liang created SPARK-18679:
--

 Summary: Regression in file listing performance
 Key: SPARK-18679
 URL: https://issues.apache.org/jira/browse/SPARK-18679
 Project: Spark
  Issue Type: Bug
Reporter: Eric Liang
Priority: Blocker


In Spark 2.1 ListingFileCatalog was significantly refactored (and renamed to 
InMemoryFileIndex).

It seems there is a performance regression here where we no longer performance 
listing in parallel for the non-root directory. This forces file listing to be 
completely serial when resolving datasource tables that are not backed by an 
external catalog.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18679) Regression in file listing performance

2016-12-01 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-18679:
---
Affects Version/s: 2.1.0

> Regression in file listing performance
> --
>
> Key: SPARK-18679
> URL: https://issues.apache.org/jira/browse/SPARK-18679
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Blocker
>
> In Spark 2.1 ListingFileCatalog was significantly refactored (and renamed to 
> InMemoryFileIndex).
> It seems there is a performance regression here where we no longer 
> performance listing in parallel for the non-root directory. This forces file 
> listing to be completely serial when resolving datasource tables that are not 
> backed by an external catalog.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18661) Creating a partitioned datasource table should not scan all files in filesystem

2016-11-30 Thread Eric Liang (JIRA)
Eric Liang created SPARK-18661:
--

 Summary: Creating a partitioned datasource table should not scan 
all files in filesystem
 Key: SPARK-18661
 URL: https://issues.apache.org/jira/browse/SPARK-18661
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Eric Liang
Priority: Blocker


Even though in 2.1 creating a partitioned datasource table will not populate 
the partition data by default (until the user issues MSCK REPAIR TABLE), it 
seems we still scan the filesystem for no good reason.

We should avoid doing this when the user specifies a schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18661) Creating a partitioned datasource table should not scan all files for table

2016-11-30 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-18661:
---
Summary: Creating a partitioned datasource table should not scan all files 
for table  (was: Creating a partitioned datasource table should not scan all 
files in filesystem)

> Creating a partitioned datasource table should not scan all files for table
> ---
>
> Key: SPARK-18661
> URL: https://issues.apache.org/jira/browse/SPARK-18661
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Blocker
>
> Even though in 2.1 creating a partitioned datasource table will not populate 
> the partition data by default (until the user issues MSCK REPAIR TABLE), it 
> seems we still scan the filesystem for no good reason.
> We should avoid doing this when the user specifies a schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18659) Incorrect behaviors in overwrite table for datasource tables

2016-11-30 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-18659:
---
Description: 
The first three test cases fail due to a crash in hive client when dropping 
partitions that don't contain files. The last one deletes too many files due to 
a partition case resolution failure.

{code}
  test("foo") {
withTable("test") {
  spark.range(10)
.selectExpr("id", "id as A", "'x' as B")
.write.partitionBy("A", "B").mode("overwrite")
.saveAsTable("test")
  spark.sql("insert overwrite table test select id, id, 'x' from range(1)")
  assert(spark.sql("select * from test").count() == 1)
}
  }

  test("bar") {
withTable("test") {
  spark.range(10)
.selectExpr("id", "id as A", "'x' as B")
.write.partitionBy("A", "B").mode("overwrite")
.saveAsTable("test")
  spark.sql("insert overwrite table test partition (a, b) select id, id, 
'x' from range(1)")
  assert(spark.sql("select * from test").count() == 1)
}
  }

  test("baz") {
withTable("test") {
  spark.range(10)
.selectExpr("id", "id as A", "'x' as B")
.write.partitionBy("A", "B").mode("overwrite")
.saveAsTable("test")
  spark.sql("insert overwrite table test partition (A, B) select id, id, 
'x' from range(1)")
  assert(spark.sql("select * from test").count() == 1)
}
  }

  test("qux") {
withTable("test") {
  spark.range(10)
.selectExpr("id", "id as A", "'x' as B")
.write.partitionBy("A", "B").mode("overwrite")
.saveAsTable("test")
  spark.sql("insert overwrite table test partition (a=1, b) select id, 'x' 
from range(1)")
  assert(spark.sql("select * from test").count() == 10)
}
  }
{code}

  was:
The following test cases fail due to a crash in hive client when dropping 
partitions that don't contain files. The last one deletes too many files due to 
a partition case resolution failure.

{code}
  test("foo") {
withTable("test") {
  spark.range(10)
.selectExpr("id", "id as A", "'x' as B")
.write.partitionBy("A", "B").mode("overwrite")
.saveAsTable("test")
  spark.sql("insert overwrite table test select id, id, 'x' from range(1)")
  assert(spark.sql("select * from test").count() == 1)
}
  }

  test("bar") {
withTable("test") {
  spark.range(10)
.selectExpr("id", "id as A", "'x' as B")
.write.partitionBy("A", "B").mode("overwrite")
.saveAsTable("test")
  spark.sql("insert overwrite table test partition (a, b) select id, id, 
'x' from range(1)")
  assert(spark.sql("select * from test").count() == 1)
}
  }

  test("baz") {
withTable("test") {
  spark.range(10)
.selectExpr("id", "id as A", "'x' as B")
.write.partitionBy("A", "B").mode("overwrite")
.saveAsTable("test")
  spark.sql("insert overwrite table test partition (A, B) select id, id, 
'x' from range(1)")
  assert(spark.sql("select * from test").count() == 1)
}
  }

  test("qux") {
withTable("test") {
  spark.range(10)
.selectExpr("id", "id as A", "'x' as B")
.write.partitionBy("A", "B").mode("overwrite")
.saveAsTable("test")
  spark.sql("insert overwrite table test partition (a=1, b) select id, 'x' 
from range(1)")
  assert(spark.sql("select * from test").count() == 10)
}
  }
{code}


> Incorrect behaviors in overwrite table for datasource tables
> 
>
> Key: SPARK-18659
> URL: https://issues.apache.org/jira/browse/SPARK-18659
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Blocker
>
> The first three test cases fail due to a crash in hive client when dropping 
> partitions that don't contain files. The last one deletes too many files due 
> to a partition case resolution failure.
> {code}
>   test("foo") {
> withTable("test") {
>   spark.range(10)
> .selectExpr("id", "id as A", "'x' as B")
> .write.partitionBy("A", "B").mode("overwrite")
> .saveAsTable("test")
>   spark.sql("insert overwrite table test select id, id, 'x' from 
> range(1)")
>   assert(spark.sql("select * from test").count() == 1)
> }
>   }
>   test("bar") {
> withTable("test") {
>   spark.range(10)
> .selectExpr("id", "id as A", "'x' as B")
> .write.partitionBy("A", "B").mode("overwrite")
> .saveAsTable("test")
>   spark.sql("insert overwrite table test partition (a, b) select id, id, 
> 'x' from range(1)")
>   assert(spark.sql("select * from test").count() == 1)
> }
>   }
>   test("baz") {
> withTable("test") {
>   spark.range(10)
> .selectExpr("id", "id as A", "'x' as B")

[jira] [Updated] (SPARK-18659) Incorrect behaviors in overwrite table for datasource tables

2016-11-30 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-18659:
---
Description: 
The following test cases fail due to a crash in hive client when dropping 
partitions that don't contain files. The last one deletes too many files due to 
a partition case resolution failure.

{code}
  test("foo") {
withTable("test") {
  spark.range(10)
.selectExpr("id", "id as A", "'x' as B")
.write.partitionBy("A", "B").mode("overwrite")
.saveAsTable("test")
  spark.sql("insert overwrite table test select id, id, 'x' from range(1)")
  assert(spark.sql("select * from test").count() == 1)
}
  }

  test("bar") {
withTable("test") {
  spark.range(10)
.selectExpr("id", "id as A", "'x' as B")
.write.partitionBy("A", "B").mode("overwrite")
.saveAsTable("test")
  spark.sql("insert overwrite table test partition (a, b) select id, id, 
'x' from range(1)")
  assert(spark.sql("select * from test").count() == 1)
}
  }

  test("baz") {
withTable("test") {
  spark.range(10)
.selectExpr("id", "id as A", "'x' as B")
.write.partitionBy("A", "B").mode("overwrite")
.saveAsTable("test")
  spark.sql("insert overwrite table test partition (A, B) select id, id, 
'x' from range(1)")
  assert(spark.sql("select * from test").count() == 1)
}
  }

  test("qux") {
withTable("test") {
  spark.range(10)
.selectExpr("id", "id as A", "'x' as B")
.write.partitionBy("A", "B").mode("overwrite")
.saveAsTable("test")
  spark.sql("insert overwrite table test partition (a=1, b) select id, 'x' 
from range(1)")
  assert(spark.sql("select * from test").count() == 10)
}
  }
{code}

  was:
The following test cases fail due to a crash in hive client when dropping 
partitions that don't contain files. The last one crashes due to a partition 
case resolution failure.

{code}
  test("foo") {
withTable("test") {
  spark.range(10)
.selectExpr("id", "id as A", "'x' as B")
.write.partitionBy("A", "B").mode("overwrite")
.saveAsTable("test")
  spark.sql("insert overwrite table test select id, id, 'x' from range(1)")
  assert(spark.sql("select * from test").count() == 1)
}
  }

  test("bar") {
withTable("test") {
  spark.range(10)
.selectExpr("id", "id as A", "'x' as B")
.write.partitionBy("A", "B").mode("overwrite")
.saveAsTable("test")
  spark.sql("insert overwrite table test partition (a, b) select id, id, 
'x' from range(1)")
  assert(spark.sql("select * from test").count() == 1)
}
  }

  test("baz") {
withTable("test") {
  spark.range(10)
.selectExpr("id", "id as A", "'x' as B")
.write.partitionBy("A", "B").mode("overwrite")
.saveAsTable("test")
  spark.sql("insert overwrite table test partition (A, B) select id, id, 
'x' from range(1)")
  assert(spark.sql("select * from test").count() == 1)
}
  }

  test("qux") {
withTable("test") {
  spark.range(10)
.selectExpr("id", "id as A", "'x' as B")
.write.partitionBy("A", "B").mode("overwrite")
.saveAsTable("test")
  spark.sql("insert overwrite table test partition (a=1, b) select id, 'x' 
from range(1)")
  assert(spark.sql("select * from test").count() == 10)
}
  }
{code}


> Incorrect behaviors in overwrite table for datasource tables
> 
>
> Key: SPARK-18659
> URL: https://issues.apache.org/jira/browse/SPARK-18659
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Blocker
>
> The following test cases fail due to a crash in hive client when dropping 
> partitions that don't contain files. The last one deletes too many files due 
> to a partition case resolution failure.
> {code}
>   test("foo") {
> withTable("test") {
>   spark.range(10)
> .selectExpr("id", "id as A", "'x' as B")
> .write.partitionBy("A", "B").mode("overwrite")
> .saveAsTable("test")
>   spark.sql("insert overwrite table test select id, id, 'x' from 
> range(1)")
>   assert(spark.sql("select * from test").count() == 1)
> }
>   }
>   test("bar") {
> withTable("test") {
>   spark.range(10)
> .selectExpr("id", "id as A", "'x' as B")
> .write.partitionBy("A", "B").mode("overwrite")
> .saveAsTable("test")
>   spark.sql("insert overwrite table test partition (a, b) select id, id, 
> 'x' from range(1)")
>   assert(spark.sql("select * from test").count() == 1)
> }
>   }
>   test("baz") {
> withTable("test") {
>   spark.range(10)
> .selectExpr("id", "id as A", "'x' as B")
> 

[jira] [Updated] (SPARK-18659) Incorrect behaviors in overwrite table for datasource tables

2016-11-30 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-18659:
---
Summary: Incorrect behaviors in overwrite table for datasource tables  
(was: Crash in overwrite table partitions due to hive metastore integration)

> Incorrect behaviors in overwrite table for datasource tables
> 
>
> Key: SPARK-18659
> URL: https://issues.apache.org/jira/browse/SPARK-18659
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Blocker
>
> The following test cases fail due to a crash in hive client when dropping 
> partitions that don't contain files. The last one crashes due to a partition 
> case resolution failure.
> {code}
>   test("foo") {
> withTable("test") {
>   spark.range(10)
> .selectExpr("id", "id as A", "'x' as B")
> .write.partitionBy("A", "B").mode("overwrite")
> .saveAsTable("test")
>   spark.sql("insert overwrite table test select id, id, 'x' from 
> range(1)")
>   assert(spark.sql("select * from test").count() == 1)
> }
>   }
>   test("bar") {
> withTable("test") {
>   spark.range(10)
> .selectExpr("id", "id as A", "'x' as B")
> .write.partitionBy("A", "B").mode("overwrite")
> .saveAsTable("test")
>   spark.sql("insert overwrite table test partition (a, b) select id, id, 
> 'x' from range(1)")
>   assert(spark.sql("select * from test").count() == 1)
> }
>   }
>   test("baz") {
> withTable("test") {
>   spark.range(10)
> .selectExpr("id", "id as A", "'x' as B")
> .write.partitionBy("A", "B").mode("overwrite")
> .saveAsTable("test")
>   spark.sql("insert overwrite table test partition (A, B) select id, id, 
> 'x' from range(1)")
>   assert(spark.sql("select * from test").count() == 1)
> }
>   }
>   test("qux") {
> withTable("test") {
>   spark.range(10)
> .selectExpr("id", "id as A", "'x' as B")
> .write.partitionBy("A", "B").mode("overwrite")
> .saveAsTable("test")
>   spark.sql("insert overwrite table test partition (a=1, b) select id, 
> 'x' from range(1)")
>   assert(spark.sql("select * from test").count() == 10)
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18659) Crash in overwrite table partitions due to hive metastore integration

2016-11-30 Thread Eric Liang (JIRA)
Eric Liang created SPARK-18659:
--

 Summary: Crash in overwrite table partitions due to hive metastore 
integration
 Key: SPARK-18659
 URL: https://issues.apache.org/jira/browse/SPARK-18659
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Eric Liang
Priority: Blocker


The following test cases fail due to a crash in hive client when dropping 
partitions that don't contain files. The last one crashes due to a partition 
case resolution failure.

{code}
  test("foo") {
withTable("test") {
  spark.range(10)
.selectExpr("id", "id as A", "'x' as B")
.write.partitionBy("A", "B").mode("overwrite")
.saveAsTable("test")
  spark.sql("insert overwrite table test select id, id, 'x' from range(1)")
  assert(spark.sql("select * from test").count() == 1)
}
  }

  test("bar") {
withTable("test") {
  spark.range(10)
.selectExpr("id", "id as A", "'x' as B")
.write.partitionBy("A", "B").mode("overwrite")
.saveAsTable("test")
  spark.sql("insert overwrite table test partition (a, b) select id, id, 
'x' from range(1)")
  assert(spark.sql("select * from test").count() == 1)
}
  }

  test("baz") {
withTable("test") {
  spark.range(10)
.selectExpr("id", "id as A", "'x' as B")
.write.partitionBy("A", "B").mode("overwrite")
.saveAsTable("test")
  spark.sql("insert overwrite table test partition (A, B) select id, id, 
'x' from range(1)")
  assert(spark.sql("select * from test").count() == 1)
}
  }

  test("qux") {
withTable("test") {
  spark.range(10)
.selectExpr("id", "id as A", "'x' as B")
.write.partitionBy("A", "B").mode("overwrite")
.saveAsTable("test")
  spark.sql("insert overwrite table test partition (a=1, b) select id, 'x' 
from range(1)")
  assert(spark.sql("select * from test").count() == 10)
}
  }
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18635) Partition name/values not escaped correctly in some cases

2016-11-29 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-18635:
---
Target Version/s: 2.1.0
Priority: Critical  (was: Major)

> Partition name/values not escaped correctly in some cases
> -
>
> Key: SPARK-18635
> URL: https://issues.apache.org/jira/browse/SPARK-18635
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>Priority: Critical
>
> For example, the following command does not insert data properly into the 
> table
> {code}
> spark.sqlContext.range(10).selectExpr("id", "id as A", "'A$\\=%' as 
> B").write.partitionBy("A", "B").mode("overwrite").saveAsTable("testy")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18635) Partition name/values not escaped correctly in some cases

2016-11-29 Thread Eric Liang (JIRA)
Eric Liang created SPARK-18635:
--

 Summary: Partition name/values not escaped correctly in some cases
 Key: SPARK-18635
 URL: https://issues.apache.org/jira/browse/SPARK-18635
 Project: Spark
  Issue Type: Sub-task
Reporter: Eric Liang


For example, the following command does not insert data properly into the table

{code}
spark.sqlContext.range(10).selectExpr("id", "id as A", "'A$\\=%' as 
B").write.partitionBy("A", "B").mode("overwrite").saveAsTable("testy")
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18545) Verify number of hive client RPCs in PartitionedTablePerfStatsSuite

2016-11-29 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-18545:
---
Issue Type: Sub-task  (was: Test)
Parent: SPARK-17861

> Verify number of hive client RPCs in PartitionedTablePerfStatsSuite
> ---
>
> Key: SPARK-18545
> URL: https://issues.apache.org/jira/browse/SPARK-18545
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>Assignee: Eric Liang
>Priority: Minor
> Fix For: 2.1.0
>
>
> To avoid performance regressions like 
> https://issues.apache.org/jira/browse/SPARK-18507 in the future, we should 
> add a metric for the number of Hive client RPC issued and check it in the 
> perf stats suite.
> cc [~cloud_fan]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18507) Major performance regression in SHOW PARTITIONS on partitioned Hive tables

2016-11-29 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-18507:
---
Issue Type: Sub-task  (was: Bug)
Parent: SPARK-17861

> Major performance regression in SHOW PARTITIONS on partitioned Hive tables
> --
>
> Key: SPARK-18507
> URL: https://issues.apache.org/jira/browse/SPARK-18507
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Michael Allman
>Assignee: Wenchen Fan
>Priority: Critical
> Fix For: 2.1.0
>
>
> Commit {{ccb11543048dccd4cc590a8db1df1d9d5847d112}} 
> (https://github.com/apache/spark/commit/ccb11543048dccd4cc590a8db1df1d9d5847d112)
>  appears to have introduced a major regression in the performance of the Hive 
> {{SHOW PARTITIONS}} command. Running that command on a Hive table with 17,337 
> partitions in the {{spark-sql}} shell with the parent commit of {{ccb1154}} 
> takes approximately 7.3 seconds. Running the same command with commit 
> {{ccb1154}} takes approximately 250 seconds.
> I have not had the opportunity to complete a thorough investigation, but I 
> suspect the problem lies in the diff hunk beginning at 
> https://github.com/apache/spark/commit/ccb11543048dccd4cc590a8db1df1d9d5847d112#diff-159191585e10542f013cb3a714f26075L675.
>  If that's the case, this performance issue should manifest itself in other 
> areas as this programming pattern was used elsewhere in this commit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18544) Append with df.saveAsTable writes data to wrong location

2016-11-22 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-18544:
---
Description: 
When using saveAsTable in append mode, data will be written to the wrong 
location for non-managed Datasource tables. The following example illustrates 
this.

It seems somehow pass the wrong table path to InsertIntoHadoopFsRelation from 
DataFrameWriter. Also, we should probably remove the repair table call at the 
end of saveAsTable in DataFrameWriter. That shouldn't be needed in either the 
Hive or Datasource case.

{code}
scala> spark.sqlContext.range(100).selectExpr("id", "id as A", "id as 
B").write.partitionBy("A", "B").mode("overwrite").parquet("/tmp/test")

scala> sql("create table test (id long, A int, B int) USING parquet OPTIONS 
(path '/tmp/test') PARTITIONED BY (A, B)")

scala> sql("msck repair table test")

scala> sql("select * from test where A = 1").count
res6: Long = 1

scala> spark.sqlContext.range(10).selectExpr("id", "id as A", "id as 
B").write.partitionBy("A", "B").mode("append").saveAsTable("test")

scala> sql("select * from test where A = 1").count
res8: Long = 1
{code}

  was:
When using saveAsTable in append mode, data will be written to the wrong 
location for non-managed Datasource tables. The following example illustrates 
this.

It seems somehow pass the wrong table path to InsertIntoHadoopFsRelation from 
DataFrameWriter. Also, we should probably remove the repair table call at the 
end of saveAsTable in DataFrameWriter. That shouldn't be needed in either the 
Hive or Datasource case.

{code}
scala> spark.sqlContext.range(1).selectExpr("id", "id as A", "id as 
B").write.partitionBy("A", "B").mode("overwrite").parquet("/tmp/test_10k")

scala> sql("msck repair table test_10k")

scala> sql("select * from test_10k where A = 1").count
res6: Long = 1

scala> spark.sqlContext.range(10).selectExpr("id", "id as A", "id as 
B").write.partitionBy("A", "B").mode("append").parquet("/tmp/test_10k")

scala> sql("select * from test_10k where A = 1").count
res8: Long = 1
{code}


> Append with df.saveAsTable writes data to wrong location
> 
>
> Key: SPARK-18544
> URL: https://issues.apache.org/jira/browse/SPARK-18544
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Eric Liang
>Priority: Blocker
>
> When using saveAsTable in append mode, data will be written to the wrong 
> location for non-managed Datasource tables. The following example illustrates 
> this.
> It seems somehow pass the wrong table path to InsertIntoHadoopFsRelation from 
> DataFrameWriter. Also, we should probably remove the repair table call at the 
> end of saveAsTable in DataFrameWriter. That shouldn't be needed in either the 
> Hive or Datasource case.
> {code}
> scala> spark.sqlContext.range(100).selectExpr("id", "id as A", "id as 
> B").write.partitionBy("A", "B").mode("overwrite").parquet("/tmp/test")
> scala> sql("create table test (id long, A int, B int) USING parquet OPTIONS 
> (path '/tmp/test') PARTITIONED BY (A, B)")
> scala> sql("msck repair table test")
> scala> sql("select * from test where A = 1").count
> res6: Long = 1
> scala> spark.sqlContext.range(10).selectExpr("id", "id as A", "id as 
> B").write.partitionBy("A", "B").mode("append").saveAsTable("test")
> scala> sql("select * from test where A = 1").count
> res8: Long = 1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18544) Append with df.saveAsTable writes data to wrong location

2016-11-22 Thread Eric Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15687884#comment-15687884
 ] 

Eric Liang commented on SPARK-18544:


cc [~yhuai] [~cloud_fan] I'll try to look at this today but might not get to it.

> Append with df.saveAsTable writes data to wrong location
> 
>
> Key: SPARK-18544
> URL: https://issues.apache.org/jira/browse/SPARK-18544
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Eric Liang
>
> When using saveAsTable in append mode, data will be written to the wrong 
> location for non-managed Datasource tables. The following example illustrates 
> this.
> It seems somehow pass the wrong table path to InsertIntoHadoopFsRelation from 
> DataFrameWriter. Also, we should probably remove the repair table call at the 
> end of saveAsTable in DataFrameWriter. That shouldn't be needed in either the 
> Hive or Datasource case.
> {code}
> scala> spark.sqlContext.range(1).selectExpr("id", "id as A", "id as 
> B").write.partitionBy("A", "B").mode("overwrite").parquet("/tmp/test_10k")
> scala> sql("msck repair table test_10k")
> scala> sql("select * from test_10k where A = 1").count
> res6: Long = 1
> scala> spark.sqlContext.range(10).selectExpr("id", "id as A", "id as 
> B").write.partitionBy("A", "B").mode("append").parquet("/tmp/test_10k")
> scala> sql("select * from test_10k where A = 1").count
> res8: Long = 1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18545) Verify number of hive client RPCs in PartitionedTablePerfStatsSuite

2016-11-22 Thread Eric Liang (JIRA)
Eric Liang created SPARK-18545:
--

 Summary: Verify number of hive client RPCs in 
PartitionedTablePerfStatsSuite
 Key: SPARK-18545
 URL: https://issues.apache.org/jira/browse/SPARK-18545
 Project: Spark
  Issue Type: Test
  Components: SQL
Reporter: Eric Liang
Priority: Minor


To avoid performance regressions like 
https://issues.apache.org/jira/browse/SPARK-18507 in the future, we should add 
a metric for the number of Hive client RPC issued and check it in the perf 
stats suite.

cc [~cloud_fan]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18544) Append with df.saveAsTable writes data to wrong location

2016-11-22 Thread Eric Liang (JIRA)
Eric Liang created SPARK-18544:
--

 Summary: Append with df.saveAsTable writes data to wrong location
 Key: SPARK-18544
 URL: https://issues.apache.org/jira/browse/SPARK-18544
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Eric Liang


When using saveAsTable in append mode, data will be written to the wrong 
location for non-managed Datasource tables. The following example illustrates 
this.

It seems somehow pass the wrong table path to InsertIntoHadoopFsRelation from 
DataFrameWriter. Also, we should probably remove the repair table call at the 
end of saveAsTable in DataFrameWriter. That shouldn't be needed in either the 
Hive or Datasource case.

{code}
scala> spark.sqlContext.range(1).selectExpr("id", "id as A", "id as 
B").write.partitionBy("A", "B").mode("overwrite").parquet("/tmp/test_10k")

scala> sql("msck repair table test_10k")

scala> sql("select * from test_10k where A = 1").count
res6: Long = 1

scala> spark.sqlContext.range(10).selectExpr("id", "id as A", "id as 
B").write.partitionBy("A", "B").mode("append").parquet("/tmp/test_10k")

scala> sql("select * from test_10k where A = 1").count
res8: Long = 1
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18393) DataFrame pivot output column names should respect aliases

2016-11-09 Thread Eric Liang (JIRA)
Eric Liang created SPARK-18393:
--

 Summary: DataFrame pivot output column names should respect aliases
 Key: SPARK-18393
 URL: https://issues.apache.org/jira/browse/SPARK-18393
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Eric Liang
Priority: Minor


For example

{code}
val df = spark.range(100).selectExpr("id % 5 as x", "id % 2 as a", "id as b")
df
  .groupBy('x)
  .pivot("a", Seq(0, 1))
  .agg(expr("sum(b)").as("blah"), expr("count(b)").as("foo"))
  .show()
+---++-++-+
|  x|0_sum(`b`) AS `blah`|0_count(`b`) AS `foo`|1_sum(`b`) AS 
`blah`|1_count(`b`) AS `foo`|
+---++-++-+
|  0| 450|   10| 500|   
10|
|  1| 510|   10| 460|   
10|
|  3| 530|   10| 480|   
10|
|  2| 470|   10| 520|   
10|
|  4| 490|   10| 540|   
10|
+---++-++-+
{code}

The column names here are quite hard to read. Ideally we would respect the 
aliases and generate column names like 0_blah, 0_foo, 1_blah, 1_foo instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is

2016-11-09 Thread Eric Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15652414#comment-15652414
 ] 

Eric Liang commented on SPARK-17916:


In our case, a user wants the empty string (whether actually missing, e.g. ,, 
or quoted ,""), to resolve as the empty string. It should only turn into null 
if nullValue is set to "". There doesn't currently appear to be some option 
combination that allows this.

> CSV data source treats empty string as null no matter what nullValue option is
> --
>
> Key: SPARK-17916
> URL: https://issues.apache.org/jira/browse/SPARK-17916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> When user configures {{nullValue}} in CSV data source, in addition to those 
> values, all empty string values are also converted to null.
> {code}
> data:
> col1,col2
> 1,"-"
> 2,""
> {code}
> {code}
> spark.read.format("csv").option("nullValue", "-")
> {code}
> We will find a null in both rows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is

2016-11-08 Thread Eric Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649039#comment-15649039
 ] 

Eric Liang commented on SPARK-17916:


We're hitting this as a regression from 2.0 as well.

Ideally, we don't want the empty string to be treated specially in any 
scenario. The only logic that converts it to nulls should be due to the 
nullValue option.

> CSV data source treats empty string as null no matter what nullValue option is
> --
>
> Key: SPARK-17916
> URL: https://issues.apache.org/jira/browse/SPARK-17916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> When user configures {{nullValue}} in CSV data source, in addition to those 
> values, all empty string values are also converted to null.
> {code}
> data:
> col1,col2
> 1,"-"
> 2,""
> {code}
> {code}
> spark.read.format("csv").option("nullValue", "-")
> {code}
> We will find a null in both rows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17990) ALTER TABLE ... ADD PARTITION does not play nice with mixed-case partition column names

2016-11-07 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-17990:
---
Target Version/s: 2.1.0

> ALTER TABLE ... ADD PARTITION does not play nice with mixed-case partition 
> column names
> ---
>
> Key: SPARK-17990
> URL: https://issues.apache.org/jira/browse/SPARK-17990
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
> Environment: Linux
> Mac OS with a case-sensitive filesystem
>Reporter: Michael Allman
>
> Writing partition data to an external table's file location and then adding 
> those as table partition metadata is a common use case. However, for tables 
> with partition column names with upper case letters, the SQL command {{ALTER 
> TABLE ... ADD PARTITION}} does not work, as illustrated in the following 
> example:
> {code}
> scala> sql("create external table mixed_case_partitioning (a bigint) 
> PARTITIONED BY (partCol bigint) STORED AS parquet LOCATION 
> '/tmp/mixed_case_partitioning'")
> res0: org.apache.spark.sql.DataFrame = []
> scala> spark.sqlContext.range(10).selectExpr("id as a", "id as 
> partCol").write.partitionBy("partCol").mode("overwrite").parquet("/tmp/mixed_case_partitioning")
> {code}
> At this point, doing a {{hadoop fs -ls /tmp/mixed_case_partitioning}} 
> produces the following:
> {code}
> [msa@jupyter ~]$ hadoop fs -ls /tmp/mixed_case_partitioning
> Found 11 items
> -rw-r--r--   3 msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/_SUCCESS
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=0
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=1
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=2
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=3
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=4
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=5
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=6
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=7
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=8
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=9
> {code}
> Returning to the Spark shell, we execute the following to add the partition 
> metadata:
> {code}
> scala> (0 to 9).foreach { p => sql(s"alter table mixed_case_partitioning add 
> partition(partCol=$p)") }
> {code}
> Examining the HDFS file listing again, we see:
> {code}
> [msa@jupyter ~]$ hadoop fs -ls /tmp/mixed_case_partitioning
> Found 21 items
> -rw-r--r--   3 msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/_SUCCESS
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=0
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=1
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=2
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=3
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=4
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=5
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=6
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=7
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=8
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:52 
> /tmp/mixed_case_partitioning/partCol=9
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:53 
> /tmp/mixed_case_partitioning/partcol=0
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:53 
> /tmp/mixed_case_partitioning/partcol=1
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:53 
> /tmp/mixed_case_partitioning/partcol=2
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:53 
> /tmp/mixed_case_partitioning/partcol=3
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:53 
> /tmp/mixed_case_partitioning/partcol=4
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:53 
> /tmp/mixed_case_partitioning/partcol=5
> drwxr-xr-x   - msa supergroup  0 2016-10-18 17:53 
> /tmp/mixed_case_partitioning/partcol=6
> drwxr-xr-x   - msa supergroup  0 2016-10-18 

[jira] [Updated] (SPARK-18333) Revert hacks in parquet and orc reader to support case insensitive resolution

2016-11-07 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-18333:
---
Target Version/s: 2.1.0

> Revert hacks in parquet and orc reader to support case insensitive resolution
> -
>
> Key: SPARK-18333
> URL: https://issues.apache.org/jira/browse/SPARK-18333
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>
> These are no longer needed after 
> https://issues.apache.org/jira/browse/SPARK-17183



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18145) Update documentation for hive partition management in 2.1

2016-11-07 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-18145:
---
Target Version/s: 2.1.0

> Update documentation for hive partition management in 2.1
> -
>
> Key: SPARK-18145
> URL: https://issues.apache.org/jira/browse/SPARK-18145
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18333) Revert hacks in parquet and orc reader to support case insensitive resolution

2016-11-07 Thread Eric Liang (JIRA)
Eric Liang created SPARK-18333:
--

 Summary: Revert hacks in parquet and orc reader to support case 
insensitive resolution
 Key: SPARK-18333
 URL: https://issues.apache.org/jira/browse/SPARK-18333
 Project: Spark
  Issue Type: Sub-task
Reporter: Eric Liang


These are no longer needed after 
https://issues.apache.org/jira/browse/SPARK-17183



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18185) Should fix INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions

2016-11-07 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-18185:
---
Issue Type: Sub-task  (was: Bug)
Parent: SPARK-17861

> Should fix INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions
> --
>
> Key: SPARK-18185
> URL: https://issues.apache.org/jira/browse/SPARK-18185
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>
> As of current 2.1, INSERT OVERWRITE with dynamic partitions against a 
> Datasource table will overwrite the entire table instead of only the updated 
> partitions as in Hive. It also doesn't respect custom partition locations.
> We should delete only the proper partitions, scan the metastore for affected 
> partitions with custom locations, and ensure that deletes/writes go to the 
> right locations for those as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18185) Should fix INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions

2016-11-07 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-18185:
---
Target Version/s: 2.1.0

> Should fix INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions
> --
>
> Key: SPARK-18185
> URL: https://issues.apache.org/jira/browse/SPARK-18185
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Eric Liang
>
> As of current 2.1, INSERT OVERWRITE with dynamic partitions against a 
> Datasource table will overwrite the entire table instead of only the updated 
> partitions as in Hive. It also doesn't respect custom partition locations.
> We should delete only the proper partitions, scan the metastore for affected 
> partitions with custom locations, and ensure that deletes/writes go to the 
> right locations for those as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18185) Should fix INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions

2016-11-07 Thread Eric Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15645238#comment-15645238
 ] 

Eric Liang commented on SPARK-18185:


I'm currently working on this.

> Should fix INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions
> --
>
> Key: SPARK-18185
> URL: https://issues.apache.org/jira/browse/SPARK-18185
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>
> As of current 2.1, INSERT OVERWRITE with dynamic partitions against a 
> Datasource table will overwrite the entire table instead of only the updated 
> partitions as in Hive. It also doesn't respect custom partition locations.
> We should delete only the proper partitions, scan the metastore for affected 
> partitions with custom locations, and ensure that deletes/writes go to the 
> right locations for those as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18185) Should fix INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions

2016-11-03 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-18185:
---
Description: 
As of current 2.1, INSERT OVERWRITE with dynamic partitions against a 
Datasource table will overwrite the entire table instead of only the updated 
partitions as in Hive. It also doesn't respect custom partition locations.

We should delete only the proper partitions, scan the metastore for affected 
partitions with custom locations, and ensure that deletes/writes go to the 
right locations for those as well.

  was:
As of current 2.1, INSERT OVERWRITE with dynamic partitions against a 
Datasource table will overwrite the entire table instead of only the updated 
partitions as in Hive.

This is non-trivial to fix in 2.1, so we should throw an exception here.


> Should fix INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions
> --
>
> Key: SPARK-18185
> URL: https://issues.apache.org/jira/browse/SPARK-18185
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Eric Liang
>
> As of current 2.1, INSERT OVERWRITE with dynamic partitions against a 
> Datasource table will overwrite the entire table instead of only the updated 
> partitions as in Hive. It also doesn't respect custom partition locations.
> We should delete only the proper partitions, scan the metastore for affected 
> partitions with custom locations, and ensure that deletes/writes go to the 
> right locations for those as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18185) Should fix INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions

2016-11-03 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-18185:
---
Summary: Should fix INSERT OVERWRITE TABLE of Datasource tables with 
dynamic partitions  (was: Should disallow INSERT OVERWRITE TABLE of Datasource 
tables with dynamic partitions)

> Should fix INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions
> --
>
> Key: SPARK-18185
> URL: https://issues.apache.org/jira/browse/SPARK-18185
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Eric Liang
>
> As of current 2.1, INSERT OVERWRITE with dynamic partitions against a 
> Datasource table will overwrite the entire table instead of only the updated 
> partitions as in Hive.
> This is non-trivial to fix in 2.1, so we should throw an exception here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18185) Should disallow INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions

2016-10-31 Thread Eric Liang (JIRA)
Eric Liang created SPARK-18185:
--

 Summary: Should disallow INSERT OVERWRITE TABLE of Datasource 
tables with dynamic partitions
 Key: SPARK-18185
 URL: https://issues.apache.org/jira/browse/SPARK-18185
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Eric Liang


As of current 2.1, INSERT OVERWRITE with dynamic partitions against a 
Datasource table will overwrite the entire table instead of only the updated 
partitions as in Hive.

This is non-trivial to fix in 2.1, so we should throw an exception here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18184) INSERT [INTO|OVERWRITE] TABLE ... PARTITION for Datasource tables cannot handle partitions with custom locations

2016-10-31 Thread Eric Liang (JIRA)
Eric Liang created SPARK-18184:
--

 Summary: INSERT [INTO|OVERWRITE] TABLE ... PARTITION for 
Datasource tables cannot handle partitions with custom locations
 Key: SPARK-18184
 URL: https://issues.apache.org/jira/browse/SPARK-18184
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Eric Liang


As part of https://issues.apache.org/jira/browse/SPARK-17861 we should support 
this case as well now that Datasource table partitions can have custom 
locations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18183) INSERT OVERWRITE TABLE ... PARTITION will overwrite the entire Datasource table instead of just the specified partition

2016-10-31 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-18183:
---
Component/s: SQL

> INSERT OVERWRITE TABLE ... PARTITION will overwrite the entire Datasource 
> table instead of just the specified partition
> ---
>
> Key: SPARK-18183
> URL: https://issues.apache.org/jira/browse/SPARK-18183
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Eric Liang
>
> In Hive, this will only overwrite the specified partition. We should fix this 
> for Datasource tables to be more in line with the Hive behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18183) INSERT OVERWRITE TABLE ... PARTITION will overwrite the entire Datasource table instead of just the specified partition

2016-10-31 Thread Eric Liang (JIRA)
Eric Liang created SPARK-18183:
--

 Summary: INSERT OVERWRITE TABLE ... PARTITION will overwrite the 
entire Datasource table instead of just the specified partition
 Key: SPARK-18183
 URL: https://issues.apache.org/jira/browse/SPARK-18183
 Project: Spark
  Issue Type: Bug
Reporter: Eric Liang


In Hive, this will only overwrite the specified partition. We should fix this 
for Datasource tables to be more in line with the Hive behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18167) Flaky test when hive partition pruning is enabled

2016-10-28 Thread Eric Liang (JIRA)
Eric Liang created SPARK-18167:
--

 Summary: Flaky test when hive partition pruning is enabled
 Key: SPARK-18167
 URL: https://issues.apache.org/jira/browse/SPARK-18167
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Eric Liang


org.apache.spark.sql.hive.execution.SQLQuerySuite is flaking when hive 
partition pruning is enabled.

Based on the stack traces, it seems to be an old issue where Hive fails to cast 
a numeric partition column ("Invalid character string format for type 
DECIMAL"). There are two possibilities here: either we are somehow corrupting 
the partition table to have non-decimal values in that column, or there is a 
transient issue with Derby.

{code}
Error Message  java.lang.reflect.InvocationTargetException: null Stacktrace  
sbt.ForkMain$ForkError: java.lang.reflect.InvocationTargetException: null  at 
sun.reflect.GeneratedMethodAccessor263.invoke(Unknown Source)at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497) at 
org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:588)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:544)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:542)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:282)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:229)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:228)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:271)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:542)
  at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:702)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:686)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:91)
   at 
org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:686)
  at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:769)
at 
org.apache.spark.sql.execution.datasources.TableFileCatalog.filterPartitions(TableFileCatalog.scala:67)
  at 
org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:59)
at 
org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:26)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291)  
 at 
org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:26)
  at 
org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:25)
  at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
at 
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)  
 at 
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
 at scala.collection.immutable.List.foreach(List.scala:381)  at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74) 
 at 
org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:73)
  at 
org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:73)
 at 
org.apache.spark.sql.QueryTest.assertEmptyMissingInput(QueryTest.scala:234)  at 
org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:170)  at 
org.apache.spark.sql.hive.execution.SQLQuerySuite$$anonfun$76$$anonfun$apply$mcV$sp$24.apply$mcV$sp(SQLQuerySuite.scala:1559)
   

[jira] [Created] (SPARK-18146) Avoid using Union to chain together create table and repair partition commands

2016-10-27 Thread Eric Liang (JIRA)
Eric Liang created SPARK-18146:
--

 Summary: Avoid using Union to chain together create table and 
repair partition commands
 Key: SPARK-18146
 URL: https://issues.apache.org/jira/browse/SPARK-18146
 Project: Spark
  Issue Type: Sub-task
Reporter: Eric Liang
Priority: Minor


The behavior of union is not well defined here. We should add an internal 
command to execute these commands sequentially.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18145) Update documentation

2016-10-27 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-18145:
---
Issue Type: Sub-task  (was: Documentation)
Parent: SPARK-17861

> Update documentation
> 
>
> Key: SPARK-18145
> URL: https://issues.apache.org/jira/browse/SPARK-18145
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18145) Update documentation for hive partition management in 2.1

2016-10-27 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-18145:
---
Component/s: SQL

> Update documentation for hive partition management in 2.1
> -
>
> Key: SPARK-18145
> URL: https://issues.apache.org/jira/browse/SPARK-18145
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18145) Update documentation for hive partition management in 2.1

2016-10-27 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-18145:
---
Summary: Update documentation for hive partition management in 2.1  (was: 
Update documentation)

> Update documentation for hive partition management in 2.1
> -
>
> Key: SPARK-18145
> URL: https://issues.apache.org/jira/browse/SPARK-18145
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18145) Update documentation

2016-10-27 Thread Eric Liang (JIRA)
Eric Liang created SPARK-18145:
--

 Summary: Update documentation
 Key: SPARK-18145
 URL: https://issues.apache.org/jira/browse/SPARK-18145
 Project: Spark
  Issue Type: Documentation
Reporter: Eric Liang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18103) Rename *FileCatalog to *FileProvider

2016-10-25 Thread Eric Liang (JIRA)
Eric Liang created SPARK-18103:
--

 Summary: Rename *FileCatalog to *FileProvider
 Key: SPARK-18103
 URL: https://issues.apache.org/jira/browse/SPARK-18103
 Project: Spark
  Issue Type: Improvement
Reporter: Eric Liang
Priority: Minor


In the SQL component there are too many different components called some 
variant of *Catalog, which is quite confusing. We should rename the subclasses 
of FileCatalog to avoid this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18101) ExternalCatalogSuite should test with mixed case fields

2016-10-25 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-18101:
---
Issue Type: Sub-task  (was: Test)
Parent: SPARK-17861

> ExternalCatalogSuite should test with mixed case fields
> ---
>
> Key: SPARK-18101
> URL: https://issues.apache.org/jira/browse/SPARK-18101
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>
> Currently, it uses field names such as "a" and "b" which are not useful for 
> testing case preservation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18101) ExternalCatalogSuite should test with mixed case fields

2016-10-25 Thread Eric Liang (JIRA)
Eric Liang created SPARK-18101:
--

 Summary: ExternalCatalogSuite should test with mixed case fields
 Key: SPARK-18101
 URL: https://issues.apache.org/jira/browse/SPARK-18101
 Project: Spark
  Issue Type: Test
  Components: SQL
Reporter: Eric Liang


Currently, it uses field names such as "a" and "b" which are not useful for 
testing case preservation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17183) put hive serde table schema to table properties like data source table

2016-10-24 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-17183:
---
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-17861

> put hive serde table schema to table properties like data source table
> --
>
> Key: SPARK-17183
> URL: https://issues.apache.org/jira/browse/SPARK-17183
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18087) Optimize insert to not require REPAIR TABLE

2016-10-24 Thread Eric Liang (JIRA)
Eric Liang created SPARK-18087:
--

 Summary: Optimize insert to not require REPAIR TABLE
 Key: SPARK-18087
 URL: https://issues.apache.org/jira/browse/SPARK-18087
 Project: Spark
  Issue Type: Sub-task
Reporter: Eric Liang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18026) should not always lowercase partition columns of partition spec in parser

2016-10-24 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-18026:
---
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-17861

> should not always lowercase partition columns of partition spec in parser
> -
>
> Key: SPARK-18026
> URL: https://issues.apache.org/jira/browse/SPARK-18026
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17970) Use metastore for managing filesource table partitions as well

2016-10-24 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-17970:
---
Summary: Use metastore for managing filesource table partitions as well  
(was: store partition spec in metastore for data source table)

> Use metastore for managing filesource table partitions as well
> --
>
> Key: SPARK-17970
> URL: https://issues.apache.org/jira/browse/SPARK-17970
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17994) Add back a file status cache for catalog tables

2016-10-18 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-17994:
---
Description: 
In SPARK-16980, we removed the full in-memory cache of table partitions in 
favor of loading only needed partitions from the metastore. This greatly 
improves the initial latency of queries that only read a small fraction of 
table partitions.

However, since the metastore does not store file statistics, we need to 
discover those from remote storage. With the loss of the in-memory file status 
cache this has to happen on each query, increasing the latency of repeated 
queries over the same partitions.

The proposal is to add back a per-table cache of partition contents, i.e. 
Map[Path, Array[FileStatus]]. This cache would be retained per-table, and can 
be invalidated through refreshTable() and refreshByPath(). Unlike the prior 
cache, it can be incrementally updated as new partitions are read.

cc [~michael]

  was:
In SPARK-16980, we removed the full in-memory cache of table partitions in 
favor of loading only needed partitions from the metastore. This greatly 
improves the initial latency of queries that only read a small fraction of 
table partitions.

However, since the metastore does not store file statistics, we need to 
discover those from remote storage. With the loss of the in-memory file status 
cache this has to happen on each query, increasing the latency of repeated 
queries over the same partitions.

The proposal is to add back a per-table cache of partition contents, i.e. 
Map[Path, Array[FileStatus]]. This cache would be retained per-table, and would 
be invalidated through refreshTable() and refreshByPath(). Unlike the prior 
cache, it can be incrementally updated as new partitions are read.

cc [~michael]


> Add back a file status cache for catalog tables
> ---
>
> Key: SPARK-17994
> URL: https://issues.apache.org/jira/browse/SPARK-17994
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Eric Liang
>
> In SPARK-16980, we removed the full in-memory cache of table partitions in 
> favor of loading only needed partitions from the metastore. This greatly 
> improves the initial latency of queries that only read a small fraction of 
> table partitions.
> However, since the metastore does not store file statistics, we need to 
> discover those from remote storage. With the loss of the in-memory file 
> status cache this has to happen on each query, increasing the latency of 
> repeated queries over the same partitions.
> The proposal is to add back a per-table cache of partition contents, i.e. 
> Map[Path, Array[FileStatus]]. This cache would be retained per-table, and can 
> be invalidated through refreshTable() and refreshByPath(). Unlike the prior 
> cache, it can be incrementally updated as new partitions are read.
> cc [~michael]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17994) Add back a file status cache for catalog tables

2016-10-18 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-17994:
---
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-17861

> Add back a file status cache for catalog tables
> ---
>
> Key: SPARK-17994
> URL: https://issues.apache.org/jira/browse/SPARK-17994
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Eric Liang
>
> In SPARK-16980, we removed the full in-memory cache of table partitions in 
> favor of loading only needed partitions from the metastore. This greatly 
> improves the initial latency of queries that only read a small fraction of 
> table partitions.
> However, since the metastore does not store file statistics, we need to 
> discover those from remote storage. With the loss of the in-memory file 
> status cache this has to happen on each query, increasing the latency of 
> repeated queries over the same partitions.
> The proposal is to add back a per-table cache of partition contents, i.e. 
> Map[Path, Array[FileStatus]]. This cache would be retained per-table, and 
> would be invalidated through refreshTable() and refreshByPath(). Unlike the 
> prior cache, it can be incrementally updated as new partitions are read.
> cc [~michael]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17994) Add back a file status cache for catalog tables

2016-10-18 Thread Eric Liang (JIRA)
Eric Liang created SPARK-17994:
--

 Summary: Add back a file status cache for catalog tables
 Key: SPARK-17994
 URL: https://issues.apache.org/jira/browse/SPARK-17994
 Project: Spark
  Issue Type: Improvement
Reporter: Eric Liang


In SPARK-16980, we removed the full in-memory cache of table partitions in 
favor of loading only needed partitions from the metastore. This greatly 
improves the initial latency of queries that only read a small fraction of 
table partitions.

However, since the metastore does not store file statistics, we need to 
discover those from remote storage. With the loss of the in-memory file status 
cache this has to happen on each query, increasing the latency of repeated 
queries over the same partitions.

The proposal is to add back a per-table cache of partition contents, i.e. 
Map[Path, Array[FileStatus]]. This cache would be retained per-table, and would 
be invalidated through refreshTable() and refreshByPath(). Unlike the prior 
cache, it can be incrementally updated as new partitions are read.

cc [~michael]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17983) Can't filter over mixed case parquet columns of converted Hive tables

2016-10-18 Thread Eric Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586476#comment-15586476
 ] 

Eric Liang commented on SPARK-17983:


Since we already store the original (case-sensitive) schema of datasource 
tables in the metastore as a table property, I think it also makes sense to do 
this for hive tables  created through Spark. This would resolve this issue for 
both types of tables.

The one issue I can see with this is that hive tables created through previous 
versions of Spark will stop working in 2.1 if they have mixed-case files. We 
would need a workaround for this situation.

> Can't filter over mixed case parquet columns of converted Hive tables
> -
>
> Key: SPARK-17983
> URL: https://issues.apache.org/jira/browse/SPARK-17983
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Critical
>
> We should probably revive https://github.com/apache/spark/pull/14750 in order 
> to fix this issue and related classes of issues.
> The only other alternatives are (1) reconciling on-disk schemas with 
> metastore schema at planning time, which seems pretty messy, and (2) fixing 
> all the datasources to support case-insensitive matching, which also has 
> issues.
> Reproduction:
> {code}
>   private def setupPartitionedTable(tableName: String, dir: File): Unit = {
> spark.range(5).selectExpr("id as normalCol", "id as partCol1", "id as 
> partCol2").write
>   .partitionBy("partCol1", "partCol2")
>   .mode("overwrite")
>   .parquet(dir.getAbsolutePath)
> spark.sql(s"""
>   |create external table $tableName (normalCol long)
>   |partitioned by (partCol1 int, partCol2 int)
>   |stored as parquet
>   |location "${dir.getAbsolutePath}.stripMargin)
> spark.sql(s"msck repair table $tableName")
>   }
>   test("filter by mixed case col") {
> withTable("test") {
>   withTempDir { dir =>
> setupPartitionedTable("test", dir)
> val df = spark.sql("select * from test where normalCol = 3")
> assert(df.count() == 1)
>   }
> }
>   }
> {code}
> cc [~cloud_fan]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17991) Enable metastore partition pruning for unconverted hive tables by default

2016-10-18 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-17991:
---
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-17861

> Enable metastore partition pruning for unconverted hive tables by default
> -
>
> Key: SPARK-17991
> URL: https://issues.apache.org/jira/browse/SPARK-17991
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>
> Now that we use the metastore for pruning converted hive tables, we should 
> enable it for unconverted tables as well. Previously this was turned off due 
> to test flakiness, but this does not seem to be the case currently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17991) Enable metastore partition pruning for unconverted hive tables by default

2016-10-18 Thread Eric Liang (JIRA)
Eric Liang created SPARK-17991:
--

 Summary: Enable metastore partition pruning for unconverted hive 
tables by default
 Key: SPARK-17991
 URL: https://issues.apache.org/jira/browse/SPARK-17991
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Eric Liang


Now that we use the metastore for pruning converted hive tables, we should 
enable it for unconverted tables as well. Previously this was turned off due to 
test flakiness, but this does not seem to be the case currently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17980) Fix refreshByPath for converted Hive tables

2016-10-18 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-17980:
---
Issue Type: Sub-task  (was: Bug)
Parent: SPARK-17861

> Fix refreshByPath for converted Hive tables
> ---
>
> Key: SPARK-17980
> URL: https://issues.apache.org/jira/browse/SPARK-17980
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Minor
>
> There is a small bug introduced in https://github.com/apache/spark/pull/14690 
> which broke refreshByPath with converted hive tables (though, it turns out it 
> was very difficult to refresh converted hive tables anyways, since you had to 
> specify the exact path of one of the partitions).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17862) Feature flag SPARK-16980

2016-10-18 Thread Eric Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586210#comment-15586210
 ] 

Eric Liang commented on SPARK-17862:


Yes, this is the flag:

{code}
  val HIVE_FILESOURCE_PARTITION_PRUNING =
SQLConfigBuilder("spark.sql.hive.filesourcePartitionPruning")
  .doc("When true, enable metastore partition pruning for file source 
tables as well. " +
   "This is currently implemented for converted Hive tables only.")
  .booleanConf
  .createWithDefault(true)
{code}

> Feature flag SPARK-16980
> 
>
> Key: SPARK-17862
> URL: https://issues.apache.org/jira/browse/SPARK-17862
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Eric Liang
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17983) Can't filter over mixed case parquet columns of converted Hive tables

2016-10-17 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-17983:
---
Description: 
We should probably revive https://github.com/apache/spark/pull/14750 in order 
to fix this issue and related classes of issues.

The only other alternatives are (1) reconciling on-disk schemas with metastore 
schema at planning time, which seems pretty messy, and (2) fixing all the 
datasources to support case-insensitive matching, which also has issues.

Reproduction:
{code}
  private def setupPartitionedTable(tableName: String, dir: File): Unit = {
spark.range(5).selectExpr("id as normalCol", "id as partCol1", "id as 
partCol2").write
  .partitionBy("partCol1", "partCol2")
  .mode("overwrite")
  .parquet(dir.getAbsolutePath)

spark.sql(s"""
  |create external table $tableName (normalCol long)
  |partitioned by (partCol1 int, partCol2 int)
  |stored as parquet
  |location "${dir.getAbsolutePath}.stripMargin)
spark.sql(s"msck repair table $tableName")
  }

  test("filter by mixed case col") {
withTable("test") {
  withTempDir { dir =>
setupPartitionedTable("test", dir)
val df = spark.sql("select * from test where normalCol = 3")
assert(df.count() == 1)
  }
}
  }
{code}
cc [~cloud_fan]

  was:
We should probably revive https://github.com/apache/spark/pull/14750 in order 
to fix this issue and related classes of issues.

The only other alternatives are (1) reconciling on-disk schemas with metastore 
schema at planning time, which seems pretty messy, and (2) fixing all the 
datasources to support case-insensitive matching, which also has issues.

cc [~cloud_fan]


> Can't filter over mixed case parquet columns of converted Hive tables
> -
>
> Key: SPARK-17983
> URL: https://issues.apache.org/jira/browse/SPARK-17983
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Critical
>
> We should probably revive https://github.com/apache/spark/pull/14750 in order 
> to fix this issue and related classes of issues.
> The only other alternatives are (1) reconciling on-disk schemas with 
> metastore schema at planning time, which seems pretty messy, and (2) fixing 
> all the datasources to support case-insensitive matching, which also has 
> issues.
> Reproduction:
> {code}
>   private def setupPartitionedTable(tableName: String, dir: File): Unit = {
> spark.range(5).selectExpr("id as normalCol", "id as partCol1", "id as 
> partCol2").write
>   .partitionBy("partCol1", "partCol2")
>   .mode("overwrite")
>   .parquet(dir.getAbsolutePath)
> spark.sql(s"""
>   |create external table $tableName (normalCol long)
>   |partitioned by (partCol1 int, partCol2 int)
>   |stored as parquet
>   |location "${dir.getAbsolutePath}.stripMargin)
> spark.sql(s"msck repair table $tableName")
>   }
>   test("filter by mixed case col") {
> withTable("test") {
>   withTempDir { dir =>
> setupPartitionedTable("test", dir)
> val df = spark.sql("select * from test where normalCol = 3")
> assert(df.count() == 1)
>   }
> }
>   }
> {code}
> cc [~cloud_fan]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17980) Fix refreshByPath for converted Hive tables

2016-10-17 Thread Eric Liang (JIRA)
Eric Liang created SPARK-17980:
--

 Summary: Fix refreshByPath for converted Hive tables
 Key: SPARK-17980
 URL: https://issues.apache.org/jira/browse/SPARK-17980
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Eric Liang
Priority: Minor


There is a small bug introduced in https://github.com/apache/spark/pull/14690 
which broke refreshByPath with converted hive tables (though, it turns out it 
was very difficult to refresh converted hive tables anyways, since you had to 
specify the exact path of one of the partitions).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17974) Refactor FileCatalog classes to simplify the inheritance tree

2016-10-17 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-17974:
---
Affects Version/s: 2.1.0

> Refactor FileCatalog classes to simplify the inheritance tree
> -
>
> Key: SPARK-17974
> URL: https://issues.apache.org/jira/browse/SPARK-17974
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Minor
>
> This is a follow-up item for https://github.com/apache/spark/pull/14690 which 
> adds support for metastore partition pruning of converted hive tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17974) Refactor FileCatalog classes to simplify the inheritance tree

2016-10-17 Thread Eric Liang (JIRA)
Eric Liang created SPARK-17974:
--

 Summary: Refactor FileCatalog classes to simplify the inheritance 
tree
 Key: SPARK-17974
 URL: https://issues.apache.org/jira/browse/SPARK-17974
 Project: Spark
  Issue Type: Improvement
Reporter: Eric Liang
Priority: Minor


This is a follow-up item for https://github.com/apache/spark/pull/14690 which 
adds support for metastore partition pruning of converted hive tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17974) Refactor FileCatalog classes to simplify the inheritance tree

2016-10-17 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-17974:
---
Component/s: SQL

> Refactor FileCatalog classes to simplify the inheritance tree
> -
>
> Key: SPARK-17974
> URL: https://issues.apache.org/jira/browse/SPARK-17974
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Minor
>
> This is a follow-up item for https://github.com/apache/spark/pull/14690 which 
> adds support for metastore partition pruning of converted hive tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17740) Spark tests should mock / interpose HDFS to ensure that streams are closed

2016-09-29 Thread Eric Liang (JIRA)
Eric Liang created SPARK-17740:
--

 Summary: Spark tests should mock / interpose HDFS to ensure that 
streams are closed
 Key: SPARK-17740
 URL: https://issues.apache.org/jira/browse/SPARK-17740
 Project: Spark
  Issue Type: Test
  Components: Spark Core
Reporter: Eric Liang


As a followup to SPARK-17666, we should add a test to ensure filesystem 
connections are not leaked at least in unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17713) Move row-datasource related tests out of JDBCSuite

2016-09-28 Thread Eric Liang (JIRA)
Eric Liang created SPARK-17713:
--

 Summary: Move row-datasource related tests out of JDBCSuite
 Key: SPARK-17713
 URL: https://issues.apache.org/jira/browse/SPARK-17713
 Project: Spark
  Issue Type: Test
  Components: SQL
Reporter: Eric Liang
Priority: Trivial


As a followup for https://github.com/apache/spark/pull/15273 we should move 
non-JDBC specific tests out of that suite.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17701) Refactor DataSourceScanExec so its sameResult call does not compare strings

2016-09-27 Thread Eric Liang (JIRA)
Eric Liang created SPARK-17701:
--

 Summary: Refactor DataSourceScanExec so its sameResult call does 
not compare strings
 Key: SPARK-17701
 URL: https://issues.apache.org/jira/browse/SPARK-17701
 Project: Spark
  Issue Type: Improvement
Reporter: Eric Liang


Currently, RowDataSourceScanExec and FileSourceScanExec rely on a "metadata" 
string map to implement equality comparison, since the RDDs they depend on 
cannot be directly compared. This has resulted in a number of correctness bugs 
around exchange reuse, e.g. SPARK-17673 and SPARK-16818.

To make these comparisons less brittle, we should refactor these classes to 
compare constructor parameters directly instead of relying on the metadata map.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17701) Refactor DataSourceScanExec so its sameResult call does not compare strings

2016-09-27 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-17701:
---
Component/s: SQL

> Refactor DataSourceScanExec so its sameResult call does not compare strings
> ---
>
> Key: SPARK-17701
> URL: https://issues.apache.org/jira/browse/SPARK-17701
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Eric Liang
>
> Currently, RowDataSourceScanExec and FileSourceScanExec rely on a "metadata" 
> string map to implement equality comparison, since the RDDs they depend on 
> cannot be directly compared. This has resulted in a number of correctness 
> bugs around exchange reuse, e.g. SPARK-17673 and SPARK-16818.
> To make these comparisons less brittle, we should refactor these classes to 
> compare constructor parameters directly instead of relying on the metadata 
> map.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17673) Reused Exchange Aggregations Produce Incorrect Results

2016-09-27 Thread Eric Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15528145#comment-15528145
 ] 

Eric Liang commented on SPARK-17673:


Russell, could you try applying this patch (wip) to see if it resolves the 
issue? https://github.com/apache/spark/pull/15273/files

It fixes equality comparison for row datasource scans to take into account the 
output schema of the scan.

> Reused Exchange Aggregations Produce Incorrect Results
> --
>
> Key: SPARK-17673
> URL: https://issues.apache.org/jira/browse/SPARK-17673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Russell Spitzer
>Priority: Blocker
>  Labels: correctness
>
> https://datastax-oss.atlassian.net/browse/SPARKC-429
> Was brought to my attention where the following code produces incorrect 
> results
> {code}
>  val data = List(TestData("A", 1, 7))
> val frame = 
> session.sqlContext.createDataFrame(session.sparkContext.parallelize(data))
> frame.createCassandraTable(
>   keySpaceName,
>   table,
>   partitionKeyColumns = Some(Seq("id")))
> frame
>   .write
>   .format("org.apache.spark.sql.cassandra")
>   .mode(SaveMode.Append)
>   .options(Map("table" -> table, "keyspace" -> keySpaceName))
>   .save()
> val loaded = sparkSession.sqlContext
>   .read
>   .format("org.apache.spark.sql.cassandra")
>   .options(Map("table" -> table, "keyspace" -> ks))
>   .load()
>   .select("id", "col1", "col2")
> val min1 = loaded.groupBy("id").agg(min("col1").as("min"))
> val min2 = loaded.groupBy("id").agg(min("col2").as("min"))
>  min1.union(min2).show()
> /* prints:
>   +---+---+
>   | id|min|
>   +---+---+
>   |  A|  1|
>   |  A|  1|
>   +---+---+
>  Should be 
>   | A| 1|
>   | A| 7|
>  */
> {code}
> I looked into the explain pattern and saw 
> {code}
> Union
> :- *HashAggregate(keys=[id#93], functions=[min(col1#94)])
> :  +- Exchange hashpartitioning(id#93, 200)
> : +- *HashAggregate(keys=[id#93], functions=[partial_min(col1#94)])
> :+- *Scan 
> org.apache.spark.sql.cassandra.CassandraSourceRelation@7ec20844 
> [id#93,col1#94]
> +- *HashAggregate(keys=[id#93], functions=[min(col2#95)])
>+- ReusedExchange [id#93, min#153], Exchange hashpartitioning(id#93, 200)
> {code}
> Which was different than using a parallelized collection as the DF backing. 
> So I tested the same code with a Parquet backed DF and saw the same results.
> {code}
> frame.write.parquet("garbagetest")
> val parquet = sparkSession.read.parquet("garbagetest").select("id", 
> "col1", "col2")
> println("PDF")
> parquetmin1.union(parquetmin2).explain()
> parquetmin1.union(parquetmin2).show()
> /* prints:
>   +---+---+
>   | id|min|
>   +---+---+
>   |  A|  1|
>   |  A|  1|
>   +---+---+
> */
> {code}
> Which leads me to believe there is something wrong with the reused exchange. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17673) Reused Exchange Aggregations Produce Incorrect Results

2016-09-27 Thread Eric Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15527339#comment-15527339
 ] 

Eric Liang commented on SPARK-17673:


I'm looking at this now.

> Reused Exchange Aggregations Produce Incorrect Results
> --
>
> Key: SPARK-17673
> URL: https://issues.apache.org/jira/browse/SPARK-17673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Russell Spitzer
>Priority: Blocker
>  Labels: correctness
>
> https://datastax-oss.atlassian.net/browse/SPARKC-429
> Was brought to my attention where the following code produces incorrect 
> results
> {code}
>  val data = List(TestData("A", 1, 7))
> val frame = 
> session.sqlContext.createDataFrame(session.sparkContext.parallelize(data))
> frame.createCassandraTable(
>   keySpaceName,
>   table,
>   partitionKeyColumns = Some(Seq("id")))
> frame
>   .write
>   .format("org.apache.spark.sql.cassandra")
>   .mode(SaveMode.Append)
>   .options(Map("table" -> table, "keyspace" -> keySpaceName))
>   .save()
> val loaded = sparkSession.sqlContext
>   .read
>   .format("org.apache.spark.sql.cassandra")
>   .options(Map("table" -> table, "keyspace" -> ks))
>   .load()
>   .select("id", "col1", "col2")
> val min1 = loaded.groupBy("id").agg(min("col1").as("min"))
> val min2 = loaded.groupBy("id").agg(min("col2").as("min"))
>  min1.union(min2).show()
> /* prints:
>   +---+---+
>   | id|min|
>   +---+---+
>   |  A|  1|
>   |  A|  1|
>   +---+---+
>  Should be 
>   | A| 1|
>   | A| 7|
>  */
> {code}
> I looked into the explain pattern and saw 
> {code}
> Union
> :- *HashAggregate(keys=[id#93], functions=[min(col1#94)])
> :  +- Exchange hashpartitioning(id#93, 200)
> : +- *HashAggregate(keys=[id#93], functions=[partial_min(col1#94)])
> :+- *Scan 
> org.apache.spark.sql.cassandra.CassandraSourceRelation@7ec20844 
> [id#93,col1#94]
> +- *HashAggregate(keys=[id#93], functions=[min(col2#95)])
>+- ReusedExchange [id#93, min#153], Exchange hashpartitioning(id#93, 200)
> {code}
> Which was different than using a parallelized collection as the DF backing. 
> So I tested the same code with a Parquet backed DF and saw the same results.
> {code}
> frame.write.parquet("garbagetest")
> val parquet = sparkSession.read.parquet("garbagetest").select("id", 
> "col1", "col2")
> println("PDF")
> parquetmin1.union(parquetmin2).explain()
> parquetmin1.union(parquetmin2).show()
> /* prints:
>   +---+---+
>   | id|min|
>   +---+---+
>   |  A|  1|
>   |  A|  1|
>   +---+---+
> */
> {code}
> Which leads me to believe there is something wrong with the reused exchange. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17472) Better error message for serialization failures of large objects in Python

2016-09-09 Thread Eric Liang (JIRA)
Eric Liang created SPARK-17472:
--

 Summary: Better error message for serialization failures of large 
objects in Python
 Key: SPARK-17472
 URL: https://issues.apache.org/jira/browse/SPARK-17472
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Eric Liang
Priority: Minor


{code}
def run():
  import numpy.random as nr
  b = nr.bytes(8 * 10)
  sc.parallelize(range(1000), 1000).map(lambda x: len(b)).count()

run()
{code}

Gives you the following error from pickle
{code}
error: 'i' format requires -2147483648 <= number <= 2147483647

---
error Traceback (most recent call last)
 in ()
  4   sc.parallelize(range(1000), 1000).map(lambda x: len(b)).count()
  5 
> 6 run()

 in run()
  2   import numpy.random as nr
  3   b = nr.bytes(8 * 10)
> 4   sc.parallelize(range(1000), 1000).map(lambda x: len(b)).count()
  5 
  6 run()

/databricks/spark/python/pyspark/rdd.pyc in count(self)
   1002 3
   1003 """
-> 1004 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
   1005 
   1006 def stats(self):

/databricks/spark/python/pyspark/rdd.pyc in sum(self)
993 6.0
994 """
--> 995 return self.mapPartitions(lambda x: [sum(x)]).fold(0, 
operator.add)
996 
997 def count(self):

/databricks/spark/python/pyspark/rdd.pyc in fold(self, zeroValue, op)
867 # zeroValue provided to each partition is unique from the one 
provided
868 # to the final reduce call
--> 869 vals = self.mapPartitions(func).collect()
870 return reduce(op, vals, zeroValue)
871 

/databricks/spark/python/pyspark/rdd.pyc in collect(self)
769 """
770 with SCCallSiteSync(self.context) as css:
--> 771 port = 
self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
772 return list(_load_from_socket(port, self._jrdd_deserializer))
773 

/databricks/spark/python/pyspark/rdd.pyc in _jrdd(self)
   2377 command = (self.func, profiler, self._prev_jrdd_deserializer,
   2378self._jrdd_deserializer)
-> 2379 pickled_cmd, bvars, env, includes = 
_prepare_for_python_RDD(self.ctx, command, self)
   2380 python_rdd = self.ctx._jvm.PythonRDD(self._prev_jrdd.rdd(),
   2381  bytearray(pickled_cmd),

/databricks/spark/python/pyspark/rdd.pyc in _prepare_for_python_RDD(sc, 
command, obj)
   2297 # the serialized command will be compressed by broadcast
   2298 ser = CloudPickleSerializer()
-> 2299 pickled_command = ser.dumps(command)
   2300 if len(pickled_command) > (1 << 20):  # 1M
   2301 # The broadcast will have same life cycle as created PythonRDD

/databricks/spark/python/pyspark/serializers.pyc in dumps(self, obj)
426 
427 def dumps(self, obj):
--> 428 return cloudpickle.dumps(obj, 2)
429 
430 

/databricks/spark/python/pyspark/cloudpickle.pyc in dumps(obj, protocol)
655 
656 cp = CloudPickler(file,protocol)
--> 657 cp.dump(obj)
658 
659 return file.getvalue()

/databricks/spark/python/pyspark/cloudpickle.pyc in dump(self, obj)
105 self.inject_addons()
106 try:
--> 107 return Pickler.dump(self, obj)
108 except RuntimeError as e:
109 if 'recursion' in e.args[0]:

/usr/lib/python2.7/pickle.pyc in dump(self, obj)
222 if self.proto >= 2:
223 self.write(PROTO + chr(self.proto))
--> 224 self.save(obj)
225 self.write(STOP)
226 

/usr/lib/python2.7/pickle.pyc in save(self, obj)
284 f = self.dispatch.get(t)
285 if f:
--> 286 f(self, obj) # Call unbound method with explicit self
287 return
288 

/usr/lib/python2.7/pickle.pyc in save_tuple(self, obj)
560 write(MARK)
561 for element in obj:
--> 562 save(element)
563 
564 if id(obj) in memo:

/usr/lib/python2.7/pickle.pyc in save(self, obj)
284 f = self.dispatch.get(t)
285 if f:
--> 286 f(self, obj) # Call unbound method with explicit self
287 return
288 

/databricks/spark/python/pyspark/cloudpickle.pyc in save_function(self, obj, 
name)
202 klass = getattr(themodule, name, None)
203 if klass is None or klass is not obj:
--> 204 self.save_function_tuple(obj)
205 return
206 

/databricks/spark/python/pyspark/cloudpickle.pyc in save_function_tuple(self, 
func)
239 # create a skeleton function object and memoize it
240 

[jira] [Updated] (SPARK-17370) Shuffle service files not invalidated when a slave is lost

2016-09-01 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-17370:
---
Component/s: Spark Core

> Shuffle service files not invalidated when a slave is lost
> --
>
> Key: SPARK-17370
> URL: https://issues.apache.org/jira/browse/SPARK-17370
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Eric Liang
>
> DAGScheduler invalidates shuffle files when an executor loss event occurs, 
> but not when the external shuffle service is enabled. This is because when 
> shuffle service is on, the shuffle file lifetime can exceed the executor 
> lifetime.
> However, it doesn't invalidate shuffle files when the shuffle service itself 
> is lost (due to whole slave loss). This can cause long hangs when slaves are 
> lost since the file loss is not detected until a subsequent stage attempts to 
> read the shuffle files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17371) Resubmitted stage outputs deleted by zombie map tasks on stop()

2016-09-01 Thread Eric Liang (JIRA)
Eric Liang created SPARK-17371:
--

 Summary: Resubmitted stage outputs deleted by zombie map tasks on 
stop()
 Key: SPARK-17371
 URL: https://issues.apache.org/jira/browse/SPARK-17371
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Eric Liang


It seems that old shuffle map tasks hanging around after a stage resubmit will 
delete intended shuffle output files on stop(), causing downstream stages to 
fail even after successful resubmit completion. This can happen easily if the 
prior map task is waiting for a network timeout when its stage is resubmitted.

This can cause unnecessary stage resubmits, sometimes multiple times, and very 
confusing FetchFailure messages that report shuffle index files missing from 
the local disk.

Given that IndexShuffleBlockResolver commits data atomically, it seems 
unnecessary to ever delete committed task output: even in the rare case that a 
task is failed after it finishes committing shuffle output, it should be safe 
to retain that output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17370) Shuffle service files not invalidated when a slave is lost

2016-09-01 Thread Eric Liang (JIRA)
Eric Liang created SPARK-17370:
--

 Summary: Shuffle service files not invalidated when a slave is lost
 Key: SPARK-17370
 URL: https://issues.apache.org/jira/browse/SPARK-17370
 Project: Spark
  Issue Type: Bug
Reporter: Eric Liang


DAGScheduler invalidates shuffle files when an executor loss event occurs, but 
not when the external shuffle service is enabled. This is because when shuffle 
service is on, the shuffle file lifetime can exceed the executor lifetime.

However, it doesn't invalidate shuffle files when the shuffle service itself is 
lost (due to whole slave loss). This can cause long hangs when slaves are lost 
since the file loss is not detected until a subsequent stage attempts to read 
the shuffle files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17042) Repl-defined classes cannot be replicated

2016-08-22 Thread Eric Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431616#comment-15431616
 ] 

Eric Liang commented on SPARK-17042:


Yeah, my bad. I was trying to split this up but it turns out to be unnecessary.

> Repl-defined classes cannot be replicated
> -
>
> Key: SPARK-17042
> URL: https://issues.apache.org/jira/browse/SPARK-17042
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager, Spark Core
>Reporter: Eric Liang
>
> A simple fix is to erase the classTag when using the default serializer, 
> since it's not needed in that case, and the classTag was failing to 
> deserialize on the remote end.
> The proper fix is actually to use the right classloader when deserializing 
> the classtags, but that is a much more invasive change for 2.0.
> The following test can be added to ReplSuite to reproduce the bug:
> {code}
>   test("replicating blocks of object with class defined in repl") {
> val output = runInterpreter("local-cluster[2,1,1024]",
>   """
> |import org.apache.spark.storage.StorageLevel._
> |case class Foo(i: Int)
> |val ret = sc.parallelize((1 to 100).map(Foo), 
> 10).persist(MEMORY_ONLY_2)
> |ret.count()
> |sc.getExecutorStorageStatus.map(s => 
> s.rddBlocksById(ret.id).size).sum
>   """.stripMargin)
> assertDoesNotContain("error:", output)
> assertDoesNotContain("Exception", output)
> assertContains(": Int = 20", output)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17162) Range does not support SQL generation

2016-08-19 Thread Eric Liang (JIRA)
Eric Liang created SPARK-17162:
--

 Summary: Range does not support SQL generation
 Key: SPARK-17162
 URL: https://issues.apache.org/jira/browse/SPARK-17162
 Project: Spark
  Issue Type: Bug
Reporter: Eric Liang
Priority: Minor


{code}
scala> sql("create view a as select * from range(100)")
16/08/19 21:10:29 INFO SparkSqlParser: Parsing command: create view a as select 
* from range(100)
java.lang.UnsupportedOperationException: unsupported plan Range (0, 100, 
splits=8)

  at 
org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:212)
  at 
org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:165)
  at org.apache.spark.sql.catalyst.SQLBuilder.projectToSQL(SQLBuilder.scala:229)
  at 
org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:127)
  at 
org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:165)
  at org.apache.spark.sql.catalyst.SQLBuilder.projectToSQL(SQLBuilder.scala:229)
  at 
org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:127)
  at org.apache.spark.sql.catalyst.SQLBuilder.toSQL(SQLBuilder.scala:97)
  at 
org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:174)
  at 
org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:138)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
```

{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17069) Expose spark.range() as table-valued function in SQL

2016-08-15 Thread Eric Liang (JIRA)
Eric Liang created SPARK-17069:
--

 Summary: Expose spark.range() as table-valued function in SQL
 Key: SPARK-17069
 URL: https://issues.apache.org/jira/browse/SPARK-17069
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Eric Liang
Priority: Minor


The idea here is to create the spark.range( x ) equivalent in SQL, so we can do 
something like

{noformat}
select count(*) from range(1)
{noformat}

This would be useful for sql-only testing and benchmarks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17042) Repl-defined classes cannot be replicated

2016-08-12 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-17042:
---
Description: 
A simple fix is to erase the classTag when using the default serializer, since 
it's not needed in that case, and the classTag was failing to deserialize on 
the remote end.

The proper fix is actually to use the right classloader when deserializing the 
classtags, but that is a much more invasive change for 2.0.

The following test can be added to ReplSuite to reproduce the bug:

{code}
  test("replicating blocks of object with class defined in repl") {
val output = runInterpreter("local-cluster[2,1,1024]",
  """
|import org.apache.spark.storage.StorageLevel._
|case class Foo(i: Int)
|val ret = sc.parallelize((1 to 100).map(Foo), 
10).persist(MEMORY_ONLY_2)
|ret.count()
|sc.getExecutorStorageStatus.map(s => s.rddBlocksById(ret.id).size).sum
  """.stripMargin)
assertDoesNotContain("error:", output)
assertDoesNotContain("Exception", output)
assertContains(": Int = 20", output)
  }
{code}

  was:
The following test can be added to ReplSuite to reproduce the bug:

{code}
  test("replicating blocks of object with class defined in repl") {
val output = runInterpreter("local-cluster[2,1,1024]",
  """
|import org.apache.spark.storage.StorageLevel._
|case class Foo(i: Int)
|val ret = sc.parallelize((1 to 100).map(Foo), 
10).persist(MEMORY_ONLY_2)
|ret.count()
|sc.getExecutorStorageStatus.map(s => s.rddBlocksById(ret.id).size).sum
  """.stripMargin)
assertDoesNotContain("error:", output)
assertDoesNotContain("Exception", output)
assertContains(": Int = 20", output)
  }
{code}


> Repl-defined classes cannot be replicated
> -
>
> Key: SPARK-17042
> URL: https://issues.apache.org/jira/browse/SPARK-17042
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager, Spark Core
>Reporter: Eric Liang
>
> A simple fix is to erase the classTag when using the default serializer, 
> since it's not needed in that case, and the classTag was failing to 
> deserialize on the remote end.
> The proper fix is actually to use the right classloader when deserializing 
> the classtags, but that is a much more invasive change for 2.0.
> The following test can be added to ReplSuite to reproduce the bug:
> {code}
>   test("replicating blocks of object with class defined in repl") {
> val output = runInterpreter("local-cluster[2,1,1024]",
>   """
> |import org.apache.spark.storage.StorageLevel._
> |case class Foo(i: Int)
> |val ret = sc.parallelize((1 to 100).map(Foo), 
> 10).persist(MEMORY_ONLY_2)
> |ret.count()
> |sc.getExecutorStorageStatus.map(s => 
> s.rddBlocksById(ret.id).size).sum
>   """.stripMargin)
> assertDoesNotContain("error:", output)
> assertDoesNotContain("Exception", output)
> assertContains(": Int = 20", output)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17042) Repl-defined classes cannot be replicated

2016-08-12 Thread Eric Liang (JIRA)
Eric Liang created SPARK-17042:
--

 Summary: Repl-defined classes cannot be replicated
 Key: SPARK-17042
 URL: https://issues.apache.org/jira/browse/SPARK-17042
 Project: Spark
  Issue Type: Sub-task
Reporter: Eric Liang


The following test can be added to ReplSuite to reproduce the bug:

{code}
  test("replicating blocks of object with class defined in repl") {
val output = runInterpreter("local-cluster[2,1,1024]",
  """
|import org.apache.spark.storage.StorageLevel._
|case class Foo(i: Int)
|val ret = sc.parallelize((1 to 100).map(Foo), 
10).persist(MEMORY_ONLY_2)
|ret.count()
|sc.getExecutorStorageStatus.map(s => s.rddBlocksById(ret.id).size).sum
  """.stripMargin)
assertDoesNotContain("error:", output)
assertDoesNotContain("Exception", output)
assertContains(": Int = 20", output)
  }
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16884) Move DataSourceScanExec out of ExistingRDD.scala file

2016-08-03 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-16884:
---
Issue Type: Improvement  (was: Bug)

> Move DataSourceScanExec out of ExistingRDD.scala file
> -
>
> Key: SPARK-16884
> URL: https://issues.apache.org/jira/browse/SPARK-16884
> Project: Spark
>  Issue Type: Improvement
>Reporter: Eric Liang
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >