[jira] [Commented] (SPARK-15691) Refactor and improve Hive support

2016-06-05 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15316017#comment-15316017
 ] 

Xiao Li commented on SPARK-15691:
-

Sure, will follow your suggestions. Thanks!

> Refactor and improve Hive support
> -
>
> Key: SPARK-15691
> URL: https://issues.apache.org/jira/browse/SPARK-15691
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> Hive support is important to Spark SQL, as many Spark users use it to read 
> from Hive. The current architecture is very difficult to maintain, and this 
> ticket tracks progress towards getting us to a sane state.
> A number of things we want to accomplish are:
> - Move the Hive specific catalog logic into HiveExternalCatalog.
>   -- Remove HiveSessionCatalog. All Hive-related stuff should go into 
> HiveExternalCatalog. This would require moving caching either into 
> HiveExternalCatalog, or just into SessionCatalog.
>   -- Move using properties to store data source options into 
> HiveExternalCatalog.
>   -- Potentially more.
> - Remove HIve's specific ScriptTransform implementation and make it more 
> general so we can put it in sql/core.
> - Implement HiveTableScan (and write path) as a data source, so we don't need 
> a special planner rule for HiveTableScan.
> - Remove HiveSharedState and HiveSessionState.
> One thing that is still unclear to me is how to work with Hive UDF support. 
> We might still need a special planner rule there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15691) Refactor and improve Hive support

2016-06-05 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15316010#comment-15316010
 ] 

Reynold Xin commented on SPARK-15691:
-

[~smilegator] -- do you mind using google docs for the design doc? It'd be 
easier to leave comments in line.

One high level feedback I have is that the current doc is very bottom up: it 
talks about functions, apis, code to move from one place to another. These are 
great, but it'd be great to start the doc with something high level, e.g. what 
are the components/classes that we should have in an ideal end-state.

Another high-level comment (you might be thinking about some of it already but 
it is not super clear from looking at your current doc): often there might be 
multiple alternatives and it is good to discuss their tradeoffs (or if one 
dominates the other). For example, parquet/orc conversion -- I think can of two 
ways to do it. One is to put it in HiveExternalCatalog, and another is to move 
all of those into the more general handling in SessionCatalog. The two have 
their own pros and cons.


> Refactor and improve Hive support
> -
>
> Key: SPARK-15691
> URL: https://issues.apache.org/jira/browse/SPARK-15691
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> Hive support is important to Spark SQL, as many Spark users use it to read 
> from Hive. The current architecture is very difficult to maintain, and this 
> ticket tracks progress towards getting us to a sane state.
> A number of things we want to accomplish are:
> - Move the Hive specific catalog logic into HiveExternalCatalog.
>   -- Remove HiveSessionCatalog. All Hive-related stuff should go into 
> HiveExternalCatalog. This would require moving caching either into 
> HiveExternalCatalog, or just into SessionCatalog.
>   -- Move using properties to store data source options into 
> HiveExternalCatalog.
>   -- Potentially more.
> - Remove HIve's specific ScriptTransform implementation and make it more 
> general so we can put it in sql/core.
> - Implement HiveTableScan (and write path) as a data source, so we don't need 
> a special planner rule for HiveTableScan.
> - Remove HiveSharedState and HiveSessionState.
> One thing that is still unclear to me is how to work with Hive UDF support. 
> We might still need a special planner rule there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15691) Refactor and improve Hive support

2016-06-05 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15315964#comment-15315964
 ] 

Xiao Li commented on SPARK-15691:
-

In the PDF document, all the underlined text is are hyperlinks that point to 
the related contents.

> Refactor and improve Hive support
> -
>
> Key: SPARK-15691
> URL: https://issues.apache.org/jira/browse/SPARK-15691
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> Hive support is important to Spark SQL, as many Spark users use it to read 
> from Hive. The current architecture is very difficult to maintain, and this 
> ticket tracks progress towards getting us to a sane state.
> A number of things we want to accomplish are:
> - Move the Hive specific catalog logic into HiveExternalCatalog.
>   -- Remove HiveSessionCatalog. All Hive-related stuff should go into 
> HiveExternalCatalog. This would require moving caching either into 
> HiveExternalCatalog, or just into SessionCatalog.
>   -- Move using properties to store data source options into 
> HiveExternalCatalog.
>   -- Potentially more.
> - Remove HIve's specific ScriptTransform implementation and make it more 
> general so we can put it in sql/core.
> - Implement HiveTableScan (and write path) as a data source, so we don't need 
> a special planner rule for HiveTableScan.
> - Remove HiveSharedState and HiveSessionState.
> One thing that is still unclear to me is how to work with Hive UDF support. 
> We might still need a special planner rule there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15691) Refactor and improve Hive support

2016-06-05 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15315960#comment-15315960
 ] 

Xiao Li commented on SPARK-15691:
-

For refactoring {{HiveMetastoreCatalog.scala}}, I just finished the design doc. 
The PDF version is available in the following link: 
https://www.dropbox.com/s/tsaoq2joegkdh1h/2016.06.05.HiveMetastoreCatalog.scala%20Refactoring.pdf?dl=0
 The original Markdown file can be downloaded via 
https://www.dropbox.com/s/uita63wkdrmuqr2/2016.06.05.HiveMetastoreCatalog.scala%20Refactoring.md?dl=0

Please let me know if this is the right direction and correct me anything is 
not appropriate.  [~rxin]

Thank you very much!

> Refactor and improve Hive support
> -
>
> Key: SPARK-15691
> URL: https://issues.apache.org/jira/browse/SPARK-15691
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> Hive support is important to Spark SQL, as many Spark users use it to read 
> from Hive. The current architecture is very difficult to maintain, and this 
> ticket tracks progress towards getting us to a sane state.
> A number of things we want to accomplish are:
> - Move the Hive specific catalog logic into HiveExternalCatalog.
>   -- Remove HiveSessionCatalog. All Hive-related stuff should go into 
> HiveExternalCatalog. This would require moving caching either into 
> HiveExternalCatalog, or just into SessionCatalog.
>   -- Move using properties to store data source options into 
> HiveExternalCatalog.
>   -- Potentially more.
> - Remove HIve's specific ScriptTransform implementation and make it more 
> general so we can put it in sql/core.
> - Implement HiveTableScan (and write path) as a data source, so we don't need 
> a special planner rule for HiveTableScan.
> - Remove HiveSharedState and HiveSessionState.
> One thing that is still unclear to me is how to work with Hive UDF support. 
> We might still need a special planner rule there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15691) Refactor and improve Hive support

2016-06-04 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15315368#comment-15315368
 ] 

Xiao Li commented on SPARK-15691:
-

Thanks!

> Refactor and improve Hive support
> -
>
> Key: SPARK-15691
> URL: https://issues.apache.org/jira/browse/SPARK-15691
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> Hive support is important to Spark SQL, as many Spark users use it to read 
> from Hive. The current architecture is very difficult to maintain, and this 
> ticket tracks progress towards getting us to a sane state.
> A number of things we want to accomplish are:
> - Move the Hive specific catalog logic into HiveExternalCatalog.
>   -- Remove HiveSessionCatalog. All Hive-related stuff should go into 
> HiveExternalCatalog. This would require moving caching either into 
> HiveExternalCatalog, or just into SessionCatalog.
>   -- Move using properties to store data source options into 
> HiveExternalCatalog.
>   -- Potentially more.
> - Remove HIve's specific ScriptTransform implementation and make it more 
> general so we can put it in sql/core.
> - Implement HiveTableScan (and write path) as a data source, so we don't need 
> a special planner rule for HiveTableScan.
> - Remove HiveSharedState and HiveSessionState.
> One thing that is still unclear to me is how to work with Hive UDF support. 
> We might still need a special planner rule there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15691) Refactor and improve Hive support

2016-06-03 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15315365#comment-15315365
 ] 

Reynold Xin commented on SPARK-15691:
-

Great. Look forward to it!




> Refactor and improve Hive support
> -
>
> Key: SPARK-15691
> URL: https://issues.apache.org/jira/browse/SPARK-15691
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> Hive support is important to Spark SQL, as many Spark users use it to read 
> from Hive. The current architecture is very difficult to maintain, and this 
> ticket tracks progress towards getting us to a sane state.
> A number of things we want to accomplish are:
> - Move the Hive specific catalog logic into HiveExternalCatalog.
>   -- Remove HiveSessionCatalog. All Hive-related stuff should go into 
> HiveExternalCatalog. This would require moving caching either into 
> HiveExternalCatalog, or just into SessionCatalog.
>   -- Move using properties to store data source options into 
> HiveExternalCatalog.
>   -- Potentially more.
> - Remove HIve's specific ScriptTransform implementation and make it more 
> general so we can put it in sql/core.
> - Implement HiveTableScan (and write path) as a data source, so we don't need 
> a special planner rule for HiveTableScan.
> - Remove HiveSharedState and HiveSessionState.
> One thing that is still unclear to me is how to work with Hive UDF support. 
> We might still need a special planner rule there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15691) Refactor and improve Hive support

2016-06-03 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15315357#comment-15315357
 ] 

Xiao Li commented on SPARK-15691:
-

Finally, {{HiveMetastoreCatalog}} has been cleaned in my private local branch. 
Now, it becomes a pure Data Source Table cache. The LOC is reduced to 90. We 
need a new name!

The four Hive-specific {{Analyzer}} rules are moved to 
{{HiveStrategies.scala}}. This is just like {{DataSourceStrategy.scala}}, which 
has both {{Anazlyer}} rules and {{SparkPlanner}} strategies.

Also, combined the duplicate code and refactored a few code. 

Will write and upload a design doc tomorrow to document the change details for 
further review. Thanks!

> Refactor and improve Hive support
> -
>
> Key: SPARK-15691
> URL: https://issues.apache.org/jira/browse/SPARK-15691
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> Hive support is important to Spark SQL, as many Spark users use it to read 
> from Hive. The current architecture is very difficult to maintain, and this 
> ticket tracks progress towards getting us to a sane state.
> A number of things we want to accomplish are:
> - Move the Hive specific catalog logic into HiveExternalCatalog.
>   -- Remove HiveSessionCatalog. All Hive-related stuff should go into 
> HiveExternalCatalog. This would require moving caching either into 
> HiveExternalCatalog, or just into SessionCatalog.
>   -- Move using properties to store data source options into 
> HiveExternalCatalog.
>   -- Potentially more.
> - Remove HIve's specific ScriptTransform implementation and make it more 
> general so we can put it in sql/core.
> - Implement HiveTableScan (and write path) as a data source, so we don't need 
> a special planner rule for HiveTableScan.
> - Remove HiveSharedState and HiveSessionState.
> One thing that is still unclear to me is how to work with Hive UDF support. 
> We might still need a special planner rule there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15691) Refactor and improve Hive support

2016-06-02 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15311878#comment-15311878
 ] 

Xiao Li commented on SPARK-15691:
-

IMO, this is the first piece of component we need to refactor, but this is a 
very interesting part. Many concepts are mixed in the same class: 
{SparkSession}, {SessionState}, {DataSource}, {parser}, Hive-specific {analyzer 
rules}, {cache}, {MetastoreRelation}, {MetaStorePartitionedTableFileCatalog} 
... Still trying to split it in a clean way.

> Refactor and improve Hive support
> -
>
> Key: SPARK-15691
> URL: https://issues.apache.org/jira/browse/SPARK-15691
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> Hive support is important to Spark SQL, as many Spark users use it to read 
> from Hive. The current architecture is very difficult to maintain, and this 
> ticket tracks progress towards getting us to a sane state.
> A number of things we want to accomplish are:
> - Move the Hive specific catalog logic into HiveExternalCatalog.
>   -- Remove HiveSessionCatalog. All Hive-related stuff should go into 
> HiveExternalCatalog. This would require moving caching either into 
> HiveExternalCatalog, or just into SessionCatalog.
>   -- Move using properties to store data source options into 
> HiveExternalCatalog.
>   -- Potentially more.
> - Remove HIve's specific ScriptTransform implementation and make it more 
> general so we can put it in sql/core.
> - Implement HiveTableScan (and write path) as a data source, so we don't need 
> a special planner rule for HiveTableScan.
> - Remove HiveSharedState and HiveSessionState.
> One thing that is still unclear to me is how to work with Hive UDF support. 
> We might still need a special planner rule there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15691) Refactor and improve Hive support

2016-06-01 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15309487#comment-15309487
 ] 

Reynold Xin commented on SPARK-15691:
-

Updated.


> Refactor and improve Hive support
> -
>
> Key: SPARK-15691
> URL: https://issues.apache.org/jira/browse/SPARK-15691
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> Hive support is important to Spark SQL, as many Spark users use it to read 
> from Hive. The current architecture is very difficult to maintain, and this 
> ticket tracks progress towards getting us to a sane state.
> A number of things we want to accomplish are:
> - Move the Hive specific catalog logic into HiveExternalCatalog.
>   -- Remove HiveSessionCatalog. All Hive-related stuff should go into 
> HiveExternalCatalog. This would require moving caching either into 
> HiveExternalCatalog, or just into SessionCatalog.
>   -- Move using properties to store data source options into 
> HiveExternalCatalog.
>   -- Potentially more.
> - Remove HIve's specific ScriptTransform implementation and make it more 
> general so we can put it in sql/core.
> - Implement HiveTableScan (and write path) as a data source, so we don't need 
> a special planner rule for HiveTableScan.
> - Remove HiveSharedState and HiveSessionState.
> One thing that is still unclear to me is how to work with Hive UDF support. 
> We might still need a special planner rule there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15691) Refactor and improve Hive support

2016-06-01 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15309438#comment-15309438
 ] 

Yin Huai commented on SPARK-15691:
--

I'd add removing HiveMetastoreCatalog as part of the work that moves Hive 
specific catalog logic into HiveExternalCatalog.

> Refactor and improve Hive support
> -
>
> Key: SPARK-15691
> URL: https://issues.apache.org/jira/browse/SPARK-15691
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> Hive support is important to Spark SQL, as many Spark users use it to read 
> from Hive. The current architecture is very difficult to maintain, and this 
> ticket tracks progress towards getting us to a sane state.
> A number of things we want to accomplish are:
> - Remove HiveSessionCatalog. All Hive-related stuff should go into 
> HiveExternalCatalog. This would require moving caching either into 
> HiveExternalCatalog, or just into SessionCatalog.
> - Move the Hive specific catalog logic (e.g. using properties to store data 
> source options) into HiveExternalCatalog.
> - Remove HIve's specific ScriptTransform implementation and make it more 
> general so we can put it in sql/core.
> - Implement HiveTableScan (and write path) as a data source, so we don't need 
> a special planner rule for HiveTableScan.
> - Remove HiveSharedState and HiveSessionState.
> One thing that is still unclear to me is how to work with Hive UDF support. 
> We might still need a special planner rule there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org