[ https://issues.apache.org/jira/browse/HIVE-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12909549#action_12909549 ]
John Sichi commented on HIVE-1546: ---------------------------------- I think we can start off with this interface: {noformat} /** * HiveSemanticAnalyzerHook allows Hive to be extended with custom * logic for semantic analysis of QL statements. This interface * and any Hive internals it exposes are currently * "limited private and evolving" (unless otherwise stated elsewhere) * and intended mainly for use by the Howl project. * *<p> * * Note that the lifetime of an instantiated hook object is scoped to * the analysis of a single statement; hook instances are never reused. */ public interface HiveSemanticAnalyzerHook { /** * Invoked before Hive performs its own semantic analysis on * a statement. The implementation may inspect the statement AST and * prevent its execution by throwing a SemanticException. * Optionally, it may also augment/rewrite the AST, but must produce * a form equivalent to one which could have * been returned directly from Hive's own parser. * * @param context context information for semantic analysis * * @param ast AST being analyzed and optionally rewritten * * @return replacement AST (typically the same as the original AST unless the * entire tree had to be replaced; must not be null) */ public ASTNode preAnalyze( HiveSemanticAnalyzerHookContext context, ASTNode ast) throws SemanticException; /** * Invoked after Hive performs its own semantic analysis on a * statement (including optimization). * Hive calls postAnalyze on the same hook object * as preAnalyze, so the hook can maintain state across the calls. * * @param context context information for semantic analysis * * @param rootTasks root tasks produced by semantic analysis; * the hook is free to modify this list or its contents */ public void postAnalyze( HiveSemanticAnalyzerHookContext context, List<Task<? extends Serializable>> rootTasks) throws SemanticException; } {noformat} Plus companion context interface to be passed in from Hive: {noformat} /** * Context information provided by Hive to implementations of HiveSemanticAnalyzerHook. */ public interface HiveSemanticAnalyzerHookContext { /** * @return the Hive db instance; hook implementations can use this for purposes such as getting configuration information or making metastore calls */ public Hive getHive(); } {noformat} The reason for the context is so that later, if we need to make more information available to the hook, we just add new getters to the context object (rather than adding new parameters to methods such as preAnalyze, which would break existing hook implementations). Unlike pre/post exec hooks, we will only allow one hook to be configured (rather than a list). If someone really wants to run more than one, they can write a list hook implementation which delegates to multiple. The conf variable used to load the hook class will be hive.semantic.analyzer.hook since it's no longer a factory. We also need an insulation class: {noformat} public abstract class AbstractSemanticAnalyzerHook implements HiveSemanticAnalyzerHook { public ASTNode preAnalyze( HiveSemanticAnalyzerHookContext context, ASTNode ast) throws SemanticException { return ast; } public void postAnalyze( HiveSemanticAnalyzerHookContext context, List<Task<? extends Serializable>> rootTasks) throws SemanticException { } } {noformat} Hook implementations should extend this to avoid breaking when we add new methods to the HiveSemanticAnalyzerHook interface later. By using the hook approach, we limit the dependency exposure: only ASTNodes, Tasks, and the org.apache.hadoop.hive.ql.metadata.Hive class can be accessed for now. If we decide to open it up more later, that will be via an agreed-upon decision in a new patch (rather than via ad hoc dependency creep). Instead of invoking the hook from many different places inside of SemanticAnalyzer, we'll start with just invoking it only pre- and post- the call to sem.analyze. More invocation points (e.g. pre-optimization) can be added later on an as-needed basis. Howl's table property additions can be saved during the preAnalyze call and then applied to the Task during the postAnalyze call (since the hook is allowed to maintain state); this stays very close to the approach in the current patch. Another way to do this is to not use a postAnalyze at all, and instead just rewrite the AST to splice in the additional TBLPROPERTIES up front. Both ways are a little messy, but either way should be fine. For handleGenericFileFormat, I think we can deal with it by having Howl edit the AST to delete the STORED AS clause entirely during preAnalyze (in cases where it accepts the storage format). That way Hive can continue to reject it unconditionally, and we can avoid adding a third interface method. (Likewise, Howl should delete INPUTDRIVER and OUTPUTDRIVER, and Hive should reject them when it sees them, since they aren't currently meaningful within Hive by itself.) The only other SemanticAnalyzer dependency I see is getColumns, and that can be dealt with by moving it to become a static utility method on ParseUtils. Let me know if I missed any other cases that need to be dealt with. > Ability to plug custom Semantic Analyzers for Hive Grammar > ---------------------------------------------------------- > > Key: HIVE-1546 > URL: https://issues.apache.org/jira/browse/HIVE-1546 > Project: Hadoop Hive > Issue Type: Improvement > Components: Metastore > Affects Versions: 0.7.0 > Reporter: Ashutosh Chauhan > Assignee: Ashutosh Chauhan > Fix For: 0.7.0 > > Attachments: hive-1546-3.patch, hive-1546-4.patch, hive-1546.patch, > hive-1546_2.patch, Howl_Semantic_Analysis.txt > > > It will be useful if Semantic Analysis phase is made pluggable such that > other projects can do custom analysis of hive queries before doing metastore > operations on them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.