[jira] Commented: (HIVE-1546) Ability to plug custom Semantic Analyzers for Hive Grammar

John Sichi (JIRA) Tue, 14 Sep 2010 18:02:58 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12909549#action_12909549
 ]


John Sichi commented on HIVE-1546:
----------------------------------

I think we can start off with this interface:

{noformat}
/**
 * HiveSemanticAnalyzerHook allows Hive to be extended with custom
 * logic for semantic analysis of QL statements.  This interface
 * and any Hive internals it exposes are currently 
 * "limited private and evolving" (unless otherwise stated elsewhere)
 * and intended mainly for use by the Howl project.
 *
 *<p>
 *
 * Note that the lifetime of an instantiated hook object is scoped to
 * the analysis of a single statement; hook instances are never reused.
 */
public interface HiveSemanticAnalyzerHook {
  /**
   * Invoked before Hive performs its own semantic analysis on
   * a statement.  The implementation may inspect the statement AST and
   * prevent its execution by throwing a SemanticException.
   * Optionally, it may also augment/rewrite the AST, but must produce
   * a form equivalent to one which could have
   * been returned directly from Hive's own parser.
   *
   * @param context context information for semantic analysis
   *
   * @param ast AST being analyzed and optionally rewritten
   *
   * @return replacement AST (typically the same as the original AST unless the
   * entire tree had to be replaced; must not be null)
   */
  public ASTNode preAnalyze(
    HiveSemanticAnalyzerHookContext context,
    ASTNode ast) throws SemanticException;

  /**
   * Invoked after Hive performs its own semantic analysis on a
   * statement (including optimization).  
   * Hive calls postAnalyze on the same hook object
   * as preAnalyze, so the hook can maintain state across the calls.
   *
   * @param context context information for semantic analysis
   *
   * @param rootTasks root tasks produced by semantic analysis;
   * the hook is free to modify this list or its contents
   */
  public void postAnalyze(
    HiveSemanticAnalyzerHookContext context,
    List<Task<? extends Serializable>> rootTasks) throws SemanticException;
}
{noformat}

Plus companion context interface to be passed in from Hive:

{noformat}
/**
 * Context information provided by Hive to implementations of 
HiveSemanticAnalyzerHook.
 */
public interface HiveSemanticAnalyzerHookContext {
  /**
   * @return the Hive db instance; hook implementations can use this for 
purposes such as getting configuration information or making metastore calls
   */
  public Hive getHive();
}
{noformat}

The reason for the context is so that later, if we need to make more information
available to the hook, we just add new getters to the context object
(rather than adding new parameters to methods such as preAnalyze,
which would break existing hook implementations).

Unlike pre/post exec hooks, we will only allow one hook to be
configured (rather than a list).  If someone really wants to run
more than one, they can write a list hook implementation which delegates
to multiple.  The conf variable used to load the hook class will be
hive.semantic.analyzer.hook since it's no longer a factory.

We also need an insulation class:

{noformat}
public abstract class AbstractSemanticAnalyzerHook 
  implements HiveSemanticAnalyzerHook {

  public ASTNode preAnalyze(
    HiveSemanticAnalyzerHookContext context,
    ASTNode ast) throws SemanticException
  {
    return ast;
  }

  public void postAnalyze(
    HiveSemanticAnalyzerHookContext context,
    List<Task<? extends Serializable>> rootTasks) throws SemanticException
  {
  }
}
{noformat}

Hook implementations should extend this to avoid breaking when we add
new methods to the HiveSemanticAnalyzerHook interface later.

By using the hook approach, we limit the dependency exposure: only ASTNodes,
Tasks, and the org.apache.hadoop.hive.ql.metadata.Hive class can be accessed
for now.  If we decide to open it up more later, that will be via an
agreed-upon decision in a new patch (rather than via ad hoc dependency
creep).

Instead of invoking the hook from many different places inside of
SemanticAnalyzer, we'll start with just invoking it only pre- and post-
the call to sem.analyze.  More invocation points
(e.g. pre-optimization) can be added later on an as-needed basis.

Howl's table property additions can be saved during the preAnalyze
call and then applied to the Task during the postAnalyze call (since
the hook is allowed to maintain state); this stays very close to the
approach in the current patch.  Another way to do this is to
not use a postAnalyze at all, and instead just rewrite the AST to
splice in the additional TBLPROPERTIES up front.  Both ways are a
little messy, but either way should be fine.

For handleGenericFileFormat, I think we can deal with it by having
Howl edit the AST to delete the STORED AS clause entirely during
preAnalyze (in cases where it accepts the storage format).  That way
Hive can continue to reject it unconditionally, and we can avoid
adding a third interface method.

(Likewise, Howl should delete INPUTDRIVER and OUTPUTDRIVER, and Hive
should reject them when it sees them, since they aren't currently
meaningful within Hive by itself.)

The only other SemanticAnalyzer dependency I see is getColumns, and
that can be dealt with by moving it to become a static utility method
on ParseUtils.

Let me know if I missed any other cases that need to be dealt with.


> Ability to plug custom Semantic Analyzers for Hive Grammar
> ----------------------------------------------------------
>
>                 Key: HIVE-1546
>                 URL: https://issues.apache.org/jira/browse/HIVE-1546
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Metastore
>    Affects Versions: 0.7.0
>            Reporter: Ashutosh Chauhan
>            Assignee: Ashutosh Chauhan
>             Fix For: 0.7.0
>
>         Attachments: hive-1546-3.patch, hive-1546-4.patch, hive-1546.patch, 
> hive-1546_2.patch, Howl_Semantic_Analysis.txt
>
>
> It will be useful if Semantic Analysis phase is made pluggable such that 
> other projects can do custom analysis of hive queries before doing metastore 
> operations on them. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1546) Ability to plug custom Semantic Analyzers for Hive Grammar

Reply via email to