Jin Shang created ARROW-17668:
---------------------------------

             Summary: [C++][Gandiva]Add parser frontend for Gandiva
                 Key: ARROW-17668
                 URL: https://issues.apache.org/jira/browse/ARROW-17668
             Project: Apache Arrow
          Issue Type: New Feature
          Components: C++ - Gandiva
            Reporter: Jin Shang
            Assignee: Jin Shang


This issue is a feature proposal to add a parser frontend for Gandiva.
h2. Background

My team uses an expression computation library for our C++ feature engineering 
pipeline. We currently use [Exprtk]([https://github.com/ArashPartow/exprtk]). 
We recently tried out Gandiva and wrote some benchmarks. We discovered that 
Gandiva is several times faster than Exprtk in our use cases. Therefore we 
decided to switch to Gandiva for computing expressions. 
h2. Objective

As of current, due to its lack of a frontend, we need to manually construct an 
AST to use Gandiva. This is inconvenient and requires extra learning cost. We 
would like to implement a parser frontend for Gandiva, so that Gandiva becomes 
a standalone complete expression compiler and evaluator, and a drop-in 
replacement for the existing libraries like like 
[Exprtk]([https://github.com/ArashPartow/exprtk]) and 
[TinyExpr]([https://github.com/codeplea/tinyexpr]). The goal is to enable the 
following functionality:

 
{code:cpp}
// Create schema for gandiva
auto field_x = arrow::field(x, arrow::int32());
auto field_y = arrow::field(y, arrow::float64());
auto field_z = arrow::field(z, arrow::boolean());
auto schema = arrow::scehma({field_x, field_y, field_z});

// Use the Parser to generate an ExpressionPtr
std::string expr_str = "if(z, castFloat8(x), y * 1000.0)";
auto parser = gandiva::Parser();
std::shared_ptr<gandiva::ExpressionPtr> expr_ptr;
auto status = parser.Parse(schema, expr_str, &expr_ptr);

// The rest is normal usage of Gandiva projector
std::shared_ptr<gandiva::Projector> projector;
auto status = gandiva::Projector::Make(schema, {expr_ptr}, &projector);
auto in_batch = arrow::RecordBatch::Make(schema, size, input_arr);
auto* pool = arrow::default_memory_pool();
arrow::ArrayVector outputs;
auto status = projector->Evaluate(*in_batch, pool, &outputs);
{code}
 
h2. Syntax

The goal is to design a succinct and intuitive grammar for both schema and 
Gandiva expressions. We will need a corresponding grammar for each Node type in 
Gandiva.
 - Literals: We find Rust’s literal 
representation([https://doc.rust-lang.org/rust-by-example/types/literals.html]([https://doc.rust-lang.org/rust-by-example/types/literals.html]))
 very intuitive. We’ll support suffix such as `i32`, `u64`, `f32` to denote a 
literal’s type. The types of unsuffixed literals are inferred by their usage. 
Otherwise, the default type for integers is `int32` and `float32` for floating 
points. String and binary literals are wrapped with single or double quotes. 
Decimal128 literals will not be supported in the first version.
 - Fields: Just their names as defined in the schema. To avoid conflicts with 
other node types, field names must start with alphabetical letters.
 - Functions: `<function_name>(<param1>, <param2>, ...)`. For functions with 
multiple overloads, their return type is inferred from input types. For 
commonly used functions, we would also like to support infix forms. They 
include:
    - Comparisons: equal(==), not equal(!=), greater than(>), greater than or 
equal to(>=), less than(<), less than or equal to(<=)
    - Arithmetics: add( +), substract( -), multiply( *), divide( /), modulo(%), 
power({^}), bitwise and(&), bitwise or(|), bitwise xor({^}), bitwise not(~)
    
    Function alias with spaces in their names won’t be supported such as “is 
not false” are not supported.
    
 - If: We could like to support two grammars for if expressions:
    - `if(<cond>, <then>, <else>)` for its simplicity and functional feel;
    - `if(<cond>) \{ <then> } else \{ <else> }` since it’s the C++ `if` grammar 
and has better formatting for complex expressions.
 - Boolean: We would like to support both `&& ||` and `and or` keywords same as 
C++.
 - InExpression: `<eval> in (<member1>, <member2>, ...)` . Its type is also 
inferred.

The grammar can be roughly represented as:
{code:cpp}
exp := literal | field | function | if | boolean | inexpression
literal := INT_LITERAL | INT_LITERAL_SUFFIX | FLOAT_LITERAL | 
FLOAT_LITERAL_SUFFIX | "-" literal
field := IDENTIFIER
function := IDENTIFIER(arguments) | infixfunc
infixfunc := exp "+" exp | exp "-" exp | ...
arguments := exp | argument "," exp
if := IF "(" exp "," exp "," exp ")" | "if" "(" exp ")" "{" exp "}" ELSE "{" 
exp "}"
boolean := and_boolean | or_boolean
and_boolean := exp AND exp | exp "&&" exp
or_boolean := exp OR exp | exp "||" exp
inexpression := exp IN "(" arguments ")"{code}
lower cases are non-terminals and upper cases are tokens.
h2. Implementation
h3. Lexing and Parsing

We would like to use flex and bison for lexing and parsing. They have several 
advantages compared to other options like Antlr4 and Boost::Spirit.
 # They are the most classical and popular parsing library in the cpp world.
 # They allow us to precompile the parser codes and have no runtime 
dependencies.

Flex&bison takes a string and outputs a `gandiva::node` tree. The tree may 
contain untyped nodes, e.g., unsuffixed literals and functions. 
h3. Type Inference

We’ll have a TypeInference class that inherits node visitor, implementing a 
simple DFS algorithm to do type inference. 

For functions, 
 # Get all available signatures of this function from the function registry.
 # Find all candidate signatures that match the current known argument and 
return type.
 ## If none of them matches, there is a type error.
 ## If only one matches, this is the signature. From the signature we also know 
the type of each argument. Visit each argument node carrying their return type.
 ## If more than one matches, visit each argument node first. Then we do 
matching again. It is guaranteed to have zero or one matches because no 
function has overloads that differ only on their return types, because at this 
point each argument’s type is known (because untyped literals are given default 
types upon visit, see below). 

The process is similar for `if`, `boolean`and `in expression` , only their 
“signatures” are manually constructed. For example `if` has `(bool, <type>, 
<type>)-><type>`and `boolean` has `(bool, bool)->(bool)`.

For untyped literals,
 # If a type is passed from its parent, the type is known.
 # Otherwise set it to the default type.

Any feedback is appreciated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to