Re: Filter push

Julian Hyde Mon, 13 Oct 2014 21:40:09 -0700

Vladimir,

I appreciate the critique of the design, and I agree with just about all of 
your points. However I think we need to help people like Dan, developing the 
RocksDB adapter, get to runnable code faster. An example, as part of the 
optiq-csv demo project, and not requiring any rules or code generation, would 
achieve that.

See detailed comments inline.

Julian

On Oct 13, 2014, at 4:59 PM, Vladimir Sitnikov <[email protected]> 
wrote:

>> * ProjectableCursorableTable goes further, and allows Calcite to
>> specify a list of projected fields and a list of filters. The cursor
>> must implement the projects, but it can choose which filters it is
>> able to implement.
> 
> I am against of such interfaces.
> I would be happy to be proven wrong.
> 
> This looks like a rabbit hole: it is a powerful feature, however
> 1) It seems hard to make it fast
> Effectively, it forces engine to interpret the whole thing since Calcite
> won't know if some of the filters are implemented by the table or not.
> We'll have to double-check if the list returned from "projectFilterScan" is
> valid (e.g. it does not contain completely new filters).

Adapters are not allowed to do that. Calcite would throw if an adapter returned 
filters that (based on ==) were not in the original list.

We would know at planning time which of the filters the table can handle. The 
remaining filters can be handled as they are today.

> 2) It does not look to scale well: tomorrow you'll want
> ProjectableCursorableIndexScanThenAccessTable once you realize some of the
> filtering can be checked against just the index contents. E.g. range scan
> of the key, then some fuzzy filter logic on the key itself, then table
> access for the rest with some more filters.

I agree, it doesn't scale well. I am following the mantra that simple things 
should be simple.

Pushing down projects and filters are by far the most common optimizations.

If someone builds an adapter to a particular data source and their users are 
telling them that they need (say) pushdown aggregation, they have already 
validated that Calcite is a useful technology and they will not mind 
re-implementing filters and projects using rules.

> 3) I am not sure if those kind of interfaces would solve more complex
> cases: complex RexNodes (e.g. RexOver over RexOver over Rex..).
> Ideally, filters should be split to the ones that "can be implemented at
> storage and the ones that can not". I guess this has to be in some rule and
> "CursorableTable" is just a tiny bit. The logic to split the filters is not
> yet automagically solved by Calcite.

Good point. If the filters can be decomposed, then Calcite should do it before 
passing the candidate filters to the adapter. The same goes for other 
transformations such as constant folding. 

That said, it should be OK to pass complex filters to the adapter. The adapter 
can just say no if it doesn't understand the filter.

> 
>> * CursorableTable is an optional interface that can be implemented by
>> any Table that allows you to get the results directly, without code
>> generation, and without creating a TableAccessRel or similar.
> 
> How is that better than AbstractQueryableTable?
> There is no need to do code generation if you need just a table scan.
> There is no need to create separate TableAccessRel either.
> 
> Here's the example:
> table definition:
> https://github.com/vlsi/optiq-mat-plugin/blob/master/mat-plugin/src/com/github/vlsi/mat/optiq/HeapSchema.java#L40
> table implementation:
> https://github.com/vlsi/optiq-mat-plugin/blob/master/mat-plugin/src/com/github/vlsi/mat/optiq/InstanceByClassTable.java#L27

Yes, I realized that later. In the prototype I am developing, I am considering 
going back to AbstractQueryableTable. One thing against that approach is that 
Queryable is a big and confusing interface, even with the help of 
AbstractQueryableTable.

I further thought of having the adapter writer override the where and select 
methods of the Queryable, but requires way too many lambda-style classes, and 
without flagging interfaces it is not clear to Calcite at planning time whether 
a table is capable of implementing filter and project.

> 
>> It returns a Cursor, which is similar to a JDBC ResultSet but much
>> simpler to implement,
> 
> We might just want "cursor convention", however it is a separate issue
> (e.g. getElementType -> Cursor.class | Object[].class |
> CustomDefinedPOJO.class)
> I do not like if "cursorable" would be a feature of "Cursorable" table.
> This will confuse users since "different kind of tables will have subtle
> differences and it would be impossible to pick the right one".

Yes. I came to the same conclusion. I am now thinking of result type being 
Enumerable<Object[]> or Queryable<Object[]>, which is basically what we have 
today.

>> and is more efficient than an Iterator or
>> Enumerable.
> 
> Can you please elaborate why Cursor would be so much better?
> I see nothing specific to Cursor that would make it more efficient.

When I looked at net.hydromatic.avatica.Cursor, I saw that it was not so easy 
to implement. And, since you need to pass values via a Getter, every value has 
to be boxed. I sketched out a simpler Cursor interface:

interface Cursor2 {
  boolean getBoolean(int);
  byte getByte(int);
  short getShort(int);
  char getChar(int);
  int getInt(int);
  long getLong(int);
  float getFloat(int);
  double getDouble(int);
  Object getObject(int);
  boolean isNull(int);
  boolean moveNext();
  void close();
}

You can implement Cursor2 without so that doing any per-row memory allocation. 
In my experience that is really important for high performance data processing.

Note also that isNull(int) takes a column ordinal, whereas Cursor.wasNull() 
requires the cursor implementation to remember which column was referenced most 
recently.

> The downside of Cursor is the requirement to convert the values to suit
> each and every getter (30+ methods in Cursor$Accessor interface).
> For instance, the data might be stored internally as "int", and Calcite
> will use getString for some reason (who stops that?)
> This might be not that efficient and it even might surprise the developer
> who implements the Cursor.

The contract in Cursor2 would be that if the table declares a column as an 
INTEGER, then Calcite would only call getInt() (possibly following up with a 
call to isNull() if the column is nullable and getInt() returned 0).

And similarly for other types.

So the developer who implements the Cursor2 would only have to implement the 
method for the one type for each column.

> I bet no one would be able to implement Date/Timestamp kind of fields from
> the first and even the second try (especially getting all the getters
> right).

Yeah, JDBC is excruciating to implement for datetime values. In Cursor2, time 
and date values would be represented by int, and timestamp by long, 
milliseconds since the zoneless epoch.

That said, I decided to represent rows as Object[] for now. I might come back 
to Cursor2 if we need an efficient interface to other data sources.

Julian

Re: Filter push

Reply via email to