Hi all,

I'd like to propose an API change to the data source v2.

One design goal of data source v2 is API type safety. The FileFormat API is
a bad example, it asks the implementation to return InternalRow even it's
actually ColumnarBatch. In data source v2 we add a type parameter to
DataReader/WriterFactoty and DataReader/Writer, so that data source
supporting columnar scan returns ColumnarBatch at API level.

However, we met some problems when migrating streaming and file-based data
source to data source v2.

For the streaming side, we need a variant of DataReader/WriterFactory to
add streaming specific concept like epoch id and offset. For details please
see ContinuousDataReaderFactory and
https://docs.google.com/document/d/1PJYfb68s2AG7joRWbhrgpEWhrsPqbhyRwUVl9V1wPOE/edit#

But this conflicts with the special format mixin traits like
SupportsScanColumnarBatch. We have to make the streaming variant of
DataReader/WriterFactory to extend the original DataReader/WriterFactory,
and do type cast at runtime, which is unnecessary and violate the type
safety.

For the file-based data source side, we have a problem with code
duplication. Let's take ORC data source as an example. To support both
unsafe row and columnar batch scan, we need something like

// A lot of parameters to carry to the executor side
class OrcUnsafeRowFactory(...) extends DataReaderFactory[UnsafeRow] {
  def createDataReader ...
}

class OrcColumnarBatchFactory(...) extends DataReaderFactory[ColumnarBatch]
{
  def createDataReader ...
}

class OrcDataSourceReader extends DataSourceReader {
  def createUnsafeRowFactories = ... // logic to prepare the parameters and
create factories

  def createColumnarBatchFactories = ... // logic to prepare the parameters
and create factories
}

You can see that we have duplicated logic for preparing parameters and
defining the factory.

Here I propose to remove all the special format mixin traits and change the
factory interface to

public enum DataFormat {
  ROW,
  INTERNAL_ROW,
  UNSAFE_ROW,
  COLUMNAR_BATCH
}

interface DataReaderFactory {
  DataFormat dataFormat;

  default DataReader<Row> createRowDataReader() {
    throw new IllegalStateException();
  }

  default DataReader<UnsafeRow> createUnsafeRowDataReader() {
    throw new IllegalStateException();
  }

  default DataReader<ColumnarBatch> createColumnarBatchDataReader() {
    throw new IllegalStateException();
  }
}

Spark will look at the dataFormat and decide which create data reader
method to call.

Now we don't have the problem for the streaming side as these special
format mixin traits go away. And the ORC data source can also be simplified
to

class OrcReaderFactory(...) extends DataReaderFactory {
  def createUnsafeRowReader ...

  def createColumnarBatchReader ...
}

class OrcDataSourceReader extends DataSourceReader {
  def createReadFactories = ... // logic to prepare the parameters and
create factories
}

We also have a potential benefit of supporting hybrid storage data source,
which may keep real-time data in row format, and history data in columnar
format. Then they can make some DataReaderFactory output InternalRow and
some output ColumnarBatch.

Thoughts?

Reply via email to