Hi,
I am trying to understand the state of datasource v2, and I'm a bit lost.
On one hand, it is supposed to be more flexible approach, as described for
example here:
https://www.slideshare.net/databricks/apache-spark-data-source
-v2-with-wenchen-fan-and-gengliang-wang
On another hand, it appears both Parquet and ORC file readers are still not
using v2 interface. There's an umbrella issue to address that:
https://issues.apache.org/jira/browse/SPARK-23507
but it does not have any sub-issues to address Parquet and the issue about
ORC:
https://issues.apache.org/jira/browse/SPARK-23817
includes this text: "Not supported( due to limitation of data source V2):
(1) Read multiple file path (2) Read bucketed file.".
Is there some up-to-date information whether datasource v2 will indeed
become to primary datasource, whether parquet reader
will be converted to V2, and whether these limitations above will be fixed.
Thanks in advance,
--
Vladimir Prus
http://vladimirprus.com