It changed from 2.0 to 2.1 to 2.2 ...
Not much but still changed. I somehow agree that this is still manageable
> On 3. May 2018, at 16:46, Wenchen Fan wrote:
>
> Hi Jakub,
>
> Yea I think data source would be the most elegant way to solve your problem.
> Unfortunately
Hi Wenchen,
Thanks for your reply! We will have a look at the FileFormat.
Actually looking at the V2 APIs I still don’t see how you can use the existing
datasource (like Parquet + Hbase) and wrap it up in another one.
Imagine you would like to load some files from parquet and load some tables
Hi Jakub,
Yea I think data source would be the most elegant way to solve your
problem. Unfortunately in Spark 2.3 the only stable data source API is data
source v1, which can't be used to implement high-performance data source.
Data source v2 is still a preview version in Spark 2.3 and may change
Hello,
Thanks a lot for your answers.
We normally look for some stability so the use of internal APIs that are a
subject to change with no warning are somewhat questionable.
As to the approach of putting this functionality on top of Spark instead of a
datasource - this works but poses a
Some note on the internal API - it used to change with each release which was
quiet annoying because other data sources (Avro, HadoopOffice etc) had to
follow up in this. In the end it is an internal API and thus does not guarantee
to be stable. If you want to have something stable you have to
Spark at some point in time used for the formats shipped with Spark (eg
parquet) an internal API that is not the data source API. You can look on how
this is implemented for Parquet and co in the Spark source code.
Maybe this is the issue you are facing?
Have you tried to put your
Hello,
At CERN we are developing a Big Data system called NXCALS that uses Spark as
Extraction API.
We have implemented a custom datasource that was wrapping 2 existing ones
(parquet and Hbase) in order to hide the implementation details (location of
the parquet files, hbase tables, etc) and to