Re: Custom datasource as a wrapper for existing ones?

2018-05-03 Thread Jörn Franke
It changed from 2.0 to 2.1 to 2.2 ... Not much but still changed. I somehow agree that this is still manageable > On 3. May 2018, at 16:46, Wenchen Fan wrote: > > Hi Jakub, > > Yea I think data source would be the most elegant way to solve your problem. > Unfortunately

Re: Custom datasource as a wrapper for existing ones?

2018-05-03 Thread Jakub Wozniak
Hi Wenchen, Thanks for your reply! We will have a look at the FileFormat. Actually looking at the V2 APIs I still don’t see how you can use the existing datasource (like Parquet + Hbase) and wrap it up in another one. Imagine you would like to load some files from parquet and load some tables

Re: Custom datasource as a wrapper for existing ones?

2018-05-03 Thread Wenchen Fan
Hi Jakub, Yea I think data source would be the most elegant way to solve your problem. Unfortunately in Spark 2.3 the only stable data source API is data source v1, which can't be used to implement high-performance data source. Data source v2 is still a preview version in Spark 2.3 and may change

Re: Custom datasource as a wrapper for existing ones?

2018-05-02 Thread Jakub Wozniak
Hello, Thanks a lot for your answers. We normally look for some stability so the use of internal APIs that are a subject to change with no warning are somewhat questionable. As to the approach of putting this functionality on top of Spark instead of a datasource - this works but poses a

Re: Custom datasource as a wrapper for existing ones?

2018-05-02 Thread Jörn Franke
Some note on the internal API - it used to change with each release which was quiet annoying because other data sources (Avro, HadoopOffice etc) had to follow up in this. In the end it is an internal API and thus does not guarantee to be stable. If you want to have something stable you have to

Re: Custom datasource as a wrapper for existing ones?

2018-05-02 Thread Jörn Franke
Spark at some point in time used for the formats shipped with Spark (eg parquet) an internal API that is not the data source API. You can look on how this is implemented for Parquet and co in the Spark source code. Maybe this is the issue you are facing? Have you tried to put your

Custom datasource as a wrapper for existing ones?

2018-05-02 Thread jwozniak
Hello, At CERN we are developing a Big Data system called NXCALS that uses Spark as Extraction API. We have implemented a custom datasource that was wrapping 2 existing ones (parquet and Hbase) in order to hide the implementation details (location of the parquet files, hbase tables, etc) and to