Re: [Spark Core] Why no spark.read.delta / df.write.delta?

Jungtaek Lim Mon, 05 Oct 2020 16:54:16 -0700

Sure. My point was that Delta Lake is also one of the 3rd party libraries
and there's no way for Apache Spark to do that. There's a Delta Lake's own
group and the request is better to be there.


On Mon, Oct 5, 2020 at 9:54 PM Enrico Minack <m...@enrico.minack.dev> wrote:

> Though spark.read.<format> refers to "built-in" data sources, there is
> nothing that prevents 3rd party libraries to "extend" spark.read in Scala
> or Python. As users know the Spark-way to read built-in data sources, it
> feels natural to hook 3rd party data sources under the same scheme, to give
> users a holistic and integrated feel.
>
> One Scala example (
> https://github.com/G-Research/spark-dgraph-connector#spark-dgraph-connector
> ):
>
> import uk.co.gresearch.spark.dgraph.connector._val triples = 
> spark.read.dgraph.triples("localhost:9080")
>
> and in Python:
>
> from gresearch.spark.dgraph.connector import *triples = 
> spark.read.dgraph.triples("localhost:9080")
>
> I agree that 3rd parties should also support the official
> spark.read.format() and the new catalog approaches.
>
> Enrico
>
> Am 05.10.20 um 14:03 schrieb Jungtaek Lim:
>
> Hi,
>
> "spark.read.<format>" is a "shorthand" for "built-in" data sources, not
> for external data sources. spark.read.format() is still an official way to
> use it. Delta Lake is not included in Apache Spark so that is indeed not
> possible for Spark to refer to.
>
> Starting from Spark 3.0, the concept of "catalog" is introduced, which you
> can simply refer to the table from catalog (if the external data source
> provides catalog implementation) and no need to specify the format
> explicitly (as catalog would know about it).
>
> This session explains the catalog and how Cassandra connector leverages
> it. I see some external data sources starting to support catalog, and in
> Spark itself there's some effort to support catalog for JDBC.
>
> https://databricks.com/fr/session_na20/datasource-v2-and-cassandra-a-whole-new-world
>
> Hope this helps.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
>
> On Mon, Oct 5, 2020 at 8:53 PM Moser, Michael <
> michael.mo...@siemens-healthineers.com> wrote:
>
>> Hi there,
>>
>>
>>
>> I’m just wondering if there is any incentive to implement read/write
>> methods in the DataFrameReader/DataFrameWriter for delta similar to e.g.
>> parquet?
>>
>>
>>
>> For example, using PySpark, “spark.read.parquet” is available, but
>> “spark.read.delta” is not (same for write).
>>
>> In my opinion, “spark.read.delta” feels more clean and pythonic compared
>> to “spark.read.format(‘delta’).load()”, especially if more options are
>> called, like “mode”.
>>
>>
>>
>> Can anyone explain the reasoning behind this, is this due to the Java
>> nature of Spark?
>>
>> From a pythonic point of view, I could also imagine a single read/write
>> method, with the format as an arg and kwargs related to the different file
>> format options.
>>
>>
>>
>> Best,
>>
>> Michael
>>
>>
>>
>>
>>
>

Re: [Spark Core] Why no spark.read.delta / df.write.delta?

Reply via email to