Re: [DISCUSS] SPIP: XML data source support

Hyukjin Kwon Wed, 19 Jul 2023 02:03:22 -0700

Here are the benefits of having it as a built-in source:

   - We can leverage the community to improve the Spark XML (not within
   Databricks repositories).
   - We can share the same core for XML expressions (e.g., from_xml and
   to_xml like from_csv, from_json, etc.).
   - It is more to embrace the commonly used datasource, just like the
   existing builtin data sources we have.
   -


   Users wouldn't have to set the jars or maven coordinates, e.g., for now,
   if they have network problems, etc, it would be harder to use them by
   default.

XML is arguably more used than CSV that is already our built-in source, see
e.g., https://insights.stackoverflow.com/trends?tags=xml%2Cjson%2Ccsv and
https://www.reddit.com/r/programming/comments/bak5qt/a_comparison_of_serialization_formats_csv_json/


On Wed, 19 Jul 2023 at 17:51, Martin Andersson <martin.anders...@kambi.com>
wrote:

> How much of an effort is it to use the spark-xml library today? What's the
> drawback to keeping this as an external library as-is?
>
> Best Regards, Martin
> ------------------------------
> *From:* Hyukjin Kwon <gurwls...@apache.org>
> *Sent:* Wednesday, July 19, 2023 01:27
> *To:* Sandip Agarwala <sandip.agarw...@databricks.com>
> *Cc:* dev@spark.apache.org <dev@spark.apache.org>
> *Subject:* Re: [DISCUSS] SPIP: XML data source support
>
>
> EXTERNAL SENDER. Do not click links or open attachments unless you
> recognize the sender and know the content is safe. DO NOT provide your
> username or password.
>
> Yeah I support this. XML is pretty outdated format TBH but still used in
> many legacy systems. For example, Wikipedia dump is one case.
>
> Even when you take a look from stats CVS vs XML vs JSON, some show that
> XML is more used in CSV.
>
> On Wed, Jul 19, 2023 at 12:58 AM Sandip Agarwala <
> sandip.agarw...@databricks.com> wrote:
>
> Dear Spark community,
>
> I would like to start a discussion on "XML data source support".
>
> XML is a widely used data format. An external spark-xml package (
> https://github.com/databricks/spark-xml) is available to read and write
> XML data in spark. Making spark-xml built-in will provide a better user
> experience for Spark SQL and structured streaming. The proposal is to
> inline code from the spark-xml package.
> I am collaborating with Hyukjin Kwon, who is the original author of
> spark-xml, for this effort.
>
> SPIP link:
>
> https://docs.google.com/document/d/1ZaOBT4-YFtN58UCx2cdFhlsKbie1ugAn-Fgz_Dddz-Q/edit?usp=sharing
>
> JIRA:
> https://issues.apache.org/jira/browse/SPARK-44265
>
> Looking forward to your feedback.
> Thanks, Sandip
>
>

Re: [DISCUSS] SPIP: XML data source support

Reply via email to