Re: I would like to add JDBCDialect to support Vertica database

2019-12-11 Thread Jungtaek Lim
If I understand correctly, you'll just want to package your implementation
with your preference of project manager (maven, sbt, etc.) which registers
your dialect implementation into JdbcDialects, and pass the jar and let end
users load the jar. That will automatically do everything and they can use
VerticaDialect and no need to do custom patch of Spark. That's how
third-party plugins are working.

On Thu, Dec 12, 2019 at 12:58 AM Bryan Herger 
wrote:

> It kind of already is.  I was able to build the VerticaDialect as a sort
> of plugin as follows:
>
>
>
> Check out apache/spark tree
>
> Copy in VerticaDialect.scala
>
> Build with “mvn -DskipTests compile”
>
> package the compiled class plus companion object into a JAR
>
> Copy JAR to jars folder in Spark binary installation (optional, probably
> can set path in an extra --jars argument instead)
>
>
>
> Then run the following test in spark-shell after creating Vertica table
> and sample data:
>
>
>
>
> org.apache.spark.sql.jdbc.JdbcDialects.registerDialect(org.apache.spark.sql.jdbc.VerticaDialect)
>
> val jdbcDF = spark.read.format("jdbc").option("url",
> "jdbc:vertica://hpbox:5433/docker").option("dbtable",
> "test_alltypes").option("user", "dbadmin").option("password",
> "Vertica1!").load()
>
> jdbcDF.show()
>
> jdbcDF.write.mode("append").format("jdbc").option("url",
> "jdbc:vertica://hpbox:5433/docker").option("dbtable",
> "test_alltypes").option("user", "dbadmin").option("password",
> "Vertica1!").save()
>
> JdbcDialects.unregisterDialect(org.apache.spark.sql.jdbc.VerticaDialect)
>
>
>
> If it would be preferable to write documentation describing the above, I
> can do that instead.  The hard part is checking out the matching
> apache/spark tree then copying to the Spark cluster – I can install master
> branch and latest binary and apply patches since I have root on all my test
> boxes, but customers may not be able to.  Still, this provides another
> route to support new JDBC dialects.
>
>
>
> BryanH
>
>
>
> *From:* Wenchen Fan [mailto:cloud0...@gmail.com]
> *Sent:* Wednesday, December 11, 2019 10:48 AM
> *To:* Xiao Li 
> *Cc:* Bryan Herger ; Sean Owen <
> sro...@gmail.com>; dev@spark.apache.org
> *Subject:* Re: I would like to add JDBCDialect to support Vertica database
>
>
>
> Can we make the JDBCDialect a public API that users can plugin? It looks
> like an end-less job to make sure Spark JDBC source supports all databases.
>
>
>
> On Wed, Dec 11, 2019 at 11:41 PM Xiao Li  wrote:
>
> You can follow how we test the other JDBC dialects. All JDBC dialects
> require the docker integration tests.
> https://github.com/apache/spark/tree/master/external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc
>
>
>
>
>
> On Wed, Dec 11, 2019 at 7:33 AM Bryan Herger 
> wrote:
>
> Hi, to answer both questions raised:
>
>
>
> Though Vertica is derived from Postgres, Vertica does not recognize type
> names TEXT, NVARCHAR, BYTEA, ARRAY, and also handles DATETIME differently
> enough to cause issues.  The major changes are to use type names and date
> format supported by Vertica.
>
>
>
> For testing, I have a SQL script plus Scala and PySpark scripts, but these
> require a Vertica database to connect, so automated testing on a build
> server wouldn’t work.  It’s possible to include my test scripts and
> directions to run manually, but not sure where in the repo that would go.
> If automated testing is required, I can ask our engineers whether there
> exists something like a mockito that could be included.
>
>
>
> Thanks, Bryan H
>
>
>
> *From:* Xiao Li [mailto:lix...@databricks.com]
> *Sent:* Wednesday, December 11, 2019 10:13 AM
> *To:* Sean Owen 
> *Cc:* Bryan Herger ; dev@spark.apache.org
> *Subject:* Re: I would like to add JDBCDialect to support Vertica database
>
>
>
> How can the dev community test it?
>
>
>
> Xiao
>
>
>
> On Wed, Dec 11, 2019 at 6:52 AM Sean Owen  wrote:
>
> It's probably OK, IMHO. The overhead of another dialect is small. Are
> there differences that require a new dialect? I assume so and might
> just be useful to summarize them if you open a PR.
>
> On Tue, Dec 10, 2019 at 7:14 AM Bryan Herger
>  wrote:
> >
> > Hi, I am a Vertica support engineer, and we have open support requests
> around NULL values and SQL type conversion with DataFrame read/write over
> JDBC when connecting to a Vertica database.  The stack traces point to
> issues with the generic JDBCDialect in Spark-SQL.
> >
> > I saw that other vendors (Teradata, DB2...) have contributed a
> JDBCDialect class to address JDBC compatibility, so I wrote up a dialect
> for Vertica.
> >
> > The changeset is on my fork of apache/spark at
> https://github.com/bryanherger/spark/commit/84d3014e4ead18146147cf299e8996c5c56b377d
> >
> > I have tested this against Vertica 9.3 and found that this changeset
> addresses both issues reported to us (issue with NULL values - setNull() -
> for valid java.sql.Types, and String to VARCHAR conversion)
> >
> > Is the an acceptable change?  

Re: I would like to add JDBCDialect to support Vertica database

2019-12-11 Thread Takeshi Yamamuro
Not sure, too.
Can't you use Spark Packages for your scenario?
https://spark-packages.org/


On Thu, Dec 12, 2019 at 9:46 AM Hyukjin Kwon  wrote:

> I am not so sure about it too. I think it is enough to expose JDBCDialect
> as an API (which seems already is).
> It brings some overhead to dev (e.g., to test and review PRs related to
> another third party).
> Such third party integration might better exist as a third party library
> without a strong reason.
>
> 2019년 12월 12일 (목) 오전 12:58, Bryan Herger 님이
> 작성:
>
>> It kind of already is.  I was able to build the VerticaDialect as a sort
>> of plugin as follows:
>>
>>
>>
>> Check out apache/spark tree
>>
>> Copy in VerticaDialect.scala
>>
>> Build with “mvn -DskipTests compile”
>>
>> package the compiled class plus companion object into a JAR
>>
>> Copy JAR to jars folder in Spark binary installation (optional, probably
>> can set path in an extra --jars argument instead)
>>
>>
>>
>> Then run the following test in spark-shell after creating Vertica table
>> and sample data:
>>
>>
>>
>>
>> org.apache.spark.sql.jdbc.JdbcDialects.registerDialect(org.apache.spark.sql.jdbc.VerticaDialect)
>>
>> val jdbcDF = spark.read.format("jdbc").option("url",
>> "jdbc:vertica://hpbox:5433/docker").option("dbtable",
>> "test_alltypes").option("user", "dbadmin").option("password",
>> "Vertica1!").load()
>>
>> jdbcDF.show()
>>
>> jdbcDF.write.mode("append").format("jdbc").option("url",
>> "jdbc:vertica://hpbox:5433/docker").option("dbtable",
>> "test_alltypes").option("user", "dbadmin").option("password",
>> "Vertica1!").save()
>>
>> JdbcDialects.unregisterDialect(org.apache.spark.sql.jdbc.VerticaDialect)
>>
>>
>>
>> If it would be preferable to write documentation describing the above, I
>> can do that instead.  The hard part is checking out the matching
>> apache/spark tree then copying to the Spark cluster – I can install master
>> branch and latest binary and apply patches since I have root on all my test
>> boxes, but customers may not be able to.  Still, this provides another
>> route to support new JDBC dialects.
>>
>>
>>
>> BryanH
>>
>>
>>
>> *From:* Wenchen Fan [mailto:cloud0...@gmail.com]
>> *Sent:* Wednesday, December 11, 2019 10:48 AM
>> *To:* Xiao Li 
>> *Cc:* Bryan Herger ; Sean Owen <
>> sro...@gmail.com>; dev@spark.apache.org
>> *Subject:* Re: I would like to add JDBCDialect to support Vertica
>> database
>>
>>
>>
>> Can we make the JDBCDialect a public API that users can plugin? It looks
>> like an end-less job to make sure Spark JDBC source supports all databases.
>>
>>
>>
>> On Wed, Dec 11, 2019 at 11:41 PM Xiao Li  wrote:
>>
>> You can follow how we test the other JDBC dialects. All JDBC dialects
>> require the docker integration tests.
>> https://github.com/apache/spark/tree/master/external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc
>>
>>
>>
>>
>>
>> On Wed, Dec 11, 2019 at 7:33 AM Bryan Herger 
>> wrote:
>>
>> Hi, to answer both questions raised:
>>
>>
>>
>> Though Vertica is derived from Postgres, Vertica does not recognize type
>> names TEXT, NVARCHAR, BYTEA, ARRAY, and also handles DATETIME differently
>> enough to cause issues.  The major changes are to use type names and date
>> format supported by Vertica.
>>
>>
>>
>> For testing, I have a SQL script plus Scala and PySpark scripts, but
>> these require a Vertica database to connect, so automated testing on a
>> build server wouldn’t work.  It’s possible to include my test scripts and
>> directions to run manually, but not sure where in the repo that would go.
>> If automated testing is required, I can ask our engineers whether there
>> exists something like a mockito that could be included.
>>
>>
>>
>> Thanks, Bryan H
>>
>>
>>
>> *From:* Xiao Li [mailto:lix...@databricks.com]
>> *Sent:* Wednesday, December 11, 2019 10:13 AM
>> *To:* Sean Owen 
>> *Cc:* Bryan Herger ; dev@spark.apache.org
>> *Subject:* Re: I would like to add JDBCDialect to support Vertica
>> database
>>
>>
>>
>> How can the dev community test it?
>>
>>
>>
>> Xiao
>>
>>
>>
>> On Wed, Dec 11, 2019 at 6:52 AM Sean Owen  wrote:
>>
>> It's probably OK, IMHO. The overhead of another dialect is small. Are
>> there differences that require a new dialect? I assume so and might
>> just be useful to summarize them if you open a PR.
>>
>> On Tue, Dec 10, 2019 at 7:14 AM Bryan Herger
>>  wrote:
>> >
>> > Hi, I am a Vertica support engineer, and we have open support requests
>> around NULL values and SQL type conversion with DataFrame read/write over
>> JDBC when connecting to a Vertica database.  The stack traces point to
>> issues with the generic JDBCDialect in Spark-SQL.
>> >
>> > I saw that other vendors (Teradata, DB2...) have contributed a
>> JDBCDialect class to address JDBC compatibility, so I wrote up a dialect
>> for Vertica.
>> >
>> > The changeset is on my fork of apache/spark at
>> https://github.com/bryanherger/spark/commit/84d3014e4ead18146147cf299e8996c5c56b377d
>> >
>> > I have tested this against Vertica 

Re: I would like to add JDBCDialect to support Vertica database

2019-12-11 Thread Hyukjin Kwon
I am not so sure about it too. I think it is enough to expose JDBCDialect
as an API (which seems already is).
It brings some overhead to dev (e.g., to test and review PRs related to
another third party).
Such third party integration might better exist as a third party library
without a strong reason.

2019년 12월 12일 (목) 오전 12:58, Bryan Herger 님이 작성:

> It kind of already is.  I was able to build the VerticaDialect as a sort
> of plugin as follows:
>
>
>
> Check out apache/spark tree
>
> Copy in VerticaDialect.scala
>
> Build with “mvn -DskipTests compile”
>
> package the compiled class plus companion object into a JAR
>
> Copy JAR to jars folder in Spark binary installation (optional, probably
> can set path in an extra --jars argument instead)
>
>
>
> Then run the following test in spark-shell after creating Vertica table
> and sample data:
>
>
>
>
> org.apache.spark.sql.jdbc.JdbcDialects.registerDialect(org.apache.spark.sql.jdbc.VerticaDialect)
>
> val jdbcDF = spark.read.format("jdbc").option("url",
> "jdbc:vertica://hpbox:5433/docker").option("dbtable",
> "test_alltypes").option("user", "dbadmin").option("password",
> "Vertica1!").load()
>
> jdbcDF.show()
>
> jdbcDF.write.mode("append").format("jdbc").option("url",
> "jdbc:vertica://hpbox:5433/docker").option("dbtable",
> "test_alltypes").option("user", "dbadmin").option("password",
> "Vertica1!").save()
>
> JdbcDialects.unregisterDialect(org.apache.spark.sql.jdbc.VerticaDialect)
>
>
>
> If it would be preferable to write documentation describing the above, I
> can do that instead.  The hard part is checking out the matching
> apache/spark tree then copying to the Spark cluster – I can install master
> branch and latest binary and apply patches since I have root on all my test
> boxes, but customers may not be able to.  Still, this provides another
> route to support new JDBC dialects.
>
>
>
> BryanH
>
>
>
> *From:* Wenchen Fan [mailto:cloud0...@gmail.com]
> *Sent:* Wednesday, December 11, 2019 10:48 AM
> *To:* Xiao Li 
> *Cc:* Bryan Herger ; Sean Owen <
> sro...@gmail.com>; dev@spark.apache.org
> *Subject:* Re: I would like to add JDBCDialect to support Vertica database
>
>
>
> Can we make the JDBCDialect a public API that users can plugin? It looks
> like an end-less job to make sure Spark JDBC source supports all databases.
>
>
>
> On Wed, Dec 11, 2019 at 11:41 PM Xiao Li  wrote:
>
> You can follow how we test the other JDBC dialects. All JDBC dialects
> require the docker integration tests.
> https://github.com/apache/spark/tree/master/external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc
>
>
>
>
>
> On Wed, Dec 11, 2019 at 7:33 AM Bryan Herger 
> wrote:
>
> Hi, to answer both questions raised:
>
>
>
> Though Vertica is derived from Postgres, Vertica does not recognize type
> names TEXT, NVARCHAR, BYTEA, ARRAY, and also handles DATETIME differently
> enough to cause issues.  The major changes are to use type names and date
> format supported by Vertica.
>
>
>
> For testing, I have a SQL script plus Scala and PySpark scripts, but these
> require a Vertica database to connect, so automated testing on a build
> server wouldn’t work.  It’s possible to include my test scripts and
> directions to run manually, but not sure where in the repo that would go.
> If automated testing is required, I can ask our engineers whether there
> exists something like a mockito that could be included.
>
>
>
> Thanks, Bryan H
>
>
>
> *From:* Xiao Li [mailto:lix...@databricks.com]
> *Sent:* Wednesday, December 11, 2019 10:13 AM
> *To:* Sean Owen 
> *Cc:* Bryan Herger ; dev@spark.apache.org
> *Subject:* Re: I would like to add JDBCDialect to support Vertica database
>
>
>
> How can the dev community test it?
>
>
>
> Xiao
>
>
>
> On Wed, Dec 11, 2019 at 6:52 AM Sean Owen  wrote:
>
> It's probably OK, IMHO. The overhead of another dialect is small. Are
> there differences that require a new dialect? I assume so and might
> just be useful to summarize them if you open a PR.
>
> On Tue, Dec 10, 2019 at 7:14 AM Bryan Herger
>  wrote:
> >
> > Hi, I am a Vertica support engineer, and we have open support requests
> around NULL values and SQL type conversion with DataFrame read/write over
> JDBC when connecting to a Vertica database.  The stack traces point to
> issues with the generic JDBCDialect in Spark-SQL.
> >
> > I saw that other vendors (Teradata, DB2...) have contributed a
> JDBCDialect class to address JDBC compatibility, so I wrote up a dialect
> for Vertica.
> >
> > The changeset is on my fork of apache/spark at
> https://github.com/bryanherger/spark/commit/84d3014e4ead18146147cf299e8996c5c56b377d
> >
> > I have tested this against Vertica 9.3 and found that this changeset
> addresses both issues reported to us (issue with NULL values - setNull() -
> for valid java.sql.Types, and String to VARCHAR conversion)
> >
> > Is the an acceptable change?  If so, how should I go about submitting a
> pull request?
> >
> > Thanks, Bryan Herger
> > Vertica 

Re: [DISCUSS] Add close() on DataWriter interface

2019-12-11 Thread Jungtaek Lim
> Is this something that would be exposed/relevant to the Python API? Or is
this just for people implementing their own Spark data source?

It's latter, and it also helps simplifying built-in data sources as well
(as I found the needs while working on
https://github.com/apache/spark/pull/26845)

On Thu, Dec 12, 2019 at 3:53 AM Nicholas Chammas 
wrote:

> Is this something that would be exposed/relevant to the Python API? Or is
> this just for people implementing their own Spark data source?
>
> On Wed, Dec 11, 2019 at 12:35 AM Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
>
>> Hi devs,
>>
>> I'd like to propose to add close() on DataWriter explicitly, which is the
>> place for resource cleanup.
>>
>> The rationalization of the proposal is due to the lifecycle of
>> DataWriter. If the scaladoc of DataWriter is correct, the lifecycle of
>> DataWriter instance ends at either commit() or abort(). That makes
>> datasource implementors to feel they can place resource cleanup in both
>> sides, but abort() can be called when commit() fails; so they have to
>> ensure they don't do double-cleanup if cleanup is not idempotent.
>>
>> I've checked some callers to see whether they can apply
>> "try-catch-finally" to ensure close() is called at the end of lifecycle for
>> DataWriter, and they look like so, but I might be missing something.
>>
>> What do you think? It would bring backward incompatible change, but given
>> the interface is marked as Evolving and we're making backward incompatible
>> changes in Spark 3.0, so I feel it may not matter.
>>
>> Would love to hear your thoughts.
>>
>> Thanks in advance,
>> Jungtaek Lim (HeartSaVioR)
>>
>>
>>


Re: [DISCUSS] Add close() on DataWriter interface

2019-12-11 Thread Jungtaek Lim
Nice, thanks for the answer! I'll craft a PR soon. Thanks again.

On Thu, Dec 12, 2019 at 3:32 AM Ryan Blue  wrote:

> Sounds good to me, too.
>
> On Wed, Dec 11, 2019 at 1:18 AM Jungtaek Lim 
> wrote:
>
>> Thanks for the quick response, Wenchen!
>>
>> I'll leave this thread for early tomorrow so that someone in US timezone
>> can chime in, and craft a patch if no one objects.
>>
>> On Wed, Dec 11, 2019 at 4:41 PM Wenchen Fan  wrote:
>>
>>> PartitionReader extends Closable, seems reasonable to me to do the same
>>> for DataWriter.
>>>
>>> On Wed, Dec 11, 2019 at 1:35 PM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
 Hi devs,

 I'd like to propose to add close() on DataWriter explicitly, which is
 the place for resource cleanup.

 The rationalization of the proposal is due to the lifecycle of
 DataWriter. If the scaladoc of DataWriter is correct, the lifecycle of
 DataWriter instance ends at either commit() or abort(). That makes
 datasource implementors to feel they can place resource cleanup in both
 sides, but abort() can be called when commit() fails; so they have to
 ensure they don't do double-cleanup if cleanup is not idempotent.

 I've checked some callers to see whether they can apply
 "try-catch-finally" to ensure close() is called at the end of lifecycle for
 DataWriter, and they look like so, but I might be missing something.

 What do you think? It would bring backward incompatible change, but
 given the interface is marked as Evolving and we're making backward
 incompatible changes in Spark 3.0, so I feel it may not matter.

 Would love to hear your thoughts.

 Thanks in advance,
 Jungtaek Lim (HeartSaVioR)



>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: Release Apache Spark 2.4.5 and 2.4.6

2019-12-11 Thread Dongjoon Hyun
Thank you all. I'll make a PR to Apache Spark website.

Bests,
Dongjoon.

On Tue, Dec 10, 2019 at 11:43 PM Wenchen Fan  wrote:

> Sounds good. Thanks for bringing this up!
>
> On Wed, Dec 11, 2019 at 3:18 PM Takeshi Yamamuro 
> wrote:
>
>> That looks nice, thanks!
>> I checked the previous v2.4.4 release; it has around 130 commits (from
>> 2.4.3 to 2.4.4), so
>> I think branch-2.4 already has enough commits for the next release.
>>
>> A commit list from 2.4.3 to 2.4.4;
>>
>> https://github.com/apache/spark/compare/5ac2014e6c118fbeb1fe8e5c8064c4a8ee9d182a...7955b3962ac46b89564e0613db7bea98a1478bf2
>>
>> Bests,
>> Takeshi
>>
>> On Tue, Dec 10, 2019 at 3:32 AM Sean Owen  wrote:
>>
>>> Sure, seems fine. The release cadence slows down in a branch over time
>>> as there is probably less to fix, so Jan-Feb 2020 for 2.4.5 and
>>> something like middle or Q3 2020 for 2.4.6 is a reasonable
>>> expectation. It might plausibly be the last 2.4.x release but who
>>> knows.
>>>
>>> On Mon, Dec 9, 2019 at 12:29 PM Dongjoon Hyun 
>>> wrote:
>>> >
>>> > Hi, All.
>>> >
>>> > Along with the discussion on 3.0.0, I'd like to discuss about the next
>>> releases on `branch-2.4`.
>>> >
>>> > As we know, `branch-2.4` is our LTS branch and also there exists some
>>> questions on the release plans. More releases are important not only for
>>> the latest K8s version support, but also for delivering important bug fixes
>>> regularly (at least until 3.x becomes dominant.)
>>> >
>>> > In short, I'd like to propose the followings.
>>> >
>>> > 1. Apache Spark 2.4.5 release (2020 January)
>>> > 2. Apache Spark 2.4.6 release (2020 July)
>>> >
>>> > Of course, we can adjust the schedule.
>>> > This aims to have a pre-defined cadence in order to give release
>>> managers to prepare.
>>> >
>>> > Bests,
>>> > Dongjoon.
>>> >
>>> > PS. As of now, `branch-2.4` has 135 additional patches after `2.4.4`.
>>> >
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>


Re: [DISCUSS] Add close() on DataWriter interface

2019-12-11 Thread Ryan Blue
Sounds good to me, too.

On Wed, Dec 11, 2019 at 1:18 AM Jungtaek Lim 
wrote:

> Thanks for the quick response, Wenchen!
>
> I'll leave this thread for early tomorrow so that someone in US timezone
> can chime in, and craft a patch if no one objects.
>
> On Wed, Dec 11, 2019 at 4:41 PM Wenchen Fan  wrote:
>
>> PartitionReader extends Closable, seems reasonable to me to do the same
>> for DataWriter.
>>
>> On Wed, Dec 11, 2019 at 1:35 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> Hi devs,
>>>
>>> I'd like to propose to add close() on DataWriter explicitly, which is
>>> the place for resource cleanup.
>>>
>>> The rationalization of the proposal is due to the lifecycle of
>>> DataWriter. If the scaladoc of DataWriter is correct, the lifecycle of
>>> DataWriter instance ends at either commit() or abort(). That makes
>>> datasource implementors to feel they can place resource cleanup in both
>>> sides, but abort() can be called when commit() fails; so they have to
>>> ensure they don't do double-cleanup if cleanup is not idempotent.
>>>
>>> I've checked some callers to see whether they can apply
>>> "try-catch-finally" to ensure close() is called at the end of lifecycle for
>>> DataWriter, and they look like so, but I might be missing something.
>>>
>>> What do you think? It would bring backward incompatible change, but
>>> given the interface is marked as Evolving and we're making backward
>>> incompatible changes in Spark 3.0, so I feel it may not matter.
>>>
>>> Would love to hear your thoughts.
>>>
>>> Thanks in advance,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>>
>>>

-- 
Ryan Blue
Software Engineer
Netflix


Re: Enabling fully disaggregated shuffle on Spark

2019-12-11 Thread Ben Sidhom
Recapping today's sync on the wider dev list for visibility:

The original proposals here can be refactored into 3 distinct changes which
could be integrated iteratively. In order of decreasing priority:

   1. Allow MapStatus to take an arbitrary/opaque payload and rip out hard
   references to executor ids, etc. This lets shuffle implementations
   customize, e.g., block location specs and decouples shuffle results from
   executors/specific machines.
   2. Allow MapStatus to be dynamically updated by inserting RPC hooks in
   strategic places. Shuffle managers can then hook into these and, for
   example, invalidate shuffle data on external failure or notify the
   MapStatus tracker that asynchronous backups are ready. This replaces the
   scheduler changes proposed above.
   3. Deterministic/sort-consistent serializer APIs that allow key-wise
   aggregation/sorting server-side.

Point 1 is really a prerequisite for 2 since dynamic updates are only
useful to shuffle managers if they have the necessary data available. Point
3 is independent but also lower priority because it can be considered a
performance optimization but may require invasive changes to Spark (and
user code) to actually work.

The tentative plan is to separate these efforts into 3 separate proposal
docs (possibly with discussion doc(s) while the details gel).

On Fri, Dec 6, 2019 at 7:53 AM Li Hao  wrote:

> Agree with Bo's  idea that the MapStatus could be a more generalized
> concept, not necessary to be bound with BlockManager/Executor.
>
> As I understand it, the MapStatus are used to track/record the output data
> location of a map task ,  created by shuffle writer, used by shuffle reader
> for  finding and reading their shuffle data. So, if we want to keep using
> MapStatus to provide same functionality in various different
> shuffle implementations,  then it should  be a more generalized so that
> different shuffle writer should be able to encapsulate their own specific
> data location info into a MapStatus object, and similarly, different
> shuffle reader should be able to retrieve their info from MapStatus object.
>
> There are two ways to make MapStatus more generalized in my observation:
> 1. make MapStatus extendable(as Bo mentioned above, making MapStatus a
> public non-sealed trait), so that different shuffle way could has their
> own MapStatus implementation.
> 2. make the location in MapStatus a more general data-location identifier
> (as mentioned in  Ben's Proposal), maybe something like URL, for example
> executor://host:port:mapid, dfs://path/to/data(which is the case in Baidu's
> disaggregated shuffle implementation), s3://path/to/data,
> xxshuffleserver://host:port:dataid, so that different shuffle writer
> could encode its output data location into this url and the reader
> will understand the what this URL means,  finally find and read the shuffle
> data.
>
> These two ways are not in conflict, actually, we could use the second way
> to make MapStatus a more generalized concept considering various
> data-location representations in  different shuffle implementations, and
> also use the first way to provide extendability so that various shuffle
> writer could encapsulate more their own info about  output into MapStatus,
> not just data location, reduce size and mapId in current MapStatus trait,
> but also some other necessary info that needed by the reduce/shuffle reader
> side.
>
> Best regards,
> Li Hao
>
> On Thu, 5 Dec 2019 at 12:15, bo yang  wrote:
>
>> Thanks guys for the discussion in the email and also this afternoon!
>>
>> From our experience, we do not need to change Spark DAG scheduler to
>> implement a remote shuffle service. Current Spark shuffle manager
>> interfaces are pretty good and easy to implement. But we do feel the need
>> to modify MapStatus to make it more generic.
>>
>> The current limit with MapStatus is that it assumes* a map output only
>> exists on a single executor* (see following). One easy update could be
>> making MapStatus supports the scenario where *a map output could be on
>> multiple remote servers*.
>>
>> private[spark] sealed trait MapStatus {
>> def location: BlockManagerId
>> }
>>
>> class BlockManagerId private {
>> private var executorId_ : String,
>> private var host_ : String,
>> private var port_ : Int,
>> }
>>
>> Also, MapStatus is a sealed trait, thus our ShuffleManager plugin could
>> not extend it with our own implementation. How about *making MapStatus a
>> public non-sealed trait*? So different Shuffle Manager plugin could
>> implement their own MapStatus classes.
>>
>> Best,
>> Bo
>>
>> On Wed, Dec 4, 2019 at 3:27 PM Ben Sidhom 
>> wrote:
>>
>>> Hey Imran (and everybody who made it to the sync today):
>>>
>>> Thanks for the comments. Responses below:
>>>
>>> Scheduling and re-executing tasks
> Allow coordination between the service and the Spark DAG scheduler as
> to whether a given block/partition needs to be recomputed when a task 
> fails
> 

Re: Spark 2.4.4 with which version of Hadoop?

2019-12-11 Thread Sean Owen
My moderately informed take is that the "Hadoop 2.7" build is really a
"Hadoop 2.x" build and AFAIK should work with 2.8, 2.9, but, I
certainly haven't tested it nor have the PR builders. Just use the
"Hadoop provided" build on your env. Of course, you might well want to
use Hadoop 3.x (3.2.x specifically) with Spark 3, which is tested.

On Wed, Dec 11, 2019 at 11:43 AM JeffK  wrote:
>
> Hi,
>
> We've been considering using the download package Spark 2.4.4 that's
> pre-built for Hadoop 2.7 with Hadoop 2.7.7.
>
> When used with Spark, Hadoop 2.7 is often quoted as the most stable.
>
> However, Hadoop 2.7.7 is End Of Life. The most recent Hadoop vulnerabilities
> have only been fixed in versions 2.8.5 and above.
>
> We've searched the Spark user forum and have also been following discussions
> on the development forum and it's still unclear as which version of Hadoop
> should be used. Discussions about Spark 3.0.0 currently want to leave Hadoop
> 2.7 as the default, when there are known vulnerabilities this is a concern.
>
> What versions of Hadoop 2.X is supported, which should we be using?
>
> Thanks
>
> Jeff
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Spark 2.4.4 with which version of Hadoop?

2019-12-11 Thread JeffK
Hi,

We've been considering using the download package Spark 2.4.4 that's
pre-built for Hadoop 2.7 with Hadoop 2.7.7.

When used with Spark, Hadoop 2.7 is often quoted as the most stable.

However, Hadoop 2.7.7 is End Of Life. The most recent Hadoop vulnerabilities
have only been fixed in versions 2.8.5 and above.

We've searched the Spark user forum and have also been following discussions
on the development forum and it's still unclear as which version of Hadoop
should be used. Discussions about Spark 3.0.0 currently want to leave Hadoop
2.7 as the default, when there are known vulnerabilities this is a concern.

What versions of Hadoop 2.X is supported, which should we be using?

Thanks

Jeff



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



RE: I would like to add JDBCDialect to support Vertica database

2019-12-11 Thread Bryan Herger
It kind of already is.  I was able to build the VerticaDialect as a sort of 
plugin as follows:

Check out apache/spark tree
Copy in VerticaDialect.scala
Build with “mvn -DskipTests compile”
package the compiled class plus companion object into a JAR
Copy JAR to jars folder in Spark binary installation (optional, probably can 
set path in an extra --jars argument instead)

Then run the following test in spark-shell after creating Vertica table and 
sample data:

org.apache.spark.sql.jdbc.JdbcDialects.registerDialect(org.apache.spark.sql.jdbc.VerticaDialect)
val jdbcDF = spark.read.format("jdbc").option("url", 
"jdbc:vertica://hpbox:5433/docker").option("dbtable", 
"test_alltypes").option("user", "dbadmin").option("password", 
"Vertica1!").load()
jdbcDF.show()
jdbcDF.write.mode("append").format("jdbc").option("url", 
"jdbc:vertica://hpbox:5433/docker").option("dbtable", 
"test_alltypes").option("user", "dbadmin").option("password", 
"Vertica1!").save()
JdbcDialects.unregisterDialect(org.apache.spark.sql.jdbc.VerticaDialect)

If it would be preferable to write documentation describing the above, I can do 
that instead.  The hard part is checking out the matching apache/spark tree 
then copying to the Spark cluster – I can install master branch and latest 
binary and apply patches since I have root on all my test boxes, but customers 
may not be able to.  Still, this provides another route to support new JDBC 
dialects.

BryanH

From: Wenchen Fan [mailto:cloud0...@gmail.com]
Sent: Wednesday, December 11, 2019 10:48 AM
To: Xiao Li 
Cc: Bryan Herger ; Sean Owen ; 
dev@spark.apache.org
Subject: Re: I would like to add JDBCDialect to support Vertica database

Can we make the JDBCDialect a public API that users can plugin? It looks like 
an end-less job to make sure Spark JDBC source supports all databases.

On Wed, Dec 11, 2019 at 11:41 PM Xiao Li 
mailto:lix...@databricks.com>> wrote:
You can follow how we test the other JDBC dialects. All JDBC dialects require 
the docker integration tests. 
https://github.com/apache/spark/tree/master/external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc


On Wed, Dec 11, 2019 at 7:33 AM Bryan Herger 
mailto:bryan.her...@microfocus.com>> wrote:
Hi, to answer both questions raised:

Though Vertica is derived from Postgres, Vertica does not recognize type names 
TEXT, NVARCHAR, BYTEA, ARRAY, and also handles DATETIME differently enough to 
cause issues.  The major changes are to use type names and date format 
supported by Vertica.

For testing, I have a SQL script plus Scala and PySpark scripts, but these 
require a Vertica database to connect, so automated testing on a build server 
wouldn’t work.  It’s possible to include my test scripts and directions to run 
manually, but not sure where in the repo that would go.  If automated testing 
is required, I can ask our engineers whether there exists something like a 
mockito that could be included.

Thanks, Bryan H

From: Xiao Li [mailto:lix...@databricks.com]
Sent: Wednesday, December 11, 2019 10:13 AM
To: Sean Owen mailto:sro...@gmail.com>>
Cc: Bryan Herger 
mailto:bryan.her...@microfocus.com>>; 
dev@spark.apache.org
Subject: Re: I would like to add JDBCDialect to support Vertica database

How can the dev community test it?

Xiao

On Wed, Dec 11, 2019 at 6:52 AM Sean Owen 
mailto:sro...@gmail.com>> wrote:
It's probably OK, IMHO. The overhead of another dialect is small. Are
there differences that require a new dialect? I assume so and might
just be useful to summarize them if you open a PR.

On Tue, Dec 10, 2019 at 7:14 AM Bryan Herger
mailto:bryan.her...@microfocus.com>> wrote:
>
> Hi, I am a Vertica support engineer, and we have open support requests around 
> NULL values and SQL type conversion with DataFrame read/write over JDBC when 
> connecting to a Vertica database.  The stack traces point to issues with the 
> generic JDBCDialect in Spark-SQL.
>
> I saw that other vendors (Teradata, DB2...) have contributed a JDBCDialect 
> class to address JDBC compatibility, so I wrote up a dialect for Vertica.
>
> The changeset is on my fork of apache/spark at 
> https://github.com/bryanherger/spark/commit/84d3014e4ead18146147cf299e8996c5c56b377d
>
> I have tested this against Vertica 9.3 and found that this changeset 
> addresses both issues reported to us (issue with NULL values - setNull() - 
> for valid java.sql.Types, and String to VARCHAR conversion)
>
> Is the an acceptable change?  If so, how should I go about submitting a pull 
> request?
>
> Thanks, Bryan Herger
> Vertica Solution Engineer
>
>
> -
> To unsubscribe e-mail: 
> dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org
--

Re: I would like to add JDBCDialect to support Vertica database

2019-12-11 Thread Wenchen Fan
Can we make the JDBCDialect a public API that users can plugin? It looks
like an end-less job to make sure Spark JDBC source supports all databases.

On Wed, Dec 11, 2019 at 11:41 PM Xiao Li  wrote:

> You can follow how we test the other JDBC dialects. All JDBC dialects
> require the docker integration tests.
> https://github.com/apache/spark/tree/master/external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc
>
>
> On Wed, Dec 11, 2019 at 7:33 AM Bryan Herger 
> wrote:
>
>> Hi, to answer both questions raised:
>>
>>
>>
>> Though Vertica is derived from Postgres, Vertica does not recognize type
>> names TEXT, NVARCHAR, BYTEA, ARRAY, and also handles DATETIME differently
>> enough to cause issues.  The major changes are to use type names and date
>> format supported by Vertica.
>>
>>
>>
>> For testing, I have a SQL script plus Scala and PySpark scripts, but
>> these require a Vertica database to connect, so automated testing on a
>> build server wouldn’t work.  It’s possible to include my test scripts and
>> directions to run manually, but not sure where in the repo that would go.
>> If automated testing is required, I can ask our engineers whether there
>> exists something like a mockito that could be included.
>>
>>
>>
>> Thanks, Bryan H
>>
>>
>>
>> *From:* Xiao Li [mailto:lix...@databricks.com]
>> *Sent:* Wednesday, December 11, 2019 10:13 AM
>> *To:* Sean Owen 
>> *Cc:* Bryan Herger ; dev@spark.apache.org
>> *Subject:* Re: I would like to add JDBCDialect to support Vertica
>> database
>>
>>
>>
>> How can the dev community test it?
>>
>>
>>
>> Xiao
>>
>>
>>
>> On Wed, Dec 11, 2019 at 6:52 AM Sean Owen  wrote:
>>
>> It's probably OK, IMHO. The overhead of another dialect is small. Are
>> there differences that require a new dialect? I assume so and might
>> just be useful to summarize them if you open a PR.
>>
>> On Tue, Dec 10, 2019 at 7:14 AM Bryan Herger
>>  wrote:
>> >
>> > Hi, I am a Vertica support engineer, and we have open support requests
>> around NULL values and SQL type conversion with DataFrame read/write over
>> JDBC when connecting to a Vertica database.  The stack traces point to
>> issues with the generic JDBCDialect in Spark-SQL.
>> >
>> > I saw that other vendors (Teradata, DB2...) have contributed a
>> JDBCDialect class to address JDBC compatibility, so I wrote up a dialect
>> for Vertica.
>> >
>> > The changeset is on my fork of apache/spark at
>> https://github.com/bryanherger/spark/commit/84d3014e4ead18146147cf299e8996c5c56b377d
>> >
>> > I have tested this against Vertica 9.3 and found that this changeset
>> addresses both issues reported to us (issue with NULL values - setNull() -
>> for valid java.sql.Types, and String to VARCHAR conversion)
>> >
>> > Is the an acceptable change?  If so, how should I go about submitting a
>> pull request?
>> >
>> > Thanks, Bryan Herger
>> > Vertica Solution Engineer
>> >
>> >
>> > -
>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>> --
>>
>> [image: Databricks Summit - Watch the talks]
>> 
>>
>
>
> --
> [image: Databricks Summit - Watch the talks]
> 
>


Re: I would like to add JDBCDialect to support Vertica database

2019-12-11 Thread Xiao Li
You can follow how we test the other JDBC dialects. All JDBC dialects
require the docker integration tests.
https://github.com/apache/spark/tree/master/external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc


On Wed, Dec 11, 2019 at 7:33 AM Bryan Herger 
wrote:

> Hi, to answer both questions raised:
>
>
>
> Though Vertica is derived from Postgres, Vertica does not recognize type
> names TEXT, NVARCHAR, BYTEA, ARRAY, and also handles DATETIME differently
> enough to cause issues.  The major changes are to use type names and date
> format supported by Vertica.
>
>
>
> For testing, I have a SQL script plus Scala and PySpark scripts, but these
> require a Vertica database to connect, so automated testing on a build
> server wouldn’t work.  It’s possible to include my test scripts and
> directions to run manually, but not sure where in the repo that would go.
> If automated testing is required, I can ask our engineers whether there
> exists something like a mockito that could be included.
>
>
>
> Thanks, Bryan H
>
>
>
> *From:* Xiao Li [mailto:lix...@databricks.com]
> *Sent:* Wednesday, December 11, 2019 10:13 AM
> *To:* Sean Owen 
> *Cc:* Bryan Herger ; dev@spark.apache.org
> *Subject:* Re: I would like to add JDBCDialect to support Vertica database
>
>
>
> How can the dev community test it?
>
>
>
> Xiao
>
>
>
> On Wed, Dec 11, 2019 at 6:52 AM Sean Owen  wrote:
>
> It's probably OK, IMHO. The overhead of another dialect is small. Are
> there differences that require a new dialect? I assume so and might
> just be useful to summarize them if you open a PR.
>
> On Tue, Dec 10, 2019 at 7:14 AM Bryan Herger
>  wrote:
> >
> > Hi, I am a Vertica support engineer, and we have open support requests
> around NULL values and SQL type conversion with DataFrame read/write over
> JDBC when connecting to a Vertica database.  The stack traces point to
> issues with the generic JDBCDialect in Spark-SQL.
> >
> > I saw that other vendors (Teradata, DB2...) have contributed a
> JDBCDialect class to address JDBC compatibility, so I wrote up a dialect
> for Vertica.
> >
> > The changeset is on my fork of apache/spark at
> https://github.com/bryanherger/spark/commit/84d3014e4ead18146147cf299e8996c5c56b377d
> >
> > I have tested this against Vertica 9.3 and found that this changeset
> addresses both issues reported to us (issue with NULL values - setNull() -
> for valid java.sql.Types, and String to VARCHAR conversion)
> >
> > Is the an acceptable change?  If so, how should I go about submitting a
> pull request?
> >
> > Thanks, Bryan Herger
> > Vertica Solution Engineer
> >
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
> --
>
> [image: Databricks Summit - Watch the talks]
> 
>


-- 
[image: Databricks Summit - Watch the talks]



RE: I would like to add JDBCDialect to support Vertica database

2019-12-11 Thread Bryan Herger
Hi, to answer both questions raised:

Though Vertica is derived from Postgres, Vertica does not recognize type names 
TEXT, NVARCHAR, BYTEA, ARRAY, and also handles DATETIME differently enough to 
cause issues.  The major changes are to use type names and date format 
supported by Vertica.

For testing, I have a SQL script plus Scala and PySpark scripts, but these 
require a Vertica database to connect, so automated testing on a build server 
wouldn’t work.  It’s possible to include my test scripts and directions to run 
manually, but not sure where in the repo that would go.  If automated testing 
is required, I can ask our engineers whether there exists something like a 
mockito that could be included.

Thanks, Bryan H

From: Xiao Li [mailto:lix...@databricks.com]
Sent: Wednesday, December 11, 2019 10:13 AM
To: Sean Owen 
Cc: Bryan Herger ; dev@spark.apache.org
Subject: Re: I would like to add JDBCDialect to support Vertica database

How can the dev community test it?

Xiao

On Wed, Dec 11, 2019 at 6:52 AM Sean Owen 
mailto:sro...@gmail.com>> wrote:
It's probably OK, IMHO. The overhead of another dialect is small. Are
there differences that require a new dialect? I assume so and might
just be useful to summarize them if you open a PR.

On Tue, Dec 10, 2019 at 7:14 AM Bryan Herger
mailto:bryan.her...@microfocus.com>> wrote:
>
> Hi, I am a Vertica support engineer, and we have open support requests around 
> NULL values and SQL type conversion with DataFrame read/write over JDBC when 
> connecting to a Vertica database.  The stack traces point to issues with the 
> generic JDBCDialect in Spark-SQL.
>
> I saw that other vendors (Teradata, DB2...) have contributed a JDBCDialect 
> class to address JDBC compatibility, so I wrote up a dialect for Vertica.
>
> The changeset is on my fork of apache/spark at 
> https://github.com/bryanherger/spark/commit/84d3014e4ead18146147cf299e8996c5c56b377d
>
> I have tested this against Vertica 9.3 and found that this changeset 
> addresses both issues reported to us (issue with NULL values - setNull() - 
> for valid java.sql.Types, and String to VARCHAR conversion)
>
> Is the an acceptable change?  If so, how should I go about submitting a pull 
> request?
>
> Thanks, Bryan Herger
> Vertica Solution Engineer
>
>
> -
> To unsubscribe e-mail: 
> dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org
--
[Databricks Summit - Watch the 
talks]


Re: I would like to add JDBCDialect to support Vertica database

2019-12-11 Thread Xiao Li
How can the dev community test it?

Xiao

On Wed, Dec 11, 2019 at 6:52 AM Sean Owen  wrote:

> It's probably OK, IMHO. The overhead of another dialect is small. Are
> there differences that require a new dialect? I assume so and might
> just be useful to summarize them if you open a PR.
>
> On Tue, Dec 10, 2019 at 7:14 AM Bryan Herger
>  wrote:
> >
> > Hi, I am a Vertica support engineer, and we have open support requests
> around NULL values and SQL type conversion with DataFrame read/write over
> JDBC when connecting to a Vertica database.  The stack traces point to
> issues with the generic JDBCDialect in Spark-SQL.
> >
> > I saw that other vendors (Teradata, DB2...) have contributed a
> JDBCDialect class to address JDBC compatibility, so I wrote up a dialect
> for Vertica.
> >
> > The changeset is on my fork of apache/spark at
> https://github.com/bryanherger/spark/commit/84d3014e4ead18146147cf299e8996c5c56b377d
> >
> > I have tested this against Vertica 9.3 and found that this changeset
> addresses both issues reported to us (issue with NULL values - setNull() -
> for valid java.sql.Types, and String to VARCHAR conversion)
> >
> > Is the an acceptable change?  If so, how should I go about submitting a
> pull request?
> >
> > Thanks, Bryan Herger
> > Vertica Solution Engineer
> >
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
> --
[image: Databricks Summit - Watch the talks]



Re: I would like to add JDBCDialect to support Vertica database

2019-12-11 Thread Sean Owen
It's probably OK, IMHO. The overhead of another dialect is small. Are
there differences that require a new dialect? I assume so and might
just be useful to summarize them if you open a PR.

On Tue, Dec 10, 2019 at 7:14 AM Bryan Herger
 wrote:
>
> Hi, I am a Vertica support engineer, and we have open support requests around 
> NULL values and SQL type conversion with DataFrame read/write over JDBC when 
> connecting to a Vertica database.  The stack traces point to issues with the 
> generic JDBCDialect in Spark-SQL.
>
> I saw that other vendors (Teradata, DB2...) have contributed a JDBCDialect 
> class to address JDBC compatibility, so I wrote up a dialect for Vertica.
>
> The changeset is on my fork of apache/spark at 
> https://github.com/bryanherger/spark/commit/84d3014e4ead18146147cf299e8996c5c56b377d
>
> I have tested this against Vertica 9.3 and found that this changeset 
> addresses both issues reported to us (issue with NULL values - setNull() - 
> for valid java.sql.Types, and String to VARCHAR conversion)
>
> Is the an acceptable change?  If so, how should I go about submitting a pull 
> request?
>
> Thanks, Bryan Herger
> Vertica Solution Engineer
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Add close() on DataWriter interface

2019-12-11 Thread Jungtaek Lim
Thanks for the quick response, Wenchen!

I'll leave this thread for early tomorrow so that someone in US timezone
can chime in, and craft a patch if no one objects.

On Wed, Dec 11, 2019 at 4:41 PM Wenchen Fan  wrote:

> PartitionReader extends Closable, seems reasonable to me to do the same
> for DataWriter.
>
> On Wed, Dec 11, 2019 at 1:35 PM Jungtaek Lim 
> wrote:
>
>> Hi devs,
>>
>> I'd like to propose to add close() on DataWriter explicitly, which is the
>> place for resource cleanup.
>>
>> The rationalization of the proposal is due to the lifecycle of
>> DataWriter. If the scaladoc of DataWriter is correct, the lifecycle of
>> DataWriter instance ends at either commit() or abort(). That makes
>> datasource implementors to feel they can place resource cleanup in both
>> sides, but abort() can be called when commit() fails; so they have to
>> ensure they don't do double-cleanup if cleanup is not idempotent.
>>
>> I've checked some callers to see whether they can apply
>> "try-catch-finally" to ensure close() is called at the end of lifecycle for
>> DataWriter, and they look like so, but I might be missing something.
>>
>> What do you think? It would bring backward incompatible change, but given
>> the interface is marked as Evolving and we're making backward incompatible
>> changes in Spark 3.0, so I feel it may not matter.
>>
>> Would love to hear your thoughts.
>>
>> Thanks in advance,
>> Jungtaek Lim (HeartSaVioR)
>>
>>
>>