[jira] [Comment Edited] (SPARK-25186) Stabilize Data Source V2 API

Jeremy Bloom (Jira) Wed, 10 Jun 2020 16:19:19 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-25186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17132780#comment-17132780
 ]


Jeremy Bloom edited comment on SPARK-25186 at 6/10/20, 11:18 PM:
-----------------------------------------------------------------

I'm not sure whether I'm in the right place, but I want to discuss my 
experience using the DataSource V2 API (using Spark 2.4.5). To be clear, I'm 
not a Spark contributor, but I am using Spark in several applications. If this 
is not the appropriate place for my comments, please direct me to the right 
place.

My current project is part of an open source program called COIN-OR (I can 
provide links if you like). Among other things, we are building a data exchange 
standard for mathematical optimization solvers, called MOSDEX. I am using Spark 
SQL as the backbone  of this application. I work mostly in Java, some in 
Python, and not at all in Scala.

I came to this point because a critical link in my application is to load data 
from JSON into a Spark Dataset. For various reasons, the standard JSON reader 
in Spark is not adequate for my purposes. I want to use a Java Stream to load 
the data into a Spark dataframe. It turns out that this is remarkably hard to 
do.

Spark Session provides a createDataFrame method whose argument is a Java List. 
However, because I am dealing with potentially vary large data sets, I do not 
want to create a list in memory. My first attempt was to wrap the stream in a 
"virtual list" that fulfilled most of the list "contract" but did not actually 
create a list object. The hang up with this approach is that it cannot count 
the elements in the stream unless and until the stream has been read, so the 
size of the list is undeterminable until that happens. Using the virtual list 
in the createDataFrame method some times succeeds but fails at other times when 
some component of Spark needs to call the size method on the list.

I raised this question on Stackoverflow, and someone suggested using DataSource 
V2. There isn't much out there, but I did find a couple of examples, and based 
on them, I was able to build a reader and a writer for Java Streams. However, 
there are a number of issues that arose that forced me to use some "unsavory" 
tricks in my code that I am uncomfortable with (they "smell" in Fowler's 
sense). Here are some of the issues I found.

1) The DataSource V2 still relies on the calling sequences of DataFrameReader 
and DataFrameWriter from V1. The reader is instantiated by the Spark Session 
read method while the writer is instantiated by the Dataset write method. The 
separation is confusing. Why not use a static factory in Dataset to instantiate 
the read method?

2) The actual calls to the reader and writer are contained in the 
DataframeReader and DataframeWriter format methods. Using "format" for the name 
of this method is misleading, but once you realize what it is doing, it makes 
even less sense. The parameter of the format method is a string, which isn't 
documented, but which turns out to be a class name. Then format creates a class 
using reflection. Use of reflection in this circumstance appears to be 
unnecessary and restrictive. What not make the parameter an instance of an 
appropriate interface, such as ReadSupport or WriteSupport?

3) Using reflection means that the reader or writer will be created by a 
parameterless constructor. Thus, it is not possible to pass any information to 
the reader and writer. In my case, I want to pass an instance of a Java Stream 
and a Spark schema to the reader and a Spark dataset to the writer. The only 
way I have found to do that is to create static fields in the reader and writer 
classes and set them before calling the format method to create the reader and 
writer objects. That's what I mean about "unsavory" programming.

4) There are other, less critical issues, but these are still annoying. Links 
to external files in the option method are undocumented strings, but they look 
like paths. Since the Path class has been part of Java just about forever, why 
not use it? And when reading from or writing to a file, why not use the Java 
InputStream and OutputStream classes, which also have been around forever?

I realize that programming style in Scala might be different than in Java, but 
I can't believe that the design of DataSource V2 requires these contortions 
even in Scala. Given the power that is available in Spark, it seems to me that 
basic input and output should be much better structured than it is.

Thanks for your attention on this matter.
chrome


was (Author: jeremybloom):
I'm not sure whether I'm in the right place, but I want to discuss my 
experience using the DataSource V2 API (using Spark 2.4.5). To be clear, I'm 
not a Spark contributor, but I am using Spark in several applications. If this 
is not the appropriate place for my comments, please direct me to the right 
place.

My current project is part of an open source program called COIN-OR (I can 
provide links if you like). Among other things, we are building a data exchange 
standard for mathematical optimization solvers, called MOSDEX. I am using Spark 
SQL as the backbone  of this application. I work mostly in Java, some in 
Python, and not at all in Scala.

I came to this point because a critical link in my application is to load data 
from JSON into a Spark Dataset. For various reasons, the standard JSON reader 
in Spark is not adequate for my purposes. I want to use a Java Stream to load 
the data into a Spark dataframe. It turns out that this is remarkably hard to 
do.

Spark Session provides a createDataFrame method whose argument is a Java List. 
However, because I am dealing with potentially vary large data sets, I do not 
want to create a list in memory. My first attempt was to wrap the stream in a 
"virtual list" that fulfilled most of the list "contract" but did not actually 
create a list object. The hang up with this approach is that it cannot count 
the elements in the stream unless and until the stream has been read, so the 
size of the list is undeterminable until that happens. Using the virtual list 
in the createDataFrame method some times succeeds but fails at other times when 
some component of Spark needs to call the size method on the list.

I raised this question on Stackoverflow, and someone suggested using DataSource 
V2. There isn't much out there, but I did find a couple of examples, and based 
on them, I was able to build a reader and a writer for Java Streams. However, 
there are a number of issues that arose that forced me to use some "unsavory" 
tricks in my code that I am uncomfortable with (they "smell" in Fowler's 
sense). Here are some of the issues I found.

1) The DataSource V2 still relies on the calling sequences of DataFrameReader 
and DataFrameWriter from V1. The reader is instantiated by the Spark Session 
read method while the writer is instantiated by the Dataset write method. The 
separation is confusing. Why not use a static factory in Dataset to instantiate 
the read method?

2) The actual calls to the reader and writer are contained in the 
DataframeReader and DataframeWriter format methods. Using "format" for the name 
of this method is misleading, but once you realize what it is doing, it makes 
even less sense. The parameter of the format method is a string, which isn't 
documented, but which turns out to be a class name. Then format creates a class 
using reflection. Use of reflection in this circumstance appears to be 
unnecessary and restrictive. What not make the parameter an instance of an 
appropriate interface, such as ReadSupport or WriteSupport?

3) Using reflection means that the reader or writer will be created by a 
parameterless constructor. Thus, it is not possible to pass any information to 
the reader and writer. In my case, I want to pass an instance of a Java Stream 
and a Spark schema to the reader and a Spark dataset to the writer. The only 
way I have found to do that is to create static fields in the reader and writer 
classes and set them before calling the format method to create the reader and 
writer objects. That's what I mean about "unsavory" programming.

4) There are other, less critical issues, but these are still annoying. Links 
to external files in the option method are undocumented strings, but they look 
like paths. Since the Path class has been part of Java just about forever, why 
not use it? And when reading from or writing to a file, why not use the Java 
InputStream and OutputStream classes, which also have been around forever?

I realize that programming style in Scala might be different than in Java, but 
I can't believe that the design of DataSource V2 requires these contortions 
even in Scala. Given the power that is available in Spark, it seems to me that 
basic input and output should be much better structured than it is.

Thanks for your attention on this matter.
chrome

> Stabilize Data Source V2 API 
> -----------------------------
>
>                 Key: SPARK-25186
>                 URL: https://issues.apache.org/jira/browse/SPARK-25186
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.1.0
>            Reporter: Wenchen Fan
>            Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-25186) Stabilize Data Source V2 API

Reply via email to