brianrice2 opened a new pull request, #655:
URL: https://github.com/apache/incubator-sedona/pull/655

   
   ## Did you read the Contributor Guide?
   
   Yes, I have read [Contributor 
Rules](https://sedona.apache.org/community/rule/) and [Contributor Development 
Guide](https://sedona.apache.org/community/develop/)
   
   ## Is this PR related to a JIRA ticket?
   
   Yes, the URL of the assoicated JIRA ticket is 
https://issues.apache.org/jira/browse/SEDONA-XXX. The PR name follows the 
format `[SEDONA-XXX] my subject`.
   
   [Link to original ticket](https://issues.apache.org/jira/browse/SEDONA-133).
   
   ## What changes were proposed in this PR?
   
   This expands the Adapter API to allow for users to convert to DataFrames 
with a given schema (for both SpatialRDD and JavaPairRDD).
   
   User data is still stored in String format, so these new methods parse/cast 
the strings to whichever new data type is requested. This is similar to Spark's 
[UnivocityParser](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala#L168-L285),
 which is used to parse CSV files, but unfortunately that functionality is not 
exposed publicly so I created a barebones version here. I didn't cover _all_ 
the data types, but tried to cover the key ones. [This 
page](https://spark.apache.org/docs/2.2.0/api/java/org/apache/spark/sql/Encoders.html)
 details the encoders that Spark uses and may be helpful to understand the 
appropriate data types for conversion. We could expand to cover more data types 
later, or I'm open to it now if you request it.
   
   This also adds some helper private methods and refactors a few common 
operations into their own functions.
   
   ## How was this patch tested?
   
   Added unit tests to confirm that SpatialRDD/JavaPairRDD -> DataFrame 
conversion works as expected.
   
   **Note**: It doesn't seem to be common practice to test private methods, so 
I didn't add unit tests for the private methods I introduced. Their behavior is 
tested implicitly by the public functions.
   
   ## Did this PR include necessary documentation updates?
   
   Yes, I am adding a new API. I am using the [current SNAPSHOT version 
number](https://github.com/apache/incubator-sedona/blob/master/pom.xml#L29) in 
since `vX.Y.Z` format.
   
   **Note**: I don't see another appropriate place to change documentation. 
Please let me know if I missed this!
   
   ## Questions
   
   ### 1.
   
   Is the following behavior intentional? I don't have a strong geospatial 
background.
   
   In the JavaPairRDD -> DataFrame test case (called "can convert JavaPairRDD 
to DataFrame with user-supplied schema"), you may notice that the left and 
right dataframes get switched. The SpatialJoinQuery has `pointRDD` on the left 
and `polygonRDD` on the right, but the final output has `leftGeometry` of type 
POLYGON, followed by the `polygonRDD` user data fields, and `rightGeometry` of 
type POINT, followed by the `pointRDD` user data (null).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to