[GitHub] [spark] sadhen commented on a change in pull request #32026: [SPARK-34771] Support UDT for Pandas/Spark conversion with Arrow support Enabled

GitBox Thu, 01 Apr 2021 18:25:18 -0700


sadhen commented on a change in pull request #32026:
URL: https://github.com/apache/spark/pull/32026#discussion_r606026938




##########
File path: python/pyspark/sql/pandas/conversion.py
##########
@@ -452,24 +457,27 @@ def _create_from_pandas_with_arrow(self, pdf, schema, 
timezone):
                 struct.add(name, from_arrow_type(field.type), 
nullable=field.nullable)
             schema = struct
 
-        # Determine arrow types to coerce data when creating batches
+        # Determine data types to coerce data when creating batches
         if isinstance(schema, StructType):
-            arrow_types = [to_arrow_type(f.dataType) for f in schema.fields]
+            data_types = [f.dataType for f in schema.fields]
         elif isinstance(schema, DataType):
             raise ValueError("Single data type %s is not supported with Arrow" 
% str(schema))
         else:
             # Any timestamps must be coerced to be compatible with Spark
-            arrow_types = [to_arrow_type(TimestampType())
-                           if is_datetime64_dtype(t) or 
is_datetime64tz_dtype(t) else None
-                           for t in pdf.dtypes]
+            data_types = [to_arrow_type(TimestampType())
+                          if is_datetime64_dtype(t) or 
is_datetime64tz_dtype(t) else None
+                          for t in pdf.dtypes]
 
         # Slice the DataFrame to be batched
         step = -(-len(pdf) // self.sparkContext.defaultParallelism)  # round 
int up
         pdf_slices = (pdf.iloc[start:start + step] for start in range(0, 
len(pdf), step))
 
         # Create list of Arrow (columns, type) for serializer dump_stream
-        arrow_data = [[(c, t) for (_, c), t in zip(pdf_slice.iteritems(), 
arrow_types)]
-                      for pdf_slice in pdf_slices]
+        # Type can be Spark SQL Data Type or Arrow Data Type
+        arrow_data_with_t = [

Review comment:
       Well, I should use `adt` or `padt` for PyArrow Data Type and `pdt` for 
Pandas DataType.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] sadhen commented on a change in pull request #32026: [SPARK-34771] Support UDT for Pandas/Spark conversion with Arrow support Enabled

Reply via email to