GitHub user rahil-c edited a comment on the discussion: RFC-99: Hudi Type System

Based of the poll you shared https://github.com/apache/hudi/discussions/13894 
it seemed like Python via pyspark is one of the main interfaces that users are 
using hudi thru which can be a mix of spark dataframe and spark sql operations 
within a job. From my understanding I think that SQL standard is probably not 
1-1 for most popular data frame libraries 
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.html
 so Im not so sure we will need to closely match the SQL standard for ai/ml 
engineers in this space that actually will be interfacing with a mix of 
dataframe and sql.

Actually the main question that is on my mind is even if hudi offers these 
types such as a BLOB or VECTOR, if the underlying engine such as Spark or Flink 
does not give the user a way to specify the type then how are we holding the 
mapping? 
For example Spark SQL does not have a native blob type or vector type 
https://spark.apache.org/docs/latest/sql-ref-datatypes.html, hence why some 
connectors for now are representing vector as an ARRAY<FLOAT> 
https://lance.org/integrations/spark/operations/ddl/create-table/#vector-columns
 and blob as BINARY 
https://lance.org/integrations/spark/operations/ddl/create-table/#blob-columns

Thoughts @danny0405 @balaji-varadarajan-ai @vinothchandar @yihua 
@the-other-tim-brown @cshuo ?

GitHub link: 
https://github.com/apache/hudi/discussions/14253#discussioncomment-14959818

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to