GitHub user rahil-c edited a comment on the discussion: RFC-99: Hudi Type System
Based of the poll you shared https://github.com/apache/hudi/discussions/13894 it seemed like Python via pyspark is one of the main interfaces that users are using hudi thru which can be a mix of spark dataframe and spark sql operations within a job. From my understanding I think that SQL standard is probably not 1-1 for most popular data frame libraries https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.html so Im not so sure we will need to closely match the SQL standard for ai/ml engineers in this space that actually will be interfacing with a mix of dataframe and sql. Actually the main question that is on my mind is even if hudi offers these types such as a BLOB or VECTOR, if the underlying engine such as Spark or Flink does not give the user a way to specify the type then how are we holding the mapping? For example Spark SQL does not have a native blob type or vector type https://spark.apache.org/docs/latest/sql-ref-datatypes.html, hence why some connectors for now are representing vector as an ARRAY<FLOAT> https://lance.org/integrations/spark/operations/ddl/create-table/#vector-columns and blob as BINARY https://lance.org/integrations/spark/operations/ddl/create-table/#blob-columns Thoughts @danny0405 @balaji-varadarajan-ai @vinothchandar @yihua @the-other-tim-brown @cshuo ? GitHub link: https://github.com/apache/hudi/discussions/14253#discussioncomment-14959818 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
