[jira] [Comment Edited] (HUDI-7117) Functional index creation not working when table is created using datasource writer

Vinaykumar Bhat (Jira) Tue, 26 Mar 2024 07:55:06 -0700


    [ 
https://issues.apache.org/jira/browse/HUDI-7117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830973#comment-17830973
 ]


Vinaykumar Bhat edited comment on HUDI-7117 at 3/26/24 2:54 PM:
----------------------------------------------------------------

This is likely not an issue, but a gap in understanding the feature.

 

The issue is that 
{{spark.read.format("hudi").load(PATH).createOrReplaceTempView(TABLE_NAME)}} 
creates a temporary view (similar to the one that is created using {{{}CREATE 
TEMPORARY VIEW ...{}}}) and it is neither a table nor a hudi managed table. 
Hence the following {{CREATE INDEX ...}} statement to create a functional fails 
as the object on which the index is being created is not a hudi managed table.

Instead of creating a temporary view, one can use {{saveAsTable(...)}} method 
on the DataFrameWriter object to create a hudi managed table and then create 
functional index on those tables. An example follows:
{code:java}
val columns = Seq("ts", "transaction_id", "rider", "driver", "price", 
"location")

val data = Seq(
(1695159649087L, "334e26e9-8355-45cc-97c6-c31daf0df330", "rider-A", "driver-K", 
19.10, "san_francisco"),
(1695091554788L, "e96c4396-3fad-413a-a942-4cb36106d721", "rider-C", "driver-M", 
27.70, "san_francisco"),
(1695046462179L, "9909a8b1-2d15-4d3d-8ec9-efc48c536a00", "rider-D", "driver-L", 
33.90, "san_francisco"),
(1695516137016L, "e3cf430c-889d-4015-bc98-59bdce1e530c", "rider-F", "driver-P", 
34.15, "sao_paulo"),
(1695115999911L, "c8abbe79-8d89-47ea-b4ce-4d224bae5bfa", "rider-J", "driver-T", 
17.85, "chennai"));
var inserts = spark.createDataFrame(data).toDF(columns: _*)
inserts.write.format("hudi").
  option(DataSourceWriteOptions.PARTITIONPATH_FIELD.key(), "location").
  option(HoodieWriteConfig.TABLE_NAME, tableName).
  option("hoodie.datasource.write.operation", "upsert").
  option("hoodie.datasource.write.recordkey.field", "transaction_id").
  option("hoodie.datasource.write.precombine.field", "ts").
  option("hoodie.datasource.write.table.type", 
HoodieTableType.COPY_ON_WRITE.name()).
  option("hoodie.table.metadata.enable", "true").
  option("hoodie.parquet.small.file.limit", "0").
  option("path", "/tmp/temp_table_path/").
  mode(SaveMode.Append).
  saveAsTable("temp_table")

spark.catalog.listTables().show(false)
spark.sql(s"CREATE INDEX hudi_table_func_index_datestr ON temp_table USING 
column_stats(ts) options(func='from_unixtime', format='yyyy-MM-dd')")
{code}
 


was (Author: JIRAUSER303569):
This is likely not an issue, but a gap in understanding the feature.

 

The issue is that 
{{spark.read.format("hudi").load(PATH).createOrReplaceTempView(TABLE_NAME)}} 
creates a temporary view (similar to the one that is created using {{{}CREATE 
TEMPORARY VIEW ...{}}}) and it is neither a table nor a hudi managed table. 
Hence the following {{CREATE INDEX ...}} statement to create a functional fails 
as the object on which the index is being created is not a hudi managed table.

Instead of creating a temporary view, one can use {{saveAsTable(...)}} method 
on the DataFrameWriter object to create a hudi managed table and then create 
functional index on those tables. An example follows:
val columns = Seq("ts", "transaction_id", "rider", "driver", "price", 
"location")
val data =
  Seq((1695159649087L, "334e26e9-8355-45cc-97c6-c31daf0df330", "rider-A", 
"driver-K", 19.10, "san_francisco"),
    (1695091554788L, "e96c4396-3fad-413a-a942-4cb36106d721", "rider-C", 
"driver-M", 27.70, "san_francisco"),
    (1695046462179L, "9909a8b1-2d15-4d3d-8ec9-efc48c536a00", "rider-D", 
"driver-L", 33.90, "san_francisco"),
    (1695516137016L, "e3cf430c-889d-4015-bc98-59bdce1e530c", "rider-F", 
"driver-P", 34.15, "sao_paulo"),
    (1695115999911L, "c8abbe79-8d89-47ea-b4ce-4d224bae5bfa", "rider-J", 
"driver-T", 17.85, "chennai"));

var inserts = spark.createDataFrame(data).toDF(columns: _*)
inserts.write.format("hudi").
  option(DataSourceWriteOptions.PARTITIONPATH_FIELD.key(), "location").
  option(HoodieWriteConfig.TABLE_NAME, tableName).
  option("hoodie.datasource.write.operation", "upsert").
  option("hoodie.datasource.write.recordkey.field", "transaction_id").
  option("hoodie.datasource.write.precombine.field", "ts").
  option("hoodie.datasource.write.table.type", 
HoodieTableType.COPY_ON_WRITE.name()).
  option("hoodie.table.metadata.enable", "true").
  option("hoodie.parquet.small.file.limit", "0").
  option("path", "/tmp/temp_table_path/").
  mode(SaveMode.Append).
  saveAsTable("temp_table")
spark.catalog.listTables().show(false)
spark.sql(s"select from_unixtime(ts, 'yyyy-MM-dd') as datestr FROM 
temp_table").show()
spark.sql(s"CREATE INDEX hudi_table_func_index_datestr ON temp_table USING 
column_stats(ts) options(func='from_unixtime', format='yyyy-MM-dd')")

> Functional index creation not working when table is created using datasource 
> writer
> -----------------------------------------------------------------------------------
>
>                 Key: HUDI-7117
>                 URL: https://issues.apache.org/jira/browse/HUDI-7117
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: index
>            Reporter: Aditya Goenka
>            Assignee: Vinaykumar Bhat
>            Priority: Blocker
>              Labels: hudi-1.0.0-beta2
>             Fix For: 1.0.0
>
>
> Details and Reproducible code under Github Issue - 
> [https://github.com/apache/hudi/issues/10110]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (HUDI-7117) Functional index creation not working when table is created using datasource writer

Reply via email to