[ https://issues.apache.org/jira/browse/HUDI-7117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830973#comment-17830973 ]
Vinaykumar Bhat edited comment on HUDI-7117 at 3/26/24 2:54 PM: ---------------------------------------------------------------- This is likely not an issue, but a gap in understanding the feature. The issue is that {{spark.read.format("hudi").load(PATH).createOrReplaceTempView(TABLE_NAME)}} creates a temporary view (similar to the one that is created using {{{}CREATE TEMPORARY VIEW ...{}}}) and it is neither a table nor a hudi managed table. Hence the following {{CREATE INDEX ...}} statement to create a functional fails as the object on which the index is being created is not a hudi managed table. Instead of creating a temporary view, one can use {{saveAsTable(...)}} method on the DataFrameWriter object to create a hudi managed table and then create functional index on those tables. An example follows: {code:java} val columns = Seq("ts", "transaction_id", "rider", "driver", "price", "location") val data = Seq( (1695159649087L, "334e26e9-8355-45cc-97c6-c31daf0df330", "rider-A", "driver-K", 19.10, "san_francisco"), (1695091554788L, "e96c4396-3fad-413a-a942-4cb36106d721", "rider-C", "driver-M", 27.70, "san_francisco"), (1695046462179L, "9909a8b1-2d15-4d3d-8ec9-efc48c536a00", "rider-D", "driver-L", 33.90, "san_francisco"), (1695516137016L, "e3cf430c-889d-4015-bc98-59bdce1e530c", "rider-F", "driver-P", 34.15, "sao_paulo"), (1695115999911L, "c8abbe79-8d89-47ea-b4ce-4d224bae5bfa", "rider-J", "driver-T", 17.85, "chennai")); var inserts = spark.createDataFrame(data).toDF(columns: _*) inserts.write.format("hudi"). option(DataSourceWriteOptions.PARTITIONPATH_FIELD.key(), "location"). option(HoodieWriteConfig.TABLE_NAME, tableName). option("hoodie.datasource.write.operation", "upsert"). option("hoodie.datasource.write.recordkey.field", "transaction_id"). option("hoodie.datasource.write.precombine.field", "ts"). option("hoodie.datasource.write.table.type", HoodieTableType.COPY_ON_WRITE.name()). option("hoodie.table.metadata.enable", "true"). option("hoodie.parquet.small.file.limit", "0"). option("path", "/tmp/temp_table_path/"). mode(SaveMode.Append). saveAsTable("temp_table") spark.catalog.listTables().show(false) spark.sql(s"CREATE INDEX hudi_table_func_index_datestr ON temp_table USING column_stats(ts) options(func='from_unixtime', format='yyyy-MM-dd')") {code} was (Author: JIRAUSER303569): This is likely not an issue, but a gap in understanding the feature. The issue is that {{spark.read.format("hudi").load(PATH).createOrReplaceTempView(TABLE_NAME)}} creates a temporary view (similar to the one that is created using {{{}CREATE TEMPORARY VIEW ...{}}}) and it is neither a table nor a hudi managed table. Hence the following {{CREATE INDEX ...}} statement to create a functional fails as the object on which the index is being created is not a hudi managed table. Instead of creating a temporary view, one can use {{saveAsTable(...)}} method on the DataFrameWriter object to create a hudi managed table and then create functional index on those tables. An example follows: val columns = Seq("ts", "transaction_id", "rider", "driver", "price", "location") val data = Seq((1695159649087L, "334e26e9-8355-45cc-97c6-c31daf0df330", "rider-A", "driver-K", 19.10, "san_francisco"), (1695091554788L, "e96c4396-3fad-413a-a942-4cb36106d721", "rider-C", "driver-M", 27.70, "san_francisco"), (1695046462179L, "9909a8b1-2d15-4d3d-8ec9-efc48c536a00", "rider-D", "driver-L", 33.90, "san_francisco"), (1695516137016L, "e3cf430c-889d-4015-bc98-59bdce1e530c", "rider-F", "driver-P", 34.15, "sao_paulo"), (1695115999911L, "c8abbe79-8d89-47ea-b4ce-4d224bae5bfa", "rider-J", "driver-T", 17.85, "chennai")); var inserts = spark.createDataFrame(data).toDF(columns: _*) inserts.write.format("hudi"). option(DataSourceWriteOptions.PARTITIONPATH_FIELD.key(), "location"). option(HoodieWriteConfig.TABLE_NAME, tableName). option("hoodie.datasource.write.operation", "upsert"). option("hoodie.datasource.write.recordkey.field", "transaction_id"). option("hoodie.datasource.write.precombine.field", "ts"). option("hoodie.datasource.write.table.type", HoodieTableType.COPY_ON_WRITE.name()). option("hoodie.table.metadata.enable", "true"). option("hoodie.parquet.small.file.limit", "0"). option("path", "/tmp/temp_table_path/"). mode(SaveMode.Append). saveAsTable("temp_table") spark.catalog.listTables().show(false) spark.sql(s"select from_unixtime(ts, 'yyyy-MM-dd') as datestr FROM temp_table").show() spark.sql(s"CREATE INDEX hudi_table_func_index_datestr ON temp_table USING column_stats(ts) options(func='from_unixtime', format='yyyy-MM-dd')") > Functional index creation not working when table is created using datasource > writer > ----------------------------------------------------------------------------------- > > Key: HUDI-7117 > URL: https://issues.apache.org/jira/browse/HUDI-7117 > Project: Apache Hudi > Issue Type: Bug > Components: index > Reporter: Aditya Goenka > Assignee: Vinaykumar Bhat > Priority: Blocker > Labels: hudi-1.0.0-beta2 > Fix For: 1.0.0 > > > Details and Reproducible code under Github Issue - > [https://github.com/apache/hudi/issues/10110] > -- This message was sent by Atlassian Jira (v8.20.10#820010)