xushiyan commented on code in PR #119:
URL: https://github.com/apache/hudi-rs/pull/119#discussion_r1759795309


##########
crates/core/Cargo.toml:
##########
@@ -63,8 +63,9 @@ tokio = { workspace = true }
 # datafusion
 datafusion = { workspace = true, optional = true }
 datafusion-expr = { workspace = true, optional = true }
-datafusion-common = { workspace = true, optional = true }
+datafusion-common = { workspace = true }
 datafusion-physical-expr = { workspace = true, optional = true }
+log = "0.4.22"

Review Comment:
   can this dep be added to workspace for centralized management?



##########
crates/core/src/table/mod.rs:
##########
@@ -306,6 +330,16 @@ impl Table {
             .read_file_slice_by_path_unchecked(relative_path)
             .await
     }
+
+    pub fn partition_filter_replace(&mut self, partition_filters: 
Vec<PartitionFilter>) {
+        if self.configs.get_or_default(IsHiveStylePartitioning).to() {
+            self.partition_filters = partition_filters;
+        }
+    }
+
+    pub fn reset(&mut self) {
+        self.file_system_view.reset()
+    }

Review Comment:
   these 2 are only intended for testing. let's minimize the mutability for 
table apis. see if you can refactor each test case with a new table instance 
without keep resetting them. maybe find the most popular lib for parameterized 
testing?



##########
crates/core/src/table/mod.rs:
##########
@@ -234,6 +239,23 @@ impl Table {
         self.timeline.get_latest_schema().await
     }
 
+    pub async fn get_partition_schema(&self) -> Result<Schema> {
+        let schema = self.get_schema().await;
+        let partition_str_vec = self

Review Comment:
   we don't need to use different local var names thanks to strict scope 
control in rust, it's actually preferable to use the same var names to increase 
readability. we just need a var named `partition_fields`



##########
crates/core/src/table/fs_view.rs:
##########
@@ -70,7 +73,31 @@ impl FileSystemView {
         if partition_paths.is_empty() {
             partition_paths.push("".to_string())
         }
-        Ok(partition_paths)
+        if partition_filters.is_empty() {
+            return Ok(partition_paths);
+        }
+        let field_and_data_type: HashMap<_, _> = partition_schema
+            .fields()
+            .iter()
+            .map(|field| (field.name().to_string(), field.data_type().clone()))
+            .collect();
+
+        Ok(partition_paths

Review Comment:
   all partition paths were listed and then filtered - a more optimized 
approach is to filter out partitions during the listing process, will reduce a 
lot of scanning for multi-level partitions. we can optimize this in a separate 
PR



##########
crates/core/src/table/mod.rs:
##########
@@ -234,6 +239,23 @@ impl Table {
         self.timeline.get_latest_schema().await
     }
 
+    pub async fn get_partition_schema(&self) -> Result<Schema> {
+        let schema = self.get_schema().await;
+        let partition_str_vec = self
+            .configs
+            .get_or_default(PartitionFields)
+            .to::<Vec<String>>();
+        let partition_set: HashSet<String> = 
partition_str_vec.into_iter().collect();
+        let partition_fields: Vec<Arc<Field>> = schema
+            .unwrap()

Review Comment:
   in production code we avoid unwrap() as much as possible. this function 
returns a result already



##########
crates/core/src/table/partitions.rs:
##########
@@ -0,0 +1,362 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+use anyhow::anyhow;
+use anyhow::Result;
+use arrow_schema::DataType;
+use datafusion_common::ScalarValue;
+use serde::{Serialize, Serializer};
+use std::borrow::Cow;
+use std::collections::HashMap;
+
+/// A special value used in Hive to represent the null partition in 
partitioned tables
+pub const NULL_PARTITION_VALUE_DATA_PATH: &str = "__HIVE_DEFAULT_PARTITION__";
+
+/// A Enum used for selecting the partition value operation when filtering a 
HudiTable partition.
+#[derive(Clone, Debug, PartialEq, Eq)]
+pub enum PartitionValue {
+    /// The partition value with the equal operator
+    Equal(ScalarValue),

Review Comment:
   echoing comment at the beginning - can we remove the dependency of 
ScalarValue from datafusion? wouldn't we be ok to keep strings in the model and 
convert using DataType for comparison when matching partitions? use something 
from arrow if suitable



##########
crates/core/src/config/table.rs:
##########
@@ -75,6 +75,7 @@ impl ConfigParser for HudiTableConfig {
             Self::DatabaseName => 
Some(HudiConfigValue::String("default".to_string())),
             Self::DropsPartitionFields => 
Some(HudiConfigValue::Boolean(false)),
             Self::PopulatesMetaFields => Some(HudiConfigValue::Boolean(true)),
+            Self::PartitionFields => 
Some(HudiConfigValue::List(vec!["".to_string()])),

Review Comment:
   follow alphabetical order, can you move this up 1 line?



##########
crates/core/Cargo.toml:
##########
@@ -63,8 +63,9 @@ tokio = { workspace = true }
 # datafusion
 datafusion = { workspace = true, optional = true }
 datafusion-expr = { workspace = true, optional = true }
-datafusion-common = { workspace = true, optional = true }
+datafusion-common = { workspace = true }

Review Comment:
   hudi-core should be optionally dependent on datafusion - only when 
datafusion feature is enabled. let's keep this optional



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to