date:20240407

Re: [I] Hive metastore does not update metdadata durring commit. [iceberg]

2024-04-07 Thread via GitHub



Chaho12 commented on issue #10101:
URL: https://github.com/apache/iceberg/issues/10101#issuecomment-2041913419

   @pvary yeah it is similiar issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Hive metastore does not update metdadata durring commit. [iceberg]

2024-04-07 Thread via GitHub



pvary commented on issue #10101:
URL: https://github.com/apache/iceberg/issues/10101#issuecomment-2041885473

   Is this similar to #9753?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Implement `Closable` interface for class `HiveCatalog` and `HiveClientPool` [iceberg]

2024-04-07 Thread via GitHub



pvary commented on issue #10100:
URL: https://github.com/apache/iceberg/issues/10100#issuecomment-2041879370

   This comes up from time-to-time.
   Like: https://github.com/apache/iceberg/issues/1502#issuecomment-698635234
   
   The original intention was to keep the HMS client pool open between catalog 
creations, since opening the connection has a significant overhead, and a 
typical user uses only a single pool.
   
   What is your use case which needs this behavior change?
   
   Thanks, Peter 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Migrate AWS tests to JUnit5 [iceberg]

2024-04-07 Thread via GitHub



nk1506 commented on code in PR #10086:
URL: https://github.com/apache/iceberg/pull/10086#discussion_r1555225734


##
aws/src/integration/java/org/apache/iceberg/aws/dynamodb/TestDynamoDbCatalog.java:
##
@@ -99,22 +100,23 @@ public void testCreateNamespace() {
 .tableName(catalogTableName)
 .key(DynamoDbCatalog.namespacePrimaryKey(namespace))
 .build());
-Assert.assertTrue("namespace must exist", response.hasItem());
-Assert.assertEquals(
-"namespace must be stored in DynamoDB",
-namespace.toString(),
-response.item().get("namespace").s());
-Assertions.assertThatThrownBy(() -> catalog.createNamespace(namespace))
+assertThat(response.hasItem()).as("namespace must exist").isTrue();
+assertThat(response.item())
+.as("namespace must be stored in DynamoDB")
+.hasEntrySatisfying(
+"namespace",
+attributeValue -> 
assertThat(attributeValue.s()).isEqualTo(namespace.toString()));

Review Comment:
   for a single key match what about simple assertion like 
   ```
   assertThat(response.item().get("namespace").s()).as("namespace must be 
stored in DynamoDB").isEqualTo(namespace.toString());
   ```
   For multiple keys validation it looks good to use `hasEntrySatisfying`. WDYT 
? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Migrate AWS tests to JUnit5 [iceberg]

2024-04-07 Thread via GitHub



tomtongue commented on code in PR #10086:
URL: https://github.com/apache/iceberg/pull/10086#discussion_r1555216313


##
aws/src/integration/java/org/apache/iceberg/aws/TestDefaultAwsClientFactory.java:
##
@@ -39,7 +40,7 @@ public void testGlueEndpointOverride() {
 properties.put(AwsProperties.GLUE_CATALOG_ENDPOINT, 
"https://unknown:1234;);
 AwsClientFactory factory = AwsClientFactories.from(properties);
 GlueClient glueClient = factory.glue();
-Assertions.assertThatThrownBy(
+assertThatThrownBy(

Review Comment:
   I believe the changes for removing `Assertion` are very small in this 
package, and by the changes, it can avoid the situation that two methods like 
`Assertions.assertThat` and `assertThat` exist in a same file. So it won't be a 
big problem. If there's any reason to keep `Assertions`, please let me know.
   [Iceberg contribute guide](https://iceberg.apache.org/contribute/#assertj) 
also shows the tests without `Assertions` .



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] #9073 Junit 4 tests switched to JUnit 5 [iceberg]

2024-04-07 Thread via GitHub



igoradulian commented on PR #9793:
URL: https://github.com/apache/iceberg/pull/9793#issuecomment-2041857202

   @nastra I updated PR, please review it and let me know if further changes 
are needed. BTW, some tests don't pass and I didn't touch tests that are 
dependant on tests and classes from other modules which haven't been updated 
yet. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Migrate AWS tests to JUnit5 [iceberg]

2024-04-07 Thread via GitHub



tomtongue commented on code in PR #10086:
URL: https://github.com/apache/iceberg/pull/10086#discussion_r1555216313


##
aws/src/integration/java/org/apache/iceberg/aws/TestDefaultAwsClientFactory.java:
##
@@ -39,7 +40,7 @@ public void testGlueEndpointOverride() {
 properties.put(AwsProperties.GLUE_CATALOG_ENDPOINT, 
"https://unknown:1234;);
 AwsClientFactory factory = AwsClientFactories.from(properties);
 GlueClient glueClient = factory.glue();
-Assertions.assertThatThrownBy(
+assertThatThrownBy(

Review Comment:
   I believe the changes for removing `Assertion` are very small in this 
package, and by the changes, it can avoid the situation that two methods like 
`Assertions.assertThat` and `assertThat` exist in a same file. So it won't be a 
big problem. Is there any reason to keep `Assertions`?
   [Iceberg contribute guide](https://iceberg.apache.org/contribute/#assertj) 
also shows the tests without `Assertions` .



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Migrate AWS tests to JUnit5 [iceberg]

2024-04-07 Thread via GitHub



tomtongue commented on PR #10086:
URL: https://github.com/apache/iceberg/pull/10086#issuecomment-2041851203

   @nastra All AWS classes are migrated to JUnit5 + AssertJ. Could you review 
this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Migrate AWS tests to JUnit5 [iceberg]

2024-04-07 Thread via GitHub



nk1506 commented on code in PR #10086:
URL: https://github.com/apache/iceberg/pull/10086#discussion_r1555209833


##
aws/src/integration/java/org/apache/iceberg/aws/TestDefaultAwsClientFactory.java:
##
@@ -39,7 +40,7 @@ public void testGlueEndpointOverride() {
 properties.put(AwsProperties.GLUE_CATALOG_ENDPOINT, 
"https://unknown:1234;);
 AwsClientFactory factory = AwsClientFactories.from(properties);
 GlueClient glueClient = factory.glue();
-Assertions.assertThatThrownBy(
+assertThatThrownBy(

Review Comment:
   Can't we avoid these cleanups with this PR? IMO, this PR should have only 
Migration from Junit4 to Junit5 . WDYT? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] Hive metastore does not update metdadata durring commit. [iceberg]

2024-04-07 Thread via GitHub



Chaho12 opened a new issue, #10101:
URL: https://github.com/apache/iceberg/issues/10101

   ### Apache Iceberg version
   
   1.4.3
   
   ### Query engine
   
   Hive
   
   ### Please describe the bug 
   
   As we all know through the commit process, iceberg registers the most 
up-to-date metadata file path to the table properties in the hive metastore.
   However, in some cases, I notice that even if iceberg registered the path of 
the metadata file in the hive metastore and received a normal response, but it 
was `not registered in the hive metastore`, causing the issue.
   
   Since from the perspective of flink iceberg, it is correct to proceed to the 
next checkpoint because hms responded normally so this unregistered metadata 
gets lost.
   
   Does anyone have any ideas on fixing this?
   
   ```
   2024-03-19 03:31:02,535 INFO  
org.apache.iceberg.flink.sink.IcebergFilesCommitter  [] - Start to 
flush snapshot state to state backend, table: hive.custom_schema.public_hms, 
checkpointId: 36569
   2024-03-19 03:31:02,712 INFO  org.apache.hadoop.io.compress.CodecPool
  [] - Got brand-new compressor [.zstd]
   2024-03-19 03:31:03,127 INFO  
org.apache.iceberg.flink.sink.IcebergFilesCommitter  [] - Committing 
append for checkpoint 36569 to table hive.custom_schema.public_hms branch main 
with summary: CommitSummary{dataFilesCount=10, dataFilesRecordCount=374, 
dataFilesByteCount=72923, deleteFilesCount=0, deleteFilesRecordCount=0, 
deleteFilesByteCount=0}
   2024-03-19 03:31:03,506 INFO  org.apache.hadoop.io.compress.CodecPool
  [] - Got brand-new compressor [.zstd]
   2024-03-19 03:31:03,966 INFO  org.apache.iceberg.hive.HiveTableOperations
  [] - Committed to table hive.custom_schema.public_hms with the 
new metadata location 
hdfs://.../user/user_hive/warehouse/custom_schema.db/public_hms/metadata/36767-0e5f399a-867c-49de-90e1-0806482ea2fa.metadata.json
   2024-03-19 03:31:04,042 INFO  
org.apache.iceberg.BaseMetastoreTableOperations  [] - Successfully 
committed to table hive.custom_schema.public_hms in 611 ms
   2024-03-19 03:31:04,042 INFO  org.apache.iceberg.SnapshotProducer
  [] - Committed snapshot 517916272951067890 (MergeAppend)
   2024-03-19 03:31:04,196 WARN  org.apache.iceberg.SnapshotProducer
  [] - Failed to load committed snapshot, skipping manifest clean-up
   2024-03-19 03:31:04,329 WARN  org.apache.iceberg.MergingSnapshotProducer 
  [] - Failed to load committed snapshot: omitting sequence number 
from notifications
   2024-03-19 03:31:04,329 INFO  
org.apache.iceberg.metrics.LoggingMetricsReporter[] - Received 
metrics report: CommitReport{tableName=hive.custom_schema.public_hms, 
snapshotId=s, sequenceNumber=-1, operation=append, 
commitMetrics=CommitMetricsResult{totalDuration=TimerResult{timeUnit=NANOSECONDS,
 totalDuration=PT1.069239305S, count=1}, attempts=CounterResult{unit=COUNT, 
value=1}, addedDataFiles=CounterResult{unit=COUNT, value=10}, 
removedDataFiles=null, totalDataFiles=null, addedDeleteFiles=null, 
addedEqualityDeleteFiles=null, addedPositionalDeleteFiles=null, 
removedDeleteFiles=null, removedEqualityDeleteFiles=null, 
removedPositionalDeleteFiles=null, totalDeleteFiles=null, 
addedRecords=CounterResult{unit=COUNT, value=374}, removedRecords=null, 
totalRecords=null, addedFilesSizeInBytes=CounterResult{unit=BYTES, 
value=72923}, removedFilesSizeInBytes=null, totalFilesSizeInBytes=null, 
addedPositionalDeletes=null, removedPositionalDeletes=null, 
totalPositionalDeletes=null, add
 edEqualityDeletes=null, removedEqualityDeletes=null, 
totalEqualityDeletes=null}, metadata={engine-version=1.15.2, engine-name=flink, 
iceberg-version=Apache Iceberg 1.2.0 (commit 
e340ad5be04e902398c576f431810c3dfa4fe717)}}
   ```
   
   3.1.0
   3.1.2
   3.1.3


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] feat: Convert predicate to arrow filter and push down to parquet reader [iceberg-rust]

2024-04-07 Thread via GitHub



viirya commented on code in PR #295:
URL: https://github.com/apache/iceberg-rust/pull/295#discussion_r1555123970


##
crates/iceberg/src/arrow/reader.rs:
##
@@ -186,4 +216,399 @@ impl ArrowReader {
 Ok(ProjectionMask::leaves(parquet_schema, indices))
 }
 }
+
+fn get_row_filter(, parquet_schema: ) -> 
Result> {
+if let Some(predicates) =  {
+let field_id_map = self.build_field_id_map(parquet_schema)?;
+
+// Collect Parquet column indices from field ids
+let mut collector = CollectFieldIdVisitor { field_ids: vec![] };
+visit_predicate( collector, predicates).unwrap();
+let column_indices = collector
+.field_ids
+.iter()
+.map(|field_id| {
+field_id_map.get(field_id).cloned().ok_or_else(|| {
+Error::new(ErrorKind::DataInvalid, "Field id not found 
in schema")
+})
+})
+.collect::>>()?;
+
+// Convert BoundPredicates to ArrowPredicates
+let mut converter = PredicateConverter {
+columns: _indices,
+projection_mask: ProjectionMask::leaves(parquet_schema, 
column_indices.clone()),
+parquet_schema,
+column_map: _id_map,
+};
+let arrow_predicate = visit_predicate( converter, predicates)?;
+Ok(Some(RowFilter::new(vec![arrow_predicate])))
+} else {
+Ok(None)
+}
+}
+
+/// Build the map of field id to Parquet column index in the schema.
+fn build_field_id_map(, parquet_schema: ) -> 
Result> {
+let mut column_map = HashMap::new();
+for (idx, field) in parquet_schema.columns().iter().enumerate() {
+let field_type = field.self_type();
+match field_type {
+ParquetType::PrimitiveType { basic_info, .. } => {
+if !basic_info.has_id() {
+return Err(Error::new(
+ErrorKind::DataInvalid,
+format!(
+"Leave column {:?} in schema doesn't have 
field id",
+field_type
+),
+));
+}
+column_map.insert(basic_info.id(), idx);
+}
+ParquetType::GroupType { .. } => {
+return Err(Error::new(
+ErrorKind::DataInvalid,
+format!(
+"Leave column in schema should be primitive type 
but got {:?}",
+field_type
+),
+));
+}
+};
+}
+
+Ok(column_map)
+}
+}
+
+/// A visitor to collect field ids from bound predicates.
+struct CollectFieldIdVisitor {
+field_ids: Vec,
+}
+
+impl BoundPredicateVisitor for CollectFieldIdVisitor {
+type T = ();
+type U = ();
+
+fn and( self, _predicates: Vec) -> Result {
+Ok(())
+}
+
+fn or( self, _predicates: Vec) -> Result {
+Ok(())
+}
+
+fn not( self, _predicate: Self::T) -> Result {
+Ok(())
+}
+
+fn visit_always_true( self) -> Result {
+Ok(())
+}
+
+fn visit_always_false( self) -> Result {
+Ok(())
+}
+
+fn visit_unary( self, predicate: ) -> 
Result {
+self.bound_reference(predicate.term())?;
+Ok(())
+}
+
+fn visit_binary( self, predicate: ) 
-> Result {
+self.bound_reference(predicate.term())?;
+Ok(())
+}
+
+fn visit_set( self, predicate: ) -> 
Result {
+self.bound_reference(predicate.term())?;
+Ok(())
+}
+
+fn bound_reference( self, reference: ) -> 
Result {
+self.field_ids.push(reference.field().id);
+Ok(())
+}
+}
+
+/// A visitor to convert Iceberg bound predicates to Arrow predicates.
+struct PredicateConverter<'a> {
+/// The leaf column indices used in the predicates.
+pub columns: &'a Vec,
+/// The projection mask for the Arrow predicates.
+pub projection_mask: ProjectionMask,
+/// The Parquet schema descriptor.
+pub parquet_schema: &'a SchemaDescriptor,
+/// The map between field id and leaf column index in Parquet schema.
+pub column_map: &'a HashMap,
+}
+
+fn get_arrow_datum(datum: ) -> Result> {
+match datum.literal() {
+PrimitiveLiteral::Boolean(value) => 
Ok(Box::new(BooleanArray::new_scalar(*value))),
+PrimitiveLiteral::Int(value) => 
Ok(Box::new(Int32Array::new_scalar(*value))),
+PrimitiveLiteral::Long(value) => 
Ok(Box::new(Int64Array::new_scalar(*value))),
+PrimitiveLiteral::Float(value) => 
Ok(Box::new(Float32Array::new_scalar(value.as_f32(,
+PrimitiveLiteral::Double(value) =>

Re: [PR] Change DataScan to accept Metadata and io [iceberg-python]

2024-04-07 Thread via GitHub



HonahX commented on code in PR #581:
URL: https://github.com/apache/iceberg-python/pull/581#discussion_r1555108664


##
pyiceberg/io/pyarrow.py:
##
@@ -1089,7 +1091,7 @@ def project_table(
 deletes_per_file.get(task.file.file_path),
 case_sensitive,
 limit,
-table.name_mapping(),
+None,

Review Comment:
   I opened a PR in your repo containing the change I proposed and tests that 
can re-produce the issue: https://github.com/Fokko/iceberg-python/pull/1 :D



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] feat: Convert predicate to arrow filter and push down to parquet reader [iceberg-rust]

2024-04-07 Thread via GitHub



viirya commented on code in PR #295:
URL: https://github.com/apache/iceberg-rust/pull/295#discussion_r1555123970


##
crates/iceberg/src/arrow/reader.rs:
##
@@ -186,4 +216,399 @@ impl ArrowReader {
 Ok(ProjectionMask::leaves(parquet_schema, indices))
 }
 }
+
+fn get_row_filter(, parquet_schema: ) -> 
Result> {
+if let Some(predicates) =  {
+let field_id_map = self.build_field_id_map(parquet_schema)?;
+
+// Collect Parquet column indices from field ids
+let mut collector = CollectFieldIdVisitor { field_ids: vec![] };
+visit_predicate( collector, predicates).unwrap();
+let column_indices = collector
+.field_ids
+.iter()
+.map(|field_id| {
+field_id_map.get(field_id).cloned().ok_or_else(|| {
+Error::new(ErrorKind::DataInvalid, "Field id not found 
in schema")
+})
+})
+.collect::>>()?;
+
+// Convert BoundPredicates to ArrowPredicates
+let mut converter = PredicateConverter {
+columns: _indices,
+projection_mask: ProjectionMask::leaves(parquet_schema, 
column_indices.clone()),
+parquet_schema,
+column_map: _id_map,
+};
+let arrow_predicate = visit_predicate( converter, predicates)?;
+Ok(Some(RowFilter::new(vec![arrow_predicate])))
+} else {
+Ok(None)
+}
+}
+
+/// Build the map of field id to Parquet column index in the schema.
+fn build_field_id_map(, parquet_schema: ) -> 
Result> {
+let mut column_map = HashMap::new();
+for (idx, field) in parquet_schema.columns().iter().enumerate() {
+let field_type = field.self_type();
+match field_type {
+ParquetType::PrimitiveType { basic_info, .. } => {
+if !basic_info.has_id() {
+return Err(Error::new(
+ErrorKind::DataInvalid,
+format!(
+"Leave column {:?} in schema doesn't have 
field id",
+field_type
+),
+));
+}
+column_map.insert(basic_info.id(), idx);
+}
+ParquetType::GroupType { .. } => {
+return Err(Error::new(
+ErrorKind::DataInvalid,
+format!(
+"Leave column in schema should be primitive type 
but got {:?}",
+field_type
+),
+));
+}
+};
+}
+
+Ok(column_map)
+}
+}
+
+/// A visitor to collect field ids from bound predicates.
+struct CollectFieldIdVisitor {
+field_ids: Vec,
+}
+
+impl BoundPredicateVisitor for CollectFieldIdVisitor {
+type T = ();
+type U = ();
+
+fn and( self, _predicates: Vec) -> Result {
+Ok(())
+}
+
+fn or( self, _predicates: Vec) -> Result {
+Ok(())
+}
+
+fn not( self, _predicate: Self::T) -> Result {
+Ok(())
+}
+
+fn visit_always_true( self) -> Result {
+Ok(())
+}
+
+fn visit_always_false( self) -> Result {
+Ok(())
+}
+
+fn visit_unary( self, predicate: ) -> 
Result {
+self.bound_reference(predicate.term())?;
+Ok(())
+}
+
+fn visit_binary( self, predicate: ) 
-> Result {
+self.bound_reference(predicate.term())?;
+Ok(())
+}
+
+fn visit_set( self, predicate: ) -> 
Result {
+self.bound_reference(predicate.term())?;
+Ok(())
+}
+
+fn bound_reference( self, reference: ) -> 
Result {
+self.field_ids.push(reference.field().id);
+Ok(())
+}
+}
+
+/// A visitor to convert Iceberg bound predicates to Arrow predicates.
+struct PredicateConverter<'a> {
+/// The leaf column indices used in the predicates.
+pub columns: &'a Vec,
+/// The projection mask for the Arrow predicates.
+pub projection_mask: ProjectionMask,
+/// The Parquet schema descriptor.
+pub parquet_schema: &'a SchemaDescriptor,
+/// The map between field id and leaf column index in Parquet schema.
+pub column_map: &'a HashMap,
+}
+
+fn get_arrow_datum(datum: ) -> Result> {
+match datum.literal() {
+PrimitiveLiteral::Boolean(value) => 
Ok(Box::new(BooleanArray::new_scalar(*value))),
+PrimitiveLiteral::Int(value) => 
Ok(Box::new(Int32Array::new_scalar(*value))),
+PrimitiveLiteral::Long(value) => 
Ok(Box::new(Int64Array::new_scalar(*value))),
+PrimitiveLiteral::Float(value) => 
Ok(Box::new(Float32Array::new_scalar(value.as_f32(,
+PrimitiveLiteral::Double(value) =>

Re: [PR] feat: Convert predicate to arrow filter and push down to parquet reader [iceberg-rust]

2024-04-07 Thread via GitHub



viirya commented on code in PR #295:
URL: https://github.com/apache/iceberg-rust/pull/295#discussion_r1555120078


##
crates/iceberg/src/arrow.rs:
##
@@ -113,6 +143,405 @@ impl ArrowReader {
 // TODO: full implementation
 ProjectionMask::all()
 }
+
+fn get_row_filter(, parquet_schema: ) -> 
Result> {
+if let Some(predicates) =  {
+let field_id_map = self.build_field_id_map(parquet_schema)?;
+
+// Collect Parquet column indices from field ids
+let column_indices = predicates
+.iter()
+.map(|predicate| {
+let mut collector = CollectFieldIdVisitor { field_ids: 
vec![] };
+collector.visit_predicate(predicate).unwrap();
+collector
+.field_ids
+.iter()
+.map(|field_id| {
+field_id_map.get(field_id).cloned().ok_or_else(|| {
+Error::new(ErrorKind::DataInvalid, "Field id 
not found in schema")
+})
+})
+.collect::>>()
+})
+.collect::>>()?;
+
+// Convert BoundPredicates to ArrowPredicates
+let mut arrow_predicates = vec![];
+for (predicate, columns) in 
predicates.iter().zip(column_indices.iter()) {
+let mut converter = PredicateConverter {
+columns,
+projection_mask: ProjectionMask::leaves(parquet_schema, 
columns.clone()),
+parquet_schema,
+column_map: _id_map,
+};
+let arrow_predicate = converter.visit_predicate(predicate)?;
+arrow_predicates.push(arrow_predicate);
+}
+Ok(Some(RowFilter::new(arrow_predicates)))
+} else {
+Ok(None)
+}
+}
+
+/// Build the map of field id to Parquet column index in the schema.
+fn build_field_id_map(, parquet_schema: ) -> 
Result> {
+let mut column_map = HashMap::new();
+for (idx, field) in parquet_schema.columns().iter().enumerate() {
+let field_type = field.self_type();
+match field_type {
+ParquetType::PrimitiveType { basic_info, .. } => {
+if !basic_info.has_id() {
+return Err(Error::new(
+ErrorKind::DataInvalid,
+format!(
+"Leave column {:?} in schema doesn't have 
field id",
+field_type
+),
+));
+}
+column_map.insert(basic_info.id(), idx);
+}
+ParquetType::GroupType { .. } => {
+return Err(Error::new(
+ErrorKind::DataInvalid,
+format!(
+"Leave column in schema should be primitive type 
but got {:?}",
+field_type
+),
+));
+}
+};
+}
+
+Ok(column_map)
+}
+}
+
+/// A visitor to collect field ids from bound predicates.
+struct CollectFieldIdVisitor {
+field_ids: Vec,
+}
+
+impl BoundPredicateVisitor for CollectFieldIdVisitor {
+type T = ();
+type U = ();
+
+fn and( self, _predicates: Vec) -> Result {
+Ok(())
+}
+
+fn or( self, _predicates: Vec) -> Result {
+Ok(())
+}
+
+fn not( self, _predicate: Self::T) -> Result {
+Ok(())
+}
+
+fn visit_always_true( self) -> Result {
+Ok(())
+}
+
+fn visit_always_false( self) -> Result {
+Ok(())
+}
+
+fn visit_unary( self, predicate: ) -> 
Result {
+self.bound_reference(predicate.term())?;
+Ok(())
+}
+
+fn visit_binary( self, predicate: ) 
-> Result {
+self.bound_reference(predicate.term())?;
+Ok(())
+}
+
+fn visit_set( self, predicate: ) -> 
Result {
+self.bound_reference(predicate.term())?;
+Ok(())
+}
+
+fn bound_reference( self, reference: ) -> 
Result {
+self.field_ids.push(reference.field().id);
+Ok(())
+}
+}
+
+struct PredicateConverter<'a> {
+pub columns: &'a Vec,
+pub projection_mask: ProjectionMask,
+pub parquet_schema: &'a SchemaDescriptor,
+pub column_map: &'a HashMap,
+}
+
+fn get_arrow_datum(datum: ) -> Box {
+match datum.literal() {
+PrimitiveLiteral::Boolean(value) => 
Box::new(BooleanArray::new_scalar(*value)),
+PrimitiveLiteral::Int(value) => 
Box::new(Int32Array::new_scalar(*value)),
+PrimitiveLiteral::Long(value) => 
Box::new(Int64Array::new_scalar(*value)),
+PrimitiveLiteral::Float(value) =>

Re: [PR] feat: Convert predicate to arrow filter and push down to parquet reader [iceberg-rust]

2024-04-07 Thread via GitHub



viirya commented on code in PR #295:
URL: https://github.com/apache/iceberg-rust/pull/295#discussion_r1555114291


##
crates/iceberg/src/arrow.rs:
##
@@ -113,6 +143,405 @@ impl ArrowReader {
 // TODO: full implementation
 ProjectionMask::all()
 }
+
+fn get_row_filter(, parquet_schema: ) -> 
Result> {
+if let Some(predicates) =  {
+let field_id_map = self.build_field_id_map(parquet_schema)?;
+
+// Collect Parquet column indices from field ids
+let column_indices = predicates
+.iter()
+.map(|predicate| {
+let mut collector = CollectFieldIdVisitor { field_ids: 
vec![] };
+collector.visit_predicate(predicate).unwrap();
+collector
+.field_ids
+.iter()
+.map(|field_id| {
+field_id_map.get(field_id).cloned().ok_or_else(|| {
+Error::new(ErrorKind::DataInvalid, "Field id 
not found in schema")
+})
+})
+.collect::>>()
+})
+.collect::>>()?;
+
+// Convert BoundPredicates to ArrowPredicates
+let mut arrow_predicates = vec![];
+for (predicate, columns) in 
predicates.iter().zip(column_indices.iter()) {
+let mut converter = PredicateConverter {
+columns,
+projection_mask: ProjectionMask::leaves(parquet_schema, 
columns.clone()),
+parquet_schema,
+column_map: _id_map,
+};
+let arrow_predicate = converter.visit_predicate(predicate)?;
+arrow_predicates.push(arrow_predicate);
+}
+Ok(Some(RowFilter::new(arrow_predicates)))
+} else {
+Ok(None)
+}
+}
+
+/// Build the map of field id to Parquet column index in the schema.
+fn build_field_id_map(, parquet_schema: ) -> 
Result> {
+let mut column_map = HashMap::new();
+for (idx, field) in parquet_schema.columns().iter().enumerate() {
+let field_type = field.self_type();
+match field_type {
+ParquetType::PrimitiveType { basic_info, .. } => {
+if !basic_info.has_id() {
+return Err(Error::new(
+ErrorKind::DataInvalid,
+format!(
+"Leave column {:?} in schema doesn't have 
field id",
+field_type
+),
+));
+}
+column_map.insert(basic_info.id(), idx);
+}
+ParquetType::GroupType { .. } => {
+return Err(Error::new(
+ErrorKind::DataInvalid,
+format!(
+"Leave column in schema should be primitive type 
but got {:?}",
+field_type
+),
+));
+}
+};
+}
+
+Ok(column_map)
+}
+}
+
+/// A visitor to collect field ids from bound predicates.
+struct CollectFieldIdVisitor {
+field_ids: Vec,
+}
+
+impl BoundPredicateVisitor for CollectFieldIdVisitor {
+type T = ();
+type U = ();
+
+fn and( self, _predicates: Vec) -> Result {
+Ok(())
+}
+
+fn or( self, _predicates: Vec) -> Result {
+Ok(())
+}
+
+fn not( self, _predicate: Self::T) -> Result {
+Ok(())
+}
+
+fn visit_always_true( self) -> Result {
+Ok(())
+}
+
+fn visit_always_false( self) -> Result {
+Ok(())
+}
+
+fn visit_unary( self, predicate: ) -> 
Result {
+self.bound_reference(predicate.term())?;
+Ok(())
+}
+
+fn visit_binary( self, predicate: ) 
-> Result {
+self.bound_reference(predicate.term())?;
+Ok(())
+}
+
+fn visit_set( self, predicate: ) -> 
Result {
+self.bound_reference(predicate.term())?;
+Ok(())
+}
+
+fn bound_reference( self, reference: ) -> 
Result {
+self.field_ids.push(reference.field().id);
+Ok(())
+}
+}
+
+struct PredicateConverter<'a> {
+pub columns: &'a Vec,
+pub projection_mask: ProjectionMask,
+pub parquet_schema: &'a SchemaDescriptor,
+pub column_map: &'a HashMap,
+}
+
+fn get_arrow_datum(datum: ) -> Box {
+match datum.literal() {
+PrimitiveLiteral::Boolean(value) => 
Box::new(BooleanArray::new_scalar(*value)),
+PrimitiveLiteral::Int(value) => 
Box::new(Int32Array::new_scalar(*value)),
+PrimitiveLiteral::Long(value) => 
Box::new(Int64Array::new_scalar(*value)),
+PrimitiveLiteral::Float(value) =>

Re: [I] Replace table shouldn't clear table history [iceberg]

2024-04-07 Thread via GitHub



github-actions[bot] closed issue #2233: Replace table shouldn't clear table 
history
URL: https://github.com/apache/iceberg/issues/2233


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Flink schema validation always has checkNullability and checkOrdering set true. Any particular reason ? [iceberg]

2024-04-07 Thread via GitHub



github-actions[bot] commented on issue #2235:
URL: https://github.com/apache/iceberg/issues/2235#issuecomment-2041658540

   This issue has been closed because it has not received any activity in the 
last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Replace table shouldn't clear table history [iceberg]

2024-04-07 Thread via GitHub



github-actions[bot] commented on issue #2233:
URL: https://github.com/apache/iceberg/issues/2233#issuecomment-2041658532

   This issue has been closed because it has not received any activity in the 
last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Flink schema validation always has checkNullability and checkOrdering set true. Any particular reason ? [iceberg]

2024-04-07 Thread via GitHub



github-actions[bot] closed issue #2235: Flink schema validation always has 
checkNullability and checkOrdering set true. Any particular reason ?
URL: https://github.com/apache/iceberg/issues/2235


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] feat: Convert predicate to arrow filter and push down to parquet reader [iceberg-rust]

2024-04-07 Thread via GitHub



viirya commented on code in PR #295:
URL: https://github.com/apache/iceberg-rust/pull/295#discussion_r1555114291


##
crates/iceberg/src/arrow.rs:
##
@@ -113,6 +143,405 @@ impl ArrowReader {
 // TODO: full implementation
 ProjectionMask::all()
 }
+
+fn get_row_filter(, parquet_schema: ) -> 
Result> {
+if let Some(predicates) =  {
+let field_id_map = self.build_field_id_map(parquet_schema)?;
+
+// Collect Parquet column indices from field ids
+let column_indices = predicates
+.iter()
+.map(|predicate| {
+let mut collector = CollectFieldIdVisitor { field_ids: 
vec![] };
+collector.visit_predicate(predicate).unwrap();
+collector
+.field_ids
+.iter()
+.map(|field_id| {
+field_id_map.get(field_id).cloned().ok_or_else(|| {
+Error::new(ErrorKind::DataInvalid, "Field id 
not found in schema")
+})
+})
+.collect::>>()
+})
+.collect::>>()?;
+
+// Convert BoundPredicates to ArrowPredicates
+let mut arrow_predicates = vec![];
+for (predicate, columns) in 
predicates.iter().zip(column_indices.iter()) {
+let mut converter = PredicateConverter {
+columns,
+projection_mask: ProjectionMask::leaves(parquet_schema, 
columns.clone()),
+parquet_schema,
+column_map: _id_map,
+};
+let arrow_predicate = converter.visit_predicate(predicate)?;
+arrow_predicates.push(arrow_predicate);
+}
+Ok(Some(RowFilter::new(arrow_predicates)))
+} else {
+Ok(None)
+}
+}
+
+/// Build the map of field id to Parquet column index in the schema.
+fn build_field_id_map(, parquet_schema: ) -> 
Result> {
+let mut column_map = HashMap::new();
+for (idx, field) in parquet_schema.columns().iter().enumerate() {
+let field_type = field.self_type();
+match field_type {
+ParquetType::PrimitiveType { basic_info, .. } => {
+if !basic_info.has_id() {
+return Err(Error::new(
+ErrorKind::DataInvalid,
+format!(
+"Leave column {:?} in schema doesn't have 
field id",
+field_type
+),
+));
+}
+column_map.insert(basic_info.id(), idx);
+}
+ParquetType::GroupType { .. } => {
+return Err(Error::new(
+ErrorKind::DataInvalid,
+format!(
+"Leave column in schema should be primitive type 
but got {:?}",
+field_type
+),
+));
+}
+};
+}
+
+Ok(column_map)
+}
+}
+
+/// A visitor to collect field ids from bound predicates.
+struct CollectFieldIdVisitor {
+field_ids: Vec,
+}
+
+impl BoundPredicateVisitor for CollectFieldIdVisitor {
+type T = ();
+type U = ();
+
+fn and( self, _predicates: Vec) -> Result {
+Ok(())
+}
+
+fn or( self, _predicates: Vec) -> Result {
+Ok(())
+}
+
+fn not( self, _predicate: Self::T) -> Result {
+Ok(())
+}
+
+fn visit_always_true( self) -> Result {
+Ok(())
+}
+
+fn visit_always_false( self) -> Result {
+Ok(())
+}
+
+fn visit_unary( self, predicate: ) -> 
Result {
+self.bound_reference(predicate.term())?;
+Ok(())
+}
+
+fn visit_binary( self, predicate: ) 
-> Result {
+self.bound_reference(predicate.term())?;
+Ok(())
+}
+
+fn visit_set( self, predicate: ) -> 
Result {
+self.bound_reference(predicate.term())?;
+Ok(())
+}
+
+fn bound_reference( self, reference: ) -> 
Result {
+self.field_ids.push(reference.field().id);
+Ok(())
+}
+}
+
+struct PredicateConverter<'a> {
+pub columns: &'a Vec,
+pub projection_mask: ProjectionMask,
+pub parquet_schema: &'a SchemaDescriptor,
+pub column_map: &'a HashMap,
+}
+
+fn get_arrow_datum(datum: ) -> Box {
+match datum.literal() {
+PrimitiveLiteral::Boolean(value) => 
Box::new(BooleanArray::new_scalar(*value)),
+PrimitiveLiteral::Int(value) => 
Box::new(Int32Array::new_scalar(*value)),
+PrimitiveLiteral::Long(value) => 
Box::new(Int64Array::new_scalar(*value)),
+PrimitiveLiteral::Float(value) =>

Re: [PR] Change DataScan to accept Metadata and io [iceberg-python]

2024-04-07 Thread via GitHub



HonahX commented on PR #581:
URL: https://github.com/apache/iceberg-python/pull/581#issuecomment-2041645715

   Other than that, the refactor looks good to me. Thanks @Fokko !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Change DataScan to accept Metadata and io [iceberg-python]

2024-04-07 Thread via GitHub



HonahX commented on code in PR #581:
URL: https://github.com/apache/iceberg-python/pull/581#discussion_r1555108664


##
pyiceberg/io/pyarrow.py:
##
@@ -1089,7 +1091,7 @@ def project_table(
 deletes_per_file.get(task.file.file_path),
 case_sensitive,
 limit,
-table.name_mapping(),
+None,

Review Comment:
   I opened a PR in your repo containing the change I proposed and the test 
that can re-produce the issue: https://github.com/Fokko/iceberg-python/pull/1



##
pyiceberg/io/pyarrow.py:
##
@@ -1089,7 +1091,7 @@ def project_table(
 deletes_per_file.get(task.file.file_path),
 case_sensitive,
 limit,
-table.name_mapping(),
+None,

Review Comment:
   I opened a PR in your repo containing the change I proposed and tests that 
can re-produce the issue: https://github.com/Fokko/iceberg-python/pull/1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] feat: Convert predicate to arrow filter and push down to parquet reader [iceberg-rust]

2024-04-07 Thread via GitHub



viirya commented on code in PR #295:
URL: https://github.com/apache/iceberg-rust/pull/295#discussion_r1555107443


##
crates/iceberg/src/arrow.rs:
##
@@ -113,6 +143,405 @@ impl ArrowReader {
 // TODO: full implementation
 ProjectionMask::all()
 }
+
+fn get_row_filter(, parquet_schema: ) -> 
Result> {
+if let Some(predicates) =  {
+let field_id_map = self.build_field_id_map(parquet_schema)?;
+
+// Collect Parquet column indices from field ids
+let column_indices = predicates
+.iter()
+.map(|predicate| {
+let mut collector = CollectFieldIdVisitor { field_ids: 
vec![] };
+collector.visit_predicate(predicate).unwrap();
+collector
+.field_ids
+.iter()
+.map(|field_id| {
+field_id_map.get(field_id).cloned().ok_or_else(|| {
+Error::new(ErrorKind::DataInvalid, "Field id 
not found in schema")
+})
+})
+.collect::>>()
+})
+.collect::>>()?;
+
+// Convert BoundPredicates to ArrowPredicates
+let mut arrow_predicates = vec![];
+for (predicate, columns) in 
predicates.iter().zip(column_indices.iter()) {
+let mut converter = PredicateConverter {
+columns,
+projection_mask: ProjectionMask::leaves(parquet_schema, 
columns.clone()),
+parquet_schema,
+column_map: _id_map,
+};
+let arrow_predicate = converter.visit_predicate(predicate)?;
+arrow_predicates.push(arrow_predicate);
+}
+Ok(Some(RowFilter::new(arrow_predicates)))
+} else {
+Ok(None)
+}
+}
+
+/// Build the map of field id to Parquet column index in the schema.
+fn build_field_id_map(, parquet_schema: ) -> 
Result> {
+let mut column_map = HashMap::new();
+for (idx, field) in parquet_schema.columns().iter().enumerate() {
+let field_type = field.self_type();
+match field_type {
+ParquetType::PrimitiveType { basic_info, .. } => {
+if !basic_info.has_id() {
+return Err(Error::new(
+ErrorKind::DataInvalid,
+format!(
+"Leave column {:?} in schema doesn't have 
field id",
+field_type
+),
+));
+}
+column_map.insert(basic_info.id(), idx);
+}
+ParquetType::GroupType { .. } => {
+return Err(Error::new(
+ErrorKind::DataInvalid,
+format!(
+"Leave column in schema should be primitive type 
but got {:?}",
+field_type
+),
+));
+}
+};
+}
+
+Ok(column_map)
+}
+}
+
+/// A visitor to collect field ids from bound predicates.
+struct CollectFieldIdVisitor {
+field_ids: Vec,
+}
+
+impl BoundPredicateVisitor for CollectFieldIdVisitor {
+type T = ();
+type U = ();
+
+fn and( self, _predicates: Vec) -> Result {
+Ok(())
+}
+
+fn or( self, _predicates: Vec) -> Result {
+Ok(())
+}
+
+fn not( self, _predicate: Self::T) -> Result {
+Ok(())
+}
+
+fn visit_always_true( self) -> Result {
+Ok(())
+}
+
+fn visit_always_false( self) -> Result {
+Ok(())
+}
+
+fn visit_unary( self, predicate: ) -> 
Result {
+self.bound_reference(predicate.term())?;
+Ok(())
+}
+
+fn visit_binary( self, predicate: ) 
-> Result {
+self.bound_reference(predicate.term())?;
+Ok(())
+}
+
+fn visit_set( self, predicate: ) -> 
Result {
+self.bound_reference(predicate.term())?;
+Ok(())
+}
+
+fn bound_reference( self, reference: ) -> 
Result {
+self.field_ids.push(reference.field().id);
+Ok(())
+}
+}
+
+struct PredicateConverter<'a> {
+pub columns: &'a Vec,
+pub projection_mask: ProjectionMask,
+pub parquet_schema: &'a SchemaDescriptor,
+pub column_map: &'a HashMap,
+}
+
+fn get_arrow_datum(datum: ) -> Box {
+match datum.literal() {
+PrimitiveLiteral::Boolean(value) => 
Box::new(BooleanArray::new_scalar(*value)),
+PrimitiveLiteral::Int(value) => 
Box::new(Int32Array::new_scalar(*value)),
+PrimitiveLiteral::Long(value) => 
Box::new(Int64Array::new_scalar(*value)),
+PrimitiveLiteral::Float(value) =>

Re: [PR] feat: Convert predicate to arrow filter and push down to parquet reader [iceberg-rust]

2024-04-07 Thread via GitHub



viirya commented on code in PR #295:
URL: https://github.com/apache/iceberg-rust/pull/295#discussion_r1555106245


##
crates/iceberg/src/arrow.rs:
##
@@ -113,6 +143,405 @@ impl ArrowReader {
 // TODO: full implementation
 ProjectionMask::all()
 }
+
+fn get_row_filter(, parquet_schema: ) -> 
Result> {
+if let Some(predicates) =  {
+let field_id_map = self.build_field_id_map(parquet_schema)?;
+
+// Collect Parquet column indices from field ids
+let column_indices = predicates
+.iter()
+.map(|predicate| {
+let mut collector = CollectFieldIdVisitor { field_ids: 
vec![] };
+collector.visit_predicate(predicate).unwrap();
+collector
+.field_ids
+.iter()
+.map(|field_id| {
+field_id_map.get(field_id).cloned().ok_or_else(|| {
+Error::new(ErrorKind::DataInvalid, "Field id 
not found in schema")
+})
+})
+.collect::>>()
+})
+.collect::>>()?;
+
+// Convert BoundPredicates to ArrowPredicates
+let mut arrow_predicates = vec![];
+for (predicate, columns) in 
predicates.iter().zip(column_indices.iter()) {
+let mut converter = PredicateConverter {
+columns,
+projection_mask: ProjectionMask::leaves(parquet_schema, 
columns.clone()),
+parquet_schema,
+column_map: _id_map,
+};
+let arrow_predicate = converter.visit_predicate(predicate)?;
+arrow_predicates.push(arrow_predicate);
+}
+Ok(Some(RowFilter::new(arrow_predicates)))
+} else {
+Ok(None)
+}
+}
+
+/// Build the map of field id to Parquet column index in the schema.
+fn build_field_id_map(, parquet_schema: ) -> 
Result> {
+let mut column_map = HashMap::new();
+for (idx, field) in parquet_schema.columns().iter().enumerate() {
+let field_type = field.self_type();
+match field_type {
+ParquetType::PrimitiveType { basic_info, .. } => {
+if !basic_info.has_id() {
+return Err(Error::new(
+ErrorKind::DataInvalid,
+format!(
+"Leave column {:?} in schema doesn't have 
field id",
+field_type
+),
+));
+}
+column_map.insert(basic_info.id(), idx);
+}
+ParquetType::GroupType { .. } => {
+return Err(Error::new(
+ErrorKind::DataInvalid,
+format!(
+"Leave column in schema should be primitive type 
but got {:?}",
+field_type
+),
+));
+}
+};
+}
+
+Ok(column_map)
+}
+}
+
+/// A visitor to collect field ids from bound predicates.
+struct CollectFieldIdVisitor {
+field_ids: Vec,
+}
+
+impl BoundPredicateVisitor for CollectFieldIdVisitor {
+type T = ();
+type U = ();
+
+fn and( self, _predicates: Vec) -> Result {
+Ok(())
+}
+
+fn or( self, _predicates: Vec) -> Result {
+Ok(())
+}
+
+fn not( self, _predicate: Self::T) -> Result {
+Ok(())
+}
+
+fn visit_always_true( self) -> Result {
+Ok(())
+}
+
+fn visit_always_false( self) -> Result {
+Ok(())
+}
+
+fn visit_unary( self, predicate: ) -> 
Result {
+self.bound_reference(predicate.term())?;
+Ok(())
+}
+
+fn visit_binary( self, predicate: ) 
-> Result {
+self.bound_reference(predicate.term())?;
+Ok(())
+}
+
+fn visit_set( self, predicate: ) -> 
Result {
+self.bound_reference(predicate.term())?;
+Ok(())
+}
+
+fn bound_reference( self, reference: ) -> 
Result {
+self.field_ids.push(reference.field().id);
+Ok(())
+}
+}
+
+struct PredicateConverter<'a> {
+pub columns: &'a Vec,
+pub projection_mask: ProjectionMask,
+pub parquet_schema: &'a SchemaDescriptor,
+pub column_map: &'a HashMap,
+}
+
+fn get_arrow_datum(datum: ) -> Box {
+match datum.literal() {
+PrimitiveLiteral::Boolean(value) => 
Box::new(BooleanArray::new_scalar(*value)),
+PrimitiveLiteral::Int(value) => 
Box::new(Int32Array::new_scalar(*value)),
+PrimitiveLiteral::Long(value) => 
Box::new(Int64Array::new_scalar(*value)),
+PrimitiveLiteral::Float(value) =>

Re: [PR] Change DataScan to accept Metadata and io [iceberg-python]

2024-04-07 Thread via GitHub



HonahX commented on code in PR #581:
URL: https://github.com/apache/iceberg-python/pull/581#discussion_r1555104577


##
pyiceberg/io/pyarrow.py:
##
@@ -1089,7 +1091,7 @@ def project_table(
 deletes_per_file.get(task.file.file_path),
 case_sensitive,
 limit,
-table.name_mapping(),
+None,

Review Comment:
   I think we need to move `name_mapping()` method to table_metadata and call 
it here.
   ```python
   table_metadata.name_mapping()
   ```
   Otherwise, we no longer use the name-mapping feature during reading.
   
   For example, if I addd
   ```
   print(tbl.scan().to_pandas())
   ```
   to the end of test: 
https://github.com/apache/iceberg-python/blob/main/tests/integration/test_add_files.py#L214
   `project_table` will complain that field-id is not found in the added 
parquet file, but the table does have name-mapping configured.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Implement getstate and setstate on PyArrowFileIO and FsSpecFileIO so that they can be pickled [iceberg-python]

2024-04-07 Thread via GitHub



HonahX merged PR #543:
URL: https://github.com/apache/iceberg-python/pull/543


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Implement getstate and setstate on PyArrowFileIO and FsSpecFileIO so that they can be pickled [iceberg-python]

2024-04-07 Thread via GitHub



HonahX commented on PR #543:
URL: https://github.com/apache/iceberg-python/pull/543#issuecomment-2041611486

   Merged, Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Discussion: Next steps / requirements to support `append` files [iceberg-rust]

2024-04-07 Thread via GitHub



viirya commented on issue #329:
URL: https://github.com/apache/iceberg-rust/issues/329#issuecomment-2041597213

   > calling the writer to write the DataFile
   create an instance of MergingSnapshotProducer -> responsible for writing the 
manifest, manifest_list, snapshot_update
   commit -> update_table() on the Catalog with TableUpdate & TableRequirements
   
   If any error happens during generating metadata relation info like manifest 
etc., as the writer already wrote DataFiles, should we go to delete the written 
DataFiles?
   
   > I think your understanding is correct - and I agree if the writer API 
already does the conversion from RecordBatch to DataFile, the Transaction 
shouldn't be concerned with this issue, since it is a higher-level API. 
However, the Transaction calls the writer that writes the actual DataFile, 
which seems reasonable.
   
   I think this is also what the python implementation does. In 
`Transaction.append`, it calls `_dataframe_to_data_files` to generate DataFiles 
based on the `pa.Table`.
   
   > we create a Transaction that basically does two things:
   2.1. It creates a _MergingSnapshotProducer which is (on a high-level) 
responsible for writing a new ManifestList, creating a new Snapshot (returned 
as AddSnaphotUpdate)
   
   Yea, specifically, it is a `FastAppendFiles` for appending files. Although 
the manifest commit logic is actually implemented in `_MergingSnapshotProducer`.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] [BUG] Valid column characters fail on to_arrow() or to_pandas() ArrowInvalid: No match for FieldRef.Name [iceberg-python]

2024-04-07 Thread via GitHub

Fokko commented on issue #584:
URL: https://github.com/apache/iceberg-python/issues/584#issuecomment-2041585527

Oof, this is a big one. Thanks for reporting this @gwindes and thanks
@kevinjqliu for jumping on this, and getting to the bottom of it. I'm also
looping in @HonahX here since we want to include this fix in 0.6.1.

> @Fokko Are you familiar with this behavior? I can't find any
documentations on it. The original PR by Ryan
(https://github.com/apache/iceberg/pull/601) suggests that this was done to be
compatible with Avro, since [Avro spec does not allow special
characters](https://avro.apache.org/docs/1.11.1/specification/#names).

I don't think Avro is the issue, we reference the fields using the field-id.

- For writing, we want to have the same behavior as Java.
- For reading in PyIceberg we have an additional step in PyIceberg: When
reading we read the Parquet files using the original column names, and we
rename the fields [afterward in this
visitor](https://github.com/apache/iceberg-python/blob/4148edb5e28ae88024a55e0b112238e65b873957/pyiceberg/io/pyarrow.py#L1137).
We could correct it there, but we want to make sure that we don't write any
invalid Parquet filenames in the first place.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] [Bug Fix] Allow HiveCatalog to create table with TimestamptzType [iceberg-python]

2024-04-07 Thread via GitHub



Fokko commented on code in PR #585:
URL: https://github.com/apache/iceberg-python/pull/585#discussion_r1555048186


##
pyiceberg/catalog/hive.py:
##
@@ -199,6 +184,7 @@ def _annotate_namespace(database: HiveDatabase, properties: 
Properties) -> HiveD
 DateType: "date",
 TimeType: "string",
 TimestampType: "timestamp",
+TimestamptzType: "timestamp",

Review Comment:
   Thanks for the context, has been a while since working with the Hive catalog.
   
   > How about we default to "timestamp with local time zone" and add a 
HiveCatalog property (e.g. hive.hive2-compatible-mode) to use "timestamp" here 
when it set to true?
   
   I think that would be a great solution  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] [BUG] Valid column characters fail on to_arrow() or to_pandas() ArrowInvalid: No match for FieldRef.Name [iceberg-python]

2024-04-07 Thread via GitHub

kevinjqliu commented on issue #584:
URL: https://github.com/apache/iceberg-python/issues/584#issuecomment-2041559077

> Further research shows that when I use
[daft](https://www.getdaft.io/projects/docs/en/latest/user_guide/integrations/iceberg.html#reading-a-table)
that I'm able to read and use the to_arrow() functionality just fine. This is
interesting especially because daft utilizes pyiceberg.

The column name transformation behavior is part of the Java Iceberg spec
when reading/writing parquet files. Specifically, the transformed schema is
pushed down to parquet reader/writer.
I suspect this is happening since the Java parquet implementation supports
both Avro and parquet schema (See [parquet
cli](https://github.com/apache/parquet-mr/blob/db4183109d5b734ec5930d870cdae161e408ddba/parquet-cli/src/main/java/org/apache/parquet/cli/commands/SchemaCommand.java#L106-L111)).
So to be compatible with both parquet and Avro schemas, this column name
transformation behavior is used.

From what I've seen, libraries in other languages do not do this. This means
these libraries can read/write parquet files having special characters in their
column names.

Daft uses the Rust Arrow library which can read parquet files with special
characters in their column names.
Similarly, pyarrow can read it as well.

I checked major parquet libraries in Python, Rust, Golang and they can all
support reading special characters in parquet column names.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] [BUG] Valid column characters fail on to_arrow() or to_pandas() ArrowInvalid: No match for FieldRef.Name [iceberg-python]

2024-04-07 Thread via GitHub



kevinjqliu commented on issue #584:
URL: https://github.com/apache/iceberg-python/issues/584#issuecomment-2041557140

   @gwindes please take a look at #590, it worked for me locally using the 
[sample 
project](https://github.com/gwindes/pyiceberg-arrow-bug/blob/main/column_name_test.py)
 you provided above


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] [BUG] Valid column characters fail on to_arrow() or to_pandas() ArrowInvalid: No match for FieldRef.Name [iceberg-python]

2024-04-07 Thread via GitHub



kevinjqliu commented on issue #584:
URL: https://github.com/apache/iceberg-python/issues/584#issuecomment-2041549075

   @fokko Are you familiar with this behavior? I can't find any documentations 
on it. The original PR by Ryan ([apache/iceberg 
#601](https://github.com/apache/iceberg/pull/601)) suggests that this was done 
to be compatible with Avro, since [Avro spec does not allow special 
characters](https://avro.apache.org/docs/1.11.1/specification/#names). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Discussion: Next steps / requirements to support `append` files [iceberg-rust]

2024-04-07 Thread via GitHub



marvinlanhenke commented on issue #329:
URL: https://github.com/apache/iceberg-rust/issues/329#issuecomment-2041548391

   > I'm not sure whether my understanding is correct: The target of 
`table.append()` is used to insert a batch of data into the table. It's seems 
like a high level API which will use two lower API:
   > 
   > 1. [writer API](https://github.com/apache/iceberg-rust/issues/34) for 
convert RecordBatch to DataFile
   > 2. [transaction 
API](https://github.com/apache/iceberg-rust/blob/ca9de89ac9d95683c8fe9191f72ab922dc4c7672/crates/iceberg/src/transaction.rs#L30)
  for commit the DataFile(update the table metadata)
   > 
   > To separate these two interfaces, I think we don't need to delegate the 
conversion between `RecordBatch` and `DataFile` in the transaction.
   
   I think your understanding is correct - and I agree if the writer API 
already does the conversion from RecordBatch to DataFile, the Transaction 
shouldn't be concerned with this issue, since it is a higher-level API. 
However, the Transaction calls the writer that writes the actual DataFile, 
which seems reasonable. 
   
   So the Transaction `append` (if I understand the py impl correctly) does all 
of those things:
   - calling the writer to write the DataFile
   - create an instance of MergingSnapshotProducer -> responsible for writing 
the manifest, manifest_list, snapshot_update
   - commit -> update_table() on the Catalog with TableUpdate & 
TableRequirements
   
   @ZENOTME 
   Where would the writer API (which I only know from the design spec in #34) 
fit best here? Should a Transaction create a new writer everytime a new 
transaction is created? Or should the Table itself hold a ref to a writer?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] [BUG] Valid column characters fail on to_arrow() or to_pandas() ArrowInvalid: No match for FieldRef.Name [iceberg-python]

2024-04-07 Thread via GitHub

kevinjqliu commented on issue #584:
URL: https://github.com/apache/iceberg-python/issues/584#issuecomment-2041546390

This was a super interesting deep dive.

So Iceberg has an obscure behavior of transforming column names with special
characters. As you see above, `TEST:A1B2.RAW.ABC-GG-1-A` is transformed into
`TEST_x3AA1B2_x2ERAW_x2EABC_x2DGG_x2D1_x2DA`. This is mentioned in #83 and
refers to the
[AvroSchemaUtil::makeCompatibleName](https://github.com/apache/iceberg/blob/ad602a379584512d1d96eda557c20cf2af21d1b2/core/src/main/java/org/apache/iceberg/avro/AvroSchemaUtil.java#L429)
function.

### Java Iceberg Behavior
When there is a special character in the column name, Iceberg will transform
the column name first before writing to parquet. The resulting parquet file
will have the transformed column name while Iceberg retains the original column
name in the metadata.
When writing, Iceberg will write parquet files with the transformed column
name. When reading, Iceberg will perform the transformation to read the
transformed column name. This is done by matching the column id.

### Python Iceberg Behavior
The issue in PyIceberg here is not the read side, it's the write side! When
an Iceberg table's column name has special characters, the parquet files should
contain the transformed column name. Instead, PyIceberg writes the column name
with the special characters.

That is the issue above, there is a mismatch between the expected column
name (transformed, `TEST_x3AA1B2_x2ERAW_x2EABC_x2DGG_x2D1_x2DA`) and the actual
column name (untransformed, `TEST:A1B2.RAW.ABC-GG-1-A`).

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Spark rewrite Files Action OOM [iceberg]

2024-04-07 Thread via GitHub



nk1506 commented on issue #10054:
URL: https://github.com/apache/iceberg/issues/10054#issuecomment-2041538414

   > @nk1506 Echoing Russell's comments, how many small files are there in your 
OOM case? How much memory do you set up?
   
   I didn't use spark-engine for compaction. I was using Java Client API. My 
queries might distract from the original problem. Although my requirement is to 
compact very large datasets(say 10K datafiles) with single commit. Using 
[RewriteFiles](https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/RewriteFiles.java#L171)
 always might cause OOM. So I am looking something which can help to manage 
manifestFiles more intelligently. I think I will start different thread to 
discuss the other problem. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Discussion: Next steps / requirements to support `append` files [iceberg-rust]

2024-04-07 Thread via GitHub



ZENOTME commented on issue #329:
URL: https://github.com/apache/iceberg-rust/issues/329#issuecomment-2041528557

   I'm not sure whether my understanding is correct: 
   The target of `table.append()` is used to insert a batch of data into the 
table. It's seems like a high level API which will use two lower API:
   1. [writer API](https://github.com/apache/iceberg-rust/issues/34) for 
convert RecordBatch to DataFile
   2. [transaction 
API](https://github.com/apache/iceberg-rust/blob/ca9de89ac9d95683c8fe9191f72ab922dc4c7672/crates/iceberg/src/transaction.rs#L30)
  for commit the DataFile(update the table metadata)
   
   To separate these two interfaces, I think we don't need to delegate the 
conversion between `RecordBatch` and `DataFile` in the transaction. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Discussion: Next steps / requirements to support `append` files [iceberg-rust]

2024-04-07 Thread via GitHub



sdd commented on issue #329:
URL: https://github.com/apache/iceberg-rust/issues/329#issuecomment-2041520090

   Thanks for spending the time thinking about this and putting your thoughts 
into words. I need to spend some time re-reading the associated parts of the 
spec and looking through the Java and possibly python implementations before 
being able to comment. I should get chance tomorrow. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] In case of Positional Deletes "file_path" in lowerbound & upperbound do not have full file_path Buffer [iceberg]

2024-04-07 Thread via GitHub



agrawalreetika commented on issue #10064:
URL: https://github.com/apache/iceberg/issues/10064#issuecomment-2041518960

   Thanks for your input @singhpk234 I will explore around the write from 
Presto side to see where truncate is happening as part of write.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Implement getstate and setstate on PyArrowFileIO and FsSpecFileIO so that they can be pickled [iceberg-python]

2024-04-07 Thread via GitHub



amogh-jahagirdar commented on code in PR #543:
URL: https://github.com/apache/iceberg-python/pull/543#discussion_r1554993449


##
tests/io/test_fsspec.py:
##
@@ -586,6 +597,25 @@ def 
test_writing_avro_file_gcs(generated_manifest_entry_file: str, fsspec_fileio
 fsspec_fileio_gcs.delete(f"gs://warehouse/{filename}")
 
 
+@pytest.mark.gcs
+def test_fsspec_pickle_roundtrip_gcs(fsspec_fileio_gcs: FsspecFileIO) -> None:
+_test_fsspec_pickle_round_trip(fsspec_fileio_gcs, "gs://warehouse/foo.txt")
+
+
+def _test_fsspec_pickle_round_trip(fsspec_fileio: FsspecFileIO, location: str) 
-> None:
+serialized_file_io = pickle.dumps(fsspec_fileio)
+deserialized_file_io = pickle.loads(serialized_file_io)
+output_file = deserialized_file_io.new_output(location)
+with output_file.create() as f:
+f.write(b"foo")
+
+input_file = deserialized_file_io.new_input(location)
+with input_file.open() as f:
+data = f.read()
+assert data == b"foo"
+assert len(input_file) == 3
+

Review Comment:
   Good idea, yes tests in general should be able to be re-run properly and to 
do that we should cleanup the resource at the end!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Implement getstate and setstate on PyArrowFileIO and FsSpecFileIO so that they can be pickled [iceberg-python]

2024-04-07 Thread via GitHub



amogh-jahagirdar commented on code in PR #543:
URL: https://github.com/apache/iceberg-python/pull/543#discussion_r1554993337


##
tests/io/test_fsspec.py:
##
@@ -61,7 +62,7 @@ def test_fsspec_new_input_file(fsspec_fileio: FsspecFileIO) 
-> None:
 assert input_file.location == f"s3://warehouse/{filename}"
 
 
-@pytest.mark.s3
+@pytest.mark.s3fsspec_file_io

Review Comment:
   Ah good catch, I think this was a copy/paste bug (somehow pasted 
fsspec_file_io on this line by mistake)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Implement getstate and setstate on PyArrowFileIO and FsSpecFileIO so that they can be pickled [iceberg-python]

2024-04-07 Thread via GitHub



amogh-jahagirdar commented on code in PR #543:
URL: https://github.com/apache/iceberg-python/pull/543#discussion_r1554993080


##
tests/io/test_fsspec.py:
##
@@ -586,6 +597,25 @@ def 
test_writing_avro_file_gcs(generated_manifest_entry_file: str, fsspec_fileio
 fsspec_fileio_gcs.delete(f"gs://warehouse/{filename}")
 
 
+@pytest.mark.gcs
+def test_fsspec_pickle_roundtrip_gcs(fsspec_fileio_gcs: FsspecFileIO) -> None:
+_test_fsspec_pickle_round_trip(fsspec_fileio_gcs, "gs://warehouse/foo.txt")
+
+
+def _test_fsspec_pickle_round_trip(fsspec_fileio: FsspecFileIO, location: str) 
-> None:
+serialized_file_io = pickle.dumps(fsspec_fileio)

Review Comment:
   Agreed! I can take up renaming in a separate PR so it's easier to review.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Spark configuration for amazon access key and secret key with glue catalog for apache Iceberg is not honoring [iceberg]

2024-04-07 Thread via GitHub



AwasthiSomesh commented on issue #10078:
URL: https://github.com/apache/iceberg/issues/10078#issuecomment-2041494668

   @lxs360
do you have any solution for above query ?.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Spark configuration for amazon access key and secret key with glue catalog for apache Iceberg is not honoring [iceberg]

2024-04-07 Thread via GitHub



AwasthiSomesh commented on issue #10078:
URL: https://github.com/apache/iceberg/issues/10078#issuecomment-2041494046

   @nastra  I checked everything .. I hope spark configuration for amazon s3 
configuration is not working for iceberg but  
   
   System.setProperty("aws.region", "us-west-2");
   System.setProperty("aws.accessKeyId", "xxx")
   System.setProperty("aws.secretAccessKey", "x")
   
   above setting is working fine not sure what configuration needed for spark 
.. its always asking default credentials chain provider.
   
   Thanks,
   Somesh


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Update Roadmap / Close old Issues [iceberg-rust]

2024-04-07 Thread via GitHub



marvinlanhenke commented on issue #330:
URL: https://github.com/apache/iceberg-rust/issues/330#issuecomment-2041475658

   cc @liurenjie1024 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] Update Roadmap / Close old Issues [iceberg-rust]

2024-04-07 Thread via GitHub



marvinlanhenke opened a new issue, #330:
URL: https://github.com/apache/iceberg-rust/issues/330

   I think we should update the roadmap, which I believe is really helpful not 
only to provide a general direction but also to attract more contributors. Also 
we have some 'older' issues that seem to be already done - or not relevant 
anymore. Perhaps we can do some grooming here (I'd be happy to help).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] Discussion: Next Steps / Requirement to support `append` files [iceberg-rust]

2024-04-07 Thread via GitHub



marvinlanhenke opened a new issue, #329:
URL: https://github.com/apache/iceberg-rust/issues/329

   ...out of curiosity, I took a closer look at the pyiceberg impl and how the 
`Table.append()` works.

   Now, I would like to pick your brain, in order to understand and track the 
next steps we have to take to support `append` as well (since we should be 
getting close to having write support). The goal here is, to extract and create 
actionable issues.
   
   Here is what I understand from the python impl so far (high-level):
   ---
   1. we call `append()` on the Table class with our DataFrame: pa.Table and 
the snaphot_properties: Dict[str, str]
   2. we create a `Transaction` that basically does two things:
   2.1. It creates a `_MergingSnapshotProducer` which is (on a high-level) 
responsible for writing a new ManifestList, creating a new Snapshot (returned 
as AddSnaphotUpdate)
   2.2 It calls `update_table` on the respective Catalog which creates a new 
metadata.json and returns the new metadata as well as the new metadata_location
   
   
[pyiceberg-link](https://github.com/apache/iceberg-python/blob/main/pyiceberg/table/__init__.py#L1314)
   
   Here is what I think we need to implement (rough sketch):
   ---
   1. 
[impl](https://github.com/apache/iceberg-python/blob/main/pyiceberg/table/__init__.py#L1314)`fn
 append(...)` on `struct Table`:
   This should probably accept a RecordBatch as a param, create a new 
`Transaction`, and delegates further action to the transaction.
   2. 
[impl](https://github.com/apache/iceberg-python/blob/main/pyiceberg/table/__init__.py#L362)`fn
 append(...)` on `struct Transaction`:
   Receives RecordBatch and snapshot_properties. Performs validation checks. 
Converts the RecordBatch to a collection of `DataFiles` and creates a 
`_MergingSnapshotProducer` with the collection.
   3. 
[impl](https://github.com/apache/iceberg-python/blob/main/pyiceberg/table/__init__.py#L2745)`_MergingSnapshotProducer`:
   :: write manifests (added, deleted, existing)
   :: get next_sequence_number from `TableMetadata`
   :: update snapshot summaries 
   :: generate manifest_list_path
   :: write manifest_list
   :: create a new Snapshot
   :: return TableUpdate: AddSnapshot
   4. impl `update_table` on the concrete Catalog implementations
   
   What could be possible Issues here?
   I think we need to start with the `_MergingSnapshotProducer` (possibly split 
into mutliple parts) and work our way up the list?
   Once we have the MergingSnapshotProducer, we can implement the append 
function on Transaction which basically orchestrates?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Discussion: Next Steps / Requirement to support `append` files [iceberg-rust]

2024-04-07 Thread via GitHub



marvinlanhenke commented on issue #329:
URL: https://github.com/apache/iceberg-rust/issues/329#issuecomment-2041473484

   cc @liurenjie1024 @Xuanwo @viirya @sdd @ZENOTME @Fokko 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[PR] chore(deps): Bump apache/skywalking-eyes from 0.5.0 to 0.6.0 [iceberg-rust]

2024-04-07 Thread via GitHub



dependabot[bot] opened a new pull request, #328:
URL: https://github.com/apache/iceberg-rust/pull/328

   Bumps [apache/skywalking-eyes](https://github.com/apache/skywalking-eyes) 
from 0.5.0 to 0.6.0.
   
   Changelog
   Sourced from https://github.com/apache/skywalking-eyes/blob/main/CHANGES.md;>apache/skywalking-eyes's
 changelog.
   
   0.6.0
   
   Add instructions to fix header issues in markdown comment.
   Add Eclipse Foundation specific Apache 2.0 license header.
   Add support for OPA policy files, protobuf.
   Add weak-compatible check to dependency check.
   
   
   
   
   Commits
   
   https://github.com/apache/skywalking-eyes/commit/cd7b195c51fd3d6ad52afceb760719ddc6b3ee91;>cd7b195
 Draft release notes for 0.6.0 (https://redirect.github.com/apache/skywalking-eyes/issues/181;>#181)
   https://github.com/apache/skywalking-eyes/commit/6753eaeab2d30d8b777f33637bf48794f70888d0;>6753eae
 bump action/setup-go to v5 (https://redirect.github.com/apache/skywalking-eyes/issues/180;>#180)
   https://github.com/apache/skywalking-eyes/commit/97538682f556b56cc7422ece660d8d7e6c4fb013;>9753868
 add instructions to fix header issues in markdown comment (https://redirect.github.com/apache/skywalking-eyes/issues/179;>#179)
   https://github.com/apache/skywalking-eyes/commit/e6d1ce46901c759d9d9f84f8bcb97ad028cd5f88;>e6d1ce4
 add Eclipse Foundation specific Apache 2.0 license header (https://redirect.github.com/apache/skywalking-eyes/issues/178;>#178)
   https://github.com/apache/skywalking-eyes/commit/ed436a5593c63a25f394ea29da61b0ac3731a9fe;>ed436a5
 feature: add support for OPA policy files (https://redirect.github.com/apache/skywalking-eyes/issues/174;>#174)
   https://github.com/apache/skywalking-eyes/commit/ee81ff786927ea6ffa48b1e29c48e5289f4753aa;>ee81ff7
 feature: add support for Protocol Buffer (https://redirect.github.com/apache/skywalking-eyes/issues/172;>#172)
   https://github.com/apache/skywalking-eyes/commit/a790ab8dd23a7f861c18bd6aaa9b012e3a234bce;>a790ab8
 feature: add weak-compatible to dependency check (https://redirect.github.com/apache/skywalking-eyes/issues/171;>#171)
   https://github.com/apache/skywalking-eyes/commit/45c9d9c2b50cdbebd8c688abba8fa8597105eed3;>45c9d9c
 Correct the way of joining slack channels (https://redirect.github.com/apache/skywalking-eyes/issues/169;>#169)
   https://github.com/apache/skywalking-eyes/commit/e19b828cea6a6027cceae78f05d81317347d21be;>e19b828
 Add | as comment indicator (https://redirect.github.com/apache/skywalking-eyes/issues/168;>#168)
   See full diff in https://github.com/apache/skywalking-eyes/compare/v0.5.0...v0.6.0;>compare
 view
   
   
   
   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=apache/skywalking-eyes=github_actions=0.5.0=0.6.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot show  ignore conditions` will show all of 
the ignore conditions of the specified dependency
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] chore(deps): Update volo-thrift requirement from 0.9.2 to 0.10.0 [iceberg-rust]

2024-04-07 Thread via GitHub



dependabot[bot] opened a new pull request, #326:
URL: https://github.com/apache/iceberg-rust/pull/326

   Updates the requirements on [volo-thrift](https://github.com/cloudwego/volo) 
to permit the latest version.
   
   Commits
   
   See full diff in https://github.com/cloudwego/volo/compare/volo-cli-0.9.2...volo-cli-0.9.2;>compare
 view
   
   
   
   
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot show  ignore conditions` will show all of 
the ignore conditions of the specified dependency
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[PR] chore(deps): Update pilota requirement from 0.10.0 to 0.11.0 [iceberg-rust]

2024-04-07 Thread via GitHub



dependabot[bot] opened a new pull request, #327:
URL: https://github.com/apache/iceberg-rust/pull/327

   Updates the requirements on [pilota](https://github.com/cloudwego/pilota) to 
permit the latest version.
   
   Release notes
   Sourced from https://github.com/cloudwego/pilota/releases;>pilota's 
releases.
   
   Pilota 0.10.0
   What's Changed
   
   fix(pilota-build): cyclic gen code by https://github.com/PureWhiteWu;>@PureWhiteWu in https://redirect.github.com/cloudwego/pilota/pull/205;>cloudwego/pilota#205
   fix(pilota-build): cyclic gen code by https://github.com/PureWhiteWu;>@PureWhiteWu in https://redirect.github.com/cloudwego/pilota/pull/206;>cloudwego/pilota#206
   fix: missing i8 type by https://github.com/junhaideng;>@junhaideng in https://redirect.github.com/cloudwego/pilota/pull/183;>cloudwego/pilota#183
   fix(pilota-build): indirect cycle by https://github.com/PureWhiteWu;>@PureWhiteWu in https://redirect.github.com/cloudwego/pilota/pull/207;>cloudwego/pilota#207
   fix: protobuf message by https://github.com/PureWhiteWu;>@PureWhiteWu in https://redirect.github.com/cloudwego/pilota/pull/208;>cloudwego/pilota#208
   fix: remove useless feature by https://github.com/PureWhiteWu;>@PureWhiteWu in https://redirect.github.com/cloudwego/pilota/pull/209;>cloudwego/pilota#209
   fix: default value path incorrect by https://github.com/Millione;>@Millione in https://redirect.github.com/cloudwego/pilota/pull/210;>cloudwego/pilota#210
   fix:escape keywords in path resolver by https://github.com/Ggiggle;>@Ggiggle in https://redirect.github.com/cloudwego/pilota/pull/211;>cloudwego/pilota#211
   fix: escape keyword when resolving path in workspace by https://github.com/Ggiggle;>@Ggiggle in https://redirect.github.com/cloudwego/pilota/pull/212;>cloudwego/pilota#212
   fix: escape keywords when generate code by https://github.com/Ggiggle;>@Ggiggle in https://redirect.github.com/cloudwego/pilota/pull/213;>cloudwego/pilota#213
   fix: generate struct in common crate rather than its own crate by https://github.com/Millione;>@Millione in https://redirect.github.com/cloudwego/pilota/pull/214;>cloudwego/pilota#214
   feat: allow duplicate name by dedup_list by https://github.com/Millione;>@Millione in https://redirect.github.com/cloudwego/pilota/pull/215;>cloudwego/pilota#215
   fix: generate i8 for thrift byte by https://github.com/Millione;>@Millione in https://redirect.github.com/cloudwego/pilota/pull/216;>cloudwego/pilota#216
   feat(pilota-build): add feature ahash for choosing hasher 
by https://github.com/wfly1998;>@wfly1998 in https://redirect.github.com/cloudwego/pilota/pull/218;>cloudwego/pilota#218
   feat: use ahashmap instead of hashmap by https://github.com/PureWhiteWu;>@PureWhiteWu in https://redirect.github.com/cloudwego/pilota/pull/219;>cloudwego/pilota#219
   chore: move test dep to dev-dep by https://github.com/PureWhiteWu;>@PureWhiteWu in https://redirect.github.com/cloudwego/pilota/pull/220;>cloudwego/pilota#220
   chore: bump version to 0.10.0 by https://github.com/PureWhiteWu;>@PureWhiteWu in https://redirect.github.com/cloudwego/pilota/pull/221;>cloudwego/pilota#221
   
   New Contributors
   
   https://github.com/junhaideng;>@junhaideng 
made their first contribution in https://redirect.github.com/cloudwego/pilota/pull/183;>cloudwego/pilota#183
   https://github.com/Ggiggle;>@Ggiggle made 
their first contribution in https://redirect.github.com/cloudwego/pilota/pull/211;>cloudwego/pilota#211
   https://github.com/wfly1998;>@wfly1998 made 
their first contribution in https://redirect.github.com/cloudwego/pilota/pull/218;>cloudwego/pilota#218
   
   Full Changelog: https://github.com/cloudwego/pilota/compare/pilota-0.9.0...pilota-0.10.0;>https://github.com/cloudwego/pilota/compare/pilota-0.9.0...pilota-0.10.0
   
   
   
   Commits
   
   See full diff in https://github.com/cloudwego/pilota/compare/pilota-0.10.0...pilota-0.10.0;>compare
 view
   
   
   
   
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot show  ignore

[I] Docs: Fix links of Get Started and Community sections in footer [iceberg]

2024-04-07 Thread via GitHub



wayneguow opened a new issue, #10099:
URL: https://github.com/apache/iceberg/issues/10099

   ### Apache Iceberg version
   
   main (development)
   
   ### Query engine
   
   None
   
   ### Please describe the bug 
   
   When browsing iceberg non-home pages, if you click on the links in the `Get 
Started` and `Community sections` in the footer, the jump will fail. 
   For example:
   If the current page you are browsing is: 
https://iceberg.apache.org/docs/nightly/#open-standard, if click `Spark 
Quickstart`, the jump link is 
https://iceberg.apache.org/docs/nightly/spark-quickstart, you will get a 404 
Not Found error.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Build: Move build configurations to project dirs [iceberg]

2024-04-07 Thread via GitHub



myskov commented on PR #10097:
URL: https://github.com/apache/iceberg/pull/10097#issuecomment-2041440695

   I accidentally copied build configurations from the old revision. This is 
now fixed, all tests passed. Please take a look.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[PR] Docs: Fix links of `Get Started` and `Community` sections in footer [iceberg]

2024-04-07 Thread via GitHub



wayneguow opened a new pull request, #10098:
URL: https://github.com/apache/iceberg/pull/10098

   Fixed the issue by using absolute links instead of relative links.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] add `InclusiveProjection` Visitor [iceberg-rust]

2024-04-07 Thread via GitHub



sdd commented on PR #321:
URL: https://github.com/apache/iceberg-rust/pull/321#issuecomment-2041438110

   Updated to align with latest `BoundPredicateVisitor` iteration in 
https://github.com/apache/iceberg-rust/pull/320
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Build: Bump mkdocs-material from 9.5.15 to 9.5.17 [iceberg]

2024-04-07 Thread via GitHub



Fokko merged PR #10092:
URL: https://github.com/apache/iceberg/pull/10092


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Add `BoundPredicateVisitor` trait [iceberg-rust]

2024-04-07 Thread via GitHub



sdd commented on PR #320:
URL: https://github.com/apache/iceberg-rust/pull/320#issuecomment-2041400457

   I've added a test for `visit_op`, as well as some doc comments.
   
   Also, based on our discussion in the comments, it seemed prudent to add the 
`non_exhaustive` marker to `PredicateOperator`? PTAL @liurenjie1024, 
@marvinlanhenke, @ZENOTME as I think this is pretty much ready now


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Implement getstate and setstate on PyArrowFileIO and FsSpecFileIO so that they can be pickled [iceberg-python]

2024-04-07 Thread via GitHub



HonahX commented on code in PR #543:
URL: https://github.com/apache/iceberg-python/pull/543#discussion_r1554859960


##
tests/io/test_fsspec.py:
##
@@ -586,6 +597,25 @@ def 
test_writing_avro_file_gcs(generated_manifest_entry_file: str, fsspec_fileio
 fsspec_fileio_gcs.delete(f"gs://warehouse/{filename}")
 
 
+@pytest.mark.gcs
+def test_fsspec_pickle_roundtrip_gcs(fsspec_fileio_gcs: FsspecFileIO) -> None:
+_test_fsspec_pickle_round_trip(fsspec_fileio_gcs, "gs://warehouse/foo.txt")
+
+
+def _test_fsspec_pickle_round_trip(fsspec_fileio: FsspecFileIO, location: str) 
-> None:
+serialized_file_io = pickle.dumps(fsspec_fileio)
+deserialized_file_io = pickle.loads(serialized_file_io)
+output_file = deserialized_file_io.new_output(location)
+with output_file.create() as f:
+f.write(b"foo")
+
+input_file = deserialized_file_io.new_input(location)
+with input_file.open() as f:
+data = f.read()
+assert data == b"foo"
+assert len(input_file) == 3
+

Review Comment:
   ```suggestion
   fsspec_fileio.delete(location)
   ```
   How about deleting the file in the end to make these tests re-runnable?



##
tests/io/test_fsspec.py:
##
@@ -61,7 +62,7 @@ def test_fsspec_new_input_file(fsspec_fileio: FsspecFileIO) 
-> None:
 assert input_file.location == f"s3://warehouse/{filename}"
 
 
-@pytest.mark.s3
+@pytest.mark.s3fsspec_file_io

Review Comment:
   This seems to be an unrelated change



##
tests/io/test_fsspec.py:
##
@@ -586,6 +597,25 @@ def 
test_writing_avro_file_gcs(generated_manifest_entry_file: str, fsspec_fileio
 fsspec_fileio_gcs.delete(f"gs://warehouse/{filename}")
 
 
+@pytest.mark.gcs
+def test_fsspec_pickle_roundtrip_gcs(fsspec_fileio_gcs: FsspecFileIO) -> None:
+_test_fsspec_pickle_round_trip(fsspec_fileio_gcs, "gs://warehouse/foo.txt")
+
+
+def _test_fsspec_pickle_round_trip(fsspec_fileio: FsspecFileIO, location: str) 
-> None:
+serialized_file_io = pickle.dumps(fsspec_fileio)

Review Comment:
   I just realized that we use both `fileio` and `file_io` in the codespace: 
(e.g. `fsspec_fileio`, `load_file_io`). I would be good if we could 
consistently use one of them. This may be done in a separate PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Add `BoundPredicateVisitor` trait [iceberg-rust]

2024-04-07 Thread via GitHub



marvinlanhenke commented on code in PR #320:
URL: https://github.com/apache/iceberg-rust/pull/320#discussion_r1554857281


##
crates/iceberg/src/expr/visitors/bound_predicate_visitor.rs:
##
@@ -0,0 +1,260 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::expr::{BoundPredicate, BoundReference, PredicateOperator};
+use crate::spec::Datum;
+use crate::Result;
+use fnv::FnvHashSet;
+
+pub enum OpLiteral<'a> {
+None,

Review Comment:
   sounds reasonable enough - thanks for taking a look into it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] [Bug Fix] Allow HiveCatalog to create table with TimestamptzType [iceberg-python]

2024-04-07 Thread via GitHub



HonahX commented on code in PR #585:
URL: https://github.com/apache/iceberg-python/pull/585#discussion_r1554855296


##
pyiceberg/catalog/hive.py:
##
@@ -199,6 +184,7 @@ def _annotate_namespace(database: HiveDatabase, properties: 
Properties) -> HiveD
 DateType: "date",
 TimeType: "string",
 TimestampType: "timestamp",
+TimestamptzType: "timestamp",

Review Comment:
   Theoretically, the content here should be constants in 
[serdeConstants](https://github.com/apache/hive/blob/master/serde/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/serde/serdeConstants.java#L76-L86)
 since java use these to [test the Hive schema 
util](https://github.com/apache/iceberg/blob/main/hive-metastore/src/test/java/org/apache/iceberg/hive/TestHiveSchemaUtil.java#L213)
   Among these types, `timestamp with local time zone` is a new one added in 
Hive3: [ref](https://github.com/apache/iceberg/pull/1897).
   
   Interestingly, when doing integration tests (use pyiceberg to write and 
spark to read) I found that Hive server will raise error for `timestampabc` but 
work normally for `timestamp arbitrary string`. Seems as long as the first word 
is within serdeConstants there will be no error loading the table. Not sure if 
`timestamp arbitrary string` will cause error in other Hive use cases.
   
   How about we default to "timestamp with local time zone" and add a 
HiveCatalog property (e.g. `hive.hive2-compatible-mode`) to use "timestamp" 
here when it set to `true`?
   
   (Will add the integration test soon)
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Build: Move build configurations to project dirs [iceberg]

2024-04-07 Thread via GitHub



myskov commented on PR #10097:
URL: https://github.com/apache/iceberg/pull/10097#issuecomment-2041380311

   I see a lot of CI failures, will moved the PR to draft for now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] feat: init iceberg writer [iceberg-rust]

2024-04-07 Thread via GitHub



ZENOTME commented on PR #275:
URL: https://github.com/apache/iceberg-rust/pull/275#issuecomment-2041377659

   > @liurenjie1024 @ZENOTME What's the current status on this PR - as it looks 
very promising as well as the outlined framework in #34 ?
   
   I think this PR is ready to go.
   
   > Since we have already completed some issues (or they are in progress) for 
read support, I think it would be beneficial to outline the next steps for 
implementing write support. Perhaps in another tracking issue (I don't think we 
have none yet?).
   > 
   > I think we have most of the writers in place (when this PR is ready), but 
have yet to 'orchestrate' them?
   
   Yes! Looks good to me. And after this basic framework, we can create a track 
for more specific writer. cc @liurenjie1024  @Xuanwo  @Fokko 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Add `BoundPredicateVisitor` trait [iceberg-rust]

2024-04-07 Thread via GitHub



sdd commented on code in PR #320:
URL: https://github.com/apache/iceberg-rust/pull/320#discussion_r1554847639


##
crates/iceberg/src/expr/visitors/bound_predicate_visitor.rs:
##
@@ -0,0 +1,260 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::expr::{BoundPredicate, BoundReference, PredicateOperator};
+use crate::spec::Datum;
+use crate::Result;
+use fnv::FnvHashSet;
+
+pub enum OpLiteral<'a> {
+None,

Review Comment:
   I've switched it to Option  
   
   Unfortunately regarding the Single / Set enums, yours is by-value and mine 
is by-reference. Mine only needs to be by-reference due to the use case, and 
refactoring it to by-value would incur unnecessary copies. Yours needs to be 
by-value I think, and a brief attempt to refactor to use my by-reference enum 
was more hassle than it was worth. Considering the simplicity of these enums I 
think it is acceptable to have a bit of duplication here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

63 matches

Mail list logo