Re: [I] Correct names in the ManifestList [iceberg-rust]

2024-04-25 Thread via GitHub


liurenjie1024 commented on issue #354:
URL: https://github.com/apache/iceberg-rust/issues/354#issuecomment-2078683994

   > @Fokko, can you please assign this to me? Thanks!
   
   Done, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Basic Integration with Datafusion [iceberg-rust]

2024-04-25 Thread via GitHub


liurenjie1024 commented on code in PR #324:
URL: https://github.com/apache/iceberg-rust/pull/324#discussion_r1580510837


##
crates/integrations/datafusion/src/catalog.rs:
##
@@ -0,0 +1,94 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::{any::Any, collections::HashMap, sync::Arc};
+
+use datafusion::catalog::{schema::SchemaProvider, CatalogProvider};
+use futures::future::try_join_all;
+use iceberg::{Catalog, NamespaceIdent, Result};
+
+use crate::schema::IcebergSchemaProvider;
+
+/// Provides an interface to manage and access multiple schemas
+/// within an Iceberg [`Catalog`].
+///
+/// Acts as a centralized catalog provider that aggregates
+/// multiple [`SchemaProvider`], each associated with distinct namespaces.
+pub struct IcebergCatalogProvider {
+/// A concurrent `HashMap` where keys are namespace names
+/// and values are dynamic references to objects implementing the
+/// [`SchemaProvider`] trait.
+schemas: HashMap>,
+}
+
+impl IcebergCatalogProvider {
+/// Asynchronously tries to construct a new [`IcebergCatalogProvider`]
+/// using the given client to fetch and initialize schema providers for
+/// each namespace in the Iceberg [`Catalog`].
+///
+/// This method retrieves the list of namespace names
+/// attempts to create a schema provider for each namespace, and
+/// collects these providers into a concurrent `HashMap`.
+pub async fn try_new(client: Arc) -> Result {
+let schema_names: Vec<_> = client
+.list_namespaces(None)
+.await?
+.iter()
+.flat_map(|ns| ns.as_ref().clone())
+.collect();
+
+let providers = try_join_all(
+schema_names
+.iter()
+.map(|name| {
+IcebergSchemaProvider::try_new(
+client.clone(),
+NamespaceIdent::new(name.clone()),
+)
+})
+.collect::>(),
+)
+.await?;
+
+let schemas: Vec<_> = schema_names
+.into_iter()
+.zip(providers.into_iter())
+.map(|(name, provider)| {
+let provider = Arc::new(provider) as Arc;
+(name, provider)
+})
+.collect();
+
+Ok(IcebergCatalogProvider {
+schemas: schemas.into_iter().collect(),
+})
+}
+}
+
+impl CatalogProvider for IcebergCatalogProvider {
+fn as_any() ->  Any {
+self
+}
+
+fn schema_names() -> Vec {
+self.schemas.keys().cloned().collect()

Review Comment:
   I'm thinking that this maybe incorrect since others processes may create new 
namespaces after creating the `try_new` method? We should the result of 
`list_namespaces` each time. For performance issue, we may create sth like 
[`CachingCatalog` in 
java](https://github.com/apache/iceberg/blob/c07f2aabc0a1d02f068ecf1514d2479c0fbdd3b0/core/src/main/java/org/apache/iceberg/CachingCatalog.java#L46)
 to implement `Catalog` trait, what do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Basic Integration with Datafusion [iceberg-rust]

2024-04-25 Thread via GitHub


liurenjie1024 commented on code in PR #324:
URL: https://github.com/apache/iceberg-rust/pull/324#discussion_r1580506460


##
crates/integrations/datafusion/src/catalog.rs:
##
@@ -0,0 +1,73 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::{any::Any, sync::Arc};
+
+use dashmap::DashMap;
+use datafusion::catalog::{schema::SchemaProvider, CatalogProvider};
+use futures::future::try_join_all;
+use iceberg::{Catalog, NamespaceIdent, Result};
+
+use crate::schema::IcebergSchemaProvider;
+
+pub struct IcebergCatalogProvider {
+schemas: DashMap>,
+}
+
+impl IcebergCatalogProvider {
+pub async fn try_new(client: Arc) -> Result {
+let schema_names: Vec = client
+.list_namespaces(None)
+.await?
+.iter()
+.flat_map(|ns| ns.as_ref().clone())

Review Comment:
   Sounds reasonable to me.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Basic Integration with Datafusion [iceberg-rust]

2024-04-25 Thread via GitHub


liurenjie1024 commented on code in PR #324:
URL: https://github.com/apache/iceberg-rust/pull/324#discussion_r1580505826


##
crates/integrations/datafusion/src/table.rs:
##
@@ -0,0 +1,79 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::sync::Arc;
+
+use datafusion::datasource::TableProvider;
+use iceberg::{table::Table, Catalog, NamespaceIdent, Result, TableIdent};
+
+pub(crate) struct IcebergTableProvider {
+_inner: Table,
+}
+
+impl IcebergTableProvider {
+pub(crate) async fn try_new(
+client: Arc,
+namespace: NamespaceIdent,
+name: impl Into,
+) -> Result {
+let name = name.into();
+let ident = TableIdent::new(namespace, name);
+let table = client.load_table().await?;
+
+Ok(IcebergTableProvider { _inner: table })
+}
+}
+
+impl TableProvider for IcebergTableProvider {
+fn as_any() ->  std::any::Any {
+self
+}
+
+fn schema() -> datafusion::arrow::datatypes::SchemaRef {
+todo!()
+}
+
+fn table_type() -> datafusion::datasource::TableType {
+todo!()
+}
+
+fn scan<'life0, 'life1, 'life2, 'life3, 'async_trait>(
+&'life0 self,
+_state: &'life1 datafusion::execution::context::SessionState,
+_projection: Option<&'life2 Vec>,
+_filters: &'life3 [datafusion::prelude::Expr],
+_limit: Option,
+) -> core::pin::Pin<
+Box<
+dyn core::future::Future<
+Output = datafusion::error::Result<
+Arc,
+>,
+> + core::marker::Send
++ 'async_trait,
+>,
+>
+where
+'life0: 'async_trait,
+'life1: 'async_trait,
+'life2: 'async_trait,
+'life3: 'async_trait,
+Self: 'async_trait,
+{
+todo!()

Review Comment:
   I took a look at the `ExecutionPlan` trait, and I think the main part if the 
`execute` method which could return an arrow record batch stream, which has 
been implemented by `TableScan::to_arrow`. Other parts, such as partitioning, 
ordering we can leave it as simple as possible. 
   I agree that this is not blocker and we can move it to next pr.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Implement manifest filtering in `TableScan` [iceberg-rust]

2024-04-25 Thread via GitHub


liurenjie1024 commented on PR #323:
URL: https://github.com/apache/iceberg-rust/pull/323#issuecomment-2078666082

   Thanks @sdd for this pr, and @Xuanwo @marvinlanhenke for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Implement manifest filtering in `TableScan` [iceberg-rust]

2024-04-25 Thread via GitHub


liurenjie1024 merged PR #323:
URL: https://github.com/apache/iceberg-rust/pull/323


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add Pagination To List Apis [iceberg]

2024-04-25 Thread via GitHub


rahil-c commented on code in PR #9782:
URL: https://github.com/apache/iceberg/pull/9782#discussion_r1580469059


##
core/src/test/java/org/apache/iceberg/rest/TestRESTViewCatalog.java:
##
@@ -144,6 +157,65 @@ public void closeCatalog() throws Exception {
 }
   }
 
+  @Test
+  public void testPaginationForListViews() {
+RESTCatalogAdapter adapter = Mockito.spy(new 
RESTCatalogAdapter(backendCatalog));
+RESTCatalog catalog =
+new RESTCatalog(SessionCatalog.SessionContext.createEmpty(), (config) 
-> adapter);
+catalog.initialize("test", 
ImmutableMap.of(RESTSessionCatalog.REST_PAGE_SIZE, "10"));
+
+int numberOfItems = 20;
+int numberOfInvocations = 2;
+String namespaceName = "newdb";
+String viewName = "newview";
+
+// create initial namespace
+catalog().createNamespace(Namespace.of(namespaceName));
+
+// create several views under namespace, based off a table for listing and 
verify
+for (int i = 0; i < numberOfItems; i++) {
+  TableIdentifier viewIndentifier = TableIdentifier.of(namespaceName, 
viewName + i);
+  catalog
+  .buildView(viewIndentifier)
+  .withSchema(SCHEMA)
+  .withDefaultNamespace(viewIndentifier.namespace())
+  .withQuery("spark", "select * from ns.tbl")
+  .create();
+}
+List views = 
catalog.listViews(Namespace.of(namespaceName));
+assertThat(views).hasSize(numberOfItems);
+
+Mockito.verify(adapter)
+.execute(
+eq(HTTPMethod.GET),
+eq("v1/config"),
+any(),
+any(),
+eq(ConfigResponse.class),
+any(),
+any());
+
+Mockito.verify(adapter, times(numberOfItems))
+.execute(
+eq(HTTPMethod.POST),
+eq(String.format("v1/namespaces/%s/views", namespaceName)),
+any(),
+any(),
+eq(LoadViewResponse.class),
+any(),
+any());
+
+Mockito.verify(adapter, times(numberOfInvocations))
+.execute(
+eq(HTTPMethod.GET),
+eq(String.format("v1/namespaces/%s/views", namespaceName)),
+eq(ImmutableMap.of("pageToken", "10", "pageSize", "10")),

Review Comment:
   @nastra I am able to make two separate requests with `.handleRequest` 
instead of `execute`, and send the correct query params for each call, updated 
the pr with this change



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Add Pagination To List Apis [iceberg]

2024-04-25 Thread via GitHub


rahil-c commented on code in PR #9782:
URL: https://github.com/apache/iceberg/pull/9782#discussion_r1580469059


##
core/src/test/java/org/apache/iceberg/rest/TestRESTViewCatalog.java:
##
@@ -144,6 +157,65 @@ public void closeCatalog() throws Exception {
 }
   }
 
+  @Test
+  public void testPaginationForListViews() {
+RESTCatalogAdapter adapter = Mockito.spy(new 
RESTCatalogAdapter(backendCatalog));
+RESTCatalog catalog =
+new RESTCatalog(SessionCatalog.SessionContext.createEmpty(), (config) 
-> adapter);
+catalog.initialize("test", 
ImmutableMap.of(RESTSessionCatalog.REST_PAGE_SIZE, "10"));
+
+int numberOfItems = 20;
+int numberOfInvocations = 2;
+String namespaceName = "newdb";
+String viewName = "newview";
+
+// create initial namespace
+catalog().createNamespace(Namespace.of(namespaceName));
+
+// create several views under namespace, based off a table for listing and 
verify
+for (int i = 0; i < numberOfItems; i++) {
+  TableIdentifier viewIndentifier = TableIdentifier.of(namespaceName, 
viewName + i);
+  catalog
+  .buildView(viewIndentifier)
+  .withSchema(SCHEMA)
+  .withDefaultNamespace(viewIndentifier.namespace())
+  .withQuery("spark", "select * from ns.tbl")
+  .create();
+}
+List views = 
catalog.listViews(Namespace.of(namespaceName));
+assertThat(views).hasSize(numberOfItems);
+
+Mockito.verify(adapter)
+.execute(
+eq(HTTPMethod.GET),
+eq("v1/config"),
+any(),
+any(),
+eq(ConfigResponse.class),
+any(),
+any());
+
+Mockito.verify(adapter, times(numberOfItems))
+.execute(
+eq(HTTPMethod.POST),
+eq(String.format("v1/namespaces/%s/views", namespaceName)),
+any(),
+any(),
+eq(LoadViewResponse.class),
+any(),
+any());
+
+Mockito.verify(adapter, times(numberOfInvocations))
+.execute(
+eq(HTTPMethod.GET),
+eq(String.format("v1/namespaces/%s/views", namespaceName)),
+eq(ImmutableMap.of("pageToken", "10", "pageSize", "10")),

Review Comment:
   @nastra I am able to make two separate requests with `.handleRequest` 
instead of `execute`, and send the correct query params for each call.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: support append data file and add e2e test [iceberg-rust]

2024-04-25 Thread via GitHub


ZENOTME commented on code in PR #349:
URL: https://github.com/apache/iceberg-rust/pull/349#discussion_r1580446821


##
crates/e2e_test/tests/append_data_file_test.rs:
##
@@ -0,0 +1,212 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Integration tests for rest catalog.
+
+use arrow_array::{ArrayRef, BooleanArray, Int32Array, RecordBatch, 
StringArray};
+use futures::TryStreamExt;
+use iceberg::io::{S3_ACCESS_KEY_ID, S3_ENDPOINT, S3_REGION, 
S3_SECRET_ACCESS_KEY};
+use iceberg::spec::{NestedField, PrimitiveType, Schema, Type};
+use iceberg::transaction::Transaction;
+use iceberg::writer::base_writer::data_file_writer::{DataFileWriterBuilder, 
DataFileWriterConfig};
+use iceberg::writer::file_writer::location_generator::{
+DefaultFileNameGenerator, DefaultLocationGenerator,
+};
+use iceberg::writer::file_writer::ParquetWriterBuilder;
+use iceberg::writer::{IcebergWriter, IcebergWriterBuilder};
+use iceberg::{Catalog, Namespace, NamespaceIdent, TableCreation};
+use iceberg_catalog_rest::{RestCatalog, RestCatalogConfig};
+use iceberg_test_utils::docker::DockerCompose;
+use iceberg_test_utils::{normalize_test_name, set_up};
+use parquet::file::properties::WriterProperties;
+use port_scanner::scan_port_addr;
+use std::collections::HashMap;
+use std::sync::Arc;
+use tokio::time::sleep;
+
+const REST_CATALOG_PORT: u16 = 8181;
+
+struct TestFixture {
+_docker_compose: DockerCompose,
+rest_catalog: RestCatalog,
+}
+
+async fn set_test_fixture(func: ) -> TestFixture {
+set_up();
+let docker_compose = DockerCompose::new(
+normalize_test_name(format!("{}_{func}", module_path!())),
+format!("{}/testdata", env!("CARGO_MANIFEST_DIR")),
+);
+
+// Start docker compose
+docker_compose.run();
+
+let rest_catalog_ip = docker_compose.get_container_ip("rest");
+
+let read_port = format!("{}:{}", rest_catalog_ip, REST_CATALOG_PORT);
+loop {
+if !scan_port_addr(_port) {
+log::info!("Waiting for 1s rest catalog to ready...");
+sleep(std::time::Duration::from_millis(1000)).await;
+} else {
+break;
+}
+}
+
+let container_ip = docker_compose.get_container_ip("minio");
+let read_port = format!("{}:{}", container_ip, 9000);
+
+let config = RestCatalogConfig::builder()
+.uri(format!("http://{}:{};, rest_catalog_ip, REST_CATALOG_PORT))
+.props(HashMap::from([
+(S3_ENDPOINT.to_string(), format!("http://{};, read_port)),
+(S3_ACCESS_KEY_ID.to_string(), "admin".to_string()),
+(S3_SECRET_ACCESS_KEY.to_string(), "password".to_string()),
+(S3_REGION.to_string(), "us-east-1".to_string()),
+]))
+.build();
+let rest_catalog = RestCatalog::new(config).await.unwrap();
+
+TestFixture {
+_docker_compose: docker_compose,
+rest_catalog,
+}
+}
+
+#[tokio::test]
+async fn test_append_data_file() {
+let fixture = set_test_fixture("test_create_table").await;
+
+let ns = Namespace::with_properties(
+NamespaceIdent::from_strs(["apple", "ios"]).unwrap(),
+HashMap::from([
+("owner".to_string(), "ray".to_string()),
+("community".to_string(), "apache".to_string()),
+]),
+);
+
+fixture
+.rest_catalog
+.create_namespace(ns.name(), ns.properties().clone())
+.await
+.unwrap();
+
+let schema = Schema::builder()
+.with_schema_id(1)
+.with_identifier_field_ids(vec![2])
+.with_fields(vec![
+NestedField::optional(1, "foo", 
Type::Primitive(PrimitiveType::String)).into(),
+NestedField::required(2, "bar", 
Type::Primitive(PrimitiveType::Int)).into(),
+NestedField::optional(3, "baz", 
Type::Primitive(PrimitiveType::Boolean)).into(),
+])
+.build()
+.unwrap();
+
+let table_creation = TableCreation::builder()
+.name("t1".to_string())
+.schema(schema.clone())
+.build();
+
+let table = fixture
+.rest_catalog
+.create_table(ns.name(), table_creation)
+.await
+.unwrap();
+
+// Create the writer and write the data

Re: [PR] feat: support append data file and add e2e test [iceberg-rust]

2024-04-25 Thread via GitHub


ZENOTME commented on code in PR #349:
URL: https://github.com/apache/iceberg-rust/pull/349#discussion_r1580444775


##
crates/iceberg/src/transaction.rs:
##
@@ -121,6 +166,270 @@ impl<'a> Transaction<'a> {
 }
 }
 
+/// FastAppendAction is a transaction action for fast append data files to the 
table.
+pub struct FastAppendAction<'a> {
+tx: Transaction<'a>,
+
+parent_snapshot_id: Option,
+snapshot_id: i64,
+schema: Schema,
+schema_id: i32,
+format_version: FormatVersion,
+partition_spec: PartitionSpec,
+key_metadata: Vec,
+
+commit_uuid: String,
+manifest_id: i64,
+
+appended_data_files: Vec,
+}
+
+impl<'a> FastAppendAction<'a> {
+#[allow(clippy::too_many_arguments)]
+pub(crate) fn new(
+tx: Transaction<'a>,
+parent_snapshot_id: Option,
+snapshot_id: i64,
+schema: Schema,
+schema_id: i32,
+format_version: FormatVersion,
+partition_spec: PartitionSpec,
+key_metadata: Vec,
+commit_uuid: String,
+) -> Result {
+Ok(Self {
+tx,
+parent_snapshot_id,
+snapshot_id,
+schema,
+schema_id,
+format_version,
+partition_spec,
+key_metadata,
+commit_uuid,
+manifest_id: 0,
+appended_data_files: vec![],
+})
+}
+
+/// Add data files to the snapshot.
+pub fn add_data_files(
+ self,
+data_file: impl IntoIterator,
+) -> Result< Self> {
+self.appended_data_files.extend(data_file);
+Ok(self)
+}
+
+fn generate_manifest_file_path( self) -> String {
+let manifest_id = self.manifest_id;
+self.manifest_id += 1;
+format!(
+"{}/{}/{}-m{}.{}",
+self.tx.table.metadata().location(),
+META_ROOT_PATH,
+_uuid,
+manifest_id,
+DataFileFormat::Avro
+)
+}
+
+async fn manifest_from_parent_snapshot() -> Result> 
{
+if let Some(snapshot) = self.tx.table.metadata().current_snapshot() {
+let manifest_list = snapshot
+.load_manifest_list(self.tx.table.file_io(), 
_ref())
+.await?;
+let mut manifest_files = 
Vec::with_capacity(manifest_list.entries().len());
+for entry in manifest_list.entries() {
+// From: 
https://github.com/apache/iceberg-python/blob/659a951d6397ab280cae80206fe6e8e4be2d3738/pyiceberg/table/__init__.py#L2921
+// Why we need this?
+if entry.added_snapshot_id == self.snapshot_id {
+continue;
+}
+let manifest = 
entry.load_manifest(self.tx.table.file_io()).await?;
+// Skip manifest with all delete entries.
+if manifest.entries().iter().all(|entry| !entry.is_alive()) {
+continue;
+}
+manifest_files.push(entry.clone());
+}
+Ok(manifest_files)
+} else {
+Ok(vec![])
+}
+}
+
+// Write manifest file for added data files and return the ManifestFile 
for ManifestList.
+async fn manifest_for_data_file( self) -> Result {
+let appended_data_files = std::mem::take( 
self.appended_data_files);
+let manifest_entries = appended_data_files
+.into_iter()
+.map(|data_file| {
+ManifestEntry::builder()
+.status(crate::spec::ManifestStatus::Added)
+.snapshot_id(self.snapshot_id)
+.data_file(data_file)
+.build()
+})
+.collect();
+let manifest_meta = ManifestMetadata::builder()
+.schema(self.schema.clone())
+.schema_id(self.schema_id)
+.format_version(self.format_version)
+.partition_spec(self.partition_spec.clone())
+.content(crate::spec::ManifestContentType::Data)
+.build();
+let manifest = Manifest::new(manifest_meta, manifest_entries);
+let writer = ManifestWriter::new(
+self.tx
+.table
+.file_io()
+.new_output(self.generate_manifest_file_path())?,
+self.snapshot_id,
+self.key_metadata.clone(),
+);
+writer.write(manifest).await
+}
+
+fn summary() -> Summary {
+Summary {
+operation: crate::spec::Operation::Append,
+other: HashMap::new(),
+}
+}
+
+/// Finished building the action and apply it to the transaction.
+pub async fn apply(mut self) -> Result> {
+let summary = self.summary();
+let manifest = self.manifest_for_data_file().await?;
+let existing_manifest_files = 
self.manifest_from_parent_snapshot().await?;
+
+let snapshot_produce_action 

Re: [I] byte and short types in spark no longer auto coerce to int32 [iceberg]

2024-04-25 Thread via GitHub


jkolash commented on issue #10225:
URL: https://github.com/apache/iceberg/issues/10225#issuecomment-2078535487

   Ok this is reproducing via the github actions build on my public fork
   https://github.com/jkolash/iceberg/actions/runs/8842101257/job/24280206652
   ```
   TestDataFrameWriterV2 > testByte FAILED
   org.apache.spark.SparkException: Job aborted due to stage failure: Task 
0 in stage 5.0 failed 1 times, most recent failure: Lost task 0.0 in stage 5.0 
(TID 7) (localhost executor driver): java.lang.ClassCastException: class 
java.lang.Byte cannot be cast to class java.lang.Integer (java.lang.Byte and 
java.lang.Integer are in module java.base of loader 'bootstrap')
at org.apache.iceberg.parquet.ColumnWriter$2.write(ColumnWriter.java:39)
at 
org.apache.iceberg.parquet.ParquetValueWriters$PrimitiveWriter.write(ParquetValueWriters.java:131)
at 
org.apache.iceberg.parquet.ParquetValueWriters$OptionWriter.write(ParquetValueWriters.java:356)
at 
org.apache.iceberg.parquet.ParquetValueWriters$StructWriter.write(ParquetValueWriters.java:589)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Implement manifest filtering in `TableScan` [iceberg-rust]

2024-04-25 Thread via GitHub


liurenjie1024 commented on PR #323:
URL: https://github.com/apache/iceberg-rust/pull/323#issuecomment-2078490131

   Let's wait a moment to see if others have comments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Implement manifest filtering in `TableScan` [iceberg-rust]

2024-04-25 Thread via GitHub


liurenjie1024 commented on code in PR #323:
URL: https://github.com/apache/iceberg-rust/pull/323#discussion_r1580334216


##
crates/iceberg/src/scan.rs:
##
@@ -158,8 +196,24 @@ impl TableScan {
 .await?;
 
 // Generate data file stream
-let mut entries = iter(manifest_list.entries());
-while let Some(entry) = entries.next().await {
+for entry in manifest_list.entries() {
+// If this scan has a filter, check the partition evaluator 
cache for an existing
+// PartitionEvaluator that matches this manifest's partition 
spec ID.
+// Use one from the cache if there is one. If not, create one, 
put it in
+// the cache, and take a reference to it.
+#[allow(clippy::map_entry)]
+if let Some(filter) = filter.as_ref() {
+if 
!manifest_evaluator_cache.contains_key(_spec_id) {

Review Comment:
   Oh, got it. Thanks for clarification.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Flink: Apply DeleteGranularity for writes [iceberg]

2024-04-25 Thread via GitHub


aokolnychyi commented on code in PR #10200:
URL: https://github.com/apache/iceberg/pull/10200#discussion_r1580294047


##
flink/v1.18/flink/src/main/java/org/apache/iceberg/flink/sink/BaseDeltaTaskWriter.java:
##
@@ -56,7 +58,8 @@ abstract class BaseDeltaTaskWriter extends 
BaseTaskWriter {
   Schema schema,
   RowType flinkSchema,
   List equalityFieldIds,
-  boolean upsert) {
+  boolean upsert,
+  DeleteGranularity deleteGranularity) {

Review Comment:
   Are we worried about the increased metadata size? If so, I don't think it 
should be a problem as position delete files require much less metadata and 
highly doubt the difference will be noticed.



##
flink/v1.18/flink/src/main/java/org/apache/iceberg/flink/sink/BaseDeltaTaskWriter.java:
##
@@ -56,7 +58,8 @@ abstract class BaseDeltaTaskWriter extends 
BaseTaskWriter {
   Schema schema,
   RowType flinkSchema,
   List equalityFieldIds,
-  boolean upsert) {
+  boolean upsert,
+  DeleteGranularity deleteGranularity) {

Review Comment:
   Are we worried about the increased metadata size? If so, I don't think it 
should be a problem as position delete files require much less metadata and I 
highly doubt the difference will be noticed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Flink: Apply DeleteGranularity for writes [iceberg]

2024-04-25 Thread via GitHub


aokolnychyi commented on PR #10200:
URL: https://github.com/apache/iceberg/pull/10200#issuecomment-2078403034

   > Also there is a behavioural change that the previous write rolled deletes
   
   @pvary, I don't think that threshold makes sense in this scenario. It was 
needed because we kept rows and positions in a set in the old writers. We are 
migrating to `SortingPositionOnlyDeleteWriter` that keeps track of positions in 
bitmaps. Even millions of position deletes would take KBs on disk.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Flink: Apply DeleteGranularity for writes [iceberg]

2024-04-25 Thread via GitHub


aokolnychyi commented on code in PR #10200:
URL: https://github.com/apache/iceberg/pull/10200#discussion_r1580291026


##
core/src/main/java/org/apache/iceberg/io/BaseTaskWriter.java:
##
@@ -109,18 +112,34 @@ protected abstract class BaseEqualityDeltaWriter 
implements Closeable {
 private final StructProjection structProjection;
 private RollingFileWriter dataWriter;
 private RollingEqDeleteWriter eqDeleteWriter;
-private SortedPosDeleteWriter posDeleteWriter;
+private FileWriter, DeleteWriteResult> posDeleteWriter;
 private Map insertedRowMap;
 
 protected BaseEqualityDeltaWriter(StructLike partition, Schema schema, 
Schema deleteSchema) {
+  this(partition, schema, deleteSchema, DeleteGranularity.PARTITION);
+}
+
+protected BaseEqualityDeltaWriter(
+StructLike partition,
+Schema schema,
+Schema deleteSchema,
+DeleteGranularity deleteGranularity) {
   Preconditions.checkNotNull(schema, "Iceberg table schema cannot be 
null.");
   Preconditions.checkNotNull(deleteSchema, "Equality-delete schema cannot 
be null.");
   this.structProjection = StructProjection.create(schema, deleteSchema);
 
   this.dataWriter = new RollingFileWriter(partition);
   this.eqDeleteWriter = new RollingEqDeleteWriter(partition);
   this.posDeleteWriter =
-  new SortedPosDeleteWriter<>(appenderFactory, fileFactory, format, 
partition);
+  new SortingPositionOnlyDeleteWriter<>(
+  () ->
+  appenderFactory.newPosDeleteWriter(
+  partition == null

Review Comment:
   If I am not mistaken, this line should check `spec.isUnpartitioned() || 
partition == null` to avoid weird issues we discovered in #7685. We also have 
to pass `spec` with `partition` below in `newOutputFile`.
   
   I'd also consider moving this logic into a separate method for better 
readability. Up to you, though.
   
   ```
   private EncryptedOutputFile newOutputFile(StructLike partition) {
 if (spec.isUnpartitioned() || partition == null) {
   return fileFactory.newOutputFile();
 } else {
   return fileFactory.newOutputFile(spec, partition);
 }
   }
   ```



##
core/src/main/java/org/apache/iceberg/io/BaseTaskWriter.java:
##
@@ -140,12 +159,18 @@ public void write(T row) throws IOException {
   PathOffset previous = insertedRowMap.put(copiedKey, pathOffset);
   if (previous != null) {
 // TODO attach the previous row if has a positional-delete row schema 
in appender factory.
-posDeleteWriter.delete(previous.path, previous.rowOffset, null);
+writePosDelete(previous);
   }
 
   dataWriter.write(row);
 }
 
+private void writePosDelete(PathOffset pathOffset) {
+  PositionDelete delete = PositionDelete.create();

Review Comment:
   We usually avoid creating an extra `PositionDelete` object for every delete, 
if possible. Instead, we initialize one object and reuse it. Take a look at 
`BasePositionDeltaWriter`.



##
core/src/main/java/org/apache/iceberg/deletes/SortingPositionOnlyDeleteWriter.java:
##
@@ -118,6 +118,10 @@ private DeleteWriteResult writeFileDeletes() throws 
IOException {
 
   @SuppressWarnings("CollectionUndefinedEquality")
   private DeleteWriteResult writeDeletes(Collection paths) 
throws IOException {
+if (paths.size() == 0) {

Review Comment:
   Shall we use `paths.isEmpty()`?



##
flink/v1.18/flink/src/main/java/org/apache/iceberg/flink/sink/BaseDeltaTaskWriter.java:
##
@@ -56,7 +58,8 @@ abstract class BaseDeltaTaskWriter extends 
BaseTaskWriter {
   Schema schema,
   RowType flinkSchema,
   List equalityFieldIds,
-  boolean upsert) {
+  boolean upsert,
+  DeleteGranularity deleteGranularity) {

Review Comment:
   Are we worried about the increased metadata size?



##
core/src/main/java/org/apache/iceberg/io/BaseTaskWriter.java:
##
@@ -109,18 +112,34 @@ protected abstract class BaseEqualityDeltaWriter 
implements Closeable {
 private final StructProjection structProjection;
 private RollingFileWriter dataWriter;
 private RollingEqDeleteWriter eqDeleteWriter;
-private SortedPosDeleteWriter posDeleteWriter;
+private FileWriter, DeleteWriteResult> posDeleteWriter;
 private Map insertedRowMap;
 
 protected BaseEqualityDeltaWriter(StructLike partition, Schema schema, 
Schema deleteSchema) {
+  this(partition, schema, deleteSchema, DeleteGranularity.PARTITION);
+}
+
+protected BaseEqualityDeltaWriter(
+StructLike partition,
+Schema schema,
+Schema deleteSchema,
+DeleteGranularity deleteGranularity) {

Review Comment:
   I am not convinced we should make it configurable when writing equality 
deltas. Our original plan was to always generate file-scoped position deletes 
when writing Flink CDC. Take a look at 

[PR] [draft] Parquet page skipping (using filtered row groups) [iceberg]

2024-04-25 Thread via GitHub


wypoon opened a new pull request, #10228:
URL: https://github.com/apache/iceberg/pull/10228

   (no comment)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Hive documentation not updated on website [iceberg]

2024-04-25 Thread via GitHub


github-actions[bot] commented on issue #2715:
URL: https://github.com/apache/iceberg/issues/2715#issuecomment-2078371964

   This issue has been automatically marked as stale because it has been open 
for 180 days with no activity. It will be closed in next 14 days if no further 
activity occurs. To permanently prevent this issue from being considered stale, 
add the label 'not-stale', but commenting on the issue is preferred when 
possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Add support for TimeType, ListType, MapType and StructType in the ArrowReader [iceberg]

2024-04-25 Thread via GitHub


github-actions[bot] commented on issue #2485:
URL: https://github.com/apache/iceberg/issues/2485#issuecomment-2078371768

   This issue has been closed because it has not received any activity in the 
last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Add support for TimeType, ListType, MapType and StructType in the ArrowReader [iceberg]

2024-04-25 Thread via GitHub


github-actions[bot] closed issue #2485: Add support for TimeType, ListType, 
MapType and StructType in the ArrowReader
URL: https://github.com/apache/iceberg/issues/2485


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] Build: Bump getdaft from 0.2.16 to 0.2.21 [iceberg-python]

2024-04-25 Thread via GitHub


dependabot[bot] opened a new pull request, #662:
URL: https://github.com/apache/iceberg-python/pull/662

   Bumps [getdaft](https://github.com/Eventual-Inc/Daft) from 0.2.16 to 0.2.21.
   
   Release notes
   Sourced from https://github.com/Eventual-Inc/Daft/releases;>getdaft's 
releases.
   
   v0.2.21
   Changes
   ✨ New Features
   
   [FEAT] Add S3Config.from_env functionality https://github.com/jaychia;>@​jaychia (https://redirect.github.com/Eventual-Inc/Daft/issues/2137;>#2137)
   deltalake _delta_lake.py: Allow Glue catalog cross account access https://github.com/pang-wu;>@​pang-wu (https://redirect.github.com/Eventual-Inc/Daft/issues/2113;>#2113)
   [FEAT] Enable Ruff https://github.com/samster25;>@​samster25 (https://redirect.github.com/Eventual-Inc/Daft/issues/2121;>#2121)
   [FEAT] Implements other trigonometry expressions https://github.com/MeepoWin;>@​MeepoWin (https://redirect.github.com/Eventual-Inc/Daft/issues/2123;>#2123)
   [FEAT] exp expression implementation https://github.com/MeepoWin;>@​MeepoWin (https://redirect.github.com/Eventual-Inc/Daft/issues/2115;>#2115)
   [FEAT] sin/cos/tan expression implementation https://github.com/reswqa;>@​reswqa (https://redirect.github.com/Eventual-Inc/Daft/issues/2112;>#2112)
   [CHORE] Using uv in MakeFile https://github.com/MeepoWin;>@​MeepoWin (https://redirect.github.com/Eventual-Inc/Daft/issues/2114;>#2114)
   [FEAT] Add option to S3Config to force virtual addressing https://github.com/samster25;>@​samster25 (https://redirect.github.com/Eventual-Inc/Daft/issues/2106;>#2106)
   [FEAT] fill_null expression https://github.com/colin-ho;>@​colin-ho (https://redirect.github.com/Eventual-Inc/Daft/issues/2089;>#2089)
   [FEAT] Add basic list aggregations https://github.com/kevinzwang;>@​kevinzwang (https://redirect.github.com/Eventual-Inc/Daft/issues/2032;>#2032)
   [FEAT] Allow sql alchemy connection factory as input to read_sql https://github.com/colin-ho;>@​colin-ho (https://redirect.github.com/Eventual-Inc/Daft/issues/2071;>#2071)
   [FEAT] Add daft-sketch subcrate and arrow2 serialization functionality 
https://github.com/jaychia;>@​jaychia (https://redirect.github.com/Eventual-Inc/Daft/issues/2090;>#2090)
   
    Bug Fixes
   
   [BUG] Fix reading partition key columns in DeltaLake https://github.com/jaychia;>@​jaychia (https://redirect.github.com/Eventual-Inc/Daft/issues/2118;>#2118)
   
    Documentation
   
   [CHORE] Fix underlines in README https://github.com/jaychia;>@​jaychia (https://redirect.github.com/Eventual-Inc/Daft/issues/2143;>#2143)
   [DOCS] Update iceberg integration docs to add writes https://github.com/jaychia;>@​jaychia (https://redirect.github.com/Eventual-Inc/Daft/issues/2110;>#2110)
   [DOCS] Create CODE_OF_CONDUCT.md https://github.com/samster25;>@​samster25 (https://redirect.github.com/Eventual-Inc/Daft/issues/2101;>#2101)
   [CHORE] Skip deltalake notebooks for CI https://github.com/jaychia;>@​jaychia (https://redirect.github.com/Eventual-Inc/Daft/issues/2097;>#2097)
   [CHORE] Add link to good first issues in readme https://github.com/colin-ho;>@​colin-ho (https://redirect.github.com/Eventual-Inc/Daft/issues/2088;>#2088)
   [DOCS] Fix docs typo https://github.com/avriiil;>@​avriiil (https://redirect.github.com/Eventual-Inc/Daft/issues/2075;>#2075)
   [DOCS] Typos in user guide https://github.com/avriiil;>@​avriiil (https://redirect.github.com/Eventual-Inc/Daft/issues/2079;>#2079)
   [DOCS] Fix typos on 10-min tutorial https://github.com/avriiil;>@​avriiil (https://redirect.github.com/Eventual-Inc/Daft/issues/2082;>#2082)
   [DOCS] Add ml batch inference tutorials https://github.com/jaychia;>@​jaychia (https://redirect.github.com/Eventual-Inc/Daft/issues/2057;>#2057)
   [CHORE] Fix autolabeller CI step for forks https://github.com/jaychia;>@​jaychia (https://redirect.github.com/Eventual-Inc/Daft/issues/2138;>#2138)
   
   藺 Maintenance
   
   [CHORE] Fix underlines in README https://github.com/jaychia;>@​jaychia (https://redirect.github.com/Eventual-Inc/Daft/issues/2143;>#2143)
   [CHORE] Split labelling and update release CI steps https://github.com/jaychia;>@​jaychia (https://redirect.github.com/Eventual-Inc/Daft/issues/2142;>#2142)
   [CHORE] Fix the labeller CI step which is not triggering https://github.com/jaychia;>@​jaychia (https://redirect.github.com/Eventual-Inc/Daft/issues/2141;>#2141)
   [CHORE] Fixing readthedocs build https://github.com/jaychia;>@​jaychia (https://redirect.github.com/Eventual-Inc/Daft/issues/2135;>#2135)
   [CHORE] Fix documentation build with uv https://github.com/jaychia;>@​jaychia (https://redirect.github.com/Eventual-Inc/Daft/issues/2134;>#2134)
   [CHORE] Fix build command https://github.com/MeepoWin;>@​MeepoWin (https://redirect.github.com/Eventual-Inc/Daft/issues/2126;>#2126)
   [CHORE] Rename virtual env folder to .venv https://github.com/MeepoWin;>@​MeepoWin (https://redirect.github.com/Eventual-Inc/Daft/issues/2122;>#2122)
   [CHORE] refactors for ruff [1/n] 

[PR] Build: Bump griffe from 0.40.1 to 0.44.0 [iceberg-python]

2024-04-25 Thread via GitHub


dependabot[bot] opened a new pull request, #660:
URL: https://github.com/apache/iceberg-python/pull/660

   Bumps [griffe](https://github.com/mkdocstrings/griffe) from 0.40.1 to 0.44.0.
   
   Release notes
   Sourced from https://github.com/mkdocstrings/griffe/releases;>griffe's 
releases.
   
   0.44.0
   https://github.com/mkdocstrings/griffe/releases/tag/0.44.0;>0.44.0 - 
2024-04-19
   https://github.com/mkdocstrings/griffe/compare/0.43.0...0.44.0;>Compare 
with 0.43.0
   Features
   
   Add resolved property on expression names, returning the 
corresponding Griffe object (https://github.com/mkdocstrings/griffe/commit/9b5ca4574250f847fd33a8cb92af56806db50c1b;>9b5ca45
 by Timothée Mazzucotelli).
   
   Bug Fixes
   
   Fix enumeration properties on expression names (https://github.com/mkdocstrings/griffe/commit/6f22256ad02439d961bce2bb1afa32d4e9e10b10;>6f22256
 by Timothée Mazzucotelli).
   
   0.43.0
   https://github.com/mkdocstrings/griffe/releases/tag/0.43.0;>0.43.0 - 
2024-04-18
   https://github.com/mkdocstrings/griffe/compare/0.42.2...0.43.0;>Compare 
with 0.42.2
   Features
   
   Add properties telling whether an expression name resolves to an 
enumeration class, instance or value (https://github.com/mkdocstrings/griffe/commit/fdb21d943f72fb10a4406930bf3e3bf7aceff6b0;>fdb21d9
 by Timothée Mazzucotelli). https://redirect.github.com/mkdocstrings/python/issues/124;>Issue-mkdocstrings/python#124
   
   0.42.2
   https://github.com/mkdocstrings/griffe/releases/tag/0.42.2;>0.42.2 - 
2024-04-15
   https://github.com/mkdocstrings/griffe/compare/0.42.1...0.42.2;>Compare 
with 0.42.1
   Bug Fixes
   
   Fix target path of aliases for multipart imports (import a.b.c as 
x) (https://github.com/mkdocstrings/griffe/commit/ee27ad97669a7321d18e6724e6c155cef601a289;>ee27ad9
 by Timothée Mazzucotelli). https://redirect.github.com/mkdocstrings/griffe/issues/259;>Issue-259
   
   0.42.1
   https://github.com/mkdocstrings/griffe/releases/tag/0.42.1;>0.42.1 - 
2024-03-19
   https://github.com/mkdocstrings/griffe/compare/0.42.0...0.42.1;>Compare 
with 0.42.0
   Bug Fixes
   
   Don't return class variables as parameters of dataclasses (https://github.com/mkdocstrings/griffe/commit/2729c22505d87b771ab7a70c91c9f8301275aa8c;>2729c22
 by Hassan Kibirige). https://redirect.github.com/mkdocstrings/griffe/pull/253;>PR-253
   Don't turn items annotated as InitVar into dataclass members (https://github.com/mkdocstrings/griffe/commit/6835ea361325a205c0af69acabc66ca5193156c5;>6835ea3
 by Hassan Kibirige). https://redirect.github.com/mkdocstrings/griffe/pull/252;>PR-252
   
   0.42.0
   https://github.com/mkdocstrings/griffe/releases/tag/0.42.0;>0.42.0 - 
2024-03-11
   https://github.com/mkdocstrings/griffe/compare/0.41.3...0.42.0;>Compare 
with 0.41.3
   Features
   
   Better support for dataclasses (https://github.com/mkdocstrings/griffe/commit/82a9d5798b2eebddfd640b918415a0e3de2ca739;>82a9d57
 by Timothée Mazzucotelli). https://redirect.github.com/mkdocstrings/griffe/issues/233;>Issue-33, 
https://redirect.github.com/mkdocstrings/griffe/issues/234;>Issue-34, 
https://redirect.github.com/mkdocstrings/griffe/issues/238;>Issue-38, 
https://redirect.github.com/mkdocstrings/griffe/issues/239;>Issue-39, 
https://redirect.github.com/mkdocstrings/griffe/pull/240;>PR-240
   
   
   
   ... (truncated)
   
   
   Changelog
   Sourced from https://github.com/mkdocstrings/griffe/blob/main/CHANGELOG.md;>griffe's 
changelog.
   
   https://github.com/mkdocstrings/griffe/releases/tag/0.44.0;>0.44.0 - 
2024-04-19
   https://github.com/mkdocstrings/griffe/compare/0.43.0...0.44.0;>Compare 
with 0.43.0
   Features
   
   Add resolved property on expression names, returning the 
corresponding Griffe object (https://github.com/mkdocstrings/griffe/commit/9b5ca4574250f847fd33a8cb92af56806db50c1b;>9b5ca45
 by Timothée Mazzucotelli).
   
   Bug Fixes
   
   Fix enumeration properties on expression names (https://github.com/mkdocstrings/griffe/commit/6f22256ad02439d961bce2bb1afa32d4e9e10b10;>6f22256
 by Timothée Mazzucotelli).
   
   https://github.com/mkdocstrings/griffe/releases/tag/0.43.0;>0.43.0 - 
2024-04-18
   https://github.com/mkdocstrings/griffe/compare/0.42.2...0.43.0;>Compare 
with 0.42.2
   Features
   
   Add properties telling whether an expression name resolves to an 
enumeration class, instance or value (https://github.com/mkdocstrings/griffe/commit/fdb21d943f72fb10a4406930bf3e3bf7aceff6b0;>fdb21d9
 by Timothée Mazzucotelli). https://redirect.github.com/mkdocstrings/python/issues/124;>Issue-mkdocstrings/python#124
   
   https://github.com/mkdocstrings/griffe/releases/tag/0.42.2;>0.42.2 - 
2024-04-15
   https://github.com/mkdocstrings/griffe/compare/0.42.1...0.42.2;>Compare 
with 0.42.1
   Bug Fixes
   
   Fix target path of aliases for multipart imports (import a.b.c as 
x) (https://github.com/mkdocstrings/griffe/commit/ee27ad97669a7321d18e6724e6c155cef601a289;>ee27ad9
 by Timothée Mazzucotelli). 

[PR] Build: Bump pyspark from 3.5.0 to 3.5.1 [iceberg-python]

2024-04-25 Thread via GitHub


dependabot[bot] opened a new pull request, #659:
URL: https://github.com/apache/iceberg-python/pull/659

   Bumps [pyspark](https://github.com/apache/spark) from 3.5.0 to 3.5.1.
   
   Commits
   
   https://github.com/apache/spark/commit/fd86f85e181fc2dc0f50a096855acf83a6cc5d9c;>fd86f85
 Preparing Spark release v3.5.1-rc2
   https://github.com/apache/spark/commit/9b4778fc1dc7047635c9ec19c187d4e75d182590;>9b4778f
 [SPARK-46906][INFRA][3.5] Bump python libraries (pandas, pyarrow) in Docker 
i...
   https://github.com/apache/spark/commit/ea6b25767fb86732c108c759fd5393caee22f129;>ea6b257
 Revert [SPARK-45396][PYTHON] Add doc entry for 
pyspark.ml.connect module, ...
   https://github.com/apache/spark/commit/a8c62d3f9a8de22f92e0e0ca1a5770f373b0b142;>a8c62d3
 [SPARK-47023][BUILD] Upgrade aircompressor to 1.26
   https://github.com/apache/spark/commit/d27bdbeae9c5b634702ad20a58b9d3c68ac9d39d;>d27bdbe
 Preparing development version 3.5.2-SNAPSHOT
   https://github.com/apache/spark/commit/08fe67b9ebf656b6ae7c44163bffba247061aa42;>08fe67b
 Preparing Spark release v3.5.1-rc1
   https://github.com/apache/spark/commit/4e4d9f07d0954357e85a6e2b0da47746a4b08501;>4e4d9f0
 [SPARK-47022][CONNECT][TESTS][3.5] Fix connect/client/jvm to have 
explicit ...
   https://github.com/apache/spark/commit/9700da7bfc1abb607f3cb916b96724d0fb8f2eba;>9700da7
 [SPARK-47021][BUILD][TESTS] Fix kvstore module to have explicit 
`commons-la...
   https://github.com/apache/spark/commit/7658f77a613c91364c4b6c986e1861c7bd5487db;>7658f77
 [SPARK-39910][SQL] Delegate path qualification to filesystem during 
DataSourc...
   https://github.com/apache/spark/commit/77f8b38a1091aa51af32dc790b61ae54ac47a2c2;>77f8b38
 [SPARK-46400][CORE][SQL][3.5] When there are corrupted files in the local 
mav...
   Additional commits viewable in https://github.com/apache/spark/compare/v3.5.0...v3.5.1;>compare 
view
   
   
   
   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=pyspark=pip=3.5.0=3.5.1)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot show  ignore conditions` will show all of 
the ignore conditions of the specified dependency
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] Build: Bump mkdocstrings from 0.24.0 to 0.24.3 [iceberg-python]

2024-04-25 Thread via GitHub


dependabot[bot] opened a new pull request, #661:
URL: https://github.com/apache/iceberg-python/pull/661

   Bumps [mkdocstrings](https://github.com/mkdocstrings/mkdocstrings) from 
0.24.0 to 0.24.3.
   
   Release notes
   Sourced from https://github.com/mkdocstrings/mkdocstrings/releases;>mkdocstrings's 
releases.
   
   0.24.3
   https://github.com/mkdocstrings/mkdocstrings/releases/tag/0.24.3;>0.24.3
 - 2024-04-05
   https://github.com/mkdocstrings/mkdocstrings/compare/0.24.2...0.24.3;>Compare
 with 0.24.2
   Bug Fixes
   
   Support HTML toc labels with Python-Markdown 3.6+ (uncomment code...) 
(https://github.com/mkdocstrings/mkdocstrings/commit/7fe3e5f28239c08094fb656728369998f52630ea;>7fe3e5f
 by Timothée Mazzucotelli).
   
   0.24.2
   https://github.com/mkdocstrings/mkdocstrings/releases/tag/0.24.2;>0.24.2
 - 2024-04-02
   https://github.com/mkdocstrings/mkdocstrings/compare/0.24.1...0.24.2;>Compare
 with 0.24.1
   Bug Fixes
   
   Support HTML toc labels with Python-Markdown 3.6+ (https://github.com/mkdocstrings/mkdocstrings/commit/c0d009000678a2ccbfb0c8adfeff3dc83845ee41;>c0d0090
 by Timothée Mazzucotelli). https://redirect.github.com/mkdocstrings/python/issues/143;>Issue-mkdocstrings/python-143
   
   0.24.1
   https://github.com/mkdocstrings/mkdocstrings/releases/tag/0.24.1;>0.24.1
 - 2024-02-27
   https://github.com/mkdocstrings/mkdocstrings/compare/0.24.0...0.24.1;>Compare
 with 0.24.0
   Code Refactoring
   
   Support new pymdownx-highlight options (https://github.com/mkdocstrings/mkdocstrings/commit/a7a29079aebcd79be84ac38ce275797920e4c75e;>a7a2907
 by Timothée Mazzucotelli).
   Backup anchors with id and no href, for compatibility with autorefs' 
Markdown anchors (https://github.com/mkdocstrings/mkdocstrings/commit/b5236b4333ebde9648c84f6e4b0f4c2b10f3ecd4;>b5236b4
 by Timothée Mazzucotelli). [PR-https://redirect.github.com/mkdocstrings/mkdocstrings/issues/651;>#651](https://redirect.github.com/mkdocstrings/mkdocstrings/pull/651;>mkdocstrings/mkdocstrings#651),
 https://redirect.github.com/mkdocstrings/autorefs/pull/39;>Related-to-mkdocs-autorefs#39,
 Co-authored-by: Oleh Prypin mailto:o...@pryp.in;>o...@pryp.in
   
   
   
   
   Changelog
   Sourced from https://github.com/mkdocstrings/mkdocstrings/blob/main/CHANGELOG.md;>mkdocstrings's
 changelog.
   
   https://github.com/mkdocstrings/mkdocstrings/releases/tag/0.24.3;>0.24.3
 - 2024-04-05
   https://github.com/mkdocstrings/mkdocstrings/compare/0.24.2...0.24.3;>Compare
 with 0.24.2
   Bug Fixes
   
   Support HTML toc labels with Python-Markdown 3.6+ (uncomment code...) 
(https://github.com/mkdocstrings/mkdocstrings/commit/7fe3e5f28239c08094fb656728369998f52630ea;>7fe3e5f
 by Timothée Mazzucotelli).
   
   https://github.com/mkdocstrings/mkdocstrings/releases/tag/0.24.2;>0.24.2
 - 2024-04-02
   https://github.com/mkdocstrings/mkdocstrings/compare/0.24.1...0.24.2;>Compare
 with 0.24.1
   Bug Fixes
   
   Support HTML toc labels with Python-Markdown 3.6+ (https://github.com/mkdocstrings/mkdocstrings/commit/c0d009000678a2ccbfb0c8adfeff3dc83845ee41;>c0d0090
 by Timothée Mazzucotelli). https://redirect.github.com/mkdocstrings/python/issues/143;>Issue-mkdocstrings/python-143
   
   https://github.com/mkdocstrings/mkdocstrings/releases/tag/0.24.1;>0.24.1
 - 2024-02-27
   https://github.com/mkdocstrings/mkdocstrings/compare/0.24.0...0.24.1;>Compare
 with 0.24.0
   Code Refactoring
   
   Support new pymdownx-highlight options (https://github.com/mkdocstrings/mkdocstrings/commit/a7a29079aebcd79be84ac38ce275797920e4c75e;>a7a2907
 by Timothée Mazzucotelli).
   Backup anchors with id and no href, for compatibility with autorefs' 
Markdown anchors (https://github.com/mkdocstrings/mkdocstrings/commit/b5236b4333ebde9648c84f6e4b0f4c2b10f3ecd4;>b5236b4
 by Timothée Mazzucotelli). [PR-https://redirect.github.com/mkdocstrings/mkdocstrings/issues/651;>#651](https://redirect.github.com/mkdocstrings/mkdocstrings/pull/651;>mkdocstrings/mkdocstrings#651),
 https://redirect.github.com/mkdocstrings/autorefs/pull/39;>Related-to-mkdocs-autorefs#39,
 Co-authored-by: Oleh Prypin mailto:o...@pryp.in;>o...@pryp.in
   
   
   
   
   Commits
   
   https://github.com/mkdocstrings/mkdocstrings/commit/828bd5921dba610e0ce33be780ac16af0eb0bef6;>828bd59
 chore: Prepare release 0.24.3
   https://github.com/mkdocstrings/mkdocstrings/commit/7fe3e5f28239c08094fb656728369998f52630ea;>7fe3e5f
 fix: Support HTML toc labels with Python-Markdown 3.6+ (uncomment code...)
   https://github.com/mkdocstrings/mkdocstrings/commit/7b9827c97e396bd76f77315d40baa6596ee8e17e;>7b9827c
 chore: Prepare release 0.24.2
   https://github.com/mkdocstrings/mkdocstrings/commit/17bfc87a8d23de5585b4630fd8c2b4595ac88a36;>17bfc87
 chore: Use PEP 440 versioning scheme for changelog
   https://github.com/mkdocstrings/mkdocstrings/commit/024ac41024a19cbf45f4d127c75cb709134683db;>024ac41
 ci: Ignore mypy warning
   

[I] How to reinitialize spark catalog on an ongoing spark session with iceberg [iceberg]

2024-04-25 Thread via GitHub


vasudhamsr opened a new issue, #10227:
URL: https://github.com/apache/iceberg/issues/10227

   ### Query engine
   
   Spark
   
   ### Question
   
   We have a scenario where the catalog object keeps changing and the custom 
FileIO underneath as well. But the loadTable function 
(org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:185)) is 
picking up the older catalog impl class with stale FileIO(which doesn't hold 
access to new tables) and is throwing. From looks of it, it seems spark catalog 
is loading catalog impl class on initialization and even there are further 
changes, its not picking up. Is there a way to forceRefresh object of catalog 
in SparkCatalog?  Thank you in advance!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Flink: Prevent setting endTag/endSnapshotId for streaming source [iceberg]

2024-04-25 Thread via GitHub


stevenzwu commented on code in PR #10207:
URL: https://github.com/apache/iceberg/pull/10207#discussion_r1580202486


##
flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/source/ScanContext.java:
##
@@ -130,7 +131,9 @@ private ScanContext(
 this.watermarkColumn = watermarkColumn;
 this.watermarkColumnTimeUnit = watermarkColumnTimeUnit;
 
-validate();
+if (!skipValidate) {

Review Comment:
   > If yes, then we change the copyWithAppendsBetween and the  
copyWithSnapshotId to remove the streaming flag, as those are not a streaming 
scans anymore.
   
   that is also not correct, because `copy` should meant `copy`. The main 
problem is that `ScanContext` is used for both user intention (via source 
builder) and internal incremental scan. I agree that internal incremental scan 
shouldn't have the streaming setting. 
   
   note that `ScanContext` is not a public class. Users can't construct this 
object directly. Maybe the `validate()` method shouldn't be called by the 
constructor and only be called by the `ScanContext#Builder#build()` method? or 
move some more intrinsic validation to `IcebergSource` builder?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Flink: Prevent setting endTag/endSnapshotId for streaming source [iceberg]

2024-04-25 Thread via GitHub


stevenzwu commented on code in PR #10207:
URL: https://github.com/apache/iceberg/pull/10207#discussion_r1580202486


##
flink/v1.19/flink/src/main/java/org/apache/iceberg/flink/source/ScanContext.java:
##
@@ -130,7 +131,9 @@ private ScanContext(
 this.watermarkColumn = watermarkColumn;
 this.watermarkColumnTimeUnit = watermarkColumnTimeUnit;
 
-validate();
+if (!skipValidate) {

Review Comment:
   > If yes, then we change the copyWithAppendsBetween and the  
copyWithSnapshotId to remove the streaming flag, as those are not a streaming 
scans anymore.
   
   that is also not correct, because `copy` should meant `copy`. The main 
problem is that `ScanContext` is used for both user intention (via source 
builder) and internal incremental scan. I agree that internal incremental scan 
shouldn't have the streaming setting. 
   
   note that `ScanContext` is not a public class. Users can't construct this 
object directly. Maybe the `validate()` method shouldn't be called by the 
constructor and only be called by the `ScanContext#Builder#build()` method?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Flink: FlinkFileIO implementation [iceberg]

2024-04-25 Thread via GitHub


stevenzwu commented on code in PR #10151:
URL: https://github.com/apache/iceberg/pull/10151#discussion_r1580188262


##
flink/v1.18/flink/src/main/java/org/apache/iceberg/flink/FlinkFileIO.java:
##
@@ -0,0 +1,183 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.flink;
+
+import java.io.FileNotFoundException;
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.Map;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.atomic.AtomicInteger;
+import java.util.stream.Collectors;
+import org.apache.flink.core.fs.FileStatus;
+import org.apache.flink.core.fs.FileSystem;
+import org.apache.flink.core.fs.Path;
+import org.apache.iceberg.io.BulkDeletionFailureException;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.FileInfo;
+import org.apache.iceberg.io.InputFile;
+import org.apache.iceberg.io.OutputFile;
+import org.apache.iceberg.io.SupportsBulkOperations;
+import org.apache.iceberg.io.SupportsPrefixOperations;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.relocated.com.google.common.collect.Iterables;
+import org.apache.iceberg.util.SerializableMap;
+import org.apache.iceberg.util.Tasks;
+import org.apache.iceberg.util.ThreadPools;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class FlinkFileIO implements FileIO, SupportsPrefixOperations, 
SupportsBulkOperations {
+  private static final Logger LOG = LoggerFactory.getLogger(FlinkFileIO.class);
+  private static final String DELETE_FILE_PARALLELISM = 
"iceberg.hadoop.delete-file-parallelism";
+  private static final String DELETE_FILE_POOL_NAME = 
"iceberg-hadoopfileio-delete";
+  private static final int DELETE_RETRY_ATTEMPTS = 3;
+  private static final int DEFAULT_DELETE_CORE_MULTIPLE = 4;
+  private static volatile ExecutorService executorService;
+  private SerializableMap properties = 
SerializableMap.copyOf(ImmutableMap.of());
+
+  @Override
+  public InputFile newInputFile(String path) {
+return new FlinkInputFile(new Path(path));
+  }
+
+  @Override
+  public InputFile newInputFile(String path, long length) {
+return new FlinkInputFile(new Path(path), length);
+  }
+
+  @Override
+  public OutputFile newOutputFile(String path) {
+return new FlinkOutputFile(new Path(path));
+  }
+
+  @Override
+  public void deleteFile(String path) {
+Path toDelete = new Path(path);
+try {
+  toDelete.getFileSystem().delete(toDelete, false /* not recursive */);
+} catch (IOException e) {
+  throw new UncheckedIOException(String.format("Failed to delete file: 
%s", path), e);
+}
+  }
+
+  @Override
+  public Iterable listPrefix(String prefix) {
+LOG.debug("Listing {}", prefix);
+Path prefixToList = new Path(prefix);
+try {
+  return listPrefix(prefixToList.getFileSystem(), prefixToList);
+} catch (IOException e) {
+  throw new UncheckedIOException(String.format("Failed to listing prefix: 
%s", prefix), e);
+}
+  }
+
+  @Override
+  public void deletePrefix(String prefix) {
+Path prefixToDelete = new Path(prefix);
+
+try {
+  prefixToDelete.getFileSystem().delete(prefixToDelete, true /* recursive 
*/);
+} catch (IOException e) {
+  throw new UncheckedIOException(String.format("Failed to delete prefix: 
%s", prefix), e);
+}
+  }
+
+  @Override
+  public void deleteFiles(Iterable pathsToDelete) throws 
BulkDeletionFailureException {
+AtomicInteger failureCount = new AtomicInteger(0);
+Tasks.foreach(pathsToDelete)
+.executeWith(executorService())
+.retry(DELETE_RETRY_ATTEMPTS)
+.stopRetryOn(FileNotFoundException.class)
+.suppressFailureWhenFinished()
+.onFailure(
+(f, e) -> {
+  LOG.error("Failure during bulk delete on file: {} ", f, e);
+  failureCount.incrementAndGet();
+})
+.run(this::deleteFile);

Review Comment:
   I meant if underline storage is S3, `S3FileIO` should probably be used 
(instead of `HadoopFileIO`)



-- 
This is an automated message from the 

[I] AWS: Creating a Glue table with Lake Formation enabled fails [iceberg]

2024-04-25 Thread via GitHub


nickdelnano opened a new issue, #10226:
URL: https://github.com/apache/iceberg/issues/10226

   ### Apache Iceberg version
   
   1.4.3
   
   ### Query engine
   
   Flink
   
   ### Please describe the bug 
   
   Hi,
   
   I am observing issues when creating an Iceberg table with a Glue catalog 
configured to use Lake Formation.
   
   I see an integration test case for the issue I am experiencing so I will 
explain the issue through this. I will include details about my use case with 
Flink afterwards.
   
   This issue is very similar to https://github.com/apache/iceberg/issues/6523 
however I observe the issue is not fixed.
   
   ### TestLakeFormationMetadataOperations.java test testCreateTableSuccess
   [Link to 
test](https://github.com/apache/iceberg/blob/main/aws/src/integration/java/org/apache/iceberg/aws/lakeformation/TestLakeFormationMetadataOperations.java#L167)
   
   This test fails in my AWS account. I have walked through the code line by 
line in a debugger and believe that it would fail in any environment due to the 
below.
   
   The test fails on this 
[line](https://github.com/apache/iceberg/blob/main/aws/src/integration/java/org/apache/iceberg/aws/lakeformation/TestLakeFormationMetadataOperations.java#L182)
 because Lake Formation permissions cannot be granted on a table that does not 
exist. It first yields an exception from the call to 
`glueCatalogPrivilegedRole.createTable` but then proceeds to the `finally` 
block.
   
   As far as I can tell, the AWS integration tests are not run on opened PRs so 
I cannot easily demonstrate this in an issue or PR. If it is possible to do 
this please let me know how and I will create a PR that shows it.
   
   Previous work has been done to create an initial or "dummy" Glue table if 
Lake Formation is enabled and the table requested for creation does not exist 
yet ([1] https://github.com/apache/iceberg/pull/4423/files). However, if Lake 
Formation is enabled, [2] [GlueCatalog sets 
`put(S3FileIOProperties.PRELOAD_CLIENT_ENABLED, 
String.valueOf(true)`](https://github.com/apache/iceberg/blob/main/aws/src/main/java/org/apache/iceberg/aws/glue/GlueCatalog.java#L224-L232),
 which triggers the below code path and results in a call to `aws glue 
get-table` API before any table exists. This causes an uncaught exception and 
creating a table fails.
   - 
https://github.com/apache/iceberg/blob/main/aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIO.java#L374-L376
   - 
   
https://github.com/apache/iceberg/blob/main/aws/src/main/java/org/apache/iceberg/aws/lakeformation/LakeFormationAwsClientFactory.java#L78-L79
   - 
https://github.com/apache/iceberg/blob/main/aws/src/main/java/org/apache/iceberg/aws/lakeformation/LakeFormationAwsClientFactory.java#L107
   
   Please provide any advice or workaround for how a table can be created in a 
Glue catalog with Lake Formation enabled without encountering this issue.
   
   ### Error in my Flink environment
   I am using Flink on EC2 (not EMR) and using Iceberg, Glue and Lake Formation.
   
   Iceberg catalog configuration:
   ```
  "CREATE CATALOG glue_catalog WITH (
   'type'='iceberg',
   'warehouse'='s3://bucket'
   'catalog-impl'='org.apache.iceberg.aws.glue.GlueCatalog'
   'io-impl'='org.apache.iceberg.aws.s3.S3FileIO'
   
'client.factory'='org.apache.iceberg.aws.lakeformation.LakeFormationAwsClientFactory'
   
'client.assume-role.arn'='arn:aws:iam:::role/'
   'glue.lakeformation-enabled'='true'
   
'client.assume-role.tags.LakeFormationAuthorizedCaller'=''
   'client.assume-role.region'='us-east-1'
   'glue.account-id'=''
   );
   ```
   
   The stacktrace confirms the behavior explained for the integration test:  in 
the call stack of creating a table, `S3FileIO` is initialized and 
`LakeFormationAwsClientFactory.isTableRegisteredWithLakeFormation` is called 
before any Glue table exists.
   
   Stacktrace:
   ```
   Caused by: 
software.amazon.awssdk.services.glue.model.EntityNotFoundException: Entity Not 
Found (Service: Glue, Status Code: 400, Request ID: 
efa126e5-e9d5-41f8-bb5a-c8d30bd166eb)
   at 
software.amazon.awssdk.core.internal.http.CombinedResponseHandler.handleErrorResponse(CombinedResponseHandler.java:125)
 ~[?:?]
   at 
software.amazon.awssdk.core.internal.http.CombinedResponseHandler.handleResponse(CombinedResponseHandler.java:82)
 ~[?:?]
   at 
software.amazon.awssdk.core.internal.http.CombinedResponseHandler.handle(CombinedResponseHandler.java:60)
 ~[?:?]
   at 
software.amazon.awssdk.core.internal.http.CombinedResponseHandler.handle(CombinedResponseHandler.java:41)
 ~[?:?]
   at 
software.amazon.awssdk.core.internal.http.pipeline.stages.HandleResponseStage.execute(HandleResponseStage.java:40)
 ~[?:?]
   at 

Re: [I] byte and short types in spark no longer auto coerce to int32 [iceberg]

2024-04-25 Thread via GitHub


jkolash commented on issue #10225:
URL: https://github.com/apache/iceberg/issues/10225#issuecomment-2078200948

   hmm I think this may be related to the spark version we are using as I 
tested on spark-3.4.1 and didn't see the issue but see it on our 3.4.2


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] byte and short types in spark no longer auto coerce to int32 [iceberg]

2024-04-25 Thread via GitHub


jkolash commented on issue #10225:
URL: https://github.com/apache/iceberg/issues/10225#issuecomment-2078157080

   If/when there is a PR I can test it on my side. where I have exhaustive type 
testing.
   
   ```
   java.lang.ClassCastException: class java.lang.Byte cannot be cast to class 
java.lang.Integer (java.lang.Byte and java.lang.Integer are in module java.base 
of loader 'bootstrap')
at org.apache.iceberg.parquet.ColumnWriter$2.write(ColumnWriter.java:39)
at 
org.apache.iceberg.parquet.ParquetValueWriters$PrimitiveWriter.write(ParquetValueWriters.java:131)
at 
org.apache.iceberg.parquet.ParquetValueWriters$OptionWriter.write(ParquetValueWriters.java:375)
at 
org.apache.iceberg.parquet.ParquetValueWriters$StructWriter.write(ParquetValueWriters.java:608)
   ```
   is the error I get


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] byte and short types in spark no longer auto coerce to int32 [iceberg]

2024-04-25 Thread via GitHub


jkolash commented on issue #10225:
URL: https://github.com/apache/iceberg/issues/10225#issuecomment-2078150463

   ```kotlin
   val df = spark.sql("""select inline(array(from_json('{"b":82}', 
'struct')))""")
   df.show()
   ```
   
   ```
   +---+
   |  b|
   +---+
   | 82|
   +---+
   ```
   
   
   ```kotlin
   df.writeTo("staging.iceberg_table_3")
   .using("iceberg")
   .createOrReplace()
   ```
   
   using this spark config
   ```
   conf.set("spark.sql.catalog.staging", 
"org.apache.iceberg.spark.SparkCatalog")
   .set("spark.sql.catalog.staging.type", "hadoop")
   .set("spark.sql.catalog.staging.warehouse", 
"/tmp/random_directory");
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Avro writers use BlockingBinaryEncoder to enable array/map size calculations. [iceberg]

2024-04-25 Thread via GitHub


Fokko commented on PR #8625:
URL: https://github.com/apache/iceberg/pull/8625#issuecomment-2078143286

   Yes, this is still top of mind! I'm going to see what's needed and make sure 
that it will be included in the next Avro release!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Build: Test Apache Avro 1.11.2 [iceberg]

2024-04-25 Thread via GitHub


Fokko closed pull request #7922: Build: Test Apache Avro 1.11.2
URL: https://github.com/apache/iceberg/pull/7922


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: glue table creation with some docs on testing [iceberg-go]

2024-04-25 Thread via GitHub


wolfeidau commented on PR #59:
URL: https://github.com/apache/iceberg-go/pull/59#issuecomment-2078141567

   @zeroshade I will dig into this in the next few days, I have implemented the 
builder for tables but need a way to have a generic path builder. 
   
   Given I have time off next week I can dig into it more. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Support partial deletes [iceberg-python]

2024-04-25 Thread via GitHub


MehulBatra commented on code in PR #569:
URL: https://github.com/apache/iceberg-python/pull/569#discussion_r1580084812


##
pyiceberg/table/__init__.py:
##
@@ -235,6 +242,10 @@ class TableProperties:
 WRITE_PARTITION_SUMMARY_LIMIT = "write.summary.partition-limit"
 WRITE_PARTITION_SUMMARY_LIMIT_DEFAULT = 0
 
+DELETE_MODE = "write.delete.mode"
+DELETE_MODE_COPY_ON_WRITE = "merge-on-read"
+DELETE_MODE_MERGE_ON_READ = "copy-on-write"

Review Comment:
   @Fokko I believe this is a mismatch, got written this way mistakenly, please 
correct me If I am overlooking it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Support partial deletes [iceberg-python]

2024-04-25 Thread via GitHub


MehulBatra commented on code in PR #569:
URL: https://github.com/apache/iceberg-python/pull/569#discussion_r1580084812


##
pyiceberg/table/__init__.py:
##
@@ -235,6 +242,10 @@ class TableProperties:
 WRITE_PARTITION_SUMMARY_LIMIT = "write.summary.partition-limit"
 WRITE_PARTITION_SUMMARY_LIMIT_DEFAULT = 0
 
+DELETE_MODE = "write.delete.mode"
+DELETE_MODE_COPY_ON_WRITE = "merge-on-read"
+DELETE_MODE_MERGE_ON_READ = "copy-on-write"

Review Comment:
   @Fokko I believe this is a mismatch, got written this way mistakenly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Support partial deletes [iceberg-python]

2024-04-25 Thread via GitHub


MehulBatra commented on code in PR #569:
URL: https://github.com/apache/iceberg-python/pull/569#discussion_r1580084812


##
pyiceberg/table/__init__.py:
##
@@ -235,6 +242,10 @@ class TableProperties:
 WRITE_PARTITION_SUMMARY_LIMIT = "write.summary.partition-limit"
 WRITE_PARTITION_SUMMARY_LIMIT_DEFAULT = 0
 
+DELETE_MODE = "write.delete.mode"
+DELETE_MODE_COPY_ON_WRITE = "merge-on-read"
+DELETE_MODE_MERGE_ON_READ = "copy-on-write"

Review Comment:
   @Fokko I believe this is a mismatch.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] byte and short types in spark no longer auto coerce to int32 [iceberg]

2024-04-25 Thread via GitHub


jkolash commented on issue #10225:
URL: https://github.com/apache/iceberg/issues/10225#issuecomment-2078099689

   @Fokko Thanks for the quick response I will try to write up a code snippet 
reproducing the issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] byte and short types in spark no longer auto coerce to int32 [iceberg]

2024-04-25 Thread via GitHub


Fokko commented on issue #10225:
URL: https://github.com/apache/iceberg/issues/10225#issuecomment-2078097905

   Hey @jkolash Thanks for reporting this. The behavior should stay the same, 
due to the logic here:
   
   
https://github.com/apache/iceberg/pull/9440/files#diff-8ac59cbdbcc60cc0c558051dfe8dcf9ffeb4c66379e48c49867a93ee43e27528R224-R236
   
   What's the error that you're seeing? This will help me to reproduce the 
issue on my end and see if we can come up with a fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Implement manifest filtering in `TableScan` [iceberg-rust]

2024-04-25 Thread via GitHub


sdd commented on code in PR #323:
URL: https://github.com/apache/iceberg-rust/pull/323#discussion_r1580049448


##
crates/iceberg/src/scan.rs:
##
@@ -158,8 +196,24 @@ impl TableScan {
 .await?;
 
 // Generate data file stream
-let mut entries = iter(manifest_list.entries());
-while let Some(entry) = entries.next().await {
+for entry in manifest_list.entries() {
+// If this scan has a filter, check the partition evaluator 
cache for an existing
+// PartitionEvaluator that matches this manifest's partition 
spec ID.
+// Use one from the cache if there is one. If not, create one, 
put it in
+// the cache, and take a reference to it.
+#[allow(clippy::map_entry)]
+if let Some(filter) = filter.as_ref() {
+if 
!manifest_evaluator_cache.contains_key(_spec_id) {

Review Comment:
   I also prefer the entry API, but `Self::create_manifest_evaluator` returns a 
`Result` rather than a `ManifestEvaluator`. The `?` operator 
can't be used inside the closure passed to the Entry API's 
[or_insert_with_key](https://doc.rust-lang.org/std/collections/hash_map/enum.Entry.html#method.or_insert_with_key)
 method, and this was the least ugly way I found of handling that. If you know 
of a nice way of using the `or_insert_with_key` with en entry_generating 
function that returns a `Result`, please share! :-) 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Implement manifest filtering in `TableScan` [iceberg-rust]

2024-04-25 Thread via GitHub


sdd commented on code in PR #323:
URL: https://github.com/apache/iceberg-rust/pull/323#discussion_r1580039092


##
crates/iceberg/src/scan.rs:
##
@@ -186,6 +240,25 @@ impl TableScan {
 .boxed())
 }
 
+fn create_manifest_evaluator(
+id: i32,
+schema: SchemaRef,
+table_metadata: Arc,
+case_sensitive: bool,
+filter: ,
+) -> crate::Result {
+let bound_predicate = filter.bind(schema.clone(), 
case_sensitive).unwrap();

Review Comment:
   sorry, thought I'd pushed a fix for this but it is on the other PR



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Implement manifest filtering in `TableScan` [iceberg-rust]

2024-04-25 Thread via GitHub


sdd commented on code in PR #323:
URL: https://github.com/apache/iceberg-rust/pull/323#discussion_r1580038792


##
crates/iceberg/src/scan.rs:
##
@@ -186,6 +240,25 @@ impl TableScan {
 .boxed())
 }
 
+fn create_manifest_evaluator(
+id: i32,
+schema: SchemaRef,
+table_metadata: Arc,
+case_sensitive: bool,
+filter: ,
+) -> crate::Result {
+let bound_predicate = filter.bind(schema.clone(), 
case_sensitive).unwrap();
+
+let partition_spec = table_metadata.partition_spec_by_id(id).unwrap();

Review Comment:
   Sorry, old code here. Will push an update



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[I] byte and short types in spark no longer auto coerce to int32 [iceberg]

2024-04-25 Thread via GitHub


jkolash opened a new issue, #10225:
URL: https://github.com/apache/iceberg/issues/10225

   ### Apache Iceberg version
   
   1.5.0
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 
   
   The removal of the code
   
   ```java
private static PrimitiveWriter ints(DataType type, ColumnDescriptor 
desc) {
   if (type instanceof ByteType) {
 return ParquetValueWriters.tinyints(desc);
   } else if (type instanceof ShortType) {
 return ParquetValueWriters.shorts(desc);
   }
   return ParquetValueWriters.ints(desc);
 }
   ```
   In this PR https://github.com/apache/iceberg/pull/9440/files
   
   broke this auto-coercion
   
   Is there a reason for the removal of byte short support auto coercing to 
int? before on iceberg 1.4.x we were able to materialize this into iceberg just 
fine but now on iceberg 1.5.x it doesn't work
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Basic Integration with Datafusion [iceberg-rust]

2024-04-25 Thread via GitHub


marvinlanhenke commented on code in PR #324:
URL: https://github.com/apache/iceberg-rust/pull/324#discussion_r1580024955


##
crates/integrations/datafusion/src/table.rs:
##
@@ -0,0 +1,79 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::sync::Arc;
+
+use datafusion::datasource::TableProvider;
+use iceberg::{table::Table, Catalog, NamespaceIdent, Result, TableIdent};
+
+pub(crate) struct IcebergTableProvider {
+_inner: Table,
+}
+
+impl IcebergTableProvider {
+pub(crate) async fn try_new(
+client: Arc,
+namespace: NamespaceIdent,
+name: impl Into,
+) -> Result {
+let name = name.into();
+let ident = TableIdent::new(namespace, name);
+let table = client.load_table().await?;
+
+Ok(IcebergTableProvider { _inner: table })
+}
+}
+
+impl TableProvider for IcebergTableProvider {
+fn as_any() ->  std::any::Any {
+self
+}
+
+fn schema() -> datafusion::arrow::datatypes::SchemaRef {
+todo!()
+}
+
+fn table_type() -> datafusion::datasource::TableType {
+todo!()
+}
+
+fn scan<'life0, 'life1, 'life2, 'life3, 'async_trait>(
+&'life0 self,
+_state: &'life1 datafusion::execution::context::SessionState,
+_projection: Option<&'life2 Vec>,
+_filters: &'life3 [datafusion::prelude::Expr],
+_limit: Option,
+) -> core::pin::Pin<
+Box<
+dyn core::future::Future<
+Output = datafusion::error::Result<
+Arc,
+>,
+> + core::marker::Send
++ 'async_trait,
+>,
+>
+where
+'life0: 'async_trait,
+'life1: 'async_trait,
+'life2: 'async_trait,
+'life3: 'async_trait,
+Self: 'async_trait,
+{
+todo!()

Review Comment:
   ...I think this has to be done (PTAL @liurenjie1024):
   
   - impl asyn fn scan() on IcebergTableProvider which has to return an 
`Arc`
 - therefore we need to: impl the trait ExecutionPlan for IcebergTableScan 
which is basically a wrapper around an actual PhysicalPlan e.g. ParquetExec 
   - therefore we need to: create an PhysicalPlan when building our 
IcebergTableScan
   
   I think creating the PhysicalPlan and providing all the input like 
FileScanConfig will be the majority of the work and I'm not sure we can provide 
all of this yet...but well see...
   
   reference from delta-rs:
   - 
[DeltaScan](https://github.com/delta-io/delta-rs/blob/main/crates/core/src/delta_datafusion/mod.rs#L790-L798)
   - [ExecutionPlan for 
DeltaScan](https://github.com/delta-io/delta-rs/blob/main/crates/core/src/delta_datafusion/mod.rs#L814-L881)
   - [building 
DeltaScan](https://github.com/delta-io/delta-rs/blob/main/crates/core/src/delta_datafusion/mod.rs#L507-L651)
 and the actual PhysicalPlan // 
[ParquetExec](https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/parquet/struct.ParquetExec.html)
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Flink: FlinkFileIO implementation [iceberg]

2024-04-25 Thread via GitHub


pvary commented on code in PR #10151:
URL: https://github.com/apache/iceberg/pull/10151#discussion_r1580019899


##
flink/v1.18/flink/src/main/java/org/apache/iceberg/flink/FlinkFileIO.java:
##
@@ -0,0 +1,183 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.flink;
+
+import java.io.FileNotFoundException;
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.Map;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.atomic.AtomicInteger;
+import java.util.stream.Collectors;
+import org.apache.flink.core.fs.FileStatus;
+import org.apache.flink.core.fs.FileSystem;
+import org.apache.flink.core.fs.Path;
+import org.apache.iceberg.io.BulkDeletionFailureException;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.FileInfo;
+import org.apache.iceberg.io.InputFile;
+import org.apache.iceberg.io.OutputFile;
+import org.apache.iceberg.io.SupportsBulkOperations;
+import org.apache.iceberg.io.SupportsPrefixOperations;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.relocated.com.google.common.collect.Iterables;
+import org.apache.iceberg.util.SerializableMap;
+import org.apache.iceberg.util.Tasks;
+import org.apache.iceberg.util.ThreadPools;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class FlinkFileIO implements FileIO, SupportsPrefixOperations, 
SupportsBulkOperations {
+  private static final Logger LOG = LoggerFactory.getLogger(FlinkFileIO.class);
+  private static final String DELETE_FILE_PARALLELISM = 
"iceberg.hadoop.delete-file-parallelism";
+  private static final String DELETE_FILE_POOL_NAME = 
"iceberg-hadoopfileio-delete";
+  private static final int DELETE_RETRY_ATTEMPTS = 3;
+  private static final int DEFAULT_DELETE_CORE_MULTIPLE = 4;
+  private static volatile ExecutorService executorService;
+  private SerializableMap properties = 
SerializableMap.copyOf(ImmutableMap.of());
+
+  @Override
+  public InputFile newInputFile(String path) {
+return new FlinkInputFile(new Path(path));
+  }
+
+  @Override
+  public InputFile newInputFile(String path, long length) {
+return new FlinkInputFile(new Path(path), length);
+  }
+
+  @Override
+  public OutputFile newOutputFile(String path) {
+return new FlinkOutputFile(new Path(path));
+  }
+
+  @Override
+  public void deleteFile(String path) {
+Path toDelete = new Path(path);
+try {
+  toDelete.getFileSystem().delete(toDelete, false /* not recursive */);
+} catch (IOException e) {
+  throw new UncheckedIOException(String.format("Failed to delete file: 
%s", path), e);
+}
+  }
+
+  @Override
+  public Iterable listPrefix(String prefix) {
+LOG.debug("Listing {}", prefix);
+Path prefixToList = new Path(prefix);
+try {
+  return listPrefix(prefixToList.getFileSystem(), prefixToList);
+} catch (IOException e) {
+  throw new UncheckedIOException(String.format("Failed to listing prefix: 
%s", prefix), e);
+}
+  }
+
+  @Override
+  public void deletePrefix(String prefix) {
+Path prefixToDelete = new Path(prefix);
+
+try {
+  prefixToDelete.getFileSystem().delete(prefixToDelete, true /* recursive 
*/);
+} catch (IOException e) {
+  throw new UncheckedIOException(String.format("Failed to delete prefix: 
%s", prefix), e);
+}
+  }
+
+  @Override
+  public void deleteFiles(Iterable pathsToDelete) throws 
BulkDeletionFailureException {
+AtomicInteger failureCount = new AtomicInteger(0);
+Tasks.foreach(pathsToDelete)
+.executeWith(executorService())
+.retry(DELETE_RETRY_ATTEMPTS)
+.stopRetryOn(FileNotFoundException.class)
+.suppressFailureWhenFinished()
+.onFailure(
+(f, e) -> {
+  LOG.error("Failure during bulk delete on file: {} ", f, e);
+  failureCount.incrementAndGet();
+})
+.run(this::deleteFile);

Review Comment:
   Why?
   For the record, the implementation for S3 is based on aws packages: 

Re: [PR] Basic Integration with Datafusion [iceberg-rust]

2024-04-25 Thread via GitHub


simonvandel commented on code in PR #324:
URL: https://github.com/apache/iceberg-rust/pull/324#discussion_r1579945951


##
crates/integrations/datafusion/src/catalog.rs:
##
@@ -0,0 +1,95 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::{any::Any, sync::Arc};
+
+use dashmap::DashMap;
+use datafusion::catalog::{schema::SchemaProvider, CatalogProvider};
+use futures::future::try_join_all;
+use iceberg::{Catalog, NamespaceIdent, Result};
+
+use crate::schema::IcebergSchemaProvider;
+
+/// Provides an interface to manage and access multiple schemas
+/// within an Iceberg [`Catalog`].
+///
+/// Acts as a centralized catalog provider that aggregates
+/// multiple [`SchemaProvider`], each associated with distinct namespaces.
+pub struct IcebergCatalogProvider {
+/// A concurrent `HashMap` where keys are namespace names
+/// and values are dynamic references to objects implementing the
+/// [`SchemaProvider`] trait.
+schemas: DashMap>,

Review Comment:
   Cool, well then I think a std Hashmap should be fine, since no actual 
concurrent operations need to happen



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Basic Integration with Datafusion [iceberg-rust]

2024-04-25 Thread via GitHub


marvinlanhenke commented on code in PR #324:
URL: https://github.com/apache/iceberg-rust/pull/324#discussion_r1579940862


##
crates/integrations/datafusion/src/catalog.rs:
##
@@ -0,0 +1,95 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::{any::Any, sync::Arc};
+
+use dashmap::DashMap;
+use datafusion::catalog::{schema::SchemaProvider, CatalogProvider};
+use futures::future::try_join_all;
+use iceberg::{Catalog, NamespaceIdent, Result};
+
+use crate::schema::IcebergSchemaProvider;
+
+/// Provides an interface to manage and access multiple schemas
+/// within an Iceberg [`Catalog`].
+///
+/// Acts as a centralized catalog provider that aggregates
+/// multiple [`SchemaProvider`], each associated with distinct namespaces.
+pub struct IcebergCatalogProvider {
+/// A concurrent `HashMap` where keys are namespace names
+/// and values are dynamic references to objects implementing the
+/// [`SchemaProvider`] trait.
+schemas: DashMap>,

Review Comment:
   > Will the map ever be mutated?
   > It might be, I was just not seeing it in this PR.
   
   No, I don't think it will be mutated after we've built the IcebergProvider.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Basic Integration with Datafusion [iceberg-rust]

2024-04-25 Thread via GitHub


simonvandel commented on code in PR #324:
URL: https://github.com/apache/iceberg-rust/pull/324#discussion_r1579930294


##
crates/integrations/datafusion/src/catalog.rs:
##
@@ -0,0 +1,95 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::{any::Any, sync::Arc};
+
+use dashmap::DashMap;
+use datafusion::catalog::{schema::SchemaProvider, CatalogProvider};
+use futures::future::try_join_all;
+use iceberg::{Catalog, NamespaceIdent, Result};
+
+use crate::schema::IcebergSchemaProvider;
+
+/// Provides an interface to manage and access multiple schemas
+/// within an Iceberg [`Catalog`].
+///
+/// Acts as a centralized catalog provider that aggregates
+/// multiple [`SchemaProvider`], each associated with distinct namespaces.
+pub struct IcebergCatalogProvider {
+/// A concurrent `HashMap` where keys are namespace names
+/// and values are dynamic references to objects implementing the
+/// [`SchemaProvider`] trait.
+schemas: DashMap>,

Review Comment:
   Will the map ever be mutated?
   It might be, I was just not seeing it in this PR.
   
   If not, then a std Hashmap seems sufficient to me.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Basic Integration with Datafusion [iceberg-rust]

2024-04-25 Thread via GitHub


marvinlanhenke commented on code in PR #324:
URL: https://github.com/apache/iceberg-rust/pull/324#discussion_r1579919751


##
crates/integrations/datafusion/src/catalog.rs:
##
@@ -0,0 +1,95 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::{any::Any, sync::Arc};
+
+use dashmap::DashMap;
+use datafusion::catalog::{schema::SchemaProvider, CatalogProvider};
+use futures::future::try_join_all;
+use iceberg::{Catalog, NamespaceIdent, Result};
+
+use crate::schema::IcebergSchemaProvider;
+
+/// Provides an interface to manage and access multiple schemas
+/// within an Iceberg [`Catalog`].
+///
+/// Acts as a centralized catalog provider that aggregates
+/// multiple [`SchemaProvider`], each associated with distinct namespaces.
+pub struct IcebergCatalogProvider {
+/// A concurrent `HashMap` where keys are namespace names
+/// and values are dynamic references to objects implementing the
+/// [`SchemaProvider`] trait.
+schemas: DashMap>,

Review Comment:
   the datafusion docs provide a link to a reference implementation by 
[delta-rs](https://github.com/delta-io/delta-rs/blob/951436ecec476ce65b5ed3b58b50fb0846ca7b91/crates/deltalake-core/src/data_catalog/unity/datafusion.rs#L21),
 that uses a DashMap as well. 
   
   My understanding based on the 
[docs](https://docs.rs/datafusion/latest/datafusion/catalog/trait.CatalogProvider.html#implementing-remote-catalogs)
 is that datafusion might share the `CatalogProvider` on multiple threads 
inside an Arc; thus concurrently accessing our "cached" schemas, etc.? Perhaps 
someone else can clarify this?
   
   > [...] The pattern that DataFusion itself uses to plan SQL queries is to 
walk over the query to [find all schema / table references in an async 
function](https://docs.rs/datafusion/latest/datafusion/execution/context/struct.SessionState.html#method.resolve_table_references),
 performing required remote catalog in parallel, and then plans the query using 
that snapshot. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Basic Integration with Datafusion [iceberg-rust]

2024-04-25 Thread via GitHub


simonvandel commented on code in PR #324:
URL: https://github.com/apache/iceberg-rust/pull/324#discussion_r1579897701


##
crates/integrations/datafusion/src/catalog.rs:
##
@@ -0,0 +1,95 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::{any::Any, sync::Arc};
+
+use dashmap::DashMap;
+use datafusion::catalog::{schema::SchemaProvider, CatalogProvider};
+use futures::future::try_join_all;
+use iceberg::{Catalog, NamespaceIdent, Result};
+
+use crate::schema::IcebergSchemaProvider;
+
+/// Provides an interface to manage and access multiple schemas
+/// within an Iceberg [`Catalog`].
+///
+/// Acts as a centralized catalog provider that aggregates
+/// multiple [`SchemaProvider`], each associated with distinct namespaces.
+pub struct IcebergCatalogProvider {
+/// A concurrent `HashMap` where keys are namespace names
+/// and values are dynamic references to objects implementing the
+/// [`SchemaProvider`] trait.
+schemas: DashMap>,

Review Comment:
   Would a std Hashmap suffice? I don't see the concurrency properties of 
DashMap being used.
   
   I might be missing something though. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Basic Integration with Datafusion [iceberg-rust]

2024-04-25 Thread via GitHub


marvinlanhenke commented on code in PR #324:
URL: https://github.com/apache/iceberg-rust/pull/324#discussion_r1579883806


##
crates/integrations/datafusion/src/table.rs:
##
@@ -0,0 +1,79 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::sync::Arc;
+
+use datafusion::datasource::TableProvider;
+use iceberg::{table::Table, Catalog, NamespaceIdent, Result, TableIdent};
+
+pub(crate) struct IcebergTableProvider {
+_inner: Table,
+}
+
+impl IcebergTableProvider {
+pub(crate) async fn try_new(
+client: Arc,
+namespace: NamespaceIdent,
+name: impl Into,
+) -> Result {
+let name = name.into();
+let ident = TableIdent::new(namespace, name);
+let table = client.load_table().await?;
+
+Ok(IcebergTableProvider { _inner: table })
+}
+}
+
+impl TableProvider for IcebergTableProvider {
+fn as_any() ->  std::any::Any {
+self
+}
+
+fn schema() -> datafusion::arrow::datatypes::SchemaRef {
+todo!()

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Flink: Fix bounded source state restore record duplication [iceberg]

2024-04-25 Thread via GitHub


stevenzwu commented on code in PR #10208:
URL: https://github.com/apache/iceberg/pull/10208#discussion_r1579871219


##
flink/v1.19/flink/src/test/java/org/apache/iceberg/flink/source/TestIcebergSourceFailover.java:
##
@@ -40,24 +44,27 @@
 import org.apache.iceberg.FileFormat;
 import org.apache.iceberg.Schema;
 import org.apache.iceberg.Table;
+import org.apache.iceberg.TableProperties;
 import org.apache.iceberg.data.GenericAppenderHelper;
 import org.apache.iceberg.data.RandomGenericData;
 import org.apache.iceberg.data.Record;
 import org.apache.iceberg.flink.FlinkConfigOptions;
+import org.apache.iceberg.flink.FlinkReadOptions;
 import org.apache.iceberg.flink.HadoopTableResource;
 import org.apache.iceberg.flink.SimpleDataUtil;
 import org.apache.iceberg.flink.TestFixtures;
 import org.apache.iceberg.flink.sink.FlinkSink;
 import org.apache.iceberg.flink.source.assigner.SimpleSplitAssignerFactory;
 import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.awaitility.Awaitility;
 import org.junit.ClassRule;
 import org.junit.Rule;
 import org.junit.Test;
 import org.junit.rules.TemporaryFolder;
 
 public class TestIcebergSourceFailover {
 
-  private static final int PARALLELISM = 4;
+  private static final int PARALLELISM = 2;

Review Comment:
   can we add the comment to the code? otherwise, reader may thought the 
parallelism is purely arbitrary (>1)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API: implement types timestamp_ns and timestamptz_ns [iceberg]

2024-04-25 Thread via GitHub


epgif commented on PR #9008:
URL: https://github.com/apache/iceberg/pull/9008#issuecomment-2077790169

   > @epgif can you please address the test failures?
   
   Done.  Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Flink: FlinkFileIO implementation [iceberg]

2024-04-25 Thread via GitHub


stevenzwu commented on code in PR #10151:
URL: https://github.com/apache/iceberg/pull/10151#discussion_r1579860068


##
flink/v1.18/flink/src/main/java/org/apache/iceberg/flink/FlinkFileIO.java:
##
@@ -0,0 +1,183 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.flink;
+
+import java.io.FileNotFoundException;
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.Map;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.atomic.AtomicInteger;
+import java.util.stream.Collectors;
+import org.apache.flink.core.fs.FileStatus;
+import org.apache.flink.core.fs.FileSystem;
+import org.apache.flink.core.fs.Path;
+import org.apache.iceberg.io.BulkDeletionFailureException;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.io.FileInfo;
+import org.apache.iceberg.io.InputFile;
+import org.apache.iceberg.io.OutputFile;
+import org.apache.iceberg.io.SupportsBulkOperations;
+import org.apache.iceberg.io.SupportsPrefixOperations;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.relocated.com.google.common.collect.Iterables;
+import org.apache.iceberg.util.SerializableMap;
+import org.apache.iceberg.util.Tasks;
+import org.apache.iceberg.util.ThreadPools;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class FlinkFileIO implements FileIO, SupportsPrefixOperations, 
SupportsBulkOperations {
+  private static final Logger LOG = LoggerFactory.getLogger(FlinkFileIO.class);
+  private static final String DELETE_FILE_PARALLELISM = 
"iceberg.hadoop.delete-file-parallelism";
+  private static final String DELETE_FILE_POOL_NAME = 
"iceberg-hadoopfileio-delete";
+  private static final int DELETE_RETRY_ATTEMPTS = 3;
+  private static final int DEFAULT_DELETE_CORE_MULTIPLE = 4;
+  private static volatile ExecutorService executorService;
+  private SerializableMap properties = 
SerializableMap.copyOf(ImmutableMap.of());
+
+  @Override
+  public InputFile newInputFile(String path) {
+return new FlinkInputFile(new Path(path));
+  }
+
+  @Override
+  public InputFile newInputFile(String path, long length) {
+return new FlinkInputFile(new Path(path), length);
+  }
+
+  @Override
+  public OutputFile newOutputFile(String path) {
+return new FlinkOutputFile(new Path(path));
+  }
+
+  @Override
+  public void deleteFile(String path) {
+Path toDelete = new Path(path);
+try {
+  toDelete.getFileSystem().delete(toDelete, false /* not recursive */);
+} catch (IOException e) {
+  throw new UncheckedIOException(String.format("Failed to delete file: 
%s", path), e);
+}
+  }
+
+  @Override
+  public Iterable listPrefix(String prefix) {
+LOG.debug("Listing {}", prefix);
+Path prefixToList = new Path(prefix);
+try {
+  return listPrefix(prefixToList.getFileSystem(), prefixToList);
+} catch (IOException e) {
+  throw new UncheckedIOException(String.format("Failed to listing prefix: 
%s", prefix), e);
+}
+  }
+
+  @Override
+  public void deletePrefix(String prefix) {
+Path prefixToDelete = new Path(prefix);
+
+try {
+  prefixToDelete.getFileSystem().delete(prefixToDelete, true /* recursive 
*/);
+} catch (IOException e) {
+  throw new UncheckedIOException(String.format("Failed to delete prefix: 
%s", prefix), e);
+}
+  }
+
+  @Override
+  public void deleteFiles(Iterable pathsToDelete) throws 
BulkDeletionFailureException {
+AtomicInteger failureCount = new AtomicInteger(0);
+Tasks.foreach(pathsToDelete)
+.executeWith(executorService())
+.retry(DELETE_RETRY_ATTEMPTS)
+.stopRetryOn(FileNotFoundException.class)
+.suppressFailureWhenFinished()
+.onFailure(
+(f, e) -> {
+  LOG.error("Failure during bulk delete on file: {} ", f, e);
+  failureCount.incrementAndGet();
+})
+.run(this::deleteFile);

Review Comment:
   if it is S3 storage, HadoopFileIO shouldn't be used.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, 

Re: [PR] Basic Integration with Datafusion [iceberg-rust]

2024-04-25 Thread via GitHub


marvinlanhenke commented on code in PR #324:
URL: https://github.com/apache/iceberg-rust/pull/324#discussion_r1579827548


##
crates/integrations/datafusion/Cargo.toml:
##
@@ -0,0 +1,46 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+[package]
+name = "iceberg-integrations-datafusion"
+version = { workspace = true }
+edition = { workspace = true }
+homepage = { workspace = true }
+rust-version = { workspace = true }
+
+categories = ["database"]
+description = "Apache Iceberg Integrations Datafusion"
+repository = { workspace = true }
+license = { workspace = true }
+keywords = ["iceberg", "integrations", "datfusion"]
+
+[features]
+default = []
+datafusion = ["dep:datafusion"]

Review Comment:
   > I believe we should have `integrations/datafusion` and 
`integrations/polars`. So this crate `iceberg-integrations-datafusion` should 
enable `datafusion` by default, and doesn't need switch.
   
   @Xuanwo 
   thank you - then I guess my last commit fixed it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Basic Integration with Datafusion [iceberg-rust]

2024-04-25 Thread via GitHub


tshauck commented on code in PR #324:
URL: https://github.com/apache/iceberg-rust/pull/324#discussion_r1579818649


##
crates/integrations/datafusion/Cargo.toml:
##
@@ -0,0 +1,46 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+[package]
+name = "iceberg-integrations-datafusion"
+version = { workspace = true }
+edition = { workspace = true }
+homepage = { workspace = true }
+rust-version = { workspace = true }
+
+categories = ["database"]
+description = "Apache Iceberg Integrations Datafusion"
+repository = { workspace = true }
+license = { workspace = true }
+keywords = ["iceberg", "integrations", "datfusion"]
+
+[features]
+default = []
+datafusion = ["dep:datafusion"]

Review Comment:
   Sorry for muddying the waters. I tried to suggest not going the feature 
route based 
[here](https://github.com/apache/iceberg-rust/pull/324#discussion_r1562767139) 
(near the end). I was a little confused by this project having all dependencies 
in the workspace's Cargo.toml even if only used by one crate.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Basic Integration with Datafusion [iceberg-rust]

2024-04-25 Thread via GitHub


marvinlanhenke commented on PR #324:
URL: https://github.com/apache/iceberg-rust/pull/324#issuecomment-2077663647

   > I've left some comment to improve, but it looks great! I'll invite 
datafusion community to help review.
   
   @liurenjie1024 
   Thanks for the review.
   I fixed most of the basic issues regarding structure, naming, async-trait 
etc.
   
   I'll work on the missing impls over the next couple of days - and see how 
far we can get with the current state of iceberg-rust.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Update 1.5.1 release notes [iceberg]

2024-04-25 Thread via GitHub


nastra merged PR #10224:
URL: https://github.com/apache/iceberg/pull/10224


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Update 1.5.1 release notes [iceberg]

2024-04-25 Thread via GitHub


nastra commented on PR #10224:
URL: https://github.com/apache/iceberg/pull/10224#issuecomment-2077658431

   LGTM, thanks @amogh-jahagirdar 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Basic Integration with Datafusion [iceberg-rust]

2024-04-25 Thread via GitHub


marvinlanhenke commented on code in PR #324:
URL: https://github.com/apache/iceberg-rust/pull/324#discussion_r1579755826


##
crates/integrations/datafusion/src/catalog.rs:
##
@@ -0,0 +1,73 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::{any::Any, sync::Arc};
+
+use dashmap::DashMap;
+use datafusion::catalog::{schema::SchemaProvider, CatalogProvider};
+use futures::future::try_join_all;
+use iceberg::{Catalog, NamespaceIdent, Result};
+
+use crate::schema::IcebergSchemaProvider;
+
+pub struct IcebergCatalogProvider {
+schemas: DashMap>,
+}
+
+impl IcebergCatalogProvider {
+pub async fn try_new(client: Arc) -> Result {
+let schema_names: Vec = client
+.list_namespaces(None)
+.await?
+.iter()
+.flat_map(|ns| ns.as_ref().clone())

Review Comment:
   In this case; we provide the catalog (e.g. hive metastore) and we don't 
support nested namespaces. 
   
   Even if datafusion allows nested namespaces; once we hit our catalog 
provider we'd only return an empty list. 
   So I don't think we need to perform any checks here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Discussion: Next steps / requirements to support `append` files [iceberg-rust]

2024-04-25 Thread via GitHub


viirya commented on issue #329:
URL: https://github.com/apache/iceberg-rust/issues/329#issuecomment-2077637850

   Thanks @liurenjie1024. The roadmaps doc looks good to me. I added a few 
items under DataFusion integration. Feel free to modify it. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Basic Integration with Datafusion [iceberg-rust]

2024-04-25 Thread via GitHub


Xuanwo commented on code in PR #324:
URL: https://github.com/apache/iceberg-rust/pull/324#discussion_r1579739285


##
crates/integrations/datafusion/Cargo.toml:
##
@@ -0,0 +1,46 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+[package]
+name = "iceberg-integrations-datafusion"
+version = { workspace = true }
+edition = { workspace = true }
+homepage = { workspace = true }
+rust-version = { workspace = true }
+
+categories = ["database"]
+description = "Apache Iceberg Integrations Datafusion"
+repository = { workspace = true }
+license = { workspace = true }
+keywords = ["iceberg", "integrations", "datfusion"]
+
+[features]
+default = []
+datafusion = ["dep:datafusion"]

Review Comment:
   I believe we should have `integrations/datafusion` and 
`integrations/polars`. So this crate `iceberg-integrations-datafusion` should 
enable `datafusion` by default, and doesn't need switch.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Basic Integration with Datafusion [iceberg-rust]

2024-04-25 Thread via GitHub


viirya commented on code in PR #324:
URL: https://github.com/apache/iceberg-rust/pull/324#discussion_r1579728551


##
crates/integrations/datafusion/README.md:
##
@@ -0,0 +1,22 @@
+
+
+# Apache Iceberg Integrations
+
+This crate contains the official Native Rust implementation of Apache Iceberg 
Integrations.

Review Comment:
   For above refined one, maybe Apache DataFusion and Apache Iceberg



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Basic Integration with Datafusion [iceberg-rust]

2024-04-25 Thread via GitHub


viirya commented on code in PR #324:
URL: https://github.com/apache/iceberg-rust/pull/324#discussion_r1579728551


##
crates/integrations/datafusion/README.md:
##
@@ -0,0 +1,22 @@
+
+
+# Apache Iceberg Integrations
+
+This crate contains the official Native Rust implementation of Apache Iceberg 
Integrations.

Review Comment:
   Apache DataFusion and Apache Iceberg



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Basic Integration with Datafusion [iceberg-rust]

2024-04-25 Thread via GitHub


marvinlanhenke commented on code in PR #324:
URL: https://github.com/apache/iceberg-rust/pull/324#discussion_r1579713512


##
crates/integrations/datafusion/Cargo.toml:
##
@@ -0,0 +1,46 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+[package]
+name = "iceberg-integrations-datafusion"
+version = { workspace = true }
+edition = { workspace = true }
+homepage = { workspace = true }
+rust-version = { workspace = true }
+
+categories = ["database"]
+description = "Apache Iceberg Integrations Datafusion"
+repository = { workspace = true }
+license = { workspace = true }
+keywords = ["iceberg", "integrations", "datfusion"]
+
+[features]
+default = []
+datafusion = ["dep:datafusion"]

Review Comment:
   Here I'm probably even more confused; I tried this based on the review by 
@tshauck - but I have no experience with this and I'm not sure how to structure 
the crate, in order to allow a possible end-user to download only the 
integration dependency he needs e.g. iceberg-datafusion or iceberg-polars. etc.
   
   So here I'm hoping for feedback and help.
   
   EDIT:
   I removed the feature flag for now. So the crate simply resides in 
/crates/integrations/datafusion
   We can figure out later how to deal with features - or whats best practice 
here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Update 1.5.1 release notes [iceberg]

2024-04-25 Thread via GitHub


amogh-jahagirdar commented on code in PR #10224:
URL: https://github.com/apache/iceberg/pull/10224#discussion_r1579724014


##
site/docs/releases.md:
##
@@ -73,6 +73,21 @@ To add a dependency on Iceberg in Maven, add the following 
to your `pom.xml`:
 
 ```
 
+### 1.5.1 release
+Apache Iceberg 1.5.1 was released on April 25, 2024.
+
+The 1.5.1 patch release contains fixes for JDBC Catalog, fixing a FileIO 
regression 
+where an extra head request was performed when reading manifests and REST 
client retries
+for 5xx failures. The release also includes fixes for system function pushdown 
for CoW tables
+in Spark 3.4 and 3.5.
+
+- Core: Fix FileIO regression where extra head request was performed when 
reading manifests ([\#10114](https://github.com/apache/iceberg/pull/10114))
+- Core: Mark 502 and 504 HTTP status codes as retryable in REST 
Client([\#10113](https://github.com/apache/iceberg/pull/10113))
+- Core: Fix JDBC Catalog table commits when migrating from V0 to V1 
schema([\#10152](https://github.com/apache/iceberg/pull/10152))
+- Core: Fix JDBC Catalog namespaces SQL to use the proper escape character 
which generalizes to different database backends like Postgres and 
MySQL([\#10167](https://github.com/apache/iceberg/pull/10167))
+- Spark: Fix system function pushdown in CoW row level commands for Spark 
3.5([\#9873](https://github.com/apache/iceberg/pull/9873))

Review Comment:
   Sure done! 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Update 1.5.1 release notes [iceberg]

2024-04-25 Thread via GitHub


nastra commented on code in PR #10224:
URL: https://github.com/apache/iceberg/pull/10224#discussion_r1579716696


##
site/docs/releases.md:
##
@@ -73,6 +73,21 @@ To add a dependency on Iceberg in Maven, add the following 
to your `pom.xml`:
 
 ```
 
+### 1.5.1 release
+Apache Iceberg 1.5.1 was released on April 25, 2024.
+
+The 1.5.1 patch release contains fixes for JDBC Catalog, fixing a FileIO 
regression 
+where an extra head request was performed when reading manifests and REST 
client retries
+for 5xx failures. The release also includes fixes for system function pushdown 
for CoW tables
+in Spark 3.4 and 3.5.
+
+- Core: Fix FileIO regression where extra head request was performed when 
reading manifests ([\#10114](https://github.com/apache/iceberg/pull/10114))
+- Core: Mark 502 and 504 HTTP status codes as retryable in REST 
Client([\#10113](https://github.com/apache/iceberg/pull/10113))
+- Core: Fix JDBC Catalog table commits when migrating from V0 to V1 
schema([\#10152](https://github.com/apache/iceberg/pull/10152))
+- Core: Fix JDBC Catalog namespaces SQL to use the proper escape character 
which generalizes to different database backends like Postgres and 
MySQL([\#10167](https://github.com/apache/iceberg/pull/10167))
+- Spark: Fix system function pushdown in CoW row level commands for Spark 
3.5([\#9873](https://github.com/apache/iceberg/pull/9873))

Review Comment:
   minor: we might want to split this into `Core` / `Spark` sections, similar 
to it's done right in the section below



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Basic Integration with Datafusion [iceberg-rust]

2024-04-25 Thread via GitHub


marvinlanhenke commented on code in PR #324:
URL: https://github.com/apache/iceberg-rust/pull/324#discussion_r1579715245


##
crates/integrations/datafusion/src/table.rs:
##
@@ -0,0 +1,79 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::sync::Arc;
+
+use datafusion::datasource::TableProvider;
+use iceberg::{table::Table, Catalog, NamespaceIdent, Result, TableIdent};
+
+pub(crate) struct IcebergTableProvider {
+_inner: Table,
+}
+
+impl IcebergTableProvider {
+pub(crate) async fn try_new(
+client: Arc,
+namespace: NamespaceIdent,
+name: impl Into,
+) -> Result {
+let name = name.into();
+let ident = TableIdent::new(namespace, name);
+let table = client.load_table().await?;
+
+Ok(IcebergTableProvider { _inner: table })
+}
+}
+
+impl TableProvider for IcebergTableProvider {
+fn as_any() ->  std::any::Any {
+self
+}
+
+fn schema() -> datafusion::arrow::datatypes::SchemaRef {
+todo!()

Review Comment:
   yes; i believe we merged this 4 days ago; so i can take a look at this again 
now



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Basic Integration with Datafusion [iceberg-rust]

2024-04-25 Thread via GitHub


marvinlanhenke commented on code in PR #324:
URL: https://github.com/apache/iceberg-rust/pull/324#discussion_r1579715245


##
crates/integrations/datafusion/src/table.rs:
##
@@ -0,0 +1,79 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::sync::Arc;
+
+use datafusion::datasource::TableProvider;
+use iceberg::{table::Table, Catalog, NamespaceIdent, Result, TableIdent};
+
+pub(crate) struct IcebergTableProvider {
+_inner: Table,
+}
+
+impl IcebergTableProvider {
+pub(crate) async fn try_new(
+client: Arc,
+namespace: NamespaceIdent,
+name: impl Into,
+) -> Result {
+let name = name.into();
+let ident = TableIdent::new(namespace, name);
+let table = client.load_table().await?;
+
+Ok(IcebergTableProvider { _inner: table })
+}
+}
+
+impl TableProvider for IcebergTableProvider {
+fn as_any() ->  std::any::Any {
+self
+}
+
+fn schema() -> datafusion::arrow::datatypes::SchemaRef {
+todo!()

Review Comment:
   yes; i believe we merged this 2 days ago; so i can take a look at this again 
now



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Basic Integration with Datafusion [iceberg-rust]

2024-04-25 Thread via GitHub


marvinlanhenke commented on code in PR #324:
URL: https://github.com/apache/iceberg-rust/pull/324#discussion_r1579713512


##
crates/integrations/datafusion/Cargo.toml:
##
@@ -0,0 +1,46 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+[package]
+name = "iceberg-integrations-datafusion"
+version = { workspace = true }
+edition = { workspace = true }
+homepage = { workspace = true }
+rust-version = { workspace = true }
+
+categories = ["database"]
+description = "Apache Iceberg Integrations Datafusion"
+repository = { workspace = true }
+license = { workspace = true }
+keywords = ["iceberg", "integrations", "datfusion"]
+
+[features]
+default = []
+datafusion = ["dep:datafusion"]

Review Comment:
   Here I'm probably even more confused; I tried this based on the review by 
@tshauck - but I have no experience with this and I'm not sure how to structure 
the crate, in order to allow a possible end-user to download only the 
integration dependency he needs e.g. iceberg-datafusion or iceberg-polars. etc.
   
   So here I'm hoping for feedback and help.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Basic Integration with Datafusion [iceberg-rust]

2024-04-25 Thread via GitHub


liurenjie1024 commented on code in PR #324:
URL: https://github.com/apache/iceberg-rust/pull/324#discussion_r1579654225


##
crates/integrations/datafusion/src/table.rs:
##
@@ -0,0 +1,79 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::sync::Arc;
+
+use datafusion::datasource::TableProvider;
+use iceberg::{table::Table, Catalog, NamespaceIdent, Result, TableIdent};
+
+pub(crate) struct IcebergTableProvider {
+_inner: Table,
+}
+
+impl IcebergTableProvider {
+pub(crate) async fn try_new(
+client: Arc,
+namespace: NamespaceIdent,
+name: impl Into,
+) -> Result {
+let name = name.into();
+let ident = TableIdent::new(namespace, name);
+let table = client.load_table().await?;
+
+Ok(IcebergTableProvider { _inner: table })
+}
+}
+
+impl TableProvider for IcebergTableProvider {
+fn as_any() ->  std::any::Any {
+self
+}
+
+fn schema() -> datafusion::arrow::datatypes::SchemaRef {
+todo!()
+}
+
+fn table_type() -> datafusion::datasource::TableType {
+todo!()
+}
+
+fn scan<'life0, 'life1, 'life2, 'life3, 'async_trait>(

Review Comment:
   You need to add #[async_trait] to top of implementation 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Basic Integration with Datafusion [iceberg-rust]

2024-04-25 Thread via GitHub


liurenjie1024 commented on code in PR #324:
URL: https://github.com/apache/iceberg-rust/pull/324#discussion_r1579601294


##
crates/integrations/datafusion/src/table.rs:
##
@@ -0,0 +1,79 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::sync::Arc;
+
+use datafusion::datasource::TableProvider;
+use iceberg::{table::Table, Catalog, NamespaceIdent, Result, TableIdent};
+
+pub(crate) struct IcebergTableProvider {
+_inner: Table,
+}
+
+impl IcebergTableProvider {
+pub(crate) async fn try_new(
+client: Arc,
+namespace: NamespaceIdent,
+name: impl Into,
+) -> Result {
+let name = name.into();
+let ident = TableIdent::new(namespace, name);
+let table = client.load_table().await?;
+
+Ok(IcebergTableProvider { _inner: table })
+}
+}
+
+impl TableProvider for IcebergTableProvider {
+fn as_any() ->  std::any::Any {
+self
+}
+
+fn schema() -> datafusion::arrow::datatypes::SchemaRef {
+todo!()
+}
+
+fn table_type() -> datafusion::datasource::TableType {
+todo!()
+}
+
+fn scan<'life0, 'life1, 'life2, 'life3, 'async_trait>(

Review Comment:
   We should remove these auto generate lifetimes



##
crates/integrations/datafusion/README.md:
##
@@ -0,0 +1,22 @@
+
+
+# Apache Iceberg Integrations
+
+This crate contains the official Native Rust implementation of Apache Iceberg 
Integrations.

Review Comment:
   ```suggestion
   This crate contains integration of apache datafusion and apache iceberg.
   ```



##
crates/integrations/datafusion/Cargo.toml:
##
@@ -0,0 +1,46 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+[package]
+name = "iceberg-integrations-datafusion"
+version = { workspace = true }
+edition = { workspace = true }
+homepage = { workspace = true }
+rust-version = { workspace = true }
+
+categories = ["database"]
+description = "Apache Iceberg Integrations Datafusion"
+repository = { workspace = true }
+license = { workspace = true }
+keywords = ["iceberg", "integrations", "datfusion"]
+
+[features]
+default = []
+datafusion = ["dep:datafusion"]

Review Comment:
   Sorry I'm a little confusing here, as a crate for integration with 
datafusion, why this is a feature rather enabled by default?



##
crates/integrations/datafusion/Cargo.toml:
##
@@ -0,0 +1,46 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+[package]
+name = "iceberg-integrations-datafusion"

Review Comment:
   ```suggestion
   name = "iceberg-datafusion"
   ```
   How about just this?



##
crates/integrations/datafusion/src/schema.rs:
##
@@ -0,0 +1,102 @@
+// Licensed to the Apache Software 

Re: [PR] Sql catalog [iceberg-rust]

2024-04-25 Thread via GitHub


liurenjie1024 commented on code in PR #229:
URL: https://github.com/apache/iceberg-rust/pull/229#discussion_r1579259139


##
crates/catalog/sql/Cargo.toml:
##
@@ -0,0 +1,53 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+[package]
+name = "iceberg-catalog-sql"
+version = { workspace = true }
+edition = { workspace = true }
+homepage = { workspace = true }
+rust-version = { workspace = true }
+
+categories = ["database"]
+description = "Apache Iceberg Rust Sql Catalog"
+repository = { workspace = true }
+license = { workspace = true }
+keywords = ["iceberg", "sql", "catalog"]
+
+[dependencies]
+anyhow = { workspace = true }
+async-trait = { workspace = true }
+chrono = { workspace = true }
+dashmap = "5.5.3"
+futures = { workspace = true }
+iceberg = { workspace = true }
+log = { workspace = true }
+opendal = { workspace = true }
+serde = { workspace = true }
+serde_derive = { workspace = true }
+serde_json = { workspace = true }
+sqlx = { version = "0.7.4", features = ["tls-rustls", "any", "sqlite", 
"postgres", "mysql"], default-features = false }
+typed-builder = { workspace = true }
+url = { workspace = true }
+urlencoding = { workspace = true }
+uuid = { workspace = true, features = ["v4"] }
+
+[dev-dependencies]
+iceberg_test_utils = { path = "../../test_utils", features = ["tests"] }
+sqlx = { version = "0.7.4", features = ["tls-rustls", "runtime-tokio", "any", 
"sqlite", "postgres", "mysql","migrate"], default-features = false }

Review Comment:
   Ditto.



##
crates/catalog/sql/Cargo.toml:
##
@@ -0,0 +1,53 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+[package]
+name = "iceberg-catalog-sql"
+version = { workspace = true }
+edition = { workspace = true }
+homepage = { workspace = true }
+rust-version = { workspace = true }
+
+categories = ["database"]
+description = "Apache Iceberg Rust Sql Catalog"
+repository = { workspace = true }
+license = { workspace = true }
+keywords = ["iceberg", "sql", "catalog"]
+
+[dependencies]
+anyhow = { workspace = true }
+async-trait = { workspace = true }
+chrono = { workspace = true }
+dashmap = "5.5.3"
+futures = { workspace = true }
+iceberg = { workspace = true }
+log = { workspace = true }
+opendal = { workspace = true }
+serde = { workspace = true }
+serde_derive = { workspace = true }
+serde_json = { workspace = true }
+sqlx = { version = "0.7.4", features = ["tls-rustls", "any", "sqlite", 
"postgres", "mysql"], default-features = false }

Review Comment:
   Is it possible to add features for each db so that the binary doesn't 
contains all implementations for all dbs?



##
crates/catalog/sql/src/catalog.rs:
##
@@ -0,0 +1,517 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// 

Re: [PR] AWS: Fix TestGlueCatalogTable#testCreateTable [iceberg]

2024-04-25 Thread via GitHub


aajisaka commented on PR #10221:
URL: https://github.com/apache/iceberg/pull/10221#issuecomment-2077332098

   Thanks a lot!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: support append data file and add e2e test [iceberg-rust]

2024-04-25 Thread via GitHub


Fokko commented on code in PR #349:
URL: https://github.com/apache/iceberg-rust/pull/349#discussion_r1579520268


##
crates/e2e_test/tests/append_data_file_test.rs:
##
@@ -0,0 +1,212 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Integration tests for rest catalog.
+
+use arrow_array::{ArrayRef, BooleanArray, Int32Array, RecordBatch, 
StringArray};
+use futures::TryStreamExt;
+use iceberg::io::{S3_ACCESS_KEY_ID, S3_ENDPOINT, S3_REGION, 
S3_SECRET_ACCESS_KEY};
+use iceberg::spec::{NestedField, PrimitiveType, Schema, Type};
+use iceberg::transaction::Transaction;
+use iceberg::writer::base_writer::data_file_writer::{DataFileWriterBuilder, 
DataFileWriterConfig};
+use iceberg::writer::file_writer::location_generator::{
+DefaultFileNameGenerator, DefaultLocationGenerator,
+};
+use iceberg::writer::file_writer::ParquetWriterBuilder;
+use iceberg::writer::{IcebergWriter, IcebergWriterBuilder};
+use iceberg::{Catalog, Namespace, NamespaceIdent, TableCreation};
+use iceberg_catalog_rest::{RestCatalog, RestCatalogConfig};
+use iceberg_test_utils::docker::DockerCompose;
+use iceberg_test_utils::{normalize_test_name, set_up};
+use parquet::file::properties::WriterProperties;
+use port_scanner::scan_port_addr;
+use std::collections::HashMap;
+use std::sync::Arc;
+use tokio::time::sleep;
+
+const REST_CATALOG_PORT: u16 = 8181;
+
+struct TestFixture {
+_docker_compose: DockerCompose,
+rest_catalog: RestCatalog,
+}
+
+async fn set_test_fixture(func: ) -> TestFixture {
+set_up();
+let docker_compose = DockerCompose::new(
+normalize_test_name(format!("{}_{func}", module_path!())),
+format!("{}/testdata", env!("CARGO_MANIFEST_DIR")),
+);
+
+// Start docker compose
+docker_compose.run();
+
+let rest_catalog_ip = docker_compose.get_container_ip("rest");
+
+let read_port = format!("{}:{}", rest_catalog_ip, REST_CATALOG_PORT);
+loop {
+if !scan_port_addr(_port) {
+log::info!("Waiting for 1s rest catalog to ready...");
+sleep(std::time::Duration::from_millis(1000)).await;
+} else {
+break;
+}
+}
+
+let container_ip = docker_compose.get_container_ip("minio");
+let read_port = format!("{}:{}", container_ip, 9000);
+
+let config = RestCatalogConfig::builder()
+.uri(format!("http://{}:{};, rest_catalog_ip, REST_CATALOG_PORT))
+.props(HashMap::from([
+(S3_ENDPOINT.to_string(), format!("http://{};, read_port)),
+(S3_ACCESS_KEY_ID.to_string(), "admin".to_string()),
+(S3_SECRET_ACCESS_KEY.to_string(), "password".to_string()),
+(S3_REGION.to_string(), "us-east-1".to_string()),
+]))
+.build();
+let rest_catalog = RestCatalog::new(config).await.unwrap();
+
+TestFixture {
+_docker_compose: docker_compose,
+rest_catalog,
+}
+}
+
+#[tokio::test]
+async fn test_append_data_file() {
+let fixture = set_test_fixture("test_create_table").await;
+
+let ns = Namespace::with_properties(
+NamespaceIdent::from_strs(["apple", "ios"]).unwrap(),
+HashMap::from([
+("owner".to_string(), "ray".to_string()),
+("community".to_string(), "apache".to_string()),
+]),
+);
+
+fixture
+.rest_catalog
+.create_namespace(ns.name(), ns.properties().clone())
+.await
+.unwrap();
+
+let schema = Schema::builder()
+.with_schema_id(1)
+.with_identifier_field_ids(vec![2])
+.with_fields(vec![
+NestedField::optional(1, "foo", 
Type::Primitive(PrimitiveType::String)).into(),
+NestedField::required(2, "bar", 
Type::Primitive(PrimitiveType::Int)).into(),
+NestedField::optional(3, "baz", 
Type::Primitive(PrimitiveType::Boolean)).into(),
+])
+.build()
+.unwrap();
+
+let table_creation = TableCreation::builder()
+.name("t1".to_string())
+.schema(schema.clone())
+.build();
+
+let table = fixture
+.rest_catalog
+.create_table(ns.name(), table_creation)
+.await
+.unwrap();
+
+// Create the writer and write the data
+  

Re: [PR] feat: support append data file and add e2e test [iceberg-rust]

2024-04-25 Thread via GitHub


Fokko commented on code in PR #349:
URL: https://github.com/apache/iceberg-rust/pull/349#discussion_r1579472707


##
crates/iceberg/src/transaction.rs:
##
@@ -121,6 +166,270 @@ impl<'a> Transaction<'a> {
 }
 }
 
+/// FastAppendAction is a transaction action for fast append data files to the 
table.
+pub struct FastAppendAction<'a> {
+tx: Transaction<'a>,
+
+parent_snapshot_id: Option,
+snapshot_id: i64,
+schema: Schema,
+schema_id: i32,
+format_version: FormatVersion,
+partition_spec: PartitionSpec,
+key_metadata: Vec,
+
+commit_uuid: String,
+manifest_id: i64,
+
+appended_data_files: Vec,
+}
+
+impl<'a> FastAppendAction<'a> {
+#[allow(clippy::too_many_arguments)]
+pub(crate) fn new(
+tx: Transaction<'a>,
+parent_snapshot_id: Option,
+snapshot_id: i64,
+schema: Schema,
+schema_id: i32,
+format_version: FormatVersion,
+partition_spec: PartitionSpec,
+key_metadata: Vec,
+commit_uuid: String,
+) -> Result {
+Ok(Self {
+tx,
+parent_snapshot_id,
+snapshot_id,
+schema,
+schema_id,
+format_version,
+partition_spec,
+key_metadata,
+commit_uuid,
+manifest_id: 0,
+appended_data_files: vec![],
+})
+}
+
+/// Add data files to the snapshot.
+pub fn add_data_files(
+ self,
+data_file: impl IntoIterator,
+) -> Result< Self> {
+self.appended_data_files.extend(data_file);
+Ok(self)
+}
+
+fn generate_manifest_file_path( self) -> String {
+let manifest_id = self.manifest_id;
+self.manifest_id += 1;
+format!(
+"{}/{}/{}-m{}.{}",
+self.tx.table.metadata().location(),
+META_ROOT_PATH,
+_uuid,
+manifest_id,
+DataFileFormat::Avro
+)
+}
+
+async fn manifest_from_parent_snapshot() -> Result> 
{
+if let Some(snapshot) = self.tx.table.metadata().current_snapshot() {
+let manifest_list = snapshot
+.load_manifest_list(self.tx.table.file_io(), 
_ref())
+.await?;
+let mut manifest_files = 
Vec::with_capacity(manifest_list.entries().len());
+for entry in manifest_list.entries() {
+// From: 
https://github.com/apache/iceberg-python/blob/659a951d6397ab280cae80206fe6e8e4be2d3738/pyiceberg/table/__init__.py#L2921
+// Why we need this?
+if entry.added_snapshot_id == self.snapshot_id {
+continue;
+}
+let manifest = 
entry.load_manifest(self.tx.table.file_io()).await?;
+// Skip manifest with all delete entries.
+if manifest.entries().iter().all(|entry| !entry.is_alive()) {
+continue;
+}
+manifest_files.push(entry.clone());
+}
+Ok(manifest_files)
+} else {
+Ok(vec![])
+}
+}
+
+// Write manifest file for added data files and return the ManifestFile 
for ManifestList.
+async fn manifest_for_data_file( self) -> Result {
+let appended_data_files = std::mem::take( 
self.appended_data_files);
+let manifest_entries = appended_data_files
+.into_iter()
+.map(|data_file| {
+ManifestEntry::builder()
+.status(crate::spec::ManifestStatus::Added)
+.snapshot_id(self.snapshot_id)

Review Comment:
   This is different for V1 and V2. For V1 we do want to write this, for V2 
typically not. This is because when the commit fails due to a conflict, we have 
to rewrite the manifest-list but can-reuse the already written manifests.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Kafka-connect: Handle namespace creation for auto table creation [iceberg]

2024-04-25 Thread via GitHub


bryanck commented on code in PR #10186:
URL: https://github.com/apache/iceberg/pull/10186#discussion_r1579467601


##
kafka-connect/kafka-connect/src/test/java/org/apache/iceberg/connect/data/IcebergWriterFactoryTest.java:
##
@@ -83,4 +90,26 @@ public void testAutoCreateTable(boolean partitioned) {
 assertThat(specCaptor.getValue().isPartitioned()).isEqualTo(partitioned);
 assertThat(propsCaptor.getValue()).containsKey("test-prop");
   }
+
+  @Test
+  public void testNamespaceCreation() throws IOException {
+TableIdentifier tableIdentifier =
+TableIdentifier.of(Namespace.of("foo1", "foo2", "foo3"), "bar");
+Schema schema = new Schema(Types.NestedField.required(1, "id", 
Types.StringType.get()));
+
+try (InMemoryCatalog catalog = new InMemoryCatalog()) {

Review Comment:
   I'd prefer using the ArgumentCaptor approach, rather than relying on 
InMemoryCatalog. This is consistent with the auto create test.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: support append data file and add e2e test [iceberg-rust]

2024-04-25 Thread via GitHub


Fokko commented on code in PR #349:
URL: https://github.com/apache/iceberg-rust/pull/349#discussion_r1579466191


##
crates/iceberg/src/transaction.rs:
##
@@ -121,6 +166,270 @@ impl<'a> Transaction<'a> {
 }
 }
 
+/// FastAppendAction is a transaction action for fast append data files to the 
table.
+pub struct FastAppendAction<'a> {
+tx: Transaction<'a>,
+
+parent_snapshot_id: Option,
+snapshot_id: i64,
+schema: Schema,
+schema_id: i32,
+format_version: FormatVersion,
+partition_spec: PartitionSpec,
+key_metadata: Vec,
+
+commit_uuid: String,
+manifest_id: i64,
+
+appended_data_files: Vec,
+}
+
+impl<'a> FastAppendAction<'a> {
+#[allow(clippy::too_many_arguments)]
+pub(crate) fn new(
+tx: Transaction<'a>,
+parent_snapshot_id: Option,
+snapshot_id: i64,
+schema: Schema,
+schema_id: i32,
+format_version: FormatVersion,
+partition_spec: PartitionSpec,
+key_metadata: Vec,
+commit_uuid: String,
+) -> Result {
+Ok(Self {
+tx,
+parent_snapshot_id,
+snapshot_id,
+schema,
+schema_id,
+format_version,
+partition_spec,
+key_metadata,
+commit_uuid,
+manifest_id: 0,
+appended_data_files: vec![],
+})
+}
+
+/// Add data files to the snapshot.
+pub fn add_data_files(
+ self,
+data_file: impl IntoIterator,
+) -> Result< Self> {
+self.appended_data_files.extend(data_file);
+Ok(self)
+}
+
+fn generate_manifest_file_path( self) -> String {
+let manifest_id = self.manifest_id;
+self.manifest_id += 1;
+format!(
+"{}/{}/{}-m{}.{}",
+self.tx.table.metadata().location(),
+META_ROOT_PATH,
+_uuid,
+manifest_id,
+DataFileFormat::Avro
+)
+}
+
+async fn manifest_from_parent_snapshot() -> Result> 
{
+if let Some(snapshot) = self.tx.table.metadata().current_snapshot() {
+let manifest_list = snapshot
+.load_manifest_list(self.tx.table.file_io(), 
_ref())
+.await?;
+let mut manifest_files = 
Vec::with_capacity(manifest_list.entries().len());
+for entry in manifest_list.entries() {
+// From: 
https://github.com/apache/iceberg-python/blob/659a951d6397ab280cae80206fe6e8e4be2d3738/pyiceberg/table/__init__.py#L2921
+// Why we need this?
+if entry.added_snapshot_id == self.snapshot_id {
+continue;
+}
+let manifest = 
entry.load_manifest(self.tx.table.file_io()).await?;
+// Skip manifest with all delete entries.
+if manifest.entries().iter().all(|entry| !entry.is_alive()) {
+continue;
+}
+manifest_files.push(entry.clone());
+}
+Ok(manifest_files)
+} else {
+Ok(vec![])
+}
+}
+
+// Write manifest file for added data files and return the ManifestFile 
for ManifestList.
+async fn manifest_for_data_file( self) -> Result {
+let appended_data_files = std::mem::take( 
self.appended_data_files);
+let manifest_entries = appended_data_files
+.into_iter()
+.map(|data_file| {
+ManifestEntry::builder()
+.status(crate::spec::ManifestStatus::Added)
+.snapshot_id(self.snapshot_id)
+.data_file(data_file)
+.build()
+})
+.collect();
+let manifest_meta = ManifestMetadata::builder()
+.schema(self.schema.clone())
+.schema_id(self.schema_id)
+.format_version(self.format_version)
+.partition_spec(self.partition_spec.clone())

Review Comment:
   How do we know if the written data adheres to the partition spec?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] `field-id`'s missing in generated Avro files [iceberg-rust]

2024-04-25 Thread via GitHub


Fokko commented on issue #353:
URL: https://github.com/apache/iceberg-rust/issues/353#issuecomment-2077166551

   @s-akhtar-baig Certainly  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Correct names in the ManifestList [iceberg-rust]

2024-04-25 Thread via GitHub


s-akhtar-baig commented on issue #354:
URL: https://github.com/apache/iceberg-rust/issues/354#issuecomment-2077159735

   @Fokko, can you please assign this to me? Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] `field-id`'s missing in generated Avro files [iceberg-rust]

2024-04-25 Thread via GitHub


s-akhtar-baig commented on issue #353:
URL: https://github.com/apache/iceberg-rust/issues/353#issuecomment-2077159481

   @Fokko, can you please assign this to me? Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: support append data file and add e2e test [iceberg-rust]

2024-04-25 Thread via GitHub


Fokko commented on code in PR #349:
URL: https://github.com/apache/iceberg-rust/pull/349#discussion_r1579449633


##
crates/iceberg/src/transaction.rs:
##
@@ -121,6 +166,270 @@ impl<'a> Transaction<'a> {
 }
 }
 
+/// FastAppendAction is a transaction action for fast append data files to the 
table.
+pub struct FastAppendAction<'a> {
+tx: Transaction<'a>,
+
+parent_snapshot_id: Option,
+snapshot_id: i64,
+schema: Schema,
+schema_id: i32,
+format_version: FormatVersion,
+partition_spec: PartitionSpec,
+key_metadata: Vec,
+
+commit_uuid: String,
+manifest_id: i64,
+
+appended_data_files: Vec,
+}
+
+impl<'a> FastAppendAction<'a> {
+#[allow(clippy::too_many_arguments)]
+pub(crate) fn new(
+tx: Transaction<'a>,
+parent_snapshot_id: Option,
+snapshot_id: i64,
+schema: Schema,
+schema_id: i32,
+format_version: FormatVersion,
+partition_spec: PartitionSpec,
+key_metadata: Vec,
+commit_uuid: String,
+) -> Result {
+Ok(Self {
+tx,
+parent_snapshot_id,
+snapshot_id,
+schema,
+schema_id,
+format_version,
+partition_spec,
+key_metadata,
+commit_uuid,
+manifest_id: 0,
+appended_data_files: vec![],
+})
+}
+
+/// Add data files to the snapshot.
+pub fn add_data_files(
+ self,
+data_file: impl IntoIterator,
+) -> Result< Self> {
+self.appended_data_files.extend(data_file);
+Ok(self)
+}
+
+fn generate_manifest_file_path( self) -> String {
+let manifest_id = self.manifest_id;
+self.manifest_id += 1;
+format!(
+"{}/{}/{}-m{}.{}",
+self.tx.table.metadata().location(),
+META_ROOT_PATH,
+_uuid,
+manifest_id,
+DataFileFormat::Avro
+)
+}
+
+async fn manifest_from_parent_snapshot() -> Result> 
{
+if let Some(snapshot) = self.tx.table.metadata().current_snapshot() {
+let manifest_list = snapshot
+.load_manifest_list(self.tx.table.file_io(), 
_ref())
+.await?;
+let mut manifest_files = 
Vec::with_capacity(manifest_list.entries().len());
+for entry in manifest_list.entries() {
+// From: 
https://github.com/apache/iceberg-python/blob/659a951d6397ab280cae80206fe6e8e4be2d3738/pyiceberg/table/__init__.py#L2921
+// Why we need this?
+if entry.added_snapshot_id == self.snapshot_id {
+continue;
+}
+let manifest = 
entry.load_manifest(self.tx.table.file_io()).await?;
+// Skip manifest with all delete entries.
+if manifest.entries().iter().all(|entry| !entry.is_alive()) {
+continue;
+}
+manifest_files.push(entry.clone());
+}
+Ok(manifest_files)
+} else {
+Ok(vec![])
+}
+}
+
+// Write manifest file for added data files and return the ManifestFile 
for ManifestList.
+async fn manifest_for_data_file( self) -> Result {
+let appended_data_files = std::mem::take( 
self.appended_data_files);
+let manifest_entries = appended_data_files
+.into_iter()
+.map(|data_file| {
+ManifestEntry::builder()
+.status(crate::spec::ManifestStatus::Added)
+.snapshot_id(self.snapshot_id)
+.data_file(data_file)
+.build()
+})
+.collect();
+let manifest_meta = ManifestMetadata::builder()
+.schema(self.schema.clone())
+.schema_id(self.schema_id)
+.format_version(self.format_version)
+.partition_spec(self.partition_spec.clone())
+.content(crate::spec::ManifestContentType::Data)
+.build();
+let manifest = Manifest::new(manifest_meta, manifest_entries);
+let writer = ManifestWriter::new(
+self.tx
+.table
+.file_io()
+.new_output(self.generate_manifest_file_path())?,
+self.snapshot_id,
+self.key_metadata.clone(),
+);
+writer.write(manifest).await
+}
+
+fn summary() -> Summary {
+Summary {
+operation: crate::spec::Operation::Append,
+other: HashMap::new(),
+}
+}
+
+/// Finished building the action and apply it to the transaction.
+pub async fn apply(mut self) -> Result> {
+let summary = self.summary();
+let manifest = self.manifest_for_data_file().await?;
+let existing_manifest_files = 
self.manifest_from_parent_snapshot().await?;
+
+let snapshot_produce_action = 

Re: [PR] feat: support append data file and add e2e test [iceberg-rust]

2024-04-25 Thread via GitHub


Fokko commented on code in PR #349:
URL: https://github.com/apache/iceberg-rust/pull/349#discussion_r1579449633


##
crates/iceberg/src/transaction.rs:
##
@@ -121,6 +166,270 @@ impl<'a> Transaction<'a> {
 }
 }
 
+/// FastAppendAction is a transaction action for fast append data files to the 
table.
+pub struct FastAppendAction<'a> {
+tx: Transaction<'a>,
+
+parent_snapshot_id: Option,
+snapshot_id: i64,
+schema: Schema,
+schema_id: i32,
+format_version: FormatVersion,
+partition_spec: PartitionSpec,
+key_metadata: Vec,
+
+commit_uuid: String,
+manifest_id: i64,
+
+appended_data_files: Vec,
+}
+
+impl<'a> FastAppendAction<'a> {
+#[allow(clippy::too_many_arguments)]
+pub(crate) fn new(
+tx: Transaction<'a>,
+parent_snapshot_id: Option,
+snapshot_id: i64,
+schema: Schema,
+schema_id: i32,
+format_version: FormatVersion,
+partition_spec: PartitionSpec,
+key_metadata: Vec,
+commit_uuid: String,
+) -> Result {
+Ok(Self {
+tx,
+parent_snapshot_id,
+snapshot_id,
+schema,
+schema_id,
+format_version,
+partition_spec,
+key_metadata,
+commit_uuid,
+manifest_id: 0,
+appended_data_files: vec![],
+})
+}
+
+/// Add data files to the snapshot.
+pub fn add_data_files(
+ self,
+data_file: impl IntoIterator,
+) -> Result< Self> {
+self.appended_data_files.extend(data_file);
+Ok(self)
+}
+
+fn generate_manifest_file_path( self) -> String {
+let manifest_id = self.manifest_id;
+self.manifest_id += 1;
+format!(
+"{}/{}/{}-m{}.{}",
+self.tx.table.metadata().location(),
+META_ROOT_PATH,
+_uuid,
+manifest_id,
+DataFileFormat::Avro
+)
+}
+
+async fn manifest_from_parent_snapshot() -> Result> 
{
+if let Some(snapshot) = self.tx.table.metadata().current_snapshot() {
+let manifest_list = snapshot
+.load_manifest_list(self.tx.table.file_io(), 
_ref())
+.await?;
+let mut manifest_files = 
Vec::with_capacity(manifest_list.entries().len());
+for entry in manifest_list.entries() {
+// From: 
https://github.com/apache/iceberg-python/blob/659a951d6397ab280cae80206fe6e8e4be2d3738/pyiceberg/table/__init__.py#L2921
+// Why we need this?
+if entry.added_snapshot_id == self.snapshot_id {
+continue;
+}
+let manifest = 
entry.load_manifest(self.tx.table.file_io()).await?;
+// Skip manifest with all delete entries.
+if manifest.entries().iter().all(|entry| !entry.is_alive()) {
+continue;
+}
+manifest_files.push(entry.clone());
+}
+Ok(manifest_files)
+} else {
+Ok(vec![])
+}
+}
+
+// Write manifest file for added data files and return the ManifestFile 
for ManifestList.
+async fn manifest_for_data_file( self) -> Result {
+let appended_data_files = std::mem::take( 
self.appended_data_files);
+let manifest_entries = appended_data_files
+.into_iter()
+.map(|data_file| {
+ManifestEntry::builder()
+.status(crate::spec::ManifestStatus::Added)
+.snapshot_id(self.snapshot_id)
+.data_file(data_file)
+.build()
+})
+.collect();
+let manifest_meta = ManifestMetadata::builder()
+.schema(self.schema.clone())
+.schema_id(self.schema_id)
+.format_version(self.format_version)
+.partition_spec(self.partition_spec.clone())
+.content(crate::spec::ManifestContentType::Data)
+.build();
+let manifest = Manifest::new(manifest_meta, manifest_entries);
+let writer = ManifestWriter::new(
+self.tx
+.table
+.file_io()
+.new_output(self.generate_manifest_file_path())?,
+self.snapshot_id,
+self.key_metadata.clone(),
+);
+writer.write(manifest).await
+}
+
+fn summary() -> Summary {
+Summary {
+operation: crate::spec::Operation::Append,
+other: HashMap::new(),
+}
+}
+
+/// Finished building the action and apply it to the transaction.
+pub async fn apply(mut self) -> Result> {
+let summary = self.summary();
+let manifest = self.manifest_for_data_file().await?;
+let existing_manifest_files = 
self.manifest_from_parent_snapshot().await?;
+
+let snapshot_produce_action = 

[I] `field-id`'s missing in generated Avro files [iceberg-rust]

2024-04-25 Thread via GitHub


Fokko opened a new issue, #353:
URL: https://github.com/apache/iceberg-rust/issues/353

   Looks like the `field-id` is missing in the Avro metadata generated by 
iceberg-rust. This should conform the spec: 
https://iceberg.apache.org/spec/#avro
   
   ```
   avro-tools getschema snap-0-1-a0c0a37d-6828-47c6-80e8-64b0a013fe44.avro
   24/04/25 14:58:05 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
   {
 "type" : "record",
 "name" : "manifest_file",
 "fields" : [ {
   "name" : "manifest_path",
   "type" : "string"
 }, {
   "name" : "manifest_length",
   "type" : "long"
 }, {
   "name" : "partition_spec_id",
   "type" : "int"
 }, {
   "name" : "content",
   "type" : "int"
 }, {
   "name" : "sequence_number",
   "type" : "long"
 }, {
   "name" : "min_sequence_number",
   "type" : "long"
 }, {
   "name" : "added_snapshot_id",
   "type" : "long"
 }, {
   "name" : "added_data_files_count",
   "type" : "int"
 }, {
   "name" : "existing_data_files_count",
   "type" : "int"
 }, {
   "name" : "deleted_data_files_count",
   "type" : "int"
 }, {
   "name" : "added_rows_count",
   "type" : "long"
 }, {
   "name" : "existing_rows_count",
   "type" : "long"
 }, {
   "name" : "deleted_rows_count",
   "type" : "long"
 }, {
   "name" : "partitions",
   "type" : [ "null", {
 "type" : "array",
 "items" : {
   "type" : "record",
   "name" : "r508",
   "fields" : [ {
 "name" : "contains_null",
 "type" : "boolean"
   }, {
 "name" : "contains_nan",
 "type" : [ "null", "boolean" ],
 "default" : null
   }, {
 "name" : "lower_bound",
 "type" : [ "null", "bytes" ],
 "default" : null
   }, {
 "name" : "upper_bound",
 "type" : [ "null", "bytes" ],
 "default" : null
   } ]
 }
   } ],
   "default" : null
 }, {
   "name" : "key_metadata",
   "type" : [ "null", "bytes" ],
   "default" : null
 } ]
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: support append data file and add e2e test [iceberg-rust]

2024-04-25 Thread via GitHub


Fokko commented on code in PR #349:
URL: https://github.com/apache/iceberg-rust/pull/349#discussion_r1579426812


##
crates/e2e_test/testdata/docker-compose.yaml:
##
@@ -0,0 +1,59 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+version: '3.8'
+
+services:
+  rest:
+image: tabulario/iceberg-rest:0.10.0
+environment:
+  - AWS_ACCESS_KEY_ID=admin
+  - AWS_SECRET_ACCESS_KEY=password
+  - AWS_REGION=us-east-1
+  - CATALOG_CATOLOG__IMPL=org.apache.iceberg.jdbc.JdbcCatalog
+  - CATALOG_URI=jdbc:sqlite:file:/tmp/iceberg_rest_mode=memory
+  - CATALOG_WAREHOUSE=s3://icebergdata/demo
+  - CATALOG_IO__IMPL=org.apache.iceberg.aws.s3.S3FileIO
+  - CATALOG_S3_ENDPOINT=http://minio:9000
+depends_on:
+  - minio
+links:
+  - minio:icebergdata.minio
+expose:
+  - 8181
+
+  minio:

Review Comment:
   I'm not sure if the error is relevant. I'm able to run it locally now.
   
   > how about change `minio/minio:RELEASE.2024-03-07T00-43-48Z`, 
`minio/mc:RELEASE.2024-03-07T00-31-49Z` to `minio/minio:latest`, 
`minio/mc:latest`.
   
   +1 on that. I haven't encountered any issues with minio updates



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: support append data file and add e2e test [iceberg-rust]

2024-04-25 Thread via GitHub


Fokko commented on code in PR #349:
URL: https://github.com/apache/iceberg-rust/pull/349#discussion_r1579420662


##
crates/iceberg/src/transaction.rs:
##
@@ -121,6 +166,270 @@ impl<'a> Transaction<'a> {
 }
 }
 
+/// FastAppendAction is a transaction action for fast append data files to the 
table.
+pub struct FastAppendAction<'a> {
+tx: Transaction<'a>,
+
+parent_snapshot_id: Option,
+snapshot_id: i64,
+schema: Schema,
+schema_id: i32,
+format_version: FormatVersion,
+partition_spec: PartitionSpec,
+key_metadata: Vec,
+
+commit_uuid: String,
+manifest_id: i64,
+
+appended_data_files: Vec,
+}
+
+impl<'a> FastAppendAction<'a> {
+#[allow(clippy::too_many_arguments)]
+pub(crate) fn new(
+tx: Transaction<'a>,
+parent_snapshot_id: Option,
+snapshot_id: i64,
+schema: Schema,
+schema_id: i32,
+format_version: FormatVersion,
+partition_spec: PartitionSpec,
+key_metadata: Vec,
+commit_uuid: String,
+) -> Result {
+Ok(Self {
+tx,
+parent_snapshot_id,
+snapshot_id,
+schema,
+schema_id,
+format_version,
+partition_spec,
+key_metadata,
+commit_uuid,
+manifest_id: 0,
+appended_data_files: vec![],
+})
+}
+
+/// Add data files to the snapshot.
+pub fn add_data_files(
+ self,
+data_file: impl IntoIterator,
+) -> Result< Self> {
+self.appended_data_files.extend(data_file);
+Ok(self)
+}
+
+fn generate_manifest_file_path( self) -> String {
+let manifest_id = self.manifest_id;
+self.manifest_id += 1;
+format!(
+"{}/{}/{}-m{}.{}",
+self.tx.table.metadata().location(),
+META_ROOT_PATH,
+_uuid,
+manifest_id,
+DataFileFormat::Avro
+)
+}
+
+async fn manifest_from_parent_snapshot() -> Result> 
{
+if let Some(snapshot) = self.tx.table.metadata().current_snapshot() {
+let manifest_list = snapshot
+.load_manifest_list(self.tx.table.file_io(), 
_ref())
+.await?;
+let mut manifest_files = 
Vec::with_capacity(manifest_list.entries().len());
+for entry in manifest_list.entries() {
+// From: 
https://github.com/apache/iceberg-python/blob/659a951d6397ab280cae80206fe6e8e4be2d3738/pyiceberg/table/__init__.py#L2921
+// Why we need this?
+if entry.added_snapshot_id == self.snapshot_id {
+continue;
+}
+let manifest = 
entry.load_manifest(self.tx.table.file_io()).await?;
+// Skip manifest with all delete entries.
+if manifest.entries().iter().all(|entry| !entry.is_alive()) {
+continue;
+}
+manifest_files.push(entry.clone());
+}
+Ok(manifest_files)
+} else {
+Ok(vec![])
+}
+}
+
+// Write manifest file for added data files and return the ManifestFile 
for ManifestList.
+async fn manifest_for_data_file( self) -> Result {
+let appended_data_files = std::mem::take( 
self.appended_data_files);
+let manifest_entries = appended_data_files
+.into_iter()
+.map(|data_file| {
+ManifestEntry::builder()
+.status(crate::spec::ManifestStatus::Added)
+.snapshot_id(self.snapshot_id)
+.data_file(data_file)
+.build()
+})
+.collect();
+let manifest_meta = ManifestMetadata::builder()
+.schema(self.schema.clone())
+.schema_id(self.schema_id)
+.format_version(self.format_version)
+.partition_spec(self.partition_spec.clone())
+.content(crate::spec::ManifestContentType::Data)
+.build();
+let manifest = Manifest::new(manifest_meta, manifest_entries);
+let writer = ManifestWriter::new(
+self.tx
+.table
+.file_io()
+.new_output(self.generate_manifest_file_path())?,
+self.snapshot_id,
+self.key_metadata.clone(),
+);
+writer.write(manifest).await
+}
+
+fn summary() -> Summary {
+Summary {
+operation: crate::spec::Operation::Append,
+other: HashMap::new(),

Review Comment:
   The summary generation is missing here. I'm okay with doing that in a 
separate PR, but we have to make sure that we add it before the release.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: 

Re: [PR] feat: support append data file and add e2e test [iceberg-rust]

2024-04-25 Thread via GitHub


Fokko commented on code in PR #349:
URL: https://github.com/apache/iceberg-rust/pull/349#discussion_r1579412576


##
crates/iceberg/src/transaction.rs:
##
@@ -95,6 +104,42 @@ impl<'a> Transaction<'a> {
 Ok(self)
 }
 
+/// Creates a fast append action.
+pub fn fast_append(
+self,
+commit_uuid: Option,
+key_metadata: Vec,
+) -> Result> {
+let parent_snapshot_id = self
+.table
+.metadata()
+.current_snapshot()
+.map(|s| s.snapshot_id());
+let snapshot_id = parent_snapshot_id.map(|id| id + 1).unwrap_or(0);

Review Comment:
   The snapshot ID is a random int, and it should be checked if it hasn't been 
used before:
   
   
https://github.com/apache/iceberg-python/blob/f72e363b18baa181c998bbdef657982159a22d48/pyiceberg/table/metadata.py#L315-L326



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: support append data file and add e2e test [iceberg-rust]

2024-04-25 Thread via GitHub


ZENOTME commented on code in PR #349:
URL: https://github.com/apache/iceberg-rust/pull/349#discussion_r1579406252


##
crates/e2e_test/testdata/docker-compose.yaml:
##
@@ -0,0 +1,59 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+version: '3.8'
+
+services:
+  rest:
+image: tabulario/iceberg-rest:0.10.0
+environment:
+  - AWS_ACCESS_KEY_ID=admin
+  - AWS_SECRET_ACCESS_KEY=password
+  - AWS_REGION=us-east-1
+  - CATALOG_CATOLOG__IMPL=org.apache.iceberg.jdbc.JdbcCatalog
+  - CATALOG_URI=jdbc:sqlite:file:/tmp/iceberg_rest_mode=memory
+  - CATALOG_WAREHOUSE=s3://icebergdata/demo
+  - CATALOG_IO__IMPL=org.apache.iceberg.aws.s3.S3FileIO
+  - CATALOG_S3_ENDPOINT=http://minio:9000
+depends_on:
+  - minio
+links:
+  - minio:icebergdata.minio
+expose:
+  - 8181
+
+  minio:
+image: minio/minio:RELEASE.2024-03-07T00-43-48Z
+environment:
+  - MINIO_ROOT_USER=admin
+  - MINIO_ROOT_PASSWORD=password
+  - MINIO_DOMAIN=minio
+expose:
+  - 9001
+  - 9000

Review Comment:
   LGTM. I also love to expose to make it easier to debug.藍



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[I] Empty snapshot ID should be `Null` instead of `-1` [iceberg-rust]

2024-04-25 Thread via GitHub


Fokko opened a new issue, #352:
URL: https://github.com/apache/iceberg-rust/issues/352

   This is an old bug from Java. Where the Snapshot was set to -1 instead of 
`None`:
   
   
https://github.com/apache/iceberg-rust/blob/aba620900e99423bbd3fed969618e67e58a03a7b/crates/iceberg/src/spec/table_metadata.rs#L44
   
   From Java 1.5 and later this is fixed. For older version of Java, the 
current-snapshot-id is required. We solved this by setting a flag: 
https://github.com/apache/iceberg-python/pull/473


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: support append data file and add e2e test [iceberg-rust]

2024-04-25 Thread via GitHub


Fokko commented on code in PR #349:
URL: https://github.com/apache/iceberg-rust/pull/349#discussion_r1579402269


##
crates/e2e_test/tests/append_data_file_test.rs:
##
@@ -0,0 +1,212 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Integration tests for rest catalog.
+
+use arrow_array::{ArrayRef, BooleanArray, Int32Array, RecordBatch, 
StringArray};
+use futures::TryStreamExt;
+use iceberg::io::{S3_ACCESS_KEY_ID, S3_ENDPOINT, S3_REGION, 
S3_SECRET_ACCESS_KEY};
+use iceberg::spec::{NestedField, PrimitiveType, Schema, Type};
+use iceberg::transaction::Transaction;
+use iceberg::writer::base_writer::data_file_writer::{DataFileWriterBuilder, 
DataFileWriterConfig};
+use iceberg::writer::file_writer::location_generator::{
+DefaultFileNameGenerator, DefaultLocationGenerator,
+};
+use iceberg::writer::file_writer::ParquetWriterBuilder;
+use iceberg::writer::{IcebergWriter, IcebergWriterBuilder};
+use iceberg::{Catalog, Namespace, NamespaceIdent, TableCreation};
+use iceberg_catalog_rest::{RestCatalog, RestCatalogConfig};
+use iceberg_test_utils::docker::DockerCompose;
+use iceberg_test_utils::{normalize_test_name, set_up};
+use parquet::file::properties::WriterProperties;
+use port_scanner::scan_port_addr;
+use std::collections::HashMap;
+use std::sync::Arc;
+use tokio::time::sleep;
+
+const REST_CATALOG_PORT: u16 = 8181;
+
+struct TestFixture {
+_docker_compose: DockerCompose,
+rest_catalog: RestCatalog,
+}
+
+async fn set_test_fixture(func: ) -> TestFixture {
+set_up();
+let docker_compose = DockerCompose::new(
+normalize_test_name(format!("{}_{func}", module_path!())),
+format!("{}/testdata", env!("CARGO_MANIFEST_DIR")),
+);
+
+// Start docker compose
+docker_compose.run();
+
+let rest_catalog_ip = docker_compose.get_container_ip("rest");
+
+let read_port = format!("{}:{}", rest_catalog_ip, REST_CATALOG_PORT);
+loop {
+if !scan_port_addr(_port) {
+log::info!("Waiting for 1s rest catalog to ready...");
+sleep(std::time::Duration::from_millis(1000)).await;
+} else {
+break;
+}
+}
+
+let container_ip = docker_compose.get_container_ip("minio");
+let read_port = format!("{}:{}", container_ip, 9000);
+
+let config = RestCatalogConfig::builder()
+.uri(format!("http://{}:{};, rest_catalog_ip, REST_CATALOG_PORT))
+.props(HashMap::from([
+(S3_ENDPOINT.to_string(), format!("http://{};, read_port)),
+(S3_ACCESS_KEY_ID.to_string(), "admin".to_string()),
+(S3_SECRET_ACCESS_KEY.to_string(), "password".to_string()),
+(S3_REGION.to_string(), "us-east-1".to_string()),
+]))
+.build();
+let rest_catalog = RestCatalog::new(config).await.unwrap();
+
+TestFixture {
+_docker_compose: docker_compose,
+rest_catalog,
+}
+}
+
+#[tokio::test]
+async fn test_append_data_file() {
+let fixture = set_test_fixture("test_create_table").await;
+
+let ns = Namespace::with_properties(
+NamespaceIdent::from_strs(["apple", "ios"]).unwrap(),
+HashMap::from([
+("owner".to_string(), "ray".to_string()),
+("community".to_string(), "apache".to_string()),
+]),
+);
+
+fixture
+.rest_catalog
+.create_namespace(ns.name(), ns.properties().clone())
+.await
+.unwrap();
+
+let schema = Schema::builder()
+.with_schema_id(1)
+.with_identifier_field_ids(vec![2])
+.with_fields(vec![
+NestedField::optional(1, "foo", 
Type::Primitive(PrimitiveType::String)).into(),
+NestedField::required(2, "bar", 
Type::Primitive(PrimitiveType::Int)).into(),
+NestedField::optional(3, "baz", 
Type::Primitive(PrimitiveType::Boolean)).into(),
+])
+.build()
+.unwrap();
+
+let table_creation = TableCreation::builder()
+.name("t1".to_string())
+.schema(schema.clone())
+.build();
+
+let table = fixture
+.rest_catalog
+.create_table(ns.name(), table_creation)
+.await
+.unwrap();
+
+// Create the writer and write the data
+  

Re: [PR] feat: support append data file and add e2e test [iceberg-rust]

2024-04-25 Thread via GitHub


ZENOTME commented on code in PR #349:
URL: https://github.com/apache/iceberg-rust/pull/349#discussion_r1579401733


##
crates/e2e_test/testdata/docker-compose.yaml:
##
@@ -0,0 +1,59 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+version: '3.8'
+
+services:
+  rest:
+image: tabulario/iceberg-rest:0.10.0
+environment:
+  - AWS_ACCESS_KEY_ID=admin
+  - AWS_SECRET_ACCESS_KEY=password
+  - AWS_REGION=us-east-1
+  - CATALOG_CATOLOG__IMPL=org.apache.iceberg.jdbc.JdbcCatalog
+  - CATALOG_URI=jdbc:sqlite:file:/tmp/iceberg_rest_mode=memory
+  - CATALOG_WAREHOUSE=s3://icebergdata/demo
+  - CATALOG_IO__IMPL=org.apache.iceberg.aws.s3.S3FileIO
+  - CATALOG_S3_ENDPOINT=http://minio:9000
+depends_on:
+  - minio
+links:
+  - minio:icebergdata.minio
+expose:
+  - 8181
+
+  minio:

Review Comment:
   how about change `minio/minio:RELEASE.2024-03-07T00-43-48Z`, 
`minio/mc:RELEASE.2024-03-07T00-31-49Z` to `minio/minio:latest`, 
`minio/mc:latest`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: support append data file and add e2e test [iceberg-rust]

2024-04-25 Thread via GitHub


Fokko commented on code in PR #349:
URL: https://github.com/apache/iceberg-rust/pull/349#discussion_r1579401183


##
crates/e2e_test/tests/append_data_file_test.rs:
##
@@ -0,0 +1,212 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Integration tests for rest catalog.
+
+use arrow_array::{ArrayRef, BooleanArray, Int32Array, RecordBatch, 
StringArray};
+use futures::TryStreamExt;
+use iceberg::io::{S3_ACCESS_KEY_ID, S3_ENDPOINT, S3_REGION, 
S3_SECRET_ACCESS_KEY};
+use iceberg::spec::{NestedField, PrimitiveType, Schema, Type};
+use iceberg::transaction::Transaction;
+use iceberg::writer::base_writer::data_file_writer::{DataFileWriterBuilder, 
DataFileWriterConfig};
+use iceberg::writer::file_writer::location_generator::{
+DefaultFileNameGenerator, DefaultLocationGenerator,
+};
+use iceberg::writer::file_writer::ParquetWriterBuilder;
+use iceberg::writer::{IcebergWriter, IcebergWriterBuilder};
+use iceberg::{Catalog, Namespace, NamespaceIdent, TableCreation};
+use iceberg_catalog_rest::{RestCatalog, RestCatalogConfig};
+use iceberg_test_utils::docker::DockerCompose;
+use iceberg_test_utils::{normalize_test_name, set_up};
+use parquet::file::properties::WriterProperties;
+use port_scanner::scan_port_addr;
+use std::collections::HashMap;
+use std::sync::Arc;
+use tokio::time::sleep;
+
+const REST_CATALOG_PORT: u16 = 8181;
+
+struct TestFixture {
+_docker_compose: DockerCompose,
+rest_catalog: RestCatalog,
+}
+
+async fn set_test_fixture(func: ) -> TestFixture {
+set_up();
+let docker_compose = DockerCompose::new(
+normalize_test_name(format!("{}_{func}", module_path!())),
+format!("{}/testdata", env!("CARGO_MANIFEST_DIR")),
+);
+
+// Start docker compose
+docker_compose.run();
+
+let rest_catalog_ip = docker_compose.get_container_ip("rest");
+
+let read_port = format!("{}:{}", rest_catalog_ip, REST_CATALOG_PORT);
+loop {
+if !scan_port_addr(_port) {
+log::info!("Waiting for 1s rest catalog to ready...");
+sleep(std::time::Duration::from_millis(1000)).await;
+} else {
+break;
+}
+}
+
+let container_ip = docker_compose.get_container_ip("minio");

Review Comment:
   I've exposed the ports, and this allows me to just point to 127.0.0.1  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] How to move Iceberg table from one location to another [iceberg]

2024-04-25 Thread via GitHub


cccs-jc commented on issue #3142:
URL: https://github.com/apache/iceberg/issues/3142#issuecomment-2077081125

   @ksmatharoo, thank you for providing your code to replace the paths within 
the iceberg metadata.
   
   I've implemented a similar solution using Python's fastavro to handle Avro 
files. In my approach, I read the Avro files, replace the paths, and then save 
the modified files.
   
   So far, everything seems to be working well. However, I noticed that your 
code is more sophisticated, particularly in how you manage lists of known 
paths. I'm curious if there's a specific reason for this. While my solution is 
straightforward, I'm wondering if there might be subtle issues that I haven't 
considered. Any insights you could provide would be appreciated!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: support append data file and add e2e test [iceberg-rust]

2024-04-25 Thread via GitHub


Fokko commented on code in PR #349:
URL: https://github.com/apache/iceberg-rust/pull/349#discussion_r1579375646


##
crates/e2e_test/testdata/docker-compose.yaml:
##
@@ -0,0 +1,59 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+version: '3.8'
+
+services:
+  rest:
+image: tabulario/iceberg-rest:0.10.0
+environment:
+  - AWS_ACCESS_KEY_ID=admin
+  - AWS_SECRET_ACCESS_KEY=password
+  - AWS_REGION=us-east-1
+  - CATALOG_CATOLOG__IMPL=org.apache.iceberg.jdbc.JdbcCatalog
+  - CATALOG_URI=jdbc:sqlite:file:/tmp/iceberg_rest_mode=memory
+  - CATALOG_WAREHOUSE=s3://icebergdata/demo
+  - CATALOG_IO__IMPL=org.apache.iceberg.aws.s3.S3FileIO
+  - CATALOG_S3_ENDPOINT=http://minio:9000
+depends_on:
+  - minio
+links:
+  - minio:icebergdata.minio
+expose:
+  - 8181

Review Comment:
   I would love to run this from my IDE as well:
   ```suggestion
   ports:
 - 8181:8181
   expose:
 - 8181
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: support append data file and add e2e test [iceberg-rust]

2024-04-25 Thread via GitHub


Fokko commented on code in PR #349:
URL: https://github.com/apache/iceberg-rust/pull/349#discussion_r1579372414


##
crates/e2e_test/testdata/docker-compose.yaml:
##
@@ -0,0 +1,59 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+version: '3.8'
+
+services:
+  rest:
+image: tabulario/iceberg-rest:0.10.0
+environment:
+  - AWS_ACCESS_KEY_ID=admin
+  - AWS_SECRET_ACCESS_KEY=password
+  - AWS_REGION=us-east-1
+  - CATALOG_CATOLOG__IMPL=org.apache.iceberg.jdbc.JdbcCatalog
+  - CATALOG_URI=jdbc:sqlite:file:/tmp/iceberg_rest_mode=memory
+  - CATALOG_WAREHOUSE=s3://icebergdata/demo
+  - CATALOG_IO__IMPL=org.apache.iceberg.aws.s3.S3FileIO
+  - CATALOG_S3_ENDPOINT=http://minio:9000
+depends_on:
+  - minio
+links:
+  - minio:icebergdata.minio
+expose:
+  - 8181
+
+  minio:
+image: minio/minio:RELEASE.2024-03-07T00-43-48Z
+environment:
+  - MINIO_ROOT_USER=admin
+  - MINIO_ROOT_PASSWORD=password
+  - MINIO_DOMAIN=minio
+expose:
+  - 9001
+  - 9000

Review Comment:
   How about also exposing the management console:
   ```suggestion
   ports:
 - 9001:9001
   expose:
 - 9001
 - 9000
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: support append data file and add e2e test [iceberg-rust]

2024-04-25 Thread via GitHub


Fokko commented on code in PR #349:
URL: https://github.com/apache/iceberg-rust/pull/349#discussion_r1579369755


##
crates/e2e_test/testdata/docker-compose.yaml:
##
@@ -0,0 +1,59 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+version: '3.8'
+
+services:
+  rest:
+image: tabulario/iceberg-rest:0.10.0
+environment:
+  - AWS_ACCESS_KEY_ID=admin
+  - AWS_SECRET_ACCESS_KEY=password
+  - AWS_REGION=us-east-1
+  - CATALOG_CATOLOG__IMPL=org.apache.iceberg.jdbc.JdbcCatalog
+  - CATALOG_URI=jdbc:sqlite:file:/tmp/iceberg_rest_mode=memory
+  - CATALOG_WAREHOUSE=s3://icebergdata/demo
+  - CATALOG_IO__IMPL=org.apache.iceberg.aws.s3.S3FileIO
+  - CATALOG_S3_ENDPOINT=http://minio:9000
+depends_on:
+  - minio
+links:
+  - minio:icebergdata.minio
+expose:
+  - 8181
+
+  minio:

Review Comment:
   It doesn't boot on my end:
   
   ```
   ➜  iceberg-rust git:(tx_append) docker logs -f d7a12d1f9d30
   Formatting 1st pool, 1 set(s), 1 drives per set.
   WARNING: Host local has more than 0 drives of set. A host failure will 
result in data becoming unavailable.
   MinIO Object Storage Server
   Copyright: 2015-2024 MinIO, Inc.
   License: GNU AGPLv3 
   Version: RELEASE.2024-03-07T00-43-48Z (go1.21.8 linux/arm64)
   
   Status: 1 Online, 0 Offline. 
   S3-API: http://172.20.0.2:9000  http://127.0.0.1:9000 
   Console: http://172.20.0.2:9001 http://127.0.0.1:9001   
   
   Documentation: https://min.io/docs/minio/linux/index.html
   Warning: The standard parity is set to 0. This can lead to data loss.
   
   API: ListObjectsV2(bucket=icebergdata)
   Time: 12:11:34 UTC 04/25/2024
   DeploymentID: 0d2c88aa-2393-4c17-a28c-560e6cfe4b9b
   RequestID: 17C984C03C8E1AFA
   RemoteHost: 172.20.0.3
   Host: minio:9000
   UserAgent: MinIO (linux; arm64) minio-go/v7.0.67 
mc/RELEASE.2024-03-07T00-31-49Z
   Error: volume not found (cmd.StorageErr)
  7: internal/logger/logonce.go:118:logger.(*logOnceType).logOnceIf()
  6: internal/logger/logonce.go:149:logger.LogOnceIf()
  5: 
cmd/erasure-server-pool.go:1516:cmd.(*erasureServerPools).ListObjects()
  4: 
cmd/erasure-server-pool.go:1275:cmd.(*erasureServerPools).ListObjectsV2()
  3: 
cmd/bucket-listobjects-handlers.go:210:cmd.objectAPIHandlers.listObjectsV2Handler()
  2: 
cmd/bucket-listobjects-handlers.go:156:cmd.objectAPIHandlers.ListObjectsV2Handler()
  1: net/http/server.go:2136:http.HandlerFunc.ServeHTTP()
   
You are running an older version of MinIO released 1 month before the 
latest release 
Update: Run `mc admin update ALIAS` 
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: support append data file and add e2e test [iceberg-rust]

2024-04-25 Thread via GitHub


Fokko commented on code in PR #349:
URL: https://github.com/apache/iceberg-rust/pull/349#discussion_r1579368639


##
crates/e2e_test/testdata/docker-compose.yaml:
##
@@ -0,0 +1,59 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+version: '3.8'
+
+services:
+  rest:
+image: tabulario/iceberg-rest:0.10.0
+environment:
+  - AWS_ACCESS_KEY_ID=admin
+  - AWS_SECRET_ACCESS_KEY=password
+  - AWS_REGION=us-east-1
+  - CATALOG_CATOLOG__IMPL=org.apache.iceberg.jdbc.JdbcCatalog

Review Comment:
   There is a typo in this. I think this line can go.
   ```suggestion
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API: implement types timestamp_ns and timestamptz_ns [iceberg]

2024-04-25 Thread via GitHub


nastra commented on PR #9008:
URL: https://github.com/apache/iceberg/pull/9008#issuecomment-2077042155

   @epgif can you please address the test failures?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Add View support for HIVE catalog [iceberg]

2024-04-25 Thread via GitHub


nastra commented on code in PR #9852:
URL: https://github.com/apache/iceberg/pull/9852#discussion_r1579326259


##
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java:
##
@@ -209,11 +262,63 @@ public boolean dropTable(TableIdentifier identifier, 
boolean purge) {
 }
   }
 
+  @Override
+  public boolean dropView(TableIdentifier identifier) {
+if (!isValidIdentifier(identifier)) {
+  return false;
+}
+
+try {
+  String database = identifier.namespace().level(0);
+  String viewName = identifier.name();
+  Table table = clients.run(client -> client.getTable(database, viewName));
+  HiveOperationsBase.validateTableIsIcebergView(

Review Comment:
   @nk1506 you should be fine removing 
   ```
   Table table = clients.run(client -> client.getTable(database, viewName));
 HiveOperationsBase.validateTableIsIcebergView(...)
   ```
   
   as that's what I meant in my comment above



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



  1   2   >