[jira] [Created] (HUDI-7937) Fix handling of decimals in StreamSync and Clustering
Timothy Brown created HUDI-7937: --- Summary: Fix handling of decimals in StreamSync and Clustering Key: HUDI-7937 URL: https://issues.apache.org/jira/browse/HUDI-7937 Project: Apache Hudi Issue Type: Bug Reporter: Timothy Brown When decimals are using a small precision, we need to write them in legacy format to ensure all hudi components can read them back. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7937) Fix handling of decimals in StreamSync and Clustering
[ https://issues.apache.org/jira/browse/HUDI-7937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown reassigned HUDI-7937: --- Assignee: Timothy Brown > Fix handling of decimals in StreamSync and Clustering > - > > Key: HUDI-7937 > URL: https://issues.apache.org/jira/browse/HUDI-7937 > Project: Apache Hudi > Issue Type: Bug >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Major > > When decimals are using a small precision, we need to write them in legacy > format to ensure all hudi components can read them back. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7927) Secondary View should only initialize when required
[ https://issues.apache.org/jira/browse/HUDI-7927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown reassigned HUDI-7927: --- Assignee: Timothy Brown > Secondary View should only initialize when required > --- > > Key: HUDI-7927 > URL: https://issues.apache.org/jira/browse/HUDI-7927 > Project: Apache Hudi > Issue Type: Bug >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Major > > In the PriorityBasedFileSystemView, the secondary view will be initialized > eagerly causing extra overhead including file listing. We should avoid this > to reduce the cost for users. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7927) Secondary View should only initialize when required
Timothy Brown created HUDI-7927: --- Summary: Secondary View should only initialize when required Key: HUDI-7927 URL: https://issues.apache.org/jira/browse/HUDI-7927 Project: Apache Hudi Issue Type: Bug Reporter: Timothy Brown In the PriorityBasedFileSystemView, the secondary view will be initialized eagerly causing extra overhead including file listing. We should avoid this to reduce the cost for users. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7826) hoodie.write.set.null.for.missing.columns results in invalid objects
Timothy Brown created HUDI-7826: --- Summary: hoodie.write.set.null.for.missing.columns results in invalid objects Key: HUDI-7826 URL: https://issues.apache.org/jira/browse/HUDI-7826 Project: Apache Hudi Issue Type: Bug Reporter: Timothy Brown When setting `hoodie.write.set.null.for.missing.columns` a null value will get set for the fields missing in the incoming data set. If the column was non-nullable, then you will get an error at runtime. Instead, we should evolve the field to be nullable in the table's schema. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7821) Handle schema evolution in proto to avro conversion
Timothy Brown created HUDI-7821: --- Summary: Handle schema evolution in proto to avro conversion Key: HUDI-7821 URL: https://issues.apache.org/jira/browse/HUDI-7821 Project: Apache Hudi Issue Type: Bug Reporter: Timothy Brown Users can encounter errors when a batch of data was written with an older schema and a new schema has fields that are not present in the old data -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7758) MDT Initialization Parses Non-Hudi files
[ https://issues.apache.org/jira/browse/HUDI-7758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown reassigned HUDI-7758: --- Assignee: Timothy Brown > MDT Initialization Parses Non-Hudi files > > > Key: HUDI-7758 > URL: https://issues.apache.org/jira/browse/HUDI-7758 > Project: Apache Hudi > Issue Type: Bug >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Major > > Right now the MDT initialization will parse files that do not belong to the > Hudi table -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7758) MDT Initialization Parses Non-Hudi files
Timothy Brown created HUDI-7758: --- Summary: MDT Initialization Parses Non-Hudi files Key: HUDI-7758 URL: https://issues.apache.org/jira/browse/HUDI-7758 Project: Apache Hudi Issue Type: Bug Reporter: Timothy Brown Right now the MDT initialization will parse files that do not belong to the Hudi table -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7713) Schema Reconciliation should also re-order fields
[ https://issues.apache.org/jira/browse/HUDI-7713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown reassigned HUDI-7713: --- Assignee: Timothy Brown > Schema Reconciliation should also re-order fields > - > > Key: HUDI-7713 > URL: https://issues.apache.org/jira/browse/HUDI-7713 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Major > > The schema reconciliation current makes sure the incoming schema is > compatible with the target but it can also be used to guarantee a consistent > ordering of fields in the schema between commits. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7713) Schema Reconciliation should also re-order fields
Timothy Brown created HUDI-7713: --- Summary: Schema Reconciliation should also re-order fields Key: HUDI-7713 URL: https://issues.apache.org/jira/browse/HUDI-7713 Project: Apache Hudi Issue Type: Improvement Reporter: Timothy Brown The schema reconciliation current makes sure the incoming schema is compatible with the target but it can also be used to guarantee a consistent ordering of fields in the schema between commits. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7689) Allow users to leverage HoodieTable and Engine Context in Compaction Strategy
Timothy Brown created HUDI-7689: --- Summary: Allow users to leverage HoodieTable and Engine Context in Compaction Strategy Key: HUDI-7689 URL: https://issues.apache.org/jira/browse/HUDI-7689 Project: Apache Hudi Issue Type: Improvement Reporter: Timothy Brown -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7689) Allow users to leverage HoodieTable and Engine Context in Compaction Strategy
[ https://issues.apache.org/jira/browse/HUDI-7689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown reassigned HUDI-7689: --- Assignee: Timothy Brown > Allow users to leverage HoodieTable and Engine Context in Compaction Strategy > - > > Key: HUDI-7689 > URL: https://issues.apache.org/jira/browse/HUDI-7689 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-4732) Leverage Schema Registry for reading proto messages from kafka
[ https://issues.apache.org/jira/browse/HUDI-4732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown reassigned HUDI-4732: --- Assignee: Timothy Brown > Leverage Schema Registry for reading proto messages from kafka > -- > > Key: HUDI-4732 > URL: https://issues.apache.org/jira/browse/HUDI-4732 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Minor > > If you use the Confluent Schema Registry, they provide a way to deserialize > the kafka message value without providing the protobuf class name. The first > cut of ProtoKafkaSource requires users to specify a classname but we want to > allow users the flexibility to use this other method of deserializing the > message. > > Docs: > https://docs.confluent.io/platform/current/schema-registry/serdes-develop/serdes-protobuf.html -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7576) Avoid recomputing partition path in AbstractFileSystemView
[ https://issues.apache.org/jira/browse/HUDI-7576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown updated HUDI-7576: Summary: Avoid recomputing partition path in AbstractFileSystemView (was: Add partitionPath to the HoodieBaseFile and HoodieLogFile objects) > Avoid recomputing partition path in AbstractFileSystemView > -- > > Key: HUDI-7576 > URL: https://issues.apache.org/jira/browse/HUDI-7576 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Major > Labels: pull-request-available > > Adding this field to the classes will allow us to avoid repeatedly computing > the partition path per file in other parts of the code. This can cut down on > the CPU overhead associated with creating the FS View. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7576) Avoid recomputing partition path in AbstractFileSystemView
[ https://issues.apache.org/jira/browse/HUDI-7576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown updated HUDI-7576: Description: We have observed a non-negligible amount of CPU spent simply computing the partition paths of base and log files when building a file system view. We should aim to improve the efficiency of these calls and reduce the number of them. (was: Adding this field to the classes will allow us to avoid repeatedly computing the partition path per file in other parts of the code. This can cut down on the CPU overhead associated with creating the FS View.) > Avoid recomputing partition path in AbstractFileSystemView > -- > > Key: HUDI-7576 > URL: https://issues.apache.org/jira/browse/HUDI-7576 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Major > Labels: pull-request-available > > We have observed a non-negligible amount of CPU spent simply computing the > partition paths of base and log files when building a file system view. We > should aim to improve the efficiency of these calls and reduce the number of > them. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7576) Add partitionPath to the HoodieBaseFile and HoodieLogFile objects
Timothy Brown created HUDI-7576: --- Summary: Add partitionPath to the HoodieBaseFile and HoodieLogFile objects Key: HUDI-7576 URL: https://issues.apache.org/jira/browse/HUDI-7576 Project: Apache Hudi Issue Type: Improvement Reporter: Timothy Brown Adding this field to the classes will allow us to avoid repeatedly computing the partition path per file in other parts of the code. This can cut down on the CPU overhead associated with creating the FS View. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7575) Avoid recomputing list of pending replacecommits in FSView code
Timothy Brown created HUDI-7575: --- Summary: Avoid recomputing list of pending replacecommits in FSView code Key: HUDI-7575 URL: https://issues.apache.org/jira/browse/HUDI-7575 Project: Apache Hudi Issue Type: Improvement Reporter: Timothy Brown When checking if a base file is part of a pending clustering, the code will construct the same list repeatedly leading to unnecessary overhead. The class should gather this list once and persist it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7576) Add partitionPath to the HoodieBaseFile and HoodieLogFile objects
[ https://issues.apache.org/jira/browse/HUDI-7576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown reassigned HUDI-7576: --- Assignee: Timothy Brown > Add partitionPath to the HoodieBaseFile and HoodieLogFile objects > - > > Key: HUDI-7576 > URL: https://issues.apache.org/jira/browse/HUDI-7576 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Major > > Adding this field to the classes will allow us to avoid repeatedly computing > the partition path per file in other parts of the code. This can cut down on > the CPU overhead associated with creating the FS View. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7575) Avoid recomputing list of pending replacecommits in FSView code
[ https://issues.apache.org/jira/browse/HUDI-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown reassigned HUDI-7575: --- Assignee: Timothy Brown > Avoid recomputing list of pending replacecommits in FSView code > --- > > Key: HUDI-7575 > URL: https://issues.apache.org/jira/browse/HUDI-7575 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Major > > When checking if a base file is part of a pending clustering, the code will > construct the same list repeatedly leading to unnecessary overhead. The class > should gather this list once and persist it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7551) Avoid loading all partitions into memory for cleaner planner
Timothy Brown created HUDI-7551: --- Summary: Avoid loading all partitions into memory for cleaner planner Key: HUDI-7551 URL: https://issues.apache.org/jira/browse/HUDI-7551 Project: Apache Hudi Issue Type: Bug Reporter: Timothy Brown When MDT is enabled the clean planner can end up loading all partitions into memory which can add extra memory pressure than is required on the driver. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7464) JsonKafkaSource Metadata Bug
Timothy Brown created HUDI-7464: --- Summary: JsonKafkaSource Metadata Bug Key: HUDI-7464 URL: https://issues.apache.org/jira/browse/HUDI-7464 Project: Apache Hudi Issue Type: Bug Reporter: Timothy Brown There are 2 potential issues with the Kafka Json Source: 1. A null key can produce an NPE and result in the offset and other metadata not being added to the row 2. The schema post processor can attempt to add fields to a source schema that may already contain those metadata fields. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7464) JsonKafkaSource Metadata Bug
[ https://issues.apache.org/jira/browse/HUDI-7464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown reassigned HUDI-7464: --- Assignee: Timothy Brown > JsonKafkaSource Metadata Bug > > > Key: HUDI-7464 > URL: https://issues.apache.org/jira/browse/HUDI-7464 > Project: Apache Hudi > Issue Type: Bug >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Major > > There are 2 potential issues with the Kafka Json Source: > 1. A null key can produce an NPE and result in the offset and other metadata > not being added to the row > 2. The schema post processor can attempt to add fields to a source schema > that may already contain those metadata fields. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7404) Bloom Filter Execution Improvements
Timothy Brown created HUDI-7404: --- Summary: Bloom Filter Execution Improvements Key: HUDI-7404 URL: https://issues.apache.org/jira/browse/HUDI-7404 Project: Apache Hudi Issue Type: Improvement Reporter: Timothy Brown # Avoid executing a countByKey that is only used by a single flow # Avoid intermediate collection on driver # Early exit when possible to avoid overhead of reader instantiation -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7323) Transformer schema inference uses stale schema
[ https://issues.apache.org/jira/browse/HUDI-7323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown reassigned HUDI-7323: --- Assignee: Timothy Brown > Transformer schema inference uses stale schema > -- > > Key: HUDI-7323 > URL: https://issues.apache.org/jira/browse/HUDI-7323 > Project: Apache Hudi > Issue Type: Bug >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Major > Labels: pull-request-available > > The `transformedSchema` interface for the Transformer class should use an up > to date schema instead of the schema at the time of object creation. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7323) Transformer schema inference uses stale schema
Timothy Brown created HUDI-7323: --- Summary: Transformer schema inference uses stale schema Key: HUDI-7323 URL: https://issues.apache.org/jira/browse/HUDI-7323 Project: Apache Hudi Issue Type: Bug Reporter: Timothy Brown The `transformedSchema` interface for the Transformer class should use an up to date schema instead of the schema at the time of object creation. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7238) Ensure ExternalSpillableMaps are properly closed
Timothy Brown created HUDI-7238: --- Summary: Ensure ExternalSpillableMaps are properly closed Key: HUDI-7238 URL: https://issues.apache.org/jira/browse/HUDI-7238 Project: Apache Hudi Issue Type: Bug Reporter: Timothy Brown There are a few places that the ExternalSpillableMap are used but the close method is not called. There are also cases where we are creating the underlying BitMap even when we have no need for it yet. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7237) Minor Improvements to Schema Handling in Delta Sync
[ https://issues.apache.org/jira/browse/HUDI-7237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown updated HUDI-7237: Priority: Minor (was: Major) > Minor Improvements to Schema Handling in Delta Sync > --- > > Key: HUDI-7237 > URL: https://issues.apache.org/jira/browse/HUDI-7237 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Timothy Brown >Priority: Minor > Labels: pull-request-available > > There are a two minor items that we have run into running DeltaStreamer in > production. > 1. The number of times the schema is fetched is more than it needs to be and > can put unnecessary load on schema providers or increase file system reads > 2. SchemaProviders that return null target schemas on empty batches cause > null schema values in commits leading to unexpected issues later > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7237) Minor Improvements to Schema Handling in Delta Sync
Timothy Brown created HUDI-7237: --- Summary: Minor Improvements to Schema Handling in Delta Sync Key: HUDI-7237 URL: https://issues.apache.org/jira/browse/HUDI-7237 Project: Apache Hudi Issue Type: Improvement Reporter: Timothy Brown There are a two minor items that we have run into running DeltaStreamer in production. 1. The number of times the schema is fetched is more than it needs to be and can put unnecessary load on schema providers or increase file system reads 2. SchemaProviders that return null target schemas on empty batches cause null schema values in commits leading to unexpected issues later -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7223) Hudi Cleaner removing files still required for view N hours old
Timothy Brown created HUDI-7223: --- Summary: Hudi Cleaner removing files still required for view N hours old Key: HUDI-7223 URL: https://issues.apache.org/jira/browse/HUDI-7223 Project: Apache Hudi Issue Type: Bug Reporter: Timothy Brown If a user is using time based cleaner policy, they will expect that they can query the table state as of N hours ago. This means that they do not want to clean up files older than N hours but files that are no longer relevant to the table N hours ago. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7160) Avro Schema Properties are dropped when adding Hoodie Metadata columns
[ https://issues.apache.org/jira/browse/HUDI-7160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown reassigned HUDI-7160: --- Assignee: Timothy Brown > Avro Schema Properties are dropped when adding Hoodie Metadata columns > -- > > Key: HUDI-7160 > URL: https://issues.apache.org/jira/browse/HUDI-7160 > Project: Apache Hudi > Issue Type: Bug >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Major > > When we add the metadata columns to an existing avro schema, the properties > set on that schema are dropped. We should allow these properties to be > carried through. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7160) Avro Schema Properties are dropped when adding Hoodie Metadata columns
Timothy Brown created HUDI-7160: --- Summary: Avro Schema Properties are dropped when adding Hoodie Metadata columns Key: HUDI-7160 URL: https://issues.apache.org/jira/browse/HUDI-7160 Project: Apache Hudi Issue Type: Bug Reporter: Timothy Brown When we add the metadata columns to an existing avro schema, the properties set on that schema are dropped. We should allow these properties to be carried through. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7160) Avro Schema Properties are dropped when adding Hoodie Metadata columns
[ https://issues.apache.org/jira/browse/HUDI-7160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown updated HUDI-7160: Priority: Minor (was: Major) > Avro Schema Properties are dropped when adding Hoodie Metadata columns > -- > > Key: HUDI-7160 > URL: https://issues.apache.org/jira/browse/HUDI-7160 > Project: Apache Hudi > Issue Type: Bug >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Minor > > When we add the metadata columns to an existing avro schema, the properties > set on that schema are dropped. We should allow these properties to be > carried through. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7115) Add more options for BigQuery Sync
Timothy Brown created HUDI-7115: --- Summary: Add more options for BigQuery Sync Key: HUDI-7115 URL: https://issues.apache.org/jira/browse/HUDI-7115 Project: Apache Hudi Issue Type: Improvement Reporter: Timothy Brown There are options for requiring a partition filter and adding a big lake connection ID to leverage some new access control features that users may want to leverage in their environment. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7112) Allow reuse of timeline server across tables
[ https://issues.apache.org/jira/browse/HUDI-7112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown reassigned HUDI-7112: --- Assignee: Timothy Brown > Allow reuse of timeline server across tables > > > Key: HUDI-7112 > URL: https://issues.apache.org/jira/browse/HUDI-7112 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Major > > When a user is running multiple writers in the same JVM, there will currently > be a javelin server created per table. This leads to unnecessary overhead > since the timeline server can support multiple basepaths. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7112) Allow reuse of timeline server across tables
Timothy Brown created HUDI-7112: --- Summary: Allow reuse of timeline server across tables Key: HUDI-7112 URL: https://issues.apache.org/jira/browse/HUDI-7112 Project: Apache Hudi Issue Type: Improvement Reporter: Timothy Brown When a user is running multiple writers in the same JVM, there will currently be a javelin server created per table. This leads to unnecessary overhead since the timeline server can support multiple basepaths. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6916) Fix excessive object creation in custom key generator
Timothy Brown created HUDI-6916: --- Summary: Fix excessive object creation in custom key generator Key: HUDI-6916 URL: https://issues.apache.org/jira/browse/HUDI-6916 Project: Apache Hudi Issue Type: Improvement Reporter: Timothy Brown The custom key generators are creating key generator objects per record/row instead of creating them once up front. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6916) Fix excessive object creation in custom key generator
[ https://issues.apache.org/jira/browse/HUDI-6916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown reassigned HUDI-6916: --- Assignee: Timothy Brown > Fix excessive object creation in custom key generator > - > > Key: HUDI-6916 > URL: https://issues.apache.org/jira/browse/HUDI-6916 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Major > > The custom key generators are creating key generator objects per record/row > instead of creating them once up front. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6898) Improve test stability by closing metadata writers, update logging
[ https://issues.apache.org/jira/browse/HUDI-6898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown reassigned HUDI-6898: --- Assignee: Timothy Brown > Improve test stability by closing metadata writers, update logging > -- > > Key: HUDI-6898 > URL: https://issues.apache.org/jira/browse/HUDI-6898 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Trivial > > Improve the test stability and performance by closing all metadata writers > created in the tests. > Also update logging to reduce the number of logs making it easier to find the > failures in the test output. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6898) Improve test stability by closing metadata writers, update logging
[ https://issues.apache.org/jira/browse/HUDI-6898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown updated HUDI-6898: Priority: Trivial (was: Major) > Improve test stability by closing metadata writers, update logging > -- > > Key: HUDI-6898 > URL: https://issues.apache.org/jira/browse/HUDI-6898 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Timothy Brown >Priority: Trivial > > Improve the test stability and performance by closing all metadata writers > created in the tests. > Also update logging to reduce the number of logs making it easier to find the > failures in the test output. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6898) Improve test stability by closing metadata writers, update logging
Timothy Brown created HUDI-6898: --- Summary: Improve test stability by closing metadata writers, update logging Key: HUDI-6898 URL: https://issues.apache.org/jira/browse/HUDI-6898 Project: Apache Hudi Issue Type: Improvement Reporter: Timothy Brown Improve the test stability and performance by closing all metadata writers created in the tests. Also update logging to reduce the number of logs making it easier to find the failures in the test output. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6871) BigQuery Sync Improvements
[ https://issues.apache.org/jira/browse/HUDI-6871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown updated HUDI-6871: Priority: Minor (was: Major) > BigQuery Sync Improvements > -- > > Key: HUDI-6871 > URL: https://issues.apache.org/jira/browse/HUDI-6871 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Minor > > # The manifest file writer is slow due to the overhead incurred per iteration > # Schema's with reserved keywords are failing in the create table statement -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6871) BigQuery Sync Improvements
[ https://issues.apache.org/jira/browse/HUDI-6871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown reassigned HUDI-6871: --- Assignee: Timothy Brown > BigQuery Sync Improvements > -- > > Key: HUDI-6871 > URL: https://issues.apache.org/jira/browse/HUDI-6871 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Major > > # The manifest file writer is slow due to the overhead incurred per iteration > # Schema's with reserved keywords are failing in the create table statement -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6871) BigQuery Sync Improvements
Timothy Brown created HUDI-6871: --- Summary: BigQuery Sync Improvements Key: HUDI-6871 URL: https://issues.apache.org/jira/browse/HUDI-6871 Project: Apache Hudi Issue Type: Improvement Reporter: Timothy Brown # The manifest file writer is slow due to the overhead incurred per iteration # Schema's with reserved keywords are failing in the create table statement -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6857) Update Docs For BigQuerySyncTool
[ https://issues.apache.org/jira/browse/HUDI-6857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown reassigned HUDI-6857: --- Assignee: Timothy Brown > Update Docs For BigQuerySyncTool > > > Key: HUDI-6857 > URL: https://issues.apache.org/jira/browse/HUDI-6857 > Project: Apache Hudi > Issue Type: Task >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Trivial > > Update the docs to include references to the new manifest based approach -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6857) Update Docs For BigQuerySyncTool
Timothy Brown created HUDI-6857: --- Summary: Update Docs For BigQuerySyncTool Key: HUDI-6857 URL: https://issues.apache.org/jira/browse/HUDI-6857 Project: Apache Hudi Issue Type: Task Reporter: Timothy Brown Update the docs to include references to the new manifest based approach -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6839) Github Actions Workflow Improvements
[ https://issues.apache.org/jira/browse/HUDI-6839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown reassigned HUDI-6839: --- Assignee: Timothy Brown > Github Actions Workflow Improvements > > > Key: HUDI-6839 > URL: https://issues.apache.org/jira/browse/HUDI-6839 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Major > > # Leverage maven cache option for build speed > # Use parallel build when packaging jars for tests > # Cancel inflight tests when updates to branches are pushed to save on costs -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6839) Github Actions Workflow Improvements
Timothy Brown created HUDI-6839: --- Summary: Github Actions Workflow Improvements Key: HUDI-6839 URL: https://issues.apache.org/jira/browse/HUDI-6839 Project: Apache Hudi Issue Type: Improvement Reporter: Timothy Brown # Leverage maven cache option for build speed # Use parallel build when packaging jars for tests # Cancel inflight tests when updates to branches are pushed to save on costs -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6836) Shutdown metrics for metadata table writer in deltastreamer
Timothy Brown created HUDI-6836: --- Summary: Shutdown metrics for metadata table writer in deltastreamer Key: HUDI-6836 URL: https://issues.apache.org/jira/browse/HUDI-6836 Project: Apache Hudi Issue Type: Bug Reporter: Timothy Brown When debugging some Deltastreamer tests, I noticed that there is still a running metrics instance for the metadata table path. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6836) Shutdown metrics for metadata table writer in deltastreamer
[ https://issues.apache.org/jira/browse/HUDI-6836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown reassigned HUDI-6836: --- Assignee: Timothy Brown > Shutdown metrics for metadata table writer in deltastreamer > --- > > Key: HUDI-6836 > URL: https://issues.apache.org/jira/browse/HUDI-6836 > Project: Apache Hudi > Issue Type: Bug >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Minor > > When debugging some Deltastreamer tests, I noticed that there is still a > running metrics instance for the metadata table path. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6807) MoR Incremental count queries trigger full scan of files in table
Timothy Brown created HUDI-6807: --- Summary: MoR Incremental count queries trigger full scan of files in table Key: HUDI-6807 URL: https://issues.apache.org/jira/browse/HUDI-6807 Project: Apache Hudi Issue Type: Bug Reporter: Timothy Brown While running the `TestMORDataSource` datasource tests I saw that we eventually call `HoodiePruneFileSourcePartitions` which will list all of the files in the table instead of the files that are relevant to the incremental query. Ideally this would be limited to the files that were impacted by commits within the range specified. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6763) WriteStats are extracted twice in BaseSparkCommitActionExecutor
[ https://issues.apache.org/jira/browse/HUDI-6763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown reassigned HUDI-6763: --- Assignee: Timothy Brown > WriteStats are extracted twice in BaseSparkCommitActionExecutor > --- > > Key: HUDI-6763 > URL: https://issues.apache.org/jira/browse/HUDI-6763 > Project: Apache Hudi > Issue Type: Bug >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Minor > > In BaseSparkCommitActionExecutor there are two places the same > `collectAsList` is called on an RDD. We can optimize this by only calling > this method once. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6763) WriteStats are extracted twice in BaseSparkCommitActionExecutor
Timothy Brown created HUDI-6763: --- Summary: WriteStats are extracted twice in BaseSparkCommitActionExecutor Key: HUDI-6763 URL: https://issues.apache.org/jira/browse/HUDI-6763 Project: Apache Hudi Issue Type: Bug Reporter: Timothy Brown In BaseSparkCommitActionExecutor there are two places the same `collectAsList` is called on an RDD. We can optimize this by only calling this method once. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6741) Timeline server cannot handle multiple base paths when metadata table is enabled
[ https://issues.apache.org/jira/browse/HUDI-6741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown reassigned HUDI-6741: --- Assignee: Timothy Brown > Timeline server cannot handle multiple base paths when metadata table is > enabled > > > Key: HUDI-6741 > URL: https://issues.apache.org/jira/browse/HUDI-6741 > Project: Apache Hudi > Issue Type: Bug >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Major > > The Timeline Server will take in a view manager to gather the information > about the tables. When the metadata table is enabled, there is a supplier > that will be called to get the > HoodieTableMetadata. That supplier is configured for a single base path but > the timeline server can be used for multiple tables. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6741) Timeline server cannot handle multiple base paths when metadata table is enabled
Timothy Brown created HUDI-6741: --- Summary: Timeline server cannot handle multiple base paths when metadata table is enabled Key: HUDI-6741 URL: https://issues.apache.org/jira/browse/HUDI-6741 Project: Apache Hudi Issue Type: Bug Reporter: Timothy Brown The Timeline Server will take in a view manager to gather the information about the tables. When the metadata table is enabled, there is a supplier that will be called to get the HoodieTableMetadata. That supplier is configured for a single base path but the timeline server can be used for multiple tables. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6731) Allow MoR Read-Optimized BigQuery Sync
[ https://issues.apache.org/jira/browse/HUDI-6731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown reassigned HUDI-6731: --- Assignee: Timothy Brown > Allow MoR Read-Optimized BigQuery Sync > -- > > Key: HUDI-6731 > URL: https://issues.apache.org/jira/browse/HUDI-6731 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Minor > Labels: pull-request-available > > Allow users to query their Hudi MoR tables with BigQuery in a read-optimized > manner by syncing the base files to BigQuery like we do for CoW tables today. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6731) Allow MoR Read-Optimized BigQuery Sync
Timothy Brown created HUDI-6731: --- Summary: Allow MoR Read-Optimized BigQuery Sync Key: HUDI-6731 URL: https://issues.apache.org/jira/browse/HUDI-6731 Project: Apache Hudi Issue Type: Improvement Reporter: Timothy Brown Allow users to query their Hudi MoR tables with BigQuery in a read-optimized manner by syncing the base files to BigQuery like we do for CoW tables today. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6728) Add Schema Evolution Support to BigQuery Sync
[ https://issues.apache.org/jira/browse/HUDI-6728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown reassigned HUDI-6728: --- Assignee: Timothy Brown > Add Schema Evolution Support to BigQuery Sync > - > > Key: HUDI-6728 > URL: https://issues.apache.org/jira/browse/HUDI-6728 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Major > > Right now the BigQuery sync is using schema auto detection which will rely on > a single file for the schema. This can cause issues when users evolve their > schema since the file may not have the latest schema. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6728) Add Schema Evolution Support to BigQuery Sync
Timothy Brown created HUDI-6728: --- Summary: Add Schema Evolution Support to BigQuery Sync Key: HUDI-6728 URL: https://issues.apache.org/jira/browse/HUDI-6728 Project: Apache Hudi Issue Type: Improvement Reporter: Timothy Brown Right now the BigQuery sync is using schema auto detection which will rely on a single file for the schema. This can cause issues when users evolve their schema since the file may not have the latest schema. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-6672) BigQuery Sync updates while queries running cause failures
[ https://issues.apache.org/jira/browse/HUDI-6672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17756221#comment-17756221 ] Timothy Brown commented on HUDI-6672: - Closing since there is a new manifest file based approach that does not have this issue. > BigQuery Sync updates while queries running cause failures > -- > > Key: HUDI-6672 > URL: https://issues.apache.org/jira/browse/HUDI-6672 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Major > > Issue was reported by the user here: > [https://github.com/apache/hudi/issues/9355] > > It looks like we are updating the underlying manifest file while there is a > query executing causing issues. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-6672) BigQuery Sync updates while queries running cause failures
[ https://issues.apache.org/jira/browse/HUDI-6672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown closed HUDI-6672. --- Resolution: Won't Fix > BigQuery Sync updates while queries running cause failures > -- > > Key: HUDI-6672 > URL: https://issues.apache.org/jira/browse/HUDI-6672 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Major > > Issue was reported by the user here: > [https://github.com/apache/hudi/issues/9355] > > It looks like we are updating the underlying manifest file while there is a > query executing causing issues. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6672) BigQuery Sync updates while queries running cause failures
Timothy Brown created HUDI-6672: --- Summary: BigQuery Sync updates while queries running cause failures Key: HUDI-6672 URL: https://issues.apache.org/jira/browse/HUDI-6672 Project: Apache Hudi Issue Type: Improvement Reporter: Timothy Brown Issue was reported by the user here: [https://github.com/apache/hudi/issues/9355] It looks like we are updating the underlying manifest file while there is a query executing causing issues. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6672) BigQuery Sync updates while queries running cause failures
[ https://issues.apache.org/jira/browse/HUDI-6672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown reassigned HUDI-6672: --- Assignee: Timothy Brown > BigQuery Sync updates while queries running cause failures > -- > > Key: HUDI-6672 > URL: https://issues.apache.org/jira/browse/HUDI-6672 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Major > > Issue was reported by the user here: > [https://github.com/apache/hudi/issues/9355] > > It looks like we are updating the underlying manifest file while there is a > query executing causing issues. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6664) Fix Java Bulk Insert partitioner for all metadata table partitions
Timothy Brown created HUDI-6664: --- Summary: Fix Java Bulk Insert partitioner for all metadata table partitions Key: HUDI-6664 URL: https://issues.apache.org/jira/browse/HUDI-6664 Project: Apache Hudi Issue Type: Improvement Reporter: Timothy Brown The java bulk partitioner was updated to handle the metadata table but it should be done in a cleaner way and should be validated that it will work when bootstrapping all of the metadata table partitions. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6648) Allow creation of table with existing files when metadata table is enabled
[ https://issues.apache.org/jira/browse/HUDI-6648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown reassigned HUDI-6648: --- Assignee: Timothy Brown > Allow creation of table with existing files when metadata table is enabled > -- > > Key: HUDI-6648 > URL: https://issues.apache.org/jira/browse/HUDI-6648 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Major > > With the metadata table we can store the information about the table and > shift away from relying directly on file names for information like commit > and fileID. Adding support for creating tables with existing files will allow > us to initialize tables from existing datasets. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6648) Allow creation of table with existing files when metadata table is enabled
Timothy Brown created HUDI-6648: --- Summary: Allow creation of table with existing files when metadata table is enabled Key: HUDI-6648 URL: https://issues.apache.org/jira/browse/HUDI-6648 Project: Apache Hudi Issue Type: Improvement Reporter: Timothy Brown With the metadata table we can store the information about the table and shift away from relying directly on file names for information like commit and fileID. Adding support for creating tables with existing files will allow us to initialize tables from existing datasets. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6647) Expand Hudi Java Client Functionality
Timothy Brown created HUDI-6647: --- Summary: Expand Hudi Java Client Functionality Key: HUDI-6647 URL: https://issues.apache.org/jira/browse/HUDI-6647 Project: Apache Hudi Issue Type: Improvement Reporter: Timothy Brown With recent improvements to the abstractions in the Hudi codebase we can expand the functionality in the java client with a lower amount of effort by moving common code into the base client and table services. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6647) Expand Hudi Java Client Functionality
[ https://issues.apache.org/jira/browse/HUDI-6647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown reassigned HUDI-6647: --- Assignee: Timothy Brown > Expand Hudi Java Client Functionality > - > > Key: HUDI-6647 > URL: https://issues.apache.org/jira/browse/HUDI-6647 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Major > > With recent improvements to the abstractions in the Hudi codebase we can > expand the functionality in the java client with a lower amount of effort by > moving common code into the base client and table services. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6628) Rely on HoodieBaseFile and HoodieLogFile methods over FsUtils
Timothy Brown created HUDI-6628: --- Summary: Rely on HoodieBaseFile and HoodieLogFile methods over FsUtils Key: HUDI-6628 URL: https://issues.apache.org/jira/browse/HUDI-6628 Project: Apache Hudi Issue Type: Improvement Reporter: Timothy Brown Update the code to rely on the methods exposed by the HoodieBaseFile and the HoodieLogFile instead of using FsUtils when possible to start removing our reliance on directly referencing file paths for information like commit time and file ID throughout the codebase. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6628) Rely on HoodieBaseFile and HoodieLogFile methods over FsUtils
[ https://issues.apache.org/jira/browse/HUDI-6628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown reassigned HUDI-6628: --- Assignee: Timothy Brown > Rely on HoodieBaseFile and HoodieLogFile methods over FsUtils > - > > Key: HUDI-6628 > URL: https://issues.apache.org/jira/browse/HUDI-6628 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Minor > > Update the code to rely on the methods exposed by the HoodieBaseFile and the > HoodieLogFile instead of using FsUtils when possible to start removing our > reliance on directly referencing file paths for information like commit time > and file ID throughout the codebase. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6618) Add Java implementation of HoodieBackedTableMetadataWriter
[ https://issues.apache.org/jira/browse/HUDI-6618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown reassigned HUDI-6618: --- Assignee: Timothy Brown > Add Java implementation of HoodieBackedTableMetadataWriter > -- > > Key: HUDI-6618 > URL: https://issues.apache.org/jira/browse/HUDI-6618 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Major > > Add an implementation of > HoodieBackedTableMetadataWriter to be used within the java write client -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6618) Add Java implementation of HoodieBackedTableMetadataWriter
Timothy Brown created HUDI-6618: --- Summary: Add Java implementation of HoodieBackedTableMetadataWriter Key: HUDI-6618 URL: https://issues.apache.org/jira/browse/HUDI-6618 Project: Apache Hudi Issue Type: Improvement Reporter: Timothy Brown Add an implementation of HoodieBackedTableMetadataWriter to be used within the java write client -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6590) Improve BigQuery Sync Schema and Partition Handling
[ https://issues.apache.org/jira/browse/HUDI-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown updated HUDI-6590: Summary: Improve BigQuery Sync Schema and Partition Handling (was: Improve BigQuery Sync Support) > Improve BigQuery Sync Schema and Partition Handling > --- > > Key: HUDI-6590 > URL: https://issues.apache.org/jira/browse/HUDI-6590 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Minor > > Add features for Schema evolution and listing only required base files while > querying the table to cut down on BigQuery usage costs. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6590) Improve BigQuery Sync Support
[ https://issues.apache.org/jira/browse/HUDI-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown reassigned HUDI-6590: --- Assignee: Timothy Brown > Improve BigQuery Sync Support > - > > Key: HUDI-6590 > URL: https://issues.apache.org/jira/browse/HUDI-6590 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Minor > > Add features for Schema evolution and listing only required base files while > querying the table to cut down on BigQuery usage costs. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6590) Improve BigQuery Sync Support
Timothy Brown created HUDI-6590: --- Summary: Improve BigQuery Sync Support Key: HUDI-6590 URL: https://issues.apache.org/jira/browse/HUDI-6590 Project: Apache Hudi Issue Type: Improvement Reporter: Timothy Brown Add features for Schema evolution and listing only required base files while querying the table to cut down on BigQuery usage costs. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-6168) Add source partition columns to rows in S3/GCS Sources
Timothy Brown created HUDI-6168: --- Summary: Add source partition columns to rows in S3/GCS Sources Key: HUDI-6168 URL: https://issues.apache.org/jira/browse/HUDI-6168 Project: Apache Hudi Issue Type: New Feature Reporter: Timothy Brown If the files read from an S3 or GCS source have a hive style partitioning themselves, we should be able to parse that out as a column to return in the dataset that is then fed into the delta streamer -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-6168) Add source partition columns to rows in S3/GCS Sources
[ https://issues.apache.org/jira/browse/HUDI-6168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown reassigned HUDI-6168: --- Assignee: Timothy Brown > Add source partition columns to rows in S3/GCS Sources > -- > > Key: HUDI-6168 > URL: https://issues.apache.org/jira/browse/HUDI-6168 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Major > > If the files read from an S3 or GCS source have a hive style partitioning > themselves, we should be able to parse that out as a column to return in the > dataset that is then fed into the delta streamer -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-5532) Add a KeyGenerator to support a Keyless workflow
[ https://issues.apache.org/jira/browse/HUDI-5532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown reassigned HUDI-5532: --- Assignee: Timothy Brown > Add a KeyGenerator to support a Keyless workflow > > > Key: HUDI-5532 > URL: https://issues.apache.org/jira/browse/HUDI-5532 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Major > > If a Hudi user wants to be able to do append only inserts we should provide > the ability to auto configure keys for them so they don't need to set fields > for the record key -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-5532) Add a KeyGenerator to support a Keyless workflow
Timothy Brown created HUDI-5532: --- Summary: Add a KeyGenerator to support a Keyless workflow Key: HUDI-5532 URL: https://issues.apache.org/jira/browse/HUDI-5532 Project: Apache Hudi Issue Type: New Feature Reporter: Timothy Brown If a Hudi user wants to be able to do append only inserts we should provide the ability to auto configure keys for them so they don't need to set fields for the record key -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-5370) Properly close file handles for Metadata writer
Timothy Brown created HUDI-5370: --- Summary: Properly close file handles for Metadata writer Key: HUDI-5370 URL: https://issues.apache.org/jira/browse/HUDI-5370 Project: Apache Hudi Issue Type: Improvement Reporter: Timothy Brown -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-5370) Properly close file handles for Metadata writer
[ https://issues.apache.org/jira/browse/HUDI-5370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown reassigned HUDI-5370: --- Assignee: sivabalan narayanan > Properly close file handles for Metadata writer > --- > > Key: HUDI-5370 > URL: https://issues.apache.org/jira/browse/HUDI-5370 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Timothy Brown >Assignee: sivabalan narayanan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4904) Handle Recursive Proto Schemas in ProtoClassBasedSchemaProvider
[ https://issues.apache.org/jira/browse/HUDI-4904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown updated HUDI-4904: Status: In Progress (was: Open) > Handle Recursive Proto Schemas in ProtoClassBasedSchemaProvider > --- > > Key: HUDI-4904 > URL: https://issues.apache.org/jira/browse/HUDI-4904 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Major > Labels: pull-request-available > > In proto we can have a schema that is recursive. We should limit the > "unraveling" of a schema to N levels and let the user specify that amount of > levels as a config. After hitting depth N in the recursion, we will create a > Record with a byte array and string. The remaining data for that branch of > the recursion will be written out as a proto byte array and we record the > descriptor string for context of what is in the byte array. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HUDI-4904) Handle Recursive Proto Schemas in ProtoClassBasedSchemaProvider
[ https://issues.apache.org/jira/browse/HUDI-4904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown resolved HUDI-4904. - > Handle Recursive Proto Schemas in ProtoClassBasedSchemaProvider > --- > > Key: HUDI-4904 > URL: https://issues.apache.org/jira/browse/HUDI-4904 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Major > Labels: pull-request-available > > In proto we can have a schema that is recursive. We should limit the > "unraveling" of a schema to N levels and let the user specify that amount of > levels as a config. After hitting depth N in the recursion, we will create a > Record with a byte array and string. The remaining data for that branch of > the recursion will be written out as a proto byte array and we record the > descriptor string for context of what is in the byte array. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HUDI-4905) Protobuf type handling improvements
[ https://issues.apache.org/jira/browse/HUDI-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown resolved HUDI-4905. - > Protobuf type handling improvements > --- > > Key: HUDI-4905 > URL: https://issues.apache.org/jira/browse/HUDI-4905 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Major > Labels: pull-request-available > > Two improvements have come out of discussions with others trying to use > protobuf and Hudi. > > # We can support uint64 as a decimal without losing precision and > representing the value in the lake as a positive value > # Proto Timestamps can be converted to long with LogicalType timestamp-micros > # Treat elements within a `oneof` as nullable -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-5198) add in minor perf wins in hudi-utilities and locking related tests
[ https://issues.apache.org/jira/browse/HUDI-5198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown reassigned HUDI-5198: --- Assignee: Timothy Brown > add in minor perf wins in hudi-utilities and locking related tests > -- > > Key: HUDI-5198 > URL: https://issues.apache.org/jira/browse/HUDI-5198 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-5198) add in minor perf wins in hudi-utilities and locking related tests
Timothy Brown created HUDI-5198: --- Summary: add in minor perf wins in hudi-utilities and locking related tests Key: HUDI-5198 URL: https://issues.apache.org/jira/browse/HUDI-5198 Project: Apache Hudi Issue Type: Improvement Reporter: Timothy Brown -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-4926) Add documentation to the Hudi Site
Timothy Brown created HUDI-4926: --- Summary: Add documentation to the Hudi Site Key: HUDI-4926 URL: https://issues.apache.org/jira/browse/HUDI-4926 Project: Apache Hudi Issue Type: Improvement Reporter: Timothy Brown -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4905) Protobuf type handling improvements
[ https://issues.apache.org/jira/browse/HUDI-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown updated HUDI-4905: Description: Two improvements have come out of discussions with others trying to use protobuf and Hudi. # We can support uint64 as a decimal without losing precision and representing the value in the lake as a positive value # Proto Timestamps can be converted to long with LogicalType timestamp-micros # Treat elements within a `oneof` as nullable was: Two improvements have come out of discussions with others trying to use protobuf and Hudi. # We can support uint64 as a decimal without losing precision and representing the value in the lake as a positive value # Proto Timestamps can be converted to long with LogicalType timestamp-micros > Protobuf type handling improvements > --- > > Key: HUDI-4905 > URL: https://issues.apache.org/jira/browse/HUDI-4905 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Major > > Two improvements have come out of discussions with others trying to use > protobuf and Hudi. > > # We can support uint64 as a decimal without losing precision and > representing the value in the lake as a positive value > # Proto Timestamps can be converted to long with LogicalType timestamp-micros > # Treat elements within a `oneof` as nullable -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4905) Protobuf type handling improvements
[ https://issues.apache.org/jira/browse/HUDI-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown updated HUDI-4905: Summary: Protobuf type handling improvements (was: Proto type handling improvements) > Protobuf type handling improvements > --- > > Key: HUDI-4905 > URL: https://issues.apache.org/jira/browse/HUDI-4905 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Major > > Two improvements have come out of discussions with others trying to use > protobuf and Hudi. > > # We can support uint64 as a decimal without losing precision and > representing the value in the lake as a positive value > # Proto Timestamps can be converted to long with LogicalType timestamp-micros -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-4905) Proto type handling improvements
Timothy Brown created HUDI-4905: --- Summary: Proto type handling improvements Key: HUDI-4905 URL: https://issues.apache.org/jira/browse/HUDI-4905 Project: Apache Hudi Issue Type: Improvement Reporter: Timothy Brown Two improvements have come out of discussions with others trying to use protobuf and Hudi. # We can support uint64 as a decimal without losing precision and representing the value in the lake as a positive value # Proto Timestamps can be converted to long with LogicalType timestamp-micros -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-4905) Proto type handling improvements
[ https://issues.apache.org/jira/browse/HUDI-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown reassigned HUDI-4905: --- Assignee: Timothy Brown > Proto type handling improvements > > > Key: HUDI-4905 > URL: https://issues.apache.org/jira/browse/HUDI-4905 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Major > > Two improvements have come out of discussions with others trying to use > protobuf and Hudi. > > # We can support uint64 as a decimal without losing precision and > representing the value in the lake as a positive value > # Proto Timestamps can be converted to long with LogicalType timestamp-micros -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-4904) Handle Recursive Proto Schemas
[ https://issues.apache.org/jira/browse/HUDI-4904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown reassigned HUDI-4904: --- Assignee: Timothy Brown > Handle Recursive Proto Schemas > -- > > Key: HUDI-4904 > URL: https://issues.apache.org/jira/browse/HUDI-4904 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Major > > In proto we can have a schema that is recursive. We should limit the > "unraveling" of a schema to N levels and let the user specify that amount of > levels as a config. After hitting depth N in the recursion, we will create a > Record with a byte array and string. The remaining data for that branch of > the recursion will be written out as a proto byte array and we record the > descriptor string for context of what is in the byte array. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4904) Handle Recursive Proto Schemas in ProtoClassBasedSchemaProvider
[ https://issues.apache.org/jira/browse/HUDI-4904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown updated HUDI-4904: Summary: Handle Recursive Proto Schemas in ProtoClassBasedSchemaProvider (was: Handle Recursive Proto Schemas) > Handle Recursive Proto Schemas in ProtoClassBasedSchemaProvider > --- > > Key: HUDI-4904 > URL: https://issues.apache.org/jira/browse/HUDI-4904 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Major > > In proto we can have a schema that is recursive. We should limit the > "unraveling" of a schema to N levels and let the user specify that amount of > levels as a config. After hitting depth N in the recursion, we will create a > Record with a byte array and string. The remaining data for that branch of > the recursion will be written out as a proto byte array and we record the > descriptor string for context of what is in the byte array. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-4904) Handle Recursive Proto Schemas
Timothy Brown created HUDI-4904: --- Summary: Handle Recursive Proto Schemas Key: HUDI-4904 URL: https://issues.apache.org/jira/browse/HUDI-4904 Project: Apache Hudi Issue Type: Improvement Reporter: Timothy Brown In proto we can have a schema that is recursive. We should limit the "unraveling" of a schema to N levels and let the user specify that amount of levels as a config. After hitting depth N in the recursion, we will create a Record with a byte array and string. The remaining data for that branch of the recursion will be written out as a proto byte array and we record the descriptor string for context of what is in the byte array. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4796) Properly release MetricsReporter resources
[ https://issues.apache.org/jira/browse/HUDI-4796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown updated HUDI-4796: Status: In Progress (was: Open) > Properly release MetricsReporter resources > -- > > Key: HUDI-4796 > URL: https://issues.apache.org/jira/browse/HUDI-4796 > Project: Apache Hudi > Issue Type: Bug >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Major > Labels: pull-request-available > > In > [Metrics.java|https://github.com/apache/hudi/blob/f5de4e434b33720d4846c6fe2450539a284ea14f/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/Metrics.java#L63-L65] > we are calling the close method on a class instead of the Reporter's `stop` > method. The `stop` method according to the Java docs "Should be used to stop > channels, streams and release resources." > For most reporters these two actions are equivalent but the > [JmxReportServer|https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/JmxReporterServer.java#L127] > has a more involved stop method that must be called. > > Relates to discussion > [here|https://github.com/apache/hudi/issues/5249#issuecomment-1235020970] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4796) Properly release MetricsReporter resources
[ https://issues.apache.org/jira/browse/HUDI-4796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown updated HUDI-4796: Status: Patch Available (was: In Progress) > Properly release MetricsReporter resources > -- > > Key: HUDI-4796 > URL: https://issues.apache.org/jira/browse/HUDI-4796 > Project: Apache Hudi > Issue Type: Bug >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Major > Labels: pull-request-available > > In > [Metrics.java|https://github.com/apache/hudi/blob/f5de4e434b33720d4846c6fe2450539a284ea14f/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/Metrics.java#L63-L65] > we are calling the close method on a class instead of the Reporter's `stop` > method. The `stop` method according to the Java docs "Should be used to stop > channels, streams and release resources." > For most reporters these two actions are equivalent but the > [JmxReportServer|https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/JmxReporterServer.java#L127] > has a more involved stop method that must be called. > > Relates to discussion > [here|https://github.com/apache/hudi/issues/5249#issuecomment-1235020970] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-4796) Properly release MetricsReporter resources
[ https://issues.apache.org/jira/browse/HUDI-4796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown reassigned HUDI-4796: --- Assignee: Timothy Brown > Properly release MetricsReporter resources > -- > > Key: HUDI-4796 > URL: https://issues.apache.org/jira/browse/HUDI-4796 > Project: Apache Hudi > Issue Type: Bug >Reporter: Timothy Brown >Assignee: Timothy Brown >Priority: Major > > In > [Metrics.java|https://github.com/apache/hudi/blob/f5de4e434b33720d4846c6fe2450539a284ea14f/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/Metrics.java#L63-L65] > we are calling the close method on a class instead of the Reporter's `stop` > method. The `stop` method according to the Java docs "Should be used to stop > channels, streams and release resources." > For most reporters these two actions are equivalent but the > [JmxReportServer|https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/JmxReporterServer.java#L127] > has a more involved stop method that must be called. > > Relates to discussion > [here|https://github.com/apache/hudi/issues/5249#issuecomment-1235020970] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-4796) Properly release MetricsReporter resources
Timothy Brown created HUDI-4796: --- Summary: Properly release MetricsReporter resources Key: HUDI-4796 URL: https://issues.apache.org/jira/browse/HUDI-4796 Project: Apache Hudi Issue Type: Bug Reporter: Timothy Brown In [Metrics.java|https://github.com/apache/hudi/blob/f5de4e434b33720d4846c6fe2450539a284ea14f/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/Metrics.java#L63-L65] we are calling the close method on a class instead of the Reporter's `stop` method. The `stop` method according to the Java docs "Should be used to stop channels, streams and release resources." For most reporters these two actions are equivalent but the [JmxReportServer|https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metrics/JmxReporterServer.java#L127] has a more involved stop method that must be called. Relates to discussion [here|https://github.com/apache/hudi/issues/5249#issuecomment-1235020970] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-4732) Leverage Schema Registry for reading proto messages from kafka
Timothy Brown created HUDI-4732: --- Summary: Leverage Schema Registry for reading proto messages from kafka Key: HUDI-4732 URL: https://issues.apache.org/jira/browse/HUDI-4732 Project: Apache Hudi Issue Type: New Feature Reporter: Timothy Brown If you use the Confluent Schema Registry, they provide a way to deserialize the kafka message value without providing the protobuf class name. The first cut of ProtoKafkaSource requires users to specify a classname but we want to allow users the flexibility to use this other method of deserializing the message. Docs: https://docs.confluent.io/platform/current/schema-registry/serdes-develop/serdes-protobuf.html -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-4727) Direct conversion from Proto Message to Row
Timothy Brown created HUDI-4727: --- Summary: Direct conversion from Proto Message to Row Key: HUDI-4727 URL: https://issues.apache.org/jira/browse/HUDI-4727 Project: Apache Hudi Issue Type: New Feature Reporter: Timothy Brown The initial implementation for the Proto source converts from Message to Avro to Row in the SourceFormatAdapter when the source needs to be read as a Dataset. Let's remove the intermediate Avro representation and convert directly from Message to Row. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-4441) Disbale INFO level logs from tests
[ https://issues.apache.org/jira/browse/HUDI-4441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Brown reassigned HUDI-4441: --- Assignee: Timothy Brown > Disbale INFO level logs from tests > -- > > Key: HUDI-4441 > URL: https://issues.apache.org/jira/browse/HUDI-4441 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Timothy Brown >Priority: Major > Labels: pull-request-available > > Since the log4j1-2 bridge upgrade, we have noticed that CI runs are logging > INFO level logs despite the min level set as WARN in all > log4j-sure.properties. To reproduce the issue just run any test locally and > you should see INFO level logs. This creates unnecessary noise and painful to > debug failures. We need to fix this. -- This message was sent by Atlassian Jira (v8.20.10#820010)