date:20220817

[GitHub] [hudi] hudi-bot commented on pull request #6419: [HUDI-2057] CTAS Generate An External Table When Create Managed Table

2022-08-17 Thread GitBox



hudi-bot commented on PR #6419:
URL: https://github.com/apache/hudi/pull/6419#issuecomment-1219064378

   
   ## CI report:
   
   * 72188dc38211a6a540256f168412d03e1cb86765 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10797)
 
   * 76094582239330262bac8c06a78b59d8abf2 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10801)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6419: [HUDI-2057] CTAS Generate An External Table When Create Managed Table

2022-08-17 Thread GitBox



hudi-bot commented on PR #6419:
URL: https://github.com/apache/hudi/pull/6419#issuecomment-1219061939

   
   ## CI report:
   
   * 72188dc38211a6a540256f168412d03e1cb86765 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10797)
 
   * 76094582239330262bac8c06a78b59d8abf2 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6386: [HUDI-4616] Adding `PulsarSource` to `DeltaStreamer` to support ingesting from Apache Pulsar

2022-08-17 Thread GitBox



hudi-bot commented on PR #6386:
URL: https://github.com/apache/hudi/pull/6386#issuecomment-1219059120

   
   ## CI report:
   
   * 79ae2b395759dcc104cb5c95834f8494026ff04c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10788)
 
   * 513b7d8c4d6209f62bd0075de4cf1a07113b0fb9 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10800)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] YannByron closed pull request #6264: [HUDI-4503] support for parsing identifier with catalog

2022-08-17 Thread GitBox



YannByron closed pull request #6264: [HUDI-4503] support for parsing identifier 
with catalog
URL: https://github.com/apache/hudi/pull/6264


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6386: [HUDI-4616] Adding `PulsarSource` to `DeltaStreamer` to support ingesting from Apache Pulsar

2022-08-17 Thread GitBox



hudi-bot commented on PR #6386:
URL: https://github.com/apache/hudi/pull/6386#issuecomment-1219056857

   
   ## CI report:
   
   * 79ae2b395759dcc104cb5c95834f8494026ff04c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10788)
 
   * 513b7d8c4d6209f62bd0075de4cf1a07113b0fb9 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6429: [HUDI-4636] Output preCombine fields of delete records when changelog disabled

2022-08-17 Thread GitBox



hudi-bot commented on PR #6429:
URL: https://github.com/apache/hudi/pull/6429#issuecomment-1219054346

   
   ## CI report:
   
   * 126dd3e7994d7c8f270930f768eaa57188f2bdf3 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10798)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4586) Address S3 timeouts in Bloom Index with metadata table

2022-08-17 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4586:
-
Labels: pull-request-available  (was: )

> Address S3 timeouts in Bloom Index with metadata table
> --
>
> Key: HUDI-4586
> URL: https://issues.apache.org/jira/browse/HUDI-4586
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.13.0
>
> Attachments: Screen Shot 2022-08-15 at 17.39.01.png
>
>
> For partitioned table, there are significant number of S3 requests timeout 
> causing the upserts to fail when using Bloom Index with metadata table.
> {code:java}
> Load meta index key ranges for file slices: hudi
> collect at HoodieSparkEngineContext.java:137+details
> org.apache.spark.api.java.AbstractJavaRDDLike.collect(JavaRDDLike.scala:45)
> org.apache.hudi.client.common.HoodieSparkEngineContext.flatMap(HoodieSparkEngineContext.java:137)
> org.apache.hudi.index.bloom.HoodieBloomIndex.loadColumnRangesFromMetaIndex(HoodieBloomIndex.java:213)
> org.apache.hudi.index.bloom.HoodieBloomIndex.getBloomIndexFileInfoForPartitions(HoodieBloomIndex.java:145)
> org.apache.hudi.index.bloom.HoodieBloomIndex.lookupIndex(HoodieBloomIndex.java:123)
> org.apache.hudi.index.bloom.HoodieBloomIndex.tagLocation(HoodieBloomIndex.java:89)
> org.apache.hudi.table.action.commit.HoodieWriteHelper.tag(HoodieWriteHelper.java:49)
> org.apache.hudi.table.action.commit.HoodieWriteHelper.tag(HoodieWriteHelper.java:32)
> org.apache.hudi.table.action.commit.BaseWriteHelper.write(BaseWriteHelper.java:53)
> org.apache.hudi.table.action.commit.SparkUpsertCommitActionExecutor.execute(SparkUpsertCommitActionExecutor.java:45)
> org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:113)
> org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:97)
> org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:155)
> org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:206)
> org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:329)
> org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:183)
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
>  {code}
> {code:java}
> org.apache.hudi.exception.HoodieException: Exception when reading log file 
>     at 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternal(AbstractHoodieLogRecordReader.java:352)
>     at 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scan(AbstractHoodieLogRecordReader.java:196)
>     at 
> org.apache.hudi.metadata.HoodieMetadataMergedLogRecordReader.getRecordsByKeys(HoodieMetadataMergedLogRecordReader.java:124)
>     at 
> org.apache.hudi.metadata.HoodieBackedTableMetadata.readLogRecords(HoodieBackedTableMetadata.java:266)
>     at 
> org.apache.hudi.metadata.HoodieBackedTableMetadata.lambda$getRecordsByKeys$1(HoodieBackedTableMetadata.java:222)
>     at java.util.HashMap.forEach(HashMap.java:1290)
>     at 
> org.apache.hudi.metadata.HoodieBackedTableMetadata.getRecordsByKeys(HoodieBackedTableMetadata.java:209)
>     at 
> org.apache.hudi.metadata.BaseTableMetadata.getColumnStats(BaseTableMetadata.java:253)
>     at 
> org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadColumnRangesFromMetaIndex$cc8e7ca2$1(HoodieBloomIndex.java:224)
>     at 
> org.apache.hudi.client.common.HoodieSparkEngineContext.lambda$flatMap$7d470b86$1(HoodieSparkEngineContext.java:137)
>     at 
> org.apache.spark.api.java.JavaRDDLike.$anonfun$flatMap$1(JavaRDDLike.scala:125)
>     at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
>     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
>     at scala.collection.Iterator.foreach(Iterator.scala:943)
>     at scala.collection.Iterator.foreach$(Iterator.scala:943)
>     at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>     at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
>     at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
>     at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
>     at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
>     at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)
>     at

[GitHub] [hudi] yihua opened a new pull request, #6432: [WIP][HUDI-4586] Improve metadata fetching in bloom index

2022-08-17 Thread GitBox



yihua opened a new pull request, #6432:
URL: https://github.com/apache/hudi/pull/6432

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   **Risk level: none | low | medium | high**
   
   _Choose one. If medium or high, explain what verification was done to 
mitigate the risks._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on pull request #6386: [HUDI-4616] Adding `PulsarSource` to `DeltaStreamer` to support ingesting from Apache Pulsar

2022-08-17 Thread GitBox



alexeykudinkin commented on PR #6386:
URL: https://github.com/apache/hudi/pull/6386#issuecomment-1219045573

   CI test failure are unrelated (Flink ITs failing)
   
   https://user-images.githubusercontent.com/428277/185299067-e45900fd-3bf7-40ad-9c86-f0b6dd221c11.png;>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6386: [HUDI-4616] Adding `PulsarSource` to `DeltaStreamer` to support ingesting from Apache Pulsar

2022-08-17 Thread GitBox



alexeykudinkin commented on code in PR #6386:
URL: https://github.com/apache/hudi/pull/6386#discussion_r948653875


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/PulsarSource.java:
##
@@ -0,0 +1,292 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources;
+
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.HoodieConversionUtils;
+import org.apache.hudi.common.config.ConfigProperty;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.util.Lazy;
+import org.apache.hudi.utilities.schema.SchemaProvider;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.pulsar.client.api.Consumer;
+import org.apache.pulsar.client.api.MessageId;
+import org.apache.pulsar.client.api.PulsarClient;
+import org.apache.pulsar.client.api.PulsarClientException;
+import org.apache.pulsar.client.api.SubscriptionInitialPosition;
+import org.apache.pulsar.client.api.SubscriptionType;
+import org.apache.pulsar.client.impl.PulsarClientImpl;
+import org.apache.pulsar.common.naming.TopicName;
+import org.apache.pulsar.shade.io.netty.channel.EventLoopGroup;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.pulsar.JsonUtils;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Collections;
+
+import static org.apache.hudi.common.util.ThreadUtils.collectActiveThreads;
+
+/**
+ * Source fetching data from Pulsar topics
+ */
+public class PulsarSource extends RowSource implements Closeable {
+
+  private static final Logger LOG = LogManager.getLogger(PulsarSource.class);
+
+  private static final String HUDI_PULSAR_CONSUMER_ID_FORMAT = 
"hudi-pulsar-consumer-%d";
+  private static final String[] PULSAR_META_FIELDS = new String[]{
+  "__key",
+  "__topic",
+  "__messageId",
+  "__publishTime",
+  "__eventTime",
+  "__messageProperties"
+  };
+
+  private final String topicName;
+
+  private final String serviceEndpointURL;
+  private final String adminEndpointURL;
+
+  // NOTE: We're keeping the client so that we can shut it down properly
+  private final Lazy pulsarClient;
+  private final Lazy> pulsarConsumer;
+
+  public PulsarSource(TypedProperties props,
+  JavaSparkContext sparkContext,
+  SparkSession sparkSession,
+  SchemaProvider schemaProvider) {
+super(props, sparkContext, sparkSession, schemaProvider);
+
+DataSourceUtils.checkRequiredProperties(props,
+Arrays.asList(
+Config.PULSAR_SOURCE_TOPIC_NAME.key(),
+Config.PULSAR_SOURCE_SERVICE_ENDPOINT_URL.key()));
+
+// Converting to a descriptor allows us to canonicalize the topic's name 
properly
+this.topicName = 
TopicName.get(props.getString(Config.PULSAR_SOURCE_TOPIC_NAME.key())).toString();
+
+// TODO validate endpoints provided in the appropriate format
+this.serviceEndpointURL = 
props.getString(Config.PULSAR_SOURCE_SERVICE_ENDPOINT_URL.key());
+this.adminEndpointURL = 
props.getString(Config.PULSAR_SOURCE_ADMIN_ENDPOINT_URL.key());
+
+this.pulsarClient = Lazy.lazily(this::initPulsarClient);
+this.pulsarConsumer = Lazy.lazily(this::subscribeToTopic);
+  }
+
+  @Override
+  protected Pair>, String> fetchNextBatch(Option 
lastCheckpointStr, long sourceLimit) {
+Pair startingEndingOffsetsPair = 
computeOffsets(lastCheckpointStr, sourceLimit);
+
+MessageId startingOffset = startingEndingOffsetsPair.getLeft();
+MessageId endingOffset = startingEndingOffsetsPair.getRight();
+
+String startingOffsetStr = convertToOffsetString(topicName, 
startingOffset);
+String endingOffsetStr = convertToOffsetString(topicName, endingOffset);
+
+Dataset sourceRows = sparkSession.read()
+.format("pulsar")
+

[jira] [Assigned] (HUDI-4549) hive sync bundle causes class loader issue

2022-08-17 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit reassigned HUDI-4549:
-

Assignee: Sagar Sumit

> hive sync bundle causes class loader issue
> --
>
> Key: HUDI-4549
> URL: https://issues.apache.org/jira/browse/HUDI-4549
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: dependencies
>Reporter: Raymond Xu
>Assignee: Sagar Sumit
>Priority: Blocker
> Fix For: 0.12.1
>
>
> A weird classpath issue i found: when testing deltastreamer using 
> hudi-utilities-slim-bundle, if i put --jars 
> hudi-hive-sync-bundle.jar,hudi-spark-bundle.jar then i’ll get this error when 
> writing
> {code:java}
> Caused by: java.lang.NoSuchMethodError: 
> org.apache.hudi.avro.MercifulJsonConverter.convert(Ljava/lang/String;Lorg/apache/avro/Schema;)Lorg/apache/avro/generic/GenericRecord;
>   at 
> org.apache.hudi.utilities.sources.helpers.AvroConvertor.fromJson(AvroConvertor.java:86)
>   at 
> org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070)
> {code}
> if i put the spark bundle before the hive sync bundle, then no issue. Without 
> hive-sync-bundle, also no issue. So hive-sync-bundle somehow messes up with 
> classpath? not sure why it reports a hudi-common API not found… caused by 
> shading avro?
> the same behavior i observed with aws-bundle, which makes sense, as it’s a 
> superset of hive-sync-bundle



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-4528) Diff tool to compare metadata across snapshots in a given time range

2022-08-17 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit reassigned HUDI-4528:
-

Assignee: Sagar Sumit

> Diff tool to compare metadata across snapshots in a given time range
> 
>
> Key: HUDI-4528
> URL: https://issues.apache.org/jira/browse/HUDI-4528
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
>
> A tool that diffs two snapshots at table and partition level and can give 
> info about what new file ids got created, deleted, updated and track other 
> changes that are captured in write stats. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4528) Diff tool to compare metadata across snapshots in a given time range

2022-08-17 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-4528:
--
Sprint: 2022/08/22

> Diff tool to compare metadata across snapshots in a given time range
> 
>
> Key: HUDI-4528
> URL: https://issues.apache.org/jira/browse/HUDI-4528
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
>
> A tool that diffs two snapshots at table and partition level and can give 
> info about what new file ids got created, deleted, updated and track other 
> changes that are captured in write stats. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4626) Partitioning table by `_hoodie_partition_path` fails

2022-08-17 Thread Sagar Sumit (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-4626:
--
Sprint: 2022/08/22

> Partitioning table by `_hoodie_partition_path` fails
> 
>
> Key: HUDI-4626
> URL: https://issues.apache.org/jira/browse/HUDI-4626
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.12.0
>Reporter: Alexey Kudinkin
>Priority: Blocker
>
>  
> Currently, creating a table partitioned by "_hoodie_partition_path" using 
> Glue catalog fails w/ the following exception:
> {code:java}
> AnalysisException: Found duplicate column(s) in the data schema and the 
> partition schema: _hoodie_partition_path
> {code}
> Using following DDL:
> {code:java}
> CREATE EXTERNAL TABLE `active_storage_attachments`(  `_hoodie_commit_time` 
> string COMMENT '',   `_hoodie_commit_seqno` string COMMENT '',   
> `_hoodie_record_key` string COMMENT '',   `_hoodie_file_name` string COMMENT 
> '',   `_change_operation_type` string COMMENT '',   
> `_upstream_event_processed_ts_ms` bigint COMMENT '',   
> `db_shard_source_partition` string COMMENT '',   `_event_origin_ts_ms` bigint 
> COMMENT '',   `_event_tx_id` bigint COMMENT '',   `_event_lsn` bigint COMMENT 
> '',   `_event_xmin` bigint COMMENT '',   `id` bigint COMMENT '',   `name` 
> string COMMENT '',   `record_type` string COMMENT '',   `record_id` bigint 
> COMMENT '',   `blob_id` bigint COMMENT '',   `created_at` timestamp COMMENT 
> '')PARTITIONED BY (   `_hoodie_partition_path` string COMMENT '')ROW FORMAT 
> SERDE   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' WITH 
> SERDEPROPERTIES (   'hoodie.query.as.ro.table'='false',   'path'='...') 
> STORED AS INPUTFORMAT   'org.apache.hudi.hadoop.HoodieParquetInputFormat' 
> OUTPUTFORMAT   
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'LOCATION  
> '...'
> TBLPROPERTIES (  'spark.sql.sources.provider'='hudi' )
>  {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #6429: [HUDI-4636] Output preCombine fields of delete records when changelog disabled

2022-08-17 Thread GitBox



hudi-bot commented on PR #6429:
URL: https://github.com/apache/hudi/pull/6429#issuecomment-1219027206

   
   ## CI report:
   
   * 126dd3e7994d7c8f270930f768eaa57188f2bdf3 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10798)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6431: Shutdown CloudWatch reporter when query completes

2022-08-17 Thread GitBox



hudi-bot commented on PR #6431:
URL: https://github.com/apache/hudi/pull/6431#issuecomment-1219027215

   
   ## CI report:
   
   * 7d755ad923370aa4ac6b46e75725883ee1434d23 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10799)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6431: Shutdown CloudWatch reporter when query completes

2022-08-17 Thread GitBox



hudi-bot commented on PR #6431:
URL: https://github.com/apache/hudi/pull/6431#issuecomment-1219025349

   
   ## CI report:
   
   * 7d755ad923370aa4ac6b46e75725883ee1434d23 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6429: [HUDI-4636] Output preCombine fields of delete records when changelog disabled

2022-08-17 Thread GitBox



hudi-bot commented on PR #6429:
URL: https://github.com/apache/hudi/pull/6429#issuecomment-1219025328

   
   ## CI report:
   
   * 126dd3e7994d7c8f270930f768eaa57188f2bdf3 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codope commented on a diff in pull request #6417: [HUDI-4565] Release note for version 0.12.0

2022-08-17 Thread GitBox



codope commented on code in PR #6417:
URL: https://github.com/apache/hudi/pull/6417#discussion_r948635820


##
website/releases/release-0.12.0.md:
##
@@ -0,0 +1,143 @@
+---
+title: "Release 0.12.0"
+sidebar_position: 2
+layout: releases
+toc: true
+last_modified_at: 2022-08-17T10:30:00+05:30
+---
+# [Release 0.12.0](https://github.com/apache/hudi/releases/tag/release-0.12.0) 
([docs](/docs/quick-start-guide))
+
+## Release Highlights
+
+### Presto-Hudi Connector
+
+Since version 0.275 of PrestoDB, users can now leverage native Hudi connector 
to query Hudi table. 
+It is on par with Hudi support in the Hive connector. To learn more about the 
usage of the connector, 
+please checkout [prestodb 
documentation](https://prestodb.io/docs/current/connector/hudi.html).
+
+### Archival Beyond Savepoint
+
+Users can now archive Hudi table beyond savepoint commit. Just enable 
`hoodie.archive.beyond.savepoint` write 

Review Comment:
   Sets the context very nicely. Sounds much better. Thanks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codope commented on a diff in pull request #6417: [HUDI-4565] Release note for version 0.12.0

2022-08-17 Thread GitBox



codope commented on code in PR #6417:
URL: https://github.com/apache/hudi/pull/6417#discussion_r948635421


##
website/releases/release-0.12.0.md:
##
@@ -0,0 +1,143 @@
+---
+title: "Release 0.12.0"
+sidebar_position: 2
+layout: releases
+toc: true
+last_modified_at: 2022-08-17T10:30:00+05:30
+---
+# [Release 0.12.0](https://github.com/apache/hudi/releases/tag/release-0.12.0) 
([docs](/docs/quick-start-guide))
+
+## Release Highlights
+
+### Presto-Hudi Connector
+
+Since version 0.275 of PrestoDB, users can now leverage native Hudi connector 
to query Hudi table. 
+It is on par with Hudi support in the Hive connector. To learn more about the 
usage of the connector, 
+please checkout [prestodb 
documentation](https://prestodb.io/docs/current/connector/hudi.html).
+
+### Archival Beyond Savepoint
+
+Users can now archive Hudi table beyond savepoint commit. Just enable 
`hoodie.archive.beyond.savepoint` write 
+configuration. This unlocks new opportunities for Hudi users. For example, one 
can retain commits for years, by adding 
+one savepoint per day for older commits (say > 30 days old). And they can 
query hudi using `as.of.instant` semantics for
+old data. In previous versions, one would have to retain every commit and let 
archival stop at the first commit.
+
+:::note
+However, if this feature is enabled, restore cannot be supported. This 
limitation would be relaxed in a future release 
+and the development of this feature can be tracked in 
[HUDI-4500](https://issues.apache.org/jira/browse/HUDI-4500).
+:::
+
+### Deltastreamer Termination Strategy
+
+Users can now configure a post-write termination strategy with deltastreamer 
`continuous` mode if need be. For instance,
+users can configure graceful shutdown if there is no new data from source for 
5 consecutive times. Here is the interface
+for the termination strategy.
+```java
+/**
+ * Post write termination strategy for deltastreamer in continuous mode.
+ */
+public interface PostWriteTerminationStrategy {
+
+  /**
+   * Returns whether deltastreamer needs to be shutdown.
+   * @param scheduledCompactionInstantAndWriteStatuses optional pair of 
scheduled compaction instant and write statuses.
+   * @return true if deltastreamer has to be shutdown. false otherwise.
+   */
+  boolean shouldShutdown(Option, JavaRDD>> 
scheduledCompactionInstantAndWriteStatuses);
+
+}
+```
+
+Also, this might help in bootstrapping a new table. Instead of doing one bulk 
load or bulk_insert leveraging a large
+cluster for a large input of data, one could start deltastreamer on continuous 
mode and add a shutdown strategy to 
+terminate, once all data has been bootstrapped. This way, each batch could be 
smaller and may not need a large cluster 
+to bootstrap data. We have one concrete implementation out of the box, 
[NoNewDataTerminationStrategy](https://github.com/apache/hudi/blob/0d0a4152cfd362185066519ae926ac4513c7a152/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/NoNewDataTerminationStrategy.java).
+Users can feel free to implement their own strategy as they see fit.
+
+### Spark 3.3 Support
+
+Spark 3.3 support is added; users who are on Spark 3.3 can use 
`hudi-spark3.3-bundle` or `hudi-spark3-bundle`. Spark 3.2,
+Spark 3.1 and Spark 2.4 will continue to be supported. Please check the 
migration guide for [bundle updates](#bundle-updates).
+
+### Spark SQL Support Improvements
+
+- Support for upgrade, downgrade, bootstrap, clean, rollback and repair 
through `Call Procedure` command.
+- Support for `analyze table`.
+- Support for `Create/Drop/Show/Refresh Index` syntax through Spark SQL.
+
+### Flink 1.15 Support
+
+Flink 1.15.x is integrated with Hudi, use profile param `-Pflink1.15` when 
compiling the codes to adapt the version. 
+Alternatively, use `hudi-flink1.15-bundle`. Flink 1.14 and Flink 1.13 will 
continue to be supported. Please check the 
+migration guide for [bundle updates](#bundle-updates).
+
+### Flink Integration Improvements
+
+- **Data skipping** is supported for batch mode read, set up SQL option 
`metadata.enabled`, `hoodie.metadata.index.column.stats.enable`  and 
`read.data.skipping.enabled` as true to enable it.
+- A **HMS-based Flink catalog** is added with catalog identifier as `hudi`. 
You can instantiate the catalog through API directly or use the `CREATE 
CATALOG`  syntax to create it. Specifies catalog option `'mode' = 'hms'`  to 
switch to the HMS catalog. By default, the catalog is in `dfs` mode.
+- **Async clustering** is supported for Flink `INSERT` operation, set up SQL 
option `clustering.schedule.enabled` and `clustering.async.enabled` as true to 
enable it. When enabling this feature, a clustering sub-pipeline is scheduled 
asynchronously continuously to merge the small files continuously into larger 
ones.
+
+### Performance Improvements
+
+This version brings more improvements to make Hudi the most performant lake 
storage format. Some notable improvements are:
+- Closed the

[GitHub] [hudi] codope commented on a diff in pull request #6417: [HUDI-4565] Release note for version 0.12.0

2022-08-17 Thread GitBox



codope commented on code in PR #6417:
URL: https://github.com/apache/hudi/pull/6417#discussion_r948635344


##
website/releases/release-0.12.0.md:
##
@@ -0,0 +1,143 @@
+---
+title: "Release 0.12.0"
+sidebar_position: 2
+layout: releases
+toc: true
+last_modified_at: 2022-08-17T10:30:00+05:30
+---
+# [Release 0.12.0](https://github.com/apache/hudi/releases/tag/release-0.12.0) 
([docs](/docs/quick-start-guide))
+
+## Release Highlights
+
+### Presto-Hudi Connector
+
+Since version 0.275 of PrestoDB, users can now leverage native Hudi connector 
to query Hudi table. 
+It is on par with Hudi support in the Hive connector. To learn more about the 
usage of the connector, 
+please checkout [prestodb 
documentation](https://prestodb.io/docs/current/connector/hudi.html).
+
+### Archival Beyond Savepoint
+
+Users can now archive Hudi table beyond savepoint commit. Just enable 
`hoodie.archive.beyond.savepoint` write 
+configuration. This unlocks new opportunities for Hudi users. For example, one 
can retain commits for years, by adding 
+one savepoint per day for older commits (say > 30 days old). And they can 
query hudi using `as.of.instant` semantics for
+old data. In previous versions, one would have to retain every commit and let 
archival stop at the first commit.
+
+:::note
+However, if this feature is enabled, restore cannot be supported. This 
limitation would be relaxed in a future release 
+and the development of this feature can be tracked in 
[HUDI-4500](https://issues.apache.org/jira/browse/HUDI-4500).
+:::
+
+### Deltastreamer Termination Strategy
+
+Users can now configure a post-write termination strategy with deltastreamer 
`continuous` mode if need be. For instance,
+users can configure graceful shutdown if there is no new data from source for 
5 consecutive times. Here is the interface
+for the termination strategy.
+```java
+/**
+ * Post write termination strategy for deltastreamer in continuous mode.
+ */
+public interface PostWriteTerminationStrategy {
+
+  /**
+   * Returns whether deltastreamer needs to be shutdown.
+   * @param scheduledCompactionInstantAndWriteStatuses optional pair of 
scheduled compaction instant and write statuses.
+   * @return true if deltastreamer has to be shutdown. false otherwise.
+   */
+  boolean shouldShutdown(Option, JavaRDD>> 
scheduledCompactionInstantAndWriteStatuses);
+
+}
+```
+
+Also, this might help in bootstrapping a new table. Instead of doing one bulk 
load or bulk_insert leveraging a large
+cluster for a large input of data, one could start deltastreamer on continuous 
mode and add a shutdown strategy to 
+terminate, once all data has been bootstrapped. This way, each batch could be 
smaller and may not need a large cluster 
+to bootstrap data. We have one concrete implementation out of the box, 
[NoNewDataTerminationStrategy](https://github.com/apache/hudi/blob/0d0a4152cfd362185066519ae926ac4513c7a152/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/NoNewDataTerminationStrategy.java).
+Users can feel free to implement their own strategy as they see fit.
+
+### Spark 3.3 Support
+
+Spark 3.3 support is added; users who are on Spark 3.3 can use 
`hudi-spark3.3-bundle` or `hudi-spark3-bundle`. Spark 3.2,
+Spark 3.1 and Spark 2.4 will continue to be supported. Please check the 
migration guide for [bundle updates](#bundle-updates).
+
+### Spark SQL Support Improvements
+
+- Support for upgrade, downgrade, bootstrap, clean, rollback and repair 
through `Call Procedure` command.
+- Support for `analyze table`.
+- Support for `Create/Drop/Show/Refresh Index` syntax through Spark SQL.
+
+### Flink 1.15 Support
+
+Flink 1.15.x is integrated with Hudi, use profile param `-Pflink1.15` when 
compiling the codes to adapt the version. 
+Alternatively, use `hudi-flink1.15-bundle`. Flink 1.14 and Flink 1.13 will 
continue to be supported. Please check the 
+migration guide for [bundle updates](#bundle-updates).
+
+### Flink Integration Improvements
+
+- **Data skipping** is supported for batch mode read, set up SQL option 
`metadata.enabled`, `hoodie.metadata.index.column.stats.enable`  and 
`read.data.skipping.enabled` as true to enable it.
+- A **HMS-based Flink catalog** is added with catalog identifier as `hudi`. 
You can instantiate the catalog through API directly or use the `CREATE 
CATALOG`  syntax to create it. Specifies catalog option `'mode' = 'hms'`  to 
switch to the HMS catalog. By default, the catalog is in `dfs` mode.
+- **Async clustering** is supported for Flink `INSERT` operation, set up SQL 
option `clustering.schedule.enabled` and `clustering.async.enabled` as true to 
enable it. When enabling this feature, a clustering sub-pipeline is scheduled 
asynchronously continuously to merge the small files continuously into larger 
ones.
+
+### Performance Improvements
+
+This version brings more improvements to make Hudi the most performant lake 
storage format. Some notable improvements are:
+- Closed the

[GitHub] [hudi] hudi-bot commented on pull request #6419: [HUDI-2057] CTAS Generate An External Table When Create Managed Table

2022-08-17 Thread GitBox



hudi-bot commented on PR #6419:
URL: https://github.com/apache/hudi/pull/6419#issuecomment-1219023427

   
   ## CI report:
   
   * 72188dc38211a6a540256f168412d03e1cb86765 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10797)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] junyuc25 opened a new pull request, #6431: Shutdown CloudWatch reporter when query completes

2022-08-17 Thread GitBox



junyuc25 opened a new pull request, #6431:
URL: https://github.com/apache/hudi/pull/6431

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   **Risk level: none | low | medium | high**
   
   _Choose one. If medium or high, explain what verification was done to 
mitigate the risks._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] QuChunhe opened a new issue, #6430: [SUPPORT]Flink SQL can't read complex type data Java client write

2022-08-17 Thread GitBox



QuChunhe opened a new issue, #6430:
URL: https://github.com/apache/hudi/issues/6430

   1. Java client write complex type data, such as ARRAY>2. 
   
   private String baseFileFormat = "parquet";
   
   Path path = new Path(tablePath);
   FileSystem fs = FSUtils.getFs(tablePath, hadoopConf);
   
   String schema = Util.getStringFromResource("/" + databasePrefix + "."
   + tableName + ".schema");
   if (!fs.exists(path)) {
 HoodieTableMetaClient.withPropertyBuilder()
 .setBaseFileFormat(baseFileFormat)
 .setPartitionFields(partitionFields)
 .setHiveStylePartitioningEnable(false)
 .setTableCreateSchema(schema)
 .setTableType(HoodieTableType.MERGE_ON_READ)
 .setRecordKeyFields(recordKeyFields)
 .setPayloadClassName(DefaultHoodieRecordPayload.class.getName())
 .setTableName(tableName)
 .initTable(hadoopConf, tablePath);
   }
   
   // Create the write client to write some records in
   HoodieWriteConfig cfg = HoodieWriteConfig.newBuilder()
   .withPath(tablePath)
   .withAutoCommit(true)
   .withEmbeddedTimelineServerEnabled(false)
   .withRollbackUsingMarkers(false)
   .withBulkInsertParallelism(parallelism)
   .withSchema(schema)
   //.withSchemaEvolutionEnable(true)
   .withParallelism(parallelism, parallelism)
   .withDeleteParallelism(1)
   //.withEngineType(EngineType.SPARK)
   .forTable(tableName)
   .withMetadataConfig(
   HoodieMetadataConfig.newBuilder()
   .enable(true)
   .build())
   .withConsistencyGuardConfig(
   ConsistencyGuardConfig.newBuilder()
   .withEnableOptimisticConsistencyGuard(false)
   .build())
   .withIndexConfig(
   HoodieIndexConfig.newBuilder()
   .withIndexType(IndexType.BLOOM)
   .build())
   .withCompactionConfig(
   HoodieCompactionConfig.newBuilder()
   .withAutoArchive(true)
   .withAutoClean(true)
   .withCompactionLazyBlockReadEnabled(true)
   .withAsyncClean(true)
   .build())
   .build();
   
   
   2. flink sql can not write the data, and throws the following errors
   
   java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
at java.util.ArrayList.rangeCheck(ArrayList.java:659) ~[?:1.8.0_332]
at java.util.ArrayList.get(ArrayList.java:435) ~[?:1.8.0_332]
at org.apache.parquet.schema.GroupType.getType(GroupType.java:216) 
~[hudi-flink1.13-bundle-0.12.0.jar:0.12.0]
at 
org.apache.hudi.table.format.cow.ParquetSplitReaderUtil.createWritableColumnVector(ParquetSplitReaderUtil.java:514)
 ~[hudi-flink1.13-bundle-0.12.0.jar:0.12.0]
at 
org.apache.hudi.table.format.cow.ParquetSplitReaderUtil.createWritableColumnVector(ParquetSplitReaderUtil.java:480)
 ~[hudi-flink1.13-bundle-0.12.0.jar:0.12.0]
at 
org.apache.hudi.table.format.cow.ParquetSplitReaderUtil.createWritableColumnVector(ParquetSplitReaderUtil.java:406)
 ~[hudi-flink1.13-bundle-0.12.0.jar:0.12.0]
at 
org.apache.hudi.table.format.cow.vector.reader.ParquetColumnarRowSplitReader.createWritableVectors(ParquetColumnarRowSplitReader.java:216)
 ~[hudi-flink1.13-bundle-0.12.0.jar:0.12.0]
at 
org.apache.hudi.table.format.cow.vector.reader.ParquetColumnarRowSplitReader.(ParquetColumnarRowSplitReader.java:156)
 ~[hudi-flink1.13-bundle-0.12.0.jar:0.12.0]
at 
org.apache.hudi.table.format.cow.ParquetSplitReaderUtil.genPartColumnarRowReader(ParquetSplitReaderUtil.java:147)
 ~[hudi-flink1.13-bundle-0.12.0.jar:0.12.0]
at 
org.apache.hudi.table.format.mor.MergeOnReadInputFormat.getReader(MergeOnReadInputFormat.java:306)
 ~[hudi-flink1.13-bundle-0.12.0.jar:0.12.0]
at 
org.apache.hudi.table.format.mor.MergeOnReadInputFormat.open(MergeOnReadInputFormat.java:177)
 ~[hudi-flink1.13-bundle-0.12.0.jar:0.12.0]
at 
org.apache.hudi.table.format.mor.MergeOnReadInputFormat.open(MergeOnReadInputFormat.java:81)
 ~[hudi-flink1.13-bundle-0.12.0.jar:0.12.0]
at 
org.apache.flink.streaming.api.functions.source.InputFormatSourceFunction.run(InputFormatSourceFunction.java:84)
 ~[flink-dist_2.12-1.13.6.jar:1.13.6]
at 
org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:104)
 ~[flink-dist_2.12-1.13.6.jar:1.13.6]
at 
org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:60) 
~[flink-dist_2.12-1.13.6.jar:1.13.6]
at 
org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:269)
 ~[flink-dist_2.12-1.13.6.jar:1.13.6]
   2022-08-18 11:43:18,280 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Clearing

[jira] [Updated] (HUDI-4636) Output preCombine fields of delete records when changelog disabled

2022-08-17 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4636:
-
Labels: pull-request-available  (was: )

> Output preCombine fields of delete records when changelog disabled
> 
>
> Key: HUDI-4636
> URL: https://issues.apache.org/jira/browse/HUDI-4636
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: yonghua jian
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] flashJd opened a new pull request, #6429: [HUDI-4636] Output preCombine fields of delete records when changelog disabled

2022-08-17 Thread GitBox



flashJd opened a new pull request, #6429:
URL: https://github.com/apache/hudi/pull/6429

   ### Change Logs
   when changelog disabled, flink will output pk columns only,
   
https://github.com/apache/hudi/blob/master/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/mor/MergeOnReadInputFormat.java
   We need the preCombine and partition fields also, so pull this request.
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   **Risk level: none | low | medium | high**
   
   _Choose one. If medium or high, explain what verification was done to 
mitigate the risks._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-4636) Output preCombine fields of delete records when changelog disabled

2022-08-17 Thread yonghua jian (Jira)

yonghua jian created HUDI-4636:
--

 Summary: Output preCombine fields of delete records when 
changelog disabled
 Key: HUDI-4636
 URL: https://issues.apache.org/jira/browse/HUDI-4636
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: yonghua jian






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] danny0405 commented on issue #6422: [SUPPORT]: hudi build failing for hudi-flink-client when no maven build option is provided

2022-08-17 Thread GitBox



danny0405 commented on issue #6422:
URL: https://github.com/apache/hudi/issues/6422#issuecomment-1219012429

   What version did you use, maybe it is caused by the force activation of the 
profile which i have removed recently from the master code: 
https://github.com/apache/hudi/pull/6415


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6419: [HUDI-2057] CTAS Generate An External Table When Create Managed Table

2022-08-17 Thread GitBox



hudi-bot commented on PR #6419:
URL: https://github.com/apache/hudi/pull/6419#issuecomment-1219002712

   
   ## CI report:
   
   * 1ff7079b85e0c0b98d8436d94bcbf02cfe431759 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10796)
 
   * 72188dc38211a6a540256f168412d03e1cb86765 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10797)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6419: [HUDI-2057] CTAS Generate An External Table When Create Managed Table

2022-08-17 Thread GitBox



hudi-bot commented on PR #6419:
URL: https://github.com/apache/hudi/pull/6419#issuecomment-1219000957

   
   ## CI report:
   
   * 1ff7079b85e0c0b98d8436d94bcbf02cfe431759 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10796)
 
   * 72188dc38211a6a540256f168412d03e1cb86765 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6419: [HUDI-2057] CTAS Generate An External Table When Create Managed Table

2022-08-17 Thread GitBox



hudi-bot commented on PR #6419:
URL: https://github.com/apache/hudi/pull/6419#issuecomment-1218994371

   
   ## CI report:
   
   * 1ff7079b85e0c0b98d8436d94bcbf02cfe431759 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10796)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] flashJd commented on a diff in pull request #6385: [HUDI-4614] fix primary key extract of delete_record when complexKeyGen configured and ChangeLogDisabled

2022-08-17 Thread GitBox



flashJd commented on code in PR #6385:
URL: https://github.com/apache/hudi/pull/6385#discussion_r948606837


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/KeyGenUtils.java:
##
@@ -73,21 +73,20 @@ public static String 
getPartitionPathFromGenericRecord(GenericRecord genericReco
*/
   public static String[] extractRecordKeys(String recordKey) {
 String[] fieldKV = recordKey.split(",");
-if (fieldKV.length == 1) {
-  return fieldKV;
-} else {
-  // a complex key
-  return Arrays.stream(fieldKV).map(kv -> {
-final String[] kvArray = kv.split(":");
-if (kvArray[1].equals(NULL_RECORDKEY_PLACEHOLDER)) {
-  return null;
-} else if (kvArray[1].equals(EMPTY_RECORDKEY_PLACEHOLDER)) {
-  return "";
-} else {
-  return kvArray[1];
-}
-  }).toArray(String[]::new);
-}
+
+return Arrays.stream(fieldKV).map(kv -> {
+  final String[] kvArray = kv.split(":");

Review Comment:
   @danny0405 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] flashJd commented on a diff in pull request #6385: [HUDI-4614] fix primary key extract of delete_record when complexKeyGen configured and ChangeLogDisabled

2022-08-17 Thread GitBox



flashJd commented on code in PR #6385:
URL: https://github.com/apache/hudi/pull/6385#discussion_r948604910


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/KeyGenUtils.java:
##
@@ -73,21 +73,20 @@ public static String 
getPartitionPathFromGenericRecord(GenericRecord genericReco
*/
   public static String[] extractRecordKeys(String recordKey) {
 String[] fieldKV = recordKey.split(",");
-if (fieldKV.length == 1) {
-  return fieldKV;
-} else {
-  // a complex key
-  return Arrays.stream(fieldKV).map(kv -> {
-final String[] kvArray = kv.split(":");
-if (kvArray[1].equals(NULL_RECORDKEY_PLACEHOLDER)) {
-  return null;
-} else if (kvArray[1].equals(EMPTY_RECORDKEY_PLACEHOLDER)) {
-  return "";
-} else {
-  return kvArray[1];
-}
-  }).toArray(String[]::new);
-}
+
+return Arrays.stream(fieldKV).map(kv -> {
+  final String[] kvArray = kv.split(":");

Review Comment:
   > So why we configure a `COMPLEX` key generator while the key is just simple 
here?
   Because flink-sql's default logic is COMPLEX KeyGenerator, when boolean 
complexHoodieKey = pks.length > 1 || partitions.length > 1;
   
https://github.com/apache/hudi/blob/master/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableFactory.java#L239



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on a diff in pull request #6385: [HUDI-4614] fix primary key extract of delete_record when complexKeyGen configured and ChangeLogDisabled

2022-08-17 Thread GitBox



danny0405 commented on code in PR #6385:
URL: https://github.com/apache/hudi/pull/6385#discussion_r948601779


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/KeyGenUtils.java:
##
@@ -73,21 +73,20 @@ public static String 
getPartitionPathFromGenericRecord(GenericRecord genericReco
*/
   public static String[] extractRecordKeys(String recordKey) {
 String[] fieldKV = recordKey.split(",");
-if (fieldKV.length == 1) {
-  return fieldKV;
-} else {
-  // a complex key
-  return Arrays.stream(fieldKV).map(kv -> {
-final String[] kvArray = kv.split(":");
-if (kvArray[1].equals(NULL_RECORDKEY_PLACEHOLDER)) {
-  return null;
-} else if (kvArray[1].equals(EMPTY_RECORDKEY_PLACEHOLDER)) {
-  return "";
-} else {
-  return kvArray[1];
-}
-  }).toArray(String[]::new);
-}
+
+return Arrays.stream(fieldKV).map(kv -> {
+  final String[] kvArray = kv.split(":");

Review Comment:
   So why we configure a `COMPLEX` key generator while the key is just simple 
here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] flashJd commented on a diff in pull request #6385: [HUDI-4614] fix primary key extract of delete_record when complexKeyGen configured and ChangeLogDisabled

2022-08-17 Thread GitBox



flashJd commented on code in PR #6385:
URL: https://github.com/apache/hudi/pull/6385#discussion_r948600424


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/KeyGenUtils.java:
##
@@ -73,21 +73,20 @@ public static String 
getPartitionPathFromGenericRecord(GenericRecord genericReco
*/
   public static String[] extractRecordKeys(String recordKey) {
 String[] fieldKV = recordKey.split(",");
-if (fieldKV.length == 1) {
-  return fieldKV;
-} else {
-  // a complex key
-  return Arrays.stream(fieldKV).map(kv -> {
-final String[] kvArray = kv.split(":");
-if (kvArray[1].equals(NULL_RECORDKEY_PLACEHOLDER)) {
-  return null;
-} else if (kvArray[1].equals(EMPTY_RECORDKEY_PLACEHOLDER)) {
-  return "";
-} else {
-  return kvArray[1];
-}
-  }).toArray(String[]::new);
-}
+
+return Arrays.stream(fieldKV).map(kv -> {
+  final String[] kvArray = kv.split(":");

Review Comment:
   @danny0405 I've explained it, looking forward to your reply.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] zhilinli123 commented on pull request #3771: [HUDI-2402] Add Kerberos configuration options to Hive Sync

2022-08-17 Thread GitBox



zhilinli123 commented on PR #3771:
URL: https://github.com/apache/hudi/pull/3771#issuecomment-1218940639

   > It should be, we have supported the option `hive_sync.conf.dir` to support 
custom hive configurations, you can declare your kerb conf in the target dir
   
   thank you ! It would be nice to have a reference document
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] TengHuo commented on pull request #6000: [HUDI-4340] fix not parsable text DateTimeParseException in HoodieInstantTimeGenerator.parseDateFromInstantTime

2022-08-17 Thread GitBox



TengHuo commented on PR #6000:
URL: https://github.com/apache/hudi/pull/6000#issuecomment-1218935335

   @yihua 
   Ohhh so sorry for late reply, too many issues recently, forgot this one, let 
me check it today. Thanks for remind


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on pull request #3771: [HUDI-2402] Add Kerberos configuration options to Hive Sync

2022-08-17 Thread GitBox



danny0405 commented on PR #3771:
URL: https://github.com/apache/hudi/pull/3771#issuecomment-1218927034

   > @test-wangxiaoyu @codope Will the new version support this feature
   
   It should be, we have supported the option `hive_sync.conf.dir` to support 
custom hive configurations, you can declare your kerb conf in the target dir 
`hive-site.xml` file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] XuQianJin-Stars commented on pull request #4913: [HUDI-1517] create marker file for every log file

2022-08-17 Thread GitBox



XuQianJin-Stars commented on PR #4913:
URL: https://github.com/apache/hudi/pull/4913#issuecomment-1218917104

   hi @guanziyue can rebase this pr?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] zhilinli123 commented on pull request #3771: [HUDI-2402] Add Kerberos configuration options to Hive Sync

2022-08-17 Thread GitBox



zhilinli123 commented on PR #3771:
URL: https://github.com/apache/hudi/pull/3771#issuecomment-1218909566

   @test-wangxiaoyu @codope Will the new version support this feature
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6427: [MINOR] Improve code style of CLI Command classes

2022-08-17 Thread GitBox



hudi-bot commented on PR #6427:
URL: https://github.com/apache/hudi/pull/6427#issuecomment-1218906064

   
   ## CI report:
   
   * 4a37120edf3a454a802c4ef63916b61b5547ab72 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10795)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rohit-m-99 closed issue #6335: [SUPPORT] Deltastreamer updates not supporting the addition of new columns

2022-08-17 Thread GitBox



rohit-m-99 closed issue #6335: [SUPPORT] Deltastreamer updates not supporting 
the addition of new columns
URL: https://github.com/apache/hudi/issues/6335


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rohit-m-99 commented on issue #6335: [SUPPORT] Deltastreamer updates not supporting the addition of new columns

2022-08-17 Thread GitBox



rohit-m-99 commented on issue #6335:
URL: https://github.com/apache/hudi/issues/6335#issuecomment-1218892631

   Option 2 worked for me! Set hoodie.metadata.enable to false in Deltastreamer 
and wait for a few commits so that metadata table is deleted completely (no 
.hoodie/metadata folder), and then re-enable the metadata table.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6419: [HUDI-2057] CTAS Generate An External Table When Create Managed Table

2022-08-17 Thread GitBox



hudi-bot commented on PR #6419:
URL: https://github.com/apache/hudi/pull/6419#issuecomment-1218814176

   
   ## CI report:
   
   * 1008d04f7a2bf12b058954ee8e842fc3e4120c7e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10791)
 
   * 1ff7079b85e0c0b98d8436d94bcbf02cfe431759 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10796)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6419: [HUDI-2057] CTAS Generate An External Table When Create Managed Table

2022-08-17 Thread GitBox



hudi-bot commented on PR #6419:
URL: https://github.com/apache/hudi/pull/6419#issuecomment-1218805623

   
   ## CI report:
   
   * 1008d04f7a2bf12b058954ee8e842fc3e4120c7e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10791)
 
   * 1ff7079b85e0c0b98d8436d94bcbf02cfe431759 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a diff in pull request #6386: [HUDI-4616] Adding `PulsarSource` to `DeltaStreamer` to support ingesting from Apache Pulsar

2022-08-17 Thread GitBox



nsivabalan commented on code in PR #6386:
URL: https://github.com/apache/hudi/pull/6386#discussion_r948550727


##
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/PulsarSource.java:
##
@@ -0,0 +1,292 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources;
+
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.HoodieConversionUtils;
+import org.apache.hudi.common.config.ConfigProperty;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.util.Lazy;
+import org.apache.hudi.utilities.schema.SchemaProvider;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.pulsar.client.api.Consumer;
+import org.apache.pulsar.client.api.MessageId;
+import org.apache.pulsar.client.api.PulsarClient;
+import org.apache.pulsar.client.api.PulsarClientException;
+import org.apache.pulsar.client.api.SubscriptionInitialPosition;
+import org.apache.pulsar.client.api.SubscriptionType;
+import org.apache.pulsar.client.impl.PulsarClientImpl;
+import org.apache.pulsar.common.naming.TopicName;
+import org.apache.pulsar.shade.io.netty.channel.EventLoopGroup;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.pulsar.JsonUtils;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Collections;
+
+import static org.apache.hudi.common.util.ThreadUtils.collectActiveThreads;
+
+/**
+ * Source fetching data from Pulsar topics
+ */
+public class PulsarSource extends RowSource implements Closeable {
+
+  private static final Logger LOG = LogManager.getLogger(PulsarSource.class);
+
+  private static final String HUDI_PULSAR_CONSUMER_ID_FORMAT = 
"hudi-pulsar-consumer-%d";
+  private static final String[] PULSAR_META_FIELDS = new String[]{
+  "__key",
+  "__topic",
+  "__messageId",
+  "__publishTime",
+  "__eventTime",
+  "__messageProperties"
+  };
+
+  private final String topicName;
+
+  private final String serviceEndpointURL;
+  private final String adminEndpointURL;
+
+  // NOTE: We're keeping the client so that we can shut it down properly
+  private final Lazy pulsarClient;
+  private final Lazy> pulsarConsumer;
+
+  public PulsarSource(TypedProperties props,
+  JavaSparkContext sparkContext,
+  SparkSession sparkSession,
+  SchemaProvider schemaProvider) {
+super(props, sparkContext, sparkSession, schemaProvider);
+
+DataSourceUtils.checkRequiredProperties(props,
+Arrays.asList(
+Config.PULSAR_SOURCE_TOPIC_NAME.key(),
+Config.PULSAR_SOURCE_SERVICE_ENDPOINT_URL.key()));
+
+// Converting to a descriptor allows us to canonicalize the topic's name 
properly
+this.topicName = 
TopicName.get(props.getString(Config.PULSAR_SOURCE_TOPIC_NAME.key())).toString();
+
+// TODO validate endpoints provided in the appropriate format
+this.serviceEndpointURL = 
props.getString(Config.PULSAR_SOURCE_SERVICE_ENDPOINT_URL.key());
+this.adminEndpointURL = 
props.getString(Config.PULSAR_SOURCE_ADMIN_ENDPOINT_URL.key());
+
+this.pulsarClient = Lazy.lazily(this::initPulsarClient);
+this.pulsarConsumer = Lazy.lazily(this::subscribeToTopic);
+  }
+
+  @Override
+  protected Pair>, String> fetchNextBatch(Option 
lastCheckpointStr, long sourceLimit) {
+Pair startingEndingOffsetsPair = 
computeOffsets(lastCheckpointStr, sourceLimit);
+
+MessageId startingOffset = startingEndingOffsetsPair.getLeft();
+MessageId endingOffset = startingEndingOffsetsPair.getRight();
+
+String startingOffsetStr = convertToOffsetString(topicName, 
startingOffset);
+String endingOffsetStr = convertToOffsetString(topicName, endingOffset);
+
+Dataset sourceRows = sparkSession.read()
+.format("pulsar")
+

[GitHub] [hudi] bithw1 commented on issue #6425: [SUPPORT]Writing to MOR table seems not working as expected

2022-08-17 Thread GitBox



bithw1 commented on issue #6425:
URL: https://github.com/apache/hudi/issues/6425#issuecomment-1218713061

   Thanks @nsivabalan for the helpful answer.
   
   When I first write to the hudi table, there are only parquet files 
there（**insert only**）, and when I rerun the application(**update only**)， log 
files appear.
   
   So, the reason should be the following:
   
   ```
   if your workload will never have any updates, you may not see log files at 
all.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] bithw1 closed issue #6425: [SUPPORT]Writing to MOR table seems not working as expected

2022-08-17 Thread GitBox



bithw1 closed issue #6425: [SUPPORT]Writing to MOR table seems not working as 
expected
URL: https://github.com/apache/hudi/issues/6425


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] bithw1 commented on issue #6379: [SUPPORT]What's the reading behavior for MOR table?

2022-08-17 Thread GitBox



bithw1 commented on issue #6379:
URL: https://github.com/apache/hudi/issues/6379#issuecomment-1218706302

   > yes, you are right.
   
   Thanks @nsivabalan  for the confirmation!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] bithw1 closed issue #6379: [SUPPORT]What's the reading behavior for MOR table?

2022-08-17 Thread GitBox



bithw1 closed issue #6379: [SUPPORT]What's the reading behavior for MOR table?
URL: https://github.com/apache/hudi/issues/6379


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] dyang108 opened a new issue, #6428: [SUPPORT] S3 Deltastreamer: Block has already been inflated

2022-08-17 Thread GitBox



dyang108 opened a new issue, #6428:
URL: https://github.com/apache/hudi/issues/6428

   **Describe the problem you faced**
   
   Deltastreamer with write output to S3 exits unexpectedly when running in 
continuous mode. It seems 
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   I ran the following:
   ```
   /etc/spark/bin/spark-submit --conf 
-Dconfig.file=/service.conf,spark.executor.extraJavaOptions=-Dlog4j.debug=true 
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --jars 
/etc/spark/work-dir/* /etc/spark/work-dir/hudi-utilities-bundle_2.11-0.12.0.jar 
--props /mnt/mesos/sandbox/kafka-source.properties --schemaprovider-class 
org.apache.hudi.utilities.schema.SchemaRegistryProvider --source-class 
org.apache.hudi.utilities.sources.AvroKafkaSource --target-base-path 
"s3a://strava.scratch/tmp/derick/hudi" --target-table "aligned_activities" --op 
"UPSERT" --source-ordering-field "ts" --table-type "COPY_ON_WRITE" 
--source-limit 100 --continuous
   ```
   
   the /etc/spark/work-dir/ looks like this:
   aws-java-sdk-bundle-1.12.283.jar  hadoop-aws-2.6.5.jar  
hudi-utilities-bundle_2.11-0.12.0.jar  scala-library-2.11.12.jar  
spark-streaming-kafka-0-10_2.11-2.4.8.jar
   
   
   **Expected behavior**
   
   I don't expect there to be issues on compaction here.
   
   **Environment Description**
   
   * Hudi version :  0.12.0 (also tried 0.11.1)
   
   * Spark version : 2.4.8
   
   * Hive version :
   
   * Hadoop version : 2.6.5
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : Yes, docker on Mesos
   
   I'm reading from an Avro kafka topic
   
   **Additional context**
   
   Add any other context about the problem here.
   
   Reading Avro record from Kafka 
   ```
   hoodie.datasource.write.recordkey.field=activityId
   auto.offset.reset=latest
   ```
   
   **Stacktrace**
   
   ```
   22/08/17 23:07:26 ERROR HoodieAsyncService: Service shutdown with error
   java.util.concurrent.ExecutionException: 
org.apache.hudi.exception.HoodieUpsertException: Failed to upsert for commit 
time 20220817230714888
   at 
java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
   at 
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
   at 
org.apache.hudi.async.HoodieAsyncService.waitForShutdown(HoodieAsyncService.java:103)
   at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$1(HoodieDeltaStreamer.java:190)
   at org.apache.hudi.common.util.Option.ifPresent(Option.java:97)
   at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:187)
   at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:557)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:498)
   at 
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
   at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849)
   at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
   at 
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
   Caused by: org.apache.hudi.exception.HoodieUpsertException: Failed to upsert 
for commit time 20220817230714888
   at 
org.apache.hudi.table.action.commit.BaseWriteHelper.write(BaseWriteHelper.java:64)
   at 
org.apache.hudi.table.action.commit.SparkUpsertCommitActionExecutor.execute(SparkUpsertCommitActionExecutor.java:45)
   at 
org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:113)
   at 
org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:97)
   at 
org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:155)
   at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:588)
   at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:335)
   at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:687)
   at 
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

[GitHub] [hudi] hudi-bot commented on pull request #6427: [MINOR] Improve code style of CLI Command classes

2022-08-17 Thread GitBox



hudi-bot commented on PR #6427:
URL: https://github.com/apache/hudi/pull/6427#issuecomment-1218684100

   
   ## CI report:
   
   * 4a37120edf3a454a802c4ef63916b61b5547ab72 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10795)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-4635) Update roadmap page based on H2 2022 plan

2022-08-17 Thread Ethan Guo (Jira)

Ethan Guo created HUDI-4635:
---

 Summary: Update roadmap page based on H2 2022 plan
 Key: HUDI-4635
 URL: https://issues.apache.org/jira/browse/HUDI-4635
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Ethan Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4635) Update roadmap page based on H2 2022 plan

2022-08-17 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4635:

Priority: Blocker  (was: Major)

> Update roadmap page based on H2 2022 plan
> -
>
> Key: HUDI-4635
> URL: https://issues.apache.org/jira/browse/HUDI-4635
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: docs
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HUDI-4635) Update roadmap page based on H2 2022 plan

2022-08-17 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-4635:
---

Assignee: Ethan Guo

> Update roadmap page based on H2 2022 plan
> -
>
> Key: HUDI-4635
> URL: https://issues.apache.org/jira/browse/HUDI-4635
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: docs
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4635) Update roadmap page based on H2 2022 plan

2022-08-17 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4635:

Component/s: docs

> Update roadmap page based on H2 2022 plan
> -
>
> Key: HUDI-4635
> URL: https://issues.apache.org/jira/browse/HUDI-4635
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: docs
>Reporter: Ethan Guo
>Priority: Major
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4635) Update roadmap page based on H2 2022 plan

2022-08-17 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4635:

Fix Version/s: 0.13.0

> Update roadmap page based on H2 2022 plan
> -
>
> Key: HUDI-4635
> URL: https://issues.apache.org/jira/browse/HUDI-4635
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Ethan Guo
>Priority: Major
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #6427: [MINOR] Improve code style of CLI Command classes

2022-08-17 Thread GitBox



hudi-bot commented on PR #6427:
URL: https://github.com/apache/hudi/pull/6427#issuecomment-1218676855

   
   ## CI report:
   
   * 4a37120edf3a454a802c4ef63916b61b5547ab72 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] parisni commented on issue #6373: [SUPPORT] Incremental cleaning never used during insert

2022-08-17 Thread GitBox



parisni commented on issue #6373:
URL: https://github.com/apache/hudi/issues/6373#issuecomment-1218625262

   Yeah KEEP_LATEST_COMMITS. Since cleaning never find files to delete it 
always fallback into getPartitionPathsForFullCleaning().
   But that method looks for path on disk, however it then looks for filegroup 
to delete in metadata table .
   
   Also I guess there is a problem to use incremental cleaning together with 
KEEP_LATEST_COMMITS which lead to never clean some partitions after a first 
clean but I will open a separate issue for this one. Incremental cleaning shall 
be use together withKEEP_LATEST_FILE_VERSIONS only
   
   On August 17, 2022 10:45:32 PM UTC, Sivabalan Narayanan ***@***.***> wrote:
   >may I know what cleaning policy you are using? I see that for 
KEEP_LATEST_FILE_VERSIONS, we call getPartitionPathsForFullCleaning() within 
which we use file system based listing and not metadata table based listing. 
   >
   >and if you are using KEEP_LATEST_COMMITS, within incremental clean mode 
enabled, if there is no prior clean ever, we trigger 
getPartitionPathsForFullCleaning() (within which we use file system based 
listing and not metadata table based listing). 
   >
   >If not for these, we should be hitting only metadata based listing. Can you 
confirm which one among the above is your case. 
   >
   >
   >-- 
   >Reply to this email directly or view it on GitHub:
   >https://github.com/apache/hudi/issues/6373#issuecomment-1218565541
   >You are receiving this because you authored the thread.
   >
   >Message ID: ***@***.***>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yihua opened a new pull request, #6427: [MINOR] Improve code style of CLI Command classes

2022-08-17 Thread GitBox



yihua opened a new pull request, #6427:
URL: https://github.com/apache/hudi/pull/6427

   ### Change Logs
   
   As above for code style changes only.  There is no logic change.
   
   ### Impact
   
   **Risk level: none**
   Only code reformatting.
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #6373: [SUPPORT] Incremental cleaning never used during insert

2022-08-17 Thread GitBox



nsivabalan commented on issue #6373:
URL: https://github.com/apache/hudi/issues/6373#issuecomment-1218565541

   may I know what cleaning policy you are using? I see that for 
KEEP_LATEST_FILE_VERSIONS, we call getPartitionPathsForFullCleaning() within 
which we use file system based listing and not metadata table based listing. 
   
   and if you are using KEEP_LATEST_COMMITS, within incremental clean mode 
enabled, if there is no prior clean ever, we trigger 
getPartitionPathsForFullCleaning() (within which we use file system based 
listing and not metadata table based listing). 
   
   If not for these, we should be hitting only metadata based listing. Can you 
confirm which one among the above is your case. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] parisni commented on issue #6373: [SUPPORT] Incremental cleaning never used during insert

2022-08-17 Thread GitBox



parisni commented on issue #6373:
URL: https://github.com/apache/hudi/issues/6373#issuecomment-1218559061

   > But as you can imagine, this is going to result in huge no of file groups 
in general and puts lot of pressure on the system
   
   Do you mean pressure when cleaning or pressure when reading or in general ?
   
   Also insert produces same number of file groups since iam in a case of 
append only table with no new data in the past.
   
   Anyway cleaning is much faster w/o metadata table and that would help to 
allow to specify configure cleaning to work on disk only
   
   On August 17, 2022 9:59:31 PM UTC, Sivabalan Narayanan ***@***.***> wrote:
   >wrt bulk_insert, I understand cleaning is not going to be any use. coz, 
every new commit goes into new file groups. Hence there won't be any file 
groups which will have more file slices which might be eligible for cleaning. 
But as you can imagine, this is going to result in huge no of file groups in 
general and puts lot of pressure on the system. 
   >
   >
   >-- 
   >Reply to this email directly or view it on GitHub:
   >https://github.com/apache/hudi/issues/6373#issuecomment-1218528854
   >You are receiving this because you authored the thread.
   >
   >Message ID: ***@***.***>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #6373: [SUPPORT] Incremental cleaning never used during insert

2022-08-17 Thread GitBox



nsivabalan commented on issue #6373:
URL: https://github.com/apache/hudi/issues/6373#issuecomment-1218558261

   I am looking into the slowness in cleaning. will keep you posted. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #6373: [SUPPORT] Incremental cleaning never used during insert

2022-08-17 Thread GitBox



nsivabalan commented on issue #6373:
URL: https://github.com/apache/hudi/issues/6373#issuecomment-1218528854

   wrt bulk_insert, I understand cleaning is not going to be any use. coz, 
every new commit goes into new file groups. Hence there won't be any file 
groups which will have more file slices which might be eligible for cleaning. 
But as you can imagine, this is going to result in huge no of file groups in 
general and puts lot of pressure on the system. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #6379: [SUPPORT]What's the reading behavior for MOR table?

2022-08-17 Thread GitBox



nsivabalan commented on issue #6379:
URL: https://github.com/apache/hudi/issues/6379#issuecomment-1218526892

   yes, you are right. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #6425: [SUPPORT]Writing to MOR table seems not working as expected

2022-08-17 Thread GitBox



nsivabalan commented on issue #6425:
URL: https://github.com/apache/hudi/issues/6425#issuecomment-1218526557

   Looks like you are setting max commits to trigger compaction to 1. And so, 
after every delta commit, compaction is triggering and converting parquet + log 
files into new parquet files. 
   Another reason could be, if your workload will never have any updates, you 
may not see log files at all. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6170: [HUDI-4441] Log4j2 configuration fixes and removal of log4j1 dependencies

2022-08-17 Thread GitBox



hudi-bot commented on PR #6170:
URL: https://github.com/apache/hudi/pull/6170#issuecomment-1218514404

   
   ## CI report:
   
   * 1de72127453499b362801d6e912f2be1fd566775 UNKNOWN
   * e9a0d9065b5d7af14e9bc969284842391a34ef62 UNKNOWN
   * 53c4195ab7d36e4b369b368584ec1cca55d2c2e7 UNKNOWN
   * 2b9408365f959cba49d9aae5fe0a0340eaed2cc2 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10794)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6170: [HUDI-4441] Log4j2 configuration fixes and removal of log4j1 dependencies

2022-08-17 Thread GitBox



hudi-bot commented on PR #6170:
URL: https://github.com/apache/hudi/pull/6170#issuecomment-1218504183

   
   ## CI report:
   
   * 1de72127453499b362801d6e912f2be1fd566775 UNKNOWN
   * e9a0d9065b5d7af14e9bc969284842391a34ef62 UNKNOWN
   * 53c4195ab7d36e4b369b368584ec1cca55d2c2e7 UNKNOWN
   * a71b8d7629b59bd969c3a10e5c90dfd020cf084f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10793)
 
   * 2b9408365f959cba49d9aae5fe0a0340eaed2cc2 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10794)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6170: [HUDI-4441] Log4j2 configuration fixes and removal of log4j1 dependencies

2022-08-17 Thread GitBox



hudi-bot commented on PR #6170:
URL: https://github.com/apache/hudi/pull/6170#issuecomment-1218459979

   
   ## CI report:
   
   * 1de72127453499b362801d6e912f2be1fd566775 UNKNOWN
   * e9a0d9065b5d7af14e9bc969284842391a34ef62 UNKNOWN
   * 53c4195ab7d36e4b369b368584ec1cca55d2c2e7 UNKNOWN
   * a71b8d7629b59bd969c3a10e5c90dfd020cf084f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10793)
 
   * 2b9408365f959cba49d9aae5fe0a0340eaed2cc2 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6170: [HUDI-4441] Log4j2 configuration fixes and removal of log4j1 dependencies

2022-08-17 Thread GitBox



hudi-bot commented on PR #6170:
URL: https://github.com/apache/hudi/pull/6170#issuecomment-1218448062

   
   ## CI report:
   
   * 1de72127453499b362801d6e912f2be1fd566775 UNKNOWN
   * e9a0d9065b5d7af14e9bc969284842391a34ef62 UNKNOWN
   * 53c4195ab7d36e4b369b368584ec1cca55d2c2e7 UNKNOWN
   * a71b8d7629b59bd969c3a10e5c90dfd020cf084f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10793)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6170: [HUDI-4441] Log4j2 configuration fixes and removal of log4j1 dependencies

2022-08-17 Thread GitBox



hudi-bot commented on PR #6170:
URL: https://github.com/apache/hudi/pull/6170#issuecomment-1218444100

   
   ## CI report:
   
   * 1de72127453499b362801d6e912f2be1fd566775 UNKNOWN
   * e9a0d9065b5d7af14e9bc969284842391a34ef62 UNKNOWN
   * 53c4195ab7d36e4b369b368584ec1cca55d2c2e7 UNKNOWN
   * a71b8d7629b59bd969c3a10e5c90dfd020cf084f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10793)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6345: [HUDI-4552]: RFC-58: Integrate column stats index with all query engines

2022-08-17 Thread GitBox



alexeykudinkin commented on code in PR #6345:
URL: https://github.com/apache/hudi/pull/6345#discussion_r948358504


##
rfc/rfc-58/rfc-58.md:
##
@@ -0,0 +1,69 @@
+
+# RFC-58: Integrate column stats index with all query engines
+
+
+
+## Proposers
+
+- @pratyakshsharma
+
+## Approvers

Review Comment:
   @prasannarajaperumal please add me to this forum as well



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] the-other-tim-brown commented on pull request #6170: [HUDI-4441] Log4j2 configuration fixes and removal of log4j1 dependencies

2022-08-17 Thread GitBox



the-other-tim-brown commented on PR #6170:
URL: https://github.com/apache/hudi/pull/6170#issuecomment-1218430020

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] parisni commented on issue #6342: [SUPPORT] Reconcile schema fails when multiple fields missing

2022-08-17 Thread GitBox



parisni commented on issue #6342:
URL: https://github.com/apache/hudi/issues/6342#issuecomment-1218417408

   Also after more testing, the issue is not with multiple columns missing but 
when the last column is missing.
   
   On August 15, 2022 5:42:26 PM UTC, Sivabalan Narayanan ***@***.***> wrote:
   >reconcilation of schemas work only if missing columns are in the end. 
   >
   >For eg:
   >Commit1: 
   >Schema: col1, col2, col3, col4
   >
   >Commit2: 
   >Schema: col1, col2, col3, col4, col5, col6
   >
   >Commit3: old writer. 
   >Schema: col1, col2, col3, col4
   >Commit3 will succeed when you enable reconcile schema config. 
   >
   >In general, adding new colunmns at the end of existing schema works. it may 
break if you try to add new columns inbetween. for eg, in commit2, if your 
schema as col1, col5, col6, col2, col3, col4, your write or read may break. 
   >
   >
   >
   >
   >-- 
   >Reply to this email directly or view it on GitHub:
   >https://github.com/apache/hudi/issues/6342#issuecomment-1215479286
   >You are receiving this because you authored the thread.
   >
   >Message ID: ***@***.***>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] parisni commented on issue #6342: [SUPPORT] Reconcile schema fails when multiple fields missing

2022-08-17 Thread GitBox



parisni commented on issue #6342:
URL: https://github.com/apache/hudi/issues/6342#issuecomment-1218406180

   >  reconcilation of schemas work only if missing columns are in the end
   
   @nsivabalan Do you mean 
https://hudi.apache.org/docs/configurations#hoodiedatasourcewritereconcileschema
 ? That's not the behavior I see: i'am able to write data with missing column 
at any position, until only one is missing. Also the doc don't mention such 
limitation nor its source code.  
   
@xiarixiaoyao as the Dev of the reconcile feature, do you have some insight 
?
   
   On August 15, 2022 5:42:26 PM UTC, Sivabalan Narayanan ***@***.***> wrote:
   >reconcilation of schemas work only if missing columns are in the end. 
   >
   >For eg:
   >Commit1: 
   >Schema: col1, col2, col3, col4
   >
   >Commit2: 
   >Schema: col1, col2, col3, col4, col5, col6
   >
   >Commit3: old writer. 
   >Schema: col1, col2, col3, col4
   >Commit3 will succeed when you enable reconcile schema config. 
   >
   >In general, adding new colunmns at the end of existing schema works. it may 
break if you try to add new columns inbetween. for eg, in commit2, if your 
schema as col1, col5, col6, col2, col3, col4, your write or read may break. 
   >
   >
   >
   >
   >-- 
   >Reply to this email directly or view it on GitHub:
   >https://github.com/apache/hudi/issues/6342#issuecomment-1215479286
   >You are receiving this because you authored the thread.
   >
   >Message ID: ***@***.***>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yihua commented on pull request #6000: [HUDI-4340] fix not parsable text DateTimeParseException in HoodieInstantTimeGenerator.parseDateFromInstantTime

2022-08-17 Thread GitBox



yihua commented on PR #6000:
URL: https://github.com/apache/hudi/pull/6000#issuecomment-1218365693

   @TengHuo any update on addressing the comment above?  Do you need any help?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a diff in pull request #6417: [HUDI-4565] Release note for version 0.12.0

2022-08-17 Thread GitBox



nsivabalan commented on code in PR #6417:
URL: https://github.com/apache/hudi/pull/6417#discussion_r948281603


##
website/releases/release-0.12.0.md:
##
@@ -0,0 +1,143 @@
+---
+title: "Release 0.12.0"
+sidebar_position: 2
+layout: releases
+toc: true
+last_modified_at: 2022-08-17T10:30:00+05:30
+---
+# [Release 0.12.0](https://github.com/apache/hudi/releases/tag/release-0.12.0) 
([docs](/docs/quick-start-guide))
+
+## Release Highlights
+
+### Presto-Hudi Connector
+
+Since version 0.275 of PrestoDB, users can now leverage native Hudi connector 
to query Hudi table. 
+It is on par with Hudi support in the Hive connector. To learn more about the 
usage of the connector, 
+please checkout [prestodb 
documentation](https://prestodb.io/docs/current/connector/hudi.html).
+
+### Archival Beyond Savepoint
+
+Users can now archive Hudi table beyond savepoint commit. Just enable 
`hoodie.archive.beyond.savepoint` write 

Review Comment:
   Probably we can rephrase as below. 
   
   ```
   Hudi supports savepoint and restore feature that users can use for disaster 
and recovery scenarios. More info can be found 
[here](https://hudi.apache.org/docs/disaster_recovery).  Until 0.12, archival 
for a given table will not make progress beyond first savepointed commit. But 
there has been ask from the community to relax this constraint so that we can 
retain some coarse grained commits in active timeline and execute point in time 
queries. So, with 0.12, users can now let archive proceed beyond savepoint 
commits by enabling `hoodie.archive.beyond.savepoint`. This unlocks new 
opportunities for Hudi users. For example, one can retain commits for years, by 
adding one savepoint per day for older commits (lets say > 30 days). And query 
hudi table using "as.of.instant" with any older savepointed commit. By this, 
hudi does not need to retain every commit in the active timeline for older 
commits. 
   ```
   
   Let me know what you think.
   ```



##
website/releases/release-0.12.0.md:
##
@@ -0,0 +1,143 @@
+---
+title: "Release 0.12.0"
+sidebar_position: 2
+layout: releases
+toc: true
+last_modified_at: 2022-08-17T10:30:00+05:30
+---
+# [Release 0.12.0](https://github.com/apache/hudi/releases/tag/release-0.12.0) 
([docs](/docs/quick-start-guide))
+
+## Release Highlights
+
+### Presto-Hudi Connector
+
+Since version 0.275 of PrestoDB, users can now leverage native Hudi connector 
to query Hudi table. 
+It is on par with Hudi support in the Hive connector. To learn more about the 
usage of the connector, 
+please checkout [prestodb 
documentation](https://prestodb.io/docs/current/connector/hudi.html).
+
+### Archival Beyond Savepoint
+
+Users can now archive Hudi table beyond savepoint commit. Just enable 
`hoodie.archive.beyond.savepoint` write 

Review Comment:
   do retain the "note" section in the end. 



##
website/releases/release-0.12.0.md:
##
@@ -0,0 +1,143 @@
+---
+title: "Release 0.12.0"
+sidebar_position: 2
+layout: releases
+toc: true
+last_modified_at: 2022-08-17T10:30:00+05:30
+---
+# [Release 0.12.0](https://github.com/apache/hudi/releases/tag/release-0.12.0) 
([docs](/docs/quick-start-guide))
+
+## Release Highlights
+
+### Presto-Hudi Connector
+
+Since version 0.275 of PrestoDB, users can now leverage native Hudi connector 
to query Hudi table. 
+It is on par with Hudi support in the Hive connector. To learn more about the 
usage of the connector, 
+please checkout [prestodb 
documentation](https://prestodb.io/docs/current/connector/hudi.html).
+
+### Archival Beyond Savepoint
+
+Users can now archive Hudi table beyond savepoint commit. Just enable 
`hoodie.archive.beyond.savepoint` write 
+configuration. This unlocks new opportunities for Hudi users. For example, one 
can retain commits for years, by adding 
+one savepoint per day for older commits (say > 30 days old). And they can 
query hudi using `as.of.instant` semantics for
+old data. In previous versions, one would have to retain every commit and let 
archival stop at the first commit.
+
+:::note
+However, if this feature is enabled, restore cannot be supported. This 
limitation would be relaxed in a future release 
+and the development of this feature can be tracked in 
[HUDI-4500](https://issues.apache.org/jira/browse/HUDI-4500).
+:::
+
+### Deltastreamer Termination Strategy
+
+Users can now configure a post-write termination strategy with deltastreamer 
`continuous` mode if need be. For instance,
+users can configure graceful shutdown if there is no new data from source for 
5 consecutive times. Here is the interface
+for the termination strategy.
+```java
+/**
+ * Post write termination strategy for deltastreamer in continuous mode.
+ */
+public interface PostWriteTerminationStrategy {
+
+  /**
+   * Returns whether deltastreamer needs to be shutdown.
+   * @param scheduledCompactionInstantAndWriteStatuses optional pair of 
scheduled compaction instant and write statuses.
+   * @return true if

[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5113: [HUDI-3625] [RFC-56] Optimized storage layout for Cloud Object Stores

2022-08-17 Thread GitBox



alexeykudinkin commented on code in PR #5113:
URL: https://github.com/apache/hudi/pull/5113#discussion_r948277674


##
rfc/rfc-56/rfc-56.md:
##
@@ -0,0 +1,226 @@
+
+
+# RFC-56: Federated Storage Layer
+
+## Proposers
+- @umehrot2
+
+## Approvers
+- @vinoth
+- @shivnarayan
+
+## Status
+
+JIRA: 
[https://issues.apache.org/jira/browse/HUDI-3625](https://issues.apache.org/jira/browse/HUDI-3625)
+
+## Abstract
+
+As you scale your Apache Hudi workloads over Cloud object stores like Amazon 
S3, there is potential of hitting request
+throttling limits which in-turn impacts performance. In this RFC, we are 
proposing to support an alternate storage
+layout that is optimized for Amazon S3 and other cloud object stores, which 
helps achieve maximum throughput and
+significantly reduce throttling.
+
+In addition, we are proposing an interface that would allow users to implement 
their own custom strategy to allow them
+to distribute the data files across cloud stores, hdfs or on prem based on 
their specific use-cases.
+
+## Background
+
+Apache Hudi follows the traditional Hive storage layout while writing files on 
storage:
+- Partitioned Tables: The files are distributed across multiple physical 
partition folders, under the table's base path.
+- Non Partitioned Tables: The files are stored directly under the table's base 
path.
+
+While this storage layout scales well for HDFS, it increases the probability 
of hitting request throttle limits when
+working with cloud object stores like Amazon S3 and others. This is because 
Amazon S3 and other cloud stores [throttle
+requests based on object 
prefix](https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/).
+Amazon S3 does scale based on request patterns for different prefixes and adds 
internal partitions (with their own request limits),
+but there can be a 30 - 60 minute wait time before new partitions are created. 
Thus, all files/objects stored under the
+same table path prefix could result in these request limits being hit for the 
table prefix, specially as workloads
+scale, and there are several thousands of files being written/updated 
concurrently. This hurts performance due to
+re-trying of failed requests affecting throughput, and result in occasional 
failures if the retries are not able to
+succeed either and continue to be throttled.
+
+The traditional storage layout also tightly couples the partitions as folders 
under the table path. However,
+some users want flexibility to be able to distribute files/partitions under 
multiple different paths across cloud stores,
+hdfs etc. based on their specific needs. For example, customers have use cases 
to distribute files for each partition under
+a separate S3 bucket with its individual encryption key. It is not possible to 
implement such use-cases with Hudi currently.
+
+The high level proposal here is to introduce a new storage layout strategy, 
where all files are distributed evenly across
+multiple randomly generated prefixes under the Amazon S3 bucket, instead of 
being stored under a common table path/prefix.

Review Comment:
   I think we need to compartmentalize this discussion in 2 actually:
   
   1. Federated Storage support: being able to offload "partitions" to 
different buckets
   2. Addressing the "common prefix" problem
   
   For the latter, solution is not necessary the former, and we already have 
solution for it -- we eliminate physical partitioning altogether, store tables 
as non-partitioned one and support "logical-partitioning" on top of it (which 
requires just 2 ingredients, partition-listing and partitioning constraints on 
what records are stored in files w/in the partition).
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a diff in pull request #5139: [WIP][HUDI-3579] Add timeline commands in hudi-cli

2022-08-17 Thread GitBox



nsivabalan commented on code in PR #5139:
URL: https://github.com/apache/hudi/pull/5139#discussion_r948228142


##
hudi-cli/src/main/java/org/apache/hudi/cli/commands/TimelineCommand.java:
##
@@ -0,0 +1,410 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.cli.commands;
+
+import org.apache.hudi.avro.model.HoodieRollbackMetadata;
+import org.apache.hudi.avro.model.HoodieRollbackPlan;
+import org.apache.hudi.cli.HoodieCLI;
+import org.apache.hudi.cli.HoodiePrintHelper;
+import org.apache.hudi.cli.HoodieTableHeaderField;
+import org.apache.hudi.cli.TableHeader;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.table.timeline.TimelineMetadataUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.metadata.HoodieTableMetadata;
+
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.springframework.shell.core.CommandMarker;
+import org.springframework.shell.core.annotation.CliCommand;
+import org.springframework.shell.core.annotation.CliOption;
+import org.springframework.stereotype.Component;
+
+import java.io.IOException;
+import java.text.SimpleDateFormat;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Comparator;
+import java.util.Date;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+/**
+ * CLI command to display timeline options.

Review Comment:
   can we add some examples here. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yihua commented on issue #4230: [SUPPORT] org.apache.hudi.exception.HoodieRemoteException: Failed to create marker file

2022-08-17 Thread GitBox



yihua commented on issue #4230:
URL: https://github.com/apache/hudi/issues/4230#issuecomment-1218290284

   @BenjMaq recently we fixed a bug in handling the timeline-server-based 
marker requests at the timeline server, which should resolve the problem of 
marker creation failure: #6383.  If you get a chance, could you try the latest 
master with `hoodie.write.markers.type=TIMELINE_SERVER_BASED` (or using the 
default by not setting such config) and see if the problem goes away?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] Zouxxyy commented on issue #6397: [SUPPORT] spark history server - sql tab

2022-08-17 Thread GitBox



Zouxxyy commented on issue #6397:
URL: https://github.com/apache/hudi/issues/6397#issuecomment-1218248368

   > Yes exactly that’s what we‘re looking for. Did we miss some configurations?
   
   I didn't set other parameters about it, I don't know why you can't display 
it, maybe it's a web page rendering problem?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] Zouxxyy opened a new issue, #6426: [SUPPORT] Questions about the rollback mechanism of the MOR table

2022-08-17 Thread GitBox



Zouxxyy opened a new issue, #6426:
URL: https://github.com/apache/hudi/issues/6426

   I'm a hudi beginner and have some questions and hope to get answers ^_^
   
   1. I would like to know if my understanding of the MOR table rollback 
mechanism is correct:
   The rollback logic of the MOR table is to add a new log, which records the 
rollback log, and hudi will not merge the log marked as rollback in the 
subsequent compaction
   
   2. Why not just delete the log that needs rollback, Wouldn't it be more 
convenient? (just like COW table, delete the parquet that needs rollback)
   
   Thank you very much！


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6170: [HUDI-4441] Log4j2 configuration fixes and removal of log4j1 dependencies

2022-08-17 Thread GitBox



hudi-bot commented on PR #6170:
URL: https://github.com/apache/hudi/pull/6170#issuecomment-1218189301

   
   ## CI report:
   
   * 1de72127453499b362801d6e912f2be1fd566775 UNKNOWN
   * e9a0d9065b5d7af14e9bc969284842391a34ef62 UNKNOWN
   * 53c4195ab7d36e4b369b368584ec1cca55d2c2e7 UNKNOWN
   * a71b8d7629b59bd969c3a10e5c90dfd020cf084f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10793)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6170: [HUDI-4441] Log4j2 configuration fixes and removal of log4j1 dependencies

2022-08-17 Thread GitBox



hudi-bot commented on PR #6170:
URL: https://github.com/apache/hudi/pull/6170#issuecomment-1218177009

   
   ## CI report:
   
   * 1de72127453499b362801d6e912f2be1fd566775 UNKNOWN
   * e9a0d9065b5d7af14e9bc969284842391a34ef62 UNKNOWN
   * ab07f137f867a065b5ae3ab422fc3a498d1e888d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10792)
 
   * 53c4195ab7d36e4b369b368584ec1cca55d2c2e7 UNKNOWN
   * a71b8d7629b59bd969c3a10e5c90dfd020cf084f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10793)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6170: [HUDI-4441] Log4j2 configuration fixes and removal of log4j1 dependencies

2022-08-17 Thread GitBox



hudi-bot commented on PR #6170:
URL: https://github.com/apache/hudi/pull/6170#issuecomment-1218169779

   
   ## CI report:
   
   * 1de72127453499b362801d6e912f2be1fd566775 UNKNOWN
   * e9a0d9065b5d7af14e9bc969284842391a34ef62 UNKNOWN
   * ab07f137f867a065b5ae3ab422fc3a498d1e888d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10792)
 
   * 53c4195ab7d36e4b369b368584ec1cca55d2c2e7 UNKNOWN
   * a71b8d7629b59bd969c3a10e5c90dfd020cf084f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] bithw1 opened a new issue, #6425: [SUPPORT]Writing to MOR table seems not working as expected

2022-08-17 Thread GitBox



bithw1 opened a new issue, #6425:
URL: https://github.com/apache/hudi/issues/6425

   Hi,
   
   I am working with Hudi 0.9.0, and I have following code that writes 10 
records to MOR hudi table(one record for each spark job). There are 11 commits 
in total,
   
   when I look at the files written in disk, there is `NO` log file , and there 
are 11 parquet files. Looks I am writing to a COW table, not sure where the 
problem is,
   
   ```
   package org.example
   
   import org.apache.hudi.DataSourceWriteOptions
   import org.apache.hudi.config.{HoodieIndexConfig, HoodieWriteConfig}
   import org.apache.hudi.index.HoodieIndex
   import org.apache.spark.sql.{SaveMode, SparkSession}
   
   case class Order(
 name: String,
 price: String,
 creation_date: String)
   object Hudi003_Demo {
 val spark = SparkSession.builder.appName(this.getClass.getSimpleName).
   config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
   .enableHiveSupport()
   .master("local[1]")
   .getOrCreate()
   
 def write_data(i: Int): Unit = {
   val hudi_table_name = this.getClass.getSimpleName
   val base_path = "/data/hudi_demo/" + hudi_table_name
   import spark.implicits._
   val order = Order(name = s"order_$i", price = s"$i-11.3", creation_date 
= s"date-0")
   val insertData = spark.createDataset(Seq(order))
   
   //DataFrame Write
   var writer = insertData.write.format("hudi")
   
 .option(DataSourceWriteOptions.RECORDKEY_FIELD.key(), "name")
 .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, 
"creation_date")
 .option(HoodieIndexConfig.BLOOM_INDEX_UPDATE_PARTITION_PATH, "true")
   
 .option(HoodieIndexConfig.INDEX_TYPE_PROP, 
HoodieIndex.IndexType.GLOBAL_BLOOM.name())
   
 .option("hoodie.insert.shuffle.parallelism", "1")
 .option("hoodie.upsert.shuffle.parallelism", "1")
   
 .option(HoodieWriteConfig.TABLE_NAME, hudi_table_name)
 .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, 
"creation_date")
  
  //Write to MOR Table
 .option(DataSourceWriteOptions.TABLE_TYPE.key(), 
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL)
 .option("hoodie.compact.inline", "false")
 .option("hoodie.compact.inline.max.delta.commits", "1")
   
   writer.mode(SaveMode.Append)
   
 .save(base_path)
   
   
 }
   
 def test1(): Unit = {
   (0 to 10).foreach {
 i => write_data(i)
   }
 }
   
 def main(args: Array[String]): Unit = {
   test1()
 }
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6419: [HUDI-2057] CTAS Generate An External Table When Create Managed Table

2022-08-17 Thread GitBox



hudi-bot commented on PR #6419:
URL: https://github.com/apache/hudi/pull/6419#issuecomment-1218100166

   
   ## CI report:
   
   * 1008d04f7a2bf12b058954ee8e842fc3e4120c7e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10791)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6419: [HUDI-2057] CTAS Generate An External Table When Create Managed Table

2022-08-17 Thread GitBox



hudi-bot commented on PR #6419:
URL: https://github.com/apache/hudi/pull/6419#issuecomment-1218093641

   
   ## CI report:
   
   * 1008d04f7a2bf12b058954ee8e842fc3e4120c7e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10791)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] sufei2009 commented on issue #6411: Hudi Record Key Data Type Must be String

2022-08-17 Thread GitBox



sufei2009 commented on issue #6411:
URL: https://github.com/apache/hudi/issues/6411#issuecomment-1218092218

   We were not getting any exceptions.  However, the partitions were acting 
inconsistent.  Data was only writing to the last partition most of the time.  
Once we changed primary key type to string, the partitions worked correctly.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] dongkelun commented on pull request #6419: [HUDI-2057] CTAS Generate An External Table When Create Managed Table

2022-08-17 Thread GitBox



dongkelun commented on PR #6419:
URL: https://github.com/apache/hudi/pull/6419#issuecomment-1218087958

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6170: [HUDI-4441] Log4j2 configuration fixes and removal of log4j1 dependencies

2022-08-17 Thread GitBox



hudi-bot commented on PR #6170:
URL: https://github.com/apache/hudi/pull/6170#issuecomment-1218086389

   
   ## CI report:
   
   * 1de72127453499b362801d6e912f2be1fd566775 UNKNOWN
   * e9a0d9065b5d7af14e9bc969284842391a34ef62 UNKNOWN
   * ab07f137f867a065b5ae3ab422fc3a498d1e888d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10792)
 
   * 53c4195ab7d36e4b369b368584ec1cca55d2c2e7 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] kylincode closed issue #6423: [SUPPORT] After schema evaluation, when time travel queries the historical data, the results show the latest schema instead of the historical schema

2022-08-17 Thread GitBox



kylincode closed issue #6423: [SUPPORT] After schema evaluation, when time 
travel queries the historical data, the results show the latest schema instead 
of the historical schema
URL: https://github.com/apache/hudi/issues/6423


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xxWSHxx opened a new issue, #6424: [SUPPORT] After schema evaluation, when time travel queries the historical data, the results show the latest schema instead of the historical schema

2022-08-17 Thread GitBox



xxWSHxx opened a new issue, #6424:
URL: https://github.com/apache/hudi/issues/6424

   After schema evaluation, when time travel queries the historical data, the 
results show the latest schema instead of the historical schema
   
   Steps to reproduce the behavior:
   
   1. spark sql create table t1, sql:
   `create table t1 (
   id int,
   name string,
   price double,
   ts long
   ) using hudi
   location '/tmp/t1'
   options (
   type = 'mor',
   primaryKey = 'id',
   preCombineField = 'ts'
   );`
   
   2. insert data
   `insert into t1 values(1,'Tom',0.9,1000);` 
   
   3. drop price column
   `alter table t1 drop column price;`
   
   4. Time Travel Query
   `select * from t1 timestamp as of '20220817161104255'`
   It is found that when time travel queries historical data, the results show 
the latest schema (only include id、name、ts columns) instead of the historical 
schema.
   
   **Expected behavior**
   
   when time travel queries historical data, the results show the schema of 
historical time points
   
   **Environment Description**
   
   * Hudi version : 0.11.0
   
   * Spark version : 3.2.2
   
   * Storage (HDFS/S3/GCS..) : local mac os
   
   * Running on Docker? (yes/no) : no
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] kylincode opened a new issue, #6423: [SUPPORT] After schema evaluation, when time travel queries the historical data, the results show the latest schema instead of the historical schem

2022-08-17 Thread GitBox



kylincode opened a new issue, #6423:
URL: https://github.com/apache/hudi/issues/6423

   After schema evaluation, when time travel queries the historical data, the 
results show the latest schema instead of the historical schema
   
   Steps to reproduce the behavior:
   
   1. spark sql create table t1, sql:
   `create table t1 (
   id int,
   name string,
   price double,
   ts long
   ) using hudi
   location '/tmp/t1'
   options (
   type = 'mor',
   primaryKey = 'id',
   preCombineField = 'ts'
   );`
   
   2. insert data
   `insert into t1 values(1,'Tom',0.9,1000);` 
   
   3. drop price column
   `alter table t1 drop column price;`
   
   4. Time Travel Query
   `select * from t1 timestamp as of '20220817161104255'`
   It is found that when time travel queries historical data, the results show 
the latest schema (only include id、name、ts columns) instead of the historical 
schema.
   
   **Expected behavior**
   
   when time travel queries historical data, the results show the schema of 
historical time points
   
   **Environment Description**
   
   * Hudi version : 0.11.0
   
   * Spark version : 3.2.2
   
   * Storage (HDFS/S3/GCS..) : local mac os
   
   * Running on Docker? (yes/no) : no
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6170: [HUDI-4441] Log4j2 configuration fixes and removal of log4j1 dependencies

2022-08-17 Thread GitBox



hudi-bot commented on PR #6170:
URL: https://github.com/apache/hudi/pull/6170#issuecomment-1218003704

   
   ## CI report:
   
   * 1de72127453499b362801d6e912f2be1fd566775 UNKNOWN
   * e9a0d9065b5d7af14e9bc969284842391a34ef62 UNKNOWN
   * ab07f137f867a065b5ae3ab422fc3a498d1e888d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10792)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] ToBeFinder closed issue #6371: [SUPPORT]when flink recovers from savepoint, there will be some data duplication in hudi

2022-08-17 Thread GitBox



ToBeFinder closed issue #6371: [SUPPORT]when flink recovers from savepoint, 
there will be some data duplication in hudi
URL: https://github.com/apache/hudi/issues/6371


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6170: [HUDI-4441] Log4j2 configuration fixes and removal of log4j1 dependencies

2022-08-17 Thread GitBox



hudi-bot commented on PR #6170:
URL: https://github.com/apache/hudi/pull/6170#issuecomment-1217998066

   
   ## CI report:
   
   * 1de72127453499b362801d6e912f2be1fd566775 UNKNOWN
   * e9a0d9065b5d7af14e9bc969284842391a34ef62 UNKNOWN
   * b2d6b015aa283cba1949665771c8fb6caddb7b1f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10741)
 
   * ab07f137f867a065b5ae3ab422fc3a498d1e888d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10792)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6170: [HUDI-4441] Log4j2 configuration fixes and removal of log4j1 dependencies

2022-08-17 Thread GitBox



hudi-bot commented on PR #6170:
URL: https://github.com/apache/hudi/pull/6170#issuecomment-1217992427

   
   ## CI report:
   
   * 1de72127453499b362801d6e912f2be1fd566775 UNKNOWN
   * e9a0d9065b5d7af14e9bc969284842391a34ef62 UNKNOWN
   * b2d6b015aa283cba1949665771c8fb6caddb7b1f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10741)
 
   * ab07f137f867a065b5ae3ab422fc3a498d1e888d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #6170: [HUDI-4441] Log4j2 configuration fixes and removal of log4j1 dependencies

2022-08-17 Thread GitBox



the-other-tim-brown commented on code in PR #6170:
URL: https://github.com/apache/hudi/pull/6170#discussion_r947882426


##
docker/demo/config/log4j2.properties:
##
@@ -0,0 +1,60 @@
+###
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#  http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+###
+status = warn
+name = HudiConsoleLog
+
+# Set everything to be logged to the console
+appender.console.type = Console
+appender.console.name = CONSOLE
+appender.console.layout.type = PatternLayout
+appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
+
+# Root logger level
+rootLogger.level = warn
+# Root logger referring to console appender
+rootLogger.appenderRef.stdout.ref = CONSOLE
+
+# Set the default spark-shell log level to WARN. When running the spark-shell, 
the
+# log level for this class is used to overwrite the root logger's log level, 
so that
+# the user can have different defaults for the shell and regular Spark apps.
+logger.apache_spark_repl.name = org.apache.spark.repl.Main
+logger.apache_spark_repl.level = warn
+# Set logging of integration testsuite to INFO level
+logger.hudi_integ.name = org.apache.hudi.integ.testsuite
+logger.hudi_integ.level = info
+# Settings to quiet third party logs that are too verbose
+logger.apache_spark_jetty.name = org.spark_project.jetty
+logger.apache_spark_jetty.level = warn
+logger.apache_spark_jett_lifecycle.name = 
org.spark_project.jetty.util.component.AbstractLifeCycle

Review Comment:
   This is allows for fine grain control over the logging for the sake of the 
demo. Should we leave this one instance in the repo since it's for a docker 
demo?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] prasannarajaperumal commented on a diff in pull request #5113: [HUDI-3625] [RFC-56] Optimized storage layout for Cloud Object Stores

2022-08-17 Thread GitBox



prasannarajaperumal commented on code in PR #5113:
URL: https://github.com/apache/hudi/pull/5113#discussion_r947841796


##
rfc/rfc-56/rfc-56.md:
##
@@ -0,0 +1,226 @@
+
+
+# RFC-56: Federated Storage Layer
+
+## Proposers
+- @umehrot2
+
+## Approvers
+- @vinoth
+- @shivnarayan
+
+## Status
+
+JIRA: 
[https://issues.apache.org/jira/browse/HUDI-3625](https://issues.apache.org/jira/browse/HUDI-3625)
+
+## Abstract
+
+As you scale your Apache Hudi workloads over Cloud object stores like Amazon 
S3, there is potential of hitting request
+throttling limits which in-turn impacts performance. In this RFC, we are 
proposing to support an alternate storage
+layout that is optimized for Amazon S3 and other cloud object stores, which 
helps achieve maximum throughput and
+significantly reduce throttling.
+
+In addition, we are proposing an interface that would allow users to implement 
their own custom strategy to allow them
+to distribute the data files across cloud stores, hdfs or on prem based on 
their specific use-cases.
+
+## Background
+
+Apache Hudi follows the traditional Hive storage layout while writing files on 
storage:
+- Partitioned Tables: The files are distributed across multiple physical 
partition folders, under the table's base path.
+- Non Partitioned Tables: The files are stored directly under the table's base 
path.
+
+While this storage layout scales well for HDFS, it increases the probability 
of hitting request throttle limits when
+working with cloud object stores like Amazon S3 and others. This is because 
Amazon S3 and other cloud stores [throttle
+requests based on object 
prefix](https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/).
+Amazon S3 does scale based on request patterns for different prefixes and adds 
internal partitions (with their own request limits),
+but there can be a 30 - 60 minute wait time before new partitions are created. 
Thus, all files/objects stored under the
+same table path prefix could result in these request limits being hit for the 
table prefix, specially as workloads
+scale, and there are several thousands of files being written/updated 
concurrently. This hurts performance due to
+re-trying of failed requests affecting throughput, and result in occasional 
failures if the retries are not able to
+succeed either and continue to be throttled.
+
+The traditional storage layout also tightly couples the partitions as folders 
under the table path. However,
+some users want flexibility to be able to distribute files/partitions under 
multiple different paths across cloud stores,

Review Comment:
   This is a nice abstraction to think about. HudiFile is a logical file path 
and this gets annotated by PhysicalLocation which contains the cloud provider 
creds/volume/bucket. 



##
rfc/rfc-56/rfc-56.md:
##
@@ -0,0 +1,226 @@
+
+
+# RFC-56: Federated Storage Layer
+
+## Proposers
+- @umehrot2
+
+## Approvers
+- @vinoth
+- @shivnarayan
+
+## Status
+
+JIRA: 
[https://issues.apache.org/jira/browse/HUDI-3625](https://issues.apache.org/jira/browse/HUDI-3625)
+
+## Abstract
+
+As you scale your Apache Hudi workloads over Cloud object stores like Amazon 
S3, there is potential of hitting request
+throttling limits which in-turn impacts performance. In this RFC, we are 
proposing to support an alternate storage
+layout that is optimized for Amazon S3 and other cloud object stores, which 
helps achieve maximum throughput and
+significantly reduce throttling.
+
+In addition, we are proposing an interface that would allow users to implement 
their own custom strategy to allow them
+to distribute the data files across cloud stores, hdfs or on prem based on 
their specific use-cases.
+
+## Background
+
+Apache Hudi follows the traditional Hive storage layout while writing files on 
storage:
+- Partitioned Tables: The files are distributed across multiple physical 
partition folders, under the table's base path.
+- Non Partitioned Tables: The files are stored directly under the table's base 
path.
+
+While this storage layout scales well for HDFS, it increases the probability 
of hitting request throttle limits when
+working with cloud object stores like Amazon S3 and others. This is because 
Amazon S3 and other cloud stores [throttle
+requests based on object 
prefix](https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/).
+Amazon S3 does scale based on request patterns for different prefixes and adds 
internal partitions (with their own request limits),
+but there can be a 30 - 60 minute wait time before new partitions are created. 
Thus, all files/objects stored under the
+same table path prefix could result in these request limits being hit for the 
table prefix, specially as workloads
+scale, and there are several thousands of files being written/updated 
concurrently. This hurts performance due to
+re-trying of failed requests affecting throughput, and result in

1 2 3 >

1 - 100 of 231 matches

Mail list logo