DevX, Test infra Rgdn

2020-08-30 Thread Sivabalan
As Hudi matures as a project, we need to get our devX and test infra rock
solid. Availability of test utils and base classes for ease of writing more
tests, stable integration tests, ease of debuggability, micro benchmarks,
performance test infra, automating checkstyle formatting, nightly snapshot
builds and so on.

We have identified and categorized these into different areas as below.

- Test fixes and some clean up. // There are a lot of jira tickets
lying around in this section.
- Test refactoring. // For ease of development, and reduce clutter, we need
to work on refactoring test infra like having more test utils, base classes
etc.
- More tests to improve coverage in some areas.
- CI stability and ease of debugging integration tests.
- Checkstyle, sl4j, warnings, spotless, etc.
- Micro benchmarks. // add benchmarking framework to hudi. and then
identify regressions on any key paths.
- Long running test suite
- Config clean ups in hudi client
- Perf test environment
- Nightly builds

As we plan out work in each of these sections, we are looking for help from
the community in getting these done. Plan is to put together a few umbrella
tickets for each of these areas and will have a coordinator. Coordinator
will be one who has expertise in the area of interest. Coordinator will
plan out the work in their resp area and will help drive the initiative
with help from the community depending on who volunteers to help out.

I understand the list is huge. Some work areas will be well defined and
should be able to get it done if we allocate enough time and resources. But
some are exploratory in nature and need some initial push to get the ball
rolling.

Very likely some of the work items in these would be well defined and
should be easy for new folks to contribute. We are not really having any
target timeframe in mind(as we had 1 month for bug bash), but would like to
get concrete work items done in decent time and have others ready by the
next major release(for eg, perf test env) depending on resources.

Let us know if you would be interested to help our community in this
regard.

-- 
Regards,
-Sivabalan


[ANNOUNCE] Hudi Community Weekly Update(2020-08-23 ~ 2020-08-30)

2020-08-30 Thread leesf
Dear community,

Nice to share Hudi community weekly update for 2020-08-23 ~ 2020-08-30 with
updates on discussion, features, bugfixs.

===
Discussion

[Release] Hudi 0.6.0 has been released, it contains many features and
bugfixes [1]


===
Features

[Writer Core] Add option to configure different path selector [2]
[Writer Core] Add back findInstantsAfterOrEquals to the HoodieTimeline
class. [3]
[Writer Core] Make timeline server timeout settings configurable [4]
[Writer Common] Add incremental meta client API to query partitions
modified in a time window [5]
[Writer Core] Tune buffer sizes for the diskbased external spillable map [6]
[Build] Specify version information for each component separately [7]
[Core] Add utility method to query extra metadata [8]


===
Bugs

[Writer Core] Fix unable to parse input partition field :1 exception when
using TimestampBasedKeyGenerator [9]
[Writer Core] Fix ComplexKeyGenerator for non-partitioned tables [10]
[Release] Fix release validate script for rc_num and release_type [11]
[Core] Fix: Avro Date logical type not handled correctly when converting to
Spark Row [12]


===
Tests

[DOCS] Add java doc for the test classes of hudi test suite [13]


[1]
https://lists.apache.org/thread.html/rb62934ceff46fc15800afa1947b15fa6f62c15d90c48fd56940a874d%40%3Cdev.hudi.apache.org%3E
[2] https://issues.apache.org/jira/browse/HUDI-1137
[3] https://issues.apache.org/jira/browse/HUDI-1136
[4] https://issues.apache.org/jira/browse/HUDI-1135
[5] https://issues.apache.org/jira/browse/HUDI-1191
[6] https://issues.apache.org/jira/browse/HUDI-1131
[7] https://issues.apache.org/jira/browse/HUDI-978
[8] https://issues.apache.org/jira/browse/HUDI-1228
[9] https://issues.apache.org/jira/browse/HUDI-1150
[10] https://issues.apache.org/jira/browse/HUDI-1226
[11] https://issues.apache.org/jira/browse/HUDI-1056
[12] https://issues.apache.org/jira/browse/HUDI-1225
[13] https://issues.apache.org/jira/browse/HUDI-532


Best,
Leesf


Hudi Writer vs Spark Parquet Writer - Sync

2020-08-30 Thread Kizhakkel Jose, Felix
Hello All,

Hive has the bucketBy feature and spark is going to add support for HIVE style 
bucketBy support for data sources and once it’s implemented - its going to 
benefit largely on the read performance. So as HUDI is having different path 
while writing parquet data, are we planning to add bucketBy functionality? 
Seems Spark is adding features on writers to be benefitted for better read 
performance, so having a different writer for HUDI, are keeping track on these 
new features happening on Spark, therefore HUDI writer is not going to greatly 
differ from spark file (parquet) writer or lacking features?

Regards,
Felix K Jose



The information contained in this message may be confidential and legally 
protected under applicable law. The message is intended solely for the 
addressee(s). If you are not the intended recipient, you are hereby notified 
that any use, forwarding, dissemination, or reproduction of this message is 
strictly prohibited and may be unlawful. If you are not the intended recipient, 
please contact the sender by return e-mail and destroy all copies of the 
original message.


[DISCUSS] Introduce incremental processing API in Hudi

2020-08-30 Thread vino yang
Hi everyone,


For a long time, in the field of big data, people hope that the tools they
use can give greater play to the processing and analysis capabilities of
big data. At present, from the perspective of API, Hudi mostly provides
APIs related to data ingestion, and relies on various big data query
engines on the query side to release capabilities, but does not provide a
more convenient API for data processing after transactional writing.

Currently, if a user wants to process the incremental data of a commit that
has just recently taken. It needs to go through three steps:


   1.

   Write data to a hudi table;
   2.

   Query or check completion of commit;
   3.

   After the data is committed, the data is found out through incremental
   query, and then the data is processed;


If you want a quick link here, you may use Hudi's recent written commit
callback function to simplify it into two steps:


   1.

   Write data to a hudi table;
   2.

   Based on the written commit callback function to trigger an incremental
   query to find out the data, and then perform data processing;


However, it is still very troublesome to split into two steps for scenarios
that want to perform more timely and efficient data analysis on the data
ingest pipeline. Therefore, I propose to merge the entire process into one
step and provide a set of incremental(or saying Pipelined) processing API
based on this:

Write the data to a hudi table, after obtaining the data through
JavaRDD, directly apply the user-defined function(UDF) to
process the data. The processing behavior can be described via these two
steps:


   1.

   Conventional conversion such as Map/Filter/Reduce;
   2.

   Aggregation calculation based on fixed time window;


And these calculation functions should be engine independent. Therefore, I
plan to introduce some new APIs that allow users to directly define
incremental processing capabilities after each writing operation.

The preliminary idea is that we can introduce a tool class, for example,
named: IncrementalProcessingBuilder or PipelineBuilder, which can be used
like this:

IncrementalProcessingBuilder builder = new IncrementalProcessingBuilder();

builder.source() //soure table

.transform()

.sink()  //derived table

.build();

IncrementalProcessingBuilder#mapAfterInsert(JavaRDD>
records, HudiMapFunction mapFunction);

IncrementalProcessingBuilder#mapAfterUpsert(JavaRDD>
records, HudiMapFunction mapFunction);

IncrementalProcessingBuilder#filterAfterInsert(JavaRDD>
records, HudiFilterFunction mapFunction);

//window function

IncrementalProcessingBuilder#aggregateAfterInsert(JavaRDD>
records, HudiAggregateFunction aggFunction);

It is suitable for scenarios where the commit interval (window) is moderate
and the delay of data ingestion is not very concerned.


What do you think? Looking forward to your thoughts and opinions.


Best,

Vino