Re: Apache Hudi products available for purchase in Redbubble store

2022-06-27 Thread Rubens Rodrigues
Great, do you send for other countrys?

Em seg., 27 de jun. de 2022 19:06, Bhavani Sudha 
escreveu:

> Dear Hudi users,
>
> Recently we added our logo in the Redbubble store. Checkout the link
> https://www.redbubble.com/shop/ap/113207590 to shop more Apache Hudi
> merchandise if you are interested :)
>
> Thanks,
> Sudha
>


Re: Regular minor/patch releases

2021-12-14 Thread Rubens Rodrigues
+1

I think it is really good for users

Em ter., 14 de dez. de 2021 22:01, Y Ethan Guo 
escreveu:

> +1 on packing bug fixes (at best effort) to minor releases.
>
> On Mon, Dec 13, 2021 at 12:06 PM Sivabalan  wrote:
>
> > +1 in general. but yeah, not sure if we have resources to do this for
> > every major release.
> >
> > On Mon, Dec 13, 2021 at 10:01 AM Vinoth Chandar 
> wrote:
> >
> >> Hi all,
> >>
> >> In the past we had plans for minor releases [1], but invariably we end
> up
> >> doing major ones, which also deliver the bug fixes.
> >>
> >> The reason was the cost involved in doing a release. We have made some
> >> good
> >> progress towards regression/integration test, which prompts me to revive
> >> this.
> >>
> >> What does everyone think about a monthly bugfix release on the last
> >> major/minor version. (not on every major release, we still don't have
> >> enough contributors to pull that off IMO). So we would be trying to do a
> >> 0.10.1 early jan for e.g, in this model?
> >>
> >> [1] https://cwiki.apache.org/confluence/display/HUDI/Release+Management
> >>
> >> Thanks
> >> Vinoth
> >>
> >
> >
> > --
> > Regards,
> > -Sivabalan
> >
>


Re: [ANNOUNCE] Apache Hudi 0.9.0 released

2021-09-03 Thread Rubens Rodrigues
Hello

Im from Brazil and Im follow hudi since version 0.5, congratulations for
everyone, The hudi evolution in only one year is impressive.

Me and my folks are very happy to choose hudi for our datalake.

Thank you so much for this wonderfull work

Em sex., 3 de set. de 2021 22:57, Raymond Xu 
escreveu:

> Congrats! Another awesome release.
>
> On Wed, Sep 1, 2021 at 11:49 AM Pratyaksh Sharma 
> wrote:
>
> > Great news! This one really feels like a major release with so many good
> > features getting added. :)
> >
> > On Wed, Sep 1, 2021 at 7:19 AM Udit Mehrotra  wrote:
> >
> > > The Apache Hudi team is pleased to announce the release of Apache Hudi
> > > 0.9.0.
> > >
> > > This release comes almost 5 months after 0.8.0. It includes 387
> resolved
> > > issues, comprising new features as well as
> > > general improvements and bug-fixes. Here are a few quick highlights:
> > >
> > > *Spark SQL DML and DDL Support*
> > > We have added experimental support for DDL/DML using Spark SQL taking a
> > > huge step towards making Hudi more
> > > easily accessible and operable by all personas (non-engineers, analysts
> > > etc). Users can now use SQL statements like
> > > "CREATE TABLEUSING HUDI" and "CREATE TABLE .. AS SELECT" to
> > > create/manage tables in catalogs like Hive,
> > > and "INSERT", "INSERT OVERWRITE", "UPDATE", "MERGE INTO" and "DELETE"
> > > statements to manipulate data.
> > > For more information, checkout our docs here
> > >  clicking on the
> > SparkSQL
> > > tab.
> > >
> > > *Query Side Improvements*
> > > Hudi tables are now registered with Hive as spark datasource tables,
> > > meaning Spark SQL on these tables now uses the
> > > datasource as well, instead of relying on the Hive fallbacks within
> > Spark,
> > > which are ill-maintained/cumbersome. This
> > > unlocks many optimizations such as the use of Hudi's own FileIndex
> > > <
> > >
> >
> https://github.com/apache/hudi/blob/bf5a52e51bbeaa089995335a0a4c55884792e505/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieFileIndex.scala#L46
> > > >
> > > implementation for optimized caching and the use
> > > of the Hudi metadata table, for faster listing of large tables. We have
> > > also added support for time travel query
> > > ,
> for
> > > spark
> > > datasource.
> > >
> > > *Writer Side Improvements*
> > > This release has several major writer side improvements. Virtual key
> > > support has been added to avoid populating meta
> > > fields and leverage existing fields to populate record keys and
> partition
> > > paths.
> > > Bulk Insert operation using row writer is now enabled by default for
> > faster
> > > inserts.
> > > Hudi's automatic cleaning of uncommitted data has been enhanced to be
> > > performant over cloud stores. You can learn
> > > more about this new centrally coordinated marker mechanism in this blog
> > > .
> > > Async Clustering support has been added to both DeltaStreamer and Spark
> > > Structured Streaming Sink. More on this
> > > can be found in this blog
> > > .
> > > Users can choose to drop fields used to generate partition paths.
> > > Added a new write operation "delete_partition" support in spark. Users
> > can
> > > leverage this to delete older partitions in
> > > bulk, in addition to record level deletes.
> > > Added Support for Huawei Cloud Object Storage, BAIDU AFS storage
> format,
> > > Baidu BOS storage in Hudi.
> > > A pre commit validator framework
> > > <
> > >
> >
> https://github.com/apache/hudi/blob/bf5a52e51bbeaa089995335a0a4c55884792e505/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SparkPreCommitValidator.java
> > > >
> > > has been added for spark engine, which can be used for DeltaStreamer
> and
> > > Spark
> > > Datasource writers. Users can leverage this to add any validations to
> be
> > > executed before committing writes to Hudi.
> > > Few out of the box validators are available like
> > > SqlQueryEqualityPreCommitValidator
> > > <
> > >
> >
> https://github.com/apache/hudi/blob/release-0.9.0/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SqlQueryEqualityPreCommitValidator.java
> > > >,
> > > SqlQueryInequalityPreCommitValidator
> > > <
> > >
> >
> https://github.com/apache/hudi/blob/release-0.9.0/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SqlQueryInequalityPreCommitValidator.java
> > > >
> > > and SqlQuerySingleResultPreCommitValidator
> > > <
> > >
> >
> https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/validator/SqlQuerySingleResultPreCommitValidator.java
> > > >
> > > .
> > >
> > > *Flink Integration Improvements*
> > > The Flink writer now supports propagation of 

Re: Amazon Athena expands Apache Hudi Support

2021-07-16 Thread Rubens Rodrigues
Great News for Hudi community.

Em sex., 16 de jul. de 2021 23:07, Udit Mehrotra 
escreveu:

> Hi Folks,
>
> Happy to announce that Amazon Athena has now upgraded to the latest Hudi
> 0.8.0 release. In addition, Athena now supports two additional features:
>
>- Snapshot/Real time query support for Merge on Read tables
>- Query support for tables created with *BOOTSTRAP* operation
>
> Following are the public documentation for the new supports:
>
>- What’s new:
>
> https://aws.amazon.com/about-aws/whats-new/2021/07/amazon-athena-expands-apache-hudi-support/
>- Updated Athena Hudi usage AWS doc:
>https://docs.aws.amazon.com/athena/latest/ug/querying-hudi.html
>
> Thanks,
> Udit Mehrotra
> SDE | AWS EMR
>


Re: [DISCUSS] Hudi is the data lake platform

2021-04-13 Thread Rubens Rodrigues
Excellent, I agree

Em ter, 13 de abr de 2021 07:23, vino yang  escreveu:

> +1 Excited by this new vision!
>
> Best,
> Vino
>
> Dianjin Wang  于2021年4月13日周二 下午3:53写道:
>
> > +1  The new brand is straightforward, a better description of Hudi.
> >
> > Best,
> > Dianjin Wang
> >
> >
> > On Tue, Apr 13, 2021 at 1:41 PM Bhavani Sudha 
> > wrote:
> >
> > > +1 . Cannot agree more. I think this makes total sense and will provide
> > for
> > > a much better representation of the project.
> > >
> > > On Mon, Apr 12, 2021 at 10:30 PM Vinoth Chandar 
> > wrote:
> > >
> > > > Hello all,
> > > >
> > > > Reading one more article today, positioning Hudi, as just a table
> > format,
> > > > made me wonder, if we have done enough justice in explaining what we
> > have
> > > > built together here.
> > > > I tend to think of Hudi as the data lake platform, which has the
> > > following
> > > > components, of which - one if a table format, one is a transactional
> > > > storage layer.
> > > > But the whole stack we have is definitely worth more than the sum of
> > all
> > > > the parts IMO (speaking from my own experience from the past 10+
> years
> > of
> > > > open source software dev).
> > > >
> > > > Here's what we have built so far.
> > > >
> > > > a) *table format* : something that stores table schema, a metadata
> > table
> > > > that stores file listing today, and being extended to store column
> > ranges
> > > > and more in the future (RFC-27)
> > > > b) *aux metadata* : bloom filters, external record level indexes
> today,
> > > > bitmaps/interval trees and other advanced on-disk data structures
> > > tomorrow
> > > > c) *concurrency control* : we always supported MVCC based log based
> > > > concurrency (serialize writes into a time ordered log), and we now
> also
> > > > have OCC for batch merge workloads with 0.8.0. We will have
> multi-table
> > > and
> > > > fully non-blocking writers soon (see future work section of RFC-22)
> > > > d) *updates/deletes* : this is the bread-and-butter use-case for
> Hudi,
> > > but
> > > > we support primary/unique key constraints and we could add foreign
> keys
> > > as
> > > > an extension, once our transactions can span tables.
> > > > e) *table services*: a hudi pipeline today is self-managing - sizes
> > > files,
> > > > cleans, compacts, clusters data, bootstraps existing data - all these
> > > > actions working off each other without blocking one another. (for
> most
> > > > parts).
> > > > f) *data services*: we also have higher level functionality with
> > > > deltastreamer sources (scalable DFS listing source, Kafka, Pulsar is
> > > > coming, ...and more), incremental ETL support, de-duplication, commit
> > > > callbacks, pre-commit validations are coming, error tables have been
> > > > proposed. I could also envision us building towards streaming egress,
> > > data
> > > > monitoring.
> > > >
> > > > I also think we should build the following (subject to separate
> > > > DISCUSS/RFCs)
> > > >
> > > > g) *caching service*: Hudi specific caching service that can hold
> > mutable
> > > > data and serve oft-queried data across engines.
> > > > h) t*imeline metaserver:* We already run a metaserver in spark
> > > > writer/drivers, backed by rocksDB & even Hudi's metadata table. Let's
> > > turn
> > > > it into a scalable, sharded metastore, that all engines can use to
> > obtain
> > > > any metadata.
> > > >
> > > > To this end, I propose we rebrand to "*Data Lake Platform*" as
> opposed
> > to
> > > > "ingests & manages storage of large analytical datasets over DFS
> (hdfs
> > or
> > > > cloud stores)." and convey the scope of our vision,
> > > > given we have already been building towards that. It would also
> provide
> > > new
> > > > contributors a good lens to look at the project from.
> > > >
> > > > (This is very similar to for e.g, the evolution of Kafka from a
> pub-sub
> > > > system, to an event streaming platform - with addition of
> > > > MirrorMaker/Connect etc. )
> > > >
> > > > Please share your thoughts!
> > > >
> > > > Thanks
> > > > Vinoth
> > > >
> > >
> >
>


Re: Community Sync Meeting

2021-02-11 Thread Rubens Rodrigues
Hi guys,

My English is not the best but will enjoy to be a listener on the next
meeting! :)

Em qui, 11 de fev de 2021 22:47, Vinoth Chandar 
escreveu:

> Makes sense. We were actually writing down the next meeting date on the
> wiki. But beyond that, a calendar is a great idea.
>
> Would love for more folks to chime in though, before we can make a call
>
> On Thu, Feb 11, 2021 at 2:44 AM Pratyaksh Sharma 
> wrote:
>
> > Absolutely, I am fine with biweekly meetings. Was just sharing my
> thoughts.
> > :)
> >
> > On Thu, Feb 11, 2021 at 11:19 AM Raymond Xu  >
> > wrote:
> >
> > > To clarify my thought, I still think biweekly is a good cadence just
> that
> > > it can be made easier for people to track using some tools
> > >
> > > > On Feb 10, 2021, at 9:22 PM, Pratyaksh Sharma  >
> > > wrote:
> > > >
> > > > +1
> > > >
> > > > It was easier to attend the meetings when we had them on a regular
> > basis.
> > > > Now if someone missed one meeting, he is prone to lose the track of
> > when
> > > > the next meeting is. :)
> > > >
> > > >> On Thu, Feb 11, 2021 at 12:44 AM Raymond Xu <
> > > xu.shiyan.raym...@gmail.com>
> > > >> wrote:
> > > >>
> > > >> Vinoth, I think this could be caused by the extra step of checking
> if
> > a
> > > >> current week has the meeting or not.
> > > >>
> > > >> It can be helpful to share a calendar where all the meetings are
> > listed
> > > and
> > > >> people can subscribe to via their calendar apps.
> > > >> Perhaps a scheduled reminder in the slack channel to auto-notify
> > people
> > > >> about this would also help.
> > > >>
> > > >>> On Tue, Feb 9, 2021 at 8:11 PM Vinoth Chandar 
> > > wrote:
> > > >>>
> > > >>> Folks,
> > > >>>
> > > >>> As you know, there was a weekly community sync meeting that was
> > pretty
> > > >> well
> > > >>> attended for over a year, until we switched to bi-weekly due to
> > > >>> covid/timings and what not. The past few meetings have been low
> > > >> attendance.
> > > >>>
> > > >>>
> > > >>>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/HUDI/Apache+Hudi+Community+Bi-Weekly+Sync
> > > >>>
> > > >>> We are not sure, if the new cadence, zoom (as opposed to google
> meet
> > > >>> before) - is the issue here. Personally for me, the timings have
> been
> > > >>> pretty hard with COVID/kidcare. Not sure if others are in the same
> > > boat.
> > > >> So
> > > >>> should we change timings?
> > > >>>
> > > >>> What can we do to fix or  if y'all don't see value in the meeting,
> we
> > > can
> > > >>> also scrap it.
> > > >>>
> > > >>> Please chime in with your thoughts.
> > > >>>
> > > >>> Thanks
> > > >>> Vinoth
> > > >>>
> > > >>
> > >
> >
>


Re: [DISCUSS] Improve data locality during ingestion

2021-02-09 Thread Rubens Rodrigues
Hi guys,

Talking about my use case...

I have datasets that ordering data by date makes a lot sense or ordering by
some id to have less touched files on merge operations.
On my use of delta lake I used to bootstrap tables ever ordering by one of
these fields and helps a lot on file pruning.

Hudi clustering do this job but I think it is an unnecessary extra step to
do after bulk insert because all data will need to be rewrite again.



Em ter, 9 de fev de 2021 21:53, Vinoth Chandar  escreveu:

> Hi Satish,
>
> Been to respond to this. I think I like the idea overall.
>
> Here's a (hopefully) my understanding version and let me know if I am
> getting this right.
>
> Predominantly, we are just talking about the problem of: where do we send
> the "inserts" to.
>
> Today the upsert partitioner does the file sizing/bin-packing etc for
> inserts and then sends some inserts over to existing file groups to
> maintain file size.
> We can abstract all of this into strategies and some kind of pipeline
> abstractions and have it also consider "affinity" to an existing file group
> based
> on say information stored in the metadata table?
>
> I think this is complimentary to what we do today and can be helpful. First
> thing may be is to abstract the existing write pipeline as a series of
> "optimization"
> stages and bring things like file sizing under that.
>
> On bucketing, I am not against Hive bucketing or anything. But with record
> level indexes and granular/micro partitions that we can achieve using
> clustering, is it still the most efficient design? That's a question I
> would love to find answers for. I never liked the static/hash partitioning
> based schemes
> in bucketing. they introduce  a lot of manual data munging, if things
> change.
>
> Thanks
> Vinoth
>
>
>
> On Wed, Feb 3, 2021 at 5:17 PM Satish Kotha 
> wrote:
>
> > I got some feedback that this thread may be a bit complex to understand.
> So
> > I tried to simplify proposal to below:
> >
> > Users can already specify 'partitionpath' using this config
> > <
> >
> https://hudi.apache.org/docs/configurations.html#PARTITIONPATH_FIELD_OPT_KEY
> > >
> > when
> > writing data. My proposal is we also give users the ability to identify
> (or
> > hint at) 'fileId' to while writing the data. For example, users can
> > say 'locality.columns:
> > session_id'. We deterministically map every session_id to a specific
> > fileGroup in hudi (using hash-modulo or range-partitioning etc). So all
> > values for a session_id are co-located in the same data/log file.
> >
> > Hopefully, this explains the idea better. Appreciate any feedback.
> >
> > On Mon, Feb 1, 2021 at 3:43 PM Satish Kotha 
> wrote:
> >
> > > Hello,
> > >
> > > Clustering  is a
> > > great feature for improving data locality. But it has a (relatively
> big)
> > > cost to rewrite the data after ingestion. I think there are other ways
> to
> > > improve data locality during ingestion. For example, we can add a new
> > Index
> > > (or partitioner) that reads values for columns that are important from
> a
> > > data locality perspective. We could then compute hash modulo on the
> value
> > > and use that to deterministically identify the file group that the
> record
> > > has to be written into.
> > >
> > > More detailed example:
> > > Assume we introduce 2 new config:
> > > hoodie.datasource.write.num.file.groups: "N" #Controls the total number
> > of
> > > file Ids allowed (per partition).
> > >
> > > hoodie.datasource.write.locality.columns: "session_id,timestamp"
> > #Identify
> > > columns that are important for data locality.
> > >
> > > (I can come up with better names for config if the general idea sounds
> > > good).
> > >
> > > During ingestion, we generate 'N' fileIds for each partition (if that
> > > partition has already K fileIds, we generate N-K new fileIds). Let's
> say
> > > these fileIds are stored in fileIdList data structure. For each row, we
> > > compute 'hash(row.get(session_id)+row.get(timestamp)) % N'.  This value
> > is
> > > used as the index into fileIdList data structure to deterministically
> > > identify the file group for the row.
> > >
> > > This improves data locality by ensuring columns with a given value are
> > > stored in the same file. This hashing could be done in two places:
> > > 1) A custom index that tags location for each row based on values for
> > > 'session_id+timestamp'.
> > > 2) In a new partitioner that assigns buckets for each row based of
> values
> > > for 'session_id+timestamp'
> > >
> > > *Advantages:*
> > > 1) No need to rewrite data for improving data locality.
> > > 2) Integrates well with hive bucketing (spark is also adding support
> for
> > > hive bucketing )
> > > 3) This reduces scan cycles to find a particular key because this
> ensures
> > > that the key is present in a certain fileId. Similarly, joining across
> > > multiple tables