Re: [DISCUSS] Improve data locality during ingestion

Rubens Rodrigues Tue, 09 Feb 2021 17:36:30 -0800

Hi guys,

Talking about my use case...


I have datasets that ordering data by date makes a lot sense or ordering by
some id to have less touched files on merge operations.
On my use of delta lake I used to bootstrap tables ever ordering by one of
these fields and helps a lot on file pruning.

Hudi clustering do this job but I think it is an unnecessary extra step to
do after bulk insert because all data will need to be rewrite again.



Em ter, 9 de fev de 2021 21:53, Vinoth Chandar <vin...@apache.org> escreveu:

> Hi Satish,
>
> Been to respond to this. I think I like the idea overall.
>
> Here's a (hopefully) my understanding version and let me know if I am
> getting this right.
>
> Predominantly, we are just talking about the problem of: where do we send
> the "inserts" to.
>
> Today the upsert partitioner does the file sizing/bin-packing etc for
> inserts and then sends some inserts over to existing file groups to
> maintain file size.
> We can abstract all of this into strategies and some kind of pipeline
> abstractions and have it also consider "affinity" to an existing file group
> based
> on say information stored in the metadata table?
>
> I think this is complimentary to what we do today and can be helpful. First
> thing may be is to abstract the existing write pipeline as a series of
> "optimization"
> stages and bring things like file sizing under that.
>
> On bucketing, I am not against Hive bucketing or anything. But with record
> level indexes and granular/micro partitions that we can achieve using
> clustering, is it still the most efficient design? That's a question I
> would love to find answers for. I never liked the static/hash partitioning
> based schemes
> in bucketing. they introduce  a lot of manual data munging, if things
> change.
>
> Thanks
> Vinoth
>
>
>
> On Wed, Feb 3, 2021 at 5:17 PM Satish Kotha <satishko...@uber.com.invalid>
> wrote:
>
> > I got some feedback that this thread may be a bit complex to understand.
> So
> > I tried to simplify proposal to below:
> >
> > Users can already specify 'partitionpath' using this config
> > <
> >
> https://hudi.apache.org/docs/configurations.html#PARTITIONPATH_FIELD_OPT_KEY
> > >
> > when
> > writing data. My proposal is we also give users the ability to identify
> (or
> > hint at) 'fileId' to while writing the data. For example, users can
> > say 'locality.columns:
> > session_id'. We deterministically map every session_id to a specific
> > fileGroup in hudi (using hash-modulo or range-partitioning etc). So all
> > values for a session_id are co-located in the same data/log file.
> >
> > Hopefully, this explains the idea better. Appreciate any feedback.
> >
> > On Mon, Feb 1, 2021 at 3:43 PM Satish Kotha <satishko...@uber.com>
> wrote:
> >
> > > Hello,
> > >
> > > Clustering <https://hudi.apache.org/blog/hudi-clustering-intro/> is a
> > > great feature for improving data locality. But it has a (relatively
> big)
> > > cost to rewrite the data after ingestion. I think there are other ways
> to
> > > improve data locality during ingestion. For example, we can add a new
> > Index
> > > (or partitioner) that reads values for columns that are important from
> a
> > > data locality perspective. We could then compute hash modulo on the
> value
> > > and use that to deterministically identify the file group that the
> record
> > > has to be written into.
> > >
> > > More detailed example:
> > > Assume we introduce 2 new config:
> > > hoodie.datasource.write.num.file.groups: "N" #Controls the total number
> > of
> > > file Ids allowed (per partition).
> > >
> > > hoodie.datasource.write.locality.columns: "session_id,timestamp"
> > #Identify
> > > columns that are important for data locality.
> > >
> > > (I can come up with better names for config if the general idea sounds
> > > good).
> > >
> > > During ingestion, we generate 'N' fileIds for each partition (if that
> > > partition has already K fileIds, we generate N-K new fileIds). Let's
> say
> > > these fileIds are stored in fileIdList data structure. For each row, we
> > > compute 'hash(row.get(session_id)+row.get(timestamp)) % N'.  This value
> > is
> > > used as the index into fileIdList data structure to deterministically
> > > identify the file group for the row.
> > >
> > > This improves data locality by ensuring columns with a given value are
> > > stored in the same file. This hashing could be done in two places:
> > > 1) A custom index that tags location for each row based on values for
> > > 'session_id+timestamp'.
> > > 2) In a new partitioner that assigns buckets for each row based of
> values
> > > for 'session_id+timestamp'
> > >
> > > *Advantages:*
> > > 1) No need to rewrite data for improving data locality.
> > > 2) Integrates well with hive bucketing (spark is also adding support
> for
> > > hive bucketing <https://issues.apache.org/jira/browse/SPARK-19256>)
> > > 3) This reduces scan cycles to find a particular key because this
> ensures
> > > that the key is present in a certain fileId. Similarly, joining across
> > > multiple tables would be efficient if they both choose the same
> > > 'locality.columns'.
> > >
> > > *Disadvantages:*
> > > 1) Users need to know the total number of filegroups to generate per
> > > partition. This value is assumed to be static for all partitions. So if
> > > significant changes are expected in traffic volume, this may not
> > partition
> > > the data well.  (We could also consider making this static per
> partition,
> > > which adds additional complexity, but feasible to do)
> > > 2) This may not be as efficient as clustering. For example, data for a
> > > given column value is guaranteed to be co-located in the same file.
> But
> > > they may not be in the same block (row group in parquet).  So more
> blocks
> > > need to be read by query engines.
> > >
> > >
> > > Clustering can still be useful for other use cases such as stitching
> > > files, transforming data for efficiency etc. Clustering can also be
> > useful
> > > for a few sorting scenarios - e.g., if users cannot predict a good
> value
> > > for the number of file groups needed.
> > >
> > > Appreciate any feedback. Let me know if you have other ideas on
> improving
> > > data locality. If you are interested in this idea and want to
> > collaborate,
> > > please reach out.
> > >
> > > Thanks
> > > Satish
> > >
> >
>

Re: [DISCUSS] Improve data locality during ingestion

Reply via email to