Re: Add checkpoint metadata while using HoodieSparkSQLWriter

Vinoth Chandar Wed, 10 Jul 2019 15:29:21 -0700

Hi,

>>And also when you say bulk insert, do you mean hoodies bulk insert
operation?
No it does not refer to bulk_insert operation in Hudi. I think it says
"bulk load" and it refers to ingesting database tables in full, unlike
using Hudi upserts to do it incrementally. Simply put, its the difference
between fully rewriting your table as you would do in the pre-Hudi world
and incrementally rewriting at the file level in present day using Hudi.


>>Why is it taking much  time for 500 GB of data and does the data include
changes or its first time insert data?
Hudi write performance depends on two things : indexing (which has gotten
lot faster since that benchmark) and writing parquet files (it depends on
your schema & cpu cores on the box). And since Hudi writing is a Spark job,
speed also depends on parallelism you provide.. In a perfect world, you
have as much parallelism as parquet files (file groups) and indexing takes
1-2 mins or so and writing takes 1-2 mins. For this specific dataset, the
schema has 1000 columns, so parquet writing is much slower.

the Hudi bulk insert or insert operation is kind of documented in the delta
streamer CLI help. If you know your dataset has no updates, then you can
issue insert/bulk_insert instead of upsert to completely avoid indexing
step and that will gain speed. Difference between insert and bulk_insert is
an implementation detail : insert() caches the input data in memory to do
all the cool storage file sizing etc, while bulk_insert() used a sort based
writing mechanism which can scale to multi terabyte initial loads ..
In short, you do bulk_insert() to bootstrap the dataset, then insert or
upsert depending on needs.

for your specific use case, if you can share the spark UI, me or someone
else here can take a look and see if there is scope to make it go faster.

/thanks/vinoth

On Wed, Jul 10, 2019 at 1:26 PM Netsanet Gebretsadkan <net22...@gmail.com>
wrote:

> Dear Vinoth,
>
> I want to try to check out the performance comparison of hudi upsert and
> bulk insert.  In the hudi documentation, specifically performance
> comparison section https://hudi.apache.org/performance.html#upserts  ,
> which tries to compare bulk insert and upsert, its showing that  it takes
> about 17 min for upserting  20 TB of data and 22 min for ingesting 500 GB
> of data. Why is it taking much  time for 500 GB of data and does the data
> include changes or its first time insert data? I assumed its data to be
> inserted for the first time since you made the comparison with bulk insert.
>
>  And also when you say bulk insert, do you mean hoodies bulk insert
> operation?  If so, what is the difference with hoodies upsert operation? In
> addition to this, The latency of ingesting 6 GB of data is 25 minutes with
> the cluster i provided. How can i enhance this?
>
> Thanks for your consideration.
>
> Kind regards,
>
> On Sun, Jun 23, 2019 at 5:42 PM Netsanet Gebretsadkan <net22...@gmail.com>
> wrote:
>
> > Thanks Vbalaji.
> > I will check it out.
> >
> > Kind regards,
> >
> > On Sat, Jun 22, 2019 at 3:29 PM vbal...@apache.org <vbal...@apache.org>
> > wrote:
> >
> >>
> >> Here is the correct gist link :
> >> https://gist.github.com/bvaradar/e18d96f9b99980dfb67a6601de5aa626
> >>
> >>
> >>     On Saturday, June 22, 2019, 6:08:48 AM PDT, vbal...@apache.org <
> >> vbal...@apache.org> wrote:
> >>
> >>   Hi,
> >> I have given a sample command to set up and run deltastreamer in
> >> continuous mode and ingest fake data in the following gist
> >> https://gist.github.com/bvaradar/c5feec486fd4b2a3dac40c93649962c7
> >>
> >> We will eventually get this to project wiki.
> >> Balaji.V
> >>
> >>     On Friday, June 21, 2019, 3:12:49 PM PDT, Netsanet Gebretsadkan <
> >> net22...@gmail.com> wrote:
> >>
> >>  @Vinoth, Thanks , that would be great if Balaji could share it.
> >>
> >> Kind regards,
> >>
> >>
> >> On Thu, Jun 20, 2019 at 11:17 PM Vinoth Chandar <vin...@apache.org>
> >> wrote:
> >>
> >> > Hi,
> >> >
> >> > We usually test with our production workloads.. However, balaji
> recently
> >> > merged a DistributedTestDataSource,
> >> >
> >> >
> >>
> https://github.com/apache/incubator-hudi/commit/a0d7ab238473f22347e140b0e1e273ab80583eb7#diff-893dced90c18fd2698c6a16475f5536d
> >> >
> >> >
> >> > that can generate some random data for testing..  Balaji, do you mind
> >> > sharing a command that can be used to kick something off like that?
> >> >
> >> >
> >> > On Thu, Jun 20, 2019 at 1:54 AM Netsanet Gebretsadkan <
> >> net22...@gmail.com>
> >> > wrote:
> >> >
> >> > > Dear Vinoth,
> >> > >
> >> > > I want to try to check out the performance comparison of upsert and
> >> bulk
> >> > > insert.  But i couldn't find a clean data set more than 10 GB.
> >> > > Would it be possible to get a data set from Hudi team? For example i
> >> was
> >> > > using the stocks data that you provided on your demo. Hence, can i
> get
> >> > > more GB's of that dataset for my experiment?
> >> > >
> >> > > Thanks for your consideration.
> >> > >
> >> > > Kind regards,
> >> > >
> >> > > On Fri, Jun 7, 2019 at 7:59 PM Vinoth Chandar <vin...@apache.org>
> >> wrote:
> >> > >
> >> > > >
> >> > >
> >> >
> >>
> https://github.com/apache/incubator-hudi/issues/714#issuecomment-499981159
> >> > > >
> >> > > > Just circling back with the resolution on the mailing list as
> well.
> >> > > >
> >> > > > On Tue, Jun 4, 2019 at 6:24 AM Netsanet Gebretsadkan <
> >> > net22...@gmail.com
> >> > > >
> >> > > > wrote:
> >> > > >
> >> > > > > Dear Vinoth,
> >> > > > >
> >> > > > > Thanks for your fast response.
> >> > > > > I have created a new issue called Performance Comparison of
> >> > > > > HoodieDeltaStreamer and DataSourceAPI #714  with the screnshots
> of
> >> > the
> >> > > > > spark UI which can be found at the  following  link
> >> > > > > https://github.com/apache/incubator-hudi/issues/714.
> >> > > > > In the UI,  it seems that the ingestion with the data source API
> >> is
> >> > > > > spending  much time in the count by key of HoodieBloomIndex and
> >> > > workload
> >> > > > > profile.  Looking forward to receive insights from you.
> >> > > > >
> >> > > > > Kinde regards,
> >> > > > >
> >> > > > >
> >> > > > > On Tue, Jun 4, 2019 at 6:35 AM Vinoth Chandar <
> vin...@apache.org>
> >> > > wrote:
> >> > > > >
> >> > > > > > Hi,
> >> > > > > >
> >> > > > > > Both datasource and deltastreamer use the same APIs
> underneath.
> >> So
> >> > > not
> >> > > > > > sure. If you can grab screenshots of spark UI for both and
> open
> >> a
> >> > > > ticket,
> >> > > > > > glad to take a look.
> >> > > > > >
> >> > > > > > On 2, well one of goals of Hudi is to break this dichotomy and
> >> > enable
> >> > > > > > streaming style (I call it incremental processing) of
> processing
> >> > even
> >> > > > in
> >> > > > > a
> >> > > > > > batch job. MOR is in production at uber. Atm MOR is lacking
> just
> >> > one
> >> > > > > > feature (incr pull using log files) that Nishith is planning
> to
> >> > merge
> >> > > > > soon.
> >> > > > > > PR #692 enables Hudi DeltaStreamer to ingest continuously
> while
> >> > > > managing
> >> > > > > > compaction etc in the same job. I already knocked off some
> index
> >> > > > > > performance problems and working on indexing the log files,
> >> which
> >> > > > should
> >> > > > > > unlock near real time ingest.
> >> > > > > >
> >> > > > > > Putting all these together, within a month or so near real
> time
> >> MOR
> >> > > > > vision
> >> > > > > > should be very real. Ofc we need community help with dev and
> >> > testing
> >> > > to
> >> > > > > > speed things up. :)
> >> > > > > >
> >> > > > > > Hope that gives you a clearer picture.
> >> > > > > >
> >> > > > > > Thanks
> >> > > > > > Vinoth
> >> > > > > >
> >> > > > > > On Mon, Jun 3, 2019 at 1:01 AM Netsanet Gebretsadkan <
> >> > > > net22...@gmail.com
> >> > > > > >
> >> > > > > > wrote:
> >> > > > > >
> >> > > > > > > Thanks, Vinoth
> >> > > > > > >
> >> > > > > > > Its working now. But i have 2 questions:
> >> > > > > > > 1. The ingestion latency of using DataSource API with
> >> > > > > > > the  HoodieSparkSQLWriter  is high compared to using delta
> >> > > streamer.
> >> > > > > Why
> >> > > > > > is
> >> > > > > > > it slow? Are there specific option where we could specify to
> >> > > minimize
> >> > > > > the
> >> > > > > > > ingestion latency.
> >> > > > > > >    For example: when i run the delta streamer its talking
> >> about 1
> >> > > > > minute
> >> > > > > > to
> >> > > > > > > insert some data. If i use DataSource API with
> >> > > HoodieSparkSQLWriter,
> >> > > > > its
> >> > > > > > > taking 5 minutes. How can we optimize this?
> >> > > > > > > 2. Where do we categorize Hudi in general (Is it batch
> >> processing
> >> > > or
> >> > > > > > > streaming)?  I am asking this because currently the copy on
> >> write
> >> > > is
> >> > > > > the
> >> > > > > > > one which is fully working and since the functionality of
> the
> >> > merge
> >> > > > on
> >> > > > > > read
> >> > > > > > > is not fully done which enables us to have a near real time
> >> > > > analytics,
> >> > > > > > can
> >> > > > > > > we consider Hudi as a batch job?
> >> > > > > > >
> >> > > > > > > Kind regards,
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > On Thu, May 30, 2019 at 5:52 PM Vinoth Chandar <
> >> > vin...@apache.org>
> >> > > > > > wrote:
> >> > > > > > >
> >> > > > > > > > Hi,
> >> > > > > > > >
> >> > > > > > > > Short answer, by default any parameter you pass in using
> >> > > > option(k,v)
> >> > > > > or
> >> > > > > > > > options() beginning with "_" would be saved to the commit
> >> > > metadata.
> >> > > > > > > > You can change "_" prefix to something else by using the
> >> > > > > > > >
> DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY().
> >> > > > > > > > Reason you are not seeing the checkpointstr inside the
> >> commit
> >> > > > > metadata
> >> > > > > > is
> >> > > > > > > > because its just supposed to be a prefix for all such
> commit
> >> > > > > metadata.
> >> > > > > > > >
> >> > > > > > > > val metaMap = parameters.filter(kv =>
> >> > > > > > > >
> >> > kv._1.startsWith(parameters(COMMIT_METADATA_KEYPREFIX_OPT_KEY)))
> >> > > > > > > >
> >> > > > > > > > On Thu, May 30, 2019 at 2:56 AM Netsanet Gebretsadkan <
> >> > > > > > > net22...@gmail.com>
> >> > > > > > > > wrote:
> >> > > > > > > >
> >> > > > > > > > > I am trying to use the HoodieSparkSQLWriter to upsert
> data
> >> > from
> >> > > > any
> >> > > > > > > > > dataframe into a hoodie modeled table.  Its creating
> >> > everything
> >> > > > > > > correctly
> >> > > > > > > > > but , i also want to save the checkpoint but i couldn't
> >> even
> >> > > > though
> >> > > > > > am
> >> > > > > > > > > passing it as an argument.
> >> > > > > > > > >
> >> > > > > > > > > inputDF.write()
> >> > > > > > > > > .format("com.uber.hoodie")
> >> > > > > > > > >
> .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(),
> >> > > > > "_row_key")
> >> > > > > > > > >
> >> .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(),
> >> > > > > > > > "partition")
> >> > > > > > > > >
> .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(),
> >> > > > > > "timestamp")
> >> > > > > > > > > .option(HoodieWriteConfig.TABLE_NAME, tableName)
> >> > > > > > > > >
> >> > > >
> .option(DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY(),
> >> > > > > > > > > checkpointstr)
> >> > > > > > > > > .mode(SaveMode.Append)
> >> > > > > > > > > .save(basePath);
> >> > > > > > > > >
> >> > > > > > > > > am using the COMMIT_METADATA_KEYPREFIX_OPT_KEY() for
> >> > inserting
> >> > > > the
> >> > > > > > > > > checkpoint while using the dataframe writer but i
> couldn't
> >> > add
> >> > > > the
> >> > > > > > > > > checkpoint meta data in to the .hoodie meta data. Is
> >> there a
> >> > > way
> >> > > > i
> >> > > > > > can
> >> > > > > > > > add
> >> > > > > > > > > the checkpoint meta data while using the dataframe
> writer
> >> > API?
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Re: Add checkpoint metadata while using HoodieSparkSQLWriter

Reply via email to