Re: Apache Hudi on AWS EMR

2020-02-27 Thread Vinoth Chandar
On the second part, it seems like a question for EMR folks ?

Hudi's RDD level APIs, do hand the failure records back and .. May be we
should consider writing out the error records somewhere for the datasource
as well.?
others any thoughts?

On Mon, Feb 24, 2020 at 10:59 PM Raghvendra Dhar Dubey
 wrote:

> Thanks Gary and Udit,
>
> I tried HudiDeltaStreamer for reading parquet files from s3  but there is
> an issue while AvroSchemaConverter not able to convert Parquet INT96. so I
> thought to use Spark Structured Streaming to read data from s3 and write
> into Hudi, but as Databricks providing "cloudfiles" for failure handling,
> Is there something in EMR? or do we need to manually handle this failure by
> introducing SQS and SNS?
>
>
>
> On 2020/02/18 20:03:16, "Mehrotra, Udit" 
> wrote:
> > Workaround provided by Gary can help querying Hudi tables through Athena
> for Copy On Write tables by basically querying only the latest commit files
> as standard parquet. It would definitely be worth documenting, as several
> people have asked for it and I remember providing the same suggestion on
> slack earlier. I can add if I have the perms.
> >
> > >> if I connect to the Hive catalog on EMR, which is able to provide the
> > Hudi views correctly, I should be able to get correct results on
> Athena
> >
> > As Vinoth mentioned, just connecting to metastore is not enough. Athena
> would still use its own Presto which does not support Hudi.
> >
> > As for Hudi support for Athena:
> > Athena does use Presto, but it's their own custom version and I don't
> think they yet have the code that Hudi guys contributed to presto i.e. the
> split annotations etc. Also they don’t have Hudi jars in presto classpath.
> We are not sure of any timelines for this support, but I have heard that
> work should start soon.
> >
> > Thanks,
> > Udit
> >
> > On 2/18/20, 11:27 AM, "Vinoth Chandar"  wrote:
> >
> > Thanks everyone for chiming in. Esp Gary for the detailed
> workaround..
> > (should we FAQ this workaround.. food for thought)
> >
> > >> if I connect to the Hive catalog on EMR, which is able to provide
> the
> > Hudi views correctly, I should be able to get correct results on
> Athena
> >
> > Knowing how the Presto/Hudi integration works, simply being able to
> read
> > from Hive metastore is not enough. Presto has code to specially
> recognize
> > Hudi tables and does an additional filtering step, which lets it
> query the
> > data in there correctly. (Gary's workaround above keeps just 1
> version
> > around for a given file (group))..
> >
> > On Mon, Feb 17, 2020 at 11:28 PM Gary Li 
> wrote:
> >
> > > Hello, I don't have any experience working with Athena but I can
> share my
> > > experience working with Impala. There is a workaround.
> > > By setting Hudi config:
> > >
> > >- hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS
> > >- hoodie.cleaner.fileversions.retained=1
> > >
> > > You will have your Hudi dataset as same as plain parquet files.
> You can
> > > create a table just like regular parquet. Hudi will write a new
> commit
> > > first then delete the older files that have two versions. You need
> to
> > > refresh the table metadata store as soon as the Hudi Upsert job
> finishes.
> > > For impala, it's simply REFRESH TABLE xxx. After Hudi vacuumed the
> older
> > > files and before refresh the table metastore, the table will be
> unavailable
> > > for query(1-5 mins in my case).
> > >
> > > How can we process S3 parquet files(hourly partitioned) through
> Apache
> > > Hudi? Is there any streaming layer we need to introduce?
> > > ---
> > > Hudi Delta streamer support parquet file. You can do a bulkInsert
> for the
> > > first job then use delta streamer for the Upsert job.
> > >
> > > 3 - What should be the parquet file size and row group size for
> better
> > > performance on querying Hudi Dataset?
> > > --
> > > That depends on the query engine you are using and it should be
> documented
> > > somewhere. For impala, the optimal size for query performance is
> 256MB, but
> > > the larger file size will make upsert more expensive. The size I
> personally
> > > choose is 100MB to 128MB.
> > >
> > > Thanks,
> > > Gary
> > >
> > >
> > >
> > > On Mon, Feb 17, 2020 at 9:46 PM Dubey, Raghu
> 
> > > wrote:
> > >
> > > > Athena is indeed Presto inside, but there is lot of custom code
> which has
> > > > gone on top of Presto there.
> > > > Couple months back I tried running a glue crawler to catalog a
> Hudi data
> > > > set and then query it from Athena. The results were not same as
> what I
> > > > would get with running the same query using spark SQL on EMR.
> Did not try
> > > > Presto on EMR, but assuming it will work fine on EMR.
> > > >
> > > > Athena integration with Hudi

Re: [DISCUSS] Support for complex record keys with TimestampBasedKeyGenerator

2020-02-27 Thread Vinoth Chandar
+1 for adding a new composite KeyGenerator, which can combine both...

Workaround : you can use the Transformer api to do a more flexible key
generation as you wish as well. for deltastreamer

On Tue, Feb 25, 2020 at 9:37 AM Balaji Varadarajan
 wrote:

>
> See if you can have a generic implementation where individual fields in
> the partition-path can be configured with their own key-generator class.
> Currently, TimestampBasedKeyGenerator is the only type specific custom
> generator. If we are anticipating more such classes for specialized types,
> you can use a generic way to support overriding key-generator for
> individual partition-fields once and for all.
> Balaji.VOn Monday, February 24, 2020, 03:09:02 AM PST, Pratyaksh
> Sharma  wrote:
>
>  Hi,
>
> We have TimestampBasedKeyGenerator for defining custom partition paths and
> we have ComplexKeyGenerator for supporting having combination of fields as
> record key or partition key.
>
> However we do not have support for the case where one wants to have
> combination of fields as record key along with being able to define custom
> partition paths. This use case recently came up at my organisation.
>
> How about having CustomTimestampBasedKeyGenerator which supports the above
> use case? This class can simply extend TimestampBasedKeyGenerator and allow
> users to have combination of fields as record key.
>
> Open to hearing others' opinions.
>


Re: Need clarity on these test cases in TestHoodieDeltaStreamer

2020-02-27 Thread Balaji Varadarajan

Awesome Pratyaksh, would you mind opening a PR to documenting it.
Balaji.V

Sent from Yahoo Mail for iPhone


On Wednesday, February 26, 2020, 11:14 PM, Pratyaksh Sharma 
 wrote:

Hi,

I figured out the issue yesterday. Thank you for helping me out.

On Thu, Feb 27, 2020 at 12:01 AM vbal...@apache.org 
wrote:

>
> This change was done as part of adding delete API support :
> https://github.com/apache/incubator-hudi/commit/7031445eb3cae5a4557786c7eb080944320609aa
>
> I don't remember the reason behind this.
> Sivabalan, Can you explain the reason when you get a chance.
> Thanks,Balaji.V
>    On Wednesday, February 26, 2020, 06:03:53 AM PST, Pratyaksh Sharma <
> pratyaks...@gmail.com> wrote:
>
>  Anybody got a chance to look at this?
>
> On Mon, Feb 24, 2020 at 1:04 AM Pratyaksh Sharma 
> wrote:
>
> > Hi,
> >
> > While working on one of my PRs, I am stuck with the following test cases
> > in TestHoodieDeltaStreamer -
> > 1. testUpsertsCOWContinuousMode
> > 2. testUpsertsMORContinuousMode
> >
> > For both of them, at line [1] and [2], we are adding 200 to totalRecords
> > while asserting record count and distance count respectively. I am unable
> > to understand what do these 200 records correspond to. Any leads are
> > appreciated.
> >
> > I feel probably I am missing some piece of code where I need to do
> changes
> > for the above tests to pass.
> >
> > [1]
> >
> https://github.com/apache/incubator-hudi/blob/078d4825d909b2c469398f31c97d2290687321a8/hudi-utilities/src/test/java/org/apache/hudi/utilities/TestHoodieDeltaStreamer.java#L425
> > .
> > [2]
> >
> https://github.com/apache/incubator-hudi/blob/078d4825d909b2c469398f31c97d2290687321a8/hudi-utilities/src/test/java/org/apache/hudi/utilities/TestHoodieDeltaStreamer.java#L426
> > .
> >
> >
>





Re: Apache Hudi on AWS EMR

2020-02-27 Thread Shiyan Xu
+1 on the idea. Giving an config like `--error-path` where all failed
conversions are saved provides flexibility for later processing. SQS/SNS
can pick that up later.

On Thu, Feb 27, 2020 at 8:10 AM Vinoth Chandar  wrote:

> On the second part, it seems like a question for EMR folks ?
>
> Hudi's RDD level APIs, do hand the failure records back and .. May be we
> should consider writing out the error records somewhere for the datasource
> as well.?
> others any thoughts?
>
> On Mon, Feb 24, 2020 at 10:59 PM Raghvendra Dhar Dubey
>  wrote:
>
> > Thanks Gary and Udit,
> >
> > I tried HudiDeltaStreamer for reading parquet files from s3  but there is
> > an issue while AvroSchemaConverter not able to convert Parquet INT96. so
> I
> > thought to use Spark Structured Streaming to read data from s3 and write
> > into Hudi, but as Databricks providing "cloudfiles" for failure handling,
> > Is there something in EMR? or do we need to manually handle this failure
> by
> > introducing SQS and SNS?
> >
> >
> >
> > On 2020/02/18 20:03:16, "Mehrotra, Udit" 
> > wrote:
> > > Workaround provided by Gary can help querying Hudi tables through
> Athena
> > for Copy On Write tables by basically querying only the latest commit
> files
> > as standard parquet. It would definitely be worth documenting, as several
> > people have asked for it and I remember providing the same suggestion on
> > slack earlier. I can add if I have the perms.
> > >
> > > >> if I connect to the Hive catalog on EMR, which is able to provide
> the
> > > Hudi views correctly, I should be able to get correct results on
> > Athena
> > >
> > > As Vinoth mentioned, just connecting to metastore is not enough. Athena
> > would still use its own Presto which does not support Hudi.
> > >
> > > As for Hudi support for Athena:
> > > Athena does use Presto, but it's their own custom version and I don't
> > think they yet have the code that Hudi guys contributed to presto i.e.
> the
> > split annotations etc. Also they don’t have Hudi jars in presto
> classpath.
> > We are not sure of any timelines for this support, but I have heard that
> > work should start soon.
> > >
> > > Thanks,
> > > Udit
> > >
> > > On 2/18/20, 11:27 AM, "Vinoth Chandar"  wrote:
> > >
> > > Thanks everyone for chiming in. Esp Gary for the detailed
> > workaround..
> > > (should we FAQ this workaround.. food for thought)
> > >
> > > >> if I connect to the Hive catalog on EMR, which is able to
> provide
> > the
> > > Hudi views correctly, I should be able to get correct results on
> > Athena
> > >
> > > Knowing how the Presto/Hudi integration works, simply being able to
> > read
> > > from Hive metastore is not enough. Presto has code to specially
> > recognize
> > > Hudi tables and does an additional filtering step, which lets it
> > query the
> > > data in there correctly. (Gary's workaround above keeps just 1
> > version
> > > around for a given file (group))..
> > >
> > > On Mon, Feb 17, 2020 at 11:28 PM Gary Li  >
> > wrote:
> > >
> > > > Hello, I don't have any experience working with Athena but I can
> > share my
> > > > experience working with Impala. There is a workaround.
> > > > By setting Hudi config:
> > > >
> > > >- hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS
> > > >- hoodie.cleaner.fileversions.retained=1
> > > >
> > > > You will have your Hudi dataset as same as plain parquet files.
> > You can
> > > > create a table just like regular parquet. Hudi will write a new
> > commit
> > > > first then delete the older files that have two versions. You
> need
> > to
> > > > refresh the table metadata store as soon as the Hudi Upsert job
> > finishes.
> > > > For impala, it's simply REFRESH TABLE xxx. After Hudi vacuumed
> the
> > older
> > > > files and before refresh the table metastore, the table will be
> > unavailable
> > > > for query(1-5 mins in my case).
> > > >
> > > > How can we process S3 parquet files(hourly partitioned) through
> > Apache
> > > > Hudi? Is there any streaming layer we need to introduce?
> > > > ---
> > > > Hudi Delta streamer support parquet file. You can do a bulkInsert
> > for the
> > > > first job then use delta streamer for the Upsert job.
> > > >
> > > > 3 - What should be the parquet file size and row group size for
> > better
> > > > performance on querying Hudi Dataset?
> > > > --
> > > > That depends on the query engine you are using and it should be
> > documented
> > > > somewhere. For impala, the optimal size for query performance is
> > 256MB, but
> > > > the larger file size will make upsert more expensive. The size I
> > personally
> > > > choose is 100MB to 128MB.
> > > >
> > > > Thanks,
> > > > Gary
> > > >
> > > >
> > > >
> > > > On Mon, Feb 17, 2020 at 9:46 PM Dubey, Raghu
> > 
> > > > wrote:
> > > >
> > > > > Athena i

Re: Apache Hudi on AWS EMR

2020-02-27 Thread Mehrotra, Udit
Raghvendra,

Can you enable TRACE level logging for Hudi on EMR, and provide the error logs. 
For this go to /etc/spark/conf/log4j.properties and change logging level of 
log4j.logger.org.apache.hudi to TRACE. This would help provide the failed 
record/keys based off 
https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala#L287
 

Another thing that would help is to provide the Avro schema that gets printed 
on the driver when you run your job. We need to understand which field and why 
it is treated as INT96, because current parquet-avro does not handle its 
conversion. Also, for any other questions about EMR we can discuss it in the 
meeting you have setup with Rahul from EMR team.

Thanks,
Udit

On 2/27/20, 11:00 AM, "Shiyan Xu"  wrote:

+1 on the idea. Giving an config like `--error-path` where all failed
conversions are saved provides flexibility for later processing. SQS/SNS
can pick that up later.

On Thu, Feb 27, 2020 at 8:10 AM Vinoth Chandar  wrote:

> On the second part, it seems like a question for EMR folks ?
>
> Hudi's RDD level APIs, do hand the failure records back and .. May be we
> should consider writing out the error records somewhere for the datasource
> as well.?
> others any thoughts?
>
> On Mon, Feb 24, 2020 at 10:59 PM Raghvendra Dhar Dubey
>  wrote:
>
> > Thanks Gary and Udit,
> >
> > I tried HudiDeltaStreamer for reading parquet files from s3  but there 
is
> > an issue while AvroSchemaConverter not able to convert Parquet INT96. so
> I
> > thought to use Spark Structured Streaming to read data from s3 and write
> > into Hudi, but as Databricks providing "cloudfiles" for failure 
handling,
> > Is there something in EMR? or do we need to manually handle this failure
> by
> > introducing SQS and SNS?
> >
> >
> >
> > On 2020/02/18 20:03:16, "Mehrotra, Udit" 
> > wrote:
> > > Workaround provided by Gary can help querying Hudi tables through
> Athena
> > for Copy On Write tables by basically querying only the latest commit
> files
> > as standard parquet. It would definitely be worth documenting, as 
several
> > people have asked for it and I remember providing the same suggestion on
> > slack earlier. I can add if I have the perms.
> > >
> > > >> if I connect to the Hive catalog on EMR, which is able to provide
> the
> > > Hudi views correctly, I should be able to get correct results on
> > Athena
> > >
> > > As Vinoth mentioned, just connecting to metastore is not enough. 
Athena
> > would still use its own Presto which does not support Hudi.
> > >
> > > As for Hudi support for Athena:
> > > Athena does use Presto, but it's their own custom version and I don't
> > think they yet have the code that Hudi guys contributed to presto i.e.
> the
> > split annotations etc. Also they don’t have Hudi jars in presto
> classpath.
> > We are not sure of any timelines for this support, but I have heard that
> > work should start soon.
> > >
> > > Thanks,
> > > Udit
> > >
> > > On 2/18/20, 11:27 AM, "Vinoth Chandar"  wrote:
> > >
> > > Thanks everyone for chiming in. Esp Gary for the detailed
> > workaround..
> > > (should we FAQ this workaround.. food for thought)
> > >
> > > >> if I connect to the Hive catalog on EMR, which is able to
> provide
> > the
> > > Hudi views correctly, I should be able to get correct results on
> > Athena
> > >
> > > Knowing how the Presto/Hudi integration works, simply being able 
to
> > read
> > > from Hive metastore is not enough. Presto has code to specially
> > recognize
> > > Hudi tables and does an additional filtering step, which lets it
> > query the
> > > data in there correctly. (Gary's workaround above keeps just 1
> > version
> > > around for a given file (group))..
> > >
> > > On Mon, Feb 17, 2020 at 11:28 PM Gary Li  >
> > wrote:
> > >
> > > > Hello, I don't have any experience working with Athena but I can
> > share my
> > > > experience working with Impala. There is a workaround.
> > > > By setting Hudi config:
> > > >
> > > >- hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS
> > > >- hoodie.cleaner.fileversions.retained=1
> > > >
> > > > You will have your Hudi dataset as same as plain parquet files.
> > You can
> > > > create a table just like regular parquet. Hudi will write a new
> > commit
> > > > first then delete the older files that have two versions. You
> need
> > to
> > > > refresh the table metadata store as soon as the Hudi Upsert job
> > finishes.
> > > > For impala, it's s

Subscribing to commits@

2020-02-27 Thread Vinoth Chandar
Folks,

Realized some folks may not have noticed this. But
https://lists.apache.org/list.html?comm...@hudi.apache.org  has all the
github/jira activity, in a single place..

If you are interested in helping others out on the community, please join
that list for email notifications. That's how I keep track,

(If we could mirror #general this way from slack, it will be awesome)


Re: Need clarity on these test cases in TestHoodieDeltaStreamer

2020-02-27 Thread Pratyaksh Sharma
Hi Balaji,

Right now I am facing some different issue in the same test case. The
number of records are not matching and assertion is failing. Once I am able
to fix that as well, I will open the PR for sure. :)

On Thu, Feb 27, 2020 at 11:17 PM Balaji Varadarajan
 wrote:

>
> Awesome Pratyaksh, would you mind opening a PR to documenting it.
> Balaji.V
>
> Sent from Yahoo Mail for iPhone
>
>
> On Wednesday, February 26, 2020, 11:14 PM, Pratyaksh Sharma <
> pratyaks...@gmail.com> wrote:
>
> Hi,
>
> I figured out the issue yesterday. Thank you for helping me out.
>
> On Thu, Feb 27, 2020 at 12:01 AM vbal...@apache.org 
> wrote:
>
> >
> > This change was done as part of adding delete API support :
> >
> https://github.com/apache/incubator-hudi/commit/7031445eb3cae5a4557786c7eb080944320609aa
> >
> > I don't remember the reason behind this.
> > Sivabalan, Can you explain the reason when you get a chance.
> > Thanks,Balaji.V
> >On Wednesday, February 26, 2020, 06:03:53 AM PST, Pratyaksh Sharma <
> > pratyaks...@gmail.com> wrote:
> >
> >  Anybody got a chance to look at this?
> >
> > On Mon, Feb 24, 2020 at 1:04 AM Pratyaksh Sharma 
> > wrote:
> >
> > > Hi,
> > >
> > > While working on one of my PRs, I am stuck with the following test
> cases
> > > in TestHoodieDeltaStreamer -
> > > 1. testUpsertsCOWContinuousMode
> > > 2. testUpsertsMORContinuousMode
> > >
> > > For both of them, at line [1] and [2], we are adding 200 to
> totalRecords
> > > while asserting record count and distance count respectively. I am
> unable
> > > to understand what do these 200 records correspond to. Any leads are
> > > appreciated.
> > >
> > > I feel probably I am missing some piece of code where I need to do
> > changes
> > > for the above tests to pass.
> > >
> > > [1]
> > >
> >
> https://github.com/apache/incubator-hudi/blob/078d4825d909b2c469398f31c97d2290687321a8/hudi-utilities/src/test/java/org/apache/hudi/utilities/TestHoodieDeltaStreamer.java#L425
> > > .
> > > [2]
> > >
> >
> https://github.com/apache/incubator-hudi/blob/078d4825d909b2c469398f31c97d2290687321a8/hudi-utilities/src/test/java/org/apache/hudi/utilities/TestHoodieDeltaStreamer.java#L426
> > > .
> > >
> > >
> >
>
>
>
>