Re: Multilevel Partitioning in Hudi with Pyspark

2020-09-01 Thread Raghvendra Dhar Dubey
I got it working by adding an option
hoodie.datasource.write.keygenerator.class =
org.apache.hudi.keygen.ComplexKeyGenerator

On Tue, Sep 1, 2020 at 12:43 PM Raghvendra Dhar Dubey <
raghvendra.d.du...@delhivery.com> wrote:

> Hi Team,
>
> I want to achieve multilevel partitioning in a Hudi Dataset like
> /MM/DD.
> How can I achieve this with pyspark?
>
> I tried below option with multiple comma separated columns but It didn't
> work
> .option("hoodie.datasource.write.partitionpath.field","YY,mm,dd")
>
> Please suggest.
>
>
> Thanks
> Raghvendra
>


Multilevel Partitioning in Hudi with Pyspark

2020-09-01 Thread Raghvendra Dhar Dubey
Hi Team,

I want to achieve multilevel partitioning in a Hudi Dataset like /MM/DD.
How can I achieve this with pyspark?

I tried below option with multiple comma separated columns but It didn't
work
.option("hoodie.datasource.write.partitionpath.field","YY,mm,dd")

Please suggest.


Thanks
Raghvendra


Re: Schema Reference in HudiDeltaStreamer

2020-03-16 Thread Raghvendra Dhar Dubey
Thanks Pratyaksh for the help.

On Mon, Mar 16, 2020 at 4:12 PM Pratyaksh Sharma 
wrote:

> Hi Raghvendra,
>
> As per the code flow of Parquet reader, I do not see any reason why this
> exception should be thrown if your target schema is actually having the
> concerned field. I would suggest to print the target schema just before
> ParquetReader flow starts in HoodieCopyOnWriteTable class i.e you need to
> print writerSchema in HoodieMergeHandle and cross check if the concerned
> field is actually getting passed to ParquetReader.
>
> On Mon, Mar 16, 2020 at 2:25 PM Raghvendra Dhar Dubey
>  wrote:
>
> > It is nullable
> > like {"name":"_id","type":["null","string"],"default":null}
> >
> > On Mon, Mar 16, 2020 at 2:22 PM Pratyaksh Sharma 
> > wrote:
> >
> > > How have you mentioned the field in your schema file? Is it a nullable
> > > field or is it having default value?
> > >
> > > On Mon, Mar 16, 2020 at 1:36 PM Raghvendra Dhar Dubey
> > >  wrote:
> > >
> > > > Thanks Pratyaksh,
> > > >
> > > > I got your point, but as in the example I used s3 avro schema file to
> > > refer
> > > > all emerged schema, It is not working.
> > > > I didn't try HiveSync Tool for this. Is there any option to refer
> glue?
> > > >
> > > >
> > > > On Mon, Mar 16, 2020 at 12:56 PM Pratyaksh Sharma <
> > pratyaks...@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > Hi Raghvendra,
> > > > >
> > > > > As mentioned in the FAQ, this error occurs when your schema has
> > evolved
> > > > in
> > > > > terms of deleting some field, in your case 'cop_amt'. Even if your
> > > > current
> > > > > target schema has this field, the problem is occurring because some
> > > > > incoming record is not having this field. To fix this, you have the
> > > > > following options -
> > > > >
> > > > > 1. Make sure none of the fields get deleted.
> > > > > 2. Else have some default value for this field and send all your
> > > records
> > > > > with that default value
> > > > > 3. Try creating uber schema.
> > > > >
> > > > > By uber schema I mean to say, create a schema which has all the
> > fields
> > > > > which were ever a part of your incoming records. If you are using
> > > > > HiveSyncTool along with DeltaStreamer, then hive metastore can be a
> > > good
> > > > > source of truth for getting all the fields ever ingested. Please
> let
> > me
> > > > > know if this makes sense.
> > > > >
> > > > > On Sun, Mar 15, 2020 at 7:11 PM Raghvendra Dhar Dubey
> > > > >  wrote:
> > > > >
> > > > > > Thanks Pratyaksh,
> > > > > > But I am assigning target schema here as
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> hoodie.deltastreamer.schemaprovider.target.schema.file=s3://bucket/schema.avsc
> > > > > >
> > > > > > But it doesn’t help, as per troubleshooting guide it is asking to
> > > build
> > > > > > Uber schema and refer It as target schema, but I am not sure
> about
> > > Uber
> > > > > > schema could you please help me into this?
> > > > > >
> > > > > > Thanks
> > > > > > Raghvendra
> > > > > >
> > > > > > On Sun, 15 Mar 2020 at 6:08 PM, Pratyaksh Sharma <
> > > > pratyaks...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > This might help - Caused by: org.apache.parquet.io
> > > > > > .InvalidRecordException:
> > > > > > > Parquet/Avro schema mismatch: Avro field 'col1' not found
> > > > > > > <
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/HUDI/Troubleshooting+Guide#TroubleshootingGuide-Causedby:org.apache.parquet.io.InvalidRecordException:Parquet/Avroschemamismatch:Avrofield'col1'notfound
> > > > > > > >
> > > > > > > .
> > > > > > >
> > > > > 

Re: Schema Reference in HudiDeltaStreamer

2020-03-16 Thread Raghvendra Dhar Dubey
It is nullable
like {"name":"_id","type":["null","string"],"default":null}

On Mon, Mar 16, 2020 at 2:22 PM Pratyaksh Sharma 
wrote:

> How have you mentioned the field in your schema file? Is it a nullable
> field or is it having default value?
>
> On Mon, Mar 16, 2020 at 1:36 PM Raghvendra Dhar Dubey
>  wrote:
>
> > Thanks Pratyaksh,
> >
> > I got your point, but as in the example I used s3 avro schema file to
> refer
> > all emerged schema, It is not working.
> > I didn't try HiveSync Tool for this. Is there any option to refer glue?
> >
> >
> > On Mon, Mar 16, 2020 at 12:56 PM Pratyaksh Sharma  >
> > wrote:
> >
> > > Hi Raghvendra,
> > >
> > > As mentioned in the FAQ, this error occurs when your schema has evolved
> > in
> > > terms of deleting some field, in your case 'cop_amt'. Even if your
> > current
> > > target schema has this field, the problem is occurring because some
> > > incoming record is not having this field. To fix this, you have the
> > > following options -
> > >
> > > 1. Make sure none of the fields get deleted.
> > > 2. Else have some default value for this field and send all your
> records
> > > with that default value
> > > 3. Try creating uber schema.
> > >
> > > By uber schema I mean to say, create a schema which has all the fields
> > > which were ever a part of your incoming records. If you are using
> > > HiveSyncTool along with DeltaStreamer, then hive metastore can be a
> good
> > > source of truth for getting all the fields ever ingested. Please let me
> > > know if this makes sense.
> > >
> > > On Sun, Mar 15, 2020 at 7:11 PM Raghvendra Dhar Dubey
> > >  wrote:
> > >
> > > > Thanks Pratyaksh,
> > > > But I am assigning target schema here as
> > > >
> > > >
> > > >
> > >
> >
> hoodie.deltastreamer.schemaprovider.target.schema.file=s3://bucket/schema.avsc
> > > >
> > > > But it doesn’t help, as per troubleshooting guide it is asking to
> build
> > > > Uber schema and refer It as target schema, but I am not sure about
> Uber
> > > > schema could you please help me into this?
> > > >
> > > > Thanks
> > > > Raghvendra
> > > >
> > > > On Sun, 15 Mar 2020 at 6:08 PM, Pratyaksh Sharma <
> > pratyaks...@gmail.com>
> > > > wrote:
> > > >
> > > > > This might help - Caused by: org.apache.parquet.io
> > > > .InvalidRecordException:
> > > > > Parquet/Avro schema mismatch: Avro field 'col1' not found
> > > > > <
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/HUDI/Troubleshooting+Guide#TroubleshootingGuide-Causedby:org.apache.parquet.io.InvalidRecordException:Parquet/Avroschemamismatch:Avrofield'col1'notfound
> > > > > >
> > > > > .
> > > > >
> > > > > Please let us know in case of any more queries.
> > > > >
> > > > > On Sun, Mar 15, 2020 at 5:08 PM Raghvendra Dubey
> > > > >  wrote:
> > > > >
> > > > > > Hi Team,
> > > > > >
> > > > > >  I am reading parquet data from HudiDeltaStreamer and writing
> data
> > > into
> > > > > > Hudi Dataset.
> > > > > > s3 > EMR(Hudi DeltaStreamer) > S3(Hudi Dataset)
> > > > > >
> > > > > > I referred  avro schema as target schema through parameter
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> hoodie.deltastreamer.schemaprovider.target.schema.file=s3://bucket/schema.avsc
> > > > > >
> > > > > > Deltastreamer command like
> > > > > > spark-submit --class
> > > > > > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer
> > > --packages
> > > > > > org.apache.spark:spark-avro_2.11:2.4.4 --master yarn
> --deploy-mode
> > > > client
> > > > > >
> > > > >
> > > >
> > >
> >
> ~/incubator-hudi/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.6.0-SNAPSHOT.jar
> > > > > > --table-type COPY_ON_WRITE --source-ordering-field action_date
> > > > > > --sourc

Re: Schema Reference in HudiDeltaStreamer

2020-03-16 Thread Raghvendra Dhar Dubey
Thanks Pratyaksh,

I got your point, but as in the example I used s3 avro schema file to refer
all emerged schema, It is not working.
I didn't try HiveSync Tool for this. Is there any option to refer glue?


On Mon, Mar 16, 2020 at 12:56 PM Pratyaksh Sharma 
wrote:

> Hi Raghvendra,
>
> As mentioned in the FAQ, this error occurs when your schema has evolved in
> terms of deleting some field, in your case 'cop_amt'. Even if your current
> target schema has this field, the problem is occurring because some
> incoming record is not having this field. To fix this, you have the
> following options -
>
> 1. Make sure none of the fields get deleted.
> 2. Else have some default value for this field and send all your records
> with that default value
> 3. Try creating uber schema.
>
> By uber schema I mean to say, create a schema which has all the fields
> which were ever a part of your incoming records. If you are using
> HiveSyncTool along with DeltaStreamer, then hive metastore can be a good
> source of truth for getting all the fields ever ingested. Please let me
> know if this makes sense.
>
> On Sun, Mar 15, 2020 at 7:11 PM Raghvendra Dhar Dubey
>  wrote:
>
> > Thanks Pratyaksh,
> > But I am assigning target schema here as
> >
> >
> >
> hoodie.deltastreamer.schemaprovider.target.schema.file=s3://bucket/schema.avsc
> >
> > But it doesn’t help, as per troubleshooting guide it is asking to build
> > Uber schema and refer It as target schema, but I am not sure about Uber
> > schema could you please help me into this?
> >
> > Thanks
> > Raghvendra
> >
> > On Sun, 15 Mar 2020 at 6:08 PM, Pratyaksh Sharma 
> > wrote:
> >
> > > This might help - Caused by: org.apache.parquet.io
> > .InvalidRecordException:
> > > Parquet/Avro schema mismatch: Avro field 'col1' not found
> > > <
> > >
> >
> https://cwiki.apache.org/confluence/display/HUDI/Troubleshooting+Guide#TroubleshootingGuide-Causedby:org.apache.parquet.io.InvalidRecordException:Parquet/Avroschemamismatch:Avrofield'col1'notfound
> > > >
> > > .
> > >
> > > Please let us know in case of any more queries.
> > >
> > > On Sun, Mar 15, 2020 at 5:08 PM Raghvendra Dubey
> > >  wrote:
> > >
> > > > Hi Team,
> > > >
> > > >  I am reading parquet data from HudiDeltaStreamer and writing data
> into
> > > > Hudi Dataset.
> > > > s3 > EMR(Hudi DeltaStreamer) > S3(Hudi Dataset)
> > > >
> > > > I referred  avro schema as target schema through parameter
> > > >
> > > >
> > >
> >
> hoodie.deltastreamer.schemaprovider.target.schema.file=s3://bucket/schema.avsc
> > > >
> > > > Deltastreamer command like
> > > > spark-submit --class
> > > > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer
> --packages
> > > > org.apache.spark:spark-avro_2.11:2.4.4 --master yarn --deploy-mode
> > client
> > > >
> > >
> >
> ~/incubator-hudi/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.6.0-SNAPSHOT.jar
> > > > --table-type COPY_ON_WRITE --source-ordering-field action_date
> > > > --source-class org.apache.hudi.utilities.sources.ParquetDFSSource
> > > > --target-base-path s3://emr-spark-scripts/hudi_spark_test
> > --target-table
> > > > hudi_spark_test --transformer-class
> > > > org.apache.hudi.utilities.transform.AWSDmsTransformer --payload-class
> > > > org.apache.hudi.payload.AWSDmsAvroPayload --hoodie-conf
> > > >
> > >
> >
> hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS,hoodie.cleaner.fileversions.retained=1,hoodie.deltastreamer.schemaprovider.target.schema.file=s3://emr-spark-scripts/mongo_load_script/schema.avsc,hoodie.datasource.write.recordkey.field=wbn,hoodie.datasource.write.partitionpath.field=ad,hoodie.deltastreamer.source.dfs.root=s3://emr-spark-scripts/mongo_load_script/parquet-data/
> > > > --continuous
> > > >
> > > > but I  am getting issue of schema i.e
> > > > org.apache.parquet.io.InvalidRecordException: Parquet/Avro schema
> > > > mismatch: Avro field 'cop_amt' not found
> > > > at
> > > >
> > >
> >
> org.apache.parquet.avro.AvroRecordConverter.getAvroField(AvroRecordConverter.java:225)
> > > > at
> > > >
> > >
> >
> org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:130)
> > > > at
> > > >
> > >
> 

Re: Schema Reference in HudiDeltaStreamer

2020-03-15 Thread Raghvendra Dhar Dubey
Thanks Pratyaksh,
But I am assigning target schema here as

hoodie.deltastreamer.schemaprovider.target.schema.file=s3://bucket/schema.avsc

But it doesn’t help, as per troubleshooting guide it is asking to build
Uber schema and refer It as target schema, but I am not sure about Uber
schema could you please help me into this?

Thanks
Raghvendra

On Sun, 15 Mar 2020 at 6:08 PM, Pratyaksh Sharma 
wrote:

> This might help - Caused by: org.apache.parquet.io.InvalidRecordException:
> Parquet/Avro schema mismatch: Avro field 'col1' not found
> <
> https://cwiki.apache.org/confluence/display/HUDI/Troubleshooting+Guide#TroubleshootingGuide-Causedby:org.apache.parquet.io.InvalidRecordException:Parquet/Avroschemamismatch:Avrofield'col1'notfound
> >
> .
>
> Please let us know in case of any more queries.
>
> On Sun, Mar 15, 2020 at 5:08 PM Raghvendra Dubey
>  wrote:
>
> > Hi Team,
> >
> >  I am reading parquet data from HudiDeltaStreamer and writing data into
> > Hudi Dataset.
> > s3 > EMR(Hudi DeltaStreamer) > S3(Hudi Dataset)
> >
> > I referred  avro schema as target schema through parameter
> >
> >
> hoodie.deltastreamer.schemaprovider.target.schema.file=s3://bucket/schema.avsc
> >
> > Deltastreamer command like
> > spark-submit --class
> > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --packages
> > org.apache.spark:spark-avro_2.11:2.4.4 --master yarn --deploy-mode client
> >
> ~/incubator-hudi/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.6.0-SNAPSHOT.jar
> > --table-type COPY_ON_WRITE --source-ordering-field action_date
> > --source-class org.apache.hudi.utilities.sources.ParquetDFSSource
> > --target-base-path s3://emr-spark-scripts/hudi_spark_test --target-table
> > hudi_spark_test --transformer-class
> > org.apache.hudi.utilities.transform.AWSDmsTransformer --payload-class
> > org.apache.hudi.payload.AWSDmsAvroPayload --hoodie-conf
> >
> hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS,hoodie.cleaner.fileversions.retained=1,hoodie.deltastreamer.schemaprovider.target.schema.file=s3://emr-spark-scripts/mongo_load_script/schema.avsc,hoodie.datasource.write.recordkey.field=wbn,hoodie.datasource.write.partitionpath.field=ad,hoodie.deltastreamer.source.dfs.root=s3://emr-spark-scripts/mongo_load_script/parquet-data/
> > --continuous
> >
> > but I  am getting issue of schema i.e
> > org.apache.parquet.io.InvalidRecordException: Parquet/Avro schema
> > mismatch: Avro field 'cop_amt' not found
> > at
> >
> org.apache.parquet.avro.AvroRecordConverter.getAvroField(AvroRecordConverter.java:225)
> > at
> >
> org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:130)
> > at
> >
> org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:95)
> > at
> >
> org.apache.parquet.avro.AvroRecordMaterializer.(AvroRecordMaterializer.java:33)
> > at
> >
> org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:138)
> > at
> >
> org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:183)
> > at
> >
> org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:156)
> > at
> > org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
> >
> > I have referred errored field into schema but still getting this issue.
> > Could you guys please help how can I refer schema file?
> >
> > Thanks
> > Raghvendra
> >
> >
> >
>


Re: Apache Hudi on AWS EMR

2020-02-28 Thread Raghvendra Dhar Dubey
Hi Udit,

I tried Hudi version 0.5.1, and it worked fine, this issue was appeared
with Hudi 0.5.0. other EMR related issues has been discussed with Rahul.
Thanks to all of you for cooperation.

Thanks
Raghvendra

On Fri, Feb 28, 2020 at 5:34 AM Mehrotra, Udit  wrote:

> Raghvendra,
>
> Can you enable TRACE level logging for Hudi on EMR, and provide the error
> logs. For this go to /etc/spark/conf/log4j.properties and change logging
> level of log4j.logger.org.apache.hudi to TRACE. This would help provide the
> failed record/keys based off
> https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala#L287
>
> Another thing that would help is to provide the Avro schema that gets
> printed on the driver when you run your job. We need to understand which
> field and why it is treated as INT96, because current parquet-avro does not
> handle its conversion. Also, for any other questions about EMR we can
> discuss it in the meeting you have setup with Rahul from EMR team.
>
> Thanks,
> Udit
>
> On 2/27/20, 11:00 AM, "Shiyan Xu"  wrote:
>
> +1 on the idea. Giving an config like `--error-path` where all failed
> conversions are saved provides flexibility for later processing.
> SQS/SNS
> can pick that up later.
>
> On Thu, Feb 27, 2020 at 8:10 AM Vinoth Chandar 
> wrote:
>
> > On the second part, it seems like a question for EMR folks ?
> >
> > Hudi's RDD level APIs, do hand the failure records back and .. May
> be we
> > should consider writing out the error records somewhere for the
> datasource
> > as well.?
> > others any thoughts?
> >
> > On Mon, Feb 24, 2020 at 10:59 PM Raghvendra Dhar Dubey
> >  wrote:
> >
> > > Thanks Gary and Udit,
> > >
> > > I tried HudiDeltaStreamer for reading parquet files from s3  but
> there is
> > > an issue while AvroSchemaConverter not able to convert Parquet
> INT96. so
> > I
> > > thought to use Spark Structured Streaming to read data from s3 and
> write
> > > into Hudi, but as Databricks providing "cloudfiles" for failure
> handling,
> > > Is there something in EMR? or do we need to manually handle this
> failure
> > by
> > > introducing SQS and SNS?
> > >
> > >
> > >
> > > On 2020/02/18 20:03:16, "Mehrotra, Udit"  >
> > > wrote:
> > > > Workaround provided by Gary can help querying Hudi tables through
> > Athena
> > > for Copy On Write tables by basically querying only the latest
> commit
> > files
> > > as standard parquet. It would definitely be worth documenting, as
> several
> > > people have asked for it and I remember providing the same
> suggestion on
> > > slack earlier. I can add if I have the perms.
> > > >
> > > > >> if I connect to the Hive catalog on EMR, which is able to
> provide
> > the
> > > > Hudi views correctly, I should be able to get correct
> results on
> > > Athena
> > > >
> > > > As Vinoth mentioned, just connecting to metastore is not enough.
> Athena
> > > would still use its own Presto which does not support Hudi.
> > > >
> > > > As for Hudi support for Athena:
> > > > Athena does use Presto, but it's their own custom version and I
> don't
> > > think they yet have the code that Hudi guys contributed to presto
> i.e.
> > the
> > > split annotations etc. Also they don’t have Hudi jars in presto
> > classpath.
> > > We are not sure of any timelines for this support, but I have
> heard that
> > > work should start soon.
> > > >
> > > > Thanks,
> > > > Udit
> > > >
> > > > On 2/18/20, 11:27 AM, "Vinoth Chandar" 
> wrote:
> > > >
> > > > Thanks everyone for chiming in. Esp Gary for the detailed
> > > workaround..
> > > > (should we FAQ this workaround.. food for thought)
> > > >
> > > > >> if I connect to the Hive catalog on EMR, which is able to
> > provide
> > > the
> > > > Hudi views correctly, I should be able to get correct
> results on
> > > Athena
> > > >
> > > >   

Re: HudiDeltaStreamer on EMR

2020-02-25 Thread Raghvendra Dhar Dubey


Got it Shiyan, Thanks.
On 2020/02/24 19:15:52, Shiyan Xu  wrote: 
> It's likely that the source parquet data has a column of Spark Timestamp
> type, which is not convertible to avro.
> By the way, ParquetDFSSource is not available in 0.5.0. Only added in
> 0.5.1. You'll probably need to add a custom class which follows its
> existing implementation, and get rid of it once EMR upgrade Hudi version.
> 
> On Mon, Feb 24, 2020 at 10:41 AM Raghvendra Dhar Dubey
>  wrote:
> 
> > Hi Team,
> >
> > I was trying to use HudiDeltaStreamer on EMR, which reads parquet data from
> > S3 and write data into Hudi Dataset, but I am getting into an issue like
> > AvroSchemaConverter not able to convert INT96, INT96 not yet implemented.
> > spark-submit command that I am using
> >
> > spark-submit --class
> > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --packages
> > org.apache.spark:spark-avro_2.11:2.4.4 --master yarn --deploy-mode client
> > /usr/lib/hudi/hudi-utilities-bundle-0.5.0-incubating.jar --storage-type
> > COPY_ON_WRITE --source-ordering-field action_date --source-class
> > org.apache.hudi.utilities.sources.ParquetDFSSource --target-base-path
> > s3://xxx/hudi_table --target-table hudi_table --payload-class
> > org.apache.hudi.payload.AWSDmsAvroPayload --hoodie-conf
> >
> > hoodie.datasource.write.recordkey.field=wbn,hoodie.datasource.write.partitionpath.field=ad,hoodie.deltastreamer.source.dfs.root=s3://xxx/Hoodi/
> >
> > Error I am getting is
> >
> > exception in thread "main" org.apache.spark.SparkException: Job aborted due
> > to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure:
> > Lost task 0.3 in stage 0.0 (TID 3, ip-172-30-37-9.ec2.internal, executor
> > 1): java.lang.IllegalArgumentException: INT96 not yet implemented. at
> >
> > org.apache.parquet.avro.AvroSchemaConverter$1.convertINT96(AvroSchemaConverter.java:279)
> > at
> >
> > org.apache.parquet.avro.AvroSchemaConverter$1.convertINT96(AvroSchemaConverter.java:264)
> > at
> >
> > org.apache.parquet.schema.PrimitiveType$PrimitiveTypeName$7.convert(PrimitiveType.java:297)
> > at
> >
> > org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:263)
> > at
> >
> > org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:241)
> > at
> >
> > org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:231)
> > at
> >
> > org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:130)
> > at
> >
> > org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:204)
> > at
> >
> > org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:182)
> > at
> >
> > org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
> > at
> >
> > org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:199)
> > at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:196)
> > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:151) at
> > org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:70) at
> > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at
> > org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at
> > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at
> > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at
> > org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at
> > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at
> > org.apache.spark.scheduler.Task.run(Task.scala:123) at
> >
> > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at
> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at
> >
> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> > at
> >
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> > at java.lang.Thread.run(Thread.java:748)
> >
> > Please help me into this.
> >
> > Thanks
> > Raghvendra
> >
> 


Re: Apache Hudi on AWS EMR

2020-02-24 Thread Raghvendra Dhar Dubey
t; Athena integration with Hudi data set is planned shortly, but not 
> sure of
> > > the date yet.
> > >
> > > However, recently Athena started supporting integration to a Hive 
> catalog
> > > apart from Glue. What that means is in Athena, if I connect to the 
> Hive
> > > catalog on EMR, which is able to provide the Hudi views correctly, I
> > should
> > > be able to get correct results on Athena. Have not tested it though. 
> The
> > > feature is in Preview already.
> > >
> > > Thanks
> > > Raghu
> > > -Original Message-
> > > From: Shiyan Xu 
> > > Sent: Tuesday, February 18, 2020 6:20 AM
> > > To: dev@hudi.apache.org
> > > Cc: Mehrotra, Udit ; Raghvendra Dhar Dubey
> > > 
> > > Subject: Re: Apache Hudi on AWS EMR
> > >
> > > For 2) I think running presto on EMR is able to let you run
> > read-optimized
> > > queries.
> > > I don't quite understand how exactly Athena not support Hudi as it is
> > > Presto underlying.
> > > Perhaps @Udit could give some insights from AWS?
> > >
> > > As @Raghvendra you mentioned, another option is to export Hudi 
> dataset to
> > > plain parquet files for Athena to query on
> > > RFC-9 is for this usecase
> > >
> > >
> > 
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+09+%3A+Hudi+Dataset+Snapshot+Exporter
> > > The task is inactive now. Feel free to pick up if this is something 
> you'd
> > > like to work on. I'd be happy to help with that.
> > >
> > >
> > > On Thu, Feb 13, 2020 at 5:39 PM Vinoth Chandar 
> > wrote:
> > >
> > > > Hi Raghvendra,
> > > >
> > > > Quick sidebar.. Please subscribe to the mailing list, so your 
> message
> > > > get published automatically. :)
> > > >
> > > > On Thu, Feb 13, 2020 at 5:32 PM Raghvendra Dhar Dubey
> > > >  wrote:
> > > >
> > > > > Hi Udit,
> > > > >
> > > > > Thanks for information.
> > > > > Actually I am struggling on following points
> > > > > 1 - How can we process S3 parquet files(hourly partitioned) 
> through
> > > > Apache
> > > > > Hudi? Is there any streaming layer we need to introduce? 2 - Is
>     > > > > there any workaround to query Hudi Dataset from Athena? we are
> > > > > thinking to dump resulting Hudi dataset to S3, and then querying
> > > > > from Athena. 3 - What should be the parquet file size and row 
> group
> > > > > size for better performance on querying Hudi Dataset?
> > > > >
> > > > > Thanks
> > > > > Raghvendra
> > > > >
> > > > >
> > > > > On Thu, Feb 13, 2020 at 5:05 AM Mehrotra, Udit 
> > > > wrote:
> > > > >
> > > > > > Hi Raghvendra,
> > > > > >
> > > > > > You would have to re-write you Parquet Dataset in Hudi format.
> > > > > > Here are the links you can follow to get started:
> > > > > >
> > > > > >
> > > > >
> > > > 
> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with
> > > > -dataset.html
> > > > > > https://hudi.apache.org/docs/querying_data.html#spark-incr-pull
> > > > > >
> > > > > > Thanks,
> > > > > > Udit
> > > > > >
> > > > > > On 2/12/20, 10:27 AM, "Raghvendra Dhar Dubey"
> > > > > >  wrote:
> > > > > >
> > > > > > Hi Team,
> > > > > >
> > > > > > I want to setup incremental view of my AWS S3 parquet data
> > > > > > through Apache
> > > > > > Hudi, and want to query this data through Athena, but
> > > > > > currently
> > > > > Athena
> > > > > > not
> > > > > > supporting Hudi Dataset.
> > > > > >
> > > > > > so there are few questions which I want to understand here
> > > > > >
> > > > > > 1 - How to stream s3 parquet file to Hudi dataset running on
> > EMR.
> > > > > >
> > > > > > 2 - How to query Hudi Dataset running on EMR
> > > > > >
> > > > > > Please help me to understand this.
> > > > > >
> > > > > > Thanks
> > > > > >
> > > > > > Raghvendra
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> 
> 
> 


HudiDeltaStreamer on EMR

2020-02-24 Thread Raghvendra Dhar Dubey
Hi Team,

I was trying to use HudiDeltaStreamer on EMR, which reads parquet data from
S3 and write data into Hudi Dataset, but I am getting into an issue like
AvroSchemaConverter not able to convert INT96, INT96 not yet implemented.
spark-submit command that I am using

spark-submit --class
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --packages
org.apache.spark:spark-avro_2.11:2.4.4 --master yarn --deploy-mode client
/usr/lib/hudi/hudi-utilities-bundle-0.5.0-incubating.jar --storage-type
COPY_ON_WRITE --source-ordering-field action_date --source-class
org.apache.hudi.utilities.sources.ParquetDFSSource --target-base-path
s3://xxx/hudi_table --target-table hudi_table --payload-class
org.apache.hudi.payload.AWSDmsAvroPayload --hoodie-conf
hoodie.datasource.write.recordkey.field=wbn,hoodie.datasource.write.partitionpath.field=ad,hoodie.deltastreamer.source.dfs.root=s3://xxx/Hoodi/

Error I am getting is

exception in thread "main" org.apache.spark.SparkException: Job aborted due
to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure:
Lost task 0.3 in stage 0.0 (TID 3, ip-172-30-37-9.ec2.internal, executor
1): java.lang.IllegalArgumentException: INT96 not yet implemented. at
org.apache.parquet.avro.AvroSchemaConverter$1.convertINT96(AvroSchemaConverter.java:279)
at
org.apache.parquet.avro.AvroSchemaConverter$1.convertINT96(AvroSchemaConverter.java:264)
at
org.apache.parquet.schema.PrimitiveType$PrimitiveTypeName$7.convert(PrimitiveType.java:297)
at
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:263)
at
org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:241)
at
org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:231)
at
org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:130)
at
org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:204)
at
org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:182)
at
org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
at
org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:199)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:196)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:151) at
org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:70) at
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at
org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at
org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at
org.apache.spark.scheduler.Task.run(Task.scala:123) at
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Please help me into this.

Thanks
Raghvendra


Re: Apache Hudi on AWS EMR

2020-02-13 Thread Raghvendra Dhar Dubey
Hi Udit,

Thanks for information.
Actually I am struggling on following points
1 - How can we process S3 parquet files(hourly partitioned) through Apache
Hudi? Is there any streaming layer we need to introduce? 2 - Is there any
workaround to query Hudi Dataset from Athena? we are thinking to dump
resulting Hudi dataset to S3, and then querying from Athena. 3 - What
should be the parquet file size and row group size for better performance
on querying Hudi Dataset?

Thanks
Raghvendra


On Thu, Feb 13, 2020 at 5:05 AM Mehrotra, Udit  wrote:

> Hi Raghvendra,
>
> You would have to re-write you Parquet Dataset in Hudi format. Here are
> the links you can follow to get started:
>
> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with-dataset.html
> https://hudi.apache.org/docs/querying_data.html#spark-incr-pull
>
> Thanks,
> Udit
>
> On 2/12/20, 10:27 AM, "Raghvendra Dhar Dubey"
>  wrote:
>
> Hi Team,
>
> I want to setup incremental view of my AWS S3 parquet data through
> Apache
> Hudi, and want to query this data through Athena, but currently Athena
> not
> supporting Hudi Dataset.
>
> so there are few questions which I want to understand here
>
> 1 - How to stream s3 parquet file to Hudi dataset running on EMR.
>
> 2 - How to query Hudi Dataset running on EMR
>
> Please help me to understand this.
>
> Thanks
>
> Raghvendra
>
>
>


Apache Hudi on AWS EMR

2020-02-12 Thread Raghvendra Dhar Dubey
Hi Team,

I want to setup incremental view of my AWS S3 parquet data through Apache
Hudi, and want to query this data through Athena, but currently Athena not
supporting Hudi Dataset.

so there are few questions which I want to understand here

1 - How to stream s3 parquet file to Hudi dataset running on EMR.

2 - How to query Hudi Dataset running on EMR

Please help me to understand this.

Thanks

Raghvendra