Re: HudiDeltaStreamer on EMR
Got it Shiyan, Thanks. On 2020/02/24 19:15:52, Shiyan Xu wrote: > It's likely that the source parquet data has a column of Spark Timestamp > type, which is not convertible to avro. > By the way, ParquetDFSSource is not available in 0.5.0. Only added in > 0.5.1. You'll probably need to add a custom class which follows its > existing implementation, and get rid of it once EMR upgrade Hudi version. > > On Mon, Feb 24, 2020 at 10:41 AM Raghvendra Dhar Dubey > wrote: > > > Hi Team, > > > > I was trying to use HudiDeltaStreamer on EMR, which reads parquet data from > > S3 and write data into Hudi Dataset, but I am getting into an issue like > > AvroSchemaConverter not able to convert INT96, INT96 not yet implemented. > > spark-submit command that I am using > > > > spark-submit --class > > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --packages > > org.apache.spark:spark-avro_2.11:2.4.4 --master yarn --deploy-mode client > > /usr/lib/hudi/hudi-utilities-bundle-0.5.0-incubating.jar --storage-type > > COPY_ON_WRITE --source-ordering-field action_date --source-class > > org.apache.hudi.utilities.sources.ParquetDFSSource --target-base-path > > s3://xxx/hudi_table --target-table hudi_table --payload-class > > org.apache.hudi.payload.AWSDmsAvroPayload --hoodie-conf > > > > hoodie.datasource.write.recordkey.field=wbn,hoodie.datasource.write.partitionpath.field=ad,hoodie.deltastreamer.source.dfs.root=s3://xxx/Hoodi/ > > > > Error I am getting is > > > > exception in thread "main" org.apache.spark.SparkException: Job aborted due > > to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: > > Lost task 0.3 in stage 0.0 (TID 3, ip-172-30-37-9.ec2.internal, executor > > 1): java.lang.IllegalArgumentException: INT96 not yet implemented. at > > > > org.apache.parquet.avro.AvroSchemaConverter$1.convertINT96(AvroSchemaConverter.java:279) > > at > > > > org.apache.parquet.avro.AvroSchemaConverter$1.convertINT96(AvroSchemaConverter.java:264) > > at > > > > org.apache.parquet.schema.PrimitiveType$PrimitiveTypeName$7.convert(PrimitiveType.java:297) > > at > > > > org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:263) > > at > > > > org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:241) > > at > > > > org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:231) > > at > > > > org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:130) > > at > > > > org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:204) > > at > > > > org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:182) > > at > > > > org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140) > > at > > > > org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:199) > > at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:196) > > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:151) at > > org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:70) at > > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at > > org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at > > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at > > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at > > org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at > > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at > > org.apache.spark.scheduler.Task.run(Task.scala:123) at > > > > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at > > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at > > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > at > > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > > at java.lang.Thread.run(Thread.java:748) > > > > Please help me into this. > > > > Thanks > > Raghvendra > > >
Re: HudiDeltaStreamer on EMR
It's likely that the source parquet data has a column of Spark Timestamp type, which is not convertible to avro. By the way, ParquetDFSSource is not available in 0.5.0. Only added in 0.5.1. You'll probably need to add a custom class which follows its existing implementation, and get rid of it once EMR upgrade Hudi version. On Mon, Feb 24, 2020 at 10:41 AM Raghvendra Dhar Dubey wrote: > Hi Team, > > I was trying to use HudiDeltaStreamer on EMR, which reads parquet data from > S3 and write data into Hudi Dataset, but I am getting into an issue like > AvroSchemaConverter not able to convert INT96, INT96 not yet implemented. > spark-submit command that I am using > > spark-submit --class > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --packages > org.apache.spark:spark-avro_2.11:2.4.4 --master yarn --deploy-mode client > /usr/lib/hudi/hudi-utilities-bundle-0.5.0-incubating.jar --storage-type > COPY_ON_WRITE --source-ordering-field action_date --source-class > org.apache.hudi.utilities.sources.ParquetDFSSource --target-base-path > s3://xxx/hudi_table --target-table hudi_table --payload-class > org.apache.hudi.payload.AWSDmsAvroPayload --hoodie-conf > > hoodie.datasource.write.recordkey.field=wbn,hoodie.datasource.write.partitionpath.field=ad,hoodie.deltastreamer.source.dfs.root=s3://xxx/Hoodi/ > > Error I am getting is > > exception in thread "main" org.apache.spark.SparkException: Job aborted due > to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: > Lost task 0.3 in stage 0.0 (TID 3, ip-172-30-37-9.ec2.internal, executor > 1): java.lang.IllegalArgumentException: INT96 not yet implemented. at > > org.apache.parquet.avro.AvroSchemaConverter$1.convertINT96(AvroSchemaConverter.java:279) > at > > org.apache.parquet.avro.AvroSchemaConverter$1.convertINT96(AvroSchemaConverter.java:264) > at > > org.apache.parquet.schema.PrimitiveType$PrimitiveTypeName$7.convert(PrimitiveType.java:297) > at > > org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:263) > at > > org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:241) > at > > org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:231) > at > > org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:130) > at > > org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:204) > at > > org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:182) > at > > org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140) > at > > org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:199) > at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:196) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:151) at > org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:70) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at > org.apache.spark.scheduler.Task.run(Task.scala:123) at > > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > > Please help me into this. > > Thanks > Raghvendra >
HudiDeltaStreamer on EMR
Hi Team, I was trying to use HudiDeltaStreamer on EMR, which reads parquet data from S3 and write data into Hudi Dataset, but I am getting into an issue like AvroSchemaConverter not able to convert INT96, INT96 not yet implemented. spark-submit command that I am using spark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --packages org.apache.spark:spark-avro_2.11:2.4.4 --master yarn --deploy-mode client /usr/lib/hudi/hudi-utilities-bundle-0.5.0-incubating.jar --storage-type COPY_ON_WRITE --source-ordering-field action_date --source-class org.apache.hudi.utilities.sources.ParquetDFSSource --target-base-path s3://xxx/hudi_table --target-table hudi_table --payload-class org.apache.hudi.payload.AWSDmsAvroPayload --hoodie-conf hoodie.datasource.write.recordkey.field=wbn,hoodie.datasource.write.partitionpath.field=ad,hoodie.deltastreamer.source.dfs.root=s3://xxx/Hoodi/ Error I am getting is exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, ip-172-30-37-9.ec2.internal, executor 1): java.lang.IllegalArgumentException: INT96 not yet implemented. at org.apache.parquet.avro.AvroSchemaConverter$1.convertINT96(AvroSchemaConverter.java:279) at org.apache.parquet.avro.AvroSchemaConverter$1.convertINT96(AvroSchemaConverter.java:264) at org.apache.parquet.schema.PrimitiveType$PrimitiveTypeName$7.convert(PrimitiveType.java:297) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:263) at org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:241) at org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:231) at org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:130) at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:204) at org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:182) at org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:199) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:196) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:151) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:70) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:123) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Please help me into this. Thanks Raghvendra