[GitHub] [hudi] ROOBALJINDAL opened a new issue, #7064: [SUPPORT] Data ingestion from csv file i.e. CsvDFSSource is working for FilebasedSchemaProvider but not working if schema is provided with SchemaRegistryProvider

GitBox Tue, 25 Oct 2022 22:40:12 -0700


ROOBALJINDAL opened a new issue, #7064:
URL: https://github.com/apache/hudi/issues/7064


   
   
   **Describe the problem you faced**
   
   Data ingestion from csv file is working for FilebasedSchemaProvider but not 
working if schema is provided with SchemaRegistryProvider for the same csv data 
file. 
   
   I have created following simple employee.csv file for testing purpose having 
following content:
   
   ```
   1001,a1,b1,c1
   1111,d1,e1,f1
   
   ```
   
   **Schema: source-flattened.avsc**
   
   ```
   {
     "type" : "record",
     "name" : "triprec",
     "fields": [
       {
         "name": "guidelinesid",
         "type": "long"
       },
       {
         "name": "str_one",
         "type": "string"
       },
       {
         "name": "str_two",
         "type": "string"
       },
       {
         "name": "str_three",
         "type": "string"
       }
     ]
   }
   ```
   
   **Spark command used which is working fine:**
   
   ```
   spark-submit  \
     --jars /usr/lib/spark/external/lib/spark-avro_2.12-3.3.0-amzn-0.jar \
     --master local --deploy-mode client \
     --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
/usr/lib/hudi/hudi-utilities-bundle_2.12-0.11.1-amzn-0.jar \
     --table-type COPY_ON_WRITE --op BULK_INSERT \
     --target-base-path 
s3://hudi-multistreamer-roobal/csv-test/synced-table/employee \
     --target-table employee  \
     --min-sync-interval-seconds 60 \
     --source-class org.apache.hudi.utilities.sources.CsvDFSSource \
     --source-ordering-field employeesid \
     --hoodie-conf 
hoodie.deltastreamer.source.dfs.root=s3a://my-bucket/csv-test/source-csv/ \
     --hoodie-conf hoodie.datasource.write.recordkey.field=employeesid \
     --enable-hive-sync \
     --hoodie-conf hoodie.datasource.hive_sync.database=default \
     --hoodie-conf hoodie.datasource.hive_sync.table=employee \
     --hoodie-conf 
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.NonPartitionedExtractor
 \
     --hoodie-conf 
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.NonpartitionedKeyGenerator
 \
     --hoodie-conf hoodie.datasource.hive_sync.partition_fields= \
     --hoodie-conf hoodie.datasource.write.partitionpath.field= \
     --hoodie-conf hoodie.deltastreamer.csv.sep=, \
     --hoodie-conf hoodie.deltastreamer.csv.header=false \
     --hoodie-conf 
hoodie.deltastreamer.schemaprovider.source.schema.file=s3a://my-bucket/csv-test/schema/source-flattened.avsc
 \
     --hoodie-conf 
hoodie.deltastreamer.schemaprovider.target.schema.file=s3a://my-bucket/csv-test/schema/source-flattened.avsc
 \
     --schemaprovider-class 
org.apache.hudi.utilities.schema.FilebasedSchemaProvider
   ```
   
   **Now I added same avro schema to apicurio registry and modified above 
command to following but its not working.**
   
   ```
   spark-submit  \
     --jars /usr/lib/spark/external/lib/spark-avro_2.12-3.3.0-amzn-0.jar \
     --master local --deploy-mode client \
     --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
/usr/lib/hudi/hudi-utilities-bundle_2.12-0.11.1-amzn-0.jar \
     --table-type COPY_ON_WRITE --op BULK_INSERT \
     --target-base-path 
s3://hudi-multistreamer-roobal/csv-test/synced-table/employee \
     --target-table employee  \
     --min-sync-interval-seconds 60 \
     --source-class org.apache.hudi.utilities.sources.CsvDFSSource \
     --source-ordering-field employeesid \
     --hoodie-conf 
hoodie.deltastreamer.source.dfs.root=s3a://my-bucket/csv-test/source-csv/ \
     --hoodie-conf hoodie.datasource.write.recordkey.field=employeesid \
     --enable-hive-sync \
     --hoodie-conf hoodie.datasource.hive_sync.database=default \
     --hoodie-conf hoodie.datasource.hive_sync.table=employee \
     --hoodie-conf 
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.NonPartitionedExtractor
 \
     --hoodie-conf 
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.NonpartitionedKeyGenerator
 \
     --hoodie-conf hoodie.datasource.hive_sync.partition_fields= \
     --hoodie-conf hoodie.datasource.write.partitionpath.field= \
     --hoodie-conf hoodie.deltastreamer.csv.sep=, \
     --hoodie-conf hoodie.deltastreamer.csv.header=false \
     --hoodie-conf 
schema.registry.url=http://xx.xxx.xx.xxx:8080/apis/ccompat/v6 \
     --hoodie-conf 
hoodie.deltastreamer.schemaprovider.registry.url=http://xx.xxx.xx.xxx:8080/apis/ccompat/v6/subjects/cd2c9220-63bb-468c-9051-57f6abf795da/versions/latest
 \
     --schemaprovider-class 
org.apache.hudi.utilities.schema.SchemaRegistryProvider
   ```
   
   **Expected behavior**
   
   Since csv data file is same, FilebasedSchemaProvider  is working but 
SchemaRegistryProvider is not. It should work as data and schema  both are same 
in both cases, just providers are different. Let me know if I am missing any 
configuration?
   
   
   **Environment Description**
   Using AWS EMR-6.8.0 cluster
   
   * Hudi version : 0.11.1
   
   * Spark version :3.3.0
   
   * Hive version :3.1.3
   
   * Hadoop version : 3.2.1
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : No, Using AWS EMR cluster
   
   
   **Stacktrace**
   
   ```22/10/26 05:19:35 DEBUG ChannelEndPoint: changeInterests p=false 0->1 for 
SocketChannelEndPoint@5402de4a{l=/xx.xxx.xx.xxx:36061,r=/xx.xxx.xx.xxx:56288,OPEN,fill=FI,flush=-,to=0/30000}{io=0/1,kio=0,kro=1}->HttpConnection@32de932e[p=HttpParser{s=START,0
 of 
-1},g=HttpGenerator@1ad2f315{s=START}]=>HttpChannelOverHttp@2e0e0711{s=HttpChannelState@60192b1{s=IDLE
 rs=BLOCKING os=OPEN is=IDLE awp=false se=false i=true 
al=0},r=2,c=false/false,a=IDLE,uri=null,age=0}
   22/10/26 05:19:35 DEBUG BatchedMarkerCreationRunnable: Finish batch 
processing of create marker requests in 37 ms
   22/10/26 05:19:35 DEBUG ChannelEndPoint: Key interests updated 0 -> 1 on 
SocketChannelEndPoint@5402de4a{l=/xx.xxx.xx.xxx:36061,r=/xx.xxx.xx.xxx:56288,OPEN,fill=FI,flush=-,to=0/30000}{io=1/1,kio=1,kro=1}->HttpConnection@32de932e[p=HttpParser{s=START,0
 of 
-1},g=HttpGenerator@1ad2f315{s=START}]=>HttpChannelOverHttp@2e0e0711{s=HttpChannelState@60192b1{s=IDLE
 rs=BLOCKING os=OPEN is=IDLE awp=false se=false i=true 
al=0},r=2,c=false/false,a=IDLE,uri=null,age=0}
   22/10/26 05:19:35 INFO HoodieCreateHandle: New CreateHandle for partition : 
with fileId afd625e0-d213-4f0f-b3e5-bb3410bb97f8-0
   22/10/26 05:19:35 ERROR HoodieWriteHandle: Error writing record 
HoodieRecord{key=HoodieKey { recordKey=1111 partitionPath=}, 
currentLocation='null', newLocation='null'}
   java.io.EOFException: null
           at 
org.apache.avro.io.BinaryDecoder$ByteArrayByteSource.readRaw(BinaryDecoder.java:999)
 ~[avro-1.11.0.jar:1.11.0]
           at 
org.apache.avro.io.BinaryDecoder.doReadBytes(BinaryDecoder.java:405) 
~[avro-1.11.0.jar:1.11.0]
           at 
org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:313) 
~[avro-1.11.0.jar:1.11.0]
           at 
org.apache.avro.io.ResolvingDecoder.readString(ResolvingDecoder.java:208) 
~[avro-1.11.0.jar:1.11.0]
           at 
org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:470)
 ~[avro-1.11.0.jar:1.11.0]
           at 
org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:460)
 ~[avro-1.11.0.jar:1.11.0]
           at 
org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:192)
 ~[avro-1.11.0.jar:1.11.0]
           at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:161) 
~[avro-1.11.0.jar:1.11.0]
           at 
org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:260)
 ~[avro-1.11.0.jar:1.11.0]
           at 
org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:248)
 ~[avro-1.11.0.jar:1.11.0]
           at 
org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:180)
 ~[avro-1.11.0.jar:1.11.0]
           at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:161) 
~[avro-1.11.0.jar:1.11.0]
           at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:154) 
~[avro-1.11.0.jar:1.11.0]
           at 
org.apache.hudi.avro.HoodieAvroUtils.bytesToAvro(HoodieAvroUtils.java:156) 
~[hudi-utilities-bundle_2.12-0.11.1-amzn-0.jar:0.11.1-amzn-0]
           at 
org.apache.hudi.avro.HoodieAvroUtils.bytesToAvro(HoodieAvroUtils.java:146) 
~[hudi-utilities-bundle_2.12-0.11.1-amzn-0.jar:0.11.1-amzn-0]
           at 
org.apache.hudi.common.model.OverwriteWithLatestAvroPayload.getInsertValue(OverwriteWithLatestAvroPayload.java:75)
 ~[hudi-utilities-bundle_2.12-0.11.1-amzn-0.jar:0.11.1-amzn-0]
           at 
org.apache.hudi.common.model.HoodieRecordPayload.getInsertValue(HoodieRecordPayload.java:105)
 ~[hudi-utilities-bundle_2.12-0.11.1-amzn-0.jar:0.11.1-amzn-0]
           at 
org.apache.hudi.execution.HoodieLazyInsertIterable$HoodieInsertValueGenResult.<init>(HoodieLazyInsertIterable.java:90)
 ~[hudi-utilities-bundle_2.12-0.11.1-amzn-0.jar:0.11.1-amzn-0]
           at 
org.apache.hudi.execution.HoodieLazyInsertIterable.lambda$getTransformFunction$0(HoodieLazyInsertIterable.java:103)
 ~[hudi-utilities-bundle_2.12-0.11.1-amzn-0.jar:0.11.1-amzn-0]
           at 
org.apache.hudi.common.util.queue.BoundedInMemoryQueue.insertRecord(BoundedInMemoryQueue.java:190)
 ~[hudi-utilities-bundle_2.12-0.11.1-amzn-0.jar:0.11.1-amzn-0]
           at 
org.apache.hudi.common.util.queue.IteratorBasedQueueProducer.produce(IteratorBasedQueueProducer.java:46)
 ~[hudi-utilities-bundle_2.12-0.11.1-amzn-0.jar:0.11.1-amzn-0]
           at 
org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$0(BoundedInMemoryExecutor.java:106)
 ~[hudi-utilities-bundle_2.12-0.11.1-amzn-0.jar:0.11.1-amzn-0]
           at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
~[?:1.8.0_342]
           at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
~[?:1.8.0_342]
           at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
~[?:1.8.0_342]
           at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
~[?:1.8.0_342]
           at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
~[?:1.8.0_342]
           at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_342]
   22/10/26 05:19:35 INFO HoodieCreateHandle: Closing the file 
afd625e0-d213-4f0f-b3e5-bb3410bb97f8-0 as we are done with all the records 0
   22/10/26 05:19:35 INFO MultipartUploadOutputStream: close closed:false 
s3://hudi-multistreamer-roobal/csv-test/synced-table/employee/afd625e0-d213-4f0f-b3e5-bb3410bb97f8-0_1-17-16_20221026051925630.parquet
   22/10/26 05:19:36 INFO HoodieCreateHandle: CreateHandle for partitionPath  
fileID afd625e0-d213-4f0f-b3e5-bb3410bb97f8-0, took 284 ms.
   22/10/26 05:19:36 INFO BoundedInMemoryExecutor: Queue Consumption is done; 
notifying producer threads
   22/10/26 05:19:36 INFO MemoryStore: Block rdd_33_1 stored as values in 
memory (estimated size 1976.0 B, free 911.2 MiB)
   22/10/26 05:19:36 INFO BlockManagerInfo: Added rdd_33_1 in memory on 
ip-10-151-46-141.us-west-2.compute.internal:43387 (size: 1976.0 B, free: 912.1 
MiB)
   22/10/26 05:19:36 INFO Executor: Finished task 1.0 in stage 17.0 (TID 16). 
1240 bytes result sent to driver
   22/10/26 05:19:36 INFO TaskSetManager: Finished task 1.0 in stage 17.0 (TID 
16) in 336 ms on ip-10-151-46-141.us-west-2.compute.internal (executor driver) 
(2/2)```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] ROOBALJINDAL opened a new issue, #7064: [SUPPORT] Data ingestion from csv file i.e. CsvDFSSource is working for FilebasedSchemaProvider but not working if schema is provided with SchemaRegistryProvider

Reply via email to