Dear team,


With the 0.5.1 version released, user need to add 
`org.apache.spark:spark-avro_2.11:2.4.4` when starting hudi command, like bellow
/-------------------------------------------------------------------------------------------------------------------------------------------------------------/
spark-2.4.4-bin-hadoop2.7/bin/spark-shell \
  --packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
 \
  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
/-------------------------------------------------------------------------------------------------------------------------------------------------------------/


From spark-avro-guide[1], we know that the spark-avro module is external, it is 
not exists in spark-2.4.4-bin-hadoop2.7.tgz.
So may it's better to relocate spark-avro dependency by using 
maven-shade-plugin. If so, user will starting hudi like 0.5.0 version does.
/-------------------------------------------------------------------------------------------------------------------------------------------------------------/
spark-2.4.4-bin-hadoop2.7/bin/spark-shell \
  --packages org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating \
  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
/-------------------------------------------------------------------------------------------------------------------------------------------------------------/


I created a pr to fix this[3], we may need have more discussion about this, any 
suggestion is welcome, thanks very much :)
Current state:
@bhasudha : +1
@vinoth       : -1


[1] http://spark.apache.org/docs/latest/sql-data-sources-avro.html
[2] 
http://mirror.bit.edu.cn/apache/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz 
[3] https://github.com/apache/incubator-hudi/pull/1290

Reply via email to