[GitHub] [hudi] bvaradar commented on issue #2592: [SUPPORT] Does latest versions of Hudi (0.7.0, 0.6.0) work with Spark 2.3.0 when reading orc files?

2021-03-02 Thread GitBox


bvaradar commented on issue #2592:
URL: https://github.com/apache/hudi/issues/2592#issuecomment-789219208


   @codejoyan : Just realized this is because spark-2.3.0 uses com.databricks 
version of spark-avro. We moved to org.apache.spark:spark-avro. 
   
   Amazon EMR got this working by compiling hudi-spark-bundle with additional 
flag "-Dspark-shade-unbundle-avro". Additional context in 
https://issues.apache.org/jira/browse/HUDI-254. 
   
   @umehrot2 : Can you confirm if the above statement is correct ?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #2592: [SUPPORT] Does latest versions of Hudi (0.7.0, 0.6.0) work with Spark 2.3.0 when reading orc files?

2021-03-01 Thread GitBox


bvaradar commented on issue #2592:
URL: https://github.com/apache/hudi/issues/2592#issuecomment-788200091


   Sorry for the delay. I will try setting up 2.30 tonight to reproduce this 
issue.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #2592: [SUPPORT] Does latest versions of Hudi (0.7.0, 0.6.0) work with Spark 2.3.0 when reading orc files?

2021-02-25 Thread GitBox


bvaradar commented on issue #2592:
URL: https://github.com/apache/hudi/issues/2592#issuecomment-785896778


   I was unable to setup spark-2.3.0 in my setup.  But,with spark-2.4.4, this 
works fine as below. Can you use spark-2.4.x version. spark-2.3 seems too old 
though ?
   
   `21/02/25 05:14:26 WARN NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
   Using Spark's default log4j profile: 
org/apache/spark/log4j-defaults.properties
   Setting default log level to "WARN".
   To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
   Spark context Web UI available at http://ip-192-168-1-81.ec2.internal:4040
   Spark context available as 'sc' (master = local[*], app id = 
local-1614258873363).
   Spark session available as 'spark'.
   Welcome to
   __
/ __/__  ___ _/ /__
   _\ \/ _ \/ _ `/ __/  '_/
  /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
 /_/
   
   Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 
1.8.0_251)
   Type in expressions to have them evaluated.
   Type :help for more information.
   
   scala> import org.apache.hudi.DataSourceWriteOptions
   import org.apache.hudi.DataSourceWriteOptions
   
   scala> import org.apache.hudi.QuickstartUtils.getQuickstartWriteConfigs
   import org.apache.hudi.QuickstartUtils.getQuickstartWriteConfigs
   
   scala> import org.apache.hudi.config.{HoodieCompactionConfig, 
HoodieStorageConfig, HoodieWriteConfig}
   import org.apache.hudi.config.{HoodieCompactionConfig, HoodieStorageConfig, 
HoodieWriteConfig}
   
   scala> import org.apache.spark.sql.{SaveMode, SparkSession}
   import org.apache.spark.sql.{SaveMode, SparkSession}
   
   scala> import org.apache.spark.sql.functions.{col, concat, lit}
   import org.apache.spark.sql.functions.{col, concat, lit}
   
   scala>
   
   scala> val inputDF = spark.read.format("csv").option("header", 
"true").load("file:///tmp/input.csv")
   inputDF: org.apache.spark.sql.DataFrame = [col_1: string, col_2: string ... 
12 more fields]
   
   scala> val formattedDF = inputDF.selectExpr("col_1", "cast(col_2 as integer) 
col_2",
|   "cast(col_3 as short) col_3", "col_4", "col_5", "cast(col_6 as 
byte) col_6", "cast(col_7 as decimal(9,2)) col_7",
|   "cast(col_8 as decimal(9,2)) col_8", "cast(col_9 as timestamp) 
col_9", "col_10", "cast(col_11 as timestamp) col_11",
|   "col_12", "cntry_cd", "cast(bus_dt as date) bus_dt")
   formattedDF: org.apache.spark.sql.DataFrame = [col_1: string, col_2: int ... 
12 more fields]
   
   scala> formattedDF.printSchema()
   root
|-- col_1: string (nullable = true)
|-- col_2: integer (nullable = true)
|-- col_3: short (nullable = true)
|-- col_4: string (nullable = true)
|-- col_5: string (nullable = true)
|-- col_6: byte (nullable = true)
|-- col_7: decimal(9,2) (nullable = true)
|-- col_8: decimal(9,2) (nullable = true)
|-- col_9: timestamp (nullable = true)
|-- col_10: string (nullable = true)
|-- col_11: timestamp (nullable = true)
|-- col_12: string (nullable = true)
|-- cntry_cd: string (nullable = true)
|-- bus_dt: date (nullable = true)
   
   
   scala> formattedDF.show
   
++-+-+-+-+-+-+-++---++---++--+
   |   col_1|col_2|col_3|col_4|col_5|col_6|col_7|col_8| 
  col_9| col_10|  col_11| col_12|cntry_cd|bus_dt|
   
++-+-+-+-+-+-+-++---++---++--+
   |7IN00716079317820...|  716|3|   AB|  INR| null| 0.00| 1.00|2021-02-14 
20:23:...|useridjsb91|2021-02-14 20:23:...|useridjsb91|  IN|2021-02-01|
   |7IN00716079317820...|  716|2|   AB|  INR| null| 0.00| 1.00|2021-02-14 
20:23:...|useridjsb91|2021-02-14 20:23:...|useridjsb91|  IN|2021-02-01|
   |7IN00716079317820...|  716|1|   AB|  INR| null| 0.00| 1.00|2021-02-14 
20:23:...|useridjsb91|2021-02-14 20:23:...|useridjsb91|  IN|2021-02-01|
   |AU700716079381819...| 5700|5|   AB|  INR| null| 0.00| 1.00|2021-02-14 
20:06:...|useridjsb91|2021-02-14 20:06:...|useridjsb91|  IN|2021-02-02|
   |AU700716079381819...| 5700|6|   AB|  INR| null| 4.00| 1.97|2021-02-14 
20:06:...|useridjsb91|2021-02-14 20:06:...|useridjsb91|  IN|2021-02-02|
   |AU700716079381819...| 5700|4|   AB|  INR| null| 0.00| 1.00|2021-02-14 
20:06:...|useridjsb91|2021-02-14 20:06:...|useridjsb91|  AU|2021-02-01|
   |AU700716079381819...| 5700|3|   AB|  INR| null| 0.00| 1.00|2021-02-14 
20:06:...|useridjsb91|2021-02-14 20:06:...|useridjsb91|  AU|2021-02-01|
   |AU700716079381819...| 5700|1|   AB|  INR| null| 0.00| 1.00|2021-02-14 
20:06:...|useridjsb91|2021-02-14 20:06:...|useridjsb91|  

[GitHub] [hudi] bvaradar commented on issue #2592: [SUPPORT] Does latest versions of Hudi (0.7.0, 0.6.0) work with Spark 2.3.0 when reading orc files?

2021-02-23 Thread GitBox


bvaradar commented on issue #2592:
URL: https://github.com/apache/hudi/issues/2592#issuecomment-784650062


   @codejoyan : Would it be possible for you to give us a reproducible set of 
steps (just like the Quickstart : 
https://hudi.apache.org/docs/quick-start-guide.html ) ?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org