[jira] [Comment Edited] (HUDI-254) Provide mechanism for installing hudi-spark-bundle onto an existing spark installation

Vinoth Chandar (Jira) Mon, 16 Sep 2019 20:50:08 -0700


    [ 
https://issues.apache.org/jira/browse/HUDI-254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930985#comment-16930985
 ]


Vinoth Chandar edited comment on HUDI-254 at 9/17/19 3:49 AM:
--------------------------------------------------------------

h3. Spark 2.3.1 on master 

 

Once I copy the hudi-spark-bundle (had to shade com.databricks:spark-avro* for 
now) to jars, I can do *a, b,c, d* 
{code:java}
root@adhoc-2: cp $HUDI_SPARK_BUNDLE /opt/spark/jars/

root@adhoc-2:/opt# $SPARK_INSTALL/bin/spark-shell --master local[2] 
--driver-class-path $HADOOP_CONF_DIR --conf 
spark.sql.hive.convertMetastoreParquet=false --deploy-mode client  
--driver-memory 1G --executor-memory 3G --num-executors 1
19/09/17 03:46:43 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
Spark context Web UI available at http://adhoc-2:4040
Spark context available as 'sc' (master = local[2], app id = 
local-1568692009107).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.1
      /_/Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)
Type in expressions to have them evaluated.
Type :help for more information.scala> :paste
// Entering paste mode (ctrl-D to finish)val jsonDF = 
spark.read.json("file:////var/hoodie/ws/docker/demo/data/batch_1.json")import 
org.apache.hudi.DataSourceReadOptions;
import org.apache.hudi.DataSourceWriteOptions;
import org.apache.spark.sql.SaveMode;
import org.apache.hudi.config.HoodieWriteConfig;
import org.apache.hudi.HoodieDataSourceHelpers;
import org.apache.hadoop.fs.FileSystem;jsonDF.write.format("org.apache.hudi").
    option("hoodie.insert.shuffle.parallelism", "2").
    option("hoodie.upsert.shuffle.parallelism","2").
    option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL).
    option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL).
    option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "key").
    option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "date").
    option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "ts").
    option(HoodieWriteConfig.TABLE_NAME, "stock_ticks_derived_mor").
    option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, 
"stock_ticks_derived_mor").
    option(DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY, "default").
    option(DataSourceWriteOptions.HIVE_URL_OPT_KEY, 
"jdbc:hive2://hiveserver:10000").
    option(DataSourceWriteOptions.HIVE_USER_OPT_KEY, "hive").
    option(DataSourceWriteOptions.HIVE_PASS_OPT_KEY, "hive").
    option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true").
    option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "datestr").
    mode(SaveMode.Append).
    
save("file:///tmp/stock_ticks_derived_mor");spark.read.format("org.apache.hudi").load("file:///tmp/stock_ticks_derived_mor/*/*/*/*.parquet").showspark.sql("select
 symbol, max(ts) from stock_ticks_derived_mor group by symbol HAVING symbol = 
'GOOG'").show(100, false)// Exiting paste mode, now interpreting.19/09/17 
03:47:13 WARN compact.HoodieRealtimeTableCompactor: After filtering, Nothing to 
compact for file:///tmp/stock_ticks_derived_mor
+-------------------+--------------------+------------------+----------------------+--------------------+------+----------+---+-------+------------------+------+-----+-------+------+-------------------+------+----+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
   _hoodie_file_name| close|      date|day|   high|               key|   
low|month|   open|symbol|                 ts|volume|year|
+-------------------+--------------------+------------------+----------------------+--------------------+------+----------+---+-------+------------------+------+-----+-------+------+-------------------+------+----+
|     20190917034711|  20190917034711_0_1|NIHD_2018-08-31 10|            
2018/08/31|5d4f52d9-c16f-4a2...|  5.67|2018/08/31| 31|   5.67|NIHD_2018-08-31 
10|  5.67|   08|   5.67|  NIHD|2018-08-31 10:29:00|  2125|2018|
|     20190917034711|  20190917034711_0_2|CDMO_2018-08-31 09|            
....  
2018/08/31|5d4f52d9-c16f-4a2...|  14.5|2018/08/31| 31|14.5999|EGAN_2018-08-31 
09|  14.5|   08|14.5999|  EGAN|2018-08-31 09:57:00|  2489|2018|
+-------------------+--------------------+------------------+----------------------+--------------------+------+----------+---+-------+------------------+------+-----+-------+------+-------------------+------+----+
only showing top 20 rowsSLF4J: Failed to load class 
"org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
details.
19/09/17 03:47:18 WARN hadoop.ParquetRecordReader: Can not initialize counter 
due to context is not a instance of TaskInputOutputContext, but is 
org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
+------+-------------------+
|symbol|max(ts)            |
+------+-------------------+
|GOOG  |2018-08-31 10:29:00|
+------+-------------------+jsonDF: org.apache.spark.sql.DataFrame = [close: 
double, date: string ... 10 more fields]
import org.apache.hudi.DataSourceReadOptions
import org.apache.hudi.DataSourceWriteOptions
import org.apache.spark.sql.SaveMode
import org.apache.hudi.config.HoodieWriteConfig
import org.apache.hudi.HoodieDataSourceHelpers
import org.apache.hadoop.fs.FileSystemscala>{code}


was (Author: vc):
h3. Spark 2.3.3 on master 

 

Once I copy the hudi-spark-bundle (had to shade com.databricks:spark-avro* for 
now) to jars, I can do *a + b* 
{code:java}
root@adhoc-2:/var/hoodie/ws/docker# $SPARK_INSTALL/bin/spark-shell --master 
local[2] --driver-class-path $HADOOP_CONF_DIR --conf 
spark.sql.hive.convertMetastoreParquet=false --deploy-mode client  
--driver-memory 1G --executor-memory 3G --num-executors 1
19/09/17 00:48:48 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
Spark context Web UI available at http://adhoc-2:4040
Spark context available as 'sc' (master = local[2], app id = 
local-1568681334864).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.1
      /_/Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)
Type in expressions to have them evaluated.
Type :help for more information.scala> :paste
// Entering paste mode (ctrl-D to finish)
val jsonDF = 
spark.read.json("file:////var/hoodie/ws/docker/demo/data/batch_1.json")
import org.apache.hudi.DataSourceReadOptions;
import org.apache.hudi.DataSourceWriteOptions;
import org.apache.spark.sql.SaveMode;
import org.apache.hudi.config.HoodieWriteConfig;
import org.apache.hudi.HoodieDataSourceHelpers;
import org.apache.hadoop.fs.FileSystem;jsonDF.write.format("org.apache.hudi").
    option("hoodie.insert.shuffle.parallelism", "2").
    option("hoodie.upsert.shuffle.parallelism","2").
    option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL).
    option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL).
    option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "key").
    option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "date").
    option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "ts").
    option(HoodieWriteConfig.TABLE_NAME, "stock_ticks_derived_mor").
    mode(SaveMode.Append).
    
save("file:///tmp/stock_ticks_derived_mor");spark.read.format("org.apache.hudi").load("file:///tmp/stock_ticks_derived_mor/*/*/*/*.parquet").show//
 Exiting paste mode, now 
interpreting.+-------------------+--------------------+------------------+----------------------+--------------------+------+----------+---+-------+------------------+------+-----+-------+------+-------------------+------+----+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
   _hoodie_file_name| close|      date|day|   high|               key|   
low|month|   open|symbol|                 ts|volume|year|
+-------------------+--------------------+------------------+----------------------+--------------------+------+----------+---+-------+------------------+------+-----+-------+------+-------------------+------+----+
|     20190917004922|  20190917004922_0_1|NIHD_2018-08-31 10|            
2018/08/31|0488121d-4ff5-4fb...|  5.67|2018/08/31| 31|   5.67|NIHD_2018-08-31 
10|  5.67|   08|   5.67|  NIHD|2018-08-31 10:29:00|  2125|2018|
  ...
|     20190917004922| 20190917004922_0_19|STAA_2018-08-31 10|            
2018/08/31|0488121d-4ff5-4fb...|  47.5|2018/08/31| 31|   47.5|STAA_2018-08-31 
10|  47.5|   08|   47.5|  STAA|2018-08-31 10:28:00|   800|2018|
|     20190917004922| 20190917004922_0_20|EGAN_2018-08-31 09|            
2018/08/31|0488121d-4ff5-4fb...|  14.5|2018/08/31| 31|14.5999|EGAN_2018-08-31 
09|  14.5|   08|14.5999|  EGAN|2018-08-31 09:57:00|  2489|2018|
+-------------------+--------------------+------------------+----------------------+--------------------+------+----------+---+-------+------------------+------+-----+-------+------+-------------------+------+----+
only showing top 20 rows {code}

> Provide mechanism for installing hudi-spark-bundle onto an existing spark 
> installation
> --------------------------------------------------------------------------------------
>
>                 Key: HUDI-254
>                 URL: https://issues.apache.org/jira/browse/HUDI-254
>             Project: Apache Hudi (incubating)
>          Issue Type: Improvement
>          Components: Spark datasource, SparkSQL Support
>            Reporter: Vinoth Chandar
>            Assignee: Vinoth Chandar
>            Priority: Major
>
> A lot of discussions around this kicked off from 
> [https://github.com/apache/incubator-hudi/issues/869] 
> Breaking down into phases, when we drop the hudi-spark-bundle*.jar onto the 
> `jars` folder 
>  
> a) Writing data via Hudi datasource should work 
> b) Spark datasource reads should work
>  
> c)  a + Hive Sync should work
> d) SparkSQL on Hive synced table works 
>  
> Start with Spark 2.3 (current demo setup) and then proceed to 2.4 and iron 
> out issues.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Comment Edited] (HUDI-254) Provide mechanism for installing hudi-spark-bundle onto an existing spark installation

Reply via email to