yes this is feasible. You can use databricks jar file to loas csv files from staging directory. This is pretty standard
val df = sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("header", "true").load("hdfs://xxxxxx:9000/data/stg/") You can then create an ORC table from Spark // Specify Hive DB namw sql("use accounts") // // Drop and create table ll_18740868 .Prefix the tablename with Hive database name // sql("DROP TABLE IF EXISTS accounts.ll_18740868") var sqltext : String = "" sqltext = """ CREATE TABLE *accounts.ll_18740868* ( TransactionDate DATE ,TransactionType String ,SortCode String ) COMMENT 'from csv file from excel sheet' STORED AS ORC TBLPROPERTIES ( "orc.compress"="ZLIB" ) """ sql(sqltext) / // Put data in Hive table say from a spark temporary table called say "tmp" here // sqltext = """ INSERT INTO TABLE *accounts.ll_18740868* SELECT TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(TransactionDate,'dd/MM/yyyy'),'yyyy-MM-dd')) AS TransactionDate , TransactionType , SortCode FROM tmp """ sql(sqltext) That is it. The above will create and populate an ORC table for you in Hive database With regard to the use of Parquet as an optimum file for Spark, well I am not sure. It depends on the use case. I personally prefer an ORC file. Having said that Parquet seems to be common in Spark. Personally I would rather store a file as a Hive table for better manageability. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 31 March 2016 at 22:00, Ashok Kumar <ashok34...@yahoo.com.invalid> wrote: > Hello, > > How feasible is to use Spark to extract csv files and creates and writes > the content to an ORC table in a Hive database. > > Is Parquet file the best (optimum) format to write to HDFS from Spark app. > > Thanks >