SparkR Error in sparkR.init(master=“local”) in RStudio

2015-07-10 Thread kachau
I have installed the SparkR package from Spark distribution into the R library. I can call the following command and it seems to work properly: library(SparkR) However, when I try to get the Spark context using the following code, sc - sparkR.init(master=local) It fails after some time with the

dataFrame.colaesce(1) or dataFrame.reapartition(1) does not seem work for me

2015-07-10 Thread kachau
Hi I have Hive insert into query which creates new Hive partitions. I have two Hive partitions named server and date. Now I execute insert into queries using the following code and try to save it DataFrame dframe = hiveContext.sql(insert into summary1 partition(server='a1',date='2015-05-22')

How do we control output part files created by Spark job?

2015-07-06 Thread kachau
Hi I am having couple of Spark jobs which processes thousands of files every day. File size may very from MBs to GBs. After finishing job I usually save using the following code finalJavaRDD.saveAsParquetFile(/path/in/hdfs); OR dataFrame.write.format(orc).save(/path/in/hdfs) //storing as ORC file

How to call hiveContext.sql() on all the Hive partitions in parallel?

2015-07-06 Thread kachau
Hi I have to fire few insert into queries which uses Hive partitions. I have two Hive partitions named server and date. Now I execute insert into queries using hiveContext as shown below query works fine hiveContext.sql(insert into summary1 partition(server='a1',date='2015-05-22') select from