Hi Swetha, One option is to use Hive with the above issues fixed which is Hive 2.0 or Cloudera CDH Hive 1.2 which has above issue resolved. One thing to remember is it's not the Hive you have installed but the Hive Spark is using which in Spark 1.6 is Hive version 1.2 as of now.
The workaround I did for this issue was to write dataframe directly using dataframe write method and to create the Hive Table on top of that, doing which my processing time was down from 4+ hrs to just under 1 hr. data_frame.write.partitionBy('idPartitioner','dtPartitoner').orc("path/to/final/location") And ORC format is supported with HiveContext only. Thanks, Bijay On Mon, Jun 13, 2016 at 11:41 AM, swetha kasireddy < swethakasire...@gmail.com> wrote: > Hi Mich, > > Following is a sample code snippet: > > > *val *userDF = > userRecsDF.toDF("idPartitioner", "dtPartitioner", "userId", > "userRecord").persist() > System.*out*.println(" userRecsDF.partitions.size"+ > userRecsDF.partitions.size) > > userDF.registerTempTable("userRecordsTemp") > > sqlContext.sql("SET hive.default.fileformat=Orc ") > sqlContext.sql("set hive.enforce.bucketing = true; ") > sqlContext.sql("set hive.enforce.sorting = true; ") > sqlContext.sql(" CREATE EXTERNAL TABLE IF NOT EXISTS users (userId > STRING, userRecord STRING) PARTITIONED BY (idPartitioner STRING, > dtPartitioner STRING) stored as ORC LOCATION '/user/userId/userRecords' ") > sqlContext.sql( > """ from userRecordsTemp ps insert overwrite table users > partition(idPartitioner, dtPartitioner) select ps.userId, ps.userRecord, > ps.idPartitioner, ps.dtPartitioner CLUSTER BY idPartitioner, dtPartitioner > """.stripMargin) > > > On Mon, Jun 13, 2016 at 10:57 AM, swetha kasireddy < > swethakasire...@gmail.com> wrote: > >> Hi Bijay, >> >> If I am hitting this issue, >> https://issues.apache.org/jira/browse/HIVE-11940. What needs to be done? >> Incrementing to higher version of hive is the only solution? >> >> Thanks! >> >> On Mon, Jun 13, 2016 at 10:47 AM, swetha kasireddy < >> swethakasire...@gmail.com> wrote: >> >>> Hi, >>> >>> Following is a sample code snippet: >>> >>> >>> *val *userDF = userRecsDF.toDF("idPartitioner", "dtPartitioner", >>> "userId", "userRecord").persist() >>> System.*out*.println(" userRecsDF.partitions.size"+ >>> userRecsDF.partitions.size) >>> >>> userDF.registerTempTable("userRecordsTemp") >>> >>> sqlContext.sql("SET hive.default.fileformat=Orc ") >>> sqlContext.sql("set hive.enforce.bucketing = true; ") >>> sqlContext.sql("set hive.enforce.sorting = true; ") >>> sqlContext.sql(" CREATE EXTERNAL TABLE IF NOT EXISTS users (userId >>> STRING, userRecord STRING) PARTITIONED BY (idPartitioner STRING, >>> dtPartitioner STRING) stored as ORC LOCATION '/user/userId/userRecords' " >>> ) >>> sqlContext.sql( >>> """ from userRecordsTemp ps insert overwrite table users >>> partition(idPartitioner, dtPartitioner) select ps.userId, ps.userRecord, >>> ps.idPartitioner, ps.dtPartitioner CLUSTER BY idPartitioner, dtPartitioner >>> """.stripMargin) >>> >>> >>> >>> >>> On Fri, Jun 10, 2016 at 12:10 AM, Bijay Pathak < >>> bijay.pat...@cloudwick.com> wrote: >>> >>>> Hello, >>>> >>>> Looks like you are hitting this: >>>> https://issues.apache.org/jira/browse/HIVE-11940. >>>> >>>> Thanks, >>>> Bijay >>>> >>>> >>>> >>>> On Thu, Jun 9, 2016 at 9:25 PM, Mich Talebzadeh < >>>> mich.talebza...@gmail.com> wrote: >>>> >>>>> cam you provide a code snippet of how you are populating the target >>>>> table from temp table. >>>>> >>>>> >>>>> HTH >>>>> >>>>> Dr Mich Talebzadeh >>>>> >>>>> >>>>> >>>>> LinkedIn * >>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>>> >>>>> >>>>> >>>>> http://talebzadehmich.wordpress.com >>>>> >>>>> >>>>> >>>>> On 9 June 2016 at 23:43, swetha kasireddy <swethakasire...@gmail.com> >>>>> wrote: >>>>> >>>>>> No, I am reading the data from hdfs, transforming it , registering >>>>>> the data in a temp table using registerTempTable and then doing insert >>>>>> overwrite using Spark SQl' hiveContext. >>>>>> >>>>>> On Thu, Jun 9, 2016 at 3:40 PM, Mich Talebzadeh < >>>>>> mich.talebza...@gmail.com> wrote: >>>>>> >>>>>>> how are you doing the insert? from an existing table? >>>>>>> >>>>>>> Dr Mich Talebzadeh >>>>>>> >>>>>>> >>>>>>> >>>>>>> LinkedIn * >>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>>>>> >>>>>>> >>>>>>> >>>>>>> http://talebzadehmich.wordpress.com >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 9 June 2016 at 21:16, Stephen Boesch <java...@gmail.com> wrote: >>>>>>> >>>>>>>> How many workers (/cpu cores) are assigned to this job? >>>>>>>> >>>>>>>> 2016-06-09 13:01 GMT-07:00 SRK <swethakasire...@gmail.com>: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> How to insert data into 2000 partitions(directories) of >>>>>>>>> ORC/parquet at a >>>>>>>>> time using Spark SQL? It seems to be not performant when I try to >>>>>>>>> insert >>>>>>>>> 2000 directories of Parquet/ORC using Spark SQL. Did anyone face >>>>>>>>> this issue? >>>>>>>>> >>>>>>>>> Thanks! >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> View this message in context: >>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-data-into-2000-partitions-directories-of-ORC-parquet-at-a-time-using-Spark-SQL-tp27132.html >>>>>>>>> Sent from the Apache Spark User List mailing list archive at >>>>>>>>> Nabble.com. >>>>>>>>> >>>>>>>>> >>>>>>>>> --------------------------------------------------------------------- >>>>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>>>>>> For additional commands, e-mail: user-h...@spark.apache.org >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >