Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

Bijay Pathak Mon, 13 Jun 2016 21:26:23 -0700

Hi Swetha,

One option is to use Hive with the above issues fixed which is Hive 2.0 or
Cloudera CDH Hive 1.2 which has above issue resolved. One thing to remember
is it's not the Hive you have installed but the Hive Spark is using which
in Spark 1.6 is Hive version 1.2 as of now.


The workaround I did for this issue was to write dataframe directly using
dataframe write method and to create the Hive Table on top of that, doing
which my processing time was down  from 4+ hrs to just under 1 hr.


data_frame.write.partitionBy('idPartitioner','dtPartitoner').orc("path/to/final/location")

And ORC format is supported with HiveContext only.

Thanks,
Bijay


On Mon, Jun 13, 2016 at 11:41 AM, swetha kasireddy <
swethakasire...@gmail.com> wrote:

> Hi Mich,
>
> Following is  a sample code snippet:
>
>
> *val *userDF =
> userRecsDF.toDF("idPartitioner", "dtPartitioner", "userId", 
> "userRecord").persist()
> System.*out*.println(" userRecsDF.partitions.size"+
> userRecsDF.partitions.size)
>
> userDF.registerTempTable("userRecordsTemp")
>
> sqlContext.sql("SET hive.default.fileformat=Orc  ")
> sqlContext.sql("set hive.enforce.bucketing = true; ")
> sqlContext.sql("set hive.enforce.sorting = true; ")
> sqlContext.sql("  CREATE EXTERNAL TABLE IF NOT EXISTS users (userId
> STRING, userRecord STRING) PARTITIONED BY (idPartitioner STRING,
> dtPartitioner STRING)   stored as ORC LOCATION '/user/userId/userRecords' ")
> sqlContext.sql(
>   """ from userRecordsTemp ps   insert overwrite table users
> partition(idPartitioner, dtPartitioner)  select ps.userId, ps.userRecord,
> ps.idPartitioner, ps.dtPartitioner CLUSTER BY idPartitioner, dtPartitioner
> """.stripMargin)
>
>
> On Mon, Jun 13, 2016 at 10:57 AM, swetha kasireddy <
> swethakasire...@gmail.com> wrote:
>
>> Hi Bijay,
>>
>> If I am hitting this issue,
>> https://issues.apache.org/jira/browse/HIVE-11940. What needs to be done?
>> Incrementing to higher version of hive is the only solution?
>>
>> Thanks!
>>
>> On Mon, Jun 13, 2016 at 10:47 AM, swetha kasireddy <
>> swethakasire...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Following is  a sample code snippet:
>>>
>>>
>>> *val *userDF = userRecsDF.toDF("idPartitioner", "dtPartitioner",
>>> "userId", "userRecord").persist()
>>> System.*out*.println(" userRecsDF.partitions.size"+
>>> userRecsDF.partitions.size)
>>>
>>> userDF.registerTempTable("userRecordsTemp")
>>>
>>> sqlContext.sql("SET hive.default.fileformat=Orc  ")
>>> sqlContext.sql("set hive.enforce.bucketing = true; ")
>>> sqlContext.sql("set hive.enforce.sorting = true; ")
>>> sqlContext.sql("  CREATE EXTERNAL TABLE IF NOT EXISTS users (userId
>>> STRING, userRecord STRING) PARTITIONED BY (idPartitioner STRING,
>>> dtPartitioner STRING)   stored as ORC LOCATION '/user/userId/userRecords' "
>>> )
>>> sqlContext.sql(
>>>   """ from userRecordsTemp ps   insert overwrite table users
>>> partition(idPartitioner, dtPartitioner)  select ps.userId, ps.userRecord,
>>> ps.idPartitioner, ps.dtPartitioner CLUSTER BY idPartitioner, dtPartitioner
>>> """.stripMargin)
>>>
>>>
>>>
>>>
>>> On Fri, Jun 10, 2016 at 12:10 AM, Bijay Pathak <
>>> bijay.pat...@cloudwick.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> Looks like you are hitting this:
>>>> https://issues.apache.org/jira/browse/HIVE-11940.
>>>>
>>>> Thanks,
>>>> Bijay
>>>>
>>>>
>>>>
>>>> On Thu, Jun 9, 2016 at 9:25 PM, Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> cam you provide a code snippet of how you are populating the target
>>>>> table from temp table.
>>>>>
>>>>>
>>>>> HTH
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * 
>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>>
>>>>> On 9 June 2016 at 23:43, swetha kasireddy <swethakasire...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> No, I am reading the data from hdfs, transforming it , registering
>>>>>> the data in a temp table using registerTempTable and then doing insert
>>>>>> overwrite using Spark SQl' hiveContext.
>>>>>>
>>>>>> On Thu, Jun 9, 2016 at 3:40 PM, Mich Talebzadeh <
>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>
>>>>>>> how are you doing the insert? from an existing table?
>>>>>>>
>>>>>>> Dr Mich Talebzadeh
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> LinkedIn * 
>>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 9 June 2016 at 21:16, Stephen Boesch <java...@gmail.com> wrote:
>>>>>>>
>>>>>>>> How many workers (/cpu cores) are assigned to this job?
>>>>>>>>
>>>>>>>> 2016-06-09 13:01 GMT-07:00 SRK <swethakasire...@gmail.com>:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> How to insert data into 2000 partitions(directories) of
>>>>>>>>> ORC/parquet  at a
>>>>>>>>> time using Spark SQL? It seems to be not performant when I try to
>>>>>>>>> insert
>>>>>>>>> 2000 directories of Parquet/ORC using Spark SQL. Did anyone face
>>>>>>>>> this issue?
>>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> View this message in context:
>>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-data-into-2000-partitions-directories-of-ORC-parquet-at-a-time-using-Spark-SQL-tp27132.html
>>>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>>>> Nabble.com.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

Reply via email to