Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-15 Thread swetha kasireddy
Hi Mich,

No I have not tried that. My requirement is to insert that from an hourly
Spark Batch job. How is it different by trying to insert with Hive CLI or
beeline?

Thanks,
Swetha



On Tue, Jun 14, 2016 at 10:44 AM, Mich Talebzadeh  wrote:

> Hi Swetha,
>
> Have you actually tried doing this in Hive using Hive CLI or beeline?
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 14 June 2016 at 18:43, Mich Talebzadeh 
> wrote:
>
>> In all probability there is no user database created in Hive
>>
>> Create a database yourself
>>
>> sql("create if not exists database test")
>>
>> It would be helpful if you grasp some concept of Hive databases etc?
>>
>> HTH
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 14 June 2016 at 15:40, Sree Eedupuganti  wrote:
>>
>>> Hi Spark users, i am new to spark. I am trying to connect hive using
>>> SparkJavaContext. Unable to connect to the database. By executing the below
>>> code i can see only "default" database. Can anyone help me out. What i need
>>> is a sample program for Querying Hive results using SparkJavaContext. Need
>>> to pass any values like this.
>>>
>>> userDF.registerTempTable("userRecordsTemp")
>>>
>>> sqlContext.sql("SET hive.default.fileformat=Orc  ")
>>> sqlContext.sql("set hive.enforce.bucketing = true; ")
>>> sqlContext.sql("set hive.enforce.sorting = true; ")
>>>
>>>  public static void  main(String[] args ) throws Exception {
>>>   SparkConf sparkConf = new
>>> SparkConf().setAppName("SparkSQL").setMaster("local");
>>>   SparkContext  ctx=new SparkContext(sparkConf);
>>>   HiveContext  hiveql=new
>>> org.apache.spark.sql.hive.HiveContext(ctx);
>>>   DataFrame df=hiveql.sql("show databases");
>>>   df.show();
>>>   }
>>>
>>> Any suggestions pleaseThanks.
>>>
>>
>>
>


Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-14 Thread Mich Talebzadeh
Hi Swetha,

Have you actually tried doing this in Hive using Hive CLI or beeline?

Thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 14 June 2016 at 18:43, Mich Talebzadeh  wrote:

> In all probability there is no user database created in Hive
>
> Create a database yourself
>
> sql("create if not exists database test")
>
> It would be helpful if you grasp some concept of Hive databases etc?
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 14 June 2016 at 15:40, Sree Eedupuganti  wrote:
>
>> Hi Spark users, i am new to spark. I am trying to connect hive using
>> SparkJavaContext. Unable to connect to the database. By executing the below
>> code i can see only "default" database. Can anyone help me out. What i need
>> is a sample program for Querying Hive results using SparkJavaContext. Need
>> to pass any values like this.
>>
>> userDF.registerTempTable("userRecordsTemp")
>>
>> sqlContext.sql("SET hive.default.fileformat=Orc  ")
>> sqlContext.sql("set hive.enforce.bucketing = true; ")
>> sqlContext.sql("set hive.enforce.sorting = true; ")
>>
>>  public static void  main(String[] args ) throws Exception {
>>   SparkConf sparkConf = new
>> SparkConf().setAppName("SparkSQL").setMaster("local");
>>   SparkContext  ctx=new SparkContext(sparkConf);
>>   HiveContext  hiveql=new
>> org.apache.spark.sql.hive.HiveContext(ctx);
>>   DataFrame df=hiveql.sql("show databases");
>>   df.show();
>>   }
>>
>> Any suggestions pleaseThanks.
>>
>
>


Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-14 Thread Mich Talebzadeh
In all probability there is no user database created in Hive

Create a database yourself

sql("create if not exists database test")

It would be helpful if you grasp some concept of Hive databases etc?

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 14 June 2016 at 15:40, Sree Eedupuganti  wrote:

> Hi Spark users, i am new to spark. I am trying to connect hive using
> SparkJavaContext. Unable to connect to the database. By executing the below
> code i can see only "default" database. Can anyone help me out. What i need
> is a sample program for Querying Hive results using SparkJavaContext. Need
> to pass any values like this.
>
> userDF.registerTempTable("userRecordsTemp")
>
> sqlContext.sql("SET hive.default.fileformat=Orc  ")
> sqlContext.sql("set hive.enforce.bucketing = true; ")
> sqlContext.sql("set hive.enforce.sorting = true; ")
>
>  public static void  main(String[] args ) throws Exception {
>   SparkConf sparkConf = new
> SparkConf().setAppName("SparkSQL").setMaster("local");
>   SparkContext  ctx=new SparkContext(sparkConf);
>   HiveContext  hiveql=new
> org.apache.spark.sql.hive.HiveContext(ctx);
>   DataFrame df=hiveql.sql("show databases");
>   df.show();
>   }
>
> Any suggestions pleaseThanks.
>


Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-14 Thread swetha kasireddy
Hi Bijay,

This approach might not work for me as I have to do partial
inserts/overwrites in a given table and data_frame.write.partitionBy will
overwrite the entire table.

Thanks,
Swetha

On Mon, Jun 13, 2016 at 9:25 PM, Bijay Pathak 
wrote:

> Hi Swetha,
>
> One option is to use Hive with the above issues fixed which is Hive 2.0 or
> Cloudera CDH Hive 1.2 which has above issue resolved. One thing to remember
> is it's not the Hive you have installed but the Hive Spark is using which
> in Spark 1.6 is Hive version 1.2 as of now.
>
> The workaround I did for this issue was to write dataframe directly using
> dataframe write method and to create the Hive Table on top of that, doing
> which my processing time was down  from 4+ hrs to just under 1 hr.
>
>
>
> data_frame.write.partitionBy('idPartitioner','dtPartitoner').orc("path/to/final/location")
>
> And ORC format is supported with HiveContext only.
>
> Thanks,
> Bijay
>
>
> On Mon, Jun 13, 2016 at 11:41 AM, swetha kasireddy <
> swethakasire...@gmail.com> wrote:
>
>> Hi Mich,
>>
>> Following is  a sample code snippet:
>>
>>
>> *val *userDF =
>> userRecsDF.toDF("idPartitioner", "dtPartitioner", "userId", 
>> "userRecord").persist()
>> System.*out*.println(" userRecsDF.partitions.size"+
>> userRecsDF.partitions.size)
>>
>> userDF.registerTempTable("userRecordsTemp")
>>
>> sqlContext.sql("SET hive.default.fileformat=Orc  ")
>> sqlContext.sql("set hive.enforce.bucketing = true; ")
>> sqlContext.sql("set hive.enforce.sorting = true; ")
>> sqlContext.sql("  CREATE EXTERNAL TABLE IF NOT EXISTS users (userId
>> STRING, userRecord STRING) PARTITIONED BY (idPartitioner STRING,
>> dtPartitioner STRING)   stored as ORC LOCATION '/user/userId/userRecords' ")
>> sqlContext.sql(
>>   """ from userRecordsTemp ps   insert overwrite table users
>> partition(idPartitioner, dtPartitioner)  select ps.userId, ps.userRecord,
>> ps.idPartitioner, ps.dtPartitioner CLUSTER BY idPartitioner, dtPartitioner
>> """.stripMargin)
>>
>>
>> On Mon, Jun 13, 2016 at 10:57 AM, swetha kasireddy <
>> swethakasire...@gmail.com> wrote:
>>
>>> Hi Bijay,
>>>
>>> If I am hitting this issue,
>>> https://issues.apache.org/jira/browse/HIVE-11940. What needs to be
>>> done? Incrementing to higher version of hive is the only solution?
>>>
>>> Thanks!
>>>
>>> On Mon, Jun 13, 2016 at 10:47 AM, swetha kasireddy <
>>> swethakasire...@gmail.com> wrote:
>>>
 Hi,

 Following is  a sample code snippet:


 *val *userDF = userRecsDF.toDF("idPartitioner", "dtPartitioner",
 "userId", "userRecord").persist()
 System.*out*.println(" userRecsDF.partitions.size"+
 userRecsDF.partitions.size)

 userDF.registerTempTable("userRecordsTemp")

 sqlContext.sql("SET hive.default.fileformat=Orc  ")
 sqlContext.sql("set hive.enforce.bucketing = true; ")
 sqlContext.sql("set hive.enforce.sorting = true; ")
 sqlContext.sql("  CREATE EXTERNAL TABLE IF NOT EXISTS users (userId
 STRING, userRecord STRING) PARTITIONED BY (idPartitioner STRING,
 dtPartitioner STRING)   stored as ORC LOCATION '/user/userId/userRecords' "
 )
 sqlContext.sql(
   """ from userRecordsTemp ps   insert overwrite table users
 partition(idPartitioner, dtPartitioner)  select ps.userId, ps.userRecord,
 ps.idPartitioner, ps.dtPartitioner CLUSTER BY idPartitioner, dtPartitioner
 """.stripMargin)




 On Fri, Jun 10, 2016 at 12:10 AM, Bijay Pathak <
 bijay.pat...@cloudwick.com> wrote:

> Hello,
>
> Looks like you are hitting this:
> https://issues.apache.org/jira/browse/HIVE-11940.
>
> Thanks,
> Bijay
>
>
>
> On Thu, Jun 9, 2016 at 9:25 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> cam you provide a code snippet of how you are populating the target
>> table from temp table.
>>
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 9 June 2016 at 23:43, swetha kasireddy 
>> wrote:
>>
>>> No, I am reading the data from hdfs, transforming it , registering
>>> the data in a temp table using registerTempTable and then doing insert
>>> overwrite using Spark SQl' hiveContext.
>>>
>>> On Thu, Jun 9, 2016 at 3:40 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 how are you doing the insert? from an existing table?

 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-14 Thread Sree Eedupuganti
Hi Spark users, i am new to spark. I am trying to connect hive using
SparkJavaContext. Unable to connect to the database. By executing the below
code i can see only "default" database. Can anyone help me out. What i need
is a sample program for Querying Hive results using SparkJavaContext. Need
to pass any values like this.

userDF.registerTempTable("userRecordsTemp")

sqlContext.sql("SET hive.default.fileformat=Orc  ")
sqlContext.sql("set hive.enforce.bucketing = true; ")
sqlContext.sql("set hive.enforce.sorting = true; ")

 public static void  main(String[] args ) throws Exception {
  SparkConf sparkConf = new
SparkConf().setAppName("SparkSQL").setMaster("local");
  SparkContext  ctx=new SparkContext(sparkConf);
  HiveContext  hiveql=new
org.apache.spark.sql.hive.HiveContext(ctx);
  DataFrame df=hiveql.sql("show databases");
  df.show();
  }

Any suggestions pleaseThanks.


Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-13 Thread Bijay Pathak
Hi Swetha,

One option is to use Hive with the above issues fixed which is Hive 2.0 or
Cloudera CDH Hive 1.2 which has above issue resolved. One thing to remember
is it's not the Hive you have installed but the Hive Spark is using which
in Spark 1.6 is Hive version 1.2 as of now.

The workaround I did for this issue was to write dataframe directly using
dataframe write method and to create the Hive Table on top of that, doing
which my processing time was down  from 4+ hrs to just under 1 hr.


data_frame.write.partitionBy('idPartitioner','dtPartitoner').orc("path/to/final/location")

And ORC format is supported with HiveContext only.

Thanks,
Bijay


On Mon, Jun 13, 2016 at 11:41 AM, swetha kasireddy <
swethakasire...@gmail.com> wrote:

> Hi Mich,
>
> Following is  a sample code snippet:
>
>
> *val *userDF =
> userRecsDF.toDF("idPartitioner", "dtPartitioner", "userId", 
> "userRecord").persist()
> System.*out*.println(" userRecsDF.partitions.size"+
> userRecsDF.partitions.size)
>
> userDF.registerTempTable("userRecordsTemp")
>
> sqlContext.sql("SET hive.default.fileformat=Orc  ")
> sqlContext.sql("set hive.enforce.bucketing = true; ")
> sqlContext.sql("set hive.enforce.sorting = true; ")
> sqlContext.sql("  CREATE EXTERNAL TABLE IF NOT EXISTS users (userId
> STRING, userRecord STRING) PARTITIONED BY (idPartitioner STRING,
> dtPartitioner STRING)   stored as ORC LOCATION '/user/userId/userRecords' ")
> sqlContext.sql(
>   """ from userRecordsTemp ps   insert overwrite table users
> partition(idPartitioner, dtPartitioner)  select ps.userId, ps.userRecord,
> ps.idPartitioner, ps.dtPartitioner CLUSTER BY idPartitioner, dtPartitioner
> """.stripMargin)
>
>
> On Mon, Jun 13, 2016 at 10:57 AM, swetha kasireddy <
> swethakasire...@gmail.com> wrote:
>
>> Hi Bijay,
>>
>> If I am hitting this issue,
>> https://issues.apache.org/jira/browse/HIVE-11940. What needs to be done?
>> Incrementing to higher version of hive is the only solution?
>>
>> Thanks!
>>
>> On Mon, Jun 13, 2016 at 10:47 AM, swetha kasireddy <
>> swethakasire...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Following is  a sample code snippet:
>>>
>>>
>>> *val *userDF = userRecsDF.toDF("idPartitioner", "dtPartitioner",
>>> "userId", "userRecord").persist()
>>> System.*out*.println(" userRecsDF.partitions.size"+
>>> userRecsDF.partitions.size)
>>>
>>> userDF.registerTempTable("userRecordsTemp")
>>>
>>> sqlContext.sql("SET hive.default.fileformat=Orc  ")
>>> sqlContext.sql("set hive.enforce.bucketing = true; ")
>>> sqlContext.sql("set hive.enforce.sorting = true; ")
>>> sqlContext.sql("  CREATE EXTERNAL TABLE IF NOT EXISTS users (userId
>>> STRING, userRecord STRING) PARTITIONED BY (idPartitioner STRING,
>>> dtPartitioner STRING)   stored as ORC LOCATION '/user/userId/userRecords' "
>>> )
>>> sqlContext.sql(
>>>   """ from userRecordsTemp ps   insert overwrite table users
>>> partition(idPartitioner, dtPartitioner)  select ps.userId, ps.userRecord,
>>> ps.idPartitioner, ps.dtPartitioner CLUSTER BY idPartitioner, dtPartitioner
>>> """.stripMargin)
>>>
>>>
>>>
>>>
>>> On Fri, Jun 10, 2016 at 12:10 AM, Bijay Pathak <
>>> bijay.pat...@cloudwick.com> wrote:
>>>
 Hello,

 Looks like you are hitting this:
 https://issues.apache.org/jira/browse/HIVE-11940.

 Thanks,
 Bijay



 On Thu, Jun 9, 2016 at 9:25 PM, Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> cam you provide a code snippet of how you are populating the target
> table from temp table.
>
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 9 June 2016 at 23:43, swetha kasireddy 
> wrote:
>
>> No, I am reading the data from hdfs, transforming it , registering
>> the data in a temp table using registerTempTable and then doing insert
>> overwrite using Spark SQl' hiveContext.
>>
>> On Thu, Jun 9, 2016 at 3:40 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> how are you doing the insert? from an existing table?
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 9 June 2016 at 21:16, Stephen Boesch  wrote:
>>>
 How many workers (/cpu cores) are assigned to this job?

 2016-06-09 13:01 GMT-07:00 SRK :

> Hi,
>
> How to insert data into 2000 partitions(directories) 

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-13 Thread swetha kasireddy
Hi Mich,

Following is  a sample code snippet:


*val *userDF =
userRecsDF.toDF("idPartitioner", "dtPartitioner", "userId",
"userRecord").persist()
System.*out*.println(" userRecsDF.partitions.size"+
userRecsDF.partitions.size)

userDF.registerTempTable("userRecordsTemp")

sqlContext.sql("SET hive.default.fileformat=Orc  ")
sqlContext.sql("set hive.enforce.bucketing = true; ")
sqlContext.sql("set hive.enforce.sorting = true; ")
sqlContext.sql("  CREATE EXTERNAL TABLE IF NOT EXISTS users (userId STRING,
userRecord STRING) PARTITIONED BY (idPartitioner STRING, dtPartitioner
STRING)   stored as ORC LOCATION '/user/userId/userRecords' ")
sqlContext.sql(
  """ from userRecordsTemp ps   insert overwrite table users
partition(idPartitioner, dtPartitioner)  select ps.userId, ps.userRecord,
ps.idPartitioner, ps.dtPartitioner CLUSTER BY idPartitioner, dtPartitioner
""".stripMargin)


On Mon, Jun 13, 2016 at 10:57 AM, swetha kasireddy <
swethakasire...@gmail.com> wrote:

> Hi Bijay,
>
> If I am hitting this issue,
> https://issues.apache.org/jira/browse/HIVE-11940. What needs to be done?
> Incrementing to higher version of hive is the only solution?
>
> Thanks!
>
> On Mon, Jun 13, 2016 at 10:47 AM, swetha kasireddy <
> swethakasire...@gmail.com> wrote:
>
>> Hi,
>>
>> Following is  a sample code snippet:
>>
>>
>> *val *userDF = userRecsDF.toDF("idPartitioner", "dtPartitioner", "userId",
>> "userRecord").persist()
>> System.*out*.println(" userRecsDF.partitions.size"+
>> userRecsDF.partitions.size)
>>
>> userDF.registerTempTable("userRecordsTemp")
>>
>> sqlContext.sql("SET hive.default.fileformat=Orc  ")
>> sqlContext.sql("set hive.enforce.bucketing = true; ")
>> sqlContext.sql("set hive.enforce.sorting = true; ")
>> sqlContext.sql("  CREATE EXTERNAL TABLE IF NOT EXISTS users (userId
>> STRING, userRecord STRING) PARTITIONED BY (idPartitioner STRING,
>> dtPartitioner STRING)   stored as ORC LOCATION '/user/userId/userRecords' "
>> )
>> sqlContext.sql(
>>   """ from userRecordsTemp ps   insert overwrite table users
>> partition(idPartitioner, dtPartitioner)  select ps.userId, ps.userRecord,
>> ps.idPartitioner, ps.dtPartitioner CLUSTER BY idPartitioner, dtPartitioner
>> """.stripMargin)
>>
>>
>>
>>
>> On Fri, Jun 10, 2016 at 12:10 AM, Bijay Pathak <
>> bijay.pat...@cloudwick.com> wrote:
>>
>>> Hello,
>>>
>>> Looks like you are hitting this:
>>> https://issues.apache.org/jira/browse/HIVE-11940.
>>>
>>> Thanks,
>>> Bijay
>>>
>>>
>>>
>>> On Thu, Jun 9, 2016 at 9:25 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 cam you provide a code snippet of how you are populating the target
 table from temp table.


 HTH

 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com



 On 9 June 2016 at 23:43, swetha kasireddy 
 wrote:

> No, I am reading the data from hdfs, transforming it , registering the
> data in a temp table using registerTempTable and then doing insert
> overwrite using Spark SQl' hiveContext.
>
> On Thu, Jun 9, 2016 at 3:40 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> how are you doing the insert? from an existing table?
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 9 June 2016 at 21:16, Stephen Boesch  wrote:
>>
>>> How many workers (/cpu cores) are assigned to this job?
>>>
>>> 2016-06-09 13:01 GMT-07:00 SRK :
>>>
 Hi,

 How to insert data into 2000 partitions(directories) of
 ORC/parquet  at a
 time using Spark SQL? It seems to be not performant when I try to
 insert
 2000 directories of Parquet/ORC using Spark SQL. Did anyone face
 this issue?

 Thanks!



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-data-into-2000-partitions-directories-of-ORC-parquet-at-a-time-using-Spark-SQL-tp27132.html
 Sent from the Apache Spark User List mailing list archive at
 Nabble.com.


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


>>>
>>
>

>>>
>>
>


Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-13 Thread swetha kasireddy
Hi Bijay,

If I am hitting this issue,
https://issues.apache.org/jira/browse/HIVE-11940. What needs to be done?
Incrementing to higher version of hive is the only solution?

Thanks!

On Mon, Jun 13, 2016 at 10:47 AM, swetha kasireddy <
swethakasire...@gmail.com> wrote:

> Hi,
>
> Following is  a sample code snippet:
>
>
> *val *userDF = userRecsDF.toDF("idPartitioner", "dtPartitioner", "userId",
> "userRecord").persist()
> System.*out*.println(" userRecsDF.partitions.size"+
> userRecsDF.partitions.size)
>
> userDF.registerTempTable("userRecordsTemp")
>
> sqlContext.sql("SET hive.default.fileformat=Orc  ")
> sqlContext.sql("set hive.enforce.bucketing = true; ")
> sqlContext.sql("set hive.enforce.sorting = true; ")
> sqlContext.sql("  CREATE EXTERNAL TABLE IF NOT EXISTS users (userId
> STRING, userRecord STRING) PARTITIONED BY (idPartitioner STRING,
> dtPartitioner STRING)   stored as ORC LOCATION '/user/userId/userRecords' "
> )
> sqlContext.sql(
>   """ from userRecordsTemp ps   insert overwrite table users
> partition(idPartitioner, dtPartitioner)  select ps.userId, ps.userRecord,
> ps.idPartitioner, ps.dtPartitioner CLUSTER BY idPartitioner, dtPartitioner
> """.stripMargin)
>
>
>
>
> On Fri, Jun 10, 2016 at 12:10 AM, Bijay Pathak  > wrote:
>
>> Hello,
>>
>> Looks like you are hitting this:
>> https://issues.apache.org/jira/browse/HIVE-11940.
>>
>> Thanks,
>> Bijay
>>
>>
>>
>> On Thu, Jun 9, 2016 at 9:25 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> cam you provide a code snippet of how you are populating the target
>>> table from temp table.
>>>
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 9 June 2016 at 23:43, swetha kasireddy 
>>> wrote:
>>>
 No, I am reading the data from hdfs, transforming it , registering the
 data in a temp table using registerTempTable and then doing insert
 overwrite using Spark SQl' hiveContext.

 On Thu, Jun 9, 2016 at 3:40 PM, Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> how are you doing the insert? from an existing table?
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 9 June 2016 at 21:16, Stephen Boesch  wrote:
>
>> How many workers (/cpu cores) are assigned to this job?
>>
>> 2016-06-09 13:01 GMT-07:00 SRK :
>>
>>> Hi,
>>>
>>> How to insert data into 2000 partitions(directories) of ORC/parquet
>>> at a
>>> time using Spark SQL? It seems to be not performant when I try to
>>> insert
>>> 2000 directories of Parquet/ORC using Spark SQL. Did anyone face
>>> this issue?
>>>
>>> Thanks!
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-data-into-2000-partitions-directories-of-ORC-parquet-at-a-time-using-Spark-SQL-tp27132.html
>>> Sent from the Apache Spark User List mailing list archive at
>>> Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>

>>>
>>
>


Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-13 Thread swetha kasireddy
Hi,

Following is  a sample code snippet:


*val *userDF = userRecsDF.toDF("idPartitioner", "dtPartitioner", "userId",
"userRecord").persist()
System.*out*.println(" userRecsDF.partitions.size"+
userRecsDF.partitions.size)

userDF.registerTempTable("userRecordsTemp")

sqlContext.sql("SET hive.default.fileformat=Orc  ")
sqlContext.sql("set hive.enforce.bucketing = true; ")
sqlContext.sql("set hive.enforce.sorting = true; ")
sqlContext.sql("  CREATE EXTERNAL TABLE IF NOT EXISTS users (userId STRING,
userRecord STRING) PARTITIONED BY (idPartitioner STRING, dtPartitioner
STRING)   stored as ORC LOCATION '/user/userId/userRecords' ")
sqlContext.sql(
  """ from userRecordsTemp ps   insert overwrite table users
partition(idPartitioner, dtPartitioner)  select ps.userId, ps.userRecord,
ps.idPartitioner, ps.dtPartitioner CLUSTER BY idPartitioner, dtPartitioner
""".stripMargin)




On Fri, Jun 10, 2016 at 12:10 AM, Bijay Pathak 
wrote:

> Hello,
>
> Looks like you are hitting this:
> https://issues.apache.org/jira/browse/HIVE-11940.
>
> Thanks,
> Bijay
>
>
>
> On Thu, Jun 9, 2016 at 9:25 PM, Mich Talebzadeh  > wrote:
>
>> cam you provide a code snippet of how you are populating the target table
>> from temp table.
>>
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 9 June 2016 at 23:43, swetha kasireddy 
>> wrote:
>>
>>> No, I am reading the data from hdfs, transforming it , registering the
>>> data in a temp table using registerTempTable and then doing insert
>>> overwrite using Spark SQl' hiveContext.
>>>
>>> On Thu, Jun 9, 2016 at 3:40 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 how are you doing the insert? from an existing table?

 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com



 On 9 June 2016 at 21:16, Stephen Boesch  wrote:

> How many workers (/cpu cores) are assigned to this job?
>
> 2016-06-09 13:01 GMT-07:00 SRK :
>
>> Hi,
>>
>> How to insert data into 2000 partitions(directories) of ORC/parquet
>> at a
>> time using Spark SQL? It seems to be not performant when I try to
>> insert
>> 2000 directories of Parquet/ORC using Spark SQL. Did anyone face this
>> issue?
>>
>> Thanks!
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-data-into-2000-partitions-directories-of-ORC-parquet-at-a-time-using-Spark-SQL-tp27132.html
>> Sent from the Apache Spark User List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

>>>
>>
>


Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-10 Thread Bijay Pathak
Hello,

Looks like you are hitting this:
https://issues.apache.org/jira/browse/HIVE-11940.

Thanks,
Bijay



On Thu, Jun 9, 2016 at 9:25 PM, Mich Talebzadeh 
wrote:

> cam you provide a code snippet of how you are populating the target table
> from temp table.
>
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 9 June 2016 at 23:43, swetha kasireddy 
> wrote:
>
>> No, I am reading the data from hdfs, transforming it , registering the
>> data in a temp table using registerTempTable and then doing insert
>> overwrite using Spark SQl' hiveContext.
>>
>> On Thu, Jun 9, 2016 at 3:40 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> how are you doing the insert? from an existing table?
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 9 June 2016 at 21:16, Stephen Boesch  wrote:
>>>
 How many workers (/cpu cores) are assigned to this job?

 2016-06-09 13:01 GMT-07:00 SRK :

> Hi,
>
> How to insert data into 2000 partitions(directories) of ORC/parquet
> at a
> time using Spark SQL? It seems to be not performant when I try to
> insert
> 2000 directories of Parquet/ORC using Spark SQL. Did anyone face this
> issue?
>
> Thanks!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-data-into-2000-partitions-directories-of-ORC-parquet-at-a-time-using-Spark-SQL-tp27132.html
> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

>>>
>>
>


Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-09 Thread Mich Talebzadeh
cam you provide a code snippet of how you are populating the target table
from temp table.


HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 9 June 2016 at 23:43, swetha kasireddy  wrote:

> No, I am reading the data from hdfs, transforming it , registering the
> data in a temp table using registerTempTable and then doing insert
> overwrite using Spark SQl' hiveContext.
>
> On Thu, Jun 9, 2016 at 3:40 PM, Mich Talebzadeh  > wrote:
>
>> how are you doing the insert? from an existing table?
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 9 June 2016 at 21:16, Stephen Boesch  wrote:
>>
>>> How many workers (/cpu cores) are assigned to this job?
>>>
>>> 2016-06-09 13:01 GMT-07:00 SRK :
>>>
 Hi,

 How to insert data into 2000 partitions(directories) of ORC/parquet  at
 a
 time using Spark SQL? It seems to be not performant when I try to insert
 2000 directories of Parquet/ORC using Spark SQL. Did anyone face this
 issue?

 Thanks!



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-data-into-2000-partitions-directories-of-ORC-parquet-at-a-time-using-Spark-SQL-tp27132.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


>>>
>>
>


Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-09 Thread swetha kasireddy
No, I am reading the data from hdfs, transforming it , registering the data
in a temp table using registerTempTable and then doing insert overwrite
using Spark SQl' hiveContext.

On Thu, Jun 9, 2016 at 3:40 PM, Mich Talebzadeh 
wrote:

> how are you doing the insert? from an existing table?
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 9 June 2016 at 21:16, Stephen Boesch  wrote:
>
>> How many workers (/cpu cores) are assigned to this job?
>>
>> 2016-06-09 13:01 GMT-07:00 SRK :
>>
>>> Hi,
>>>
>>> How to insert data into 2000 partitions(directories) of ORC/parquet  at a
>>> time using Spark SQL? It seems to be not performant when I try to insert
>>> 2000 directories of Parquet/ORC using Spark SQL. Did anyone face this
>>> issue?
>>>
>>> Thanks!
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-data-into-2000-partitions-directories-of-ORC-parquet-at-a-time-using-Spark-SQL-tp27132.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>


Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-09 Thread Mich Talebzadeh
how are you doing the insert? from an existing table?

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 9 June 2016 at 21:16, Stephen Boesch  wrote:

> How many workers (/cpu cores) are assigned to this job?
>
> 2016-06-09 13:01 GMT-07:00 SRK :
>
>> Hi,
>>
>> How to insert data into 2000 partitions(directories) of ORC/parquet  at a
>> time using Spark SQL? It seems to be not performant when I try to insert
>> 2000 directories of Parquet/ORC using Spark SQL. Did anyone face this
>> issue?
>>
>> Thanks!
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-data-into-2000-partitions-directories-of-ORC-parquet-at-a-time-using-Spark-SQL-tp27132.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-09 Thread swetha kasireddy
400 cores are assigned to this job.

On Thu, Jun 9, 2016 at 1:16 PM, Stephen Boesch  wrote:

> How many workers (/cpu cores) are assigned to this job?
>
> 2016-06-09 13:01 GMT-07:00 SRK :
>
>> Hi,
>>
>> How to insert data into 2000 partitions(directories) of ORC/parquet  at a
>> time using Spark SQL? It seems to be not performant when I try to insert
>> 2000 directories of Parquet/ORC using Spark SQL. Did anyone face this
>> issue?
>>
>> Thanks!
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-data-into-2000-partitions-directories-of-ORC-parquet-at-a-time-using-Spark-SQL-tp27132.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-09 Thread Stephen Boesch
How many workers (/cpu cores) are assigned to this job?

2016-06-09 13:01 GMT-07:00 SRK :

> Hi,
>
> How to insert data into 2000 partitions(directories) of ORC/parquet  at a
> time using Spark SQL? It seems to be not performant when I try to insert
> 2000 directories of Parquet/ORC using Spark SQL. Did anyone face this
> issue?
>
> Thanks!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-data-into-2000-partitions-directories-of-ORC-parquet-at-a-time-using-Spark-SQL-tp27132.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-09 Thread SRK
Hi,

How to insert data into 2000 partitions(directories) of ORC/parquet  at a
time using Spark SQL? It seems to be not performant when I try to insert
2000 directories of Parquet/ORC using Spark SQL. Did anyone face this issue?

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-data-into-2000-partitions-directories-of-ORC-parquet-at-a-time-using-Spark-SQL-tp27132.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org