Re: How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread Mich Talebzadeh
Whatever you do the lion share of time is going to be taken by insert into Hive table. Ok check this. It is CSV files inserted into Hive ORC table. This version uses Hive on Spark engine and it is written in Hive executed via beeline --1 Move .CSV data into HDFS: --2 Create an external table.

Re: How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread swetha kasireddy
I am doing the 1. currently using the following and it takes a lot of time. Whats the advantage of doing 2 and how to do it? sqlContext.sql(" CREATE EXTERNAL TABLE IF NOT EXISTS records (id STRING, record STRING) PARTITIONED BY (datePartition STRING, idPartition STRING) stored as ORC LOCATION

Re: How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread Mich Talebzadeh
two alternatives for this ETL or ELT 1. There is only one external ORC table and you do insert overwrite into that external table through Spark sql 2. or 3. 14k files loaded into staging area/read directory and then insert overwrite into an ORC table and th Dr Mich Talebzadeh

Re: How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread swetha kasireddy
Around 14000 partitions need to be loaded every hour. Yes, I tested this and its taking a lot of time to load. A partition would look something like the following which is further partitioned by userId with all the userRecords for that date inside it. 5 2016-05-20 16:03

Re: How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread Mich Talebzadeh
by partition do you mean 14000 files loaded in each batch session (say daily)?. Have you actually tested this? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread swetha kasireddy
The data is not very big. Say 1MB-10 MB at the max per partition. What is the best way to insert this 14k partitions with decent performance? On Sun, May 22, 2016 at 12:18 PM, Mich Talebzadeh wrote: > the acid question is how many rows are you going to insert in a

Re: How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread Mich Talebzadeh
the acid question is how many rows are you going to insert in a batch session? btw if this is purely an sql operation then you can do all that in hive running on spark engine. It will be very fast as well. Dr Mich Talebzadeh LinkedIn *

Re: How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread Jörn Franke
14000 partitions seem to be way too many to be performant (except for large data sets). How much data does one partition contain? > On 22 May 2016, at 09:34, SRK wrote: > > Hi, > > In my Spark SQL query to insert data, I have around 14,000 partitions of > data which

Re: How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread Sabarish Sasidharan
Can't you just reduce the amount of data you insert by applying a filter so that only a small set of idpartitions is selected. You could have multiple such inserts to cover all idpartitions. Does that help? Regards Sab On 22 May 2016 1:11 pm, "swetha kasireddy" wrote:

Re: How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread swetha kasireddy
So, if I put 1000 records at a time and if the next 1000 records have some records that has same partition as the previous records then the data will be overwritten. How can I prevent overwriting valid data in this case? Could you post the example that you are talking about? What I am doing is

Re: How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread Mich Talebzadeh
ok is the staging table used as staging only. you can create a staging *directory^ where you put your data there (you can put 100s of files there) and do an insert/select that will take data from 100 files into your main ORC table. I have an example of 100's of CSV files insert/select from a

Re: How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread swetha kasireddy
But, how do I take 100 partitions at a time from staging table? On Sun, May 22, 2016 at 11:26 AM, Mich Talebzadeh wrote: > ok so you still keep data as ORC in Hive for further analysis > > what I have in mind is to have an external table as staging table and do >

Re: How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread swetha kasireddy
I am looking at ORC. I insert the data using the following query. sqlContext.sql(" CREATE EXTERNAL TABLE IF NOT EXISTS records (id STRING, record STRING) PARTITIONED BY (datePartition STRING, idPartition STRING) stored as ORC LOCATION '/user/users' ") sqlContext.sql(" orc.compress=

Re: How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread Mich Talebzadeh
where is your base table and what format is it Parquet, ORC etc) Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *

How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread SRK
Hi, In my Spark SQL query to insert data, I have around 14,000 partitions of data which seems to be causing memory issues. How can I insert the data for 100 partitions at a time to avoid any memory issues? -- View this message in context: