I am doing the 1. currently using the following and it takes a lot of time. Whats the advantage of doing 2 and how to do it?
sqlContext.sql(" CREATE EXTERNAL TABLE IF NOT EXISTS records (id STRING, record STRING) PARTITIONED BY (datePartition STRING, idPartition STRING) stored as ORC LOCATION '/user/users' ") sqlContext.sql(" orc.compress= SNAPPY") sqlContext.sql( """ from recordsTemp ps insert overwrite table users partition(datePartition , idPartition ) select ps.id, ps.record , ps.datePartition, ps.idPartition """.stripMargin) On Sun, May 22, 2016 at 12:47 PM, Mich Talebzadeh <mich.talebza...@gmail.com > wrote: > two alternatives for this ETL or ELT > > > 1. There is only one external ORC table and you do insert overwrite > into that external table through Spark sql > 2. or > 3. 14k files loaded into staging area/read directory and then insert > overwrite into an ORC table and th > > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 22 May 2016 at 20:38, swetha kasireddy <swethakasire...@gmail.com> > wrote: > >> Around 14000 partitions need to be loaded every hour. Yes, I tested this >> and its taking a lot of time to load. A partition would look something like >> the following which is further partitioned by userId with all the >> userRecords for that date inside it. >> >> 5 2016-05-20 16:03 /user/user/userRecords/dtPartitioner=2012-09-12 >> >> On Sun, May 22, 2016 at 12:30 PM, Mich Talebzadeh < >> mich.talebza...@gmail.com> wrote: >> >>> by partition do you mean 14000 files loaded in each batch session (say >>> daily)?. >>> >>> Have you actually tested this? >>> >>> Dr Mich Talebzadeh >>> >>> >>> >>> LinkedIn * >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>> >>> >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>> >>> On 22 May 2016 at 20:24, swetha kasireddy <swethakasire...@gmail.com> >>> wrote: >>> >>>> The data is not very big. Say 1MB-10 MB at the max per partition. What >>>> is the best way to insert this 14k partitions with decent performance? >>>> >>>> On Sun, May 22, 2016 at 12:18 PM, Mich Talebzadeh < >>>> mich.talebza...@gmail.com> wrote: >>>> >>>>> the acid question is how many rows are you going to insert in a batch >>>>> session? btw if this is purely an sql operation then you can do all that >>>>> in >>>>> hive running on spark engine. It will be very fast as well. >>>>> >>>>> >>>>> >>>>> Dr Mich Talebzadeh >>>>> >>>>> >>>>> >>>>> LinkedIn * >>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>>> >>>>> >>>>> >>>>> http://talebzadehmich.wordpress.com >>>>> >>>>> >>>>> >>>>> On 22 May 2016 at 20:14, Jörn Franke <jornfra...@gmail.com> wrote: >>>>> >>>>>> 14000 partitions seem to be way too many to be performant (except for >>>>>> large data sets). How much data does one partition contain? >>>>>> >>>>>> > On 22 May 2016, at 09:34, SRK <swethakasire...@gmail.com> wrote: >>>>>> > >>>>>> > Hi, >>>>>> > >>>>>> > In my Spark SQL query to insert data, I have around 14,000 >>>>>> partitions of >>>>>> > data which seems to be causing memory issues. How can I insert the >>>>>> data for >>>>>> > 100 partitions at a time to avoid any memory issues? >>>>>> > >>>>>> > >>>>>> > >>>>>> > -- >>>>>> > View this message in context: >>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-data-for-100-partitions-at-a-time-using-Spark-SQL-tp26997.html >>>>>> > Sent from the Apache Spark User List mailing list archive at >>>>>> Nabble.com. >>>>>> > >>>>>> > >>>>>> --------------------------------------------------------------------- >>>>>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>>> > For additional commands, e-mail: user-h...@spark.apache.org >>>>>> > >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>>> For additional commands, e-mail: user-h...@spark.apache.org >>>>>> >>>>>> >>>>> >>>> >>> >> >