Let us take this for a ride. Simple code. Reads from an existing of 22miilion rows stored as ORC and saves it as a Parquet
val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc) HiveContext.sql("use oraclehadoop") val s = HiveContext.table("sales2") val sorted = s.sort("prod_id","cust_id","time_id","channel_id","promo_id") sorted.count sorted.save("oraclehadoop.sales3") It will store it on hdfs in this case, under directory /user/hduser/oraclehadoop.sales3/ Note the subdirectory corresponds to save() It is saved by Spark and the number of partitions I gather is determined by Spark code (I don't know the content) The sub-directory owner on hdfs will be /user/<LINUX_USER_NAME> by default and if you list the partitions it will show something like below: -rw-r--r-- 2 hduser supergroup 0 2016-07-01 09:26 /user/hduser/oraclehadoop.sales3/_SUCCESS -rw-r--r-- 2 hduser supergroup 743 2016-07-01 09:26 /user/hduser/oraclehadoop.sales3/_common_metadata -rw-r--r-- 2 hduser supergroup 182639 2016-07-01 09:26 /user/hduser/oraclehadoop.sales3/_metadata -rw-r--r-- 2 hduser supergroup 22962 2016-07-01 09:23 /user/hduser/oraclehadoop.sales3/part-r-00000-0ed867c3-0f33-4d97-9751-e6661d5dc5bc.gz.parquet -rw-r--r-- 2 hduser supergroup 25698 2016-07-01 09:23 /user/hduser/oraclehadoop.sales3/part-r-00001-0ed867c3-0f33-4d97-9751-e6661d5dc5bc.gz.parquet -rw-r--r-- 2 hduser supergroup 17210 2016-07-01 09:23 /user/hduser/oraclehadoop.sales3/part-r-00002-0ed867c3-0f33-4d97-9751-e6661d5dc5bc.gz.parquet -rw-r--r-- 2 hduser supergroup 22398 2016-07-01 09:23 /user/hduser/oraclehadoop.sales3/part-r-00003-0ed867c3-0f33-4d97-9751-e6661d5dc5bc.gz.parquet -rw-r--r-- 2 hduser supergroup 18105 2016-07-01 09:23 /user/hduser/oraclehadoop.sales3/part-r-00004-0ed867c3-0f33-4d97-9751-e6661d5dc5bc.gz.parquet Note that metadata is also stored on the file together with data partitions (zipped). This is in contrast to Hive where metadata is stored in Hive metastore. In this case it decides to have 200 data partitions. If you want more control of it, you can get the data as DF, register it as tempTable, create the table the way you like it and do an insert/select from tempTable. I personally prefer to create the table myself in a format that I like (liker ORC below) and store it in Hive. Example var sqltext: String = "" sqltext = """ CREATE TABLE IF NOT EXISTS oraclehadoop.sales3 ( PROD_ID bigint , CUST_ID bigint , TIME_ID timestamp , CHANNEL_ID bigint , PROMO_ID bigint , QUANTITY_SOLD decimal(10) , AMOUNT_SOLD decimal(10) ) CLUSTERED BY (PROD_ID,CUST_ID,TIME_ID,CHANNEL_ID,PROMO_ID) INTO 256 BUCKETS STORED AS ORC TBLPROPERTIES ( "orc.compress"="SNAPPY", "orc.create.index"="true", "orc.bloom.filter.columns"="PROD_ID,CUST_ID,TIME_ID,CHANNEL_ID,PROMO_ID", "orc.bloom.filter.fpp"="0.05", "orc.stripe.size"="268435456", "orc.row.index.stride"="10000") """ HiveContext.sql(sqltext) sorted.registerTempTable("tmp") sqltext = """ INSERT INTO oraclehadoop.sales3 SELECT * FROM tmp """ HiveContext.sql(sqltext) HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On 1 July 2016 at 08:43, shiv4nsh <shiva...@knoldus.com> wrote: > Hey guys I am using Apache Spark 1.5.2, and I am running the Sql query > using > the SQLContext and when I run the insert query it saves the data in > partition (as expected). > > I am just curious and want to know how these partitions are made and how > the > permissions to these partition is assigned . Can we change it? Does it > behave differently on hdfs.? > > If someone can point me to the exact code in spark that would be > beneficial. > > I have also posted it on stackOverflow. > < > http://stackoverflow.com/questions/38138113/how-spark-makes-partition-when-we-insert-data-using-the-sql-query-and-how-the-p > > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/How-spark-makes-partition-when-we-insert-data-using-the-Sql-query-and-how-the-permissions-to-the-par-tp27256.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >