Re: How spark makes partition when we insert data using the Sql query, and how the permissions to the partitions is assigned.?

Mich Talebzadeh Fri, 01 Jul 2016 01:41:30 -0700

Let us take this for a ride.

Simple code. Reads from an existing of 22miilion rows stored as ORC and
saves it as a Parquet


val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
HiveContext.sql("use oraclehadoop")
val s = HiveContext.table("sales2")
val sorted = s.sort("prod_id","cust_id","time_id","channel_id","promo_id")
sorted.count
sorted.save("oraclehadoop.sales3")

It will store it on hdfs in this case, under directory
/user/hduser/oraclehadoop.sales3/

Note the subdirectory corresponds to save()

It is saved by Spark and the number of partitions I gather is determined by
Spark code (I don't know the content)

The sub-directory owner on hdfs will be /user/<LINUX_USER_NAME> by default
and if you list the partitions it will show something like below:

-rw-r--r--   2 hduser supergroup          0 2016-07-01 09:26
/user/hduser/oraclehadoop.sales3/_SUCCESS
-rw-r--r--   2 hduser supergroup        743 2016-07-01 09:26
/user/hduser/oraclehadoop.sales3/_common_metadata
-rw-r--r--   2 hduser supergroup     182639 2016-07-01 09:26
/user/hduser/oraclehadoop.sales3/_metadata
-rw-r--r--   2 hduser supergroup      22962 2016-07-01 09:23
/user/hduser/oraclehadoop.sales3/part-r-00000-0ed867c3-0f33-4d97-9751-e6661d5dc5bc.gz.parquet
-rw-r--r--   2 hduser supergroup      25698 2016-07-01 09:23
/user/hduser/oraclehadoop.sales3/part-r-00001-0ed867c3-0f33-4d97-9751-e6661d5dc5bc.gz.parquet
-rw-r--r--   2 hduser supergroup      17210 2016-07-01 09:23
/user/hduser/oraclehadoop.sales3/part-r-00002-0ed867c3-0f33-4d97-9751-e6661d5dc5bc.gz.parquet
-rw-r--r--   2 hduser supergroup      22398 2016-07-01 09:23
/user/hduser/oraclehadoop.sales3/part-r-00003-0ed867c3-0f33-4d97-9751-e6661d5dc5bc.gz.parquet
-rw-r--r--   2 hduser supergroup      18105 2016-07-01 09:23
/user/hduser/oraclehadoop.sales3/part-r-00004-0ed867c3-0f33-4d97-9751-e6661d5dc5bc.gz.parquet

Note that metadata is also stored on the file together with data partitions
(zipped). This is in contrast to Hive where metadata is stored in Hive
metastore.

In this case it decides to have 200 data partitions.

If you want more control of it, you can get the data as DF, register it as
tempTable, create the table the way you like it and do an insert/select
from tempTable.

I personally prefer to create the table myself in a format that I like
(liker ORC below) and store it in Hive.

Example

var sqltext: String = ""
sqltext =
"""
 CREATE TABLE IF NOT EXISTS oraclehadoop.sales3
 (
  PROD_ID        bigint                       ,
  CUST_ID        bigint                       ,
  TIME_ID        timestamp                    ,
  CHANNEL_ID     bigint                       ,
  PROMO_ID       bigint                       ,
  QUANTITY_SOLD  decimal(10)                  ,
  AMOUNT_SOLD    decimal(10)
)
CLUSTERED BY (PROD_ID,CUST_ID,TIME_ID,CHANNEL_ID,PROMO_ID) INTO 256 BUCKETS
STORED AS ORC
TBLPROPERTIES ( "orc.compress"="SNAPPY",
"orc.create.index"="true",
"orc.bloom.filter.columns"="PROD_ID,CUST_ID,TIME_ID,CHANNEL_ID,PROMO_ID",
"orc.bloom.filter.fpp"="0.05",
"orc.stripe.size"="268435456",
"orc.row.index.stride"="10000")
"""
HiveContext.sql(sqltext)
sorted.registerTempTable("tmp")
sqltext =
"""
INSERT INTO
oraclehadoop.sales3
SELECT * FROM tmp
"""
HiveContext.sql(sqltext)


HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 1 July 2016 at 08:43, shiv4nsh <shiva...@knoldus.com> wrote:

> Hey guys I am using Apache Spark 1.5.2, and I am running the Sql query
> using
> the SQLContext and when I run the insert query it saves the data in
> partition (as expected).
>
> I am just curious and want to know how these partitions are made and how
> the
> permissions to these partition is assigned . Can we change it? Does it
> behave differently on hdfs.?
>
> If someone can point me to the exact code in spark  that would be
> beneficial.
>
> I have also posted it on  stackOverflow.
> <
> http://stackoverflow.com/questions/38138113/how-spark-makes-partition-when-we-insert-data-using-the-sql-query-and-how-the-p
> >
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-spark-makes-partition-when-we-insert-data-using-the-Sql-query-and-how-the-permissions-to-the-par-tp27256.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: How spark makes partition when we insert data using the Sql query, and how the permissions to the partitions is assigned.?

Reply via email to