Re: How spark makes partition when we insert data using the Sql query, and how the permissions to the partitions is assigned.?

2016-07-01 Thread Mich Talebzadeh
Let us take this for a ride.

Simple code. Reads from an existing of 22miilion rows stored as ORC and
saves it as a Parquet

val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
HiveContext.sql("use oraclehadoop")
val s = HiveContext.table("sales2")
val sorted = s.sort("prod_id","cust_id","time_id","channel_id","promo_id")
sorted.count
sorted.save("oraclehadoop.sales3")

It will store it on hdfs in this case, under directory
/user/hduser/oraclehadoop.sales3/

Note the subdirectory corresponds to save()

It is saved by Spark and the number of partitions I gather is determined by
Spark code (I don't know the content)

The sub-directory owner on hdfs will be /user/ by default
and if you list the partitions it will show something like below:

-rw-r--r--   2 hduser supergroup  0 2016-07-01 09:26
/user/hduser/oraclehadoop.sales3/_SUCCESS
-rw-r--r--   2 hduser supergroup743 2016-07-01 09:26
/user/hduser/oraclehadoop.sales3/_common_metadata
-rw-r--r--   2 hduser supergroup 182639 2016-07-01 09:26
/user/hduser/oraclehadoop.sales3/_metadata
-rw-r--r--   2 hduser supergroup  22962 2016-07-01 09:23
/user/hduser/oraclehadoop.sales3/part-r-0-0ed867c3-0f33-4d97-9751-e6661d5dc5bc.gz.parquet
-rw-r--r--   2 hduser supergroup  25698 2016-07-01 09:23
/user/hduser/oraclehadoop.sales3/part-r-1-0ed867c3-0f33-4d97-9751-e6661d5dc5bc.gz.parquet
-rw-r--r--   2 hduser supergroup  17210 2016-07-01 09:23
/user/hduser/oraclehadoop.sales3/part-r-2-0ed867c3-0f33-4d97-9751-e6661d5dc5bc.gz.parquet
-rw-r--r--   2 hduser supergroup  22398 2016-07-01 09:23
/user/hduser/oraclehadoop.sales3/part-r-3-0ed867c3-0f33-4d97-9751-e6661d5dc5bc.gz.parquet
-rw-r--r--   2 hduser supergroup  18105 2016-07-01 09:23
/user/hduser/oraclehadoop.sales3/part-r-4-0ed867c3-0f33-4d97-9751-e6661d5dc5bc.gz.parquet

Note that metadata is also stored on the file together with data partitions
(zipped). This is in contrast to Hive where metadata is stored in Hive
metastore.

In this case it decides to have 200 data partitions.

If you want more control of it, you can get the data as DF, register it as
tempTable, create the table the way you like it and do an insert/select
from tempTable.

I personally prefer to create the table myself in a format that I like
(liker ORC below) and store it in Hive.

Example

var sqltext: String = ""
sqltext =
"""
 CREATE TABLE IF NOT EXISTS oraclehadoop.sales3
 (
  PROD_IDbigint   ,
  CUST_IDbigint   ,
  TIME_IDtimestamp,
  CHANNEL_ID bigint   ,
  PROMO_ID   bigint   ,
  QUANTITY_SOLD  decimal(10)  ,
  AMOUNT_SOLDdecimal(10)
)
CLUSTERED BY (PROD_ID,CUST_ID,TIME_ID,CHANNEL_ID,PROMO_ID) INTO 256 BUCKETS
STORED AS ORC
TBLPROPERTIES ( "orc.compress"="SNAPPY",
"orc.create.index"="true",
"orc.bloom.filter.columns"="PROD_ID,CUST_ID,TIME_ID,CHANNEL_ID,PROMO_ID",
"orc.bloom.filter.fpp"="0.05",
"orc.stripe.size"="268435456",
"orc.row.index.stride"="1")
"""
HiveContext.sql(sqltext)
sorted.registerTempTable("tmp")
sqltext =
"""
INSERT INTO
oraclehadoop.sales3
SELECT * FROM tmp
"""
HiveContext.sql(sqltext)


HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 1 July 2016 at 08:43, shiv4nsh  wrote:

> Hey guys I am using Apache Spark 1.5.2, and I am running the Sql query
> using
> the SQLContext and when I run the insert query it saves the data in
> partition (as expected).
>
> I am just curious and want to know how these partitions are made and how
> the
> permissions to these partition is assigned . Can we change it? Does it
> behave differently on hdfs.?
>
> If someone can point me to the exact code in spark  that would be
> beneficial.
>
> I have also posted it on  stackOverflow.
> <
> http://stackoverflow.com/questions/38138113/how-spark-makes-partition-when-we-insert-data-using-the-sql-query-and-how-the-p
> >
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-spark-makes-partition-when-we-insert-data-using-the-Sql-query-and-how-the-permissions-to-the-par-tp27256.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


How spark makes partition when we insert data using the Sql query, and how the permissions to the partitions is assigned.?

2016-07-01 Thread shiv4nsh
Hey guys I am using Apache Spark 1.5.2, and I am running the Sql query using
the SQLContext and when I run the insert query it saves the data in
partition (as expected).

I am just curious and want to know how these partitions are made and how the
permissions to these partition is assigned . Can we change it? Does it
behave differently on hdfs.?

If someone can point me to the exact code in spark  that would be
beneficial.

I have also posted it on  stackOverflow.

  




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-spark-makes-partition-when-we-insert-data-using-the-Sql-query-and-how-the-permissions-to-the-par-tp27256.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org