subject:"\[GitHub\] spark issue #15229\: \[SPARK\-17654\] \[SQL\] Propagate bucketing information for ..."

[GitHub] spark issue #15229: [SPARK-17654] [SQL] Propagate bucketing information for ...

2017-03-08 Thread tejasapatil

Github user tejasapatil commented on the issue:

https://github.com/apache/spark/pull/15229
  
@carlos-verdes : Thanks for the information. This is moved under an 
umbrella jira (SPARK-19256) which has a proposal : 
https://docs.google.com/document/d/1a8IDh23RAkrkg9YYAeO51F4aGO8-xAlupKwdshve2fc/edit

I believe all your requirements are captured in the proposal. If not, let 
me know. Meanwhile, I will close this PR and re-open when the right pieces are 
together. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15229: [SPARK-17654] [SQL] Propagate bucketing information for ...

2017-03-06 Thread carlos-verdes

Github user carlos-verdes commented on the issue:

https://github.com/apache/spark/pull/15229
  
Hi @rxin,

In Hive you have two levels, the partition and the buckets.
The partitons are translated to folders on HDFS, for example:
```bash
/apps/hive/warehouse/model_table/date=6
```
Where model_table is the name of the table and date is the partition.

Inside a folder you will have n files and Hive let you decide how many 
files you want to create (buckets) and which data you want to store within.

If you create a table like this on Hive:
```sql 
create table events (
  timestamp: long,
  userId: String,
  event: String
)
partitioned by (event_date int)
clustered by (userId) sorted by (userId, timestamp) into 10 buckets;
```

Then when it will be only 10 files per partition and all the events for one 
user will be only on one partition and sorted by time. 

If you insert data on this table using the next query on Hive you will see 
that the clustering policy is respected:
```sql
set hive.enforce.bucketing = true;  -- (Note: Not needed in Hive 2.x onward)
from  event_feed_source e
insert overwrite table events
partition (event_date = 20170307)
select e.*, 20170307   
where event_day = 20170307;
```

However... if you do the next insert with Spark:
```scala
sqlContext.sql("insert overwrite table events partition (event_date = 
20170307) select e.*,1 from event_feed_source e")
```

You will see that the data is stored with the same partitioning as it is on 
the source dataset.

What is the benefit of respecting the Hive clustering policy?
The main benefit is to avoid shuffle and have a control on the number or 
partitions.

To give an example we have a pipeline that reads thousands of events per 
user and save them into another table (model), so it means the events table is 
going to have x times more data than the model table (imagine a factor of 10x).

First point is, if the source data are clustered properly we can read all 
the events per user without shuffle (I mean to do something like 
`events.groupBy(user).mapValues(_.sortBy(timestamp)` will be done without 
shuffle).

Second point is when we generate the model RDD/Dataser from the event 
RDD/Dataset. Spark respects the source partitioning (unless you indicate 
otherwise) which means... is going to save into Hive 10 times the number of 
files for the model as needed (not respecting the clustering policy on Hive).
This implies that we have 10x more partitions than needed and also that the 
queries over the model table are not "clustered"... which means full scan every 
time we need to do a query (a full scan over 10 times the optimal number of 
partitions).

I hope I clarify the point on Hive clusters ;)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15229: [SPARK-17654] [SQL] Propagate bucketing information for ...

2016-10-13 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/15229
  
@tejasapatil how does HIve store partitioning files?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15229: [SPARK-17654] [SQL] Propagate bucketing information for ...

2016-09-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15229
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15229: [SPARK-17654] [SQL] Propagate bucketing information for ...

2016-09-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15229
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66125/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15229: [SPARK-17654] [SQL] Propagate bucketing information for ...

2016-09-29 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15229
  
**[Test build #66125 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66125/consoleFull)**
 for PR 15229 at commit 
[`9b61e39`](https://github.com/apache/spark/commit/9b61e39b5a3a762414c0f31de2fbe4a33ab07d58).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15229: [SPARK-17654] [SQL] Propagate bucketing information for ...

2016-09-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15229
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15229: [SPARK-17654] [SQL] Propagate bucketing information for ...

2016-09-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15229
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66128/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15229: [SPARK-17654] [SQL] Propagate bucketing information for ...

2016-09-29 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15229
  
**[Test build #66128 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66128/consoleFull)**
 for PR 15229 at commit 
[`23986a8`](https://github.com/apache/spark/commit/23986a89787a7f9ba24ef5ccce842a13165d0d9d).
 * This patch **fails MiMa tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15229: [SPARK-17654] [SQL] Propagate bucketing information for ...

2016-09-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15229
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15229: [SPARK-17654] [SQL] Propagate bucketing information for ...

2016-09-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15229
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65862/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15229: [SPARK-17654] [SQL] Propagate bucketing information for ...

2016-09-23 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15229
  
**[Test build #65862 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65862/consoleFull)**
 for PR 15229 at commit 
[`8726cc6`](https://github.com/apache/spark/commit/8726cc6430cbeaf8c2eebd7cef40199a7c563218).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15229: [SPARK-17654] [SQL] Propagate bucketing information for ...

2016-09-23 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15229
  
**[Test build #65862 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65862/consoleFull)**
 for PR 15229 at commit 
[`8726cc6`](https://github.com/apache/spark/commit/8726cc6430cbeaf8c2eebd7cef40199a7c563218).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15229: [SPARK-17654] [SQL] Propagate bucketing information for ...

[GitHub] spark issue #15229: [SPARK-17654] [SQL] Propagate bucketing information for ...

[GitHub] spark issue #15229: [SPARK-17654] [SQL] Propagate bucketing information for ...

[GitHub] spark issue #15229: [SPARK-17654] [SQL] Propagate bucketing information for ...

[GitHub] spark issue #15229: [SPARK-17654] [SQL] Propagate bucketing information for ...

[GitHub] spark issue #15229: [SPARK-17654] [SQL] Propagate bucketing information for ...

[GitHub] spark issue #15229: [SPARK-17654] [SQL] Propagate bucketing information for ...

[GitHub] spark issue #15229: [SPARK-17654] [SQL] Propagate bucketing information for ...

[GitHub] spark issue #15229: [SPARK-17654] [SQL] Propagate bucketing information for ...

[GitHub] spark issue #15229: [SPARK-17654] [SQL] Propagate bucketing information for ...

[GitHub] spark issue #15229: [SPARK-17654] [SQL] Propagate bucketing information for ...

[GitHub] spark issue #15229: [SPARK-17654] [SQL] Propagate bucketing information for ...

[GitHub] spark issue #15229: [SPARK-17654] [SQL] Propagate bucketing information for ...

13 matches

Site Navigation

Mail list logo

Footer information