Re: Question regarding structured data and partitions

2016-07-07 Thread tan shai
Thank you for your answer. Since Spark 1.6.0, it is possible to partition a dataframe using hash partitioning with Repartition " https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame " I have also sorted a dataframe and it using a range partitioning in the

Re: Question regarding structured data and partitions

2016-07-07 Thread Koert Kuipers
since dataframes represent more or less a plan of execution, they do not have partition information as such i think? you could however do dataFrame.rdd, to force it to create a physical plan that results in an actual rdd, and then query the rdd for partition info. On Thu, Jul 7, 2016 at 4:24 AM,

Re: Question regarding structured data and partitions

2016-07-07 Thread tan shai
Using partitioning with dataframes, how can we retrieve informations about partitions? partitions bounds for example Thanks, Shaira 2016-07-07 6:30 GMT+02:00 Koert Kuipers : > spark does keep some information on the partitions of an RDD, namely the > partitioning/partitioner.

Re: Question regarding structured data and partitions

2016-07-06 Thread Koert Kuipers
spark does keep some information on the partitions of an RDD, namely the partitioning/partitioner. GroupSorted is an extension for key-value RDDs that also keeps track of the ordering, allowing for faster joins, non-reduce type operations on very large groups of values per key, etc. see here:

Question regarding structured data and partitions

2016-07-06 Thread Omid Alipourfard
Hi, Why doesn't Spark keep information about the structure of the RDDs or the partitions within RDDs? Say that I use repartitionAndSortWithinPartitions, which results in sorted partitions. With sorted partitions, lookups should be super fast (binary search?), yet I still need to go through the