Thank you for your answer.
Since Spark 1.6.0, it is possible to partition a dataframe using hash
partitioning with Repartition "
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame
"
I have also sorted a dataframe and it using a range partitioning in the
since dataframes represent more or less a plan of execution, they do not
have partition information as such i think?
you could however do dataFrame.rdd, to force it to create a physical plan
that results in an actual rdd, and then query the rdd for partition info.
On Thu, Jul 7, 2016 at 4:24 AM,
Using partitioning with dataframes, how can we retrieve informations about
partitions? partitions bounds for example
Thanks,
Shaira
2016-07-07 6:30 GMT+02:00 Koert Kuipers :
> spark does keep some information on the partitions of an RDD, namely the
> partitioning/partitioner.
spark does keep some information on the partitions of an RDD, namely the
partitioning/partitioner.
GroupSorted is an extension for key-value RDDs that also keeps track of the
ordering, allowing for faster joins, non-reduce type operations on very
large groups of values per key, etc.
see here:
Hi,
Why doesn't Spark keep information about the structure of the RDDs or the
partitions within RDDs? Say that I use repartitionAndSortWithinPartitions,
which results in sorted partitions. With sorted partitions, lookups should
be super fast (binary search?), yet I still need to go through the