although it is not a bad idea to write data out partitioned, and then use a
merge join when reading it back in, this currently isn't even easily doable
with rdds because when you read an rdd from disk the partitioning info is
lost. re-introducing a partitioner at that point causes a shuffle
Michael,
Is there any specific reason why DataFrames does not have partitioners like
RDDs ? This will be very useful if one is writing custom datasources ,
which keeps data in partitions. While storing data one can pre-partition
the data at Spark level rather than at the datasource.
Regards,
So suppose I have a bunch of userIds and I need to save them as parquet in
database. I also need to load them back and need to be able to do a join
on userId. My idea is to partition by userId hashcode first and then on
userId. So that I don't have to deal with any performance issues because of
a
So suppose I have a bunch of userIds and I need to save them as parquet in
database. I also need to load them back and need to be able to do a join
on userId. My idea is to partition by userId hashcode first and then on
userId.
On Wed, Feb 17, 2016 at 11:51 AM, Michael Armbrust
Unfortunately there is not any, at least till 1.5. Have not gone through
the new DataSet of 1.6. There is some basic support for Parquet like
partitionByColumn.
If you want to partition your dataset on a certain way you have to use an
RDD to partition & convert that into a DataFrame before
Hi,
How do I use a custom partitioner when I do a saveAsTable in a dataframe.
Thanks,
Swetha
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-a-custom-partitioner-in-a-dataframe-in-Spark-tp26240.html
Sent from the Apache Spark User List