Re: How to use a custom partitioner in a dataframe in Spark

2016-02-18 Thread Koert Kuipers
although it is not a bad idea to write data out partitioned, and then use a merge join when reading it back in, this currently isn't even easily doable with rdds because when you read an rdd from disk the partitioning info is lost. re-introducing a partitioner at that point causes a shuffle

Re: How to use a custom partitioner in a dataframe in Spark

2016-02-18 Thread Rishi Mishra
Michael, Is there any specific reason why DataFrames does not have partitioners like RDDs ? This will be very useful if one is writing custom datasources , which keeps data in partitions. While storing data one can pre-partition the data at Spark level rather than at the datasource. Regards,

Re: How to use a custom partitioner in a dataframe in Spark

2016-02-17 Thread swetha kasireddy
So suppose I have a bunch of userIds and I need to save them as parquet in database. I also need to load them back and need to be able to do a join on userId. My idea is to partition by userId hashcode first and then on userId. So that I don't have to deal with any performance issues because of a

Re: How to use a custom partitioner in a dataframe in Spark

2016-02-17 Thread swetha kasireddy
So suppose I have a bunch of userIds and I need to save them as parquet in database. I also need to load them back and need to be able to do a join on userId. My idea is to partition by userId hashcode first and then on userId. On Wed, Feb 17, 2016 at 11:51 AM, Michael Armbrust

Re: How to use a custom partitioner in a dataframe in Spark

2016-02-17 Thread Rishi Mishra
Unfortunately there is not any, at least till 1.5. Have not gone through the new DataSet of 1.6. There is some basic support for Parquet like partitionByColumn. If you want to partition your dataset on a certain way you have to use an RDD to partition & convert that into a DataFrame before

How to use a custom partitioner in a dataframe in Spark

2016-02-16 Thread SRK
Hi, How do I use a custom partitioner when I do a saveAsTable in a dataframe. Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-a-custom-partitioner-in-a-dataframe-in-Spark-tp26240.html Sent from the Apache Spark User List