although it is not a bad idea to write data out partitioned, and then use a merge join when reading it back in, this currently isn't even easily doable with rdds because when you read an rdd from disk the partitioning info is lost. re-introducing a partitioner at that point causes a shuffle defeating the purpose.
On Thu, Feb 18, 2016 at 1:49 PM, Rishi Mishra <rmis...@snappydata.io> wrote: > Michael, > Is there any specific reason why DataFrames does not have partitioners > like RDDs ? This will be very useful if one is writing custom datasources , > which keeps data in partitions. While storing data one can pre-partition > the data at Spark level rather than at the datasource. > > Regards, > Rishitesh Mishra, > SnappyData . (http://www.snappydata.io/) > > https://in.linkedin.com/in/rishiteshmishra > > On Thu, Feb 18, 2016 at 3:50 AM, swetha kasireddy < > swethakasire...@gmail.com> wrote: > >> So suppose I have a bunch of userIds and I need to save them as parquet >> in database. I also need to load them back and need to be able to do a join >> on userId. My idea is to partition by userId hashcode first and then on >> userId. So that I don't have to deal with any performance issues because of >> a number of small files and also to be able to scan faster. >> >> >> Something like ...df.write.format("parquet").partitionBy( "userIdHash" >> , "userId").mode(SaveMode.Append).save("userRecords"); >> >> On Wed, Feb 17, 2016 at 2:16 PM, swetha kasireddy < >> swethakasire...@gmail.com> wrote: >> >>> So suppose I have a bunch of userIds and I need to save them as parquet >>> in database. I also need to load them back and need to be able to do a join >>> on userId. My idea is to partition by userId hashcode first and then on >>> userId. >>> >>> >>> >>> On Wed, Feb 17, 2016 at 11:51 AM, Michael Armbrust < >>> mich...@databricks.com> wrote: >>> >>>> Can you describe what you are trying to accomplish? What would the >>>> custom partitioner be? >>>> >>>> On Tue, Feb 16, 2016 at 1:21 PM, SRK <swethakasire...@gmail.com> wrote: >>>> >>>>> Hi, >>>>> >>>>> How do I use a custom partitioner when I do a saveAsTable in a >>>>> dataframe. >>>>> >>>>> >>>>> Thanks, >>>>> Swetha >>>>> >>>>> >>>>> >>>>> -- >>>>> View this message in context: >>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-a-custom-partitioner-in-a-dataframe-in-Spark-tp26240.html >>>>> Sent from the Apache Spark User List mailing list archive at >>>>> Nabble.com. >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>> For additional commands, e-mail: user-h...@spark.apache.org >>>>> >>>>> >>>> >>> >> >