Re: How to use a custom partitioner in a dataframe in Spark

Koert Kuipers Thu, 18 Feb 2016 12:33:33 -0800

although it is not a bad idea to write data out partitioned, and then use a
merge join when reading it back in, this currently isn't even easily doable
with rdds because when you read an rdd from disk the partitioning info is
lost. re-introducing a partitioner at that point causes a shuffle defeating
the purpose.


On Thu, Feb 18, 2016 at 1:49 PM, Rishi Mishra <rmis...@snappydata.io> wrote:

> Michael,
> Is there any specific reason why DataFrames does not have partitioners
> like RDDs ? This will be very useful if one is writing custom datasources ,
> which keeps data in partitions. While storing data one can pre-partition
> the data at Spark level rather than at the datasource.
>
> Regards,
> Rishitesh Mishra,
> SnappyData . (http://www.snappydata.io/)
>
> https://in.linkedin.com/in/rishiteshmishra
>
> On Thu, Feb 18, 2016 at 3:50 AM, swetha kasireddy <
> swethakasire...@gmail.com> wrote:
>
>> So suppose I have a bunch of userIds and I need to save them as parquet
>> in database. I also need to load them back and need to be able to do a join
>> on userId. My idea is to partition by userId hashcode first and then on
>> userId. So that I don't have to deal with any performance issues because of
>> a number of small files and also to be able to scan faster.
>>
>>
>> Something like ...df.write.format("parquet").partitionBy( "userIdHash"
>> , "userId").mode(SaveMode.Append).save("userRecords");
>>
>> On Wed, Feb 17, 2016 at 2:16 PM, swetha kasireddy <
>> swethakasire...@gmail.com> wrote:
>>
>>> So suppose I have a bunch of userIds and I need to save them as parquet
>>> in database. I also need to load them back and need to be able to do a join
>>> on userId. My idea is to partition by userId hashcode first and then on
>>> userId.
>>>
>>>
>>>
>>> On Wed, Feb 17, 2016 at 11:51 AM, Michael Armbrust <
>>> mich...@databricks.com> wrote:
>>>
>>>> Can you describe what you are trying to accomplish?  What would the
>>>> custom partitioner be?
>>>>
>>>> On Tue, Feb 16, 2016 at 1:21 PM, SRK <swethakasire...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> How do I use a custom partitioner when I do a saveAsTable in a
>>>>> dataframe.
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Swetha
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-a-custom-partitioner-in-a-dataframe-in-Spark-tp26240.html
>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>> Nabble.com.
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: How to use a custom partitioner in a dataframe in Spark

Reply via email to