Hi Michael,
I have opened following JIRA for the same :-
https://issues.apache.org/jira/browse/SPARK-4849
I am having a look at the code to see what can be done and then we can have
a discussion over the approach.
Let me know if you have any comments/suggestions.
Thanks
-Nitin
On Sun, Dec
I'm happy to discuss what it would take to make sure we can propagate this
information correctly. Please open a JIRA (and mention me in it).
Regarding including it in 1.2.1, it depends on how invasive the change ends
up being, but it is certainly possible.
On Thu, Dec 11, 2014 at 3:55 AM, nitin
Can we take this as a performance improvement task in Spark-1.2.1? I can help
contribute for this.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-tp20350p20623.html
Sent from the Apache Spark User List mailing
It does not appear that the in-memory caching currently preserves the
information about the partitioning of the data so this optimization will
probably not work.
On Thu, Dec 4, 2014 at 8:42 PM, nitin nitin2go...@gmail.com wrote:
With some quick googling, I learnt that I can we can provide
With some quick googling, I learnt that I can we can provide distribute by
coulmn_name in hive ql to distribute data based on a column values. My
question now if I use distribute by id, will there be any performance
improvements? Will I be able to avoid data movement in shuffle(Excahnge
before