Hi,
FYI: There is bug in Java mapPartitions - SPARK-3369
https://issues.apache.org/jira/browse/SPARK-3369. In Java results from
mapPartitions and similar functions must fit in memory. Look at example
below - it returns List.
Lukas
On 1.9.2014 00:50, Matei Zaharia wrote:
mapPartitions just
Assume I define a partitioner like
/**
* partition on the first letter
*/
public class PartitionByStart extends Partitioner {
@Override public int numPartitions() {
return 26;
}
@Override public int getPartition(final Object key) {
Partitioners also work in local mode, the only question is how to see which
data fell into each partition, since most RDD operations hide the fact that
it's partitioned. You can do rdd.glom().collect() -- the glom() operation turns
an RDD of elements of type T into an RDD of ListT, with a
BTW you can also use rdd.partitions() to get a list of Partition objects and
see how many there are.
On September 4, 2014 at 5:18:30 PM, Matei Zaharia (matei.zaha...@gmail.com)
wrote:
Partitioners also work in local mode, the only question is how to see which
data fell into each partition,
Is there a sample of how to do this -
I see 1.1 is out but cannot find samples of mapPartitions
A Java sample would be very useful
On Sat, Aug 30, 2014 at 10:30 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:
In 1.1, you'll be able to get all of these properties using sortByKey, and
then
matei,
it is good to hear that the restriction that keys need to fit in memory no
longer applies to combineByKey. however join requiring keys to fit in
memory is still a big deal to me. does it apply to both sides of the join,
or only one (while othe other side is streaming)?
On Sat, Aug 30,
Just to be clear, no operation requires all the keys to fit in memory, only the
values for each specific key. All the values for each individual key need to
fit, but the system can spill to disk across keys. Right now it's for both
sides of it, unless you do a broadcast join by hand with
mapPartitions just gives you an Iterator of the values in each partition, and
lets you return an Iterator of outputs. For instance, take a look atÂ
https://github.com/apache/spark/blob/master/core/src/test/java/org/apache/spark/JavaAPISuite.java#L694.
Matei
On August 31, 2014 at 12:26:51 PM,
When programming in Hadoop it is possible to guarantee
1) All keys sent to a specific partition will be handled by the same
machine (thread)
2) All keys received by a specific machine (thread) will be received in
sorted order
3) These conditions will hold even if the values associated with a
In 1.1, you'll be able to get all of these properties using sortByKey, and then
mapPartitions on top to iterate through the key-value pairs. Unfortunately
sortByKey does not let you control the Partitioner, but it's fairly easy to
write your own version that does if this is important.
In
10 matches
Mail list logo