subject:"How many partitions is my RDD split into\?"

Re: How many partitions is my RDD split into?

2014-03-24 Thread Nicholas Chammas

Oh, glad to know it's that simple!

Patrick, in your last comment did you mean filter in? As in I start with
one year of data and filter it so I have one day left? I'm assuming in that
case the empty partitions would be for all the days that got filtered out.

Nick

2014년 3월 24일 월요일, Patrick Wendellpwend...@gmail.com님이 작성한 메시지:

As Mark said you can actually access this easily. The main issue I've
seen from a performance perspective is people having a bunch of really
small partitions. This will still work but the performance will
improve if you coalesce the partitions using rdd.coalesce().

This can happen for example if you do a highly selective filter on an
RDD. For instance, you filter out one day of data from a dataset of a
year.

- Patrick

On Sun, Mar 23, 2014 at 9:53 PM, Mark Hamstra
m...@clearstorydata.comjavascript:;
wrote:
It's much simpler: rdd.partitions.size

On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas
nicholas.cham...@gmail.com javascript:; wrote:

Hey there fellow Dukes of Data,

How can I tell how many partitions my RDD is split into?

I'm interested in knowing because, from what I gather, having a good
number of partitions is good for performance. If I'm looking to
understand
how my pipeline is performing, say for a parallelized write out to HDFS,
knowing how many partitions an RDD has would be a good thing to check.

Is that correct?

I could not find an obvious method or property to see how my RDD is
partitioned. Instead, I devised the following thingy:

def f(idx, itr): yield idx

rdd = sc.parallelize([1, 2, 3, 4], 4)
rdd.mapPartitionsWithIndex(f).count()

Frankly, I'm not sure what I'm doing here, but this seems to give me the
answer I'm looking for. Derp. :)

So in summary, should I care about how finely my RDDs are partitioned?
And
how would I check on that?

Nick

View this message in context: How many partitions is my RDD split into?
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How many partitions is my RDD split into?

2014-03-24 Thread Nicholas Chammas

Mark,

This appears to be a Scala-only feature. :(

Patrick,

Are we planning to add this to PySpark?

Nick


On Mon, Mar 24, 2014 at 12:53 AM, Mark Hamstra m...@clearstorydata.comwrote:

 It's much simpler: rdd.partitions.size


 On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Hey there fellow Dukes of Data,

 How can I tell how many partitions my RDD is split into?

 I'm interested in knowing because, from what I gather, having a good
 number of partitions is good for performance. If I'm looking to understand
 how my pipeline is performing, say for a parallelized write out to HDFS,
 knowing how many partitions an RDD has would be a good thing to check.

 Is that correct?

 I could not find an obvious method or property to see how my RDD is
 partitioned. Instead, I devised the following thingy:

 def f(idx, itr): yield idx

 rdd = sc.parallelize([1, 2, 3, 4], 4)
 rdd.mapPartitionsWithIndex(f).count()

 Frankly, I'm not sure what I'm doing here, but this seems to give me the
 answer I'm looking for. Derp. :)

 So in summary, should I care about how finely my RDDs are partitioned?
 And how would I check on that?

 Nick


 --
 View this message in context: How many partitions is my RDD split 
 into?http://apache-spark-user-list.1001560.n3.nabble.com/How-many-partitions-is-my-RDD-split-into-tp3072.html
 Sent from the Apache Spark User List mailing list 
 archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at Nabble.com.

Re: How many partitions is my RDD split into?

2014-03-24 Thread Shivaram Venkataraman

There is no direct way to get this in pyspark, but you can get it from the
underlying java rdd. For example

a = sc.parallelize([1,2,3,4], 2)
a._jrdd.splits().size()


On Mon, Mar 24, 2014 at 7:46 AM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 Mark,

 This appears to be a Scala-only feature. :(

 Patrick,

 Are we planning to add this to PySpark?

 Nick


 On Mon, Mar 24, 2014 at 12:53 AM, Mark Hamstra m...@clearstorydata.comwrote:

 It's much simpler: rdd.partitions.size


 On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Hey there fellow Dukes of Data,

 How can I tell how many partitions my RDD is split into?

 I'm interested in knowing because, from what I gather, having a good
 number of partitions is good for performance. If I'm looking to understand
 how my pipeline is performing, say for a parallelized write out to HDFS,
 knowing how many partitions an RDD has would be a good thing to check.

 Is that correct?

 I could not find an obvious method or property to see how my RDD is
 partitioned. Instead, I devised the following thingy:

 def f(idx, itr): yield idx

 rdd = sc.parallelize([1, 2, 3, 4], 4)
 rdd.mapPartitionsWithIndex(f).count()

 Frankly, I'm not sure what I'm doing here, but this seems to give me the
 answer I'm looking for. Derp. :)

 So in summary, should I care about how finely my RDDs are partitioned?
 And how would I check on that?

 Nick


 --
 View this message in context: How many partitions is my RDD split 
 into?http://apache-spark-user-list.1001560.n3.nabble.com/How-many-partitions-is-my-RDD-split-into-tp3072.html
 Sent from the Apache Spark User List mailing list 
 archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at Nabble.com.

Re: How many partitions is my RDD split into?

2014-03-24 Thread Patrick Wendell

Ah we should just add this directly in pyspark - it's as simple as the
code Shivaram just wrote.

- Patrick

On Mon, Mar 24, 2014 at 1:25 PM, Shivaram Venkataraman
shivaram.venkatara...@gmail.com wrote:
 There is no direct way to get this in pyspark, but you can get it from the
 underlying java rdd. For example

 a = sc.parallelize([1,2,3,4], 2)
 a._jrdd.splits().size()


 On Mon, Mar 24, 2014 at 7:46 AM, Nicholas Chammas
 nicholas.cham...@gmail.com wrote:

 Mark,

 This appears to be a Scala-only feature. :(

 Patrick,

 Are we planning to add this to PySpark?

 Nick


 On Mon, Mar 24, 2014 at 12:53 AM, Mark Hamstra m...@clearstorydata.com
 wrote:

 It's much simpler: rdd.partitions.size


 On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas
 nicholas.cham...@gmail.com wrote:

 Hey there fellow Dukes of Data,

 How can I tell how many partitions my RDD is split into?

 I'm interested in knowing because, from what I gather, having a good
 number of partitions is good for performance. If I'm looking to understand
 how my pipeline is performing, say for a parallelized write out to HDFS,
 knowing how many partitions an RDD has would be a good thing to check.

 Is that correct?

 I could not find an obvious method or property to see how my RDD is
 partitioned. Instead, I devised the following thingy:

 def f(idx, itr): yield idx

 rdd = sc.parallelize([1, 2, 3, 4], 4)
 rdd.mapPartitionsWithIndex(f).count()

 Frankly, I'm not sure what I'm doing here, but this seems to give me the
 answer I'm looking for. Derp. :)

 So in summary, should I care about how finely my RDDs are partitioned?
 And how would I check on that?

 Nick


 
 View this message in context: How many partitions is my RDD split into?
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

How many partitions is my RDD split into?

2014-03-23 Thread Nicholas Chammas

Hey there fellow Dukes of Data,

How can I tell how many partitions my RDD is split into?

I'm interested in knowing because, from what I gather, having a good number
of partitions is good for performance. If I'm looking to understand how my
pipeline is performing, say for a parallelized write out to HDFS, knowing
how many partitions an RDD has would be a good thing to check.

Is that correct?

I could not find an obvious method or property to see how my RDD is
partitioned. Instead, I devised the following thingy:

def f(idx, itr): yield idx

rdd = sc.parallelize([1, 2, 3, 4], 4)
rdd.mapPartitionsWithIndex(f).count()

Frankly, I'm not sure what I'm doing here, but this seems to give me the
answer I'm looking for. Derp. :)

So in summary, should I care about how finely my RDDs are partitioned? And
how would I check on that?

Nick




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-many-partitions-is-my-RDD-split-into-tp3072.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How many partitions is my RDD split into?

2014-03-23 Thread Mark Hamstra

It's much simpler: rdd.partitions.size


On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 Hey there fellow Dukes of Data,

 How can I tell how many partitions my RDD is split into?

 I'm interested in knowing because, from what I gather, having a good
 number of partitions is good for performance. If I'm looking to understand
 how my pipeline is performing, say for a parallelized write out to HDFS,
 knowing how many partitions an RDD has would be a good thing to check.

 Is that correct?

 I could not find an obvious method or property to see how my RDD is
 partitioned. Instead, I devised the following thingy:

 def f(idx, itr): yield idx

 rdd = sc.parallelize([1, 2, 3, 4], 4)
 rdd.mapPartitionsWithIndex(f).count()

 Frankly, I'm not sure what I'm doing here, but this seems to give me the
 answer I'm looking for. Derp. :)

 So in summary, should I care about how finely my RDDs are partitioned? And
 how would I check on that?

 Nick


 --
 View this message in context: How many partitions is my RDD split 
 into?http://apache-spark-user-list.1001560.n3.nabble.com/How-many-partitions-is-my-RDD-split-into-tp3072.html
 Sent from the Apache Spark User List mailing list 
 archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at Nabble.com.

Re: How many partitions is my RDD split into?

2014-03-23 Thread Patrick Wendell

As Mark said you can actually access this easily. The main issue I've
seen from a performance perspective is people having a bunch of really
small partitions. This will still work but the performance will
improve if you coalesce the partitions using rdd.coalesce().

This can happen for example if you do a highly selective filter on an
RDD. For instance, you filter out one day of data from a dataset of a
year.

- Patrick

On Sun, Mar 23, 2014 at 9:53 PM, Mark Hamstra m...@clearstorydata.com wrote:
 It's much simpler: rdd.partitions.size


 On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas
 nicholas.cham...@gmail.com wrote:

 Hey there fellow Dukes of Data,

 How can I tell how many partitions my RDD is split into?

 I'm interested in knowing because, from what I gather, having a good
 number of partitions is good for performance. If I'm looking to understand
 how my pipeline is performing, say for a parallelized write out to HDFS,
 knowing how many partitions an RDD has would be a good thing to check.

 Is that correct?

 I could not find an obvious method or property to see how my RDD is
 partitioned. Instead, I devised the following thingy:

 def f(idx, itr): yield idx

 rdd = sc.parallelize([1, 2, 3, 4], 4)
 rdd.mapPartitionsWithIndex(f).count()

 Frankly, I'm not sure what I'm doing here, but this seems to give me the
 answer I'm looking for. Derp. :)

 So in summary, should I care about how finely my RDDs are partitioned? And
 how would I check on that?

 Nick


 
 View this message in context: How many partitions is my RDD split into?
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How many partitions is my RDD split into?

Re: How many partitions is my RDD split into?

Re: How many partitions is my RDD split into?

Re: How many partitions is my RDD split into?

How many partitions is my RDD split into?

Re: How many partitions is my RDD split into?

Re: How many partitions is my RDD split into?

7 matches

Site Navigation

Mail list logo

Footer information