Re: How many partitions is my RDD split into?
Oh, glad to know it's that simple! Patrick, in your last comment did you mean filter in? As in I start with one year of data and filter it so I have one day left? I'm assuming in that case the empty partitions would be for all the days that got filtered out. Nick 2014년 3월 24일 월요일, Patrick Wendellpwend...@gmail.com님이 작성한 메시지: As Mark said you can actually access this easily. The main issue I've seen from a performance perspective is people having a bunch of really small partitions. This will still work but the performance will improve if you coalesce the partitions using rdd.coalesce(). This can happen for example if you do a highly selective filter on an RDD. For instance, you filter out one day of data from a dataset of a year. - Patrick On Sun, Mar 23, 2014 at 9:53 PM, Mark Hamstra m...@clearstorydata.comjavascript:; wrote: It's much simpler: rdd.partitions.size On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas nicholas.cham...@gmail.com javascript:; wrote: Hey there fellow Dukes of Data, How can I tell how many partitions my RDD is split into? I'm interested in knowing because, from what I gather, having a good number of partitions is good for performance. If I'm looking to understand how my pipeline is performing, say for a parallelized write out to HDFS, knowing how many partitions an RDD has would be a good thing to check. Is that correct? I could not find an obvious method or property to see how my RDD is partitioned. Instead, I devised the following thingy: def f(idx, itr): yield idx rdd = sc.parallelize([1, 2, 3, 4], 4) rdd.mapPartitionsWithIndex(f).count() Frankly, I'm not sure what I'm doing here, but this seems to give me the answer I'm looking for. Derp. :) So in summary, should I care about how finely my RDDs are partitioned? And how would I check on that? Nick View this message in context: How many partitions is my RDD split into? Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: How many partitions is my RDD split into?
Mark, This appears to be a Scala-only feature. :( Patrick, Are we planning to add this to PySpark? Nick On Mon, Mar 24, 2014 at 12:53 AM, Mark Hamstra m...@clearstorydata.comwrote: It's much simpler: rdd.partitions.size On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Hey there fellow Dukes of Data, How can I tell how many partitions my RDD is split into? I'm interested in knowing because, from what I gather, having a good number of partitions is good for performance. If I'm looking to understand how my pipeline is performing, say for a parallelized write out to HDFS, knowing how many partitions an RDD has would be a good thing to check. Is that correct? I could not find an obvious method or property to see how my RDD is partitioned. Instead, I devised the following thingy: def f(idx, itr): yield idx rdd = sc.parallelize([1, 2, 3, 4], 4) rdd.mapPartitionsWithIndex(f).count() Frankly, I'm not sure what I'm doing here, but this seems to give me the answer I'm looking for. Derp. :) So in summary, should I care about how finely my RDDs are partitioned? And how would I check on that? Nick -- View this message in context: How many partitions is my RDD split into?http://apache-spark-user-list.1001560.n3.nabble.com/How-many-partitions-is-my-RDD-split-into-tp3072.html Sent from the Apache Spark User List mailing list archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at Nabble.com.
Re: How many partitions is my RDD split into?
There is no direct way to get this in pyspark, but you can get it from the underlying java rdd. For example a = sc.parallelize([1,2,3,4], 2) a._jrdd.splits().size() On Mon, Mar 24, 2014 at 7:46 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Mark, This appears to be a Scala-only feature. :( Patrick, Are we planning to add this to PySpark? Nick On Mon, Mar 24, 2014 at 12:53 AM, Mark Hamstra m...@clearstorydata.comwrote: It's much simpler: rdd.partitions.size On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Hey there fellow Dukes of Data, How can I tell how many partitions my RDD is split into? I'm interested in knowing because, from what I gather, having a good number of partitions is good for performance. If I'm looking to understand how my pipeline is performing, say for a parallelized write out to HDFS, knowing how many partitions an RDD has would be a good thing to check. Is that correct? I could not find an obvious method or property to see how my RDD is partitioned. Instead, I devised the following thingy: def f(idx, itr): yield idx rdd = sc.parallelize([1, 2, 3, 4], 4) rdd.mapPartitionsWithIndex(f).count() Frankly, I'm not sure what I'm doing here, but this seems to give me the answer I'm looking for. Derp. :) So in summary, should I care about how finely my RDDs are partitioned? And how would I check on that? Nick -- View this message in context: How many partitions is my RDD split into?http://apache-spark-user-list.1001560.n3.nabble.com/How-many-partitions-is-my-RDD-split-into-tp3072.html Sent from the Apache Spark User List mailing list archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at Nabble.com.
Re: How many partitions is my RDD split into?
Ah we should just add this directly in pyspark - it's as simple as the code Shivaram just wrote. - Patrick On Mon, Mar 24, 2014 at 1:25 PM, Shivaram Venkataraman shivaram.venkatara...@gmail.com wrote: There is no direct way to get this in pyspark, but you can get it from the underlying java rdd. For example a = sc.parallelize([1,2,3,4], 2) a._jrdd.splits().size() On Mon, Mar 24, 2014 at 7:46 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Mark, This appears to be a Scala-only feature. :( Patrick, Are we planning to add this to PySpark? Nick On Mon, Mar 24, 2014 at 12:53 AM, Mark Hamstra m...@clearstorydata.com wrote: It's much simpler: rdd.partitions.size On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Hey there fellow Dukes of Data, How can I tell how many partitions my RDD is split into? I'm interested in knowing because, from what I gather, having a good number of partitions is good for performance. If I'm looking to understand how my pipeline is performing, say for a parallelized write out to HDFS, knowing how many partitions an RDD has would be a good thing to check. Is that correct? I could not find an obvious method or property to see how my RDD is partitioned. Instead, I devised the following thingy: def f(idx, itr): yield idx rdd = sc.parallelize([1, 2, 3, 4], 4) rdd.mapPartitionsWithIndex(f).count() Frankly, I'm not sure what I'm doing here, but this seems to give me the answer I'm looking for. Derp. :) So in summary, should I care about how finely my RDDs are partitioned? And how would I check on that? Nick View this message in context: How many partitions is my RDD split into? Sent from the Apache Spark User List mailing list archive at Nabble.com.
How many partitions is my RDD split into?
Hey there fellow Dukes of Data, How can I tell how many partitions my RDD is split into? I'm interested in knowing because, from what I gather, having a good number of partitions is good for performance. If I'm looking to understand how my pipeline is performing, say for a parallelized write out to HDFS, knowing how many partitions an RDD has would be a good thing to check. Is that correct? I could not find an obvious method or property to see how my RDD is partitioned. Instead, I devised the following thingy: def f(idx, itr): yield idx rdd = sc.parallelize([1, 2, 3, 4], 4) rdd.mapPartitionsWithIndex(f).count() Frankly, I'm not sure what I'm doing here, but this seems to give me the answer I'm looking for. Derp. :) So in summary, should I care about how finely my RDDs are partitioned? And how would I check on that? Nick -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-many-partitions-is-my-RDD-split-into-tp3072.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: How many partitions is my RDD split into?
It's much simpler: rdd.partitions.size On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Hey there fellow Dukes of Data, How can I tell how many partitions my RDD is split into? I'm interested in knowing because, from what I gather, having a good number of partitions is good for performance. If I'm looking to understand how my pipeline is performing, say for a parallelized write out to HDFS, knowing how many partitions an RDD has would be a good thing to check. Is that correct? I could not find an obvious method or property to see how my RDD is partitioned. Instead, I devised the following thingy: def f(idx, itr): yield idx rdd = sc.parallelize([1, 2, 3, 4], 4) rdd.mapPartitionsWithIndex(f).count() Frankly, I'm not sure what I'm doing here, but this seems to give me the answer I'm looking for. Derp. :) So in summary, should I care about how finely my RDDs are partitioned? And how would I check on that? Nick -- View this message in context: How many partitions is my RDD split into?http://apache-spark-user-list.1001560.n3.nabble.com/How-many-partitions-is-my-RDD-split-into-tp3072.html Sent from the Apache Spark User List mailing list archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at Nabble.com.
Re: How many partitions is my RDD split into?
As Mark said you can actually access this easily. The main issue I've seen from a performance perspective is people having a bunch of really small partitions. This will still work but the performance will improve if you coalesce the partitions using rdd.coalesce(). This can happen for example if you do a highly selective filter on an RDD. For instance, you filter out one day of data from a dataset of a year. - Patrick On Sun, Mar 23, 2014 at 9:53 PM, Mark Hamstra m...@clearstorydata.com wrote: It's much simpler: rdd.partitions.size On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Hey there fellow Dukes of Data, How can I tell how many partitions my RDD is split into? I'm interested in knowing because, from what I gather, having a good number of partitions is good for performance. If I'm looking to understand how my pipeline is performing, say for a parallelized write out to HDFS, knowing how many partitions an RDD has would be a good thing to check. Is that correct? I could not find an obvious method or property to see how my RDD is partitioned. Instead, I devised the following thingy: def f(idx, itr): yield idx rdd = sc.parallelize([1, 2, 3, 4], 4) rdd.mapPartitionsWithIndex(f).count() Frankly, I'm not sure what I'm doing here, but this seems to give me the answer I'm looking for. Derp. :) So in summary, should I care about how finely my RDDs are partitioned? And how would I check on that? Nick View this message in context: How many partitions is my RDD split into? Sent from the Apache Spark User List mailing list archive at Nabble.com.