Re: How to estimate the size of dataframe using pyspark?

2016-04-12 Thread Buntu Dev
Thanks Davies, I've shared the code snippet and the dataset. Please let me
know if you need any other information.

On Mon, Apr 11, 2016 at 10:44 AM, Davies Liu  wrote:

> That's weird, DataFrame.count() should not require lots of memory on
> driver, could you provide a way to reproduce it (could generate fake
> dataset)?
>
> On Sat, Apr 9, 2016 at 4:33 PM, Buntu Dev  wrote:
> > I've allocated about 4g for the driver. For the count stage, I notice the
> > Shuffle Write to be 13.9 GB.
> >
> > On Sat, Apr 9, 2016 at 11:43 AM, Ndjido Ardo BAR 
> wrote:
> >>
> >> What's the size of your driver?
> >> On Sat, 9 Apr 2016 at 20:33, Buntu Dev  wrote:
> >>>
> >>> Actually, df.show() works displaying 20 rows but df.count() is the one
> >>> which is causing the driver to run out of memory. There are just 3 INT
> >>> columns.
> >>>
> >>> Any idea what could be the reason?
> >>>
> >>> On Sat, Apr 9, 2016 at 10:47 AM,  wrote:
> 
>  You seem to have a lot of column :-) !
>  df.count() displays the size of your data frame.
>  df.columns.size() the number of columns.
> 
>  Finally, I suggest you check the size of your drive and customize it
>  accordingly.
> 
>  Cheers,
> 
>  Ardo
> 
>  Sent from my iPhone
> 
>  > On 09 Apr 2016, at 19:37, bdev  wrote:
>  >
>  > I keep running out of memory on the driver when I attempt to do
>  > df.show().
>  > Can anyone let me know how to estimate the size of the dataframe?
>  >
>  > Thanks!
>  >
>  >
>  >
>  > --
>  > View this message in context:
>  >
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-estimate-the-size-of-dataframe-using-pyspark-tp26729.html
>  > Sent from the Apache Spark User List mailing list archive at
>  > Nabble.com.
>  >
>  >
> -
>  > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>  > For additional commands, e-mail: user-h...@spark.apache.org
>  >
> >>>
> >>>
> >
>


Re: How to estimate the size of dataframe using pyspark?

2016-04-11 Thread Davies Liu
That's weird, DataFrame.count() should not require lots of memory on
driver, could you provide a way to reproduce it (could generate fake
dataset)?

On Sat, Apr 9, 2016 at 4:33 PM, Buntu Dev  wrote:
> I've allocated about 4g for the driver. For the count stage, I notice the
> Shuffle Write to be 13.9 GB.
>
> On Sat, Apr 9, 2016 at 11:43 AM, Ndjido Ardo BAR  wrote:
>>
>> What's the size of your driver?
>> On Sat, 9 Apr 2016 at 20:33, Buntu Dev  wrote:
>>>
>>> Actually, df.show() works displaying 20 rows but df.count() is the one
>>> which is causing the driver to run out of memory. There are just 3 INT
>>> columns.
>>>
>>> Any idea what could be the reason?
>>>
>>> On Sat, Apr 9, 2016 at 10:47 AM,  wrote:

 You seem to have a lot of column :-) !
 df.count() displays the size of your data frame.
 df.columns.size() the number of columns.

 Finally, I suggest you check the size of your drive and customize it
 accordingly.

 Cheers,

 Ardo

 Sent from my iPhone

 > On 09 Apr 2016, at 19:37, bdev  wrote:
 >
 > I keep running out of memory on the driver when I attempt to do
 > df.show().
 > Can anyone let me know how to estimate the size of the dataframe?
 >
 > Thanks!
 >
 >
 >
 > --
 > View this message in context:
 > http://apache-spark-user-list.1001560.n3.nabble.com/How-to-estimate-the-size-of-dataframe-using-pyspark-tp26729.html
 > Sent from the Apache Spark User List mailing list archive at
 > Nabble.com.
 >
 > -
 > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 > For additional commands, e-mail: user-h...@spark.apache.org
 >
>>>
>>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to estimate the size of dataframe using pyspark?

2016-04-09 Thread Buntu Dev
I've allocated about 4g for the driver. For the count stage, I notice the
Shuffle Write to be 13.9 GB.

On Sat, Apr 9, 2016 at 11:43 AM, Ndjido Ardo BAR  wrote:

> What's the size of your driver?
> On Sat, 9 Apr 2016 at 20:33, Buntu Dev  wrote:
>
>> Actually, df.show() works displaying 20 rows but df.count() is the one
>> which is causing the driver to run out of memory. There are just 3 INT
>> columns.
>>
>> Any idea what could be the reason?
>>
>> On Sat, Apr 9, 2016 at 10:47 AM,  wrote:
>>
>>> You seem to have a lot of column :-) !
>>> df.count() displays the size of your data frame.
>>> df.columns.size() the number of columns.
>>>
>>> Finally, I suggest you check the size of your drive and customize it
>>> accordingly.
>>>
>>> Cheers,
>>>
>>> Ardo
>>>
>>> Sent from my iPhone
>>>
>>> > On 09 Apr 2016, at 19:37, bdev  wrote:
>>> >
>>> > I keep running out of memory on the driver when I attempt to do
>>> df.show().
>>> > Can anyone let me know how to estimate the size of the dataframe?
>>> >
>>> > Thanks!
>>> >
>>> >
>>> >
>>> > --
>>> > View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-estimate-the-size-of-dataframe-using-pyspark-tp26729.html
>>> > Sent from the Apache Spark User List mailing list archive at
>>> Nabble.com.
>>> >
>>> > -
>>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> > For additional commands, e-mail: user-h...@spark.apache.org
>>> >
>>>
>>
>>


Re: How to estimate the size of dataframe using pyspark?

2016-04-09 Thread bdev
Thanks Mandar, I couldn't see anything under the 'Storage Section' but under
the Executors I noticed it to be 3.1 GB:

Executors (1)
Memory: 0.0 B Used (3.1 GB Total)




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-estimate-the-size-of-dataframe-using-pyspark-tp26729p26732.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to estimate the size of dataframe using pyspark?

2016-04-09 Thread Ndjido Ardo BAR
What's the size of your driver?
On Sat, 9 Apr 2016 at 20:33, Buntu Dev  wrote:

> Actually, df.show() works displaying 20 rows but df.count() is the one
> which is causing the driver to run out of memory. There are just 3 INT
> columns.
>
> Any idea what could be the reason?
>
> On Sat, Apr 9, 2016 at 10:47 AM,  wrote:
>
>> You seem to have a lot of column :-) !
>> df.count() displays the size of your data frame.
>> df.columns.size() the number of columns.
>>
>> Finally, I suggest you check the size of your drive and customize it
>> accordingly.
>>
>> Cheers,
>>
>> Ardo
>>
>> Sent from my iPhone
>>
>> > On 09 Apr 2016, at 19:37, bdev  wrote:
>> >
>> > I keep running out of memory on the driver when I attempt to do
>> df.show().
>> > Can anyone let me know how to estimate the size of the dataframe?
>> >
>> > Thanks!
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-estimate-the-size-of-dataframe-using-pyspark-tp26729.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > -
>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: user-h...@spark.apache.org
>> >
>>
>
>


Re: How to estimate the size of dataframe using pyspark?

2016-04-09 Thread Buntu Dev
Actually, df.show() works displaying 20 rows but df.count() is the one
which is causing the driver to run out of memory. There are just 3 INT
columns.

Any idea what could be the reason?

On Sat, Apr 9, 2016 at 10:47 AM,  wrote:

> You seem to have a lot of column :-) !
> df.count() displays the size of your data frame.
> df.columns.size() the number of columns.
>
> Finally, I suggest you check the size of your drive and customize it
> accordingly.
>
> Cheers,
>
> Ardo
>
> Sent from my iPhone
>
> > On 09 Apr 2016, at 19:37, bdev  wrote:
> >
> > I keep running out of memory on the driver when I attempt to do
> df.show().
> > Can anyone let me know how to estimate the size of the dataframe?
> >
> > Thanks!
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-estimate-the-size-of-dataframe-using-pyspark-tp26729.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>