Re: pyspark EOFError after calling map

2016-04-22 Thread Pete Werner
Oh great, thank you for clearing that up.

On Fri, Apr 22, 2016 at 5:15 PM, Davies Liu  wrote:

> This exception is already handled well, just noisy, should be muted.
>
> On Wed, Apr 13, 2016 at 4:52 PM, Pete Werner 
> wrote:
>
>> Hi
>>
>> I am new to spark & pyspark.
>>
>> I am reading a small csv file (~40k rows) into a dataframe.
>>
>> from pyspark.sql import functions as F
>> df =
>> sqlContext.read.format('com.databricks.spark.csv').options(header='true',
>> inferschema='true').load('/tmp/sm.csv')
>> df = df.withColumn('verified', F.when(df['verified'] == 'Y',
>> 1).otherwise(0))
>> df2 = df.map(lambda x: Row(label=float(x[0]),
>> features=Vectors.dense(x[1:]))).toDF()
>>
>> I get some weird error that does not occur every single time, but does
>> happen pretty regularly
>>
>> >>> df2.show(1)
>> ++-+
>> |features|label|
>> ++-+
>> |[0.0,0.0,0.0,0.0,...|0.0|
>> ++-+
>> only showing top 1 row
>>
>> >>> df2.count()
>> 41999
>>
>> >>> df2.show(1)
>> ++-+
>> |features|label|
>> ++-+
>> |[0.0,0.0,0.0,0.0,...|0.0|
>> ++-+
>> only showing top 1 row
>>
>> >>> df2.count()
>> 41999
>>
>> >>> df2.show(1)
>> Traceback (most recent call last):
>>   File "spark-1.6.1/python/lib/pyspark.zip/pyspark/daemon.py", line 157,
>> in manager
>>   File "spark-1.6.1/python/lib/pyspark.zip/pyspark/daemon.py", line 61,
>> in worker
>>   File "spark-1.6.1/python/lib/pyspark.zip/pyspark/worker.py", line 136,
>> in main
>> if read_int(infile) == SpecialLengths.END_OF_STREAM:
>>   File "spark-1.6.1/python/lib/pyspark.zip/pyspark/serializers.py", line
>> 545, in read_int
>> raise EOFError
>> EOFError
>> ++-+
>> |features|label|
>> ++-+
>> |[0.0,0.0,0.0,0.0,...|4700734.0|
>> ++-+
>> only showing top 1 row
>>
>> Once that EOFError has been raised, I will not see it again until I do
>> something that requires interacting with the spark server
>>
>> When I call df2.count() it shows that [Stage xxx] prompt which is what I
>> mean by it going to the spark server.
>>
>> Anything that triggers that seems to eventually end up giving the
>> EOFError again when I do something with df2.
>>
>> It does not seem to happen with df (vs. df2) so seems like it must be
>> something happening with the df.map() line.
>>
>> --
>>
>> Pete Werner
>> Data Scientist
>> Freelancer.com
>>
>> Level 20
>> 680 George Street
>> Sydney NSW 2000
>>
>> e: pwer...@freelancer.com
>> p:  +61 2 8599 2700
>> w: http://www.freelancer.com
>>
>>
>


-- 

Pete Werner
Data Scientist
Freelancer.com

Level 20
680 George Street
Sydney NSW 2000

e: pwer...@freelancer.com
p:  +61 2 8599 2700
w: http://www.freelancer.com


Re: pyspark EOFError after calling map

2016-04-22 Thread Davies Liu
This exception is already handled well, just noisy, should be muted.

On Wed, Apr 13, 2016 at 4:52 PM, Pete Werner  wrote:

> Hi
>
> I am new to spark & pyspark.
>
> I am reading a small csv file (~40k rows) into a dataframe.
>
> from pyspark.sql import functions as F
> df =
> sqlContext.read.format('com.databricks.spark.csv').options(header='true',
> inferschema='true').load('/tmp/sm.csv')
> df = df.withColumn('verified', F.when(df['verified'] == 'Y',
> 1).otherwise(0))
> df2 = df.map(lambda x: Row(label=float(x[0]),
> features=Vectors.dense(x[1:]))).toDF()
>
> I get some weird error that does not occur every single time, but does
> happen pretty regularly
>
> >>> df2.show(1)
> ++-+
> |features|label|
> ++-+
> |[0.0,0.0,0.0,0.0,...|0.0|
> ++-+
> only showing top 1 row
>
> >>> df2.count()
> 41999
>
> >>> df2.show(1)
> ++-+
> |features|label|
> ++-+
> |[0.0,0.0,0.0,0.0,...|0.0|
> ++-+
> only showing top 1 row
>
> >>> df2.count()
> 41999
>
> >>> df2.show(1)
> Traceback (most recent call last):
>   File "spark-1.6.1/python/lib/pyspark.zip/pyspark/daemon.py", line 157,
> in manager
>   File "spark-1.6.1/python/lib/pyspark.zip/pyspark/daemon.py", line 61, in
> worker
>   File "spark-1.6.1/python/lib/pyspark.zip/pyspark/worker.py", line 136,
> in main
> if read_int(infile) == SpecialLengths.END_OF_STREAM:
>   File "spark-1.6.1/python/lib/pyspark.zip/pyspark/serializers.py", line
> 545, in read_int
> raise EOFError
> EOFError
> ++-+
> |features|label|
> ++-+
> |[0.0,0.0,0.0,0.0,...|4700734.0|
> ++-+
> only showing top 1 row
>
> Once that EOFError has been raised, I will not see it again until I do
> something that requires interacting with the spark server
>
> When I call df2.count() it shows that [Stage xxx] prompt which is what I
> mean by it going to the spark server.
>
> Anything that triggers that seems to eventually end up giving the EOFError
> again when I do something with df2.
>
> It does not seem to happen with df (vs. df2) so seems like it must be
> something happening with the df.map() line.
>
> --
>
> Pete Werner
> Data Scientist
> Freelancer.com
>
> Level 20
> 680 George Street
> Sydney NSW 2000
>
> e: pwer...@freelancer.com
> p:  +61 2 8599 2700
> w: http://www.freelancer.com
>
>


pyspark EOFError after calling map

2016-04-13 Thread Pete Werner
Hi

I am new to spark & pyspark.

I am reading a small csv file (~40k rows) into a dataframe.

from pyspark.sql import functions as F
df =
sqlContext.read.format('com.databricks.spark.csv').options(header='true',
inferschema='true').load('/tmp/sm.csv')
df = df.withColumn('verified', F.when(df['verified'] == 'Y',
1).otherwise(0))
df2 = df.map(lambda x: Row(label=float(x[0]),
features=Vectors.dense(x[1:]))).toDF()

I get some weird error that does not occur every single time, but does
happen pretty regularly

>>> df2.show(1)
++-+
|features|label|
++-+
|[0.0,0.0,0.0,0.0,...|0.0|
++-+
only showing top 1 row

>>> df2.count()
41999

>>> df2.show(1)
++-+
|features|label|
++-+
|[0.0,0.0,0.0,0.0,...|0.0|
++-+
only showing top 1 row

>>> df2.count()
41999

>>> df2.show(1)
Traceback (most recent call last):
  File "spark-1.6.1/python/lib/pyspark.zip/pyspark/daemon.py", line 157, in
manager
  File "spark-1.6.1/python/lib/pyspark.zip/pyspark/daemon.py", line 61, in
worker
  File "spark-1.6.1/python/lib/pyspark.zip/pyspark/worker.py", line 136, in
main
if read_int(infile) == SpecialLengths.END_OF_STREAM:
  File "spark-1.6.1/python/lib/pyspark.zip/pyspark/serializers.py", line
545, in read_int
raise EOFError
EOFError
++-+
|features|label|
++-+
|[0.0,0.0,0.0,0.0,...|4700734.0|
++-+
only showing top 1 row

Once that EOFError has been raised, I will not see it again until I do
something that requires interacting with the spark server

When I call df2.count() it shows that [Stage xxx] prompt which is what I
mean by it going to the spark server.

Anything that triggers that seems to eventually end up giving the EOFError
again when I do something with df2.

It does not seem to happen with df (vs. df2) so seems like it must be
something happening with the df.map() line.

-- 

Pete Werner
Data Scientist
Freelancer.com

Level 20
680 George Street
Sydney NSW 2000

e: pwer...@freelancer.com
p:  +61 2 8599 2700
w: http://www.freelancer.com