Re: pyspark EOFError after calling map
Oh great, thank you for clearing that up. On Fri, Apr 22, 2016 at 5:15 PM, Davies Liuwrote: > This exception is already handled well, just noisy, should be muted. > > On Wed, Apr 13, 2016 at 4:52 PM, Pete Werner > wrote: > >> Hi >> >> I am new to spark & pyspark. >> >> I am reading a small csv file (~40k rows) into a dataframe. >> >> from pyspark.sql import functions as F >> df = >> sqlContext.read.format('com.databricks.spark.csv').options(header='true', >> inferschema='true').load('/tmp/sm.csv') >> df = df.withColumn('verified', F.when(df['verified'] == 'Y', >> 1).otherwise(0)) >> df2 = df.map(lambda x: Row(label=float(x[0]), >> features=Vectors.dense(x[1:]))).toDF() >> >> I get some weird error that does not occur every single time, but does >> happen pretty regularly >> >> >>> df2.show(1) >> ++-+ >> |features|label| >> ++-+ >> |[0.0,0.0,0.0,0.0,...|0.0| >> ++-+ >> only showing top 1 row >> >> >>> df2.count() >> 41999 >> >> >>> df2.show(1) >> ++-+ >> |features|label| >> ++-+ >> |[0.0,0.0,0.0,0.0,...|0.0| >> ++-+ >> only showing top 1 row >> >> >>> df2.count() >> 41999 >> >> >>> df2.show(1) >> Traceback (most recent call last): >> File "spark-1.6.1/python/lib/pyspark.zip/pyspark/daemon.py", line 157, >> in manager >> File "spark-1.6.1/python/lib/pyspark.zip/pyspark/daemon.py", line 61, >> in worker >> File "spark-1.6.1/python/lib/pyspark.zip/pyspark/worker.py", line 136, >> in main >> if read_int(infile) == SpecialLengths.END_OF_STREAM: >> File "spark-1.6.1/python/lib/pyspark.zip/pyspark/serializers.py", line >> 545, in read_int >> raise EOFError >> EOFError >> ++-+ >> |features|label| >> ++-+ >> |[0.0,0.0,0.0,0.0,...|4700734.0| >> ++-+ >> only showing top 1 row >> >> Once that EOFError has been raised, I will not see it again until I do >> something that requires interacting with the spark server >> >> When I call df2.count() it shows that [Stage xxx] prompt which is what I >> mean by it going to the spark server. >> >> Anything that triggers that seems to eventually end up giving the >> EOFError again when I do something with df2. >> >> It does not seem to happen with df (vs. df2) so seems like it must be >> something happening with the df.map() line. >> >> -- >> >> Pete Werner >> Data Scientist >> Freelancer.com >> >> Level 20 >> 680 George Street >> Sydney NSW 2000 >> >> e: pwer...@freelancer.com >> p: +61 2 8599 2700 >> w: http://www.freelancer.com >> >> > -- Pete Werner Data Scientist Freelancer.com Level 20 680 George Street Sydney NSW 2000 e: pwer...@freelancer.com p: +61 2 8599 2700 w: http://www.freelancer.com
Re: pyspark EOFError after calling map
This exception is already handled well, just noisy, should be muted. On Wed, Apr 13, 2016 at 4:52 PM, Pete Wernerwrote: > Hi > > I am new to spark & pyspark. > > I am reading a small csv file (~40k rows) into a dataframe. > > from pyspark.sql import functions as F > df = > sqlContext.read.format('com.databricks.spark.csv').options(header='true', > inferschema='true').load('/tmp/sm.csv') > df = df.withColumn('verified', F.when(df['verified'] == 'Y', > 1).otherwise(0)) > df2 = df.map(lambda x: Row(label=float(x[0]), > features=Vectors.dense(x[1:]))).toDF() > > I get some weird error that does not occur every single time, but does > happen pretty regularly > > >>> df2.show(1) > ++-+ > |features|label| > ++-+ > |[0.0,0.0,0.0,0.0,...|0.0| > ++-+ > only showing top 1 row > > >>> df2.count() > 41999 > > >>> df2.show(1) > ++-+ > |features|label| > ++-+ > |[0.0,0.0,0.0,0.0,...|0.0| > ++-+ > only showing top 1 row > > >>> df2.count() > 41999 > > >>> df2.show(1) > Traceback (most recent call last): > File "spark-1.6.1/python/lib/pyspark.zip/pyspark/daemon.py", line 157, > in manager > File "spark-1.6.1/python/lib/pyspark.zip/pyspark/daemon.py", line 61, in > worker > File "spark-1.6.1/python/lib/pyspark.zip/pyspark/worker.py", line 136, > in main > if read_int(infile) == SpecialLengths.END_OF_STREAM: > File "spark-1.6.1/python/lib/pyspark.zip/pyspark/serializers.py", line > 545, in read_int > raise EOFError > EOFError > ++-+ > |features|label| > ++-+ > |[0.0,0.0,0.0,0.0,...|4700734.0| > ++-+ > only showing top 1 row > > Once that EOFError has been raised, I will not see it again until I do > something that requires interacting with the spark server > > When I call df2.count() it shows that [Stage xxx] prompt which is what I > mean by it going to the spark server. > > Anything that triggers that seems to eventually end up giving the EOFError > again when I do something with df2. > > It does not seem to happen with df (vs. df2) so seems like it must be > something happening with the df.map() line. > > -- > > Pete Werner > Data Scientist > Freelancer.com > > Level 20 > 680 George Street > Sydney NSW 2000 > > e: pwer...@freelancer.com > p: +61 2 8599 2700 > w: http://www.freelancer.com > >
pyspark EOFError after calling map
Hi I am new to spark & pyspark. I am reading a small csv file (~40k rows) into a dataframe. from pyspark.sql import functions as F df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('/tmp/sm.csv') df = df.withColumn('verified', F.when(df['verified'] == 'Y', 1).otherwise(0)) df2 = df.map(lambda x: Row(label=float(x[0]), features=Vectors.dense(x[1:]))).toDF() I get some weird error that does not occur every single time, but does happen pretty regularly >>> df2.show(1) ++-+ |features|label| ++-+ |[0.0,0.0,0.0,0.0,...|0.0| ++-+ only showing top 1 row >>> df2.count() 41999 >>> df2.show(1) ++-+ |features|label| ++-+ |[0.0,0.0,0.0,0.0,...|0.0| ++-+ only showing top 1 row >>> df2.count() 41999 >>> df2.show(1) Traceback (most recent call last): File "spark-1.6.1/python/lib/pyspark.zip/pyspark/daemon.py", line 157, in manager File "spark-1.6.1/python/lib/pyspark.zip/pyspark/daemon.py", line 61, in worker File "spark-1.6.1/python/lib/pyspark.zip/pyspark/worker.py", line 136, in main if read_int(infile) == SpecialLengths.END_OF_STREAM: File "spark-1.6.1/python/lib/pyspark.zip/pyspark/serializers.py", line 545, in read_int raise EOFError EOFError ++-+ |features|label| ++-+ |[0.0,0.0,0.0,0.0,...|4700734.0| ++-+ only showing top 1 row Once that EOFError has been raised, I will not see it again until I do something that requires interacting with the spark server When I call df2.count() it shows that [Stage xxx] prompt which is what I mean by it going to the spark server. Anything that triggers that seems to eventually end up giving the EOFError again when I do something with df2. It does not seem to happen with df (vs. df2) so seems like it must be something happening with the df.map() line. -- Pete Werner Data Scientist Freelancer.com Level 20 680 George Street Sydney NSW 2000 e: pwer...@freelancer.com p: +61 2 8599 2700 w: http://www.freelancer.com