It is a simple text file.

I'm not using SQL. just doing a rdd.count() on it. Does the bug affect it?

On Friday, February 27, 2015, Davies Liu <dav...@databricks.com> wrote:

> What is this dataset? text file or parquet file?
>
> There is an issue with serialization in Spark SQL, which will make it
> very slow, see https://issues.apache.org/jira/browse/SPARK-6055, will
> be fixed very soon.
>
> Davies
>
> On Fri, Feb 27, 2015 at 1:59 PM, Guillaume Guy
> <guillaume.c....@gmail.com <javascript:;>> wrote:
> > Hi Sean:
> >
> > Thanks for your feedback. Scala is much faster. The count is performed
> in ~1
> > minutes (vs 17min). I would expect scala to be 2-5X faster but this gap
> > seems to be more than that. Is that also your conclusion?
> >
> > Thanks.
> >
> >
> > Best,
> >
> > Guillaume Guy
> >  +1 919 - 972 - 8750
> >
> > On Fri, Feb 27, 2015 at 9:12 AM, Sean Owen <so...@cloudera.com
> <javascript:;>> wrote:
> >>
> >> That's very slow, and there are a lot of possible explanations. The
> >> first one that comes to mind is: I assume your YARN and HDFS are on
> >> the same machines, but are you running executors on all HDFS nodes
> >> when you run this? if not, a lot of these reads could be remote.
> >>
> >> You have 6 executor slots, but your data exists in 96 blocks on HDFS.
> >> You could read with up to 96-way parallelism. You say you're CPU-bound
> >> though, but normally I'd wonder if this was simply a case of
> >> under-using parallelism.
> >>
> >> I also wonder if the bottleneck is something to do with pyspark in
> >> this case; might be good to just try it in the spark-shell to check.
> >>
> >> On Fri, Feb 27, 2015 at 2:00 PM, Guillaume Guy
> >> <guillaume.c....@gmail.com <javascript:;>> wrote:
> >> > Dear Spark users:
> >> >
> >> > I want to see if anyone has an idea of the performance for a small
> >> > cluster.
> >> >
> >> > Reading from HDFS, what should be the performance of  a count()
> >> > operation on
> >> > an 10GB RDD with 100M rows using pyspark. I looked into the CPU usage,
> >> > all 6
> >> > are at 100%.
> >> >
> >> > Details:
> >> >
> >> > master yarn-client
> >> > num-executors 3
> >> > executor-cores 2
> >> > driver-memory 5g
> >> > executor-memory 2g
> >> > Distribution: Cloudera
> >> >
> >> > I also attached the screenshot.
> >> >
> >> > Right now, I'm at 17 minutes which seems quite slow. Any idea how a
> >> > decent
> >> > performance with similar configuration?
> >> >
> >> > If it's way off, I would appreciate any pointers as to ways to improve
> >> > performance.
> >> >
> >> > Thanks.
> >> >
> >> > Best,
> >> >
> >> > Guillaume
> >> >
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> <javascript:;>
> >> > For additional commands, e-mail: user-h...@spark.apache.org
> <javascript:;>
> >
> >
>


-- 

Best,

Guillaume Guy

* +1 919 - 972 - 8750*

Reply via email to