It is a simple text file. I'm not using SQL. just doing a rdd.count() on it. Does the bug affect it?
On Friday, February 27, 2015, Davies Liu <dav...@databricks.com> wrote: > What is this dataset? text file or parquet file? > > There is an issue with serialization in Spark SQL, which will make it > very slow, see https://issues.apache.org/jira/browse/SPARK-6055, will > be fixed very soon. > > Davies > > On Fri, Feb 27, 2015 at 1:59 PM, Guillaume Guy > <guillaume.c....@gmail.com <javascript:;>> wrote: > > Hi Sean: > > > > Thanks for your feedback. Scala is much faster. The count is performed > in ~1 > > minutes (vs 17min). I would expect scala to be 2-5X faster but this gap > > seems to be more than that. Is that also your conclusion? > > > > Thanks. > > > > > > Best, > > > > Guillaume Guy > > +1 919 - 972 - 8750 > > > > On Fri, Feb 27, 2015 at 9:12 AM, Sean Owen <so...@cloudera.com > <javascript:;>> wrote: > >> > >> That's very slow, and there are a lot of possible explanations. The > >> first one that comes to mind is: I assume your YARN and HDFS are on > >> the same machines, but are you running executors on all HDFS nodes > >> when you run this? if not, a lot of these reads could be remote. > >> > >> You have 6 executor slots, but your data exists in 96 blocks on HDFS. > >> You could read with up to 96-way parallelism. You say you're CPU-bound > >> though, but normally I'd wonder if this was simply a case of > >> under-using parallelism. > >> > >> I also wonder if the bottleneck is something to do with pyspark in > >> this case; might be good to just try it in the spark-shell to check. > >> > >> On Fri, Feb 27, 2015 at 2:00 PM, Guillaume Guy > >> <guillaume.c....@gmail.com <javascript:;>> wrote: > >> > Dear Spark users: > >> > > >> > I want to see if anyone has an idea of the performance for a small > >> > cluster. > >> > > >> > Reading from HDFS, what should be the performance of a count() > >> > operation on > >> > an 10GB RDD with 100M rows using pyspark. I looked into the CPU usage, > >> > all 6 > >> > are at 100%. > >> > > >> > Details: > >> > > >> > master yarn-client > >> > num-executors 3 > >> > executor-cores 2 > >> > driver-memory 5g > >> > executor-memory 2g > >> > Distribution: Cloudera > >> > > >> > I also attached the screenshot. > >> > > >> > Right now, I'm at 17 minutes which seems quite slow. Any idea how a > >> > decent > >> > performance with similar configuration? > >> > > >> > If it's way off, I would appreciate any pointers as to ways to improve > >> > performance. > >> > > >> > Thanks. > >> > > >> > Best, > >> > > >> > Guillaume > >> > > >> > > >> > --------------------------------------------------------------------- > >> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > <javascript:;> > >> > For additional commands, e-mail: user-h...@spark.apache.org > <javascript:;> > > > > > -- Best, Guillaume Guy * +1 919 - 972 - 8750*