One useful thing to do when you run into unexpected slowness is to run 'jstack' a few times on the driver and executors and see if there is any particular hotspot in the Spark SQL code.
Also, it seems like a better option here might be to use the new applySchema API <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L126> that has been added for the 1.1 release. I'd be curious how this helps your performance. On Thu, Aug 21, 2014 at 1:31 PM, Yin Huai <huaiyin....@gmail.com> wrote: > I have not profiled this part. But, I think one possible cause is > allocating an array for every inner struct for every row (every struct > value is represented by a Spark SQL row). I will play with it later and see > what I find. > > > On Tue, Aug 19, 2014 at 9:01 PM, Evan Chan <velvia.git...@gmail.com> > wrote: > >> Hey guys, >> >> I'm using Spark 1.0.2 in AWS with 8 x c3.xlarge machines. I am >> working with a subset of the GDELT dataset (57 columns, > 250 million >> rows, but my subset is only 4 million) and trying to query it with >> Spark SQL. >> >> Since a CSV importer isn't available, my first thought was to use >> nested case classes (since Scala has a limit of 22 fields, plus there >> are lots of repeated fields in GDELT). The case classes look like >> this: >> >> case class ActorInfo(Code: String, >> Name: String, >> CountryCode: String, >> KnownGroupCode: String, >> EthnicCode: String, Religion1Code: String, >> Religion2Code: String, >> Type1Code: String, Type2Code: String, Type3Code: >> String) >> >> case class GeoInfo(`Type`: Int, FullName: String, CountryCode: String, >> ADM1Code: String, Lat: Float, >> `Long`: Float, FeatureID: Int) >> >> case class GDeltRow(EventId: Int, Day: Int, MonthYear: Int, Year: Int, >> FractionDate: Float, >> Actor1: ActorInfo, Actor2: ActorInfo, >> IsRootEvent: Byte, EventCode: String, EventBaseCode: >> String, >> EventRootCode: String, QuadClass: Int, >> GoldsteinScale: Float, >> NumMentions: Int, NumSources: Int, NumArticles: Int, >> AvgTone: Float, >> Actor1Geo: GeoInfo, Actor2Geo: GeoInfo, ActionGeo: >> GeoInfo, DateAdded: String) >> >> Then I use sc.textFile(...) to parse the CSV into an RDD[GDeltRow]. >> >> I can query these records without caching. However, if I attempt to >> cache using registerAsTable() and then sqlContext.cacheTable(...), it >> is extremely slow (takes 1 hour !!). >> >> Any queries using them are also extremely slow. >> >> I had tested Spark SQL using a flat structure (no nesting) on a >> different dataset and the caching and queries were both extremely >> fast. >> >> Thinking that this is an issue with the case classes, I saved them to >> parquet files and used sqlContext.parquetFile(....), but the slowness >> is the same. This makes sense, since the internal structure of >> SchemaRdds is basically the same. In both cases, for both parquet and >> case classes, the schema is the same. >> >> Has anybody else experienced this slowness with nested structures? Is >> this a known problem and being worked on? >> >> The only way to work around this issue I can think of is to convert to >> JSON, which is tedious, or to construct Parquet files manually (also >> tedious). >> >> thanks, >> Evan >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >