One useful thing to do when you run into unexpected slowness is to run
'jstack' a few times on the driver and executors and see if there is any
particular hotspot in the Spark SQL code.

Also, it seems like a better option here might be to use the new
applySchema API
<https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L126>
that has been added for the 1.1 release.  I'd be curious how this helps
your performance.


On Thu, Aug 21, 2014 at 1:31 PM, Yin Huai <huaiyin....@gmail.com> wrote:

> I have not profiled this part. But, I think one possible cause is
> allocating an array for every inner struct for every row (every struct
> value is represented by a Spark SQL row). I will play with it later and see
> what I find.
>
>
> On Tue, Aug 19, 2014 at 9:01 PM, Evan Chan <velvia.git...@gmail.com>
> wrote:
>
>> Hey guys,
>>
>> I'm using Spark 1.0.2 in AWS with 8 x c3.xlarge machines.   I am
>> working with a subset of the GDELT dataset (57 columns, > 250 million
>> rows, but my subset is only 4 million) and trying to query it with
>> Spark SQL.
>>
>> Since a CSV importer isn't available, my first thought was to use
>> nested case classes (since Scala has a limit of 22 fields, plus there
>> are lots of repeated fields in GDELT).    The case classes look like
>> this:
>>
>> case class ActorInfo(Code: String,
>>                      Name: String,
>>                      CountryCode: String,
>>                      KnownGroupCode: String,
>>                      EthnicCode: String, Religion1Code: String,
>> Religion2Code: String,
>>                      Type1Code: String, Type2Code: String, Type3Code:
>> String)
>>
>> case class GeoInfo(`Type`: Int, FullName: String, CountryCode: String,
>> ADM1Code: String, Lat: Float,
>>                    `Long`: Float, FeatureID: Int)
>>
>> case class GDeltRow(EventId: Int, Day: Int, MonthYear: Int, Year: Int,
>> FractionDate: Float,
>>                     Actor1: ActorInfo, Actor2: ActorInfo,
>>                     IsRootEvent: Byte, EventCode: String, EventBaseCode:
>> String,
>>                     EventRootCode: String, QuadClass: Int,
>> GoldsteinScale: Float,
>>                     NumMentions: Int, NumSources: Int, NumArticles: Int,
>>                     AvgTone: Float,
>>                     Actor1Geo: GeoInfo, Actor2Geo: GeoInfo, ActionGeo:
>> GeoInfo, DateAdded: String)
>>
>> Then I use sc.textFile(...) to parse the CSV into an RDD[GDeltRow].
>>
>> I can query these records without caching.  However, if I attempt to
>> cache using registerAsTable() and then sqlContext.cacheTable(...), it
>> is extremely slow (takes 1 hour !!).
>>
>> Any queries using them are also extremely slow.
>>
>> I had tested Spark SQL using a flat structure (no nesting) on a
>> different dataset and the caching and queries were both extremely
>> fast.
>>
>> Thinking that this is an issue with the case classes, I saved them to
>> parquet files and used sqlContext.parquetFile(....), but the slowness
>> is the same.   This makes sense, since the internal structure of
>> SchemaRdds is basically the same.  In both cases, for both parquet and
>> case classes, the schema is the same.
>>
>> Has anybody else experienced this slowness with nested structures?  Is
>> this a known problem and being worked on?
>>
>> The only way to work around this issue I can think of is to convert to
>> JSON, which is tedious, or to construct Parquet files manually (also
>> tedious).
>>
>> thanks,
>> Evan
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Reply via email to