Re: sparkSQL thread safe?

Michael Armbrust Thu, 10 Jul 2014 16:52:24 -0700

Hey Ian,

Thanks for bringing these up!  Responses in-line:


Just wondering if right now spark sql is expected to be thread safe on
> master?
> doing a simple hadoop file -> RDD -> schema RDD -> write parquet
> will fail in reflection code if i run these in a thread pool.
>

You are probably hitting SPARK-2178
<https://issues.apache.org/jira/browse/SPARK-2178> which is caused by
SI-6240 <https://issues.scala-lang.org/browse/SI-6240>.  We have a plan to
fix this by moving the schema introspection to compile time, using macros.


> The SparkSqlSerializer, seems to create a new Kryo instance each time it
> wants to serialize anything. I got a huge speedup when I had any
> non-primitive type in my SchemaRDD using the ResourcePool's from Chill for
> providing the KryoSerializer to it. (I can open an RB if there is some
> reason not to re-use them?)
>

Sounds like SPARK-2102 <https://issues.apache.org/jira/browse/SPARK-2102>.
 There is no reason AFAIK to not reuse the instance. A PR would be greatly
appreciated!


> With the Distinct Count operator there is no map-side operations, and a
> test to check for this. Is there any reason not to do a map side combine
> into a set and then merge the sets later? (similar to the approximate
> distinct count operator)
>

Thats just not an optimization that we had implemented yet... but I've just
done it here <https://github.com/apache/spark/pull/1366> and it'll be in
master soon :)


> Another thing while i'm mailing.. the 1.0.1 docs have a section like:
> "
> // Note: Case classes in Scala 2.10 can support only up to 22 fields. To
> work around this limit, // you can use custom classes that implement the
> Product interface.
> "
>
> Which sounds great, we have lots of data in thrift.. so via scrooge (
> https://github.com/twitter/scrooge), we end up with ultimately instances
> of
> traits which implement product. Though the reflection code appears to look
> for the constructor of the class and base the types based on those
> parameters?


Yeah, thats true that we only look in the constructor at the moment, but I
don't think there is a really good reason for that (other than I guess we
will need to add code to make sure we skip builtin object methods).  If you
want to open a JIRA, we can try fixing this.

Michael

Re: sparkSQL thread safe?

Reply via email to