Hey Ian, Thanks for bringing these up! Responses in-line:
Just wondering if right now spark sql is expected to be thread safe on > master? > doing a simple hadoop file -> RDD -> schema RDD -> write parquet > will fail in reflection code if i run these in a thread pool. > You are probably hitting SPARK-2178 <https://issues.apache.org/jira/browse/SPARK-2178> which is caused by SI-6240 <https://issues.scala-lang.org/browse/SI-6240>. We have a plan to fix this by moving the schema introspection to compile time, using macros. > The SparkSqlSerializer, seems to create a new Kryo instance each time it > wants to serialize anything. I got a huge speedup when I had any > non-primitive type in my SchemaRDD using the ResourcePool's from Chill for > providing the KryoSerializer to it. (I can open an RB if there is some > reason not to re-use them?) > Sounds like SPARK-2102 <https://issues.apache.org/jira/browse/SPARK-2102>. There is no reason AFAIK to not reuse the instance. A PR would be greatly appreciated! > With the Distinct Count operator there is no map-side operations, and a > test to check for this. Is there any reason not to do a map side combine > into a set and then merge the sets later? (similar to the approximate > distinct count operator) > Thats just not an optimization that we had implemented yet... but I've just done it here <https://github.com/apache/spark/pull/1366> and it'll be in master soon :) > Another thing while i'm mailing.. the 1.0.1 docs have a section like: > " > // Note: Case classes in Scala 2.10 can support only up to 22 fields. To > work around this limit, // you can use custom classes that implement the > Product interface. > " > > Which sounds great, we have lots of data in thrift.. so via scrooge ( > https://github.com/twitter/scrooge), we end up with ultimately instances > of > traits which implement product. Though the reflection code appears to look > for the constructor of the class and base the types based on those > parameters? Yeah, thats true that we only look in the constructor at the moment, but I don't think there is a really good reason for that (other than I guess we will need to add code to make sure we skip builtin object methods). If you want to open a JIRA, we can try fixing this. Michael