Performance problems on SQL JOIN

2014-06-20 Thread mathias
Hi there, We're trying out Spark and are experiencing some performance issues using Spark SQL. Anyone who can tell us if our results are normal? We are using the Amazon EC2 scripts to create a cluster with 3 workers/executors (m1.large). Tried both spark 1.0.0 as well as the git master; the

Re: Performance problems on SQL JOIN

2014-06-20 Thread mathias
Thanks for your suggestions. file.count() takes 7s, so that doesn't seem to be the problem. Moreover, a union with the same code/CSV takes about 15s (SELECT * FROM rooms2 UNION SELECT * FROM rooms3). The web status page shows that both stages 'count at joins.scala:216' and 'reduce at