Your data source is S3 and data is used twice. m1.large does not have very good network performance. Please try file.count() and see how fast it goes. -Xiangrui
> On Jun 20, 2014, at 8:16 AM, mathias <math...@socialsignificance.co.uk> wrote: > > Hi there, > > We're trying out Spark and are experiencing some performance issues using > Spark SQL. > Anyone who can tell us if our results are normal? > > We are using the Amazon EC2 scripts to create a cluster with 3 > workers/executors (m1.large). > Tried both spark 1.0.0 as well as the git master; the Scala as well as the > Python shells. > > Running the following code takes about 5 minutes, which seems a long time > for this query. > > val file = sc.textFile("s3n:// ... .csv"); > val data = file.map(x => x.split('|')); // 300k rows > > case class BookingInfo(num_rooms: String, hotelId: String, toDate: String, > ...); > val rooms2 = data.filter(x => x(0) == "2").map(x => BookingInfo(x(0), x(1), > ... , x(9))); // 50k rows > val rooms3 = data.filter(x => x(0) == "3").map(x => BookingInfo(x(0), x(1), > ... , x(9))); // 30k rows > > rooms2.registerAsTable("rooms2"); > cacheTable("rooms2"); > rooms3.registerAsTable("rooms3"); > cacheTable("rooms3"); > > sql("SELECT * FROM rooms2 LEFT JOIN rooms3 ON rooms2.hotelId = > rooms3.hotelId AND rooms2.toDate = rooms3.toDate").count(); > > > Are we doing something wrong here? > Thanks! > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Performance-problems-on-SQL-JOIN-tp8001.html > Sent from the Apache Spark User List mailing list archive at Nabble.com.