Your data source is S3 and data is used twice. m1.large does not have very good 
network performance. Please try file.count() and see how fast it goes. -Xiangrui

> On Jun 20, 2014, at 8:16 AM, mathias <math...@socialsignificance.co.uk> wrote:
> 
> Hi there,
> 
> We're trying out Spark and are experiencing some performance issues using
> Spark SQL.
> Anyone who can tell us if our results are normal?
> 
> We are using the Amazon EC2 scripts to create a cluster with 3
> workers/executors (m1.large).
> Tried both spark 1.0.0 as well as the git master; the Scala as well as the
> Python shells.
> 
> Running the following code takes about 5 minutes, which seems a long time
> for this query.
> 
> val file = sc.textFile("s3n:// ...  .csv");
> val data = file.map(x => x.split('|')); // 300k rows
> 
> case class BookingInfo(num_rooms: String, hotelId: String, toDate: String,
> ...);
> val rooms2 = data.filter(x => x(0) == "2").map(x => BookingInfo(x(0), x(1),
> ... , x(9))); // 50k rows
> val rooms3 = data.filter(x => x(0) == "3").map(x => BookingInfo(x(0), x(1),
> ... , x(9))); // 30k rows
> 
> rooms2.registerAsTable("rooms2");
> cacheTable("rooms2");
> rooms3.registerAsTable("rooms3");
> cacheTable("rooms3");
> 
> sql("SELECT * FROM rooms2 LEFT JOIN rooms3 ON rooms2.hotelId =
> rooms3.hotelId AND rooms2.toDate = rooms3.toDate").count();
> 
> 
> Are we doing something wrong here?
> Thanks!
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Performance-problems-on-SQL-JOIN-tp8001.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to