Re: Performance problems on SQL JOIN

2014-06-21 Thread Michael Armbrust
Its probably because our LEFT JOIN performance isn't super great ATM since
we'll use a nest loop join. Sorry! We are aware of the problem and there is
a JIRA to let us do this with a HashJoin instead. If you are feeling brave
you might try pulling in the related PR.

https://issues.apache.org/jira/browse/SPARK-2212


On Fri, Jun 20, 2014 at 8:16 AM, mathias math...@socialsignificance.co.uk
wrote:

 Hi there,

 We're trying out Spark and are experiencing some performance issues using
 Spark SQL.
 Anyone who can tell us if our results are normal?

 We are using the Amazon EC2 scripts to create a cluster with 3
 workers/executors (m1.large).
 Tried both spark 1.0.0 as well as the git master; the Scala as well as the
 Python shells.

 Running the following code takes about 5 minutes, which seems a long time
 for this query.

 val file = sc.textFile(s3n:// ...  .csv);
 val data = file.map(x = x.split('|')); // 300k rows

 case class BookingInfo(num_rooms: String, hotelId: String, toDate: String,
 ...);
 val rooms2 = data.filter(x = x(0) == 2).map(x = BookingInfo(x(0), x(1),
 ... , x(9))); // 50k rows
 val rooms3 = data.filter(x = x(0) == 3).map(x = BookingInfo(x(0), x(1),
 ... , x(9))); // 30k rows

 rooms2.registerAsTable(rooms2);
 cacheTable(rooms2);
 rooms3.registerAsTable(rooms3);
 cacheTable(rooms3);

 sql(SELECT * FROM rooms2 LEFT JOIN rooms3 ON rooms2.hotelId =
 rooms3.hotelId AND rooms2.toDate = rooms3.toDate).count();


 Are we doing something wrong here?
 Thanks!



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Performance-problems-on-SQL-JOIN-tp8001.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Performance problems on SQL JOIN

2014-06-20 Thread Xiangrui Meng
Your data source is S3 and data is used twice. m1.large does not have very good 
network performance. Please try file.count() and see how fast it goes. -Xiangrui

 On Jun 20, 2014, at 8:16 AM, mathias math...@socialsignificance.co.uk wrote:
 
 Hi there,
 
 We're trying out Spark and are experiencing some performance issues using
 Spark SQL.
 Anyone who can tell us if our results are normal?
 
 We are using the Amazon EC2 scripts to create a cluster with 3
 workers/executors (m1.large).
 Tried both spark 1.0.0 as well as the git master; the Scala as well as the
 Python shells.
 
 Running the following code takes about 5 minutes, which seems a long time
 for this query.
 
 val file = sc.textFile(s3n:// ...  .csv);
 val data = file.map(x = x.split('|')); // 300k rows
 
 case class BookingInfo(num_rooms: String, hotelId: String, toDate: String,
 ...);
 val rooms2 = data.filter(x = x(0) == 2).map(x = BookingInfo(x(0), x(1),
 ... , x(9))); // 50k rows
 val rooms3 = data.filter(x = x(0) == 3).map(x = BookingInfo(x(0), x(1),
 ... , x(9))); // 30k rows
 
 rooms2.registerAsTable(rooms2);
 cacheTable(rooms2);
 rooms3.registerAsTable(rooms3);
 cacheTable(rooms3);
 
 sql(SELECT * FROM rooms2 LEFT JOIN rooms3 ON rooms2.hotelId =
 rooms3.hotelId AND rooms2.toDate = rooms3.toDate).count();
 
 
 Are we doing something wrong here?
 Thanks!
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Performance-problems-on-SQL-JOIN-tp8001.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Performance problems on SQL JOIN

2014-06-20 Thread Evan R. Sparks
Also - you could consider caching your data after the first split (before
the first filter), this will prevent you from retrieving the data from s3
twice.


On Fri, Jun 20, 2014 at 8:32 AM, Xiangrui Meng men...@gmail.com wrote:

 Your data source is S3 and data is used twice. m1.large does not have very
 good network performance. Please try file.count() and see how fast it goes.
 -Xiangrui

  On Jun 20, 2014, at 8:16 AM, mathias math...@socialsignificance.co.uk
 wrote:
 
  Hi there,
 
  We're trying out Spark and are experiencing some performance issues using
  Spark SQL.
  Anyone who can tell us if our results are normal?
 
  We are using the Amazon EC2 scripts to create a cluster with 3
  workers/executors (m1.large).
  Tried both spark 1.0.0 as well as the git master; the Scala as well as
 the
  Python shells.
 
  Running the following code takes about 5 minutes, which seems a long time
  for this query.
 
  val file = sc.textFile(s3n:// ...  .csv);
  val data = file.map(x = x.split('|')); // 300k rows
 
  case class BookingInfo(num_rooms: String, hotelId: String, toDate:
 String,
  ...);
  val rooms2 = data.filter(x = x(0) == 2).map(x = BookingInfo(x(0),
 x(1),
  ... , x(9))); // 50k rows
  val rooms3 = data.filter(x = x(0) == 3).map(x = BookingInfo(x(0),
 x(1),
  ... , x(9))); // 30k rows
 
  rooms2.registerAsTable(rooms2);
  cacheTable(rooms2);
  rooms3.registerAsTable(rooms3);
  cacheTable(rooms3);
 
  sql(SELECT * FROM rooms2 LEFT JOIN rooms3 ON rooms2.hotelId =
  rooms3.hotelId AND rooms2.toDate = rooms3.toDate).count();
 
 
  Are we doing something wrong here?
  Thanks!
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Performance-problems-on-SQL-JOIN-tp8001.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Performance problems on SQL JOIN

2014-06-20 Thread mathias
Thanks for your suggestions.

file.count() takes 7s, so that doesn't seem to be the problem.
Moreover, a union with the same code/CSV takes about 15s (SELECT * FROM
rooms2 UNION SELECT * FROM rooms3).

The web status page shows that both stages 'count at joins.scala:216' and
'reduce at joins.scala:219' take up the majority of the time.
Is this due to bad partitioning or caching? Or is there a problem with the
JOIN operator?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Performance-problems-on-SQL-JOIN-tp8001p8016.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.