Simple record matching using Spark SQL

Sarath Chandra Wed, 16 Jul 2014 06:22:42 -0700

Hi All,

I'm trying to do a simple record matching between 2 files and wrote
following code -


*import org.apache.spark.sql.SQLContext;*
*import org.apache.spark.rdd.RDD*
*object SqlTest {*
*  case class Test(fld1:String, fld2:String, fld3:String, fld4:String,
fld4:String, fld5:Double, fld6:String);*
*  sc.addJar("test1-0.1.jar");*
*  val file1 = sc.textFile("hdfs://localhost:54310/user/hduser/file1.csv");*
*  val file2 = sc.textFile("hdfs://localhost:54310/user/hduser/file2.csv");*
*  val sq = new SQLContext(sc);*
*  val file1_recs: RDD[Test] = file1.map(_.split(",")).map(l => Test(l(0),
l(1), l(2), l(3), l(4), l(5).toDouble, l(6)));*
*  val file2_recs: RDD[Test] = file2.map(_.split(",")).map(s => Test(s(0),
s(1), s(2), s(3), s(4), s(5).toDouble, s(6)));*
*  val file1_schema = sq.createSchemaRDD(file1_recs);*
*  val file2_schema = sq.createSchemaRDD(file2_recs);*
*  file1_schema.registerAsTable("file1_tab");*
*  file2_schema.registerAsTable("file2_tab");*
*  val matched = sq.sql("select * from file1_tab l join file2_tab s on
l.fld6=s.fld6 where l.fld3=s.fld3 and l.fld4=s.fld4 and l.fld5=s.fld5 and
l.fld2=s.fld2");*
*  val count = matched.count();*
*  System.out.println("Found " + matched.count() + " matching records");*
*}*

When I run this program on a standalone spark cluster, it keeps running for
long with no output or error. After waiting for few mins I'm forcibly
killing it.
But the same program is working well when executed from a spark shell.

What is going wrong? What am I missing?

~Sarath

Simple record matching using Spark SQL

Reply via email to