Re: Simple record matching using Spark SQL

Sarath Chandra Wed, 16 Jul 2014 07:09:33 -0700

Hi Soumya,

Data is very small, 500+ lines in each file.


Removed last 2 lines and placed this at the end
"matched.collect().foreach(println);". Still no luck. It's been more than
5min, the execution is still running.

Checked logs, nothing in stdout. In stderr I don't see anything going
wrong, all are info messages.

What else do I need check?

~Sarath

On Wed, Jul 16, 2014 at 7:23 PM, Soumya Simanta <soumya.sima...@gmail.com>
wrote:

> Check your executor logs for the output or if your data is not big collect
> it in the driver and print it.
>
>
>
> On Jul 16, 2014, at 9:21 AM, Sarath Chandra <
> sarathchandra.jos...@algofusiontech.com> wrote:
>
> Hi All,
>
> I'm trying to do a simple record matching between 2 files and wrote
> following code -
>
> *import org.apache.spark.sql.SQLContext;*
> *import org.apache.spark.rdd.RDD*
> *object SqlTest {*
> *  case class Test(fld1:String, fld2:String, fld3:String, fld4:String,
> fld4:String, fld5:Double, fld6:String);*
> *  sc.addJar("test1-0.1.jar");*
> *  val file1 =
> sc.textFile("hdfs://localhost:54310/user/hduser/file1.csv");*
> *  val file2 =
> sc.textFile("hdfs://localhost:54310/user/hduser/file2.csv");*
> *  val sq = new SQLContext(sc);*
> *  val file1_recs: RDD[Test] = file1.map(_.split(",")).map(l => Test(l(0),
> l(1), l(2), l(3), l(4), l(5).toDouble, l(6)));*
> *  val file2_recs: RDD[Test] = file2.map(_.split(",")).map(s => Test(s(0),
> s(1), s(2), s(3), s(4), s(5).toDouble, s(6)));*
> *  val file1_schema = sq.createSchemaRDD(file1_recs);*
> *  val file2_schema = sq.createSchemaRDD(file2_recs);*
> *  file1_schema.registerAsTable("file1_tab");*
> *  file2_schema.registerAsTable("file2_tab");*
> *  val matched = sq.sql("select * from file1_tab l join file2_tab s on
> l.fld6=s.fld6 where l.fld3=s.fld3 and l.fld4=s.fld4 and l.fld5=s.fld5 and
> l.fld2=s.fld2");*
> *  val count = matched.count();*
> *  System.out.println("Found " + matched.count() + " matching records");*
> *}*
>
> When I run this program on a standalone spark cluster, it keeps running
> for long with no output or error. After waiting for few mins I'm forcibly
> killing it.
> But the same program is working well when executed from a spark shell.
>
> What is going wrong? What am I missing?
>
> ~Sarath
>
>

Re: Simple record matching using Spark SQL

Reply via email to