Re: Simple record matching using Spark SQL

Soumya Simanta Wed, 16 Jul 2014 07:19:09 -0700

When you submit your job, it should appear on the Spark UI. Same with the
REPL. Make sure you job is submitted to the cluster properly.



On Wed, Jul 16, 2014 at 10:08 AM, Sarath Chandra <
sarathchandra.jos...@algofusiontech.com> wrote:

> Hi Soumya,
>
> Data is very small, 500+ lines in each file.
>
> Removed last 2 lines and placed this at the end
> "matched.collect().foreach(println);". Still no luck. It's been more than
> 5min, the execution is still running.
>
> Checked logs, nothing in stdout. In stderr I don't see anything going
> wrong, all are info messages.
>
> What else do I need check?
>
> ~Sarath
>
> On Wed, Jul 16, 2014 at 7:23 PM, Soumya Simanta <soumya.sima...@gmail.com>
> wrote:
>
>> Check your executor logs for the output or if your data is not big
>> collect it in the driver and print it.
>>
>>
>>
>> On Jul 16, 2014, at 9:21 AM, Sarath Chandra <
>> sarathchandra.jos...@algofusiontech.com> wrote:
>>
>> Hi All,
>>
>> I'm trying to do a simple record matching between 2 files and wrote
>> following code -
>>
>> *import org.apache.spark.sql.SQLContext;*
>> *import org.apache.spark.rdd.RDD*
>> *object SqlTest {*
>> *  case class Test(fld1:String, fld2:String, fld3:String, fld4:String,
>> fld4:String, fld5:Double, fld6:String);*
>> *  sc.addJar("test1-0.1.jar");*
>> *  val file1 =
>> sc.textFile("hdfs://localhost:54310/user/hduser/file1.csv");*
>> *  val file2 =
>> sc.textFile("hdfs://localhost:54310/user/hduser/file2.csv");*
>> *  val sq = new SQLContext(sc);*
>> *  val file1_recs: RDD[Test] = file1.map(_.split(",")).map(l =>
>> Test(l(0), l(1), l(2), l(3), l(4), l(5).toDouble, l(6)));*
>> *  val file2_recs: RDD[Test] = file2.map(_.split(",")).map(s =>
>> Test(s(0), s(1), s(2), s(3), s(4), s(5).toDouble, s(6)));*
>> *  val file1_schema = sq.createSchemaRDD(file1_recs);*
>> *  val file2_schema = sq.createSchemaRDD(file2_recs);*
>> *  file1_schema.registerAsTable("file1_tab");*
>> *  file2_schema.registerAsTable("file2_tab");*
>> *  val matched = sq.sql("select * from file1_tab l join file2_tab s on
>> l.fld6=s.fld6 where l.fld3=s.fld3 and l.fld4=s.fld4 and l.fld5=s.fld5 and
>> l.fld2=s.fld2");*
>> *  val count = matched.count();*
>> *  System.out.println("Found " + matched.count() + " matching records");*
>> *}*
>>
>> When I run this program on a standalone spark cluster, it keeps running
>> for long with no output or error. After waiting for few mins I'm forcibly
>> killing it.
>> But the same program is working well when executed from a spark shell.
>>
>> What is going wrong? What am I missing?
>>
>> ~Sarath
>>
>>
>

Re: Simple record matching using Spark SQL

Reply via email to