Re: Simple record matching using Spark SQL

Soumya Simanta Wed, 16 Jul 2014 07:30:13 -0700


Can you try submitting a very simple job to the cluster.


> On Jul 16, 2014, at 10:25 AM, Sarath Chandra 
> <sarathchandra.jos...@algofusiontech.com> wrote:
> 
> Yes it is appearing on the Spark UI, and remains there with state as 
> "RUNNING" till I press Ctrl+C in the terminal to kill the execution.
> 
> Barring the statements to create the spark context, if I copy paste the lines 
> of my code in spark shell, runs perfectly giving the desired output.
> 
> ~Sarath
> 
>> On Wed, Jul 16, 2014 at 7:48 PM, Soumya Simanta <soumya.sima...@gmail.com> 
>> wrote:
>> When you submit your job, it should appear on the Spark UI. Same with the 
>> REPL. Make sure you job is submitted to the cluster properly. 
>> 
>> 
>>> On Wed, Jul 16, 2014 at 10:08 AM, Sarath Chandra 
>>> <sarathchandra.jos...@algofusiontech.com> wrote:
>>> Hi Soumya,
>>> 
>>> Data is very small, 500+ lines in each file.
>>> 
>>> Removed last 2 lines and placed this at the end 
>>> "matched.collect().foreach(println);". Still no luck. It's been more than 
>>> 5min, the execution is still running.
>>> 
>>> Checked logs, nothing in stdout. In stderr I don't see anything going 
>>> wrong, all are info messages.
>>> 
>>> What else do I need check?
>>> 
>>> ~Sarath
>>> 
>>>> On Wed, Jul 16, 2014 at 7:23 PM, Soumya Simanta <soumya.sima...@gmail.com> 
>>>> wrote:
>>>> Check your executor logs for the output or if your data is not big collect 
>>>> it in the driver and print it. 
>>>> 
>>>> 
>>>> 
>>>>> On Jul 16, 2014, at 9:21 AM, Sarath Chandra 
>>>>> <sarathchandra.jos...@algofusiontech.com> wrote:
>>>>> 
>>>>> Hi All,
>>>>> 
>>>>> I'm trying to do a simple record matching between 2 files and wrote 
>>>>> following code -
>>>>> 
>>>>> import org.apache.spark.sql.SQLContext;
>>>>> import org.apache.spark.rdd.RDD
>>>>> object SqlTest {
>>>>>   case class Test(fld1:String, fld2:String, fld3:String, fld4:String, 
>>>>> fld4:String, fld5:Double, fld6:String);
>>>>>   sc.addJar("test1-0.1.jar");
>>>>>   val file1 = sc.textFile("hdfs://localhost:54310/user/hduser/file1.csv");
>>>>>   val file2 = sc.textFile("hdfs://localhost:54310/user/hduser/file2.csv");
>>>>>   val sq = new SQLContext(sc);
>>>>>   val file1_recs: RDD[Test] = file1.map(_.split(",")).map(l => Test(l(0), 
>>>>> l(1), l(2), l(3), l(4), l(5).toDouble, l(6)));
>>>>>   val file2_recs: RDD[Test] = file2.map(_.split(",")).map(s => Test(s(0), 
>>>>> s(1), s(2), s(3), s(4), s(5).toDouble, s(6)));
>>>>>   val file1_schema = sq.createSchemaRDD(file1_recs);
>>>>>   val file2_schema = sq.createSchemaRDD(file2_recs);
>>>>>   file1_schema.registerAsTable("file1_tab");
>>>>>   file2_schema.registerAsTable("file2_tab");
>>>>>   val matched = sq.sql("select * from file1_tab l join file2_tab s on 
>>>>> l.fld6=s.fld6 where l.fld3=s.fld3 and l.fld4=s.fld4 and l.fld5=s.fld5 and 
>>>>> l.fld2=s.fld2");
>>>>>   val count = matched.count();
>>>>>   System.out.println("Found " + matched.count() + " matching records");
>>>>> }
>>>>> 
>>>>> When I run this program on a standalone spark cluster, it keeps running 
>>>>> for long with no output or error. After waiting for few mins I'm forcibly 
>>>>> killing it.
>>>>> But the same program is working well when executed from a spark shell.
>>>>> 
>>>>> What is going wrong? What am I missing?
>>>>> 
>>>>> ~Sarath
>

Re: Simple record matching using Spark SQL

Reply via email to