Re: Simple record matching using Spark SQL

Michael Armbrust Thu, 17 Jul 2014 09:07:10 -0700

What version are you running?  Could you provide a jstack of the driver and
executor when it is hanging?



On Thu, Jul 17, 2014 at 10:55 AM, Sarath Chandra <
sarathchandra.jos...@algofusiontech.com> wrote:

> Added below 2 lines just before the sql query line -
> *...*
> *file1_schema.count;*
> *file2_schema.count;*
> *...*
> and it started working. But I couldn't get the reason.
>
> Can someone please explain me? What was happening earlier and what is
> happening with addition of these 2 lines?
>
> ~Sarath
>
>
> On Thu, Jul 17, 2014 at 1:13 PM, Sarath Chandra <
> sarathchandra.jos...@algofusiontech.com> wrote:
>
>> No Sonal, I'm not doing any explicit call to stop context.
>>
>> If you see my previous post to Michael, the commented portion of the code
>> is my requirement. When I run this over standalone spark cluster, the
>> execution keeps running with no output or error. After waiting for several
>> minutes I'm killing it by pressing Ctrl+C in the terminal.
>>
>> But the same code runs perfectly when executed from spark shell.
>>
>> ~Sarath
>>
>>
>> On Thu, Jul 17, 2014 at 1:05 PM, Sonal Goyal <sonalgoy...@gmail.com>
>> wrote:
>>
>>> Hi Sarath,
>>>
>>> Are you explicitly stopping the context?
>>>
>>> sc.stop()
>>>
>>>
>>>
>>>
>>> Best Regards,
>>> Sonal
>>> Nube Technologies <http://www.nubetech.co>
>>>
>>> <http://in.linkedin.com/in/sonalgoyal>
>>>
>>>
>>>
>>>
>>> On Thu, Jul 17, 2014 at 12:51 PM, Sarath Chandra <
>>> sarathchandra.jos...@algofusiontech.com> wrote:
>>>
>>>> Hi Michael, Soumya,
>>>>
>>>> Can you please check and let me know what is the issue? what am I
>>>> missing?
>>>> Let me know if you need any logs to analyze.
>>>>
>>>> ~Sarath
>>>>
>>>>
>>>> On Wed, Jul 16, 2014 at 8:24 PM, Sarath Chandra <
>>>> sarathchandra.jos...@algofusiontech.com> wrote:
>>>>
>>>>> Hi Michael,
>>>>>
>>>>> Tried it. It's correctly printing the line counts of both the files.
>>>>> Here's what I tried -
>>>>>
>>>>> *Code:*
>>>>> *package test*
>>>>> *object Test4 {*
>>>>> *  case class Test(fld1: String, *
>>>>> *   fld2: String, *
>>>>> *   fld3: String, *
>>>>> *   fld4: String, *
>>>>> *   fld5: String, *
>>>>> *   fld6: Double, *
>>>>> *   fld7: String);*
>>>>> *  def main(args: Array[String]) {*
>>>>> *    val conf = new SparkConf()*
>>>>> *    .setMaster(args(0))*
>>>>> * .setAppName("SQLTest")*
>>>>> * .setSparkHome(args(1))*
>>>>> * .set("spark.executor.memory", "2g");*
>>>>> *    val sc = new SparkContext(conf);*
>>>>> *    sc.addJar("test1-0.1.jar");*
>>>>> *    val file1 = sc.textFile(args(2));*
>>>>> *    println(file1.count());*
>>>>> *    val file2 = sc.textFile(args(3));*
>>>>> *    println(file2.count());*
>>>>> *//    val sq = new SQLContext(sc);*
>>>>> *//    import sq._*
>>>>> *//    val file1_recs: RDD[Test] = file1.map(_.split(",")).map(l =>
>>>>> Test(l(0), l(1), l(2), l(3), l(4), l(5).toDouble, l(6)));*
>>>>> *//    val file2_recs: RDD[Test] = file2.map(_.split(",")).map(s =>
>>>>> Test(s(0), s(1), s(2), s(3), s(4), s(5).toDouble, s(6)));*
>>>>> *//    val file1_schema = sq.createSchemaRDD(file1_recs);*
>>>>> *//    val file2_schema = sq.createSchemaRDD(file2_recs);*
>>>>> *//    file1_schema.registerAsTable("file1_tab");*
>>>>> *//    file2_schema.registerAsTable("file2_tab");*
>>>>> *//    val matched = sq.sql("select * from file1_tab l join file2_tab
>>>>> s on " + *
>>>>> *//     "l.fld7=s.fld7 where l.fld2=s.fld2 and " + *
>>>>> *//     "l.fld3=s.fld3 and l.fld4=s.fld4 and " + *
>>>>> *//     "l.fld6=s.fld6");*
>>>>> *//    matched.collect().foreach(println);*
>>>>> *  }*
>>>>> *}*
>>>>>
>>>>> *Execution:*
>>>>> *export CLASSPATH=$HADOOP_PREFIX/conf:$SPARK_HOME/lib/*:test1-0.1.jar*
>>>>> *export CONFIG_OPTS="-Dspark.jars=test1-0.1.jar"*
>>>>> *java -cp $CLASSPATH $CONFIG_OPTS test.Test4 spark://master:7077
>>>>> "/usr/local/spark-1.0.1-bin-hadoop1"
>>>>> hdfs://master:54310/user/hduser/file1.csv
>>>>> hdfs://master:54310/user/hduser/file2.csv*
>>>>>
>>>>> ~Sarath
>>>>>
>>>>> On Wed, Jul 16, 2014 at 8:14 PM, Michael Armbrust <
>>>>> mich...@databricks.com> wrote:
>>>>>
>>>>>> What if you just run something like:
>>>>>> *sc.textFile("hdfs://localhost:54310/user/hduser/file1.csv").count()*
>>>>>>
>>>>>>
>>>>>> On Wed, Jul 16, 2014 at 10:37 AM, Sarath Chandra <
>>>>>> sarathchandra.jos...@algofusiontech.com> wrote:
>>>>>>
>>>>>>> Yes Soumya, I did it.
>>>>>>>
>>>>>>> First I tried with the example available in the documentation
>>>>>>> (example using people table and finding teenagers). After successfully
>>>>>>> running it, I moved on to this one which is starting point to a bigger
>>>>>>> requirement for which I'm evaluating Spark SQL.
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Jul 16, 2014 at 7:59 PM, Soumya Simanta <
>>>>>>> soumya.sima...@gmail.com> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Can you try submitting a very simple job to the cluster.
>>>>>>>>
>>>>>>>> On Jul 16, 2014, at 10:25 AM, Sarath Chandra <
>>>>>>>> sarathchandra.jos...@algofusiontech.com> wrote:
>>>>>>>>
>>>>>>>> Yes it is appearing on the Spark UI, and remains there with state
>>>>>>>> as "RUNNING" till I press Ctrl+C in the terminal to kill the execution.
>>>>>>>>
>>>>>>>> Barring the statements to create the spark context, if I copy paste
>>>>>>>> the lines of my code in spark shell, runs perfectly giving the desired
>>>>>>>> output.
>>>>>>>>
>>>>>>>> ~Sarath
>>>>>>>>
>>>>>>>> On Wed, Jul 16, 2014 at 7:48 PM, Soumya Simanta <
>>>>>>>> soumya.sima...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> When you submit your job, it should appear on the Spark UI. Same
>>>>>>>>> with the REPL. Make sure you job is submitted to the cluster properly.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Jul 16, 2014 at 10:08 AM, Sarath Chandra <
>>>>>>>>> sarathchandra.jos...@algofusiontech.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Soumya,
>>>>>>>>>>
>>>>>>>>>> Data is very small, 500+ lines in each file.
>>>>>>>>>>
>>>>>>>>>> Removed last 2 lines and placed this at the end
>>>>>>>>>> "matched.collect().foreach(println);". Still no luck. It's been more 
>>>>>>>>>> than
>>>>>>>>>> 5min, the execution is still running.
>>>>>>>>>>
>>>>>>>>>> Checked logs, nothing in stdout. In stderr I don't see anything
>>>>>>>>>> going wrong, all are info messages.
>>>>>>>>>>
>>>>>>>>>> What else do I need check?
>>>>>>>>>>
>>>>>>>>>> ~Sarath
>>>>>>>>>>
>>>>>>>>>> On Wed, Jul 16, 2014 at 7:23 PM, Soumya Simanta <
>>>>>>>>>> soumya.sima...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Check your executor logs for the output or if your data is not
>>>>>>>>>>> big collect it in the driver and print it.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Jul 16, 2014, at 9:21 AM, Sarath Chandra <
>>>>>>>>>>> sarathchandra.jos...@algofusiontech.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi All,
>>>>>>>>>>>
>>>>>>>>>>> I'm trying to do a simple record matching between 2 files and
>>>>>>>>>>> wrote following code -
>>>>>>>>>>>
>>>>>>>>>>> *import org.apache.spark.sql.SQLContext;*
>>>>>>>>>>> *import org.apache.spark.rdd.RDD*
>>>>>>>>>>> *object SqlTest {*
>>>>>>>>>>> *  case class Test(fld1:String, fld2:String, fld3:String,
>>>>>>>>>>> fld4:String, fld4:String, fld5:Double, fld6:String);*
>>>>>>>>>>> *  sc.addJar("test1-0.1.jar");*
>>>>>>>>>>> *  val file1 =
>>>>>>>>>>> sc.textFile("hdfs://localhost:54310/user/hduser/file1.csv");*
>>>>>>>>>>> *  val file2 =
>>>>>>>>>>> sc.textFile("hdfs://localhost:54310/user/hduser/file2.csv");*
>>>>>>>>>>> *  val sq = new SQLContext(sc);*
>>>>>>>>>>> *  val file1_recs: RDD[Test] = file1.map(_.split(",")).map(l =>
>>>>>>>>>>> Test(l(0), l(1), l(2), l(3), l(4), l(5).toDouble, l(6)));*
>>>>>>>>>>> *  val file2_recs: RDD[Test] = file2.map(_.split(",")).map(s =>
>>>>>>>>>>> Test(s(0), s(1), s(2), s(3), s(4), s(5).toDouble, s(6)));*
>>>>>>>>>>> *  val file1_schema = sq.createSchemaRDD(file1_recs);*
>>>>>>>>>>> *  val file2_schema = sq.createSchemaRDD(file2_recs);*
>>>>>>>>>>> *  file1_schema.registerAsTable("file1_tab");*
>>>>>>>>>>> *  file2_schema.registerAsTable("file2_tab");*
>>>>>>>>>>> *  val matched = sq.sql("select * from file1_tab l join
>>>>>>>>>>> file2_tab s on l.fld6=s.fld6 where l.fld3=s.fld3 and l.fld4=s.fld4 
>>>>>>>>>>> and
>>>>>>>>>>> l.fld5=s.fld5 and l.fld2=s.fld2");*
>>>>>>>>>>> *  val count = matched.count();*
>>>>>>>>>>> *  System.out.println("Found " + matched.count() + " matching
>>>>>>>>>>> records");*
>>>>>>>>>>> *}*
>>>>>>>>>>>
>>>>>>>>>>> When I run this program on a standalone spark cluster, it keeps
>>>>>>>>>>> running for long with no output or error. After waiting for few 
>>>>>>>>>>> mins I'm
>>>>>>>>>>> forcibly killing it.
>>>>>>>>>>> But the same program is working well when executed from a spark
>>>>>>>>>>> shell.
>>>>>>>>>>>
>>>>>>>>>>> What is going wrong? What am I missing?
>>>>>>>>>>>
>>>>>>>>>>> ~Sarath
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Simple record matching using Spark SQL

Reply via email to