Spark SQL Join returns less rows that expected
Hi, I have 2 files which come from csv import of 2 Oracle tables. F1 has 46730613 rows F2 has 3386740 rows I build 2 tables with spark. Table F1 join with table F2 on c1=d1. All keys F2.d1 exists in F1.c1, so i expect to retrieve 46730613 rows. But it returns only 3437 rows // --- begin code --- val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD val rddFile = sc.textFile(hdfs://referential/F1/part-*) case class F1(c1:String, c2:String,c3:Double, c3:String, c5: String) val stkrdd = rddFile.map(x = x.split(|)).map(f = F1(f(44),f(3),f(10).toDouble, ,f(2))) stkrdd.registerAsTable(F1) sqlContext.cacheTable(F1) val prdfile = sc.textFile(hdfs://referential/F2/part-*) case class F2(d1: String, d2:String, d3:String,d4:String) val productrdd = prdfile.map(x = x.split(|)).map(f = F2(f(0),f(2),f(101),f(3))) productrdd.registerAsTable(F2) sqlContext.cacheTable(F2) val resrdd = sqlContext.sql(Select count(*) from F1, F2 where F1.c1 = F2.d1 ).count() // --- end of code --- Does anybody know what i missed ? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Join-returns-less-rows-that-expected-tp19731.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark SQL Join returns less rows that expected
Which version are you using? Or if you are using the most recent master or branch-1.2, which commit are you using? On 11/25/14 4:08 PM, david wrote: Hi, I have 2 files which come from csv import of 2 Oracle tables. F1 has 46730613 rows F2 has 3386740 rows I build 2 tables with spark. Table F1 join with table F2 on c1=d1. All keys F2.d1 exists in F1.c1, so i expect to retrieve 46730613 rows. But it returns only 3437 rows // --- begin code --- val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD val rddFile = sc.textFile(hdfs://referential/F1/part-*) case class F1(c1:String, c2:String,c3:Double, c3:String, c5: String) val stkrdd = rddFile.map(x = x.split(|)).map(f = F1(f(44),f(3),f(10).toDouble, ,f(2))) stkrdd.registerAsTable(F1) sqlContext.cacheTable(F1) val prdfile = sc.textFile(hdfs://referential/F2/part-*) case class F2(d1: String, d2:String, d3:String,d4:String) val productrdd = prdfile.map(x = x.split(|)).map(f = F2(f(0),f(2),f(101),f(3))) productrdd.registerAsTable(F2) sqlContext.cacheTable(F2) val resrdd = sqlContext.sql(Select count(*) from F1, F2 where F1.c1 = F2.d1 ).count() // --- end of code --- Does anybody know what i missed ? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Join-returns-less-rows-that-expected-tp19731.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark SQL Join returns less rows that expected
I guess you want to use split(\\|) instead of split(|). On Tue, Nov 25, 2014 at 4:51 AM, Cheng Lian lian.cs@gmail.com wrote: Which version are you using? Or if you are using the most recent master or branch-1.2, which commit are you using? On 11/25/14 4:08 PM, david wrote: Hi, I have 2 files which come from csv import of 2 Oracle tables. F1 has 46730613 rows F2 has 3386740 rows I build 2 tables with spark. Table F1 join with table F2 on c1=d1. All keys F2.d1 exists in F1.c1, so i expect to retrieve 46730613 rows. But it returns only 3437 rows // --- begin code --- val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD val rddFile = sc.textFile(hdfs://referential/F1/part-*) case class F1(c1:String, c2:String,c3:Double, c3:String, c5: String) val stkrdd = rddFile.map(x = x.split(|)).map(f = F1(f(44),f(3),f(10).toDouble, ,f(2))) stkrdd.registerAsTable(F1) sqlContext.cacheTable(F1) val prdfile = sc.textFile(hdfs://referential/F2/part-*) case class F2(d1: String, d2:String, d3:String,d4:String) val productrdd = prdfile.map(x = x.split(|)).map(f = F2(f(0),f(2),f(101),f(3))) productrdd.registerAsTable(F2) sqlContext.cacheTable(F2) val resrdd = sqlContext.sql(Select count(*) from F1, F2 where F1.c1 = F2.d1 ).count() // --- end of code --- Does anybody know what i missed ? Thanks -- View this message in context: http://apache-spark-user-list. 1001560.n3.nabble.com/Spark-SQL-Join-returns-less-rows- that-expected-tp19731.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org