Spark SQL Join returns less rows that expected

2014-11-25 Thread david
Hi,

 I have 2 files which come from csv import of 2 Oracle tables. 

 F1 has 46730613 rows
 F2 has   3386740 rows

I build 2 tables with spark.

Table F1 join with table F2 on c1=d1.


All keys F2.d1 exists in F1.c1,  so i expect to retrieve 46730613  rows. But
it returns only 3437  rows

// --- begin code ---

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.createSchemaRDD


val rddFile = sc.textFile(hdfs://referential/F1/part-*)
case class F1(c1:String, c2:String,c3:Double, c3:String, c5: String)
val stkrdd = rddFile.map(x = x.split(|)).map(f =
F1(f(44),f(3),f(10).toDouble, ,f(2)))
stkrdd.registerAsTable(F1)
sqlContext.cacheTable(F1)


val prdfile = sc.textFile(hdfs://referential/F2/part-*)
case class F2(d1: String, d2:String, d3:String,d4:String)
val productrdd = prdfile.map(x = x.split(|)).map(f =
F2(f(0),f(2),f(101),f(3)))
productrdd.registerAsTable(F2)
sqlContext.cacheTable(F2)

val resrdd = sqlContext.sql(Select count(*) from F1, F2 where F1.c1 = F2.d1
).count()

// --- end of code ---


Does anybody know what i missed ?

Thanks






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Join-returns-less-rows-that-expected-tp19731.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL Join returns less rows that expected

2014-11-25 Thread Cheng Lian
Which version are you using? Or if you are using the most recent master 
or branch-1.2, which commit are you using?


On 11/25/14 4:08 PM, david wrote:

Hi,

  I have 2 files which come from csv import of 2 Oracle tables.

  F1 has 46730613 rows
  F2 has   3386740 rows

I build 2 tables with spark.

Table F1 join with table F2 on c1=d1.


All keys F2.d1 exists in F1.c1,  so i expect to retrieve 46730613  rows. But
it returns only 3437  rows

// --- begin code ---

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.createSchemaRDD


val rddFile = sc.textFile(hdfs://referential/F1/part-*)
case class F1(c1:String, c2:String,c3:Double, c3:String, c5: String)
val stkrdd = rddFile.map(x = x.split(|)).map(f =
F1(f(44),f(3),f(10).toDouble, ,f(2)))
stkrdd.registerAsTable(F1)
sqlContext.cacheTable(F1)


val prdfile = sc.textFile(hdfs://referential/F2/part-*)
case class F2(d1: String, d2:String, d3:String,d4:String)
val productrdd = prdfile.map(x = x.split(|)).map(f =
F2(f(0),f(2),f(101),f(3)))
productrdd.registerAsTable(F2)
sqlContext.cacheTable(F2)

val resrdd = sqlContext.sql(Select count(*) from F1, F2 where F1.c1 = F2.d1
).count()

// --- end of code ---


Does anybody know what i missed ?

Thanks






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Join-returns-less-rows-that-expected-tp19731.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL Join returns less rows that expected

2014-11-25 Thread Yin Huai
I guess you want to use split(\\|) instead of split(|).

On Tue, Nov 25, 2014 at 4:51 AM, Cheng Lian lian.cs@gmail.com wrote:

 Which version are you using? Or if you are using the most recent master or
 branch-1.2, which commit are you using?


 On 11/25/14 4:08 PM, david wrote:

 Hi,

   I have 2 files which come from csv import of 2 Oracle tables.

   F1 has 46730613 rows
   F2 has   3386740 rows

 I build 2 tables with spark.

 Table F1 join with table F2 on c1=d1.


 All keys F2.d1 exists in F1.c1,  so i expect to retrieve 46730613  rows.
 But
 it returns only 3437  rows

 // --- begin code ---

 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 import sqlContext.createSchemaRDD


 val rddFile = sc.textFile(hdfs://referential/F1/part-*)
 case class F1(c1:String, c2:String,c3:Double, c3:String, c5: String)
 val stkrdd = rddFile.map(x = x.split(|)).map(f =
 F1(f(44),f(3),f(10).toDouble, ,f(2)))
 stkrdd.registerAsTable(F1)
 sqlContext.cacheTable(F1)


 val prdfile = sc.textFile(hdfs://referential/F2/part-*)
 case class F2(d1: String, d2:String, d3:String,d4:String)
 val productrdd = prdfile.map(x = x.split(|)).map(f =
 F2(f(0),f(2),f(101),f(3)))
 productrdd.registerAsTable(F2)
 sqlContext.cacheTable(F2)

 val resrdd = sqlContext.sql(Select count(*) from F1, F2 where F1.c1 =
 F2.d1
 ).count()

 // --- end of code ---


 Does anybody know what i missed ?

 Thanks






 --
 View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/Spark-SQL-Join-returns-less-rows-
 that-expected-tp19731.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org