I have two RDDs each saved in a parquet file. I need to join this two RDDs by 
the "id" column. Can I created index on the id column so they can join faster? 
Here is the code


case class Example(val id: String,  val category: String)
case class DocVector(val id: String,  val vector: Vector)

val examples : RDD[Example] = ...    // First RDD which contain a few thousands 
items
val docVectors : RDD[DocVector] = ...  // Second RDD   which contains 10 
million items

// These two RDDs are saved in parquet files as follow
examples.toDF().saveAsParquetFile("file:///c:/temp/examples.parquet")
docVectors.toDF().saveAsParquetFile("file:///c:/temp/docVectors.parquet")


// Now I need to join these two RDDs stored in the parquet files
val dfExamples = sqlContext.parquetFile("file:///c:/temp/docVectors.parquet")
val dfDocVectors = sqlContext.parquetFile(docVectorsParquet) // DataFrame of 
(id, vector)

dfExamples.join(dfDocVectors, dfExamples("id") === 
dfDocVectors("id")).select(dfDocVectors("id"), dfDocVectors("vector"), 
dfExamples("cat"))

I need to perform such kind of join many times. To speed up the join, can I 
create index on the "id" column in the parquet file like what I can do to a 
database table?


Ningjun

Reply via email to