I have two RDDs each saved in a parquet file. I need to join this two RDDs by
the "id" column. Can I created index on the id column so they can join faster?
Here is the code
case class Example(val id: String, val category: String)
case class DocVector(val id: String, val vector: Vector)
val examples : RDD[Example] = ... // First RDD which contain a few thousands
items
val docVectors : RDD[DocVector] = ... // Second RDD which contains 10
million items
// These two RDDs are saved in parquet files as follow
examples.toDF().saveAsParquetFile("file:///c:/temp/examples.parquet")
docVectors.toDF().saveAsParquetFile("file:///c:/temp/docVectors.parquet")
// Now I need to join these two RDDs stored in the parquet files
val dfExamples = sqlContext.parquetFile("file:///c:/temp/docVectors.parquet")
val dfDocVectors = sqlContext.parquetFile(docVectorsParquet) // DataFrame of
(id, vector)
dfExamples.join(dfDocVectors, dfExamples("id") ===
dfDocVectors("id")).select(dfDocVectors("id"), dfDocVectors("vector"),
dfExamples("cat"))
I need to perform such kind of join many times. To speed up the join, can I
create index on the "id" column in the parquet file like what I can do to a
database table?
Ningjun