I have two RDDs each saved in a parquet file. I need to join this two RDDs by the "id" column. Can I created index on the id column so they can join faster? Here is the code
case class Example(val id: String, val category: String) case class DocVector(val id: String, val vector: Vector) val examples : RDD[Example] = ... // First RDD which contain a few thousands items val docVectors : RDD[DocVector] = ... // Second RDD which contains 10 million items // These two RDDs are saved in parquet files as follow examples.toDF().saveAsParquetFile("file:///c:/temp/examples.parquet") docVectors.toDF().saveAsParquetFile("file:///c:/temp/docVectors.parquet") // Now I need to join these two RDDs stored in the parquet files val dfExamples = sqlContext.parquetFile("file:///c:/temp/docVectors.parquet") val dfDocVectors = sqlContext.parquetFile(docVectorsParquet) // DataFrame of (id, vector) dfExamples.join(dfDocVectors, dfExamples("id") === dfDocVectors("id")).select(dfDocVectors("id"), dfDocVectors("vector"), dfExamples("cat")) I need to perform such kind of join many times. To speed up the join, can I create index on the "id" column in the parquet file like what I can do to a database table? Ningjun