Hi, guys
I'm confused about joining columns in SparkSQL and need your advice.
I want to join 2 datasets of profiles. Each profile has name and array of 
attributes(age, gender, email etc).
There can be mutliple instances of attribute with the same name, e.g. profile 
has 2 emails - so 2 attributes with name = 'email' in 
array. Now I want to join 2 datasets using 'email' attribute. I cant find the 
way to do it :(

The code is below. Now result of join is empty, while I expect to see 1 row 
with all Alice emails.

import org.apache.spark.sql.{DataFrame, SQLContext}
import org.apache.spark.{SparkConf, SparkContext}

case class Attribute(name: String, value: String, weight: Float)
case class Profile(name: String, attributes: Seq[Attribute])

object SparkJoinArrayColumn {
  def main(args: Array[String]) {
    val sc: SparkContext = new SparkContext(new 
SparkConf().setMaster("local").setAppName(getClass.getSimpleName))
    val sqlContext: SQLContext = new SQLContext(sc)

    import sqlContext.implicits._

    val a: DataFrame = sc.parallelize(Seq(
      Profile("Alice", Seq(Attribute("email", "al...@mail.com", 1.0f), 
Attribute("email", "a.jo...@mail.com", 1.0f)))
    )).toDF.as("a")

    val b: DataFrame = sc.parallelize(Seq(
      Profile("Alice", Seq(Attribute("email", "al...@mail.com", 1.0f), 
Attribute("age", "29", 0.2f)))
    )).toDF.as("b")


    a.where($"a.attributes.name" === "email")
      .join(
        b.where($"b.attributes.name" === "email"),
        $"a.attributes.value" === $"b.attributes.value"
      )
    .show()
  }
}

Thanks forward!
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to