What are the alternatives to nested DataFrames?

email Thu, 27 Dec 2018 23:49:15 -0800

Hi community ,


As shown in other answers online , Spark does not support the nesting of
DataFrames , but what are the options?

 

I have the following scenario :

 

dataFrame1 = List of Cities

 

dataFrame2 = Created after searching in ElasticSearch for each city in
dataFrame1

 

I've tried :

 

 val cities    = sc.parallelize(Seq("New York")).toDF()

   cities.foreach(r => {

    val companyName = r.getString(0)

    println(companyName)

    val dfs = sqlContext.esDF("cities/docs", "?q=" + companyName)  //returns
a DataFrame consisting of all the cities matching the entry in cities

    })

 

Which triggers the expected null pointer exception

 

java.lang.NullPointerException

    at org.elasticsearch.spark.sql.EsSparkSQL$.esDF(EsSparkSQL.scala:53)

    at org.elasticsearch.spark.sql.EsSparkSQL$.esDF(EsSparkSQL.scala:51)

    at
org.elasticsearch.spark.sql.package$SQLContextFunctions.esDF(package.scala:3
7)

    at Main$$anonfun$main$1.apply(Main.scala:43)

    at Main$$anonfun$main$1.apply(Main.scala:39)

    at scala.collection.Iterator$class.foreach(Iterator.scala:742)

    at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)

    at
org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scal
a:921)

    at
org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scal
a:921)

    at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:206
7)

    at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:206
7)

    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

    at org.apache.spark.scheduler.Task.run(Task.scala:109)

    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)

    at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:11
49)

    at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:6
24)

    at java.lang.Thread.run(Thread.java:748)

2018-12-28 02:01:00 ERROR TaskSetManager:70 - Task 7 in stage 0.0 failed 1
times; aborting job

Exception in thread "main" org.apache.spark.SparkException: Job aborted due
to stage failure: Task 7 in stage 0.0 failed 1 times, most recent failure:
Lost task 7.0 in stage 0.0 (TID 7, localhost, executor driver):
java.lang.NullPointerException

 

What options do I have?

 

Thank you.

What are the alternatives to nested DataFrames?

Reply via email to