[Spark] Unable to index JSON from HDFS using SchemaRDD.saveToES()

m shirley Thu, 19 Feb 2015 13:21:10 -0800

This is my first real attempt at spark/scala so be gentle.

I have a file called test.json on HDFS that I'm trying to read and index 
using Spark.  I'm able to read the file via SQLContext.jsonFile() but when 
I try to use SchemaRDD.saveToEs() I get an invalid JSON fragment received 
error.  I'm thinking that the saveToES() function isn't actually formatting 
the output in json and instead is just sending the value field of the RDD.


What am I doing wrong?

Spark 1.2.0
Elasticsearch-hadoop 2.1.0.BUILD-20150217

test.json:
{"key":"value"}

spark-shell:
import org.apache.spark.SparkContext._
import org.elasticsearch.spark._

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._

val input = 
sqlContext.jsonFile("hdfs://nameservice1/user/mshirley/test.json")

input.saveToEs("mshirley_spark_test/test")

error:
<snip>
org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: Found unrecoverable 
error [Bad Request(400) - Invalid JSON fragment 
received[["value"]][MapperParsingException[failed to parse]; n
ested: ElasticsearchParseException[Failed to derive xcontent from 
(offset=13, length=9): [123, 34, 105, 110, 100, 101, 120, 34, 58, 123, 125, 
125, 10, 91, 34, 118, 97, 108, 117, 101, 3
4, 93, 10]]; ]]; Bailing out..
<snip>

input:
res2: org.apache.spark.sql.SchemaRDD = 
SchemaRDD[6] at RDD at SchemaRDD.scala:108
== Query Plan ==
== Physical Plan ==
PhysicalRDD [key#0], MappedRDD[5] at map at JsonRDD.scala:47

input.printSchema():
root
 |-- key: string (nullable = true)

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/bc6caa8f-b309-488c-8b1b-4cbef1e1c9fc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[Spark] Unable to index JSON from HDFS using SchemaRDD.saveToES()

Reply via email to