Re: [SparkR] creating dataframe from json file

jianshu Weng Wed, 15 Jul 2015 06:38:31 -0700

Thanks.

t <- getField(df$hashtags, "text") does return a Column. But when I tried
to call t <- getField(df$hashtags, "text"), it would give an error:


Error: All select() inputs must resolve to integer column positions.
The following do not:
*  getField(df$hashtags, "text")

In fact, the "text" field in df is now return as something like
List(<hashtag1>, <hashtag2>). Want to flat the list out and make the field
a string like "<hashtag1>, <hashtag2>".

You mentioned in the email that "then you can perform operations on the
column.". Bear with me if you feel the question is too naive, am still new
to SparkR. But what operations are allowed on the column, in the SparkR
documentation, I didnt find any specific function for column operation (
https://spark.apache.org/docs/latest/api/R/index.html). I didnt even fine
"getField" function in the documentation as well.

Thanks,

-JS

On Wed, Jul 15, 2015 at 8:42 PM, Sun, Rui <rui....@intel.com> wrote:

> suppose df <- jsonFile(sqlContext, "<json file>")
>
> You can extract hashtags.text as a Column object using the following
> command:
>     t <- getField(df$hashtags, "text")
> and then you can perform operations on the column.
>
> You can extract hashtags.text as a DataFrame using the following command:
>    t <- select(df, getField(df$hashtags, "text"))
>    showDF(t)
>
> Or you can use SQL query to extract the field:
>   hiveContext <- sparkRHive.init()
>   df <-jsonFile(hiveContext,"<json file>")
>   registerTempTable(df, "table")
>   t <- sql(hiveContext, "select hashtags.text from table")
>   showDF(t)
> ________________________________________
> From: jianshu [jian...@gmail.com]
> Sent: Wednesday, July 15, 2015 4:42 PM
> To: user@spark.apache.org
> Subject: [SparkR] creating dataframe from json file
>
> hi all,
>
> Not sure whether this the right venue to ask. If not, please point me to
> the
> right group, if there is any.
>
> I'm trying to create a Spark DataFrame from JSON file using jsonFile(). The
> call was successful, and I can see the DataFrame created. The JSON file I
> have contains a number of tweets obtained from Twitter API. Am particularly
> interested in pulling the hashtags contains in the tweets. If I use
> printSchema(), the schema is something like:
>
> root
>  |-- id_str: string (nullable = true)
>  |-- hashtags: array (nullable = true)
>  |    |-- element: struct (containsNull = true)
>  |    |    |-- indices: array (nullable = true)
>  |    |    |    |-- element: long (containsNull = true)
>  |    |    |-- text: string (nullable = true)
>
> showDF() would show something like this :
>
> +--------------------+
> |            hashtags|
> +--------------------+
> |              List()|
> |List([List(125, 1...|
> |              List()|
> |List([List(0, 3),...|
> |List([List(76, 86...|
> |              List()|
> |List([List(74, 84...|
> |              List()|
> |              List()|
> |              List()|
> |List([List(85, 96...|
> |List([List(125, 1...|
> |              List()|
> |              List()|
> |              List()|
> |              List()|
> |List([List(14, 17...|
> |              List()|
> |              List()|
> |List([List(14, 17...|
> +--------------------+
>
> The question is now how to extract the text of the hashtags for each tweet?
> Still new to SparkR. Am thinking maybe I need to loop through the dataframe
> to extract for each tweet. But it seems that lapply does not really apply
> on
> Spark DataFrame as more. Any though on how to extract the text, as it will
> be inside a JSON array.
>
>
> Thanks,
>
>
> -JS
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-creating-dataframe-from-json-file-tp23849.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: [SparkR] creating dataframe from json file

Reply via email to