RE: [SparkR] creating dataframe from json file

2015-07-15 Thread Sun, Rui
You can try selectExpr() of DataFrame. for example,
  y-selectExpr(df, concat(hashtags.text[0],hashtags.text[1])) # [] 
operator is used to extract an item from an array

or

sql(hiveContext, select concat(hashtags.text[0],hashtags.text[1]) from table)

Yeah, the documentation of SparkR is not so complete. You may use scala 
documentation as reference, and try if some method is supported in SparkR.

From: jianshu Weng [jian...@gmail.com]
Sent: Wednesday, July 15, 2015 9:37 PM
To: Sun, Rui
Cc: user@spark.apache.org
Subject: Re: [SparkR] creating dataframe from json file

Thanks.

t - getField(df$hashtags, text) does return a Column. But when I tried to 
call t - getField(df$hashtags, text), it would give an error:

Error: All select() inputs must resolve to integer column positions.
The following do not:
*  getField(df$hashtags, text)

In fact, the text field in df is now return as something like 
List(hashtag1, hashtag2). Want to flat the list out and make the field a 
string like hashtag1, hashtag2.

You mentioned in the email that then you can perform operations on the 
column.. Bear with me if you feel the question is too naive, am still new to 
SparkR. But what operations are allowed on the column, in the SparkR 
documentation, I didnt find any specific function for column operation 
(https://spark.apache.org/docs/latest/api/R/index.html). I didnt even fine 
getField function in the documentation as well.

Thanks,

-JS

On Wed, Jul 15, 2015 at 8:42 PM, Sun, Rui 
rui@intel.commailto:rui@intel.com wrote:
suppose df - jsonFile(sqlContext, json file)

You can extract hashtags.text as a Column object using the following command:
t - getField(df$hashtags, text)
and then you can perform operations on the column.

You can extract hashtags.text as a DataFrame using the following command:
   t - select(df, getField(df$hashtags, text))
   showDF(t)

Or you can use SQL query to extract the field:
  hiveContext - sparkRHive.init()
  df -jsonFile(hiveContext,json file)
  registerTempTable(df, table)
  t - sql(hiveContext, select hashtags.text from table)
  showDF(t)

From: jianshu [jian...@gmail.commailto:jian...@gmail.com]
Sent: Wednesday, July 15, 2015 4:42 PM
To: user@spark.apache.orgmailto:user@spark.apache.org
Subject: [SparkR] creating dataframe from json file

hi all,

Not sure whether this the right venue to ask. If not, please point me to the
right group, if there is any.

I'm trying to create a Spark DataFrame from JSON file using jsonFile(). The
call was successful, and I can see the DataFrame created. The JSON file I
have contains a number of tweets obtained from Twitter API. Am particularly
interested in pulling the hashtags contains in the tweets. If I use
printSchema(), the schema is something like:

root
 |-- id_str: string (nullable = true)
 |-- hashtags: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- indices: array (nullable = true)
 ||||-- element: long (containsNull = true)
 |||-- text: string (nullable = true)

showDF() would show something like this :

++
|hashtags|
++
|  List()|
|List([List(125, 1...|
|  List()|
|List([List(0, 3),...|
|List([List(76, 86...|
|  List()|
|List([List(74, 84...|
|  List()|
|  List()|
|  List()|
|List([List(85, 96...|
|List([List(125, 1...|
|  List()|
|  List()|
|  List()|
|  List()|
|List([List(14, 17...|
|  List()|
|  List()|
|List([List(14, 17...|
++

The question is now how to extract the text of the hashtags for each tweet?
Still new to SparkR. Am thinking maybe I need to loop through the dataframe
to extract for each tweet. But it seems that lapply does not really apply on
Spark DataFrame as more. Any though on how to extract the text, as it will
be inside a JSON array.


Thanks,


-JS




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-creating-dataframe-from-json-file-tp23849.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.orgmailto:user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: [SparkR] creating dataframe from json file

2015-07-15 Thread Sun, Rui
suppose df - jsonFile(sqlContext, json file)

You can extract hashtags.text as a Column object using the following command:
t - getField(df$hashtags, text)
and then you can perform operations on the column.

You can extract hashtags.text as a DataFrame using the following command:
   t - select(df, getField(df$hashtags, text))
   showDF(t)

Or you can use SQL query to extract the field:
  hiveContext - sparkRHive.init()
  df -jsonFile(hiveContext,json file)
  registerTempTable(df, table)
  t - sql(hiveContext, select hashtags.text from table)
  showDF(t)

From: jianshu [jian...@gmail.com]
Sent: Wednesday, July 15, 2015 4:42 PM
To: user@spark.apache.org
Subject: [SparkR] creating dataframe from json file

hi all,

Not sure whether this the right venue to ask. If not, please point me to the
right group, if there is any.

I'm trying to create a Spark DataFrame from JSON file using jsonFile(). The
call was successful, and I can see the DataFrame created. The JSON file I
have contains a number of tweets obtained from Twitter API. Am particularly
interested in pulling the hashtags contains in the tweets. If I use
printSchema(), the schema is something like:

root
 |-- id_str: string (nullable = true)
 |-- hashtags: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- indices: array (nullable = true)
 ||||-- element: long (containsNull = true)
 |||-- text: string (nullable = true)

showDF() would show something like this :

++
|hashtags|
++
|  List()|
|List([List(125, 1...|
|  List()|
|List([List(0, 3),...|
|List([List(76, 86...|
|  List()|
|List([List(74, 84...|
|  List()|
|  List()|
|  List()|
|List([List(85, 96...|
|List([List(125, 1...|
|  List()|
|  List()|
|  List()|
|  List()|
|List([List(14, 17...|
|  List()|
|  List()|
|List([List(14, 17...|
++

The question is now how to extract the text of the hashtags for each tweet?
Still new to SparkR. Am thinking maybe I need to loop through the dataframe
to extract for each tweet. But it seems that lapply does not really apply on
Spark DataFrame as more. Any though on how to extract the text, as it will
be inside a JSON array.


Thanks,


-JS




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-creating-dataframe-from-json-file-tp23849.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: [SparkR] creating dataframe from json file

2015-07-15 Thread jianshu Weng
Thanks.

t - getField(df$hashtags, text) does return a Column. But when I tried
to call t - getField(df$hashtags, text), it would give an error:

Error: All select() inputs must resolve to integer column positions.
The following do not:
*  getField(df$hashtags, text)

In fact, the text field in df is now return as something like
List(hashtag1, hashtag2). Want to flat the list out and make the field
a string like hashtag1, hashtag2.

You mentioned in the email that then you can perform operations on the
column.. Bear with me if you feel the question is too naive, am still new
to SparkR. But what operations are allowed on the column, in the SparkR
documentation, I didnt find any specific function for column operation (
https://spark.apache.org/docs/latest/api/R/index.html). I didnt even fine
getField function in the documentation as well.

Thanks,

-JS

On Wed, Jul 15, 2015 at 8:42 PM, Sun, Rui rui@intel.com wrote:

 suppose df - jsonFile(sqlContext, json file)

 You can extract hashtags.text as a Column object using the following
 command:
 t - getField(df$hashtags, text)
 and then you can perform operations on the column.

 You can extract hashtags.text as a DataFrame using the following command:
t - select(df, getField(df$hashtags, text))
showDF(t)

 Or you can use SQL query to extract the field:
   hiveContext - sparkRHive.init()
   df -jsonFile(hiveContext,json file)
   registerTempTable(df, table)
   t - sql(hiveContext, select hashtags.text from table)
   showDF(t)
 
 From: jianshu [jian...@gmail.com]
 Sent: Wednesday, July 15, 2015 4:42 PM
 To: user@spark.apache.org
 Subject: [SparkR] creating dataframe from json file

 hi all,

 Not sure whether this the right venue to ask. If not, please point me to
 the
 right group, if there is any.

 I'm trying to create a Spark DataFrame from JSON file using jsonFile(). The
 call was successful, and I can see the DataFrame created. The JSON file I
 have contains a number of tweets obtained from Twitter API. Am particularly
 interested in pulling the hashtags contains in the tweets. If I use
 printSchema(), the schema is something like:

 root
  |-- id_str: string (nullable = true)
  |-- hashtags: array (nullable = true)
  ||-- element: struct (containsNull = true)
  |||-- indices: array (nullable = true)
  ||||-- element: long (containsNull = true)
  |||-- text: string (nullable = true)

 showDF() would show something like this :

 ++
 |hashtags|
 ++
 |  List()|
 |List([List(125, 1...|
 |  List()|
 |List([List(0, 3),...|
 |List([List(76, 86...|
 |  List()|
 |List([List(74, 84...|
 |  List()|
 |  List()|
 |  List()|
 |List([List(85, 96...|
 |List([List(125, 1...|
 |  List()|
 |  List()|
 |  List()|
 |  List()|
 |List([List(14, 17...|
 |  List()|
 |  List()|
 |List([List(14, 17...|
 ++

 The question is now how to extract the text of the hashtags for each tweet?
 Still new to SparkR. Am thinking maybe I need to loop through the dataframe
 to extract for each tweet. But it seems that lapply does not really apply
 on
 Spark DataFrame as more. Any though on how to extract the text, as it will
 be inside a JSON array.


 Thanks,


 -JS




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-creating-dataframe-from-json-file-tp23849.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org