Holden,

If I were to use DataSets, then I would essentially do this:

val receiveMessageRequest = new ReceiveMessageRequest(myQueueUrl)
val messages = sqs.receiveMessage(receiveMessageRequest).getMessages()
for (message <- messages.asScala) {
    val files = sqlContext.read.json(message.getBody())
}

Can I simply do files.toDS() or do I have to create a schema using a case class 
File and apply it as[File]? If I have to apply a schema, then how would I 
create it based on the JSON structure below, especially the nested elements.

Thanks,
Ben


> On Apr 14, 2016, at 3:46 PM, Holden Karau <hol...@pigscanfly.ca> wrote:
> 
> You could certainly use RDDs for that, you might also find using Dataset 
> selecting the fields you need to construct the URL to fetch and then using 
> the map function to be easier.
> 
> On Thu, Apr 14, 2016 at 12:01 PM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> I was wonder what would be the best way to use JSON in Spark/Scala. I need to 
> lookup values of fields in a collection of records to form a URL and download 
> that file at that location. I was thinking an RDD would be perfect for this. 
> I just want to hear from others who might have more experience in this. Below 
> is the actual JSON structure that I am trying to use for the S3 bucket and 
> key values of each “record" within “Records".
> 
> {
>    "Records":[
>       {
>          "eventVersion":"2.0",
>          "eventSource":"aws:s3",
>          "awsRegion":"us-east-1",
>          "eventTime":The time, in ISO-8601 format, for example, 
> 1970-01-01T00:00:00.000Z, when S3 finished processing the request,
>          "eventName":"event-type",
>          "userIdentity":{
>             
> "principalId":"Amazon-customer-ID-of-the-user-who-caused-the-event"
>          },
>          "requestParameters":{
>             "sourceIPAddress":"ip-address-where-request-came-from"
>          },
>          "responseElements":{
>             "x-amz-request-id":"Amazon S3 generated request ID",
>             "x-amz-id-2":"Amazon S3 host that processed the request"
>          },
>          "s3":{
>             "s3SchemaVersion":"1.0",
>             "configurationId":"ID found in the bucket notification 
> configuration",
>             "bucket":{
>                "name":"bucket-name",
>                "ownerIdentity":{
>                   "principalId":"Amazon-customer-ID-of-the-bucket-owner"
>                },
>                "arn":"bucket-ARN"
>             },
>             "object":{
>                "key":"object-key",
>                "size":object-size,
>                "eTag":"object eTag",
>                "versionId":"object version if bucket is versioning-enabled, 
> otherwise null",
>                "sequencer": "a string representation of a hexadecimal value 
> used to determine event sequence,
>                    only used with PUTs and DELETEs"
>             }
>          }
>       },
>       {
>           // Additional events
>       }
>    ]
> }
> 
> Thanks
> Ben
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> For additional commands, e-mail: user-h...@spark.apache.org 
> <mailto:user-h...@spark.apache.org>
> 
> 
> 
> 
> -- 
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau <https://twitter.com/holdenkarau>

Reply via email to