Interestingly, after more digging, df.printSchema() in raw spark shows the
columns as a long, not a bigint.
root
|-- localEventDtTm: timestamp (nullable = true)
|-- asset: string (nullable = true)
|-- assetCategory: string (nullable = true)
|-- assetType: string (nullable = true)
|-- event:
Hi Folks,
Using Spark to read in JSON files and detect the schema, it gives me a
dataframe with a bigint filed. R then fails to import the dataframe as it
cant convert the type.
head(mydf)
Error in as.data.frame.default(x[[i]], optional = TRUE) :
cannot coerce class jobj to a data.frame
Not sure if this helps, but the options I set are slightly different:
val hadoopConf=sc.hadoopConfiguration
hadoopConf.set(fs.s3n.awsAccessKeyId,key)
hadoopConf.set(fs.s3n.awsSecretAccessKey,secret)
Try setting them to s3n as opposed to just s3
Good luck!
--
View this message in context:
Just to add to this, here's some more info:
val myDF = hiveContext.read.parquet(s3n://myBucket/myPath/)
Produces these...
2015-07-01 03:25:50,450 INFO [pool-14-thread-4]
(org.apache.hadoop.fs.s3native.NativeS3FileSystem) - Opening
's3n://myBucket/myPath/part-r-00339.parquet' for reading
That
FWIW, I had some trouble getting Spark running on a Pi.
My core problem was using snappy for compression as it comes as a pre-made
binary for i386 and I couldnt find one for ARM.
So to work around it there was an option to use LZO instead, then everything
worked.
Off the top of my head, it was
So I was delighted with Spark 1.3.1 using Parquet 1.6.0 which would
partition data into folders. So I set up some parquet data paritioned by
date. This enabled is to reference a single day/month/year minimizing how
much data was scanned.
eg:
val myDataFrame =
Hi Folks,
I just stepped up from 1.3.1 to 1.4.0, the most notable difference for me so
far is the data frame reader/writer. Previously:
val myData = hiveContext.load(s3n://someBucket/somePath/,parquet)
Now:
val myData = hiveContext.read.parquet(s3n://someBucket/somePath)
Using the original
Hello Bright Sparks,
I was using Spark 1.3.0 to push data out to Parquet files. They have been
working great, super fast, easy way to persist data frames etc.
However I just swapped out Spark 1.3.0 and picked up the tarball for 1.3.1.
I unzipped it, copied my config over and then went to read