Re: SparkR Supported Types - Please add bigint

2015-07-23 Thread Exie
Interestingly, after more digging, df.printSchema() in raw spark shows the
columns as a long, not a bigint.

root
 |-- localEventDtTm: timestamp (nullable = true)
 |-- asset: string (nullable = true)
 |-- assetCategory: string (nullable = true)
 |-- assetType: string (nullable = true)
 |-- event: string (nullable = true)
 |-- extras: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- name: string (nullable = true)
 |||-- value: string (nullable = true)
 |-- ipAddress: string (nullable = true)
 |-- memberId: string (nullable = true)
 |-- system: string (nullable = true)
 |-- timestamp: long (nullable = true)
 |-- title: string (nullable = true)
 |-- trackingId: string (nullable = true)
 |-- version: long (nullable = true)

I'm going to have to keep digging I guess. :(




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-Supported-Types-Please-add-bigint-tp23975p23978.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



SparkR Supported Types - Please add bigint

2015-07-23 Thread Exie
Hi Folks,

Using Spark to read in JSON files and detect the schema, it gives me a
dataframe with a bigint filed. R then fails to import the dataframe as it
cant convert the type.

 head(mydf)
Error in as.data.frame.default(x[[i]], optional = TRUE) : 
  cannot coerce class jobj to a data.frame

 show(mydf)
DataFrame[localEventDtTm:timestamp, asset:string, assetCategory:string,
assetType:string, event:string,
extras:arraystructlt;name:string,value:string, ipAddress:string,
memberId:string, system:string, timestamp:bigint, title:string,
trackingId:string, version:bigint]


I believe this is related to:
https://issues.apache.org/jira/browse/SPARK-8840

A sample record in raw JSON looks like this:
{version: 1,event: view,timestamp: 1427846422377,system:
DCDS,asset: 6404476,assetType: myType,assetCategory:
myCategory,extras: [{name: videoSource,value: mySource},{name:
playerType,value: Article},{name: duration,value:
202088}],trackingId: 155629a0-d802-11e4-13ee-6884e43d6000,ipAddress:
165.69.2.4,title: myTitle}

Can someone turn this into a feature request or something for 1.5.0 ?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-Supported-Types-Please-add-bigint-tp23975.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: s3 bucket access/read file

2015-06-30 Thread Exie
Not sure if this helps, but the options I set are slightly different:

val hadoopConf=sc.hadoopConfiguration
hadoopConf.set(fs.s3n.awsAccessKeyId,key)
hadoopConf.set(fs.s3n.awsSecretAccessKey,secret)

Try setting them to s3n as opposed to just s3

Good luck!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/s3-bucket-access-read-file-tp23536p23560.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark 1.4.0: read.df() causes excessive IO

2015-06-30 Thread Exie

Just to add to this, here's some more info:

val myDF = hiveContext.read.parquet(s3n://myBucket/myPath/)

Produces these...
2015-07-01 03:25:50,450  INFO [pool-14-thread-4]
(org.apache.hadoop.fs.s3native.NativeS3FileSystem) - Opening
's3n://myBucket/myPath/part-r-00339.parquet' for reading

That is to say, it actually opens and reads every frick'n file.
Previously, it would have queued the command until an action was called.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-4-0-read-df-causes-excessive-IO-tp23541p23559.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark run errors on Raspberry Pi

2015-06-30 Thread Exie
FWIW, I had some trouble getting Spark running on a Pi. 

My core problem was using snappy for compression as it comes as a pre-made
binary for i386 and I couldnt find one for ARM.

So to work around it there was an option to use LZO instead, then everything
worked.

Off the top of my head, it was something like:
spark.sql.parquet.compression.codec=lzo

This might be worth trying.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-run-errors-on-Raspberry-Pi-tp23532p23561.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



1.4.0

2015-06-30 Thread Exie
So I was delighted with Spark 1.3.1 using Parquet 1.6.0 which would
partition data into folders. So I set up some parquet data paritioned by
date. This enabled is to reference a single day/month/year minimizing how
much data was scanned.

eg:
val myDataFrame =
hiveContext.read.parquet(s3n://myBucket/myPath/2014/07/01)
or
val myDataFrame = hiveContext.read.parquet(s3n://myBucket/myPath/2014/07)

However since upgrading to Spark 1.4.0 it doesnt seem to be working the same
way. 
The first line works, in the 01 folder is all the normal files:
2015-06-02 20:01 0   s3://myBucket/myPath/2014/07/01/_SUCCESS
2015-06-02 20:01  2066  
s3://myBucket/myPath/2014/07/01/_common_metadata
2015-06-02 20:01   1077190   s3://myBucket/myPath/2014/07/01/_metadata
2015-06-02 19:57119933  
s3://myBucket/myPath/2014/07/01/part-r-1.parquet
2015-06-02 19:57 48478  
s3://myBucket/myPath/2014/07/01/part-r-2.parquet
2015-06-02 19:57576878  
s3://myBucket/myPath/2014/07/01/part-r-3.parquet

... but if I now use the second line above, to read in all days, it comes
back empty.

Is there an option I can set somewhere to fix this ?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/1-4-0-tp23556.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark 1.4.0: read.df() causes excessive IO

2015-06-29 Thread Exie
Hi Folks,

I just stepped up from 1.3.1 to 1.4.0, the most notable difference for me so
far is the data frame reader/writer. Previously:
   val myData = hiveContext.load(s3n://someBucket/somePath/,parquet)
Now:
   val myData = hiveContext.read.parquet(s3n://someBucket/somePath)

Using the original code, it didnt actually fire an action, but using the
data frame reader, it does.

So if I have say, 1Gb of data in a parquet file partitioned into 1000
chunks, I now have to sit there waiting for it to scan that 1000 chunks.

It seems to be doing some sort of validation ? is there any way to switch
that off or stop it hitting the source so much ?

I should also note the parquet files were written with 1.6RC3, where as I
think Spark 1.4.0 is using Parquet 1.7.x



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-4-0-read-df-causes-excessive-IO-tp23541.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark 1.3.0 - 1.3.1 produces java.lang.NoSuchFieldError: NO_FILTER

2015-05-14 Thread Exie
Hello Bright Sparks,

I was using Spark 1.3.0 to push data out to Parquet files. They have been
working great, super fast, easy way to persist data frames etc.

However I just swapped out Spark 1.3.0 and picked up the tarball for 1.3.1.
I unzipped it, copied my config over and then went to read one of my parquet
files from the last release when I got this:
java.lang.NoSuchFieldError: NO_FILTER
at
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$6.apply(newParquet.scala:299)
at
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$6.apply(newParquet.scala:297)
at 
scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658)
at
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54)
at 
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
at 
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56)
at
scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:650)

I did some googling, it appears there were some changes to the Parquet file
format.

I found a reference to an option:
sqlContext.setConf(spark.sql.parquet.useDataSourceApi, false) 

Which I tried, but I got the same error (slightly different cause though).
java.lang.NoSuchFieldError: NO_FILTER
at
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$3.apply(ParquetTypes.scala:494)
at
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$3.apply(ParquetTypes.scala:494)
at scala.Option.map(Option.scala:145)
at
org.apache.spark.sql.parquet.ParquetTypesConverter$.readMetaData(ParquetTypes.scala:494)
at
org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:515)
at
org.apache.spark.sql.parquet.ParquetRelation.init(ParquetRelation.scala:67)
at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:542)

I presume its not just me, anyone else come across this ?

Any suggestions how to work around it ? can I set an option like
old.parquet.format or something ?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-0-1-3-1-produces-java-lang-NoSuchFieldError-NO-FILTER-tp22897.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org