Parse tab seperated file inc json efficent

2015-09-14 Thread matthes
I try to parse a tab seperated file in Spark 1.5 with a json section as efficent as possible. The file looks like follows: value1value2{json} How can I parse all fields inc the json fields into a RDD directly? If I use this peace of code: val jsonCol = sc.textFile("/data/input").map(l => l.spl

LATERAL VIEW explode requests the full schema

2015-03-03 Thread matthes
I use "LATERAL VIEW explode(...)" to read data from a parquet-file but the full schema is requeseted by parquet instead just the used columns. When I didn't use LATERAL VIEW the requested schema has just the two columns which I use. Is it correct or is there place for an optimization or do I unders

Can't access nested types with sql

2015-01-23 Thread matthes
I try to work with nested parquet data. To read and write the parquet file is actually working now but when I try to query a nested field with SqlContext I get an exception: RuntimeException: "Can't access nested field in type ArrayType(StructType(List(StructField(..." I generate the parquet file

Re: Spark run slow after unexpected repartition

2014-09-30 Thread matthes
I have the same problem! I start the same job 3 or 4 times again, it depends how big the data and the cluster are. The runtime goes down in the following jobs. And at the end I get the Fetch failure error and at this point I must restart the spark shell and everything works well again. And I don't

Re: Is it possible to use Parquet with Dremel encoding

2014-09-29 Thread matthes
entries, 5.048B raw, 1.262B comp} By the way why is the schema wrong? I include there repeated values, I'm very confused! Thanks Matthes -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-use-Parquet-with-Dremel-encoding-tp15186p15344

Re: Is it possible to use Parquet with Dremel encoding

2014-09-26 Thread matthes
int64 secoundRepeatedid; repeated group level2 { int64 value1; int32 value2; } } } """ Best, Matthes -- View this message in context: http://apache-spa

Re: Is it possible to use Parquet with Dremel encoding

2014-09-26 Thread matthes
Thank you Jey, That is a nice introduction but it is a may be to old (AUG 21ST, 2013) "Note: If you keep the schema flat (without nesting), the Parquet files you create can be read by systems like Shark and Impala. These systems allow you to query Parquet files as tables using SQL-like syntax. Th

Is it possible to use Parquet with Dremel encoding

2014-09-25 Thread matthes
somebody could give me a good hint how can I do that or maybe a better way. Best, Matthes -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-use-Parquet-with-Dremel-encoding-tp15186.html Sent from the Apache Spark User List mailing list archive at

Re: Setup an huge Unserializable Object in a mapper

2014-09-23 Thread matthes
I solved it :) I moved the lookupObject into the function where I create the broadcast and now all works very well! object lookupObject { private var treeFile : org.apache.spark.broadcast.Broadcast[String] = _ def main(args: Array[String]): Unit = { … val treeFile = sc.broadcast(args(0)) o

Re: Setup an huge Unserializable Object in a mapper

2014-09-23 Thread matthes
Thank you for the answer and sorry for the double question, but now it works! I have one additional question, is it possible to use a broadcast variable in this object, at the moment I try it in the way below, but the broadcast object is still null. object lookupObject { private var treeFile : org

Setup an huge Unserializable Object in a mapper

2014-09-22 Thread matthes
Hello everybody! I’m newbe in spark and I hope my problem is solvable! I need to setup an instance which I want to use is a mapper function. The problem is it is not Serializable and the broadcast function is no option for me. The Instance can become very huge (e.g. 1GB-10GB). Is there a way to se