I try to parse a tab seperated file in Spark 1.5 with a json section as
efficent as possible.
The file looks like follows:
value1value2{json}
How can I parse all fields inc the json fields into a RDD directly?
If I use this peace of code:
val jsonCol = sc.textFile("/data/input").map(l => l.spl
I use "LATERAL VIEW explode(...)" to read data from a parquet-file but the
full schema is requeseted by parquet instead just the used columns. When I
didn't use LATERAL VIEW the requested schema has just the two columns which
I use. Is it correct or is there place for an optimization or do I
unders
I try to work with nested parquet data. To read and write the parquet file is
actually working now but when I try to query a nested field with SqlContext
I get an exception:
RuntimeException: "Can't access nested field in type
ArrayType(StructType(List(StructField(..."
I generate the parquet file
I have the same problem! I start the same job 3 or 4 times again, it depends
how big the data and the cluster are. The runtime goes down in the following
jobs. And at the end I get the Fetch failure error and at this point I must
restart the spark shell and everything works well again. And I don't
entries, 5.048B raw,
1.262B comp}
By the way why is the schema wrong? I include there repeated values, I'm
very confused!
Thanks
Matthes
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-use-Parquet-with-Dremel-encoding-tp15186p15344
int64 secoundRepeatedid;
repeated group level2
{
int64 value1;
int32 value2;
}
}
}
"""
Best,
Matthes
--
View this message in context:
http://apache-spa
Thank you Jey,
That is a nice introduction but it is a may be to old (AUG 21ST, 2013)
"Note: If you keep the schema flat (without nesting), the Parquet files you
create can be read by systems like Shark and Impala. These systems allow you
to query Parquet files as tables using SQL-like syntax. Th
somebody could give me a good hint how can I do that
or maybe a better way.
Best,
Matthes
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-use-Parquet-with-Dremel-encoding-tp15186.html
Sent from the Apache Spark User List mailing list archive at
I solved it :) I moved the lookupObject into the function where I create the
broadcast and now all works very well!
object lookupObject
{
private var treeFile : org.apache.spark.broadcast.Broadcast[String] = _
def main(args: Array[String]): Unit = {
…
val treeFile = sc.broadcast(args(0))
o
Thank you for the answer and sorry for the double question, but now it works!
I have one additional question, is it possible to use a broadcast variable
in this object, at the moment I try it in the way below, but the broadcast
object is still null.
object lookupObject
{
private var treeFile : org
Hello everybody!
I’m newbe in spark and I hope my problem is solvable!
I need to setup an instance which I want to use is a mapper function. The
problem is it is not Serializable and the broadcast function is no option
for me. The Instance can become very huge (e.g. 1GB-10GB). Is there a way to
se
11 matches
Mail list logo