Parse tab seperated file inc json efficent

2015-09-14 Thread matthes
I try to parse a tab seperated file in Spark 1.5 with a json section as
efficent as possible.
The file looks like follows: 

value1value2{json}

How can I parse all fields inc the json fields into a RDD directly?

If I use this peace of code:

val jsonCol = sc.textFile("/data/input").map(l => l.split("\t",3)).map(x =>
x(2).trim()).cache()
val json = sqlContext.read.json(jsonCol).rdd

I will loose value1 and value2!!!
I'm open for any idea!



-
I'm using Spark 1.5
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Parse-tab-seperated-file-inc-json-efficent-tp24691.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



LATERAL VIEW explode requests the full schema

2015-03-03 Thread matthes
I use "LATERAL VIEW explode(...)" to read data from a parquet-file but the
full schema is requeseted by parquet instead just the used columns. When I
didn't use LATERAL VIEW the requested schema has just the two columns which
I use. Is it correct or is there place for an optimization or do I
understand there somthing wrong?

Here are my examples:

1) hiveContext.sql("SELECT userid FROM pef WHERE observeddays==20140509") 

The requested schema is:

optional group observedDays (LIST) {
repeated int32 array;
  }
  required int64 userid;
}

This is what I expect although the result does not work, but that is not the
problem here!

2) hiveContext.sql("SELECT userid FROM pef LATERAL VIEW
explode(observeddays) od AS observed WHERE observed==20140509") 

the requested schema is:

  required int64 userid;
  optional int32 source;
  optional group observedDays (LIST) {
repeated int32 array;
  }
  optional group placetobe (LIST) {
repeated group bag {
  optional group array {
optional binary palces (UTF8);
optional group dates (LIST) {
  repeated int32 array;
}
  }
}
  }
}

Why does parquet request the full schema. I just use two fields of the
table.

Can somebody please explain me why this can happen.

Thanks!




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/LATERAL-VIEW-explode-requests-the-full-schema-tp21893.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Can't access nested types with sql

2015-01-23 Thread matthes
I try to work with nested parquet data. To read and write the parquet file is
actually working now but when I try to query a nested field with SqlContext
I get an exception:

RuntimeException: "Can't access nested field in type
ArrayType(StructType(List(StructField(..."

I generate the parquet file by parsing the data into the following caseclass
structure:

case class areas(area : String, dates : Seq[Int])
case class dataset(userid : Long, source : Int, days : Seq[Int] , areas :
Seq[areas] )

automatic generated schema:
root
 |-- userid: long (nullable = false)
 |-- source: integer (nullable = false)
 |-- days: array (nullable = true)
 ||-- element: integer (containsNull = false)
 |-- areas: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- area: string (nullable = true)
 |||-- dates: array (nullable = true)
 ||||-- element: integer (containsNull = false)
 
After writeing the Parquetfile I load the data again and I create a
SQLContext and try to execute a sql-command like follows:

parquetFile.registerTempTable("testtable")
val result = sqlContext.sql("SELECT areas.area FROM testtable where userid >
50")   
result.map(t => t(0)).collect().foreach(println) // throw the exception 

If I execute this command: val result = sqlContext.sql("SELECT areas[0].area
FROM testtable where userid > 50")  
I get only the values at the first position in the array but I need every
value and that doesn't work.
I sow the function t.getAs[...] but everything what I tried didn't worked. 

I hope somebody can help me how I can access a nested field that I read all
values of the nested array or isn't it supported?

I use spark-sql_2.10(v1.2.0), spark-core_2.10(v1.2.0) and parquet 1.6.0rc4.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Can-t-access-nested-types-with-sql-tp21336.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark run slow after unexpected repartition

2014-09-30 Thread matthes
I have the same problem! I start the same job 3 or 4 times again, it depends
how big the data and the cluster are. The runtime goes down in the following
jobs. And at the end I get the Fetch failure error and at this point I must
restart the spark shell and everything works well again. And I don't use the
caching option!

By the way, I have the same behavior with different jobs!




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-run-slow-after-unexpected-repartition-tp14542p15416.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Is it possible to use Parquet with Dremel encoding

2014-09-29 Thread matthes
Thank you so much guys for helping me, but I have some more questions about
it!

Do we have to presort the columns to get the benefits of the run length
encoding or do I have to group the data first and wrap it into a case class?

I try to sort the data first and write it down and I get different sizes as
result:
65.191.222 Bytesunsorted
62.576.598 Bytessorted

I see no run time encoding in the debug output:

14/09/29 11:20:59 INFO ColumnChunkPageWriteStore: written 4.572.354B for
[col1] INT64: 683.189 values, 5.465.512B raw, 4.572.211B comp, 6 pages,
encodings: [PLAIN, BIT_PACKED]
14/09/29 11:20:59 INFO ColumnChunkPageWriteStore: written 4.687.432B for
[col2] INT64: 683.189 values, 5.465.512B raw, 4.687.289B comp, 6 pages,
encodings: [PLAIN, BIT_PACKED]
14/09/29 11:20:59 INFO ColumnChunkPageWriteStore: written 847.267B for
[col3] INT32: 683.189 values, 852.104B raw, 847.198B comp, 3 pages,
encodings: [PLAIN_DICTIONARY, BIT_PACKED], dic { 713 entries, 2.852B raw,
713B comp}
14/09/29 11:20:59 INFO ColumnChunkPageWriteStore: written 796.082B for
[col4] INT32: 683.189 values, 907.744B raw, 796.013B comp, 3 pages,
encodings: [PLAIN_DICTIONARY, BIT_PACKED], dic { 1.262 entries, 5.048B raw,
1.262B comp}


By the way why is the schema wrong? I include there repeated values, I'm
very confused!

Thanks 
Matthes



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-use-Parquet-with-Dremel-encoding-tp15186p15344.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Is it possible to use Parquet with Dremel encoding

2014-09-26 Thread matthes
Hi Frank,

thanks al lot for your response, this is a very helpful!

Actually I'm try to figure out does the current spark version supports
Repetition levels
(https://blog.twitter.com/2013/dremel-made-simple-with-parquet) but now it
looks good to me.
It is very hard to find some good things about that. Now I found this as
well: 
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=blob;f=sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTestData.scala;h=1dc58633a2a68cd910c1bab01c3d5ee1eb4f8709;hb=f479cf37

I wasn't sure of that because nested data can be many different things!
If it works with SQL, to find the firstRepeatedid or secoundRepeatedid would
be awesome. But if it only works with kind of map/reduce job than it also
good. The most important thing is to filter the first or secound  repeated
value as fast as possible and in combination as well.
I start now to play with this things to get the best search results!

Me schema looks like this:

val nestedSchema =
"""message nestedRowSchema 
{
  int32 firstRepeatedid;
  repeated group level1
  {
int64 secoundRepeatedid;
repeated group level2 
  {
int64   value1;
int32   value2;
  }
  }
    }
"""

Best,
Matthes



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-use-Parquet-with-Dremel-encoding-tp15186p15239.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Is it possible to use Parquet with Dremel encoding

2014-09-26 Thread matthes
Thank you Jey,

That is a nice introduction but it is a may be to old (AUG 21ST, 2013)

"Note: If you keep the schema flat (without nesting), the Parquet files you
create can be read by systems like Shark and Impala. These systems allow you
to query Parquet files as tables using SQL-like syntax. The Parquet files
created by this sample application could easily be queried using Shark for
example."

But in this post
(http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Nested-CaseClass-Parquet-failure-td8377.html)
I found this: Nested parquet is not supported in 1.0, but is part of the
upcoming 1.0.1 release.

So the question now is, can I use it in the benefit way of nested parquet
files to find fast with sql or do I have to write a special map/reduce job
to transform and find my data?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-use-Parquet-with-Dremel-encoding-tp15186p15234.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Is it possible to use Parquet with Dremel encoding

2014-09-25 Thread matthes
Hi again!

At the moment I try to use parquet and I want to keep the data into the
memory in an efficient way to make requests against the data as fast as
possible.
I read about parquet it is able to encode nested columns. Parquet uses the
Dremel encoding with definition and repetition levels. 
Is it at the moment possible to use this in spark as well or is it actually
not implemented? If yes, I’m not sure how to do it. I saw some examples,
they try to put some arrays or case classes in other case classes, nut I
don’t think that is the right way.  The other thing that I saw in this
relation was SchemaRDDs. 

Input:

Col1|   Col2|   Col3|   Col4
Int |   long|   long|   int
-
14  |   1234|   1422|   3
14  |   3212|   1542|   2
14  |   8910|   1422|   8
15  |   1234|   1542|   9
15  |   8897|   1422|   13

Want this Parquet-format:
Col3|   Col1|   Col4|   Col2
long|   int |   int |   long

1422|   14  |   3   |   1234
“   |   “   |   8   |   8910
“   |   15  |   13  |   8897
1542|   14  |   2   |   3212
“   |   15  |   9   |   1234

It would be awesome if somebody could give me a good hint how can I do that
or maybe a better way.

Best,
Matthes




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-use-Parquet-with-Dremel-encoding-tp15186.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Setup an huge Unserializable Object in a mapper

2014-09-23 Thread matthes
I solved it :) I moved the lookupObject into the function where I create the
broadcast and now all works very well!

object lookupObject 
{ 
private var treeFile : org.apache.spark.broadcast.Broadcast[String] = _ 

def main(args: Array[String]): Unit = { 
… 
val treeFile = sc.broadcast(args(0)) 

object treeContainer 
{ 
  val tree : S2Lookup = loadTree 
  
  def dolookup(id : Long) : Boolean = 
  { 
return tree.lookupSimple(new S2CellId(id)) 
  } 
  def loadTree() : S2Lookup = 
  { 
val path = new Path(treeFile.value); // treeFile is everytime
null 
val fileSystem = FileSystem.get(new Configuration()) 
new S2Lookup(ConversionUtils.deserializeCovering(new
InputStreamReader(fileSystem.open(path 
  } 
}

… 
} 

} 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Setup-an-huge-Unserializable-Object-in-a-mapper-tp14817p14916.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Setup an huge Unserializable Object in a mapper

2014-09-23 Thread matthes
Thank you for the answer and sorry for the double question, but now it works!
I have one additional question, is it possible to use a broadcast variable
in this object, at the moment I try it in the way below, but the broadcast
object is still null.

object lookupObject
{
private var treeFile : org.apache.spark.broadcast.Broadcast[String] = _

def main(args: Array[String]): Unit = {
…
val treeFile = sc.broadcast(args(0))
…
}
object treeContainer 
{
  val tree : S2Lookup = loadTree
  
  def dolookup(id : Long) : Boolean =
  {
return tree.lookupSimple(new S2CellId(id))
  }
  def loadTree() : S2Lookup =
  {
val path = new Path(treeFile.value); // treeFile is everytime null
val fileSystem = FileSystem.get(new Configuration())
new S2Lookup(ConversionUtils.deserializeCovering(new
InputStreamReader(fileSystem.open(path
  }
}   
}




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Setup-an-huge-Unserializable-Object-in-a-mapper-tp14817p14899.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Setup an huge Unserializable Object in a mapper

2014-09-22 Thread matthes
Hello everybody!

I’m newbe in spark and I hope my problem is solvable!
I need to setup an instance which I want to use is a mapper function. The
problem is it is not Serializable and the broadcast function is no option
for me. The Instance can become very huge (e.g. 1GB-10GB). Is there a way to
setup the getTree function only onetime per prozess like in hadoop. Because
at the moment it will be called for every partition and then I ran out of
memory. The second question is, is there also a secure way to limit the
tasks of mapper that I will never get more as the defined limit?
If this way is totally wrong, please let me know. I’m open for any ideas.

My first try is:

val countresult = file.mapPartitions { valueIterator =>

val s2tree = getTree(bcTreefilename.value) 

valueIterator.map { x =>
  val split = x.split("\t")
  val result: String = ""
  val key = split(1)
  var value = CountContainer(split(3).toInt)
   
  if (s2tree.lookupContainingCellsSimple(new
S2CellId(split(2).toLong))) {
value.exposureCnt = value.totalCnt
  }

  (key, value)
}
  }.reduceByKey{ (x,y) => x.add(y); x}.cache()

Best,

Matthias




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Setup-an-huge-Unserializable-Object-in-a-mapper-tp14817.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org