Limit the # of columns in Spark Scala

2014-12-14 Thread Denny Lee
I have a large of files within HDFS that I would like to do a group by
statement ala

val table = sc.textFile(hdfs://)
val tabs = table.map(_.split(\t))

I'm trying to do something similar to
tabs.map(c = (c._(167), c._(110), c._(200))

where I create a new RDD that only has
but that isn't quite right because I'm not really manipulating sequences.

BTW, I cannot use SparkSQL / case right now because my table has 200
columns (and I'm on Scala 2.10.3)

Thanks!
Denny


Re: Limit the # of columns in Spark Scala

2014-12-14 Thread Gerard Maas
Hi,

I don't get what the problem is. That map to selected columns looks like
the way to go given the context. What's not working?

Kr, Gerard
On Dec 14, 2014 5:17 PM, Denny Lee denny.g@gmail.com wrote:

 I have a large of files within HDFS that I would like to do a group by
 statement ala

 val table = sc.textFile(hdfs://)
 val tabs = table.map(_.split(\t))

 I'm trying to do something similar to
 tabs.map(c = (c._(167), c._(110), c._(200))

 where I create a new RDD that only has
 but that isn't quite right because I'm not really manipulating sequences.

 BTW, I cannot use SparkSQL / case right now because my table has 200
 columns (and I'm on Scala 2.10.3)

 Thanks!
 Denny




Re: Limit the # of columns in Spark Scala

2014-12-14 Thread Denny Lee
Getting a bunch of syntax errors. Let me get back with the full statement
and error later today. Thanks for verifying my thinking wasn't out in left
field.
On Sun, Dec 14, 2014 at 08:56 Gerard Maas gerard.m...@gmail.com wrote:

 Hi,

 I don't get what the problem is. That map to selected columns looks like
 the way to go given the context. What's not working?

 Kr, Gerard
 On Dec 14, 2014 5:17 PM, Denny Lee denny.g@gmail.com wrote:

 I have a large of files within HDFS that I would like to do a group by
 statement ala

 val table = sc.textFile(hdfs://)
 val tabs = table.map(_.split(\t))

 I'm trying to do something similar to
 tabs.map(c = (c._(167), c._(110), c._(200))

 where I create a new RDD that only has
 but that isn't quite right because I'm not really manipulating sequences.

 BTW, I cannot use SparkSQL / case right now because my table has 200
 columns (and I'm on Scala 2.10.3)

 Thanks!
 Denny




Re: Limit the # of columns in Spark Scala

2014-12-14 Thread Michael Armbrust

 BTW, I cannot use SparkSQL / case right now because my table has 200
 columns (and I'm on Scala 2.10.3)


You can still apply the schema programmatically:
http://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema


Re: Limit the # of columns in Spark Scala

2014-12-14 Thread Denny Lee
Yes - that works great! Sorry for implying I couldn't. Was just more
flummoxed that I couldn't make the Scala call work on its own. Will
continue to debug ;-)
On Sun, Dec 14, 2014 at 11:39 Michael Armbrust mich...@databricks.com
wrote:

 BTW, I cannot use SparkSQL / case right now because my table has 200
 columns (and I'm on Scala 2.10.3)


 You can still apply the schema programmatically:
 http://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema



Re: Limit the # of columns in Spark Scala

2014-12-14 Thread Yana Kadiyska
Denny, I am not sure what exception you're observing but I've had luck with
2 things:

val table = sc.textFile(hdfs://)

You can try calling table.first here and you'll see the first line of the
file.
You can also do val debug = table.first.split(\t) which would give you an
array and you can indeed verify that the array contains what you want in
 positions 167,119 and 200. In the case of large files with a random bad
line I find wrapping the call within the map in try/catch very valuable --
you can dump out the whole line in the catch statement

Lastly I would guess that you're getting a compile error and not a runtime
error -- I believe c is an array of values so I think you want
tabs.map(c = (c(167), c(110), c(200)) instead of tabs.map(c = (c._(167),
c._(110), c._(200))



On Sun, Dec 14, 2014 at 3:12 PM, Denny Lee denny.g@gmail.com wrote:

 Yes - that works great! Sorry for implying I couldn't. Was just more
 flummoxed that I couldn't make the Scala call work on its own. Will
 continue to debug ;-)

 On Sun, Dec 14, 2014 at 11:39 Michael Armbrust mich...@databricks.com
 wrote:

 BTW, I cannot use SparkSQL / case right now because my table has 200
 columns (and I'm on Scala 2.10.3)


 You can still apply the schema programmatically:
 http://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema




Re: Limit the # of columns in Spark Scala

2014-12-14 Thread Denny Lee
Oh, just figured it out:

tabs.map(c = Array(c(167), c(110), c(200))

Thanks for all of the advice, eh?!





On Sun Dec 14 2014 at 1:14:00 PM Yana Kadiyska yana.kadiy...@gmail.com
wrote:

 Denny, I am not sure what exception you're observing but I've had luck
 with 2 things:

 val table = sc.textFile(hdfs://)

 You can try calling table.first here and you'll see the first line of the
 file.
 You can also do val debug = table.first.split(\t) which would give you
 an array and you can indeed verify that the array contains what you want in
  positions 167,119 and 200. In the case of large files with a random bad
 line I find wrapping the call within the map in try/catch very valuable --
 you can dump out the whole line in the catch statement

 Lastly I would guess that you're getting a compile error and not a runtime
 error -- I believe c is an array of values so I think you want
 tabs.map(c = (c(167), c(110), c(200)) instead of tabs.map(c = (c._(167),
 c._(110), c._(200))



 On Sun, Dec 14, 2014 at 3:12 PM, Denny Lee denny.g@gmail.com wrote:

 Yes - that works great! Sorry for implying I couldn't. Was just more
 flummoxed that I couldn't make the Scala call work on its own. Will
 continue to debug ;-)

 On Sun, Dec 14, 2014 at 11:39 Michael Armbrust mich...@databricks.com
 wrote:

 BTW, I cannot use SparkSQL / case right now because my table has 200
 columns (and I'm on Scala 2.10.3)


 You can still apply the schema programmatically:
 http://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema