Re: Remapping columns from a schemaRDD

2014-11-26 Thread Daniel Haviv
Is there some place I can read more about it ? I can't find any reference.
I actully want to flatten these structures and not return them from the UDF.

Thanks,
Daniel

On Tue, Nov 25, 2014 at 8:44 PM, Michael Armbrust mich...@databricks.com
wrote:

 Maps should just be scala maps, structs are rows inside of rows.  If you
 wan to return a struct from a UDF you can do that with a case class.

 On Tue, Nov 25, 2014 at 10:25 AM, Daniel Haviv danielru...@gmail.com
 wrote:

 Thank you.

 How can I address more complex columns like maps and structs?

 Thanks again!
 Daniel

 On 25 בנוב׳ 2014, at 19:43, Michael Armbrust mich...@databricks.com
 wrote:

 Probably the easiest/closest way to do this would be with a UDF,
 something like:

 registerFunction(makeString, (s: Seq[String]) = s.mkString(,))
 sql(SELECT *, makeString(c8) AS newC8 FROM jRequests)

 Although this does not modify a column, but instead appends a new column.

 Another more complicated way to do something like this would be by using the
 applySchema function
 http://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema
 .

 I'll note that, as part of the ML pipeline work, we have been considering
 adding something like:

 def modifyColumn(columnName, function)

 Any comments anyone has on this interface would be appreciated!

 Michael

 On Tue, Nov 25, 2014 at 7:02 AM, Daniel Haviv danielru...@gmail.com
 wrote:

 Hi,
 I'm selecting columns from a json file, transform some of them and would
 like to store the result as a parquet file but I'm failing.

 This is what I'm doing:

 val jsonFiles=sqlContext.jsonFile(/requests.loading)
 jsonFiles.registerTempTable(jRequests)

 val clean_jRequests=sqlContext.sql(select c1, c2, c3 ... c55 from
 jRequests)

 and then I run a map:
  val
 jRequests_flat=clean_jRequests.map(line={((line(1),line(2),line(3),line(4),line(5),line(6),line(7),
 *line(8).asInstanceOf[Iterable[String]].mkString(,)*,line(9)
 ,line(10) ,line(11) ,line(12) ,line(13) ,line(14) ,line(15) ,line(16)
 ,line(17) ,line(18) ,line(19) ,line(20) ,line(21) ,line(22) ,line(23)
 ,line(24) ,line(25) ,line(26) ,line(27) ,line(28) ,line(29) ,line(30)
 ,line(31) ,line(32) ,line(33) ,line(34) ,line(35) ,line(36) ,line(37)
 ,line(38) ,line(39) ,line(40) ,line(41) ,line(42) ,line(43) ,line(44)
 ,line(45) ,line(46) ,line(47) ,line(48) ,line(49) ,line(50)))})



 1. Is there a smarter way to achieve that (only modify a certain column
 without relating to the others, but keeping all of them)?
 2. The last statement fails because the tuple has too much members:
 console:19: error: object Tuple50 is not a member of package scala


 Thanks for your help,
 Daniel






Re: Remapping columns from a schemaRDD

2014-11-25 Thread Michael Armbrust
Probably the easiest/closest way to do this would be with a UDF, something
like:

registerFunction(makeString, (s: Seq[String]) = s.mkString(,))
sql(SELECT *, makeString(c8) AS newC8 FROM jRequests)

Although this does not modify a column, but instead appends a new column.

Another more complicated way to do something like this would be by using the
applySchema function
http://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema
.

I'll note that, as part of the ML pipeline work, we have been considering
adding something like:

def modifyColumn(columnName, function)

Any comments anyone has on this interface would be appreciated!

Michael

On Tue, Nov 25, 2014 at 7:02 AM, Daniel Haviv danielru...@gmail.com wrote:

 Hi,
 I'm selecting columns from a json file, transform some of them and would
 like to store the result as a parquet file but I'm failing.

 This is what I'm doing:

 val jsonFiles=sqlContext.jsonFile(/requests.loading)
 jsonFiles.registerTempTable(jRequests)

 val clean_jRequests=sqlContext.sql(select c1, c2, c3 ... c55 from
 jRequests)

 and then I run a map:
  val
 jRequests_flat=clean_jRequests.map(line={((line(1),line(2),line(3),line(4),line(5),line(6),line(7),
 *line(8).asInstanceOf[Iterable[String]].mkString(,)*,line(9) ,line(10)
 ,line(11) ,line(12) ,line(13) ,line(14) ,line(15) ,line(16) ,line(17)
 ,line(18) ,line(19) ,line(20) ,line(21) ,line(22) ,line(23) ,line(24)
 ,line(25) ,line(26) ,line(27) ,line(28) ,line(29) ,line(30) ,line(31)
 ,line(32) ,line(33) ,line(34) ,line(35) ,line(36) ,line(37) ,line(38)
 ,line(39) ,line(40) ,line(41) ,line(42) ,line(43) ,line(44) ,line(45)
 ,line(46) ,line(47) ,line(48) ,line(49) ,line(50)))})



 1. Is there a smarter way to achieve that (only modify a certain column
 without relating to the others, but keeping all of them)?
 2. The last statement fails because the tuple has too much members:
 console:19: error: object Tuple50 is not a member of package scala


 Thanks for your help,
 Daniel




Re: Remapping columns from a schemaRDD

2014-11-25 Thread Daniel Haviv
Thank you.

How can I address more complex columns like maps and structs?

Thanks again!
Daniel

 On 25 בנוב׳ 2014, at 19:43, Michael Armbrust mich...@databricks.com wrote:
 
 Probably the easiest/closest way to do this would be with a UDF, something 
 like:
 
 registerFunction(makeString, (s: Seq[String]) = s.mkString(,))
 sql(SELECT *, makeString(c8) AS newC8 FROM jRequests)
 
 Although this does not modify a column, but instead appends a new column.
 
 Another more complicated way to do something like this would be by using the 
 applySchema function.
 
 I'll note that, as part of the ML pipeline work, we have been considering 
 adding something like:
 
 def modifyColumn(columnName, function)
 
 Any comments anyone has on this interface would be appreciated!
 
 Michael
 
 On Tue, Nov 25, 2014 at 7:02 AM, Daniel Haviv danielru...@gmail.com wrote:
 Hi,
 I'm selecting columns from a json file, transform some of them and would 
 like to store the result as a parquet file but I'm failing.
 
 This is what I'm doing:
 
 val jsonFiles=sqlContext.jsonFile(/requests.loading)
 jsonFiles.registerTempTable(jRequests)
 
 val clean_jRequests=sqlContext.sql(select c1, c2, c3 ... c55 from 
 jRequests)
 
 and then I run a map:
  val 
 jRequests_flat=clean_jRequests.map(line={((line(1),line(2),line(3),line(4),line(5),line(6),line(7),line(8).asInstanceOf[Iterable[String]].mkString(,),line(9)
  ,line(10) ,line(11) ,line(12) ,line(13) ,line(14) ,line(15) ,line(16) 
 ,line(17) ,line(18) ,line(19) ,line(20) ,line(21) ,line(22) ,line(23) 
 ,line(24) ,line(25) ,line(26) ,line(27) ,line(28) ,line(29) ,line(30) 
 ,line(31) ,line(32) ,line(33) ,line(34) ,line(35) ,line(36) ,line(37) 
 ,line(38) ,line(39) ,line(40) ,line(41) ,line(42) ,line(43) ,line(44) 
 ,line(45) ,line(46) ,line(47) ,line(48) ,line(49) ,line(50)))})
 
 
 
 1. Is there a smarter way to achieve that (only modify a certain column 
 without relating to the others, but keeping all of them)?
 2. The last statement fails because the tuple has too much members:
 console:19: error: object Tuple50 is not a member of package scala
 
 
 Thanks for your help,
 Daniel
 


Re: Remapping columns from a schemaRDD

2014-11-25 Thread Michael Armbrust
Maps should just be scala maps, structs are rows inside of rows.  If you
wan to return a struct from a UDF you can do that with a case class.

On Tue, Nov 25, 2014 at 10:25 AM, Daniel Haviv danielru...@gmail.com
wrote:

 Thank you.

 How can I address more complex columns like maps and structs?

 Thanks again!
 Daniel

 On 25 בנוב׳ 2014, at 19:43, Michael Armbrust mich...@databricks.com
 wrote:

 Probably the easiest/closest way to do this would be with a UDF, something
 like:

 registerFunction(makeString, (s: Seq[String]) = s.mkString(,))
 sql(SELECT *, makeString(c8) AS newC8 FROM jRequests)

 Although this does not modify a column, but instead appends a new column.

 Another more complicated way to do something like this would be by using the
 applySchema function
 http://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema
 .

 I'll note that, as part of the ML pipeline work, we have been considering
 adding something like:

 def modifyColumn(columnName, function)

 Any comments anyone has on this interface would be appreciated!

 Michael

 On Tue, Nov 25, 2014 at 7:02 AM, Daniel Haviv danielru...@gmail.com
 wrote:

 Hi,
 I'm selecting columns from a json file, transform some of them and would
 like to store the result as a parquet file but I'm failing.

 This is what I'm doing:

 val jsonFiles=sqlContext.jsonFile(/requests.loading)
 jsonFiles.registerTempTable(jRequests)

 val clean_jRequests=sqlContext.sql(select c1, c2, c3 ... c55 from
 jRequests)

 and then I run a map:
  val
 jRequests_flat=clean_jRequests.map(line={((line(1),line(2),line(3),line(4),line(5),line(6),line(7),
 *line(8).asInstanceOf[Iterable[String]].mkString(,)*,line(9) ,line(10)
 ,line(11) ,line(12) ,line(13) ,line(14) ,line(15) ,line(16) ,line(17)
 ,line(18) ,line(19) ,line(20) ,line(21) ,line(22) ,line(23) ,line(24)
 ,line(25) ,line(26) ,line(27) ,line(28) ,line(29) ,line(30) ,line(31)
 ,line(32) ,line(33) ,line(34) ,line(35) ,line(36) ,line(37) ,line(38)
 ,line(39) ,line(40) ,line(41) ,line(42) ,line(43) ,line(44) ,line(45)
 ,line(46) ,line(47) ,line(48) ,line(49) ,line(50)))})



 1. Is there a smarter way to achieve that (only modify a certain column
 without relating to the others, but keeping all of them)?
 2. The last statement fails because the tuple has too much members:
 console:19: error: object Tuple50 is not a member of package scala


 Thanks for your help,
 Daniel