Re: Remapping columns from a schemaRDD
Is there some place I can read more about it ? I can't find any reference. I actully want to flatten these structures and not return them from the UDF. Thanks, Daniel On Tue, Nov 25, 2014 at 8:44 PM, Michael Armbrust mich...@databricks.com wrote: Maps should just be scala maps, structs are rows inside of rows. If you wan to return a struct from a UDF you can do that with a case class. On Tue, Nov 25, 2014 at 10:25 AM, Daniel Haviv danielru...@gmail.com wrote: Thank you. How can I address more complex columns like maps and structs? Thanks again! Daniel On 25 בנוב׳ 2014, at 19:43, Michael Armbrust mich...@databricks.com wrote: Probably the easiest/closest way to do this would be with a UDF, something like: registerFunction(makeString, (s: Seq[String]) = s.mkString(,)) sql(SELECT *, makeString(c8) AS newC8 FROM jRequests) Although this does not modify a column, but instead appends a new column. Another more complicated way to do something like this would be by using the applySchema function http://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema . I'll note that, as part of the ML pipeline work, we have been considering adding something like: def modifyColumn(columnName, function) Any comments anyone has on this interface would be appreciated! Michael On Tue, Nov 25, 2014 at 7:02 AM, Daniel Haviv danielru...@gmail.com wrote: Hi, I'm selecting columns from a json file, transform some of them and would like to store the result as a parquet file but I'm failing. This is what I'm doing: val jsonFiles=sqlContext.jsonFile(/requests.loading) jsonFiles.registerTempTable(jRequests) val clean_jRequests=sqlContext.sql(select c1, c2, c3 ... c55 from jRequests) and then I run a map: val jRequests_flat=clean_jRequests.map(line={((line(1),line(2),line(3),line(4),line(5),line(6),line(7), *line(8).asInstanceOf[Iterable[String]].mkString(,)*,line(9) ,line(10) ,line(11) ,line(12) ,line(13) ,line(14) ,line(15) ,line(16) ,line(17) ,line(18) ,line(19) ,line(20) ,line(21) ,line(22) ,line(23) ,line(24) ,line(25) ,line(26) ,line(27) ,line(28) ,line(29) ,line(30) ,line(31) ,line(32) ,line(33) ,line(34) ,line(35) ,line(36) ,line(37) ,line(38) ,line(39) ,line(40) ,line(41) ,line(42) ,line(43) ,line(44) ,line(45) ,line(46) ,line(47) ,line(48) ,line(49) ,line(50)))}) 1. Is there a smarter way to achieve that (only modify a certain column without relating to the others, but keeping all of them)? 2. The last statement fails because the tuple has too much members: console:19: error: object Tuple50 is not a member of package scala Thanks for your help, Daniel
Re: Remapping columns from a schemaRDD
Probably the easiest/closest way to do this would be with a UDF, something like: registerFunction(makeString, (s: Seq[String]) = s.mkString(,)) sql(SELECT *, makeString(c8) AS newC8 FROM jRequests) Although this does not modify a column, but instead appends a new column. Another more complicated way to do something like this would be by using the applySchema function http://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema . I'll note that, as part of the ML pipeline work, we have been considering adding something like: def modifyColumn(columnName, function) Any comments anyone has on this interface would be appreciated! Michael On Tue, Nov 25, 2014 at 7:02 AM, Daniel Haviv danielru...@gmail.com wrote: Hi, I'm selecting columns from a json file, transform some of them and would like to store the result as a parquet file but I'm failing. This is what I'm doing: val jsonFiles=sqlContext.jsonFile(/requests.loading) jsonFiles.registerTempTable(jRequests) val clean_jRequests=sqlContext.sql(select c1, c2, c3 ... c55 from jRequests) and then I run a map: val jRequests_flat=clean_jRequests.map(line={((line(1),line(2),line(3),line(4),line(5),line(6),line(7), *line(8).asInstanceOf[Iterable[String]].mkString(,)*,line(9) ,line(10) ,line(11) ,line(12) ,line(13) ,line(14) ,line(15) ,line(16) ,line(17) ,line(18) ,line(19) ,line(20) ,line(21) ,line(22) ,line(23) ,line(24) ,line(25) ,line(26) ,line(27) ,line(28) ,line(29) ,line(30) ,line(31) ,line(32) ,line(33) ,line(34) ,line(35) ,line(36) ,line(37) ,line(38) ,line(39) ,line(40) ,line(41) ,line(42) ,line(43) ,line(44) ,line(45) ,line(46) ,line(47) ,line(48) ,line(49) ,line(50)))}) 1. Is there a smarter way to achieve that (only modify a certain column without relating to the others, but keeping all of them)? 2. The last statement fails because the tuple has too much members: console:19: error: object Tuple50 is not a member of package scala Thanks for your help, Daniel
Re: Remapping columns from a schemaRDD
Thank you. How can I address more complex columns like maps and structs? Thanks again! Daniel On 25 בנוב׳ 2014, at 19:43, Michael Armbrust mich...@databricks.com wrote: Probably the easiest/closest way to do this would be with a UDF, something like: registerFunction(makeString, (s: Seq[String]) = s.mkString(,)) sql(SELECT *, makeString(c8) AS newC8 FROM jRequests) Although this does not modify a column, but instead appends a new column. Another more complicated way to do something like this would be by using the applySchema function. I'll note that, as part of the ML pipeline work, we have been considering adding something like: def modifyColumn(columnName, function) Any comments anyone has on this interface would be appreciated! Michael On Tue, Nov 25, 2014 at 7:02 AM, Daniel Haviv danielru...@gmail.com wrote: Hi, I'm selecting columns from a json file, transform some of them and would like to store the result as a parquet file but I'm failing. This is what I'm doing: val jsonFiles=sqlContext.jsonFile(/requests.loading) jsonFiles.registerTempTable(jRequests) val clean_jRequests=sqlContext.sql(select c1, c2, c3 ... c55 from jRequests) and then I run a map: val jRequests_flat=clean_jRequests.map(line={((line(1),line(2),line(3),line(4),line(5),line(6),line(7),line(8).asInstanceOf[Iterable[String]].mkString(,),line(9) ,line(10) ,line(11) ,line(12) ,line(13) ,line(14) ,line(15) ,line(16) ,line(17) ,line(18) ,line(19) ,line(20) ,line(21) ,line(22) ,line(23) ,line(24) ,line(25) ,line(26) ,line(27) ,line(28) ,line(29) ,line(30) ,line(31) ,line(32) ,line(33) ,line(34) ,line(35) ,line(36) ,line(37) ,line(38) ,line(39) ,line(40) ,line(41) ,line(42) ,line(43) ,line(44) ,line(45) ,line(46) ,line(47) ,line(48) ,line(49) ,line(50)))}) 1. Is there a smarter way to achieve that (only modify a certain column without relating to the others, but keeping all of them)? 2. The last statement fails because the tuple has too much members: console:19: error: object Tuple50 is not a member of package scala Thanks for your help, Daniel
Re: Remapping columns from a schemaRDD
Maps should just be scala maps, structs are rows inside of rows. If you wan to return a struct from a UDF you can do that with a case class. On Tue, Nov 25, 2014 at 10:25 AM, Daniel Haviv danielru...@gmail.com wrote: Thank you. How can I address more complex columns like maps and structs? Thanks again! Daniel On 25 בנוב׳ 2014, at 19:43, Michael Armbrust mich...@databricks.com wrote: Probably the easiest/closest way to do this would be with a UDF, something like: registerFunction(makeString, (s: Seq[String]) = s.mkString(,)) sql(SELECT *, makeString(c8) AS newC8 FROM jRequests) Although this does not modify a column, but instead appends a new column. Another more complicated way to do something like this would be by using the applySchema function http://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema . I'll note that, as part of the ML pipeline work, we have been considering adding something like: def modifyColumn(columnName, function) Any comments anyone has on this interface would be appreciated! Michael On Tue, Nov 25, 2014 at 7:02 AM, Daniel Haviv danielru...@gmail.com wrote: Hi, I'm selecting columns from a json file, transform some of them and would like to store the result as a parquet file but I'm failing. This is what I'm doing: val jsonFiles=sqlContext.jsonFile(/requests.loading) jsonFiles.registerTempTable(jRequests) val clean_jRequests=sqlContext.sql(select c1, c2, c3 ... c55 from jRequests) and then I run a map: val jRequests_flat=clean_jRequests.map(line={((line(1),line(2),line(3),line(4),line(5),line(6),line(7), *line(8).asInstanceOf[Iterable[String]].mkString(,)*,line(9) ,line(10) ,line(11) ,line(12) ,line(13) ,line(14) ,line(15) ,line(16) ,line(17) ,line(18) ,line(19) ,line(20) ,line(21) ,line(22) ,line(23) ,line(24) ,line(25) ,line(26) ,line(27) ,line(28) ,line(29) ,line(30) ,line(31) ,line(32) ,line(33) ,line(34) ,line(35) ,line(36) ,line(37) ,line(38) ,line(39) ,line(40) ,line(41) ,line(42) ,line(43) ,line(44) ,line(45) ,line(46) ,line(47) ,line(48) ,line(49) ,line(50)))}) 1. Is there a smarter way to achieve that (only modify a certain column without relating to the others, but keeping all of them)? 2. The last statement fails because the tuple has too much members: console:19: error: object Tuple50 is not a member of package scala Thanks for your help, Daniel