Re: Remapping columns from a schemaRDD

2014-11-26 Thread Daniel Haviv
Is there some place I can read more about it ? I can't find any reference.
I actully want to flatten these structures and not return them from the UDF.

Thanks,
Daniel

On Tue, Nov 25, 2014 at 8:44 PM, Michael Armbrust 
wrote:

> Maps should just be scala maps, structs are rows inside of rows.  If you
> wan to return a struct from a UDF you can do that with a case class.
>
> On Tue, Nov 25, 2014 at 10:25 AM, Daniel Haviv 
> wrote:
>
>> Thank you.
>>
>> How can I address more complex columns like maps and structs?
>>
>> Thanks again!
>> Daniel
>>
>> On 25 בנוב׳ 2014, at 19:43, Michael Armbrust 
>> wrote:
>>
>> Probably the easiest/closest way to do this would be with a UDF,
>> something like:
>>
>> registerFunction("makeString", (s: Seq[String]) => s.mkString(","))
>> sql("SELECT *, makeString(c8) AS newC8 FROM jRequests")
>>
>> Although this does not modify a column, but instead appends a new column.
>>
>> Another more complicated way to do something like this would be by using the
>> applySchema function
>> 
>> .
>>
>> I'll note that, as part of the ML pipeline work, we have been considering
>> adding something like:
>>
>> def modifyColumn(columnName, function)
>>
>> Any comments anyone has on this interface would be appreciated!
>>
>> Michael
>>
>> On Tue, Nov 25, 2014 at 7:02 AM, Daniel Haviv 
>> wrote:
>>
>>> Hi,
>>> I'm selecting columns from a json file, transform some of them and would
>>> like to store the result as a parquet file but I'm failing.
>>>
>>> This is what I'm doing:
>>>
>>> val jsonFiles=sqlContext.jsonFile("/requests.loading")
>>> jsonFiles.registerTempTable("jRequests")
>>>
>>> val clean_jRequests=sqlContext.sql("select c1, c2, c3 ... c55 from
>>> jRequests")
>>>
>>> and then I run a map:
>>>  val
>>> jRequests_flat=clean_jRequests.map(line=>{((line(1),line(2),line(3),line(4),line(5),line(6),line(7),
>>> *line(8).asInstanceOf[Iterable[String]].mkString(",")*,line(9)
>>> ,line(10) ,line(11) ,line(12) ,line(13) ,line(14) ,line(15) ,line(16)
>>> ,line(17) ,line(18) ,line(19) ,line(20) ,line(21) ,line(22) ,line(23)
>>> ,line(24) ,line(25) ,line(26) ,line(27) ,line(28) ,line(29) ,line(30)
>>> ,line(31) ,line(32) ,line(33) ,line(34) ,line(35) ,line(36) ,line(37)
>>> ,line(38) ,line(39) ,line(40) ,line(41) ,line(42) ,line(43) ,line(44)
>>> ,line(45) ,line(46) ,line(47) ,line(48) ,line(49) ,line(50)))})
>>>
>>>
>>>
>>> 1. Is there a smarter way to achieve that (only modify a certain column
>>> without relating to the others, but keeping all of them)?
>>> 2. The last statement fails because the tuple has too much members:
>>> :19: error: object Tuple50 is not a member of package scala
>>>
>>>
>>> Thanks for your help,
>>> Daniel
>>>
>>>
>>
>


Re: Remapping columns from a schemaRDD

2014-11-25 Thread Michael Armbrust
Maps should just be scala maps, structs are rows inside of rows.  If you
wan to return a struct from a UDF you can do that with a case class.

On Tue, Nov 25, 2014 at 10:25 AM, Daniel Haviv 
wrote:

> Thank you.
>
> How can I address more complex columns like maps and structs?
>
> Thanks again!
> Daniel
>
> On 25 בנוב׳ 2014, at 19:43, Michael Armbrust 
> wrote:
>
> Probably the easiest/closest way to do this would be with a UDF, something
> like:
>
> registerFunction("makeString", (s: Seq[String]) => s.mkString(","))
> sql("SELECT *, makeString(c8) AS newC8 FROM jRequests")
>
> Although this does not modify a column, but instead appends a new column.
>
> Another more complicated way to do something like this would be by using the
> applySchema function
> 
> .
>
> I'll note that, as part of the ML pipeline work, we have been considering
> adding something like:
>
> def modifyColumn(columnName, function)
>
> Any comments anyone has on this interface would be appreciated!
>
> Michael
>
> On Tue, Nov 25, 2014 at 7:02 AM, Daniel Haviv 
> wrote:
>
>> Hi,
>> I'm selecting columns from a json file, transform some of them and would
>> like to store the result as a parquet file but I'm failing.
>>
>> This is what I'm doing:
>>
>> val jsonFiles=sqlContext.jsonFile("/requests.loading")
>> jsonFiles.registerTempTable("jRequests")
>>
>> val clean_jRequests=sqlContext.sql("select c1, c2, c3 ... c55 from
>> jRequests")
>>
>> and then I run a map:
>>  val
>> jRequests_flat=clean_jRequests.map(line=>{((line(1),line(2),line(3),line(4),line(5),line(6),line(7),
>> *line(8).asInstanceOf[Iterable[String]].mkString(",")*,line(9) ,line(10)
>> ,line(11) ,line(12) ,line(13) ,line(14) ,line(15) ,line(16) ,line(17)
>> ,line(18) ,line(19) ,line(20) ,line(21) ,line(22) ,line(23) ,line(24)
>> ,line(25) ,line(26) ,line(27) ,line(28) ,line(29) ,line(30) ,line(31)
>> ,line(32) ,line(33) ,line(34) ,line(35) ,line(36) ,line(37) ,line(38)
>> ,line(39) ,line(40) ,line(41) ,line(42) ,line(43) ,line(44) ,line(45)
>> ,line(46) ,line(47) ,line(48) ,line(49) ,line(50)))})
>>
>>
>>
>> 1. Is there a smarter way to achieve that (only modify a certain column
>> without relating to the others, but keeping all of them)?
>> 2. The last statement fails because the tuple has too much members:
>> :19: error: object Tuple50 is not a member of package scala
>>
>>
>> Thanks for your help,
>> Daniel
>>
>>
>


Re: Remapping columns from a schemaRDD

2014-11-25 Thread Daniel Haviv
Thank you.

How can I address more complex columns like maps and structs?

Thanks again!
Daniel

> On 25 בנוב׳ 2014, at 19:43, Michael Armbrust  wrote:
> 
> Probably the easiest/closest way to do this would be with a UDF, something 
> like:
> 
> registerFunction("makeString", (s: Seq[String]) => s.mkString(","))
> sql("SELECT *, makeString(c8) AS newC8 FROM jRequests")
> 
> Although this does not modify a column, but instead appends a new column.
> 
> Another more complicated way to do something like this would be by using the 
> applySchema function.
> 
> I'll note that, as part of the ML pipeline work, we have been considering 
> adding something like:
> 
> def modifyColumn(columnName, function)
> 
> Any comments anyone has on this interface would be appreciated!
> 
> Michael
> 
>> On Tue, Nov 25, 2014 at 7:02 AM, Daniel Haviv  wrote:
>> Hi,
>> I'm selecting columns from a json file, transform some of them and would 
>> like to store the result as a parquet file but I'm failing.
>> 
>> This is what I'm doing:
>> 
>> val jsonFiles=sqlContext.jsonFile("/requests.loading")
>> jsonFiles.registerTempTable("jRequests")
>> 
>> val clean_jRequests=sqlContext.sql("select c1, c2, c3 ... c55 from 
>> jRequests")
>> 
>> and then I run a map:
>>  val 
>> jRequests_flat=clean_jRequests.map(line=>{((line(1),line(2),line(3),line(4),line(5),line(6),line(7),line(8).asInstanceOf[Iterable[String]].mkString(","),line(9)
>>  ,line(10) ,line(11) ,line(12) ,line(13) ,line(14) ,line(15) ,line(16) 
>> ,line(17) ,line(18) ,line(19) ,line(20) ,line(21) ,line(22) ,line(23) 
>> ,line(24) ,line(25) ,line(26) ,line(27) ,line(28) ,line(29) ,line(30) 
>> ,line(31) ,line(32) ,line(33) ,line(34) ,line(35) ,line(36) ,line(37) 
>> ,line(38) ,line(39) ,line(40) ,line(41) ,line(42) ,line(43) ,line(44) 
>> ,line(45) ,line(46) ,line(47) ,line(48) ,line(49) ,line(50)))})
>> 
>> 
>> 
>> 1. Is there a smarter way to achieve that (only modify a certain column 
>> without relating to the others, but keeping all of them)?
>> 2. The last statement fails because the tuple has too much members:
>> :19: error: object Tuple50 is not a member of package scala
>> 
>> 
>> Thanks for your help,
>> Daniel
> 


Re: Remapping columns from a schemaRDD

2014-11-25 Thread Michael Armbrust
Probably the easiest/closest way to do this would be with a UDF, something
like:

registerFunction("makeString", (s: Seq[String]) => s.mkString(","))
sql("SELECT *, makeString(c8) AS newC8 FROM jRequests")

Although this does not modify a column, but instead appends a new column.

Another more complicated way to do something like this would be by using the
applySchema function

.

I'll note that, as part of the ML pipeline work, we have been considering
adding something like:

def modifyColumn(columnName, function)

Any comments anyone has on this interface would be appreciated!

Michael

On Tue, Nov 25, 2014 at 7:02 AM, Daniel Haviv  wrote:

> Hi,
> I'm selecting columns from a json file, transform some of them and would
> like to store the result as a parquet file but I'm failing.
>
> This is what I'm doing:
>
> val jsonFiles=sqlContext.jsonFile("/requests.loading")
> jsonFiles.registerTempTable("jRequests")
>
> val clean_jRequests=sqlContext.sql("select c1, c2, c3 ... c55 from
> jRequests")
>
> and then I run a map:
>  val
> jRequests_flat=clean_jRequests.map(line=>{((line(1),line(2),line(3),line(4),line(5),line(6),line(7),
> *line(8).asInstanceOf[Iterable[String]].mkString(",")*,line(9) ,line(10)
> ,line(11) ,line(12) ,line(13) ,line(14) ,line(15) ,line(16) ,line(17)
> ,line(18) ,line(19) ,line(20) ,line(21) ,line(22) ,line(23) ,line(24)
> ,line(25) ,line(26) ,line(27) ,line(28) ,line(29) ,line(30) ,line(31)
> ,line(32) ,line(33) ,line(34) ,line(35) ,line(36) ,line(37) ,line(38)
> ,line(39) ,line(40) ,line(41) ,line(42) ,line(43) ,line(44) ,line(45)
> ,line(46) ,line(47) ,line(48) ,line(49) ,line(50)))})
>
>
>
> 1. Is there a smarter way to achieve that (only modify a certain column
> without relating to the others, but keeping all of them)?
> 2. The last statement fails because the tuple has too much members:
> :19: error: object Tuple50 is not a member of package scala
>
>
> Thanks for your help,
> Daniel
>
>