Spark SQL: SchemaRDD, DataFrame. Multi-value, Nested attributes

Eugene Morozov Wed, 22 Apr 2015 12:16:44 -0700

Hi!

I’m trying to query a dataset that reads data from csv and provides a SQL on 
top of it. The problem I have is I have a hierarchy of objects that I need to 
represent as a table so that users might use SQL to query it and do some 
aggregations. I do have multi value attributes (that in csv file looks like 
column_1, column_2, …, column_n) and I do have particular entities that split 
into several columns, like an Address (city, street, etc). And each row (let’s 
say it represents a Person) might have several Addresses.


It’s pretty clear that it’s not simple to flatten everything into one long list 
of columns as I would be able to find some weird stuff by doing that. So my 
question is the following: 
1. Does SchemaRDD support something like multi value attributes? It might look 
like and array of values that lives in just one 
column. Although it’s not clear how I’d aggregate over it. May be there is some 
custom type API I can utilise?
2. Does newly supported DataFrame provides something in this regard? My 
understanding is that columns in DataFrame do need to be actual columns (as in 
a relation), but they may be different types (like arrays or objects). May be 
implementation of DataFrame itself provides some sort of custom types or smth 
pluggable that I might consider.

Any clue would be really appreciated.
Thanks

--
Eugene Morozov
fathers...@list.ru

Spark SQL: SchemaRDD, DataFrame. Multi-value, Nested attributes

Reply via email to