Hi ,
The current implementation of JavaSchemaRDD need a special JavaBean class to
define schema information for tables, but when developing applications using
the Spark SQL API, table is a more dynamic component, the awkward thing is when
new tables are defined, we must create a new JavaBean, and redeploy the whole
application. So here comes an idea regarding a more general schema registration
method,
Step1: Defile a new Java class named RowSchema in API to define column
information, column name and data type are most important ones.
Step2: the actual data is store just as JavaRDD<Row>;
Step3:When loading data into JavaRDD<Row>, the API provides a general map
function, which takes a RowSchema object as parameter, to map each line to a
Row object.
Step4: Add a new applySchema method , which takes a RowSchema object as
parameter, to the JavaSQLContext class ,
Step 5: The registerAsTable and all other SQL releated methods of
JavaSQLContext class should take care of the difference of defining schema
throw JavaBean and RowSchemas.(That’s the work of the API layer)
The API is something like this:
Public Class RowSchema{
Public RowSchema(List<String> colNames, List<String> colDataTypes);
Public String getColName(integer i);//return column string of column I;
Public integer getColDataType(integer i);//return data type of column I;
Public integer getColNumber();// return number of columns
};
RowSchema rs = new RowSchema(……);
JavaRDD<Row> table = ctx.textFile(“file path”).map(rs);
JavaSchemaRDD schemaPeople = sqlCtx.applySchema(table, rs);
schemaPeople.registerAsTable("people");
Regards,
Xiaobo Gu