Hi ,
 
The current implementation of JavaSchemaRDD need a special JavaBean class to 
define schema information for tables, but when developing applications using 
the Spark SQL API, table is a more dynamic component, the awkward thing is when 
new tables are defined, we must create a new JavaBean, and redeploy the whole 
application. So here comes an idea regarding a more general schema registration 
method,
 
 
 
Step1: Defile a new Java class named RowSchema in API to define column 
information, column name and data type are most important ones.
 
 
 
Step2: the actual data is store just as JavaRDD<Row>;
 
 
 
Step3:When loading data into JavaRDD<Row>, the API provides a general map 
function, which takes a RowSchema object as parameter, to map each line to a 
Row object.
 
 
 
Step4: Add a new applySchema method , which takes a RowSchema object as 
parameter, to the JavaSQLContext class ,
 
 
 
Step 5: The registerAsTable and all other SQL releated methods of 
JavaSQLContext class should take care of the difference of defining schema 
throw JavaBean and RowSchemas.(That’s the work of the API layer)
 
 
 
The API is something like this:
 
 
 
Public Class RowSchema{
 
Public RowSchema(List<String> colNames, List<String> colDataTypes);
 
Public String getColName(integer i);//return column string of column I;
 
Public integer getColDataType(integer i);//return data type of column I;
 
Public integer getColNumber();// return number of columns 
 
 
 
};
 
RowSchema rs = new RowSchema(……);
 
 
 
JavaRDD<Row> table = ctx.textFile(“file path”).map(rs);
 
 
 
JavaSchemaRDD schemaPeople = sqlCtx.applySchema(table, rs);
 
schemaPeople.registerAsTable("people");
 
Regards,
 

 Xiaobo Gu

Reply via email to