Hi , The current implementation of JavaSchemaRDD need a special JavaBean class to define schema information for tables, but when developing applications using the Spark SQL API, table is a more dynamic component, the awkward thing is when new tables are defined, we must create a new JavaBean, and redeploy the whole application. So here comes an idea regarding a more general schema registration method, Step1: Defile a new Java class named RowSchema in API to define column information, column name and data type are most important ones. Step2: the actual data is store just as JavaRDD<Row>; Step3:When loading data into JavaRDD<Row>, the API provides a general map function, which takes a RowSchema object as parameter, to map each line to a Row object. Step4: Add a new applySchema method , which takes a RowSchema object as parameter, to the JavaSQLContext class , Step 5: The registerAsTable and all other SQL releated methods of JavaSQLContext class should take care of the difference of defining schema throw JavaBean and RowSchemas.(That’s the work of the API layer) The API is something like this: Public Class RowSchema{ Public RowSchema(List<String> colNames, List<String> colDataTypes); Public String getColName(integer i);//return column string of column I; Public integer getColDataType(integer i);//return data type of column I; Public integer getColNumber();// return number of columns }; RowSchema rs = new RowSchema(……); JavaRDD<Row> table = ctx.textFile(“file path”).map(rs); JavaSchemaRDD schemaPeople = sqlCtx.applySchema(table, rs); schemaPeople.registerAsTable("people"); Regards,
Xiaobo Gu