Hi Costin, I am answering your questions below.
1. You can find Spark SQL data type reference at here <http://spark.apache.org/docs/1.1.0/sql-programming-guide.html#spark-sql-datatype-reference>. It explains the underlying data type for a Spark SQL data type for Scala, Java, and Python APIs. For example, in Scala API, the underlying Scala type of MapType is scala.collection.Map. While, in Java API, it is java.util.Map. For StructType, yes, it should be cast to Row. 2. Interfaces like getFloat and getInteger are for primitive data types. For other types, you can access values by ordinal. For example, row(1). Right now, you have to cast values accessed by ordinal. Once https://github.com/apache/spark/pull/1759 is in, accessing values in a row will be much easier. 3. We are working on supporting CSV files ( https://github.com/apache/spark/pull/1351). Right now, you can use our programatic APIs <http://spark.apache.org/docs/1.1.0/sql-programming-guide.html#programmatically-specifying-the-schema> to create SchemaRDDs. Basically, you first define the schema (represented by a StructType) of the SchemaRDD. Then, convert your RDD (for example, RDD[String]) directly to RDD[Row]. Finally, use applySchema provided in SQLContext/HiveContext to apply the defined schema to the RDD[Row]. The return value of applySchema is the SchemaRDD you want. Thanks, Yin On Tue, Sep 30, 2014 at 5:05 AM, Costin Leau <costin.l...@gmail.com> wrote: > Hi, > > I'm working on supporting SchemaRDD in Elasticsearch Hadoop [1] but I'm > having some issues with the SQL API, in particular in what the DataTypes > translate to. > > 1. A SchemaRDD is composed of a Row and StructType - I'm using the latter > to decompose a Row into primitives. I'm not clear however how to deal with > _rich_ types, namely array, map and struct. > MapType gives me type information about the key and its value however > what's the actual Map object? j.u.Map, scala.Map? > For example assuming row(0) has a MapType associated with it, to what do I > cast row(0)? > Same goes for StructType; if row(1) has a StructType associated with it, > do I cast the value to Row? > > 2. Similar to the above, I've noticed the Row interface has cast methods > so ideally one should use row(index).getFloat|Integer|Boolean etc... but > I didn't see any methods for Binary or Decimal. Also the _rich_ types are > missing; I presume this is for pluggability reasons however whats the > generic way to access/unwrap the generic Any/Object in this case to the > desired DataType? > > 3. On a separate note, for RDDs containing just values (think CSV,TSV > files) is there an option to have a header associated with it without > having to wrap each row with a case class? As each entry has exactly the > same structure, the wrapping is just overhead that doesn't provide any > extra information (you know the structure of one row, you know it for all > of them). > > Thanks, > > [1] github.com/elasticsearch/elasticsearch-hadoop > -- > Costin > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >