Hi Yin,

Thanks for the reply. I've found the section as well, a couple of days ago and managed to integrate es-hadoop with Spark SQL [1]

Cheers,

[1] http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/master/spark.html

On 10/2/14 6:32 PM, Yin Huai wrote:
Hi Costin,

I am answering your questions below.

1. You can find  Spark SQL data type reference at here
<http://spark.apache.org/docs/1.1.0/sql-programming-guide.html#spark-sql-datatype-reference>.
 It explains the underlying
data type for a Spark SQL data type for Scala, Java, and Python APIs. For 
example, in Scala API, the underlying Scala
type of MapType is scala.collection.Map. While, in Java API, it is 
java.util.Map. For StructType, yes, it should be cast
to Row.

2. Interfaces like getFloat and getInteger are for primitive data types. For 
other types, you can access values by
ordinal. For example, row(1). Right now, you have to cast values accessed by 
ordinal. Once
https://github.com/apache/spark/pull/1759 is in, accessing values in a row will 
be much easier.

3. We are working on supporting CSV files 
(https://github.com/apache/spark/pull/1351). Right now, you can use our
programatic APIs
<http://spark.apache.org/docs/1.1.0/sql-programming-guide.html#programmatically-specifying-the-schema>
 to create
SchemaRDDs. Basically, you first define the schema (represented by a 
StructType) of the SchemaRDD. Then, convert your
RDD (for example, RDD[String]) directly to RDD[Row]. Finally, use applySchema 
provided in SQLContext/HiveContext to
apply the defined schema to the RDD[Row]. The return value of applySchema is 
the SchemaRDD you want.

Thanks,

Yin

On Tue, Sep 30, 2014 at 5:05 AM, Costin Leau <costin.l...@gmail.com 
<mailto:costin.l...@gmail.com>> wrote:

    Hi,

    I'm working on supporting SchemaRDD in Elasticsearch Hadoop [1] but I'm 
having some issues with the SQL API, in
    particular in what the DataTypes translate to.

    1. A SchemaRDD is composed of a Row and StructType - I'm using the latter 
to decompose a Row into primitives. I'm
    not clear however how to deal with _rich_ types, namely array, map and 
struct.
    MapType gives me type information about the key and its value however 
what's the actual Map object? j.u.Map, scala.Map?
    For example assuming row(0) has a MapType associated with it, to what do I 
cast row(0)?
    Same goes for StructType; if row(1) has a StructType associated with it, do 
I cast the value to Row?

    2. Similar to the above, I've noticed the Row interface has cast methods so 
ideally one should use
    row(index).getFloat|Integer|__Boolean etc... but I didn't see any methods 
for Binary or Decimal. Also the _rich_
    types are missing; I presume this is for pluggability reasons however whats 
the generic way to access/unwrap the
    generic Any/Object in this case to the desired DataType?

    3. On a separate note, for RDDs containing just values (think CSV,TSV 
files) is there an option to have a header
    associated with it without having to wrap each row with a case class? As 
each entry has exactly the same structure,
    the wrapping is just overhead that doesn't provide any extra information 
(you know the structure of one row, you
    know it for all of them).

    Thanks,

    [1] github.com/elasticsearch/__elasticsearch-hadoop 
<http://github.com/elasticsearch/elasticsearch-hadoop>
    --
    Costin

    ------------------------------__------------------------------__---------
    To unsubscribe, e-mail: user-unsubscribe@spark.apache.__org 
<mailto:user-unsubscr...@spark.apache.org>
    For additional commands, e-mail: user-h...@spark.apache.org 
<mailto:user-h...@spark.apache.org>



--
Costin

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to