Hi Yin,
Thanks for the reply. I've found the section as well, a couple of days ago and managed to integrate es-hadoop with Spark
SQL [1]
Cheers,
[1] http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/master/spark.html
On 10/2/14 6:32 PM, Yin Huai wrote:
Hi Costin,
I am answering your questions below.
1. You can find Spark SQL data type reference at here
<http://spark.apache.org/docs/1.1.0/sql-programming-guide.html#spark-sql-datatype-reference>.
It explains the underlying
data type for a Spark SQL data type for Scala, Java, and Python APIs. For
example, in Scala API, the underlying Scala
type of MapType is scala.collection.Map. While, in Java API, it is
java.util.Map. For StructType, yes, it should be cast
to Row.
2. Interfaces like getFloat and getInteger are for primitive data types. For
other types, you can access values by
ordinal. For example, row(1). Right now, you have to cast values accessed by
ordinal. Once
https://github.com/apache/spark/pull/1759 is in, accessing values in a row will
be much easier.
3. We are working on supporting CSV files
(https://github.com/apache/spark/pull/1351). Right now, you can use our
programatic APIs
<http://spark.apache.org/docs/1.1.0/sql-programming-guide.html#programmatically-specifying-the-schema>
to create
SchemaRDDs. Basically, you first define the schema (represented by a
StructType) of the SchemaRDD. Then, convert your
RDD (for example, RDD[String]) directly to RDD[Row]. Finally, use applySchema
provided in SQLContext/HiveContext to
apply the defined schema to the RDD[Row]. The return value of applySchema is
the SchemaRDD you want.
Thanks,
Yin
On Tue, Sep 30, 2014 at 5:05 AM, Costin Leau <costin.l...@gmail.com
<mailto:costin.l...@gmail.com>> wrote:
Hi,
I'm working on supporting SchemaRDD in Elasticsearch Hadoop [1] but I'm
having some issues with the SQL API, in
particular in what the DataTypes translate to.
1. A SchemaRDD is composed of a Row and StructType - I'm using the latter
to decompose a Row into primitives. I'm
not clear however how to deal with _rich_ types, namely array, map and
struct.
MapType gives me type information about the key and its value however
what's the actual Map object? j.u.Map, scala.Map?
For example assuming row(0) has a MapType associated with it, to what do I
cast row(0)?
Same goes for StructType; if row(1) has a StructType associated with it, do
I cast the value to Row?
2. Similar to the above, I've noticed the Row interface has cast methods so
ideally one should use
row(index).getFloat|Integer|__Boolean etc... but I didn't see any methods
for Binary or Decimal. Also the _rich_
types are missing; I presume this is for pluggability reasons however whats
the generic way to access/unwrap the
generic Any/Object in this case to the desired DataType?
3. On a separate note, for RDDs containing just values (think CSV,TSV
files) is there an option to have a header
associated with it without having to wrap each row with a case class? As
each entry has exactly the same structure,
the wrapping is just overhead that doesn't provide any extra information
(you know the structure of one row, you
know it for all of them).
Thanks,
[1] github.com/elasticsearch/__elasticsearch-hadoop
<http://github.com/elasticsearch/elasticsearch-hadoop>
--
Costin
------------------------------__------------------------------__---------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.__org
<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: user-h...@spark.apache.org
<mailto:user-h...@spark.apache.org>
--
Costin
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org