BTW we merged this today: https://github.com/apache/spark/pull/4640

This should allow us in the future to address column by name in a Row.


On Mon, Feb 16, 2015 at 11:39 AM, Michael Armbrust <mich...@databricks.com>
wrote:

> I can unpack the code snippet a bit:
>
> caper.select('ran_id) is the same as saying "SELECT ran_id FROM table" in
> SQL.  Its always a good idea to explicitly request the columns you need
> right before using them.  That way you are tolerant of any changes to the
> schema that might happen upstream.
>
> The next part .map { case Row(ranId: String) => ... } is doing an
> extraction to pull out the values of the row into typed variables.  This is
> the same as doing .map(row => row(0).asInstanceOf[String]) or .map(row =>
> row.getString(0)), but I find this syntax easier to read since it lines
> up nicely with the select clause that comes right before it.  It's also
> less verbose especially when pulling out a bunch of columns.
>
> Regarding the differences between python and java/scala, part of this is
> just due to the nature of these language.  Since java/scala are statically
> typed, you will always have to explicitly say the type of the column you
> are extracting (the bonus here is they are much faster than python due to
> optimizations this strictness allows).  However, since its already a little
> more verbose, we decided not to have the more expensive ability to look up
> columns in a row by name, and instead go with a faster ordinal based API.
> We could revisit this, but its not currently something we are planning to
> change.
>
> Michael
>
> On Mon, Feb 16, 2015 at 11:04 AM, Eric Bell <e...@ericjbell.com> wrote:
>
>>  I am just learning scala so I don't actually understand what your code
>> snippet is doing but thank you, I will learn more so I can figure it out.
>>
>> I am new to all of this and still trying to make the mental shift from
>> normal programming to distributed programming, but it seems to me that the
>> row object would know its own schema object that it came from and be able
>> to ask its schema to transform a name to a column number. Am I missing
>> something or is this just a matter of time constraints and this one just
>> hasn't gotten into the queue yet?
>>
>> Baring that, do the schema classes provide methods for doing this? I've
>> looked and didn't see anything.
>>
>> I've just discovered that the python implementation for SchemaRDD does in
>> fact allow for referencing by name and column. Why is this provided in the
>> python implementation but not scala or java implementations?
>>
>> Thanks,
>>
>> --eric
>>
>>
>>
>> On 02/16/2015 10:46 AM, Michael Armbrust wrote:
>>
>> For efficiency the row objects don't contain the schema so you can't get
>> the column by name directly.  I usually do a select followed by pattern
>> matching. Something like the following:
>>
>>  caper.select('ran_id).map { case Row(ranId: String) => }
>>
>> On Mon, Feb 16, 2015 at 8:54 AM, Eric Bell <e...@ericjbell.com> wrote:
>>
>>> Is it possible to reference a column from a SchemaRDD using the column's
>>> name instead of its number?
>>>
>>> For example, let's say I've created a SchemaRDD from an avro file:
>>>
>>> val sqlContext = new SQLContext(sc)
>>> import sqlContext._
>>> val caper=sqlContext.avroFile("hdfs://localhost:9000/sma/raw_avro/caper")
>>> caper.registerTempTable("caper")
>>>
>>> scala> caper
>>> res20: org.apache.spark.sql.SchemaRDD = SchemaRDD[0] at RDD at
>>> SchemaRDD.scala:108
>>> == Query Plan ==
>>> == Physical Plan ==
>>> PhysicalRDD
>>> [ADMDISP#0,age#1,AMBSURG#2,apptdt_skew#3,APPTSTAT#4,APPTTYPE#5,ASSGNDUR#6,CANCSTAT#7,CAPERSTAT#8,COMPLAINT#9,CPT_1#10,CPT_10#11,CPT_11#12,CPT_12#13,CPT_13#14,CPT_2#15,CPT_3#16,CPT_4#17,CPT_5#18,CPT_6#19,CPT_7#20,CPT_8#21,CPT_9#22,CPTDX_1#23,CPTDX_10#24,CPTDX_11#25,CPTDX_12#26,CPTDX_13#27,CPTDX_2#28,CPTDX_3#29,CPTDX_4#30,CPTDX_5#31,CPTDX_6#32,CPTDX_7#33,CPTDX_8#34,CPTDX_9#35,CPTMOD1_1#36,CPTMOD1_10#37,CPTMOD1_11#38,CPTMOD1_12#39,CPTMOD1_13#40,CPTMOD1_2#41,CPTMOD1_3#42,CPTMOD1_4#43,CPTMOD1_5#44,CPTMOD1_6#45,CPTMOD1_7#46,CPTMOD1_8#47,CPTMOD1_9#48,CPTMOD2_1#49,CPTMOD2_10#50,CPTMOD2_11#51,CPTMOD2_12#52,CPTMOD2_13#53,CPTMOD2_2#54,CPTMOD2_3#55,CPTMOD2_4#56,CPTMOD...
>>> scala>
>>>
>>> Now I want to access fields, and of course the normal thing to do is to
>>> use a field name, not a field number.
>>>
>>> scala> val kv = caper.map(r => (r.ran_id, r))
>>> <console>:23: error: value ran_id is not a member of
>>> org.apache.spark.sql.Row
>>>        val kv = caper.map(r => (r.ran_id, r))
>>>
>>> How do I do this?
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>>
>

Reply via email to