If you intern the string it will be more efficient, but still significantly
more expensive than the class based approach.

** VERY EXPERIMENTAL **
We are working with EPFL on a lightweight syntax for naming the results of
spark transformations in scala (and are going to make it interoperate with
SQL).  Sparse details here: https://github.com/scala-records/scala-records

Stay tuned for more...

Michael


On Thu, Jul 17, 2014 at 4:47 AM, Luis Guerra <[email protected]> wrote:

> Thank you for your fast reply.
>
> We are considering this Map[String, String] solution, but there are some
> details that we do not control yet. What would happen if we have different
> data types for different fields? Also, with this solution, we have to
> repeat the field names for every "row" that we have, is this efficient?
>
> Regarding the solution with composition, the key would be repeated in the
> new class, whereas it is only necessary once after the join, isn't it?
>
>
> On Thu, Jul 17, 2014 at 10:25 AM, Sean Owen <[email protected]> wrote:
>
>> If what you have is a large number of named strings, why not use a
>> Map[String,String] to represent them? If you're approaching a class
>> with >22 String fields anyway, it probably makes more sense. You lose
>> a bit of compile-time checking, but gain flexibility.
>>
>> Also, merging two Maps to make a new one is pretty simple, compared to
>> making many of these values classes.
>>
>> (Although, if you otherwise needed a class that represented "all of
>> the things in class A and class B", this could be done easily with
>> composition, a class with an A and a B inside.)
>>
>> On Thu, Jul 17, 2014 at 9:15 AM, Luis Guerra <[email protected]>
>> wrote:
>> > Hi all,
>> >
>> > I am a newbie Spark user with many doubts, so sorry if this is a "silly"
>> > question.
>> >
>> > I am dealing with tabular data formatted as text files, so when I first
>> load
>> > the data, my code is like this:
>> >
>> > case class data_class(
>> >   V1: String,
>> >   V2: String,
>> >   V3: String,
>> >   V4: String,
>> >   V5: String,
>> >   V6: String,
>> >   V7: String)
>> >
>> > val data= sc.textFile(data_path)
>> >   .map(x => {
>> >   val fields = (x+" ").split("\t")
>> >
>> data_class(fields(0).trim(),fields(1).trim(),fields(2).trim(),fields(3).trim(),
>> > fields(4).trim(), fields(5).trim(),fields(6).trim())
>> >     })
>> >
>> > I am doing this because I would like to access to each position using
>> the
>> > variable name (V1...V7). Is there any other way of doing this?
>> >
>> > Also related to this question, if I have data with more than 22
>> variables, I
>> > am restringed to use class instead of case class. However, this kind of
>> > solution has many restrictions mainly related to getter methods. Is
>> there
>> > any other way of doing this?
>> >
>> > And finally, one of my main problems comes after operations of different
>> > data variables. For instance, if I have two different variables (data1
>> and
>> > data2), and I want to join them both as:
>> >
>> > val data3 = data1.keyBy(_.V1).leftOuterJoin(data2.keyBy(_.V1))
>> >
>> > Then I have to post process data3 in order to obtain a new class that
>> > contains those variables from data1 and also those variables from
>> data2. As
>> > data3 is (key, (data1, data2)), do I have to create a new different
>> class
>> > with all these attributes from data1 and data2? This is kind of annoying
>> > when there are many attributes.
>> >
>> > Thanks in advance,
>> >
>> > Best
>>
>
>

Reply via email to