If you intern the string it will be more efficient, but still significantly more expensive than the class based approach.
** VERY EXPERIMENTAL ** We are working with EPFL on a lightweight syntax for naming the results of spark transformations in scala (and are going to make it interoperate with SQL). Sparse details here: https://github.com/scala-records/scala-records Stay tuned for more... Michael On Thu, Jul 17, 2014 at 4:47 AM, Luis Guerra <[email protected]> wrote: > Thank you for your fast reply. > > We are considering this Map[String, String] solution, but there are some > details that we do not control yet. What would happen if we have different > data types for different fields? Also, with this solution, we have to > repeat the field names for every "row" that we have, is this efficient? > > Regarding the solution with composition, the key would be repeated in the > new class, whereas it is only necessary once after the join, isn't it? > > > On Thu, Jul 17, 2014 at 10:25 AM, Sean Owen <[email protected]> wrote: > >> If what you have is a large number of named strings, why not use a >> Map[String,String] to represent them? If you're approaching a class >> with >22 String fields anyway, it probably makes more sense. You lose >> a bit of compile-time checking, but gain flexibility. >> >> Also, merging two Maps to make a new one is pretty simple, compared to >> making many of these values classes. >> >> (Although, if you otherwise needed a class that represented "all of >> the things in class A and class B", this could be done easily with >> composition, a class with an A and a B inside.) >> >> On Thu, Jul 17, 2014 at 9:15 AM, Luis Guerra <[email protected]> >> wrote: >> > Hi all, >> > >> > I am a newbie Spark user with many doubts, so sorry if this is a "silly" >> > question. >> > >> > I am dealing with tabular data formatted as text files, so when I first >> load >> > the data, my code is like this: >> > >> > case class data_class( >> > V1: String, >> > V2: String, >> > V3: String, >> > V4: String, >> > V5: String, >> > V6: String, >> > V7: String) >> > >> > val data= sc.textFile(data_path) >> > .map(x => { >> > val fields = (x+" ").split("\t") >> > >> data_class(fields(0).trim(),fields(1).trim(),fields(2).trim(),fields(3).trim(), >> > fields(4).trim(), fields(5).trim(),fields(6).trim()) >> > }) >> > >> > I am doing this because I would like to access to each position using >> the >> > variable name (V1...V7). Is there any other way of doing this? >> > >> > Also related to this question, if I have data with more than 22 >> variables, I >> > am restringed to use class instead of case class. However, this kind of >> > solution has many restrictions mainly related to getter methods. Is >> there >> > any other way of doing this? >> > >> > And finally, one of my main problems comes after operations of different >> > data variables. For instance, if I have two different variables (data1 >> and >> > data2), and I want to join them both as: >> > >> > val data3 = data1.keyBy(_.V1).leftOuterJoin(data2.keyBy(_.V1)) >> > >> > Then I have to post process data3 in order to obtain a new class that >> > contains those variables from data1 and also those variables from >> data2. As >> > data3 is (key, (data1, data2)), do I have to create a new different >> class >> > with all these attributes from data1 and data2? This is kind of annoying >> > when there are many attributes. >> > >> > Thanks in advance, >> > >> > Best >> > >
