Hi all,

I am a newbie Spark user with many doubts, so sorry if this is a "silly"
question.

I am dealing with tabular data formatted as text files, so when I first
load the data, my code is like this:

case class data_class(
   V1: String,
   V2: String,
   V3: String,
   V4: String,
   V5: String,
   V6: String,
   V7: String)

val data= sc.textFile(data_path)
  .map(x => {
  val fields = (x+" ").split("\t")

 data_class(fields(0).trim(),fields(1).trim(),fields(2).trim(),fields(3).trim(),

fields(4).trim(), fields(5).trim(),fields(6).trim())
     })

I am doing this because I would like to access to each position using the
variable name (V1...V7). Is there any other way of doing this?

Also related to this question, if I have data with more than 22 variables,
I am restringed to use class instead of case class. However, this kind of
solution has many restrictions mainly related to getter methods. Is there
any other way of doing this?

And finally, one of my main problems comes after operations of different
data variables. For instance, if I have two different variables (data1 and
data2), and I want to join them both as:

val data3 = data1.keyBy(_.V1).leftOuterJoin(data2.keyBy(_.V1))

Then I have to post process data3 in order to obtain a new class that
contains those variables from data1 and also those variables from data2. As
data3 is (key, (data1, data2)), do I have to create a new different class
with all these attributes from data1 and data2? This is kind of annoying
when there are many attributes.

Thanks in advance,

Best

Reply via email to