Re: groupby question

2022-05-05 Thread wilson

don't know what you were trying to express.
it's better if you can give the sample dataset and the purpose you want 
to achieve, then we may give the right solution.


Thanks

Irene Markelic wrote:
I have and rdd that I want to group according to some key, but it just 
doesn't work. I am a Scala beginner. So I have the following RDD:


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



groupby question

2022-05-05 Thread Irene Markelic

Hi everybody,

I have and rdd that I want to group according to some key, but it just 
doesn't work. I am a Scala beginner. So I have the following RDD:



langs: List[String]

rdd: RDD[WikipediaArticle])

val meinVal = rdd.flatMap(article=>langs.map(lang=>{if 
(article.mentionsLanguage(lang){ Tuple2(lang,article)} 
else{None}})).filter(_!=None)


meinVal.collect.foreach(println) gives:

(Scala,WikipediaArticle(2,Scala and Java run on the JVM))
(Java,WikipediaArticle(2,Scala and Java run on the JVM))
(Scala,WikipediaArticle(3,Scala is not purely functional))


I have two questions:

1) Why can I not apply the groupByKey function? It is an rdd that 
contains a list of tuples, the first tuple-entry is the key.


2) I don't see how to  apply groupby either. I thought I could do 
meinVal.groupby(x=> x._1), but that trows an error.


I notice, that when I use an IDE and hover over "meinVal" it shows that 
it is RDD[Object] whereas it should be RDD[(String,WikipediaArticle)]. I 
do not know how to get this information without the IDE. So it seems 
that the rdd contains just one big object. I only don't see why that is.


Anyone? Please?

Irene



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



groupBy question

2014-06-10 Thread SK
After doing a groupBy operation, I have the following result:

 val res = 
(ID1,ArrayBuffer((145804601,ID1,japan)))
(ID3,ArrayBuffer((145865080,ID3,canada),
(145899640,ID3,china)))
(ID2,ArrayBuffer((145752760,ID2,usa),
(145934200,ID2,usa)))

Now I need to output for each group, the size of each group and the max of
the first field, which is a timestamp.
So, I tried the following:

1) res.map(group = (group._2.size, group._2._1.max))
But I got an error : value _1 is not a member of Iterable[(Long, String,
String)]

2) I also tried: res.map(group = (group._2.size, group._2[1].max)), but got
an error for that as well.

What is the right way to get the max of the timestamp field (the first field
in the ArrayBuffer) for each group?


thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-question-tp7357.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.