With the left join, you are joining two tables.

In your case, df is the left table, dfAgg is the right table.
The second parameter should be the joining condition, right?
For instance

 dfRes = df.join(dfAgg, $”userName”===$”name", "left_outer”)

having a field in df called userName, and another in dfAgg called “name”

However, what’s the kind of query you want to make? dfAgg is already the df 
table that has been grouped by S_ID.

I guess that you are looking for something more like the following example
dfAgg = df.groupBy("S_ID”)
   .agg(org.apache.spark.sql.functions.count(“userName").as(“usersCount”),
 .agg(org.apache.spark.sql.functions.collect_set(“city") .as("ListofCities”)),
 .agg(org.apache.spark.sql.functions.max(“age").as(“oldest”))
)

> On 3 Jul 2017, at 11:55, Bernard Jesop <bernard.je...@gmail.com> wrote:
> 
> Hello, I don't understand my error message.
> 
> Basically, all I am doing is :
> - dfAgg = df.groupBy("S_ID")
> - dfRes = df.join(dfAgg, Seq("S_ID"), "left_outer")
> 
> However I get this AnalysisException: "
> Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved 
> attribute(s) S_ID#1903L missing from 
> Dummy_ID#740,sex#37L,PERSONAL_STATUS#726L,W_DEP_CODE#736,W_SIZE#739L,
> POSTAL_CODE#735,COUNTRY_CODE#730,
> ID#724L,Dummy_ID_1#741,DEP_CODE#729,HOUSEHOLD_TYPE#733L,
> HOUSEHOLD_SIZE#734L,AGE#727L,W_ID#738L,H_ID#732L,AGE_TYPE#728,
> S_ID#57L,NATIONALITY#731
> in operator !Project [ID#724L, sex#37L, PERSON\
>  AL_STATUS#726L, AGE#727L, AGE_TYPE#728, DEP_CODE#729, COUNTRY_CODE#730, 
> NATIONALITY#731 AS Nationality#77, H_ID#732L, HOUSEHOLD_TYPE#733L, 
> HOUSEHOLD_SIZE#734L, POSTAL_CODE#735, W_DEP_CODE#736, S_ID#1903L,
> W_ID#738L, W_SIZE#739L, Dummy_ID#740, Dummy_ID_1#741];; "
> 
> What I don't understand is it says S_ID#1903L is missing
> but everything seems fine on the Logical Plan.
> +- Join LeftOuter, (S_ID#57L = S_ID#1903L)
> 
>    :- Project [W_ID#14L, H_ID#8L, ID#0L, sex#37L, category#97L, AGE#3L, 
> AGE_TYPE#4, DEP_CODE#5, COUNTRY_CODE#6, Nationality#77, HOUSEHOLD_TYPE#9L, 
> familySize#117L, POSTAL_CODE#11, W_DEP_CODE#12, S_ID#57\
>  L, workplaceSize#137L, Dummy_ID#16, Dummy_ID_1#17, Inc_period#157, 
> Time_inf#1064, Time_inc#200, Health#1014, Inf_period#1039, 
> infectedFamily#1355L, infectedWorker#1385L]
> 
>     +- Aggregate [S_ID#1903L], [S_ID#1903L, count(1) AS infectedStreet#1415L]
> 
> Does someone have a clue about it?
> Thanks,
> 
> 
> 

Didac Gil de la Iglesia
PhD in Computer Science
didacg...@gmail.com
Spain:     +34 696 285 544
Sweden: +46 (0)730229737
Skype: didac.gil.de.la.iglesia

Attachment: signature.asc
Description: Message signed with OpenPGP

Reply via email to