With the left join, you are joining two tables. In your case, df is the left table, dfAgg is the right table. The second parameter should be the joining condition, right? For instance
dfRes = df.join(dfAgg, $”userName”===$”name", "left_outer”) having a field in df called userName, and another in dfAgg called “name” However, what’s the kind of query you want to make? dfAgg is already the df table that has been grouped by S_ID. I guess that you are looking for something more like the following example dfAgg = df.groupBy("S_ID”) .agg(org.apache.spark.sql.functions.count(“userName").as(“usersCount”), .agg(org.apache.spark.sql.functions.collect_set(“city") .as("ListofCities”)), .agg(org.apache.spark.sql.functions.max(“age").as(“oldest”)) ) > On 3 Jul 2017, at 11:55, Bernard Jesop <bernard.je...@gmail.com> wrote: > > Hello, I don't understand my error message. > > Basically, all I am doing is : > - dfAgg = df.groupBy("S_ID") > - dfRes = df.join(dfAgg, Seq("S_ID"), "left_outer") > > However I get this AnalysisException: " > Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved > attribute(s) S_ID#1903L missing from > Dummy_ID#740,sex#37L,PERSONAL_STATUS#726L,W_DEP_CODE#736,W_SIZE#739L, > POSTAL_CODE#735,COUNTRY_CODE#730, > ID#724L,Dummy_ID_1#741,DEP_CODE#729,HOUSEHOLD_TYPE#733L, > HOUSEHOLD_SIZE#734L,AGE#727L,W_ID#738L,H_ID#732L,AGE_TYPE#728, > S_ID#57L,NATIONALITY#731 > in operator !Project [ID#724L, sex#37L, PERSON\ > AL_STATUS#726L, AGE#727L, AGE_TYPE#728, DEP_CODE#729, COUNTRY_CODE#730, > NATIONALITY#731 AS Nationality#77, H_ID#732L, HOUSEHOLD_TYPE#733L, > HOUSEHOLD_SIZE#734L, POSTAL_CODE#735, W_DEP_CODE#736, S_ID#1903L, > W_ID#738L, W_SIZE#739L, Dummy_ID#740, Dummy_ID_1#741];; " > > What I don't understand is it says S_ID#1903L is missing > but everything seems fine on the Logical Plan. > +- Join LeftOuter, (S_ID#57L = S_ID#1903L) > > :- Project [W_ID#14L, H_ID#8L, ID#0L, sex#37L, category#97L, AGE#3L, > AGE_TYPE#4, DEP_CODE#5, COUNTRY_CODE#6, Nationality#77, HOUSEHOLD_TYPE#9L, > familySize#117L, POSTAL_CODE#11, W_DEP_CODE#12, S_ID#57\ > L, workplaceSize#137L, Dummy_ID#16, Dummy_ID_1#17, Inc_period#157, > Time_inf#1064, Time_inc#200, Health#1014, Inf_period#1039, > infectedFamily#1355L, infectedWorker#1385L] > > +- Aggregate [S_ID#1903L], [S_ID#1903L, count(1) AS infectedStreet#1415L] > > Does someone have a clue about it? > Thanks, > > > Didac Gil de la Iglesia PhD in Computer Science didacg...@gmail.com Spain: +34 696 285 544 Sweden: +46 (0)730229737 Skype: didac.gil.de.la.iglesia
signature.asc
Description: Message signed with OpenPGP