Hi folks, I've got a stream coming from Kafka. It has the following schema:
userdata : { id: INT, acctid: INT, uid: STRING, logintm: datetime } I'm trying to count the number of logins by acctid. I can do the count fine, but the table only has the acctid and the count. Now I wish to get all the other columns. I have tried to do this: origdata = uDF2.select("userdetail.*") # This gives me all the columns. # +---+-------+---------+--------------------+ # | id| acctid| uid| logintm| # +---+-------+---------+--------------------+ # |014|1075627|curtis.ga|2020-08-24 13:58:...| # +---+-------+---------+--------------------+ logins = origdata\ .withWatermark('logintm', '10 minutes')\ .groupBy("acctid", "logintm")\ .count()\ .select('acctid', 'count') # This gives me the count of the accountId. Now I thought I could do this alldata = logins.join(origdata, "acctid") But I get no rows returned. Am I just completely missing the mark here? The original dataframe has all the columns as expected (including the join column) and the count has the join column and the count. If I can't do this, then how otherwise does one get all the other data WITH the count? Thanks for your help in advance. Hamish