Stream to Stream joins

Hamish Whittal Mon, 24 Aug 2020 07:07:44 -0700

Hi folks,

I've got a stream coming from Kafka. It has the following schema:


 userdata : { id: INT, acctid: INT, uid: STRING, logintm: datetime }

I'm trying to count the number of logins by acctid.

I can do the count fine, but the table only has the acctid and the count.
Now I wish to get all the other columns.

I have tried to do this:
origdata = uDF2.select("userdetail.*")
# This gives me all the columns.
# +---+-------+---------+--------------------+
#  | id| acctid|      uid|             logintm|
# +---+-------+---------+--------------------+
#  |014|1075627|curtis.ga|2020-08-24 13:58:...|
#  +---+-------+---------+--------------------+

logins = origdata\
               .withWatermark('logintm', '10 minutes')\
               .groupBy("acctid", "logintm")\
               .count()\
               .select('acctid', 'count')
# This gives me the count of the accountId.

Now I thought I could do this
   alldata = logins.join(origdata, "acctid")

But I get no rows returned. Am I just completely missing the mark here? The
original dataframe has all the columns as expected (including the join
column) and the count has the join column and the count.

If I can't do this, then how otherwise does one get all the other data WITH
the count?

Thanks for your help in advance.

Hamish

Stream to Stream joins

Reply via email to