I see in the documentation that the distinct operation is not supported
<https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#unsupported-operations>
in Structured Streaming. That being said, I have noticed that you are able
to successfully call distinct() on a data frame and it seems to perform the
desired operation and doesn’t fail with the AnalysisException as expected.
If I call it with a column name specified, then it will fail with
AnalysisException.

I am using Structured Streaming to read from a Kafka stream and my question
(and concern) is that:

   - The distinct operation is properly applied across the *current* batch
   as read from Kafka, however, the distinct operation would not apply
   across batches.

I have tried the following:

   - Started the streaming job to see my baseline data and left the job
   streaming
   - Created events in kafka that would increment my counts if distinct was
   not performing as expected
   - Results:
      - Distinct still seems to be working over the entire data set even as
      I add new data.
      - As I add new data, I see spark process the data (I’m doing output
      mode = update) but there are no new results indicating the distinct
      function is in fact still working across batches as spark pulls
in the new
      data from kafka.

Does anyone know more about the intended behavior of distinct in Structured
Streaming?

If this is working as intended, does this mean I could have a dataset that
is growing without bound being held in memory/disk or something to that
effect (so it has some way to make that distinct operation against previous
data)?
​

Reply via email to