[jira] [Commented] (SPARK-32500) Query and Batch Id not set for Structured Streaming Jobs in case of ForeachBatch in PySpark

Abhishek Dixit (Jira) Mon, 03 Aug 2020 00:41:07 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-32500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17169786#comment-17169786
 ]


Abhishek Dixit commented on SPARK-32500:
----------------------------------------

[~JinxinTang] [~rohitmishr1484]

_epoch_id_ may be available in Python but I don't think that information is 
available back in Scala's API and StreamingQueryListener.

I'm attaching a Spark UI screenshot that compares two streaming queries run in 
PySpark, one uses Pyspark ForeachBatch and doesn't have BatchId and QueryId 
information and second uses simple FileStreamSink and has BatchId and QueryId 
information shown properly.

!Screen Shot 2020-07-26 at 6.50.39 PM.png!

 

I've given code for _foreachBatch_ example in the above comment. Code for 
FileStreamSink example is below:
{code:java}
from pyspark.sql.functions import *

checkpoint_location = "/tmp/testdata/checkpoint"                                
                
query = (
  
spark.readStream.format("rate").option("rowsPerSecond",100).option("numPartitions",4)
  .load()
    .writeStream
    .format("json")
    .option("path", "/tmp/testdata/output") 
    .option("checkpointLocation", checkpoint_location)                       
    .outputMode("Append")
    .start()
)        {code}

> Query and Batch Id not set for Structured Streaming Jobs in case of 
> ForeachBatch in PySpark
> -------------------------------------------------------------------------------------------
>
>                 Key: SPARK-32500
>                 URL: https://issues.apache.org/jira/browse/SPARK-32500
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Structured Streaming
>    Affects Versions: 2.4.6
>            Reporter: Abhishek Dixit
>            Priority: Major
>         Attachments: Screen Shot 2020-07-26 at 6.50.39 PM.png, Screen Shot 
> 2020-07-30 at 9.04.21 PM.png, image-2020-08-01-10-21-51-246.png
>
>
> Query Id and Batch Id information is not available for jobs started by 
> structured streaming query when _foreachBatch_ API is used in PySpark.
> This happens only with foreachBatch in pyspark. ForeachBatch in scala works 
> fine, and also other structured streaming sinks in pyspark work fine. I am 
> attaching a screenshot of jobs pages.
> I think job group is not set properly when _foreachBatch_ is used via 
> pyspark. I have a framework that depends on the _queryId_ and _batchId_ 
> information available in the job properties and so my framework doesn't work 
> for pyspark-foreachBatch use case.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32500) Query and Batch Id not set for Structured Streaming Jobs in case of ForeachBatch in PySpark

Reply via email to