[jira] [Commented] (SPARK-33359) foreachBatch sink outputs wrong metrics
[ https://issues.apache.org/jira/browse/SPARK-33359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17227064#comment-17227064 ] John Wesley commented on SPARK-33359: - Thanks for the clarification. How about numInputRows? > foreachBatch sink outputs wrong metrics > --- > > Key: SPARK-33359 > URL: https://issues.apache.org/jira/browse/SPARK-33359 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 > Environment: Spark on Kubernetes cluster with spark-3.0.0 image. The > CRD is ScheduledSparkApplication >Reporter: John Wesley >Priority: Minor > > I created 2 similar jobs, > 1) First job reading from kafka and writing to console sink in append mode > 2) Second job reading from kafka and writing to foreachBatch sink (which then > writes in parquet format to S3). > The metrics in the log for console shows correct values for numInputRows and > numOutputRows whereas they are wrong for foreachBatch. > With foreachBatch: > numInputRows is +1 more than what is actually present > numOutputRows is always -1. > ///Console sink > //20/11/05 13:36:21 INFO > MicroBatchExecution: Streaming query made progress: { > "id" : "775aa543-58bf-4cf7-b274-390da640b6ae", > "runId" : "e5eac4ca-0b29-4ed4-be35-b70bd20906d5", > "name" : null, > "timestamp" : "2020-11-05T13:36:08.921Z", > "batchId" : 0, > "numInputRows" : 10, > "processedRowsPerSecond" : 0.7759757895553658, > "durationMs" : { > "addBatch" : 7735, > "getBatch" : 152, > "latestOffset" : 2037, > "queryPlanning" : 1010, > "triggerExecution" : 12886, > "walCommit" : 938 > }, > "stateOperators" : [ ], > "sources" : [ { > "description" : "KafkaV2[Subscribe[testedr7]]", > "startOffset" : null, > "endOffset" : { > "testedr7" : { > "0" : 10 > } > }, > "numInputRows" : 10, > "processedRowsPerSecond" : 0.7759757895553658 > } ], > "sink" : { > "description" : > "org.apache.spark.sql.execution.streaming.ConsoleTable$@38c3a814", > "numOutputRows" : 10 > } > } > ///ForEachBatch Sink > //20/11/05 13:43:38 INFO > MicroBatchExecution: Streaming query made progress: { > "id" : "789f9a00-2f2a-4f75-b643-fea201088b4a", > "runId" : "b5e695c5-3a2e-4ad2-9dbf-11b69f368f61", > "name" : null, > "timestamp" : "2020-11-05T13:43:15.421Z", > "batchId" : 0, > "numInputRows" : 11, > "processedRowsPerSecond" : 0.4833252779120348, > "durationMs" : { > "addBatch" : 17689, > "getBatch" : 135, > "latestOffset" : 2121, > "queryPlanning" : 880, > "triggerExecution" : 22758, > "walCommit" : 876 > }, > "stateOperators" : [ ], > "sources" : [ { > "description" : "KafkaV2[Subscribe[testedr7]]", > "startOffset" : null, > "endOffset" : { > "testedr7" : { > "0" : 10 > } > }, > "numInputRows" : 11, > "processedRowsPerSecond" : 0.4833252779120348 > } ], > "sink" : { > "description" : "ForeachBatchSink", > "numOutputRows" : -1 > } > } -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33361) Dataset.observe() functionality does not work with structured streaming
John Wesley created SPARK-33361: --- Summary: Dataset.observe() functionality does not work with structured streaming Key: SPARK-33361 URL: https://issues.apache.org/jira/browse/SPARK-33361 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 3.0.0 Environment: Spark on k8s, version 3.0.0 Reporter: John Wesley The dataset observe() functionality does not work as expected with spark in cluster mode. Related discussion here: [http://apache-spark-user-list.1001560.n3.nabble.com/Best-way-to-emit-custom-metrics-to-Prometheus-in-spark-structured-streaming-td38826.html] Using lit() as the aggregation column goes through well. However sum, count etc returns 0 all the time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33359) foreachBatch sink outputs wrong metrics
John Wesley created SPARK-33359: --- Summary: foreachBatch sink outputs wrong metrics Key: SPARK-33359 URL: https://issues.apache.org/jira/browse/SPARK-33359 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 3.0.0 Environment: Spark on Kubernetes cluster with spark-3.0.0 image. The CRD is ScheduledSparkApplication Reporter: John Wesley I created 2 similar jobs, 1) First job reading from kafka and writing to console sink in append mode 2) Second job reading from kafka and writing to foreachBatch sink (which then writes in parquet format to S3). The metrics in the log for console shows correct values for numInputRows and numOutputRows whereas they are wrong for foreachBatch. With foreachBatch: numInputRows is +1 more than what is actually present numOutputRows is always -1. ///Console sink //20/11/05 13:36:21 INFO MicroBatchExecution: Streaming query made progress: { "id" : "775aa543-58bf-4cf7-b274-390da640b6ae", "runId" : "e5eac4ca-0b29-4ed4-be35-b70bd20906d5", "name" : null, "timestamp" : "2020-11-05T13:36:08.921Z", "batchId" : 0, "numInputRows" : 10, "processedRowsPerSecond" : 0.7759757895553658, "durationMs" : { "addBatch" : 7735, "getBatch" : 152, "latestOffset" : 2037, "queryPlanning" : 1010, "triggerExecution" : 12886, "walCommit" : 938 }, "stateOperators" : [ ], "sources" : [ { "description" : "KafkaV2[Subscribe[testedr7]]", "startOffset" : null, "endOffset" : { "testedr7" : { "0" : 10 } }, "numInputRows" : 10, "processedRowsPerSecond" : 0.7759757895553658 } ], "sink" : { "description" : "org.apache.spark.sql.execution.streaming.ConsoleTable$@38c3a814", "numOutputRows" : 10 } } ///ForEachBatch Sink //20/11/05 13:43:38 INFO MicroBatchExecution: Streaming query made progress: { "id" : "789f9a00-2f2a-4f75-b643-fea201088b4a", "runId" : "b5e695c5-3a2e-4ad2-9dbf-11b69f368f61", "name" : null, "timestamp" : "2020-11-05T13:43:15.421Z", "batchId" : 0, "numInputRows" : 11, "processedRowsPerSecond" : 0.4833252779120348, "durationMs" : { "addBatch" : 17689, "getBatch" : 135, "latestOffset" : 2121, "queryPlanning" : 880, "triggerExecution" : 22758, "walCommit" : 876 }, "stateOperators" : [ ], "sources" : [ { "description" : "KafkaV2[Subscribe[testedr7]]", "startOffset" : null, "endOffset" : { "testedr7" : { "0" : 10 } }, "numInputRows" : 11, "processedRowsPerSecond" : 0.4833252779120348 } ], "sink" : { "description" : "ForeachBatchSink", "numOutputRows" : -1 } } -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16909) Streaming for postgreSQL JDBC driver
prince john wesley created SPARK-16909: -- Summary: Streaming for postgreSQL JDBC driver Key: SPARK-16909 URL: https://issues.apache.org/jira/browse/SPARK-16909 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.0 Reporter: prince john wesley postgreSQL JDBC driver sets 0 as default record fetch size, which means, it caches all rows irrespective of the row count. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org