[jira] [Commented] (SPARK-33359) foreachBatch sink outputs wrong metrics

2020-11-05 Thread John Wesley (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17227064#comment-17227064
 ] 

John Wesley commented on SPARK-33359:
-

Thanks for the clarification. How about numInputRows?

> foreachBatch sink outputs wrong metrics
> ---
>
> Key: SPARK-33359
> URL: https://issues.apache.org/jira/browse/SPARK-33359
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
> Environment: Spark on Kubernetes cluster with spark-3.0.0 image. The 
> CRD is ScheduledSparkApplication
>Reporter: John Wesley
>Priority: Minor
>
> I created 2 similar jobs, 
> 1) First job reading from kafka and writing to console sink in append mode
> 2) Second job reading from kafka and writing to foreachBatch sink (which then 
> writes in parquet format to S3).
> The metrics in the log for console shows correct values for numInputRows and 
> numOutputRows whereas they are wrong for foreachBatch.
> With foreachBatch:
> numInputRows is +1 more than what is actually present
> numOutputRows is always -1.
> ///Console sink
> //20/11/05 13:36:21 INFO 
> MicroBatchExecution: Streaming query made progress: {
>   "id" : "775aa543-58bf-4cf7-b274-390da640b6ae",
>   "runId" : "e5eac4ca-0b29-4ed4-be35-b70bd20906d5",
>   "name" : null,
>   "timestamp" : "2020-11-05T13:36:08.921Z",
>   "batchId" : 0,
>   "numInputRows" : 10,
>   "processedRowsPerSecond" : 0.7759757895553658,
>   "durationMs" : {
> "addBatch" : 7735,
> "getBatch" : 152,
> "latestOffset" : 2037,
> "queryPlanning" : 1010,
> "triggerExecution" : 12886,
> "walCommit" : 938
>   },
>   "stateOperators" : [ ],
>   "sources" : [ {
> "description" : "KafkaV2[Subscribe[testedr7]]",
> "startOffset" : null,
> "endOffset" : {
>   "testedr7" : {
> "0" : 10
>   }
> },
> "numInputRows" : 10,
> "processedRowsPerSecond" : 0.7759757895553658
>   } ],
>   "sink" : {
> "description" : 
> "org.apache.spark.sql.execution.streaming.ConsoleTable$@38c3a814",
> "numOutputRows" : 10
>   }
> }
> ///ForEachBatch Sink
> //20/11/05 13:43:38 INFO 
> MicroBatchExecution: Streaming query made progress: {
>   "id" : "789f9a00-2f2a-4f75-b643-fea201088b4a",
>   "runId" : "b5e695c5-3a2e-4ad2-9dbf-11b69f368f61",
>   "name" : null,
>   "timestamp" : "2020-11-05T13:43:15.421Z",
>   "batchId" : 0,
>   "numInputRows" : 11,
>   "processedRowsPerSecond" : 0.4833252779120348,
>   "durationMs" : {
> "addBatch" : 17689,
> "getBatch" : 135,
> "latestOffset" : 2121,
> "queryPlanning" : 880,
> "triggerExecution" : 22758,
> "walCommit" : 876
>   },
>   "stateOperators" : [ ],
>   "sources" : [ {
> "description" : "KafkaV2[Subscribe[testedr7]]",
> "startOffset" : null,
> "endOffset" : {
>   "testedr7" : {
> "0" : 10
>   }
> },
> "numInputRows" : 11,
> "processedRowsPerSecond" : 0.4833252779120348
>   } ],
>   "sink" : {
> "description" : "ForeachBatchSink",
> "numOutputRows" : -1
>   }
> }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33361) Dataset.observe() functionality does not work with structured streaming

2020-11-05 Thread John Wesley (Jira)
John Wesley created SPARK-33361:
---

 Summary: Dataset.observe() functionality does not work with 
structured streaming
 Key: SPARK-33361
 URL: https://issues.apache.org/jira/browse/SPARK-33361
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.0.0
 Environment: Spark on k8s, version 3.0.0
Reporter: John Wesley


The dataset observe() functionality does not work as expected with spark in 
cluster mode.
Related discussion here:
[http://apache-spark-user-list.1001560.n3.nabble.com/Best-way-to-emit-custom-metrics-to-Prometheus-in-spark-structured-streaming-td38826.html]

Using lit() as the aggregation column goes through well. However sum, count etc 
returns 0 all the time.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33359) foreachBatch sink outputs wrong metrics

2020-11-05 Thread John Wesley (Jira)
John Wesley created SPARK-33359:
---

 Summary: foreachBatch sink outputs wrong metrics
 Key: SPARK-33359
 URL: https://issues.apache.org/jira/browse/SPARK-33359
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.0.0
 Environment: Spark on Kubernetes cluster with spark-3.0.0 image. The 
CRD is ScheduledSparkApplication
Reporter: John Wesley


I created 2 similar jobs, 
1) First job reading from kafka and writing to console sink in append mode
2) Second job reading from kafka and writing to foreachBatch sink (which then 
writes in parquet format to S3).

The metrics in the log for console shows correct values for numInputRows and 
numOutputRows whereas they are wrong for foreachBatch.

With foreachBatch:
numInputRows is +1 more than what is actually present

numOutputRows is always -1.
///Console sink
//20/11/05 13:36:21 INFO 
MicroBatchExecution: Streaming query made progress: {
  "id" : "775aa543-58bf-4cf7-b274-390da640b6ae",
  "runId" : "e5eac4ca-0b29-4ed4-be35-b70bd20906d5",
  "name" : null,
  "timestamp" : "2020-11-05T13:36:08.921Z",
  "batchId" : 0,
  "numInputRows" : 10,
  "processedRowsPerSecond" : 0.7759757895553658,
  "durationMs" : {
"addBatch" : 7735,
"getBatch" : 152,
"latestOffset" : 2037,
"queryPlanning" : 1010,
"triggerExecution" : 12886,
"walCommit" : 938
  },
  "stateOperators" : [ ],
  "sources" : [ {
"description" : "KafkaV2[Subscribe[testedr7]]",
"startOffset" : null,
"endOffset" : {
  "testedr7" : {
"0" : 10
  }
},
"numInputRows" : 10,
"processedRowsPerSecond" : 0.7759757895553658
  } ],
  "sink" : {
"description" : 
"org.apache.spark.sql.execution.streaming.ConsoleTable$@38c3a814",
"numOutputRows" : 10
  }
}


///ForEachBatch Sink
//20/11/05 13:43:38 INFO 
MicroBatchExecution: Streaming query made progress: {
  "id" : "789f9a00-2f2a-4f75-b643-fea201088b4a",
  "runId" : "b5e695c5-3a2e-4ad2-9dbf-11b69f368f61",
  "name" : null,
  "timestamp" : "2020-11-05T13:43:15.421Z",
  "batchId" : 0,
  "numInputRows" : 11,
  "processedRowsPerSecond" : 0.4833252779120348,
  "durationMs" : {
"addBatch" : 17689,
"getBatch" : 135,
"latestOffset" : 2121,
"queryPlanning" : 880,
"triggerExecution" : 22758,
"walCommit" : 876
  },
  "stateOperators" : [ ],
  "sources" : [ {
"description" : "KafkaV2[Subscribe[testedr7]]",
"startOffset" : null,
"endOffset" : {
  "testedr7" : {
"0" : 10
  }
},
"numInputRows" : 11,
"processedRowsPerSecond" : 0.4833252779120348
  } ],
  "sink" : {
"description" : "ForeachBatchSink",
"numOutputRows" : -1
  }
}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16909) Streaming for postgreSQL JDBC driver

2016-08-04 Thread prince john wesley (JIRA)
prince john wesley created SPARK-16909:
--

 Summary: Streaming for postgreSQL JDBC driver
 Key: SPARK-16909
 URL: https://issues.apache.org/jira/browse/SPARK-16909
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.0
Reporter: prince john wesley


postgreSQL JDBC driver sets 0 as default record fetch size, which means, it 
caches all rows irrespective of the row count. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org