[Structured Streaming] OOM on ConsoleSink with large inputs

2017-08-11 Thread Gerard Maas
Devs,

While investigating another issue, I came across this OOM error when using
the Console Sink with any source that can be larger than the available
driver memory. In my case, I was using the File source and I had a 14G file
in the monitored dir.

I traced back the issue to a `df.collect` in the Console Sink code.
I created a JIRA for it: https://issues.apache.org/jira/browse/SPARK-21710
and a PR is available: https://github.com/apache/spark/pull/18923

I hope a committer can check it out.

-kr, Gerard.


Re: [SS] watermark, eventTime and "StreamExecution: Streaming query made progress"

2017-08-11 Thread Michael Armbrust
The point here is to tell you what watermark value was used when executing
this batch.  You don't know the new watermark until the batch is over and
we don't want to do two passes over the data.  In general the semantics of
the watermark are designed to be conservative (i.e. just because data is
older than the watermark does not mean it will be dropped, but data will
never be dropped until after it is below the watermark).

On Fri, Aug 11, 2017 at 12:23 AM, Jacek Laskowski  wrote:

> Hi,
>
> I'm curious why watermark is updated the next streaming batch after
> it's been observed [1]? The report (from
> ProgressReporter/StreamExecution) does not look right to me as
> avg/max/min are already calculated according to the watermark [2]
>
> My recommendation would be to do the update [2] in the same streaming
> batch it was observed. Why not? Please enlighten.
>
> 17/08/11 09:04:20 INFO StreamExecution: Streaming query made progress: {
>   "id" : "ec8f8228-90f6-4e1f-8ad2-80222affed63",
>   "runId" : "f605c134-cfb0-4378-88c1-159d8a7c232e",
>   "name" : "rates-to-console",
>   "timestamp" : "2017-08-11T07:04:20.004Z",
>   "batchId" : 1,
>   "numInputRows" : 2,
>   "inputRowsPerSecond" : 0.7601672367920943,
>   "processedRowsPerSecond" : 25.31645569620253,
>   "durationMs" : {
> "addBatch" : 48,
> "getBatch" : 6,
> "getOffset" : 0,
> "queryPlanning" : 1,
> "triggerExecution" : 79,
> "walCommit" : 23
>   },
>   "eventTime" : {
> "avg" : "2017-08-11T07:04:17.782Z",
> "max" : "2017-08-11T07:04:18.282Z",
> "min" : "2017-08-11T07:04:17.282Z",
> "watermark" : "1970-01-01T00:00:00.000Z"
>   },
>
> ...
>
> 17/08/11 09:04:30 INFO StreamExecution: Streaming query made progress: {
>   "id" : "ec8f8228-90f6-4e1f-8ad2-80222affed63",
>   "runId" : "f605c134-cfb0-4378-88c1-159d8a7c232e",
>   "name" : "rates-to-console",
>   "timestamp" : "2017-08-11T07:04:30.003Z",
>   "batchId" : 2,
>   "numInputRows" : 10,
>   "inputRowsPerSecond" : 1.000100010001,
>   "processedRowsPerSecond" : 56.17977528089888,
>   "durationMs" : {
> "addBatch" : 147,
> "getBatch" : 6,
> "getOffset" : 0,
> "queryPlanning" : 1,
> "triggerExecution" : 178,
> "walCommit" : 22
>   },
>   "eventTime" : {
> "avg" : "2017-08-11T07:04:23.782Z",
> "max" : "2017-08-11T07:04:28.282Z",
> "min" : "2017-08-11T07:04:19.282Z",
> "watermark" : "2017-08-11T07:04:08.282Z"
>   },
>
> [1] https://github.com/apache/spark/blob/master/sql/core/
> src/main/scala/org/apache/spark/sql/execution/streaming/
> StreamExecution.scala?utf8=%E2%9C%93#L538
> [2] https://github.com/apache/spark/blob/master/sql/core/
> src/main/scala/org/apache/spark/sql/execution/streaming/
> ProgressReporter.scala#L257
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


[build system] jenkins back up and building

2017-08-11 Thread shane knapp
there was some network work being done last night (~945pm PDT) at our
colo, and it had the unintended consequence of kicking a lot of
services off the network.

jenkins was affected, and the connection to github was lost.  i just
kicked the jenkins master and things are happily building again.

sorry for the downtime...

shane

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Any comitter interested in Speaking a Solutions.Hamburg

2017-08-11 Thread Christofer Dutz
Hi all,

I am looking for someone to speak as part of a 3 day Apache track at 
Solutions.Hamburg (https://solutions.hamburg/) next month (06.-08.09.2017). We 
were planning on having a dev oriented "Spark Structured Streaming" talk. 
Unfortunately the original volunteer seems to be unable to provide us with an 
concrete title abstract (max 300 chars) and the deadline for finishing the 
schedule is approaching fast. As one of the goodies we are providing is not 
only having people speak that know the software they are talking about, but 
above that are committers and help building it, that would be the restriction 
we have.

Ideally we could also use a second Spark or in general BigData related Talk 
which is more operations focussed.

If the chance of speaking is not enough, maybe mentioning that you will be 
invited to two really cool parties will help motivating you :-)

Would be great if you could help me with finishing our schedule :-)

Chris

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Use Apache ORC in Apache Spark 2.3

2017-08-11 Thread Sean Owen
-private@ list for future replies. This is not a PMC conversation.

On Fri, Aug 11, 2017 at 3:17 AM Andrew Ash  wrote:

> @Reynold no I don't use the HiveCatalog -- I'm using a custom
> implementation of ExternalCatalog instead.
>
> On Thu, Aug 10, 2017 at 3:34 PM, Dong Joon Hyun 
> wrote:
>
>> Thank you, Andrew and Reynold.
>>
>>
>>
>> Yes, it will reduce the old Hive dependency eventually, at least, ORC
>> codes.
>>
>>
>>
>> And, Spark without `-Phive` can ORC like Parquet.
>>
>>
>>
>> This is one milestone for `Feature parity for ORC with Parquet
>> (SPARK-20901)`.
>>
>>
>>
>> Bests,
>>
>> Dongjoon
>>
>>
>>


[SS] watermark, eventTime and "StreamExecution: Streaming query made progress"

2017-08-11 Thread Jacek Laskowski
Hi,

I'm curious why watermark is updated the next streaming batch after
it's been observed [1]? The report (from
ProgressReporter/StreamExecution) does not look right to me as
avg/max/min are already calculated according to the watermark [2]

My recommendation would be to do the update [2] in the same streaming
batch it was observed. Why not? Please enlighten.

17/08/11 09:04:20 INFO StreamExecution: Streaming query made progress: {
  "id" : "ec8f8228-90f6-4e1f-8ad2-80222affed63",
  "runId" : "f605c134-cfb0-4378-88c1-159d8a7c232e",
  "name" : "rates-to-console",
  "timestamp" : "2017-08-11T07:04:20.004Z",
  "batchId" : 1,
  "numInputRows" : 2,
  "inputRowsPerSecond" : 0.7601672367920943,
  "processedRowsPerSecond" : 25.31645569620253,
  "durationMs" : {
"addBatch" : 48,
"getBatch" : 6,
"getOffset" : 0,
"queryPlanning" : 1,
"triggerExecution" : 79,
"walCommit" : 23
  },
  "eventTime" : {
"avg" : "2017-08-11T07:04:17.782Z",
"max" : "2017-08-11T07:04:18.282Z",
"min" : "2017-08-11T07:04:17.282Z",
"watermark" : "1970-01-01T00:00:00.000Z"
  },

...

17/08/11 09:04:30 INFO StreamExecution: Streaming query made progress: {
  "id" : "ec8f8228-90f6-4e1f-8ad2-80222affed63",
  "runId" : "f605c134-cfb0-4378-88c1-159d8a7c232e",
  "name" : "rates-to-console",
  "timestamp" : "2017-08-11T07:04:30.003Z",
  "batchId" : 2,
  "numInputRows" : 10,
  "inputRowsPerSecond" : 1.000100010001,
  "processedRowsPerSecond" : 56.17977528089888,
  "durationMs" : {
"addBatch" : 147,
"getBatch" : 6,
"getOffset" : 0,
"queryPlanning" : 1,
"triggerExecution" : 178,
"walCommit" : 22
  },
  "eventTime" : {
"avg" : "2017-08-11T07:04:23.782Z",
"max" : "2017-08-11T07:04:28.282Z",
"min" : "2017-08-11T07:04:19.282Z",
"watermark" : "2017-08-11T07:04:08.282Z"
  },

[1] 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala?utf8=%E2%9C%93#L538
[2] 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala#L257

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org