Re: Watch "Airbus makes more of the sky with Spark - Jesse Anderson & Hassene Ben Salem" on YouTube

2020-05-03 Thread Fuo Bol
Why did you  remove  email  zahidr1...@gmail.com following this query. ?
The query was commercially  hostile  but based on sound research.

The two responses were then sent to a removed email account.
The two accounts responded  and didn't  agree with you.


> Note that there are ways to detect when someone is signing up a sock
> puppet account, and mods will ban both if so.

RUBBISH  




Sent: Sunday, May 03, 2020 at 7:39 PM
> From: "Sean Owen" 
> To: "Fuo Bol" 
> Cc: "User" 
> Subject: Re: Watch "Airbus makes more of the sky with Spark - Jesse Anderson 
> & Hassene Ben Salem" on YouTube
>
> It was not removed because of this e-mail, but many other spam and in
> appropriate messages, from this and sock-puppet accounts. This one is
> IMHO off-topic however.
> 
> Note that there are ways to detect when someone is signing up a sock
> puppet account, and mods will ban both if so.
> 
> On Sun, May 3, 2020 at 11:45 AM Fuo Bol  wrote:
> >
> > @Sean Owen
> >
> > Why did you  remove  email  zahidr1...@gmail.com following this query. ?
> >
> > The two responses were then sent to a removed email account.
> >
> >
> >
> >
> > > -- Forwarded message -
> > > From: 
> > > Date: Sat, 25 Apr 2020, 19:40
> > > Subject: RE: Watch "Airbus makes more of the sky with Spark - Jesse
> > > Anderson & Hassene Ben Salem" on YouTube
> > > To: Zahid Rahman 
> > > Cc: user , 
> > >
> > >
> > > Zahid,
> > >
> > >
> > >
> > > Starting with Spark 2.3.0, the Spark team introduced an experimental
> > > feature called “Continuous Streaming”[1][2] to enter that space, but in
> > > general, Spark streaming operates using micro-batches while Flink operates
> > > using the Continuous Flow Operator model.
> > >
> > >
> > >
> > > There are many resources online comparing the two but I am leaving you
> > > one[3] (old, but still relevant)  so you can start looking into it.
> > >
> > >
> > >
> > > Note that while I am not a subject expert, that’s the basic explanation.
> > > Until recently we were not competing with Flink in that space, so it
> > > explains why Flink was preferred at the time and why it would still be
> > > preferred today. We will catch up eventually.
> > >
> > >
> > >
> > > [1]
> > > https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#continuous-processing
> > >
> > > [2]
> > > https://databricks.com/blog/2018/03/20/low-latency-continuous-processing-mode-in-structured-streaming-in-apache-spark-2-3-0.html
> > >
> > > [3] https://www.youtube.com/watch?v=Dzx-iE6RN4w&feature=emb_title
> > >
> > >
> > >
> > > *From:* Zahid Rahman 
> > > *Sent:* Saturday, April 25, 2020 7:36 AM
> > > *To:* joerg.stre...@posteo.de
> > > *Cc:* user 
> > > *Subject:* Re: Watch "Airbus makes more of the sky with Spark - Jesse
> > > Anderson & Hassene Ben Salem" on YouTube
> > >
> > >
> > >
> > > My motive is simple . I want you (spark product  experts user)  to
> > > challenge the reason given by Jesse Anderson for choosing flink over 
> > > spark.
> > >
> > >
> > >
> > > You know the saying keep your friends close, keep your enemies even 
> > > closer.
> > >
> > > The video  only has  328 views.
> > >
> > > It is a great educational tool to see a recent recent Use Case. Should be
> > > of compelling interest to anyone in this field. Commercial Companies do 
> > > not
> > > often share or discuss their projects openly.
> > >
> > >
> > >
> > > Incidentally Heathrow is the busiest airport in the world.
> > >
> > > 1. Because the emailing facility completed my sentence.
> > >
> > >
> > >
> > > 2. I think at Heathrow the gap is less than two minutes.
> > >
> > >
> > >
> > >
> > >
> > > On Sat, 25 Apr 2020, 09:42 Jörg Strebel,  wrote:
> > >
> > > Hallo!
> > >
> > > Well, the title of the video is actually "Airbus makes more of the sky 
> > > with
> > > Flink - Jesse Anderson & Hassene Ben Salem"and it talks about Apache Flink
> > > and specifically not about Apache Spark.They excluded Spark Streaming for
> > > high latency reasons.
> > >
> > > Why are you posting this video on a Spark mailing list?
> > >
> > > Regards
> > >
> > > J. Strebel
> > >
> > > Am 25.04.20 um 05:07 schrieb Zahid Rahman:
> > >
> > >
> > >
> > > https://youtu.be/sYlbD_OoHhs
> > >
> > >
> > > Backbutton.co.uk
> > >
> > > ¯\_(ツ)_/¯
> > > ♡۶Java♡۶RMI ♡۶
> > >
> > > Make Use Method {MUM}
> > >
> > > makeuse.org
> > >
> > > --
> > >
> > > Jörg Strebel
> > >
> > > Aachener Straße 2
> > >
> > > 80804 München
> > >
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Structured Streaminig] multiple queries in one application

2020-05-03 Thread lec ssmi
For example, put the generated query into a list  and start every one, then
use the method awaitTermination() on the last one .

Abhisheks  于2020年5月1日周五 上午10:32写道:

> I hope you are using the Query object that is returned by the Structured
> streaming, right?
> Returned object contains a lot of information about each query and tracking
> state of the object should be helpful.
>
> Hope this may help, if not can you please share more details with examples?
>
> Best,
> A
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Watch "Airbus makes more of the sky with Spark - Jesse Anderson & Hassene Ben Salem" on YouTube

2020-05-03 Thread Sean Owen
It was not removed because of this e-mail, but many other spam and in
appropriate messages, from this and sock-puppet accounts. This one is
IMHO off-topic however.

Note that there are ways to detect when someone is signing up a sock
puppet account, and mods will ban both if so.

On Sun, May 3, 2020 at 11:45 AM Fuo Bol  wrote:
>
> @Sean Owen
>
> Why did you  remove  email  zahidr1...@gmail.com following this query. ?
>
> The two responses were then sent to a removed email account.
>
>
>
>
> > -- Forwarded message -
> > From: 
> > Date: Sat, 25 Apr 2020, 19:40
> > Subject: RE: Watch "Airbus makes more of the sky with Spark - Jesse
> > Anderson & Hassene Ben Salem" on YouTube
> > To: Zahid Rahman 
> > Cc: user , 
> >
> >
> > Zahid,
> >
> >
> >
> > Starting with Spark 2.3.0, the Spark team introduced an experimental
> > feature called “Continuous Streaming”[1][2] to enter that space, but in
> > general, Spark streaming operates using micro-batches while Flink operates
> > using the Continuous Flow Operator model.
> >
> >
> >
> > There are many resources online comparing the two but I am leaving you
> > one[3] (old, but still relevant)  so you can start looking into it.
> >
> >
> >
> > Note that while I am not a subject expert, that’s the basic explanation.
> > Until recently we were not competing with Flink in that space, so it
> > explains why Flink was preferred at the time and why it would still be
> > preferred today. We will catch up eventually.
> >
> >
> >
> > [1]
> > https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#continuous-processing
> >
> > [2]
> > https://databricks.com/blog/2018/03/20/low-latency-continuous-processing-mode-in-structured-streaming-in-apache-spark-2-3-0.html
> >
> > [3] https://www.youtube.com/watch?v=Dzx-iE6RN4w&feature=emb_title
> >
> >
> >
> > *From:* Zahid Rahman 
> > *Sent:* Saturday, April 25, 2020 7:36 AM
> > *To:* joerg.stre...@posteo.de
> > *Cc:* user 
> > *Subject:* Re: Watch "Airbus makes more of the sky with Spark - Jesse
> > Anderson & Hassene Ben Salem" on YouTube
> >
> >
> >
> > My motive is simple . I want you (spark product  experts user)  to
> > challenge the reason given by Jesse Anderson for choosing flink over spark.
> >
> >
> >
> > You know the saying keep your friends close, keep your enemies even closer.
> >
> > The video  only has  328 views.
> >
> > It is a great educational tool to see a recent recent Use Case. Should be
> > of compelling interest to anyone in this field. Commercial Companies do not
> > often share or discuss their projects openly.
> >
> >
> >
> > Incidentally Heathrow is the busiest airport in the world.
> >
> > 1. Because the emailing facility completed my sentence.
> >
> >
> >
> > 2. I think at Heathrow the gap is less than two minutes.
> >
> >
> >
> >
> >
> > On Sat, 25 Apr 2020, 09:42 Jörg Strebel,  wrote:
> >
> > Hallo!
> >
> > Well, the title of the video is actually "Airbus makes more of the sky with
> > Flink - Jesse Anderson & Hassene Ben Salem"and it talks about Apache Flink
> > and specifically not about Apache Spark.They excluded Spark Streaming for
> > high latency reasons.
> >
> > Why are you posting this video on a Spark mailing list?
> >
> > Regards
> >
> > J. Strebel
> >
> > Am 25.04.20 um 05:07 schrieb Zahid Rahman:
> >
> >
> >
> > https://youtu.be/sYlbD_OoHhs
> >
> >
> > Backbutton.co.uk
> >
> > ¯\_(ツ)_/¯
> > ♡۶Java♡۶RMI ♡۶
> >
> > Make Use Method {MUM}
> >
> > makeuse.org
> >
> > --
> >
> > Jörg Strebel
> >
> > Aachener Straße 2
> >
> > 80804 München
> >
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Good idea to do multi-threading in spark job?

2020-05-03 Thread Sean Owen
Spark will by default assume each task needs 1 CPU. On an executor
with 16 cores and 16 slots, you'd schedule 16 tasks. If each is using
4 cores, then 64 threads are trying to run. If you're CPU-bound, that
could slow things down. But to the extent some of tasks take some time
blocking on I/O, it could increase overall utilization. You shouldn't
have to worry about Spark there, but, you do have to consider that N
tasks, each with its own concurrency, maybe executing your code in one
JVM, and whatever synchronization that implies.

On Sun, May 3, 2020 at 11:32 AM Ruijing Li  wrote:
>
> Hi all,
>
> We have a spark job (spark 2.4.4, hadoop 2.7, scala 2.11.12) where we use 
> semaphores / parallel collections within our spark job. We definitely notice 
> a huge speedup in our job from doing this, but were wondering if this could 
> cause any unintended side effects? Particularly I’m worried about any 
> deadlocks and if it could mess with the fixes for issues such as this
> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-26961
>
> We do run with multiple cores.
>
> Thanks!
> --
> Cheers,
> Ruijing Li

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Watch "Airbus makes more of the sky with Spark - Jesse Anderson & Hassene Ben Salem" on YouTube

2020-05-03 Thread Fuo Bol
@Sean Owen 

Why did you  remove  email  zahidr1...@gmail.com following this query. ?

The two responses were then sent to a removed email account.




> -- Forwarded message -
> From: 
> Date: Sat, 25 Apr 2020, 19:40
> Subject: RE: Watch "Airbus makes more of the sky with Spark - Jesse
> Anderson & Hassene Ben Salem" on YouTube
> To: Zahid Rahman 
> Cc: user , 
> 
> 
> Zahid,
> 
> 
> 
> Starting with Spark 2.3.0, the Spark team introduced an experimental
> feature called “Continuous Streaming”[1][2] to enter that space, but in
> general, Spark streaming operates using micro-batches while Flink operates
> using the Continuous Flow Operator model.
> 
> 
> 
> There are many resources online comparing the two but I am leaving you
> one[3] (old, but still relevant)  so you can start looking into it.
> 
> 
> 
> Note that while I am not a subject expert, that’s the basic explanation.
> Until recently we were not competing with Flink in that space, so it
> explains why Flink was preferred at the time and why it would still be
> preferred today. We will catch up eventually.
> 
> 
> 
> [1]
> https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#continuous-processing
> 
> [2]
> https://databricks.com/blog/2018/03/20/low-latency-continuous-processing-mode-in-structured-streaming-in-apache-spark-2-3-0.html
> 
> [3] https://www.youtube.com/watch?v=Dzx-iE6RN4w&feature=emb_title
> 
> 
> 
> *From:* Zahid Rahman 
> *Sent:* Saturday, April 25, 2020 7:36 AM
> *To:* joerg.stre...@posteo.de
> *Cc:* user 
> *Subject:* Re: Watch "Airbus makes more of the sky with Spark - Jesse
> Anderson & Hassene Ben Salem" on YouTube
> 
> 
> 
> My motive is simple . I want you (spark product  experts user)  to
> challenge the reason given by Jesse Anderson for choosing flink over spark.
> 
> 
> 
> You know the saying keep your friends close, keep your enemies even closer.
> 
> The video  only has  328 views.
> 
> It is a great educational tool to see a recent recent Use Case. Should be
> of compelling interest to anyone in this field. Commercial Companies do not
> often share or discuss their projects openly.
> 
> 
> 
> Incidentally Heathrow is the busiest airport in the world.
> 
> 1. Because the emailing facility completed my sentence.
> 
> 
> 
> 2. I think at Heathrow the gap is less than two minutes.
> 
> 
> 
> 
> 
> On Sat, 25 Apr 2020, 09:42 Jörg Strebel,  wrote:
> 
> Hallo!
> 
> Well, the title of the video is actually "Airbus makes more of the sky with
> Flink - Jesse Anderson & Hassene Ben Salem"and it talks about Apache Flink
> and specifically not about Apache Spark.They excluded Spark Streaming for
> high latency reasons.
> 
> Why are you posting this video on a Spark mailing list?
> 
> Regards
> 
> J. Strebel
> 
> Am 25.04.20 um 05:07 schrieb Zahid Rahman:
> 
> 
> 
> https://youtu.be/sYlbD_OoHhs
> 
> 
> Backbutton.co.uk
> 
> ¯\_(ツ)_/¯
> ♡۶Java♡۶RMI ♡۶
> 
> Make Use Method {MUM}
> 
> makeuse.org
> 
> -- 
> 
> Jörg Strebel
> 
> Aachener Straße 2
> 
> 80804 München
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Good idea to do multi-threading in spark job?

2020-05-03 Thread Ruijing Li
Hi all,

We have a spark job (spark 2.4.4, hadoop 2.7, scala 2.11.12) where we use
semaphores / parallel collections within our spark job. We definitely
notice a huge speedup in our job from doing this, but were wondering if
this could cause any unintended side effects? Particularly I’m worried
about any deadlocks and if it could mess with the fixes for issues such as
this
https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-26961

We do run with multiple cores.

Thanks!
-- 
Cheers,
Ruijing Li


Re: [spark streaming] checkpoint location feature for batch processing

2020-05-03 Thread Jungtaek Lim
Replied inline:

On Sun, May 3, 2020 at 6:25 PM Magnus Nilsson  wrote:

> Thank you, so that would mean spark gets the current latest offset(s) when
> the trigger fires and then process all available messages in the topic upto
> and including that offset as long as maxOffsetsPerTrigger is the default of
> None (or large enought to handle all available messages).
>

Yes it starts from the offset of latest batch. `maxOffsetsPerTrigger` will
be ignored starting from Spark 3.0.0, which means for Spark 2.x it's still
affecting even Trigger.Once is used I guess.


>
> I think the word micro-batch confused me (more like mega-batch in some
> cases). It makes sense though, this makes Trigger.Once a fixed interval
> trigger that's only fired once and not repeatedly.
>

"micro" is relative - though Spark by default processes all available
inputs per batch, in most cases you'll want to make the batch size
(interval) as small as possible, as it defines the latency of the output.
Trigger.Once is an unusual case in streaming workload - that's more alike
continuous execution of "batch". I refer "continuous" as picking up latest
context which is the characteristic of streaming query, hence hybrid one.


>
>
> On Sun, May 3, 2020 at 3:20 AM Jungtaek Lim 
> wrote:
>
>> If I understand correctly, Trigger.once executes only one micro-batch and
>> terminates, that's all. Your understanding of structured streaming applies
>> there as well.
>>
>> It's like a hybrid approach as bringing incremental processing from
>> micro-batch but having processing interval as batch. That said, while it
>> enables to get both sides of benefits, it's basically structured streaming,
>> inheriting all the limitations on the structured streaming, compared to the
>> batch query.
>>
>> Spark 3.0.0 will bring some change on Trigger.once (SPARK-30669 [1]) -
>> Trigger.once will "ignore" the read limit per micro-batch on data source
>> (like maxOffsetsPerTrigger) and process all available input as possible.
>> (Data sources should migrate to the new API to take effect, but works for
>> built-in data sources like file and Kafka.)
>>
>> 1. https://issues.apache.org/jira/browse/SPARK-30669
>>
>> 2020년 5월 2일 (토) 오후 5:35, Magnus Nilsson 님이 작성:
>>
>>> I've always had a question about Trigger.Once that I never got around to
>>> ask or test for myself. If you have a 24/7 stream to a Kafka topic.
>>>
>>> Will Trigger.Once get the last offset(s) when it starts and then quit
>>> once it hits this offset(s) or will the job run until no new messages is
>>> added to the topic for a particular amount of time?
>>>
>>> br,
>>>
>>> Magnus
>>>
>>> On Sat, May 2, 2020 at 1:22 AM Burak Yavuz  wrote:
>>>
 Hi Rishi,

 That is exactly why Trigger.Once was created for Structured Streaming.
 The way we look at streaming is that it doesn't have to be always real
 time, or 24-7 always on. We see streaming as a workflow that you have to
 repeat indefinitely. See this blog post for more details!

 https://databricks.com/blog/2017/05/22/running-streaming-jobs-day-10x-cost-savings.html

 Best,
 Burak

 On Fri, May 1, 2020 at 2:55 PM Rishi Shah 
 wrote:

> Hi All,
>
> I recently started playing with spark streaming, and checkpoint
> location feature looks very promising. I wonder if anyone has an opinion
> about using spark streaming with checkpoint location option as a slow 
> batch
> processing solution. What would be the pros and cons of utilizing 
> streaming
> with checkpoint location feature to achieve fault tolerance in batch
> processing application?
>
> --
> Regards,
>
> Rishi Shah
>



Re: [spark streaming] checkpoint location feature for batch processing

2020-05-03 Thread Magnus Nilsson
Thank you, so that would mean spark gets the current latest offset(s) when
the trigger fires and then process all available messages in the topic upto
and including that offset as long as maxOffsetsPerTrigger is the default of
None (or large enought to handle all available messages).

I think the word micro-batch confused me (more like mega-batch in some
cases). It makes sense though, this makes Trigger.Once a fixed interval
trigger that's only fired once and not repeatedly.


On Sun, May 3, 2020 at 3:20 AM Jungtaek Lim 
wrote:

> If I understand correctly, Trigger.once executes only one micro-batch and
> terminates, that's all. Your understanding of structured streaming applies
> there as well.
>
> It's like a hybrid approach as bringing incremental processing from
> micro-batch but having processing interval as batch. That said, while it
> enables to get both sides of benefits, it's basically structured streaming,
> inheriting all the limitations on the structured streaming, compared to the
> batch query.
>
> Spark 3.0.0 will bring some change on Trigger.once (SPARK-30669 [1]) -
> Trigger.once will "ignore" the read limit per micro-batch on data source
> (like maxOffsetsPerTrigger) and process all available input as possible.
> (Data sources should migrate to the new API to take effect, but works for
> built-in data sources like file and Kafka.)
>
> 1. https://issues.apache.org/jira/browse/SPARK-30669
>
> 2020년 5월 2일 (토) 오후 5:35, Magnus Nilsson 님이 작성:
>
>> I've always had a question about Trigger.Once that I never got around to
>> ask or test for myself. If you have a 24/7 stream to a Kafka topic.
>>
>> Will Trigger.Once get the last offset(s) when it starts and then quit
>> once it hits this offset(s) or will the job run until no new messages is
>> added to the topic for a particular amount of time?
>>
>> br,
>>
>> Magnus
>>
>> On Sat, May 2, 2020 at 1:22 AM Burak Yavuz  wrote:
>>
>>> Hi Rishi,
>>>
>>> That is exactly why Trigger.Once was created for Structured Streaming.
>>> The way we look at streaming is that it doesn't have to be always real
>>> time, or 24-7 always on. We see streaming as a workflow that you have to
>>> repeat indefinitely. See this blog post for more details!
>>>
>>> https://databricks.com/blog/2017/05/22/running-streaming-jobs-day-10x-cost-savings.html
>>>
>>> Best,
>>> Burak
>>>
>>> On Fri, May 1, 2020 at 2:55 PM Rishi Shah 
>>> wrote:
>>>
 Hi All,

 I recently started playing with spark streaming, and checkpoint
 location feature looks very promising. I wonder if anyone has an opinion
 about using spark streaming with checkpoint location option as a slow batch
 processing solution. What would be the pros and cons of utilizing streaming
 with checkpoint location feature to achieve fault tolerance in batch
 processing application?

 --
 Regards,

 Rishi Shah

>>>