Re: Spark execution on Hadoop cluster (many nodes)

2022-01-24 Thread Mich Talebzadeh
Hadoop core comprises HDFS (the storage), MapReduce (parallel execution
algorithm)  and YARN (the resource manager).

Spark can use YARN. in either cluster or client mode and can use HDFS for
temporary or permanent storage. As HDFS is available and accessible in
all nodes, Spark can take advantage of that. Spark does MapReduce in memory
as opposed to disk to speed up queries by order of magnitude. Spark is just
an application on Hadoop and not much more.

HTH



   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 24 Jan 2022 at 17:22, sam smith  wrote:

> spark-submit a spark application on Hadoop (cluster mode) that's what i
> mean by  executing on Hadoop
>
> Le lun. 24 janv. 2022 à 18:00, Sean Owen  a écrit :
>
>> I am still not understanding what you mean by "executing on Hadoop".
>> Spark does not use Hadoop for execution. Probably can't answer until this
>> is cleared up.
>>
>> On Mon, Jan 24, 2022 at 10:57 AM sam smith 
>> wrote:
>>
>>> I mean the DAG order is somehow altered when executing on Hadoop
>>>
>>> Le lun. 24 janv. 2022 à 17:17, Sean Owen  a écrit :
>>>
 Code is not executed by Hadoop, nor passed through Hadoop somehow. Do
 you mean data? data is read as-is. There is typically no guarantee about
 ordering of data in files but you can order data. Still not sure what
 specifically you are worried about here, but I don't think the kind of
 thing you're contemplating can happen, no

 On Mon, Jan 24, 2022 at 9:28 AM sam smith 
 wrote:

> I am aware of that, but whenever the chunks of code are returned to
> Spark from Hadoop (after processing) could they be done not in the ordered
> way ? could this ever happen ?
>
> Le lun. 24 janv. 2022 à 16:14, Sean Owen  a écrit :
>
>> Hadoop does not run Spark programs, Spark does. How or why would
>> something, what, modify the byte code? No
>>
>> On Mon, Jan 24, 2022, 9:07 AM sam smith 
>> wrote:
>>
>>> My point is could Hadoop go wrong about one Spark execution ?
>>> meaning that it gets confused (given the concurrent distributed tasks) 
>>> and
>>> then adds wrong instruction to the program, or maybe does execute an
>>> instruction not at its right order (shuffling the order of execution by
>>> executing previous ones, while it shouldn't) ? Before finishing and
>>> returning the results from one node it returns the results of the other 
>>> in
>>> a wrong way for example.
>>>
>>> Le lun. 24 janv. 2022 à 15:31, Sean Owen  a
>>> écrit :
>>>
 Not clear what you mean here. A Spark program is a program, so what
 are the alternatives here? program execution order is still program
 execution order. You are not guaranteed anything about order of 
 concurrent
 tasks. Failed tasks can be reexecuted so should be idempotent. I think 
 the
 answer is 'no' but not sure what you are thinking of here.

 On Mon, Jan 24, 2022 at 7:10 AM sam smith <
 qustacksm2123...@gmail.com> wrote:

> Hello guys,
>
> I hope my question does not sound weird, but could a Spark
> execution on Hadoop cluster give different output than the program 
> actually
> does ? I mean by that, the execution order is messed by hadoop, or an
> instruction executed twice..; ?
>
> Thanks for your enlightenment
>



Re: Spark execution on Hadoop cluster (many nodes)

2022-01-24 Thread sam smith
spark-submit a spark application on Hadoop (cluster mode) that's what i
mean by  executing on Hadoop

Le lun. 24 janv. 2022 à 18:00, Sean Owen  a écrit :

> I am still not understanding what you mean by "executing on Hadoop". Spark
> does not use Hadoop for execution. Probably can't answer until this is
> cleared up.
>
> On Mon, Jan 24, 2022 at 10:57 AM sam smith 
> wrote:
>
>> I mean the DAG order is somehow altered when executing on Hadoop
>>
>> Le lun. 24 janv. 2022 à 17:17, Sean Owen  a écrit :
>>
>>> Code is not executed by Hadoop, nor passed through Hadoop somehow. Do
>>> you mean data? data is read as-is. There is typically no guarantee about
>>> ordering of data in files but you can order data. Still not sure what
>>> specifically you are worried about here, but I don't think the kind of
>>> thing you're contemplating can happen, no
>>>
>>> On Mon, Jan 24, 2022 at 9:28 AM sam smith 
>>> wrote:
>>>
 I am aware of that, but whenever the chunks of code are returned to
 Spark from Hadoop (after processing) could they be done not in the ordered
 way ? could this ever happen ?

 Le lun. 24 janv. 2022 à 16:14, Sean Owen  a écrit :

> Hadoop does not run Spark programs, Spark does. How or why would
> something, what, modify the byte code? No
>
> On Mon, Jan 24, 2022, 9:07 AM sam smith 
> wrote:
>
>> My point is could Hadoop go wrong about one Spark execution ? meaning
>> that it gets confused (given the concurrent distributed tasks) and then
>> adds wrong instruction to the program, or maybe does execute an 
>> instruction
>> not at its right order (shuffling the order of execution by executing
>> previous ones, while it shouldn't) ? Before finishing and returning the
>> results from one node it returns the results of the other in a wrong way
>> for example.
>>
>> Le lun. 24 janv. 2022 à 15:31, Sean Owen  a écrit :
>>
>>> Not clear what you mean here. A Spark program is a program, so what
>>> are the alternatives here? program execution order is still program
>>> execution order. You are not guaranteed anything about order of 
>>> concurrent
>>> tasks. Failed tasks can be reexecuted so should be idempotent. I think 
>>> the
>>> answer is 'no' but not sure what you are thinking of here.
>>>
>>> On Mon, Jan 24, 2022 at 7:10 AM sam smith <
>>> qustacksm2123...@gmail.com> wrote:
>>>
 Hello guys,

 I hope my question does not sound weird, but could a Spark
 execution on Hadoop cluster give different output than the program 
 actually
 does ? I mean by that, the execution order is messed by hadoop, or an
 instruction executed twice..; ?

 Thanks for your enlightenment

>>>


Re: Spark execution on Hadoop cluster (many nodes)

2022-01-24 Thread sam smith
I mean the DAG order is somehow altered when executing on Hadoop

Le lun. 24 janv. 2022 à 17:17, Sean Owen  a écrit :

> Code is not executed by Hadoop, nor passed through Hadoop somehow. Do you
> mean data? data is read as-is. There is typically no guarantee about
> ordering of data in files but you can order data. Still not sure what
> specifically you are worried about here, but I don't think the kind of
> thing you're contemplating can happen, no
>
> On Mon, Jan 24, 2022 at 9:28 AM sam smith 
> wrote:
>
>> I am aware of that, but whenever the chunks of code are returned to Spark
>> from Hadoop (after processing) could they be done not in the ordered way ?
>> could this ever happen ?
>>
>> Le lun. 24 janv. 2022 à 16:14, Sean Owen  a écrit :
>>
>>> Hadoop does not run Spark programs, Spark does. How or why would
>>> something, what, modify the byte code? No
>>>
>>> On Mon, Jan 24, 2022, 9:07 AM sam smith 
>>> wrote:
>>>
 My point is could Hadoop go wrong about one Spark execution ? meaning
 that it gets confused (given the concurrent distributed tasks) and then
 adds wrong instruction to the program, or maybe does execute an instruction
 not at its right order (shuffling the order of execution by executing
 previous ones, while it shouldn't) ? Before finishing and returning the
 results from one node it returns the results of the other in a wrong way
 for example.

 Le lun. 24 janv. 2022 à 15:31, Sean Owen  a écrit :

> Not clear what you mean here. A Spark program is a program, so what
> are the alternatives here? program execution order is still program
> execution order. You are not guaranteed anything about order of concurrent
> tasks. Failed tasks can be reexecuted so should be idempotent. I think the
> answer is 'no' but not sure what you are thinking of here.
>
> On Mon, Jan 24, 2022 at 7:10 AM sam smith 
> wrote:
>
>> Hello guys,
>>
>> I hope my question does not sound weird, but could a Spark execution
>> on Hadoop cluster give different output than the program actually does ? 
>> I
>> mean by that, the execution order is messed by hadoop, or an instruction
>> executed twice..; ?
>>
>> Thanks for your enlightenment
>>
>


Re: Spark execution on Hadoop cluster (many nodes)

2022-01-24 Thread Sean Owen
Code is not executed by Hadoop, nor passed through Hadoop somehow. Do you
mean data? data is read as-is. There is typically no guarantee about
ordering of data in files but you can order data. Still not sure what
specifically you are worried about here, but I don't think the kind of
thing you're contemplating can happen, no

On Mon, Jan 24, 2022 at 9:28 AM sam smith 
wrote:

> I am aware of that, but whenever the chunks of code are returned to Spark
> from Hadoop (after processing) could they be done not in the ordered way ?
> could this ever happen ?
>
> Le lun. 24 janv. 2022 à 16:14, Sean Owen  a écrit :
>
>> Hadoop does not run Spark programs, Spark does. How or why would
>> something, what, modify the byte code? No
>>
>> On Mon, Jan 24, 2022, 9:07 AM sam smith 
>> wrote:
>>
>>> My point is could Hadoop go wrong about one Spark execution ? meaning
>>> that it gets confused (given the concurrent distributed tasks) and then
>>> adds wrong instruction to the program, or maybe does execute an instruction
>>> not at its right order (shuffling the order of execution by executing
>>> previous ones, while it shouldn't) ? Before finishing and returning the
>>> results from one node it returns the results of the other in a wrong way
>>> for example.
>>>
>>> Le lun. 24 janv. 2022 à 15:31, Sean Owen  a écrit :
>>>
 Not clear what you mean here. A Spark program is a program, so what are
 the alternatives here? program execution order is still program execution
 order. You are not guaranteed anything about order of concurrent tasks.
 Failed tasks can be reexecuted so should be idempotent. I think the answer
 is 'no' but not sure what you are thinking of here.

 On Mon, Jan 24, 2022 at 7:10 AM sam smith 
 wrote:

> Hello guys,
>
> I hope my question does not sound weird, but could a Spark execution
> on Hadoop cluster give different output than the program actually does ? I
> mean by that, the execution order is messed by hadoop, or an instruction
> executed twice..; ?
>
> Thanks for your enlightenment
>



Re: Spark execution on Hadoop cluster (many nodes)

2022-01-24 Thread sam smith
I am aware of that, but whenever the chunks of code are returned to Spark
from Hadoop (after processing) could they be done not in the ordered way ?
could this ever happen ?

Le lun. 24 janv. 2022 à 16:14, Sean Owen  a écrit :

> Hadoop does not run Spark programs, Spark does. How or why would
> something, what, modify the byte code? No
>
> On Mon, Jan 24, 2022, 9:07 AM sam smith 
> wrote:
>
>> My point is could Hadoop go wrong about one Spark execution ? meaning
>> that it gets confused (given the concurrent distributed tasks) and then
>> adds wrong instruction to the program, or maybe does execute an instruction
>> not at its right order (shuffling the order of execution by executing
>> previous ones, while it shouldn't) ? Before finishing and returning the
>> results from one node it returns the results of the other in a wrong way
>> for example.
>>
>> Le lun. 24 janv. 2022 à 15:31, Sean Owen  a écrit :
>>
>>> Not clear what you mean here. A Spark program is a program, so what are
>>> the alternatives here? program execution order is still program execution
>>> order. You are not guaranteed anything about order of concurrent tasks.
>>> Failed tasks can be reexecuted so should be idempotent. I think the answer
>>> is 'no' but not sure what you are thinking of here.
>>>
>>> On Mon, Jan 24, 2022 at 7:10 AM sam smith 
>>> wrote:
>>>
 Hello guys,

 I hope my question does not sound weird, but could a Spark execution on
 Hadoop cluster give different output than the program actually does ? I
 mean by that, the execution order is messed by hadoop, or an instruction
 executed twice..; ?

 Thanks for your enlightenment

>>>


Re: Spark execution on Hadoop cluster (many nodes)

2022-01-24 Thread Sean Owen
Hadoop does not run Spark programs, Spark does. How or why would something,
what, modify the byte code? No

On Mon, Jan 24, 2022, 9:07 AM sam smith  wrote:

> My point is could Hadoop go wrong about one Spark execution ? meaning that
> it gets confused (given the concurrent distributed tasks) and then adds
> wrong instruction to the program, or maybe does execute an instruction not
> at its right order (shuffling the order of execution by executing previous
> ones, while it shouldn't) ? Before finishing and returning the results from
> one node it returns the results of the other in a wrong way for example.
>
> Le lun. 24 janv. 2022 à 15:31, Sean Owen  a écrit :
>
>> Not clear what you mean here. A Spark program is a program, so what are
>> the alternatives here? program execution order is still program execution
>> order. You are not guaranteed anything about order of concurrent tasks.
>> Failed tasks can be reexecuted so should be idempotent. I think the answer
>> is 'no' but not sure what you are thinking of here.
>>
>> On Mon, Jan 24, 2022 at 7:10 AM sam smith 
>> wrote:
>>
>>> Hello guys,
>>>
>>> I hope my question does not sound weird, but could a Spark execution on
>>> Hadoop cluster give different output than the program actually does ? I
>>> mean by that, the execution order is messed by hadoop, or an instruction
>>> executed twice..; ?
>>>
>>> Thanks for your enlightenment
>>>
>>


Re: Spark execution on Hadoop cluster (many nodes)

2022-01-24 Thread sam smith
My point is could Hadoop go wrong about one Spark execution ? meaning that
it gets confused (given the concurrent distributed tasks) and then adds
wrong instruction to the program, or maybe does execute an instruction not
at its right order (shuffling the order of execution by executing previous
ones, while it shouldn't) ? Before finishing and returning the results from
one node it returns the results of the other in a wrong way for example.

Le lun. 24 janv. 2022 à 15:31, Sean Owen  a écrit :

> Not clear what you mean here. A Spark program is a program, so what are
> the alternatives here? program execution order is still program execution
> order. You are not guaranteed anything about order of concurrent tasks.
> Failed tasks can be reexecuted so should be idempotent. I think the answer
> is 'no' but not sure what you are thinking of here.
>
> On Mon, Jan 24, 2022 at 7:10 AM sam smith 
> wrote:
>
>> Hello guys,
>>
>> I hope my question does not sound weird, but could a Spark execution on
>> Hadoop cluster give different output than the program actually does ? I
>> mean by that, the execution order is messed by hadoop, or an instruction
>> executed twice..; ?
>>
>> Thanks for your enlightenment
>>
>


Re: Spark execution on Hadoop cluster (many nodes)

2022-01-24 Thread Sean Owen
Not clear what you mean here. A Spark program is a program, so what are the
alternatives here? program execution order is still program execution
order. You are not guaranteed anything about order of concurrent tasks.
Failed tasks can be reexecuted so should be idempotent. I think the answer
is 'no' but not sure what you are thinking of here.

On Mon, Jan 24, 2022 at 7:10 AM sam smith 
wrote:

> Hello guys,
>
> I hope my question does not sound weird, but could a Spark execution on
> Hadoop cluster give different output than the program actually does ? I
> mean by that, the execution order is messed by hadoop, or an instruction
> executed twice..; ?
>
> Thanks for your enlightenment
>