Heri:
Is it possible to partition your data set so that the number of rows
involved in join is under control ?

Cheers

On Sat, May 28, 2016 at 5:25 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> You are welcome
>
> Also use can use OS command /usr/bin/free to see how much free memory you
> have on each node.
>
> You should also see from SPARK GUI (first job on master node:4040, next on
> 4041etc) the  resource and Storage (memory usage) for each SparkSubmit job.
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 29 May 2016 at 01:16, heri wijayanto <heri0...@gmail.com> wrote:
>
>> Thank you, Dr Mich Talebzadeh, I will capture the error messages, but
>> currently, my cluster is running to do the other job. After it finished, I
>> will try your suggestions
>>
>> On Sun, May 29, 2016 at 7:55 AM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> You should have errors in yarn-nodemanager and yarn-resourcemanager
>>> logs.
>>>
>>> Something like below for heathy container
>>>
>>> 2016-05-29 00:50:50,496 INFO
>>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>>> Memory usage of ProcessTree 29769 for container-id
>>> container_1464210869844_0061_01_000001: 372.6 MB of 4 GB physical memory
>>> used; 2.7 GB of 8.4 GB virtual memory used
>>>
>>> It appears that you are running out of memory. Have you also checked
>>> with jps and jmonitor for SparkSubmit (the driver process) for the failing
>>> job? It will show you the resource usage= like memory/heap/cpu etc
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 29 May 2016 at 00:26, heri wijayanto <heri0...@gmail.com> wrote:
>>>
>>>> I implement spark with join function for processing in around 250
>>>> million rows of text.
>>>>
>>>> When I just used several hundred of rows, it could run, but when I use
>>>> the large data, it is failed.
>>>>
>>>> My spark version in 1.6.1, run above yarn-cluster mode, and we have 5
>>>> node computers.
>>>>
>>>> Thank you very much, Ted Yu
>>>>
>>>> On Sun, May 29, 2016 at 6:48 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>>
>>>>> Can you let us know your case ?
>>>>>
>>>>> When the join failed, what was the error (consider pastebin) ?
>>>>>
>>>>> Which release of Spark are you using ?
>>>>>
>>>>> Thanks
>>>>>
>>>>> > On May 28, 2016, at 3:27 PM, heri wijayanto <heri0...@gmail.com>
>>>>> wrote:
>>>>> >
>>>>> > Hi everyone,
>>>>> > I perform join function in a loop, and it is failed. I found a
>>>>> tutorial from the web, it says that I should use a broadcast variable but
>>>>> it is not a good choice for doing it on the loop.
>>>>> > I need your suggestion to address this problem, thank you very much.
>>>>> > and I am sorry, I am a beginner in Spark programming
>>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to