I am sorry, we can not divide the data set and process it separately. does
it mean that I overuse Spark for my data size because it consumes a long
time to shuffle the data?

On Sun, May 29, 2016 at 8:53 AM, Ted Yu <yuzhih...@gmail.com> wrote:

> Heri:
> Is it possible to partition your data set so that the number of rows
> involved in join is under control ?
>
> Cheers
>
> On Sat, May 28, 2016 at 5:25 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> You are welcome
>>
>> Also use can use OS command /usr/bin/free to see how much free memory
>> you have on each node.
>>
>> You should also see from SPARK GUI (first job on master node:4040, next
>> on 4041etc) the  resource and Storage (memory usage) for each SparkSubmit
>> job.
>>
>> HTH
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 29 May 2016 at 01:16, heri wijayanto <heri0...@gmail.com> wrote:
>>
>>> Thank you, Dr Mich Talebzadeh, I will capture the error messages, but
>>> currently, my cluster is running to do the other job. After it finished, I
>>> will try your suggestions
>>>
>>> On Sun, May 29, 2016 at 7:55 AM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> You should have errors in yarn-nodemanager and yarn-resourcemanager
>>>> logs.
>>>>
>>>> Something like below for heathy container
>>>>
>>>> 2016-05-29 00:50:50,496 INFO
>>>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>>>> Memory usage of ProcessTree 29769 for container-id
>>>> container_1464210869844_0061_01_000001: 372.6 MB of 4 GB physical memory
>>>> used; 2.7 GB of 8.4 GB virtual memory used
>>>>
>>>> It appears that you are running out of memory. Have you also checked
>>>> with jps and jmonitor for SparkSubmit (the driver process) for the failing
>>>> job? It will show you the resource usage= like memory/heap/cpu etc
>>>>
>>>> HTH
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> On 29 May 2016 at 00:26, heri wijayanto <heri0...@gmail.com> wrote:
>>>>
>>>>> I implement spark with join function for processing in around 250
>>>>> million rows of text.
>>>>>
>>>>> When I just used several hundred of rows, it could run, but when I use
>>>>> the large data, it is failed.
>>>>>
>>>>> My spark version in 1.6.1, run above yarn-cluster mode, and we have 5
>>>>> node computers.
>>>>>
>>>>> Thank you very much, Ted Yu
>>>>>
>>>>> On Sun, May 29, 2016 at 6:48 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>>>
>>>>>> Can you let us know your case ?
>>>>>>
>>>>>> When the join failed, what was the error (consider pastebin) ?
>>>>>>
>>>>>> Which release of Spark are you using ?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> > On May 28, 2016, at 3:27 PM, heri wijayanto <heri0...@gmail.com>
>>>>>> wrote:
>>>>>> >
>>>>>> > Hi everyone,
>>>>>> > I perform join function in a loop, and it is failed. I found a
>>>>>> tutorial from the web, it says that I should use a broadcast variable but
>>>>>> it is not a good choice for doing it on the loop.
>>>>>> > I need your suggestion to address this problem, thank you very much.
>>>>>> > and I am sorry, I am a beginner in Spark programming
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to