I am sorry, we can not divide the data set and process it separately. does it mean that I overuse Spark for my data size because it consumes a long time to shuffle the data?
On Sun, May 29, 2016 at 8:53 AM, Ted Yu <yuzhih...@gmail.com> wrote: > Heri: > Is it possible to partition your data set so that the number of rows > involved in join is under control ? > > Cheers > > On Sat, May 28, 2016 at 5:25 PM, Mich Talebzadeh < > mich.talebza...@gmail.com> wrote: > >> You are welcome >> >> Also use can use OS command /usr/bin/free to see how much free memory >> you have on each node. >> >> You should also see from SPARK GUI (first job on master node:4040, next >> on 4041etc) the resource and Storage (memory usage) for each SparkSubmit >> job. >> >> HTH >> >> >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> >> On 29 May 2016 at 01:16, heri wijayanto <heri0...@gmail.com> wrote: >> >>> Thank you, Dr Mich Talebzadeh, I will capture the error messages, but >>> currently, my cluster is running to do the other job. After it finished, I >>> will try your suggestions >>> >>> On Sun, May 29, 2016 at 7:55 AM, Mich Talebzadeh < >>> mich.talebza...@gmail.com> wrote: >>> >>>> You should have errors in yarn-nodemanager and yarn-resourcemanager >>>> logs. >>>> >>>> Something like below for heathy container >>>> >>>> 2016-05-29 00:50:50,496 INFO >>>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: >>>> Memory usage of ProcessTree 29769 for container-id >>>> container_1464210869844_0061_01_000001: 372.6 MB of 4 GB physical memory >>>> used; 2.7 GB of 8.4 GB virtual memory used >>>> >>>> It appears that you are running out of memory. Have you also checked >>>> with jps and jmonitor for SparkSubmit (the driver process) for the failing >>>> job? It will show you the resource usage= like memory/heap/cpu etc >>>> >>>> HTH >>>> >>>> Dr Mich Talebzadeh >>>> >>>> >>>> >>>> LinkedIn * >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>> >>>> >>>> >>>> http://talebzadehmich.wordpress.com >>>> >>>> >>>> >>>> On 29 May 2016 at 00:26, heri wijayanto <heri0...@gmail.com> wrote: >>>> >>>>> I implement spark with join function for processing in around 250 >>>>> million rows of text. >>>>> >>>>> When I just used several hundred of rows, it could run, but when I use >>>>> the large data, it is failed. >>>>> >>>>> My spark version in 1.6.1, run above yarn-cluster mode, and we have 5 >>>>> node computers. >>>>> >>>>> Thank you very much, Ted Yu >>>>> >>>>> On Sun, May 29, 2016 at 6:48 AM, Ted Yu <yuzhih...@gmail.com> wrote: >>>>> >>>>>> Can you let us know your case ? >>>>>> >>>>>> When the join failed, what was the error (consider pastebin) ? >>>>>> >>>>>> Which release of Spark are you using ? >>>>>> >>>>>> Thanks >>>>>> >>>>>> > On May 28, 2016, at 3:27 PM, heri wijayanto <heri0...@gmail.com> >>>>>> wrote: >>>>>> > >>>>>> > Hi everyone, >>>>>> > I perform join function in a loop, and it is failed. I found a >>>>>> tutorial from the web, it says that I should use a broadcast variable but >>>>>> it is not a good choice for doing it on the loop. >>>>>> > I need your suggestion to address this problem, thank you very much. >>>>>> > and I am sorry, I am a beginner in Spark programming >>>>>> >>>>> >>>>> >>>> >>> >> >