Re: Streaming Hadoop using C
Starfish worked great for wordcount .. I didn't run it on my application because I have only map tasks. Mark On Thu, Mar 1, 2012 at 4:34 AM, Charles Earl wrote: > How was your experience of starfish? > C > On Mar 1, 2012, at 12:35 AM, Mark question wrote: > > > Thank you for your time and suggestions, I've already tried starfish, but > > not jmap. I'll check it out. > > Thanks again, > > Mark > > > > On Wed, Feb 29, 2012 at 1:17 PM, Charles Earl >wrote: > > > >> I assume you have also just tried running locally and using the jdk > >> performance tools (e.g. jmap) to gain insight by configuring hadoop to > run > >> absolute minimum number of tasks? > >> Perhaps the discussion > >> > >> > http://grokbase.com/t/hadoop/common-user/11ahm67z47/how-do-i-connect-java-visual-vm-to-a-remote-task > >> might be relevant? > >> On Feb 29, 2012, at 3:53 PM, Mark question wrote: > >> > >>> I've used hadoop profiling (.prof) to show the stack trace but it was > >> hard > >>> to follow. jConsole locally since I couldn't find a way to set a port > >>> number to child processes when running them remotely. Linux commands > >>> (top,/proc), showed me that the virtual memory is almost twice as my > >>> physical which means swapping is happening which is what I'm trying to > >>> avoid. > >>> > >>> So basically, is there a way to assign a port to child processes to > >> monitor > >>> them remotely (asked before by Xun) or would you recommend another > >>> monitoring tool? > >>> > >>> Thank you, > >>> Mark > >>> > >>> > >>> On Wed, Feb 29, 2012 at 11:35 AM, Charles Earl < > charles.ce...@gmail.com > >>> wrote: > >>> > Mark, > So if I understand, it is more the memory management that you are > interested in, rather than a need to run an existing C or C++ > >> application > in MapReduce platform? > Have you done profiling of the application? > C > On Feb 29, 2012, at 2:19 PM, Mark question wrote: > > > Thanks Charles .. I'm running Hadoop for research to perform > duplicate > > detection methods. To go deeper, I need to understand what's slowing > my > > program, which usually starts with analyzing memory to predict best > >> input > > size for map task. So you're saying piping can help me control memory > even > > though it's running on VM eventually? > > > > Thanks, > > Mark > > > > On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl < > >> charles.ce...@gmail.com > > wrote: > > > >> Mark, > >> Both streaming and pipes allow this, perhaps more so pipes at the > >> level > of > >> the mapreduce task. Can you provide more details on the application? > >> On Feb 29, 2012, at 1:56 PM, Mark question wrote: > >> > >>> Hi guys, thought I should ask this before I use it ... will using C > over > >>> Hadoop give me the usual C memory management? For example, > malloc() , > >>> sizeof() ? My guess is no since this all will eventually be turned > >> into > >>> bytecode, but I need more control on memory which obviously is hard > >> for > >> me > >>> to do with Java. > >>> > >>> Let me know of any advantages you know about streaming in C over > hadoop. > >>> Thank you, > >>> Mark > >> > >> > > > >> > >> > >
Re: Streaming Hadoop using C
How was your experience of starfish? C On Mar 1, 2012, at 12:35 AM, Mark question wrote: > Thank you for your time and suggestions, I've already tried starfish, but > not jmap. I'll check it out. > Thanks again, > Mark > > On Wed, Feb 29, 2012 at 1:17 PM, Charles Earl wrote: > >> I assume you have also just tried running locally and using the jdk >> performance tools (e.g. jmap) to gain insight by configuring hadoop to run >> absolute minimum number of tasks? >> Perhaps the discussion >> >> http://grokbase.com/t/hadoop/common-user/11ahm67z47/how-do-i-connect-java-visual-vm-to-a-remote-task >> might be relevant? >> On Feb 29, 2012, at 3:53 PM, Mark question wrote: >> >>> I've used hadoop profiling (.prof) to show the stack trace but it was >> hard >>> to follow. jConsole locally since I couldn't find a way to set a port >>> number to child processes when running them remotely. Linux commands >>> (top,/proc), showed me that the virtual memory is almost twice as my >>> physical which means swapping is happening which is what I'm trying to >>> avoid. >>> >>> So basically, is there a way to assign a port to child processes to >> monitor >>> them remotely (asked before by Xun) or would you recommend another >>> monitoring tool? >>> >>> Thank you, >>> Mark >>> >>> >>> On Wed, Feb 29, 2012 at 11:35 AM, Charles Earl >> wrote: >>> Mark, So if I understand, it is more the memory management that you are interested in, rather than a need to run an existing C or C++ >> application in MapReduce platform? Have you done profiling of the application? C On Feb 29, 2012, at 2:19 PM, Mark question wrote: > Thanks Charles .. I'm running Hadoop for research to perform duplicate > detection methods. To go deeper, I need to understand what's slowing my > program, which usually starts with analyzing memory to predict best >> input > size for map task. So you're saying piping can help me control memory even > though it's running on VM eventually? > > Thanks, > Mark > > On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl < >> charles.ce...@gmail.com > wrote: > >> Mark, >> Both streaming and pipes allow this, perhaps more so pipes at the >> level of >> the mapreduce task. Can you provide more details on the application? >> On Feb 29, 2012, at 1:56 PM, Mark question wrote: >> >>> Hi guys, thought I should ask this before I use it ... will using C over >>> Hadoop give me the usual C memory management? For example, malloc() , >>> sizeof() ? My guess is no since this all will eventually be turned >> into >>> bytecode, but I need more control on memory which obviously is hard >> for >> me >>> to do with Java. >>> >>> Let me know of any advantages you know about streaming in C over hadoop. >>> Thank you, >>> Mark >> >> >> >>
Re: Streaming Hadoop using C
Thank you for your time and suggestions, I've already tried starfish, but not jmap. I'll check it out. Thanks again, Mark On Wed, Feb 29, 2012 at 1:17 PM, Charles Earl wrote: > I assume you have also just tried running locally and using the jdk > performance tools (e.g. jmap) to gain insight by configuring hadoop to run > absolute minimum number of tasks? > Perhaps the discussion > > http://grokbase.com/t/hadoop/common-user/11ahm67z47/how-do-i-connect-java-visual-vm-to-a-remote-task > might be relevant? > On Feb 29, 2012, at 3:53 PM, Mark question wrote: > > > I've used hadoop profiling (.prof) to show the stack trace but it was > hard > > to follow. jConsole locally since I couldn't find a way to set a port > > number to child processes when running them remotely. Linux commands > > (top,/proc), showed me that the virtual memory is almost twice as my > > physical which means swapping is happening which is what I'm trying to > > avoid. > > > > So basically, is there a way to assign a port to child processes to > monitor > > them remotely (asked before by Xun) or would you recommend another > > monitoring tool? > > > > Thank you, > > Mark > > > > > > On Wed, Feb 29, 2012 at 11:35 AM, Charles Earl >wrote: > > > >> Mark, > >> So if I understand, it is more the memory management that you are > >> interested in, rather than a need to run an existing C or C++ > application > >> in MapReduce platform? > >> Have you done profiling of the application? > >> C > >> On Feb 29, 2012, at 2:19 PM, Mark question wrote: > >> > >>> Thanks Charles .. I'm running Hadoop for research to perform duplicate > >>> detection methods. To go deeper, I need to understand what's slowing my > >>> program, which usually starts with analyzing memory to predict best > input > >>> size for map task. So you're saying piping can help me control memory > >> even > >>> though it's running on VM eventually? > >>> > >>> Thanks, > >>> Mark > >>> > >>> On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl < > charles.ce...@gmail.com > >>> wrote: > >>> > Mark, > Both streaming and pipes allow this, perhaps more so pipes at the > level > >> of > the mapreduce task. Can you provide more details on the application? > On Feb 29, 2012, at 1:56 PM, Mark question wrote: > > > Hi guys, thought I should ask this before I use it ... will using C > >> over > > Hadoop give me the usual C memory management? For example, malloc() , > > sizeof() ? My guess is no since this all will eventually be turned > into > > bytecode, but I need more control on memory which obviously is hard > for > me > > to do with Java. > > > > Let me know of any advantages you know about streaming in C over > >> hadoop. > > Thank you, > > Mark > > > >> > >> > >
Re: Streaming Hadoop using C
I assume you have also just tried running locally and using the jdk performance tools (e.g. jmap) to gain insight by configuring hadoop to run absolute minimum number of tasks? Perhaps the discussion http://grokbase.com/t/hadoop/common-user/11ahm67z47/how-do-i-connect-java-visual-vm-to-a-remote-task might be relevant? On Feb 29, 2012, at 3:53 PM, Mark question wrote: > I've used hadoop profiling (.prof) to show the stack trace but it was hard > to follow. jConsole locally since I couldn't find a way to set a port > number to child processes when running them remotely. Linux commands > (top,/proc), showed me that the virtual memory is almost twice as my > physical which means swapping is happening which is what I'm trying to > avoid. > > So basically, is there a way to assign a port to child processes to monitor > them remotely (asked before by Xun) or would you recommend another > monitoring tool? > > Thank you, > Mark > > > On Wed, Feb 29, 2012 at 11:35 AM, Charles Earl wrote: > >> Mark, >> So if I understand, it is more the memory management that you are >> interested in, rather than a need to run an existing C or C++ application >> in MapReduce platform? >> Have you done profiling of the application? >> C >> On Feb 29, 2012, at 2:19 PM, Mark question wrote: >> >>> Thanks Charles .. I'm running Hadoop for research to perform duplicate >>> detection methods. To go deeper, I need to understand what's slowing my >>> program, which usually starts with analyzing memory to predict best input >>> size for map task. So you're saying piping can help me control memory >> even >>> though it's running on VM eventually? >>> >>> Thanks, >>> Mark >>> >>> On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl >> wrote: >>> Mark, Both streaming and pipes allow this, perhaps more so pipes at the level >> of the mapreduce task. Can you provide more details on the application? On Feb 29, 2012, at 1:56 PM, Mark question wrote: > Hi guys, thought I should ask this before I use it ... will using C >> over > Hadoop give me the usual C memory management? For example, malloc() , > sizeof() ? My guess is no since this all will eventually be turned into > bytecode, but I need more control on memory which obviously is hard for me > to do with Java. > > Let me know of any advantages you know about streaming in C over >> hadoop. > Thank you, > Mark >> >>
Re: Streaming Hadoop using C
The documentation on Starfish http://www.cs.duke.edu/starfish/index.html looks promising , I have not used it. I wonder if others on the list have found it more useful than setting mapred.task.profile. C On Feb 29, 2012, at 3:53 PM, Mark question wrote: > I've used hadoop profiling (.prof) to show the stack trace but it was hard > to follow. jConsole locally since I couldn't find a way to set a port > number to child processes when running them remotely. Linux commands > (top,/proc), showed me that the virtual memory is almost twice as my > physical which means swapping is happening which is what I'm trying to > avoid. > > So basically, is there a way to assign a port to child processes to monitor > them remotely (asked before by Xun) or would you recommend another > monitoring tool? > > Thank you, > Mark > > > On Wed, Feb 29, 2012 at 11:35 AM, Charles Earl wrote: > >> Mark, >> So if I understand, it is more the memory management that you are >> interested in, rather than a need to run an existing C or C++ application >> in MapReduce platform? >> Have you done profiling of the application? >> C >> On Feb 29, 2012, at 2:19 PM, Mark question wrote: >> >>> Thanks Charles .. I'm running Hadoop for research to perform duplicate >>> detection methods. To go deeper, I need to understand what's slowing my >>> program, which usually starts with analyzing memory to predict best input >>> size for map task. So you're saying piping can help me control memory >> even >>> though it's running on VM eventually? >>> >>> Thanks, >>> Mark >>> >>> On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl >> wrote: >>> Mark, Both streaming and pipes allow this, perhaps more so pipes at the level >> of the mapreduce task. Can you provide more details on the application? On Feb 29, 2012, at 1:56 PM, Mark question wrote: > Hi guys, thought I should ask this before I use it ... will using C >> over > Hadoop give me the usual C memory management? For example, malloc() , > sizeof() ? My guess is no since this all will eventually be turned into > bytecode, but I need more control on memory which obviously is hard for me > to do with Java. > > Let me know of any advantages you know about streaming in C over >> hadoop. > Thank you, > Mark >> >>
Re: Streaming Hadoop using C
I've used hadoop profiling (.prof) to show the stack trace but it was hard to follow. jConsole locally since I couldn't find a way to set a port number to child processes when running them remotely. Linux commands (top,/proc), showed me that the virtual memory is almost twice as my physical which means swapping is happening which is what I'm trying to avoid. So basically, is there a way to assign a port to child processes to monitor them remotely (asked before by Xun) or would you recommend another monitoring tool? Thank you, Mark On Wed, Feb 29, 2012 at 11:35 AM, Charles Earl wrote: > Mark, > So if I understand, it is more the memory management that you are > interested in, rather than a need to run an existing C or C++ application > in MapReduce platform? > Have you done profiling of the application? > C > On Feb 29, 2012, at 2:19 PM, Mark question wrote: > > > Thanks Charles .. I'm running Hadoop for research to perform duplicate > > detection methods. To go deeper, I need to understand what's slowing my > > program, which usually starts with analyzing memory to predict best input > > size for map task. So you're saying piping can help me control memory > even > > though it's running on VM eventually? > > > > Thanks, > > Mark > > > > On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl >wrote: > > > >> Mark, > >> Both streaming and pipes allow this, perhaps more so pipes at the level > of > >> the mapreduce task. Can you provide more details on the application? > >> On Feb 29, 2012, at 1:56 PM, Mark question wrote: > >> > >>> Hi guys, thought I should ask this before I use it ... will using C > over > >>> Hadoop give me the usual C memory management? For example, malloc() , > >>> sizeof() ? My guess is no since this all will eventually be turned into > >>> bytecode, but I need more control on memory which obviously is hard for > >> me > >>> to do with Java. > >>> > >>> Let me know of any advantages you know about streaming in C over > hadoop. > >>> Thank you, > >>> Mark > >> > >> > >
Re: Streaming Hadoop using C
Mark, So if I understand, it is more the memory management that you are interested in, rather than a need to run an existing C or C++ application in MapReduce platform? Have you done profiling of the application? C On Feb 29, 2012, at 2:19 PM, Mark question wrote: > Thanks Charles .. I'm running Hadoop for research to perform duplicate > detection methods. To go deeper, I need to understand what's slowing my > program, which usually starts with analyzing memory to predict best input > size for map task. So you're saying piping can help me control memory even > though it's running on VM eventually? > > Thanks, > Mark > > On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl wrote: > >> Mark, >> Both streaming and pipes allow this, perhaps more so pipes at the level of >> the mapreduce task. Can you provide more details on the application? >> On Feb 29, 2012, at 1:56 PM, Mark question wrote: >> >>> Hi guys, thought I should ask this before I use it ... will using C over >>> Hadoop give me the usual C memory management? For example, malloc() , >>> sizeof() ? My guess is no since this all will eventually be turned into >>> bytecode, but I need more control on memory which obviously is hard for >> me >>> to do with Java. >>> >>> Let me know of any advantages you know about streaming in C over hadoop. >>> Thank you, >>> Mark >> >>
Re: Streaming Hadoop using C
Thanks Charles .. I'm running Hadoop for research to perform duplicate detection methods. To go deeper, I need to understand what's slowing my program, which usually starts with analyzing memory to predict best input size for map task. So you're saying piping can help me control memory even though it's running on VM eventually? Thanks, Mark On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl wrote: > Mark, > Both streaming and pipes allow this, perhaps more so pipes at the level of > the mapreduce task. Can you provide more details on the application? > On Feb 29, 2012, at 1:56 PM, Mark question wrote: > > > Hi guys, thought I should ask this before I use it ... will using C over > > Hadoop give me the usual C memory management? For example, malloc() , > > sizeof() ? My guess is no since this all will eventually be turned into > > bytecode, but I need more control on memory which obviously is hard for > me > > to do with Java. > > > > Let me know of any advantages you know about streaming in C over hadoop. > > Thank you, > > Mark > >
Re: Streaming Hadoop using C
Mark, Both streaming and pipes allow this, perhaps more so pipes at the level of the mapreduce task. Can you provide more details on the application? On Feb 29, 2012, at 1:56 PM, Mark question wrote: > Hi guys, thought I should ask this before I use it ... will using C over > Hadoop give me the usual C memory management? For example, malloc() , > sizeof() ? My guess is no since this all will eventually be turned into > bytecode, but I need more control on memory which obviously is hard for me > to do with Java. > > Let me know of any advantages you know about streaming in C over hadoop. > Thank you, > Mark
Streaming Hadoop using C
Hi guys, thought I should ask this before I use it ... will using C over Hadoop give me the usual C memory management? For example, malloc() , sizeof() ? My guess is no since this all will eventually be turned into bytecode, but I need more control on memory which obviously is hard for me to do with Java. Let me know of any advantages you know about streaming in C over hadoop. Thank you, Mark