Extending pipes to support binary data

2012-02-14 Thread Charles Earl
Hi,
I'm trying to extend the pipes interface as defined in Pipes.hh to
support the read of binary input data.
I believe that would mean extending the getInputValue() method of
context to return char *, which would then be memcpy'd to appropriate
type inside the C++ pipes program.
I'm guessing the best way  to do this would be to use a custom
InputFormat on the java side that would have BytesWritable value.
Is this the correct approach?

-- 
- Charles


Re: Streaming Hadoop using C

2012-02-29 Thread Charles Earl
Mark,
Both streaming and pipes allow this, perhaps more so pipes at the level of the 
mapreduce task. Can you provide more details on the application?
On Feb 29, 2012, at 1:56 PM, Mark question wrote:

> Hi guys, thought I should ask this before I use it ... will using C over
> Hadoop give me the usual C memory management? For example, malloc() ,
> sizeof() ? My guess is no since this all will eventually be turned into
> bytecode, but I need more control on memory which obviously is hard for me
> to do with Java.
> 
> Let me know of any advantages you know about streaming in C over hadoop.
> Thank you,
> Mark



Re: Streaming Hadoop using C

2012-02-29 Thread Charles Earl
Mark,
So if I understand, it is more the memory management that you are interested 
in, rather than a need to run an existing C or C++ application in MapReduce 
platform?
Have you done profiling of the application?
C
On Feb 29, 2012, at 2:19 PM, Mark question wrote:

> Thanks Charles .. I'm running Hadoop for research to perform duplicate
> detection methods. To go deeper, I need to understand what's slowing my
> program, which usually starts with analyzing memory to predict best input
> size for map task. So you're saying piping can help me control memory even
> though it's running on VM eventually?
> 
> Thanks,
> Mark
> 
> On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl wrote:
> 
>> Mark,
>> Both streaming and pipes allow this, perhaps more so pipes at the level of
>> the mapreduce task. Can you provide more details on the application?
>> On Feb 29, 2012, at 1:56 PM, Mark question wrote:
>> 
>>> Hi guys, thought I should ask this before I use it ... will using C over
>>> Hadoop give me the usual C memory management? For example, malloc() ,
>>> sizeof() ? My guess is no since this all will eventually be turned into
>>> bytecode, but I need more control on memory which obviously is hard for
>> me
>>> to do with Java.
>>> 
>>> Let me know of any advantages you know about streaming in C over hadoop.
>>> Thank you,
>>> Mark
>> 
>> 



Re: Streaming Hadoop using C

2012-02-29 Thread Charles Earl
The documentation on Starfish http://www.cs.duke.edu/starfish/index.html
looks promising , I have not used it. I wonder if others on the list have found 
it more useful than setting mapred.task.profile.
C
On Feb 29, 2012, at 3:53 PM, Mark question wrote:

> I've used hadoop profiling (.prof) to show the stack trace but it was hard
> to follow. jConsole locally since I couldn't find a way to set a port
> number to child processes when running them remotely. Linux commands
> (top,/proc), showed me that the virtual memory is almost twice as my
> physical which means swapping is happening which is what I'm trying to
> avoid.
> 
> So basically, is there a way to assign a port to child processes to monitor
> them remotely (asked before by Xun) or would you recommend another
> monitoring tool?
> 
> Thank you,
> Mark
> 
> 
> On Wed, Feb 29, 2012 at 11:35 AM, Charles Earl wrote:
> 
>> Mark,
>> So if I understand, it is more the memory management that you are
>> interested in, rather than a need to run an existing C or C++ application
>> in MapReduce platform?
>> Have you done profiling of the application?
>> C
>> On Feb 29, 2012, at 2:19 PM, Mark question wrote:
>> 
>>> Thanks Charles .. I'm running Hadoop for research to perform duplicate
>>> detection methods. To go deeper, I need to understand what's slowing my
>>> program, which usually starts with analyzing memory to predict best input
>>> size for map task. So you're saying piping can help me control memory
>> even
>>> though it's running on VM eventually?
>>> 
>>> Thanks,
>>> Mark
>>> 
>>> On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl >> wrote:
>>> 
>>>> Mark,
>>>> Both streaming and pipes allow this, perhaps more so pipes at the level
>> of
>>>> the mapreduce task. Can you provide more details on the application?
>>>> On Feb 29, 2012, at 1:56 PM, Mark question wrote:
>>>> 
>>>>> Hi guys, thought I should ask this before I use it ... will using C
>> over
>>>>> Hadoop give me the usual C memory management? For example, malloc() ,
>>>>> sizeof() ? My guess is no since this all will eventually be turned into
>>>>> bytecode, but I need more control on memory which obviously is hard for
>>>> me
>>>>> to do with Java.
>>>>> 
>>>>> Let me know of any advantages you know about streaming in C over
>> hadoop.
>>>>> Thank you,
>>>>> Mark
>>>> 
>>>> 
>> 
>> 



Re: Streaming Hadoop using C

2012-02-29 Thread Charles Earl
I assume you have also just tried running locally and using the jdk performance 
tools (e.g. jmap) to gain insight by configuring hadoop to run absolute minimum 
number of tasks?
Perhaps the discussion
http://grokbase.com/t/hadoop/common-user/11ahm67z47/how-do-i-connect-java-visual-vm-to-a-remote-task
might be relevant?
On Feb 29, 2012, at 3:53 PM, Mark question wrote:

> I've used hadoop profiling (.prof) to show the stack trace but it was hard
> to follow. jConsole locally since I couldn't find a way to set a port
> number to child processes when running them remotely. Linux commands
> (top,/proc), showed me that the virtual memory is almost twice as my
> physical which means swapping is happening which is what I'm trying to
> avoid.
> 
> So basically, is there a way to assign a port to child processes to monitor
> them remotely (asked before by Xun) or would you recommend another
> monitoring tool?
> 
> Thank you,
> Mark
> 
> 
> On Wed, Feb 29, 2012 at 11:35 AM, Charles Earl wrote:
> 
>> Mark,
>> So if I understand, it is more the memory management that you are
>> interested in, rather than a need to run an existing C or C++ application
>> in MapReduce platform?
>> Have you done profiling of the application?
>> C
>> On Feb 29, 2012, at 2:19 PM, Mark question wrote:
>> 
>>> Thanks Charles .. I'm running Hadoop for research to perform duplicate
>>> detection methods. To go deeper, I need to understand what's slowing my
>>> program, which usually starts with analyzing memory to predict best input
>>> size for map task. So you're saying piping can help me control memory
>> even
>>> though it's running on VM eventually?
>>> 
>>> Thanks,
>>> Mark
>>> 
>>> On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl >> wrote:
>>> 
>>>> Mark,
>>>> Both streaming and pipes allow this, perhaps more so pipes at the level
>> of
>>>> the mapreduce task. Can you provide more details on the application?
>>>> On Feb 29, 2012, at 1:56 PM, Mark question wrote:
>>>> 
>>>>> Hi guys, thought I should ask this before I use it ... will using C
>> over
>>>>> Hadoop give me the usual C memory management? For example, malloc() ,
>>>>> sizeof() ? My guess is no since this all will eventually be turned into
>>>>> bytecode, but I need more control on memory which obviously is hard for
>>>> me
>>>>> to do with Java.
>>>>> 
>>>>> Let me know of any advantages you know about streaming in C over
>> hadoop.
>>>>> Thank you,
>>>>> Mark
>>>> 
>>>> 
>> 
>> 



Re: Streaming Hadoop using C

2012-03-01 Thread Charles Earl
How was your experience of starfish?
C
On Mar 1, 2012, at 12:35 AM, Mark question wrote:

> Thank you for your time and suggestions, I've already tried starfish, but
> not jmap. I'll check it out.
> Thanks again,
> Mark
> 
> On Wed, Feb 29, 2012 at 1:17 PM, Charles Earl wrote:
> 
>> I assume you have also just tried running locally and using the jdk
>> performance tools (e.g. jmap) to gain insight by configuring hadoop to run
>> absolute minimum number of tasks?
>> Perhaps the discussion
>> 
>> http://grokbase.com/t/hadoop/common-user/11ahm67z47/how-do-i-connect-java-visual-vm-to-a-remote-task
>> might be relevant?
>> On Feb 29, 2012, at 3:53 PM, Mark question wrote:
>> 
>>> I've used hadoop profiling (.prof) to show the stack trace but it was
>> hard
>>> to follow. jConsole locally since I couldn't find a way to set a port
>>> number to child processes when running them remotely. Linux commands
>>> (top,/proc), showed me that the virtual memory is almost twice as my
>>> physical which means swapping is happening which is what I'm trying to
>>> avoid.
>>> 
>>> So basically, is there a way to assign a port to child processes to
>> monitor
>>> them remotely (asked before by Xun) or would you recommend another
>>> monitoring tool?
>>> 
>>> Thank you,
>>> Mark
>>> 
>>> 
>>> On Wed, Feb 29, 2012 at 11:35 AM, Charles Earl >> wrote:
>>> 
>>>> Mark,
>>>> So if I understand, it is more the memory management that you are
>>>> interested in, rather than a need to run an existing C or C++
>> application
>>>> in MapReduce platform?
>>>> Have you done profiling of the application?
>>>> C
>>>> On Feb 29, 2012, at 2:19 PM, Mark question wrote:
>>>> 
>>>>> Thanks Charles .. I'm running Hadoop for research to perform duplicate
>>>>> detection methods. To go deeper, I need to understand what's slowing my
>>>>> program, which usually starts with analyzing memory to predict best
>> input
>>>>> size for map task. So you're saying piping can help me control memory
>>>> even
>>>>> though it's running on VM eventually?
>>>>> 
>>>>> Thanks,
>>>>> Mark
>>>>> 
>>>>> On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl <
>> charles.ce...@gmail.com
>>>>> wrote:
>>>>> 
>>>>>> Mark,
>>>>>> Both streaming and pipes allow this, perhaps more so pipes at the
>> level
>>>> of
>>>>>> the mapreduce task. Can you provide more details on the application?
>>>>>> On Feb 29, 2012, at 1:56 PM, Mark question wrote:
>>>>>> 
>>>>>>> Hi guys, thought I should ask this before I use it ... will using C
>>>> over
>>>>>>> Hadoop give me the usual C memory management? For example, malloc() ,
>>>>>>> sizeof() ? My guess is no since this all will eventually be turned
>> into
>>>>>>> bytecode, but I need more control on memory which obviously is hard
>> for
>>>>>> me
>>>>>>> to do with Java.
>>>>>>> 
>>>>>>> Let me know of any advantages you know about streaming in C over
>>>> hadoop.
>>>>>>> Thank you,
>>>>>>> Mark
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>> 
>> 



Re: Hadoop streaming or pipes ..

2012-04-05 Thread Charles Earl
Also bear in mind that there is a kind of detour involved, in the sense that a 
pipes map must send key,value data back to the Java process and then to reduce 
(more or less). 
I think that the Hadoop C Extension (HCE, there is a patch) is supposed to be 
faster. 
Would be interested to know if the community has any experience with HCE 
performance.
C

On Apr 5, 2012, at 3:49 PM, Robert Evans  wrote:

> Both streaming and pipes do very similar things.  They will fork/exec a 
> separate process that is running whatever you want it to run.  The JVM that 
> is running hadoop then communicates with this process to send the data over 
> and get the processing results back.  The difference between streaming and 
> pipes is that streaming uses stdin/stdout for this communication so 
> preexisting processing like grep, sed and awk can be used here.  Pipes uses a 
> custom protocol with a C++ library to communicate.  The C++ library is tagged 
> with SWIG compatible data so that it can be wrapped to have APIs in other 
> languages like python or perl.
> 
> I am not sure what the performance difference is between the two, but in my 
> own work I have seen a significant performance penalty from using either of 
> them, because there is a somewhat large overhead of sending all of the data 
> out to a separate process just to read it back in again.
> 
> --Bobby Evans
> 
> 
> On 4/5/12 1:54 PM, "Mark question"  wrote:
> 
> Hi guys,
>  quick question:
>   Are there any performance gains from hadoop streaming or pipes over
> Java? From what I've read, it's only to ease testing by using your favorite
> language. So I guess it is eventually translated to bytecode then executed.
> Is that true?
> 
> Thank you,
> Mark
> 


Re: Text Analysis

2012-04-25 Thread Charles Earl
If you've got existing R code, you might want to look at this 
http://www.quora.com/How-can-R-and-Hadoop-be-used-together.
Quora posting, also by Cloudera, or the rhipe R Hadoop package 
https://github.com/saptarshiguha/RHIPE/wiki
Mahout and Lucene/Solr offer some level of text analysis, although I would not 
call these complete text analysis packages.
What I've found are specific algorithms as opposed to a complete package: for 
example LDA for topic discovery -- Mahout and Yahoo Research 
(https://github.com/shravanmn/Yahoo_LDA) have Hadoop based implementations -- 
in the case of Yahoo_LDA the data is stored in HDFS, while the computation is 
essentially MPI based. Whether the algorithm reads data from HDFS store and 
uses another approach other than map reduce is another question.
C

On Apr 25, 2012, at 12:47 PM, Jagat wrote:

> There are Api which you can use , offcourse they are third party.
> 
> ---
> Sent from Mobile , short and crisp.
> On 25-Apr-2012 8:57 PM, "Robert Evans"  wrote:
> 
>> Hadoop itself is the core Map/Reduce and HDFS functionality.  The higher
>> level algorithms like sentiment analysis are often done by others.
>> Cloudera has a video from HadoopWorld 2010 about it
>> 
>> 
>> http://www.cloudera.com/resource/hw10_video_sentiment_analysis_powered_by_hadoop/
>> 
>> And there are likely to be other tools like R that can help you out with
>> it.  I am not really sure if mahout offers sentiment analysis or not, but
>> you might want to look there too http://mahout.apache.org/
>> 
>> --Bobby Evans
>> 
>> 
>> On 4/25/12 7:50 AM, "karanveer.si...@barclays.com" <
>> karanveer.si...@barclays.com> wrote:
>> 
>> Hi,
>> 
>> I wanted to know if there are any existing API's within Hadoop for us to
>> do some text analysis like sentiment analysis, etc. OR are we to rely on
>> tools like R, etc. for this.
>> 
>> 
>> Regards,
>> Karanveer
>> 
>> 
>> 
>> 
>> 
>> This e-mail and any attachments are confidential and intended
>> solely for the addressee and may also be privileged or exempt from
>> disclosure under applicable law. If you are not the addressee, or
>> have received this e-mail in error, please notify the sender
>> immediately, delete it from your system and do not copy, disclose
>> or otherwise act upon any part of this e-mail or its attachments.
>> 
>> Internet communications are not guaranteed to be secure or
>> virus-free.
>> The Barclays Group does not accept responsibility for any loss
>> arising from unauthorised access to, or interference with, any
>> Internet communications by any third party, or from the
>> transmission of any viruses. Replies to this e-mail may be
>> monitored by the Barclays Group for operational or business
>> reasons.
>> 
>> Any opinion or other information in this e-mail or its attachments
>> that does not relate to the business of the Barclays Group is
>> personal to the sender and is not given or endorsed by the Barclays
>> Group.
>> 
>> Barclays Bank PLC. Registered in England and Wales (registered no.
>> 1026167).
>> Registered Office: 1 Churchill Place, London, E14 5HP, United
>> Kingdom.
>> 
>> Barclays Bank PLC is authorised and regulated by the Financial
>> Services Authority.
>> 
>> 



Re: Hadoop-on-demand and torque

2012-05-21 Thread Charles Earl
Ralph,
Do you have any YARN or Mesos performance comparison against HOD? I suppose 
since it was customer requirement you might not have explored it. MPI support 
seems to be active issue for Mesos now.
Charles

On May 21, 2012, at 10:36 AM, Ralph Castain  wrote:

> Not quite yet, though we are working on it (some descriptive stuff is around, 
> but needs to be consolidated). Several of us started working together a 
> couple of months ago to support the MapReduce programming model on HPC 
> clusters using Open MPI as the platform. In working with our customers and 
> OMPI's wide community of users, we found that people were interested in this 
> capability, wanted to integrate MPI support into their MapReduce jobs, and 
> didn't want to migrate their clusters to YARN for various reasons.
> 
> We have released initial versions of two new tools in the OMPI developer's 
> trunk, scheduled for inclusion in the upcoming 1.7.0 release:
> 
> 1. "mr+" - executes the MapReduce programming paradigm. Currently, we only 
> support streaming data, though we will extend that support shortly. All HPC 
> environments (rsh, SLURM, Torque, Alps, LSF, Windows, etc.) are supported. 
> Both mappers and reducers can utilize MPI (independently or in combination) 
> if they so choose. Mappers and reducers can be written in any of the typical 
> HPC languages (C, C++, and Fortran) as well as Java (note: OMPI now comes 
> with Java MPI bindings).
> 
> 2. "hdfsalloc" - takes a list of files and obtains a resource allocation for 
> the nodes upon which those files reside. SLURM and Moab/Maui are currently 
> supported, with Gridengine coming soon.
> 
> There will be a public announcement of this in the near future, and we expect 
> to integrate the Hadoop 1.0 and Hadoop 2.0 MR classes over the next couple of 
> months. By the end of this summer, we should have a full-featured public 
> release.
> 
> 
> On May 20, 2012, at 2:10 PM, Brian Bockelman wrote:
> 
>> Hi Ralph,
>> 
>> I admit - I've only been half-following the OpenMPI progress.  Do you have a 
>> technical write-up of what has been done?
>> 
>> Thanks,
>> 
>> Brian
>> 
>> On May 20, 2012, at 9:31 AM, Ralph Castain wrote:
>> 
>>> FWIW: Open MPI now has an initial cut at "MR+" that runs map-reduce under 
>>> any HPC environment. We don't have the Java integration yet to support the 
>>> Hadoop MR class, but you can write a mapper/reducer and execute that 
>>> programming paradigm. We plan to integrate the Hadoop MR class soon.
>>> 
>>> If you already have that integration, we'd love to help port it over. We 
>>> already have the MPI support completed, so any mapper/reducer could use it.
>>> 
>>> 
>>> On May 20, 2012, at 7:12 AM, Pierre Antoine DuBoDeNa wrote:
>>> 
 We run similar infrastructure in a university project.. we plan to install
 hadoop.. and looking for "alternatives" based on hadoop in case the pure
 hadoop is not working as expected.
 
 Keep us updated on the code release.
 
 Best,
 PA
 
 2012/5/20 Stijn De Weirdt 
 
> hi all,
> 
> i'm part of an HPC group of a university, and we have some users that are
> interested in Hadoop to see if it can be useful in their research and we
> also have researchers that are using hadoop already on their own
> infrastructure, but that is is not enough reason for us to start with
> dedicated dedicated Hadoop infrastructure  (we are now only running torque
> based clusters with and without shared storage; setting up and properly
> maintaining Hadoop infrastructure requires quite some understanding of new
> software)
> 
> to be able to support these needs we wanted to do just this: use current
> HPC infrastructure to make private hadoop clusters so people can do some
> work. if we attract enough interest, we will probably setup dedicated
> infrastructure, but by that time we (the admins) will also have a better
> understanding of what is required.
> 
> so we used to look at HOD for testing/running hadoop on existing
> infrastructure (never really looked at myhadoop though).
> but (imho) the current HOD code base is not in such a good state. we did
> some work to get it working and added some features, to come to the
> conclusion that it was not sufficient (and not maintainable).
> 
> so we wrote something from scratch with same functionality as HOD, and
> much more (eg HBase is now possible, with or without MR1; some default
> tuning; easy to add support for yarn instead of MR1).
> it has some suport for torque, but my laptop is also sufficient. (the
> torque support is a wrapper to submit the job)
> we gave a workshop on hadoop using it (25 people, and each with their own
> 5 node hadoop cluster) and it went rather well.
> 
> it's not in a public repo yet, but we could do that. if interested, let me
> know, and i see what can be done. (releasing the code is