Re: Spark vs Tez

2014-10-19 Thread Mohan Radhakrishnan
Is Tez's architecture similar to Akka's distributed architecture ? I think
I remember that Jonas boner mentioned during a presentation on distributed
computing about Akka's support for protocols like raft etc. What makes Tez
more scalable in this regard ?

Thanks,
Mohan

On Sun, Oct 19, 2014 at 5:26 PM, Niels Basjes ni...@basjes.nl wrote:

 Very interesting!
 What makes Tez more scalable than Spark?
 What architectural thing makes the difference?

 Niels Basjes
 On Oct 19, 2014 3:07 AM, Jeff Zhang zjf...@gmail.com wrote:

 Tez has a feature called pre-warm which will launch JVM before you use it
 and you can reuse the container afterwards. So it is also suitable for
 interactive queries and is more stable and scalable than spark IMO.

 On Sat, Oct 18, 2014 at 4:22 PM, Niels Basjes ni...@basjes.nl wrote:

 It is my understanding that one of the big differences between Tez and
 Spark is is that a Tez based query still has the startup overhead of
 starting JVMs on the Yarn cluster. Spark based queries are immediately
 executed on already running JVMs.

 So for interactive dashboards Spark seems more suitable.

 Did I understand correctly?

 Niels Basjes
 On Oct 17, 2014 8:30 PM, Gavin Yue yue.yuany...@gmail.com wrote:

 Spark and tez both make MR faster, this has no doubt.

 They also provide new features like DAG, which is quite important for
 interactive query processing.  From this perspective, you could view them
 as a wrapper around MR and try to handle the intermediary buffer(files)
 more efficiently.  It is a big pain in MR.

 Also they both try to use Memory as the buffer instead of only
 filesystems.   Spark has a concept RDD, which is quite interesting and also
 limited.



 On Fri, Oct 17, 2014 at 11:23 AM, Adaryl Bob Wakefield, MBA 
 adaryl.wakefi...@hotmail.com wrote:

   It was my understanding that Spark is faster batch processing. Tez
 is the new execution engine that replaces MapReduce and is also supposed 
 to
 speed up batch processing. Is that not correct?
 B.



  *From:* Shahab Yunus shahab.yu...@gmail.com
 *Sent:* Friday, October 17, 2014 1:12 PM
 *To:* user@hadoop.apache.org
 *Subject:* Re: Spark vs Tez

  What aspects of Tez and Spark are you comparing? They have different
 purposes and thus not directly comparable, as far as I understand.

 Regards,
 Shahab

 On Fri, Oct 17, 2014 at 2:06 PM, Adaryl Bob Wakefield, MBA 
 adaryl.wakefi...@hotmail.com wrote:

   Does anybody have any performance figures on how Spark stacks up
 against Tez? If you don’t have figures, does anybody have an opinion? 
 Spark
 seems so popular but I’m not really seeing why.
 B.







 --
 Best Regards

 Jeff Zhang




Re: Spark vs Tez

2014-10-18 Thread Mohan Radhakrishnan
I remember Spark uses Akka clusters. Isn't that totally different from
other distributed technologies ?

Thanks,
Mohan

On Sat, Oct 18, 2014 at 1:52 PM, Niels Basjes ni...@basjes.nl wrote:

 It is my understanding that one of the big differences between Tez and
 Spark is is that a Tez based query still has the startup overhead of
 starting JVMs on the Yarn cluster. Spark based queries are immediately
 executed on already running JVMs.

 So for interactive dashboards Spark seems more suitable.

 Did I understand correctly?

 Niels Basjes
 On Oct 17, 2014 8:30 PM, Gavin Yue yue.yuany...@gmail.com wrote:

 Spark and tez both make MR faster, this has no doubt.

 They also provide new features like DAG, which is quite important for
 interactive query processing.  From this perspective, you could view them
 as a wrapper around MR and try to handle the intermediary buffer(files)
 more efficiently.  It is a big pain in MR.

 Also they both try to use Memory as the buffer instead of only
 filesystems.   Spark has a concept RDD, which is quite interesting and also
 limited.



 On Fri, Oct 17, 2014 at 11:23 AM, Adaryl Bob Wakefield, MBA 
 adaryl.wakefi...@hotmail.com wrote:

   It was my understanding that Spark is faster batch processing. Tez is
 the new execution engine that replaces MapReduce and is also supposed to
 speed up batch processing. Is that not correct?
 B.



  *From:* Shahab Yunus shahab.yu...@gmail.com
 *Sent:* Friday, October 17, 2014 1:12 PM
 *To:* user@hadoop.apache.org
 *Subject:* Re: Spark vs Tez

  What aspects of Tez and Spark are you comparing? They have different
 purposes and thus not directly comparable, as far as I understand.

 Regards,
 Shahab

 On Fri, Oct 17, 2014 at 2:06 PM, Adaryl Bob Wakefield, MBA 
 adaryl.wakefi...@hotmail.com wrote:

   Does anybody have any performance figures on how Spark stacks up
 against Tez? If you don’t have figures, does anybody have an opinion? Spark
 seems so popular but I’m not really seeing why.
 B.







Re: Hadoop and Open Data (CKAN.org).

2014-09-04 Thread Mohan Radhakrishnan
I understand that coding MR jobs using a language is required but if we are
just processing large amounts of data (Machine Learning for example) we
could use Pig. I recently processed 0.25 TB on AWS clusters in a reasonably
short time. In this case the development effort is very less.


Thanks,
Mohan


On Thu, Sep 4, 2014 at 6:41 PM, Alec Ten Harmsel a...@alectenharmsel.com
wrote:

  I would recommend using Hadoop only if you are ingesting a lot of data
 and you need reasonable performance at scale. I would recommend starting
 with using insert language/tool of choice to ingest and transform data
 until that process starts taking too long.

 For example, one of our researchers at the University of Michigan had to
 process ~150GB of data. Using python, processing that data took about 45
 minutes - it was not worth it to spend extra development time to run it on
 Hadoop. This time will change depending on what you need to do and the
 hardware available, naturally.

 So until you need to frequently process large amounts of data, I'd stick
 with something you're already familiar with.

 Alec Ten Harmsel

 On 09/04/2014 03:30 AM, Henrik Aagaard Jørgensen wrote:

  Dear all,



 I’m very new to Hadoop as I’m still trying to grasp its value and
 purpose. I do hope my question on this mailing list is OK.



 I manage our open data platform at our municipality, using CKAN.org. It
 works very well for its purpose of showing data and adding API’s to data.



 However, I’m very interested in knowing more about Hadoop and if it would
 fit into a (open) data platform, as we are getting more and more data to
 show and to work with internally at our municipality.



 However, I cannot figure out if it’s the right purpose to use Hadoop for,
 if it is “overkill” or…



 Could someone elaborate on such topic?



 I’ve Googled around a lot and looked at various videos online and Hadoop
 seems to have it place, also in an open data platform environment.



 Best regards,

 Henrik





Re: Started learning Hadoop. Which distribution is best for native install in pseudo distributed mode?

2014-08-15 Thread Mohan Radhakrishnan
Actually there  was another thread about using MR for ML but I didn't see
many responses. I use Octave or R for this but it would be useful to know
how this is solved using Hadoop. The closest community that has an interest
in this could be H2o but they have implemented MR for their engine to solve
these problems. That is what I understand. So we may be able to look at
their code but that could be tedious.

Mohan


On Thu, Aug 14, 2014 at 3:35 PM, Kai Wähner megachu...@gmail.com wrote:

 As a beginner, it depends on what you want to learn? Do you want to
 program MapReduce, just do some SQL queries to hadoop, or install, deploy
 and monitor a Hadoop cluster?

 This article might help making a good decision:
 spoilt for choice - how to choose the right Hadoop distribution
 http://www.infoq.com/articles/BigDataPlatform

 Kai

 Sent from my iPhone

  On 14.08.2014, at 11:58, Chris MacKenzie 
 stu...@chrismackenziephotography.co.uk wrote:
 
  Hi,
 
  I have been using Hadoop since Christmas loosely and from May for an
  Software engineering MSc at Heriot Watt University in Edinburgh,
 Scotland.
  I have written a genetic sequence alignment algorithm.
 
  I have installed Hadoop in various places including a 32 node cluster and
  am using eclipse kepler sr 2 as an IDE.
 
  My current Hadoop version is 2.4.1 which I download as a tar from the
  apache mirror servers.
 
  It¹s been a tough learning curve, but that has made the learning all the
  more valuable.
 
  I believe using the straight Hadoop version has given insights that
  proprietary builds wouldn¹t have. There are so many confusing issues that
  crop up, it¹s easy to attach importance to trying to fix the an error
  which masks another. With the proprietary versions it would be easy to
  attach blame where it¹s not that build or this builds fault.
 
  Go with your heart but be prepared to work to solve the problems you
  encounter.
 
  Buy Tom Whites book, it isn¹t perfect and a couple of years out of date
  but it gives you enough detail and structure to build an impression you
  can work from. The downloadable source code is a great help when trying
 to
  get started.
 
  Good luck.
 
 
  Regards,
 
  Chris MacKenzie
  telephone: 0131 332 6967
  email: stu...@chrismackenziephotography.co.uk
  corporate: www.chrismackenziephotography.co.uk
  http://www.chrismackenziephotography.co.uk/
  http://plus.google.com/+ChrismackenziephotographyCoUk/posts
  http://www.linkedin.com/in/chrismackenziephotography/
 
 
 
 
 
 
  From:  Adaryl \Bob\ Wakefield, MBA adaryl.wakefi...@hotmail.com
  Reply-To:  user@hadoop.apache.org
  Date:  Thursday, 14 August 2014 01:13
  To:  user@hadoop.apache.org
  Subject:  Re: Started learning Hadoop. Which distribution is best for
  native install in pseudo distributed mode?
 
 
  He didn¹t ask for the best and nobody framed up their answer like that.
 He
  asked what people were using. Out of the 10 responses only four of them
  actually
  answered his question.
 
  I¹ve been studying Hadoop for two months straight. Quite frankly, I wish
  more people would ask for community input and what does what and how.
 
  Adaryl
  Bob Wakefield, MBA
  Principal
  Mass Street
  Analytics
  913.938.6685
  www.linkedin.com/in/bobwakefieldmba
  Twitter:
  @BobLovesData
 
  From: Kilaru, Sambaiah mailto:sambaiah_kil...@intuit.com
  Sent: Wednesday, August 13, 2014 1:10 PM
  To: user@hadoop.apache.org
  Subject: Re: Started learning Hadoop. Which distribution is best for
  native install in pseudo distributed mode?
 
 
 
 
  Engough wars on going on which is best. You choose one of it and try to
  learn and there is nothing that x is better or y is better.
  It is upto your choice.
 
  Thanks,
  Sam
 
  From: Sebastiano Di Paola sebastiano.dipa...@gmail.com
  Reply-To: user@hadoop.apache.org user@hadoop.apache.org
  Date: Wednesday, August 13, 2014 at 6:28
  PM
  To: user@hadoop.apache.org user@hadoop.apache.org
  Subject: Re: Started learning Hadoop. Which
  distribution is best for native install in pseudo distributed mode?
 
 
  Hi,
  I'm a newbie too and I'm not using any particular distribution. Just
  download the component I need / want to try for my deploiment and use
  them.
 
  It's a slow process but allows me to better understand what I'm
  doing under the hood.
 
  Regards,
  Seba
 
 
 
  On Tue, Aug 12, 2014 at 10:12 PM, mani kandan mankand...@gmail.com
 wrote:
 
   Which distribution are you people using? Cloudera vs Hortonworks vs
   Biginsights?
 
 
 
 
 
 



Re: Managed File Transfer

2014-07-09 Thread Mohan Radhakrishnan
I am a beginner. But this seems to be similar to what I intend. The data
source will be external FTP or S3 storage.

Spark Streaming can read data from HDFS
http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html
,Flume http://flume.apache.org/, Kafka http://kafka.apache.org/, Twitter
https://dev.twitter.com/ and ZeroMQ http://zeromq.org/. You can also
define your own custom data sources.

Thanks,
Mohan


On Wed, Jul 9, 2014 at 2:09 PM, Stanley Shi s...@gopivotal.com wrote:

 There's a DistCP utility for this kind of purpose;
 Also there's Spring XD there, but I am not sure if you want to use it.

 Regards,
 *Stanley Shi,*



 On Mon, Jul 7, 2014 at 10:02 PM, Mohan Radhakrishnan 
 radhakrishnan.mo...@gmail.com wrote:

 Hi,
We used a commercial FT and scheduler tool in clustered mode.
 This was a traditional active-active cluster that supported multiple
 protocols like FTPS etc.

 Now I am interested in evaluating a Distributed way of crawling FTP
 sites and downloading files using Hadoop. I thought since we have to
 process thousands of files Hadoop jobs can do it.

 Are Hadoop jobs used for this type of file transfers ?

 Moreover there is a requirement for a scheduler  also. What is the
 recommendation of the forum ?


 Thanks,
 Mohan





Managed File Transfer

2014-07-07 Thread Mohan Radhakrishnan
Hi,
   We used a commercial FT and scheduler tool in clustered mode.
This was a traditional active-active cluster that supported multiple
protocols like FTPS etc.

Now I am interested in evaluating a Distributed way of crawling FTP
sites and downloading files using Hadoop. I thought since we have to
process thousands of files Hadoop jobs can do it.

Are Hadoop jobs used for this type of file transfers ?

Moreover there is a requirement for a scheduler  also. What is the
recommendation of the forum ?


Thanks,
Mohan


Practical examples

2014-04-28 Thread Mohan Radhakrishnan
Hi,
   I have been reading the definitive guide and taking online courses.
Now I would like to understand how Hadoop is used for more real-time
scenarios. Are machine learning, language processing and fraud detection
examples available ? What are the other practical usecases ?

I am familiar with Machine learning and use a single node in a 8 GB machine.

Thanks,
Mohan


Re: Practical examples

2014-04-28 Thread Mohan Radhakrishnan
I am interested in ML but I want a Hadoop base because I am learning
hadoop. Mahout seems to be for ML at this time. Not Hadoop.

Thanks,
Mohan


On Tue, Apr 29, 2014 at 7:38 AM, Shahab Yunus shahab.yu...@gmail.comwrote:

 For Machine Learning based applications of Hadoop you can check-out Mahout
 framework.

 Regards,
 Shahab


 On Mon, Apr 28, 2014 at 10:02 PM, Mohan Radhakrishnan 
 radhakrishnan.mo...@gmail.com wrote:

 Hi,
I have been reading the definitive guide and taking online
 courses. Now I would like to understand how Hadoop is used for more
 real-time scenarios. Are machine learning, language processing and fraud
 detection examples available ? What are the other practical usecases ?

 I am familiar with Machine learning and use a single node in a 8 GB
 machine.

 Thanks,
 Mohan





Re: calling mapreduce from webservice

2014-04-18 Thread Mohan Radhakrishnan
Play framework is reactive and uses push channels. It may be useful here if
the UI has to be asynchronous and reactive.

Mohan


On Sat, Apr 19, 2014 at 4:37 AM, Shahab Yunus shahab.yu...@gmail.comwrote:

 As far as I know there is no API to kick of M/R jobs. There is for M/R v2,
 a REST API to get status of jobs:
 http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/MapredAppMasterRest.html#Mapreduce_Application_Master_Info_API

 I would say that you have invoke M/R jobs in your middle tier or back-end,
 you have to implement a custom solution i.e. invoking the M/R jobs in
 standard way and then monitoring the status of the job and then update the
 UI asynchronously depending on which UI framework or web service
 implementation (e.g. WS-Addressing) you are using.

 Regards,
 Shahab


 On Fri, Apr 18, 2014 at 3:11 PM, girish hilage girish_hil...@yahoo.comwrote:

 Yes. I intend to run the jobs asynchronously and show the status of the
 user submitted job as running/completed etc. and user will be able to
 submit new jobs simultaneously.  I have not checked PigLipStick though.

 Regards,
 Girish

   On Saturday, April 19, 2014 12:34 AM, Shahab Yunus 
 shahab.yu...@gmail.com wrote:
  Question: M/R jobs are supposed to run for a long time. They are
 essentially batch processes. Do you plan to keep the Web UI blocked for
 that while? Or are you looking for asynchronous invocation of the M/R job?
 Or are you thinking about building sort of an Admin UI (e.g. PigLipstick)
 What exactly is your requirement?

 Regards,
 Shahab


 On Fri, Apr 18, 2014 at 3:01 PM, girish hilage 
 girish_hil...@yahoo.comwrote:

 Hi,

This is just to check with you, if it is possible to call MR jobs from
 Java Webservices.
If yes, then could you please help me by pointing to some
 resouces/docs.

Actually, what I intend to do is create a Web UI with some
 functionality which would call MR jobs and present the result to the user
 in browser.

 Regards,
 Girish








Hadoop distribution(2-node cluster)

2014-04-14 Thread Mohan Radhakrishnan
Hi,
As the subject implies I have 2 nodes, one is OSX and the other is
linux. How is a distributed cluster installed in this case ? What other
networking equipment do I need ?

Thanks,
Mohan


2-node cluster

2014-04-14 Thread Mohan Radhakrishnan
Hi,
I have 2 nodes, one is OSX and the other is linux. How is a
distributed cluster installed in this case ? What other networking
equipment do I need ?

Can I ask for pointers to instructions ? I am new.

Thanks,
Mohan