Re: Spark vs Tez

2014-10-19 Thread Mohan Radhakrishnan
Is Tez's architecture similar to Akka's distributed architecture ? I think
I remember that Jonas boner mentioned during a presentation on distributed
computing about Akka's support for protocols like raft etc. What makes Tez
more scalable in this regard ?

Thanks,
Mohan

On Sun, Oct 19, 2014 at 5:26 PM, Niels Basjes  wrote:

> Very interesting!
> What makes Tez more scalable than Spark?
> What architectural "thing" makes the difference?
>
> Niels Basjes
> On Oct 19, 2014 3:07 AM, "Jeff Zhang"  wrote:
>
>> Tez has a feature called pre-warm which will launch JVM before you use it
>> and you can reuse the container afterwards. So it is also suitable for
>> interactive queries and is more stable and scalable than spark IMO.
>>
>> On Sat, Oct 18, 2014 at 4:22 PM, Niels Basjes  wrote:
>>
>>> It is my understanding that one of the big differences between Tez and
>>> Spark is is that a Tez based query still has the startup overhead of
>>> starting JVMs on the Yarn cluster. Spark based queries are immediately
>>> executed on "already running JVMs".
>>>
>>> So for interactive dashboards Spark seems more suitable.
>>>
>>> Did I understand correctly?
>>>
>>> Niels Basjes
>>> On Oct 17, 2014 8:30 PM, "Gavin Yue"  wrote:
>>>
 Spark and tez both make MR faster, this has no doubt.

 They also provide new features like DAG, which is quite important for
 interactive query processing.  From this perspective, you could view them
 as a wrapper around MR and try to handle the intermediary buffer(files)
 more efficiently.  It is a big pain in MR.

 Also they both try to use Memory as the buffer instead of only
 filesystems.   Spark has a concept RDD, which is quite interesting and also
 limited.



 On Fri, Oct 17, 2014 at 11:23 AM, Adaryl "Bob" Wakefield, MBA <
 adaryl.wakefi...@hotmail.com> wrote:

>   It was my understanding that Spark is faster batch processing. Tez
> is the new execution engine that replaces MapReduce and is also supposed 
> to
> speed up batch processing. Is that not correct?
> B.
>
>
>
>  *From:* Shahab Yunus 
> *Sent:* Friday, October 17, 2014 1:12 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Spark vs Tez
>
>  What aspects of Tez and Spark are you comparing? They have different
> purposes and thus not directly comparable, as far as I understand.
>
> Regards,
> Shahab
>
> On Fri, Oct 17, 2014 at 2:06 PM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefi...@hotmail.com> wrote:
>
>>   Does anybody have any performance figures on how Spark stacks up
>> against Tez? If you don’t have figures, does anybody have an opinion? 
>> Spark
>> seems so popular but I’m not really seeing why.
>> B.
>>
>
>


>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>


Re: Spark vs Tez

2014-10-18 Thread Mohan Radhakrishnan
I remember Spark uses Akka clusters. Isn't that totally different from
other distributed technologies ?

Thanks,
Mohan

On Sat, Oct 18, 2014 at 1:52 PM, Niels Basjes  wrote:

> It is my understanding that one of the big differences between Tez and
> Spark is is that a Tez based query still has the startup overhead of
> starting JVMs on the Yarn cluster. Spark based queries are immediately
> executed on "already running JVMs".
>
> So for interactive dashboards Spark seems more suitable.
>
> Did I understand correctly?
>
> Niels Basjes
> On Oct 17, 2014 8:30 PM, "Gavin Yue"  wrote:
>
>> Spark and tez both make MR faster, this has no doubt.
>>
>> They also provide new features like DAG, which is quite important for
>> interactive query processing.  From this perspective, you could view them
>> as a wrapper around MR and try to handle the intermediary buffer(files)
>> more efficiently.  It is a big pain in MR.
>>
>> Also they both try to use Memory as the buffer instead of only
>> filesystems.   Spark has a concept RDD, which is quite interesting and also
>> limited.
>>
>>
>>
>> On Fri, Oct 17, 2014 at 11:23 AM, Adaryl "Bob" Wakefield, MBA <
>> adaryl.wakefi...@hotmail.com> wrote:
>>
>>>   It was my understanding that Spark is faster batch processing. Tez is
>>> the new execution engine that replaces MapReduce and is also supposed to
>>> speed up batch processing. Is that not correct?
>>> B.
>>>
>>>
>>>
>>>  *From:* Shahab Yunus 
>>> *Sent:* Friday, October 17, 2014 1:12 PM
>>> *To:* user@hadoop.apache.org
>>> *Subject:* Re: Spark vs Tez
>>>
>>>  What aspects of Tez and Spark are you comparing? They have different
>>> purposes and thus not directly comparable, as far as I understand.
>>>
>>> Regards,
>>> Shahab
>>>
>>> On Fri, Oct 17, 2014 at 2:06 PM, Adaryl "Bob" Wakefield, MBA <
>>> adaryl.wakefi...@hotmail.com> wrote:
>>>
   Does anybody have any performance figures on how Spark stacks up
 against Tez? If you don’t have figures, does anybody have an opinion? Spark
 seems so popular but I’m not really seeing why.
 B.

>>>
>>>
>>
>>


Re: Hadoop and Open Data (CKAN.org).

2014-09-04 Thread Mohan Radhakrishnan
I understand that coding MR jobs using a language is required but if we are
just processing large amounts of data (Machine Learning for example) we
could use Pig. I recently processed 0.25 TB on AWS clusters in a reasonably
short time. In this case the development effort is very less.


Thanks,
Mohan


On Thu, Sep 4, 2014 at 6:41 PM, Alec Ten Harmsel 
wrote:

>  I would recommend using Hadoop only if you are ingesting a lot of data
> and you need reasonable performance at scale. I would recommend starting
> with using  to ingest and transform data
> until that process starts taking too long.
>
> For example, one of our researchers at the University of Michigan had to
> process ~150GB of data. Using python, processing that data took about 45
> minutes - it was not worth it to spend extra development time to run it on
> Hadoop. This time will change depending on what you need to do and the
> hardware available, naturally.
>
> So until you need to frequently process large amounts of data, I'd stick
> with something you're already familiar with.
>
> Alec Ten Harmsel
>
> On 09/04/2014 03:30 AM, Henrik Aagaard Jørgensen wrote:
>
>  Dear all,
>
>
>
> I’m very new to Hadoop as I’m still trying to grasp its value and
> purpose. I do hope my question on this mailing list is OK.
>
>
>
> I manage our open data platform at our municipality, using CKAN.org. It
> works very well for its purpose of showing data and adding API’s to data.
>
>
>
> However, I’m very interested in knowing more about Hadoop and if it would
> fit into a (open) data platform, as we are getting more and more data to
> show and to work with internally at our municipality.
>
>
>
> However, I cannot figure out if it’s the right purpose to use Hadoop for,
> if it is “overkill” or…
>
>
>
> Could someone elaborate on such topic?
>
>
>
> I’ve Googled around a lot and looked at various videos online and Hadoop
> seems to have it place, also in an open data platform environment.
>
>
>
> Best regards,
>
> Henrik
>
>
>


Re: Started learning Hadoop. Which distribution is best for native install in pseudo distributed mode?

2014-08-15 Thread Mohan Radhakrishnan
Actually there  was another thread about using MR for ML but I didn't see
many responses. I use Octave or R for this but it would be useful to know
how this is solved using Hadoop. The closest community that has an interest
in this could be H2o but they have implemented MR for their engine to solve
these problems. That is what I understand. So we may be able to look at
their code but that could be tedious.

Mohan


On Thu, Aug 14, 2014 at 3:35 PM, Kai Wähner  wrote:

> As a beginner, it depends on what you want to learn? Do you want to
> program MapReduce, just do some SQL queries to hadoop, or install, deploy
> and monitor a Hadoop cluster?
>
> This article might help making a good decision:
> "spoilt for choice - how to choose the right Hadoop distribution"
> http://www.infoq.com/articles/BigDataPlatform
>
> Kai
>
> Sent from my iPhone
>
> > On 14.08.2014, at 11:58, Chris MacKenzie <
> stu...@chrismackenziephotography.co.uk> wrote:
> >
> > Hi,
> >
> > I have been using Hadoop since Christmas loosely and from May for an
> > Software engineering MSc at Heriot Watt University in Edinburgh,
> Scotland.
> > I have written a genetic sequence alignment algorithm.
> >
> > I have installed Hadoop in various places including a 32 node cluster and
> > am using eclipse kepler sr 2 as an IDE.
> >
> > My current Hadoop version is 2.4.1 which I download as a tar from the
> > apache mirror servers.
> >
> > It¹s been a tough learning curve, but that has made the learning all the
> > more valuable.
> >
> > I believe using the straight Hadoop version has given insights that
> > proprietary builds wouldn¹t have. There are so many confusing issues that
> > crop up, it¹s easy to attach importance to trying to fix the an error
> > which masks another. With the proprietary versions it would be easy to
> > attach blame where it¹s not that build or this builds fault.
> >
> > Go with your heart but be prepared to work to solve the problems you
> > encounter.
> >
> > Buy Tom Whites book, it isn¹t perfect and a couple of years out of date
> > but it gives you enough detail and structure to build an impression you
> > can work from. The downloadable source code is a great help when trying
> to
> > get started.
> >
> > Good luck.
> >
> >
> > Regards,
> >
> > Chris MacKenzie
> > telephone: 0131 332 6967
> > email: stu...@chrismackenziephotography.co.uk
> > corporate: www.chrismackenziephotography.co.uk
> > 
> > 
> > 
> >
> >
> >
> >
> >
> >
> > From:  "Adaryl \"Bob\" Wakefield, MBA" 
> > Reply-To:  
> > Date:  Thursday, 14 August 2014 01:13
> > To:  
> > Subject:  Re: Started learning Hadoop. Which distribution is best for
> > native install in pseudo distributed mode?
> >
> >
> > He didn¹t ask for the best and nobody framed up their answer like that.
> He
> > asked what people were using. Out of the 10 responses only four of them
> > actually
> > answered his question.
> >
> > I¹ve been studying Hadoop for two months straight. Quite frankly, I wish
> > more people would ask for community input and what does what and how.
> >
> > Adaryl
> > "Bob" Wakefield, MBA
> > Principal
> > Mass Street
> > Analytics
> > 913.938.6685
> > www.linkedin.com/in/bobwakefieldmba
> > Twitter:
> > @BobLovesData
> >
> > From: Kilaru, Sambaiah 
> > Sent: Wednesday, August 13, 2014 1:10 PM
> > To: user@hadoop.apache.org
> > Subject: Re: Started learning Hadoop. Which distribution is best for
> > native install in pseudo distributed mode?
> >
> >
> >
> >
> > Engough wars on going on which is best. You choose one of it and try to
> > learn and there is nothing that x is better or y is better.
> > It is upto your choice.
> >
> > Thanks,
> > Sam
> >
> > From: Sebastiano Di Paola 
> > Reply-To: "user@hadoop.apache.org" 
> > Date: Wednesday, August 13, 2014 at 6:28
> > PM
> > To: "user@hadoop.apache.org" 
> > Subject: Re: Started learning Hadoop. Which
> > distribution is best for native install in pseudo distributed mode?
> >
> >
> > Hi,
> > I'm a newbie too and I'm not using any particular distribution. Just
> > download the component I need / want to try for my deploiment and use
> > them.
> >
> > It's a slow process but allows me to better understand what I'm
> > doing under the hood.
> >
> > Regards,
> > Seba
> >
> >
> >
> > On Tue, Aug 12, 2014 at 10:12 PM, mani kandan 
> wrote:
> >
> >  Which distribution are you people using? Cloudera vs Hortonworks vs
> >  Biginsights?
> >
> >
> >
> >
> >
> >
>


Re: Managed File Transfer

2014-07-09 Thread Mohan Radhakrishnan
I am a beginner. But this seems to be similar to what I intend. The data
source will be external FTP or S3 storage.

"Spark Streaming can read data from HDFS
<http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html>
,Flume <http://flume.apache.org/>, Kafka <http://kafka.apache.org/>, Twitter
<https://dev.twitter.com/> and ZeroMQ <http://zeromq.org/>. You can also
define your own custom data sources."

Thanks,
Mohan


On Wed, Jul 9, 2014 at 2:09 PM, Stanley Shi  wrote:

> There's a DistCP utility for this kind of purpose;
> Also there's "Spring XD" there, but I am not sure if you want to use it.
>
> Regards,
> *Stanley Shi,*
>
>
>
> On Mon, Jul 7, 2014 at 10:02 PM, Mohan Radhakrishnan <
> radhakrishnan.mo...@gmail.com> wrote:
>
>> Hi,
>>We used a commercial FT and scheduler tool in clustered mode.
>> This was a traditional active-active cluster that supported multiple
>> protocols like FTPS etc.
>>
>> Now I am interested in evaluating a Distributed way of crawling FTP
>> sites and downloading files using Hadoop. I thought since we have to
>> process thousands of files Hadoop jobs can do it.
>>
>> Are Hadoop jobs used for this type of file transfers ?
>>
>> Moreover there is a requirement for a scheduler  also. What is the
>> recommendation of the forum ?
>>
>>
>> Thanks,
>> Mohan
>>
>
>


Managed File Transfer

2014-07-07 Thread Mohan Radhakrishnan
Hi,
   We used a commercial FT and scheduler tool in clustered mode.
This was a traditional active-active cluster that supported multiple
protocols like FTPS etc.

Now I am interested in evaluating a Distributed way of crawling FTP
sites and downloading files using Hadoop. I thought since we have to
process thousands of files Hadoop jobs can do it.

Are Hadoop jobs used for this type of file transfers ?

Moreover there is a requirement for a scheduler  also. What is the
recommendation of the forum ?


Thanks,
Mohan


Re: Practical examples

2014-04-28 Thread Mohan Radhakrishnan
I am interested in ML but I want a Hadoop base because I am learning
hadoop. Mahout seems to be for ML at this time. Not Hadoop.

Thanks,
Mohan


On Tue, Apr 29, 2014 at 7:38 AM, Shahab Yunus wrote:

> For Machine Learning based applications of Hadoop you can check-out Mahout
> framework.
>
> Regards,
> Shahab
>
>
> On Mon, Apr 28, 2014 at 10:02 PM, Mohan Radhakrishnan <
> radhakrishnan.mo...@gmail.com> wrote:
>
>> Hi,
>>I have been reading the definitive guide and taking online
>> courses. Now I would like to understand how Hadoop is used for more
>> real-time scenarios. Are machine learning, language processing and fraud
>> detection examples available ? What are the other practical usecases ?
>>
>> I am familiar with Machine learning and use a single node in a 8 GB
>> machine.
>>
>> Thanks,
>> Mohan
>>
>
>


Practical examples

2014-04-28 Thread Mohan Radhakrishnan
Hi,
   I have been reading the definitive guide and taking online courses.
Now I would like to understand how Hadoop is used for more real-time
scenarios. Are machine learning, language processing and fraud detection
examples available ? What are the other practical usecases ?

I am familiar with Machine learning and use a single node in a 8 GB machine.

Thanks,
Mohan


Re: calling mapreduce from webservice

2014-04-18 Thread Mohan Radhakrishnan
Play framework is reactive and uses push channels. It may be useful here if
the UI has to be asynchronous and reactive.

Mohan


On Sat, Apr 19, 2014 at 4:37 AM, Shahab Yunus wrote:

> As far as I know there is no API to kick of M/R jobs. There is for M/R v2,
> a REST API to get status of jobs:
> http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/MapredAppMasterRest.html#Mapreduce_Application_Master_Info_API
>
> I would say that you have invoke M/R jobs in your middle tier or back-end,
> you have to implement a custom solution i.e. invoking the M/R jobs in
> standard way and then monitoring the status of the job and then update the
> UI asynchronously depending on which UI framework or web service
> implementation (e.g. WS-Addressing) you are using.
>
> Regards,
> Shahab
>
>
> On Fri, Apr 18, 2014 at 3:11 PM, girish hilage wrote:
>
>> Yes. I intend to run the jobs asynchronously and show the status of the
>> user submitted job as "running/completed" etc. and user will be able to
>> submit new jobs simultaneously.  I have not checked PigLipStick though.
>>
>> Regards,
>> Girish
>>
>>   On Saturday, April 19, 2014 12:34 AM, Shahab Yunus <
>> shahab.yu...@gmail.com> wrote:
>>  Question: M/R jobs are supposed to run for a long time. They are
>> essentially batch processes. Do you plan to keep the Web UI blocked for
>> that while? Or are you looking for asynchronous invocation of the M/R job?
>> Or are you thinking about building sort of an Admin UI (e.g. PigLipstick)
>> What exactly is your requirement?
>>
>> Regards,
>> Shahab
>>
>>
>> On Fri, Apr 18, 2014 at 3:01 PM, girish hilage 
>> wrote:
>>
>> Hi,
>>
>>This is just to check with you, if it is possible to call MR jobs from
>> Java Webservices.
>>If yes, then could you please help me by pointing to some
>> resouces/docs.
>>
>>Actually, what I intend to do is create a Web UI with some
>> functionality which would call MR jobs and present the result to the user
>> in browser.
>>
>> Regards,
>> Girish
>>
>>
>>
>>
>>
>


2-node cluster

2014-04-14 Thread Mohan Radhakrishnan
Hi,
I have 2 nodes, one is OSX and the other is linux. How is a
distributed cluster installed in this case ? What other networking
equipment do I need ?

Can I ask for pointers to instructions ? I am new.

Thanks,
Mohan


Hadoop distribution(2-node cluster)

2014-04-13 Thread Mohan Radhakrishnan
Hi,
As the subject implies I have 2 nodes, one is OSX and the other is
linux. How is a distributed cluster installed in this case ? What other
networking equipment do I need ?

Thanks,
Mohan