Re: Spark vs Tez
Is Tez's architecture similar to Akka's distributed architecture ? I think I remember that Jonas boner mentioned during a presentation on distributed computing about Akka's support for protocols like raft etc. What makes Tez more scalable in this regard ? Thanks, Mohan On Sun, Oct 19, 2014 at 5:26 PM, Niels Basjes ni...@basjes.nl wrote: Very interesting! What makes Tez more scalable than Spark? What architectural thing makes the difference? Niels Basjes On Oct 19, 2014 3:07 AM, Jeff Zhang zjf...@gmail.com wrote: Tez has a feature called pre-warm which will launch JVM before you use it and you can reuse the container afterwards. So it is also suitable for interactive queries and is more stable and scalable than spark IMO. On Sat, Oct 18, 2014 at 4:22 PM, Niels Basjes ni...@basjes.nl wrote: It is my understanding that one of the big differences between Tez and Spark is is that a Tez based query still has the startup overhead of starting JVMs on the Yarn cluster. Spark based queries are immediately executed on already running JVMs. So for interactive dashboards Spark seems more suitable. Did I understand correctly? Niels Basjes On Oct 17, 2014 8:30 PM, Gavin Yue yue.yuany...@gmail.com wrote: Spark and tez both make MR faster, this has no doubt. They also provide new features like DAG, which is quite important for interactive query processing. From this perspective, you could view them as a wrapper around MR and try to handle the intermediary buffer(files) more efficiently. It is a big pain in MR. Also they both try to use Memory as the buffer instead of only filesystems. Spark has a concept RDD, which is quite interesting and also limited. On Fri, Oct 17, 2014 at 11:23 AM, Adaryl Bob Wakefield, MBA adaryl.wakefi...@hotmail.com wrote: It was my understanding that Spark is faster batch processing. Tez is the new execution engine that replaces MapReduce and is also supposed to speed up batch processing. Is that not correct? B. *From:* Shahab Yunus shahab.yu...@gmail.com *Sent:* Friday, October 17, 2014 1:12 PM *To:* user@hadoop.apache.org *Subject:* Re: Spark vs Tez What aspects of Tez and Spark are you comparing? They have different purposes and thus not directly comparable, as far as I understand. Regards, Shahab On Fri, Oct 17, 2014 at 2:06 PM, Adaryl Bob Wakefield, MBA adaryl.wakefi...@hotmail.com wrote: Does anybody have any performance figures on how Spark stacks up against Tez? If you don’t have figures, does anybody have an opinion? Spark seems so popular but I’m not really seeing why. B. -- Best Regards Jeff Zhang
Re: Spark vs Tez
I remember Spark uses Akka clusters. Isn't that totally different from other distributed technologies ? Thanks, Mohan On Sat, Oct 18, 2014 at 1:52 PM, Niels Basjes ni...@basjes.nl wrote: It is my understanding that one of the big differences between Tez and Spark is is that a Tez based query still has the startup overhead of starting JVMs on the Yarn cluster. Spark based queries are immediately executed on already running JVMs. So for interactive dashboards Spark seems more suitable. Did I understand correctly? Niels Basjes On Oct 17, 2014 8:30 PM, Gavin Yue yue.yuany...@gmail.com wrote: Spark and tez both make MR faster, this has no doubt. They also provide new features like DAG, which is quite important for interactive query processing. From this perspective, you could view them as a wrapper around MR and try to handle the intermediary buffer(files) more efficiently. It is a big pain in MR. Also they both try to use Memory as the buffer instead of only filesystems. Spark has a concept RDD, which is quite interesting and also limited. On Fri, Oct 17, 2014 at 11:23 AM, Adaryl Bob Wakefield, MBA adaryl.wakefi...@hotmail.com wrote: It was my understanding that Spark is faster batch processing. Tez is the new execution engine that replaces MapReduce and is also supposed to speed up batch processing. Is that not correct? B. *From:* Shahab Yunus shahab.yu...@gmail.com *Sent:* Friday, October 17, 2014 1:12 PM *To:* user@hadoop.apache.org *Subject:* Re: Spark vs Tez What aspects of Tez and Spark are you comparing? They have different purposes and thus not directly comparable, as far as I understand. Regards, Shahab On Fri, Oct 17, 2014 at 2:06 PM, Adaryl Bob Wakefield, MBA adaryl.wakefi...@hotmail.com wrote: Does anybody have any performance figures on how Spark stacks up against Tez? If you don’t have figures, does anybody have an opinion? Spark seems so popular but I’m not really seeing why. B.
Re: Hadoop and Open Data (CKAN.org).
I understand that coding MR jobs using a language is required but if we are just processing large amounts of data (Machine Learning for example) we could use Pig. I recently processed 0.25 TB on AWS clusters in a reasonably short time. In this case the development effort is very less. Thanks, Mohan On Thu, Sep 4, 2014 at 6:41 PM, Alec Ten Harmsel a...@alectenharmsel.com wrote: I would recommend using Hadoop only if you are ingesting a lot of data and you need reasonable performance at scale. I would recommend starting with using insert language/tool of choice to ingest and transform data until that process starts taking too long. For example, one of our researchers at the University of Michigan had to process ~150GB of data. Using python, processing that data took about 45 minutes - it was not worth it to spend extra development time to run it on Hadoop. This time will change depending on what you need to do and the hardware available, naturally. So until you need to frequently process large amounts of data, I'd stick with something you're already familiar with. Alec Ten Harmsel On 09/04/2014 03:30 AM, Henrik Aagaard Jørgensen wrote: Dear all, I’m very new to Hadoop as I’m still trying to grasp its value and purpose. I do hope my question on this mailing list is OK. I manage our open data platform at our municipality, using CKAN.org. It works very well for its purpose of showing data and adding API’s to data. However, I’m very interested in knowing more about Hadoop and if it would fit into a (open) data platform, as we are getting more and more data to show and to work with internally at our municipality. However, I cannot figure out if it’s the right purpose to use Hadoop for, if it is “overkill” or… Could someone elaborate on such topic? I’ve Googled around a lot and looked at various videos online and Hadoop seems to have it place, also in an open data platform environment. Best regards, Henrik
Re: Started learning Hadoop. Which distribution is best for native install in pseudo distributed mode?
Actually there was another thread about using MR for ML but I didn't see many responses. I use Octave or R for this but it would be useful to know how this is solved using Hadoop. The closest community that has an interest in this could be H2o but they have implemented MR for their engine to solve these problems. That is what I understand. So we may be able to look at their code but that could be tedious. Mohan On Thu, Aug 14, 2014 at 3:35 PM, Kai Wähner megachu...@gmail.com wrote: As a beginner, it depends on what you want to learn? Do you want to program MapReduce, just do some SQL queries to hadoop, or install, deploy and monitor a Hadoop cluster? This article might help making a good decision: spoilt for choice - how to choose the right Hadoop distribution http://www.infoq.com/articles/BigDataPlatform Kai Sent from my iPhone On 14.08.2014, at 11:58, Chris MacKenzie stu...@chrismackenziephotography.co.uk wrote: Hi, I have been using Hadoop since Christmas loosely and from May for an Software engineering MSc at Heriot Watt University in Edinburgh, Scotland. I have written a genetic sequence alignment algorithm. I have installed Hadoop in various places including a 32 node cluster and am using eclipse kepler sr 2 as an IDE. My current Hadoop version is 2.4.1 which I download as a tar from the apache mirror servers. It¹s been a tough learning curve, but that has made the learning all the more valuable. I believe using the straight Hadoop version has given insights that proprietary builds wouldn¹t have. There are so many confusing issues that crop up, it¹s easy to attach importance to trying to fix the an error which masks another. With the proprietary versions it would be easy to attach blame where it¹s not that build or this builds fault. Go with your heart but be prepared to work to solve the problems you encounter. Buy Tom Whites book, it isn¹t perfect and a couple of years out of date but it gives you enough detail and structure to build an impression you can work from. The downloadable source code is a great help when trying to get started. Good luck. Regards, Chris MacKenzie telephone: 0131 332 6967 email: stu...@chrismackenziephotography.co.uk corporate: www.chrismackenziephotography.co.uk http://www.chrismackenziephotography.co.uk/ http://plus.google.com/+ChrismackenziephotographyCoUk/posts http://www.linkedin.com/in/chrismackenziephotography/ From: Adaryl \Bob\ Wakefield, MBA adaryl.wakefi...@hotmail.com Reply-To: user@hadoop.apache.org Date: Thursday, 14 August 2014 01:13 To: user@hadoop.apache.org Subject: Re: Started learning Hadoop. Which distribution is best for native install in pseudo distributed mode? He didn¹t ask for the best and nobody framed up their answer like that. He asked what people were using. Out of the 10 responses only four of them actually answered his question. I¹ve been studying Hadoop for two months straight. Quite frankly, I wish more people would ask for community input and what does what and how. Adaryl Bob Wakefield, MBA Principal Mass Street Analytics 913.938.6685 www.linkedin.com/in/bobwakefieldmba Twitter: @BobLovesData From: Kilaru, Sambaiah mailto:sambaiah_kil...@intuit.com Sent: Wednesday, August 13, 2014 1:10 PM To: user@hadoop.apache.org Subject: Re: Started learning Hadoop. Which distribution is best for native install in pseudo distributed mode? Engough wars on going on which is best. You choose one of it and try to learn and there is nothing that x is better or y is better. It is upto your choice. Thanks, Sam From: Sebastiano Di Paola sebastiano.dipa...@gmail.com Reply-To: user@hadoop.apache.org user@hadoop.apache.org Date: Wednesday, August 13, 2014 at 6:28 PM To: user@hadoop.apache.org user@hadoop.apache.org Subject: Re: Started learning Hadoop. Which distribution is best for native install in pseudo distributed mode? Hi, I'm a newbie too and I'm not using any particular distribution. Just download the component I need / want to try for my deploiment and use them. It's a slow process but allows me to better understand what I'm doing under the hood. Regards, Seba On Tue, Aug 12, 2014 at 10:12 PM, mani kandan mankand...@gmail.com wrote: Which distribution are you people using? Cloudera vs Hortonworks vs Biginsights?
Re: Managed File Transfer
I am a beginner. But this seems to be similar to what I intend. The data source will be external FTP or S3 storage. Spark Streaming can read data from HDFS http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html ,Flume http://flume.apache.org/, Kafka http://kafka.apache.org/, Twitter https://dev.twitter.com/ and ZeroMQ http://zeromq.org/. You can also define your own custom data sources. Thanks, Mohan On Wed, Jul 9, 2014 at 2:09 PM, Stanley Shi s...@gopivotal.com wrote: There's a DistCP utility for this kind of purpose; Also there's Spring XD there, but I am not sure if you want to use it. Regards, *Stanley Shi,* On Mon, Jul 7, 2014 at 10:02 PM, Mohan Radhakrishnan radhakrishnan.mo...@gmail.com wrote: Hi, We used a commercial FT and scheduler tool in clustered mode. This was a traditional active-active cluster that supported multiple protocols like FTPS etc. Now I am interested in evaluating a Distributed way of crawling FTP sites and downloading files using Hadoop. I thought since we have to process thousands of files Hadoop jobs can do it. Are Hadoop jobs used for this type of file transfers ? Moreover there is a requirement for a scheduler also. What is the recommendation of the forum ? Thanks, Mohan
Managed File Transfer
Hi, We used a commercial FT and scheduler tool in clustered mode. This was a traditional active-active cluster that supported multiple protocols like FTPS etc. Now I am interested in evaluating a Distributed way of crawling FTP sites and downloading files using Hadoop. I thought since we have to process thousands of files Hadoop jobs can do it. Are Hadoop jobs used for this type of file transfers ? Moreover there is a requirement for a scheduler also. What is the recommendation of the forum ? Thanks, Mohan
Practical examples
Hi, I have been reading the definitive guide and taking online courses. Now I would like to understand how Hadoop is used for more real-time scenarios. Are machine learning, language processing and fraud detection examples available ? What are the other practical usecases ? I am familiar with Machine learning and use a single node in a 8 GB machine. Thanks, Mohan
Re: Practical examples
I am interested in ML but I want a Hadoop base because I am learning hadoop. Mahout seems to be for ML at this time. Not Hadoop. Thanks, Mohan On Tue, Apr 29, 2014 at 7:38 AM, Shahab Yunus shahab.yu...@gmail.comwrote: For Machine Learning based applications of Hadoop you can check-out Mahout framework. Regards, Shahab On Mon, Apr 28, 2014 at 10:02 PM, Mohan Radhakrishnan radhakrishnan.mo...@gmail.com wrote: Hi, I have been reading the definitive guide and taking online courses. Now I would like to understand how Hadoop is used for more real-time scenarios. Are machine learning, language processing and fraud detection examples available ? What are the other practical usecases ? I am familiar with Machine learning and use a single node in a 8 GB machine. Thanks, Mohan
Re: calling mapreduce from webservice
Play framework is reactive and uses push channels. It may be useful here if the UI has to be asynchronous and reactive. Mohan On Sat, Apr 19, 2014 at 4:37 AM, Shahab Yunus shahab.yu...@gmail.comwrote: As far as I know there is no API to kick of M/R jobs. There is for M/R v2, a REST API to get status of jobs: http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/MapredAppMasterRest.html#Mapreduce_Application_Master_Info_API I would say that you have invoke M/R jobs in your middle tier or back-end, you have to implement a custom solution i.e. invoking the M/R jobs in standard way and then monitoring the status of the job and then update the UI asynchronously depending on which UI framework or web service implementation (e.g. WS-Addressing) you are using. Regards, Shahab On Fri, Apr 18, 2014 at 3:11 PM, girish hilage girish_hil...@yahoo.comwrote: Yes. I intend to run the jobs asynchronously and show the status of the user submitted job as running/completed etc. and user will be able to submit new jobs simultaneously. I have not checked PigLipStick though. Regards, Girish On Saturday, April 19, 2014 12:34 AM, Shahab Yunus shahab.yu...@gmail.com wrote: Question: M/R jobs are supposed to run for a long time. They are essentially batch processes. Do you plan to keep the Web UI blocked for that while? Or are you looking for asynchronous invocation of the M/R job? Or are you thinking about building sort of an Admin UI (e.g. PigLipstick) What exactly is your requirement? Regards, Shahab On Fri, Apr 18, 2014 at 3:01 PM, girish hilage girish_hil...@yahoo.comwrote: Hi, This is just to check with you, if it is possible to call MR jobs from Java Webservices. If yes, then could you please help me by pointing to some resouces/docs. Actually, what I intend to do is create a Web UI with some functionality which would call MR jobs and present the result to the user in browser. Regards, Girish
Hadoop distribution(2-node cluster)
Hi, As the subject implies I have 2 nodes, one is OSX and the other is linux. How is a distributed cluster installed in this case ? What other networking equipment do I need ? Thanks, Mohan
2-node cluster
Hi, I have 2 nodes, one is OSX and the other is linux. How is a distributed cluster installed in this case ? What other networking equipment do I need ? Can I ask for pointers to instructions ? I am new. Thanks, Mohan