Re: AUTO: Yuan Jin is out of the office. (returning 08/13/2012)

2012-08-09 Thread Gavin Yue
Thank you for letting us know your schedule...


On Thu, Aug 9, 2012 at 4:07 PM, Yuan Jin  wrote:

> I am out of the office until 08/13/2012.
>
> I am out of office.
>
> For HAMSTER related things, you can contact Jason(Deng Peng Zhou/China/IBM)
> For CFM related things, you can contact Daniel(Liang SH Su/China/Contr/IBM)
> For TMB related things, you can contact Flora(Jun Ying Li/China/IBM)
> For TWB related things, you can contact Kim(Yuan SH Jin/China/IBM)
> For others, I will reply you when I am back.
>
>
> Note: This is an automated response to your message  *"Re: Apache Hadoop
> 0.23.1 Source Build Failing"* sent on *10/08/2012 0:46:02*.
>
> This is the only notification you will receive while this person is away.
>


Re: AUTO: Yuan Jin is out of the office. (returning 09/11/2012)

2012-09-09 Thread Gavin Yue
My heart is deeply hurt by this message...
So sorry that you are out of your office again...


On Sun, Sep 9, 2012 at 10:13 PM, Yuan Jin  wrote:

> I am out of the office until 09/11/2012.
>
> I am out of office.
>
> For HAMSTER related things, you can contact Jason(Deng Peng Zhou/China/IBM)
> For CFM related things, you can contact Daniel(Liang SH Su/China/Contr/IBM)
> For TMB related things, you can contact Flora(Jun Ying Li/China/IBM)
> For TWB related things, you can contact Kim(Yuan SH Jin/China/IBM)
> For others, I will reply you when I am back.
>
>
> Note: This is an automated response to your message  *"Re: Can't get out
> of safemode"* sent on *10/09/2012 6:39:12*.
>
> This is the only notification you will receive while this person is away.
>


Re: Big Data tech stack (was Spark vs. Storm)

2014-07-02 Thread Gavin Yue
Isn't this what Yarn or Mesos are trying to do?  Separate the resources
management and applications. Run whatever suitable above them.  Spark also
could run above yanr or mesos. Spark was designed for iteration intensive
computing like Machine learning algorithms.

Storm is quite different.  It is not designed for big data stored in the
hard disk. It is inspired by the stream data like tweets. On the other
side, Map-Reduce/HDFS was initially designed to handle stored webpage to
build up index.

Hadoop is on the way to become a generic Big Data analysis framework.
HontonWorks and Cloudera are trying to make it much easier on management
and deployment.



On Wed, Jul 2, 2014 at 4:25 PM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefi...@hotmail.com> wrote:

>   You know what I’m really trying to do? I’m trying to come up with a
> best practice technology stack. There are so many freaking projects it is
> overwhelming. If I were to walk into an organization that had no Big Data
> capability, what mix of projects would be best to implement based on
> performance, scalability and easy of use/implementation? So far I’ve got:
> Ubuntu
> Hadoop
> Cassandra (Seems to be the highest performing NoSQL database out there.)
> Storm (maybe?)
> Python (Easier than Java. Maybe that shouldn’t be a concern.)
> Hive (For people to leverage their existing SQL skillset.)
>
> That would seem to cover transaction processing and warehouse storage and
> the capability to do batch and real time analysis. What am I leaving out or
> what do I have incorrect in my assumptions?
>
> B.
>
>
>
>  *From:* Stephen Boesch 
> *Sent:* Wednesday, July 02, 2014 3:07 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Spark vs. Storm
>
>  Spark Streaming discretizes the stream by configurable intervals of no
> less than 500Milliseconds. Therefore it is not appropriate for true real
> time processing.So if you need to capture events in the low 100's of
> milliseonds range or less than stick with Storm (at least for now).
>
> If you can afford one second+ of latency then spark provides advantages of
> interoperability with the other Spark components and capabilities.
>
>
> 2014-07-02 12:59 GMT-07:00 Shahab Yunus :
>
>> Not exactly. There are of course  major implementation differences and
>> then some subtle and high level ones too.
>>
>> My 2-cents:
>>
>> Spark is in-memory M/R and it simulated streaming or real-time
>> distributed process for large datasets by micro-batching. The gain in speed
>> and performance as opposed to batch paradigm is in-memory buffering or
>> batching (and I am here being a bit naive/crude in explanation.)
>>
>> Storm on the other hand, supports stream processing even at a single
>> record level (known as tuple in its lingo.) You can do micro-batching on
>> top of it as well (using Trident API which is good for state maintenance
>> too, if your BL requires that). This is more applicable where you want
>> control to a single record level rather than set, collection or batch of
>> records.
>>
>> Having said that, Spark Streaming is trying to simulate Storm's extreme
>> granular approach but as far as I recall, it still is built on top of core
>> Spark (basically another level of abstraction over core Spark constructs.)
>>
>> So given this, you can pick the framework which is more attuned to your
>> needs.
>>
>>
>> On Wed, Jul 2, 2014 at 3:31 PM, Adaryl "Bob" Wakefield, MBA <
>> adaryl.wakefi...@hotmail.com> wrote:
>>
>>>   Do these two projects do essentially the same thing? Is one better
>>> than the other?
>>>
>>
>>
>
>


Re: Hadoop virtual machine

2014-07-05 Thread Gavin Yue
http://hortonworks.com/products/hortonworks-sandbox/

or

CDH5
http://www.cloudera.com/content/cloudera-content/cloudera-docs/DemoVMs/Cloudera-QuickStart-VM/cloudera_quickstart_vm.html



On Sat, Jul 5, 2014 at 11:27 PM, Manar Elkady 
wrote:

> Hi,
>
> I am a newcomer in using Hadoop, and I read many online tutorial to set up
> Hadoop on Window by using virtual machines, but all of them link to old
> versions of Hadoop virtual machines.
> Could any one help me to find a Hadoop virtual machine, which include a
> newer version of hadoop? Or should I do it myself from scratch?
> Also, any well explained Hadoop installing tutorial and any other helpful
> material are appreciated.
>
>
> Manar,
>
>
> --
>
>


Re: Spark vs Tez

2014-10-17 Thread Gavin Yue
Spark and tez both make MR faster, this has no doubt.

They also provide new features like DAG, which is quite important for
interactive query processing.  From this perspective, you could view them
as a wrapper around MR and try to handle the intermediary buffer(files)
more efficiently.  It is a big pain in MR.

Also they both try to use Memory as the buffer instead of only
filesystems.   Spark has a concept RDD, which is quite interesting and also
limited.



On Fri, Oct 17, 2014 at 11:23 AM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefi...@hotmail.com> wrote:

>   It was my understanding that Spark is faster batch processing. Tez is
> the new execution engine that replaces MapReduce and is also supposed to
> speed up batch processing. Is that not correct?
> B.
>
>
>
>  *From:* Shahab Yunus 
> *Sent:* Friday, October 17, 2014 1:12 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Spark vs Tez
>
>  What aspects of Tez and Spark are you comparing? They have different
> purposes and thus not directly comparable, as far as I understand.
>
> Regards,
> Shahab
>
> On Fri, Oct 17, 2014 at 2:06 PM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefi...@hotmail.com> wrote:
>
>>   Does anybody have any performance figures on how Spark stacks up
>> against Tez? If you don’t have figures, does anybody have an opinion? Spark
>> seems so popular but I’m not really seeing why.
>> B.
>>
>
>


Re: Hadoop Learning Environment

2014-11-04 Thread Gavin Yue
Try docker!

http://ferry.opencore.io/en/latest/examples/hadoop.html



On Tue, Nov 4, 2014 at 6:36 PM, jay vyas 
wrote:

> Hi daemon:  Actually, for most folks who would want to actually use a
> hadoop cluster,  i would think setting up bigtop is super easy ! If you
> have issues with it ping me and I can help you get started.
> Also, we have docker containers - so you dont even *need* a VM to run a 4
> or 5 node hadoop cluster.
>
> install vagrant
> install VirtualBox
> git clone https://github.com/apache/bigtop
> cd bigtop/bigtop-deploy/vm/vagrant-puppet
> vagrant up
> Then vagrant destroy when your done.
>
> This to me is easier than manually downloading an appliance, picking memory
> starting the virtualbox gui, loading the appliance , etc...  and also its
> easy to turn the simple single node bigtop VM into a multinode one,
> by just modifying the vagrantile.
>
>
> On Tue, Nov 4, 2014 at 5:32 PM, daemeon reiydelle 
> wrote:
>
>> What you want as a sandbox depends on what you are trying to learn.
>>
>> If you are trying to learn to code in e.g PigLatin, Sqooz, or similar,
>> all of the suggestions (perhaps excluding BigTop due to its setup
>> complexities) are great. Laptop? perhaps but laptop's are really kind of
>> infuriatingly slow (because of the hardware - you pay a price for a
>> 30-45watt average heating bill). A laptop is an OK place to start if it is
>> e.g. an i5 or i7 with lots of memory. What do you think of the thought that
>> you will pretty quickly graduate to wanting a small'ish desktop for your
>> sandbox?
>>
>> A simple, single node, Hadoop instance will let you learn many things.
>> The next level of complexity comes when you are attempting to deal with
>> data whose processing needs to be split up, so you can learn about how to
>> split data in Mapping, reduce the splits via reduce jobs, etc. For that,
>> you could get a windows desktop box or e.g. RedHat/CentOS and use
>> virtualization. Something like a 4 core i5 with 32gb of memory, running 3
>> or for some things 4, vm's. You could load e.g. hortonworks into each of
>> the vm's and practice setting up a 3/4 way cluster. Throw in 2-3 1tb drives
>> off of eBay and you can have a lot of learning.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *...“The race is not to the swift,nor the battle to the strong,but to
>> those who can see it coming and jump aside.” - Hunter ThompsonDaemeon*
>> On Tue, Nov 4, 2014 at 1:24 PM, oscar sumano  wrote:
>>
>>> you can try the pivotal vm as well.
>>>
>>>
>>> http://pivotalhd.docs.pivotal.io/tutorial/getting-started/pivotalhd-vm.html
>>>
>>> On Tue, Nov 4, 2014 at 3:13 PM, Leonid Fedotov >> > wrote:
>>>
 Tim,
 download Sandbox from http://hortonworks/com
 You will have everything needed in a small VM instance which will run
 on your home desktop.


 *Thank you!*


 *Sincerely,*

 *Leonid Fedotov*

 Systems Architect - Professional Services

 lfedo...@hortonworks.com

 office: +1 855 846 7866 ext 292

 mobile: +1 650 430 1673

 On Tue, Nov 4, 2014 at 11:28 AM, Tim Dunphy 
 wrote:

> Hey all,
>
>  I want to setup an environment where I can teach myself hadoop.
> Usually the way I'll handle this is to grab a machine off the Amazon free
> tier and setup whatever software I want.
>
> However I realize that Hadoop is a memory intensive, big data
> solution. So what I'm wondering is, would a t2.micro instance be 
> sufficient
> for setting up a cluster of hadoop nodes with the intention of learning 
> it?
> To keep things running longer in the free tier I would either setup 
> however
> many nodes as I want and keep them stopped when I'm not actively using
> them. Or just setup a few nodes with a few different accounts (with a
> different gmail address for each one.. easy enough to do).
>
> Failing that, what are some other free/cheap solutions for setting up
> a hadoop learning environment?
>
> Thanks,
> Tim
>
> --
> GPG me!!
>
> gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B
>
>

 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or
 entity to which it is addressed and may contain information that is
 confidential, privileged and exempt from disclosure under applicable law.
 If the reader of this message is not the intended recipient, you are hereby
 notified that any printing, copying, dissemination, distribution,
 disclosure or forwarding of this communication is strictly prohibited. If
 you have received this communication in error, please contact the sender
 immediately and delete it from your system. Thank You.
>>>
>>>
>>>
>>
>
>
> --
> jay vyas
>


Using multiple filesystems together

2014-11-10 Thread Gavin Yue
Hey,

Recently I am trying to upgrade the managed hadoop clusters. Saw the
article about Netflix using S3 as the file system instead of Hdfs.

But this would cause the lower performance due to slower IO.  So could I
use Hdfs and S3 together, or even other fs together?  If the file is in S3
and not in hdfs,  will automatically load the file.

Are there any projects targeting this problem now?

Thank you.


Best,
Gavin


Re: How to install patches on Hadoop Cluster

2015-01-07 Thread Gavin Yue
It is more like a general cluster management question.  

Do you use chef or puppet or any similar products?



Sent from my iPhone

> On Jan 7, 2015, at 12:51, Sajid Syed  wrote:
> 
> Hi All,
> 
> Is there any easy way or links which show how to install patches (OS / 
> Hadoop) on a 100 node cluster?
> 
> Thanks
> SP
> 
> 


Fw: important

2015-09-08 Thread Gavin Yue
Hello!

 

Important message, visit http://silvereye.in/book.php?mwc

 

Gavin Yue



how to quickly fs -cp dir with thousand files?

2016-01-08 Thread Gavin Yue
I want to cp a dir with over 8000 files to another dir in the same hdfs.
but the copy process is really slow since it is copying one by one.
Is there a fast way to copy this using Java FileSystem or FileUtil api?

Thanks.


Re: how to quickly fs -cp dir with thousand files?

2016-01-10 Thread Gavin Yue
Yes. I need two different copy. And  I tried Chris's solution, distcp
indeed works.
Thank you all

On Sun, Jan 10, 2016 at 3:00 PM, Chris Nauroth 
wrote:

> Yes, certainly, if you only need it in one spot, then -mv is a fast
> metadata-only operation.  I was under the impression that Gavin really
> wanted to achieve 2 distinct copies.  Perhaps I was mistaken.
>
> --Chris Nauroth
>
> From: sandeep vura 
> Date: Sunday, January 10, 2016 at 6:23 AM
> To: Chris Nauroth 
> Cc: Gavin Yue , "user@hadoop.apache.org" <
> user@hadoop.apache.org>
> Subject: Re: how to quickly fs -cp dir with thousand files?
>
> Hi Chris,
>
> Instead of copying files . Use mv command .
>
>
>- hadoop fs -mv /user/hadoop/file1 /user/hadoop/file2
>
>
> Sandeep.v
>
>
> On Sat, Jan 9, 2016 at 9:55 AM, Chris Nauroth 
> wrote:
>
>> DistCp is capable of running large copies like this in distributed
>> fashion, implemented as a MapReduce job.
>>
>> http://hadoop.apache.org/docs/r2.7.1/hadoop-distcp/DistCp.html
>>
>> A lot of the literature on DistCp talks about use cases for copying
>> across different clusters, but it's also completely legitimate to run
>> DistCp within the same cluster.
>>
>> --Chris Nauroth
>>
>> From: Gavin Yue 
>> Date: Friday, January 8, 2016 at 4:45 PM
>> To: "user@hadoop.apache.org" 
>> Subject: how to quickly fs -cp dir with thousand files?
>>
>> I want to cp a dir with over 8000 files to another dir in the same hdfs.
>> but the copy process is really slow since it is copying one by one.
>> Is there a fast way to copy this using Java FileSystem or FileUtil api?
>>
>> Thanks.
>>
>>
>


Re: HDFS2 vs MaprFS

2016-06-04 Thread Gavin Yue
Here is what I found on Horton website.  


Namespace scalability

While HDFS cluster storage scales horizontally with the addition of datanodes, 
the namespace does not. Currently the namespace can only be vertically scaled 
on a single namenode.  The namenode stores the entire file system metadata in 
memory. This limits the number of blocks, files, and directories supported on 
the file system to what can be accommodated in the memory of a single namenode. 
A typical large deployment at Yahoo! includes an HDFS cluster with 2700-4200 
datanodes with 180 million files and blocks, and address ~25 PB of storage.  At 
Facebook, HDFS has around 2600 nodes, 300 million files and blocks, addressing 
up to 60PB of storage. While these are very large systems and good enough for 
majority of Hadoop users, a few deployments that might want to grow even larger 
could find the namespace scalability limiting.





> On Jun 4, 2016, at 04:43, Ascot Moss  wrote:
> 
> Hi,
> 
> I read some (old?) articles from Internet about Mapr-FS vs HDFS. 
> 
> https://www.mapr.com/products/m5-features/no-namenode-architecture
> 
> It states that HDFS Federation has 
> 
> a) "Multiple Single Points of Failure", is it really true?  
> Why MapR uses HDFS but not HDFS2 in its comparison as this would lead to an 
> unfair comparison (or even misleading comparison)?  (HDFS was from Hadoop 
> 1.x, the old generation) HDFS2 is available since 2013-10-15, there is no any 
> Single Points of  Failure in HDFS2.
> 
> b) "Limit to 50-200 million files", is it really true? 
> I have seen so many real world Hadoop Clusters with over 10PB data, some even 
> with 150PB data.  If "Limit to 50 -200 millions files" were true in HDFS2, 
> why are there so many production Hadoop clusters in real world? how can they 
> mange well the issue of  "Limit to 50-200 million files"? For instances,  the 
> Facebook's "Like" implementation runs on HBase at Web Scale, I can image 
> HBase generates huge number of files in Facbook's Hadoop cluster, the number 
> of files in Facebook's Hadoop cluster should be much much bigger than 50-200 
> million.
> 
> From my point of view, in contrast, MaprFS should have true limitation up to 
> 1T files while HDFS2 can handle true unlimited files, please do correct me if 
> I am wrong.
> 
> c) "Performance Bottleneck", again, is it really true?
> MaprFS does not have namenode in order to gain file system performance. If 
> without Namenode, MaprFS would lose Data Locality which is one of the 
> beauties of Hadoop  If Data Locality is no longer available, any big data 
> application running on MaprFS might gain some file system performance but it 
> would totally lose the true gain of performance from Data Locality provided 
> by Hadoop's namenode (gain small lose big)
> 
> d) "Commercial NAS required"
> Is there any wiki/blog/discussion about Commercial NAS on Hadoop Federation?
> 
> regards
>  
> 
>