Re: Hadoop vs Ceph and GlusterFS

2013-12-31 Thread Chris Embree
Ceph and glusterfs are NOT centralized files systems. Glusterfs can be used with Hadoop map reduce, but it requires a special plug in, and hdfs 2 can be ha, so it's probably not worth switching. Ymmv. On Dec 31, 2013 4:01 PM, "Jiayu Ji" wrote: > I am not very familiar with Ceph and GlusterFS, b

Re: conf.set() and conf.get()

2013-12-29 Thread Chris Embree
Ignorant question: Did this just devolve into a java discussion? On 12/30/13, unmesha sreeveni wrote: > but i need to convert it back to object of the same class. > If i am converting it to string will it be possible? > > > On Mon, Dec 30, 2013 at 11:16 AM, Harsh J wrote: > >> If you can store

Re: block replication

2013-12-27 Thread Chris Embree
Maybe I'm just grouchy tonight.. it's seems all of these questions can be answered by RTFM. http://hadoop.apache.org/docs/current2/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html What's the balance between encouraging learning by New to Hadoop users and OMG!? On Fri, Dec 27, 2013 at 8:58 PM,

Re: from relational to bigger data

2013-12-19 Thread Chris Embree
In big data terms, 500G isn't big. But, moving that much data around every night is not trivial either. I'm going to guess at a lot here, but at a very high level. 1. Sqoop the data required to build the summary tables into Hadoop. 2. Crunch the summaries into new tables (really just files on Ha

FYI re: Sqoop jiras being reported

2013-12-06 Thread Chris Embree
Hello Hadoopers. I thought I'd share a couple of sqoop bugs we found recently. 1) If, for some reason, sqoop fails to move a file/directory to it's -target-dir because the file is no longer available, it will issue a WARNing and not an error. This is very significant in batch operations. Effec

Re: Apache Ambari

2013-12-05 Thread Chris Embree
Unless something has recently changed, Ambari cannot work on an existing cluster. One of the several reasons we chose to eschew it. On 12/5/13, Jilal Oussama wrote: > Hello all, > > Pardon me to ask this question here instead of the Ambari mailing list (I > am not subscribed to it). > > I would

Re: Using combiners in python hadoop streaming

2013-09-18 Thread Chris Embree
LMGTFY: http://pydoop.sourceforge.net/docs/pydoop_script.html#pydoop-script-guide On Wed, Sep 18, 2013 at 6:01 PM, jamal sasha wrote: > Hi, > How do I implement (say ) in wordcount a combiner functionality if i am > using python hadoop streaming? > Thanks >

Re: Cloudera Vs Hortonworks Vs MapR

2013-09-16 Thread Chris Embree
Our evaluation was similar except we did not consider the "management" tools any vendor provided as that's just as much lock in as any proprietary tool. What if I want trade vendors? I have to re-tool to use there mgmt? Nope, wrote our own. Being in a large enterprise, we went with the "perceiv

Re: Cloudera Vs Hortonworks Vs MapR

2013-09-13 Thread Chris Embree
The only problem is around the degeneration of the discussion. See years long threads around vi vs. emacs, Windows vs. Linux, Java vs. C/Python/Perl/Ruby. On 9/13/13, Chris Mattmann wrote: > Errr, what's wrong with discussing these types of issues on list? > > Nothing public here, and as long a

Re: 'supergroup' in Hadoop

2013-09-11 Thread Chris Embree
Hadoop (HDFS and MapReduce) get group membership, etc. from the OS. The only "exception" is that you define the HDFS Superuser Group in the XML. It still must exist at the OS Level, but grants privs at the Hadoop Level. At least in HDP 1.x On Wed, Sep 11, 2013 at 9:38 PM, Raj Hadoop wrote: >

Re: Hadoop Metrics Issue in ganglia.

2013-09-11 Thread Chris Embree
Did you try ganglia forums/lists? On 9/11/13, orahad bigdata wrote: > Hi All, > > Can somebody help me please? > > Thanks > On 9/11/13, orahad bigdata wrote: >> Hi All, >> >> I'm facing an issue while showing Hadoop metrics in ganglia, Though I >> have installed ganglia on my master/slaves node

Re: hadoop cares about /etc/hosts ?

2013-09-09 Thread Chris Embree
This sound entirely like an OS Level problem and is slightly outside of the scope of this list, however. I'd suggest you look at your /etc/nsswitch.conf file and ensure that the hosts: line says hosts: files dns This will ensure that names are resolved first by /etc/hosts, then by DNS. Please al

Re: How to speed up Hadoop?

2013-09-05 Thread Chris Embree
I think you just went backwards. more replicas (generally speaking) are better. I'd take 60 cheap, 1 U servers over 20 "highly fault tolerant" ones for almost every problem. I'd get them for the same or less $ too. On Thu, Sep 5, 2013 at 8:41 PM, Sundeep Kambhampati < kambh...@cse.ohio-stat

Re: Hardware Selection for Hadoop

2013-08-12 Thread Chris Embree
As we always say in Technology... it depends! What country are you in? That makes a difference. How much buying power do you have? I work for a Fortune 100 Company and we -- absurdly -- pay about 60% off retail when we buy servers. Are you buying a bunch at once? You best bet is to contact 3 or

Re: the options that used to tuning mapreduceV1 is still useful for YARN?

2013-08-11 Thread Chris Embree
Steps to Hadoop 2.x documentation. 1. Realize reality, 2. Smoke 2-3 long joints, depending on tolerance levels 3. Review the code... 4. Allow the THC to take effect and view the code in a new light 5. Understand what the developers have said 6. Code mind beautiful patches to base code 7. crash 8.

Re: Suspecting Namenode Filesystem Corrupt!!

2013-07-29 Thread Chris Embree
My foundation is more Linux than Hadoop, so I'll support Harsh (like he needs it) in asking, "What's the problem?" If you can't df -h this is probably a "lower than Hadoop" issue, and while most Hadoop folks are willing to help (see the fact that Harsh responded) this is 99.9% likely to be an EXT4

Re: Why Hadoop force using DNS?

2013-07-29 Thread Chris Embree
Just for clarity, DNS as a service is NOT Required. Name resolution is. I use /etc/hosts files to identify all nodes in my clusters. One of the reasons for using Names over IP's is ease of use. I would much rather use a hostname in my XML to identify NN, JT, etc. vs. some random string of numb

How bad is this? :)

2013-07-08 Thread Chris Embree
Hey Hadoop smart folks I have a tendency to seek optimum performance given my understanding, so that led to me "brilliant" decision. We settled on EXT4 for our underlying FS for HDFS. Greedy for speed I thought, let's turn the journal off and gain the speed benefits. After all, I have 3 co

Re: Reg:HDFS Decommission Design Document.

2013-06-13 Thread Chris Embree
Ha -- I just decommissioned some nodes today. Add the nodes you'd like to decom. to the excludes file (search for it's name in hdfs-site.xml), usually dfs.exclude. Login to your NN and issue hadoop dfsadmin -refreshNodes Watch the NN Web interface until the Decommissioning Nodes are complete. T

Re: recovery accidently deleted pig script

2013-06-12 Thread Chris Embree
This is not a Hadoop question (IMHO). 2 words: Version Control Did the advent of Hadoop somehow circumvent all IT convention? Sorry folks, it's been a rough day. On 6/12/13, Michael Segel wrote: > Where was the pig script? On HDFS? > > How often does your cluster clean up the trash? > > (Dele

Re: is time sync required among all nodes?

2013-06-04 Thread Chris Embree
Yes, NTPD is your best option. On 6/4/13, Ben Kim wrote: > Hi, > This is very basic & fundamental question. > > Is time among all nodes needs to be synced? > > I've never even thought of timing in hadoop cluster but recently > experienced my servers going out of sync with time. I know hbase requi

Re: Where to begin from??

2013-05-23 Thread Chris Embree
I'll be chastised and have mean things said about me for this. Get some experience in IT before you start looking at Hadoop. My reasoning is this: If you don't know how to develop real applications in a Non-Hadoop world, you'll struggle a lot to develop with Hadoop. Asking what "things you need

Re: Low latency data access Vs High throughput of data

2013-05-20 Thread Chris Embree
I'll take a swing at this one. Low latency data access: I hit the enter key (or submit button) and I expect results within seconds at most. My database query time should be sub-second. High throughput of data: I want to scan millions of rows of data and count or sum some subset. I expect this

Re: About configuring cluster setup

2013-05-14 Thread Chris Embree
It's not a good idea for anything more than Proof of Concept or Sandbox clusters. On Tue, May 14, 2013 at 3:10 AM, Leonid Fedotov wrote: > No, it is not called "pseudo distributed" mode. It called "as you wish" > mode... > It is absolutely normal configuration. > You can distribute your nodes as

Re: Rack Aware Hadoop cluster

2013-05-08 Thread Chris Embree
< > 3m.mustaq...@gmail.com> wrote: > > @chris, I have test it outside. It is working fine. > > > > > > On Wed, May 8, 2013 at 7:48 PM, Leonid Fedotov > wrote: > > Error in script. > > > > > > On Wed, May 8, 2013 at 7:11 AM, Chris Em

Re: Rack Aware Hadoop cluster

2013-05-08 Thread Chris Embree
Your script has an error in it. Please test your script using both IP Addresses and Names, outside of hadoop. On Wed, May 8, 2013 at 10:01 AM, Mohammad Mustaqeem <3m.mustaq...@gmail.com>wrote: > I have done this and found following error in log - > > 2013-05-08 18:53:45,221 WARN org.apache.hado

Re: Rack Aware Hadoop cluster

2013-05-08 Thread Chris Embree
Finally, one I can answer. :) That should be in core-site.xml (unless it's moved from ver 1.x). It needs to be in the configuration for NameNode(s) and JobTracker (Yarn). In 1.x you need to restart NN and JT services for the script to take effect. On Wed, May 8, 2013 at 9:43 AM, Mohammad Musta

Re: Run multiple HDFS instances

2013-04-18 Thread Chris Embree
Glad you got this working... can you explain your use case a little? I'm trying to understand why you might want to do that. On Thu, Apr 18, 2013 at 11:29 AM, Lixiang Ao wrote: > I modified sbin/hadoop-daemon.sh, where HADOOP_PID_DIR is set. It works! > Everything looks fine now. > > Seems d

Re: Linux io scheduler

2013-04-02 Thread Chris Embree
I assume your talking about the I/O scheduler. Based on normal advice, only change this if you have a "smart" device between the OS and the Drives. A SATA controller usually qualifies. I have our DataNodes to to NOOP to reduce the number of layers. As always your mileage may vary and you should

Re: Rack Awareness

2013-03-26 Thread Chris Embree
Make sure you have the topology script available on the JobTracker server as well. This also requires a jobtracker stop/start to take effect. Also, make sure $HADOOP_CONF resolves properly as the mapred user. On Tue, Mar 26, 2013 at 1:19 AM, preethi ganeshan < preethiganesha...@gmail.com> wrote:

Re: Cluster lost IP addresses

2013-03-22 Thread Chris Embree
Hey John, Make sure your /etc/hosts ( or DNS) is up to date and any topology scripts are updated. Unfortunately, NN is pretty dumb about IP's vs. Hostnames. BTW, NN devs. Seriously? You rely on IP addr instead of hostname? Someone should probably be shot or at least be responsible for fixing

Re: Replication factor

2013-03-12 Thread Chris Embree
Aww.. You could've used lmgtfy.com :) On Tue, Mar 12, 2013 at 4:57 PM, varun kumar wrote: > http://hadoopblogfromvarun.wordpress.com/ > > > On Wed, Mar 13, 2013 at 2:16 AM, Mohit Anchlia wrote: > >> Is it possible to set replication factor to a different value than the >> default at the directo

Re: About Hadoop Deb file

2013-02-20 Thread Chris Embree
Jokingly I want to say the problem is that you selected Ubuntu (or any other Debian based Linux) as your platform. On a more serious note, if you are new to both Linux and Hadoop, you might be much better off to select CentOS for your Linux as that is the base development platform for most contrib

Topology script frequency

2013-02-20 Thread Chris Embree
I've checked all of the documentation, books and google searches I can think of I have a working topology script. I have dynamic IP's. I have an automated process to update the rack data when a datanode changes IP. What I don't have is any clue as to when the NN reads this script. If I exe

Re: Using NFS mounted volume for Hadoop installation/configuration

2013-02-18 Thread Chris Embree
> Paul > > > On 18 Feb 2013, at 18:09, Chris Embree wrote: > > I'm doing that currently. No problems to report so far. > > The only pitfall I've found is around NFS stability. If your NAS is 100% > solid no problems. I've seen mtab get messed up and re

Re: Using NFS mounted volume for Hadoop installation/configuration

2013-02-18 Thread Chris Embree
I'm doing that currently. No problems to report so far. The only pitfall I've found is around NFS stability. If your NAS is 100% solid no problems. I've seen mtab get messed up and refuse to remount if NFS has any hiccups. If you want to really crazy, consider NFS for your datanode root fs. S

Re: How to add users in supergroup?

2013-02-14 Thread Chris Embree
Check your HDFS config file for the groupname you used as HDFS supergroup. We used hdfs as the group name in our case. Then just groupadd hdfs (see man groupadd for additional options) Then, when you create users, add them to that group. useradd -G hdfs newuser This is more Linux admin than Ha

Re: configure mapreduce to work with pem files.

2013-02-13 Thread Chris Embree
You need to configure ssh to use your pem files, by default it uses dsa or rsa files. Look at man ssh_config. On Wed, Feb 13, 2013 at 6:46 AM, Pedro Sá da Costa wrote: > I'm trying to configure ssh for the Hadoop mapreduce, but my nodes only > communicate with each others using RSA keys in pem

Re: Mutiple dfs.data.dir vs RAID0

2013-02-10 Thread Chris Embree
Interesting question. You'd probably need to benchmark to prove it out. I'm not the exact details of how HDFS stripes data, but it should compare pretty well to hardware RAID. Conceptually, HDFS should be able to out perform a RAID solution, since HDFS "knows" more about the data being written.

Re: Interested in learning hadoop

2013-02-02 Thread Chris Embree
Just to maintain some balance on the list, Hortonworks has similar training vidos and a sandbox appliance. http://hortonworks.com/community/ Enjoy. On Sat, Feb 2, 2013 at 10:02 AM, YouPeng Yang wrote: > Hi akram khalil > >if you want to take some courses .Recommend you to take the Cloudera

Re: Maximum Storage size in a Single datanode

2013-01-30 Thread Chris Embree
You should probably think about this in a more cluster fashion. A single node with a PB of data is probably not a good allocation of CPU : Disk ration. In addition, you need enough RAM on your NameNode to keep track of all of your blocks. A few nodes with a PB each would quickly drive up NN RAM

Re: hdfs du periodicity and hdfs not respond at that time

2013-01-24 Thread Chris Embree
What type of FS are you using under HDFS? XFS, ext3, ext4? The type and configuration of the underlying FS will impact performance. Most notably, ext3 has a lock-up effect when flushing disk cache. On Thu, Jan 24, 2013 at 2:54 AM, Xibin Liu wrote: > Thanks, http://search-hadoop.com/m/LLBgUiH0

Where do/should .jar files live?

2013-01-22 Thread Chris Embree
Hi List, This should be a simple question, I think. Disclosure, I am not a java developer. ;) We're getting ready to build our Dev and Prod clusters. I'm pretty comfortable with HDFS and how it sits atop several local file systems on multiple servers. I'm fairly comfortable with the concept of

Re: modifying existing wordcount example

2013-01-16 Thread Chris Embree
Can you instead copy intput1 and input2 together? Or process both files on the second pass? Otherwise, you'll have to read in output file, load the values and start your map/red job. Probably someone else will have a better answer. :) On Wed, Jan 16, 2013 at 9:07 PM, jamal sasha wrote: > Hi,

Re: does "fs -put " create subdirectories?

2013-01-16 Thread Chris Embree
Good point Harsh. As a Linux Admin, I prefer the behavior of 2.x. It allows me to see if I've made a mistake in my planned placement of files instead of blindly writing. On Wed, Jan 16, 2013 at 12:05 PM, Harsh J wrote: > On 1.x, -put does mkdir the parent directories if they are non existent >

Re: Hadoop NON DFS space

2013-01-16 Thread Chris Embree
Ha, you joke, but we're planning on running with no local OS. If it works as planned I'll post a nice summary of our approach. :) On Wed, Jan 16, 2013 at 2:53 AM, Harsh J wrote: > Wipe your OS out. > > Please read: http://search-hadoop.com/m/9Qwi9UgMOe > > > On Wed, Jan 16, 2013 at 1:16 PM, V

Re: datanode write forwarding

2012-12-18 Thread Chris Embree
Harsh, Is that a change from 1.0 code? Hortonworks explains it a little differently. Thanks for the details and pointer to the code. Chris (another one) On Dec 18, 2012 5:14 PM, "Harsh J" wrote: > Hi, > > The received write packet is directly socket-written to the next > node's receiver (asyn

Re: datanode write forwarding

2012-12-18 Thread Chris Embree
Hi Jay, We'll need "real developer" expertise on this, but my understanding of the documentation is: Client talks to Name node to get Node/Block assignments, client then talks to node 1: write, fwd, node 2: write, fwd, node 3: write, ack node 2, node 2: ack node1, node 1: ack Client and Name Nod

Re: number of mapred slots

2012-12-17 Thread Chris Embree
I think the rule of thumb (hortonworks at least) is 2x cores for maps threads and 1x cores for reducers. Don't have my notes here so I'm not 100%. It's just a guideline in any event. :) TEST, TEST, TEST. :) On Tue, Dec 18, 2012 at 1:08 AM, wrote: > Hello, > > I was unable to find any informa

Re: Hadoop 101

2012-12-12 Thread Chris Embree
Just to be a picker of nits... this topic is more concisely Hadoop Development 101. I only mention this because I am a newbie hadoop admin and this was over my head. ;) Admins don't worry as much about Key Value Pairs and parsing as we do about where is the script that starts the NameNode. ;) O

Re: Sane max storage size for DN

2012-12-12 Thread Chris Embree
Hi Mohammed, The amount of RAM on the NN is related to the number of blocks... so let's do some math. :) 1G of RAM to 1M blocks seems to be the general rule. I'll probably mess this up so someone check my math: 9 PT ~ 9,216 TB ~ 9,437,184 GB of data. Let's put that in 128MB blocks: according