Re: Copy Vs DistCP

2013-04-14 Thread Mathias Herberts
That was a hidden shameless plug Ted ;-) The main disadvantage of fs -cp is that all data has to transit via the machine you issue the command on, depending on the size of data you want to copy that can be a killer. DistCp is distributed as its name imply, so no bottleneck of this kind then. On

Re: Copy Vs DistCP

2013-04-14 Thread Mathias Herberts
This is absolutely true. Distcp dominates cp for large copies. On the other hand cp dominates distcp for convenience. In my own experience, I love cp when copying relatively small amounts of data (10's of GB) where the available bandwidth of about a GB/s allows the copy to complete in less

Re: Encryption in HDFS

2013-02-25 Thread Mathias Herberts
Encryption without proper key management only addresses the 'stolen hard drive' problem. So far I have not found 100% satisfactory solutions to this hard problem. I've written OSS (Open Secret Server) partly to address this problem in Pig, i.e. accessing encrypted data without embedding key info

Re: How to Backup HDFS data ?

2013-01-24 Thread Mathias Herberts
Backup on tape or on disk? On disk, have another Hadoop cluster dans do regular distcp. On tape, make sure you have a backup program which can backup streams so you don't have to materialize your TB files outside of your Hadoop cluster first... (I know Simpana can't do that :-(). On Fri, Jan

Re: One petabyte of data loading into HDFS with in 10 min.

2012-09-05 Thread Mathias Herberts
It greatly depends on the form thie PB is stored under, if we're talking N files with N 1 then you might get better performance by sharding the import job on multiple boxes. If it's a single 1PB file then Infiniband might be your best bet, but won't get you close to 10'

Re: Hadoop and MainFrame integration

2012-08-28 Thread Mathias Herberts
build a custom transfer mechanism in Java and use a zaap so you won't consume mips On Aug 28, 2012 6:24 PM, Siddharth Tiwari siddharth.tiw...@live.com wrote: Hi Users. We have flat files on mainframes with around a billion records. We need to sort them and then use them with different jobs

Re: Hadoop on physical Machines compared to Amazon Ec2 / virtual machines

2012-05-31 Thread Mathias Herberts
Correct me if I'm wrong, but the sole cost of storing 300TB on AWS will account for roughly 30*0.10*12 = 36 USD per annum. We operate a cluster with 112 nodes offering 800+ TB of raw HDFS capacity and the CAPEX was less than 700k USD, if you ask me there is no comparison possible if you

Re: is hadoop suitable for us?

2012-05-17 Thread Mathias Herberts
Hadoop does not perform well with shared storage and vms. The question should be asked first regarding what you're trying to achieve, not about your infra. On May 17, 2012 10:39 PM, Pierre Antoine Du Bois De Naurois pad...@gmail.com wrote: Hello, We have about 50 VMs and we want to

Re: Pig question

2012-05-03 Thread Mathias Herberts
B = GROUP A BY x; C = FOREACH B GENERATE group,SIZE(B),B; D = FILTER C BY $1 == N; On Thu, May 3, 2012 at 8:58 PM, Aleksandr Elbakyan ramal...@yahoo.com wrote: Hello All, I was wandering if it is possible to filter all groups in pig which have size N. This sounds like something common but

Re: Temporal query

2012-03-29 Thread Mathias Herberts
rather easy to do in Pig with a UDF, filter values threshold, Group ALL, then nested foreach which does an order by on the timestamp and calls your UDF on the sorted bag in the generate On Mar 29, 2012 11:03 AM, banermatt banerm...@hotmail.fr wrote: Hello, I'm developping a log file anomaly

Re: Problem setting super user

2012-03-19 Thread Mathias Herberts
does it work under user hdfs? On Mar 19, 2012 6:32 PM, Olivier Sallou olivier.sal...@irisa.fr wrote: Hi, I have installed Hadoop 1.0 using .deb package. I tried to configure superuser groups but it somehow fail. I do not know what's wrong: I expect root to be able to run hadoop dfsadmin

Re: how to compress log files with snappy compression

2011-12-31 Thread Mathias Herberts
write a simple java class that creates a snappy compressed seqfile. On Dec 31, 2011 11:17 PM, ravikumar visweswara talk2had...@gmail.com wrote: Hello All, Is there a way to compress my text log files in snappy format on Mac OSX and Linux before or while pushing to hdfs? I dont want to run

Re: Combiners

2011-10-31 Thread Mathias Herberts
Yes. We've talked about adding various checks, but I don't think anyone has added them. We obviously have the input key and one option would be to ignore the output key. ok. Since a Combiner is simply a Reducer with no other constraints, That isn't true. Combiners are required to be:

Hadoop MapReduce Poster

2011-10-31 Thread Mathias Herberts
Hi, I'm in the process of putting together a 'Hadoop MapReduce Poster' so my students can better understand the various steps of a MapReduce job as ran by Hadoop. I intend to release the Poster under a CC-BY-NC-ND license. I'd be grateful if people could review the current draf (3) of the

Re: Grouping in Combiners

2011-10-31 Thread Mathias Herberts
anyway? ICBW, the way I've been writing code makes it irrelevant. Alternatively, I've misunderstood the (simpler) question, and the answer is to use the setGroupingComparatorClass() API. S. On 29 October 2011 04:35, Mathias Herberts mathias.herbe...@gmail.com wrote: Another point

Grouping in Combiners

2011-10-29 Thread Mathias Herberts
Another point concerning the Combiners, the grouping is currently done using the RawComparator used for sorting the Mapper's output. Wouldn't it be useful to be able to set a custom CombinerGroupingComparatorClass? Mathias.

Re: Maintaining map reduce job logs - The best practices

2011-09-23 Thread Mathias Herberts
You can find the job specific logs in two places. The first one is in the hdfs ouput directory. The second place is under $HADOOP_HOME/logs/history ($HADOOP_HOME/logs/history/done) Both these paces have the config file and the job logs for each submited job. Those logs in 'history/done'

Re: Hadoop integration with SAS

2011-08-23 Thread Mathias Herberts
Forget sas, use pig instead. On Aug 23, 2011 11:22 PM, jonathan.hw...@accenture.com wrote: Anyone had worked on Hadoop data integration with SAS? Does SAS have a connector to HDFS? Can it use data directly on HDFS? Any link or samples or tools? Thanks! Jonathan

Re: Namenode Scalability

2011-08-10 Thread Mathias Herberts
Just curious, what are the techspecs of your datanodes to accomodate 1PB/day on 20 nodes? On Aug 10, 2011 10:12 AM, jagaran das jagaran_...@yahoo.co.in wrote: In my current project we are planning to streams of data to Namenode (20 Node Cluster). Data Volume would be around 1 PB per day. But

Re: Sanity check re: value of 10GbE NICs for Hadoop?

2011-06-28 Thread Mathias Herberts
On Wed, Jun 29, 2011 at 01:02, Matei Zaharia ma...@eecs.berkeley.edu wrote: Ideally, to evaluate whether you want to go for 10GbE NICs, you would profile your target Hadoop workload and see whether it's communication-bound. Hadoop jobs can definitely be communication-bound if you shuffle a

Re: can our problem be handled by hadoop

2011-05-26 Thread Mathias Herberts
Hi, seems like the perfect use case for Map Reduce yep. 2011/5/26 Mirko Kämpf mirko.kae...@googlemail.com: Hello, we are working on a scientific project to analyze information spread in networks. Our simulations are independent from each other but we need a large amount of runs and we have

Re: distcp performing much better for rebalancing than dedicated balancer

2011-05-05 Thread Mathias Herberts
Did you explicitely start a balancer or did you decommission the nodes using dfs.hosts.exclude and a dfsadmin -refreshNodes? On Thu, May 5, 2011 at 14:30, Ferdy Galema ferdy.gal...@kalooga.com wrote: Hi, On our 15node cluster (1GB ethernet and 4x1TB disk per node) I noticed that distcp does a

Re: Does it mean that single disk failure causes the whole datanode to fail?

2011-04-25 Thread Mathias Herberts
You can configure how many failed volumes a datanode can tolerate. On Apr 25, 2011 5:04 PM, Xiaobo Gu guxiaobo1...@gmail.com wrote: Hi, I heard from so many people saying we should using JBOD instead of RAID, that is we should format each local disk(used for data storage) into an individual

Re: HDFS permission denied

2011-04-24 Thread Mathias Herberts
Check the NN's logs to see the path which led to this. On Apr 24, 2011 8:41 AM, Peng, Wei wei.p...@xerox.com wrote: Hi, I need a help very bad. I got an HDFS permission error by starting to run hadoop job org.apache.hadoop.security.AccessControlException: Permission denied: user=wp,

Re: Jcuda on Hadoop

2010-12-09 Thread Mathias Herberts
You need to have the native libs on all tasktrackers and have java.library.path correctly set. On Dec 9, 2010 11:01 PM, He Chen airb...@gmail.com wrote: Hello everyone, I 've got a problem when I write some Jcuda program based on Hadoop MapReduce. I use the jcudaUtill. The KernelLauncherSample

Re: Jcuda on Hadoop

2010-12-09 Thread Mathias Herberts
, Mathias Herberts mathias.herbe...@gmail.com wrote: You need to have the native libs on all tasktrackers and have java.library.path correctly set. On Dec 9, 2010 11:01 PM, He Chen airb...@gmail.com wrote: Hello everyone, I 've got a problem when I write some Jcuda program based on Hadoop

Re: RE: No Space Left On Device though space is available

2009-08-03 Thread Mathias Herberts
no quota on the fs? On Aug 3, 2009 7:13 AM, Palleti, Pallavi pallavi.pall...@corp.aol.com wrote: No. These are production jobs which were working pretty fine and suddenly, we started seeing these issues. And, if you see the error log, the jobs are failing at the time of submission itself while

Re: Remote access to cluster using user as hadoop

2009-07-23 Thread Mathias Herberts
On Thu, Jul 23, 2009 at 09:20, Ted Dunningted.dunn...@gmail.com wrote: Last I heard, the API could be suborned in this scenario.  Real credential based identity would be needed to provide more than this. The hack would involve a changed hadoop library that lies about identity. This would not