That was a hidden shameless plug Ted ;-)
The main disadvantage of fs -cp is that all data has to transit via the
machine you issue the command on, depending on the size of data you want to
copy that can be a killer. DistCp is distributed as its name imply, so no
bottleneck of this kind then.
On
This is absolutely true. Distcp dominates cp for large copies. On the
other hand cp dominates distcp for convenience.
In my own experience, I love cp when copying relatively small amounts of
data (10's of GB) where the available bandwidth of about a GB/s allows the
copy to complete in less
Encryption without proper key management only addresses the 'stolen
hard drive' problem.
So far I have not found 100% satisfactory solutions to this hard
problem. I've written OSS (Open Secret Server) partly to address this
problem in Pig, i.e. accessing encrypted data without embedding key
info
Backup on tape or on disk?
On disk, have another Hadoop cluster dans do regular distcp.
On tape, make sure you have a backup program which can backup streams
so you don't have to materialize your TB files outside of your Hadoop
cluster first... (I know Simpana can't do that :-().
On Fri, Jan
It greatly depends on the form thie PB is stored under, if we're
talking N files with N 1 then you might get better performance by
sharding the import job on multiple boxes.
If it's a single 1PB file then Infiniband might be your best bet, but
won't get you close to 10'
build a custom transfer mechanism in Java and use a zaap so you won't
consume mips
On Aug 28, 2012 6:24 PM, Siddharth Tiwari siddharth.tiw...@live.com
wrote:
Hi Users.
We have flat files on mainframes with around a billion records. We need to
sort them and then use them with different jobs
Correct me if I'm wrong, but the sole cost of storing 300TB on AWS
will account for roughly 30*0.10*12 = 36 USD per annum.
We operate a cluster with 112 nodes offering 800+ TB of raw HDFS
capacity and the CAPEX was less than 700k USD, if you ask me there is
no comparison possible if you
Hadoop does not perform well with shared storage and vms.
The question should be asked first regarding what you're trying to achieve,
not about your infra.
On May 17, 2012 10:39 PM, Pierre Antoine Du Bois De Naurois
pad...@gmail.com wrote:
Hello,
We have about 50 VMs and we want to
B = GROUP A BY x;
C = FOREACH B GENERATE group,SIZE(B),B;
D = FILTER C BY $1 == N;
On Thu, May 3, 2012 at 8:58 PM, Aleksandr Elbakyan ramal...@yahoo.com wrote:
Hello All,
I was wandering if it is possible to filter all groups in pig which have size
N. This sounds like something common but
rather easy to do in Pig with a UDF, filter values threshold, Group ALL,
then nested foreach which does an order by on the timestamp and calls your
UDF on the sorted bag in the generate
On Mar 29, 2012 11:03 AM, banermatt banerm...@hotmail.fr wrote:
Hello,
I'm developping a log file anomaly
does it work under user hdfs?
On Mar 19, 2012 6:32 PM, Olivier Sallou olivier.sal...@irisa.fr wrote:
Hi,
I have installed Hadoop 1.0 using .deb package.
I tried to configure superuser groups but it somehow fail. I do not know
what's wrong:
I expect root to be able to run hadoop dfsadmin
write a simple java class that creates a snappy compressed seqfile.
On Dec 31, 2011 11:17 PM, ravikumar visweswara talk2had...@gmail.com
wrote:
Hello All,
Is there a way to compress my text log files in snappy format on Mac OSX
and Linux before or while pushing to hdfs?
I dont want to run
Yes. We've talked about adding various checks, but I don't think anyone has
added them. We obviously have the input key and one option would be to
ignore the output key.
ok.
Since a Combiner is simply a Reducer with no other constraints,
That isn't true. Combiners are required to be:
Hi,
I'm in the process of putting together a 'Hadoop MapReduce Poster' so
my students can better understand the various steps of a MapReduce job
as ran by Hadoop.
I intend to release the Poster under a CC-BY-NC-ND license.
I'd be grateful if people could review the current draf (3) of the
anyway? ICBW, the way I've been writing code makes it irrelevant.
Alternatively, I've misunderstood the (simpler) question, and the answer
is
to use the setGroupingComparatorClass() API.
S.
On 29 October 2011 04:35, Mathias Herberts mathias.herbe...@gmail.com
wrote:
Another point
Another point concerning the Combiners,
the grouping is currently done using the RawComparator used for
sorting the Mapper's output. Wouldn't it be useful to be able to set a
custom CombinerGroupingComparatorClass?
Mathias.
You can find the job specific logs in two places. The first one is in the
hdfs ouput directory. The second place is under $HADOOP_HOME/logs/history
($HADOOP_HOME/logs/history/done)
Both these paces have the config file and the job logs for each submited job.
Those logs in 'history/done'
Forget sas, use pig instead.
On Aug 23, 2011 11:22 PM, jonathan.hw...@accenture.com wrote:
Anyone had worked on Hadoop data integration with SAS?
Does SAS have a connector to HDFS? Can it use data directly on HDFS? Any
link or samples or tools?
Thanks!
Jonathan
Just curious, what are the techspecs of your datanodes to accomodate 1PB/day
on 20 nodes?
On Aug 10, 2011 10:12 AM, jagaran das jagaran_...@yahoo.co.in wrote:
In my current project we are planning to streams of data to Namenode (20
Node Cluster).
Data Volume would be around 1 PB per day.
But
On Wed, Jun 29, 2011 at 01:02, Matei Zaharia ma...@eecs.berkeley.edu wrote:
Ideally, to evaluate whether you want to go for 10GbE NICs, you would profile
your target Hadoop workload and see whether it's communication-bound. Hadoop
jobs can definitely be communication-bound if you shuffle a
Hi,
seems like the perfect use case for Map Reduce yep.
2011/5/26 Mirko Kämpf mirko.kae...@googlemail.com:
Hello,
we are working on a scientific project to analyze information spread in
networks. Our simulations are independent from each other but we need a
large amount of runs and we have
Did you explicitely start a balancer or did you decommission the nodes
using dfs.hosts.exclude and a dfsadmin -refreshNodes?
On Thu, May 5, 2011 at 14:30, Ferdy Galema ferdy.gal...@kalooga.com wrote:
Hi,
On our 15node cluster (1GB ethernet and 4x1TB disk per node) I noticed that
distcp does a
You can configure how many failed volumes a datanode can tolerate.
On Apr 25, 2011 5:04 PM, Xiaobo Gu guxiaobo1...@gmail.com wrote:
Hi,
I heard from so many people saying we should using JBOD instead of
RAID, that is we should format each local disk(used for data storage)
into an individual
Check the NN's logs to see the path which led to this.
On Apr 24, 2011 8:41 AM, Peng, Wei wei.p...@xerox.com wrote:
Hi,
I need a help very bad.
I got an HDFS permission error by starting to run hadoop job
org.apache.hadoop.security.AccessControlException: Permission denied:
user=wp,
You need to have the native libs on all tasktrackers and have
java.library.path correctly set.
On Dec 9, 2010 11:01 PM, He Chen airb...@gmail.com wrote:
Hello everyone, I 've got a problem when I write some Jcuda program based
on
Hadoop MapReduce. I use the jcudaUtill. The KernelLauncherSample
, Mathias Herberts mathias.herbe...@gmail.com
wrote:
You need to have the native libs on all tasktrackers and have
java.library.path correctly set.
On Dec 9, 2010 11:01 PM, He Chen airb...@gmail.com wrote:
Hello everyone, I 've got a problem when I write some Jcuda program based
on
Hadoop
no quota on the fs?
On Aug 3, 2009 7:13 AM, Palleti, Pallavi pallavi.pall...@corp.aol.com
wrote:
No. These are production jobs which were working pretty fine and
suddenly, we started seeing these issues. And, if you see the error log,
the jobs are failing at the time of submission itself while
On Thu, Jul 23, 2009 at 09:20, Ted Dunningted.dunn...@gmail.com wrote:
Last I heard, the API could be suborned in this scenario. Real credential
based identity would be needed to provide more than this.
The hack would involve a changed hadoop library that lies about identity.
This would not
28 matches
Mail list logo