For this particular file you have set the replication factor to be 10 but
there is only one available. Did your dadanodes go down ? or you retired
datanodes? Or your number of datanodes is less than that of replication
factor?
On Apr 19, 2013 12:09 PM, "Mohit Vadhera"
wrote:
> Can anybody let me
Can you not simply do a fs -put from the location where the 2 TB file
currently resides? HDFS should be able to consume it just fine, as the
client chunks them into fixed size blocks.
On Fri, Apr 19, 2013 at 10:05 AM, 超级塞亚人 wrote:
> I have a problem. Our cluster has 32 nodes. Each disk is 1TB. I
Can anybody let me know the meaning of the below log plz " Target Replicas
is 10 but found 1 replica(s)." ?
/var/lib/hadoop-hdfs/cache/mapred/mapred/staging/test_user/.staging/job_201302180313_0623/job.split:
Under replicated
BP-2091347308-172.20.3.119-1356632249303:blk_6297333561560198850_70720.
On Thu, Apr 18, 2013 at 9:23 PM, Mark Kerzner wrote:
> Hi,
>
> my clusters are on EC2, and they disappear after the cluster's instances are
> destroyed. What is the best practice to collect the logs for later storage?
>
> EC2 does exactly that with their EMR, how do they do it?
Apache Flume could
I have a problem. Our cluster has 32 nodes. Each disk is 1TB. I wanna
upload 2TB file to HDFS.How can I put the file to the namenode and upload
to HDFS?
Can anybody help me to start jobtracker service. it is an urgent . it looks
permission issue .
What permission to give on which directory. I am pasting log for the same.
Service start and stops
2013-04-19 02:21:06,388 FATAL org.apache.hadoop.mapred.JobTracker:
org.apache.hadoop.security.AccessCont
Actually the problem is not simple. Based on these problems, there are
three companies working on them:
- Loggly:
http://loggly.com/
Loggly is a part of the Amazon Marketplace:
https://aws.amazon.com/solution-providers/isv/loggly
- Pageduty:
https://papertrailapp.com/
How to do it:
ht
I set bandwidthPerSec = 104857600, but when I add a new data node, and run
hadoop balancer, the bandwidth is only 1MB/s, and the datanode log shows
that:
org.apache.hadoop.hdfs.server.
datanode.DataNode: Balancing bandwith is 1048576 bytes/s
My hadoop core version is 1.0.3
Thanks
2013/4/19 Than
So you are saying, the problem is very simple. Just before you destroy the
cluster, simply collect the logs to S3. Anyway, I only need them after I
have completed with a specific computation. So I don't need any special
requirements.
In regular permanent clusters, is there something that allows yo
When you destroy an EC2 instance, the correct behavior is to erase all
data.
Why don't you create a service to collect the logs directly to a S3 bucket
in real-time or in a batch of 5 mins?
2013/4/18 Mark Kerzner
> Hi,
>
> my clusters are on EC2, and they disappear after the cluster's instances
Hi,
my clusters are on EC2, and they disappear after the cluster's instances
are destroyed. What is the best practice to collect the logs for later
storage?
EC2 does exactly that with their EMR, how do they do it?
Thank you,
Mark
Not really, fereration provides seperate namespaces, but I want it looks
like one namespace. My basic idea is to maintain a map from files to
namenodes, it receive RPC calls from client and forward them to specific
namenode that in charge of the file. It's challenging for me but
I'll figure out whe
Are you trying to implement something like namespace federation, that's a
part of Hadoop 2.0 -
http://hadoop.apache.org/docs/r2.0.3-alpha/hadoop-project-dist/hadoop-hdfs/Federation.html
On Thu, Apr 18, 2013 at 10:02 PM, Lixiang Ao wrote:
> Actually I'm trying to do something like combining mult
Well, since the DistributedCache is used by the tasktracker, you need to
update the log4j configuration file used by the tasktracker daemon. And you
need to get the tasktracker log file - from the machine where you see the
distributed cache problem.
On Fri, Apr 19, 2013 at 6:27 AM, wrote:
> Hi
Thanks guys for updating!
Yeah, I read the thread that Checkpoint/BackupNode may be get deprecated.
SNN is a way to go then.
I just wonder if we use multiple CheckpointNodes, we might run into the
situation where while a checkpoint is on-going, but the first
CheckpointNode is slow, then the secon
With files that small it is much better to write a custom input format
which checks the entire file and only passes records from good files. If
you need Hadoop you are probably processing a large number of these files
and an input format could easily read the entire file and handle it if it
as as s
Hello Thanh,
Just to keep you updated, checkpoint node might get depricated. So,
it's always better to use secondary namenode. More on this could be found
here :
https://issues.apache.org/jira/browse/HDFS-2397
https://issues.apache.org/jira/browse/HDFS-4114
Warm Regards,
Tariq
https://mtar
For more information : https://issues.apache.org/jira/browse/HADOOP-7297
It has been corrected but the stable documentation is still the 1.0.4
(previous to correction).
See
* http://hadoop.apache.org/docs/r1.0.4/hdfs_user_guide.html
* http://hadoop.apache.org/docs/r1.1.1/hdfs_user_guide.html
* ht
Never mind, got it fixed.
Thanks,
Som
On Tue, Apr 16, 2013 at 6:18 PM, Som Satpathy wrote:
> Hi All,
>
> I have just set up a CDH cluster on EC2 using cloudera manager 4.5. I have
> been trying to run a couple of mapreduce jobs as part of an oozie workflow
> but have been blocked by the followi
It would be important to point the document (which I believe is
http://hadoop.apache.org/docs/stable/hdfs_user_guide.html) and the version
of Hadoop you are interested in. At one time, the documentation was
misleading. The 1.x version didn't have checkpoint/backup nodes only the
secondary namenode.
Hi all,
In my mapreduce job, I would like to process only whole input files containing
only valid rows. If one map task processing an input split of a file detects an
invalid row, the whole file should be "marked" as invalid and not processed at
all. This input file will then be cleansed by ano
It is rarely practical to do exhaustive comparisons on datasets of this
size.
The method used is to heuristically prune the cartesian product set and
only examine pairs that have a high likelihood of being near.
This can be done in many ways. Your suggestion of doing a map-side join is
a reasona
so reliability (to prevent metadata loss) is the main motivation for
multiple checkpoint nodes?
Does anybody use multiple checkpoint nodes in real life?
Thanks
On Thu, Apr 18, 2013 at 12:07 PM, shashwat shriparv <
dwivedishash...@gmail.com> wrote:
> more checkpoint nodes means more backup of t
more checkpoint nodes means more backup of the metadata :)
*Thanks & Regards*
∞
Shashwat Shriparv
On Thu, Apr 18, 2013 at 9:35 PM, Thanh Do wrote:
> Hi all,
>
> The document says "Multiple checkpoint nodes may be specified in the
> cluster configuration file".
>
> Can some one clarify me
Actually I'm trying to do something like combining multiple namenodes so
that they present themselves to clients as a single namespace, implementing
basic namenode functionalities.
在 2013年4月18日星期四,Chris Embree 写道:
> Glad you got this working... can you explain your use case a little? I'm
> tryi
What do you mean by "doesn't work"?
On Thu, Apr 18, 2013 at 10:01 AM, zhoushuaifeng wrote:
> **
> Hi,
> I set the hdfs balance bandwidth from 1048576 to 104857600, but it doesn't
> work, what's wrong?
> Does anyone encounter the same problem?
> Thanks a lot.
>
>
> dfs.balance.bandwidthPerSec
Hi all,
The document says "Multiple checkpoint nodes may be specified in the
cluster configuration file".
Can some one clarify me that why we really need to run multiple checkpoint
nodes anyway? Is it possible that while checkpoint node A is doing
checkpoint, and check point node B kicks in and d
Glad you got this working... can you explain your use case a little? I'm
trying to understand why you might want to do that.
On Thu, Apr 18, 2013 at 11:29 AM, Lixiang Ao wrote:
> I modified sbin/hadoop-daemon.sh, where HADOOP_PID_DIR is set. It works!
> Everything looks fine now.
>
> Seems d
On Thu, Apr 18, 2013 at 4:49 AM, Hadoop Explorer
wrote:
> I have an application that evaluate a graph using this algorithm:
>
> - use a parallel for loop to evaluate all nodes in a graph (to evaluate a
> node, an image is read, and then result of this node is calculated)
>
> - use a second paralle
I modified sbin/hadoop-daemon.sh, where HADOOP_PID_DIR is set. It works!
Everything looks fine now.
Seems direct command "hdfs namenode" gives a better sense of control :)
Thanks a lot.
在 2013年4月18日星期四,Harsh J 写道:
> Yes you can but if you want the scripts to work, you should have them
> use a
The approach which I proposed will have m+n i/o for reading datasets not the (m
+ n + m*n) and but further i/o due to spills and reading mapper output by
reducer will be more as number of tuples coming out of mapper are ( m + m * n).
Regards,
Ajay Srivastava
On 18-Apr-2013, at 5:40 PM, zheyi
Hi,
I set the hdfs balance bandwidth from 1048576 to 104857600, but it doesn't
work, what's wrong?
Does anyone encounter the same problem?
Thanks a lot.
dfs.balance.bandwidthPerSec
104857600
Specifies the maximum amount of bandwidth that each datanode
can utilize for the b
Here's a rough guideline:
Moving a cluster isn't all that different from upgrading it. The initial steps
are the same:
- stop your mapreduce services
- switch you namenode to safe mode
- generate a final image with -saveNamespace
- stop your hfds services
- back up your metadata - as long as you
Thank you, now I get your point.
But I wonder that this approach would be slower than
implementing a custom InputFormat which, each time, provides a pair of
lines to mappers; then doing the product in mappers? (in
Since your approach would need (m + n + m*n) I/O in mapper side, and
(2*m*n) IO in
I have an application that evaluate a graph using this algorithm:
- use a parallel for loop to evaluate all nodes in a graph (to evaluate a node,
an image is read, and then result of this node is calculated)
- use a second parallel for loop to evaluate all edges in the graph. The
function woul
Yes, that's a crucial part.
Write a class which extends WritableComparator and override compare method.
You need to set this class in job client as -
job.setGroupingComparatorClass (Grouping comparator class).
This will make sure that records having same Ki will be grouped together and
will go t
Hi Ajay Srivastava,
Thank your for your reply.
Could you please explain a little bit more on "Write a grouping comparator
which group records on first part of key i.e. Ki." ?
I guess it is a crucial part, which could filter some pairs before passing
them to the reducer.
Regards,
Zheyi Rong
O
Hi all,
I was wondering if there is a good reason why public
Configuration(Configuration other) constructor in Hadoop 1.0.4 doesn't
clone the classloader in "other" to the new Configration ?
Is this a bug ?
I'm asking because I'm trying to run a Hadoop client in OSGI environment
and I need to pa
Hi Rong,
You can use following simple method.
Lets say dataset1 has m records and when you emit these records from mapper,
keys are K1,K2 ….., Km for each respective record. Also add an identifier to
identify dataset from where records is being emitted.
So if R1 is a record in dataset1, the mapp
This is not suitable for his large dataset.
--Send from my Sony mobile.
On Apr 18, 2013 5:58 PM, "Jagat Singh" wrote:
> Hi,
>
> Can you have a look at
>
> http://pig.apache.org/docs/r0.11.1/basic.html#cross
>
> Thanks
>
>
> On Thu, Apr 18, 2013 at 7:47 PM, zheyi rong wrote:
>
>> Dear all,
>>
>>
Hi,
Can you have a look at
http://pig.apache.org/docs/r0.11.1/basic.html#cross
Thanks
On Thu, Apr 18, 2013 at 7:47 PM, zheyi rong wrote:
> Dear all,
>
> I am writing to kindly ask for ideas of doing cartesian product in hadoop.
> Specifically, now I have two datasets, each of which contains
Hi
Someone had already work with the Oracle big data Appliance ?
thanks
Dear all,
I am writing to kindly ask for ideas of doing cartesian product in hadoop.
Specifically, now I have two datasets, each of which contains 20million
lines.
I want to do cartesian product on these two datasets, comparing lines
pairwisely.
The output of each comparison can be mostly filtere
Yes you can but if you want the scripts to work, you should have them
use a different PID directory (I think its called HADOOP_PID_DIR)
every time you invoke them.
I instead prefer to start the daemons up via their direct command such
as "hdfs namenode" and so and move them to the background, with
Hi all,
Can I run mutiple HDFS instances, that is, n seperate namenodes and n
datanodes, on a single machine?
I've modified core-site.xml and hdfs-site.xml to avoid port and file
conflicting between HDFSes, but when I started the second HDFS, I got the
errors:
Starting namenodes on [localhost]
l
Hi,
I have a Kerberos KDC running and also have apache Hadoop 1.0.4
running on a cluster. Is there some kind of documentation I can use to link
the two?
Basically, I'm trying to make my hadoop cluster secure.
Thanks,
Chris
On Wed, Apr 17, 2013 at 3:30 PM, Aaron T. Myers wrote:
> Hi Chr
Hi Hemanth,
I guess that the only solution is to delete the crc files after the export.
Does anyone of you knows if someone filed a Jira to implement a parameter
to -getmerge to delete the crc files afterwards?
*Fabio Pitzolu*
Consultant - BI & Infrastructure
Mob. +39 3356033776
Telefono 02 871
47 matches
Mail list logo