Ok but does that imply that only 1 of your compute nodes is promised
to have all of the data for any given row? The blocks will replicate,
but they don't necessarily all replicate to the same nodes right?
So if I have say 2 column families (cf1, cf2) and there is 2 physical
files on the HDFS for
Hi,
The number of maps specified to any map reduce program (including
those part of MRBench) is generally only a hint, and the actual number
of maps will be influenced in typical cases by the amount of data
being processed. You can take a look at this wiki link to understand
more: http://wiki.apac
Hi,
Locations are per block (a file is a set of blocks, a block is replicated
on multiple hdfs datanodes).
We have locality in HBase because hdfs datanodes are deployed on the same
box as the hbase regionserver and hdfs writes one replica of the blocks on
the datanode the same machine as the clien
Hi,
Thanks for your replies. I will try working on recommended suggestions and
provide feedback.
Abhi,
In the JobTracker Web UI -> Job Tracker History, go to the specific job. Go
to Reduce Task List. Enter into the first reduce task attempt. There you
can see the start time. It is the time when
I have been reading up on HBase and my understanding is that the
physical files on the HDFS are split first by region and then by
column families.
Thus each column family has its own physical file (on a per-region basis).
If I run a MapReduce task that uses the HBase as input, wouldn't this
imply
HADOOP policy has changed.
Any user wanting to unsubscribe needs to donate USD 100/- for Obamas
campaign before the request is accepted.
From: Georgi Georgiev [mailto:g.d.georg...@gmail.com]
Sent: 29 August 2012 03:31
To: user@hadoop.apache.org
Cc: Hennig, Ryan
Subject: Re: unsubscr
So looking at the source and single stepping through a simple test case I have
to make a directory using FileSystem.mkdir()
I see that I am getting a 404 (nothing new there). However the code that is
throwing this looks like this below. Note the comment about being "brittle".
Seems its looki
Can you read the data off backup tapes and dump it to flat files?
Artem Ervits
Data Analyst
New York Presbyterian Hospital
From: Marcos Ortiz [mailto:mlor...@uci.cu]
Sent: Tuesday, August 28, 2012 06:51 PM
To: user@hadoop.apache.org
Cc: Siddharth Tiwari
Subject: Re: Hadoop and MainFrame integr
Its possible to copy apache.hadoop.conf.Configuration into another
configuration object without creating a new instance? I am seeking
something like new Configuration(Configuration) but without creating new
destination object (its managed by spring)
unsubscribe
--
Regards,
唐李洋
The problem with it, is that Hadoop depends on top of HDFS to storage in
blocks of 64/128 MB of size (or the size that you determine, 64 MB is
the de-facto size), and then make the calculations.
So, you need to move all your data to a HDFS cluster to use data in
MapReduce jobs if you want to mak
I even got emails from people not in office, by sending the email bellow -
thats crazy!
g
On Wed, Aug 29, 2012 at 12:56 AM, Georgi Georgiev wrote:
> guys - whats going wrong with these request - cant you just teach people
> act appropriate - send regular mails to un-sub-subscribe - really a lot
guys - whats going wrong with these request - cant you just teach people
act appropriate - send regular mails to un-sub-subscribe - really a lot of
spam in my in-mail.
cheers,
g
On Wed, Aug 29, 2012 at 12:08 AM, Fabio Pitzolu wrote:
> Epic Ryan!!!
>
> Sent from my Windows Phone
> -
Epic Ryan!!!
Sent from my Windows Phone
--
Da: Hennig, Ryan
Inviato: 28/08/2012 21:14
A: user@hadoop.apache.org
Oggetto: RE: unsubscribe
Error: unsubscribe request failed. Please retry again during a full
moon.
*From:* Alberto Andreotti [mailto:albertoandreo...
Thanks, I should of been more clear. I am not attempting to perform a map
reduce job. I was literally trying to use the FileSystem abstraction (rather
than using jets3t library directly to access S3. I was assuming it handled the
mocking of directories in s3 (as it is not a native feature of
I don't mean that.
I mean that a k-means clustering with pretty large clusters is a useful
auxiliary data structure for finding nearest neighbors. The basic outline
is that you find the nearest clusters and search those for near neighbors.
The first riff is that you use a clever data structure f
hi Gaurav,
Can you tell me how did calculated total shuffle time ?.Apart from
combiners and compression, you can also use some shuffle-sort
parameters that might increase the performance, i am not sure exactly
which parameters to tweak .Please share if you come across some other
techniques , i am
Rahul,
Surely there is some way that you could share the text output of a command,
without sending a giant screenshot of your entire desktop?
- Ryan
Error: unsubscribe request failed. Please retry again during a full moon.
From: Alberto Andreotti [mailto:albertoandreo...@gmail.com]
Sent: Thursday, August 23, 2012 9:00 AM
To: user@hadoop.apache.org
Subject: unsubscribe
unsubscribe
--
José Pablo Alberto Andreotti.
Tel: 54 351 4730292
Móvil: 5
Hi, there
I'm wondering if there is a sort after mapper before combiner? like
those case in reducer. I know I probably wrong. As I guess in mapper,
pair is in memory and a hash of list should do the work. Thank you
very much!
On 28 August 2012 09:24, Siddharth Tiwari wrote:
> Hi Users.
>
> We have flat files on mainframes with around a billion records. We need to
> sort them and then use them with different jobs on mainframe for report
> generation. I was wondering was there any way I could integrate the
> mainframe
> 2. how/where does the namenode stores the datanodes's storageids ?
When the datanode connects with the namenode for the first time it will
register with the namenode and during the registration the of the datanode the
storage id for the datanode is generated.
>
> 4. can I format/reset the na
build a custom transfer mechanism in Java and use a zaap so you won't
consume mips
On Aug 28, 2012 6:24 PM, "Siddharth Tiwari"
wrote:
> Hi Users.
>
> We have flat files on mainframes with around a billion records. We need to
> sort them and then use them with different jobs on mainframe for repo
Hi all,
I am using Python on CDH3u3 for streaming. I do not know how to provide
command-line arguments. My python mapper takes in 3 arguments - 2 input
files and one placeholder for an output file. I am doing something like
this, but fails. Where am I going wrong? What other options do I have? A
Hi,
Here is an example, might help you.
http://muhammadkhojaye.blogspot.in/2012/04/how-to-run-amazon-elastic-mapreduce-job.html
Cheers!
Manoj.
On Tue, Aug 28, 2012 at 12:55 PM, Chris Collins
wrote:
>
>
>
> Hi I am trying to use the Hadoop filesystem abstraction with S3 but in my
> tinkering
Hi I am trying to use the Hadoop filesystem abstraction with S3 but in my
tinkering I am not having a great deal of success. I am particularly
interested in the ability to mimic a directory structure (since s3 native
doesnt do it).
Can anyone point me to some good example usage of Hadoop FileS
Can you please explain how to automate the process of sending files back
and forth from Mainframe? Is it done using NDM?
I have a requirement to migrate Mainframe to Hadoop. I am looking for more
information in this area about the economics, process and tools that
support migration etc. Any inf
Hi Abhay
The map outputs are deleted only after the reducer runs to completion.
>Is it possible to run the same attempt again? Does killing the child java
>process or tasktracker on the node help? (since hadoop may schedule a reduce
>attempt on another node).
Yes,it is possible to re attempt
At some point in the work flow you're going to have to transfer the file
from the mainframe to the Hadoop cluster for processing, and then send it
back for storage on the mainframe.
You should be able to automate the process of sending the files back and
forth.
It's been my experience that it's o
Hi Users.
We have flat files on mainframes with around a billion records. We need to sort
them and then use them with different jobs on mainframe for report generation.
I was wondering was there any way I could integrate the mainframe with hadoop
do the sorting and keep the file on the sever i
Hi,
hope it's not a newby question...
I installed several versions of hadoop for testing,
(0.20.203, 0.21.0, and 1.0.3)
on various machines.
now I am using 1.0.3 on all the machines,
I face a problem that in some of the machhines, the datanode gets no
storageID from the namenode.
where it works, t
Without knowing your exact workload, using a Combiner (if possible) as
Tsuyoshi recommended should decrease your total shuffle time. You can also
try compressing the map output so that there's less disk and network IO.
Here's an example configuration using Snappy:
conf.set("mapred.compress.map.o
Right, but if i understood your sugesstion, you look at the end goal ,
which is:
1[40.123,-50.432]\t[[41.431,-43.32],[...,...],...,[...]]
for example, and you say: here we see a cluster basically, that cluster is
represented by the point: [40.123,-50.432]
which points does this cluster contains?
Hi, Tao. This problem is only with 2.0.1 or with the two versions?
Have you tried to use distcp from 1.0.3 to 1.0.3?
El 28/08/2012 11:36, Tao escribió:
>
> Hi, all
>
> I use distcp copying data from hadoop1.0.3 to hadoop 2.0.1.
>
> When the file path(or file name) contain Chinese character, an
> e
Hi, all
I use distcp copying data from hadoop1.0.3 to hadoop 2.0.1.
When the file path(or file name) contain Chinese character, an
exception will throw. Like below. I need some help about this.
Thanks.
[hdfs@host ~]$ hadoop distcp -i -prbugp -m 14
Regards to all the list.
Well, you should ask to the Tumblr´s fellows that they use a combination
of MySQL and HBase for its blogging platform. They talked about this
topic in the last HBaseCon. Here is the link:
http://www.hbasecon.com/sessions/growing-your-inbox-hbase-at-tumblr/
Blake Mathen
On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan wrote:
>
> I understand your solution ( i think) , didn't think of that, in that
> particular way.
> I think that lets say i have 1M data-points, and running knn , that the
> k=1M and n=10 (each point is a cluster that requires up to 10 points)
> is a
Hello,
I have a MR job which has 4 reducers running.
One of the reduce attempt is pending since long time in reduce->copy phase.
The job is not able to complete because of this.
I have seen that the child java process on tasktracker is running.
Is it possible to run the same attempt again? Does
The host on top of the stack trace contains the host and port I defined on
mapred.job.tracker in mapred-site.xml
Other than that, I don't know how to verify what you asked me. Any tips?
On Tue, Aug 28, 2012 at 3:47 PM, Harsh J wrote:
> Are you sure you're reaching the right port for your JobTrc
Dear Ted,
I understand your solution ( i think) , didn't think of that, in that
particular way.
I think that lets say i have 1M data-points, and running knn , that the
k=1M and n=10 (each point is a cluster that requires up to 10 points)
is an overkill.
How can i achieve the same result WITHOUT u
Are you sure you're reaching the right port for your JobTrcker?
On Tue, Aug 28, 2012 at 7:15 PM, Caetano Sauer wrote:
> Hello,
>
> I am getting the following error when trying to execute a hadoop job on a
> 5-node cluster:
>
> Caused by: java.io.IOException: Call to *** failed on local exception:
Hi Harsh
Thanks for the reply - my understanding is that with MultipleOutputs I can
write differently named files into the same target directory. With
MultipleTextOutputFormat I was able to override the target directory name to
perform the segmentation, by overriding generateFileNameForKeyValue
Apologies for the spelling Harsh.
Regards,
Mohammad Tariq
On Tue, Aug 28, 2012 at 6:22 PM, Mohammad Tariq wrote:
> Hello Hars,
>
>Sorry for the intrusion. I just wanted to ask the same for
> WholeFileInputFormat. I am not able to find it in 1.0.3.
>
> Regards,
> Mohammad Tariq
Hello Hars,
Sorry for the intrusion. I just wanted to ask the same for
WholeFileInputFormat. I am not able to find it in 1.0.3.
Regards,
Mohammad Tariq
On Tue, Aug 28, 2012 at 6:14 PM, Harsh J wrote:
> The Multiple*OutputFormat have been deprecated in favor of the generic
> Multip
Checked the "mapred.tmp.local" directory on the node which is running the
reducer attempt and seems that there is available space around 1G(though
it's less).
On Tue, Aug 28, 2012 at 3:55 PM, Joshi, Rekha wrote:
> Hi Abhay,
>
> Ideally the error line - "Caused by:
> org.apache.hadoop.util.DiskC
The Multiple*OutputFormat have been deprecated in favor of the generic
MultipleOutputs API. Would using that instead work for you?
On Tue, Aug 28, 2012 at 6:05 PM, Tony Burton wrote:
> Hi,
>
> I've seen that org.apache.hadoop.mapred.lib.MultipleTextOutputFormat is good
> for writing results into
Hi,
I've seen that org.apache.hadoop.mapred.lib.MultipleTextOutputFormat is good
for writing results into (for example) different directories created on the
fly. However, now I'm implementing a MapReduce job using Hadoop 1.0.3, I see
that the new API no longer supports MultipleTextOutputFormat.
I'm facing the exact same issue on 0.20.2-cdh3u0.
Does anybody have an idea?
Tnx.
Best,
Christoph
--
>From Matthias Zengler
Subject Re: pending clean up step
DateTue, 14 Feb 2012 13:17:04 GMT
Hello,
I have a question regarding a problem with hadoop.
At fir
Hi Abhay,
Ideally the error line - "Caused by:
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid
local directory for output/map_128.out" suggests you either do not have
permissions for output folder or disk is full.
Also 5 is not a big number on thread spawning, (
hi Mohammad,
Thank you for reminding.
I have checked the two directories, and set them as /home/hadoopfs/data and
/home/hadoopfs/name, not under /tmp.
So far, my problem has already been solved. Thank you.
On Mon, Aug 27, 2012 at 4:31 PM, Mohammad Tariq wrote:
> Hello Charles,
>
>Have you ad
It depends of workload. Could you tell us more specification about
your job? In general case which reducers are bottleneck, there are
some tuning techniques as follows:
1. Allocate more memory to reducers. It decreases disk IO of reducers
when merging and running reduce functions.
2. Use combine fu
Hi I am trying to use the Hadoop filesystem abstraction with S3 but in my
tinkering I am not having a great deal of success. I am particularly
interested in the ability to mimic a directory structure (since s3 native
doesnt do it).
Can anyone point me to some good example usage of Hadoop Fi
Hi,
I have run some large and small jobs and calculated the Total Shuffle Time
for the jobs. I can see that the Total Shuffle Time is almost half the
Total Time which was taken by the full job to complete.
My question, here, is that how can we decrease the Total Shuffle Time? And
doing so, what w
Hi Users,
I have a 12 node CDH3 cluster where I am planning to run some benchmark
tests. My main intension is to run the benchmarks first with the default
Hadoop configuration and then analyze the outcomes and tune the Hadoop
metrics accordingly to increase the performance of my cluster.
Can some
54 matches
Mail list logo