It would help to know your data ingest and processing patterns (and any
applicable SLAs).
In most cases, you'd only need to move the raw ingested data, then you can
derive the rest in the other cluster. Assuming that you have some sort of
date-based partitioning on the ingest, then it's easy to de
Hey Aaron,
I'm still skeptical when it comes to flash drives, especially as pertains
to Hadoop. The write cycle limit is impractical to make them usable for
dfs.data.dir and mapred.local.dir, and as you pointed out, you can't use
them for logs either.
If you put HADOOP_LOG_DIR in /mnt/d0, you wil
kinda clunky but you could do this via shell:
for $FILE in $LIST_OF_FILES ; do
hadoop fs -copyFromLocal $FILE $DEST_PATH &
done
If doing this via the Java API, then, yes you will have to use multiple
threads.
On Wed, May 18, 2011 at 1:04 AM, Mapred Learn wrote:
> Thanks harsh !
> That means b
ta?
> 4 Any idea about teh loss of efficiency due to the infrastructure?
> 5. The 256 node cluster seems to provide an efficieny of 1.3 while 512
> nodes it decreases to around 1.2. Would this trend continue?
>
Yes, because of the network capacity.
>
> Raj
>
> Raj
>
>
Hi, Raj.
Interesting analysis...
These numbers appear to be off. For example, 405s for mappers + 751s for
reducers = 1156s for all tasks. If you have 2000 map and reduce tasks, this
means each task is spending roughly 500ms to do actual work. That is a very
low number and seems impossible.
- P
This is currently not possible with Hadoop, as the communications protocols
between clients and servers have to be the same version.
On Wed, Nov 10, 2010 at 6:25 AM, Gokulakannan M wrote:
> Hi all,
>
>
>
>It will be a good sharing if some tips are given in performing
> rolling upgrad
There could be a number of reasons. It could be directory permissions
problems with the partitions (user 'hadoop' cannot rwx). It could be typos
in the dfs.data.dir config.
The directories are checked on datanode startup only.
Regards,
- Patrick
On Mon, Nov 8, 2010 at 9:42 AM, Sudhir Vallamkond
Nick,
The corruption may have been caused by running out of disk space. At that
point, even after rebalancing, you will still have corruption. Under normal
circumstances, balancing by itself should not result in corruption.
Regards,
- Patrick
On Wed, Oct 27, 2010 at 9:40 AM, Jones, Nick wrote:
HBase might fit the bill.
On Tue, Oct 26, 2010 at 12:28 PM, Ananth Sarathy wrote:
> I was wondering if there were any projects out there doing a small file
> management layer on top of Hadoop? I know that HDFS is primarily for
> map/reduce but I think companies are going to start using hdfs clus
This is not CDH3 specific... it's related to the kerberos security patch, so
these upgrade issues will pop up in the Y! distribution, and eventually in
0.22 as well.
These aren't bugs in the code per se, it's just that the upgrade process
going from pre- to post- security is somewhat tricky, and c
Kim, Jamie,
This might be a particular issue with the Cloudera distro, specifically with
the AsyncDiskService related patches that were applied to 0.20.2+320 (aka
CDH3b2).
I've created an issue here:
https://issues.cloudera.org/browse/DISTRO-39
I encourage you (and anyone else reading this) to
I'd also recommend setting mapred.local.dir and dfs.data.dir to something
that is not under /tmp.
Aside from your HDFS data getting wiped, these settings should ideally be
comma separated paths, one for each physical disk in your server so you can
aggregate disk I/O.
2010/8/15 Kevin .
>
> Hi, H
In this case, don't bother with MultipleOutput.
Specify 2 reducers, and a custom partitioner that sends 'even' records to
partition 0, and 'odd' partitions to partition 1.
You will have two output files named 'part-0' and 'part-1'
corresponding to odd and even.
On Mon, Aug 16, 2010 at 2:
Shuja,
Those settings (mapred.child.jvm.opts and mapred.child.ulimit) are only used
for child JVMs that get forked by the TaskTracker. You are using Hadoop
streaming, which means the TaskTracker is forking a JVM for streaming, which
is then forking a shell process that runs your groovy code (in an
Arun,
Did you specify dfs.hosts.exclude before the NameNode started? If not, you
will have to restart the NameNode. Otherwise, just kill the DataNode.
On Thu, Jul 8, 2010 at 10:01 PM, Arun Ramakrishnan <
aramakrish...@languageweaver.com> wrote:
> When I run fsck everything seems fine. Nothing is
If all you want is dumb storage for small-ish files, you can always just use
NAS or SAN.
For the MP3 example, you might want to consider HBase... you can store
associated meta-data in column families.
On Tue, Jul 6, 2010 at 3:33 PM, Ananth Sarathy
wrote:
> So I am aware of the problem with small
Hey Stan,
There's really no way to programmatically spin up an HDFS cluster. What's
your actual goal?
Regards,
- Patrick
p.s., Thanks for all the great comix! ;-)
On Wed, Jun 9, 2010 at 4:48 AM, stan lee wrote:
> Hi Experts,
>
> Although HDFS file system has exposed some APIs which can be us
s) he only
> metric that matters, it seem to me lie something very interresting to check
> out...
> I have hierarchy over me and they will be happy to understand my choices
> with real numbers to base their understanding on.
> Thanks.
>
>
> On Tue, May 18, 2010 at 5:00 PM, P
Should be evident in the total job running time... that's the only metric
that really matters :)
On Tue, May 18, 2010 at 10:39 AM, Pierre ANCELOT wrote:
> Thank you,
> Any way I can measure the startup overhead in terms of time?
>
>
> On Tue, May 18, 2010 at 4:27 PM, Pat
Pierre,
Adding to what Brian has said (some things are not explicitly mentioned in
the HDFS design doc)...
- If you have small files that take up < 64MB you do not actually use the
entire 64MB block on disk.
- You *do* use up RAM on the NameNode, as each block represents meta-data
that needs to b
Matias,
Hive partitions map to subdirectories in HDFS. You can do a 'mv' if you're
lucky enough to have each partition in a distinct HDFS file that could be
moved to the right partition subdirectory. Otherwise, you can run a
MapReduce job to collate your data into separate files per partition. You
>From what I understand about Sensage, they collect enterprise data to
facilitate compliance driven audits. Of course this can be done, and done
very well in Hadoop. But, at the moment there are no specific off-the-shelf
compliance products based on Hadoop that you can just drop into your
environme
Dan,
Shuffle and Sort is a combination of multiple 'algorithms'.
- Map output goes to a circular, in-memory buffer
- When this starts filling up, it gets 'spilled' to disk
- Spilling involves writing each K/V pair to a partition specific file
(where partition is the algorithm Jim describes below)
Packaging the job and config and sending it to the JobTracker and various
nodes also adds a few seconds overhead.
On Thu, Apr 8, 2010 at 10:37 AM, Jeff Zhang wrote:
> By default, for each task hadoop will create a new jvm process which will
> be
> the major cost in my opinion. You can customize
Hi David,
Strange indeed. I assume nothing in your configs changed. Anything funny in
the logs? You should also rule out the switch itself as being faulty.
It's possible that CDH2 has a patch that's not in 0.20.1 that's causing this
problem, but we haven't heard this exact problem from any of our
My understanding (please correct me, list) is that hadoop will always spit
> your files based on the block size setting. The InputSplit and
> RecordReaders
> are used by jobs to retrieve chunks of files for processing - that is,
> there
> are two separate splits happening here: one "physical" split
Yuri,
Probably the easiest thing is to actually create distinct files and
configure the block size per file such that HDFS doesn't split it into
smaller blocks for you.
- P
On Wed, Mar 24, 2010 at 11:23 AM, Yuri K. wrote:
>
> Dear Hadoopers,
>
> i'm trying to find out how and where hadoop spli
Scott,
The code you have below should work, provided that the 'outputPath' points
to an HDFS file. The trick is to get FTP/SCP access to the remote files
using a Java client and receive the contents into a byte buffer.You can then
set that byte buffer into your BytesWritable and call writer.append
You can use a custom Partitioner to send keys to a specific reducer. Note
that your reducer will still process one key at a time.
On Mon, Mar 15, 2010 at 1:26 PM, Raymond Jennings III wrote:
> Is it possible to override a method in the reducer so that similar keys
> will be grouped together? Fo
Hello Kai,
To answer your questions:
- Most of the missing stuff from the new API are convenience classes --
InputFormats, OutputFormats, etc. One very handy class that is missing from
the new API is MultipleOutputs which allows you to write multiple files in a
single pass.
- You cannot mix class
Thomas,
Owning your machines and renting a 1/2 cabinet in a colo facility is the
cheapest way to go in the long run.
That said, you could also try www.softlayer.com. You'll get a good idea of
the pricing up front as they allow you to configure everything on the
website. You can also get the Cloud
You can also set that param per-job. Maybe you called some code that did
that behind the scenes?
On Tue, Dec 22, 2009 at 11:10 AM, Mark Vigeant wrote:
> Hey Everyone-
>
> I've been playing around with Hadoop and Hbase for a while and I noticed
> that when running a program to upload data into an
DS,
What you say is true, but there are finer points:
1. Data transfer can begin while the mapper is working through the data.
You would still bottleneck on the network if: (a) you have enough nodes and
spindles such that the aggregate disk transfer speed is greater than the
network c
The '_' character is not legal for hostnames.
On Mon, Nov 30, 2009 at 4:25 PM, pavel kolodin wrote:
>
> Namenode won't start with this messages:
>
> hadoop-0.20.1/logs/hadoop-hadoop-namenode-hadoop_master.log:
>
> http://pastebin.com/m359b9e24
>
> Thank you.
>
Interesting... you have more tokens per line than total lines?
LineRecordReader conveys the line number as the key in the mapper. If I
understand correctly, though, that line number is relative to the input
split, so you could probably use a combination of line number and task ID.
However, based
What does the data look like?
You mention 30k records, is that for 10MB or for 600MB, or do you have a
constant 30k records with vastly varying file sizes?
If the data is 10MB and you have 30k records, and it takes ~2 mins to
process each record, I'd suggest using map to distribute the data acros
You can always do
hadoop fs -text
This will 'cat' the file for you, and decompress it if necessary.
On Thu, Nov 26, 2009 at 7:59 PM, Mark Kerzner wrote:
> It worked!
>
> But why is it "for testing?" I only have one job, so I need by related as
> text, can I use this fix all the time?
>
> Than
>From what I understand, it's rather tricky to set up multiple secondary
namenodes. In either case, running multiple 2ndary NNs doesn't get you much.
See this thread:
http://www.mail-archive.com/core-u...@hadoop.apache.org/msg06280.html
On Wed, Oct 21, 2009 at 10:44 AM, Stas Oskin wrote:
> To cl
On Thu, Oct 15, 2009 at 12:32 PM, Edward Capriolo wrote:
>
> >>No need for dedicated SATA drives with
> >>RAID for your OS. Most of that is accessed during boot time so it won't
> >>contend that much with HDFS.
>
> You may want to RAID your OS. If you lose a datanode with a large
> volume of data
After the discount, an equivalently configured Dell comes about 10-20% over
the Silicon Mechanics price. It's close enough that unless you're spending
100k it won't make that much of a difference. Talk to a rep, call them out
on the ridiculous drive pricing, buy at the end of their fiscal quarter.
Hi Tim,
I assume those are single proc machines?
I got 649 secs on 70GB of data for our 7-node cluster (~11 mins), but we
have dual quad Nehalems (2.26Ghz).
On Thu, Oct 15, 2009 at 11:34 AM, tim robertson
wrote:
> Hi Usmam,
>
> So on my 10 node cluster (9 DN) with 4 maps and 4 reduces (I plan on
I got the following error while running the example sort program (hadoop
0.20) on a brand new Hadoop cluster (using the Cloudera distro). The job
seems to have recovered. However I'm wondering if this is normal or should I
be checking for something.
attempt_200910051513_0009_r_05_0: 09/10/15
0, 2009 at 9:06 PM, Ted Dunning
> wrote:
>
> > 2TB drives are just now dropping to parity with 1TB on a $/GB basis.
> >
> > If you want space rather than speed, this is a good option. If you want
> > speed rather than space, more spindles and smaller disks are better.
>
We went with 2 x Nehalems, 4 x 1TB drives and 24GB RAM. The ram might be
overkill... but it's DDR3 so you get either 12 or 24GB. Each box has 16
virtual cores so 12GB might not have been enough. These boxes are around $4k
each, but can easily outperform any $1K box dollar per dollar (and
performanc
44 matches
Mail list logo