Ning Li wrote:
With
http://issues.apache.org/jira/browse/HADOOP-4801, however, it may
become feasible to search on HDFS directly.
I don't think HADOOP-4801 is required. It would help, certainly, but
it's so fraught with security and other issues that I doubt it will be
committed anytime
Konstantin Shvachko wrote:
The port was not specified at all in the original configuration.
Since 0.18, the port is optional. If no port is specified, then 8020 is
used. 8020 is the default port for namenodes.
https://issues.apache.org/jira/browse/HADOOP-3317
Doug
Ian Swett wrote:
We've used Jackson(http://jackson.codehaus.org/), which we've found to be easy
to use and faster than any other option.
I also use Jackson and recommend it.
Doug
I think they're complementary.
Hadoop's MapReduce lets you run computations on up to thousands of
computers potentially processing petabytes of data. It gets data from
the grid to your computation, reliably stores output back to the grid,
and supports grid-global computations (e.g.,
Hi, Ian.
One reason is that a MapFile is represented by a directory containing
two files named index and data. SequenceFileInputFormat handles
MapFiles too by, if an input file is a directory containing a data file,
using that file.
Another reason is that's what reduces generate.
Neither
Bryan Duxbury wrote:
Hm, very interesting. Didn't know about that. What's the purpose of the
reservation? Just to give root preference or leave wiggle room?
I think it's so that, when the disk is full, root processes don't fail,
only user processes. So you don't lose, e.g., syslog. With
Ext2 by default reserves 5% of the drive for use by root only. That'd
be 45MB of your 907GB capacity which would account for most of the
discrepancy. You can adjust this with tune2fs.
Doug
Bryan Duxbury wrote:
There are no non-dfs files on the partitions in question.
df -h indicates that
Philip (flip) Kromer wrote:
Heretrix http://en.wikipedia.org/wiki/Heritrix,
Nutchhttp://en.wikipedia.org/wiki/Nutch,
others use the ARC file format
http://www.archive.org/web/researcher/ArcFileFormat.php
http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml
Nutch does not use ARC
Mark Kerzner wrote:
Okay, I am convinced. I only noticed that Doug, the originator, was not
happy about it - but in open source one has to give up control sometimes.
I think perhaps you misunderstood my remarks. My point was that, if you
looked to Nutch's Content class for an example, it is,
for permissions. See code and description here:
http://www.hadoop.iponweb.net/Home/hdfs-over-webdav
Hope it is useful,
Regards,
Boris, IPonWeb
On Thu, Jan 22, 2009 at 2:30 PM, Doug Cutting cutt...@apache.org wrote:
Aaron Kimball wrote:
Is anyone aware of an OSS web dav library that
could
Aaron Kimball wrote:
Doesn't the WebDAV protocol use http for file transfer, and support reads /
writes / listings / etc?
Yes. Getting a WebDAV-based FileSystem in Hadoop has long been a goal.
It could replace libhdfs, since there are already a WebDav-based FUSE
filesystem for Linux (wdfs,
Derek Young wrote:
Reading http://issues.apache.org/jira/browse/HADOOP-341 it sounds like
this should be supported, but the http URLs are not working for me. Are
http source URLs still supported?
No. They used to be supported, but when distcp was converted to accept
any Path this stopped
The notion of a client/task ID, independent of IP or username seems
useful for log analysis. DFS's client ID is probably currently your
best bet, but we might improve its implementation, and make the notion
more generic.
It is currently implemented as:
String taskId =
Ubuntu does not include the ssh server in client installations, so you
need to install it yourself.
sudo apt-get install openssh-server
Doug
vinayak katkar wrote:
Hey
When I tried to install hadoop in ubuntu 8.04 I got an error ssh connection
refused to localhost at port 22.
Please any one
Values can drop if tasks die and must be re-run.
Doug
Aaron Kimball wrote:
The actual number of input records is most likely steadily increasing. The
counters on the web site are inaccurate until the job is complete; their
values will fluctuate wildly. I'm not sure why this is.
- Aaron
On
Why are you using such a big block size? I suspect this problem will go
away if you decrease your blocksize to less than 2GB.
This sounds like a bug, probably related to integer overflow: some part
of Hadoop is using an 'int' where it should be using a 'long'. Please
file an issue in Jira,
Owen O'Malley wrote:
It is interesting, but it would be more interesting to track the authors
of the patch rather than the committer. The two are rarely the same.
Indeed. There was a period of over a year where I wrote hardly anything
but committed almost everything. So I am vastly
Steve Loughran wrote:
Alternatively, why we should be exploring the configuration space more
widely
Are you volunteering?
Doug
Brian Bockelman wrote:
To some extent, this whole issue is caused because we only have enough
space for 2 replicas; I'd imagine that at 3 replicas, the issue would be
much harder to trigger.
The unfortunate reality is that if you run a configuration that's
different than most you'll likely
Variables in configuration files may be Java system properties or other
configuration parameters. The list of pre-defined Java system
properties is at:
http://java.sun.com/javase/6/docs/api/java/lang/System.html#getProperties()
Unfortunately the host name is not in that list. You could
Billy Pearson wrote:
We are looking for a way to support smaller clusters also that might
over run there heap size causing the cluster to crash.
Support for namespaces larger than RAM would indeed be a good feature to
have. Implementing this without impacting large cluster in-memory
This looks like it could be a great feature for EC2-based Hadoop users:
http://aws.amazon.com/publicdatasets/
Has anyone tried it yet? Any datasets to share?
Doug
A task may read from more than one block. For example, in line-oriented
input, lines frequently cross block boundaries. And a block may be read
from more than one host. For example, if a datanode dies midway through
providing a block, the client will switch to using a different datanode.
Dennis Kubes wrote:
2) Besides possible slight degradation in performance, is there a reason
why the BlocksMap shouldn't or couldn't be stored on disk?
I think the assumption is that it would be considerably more than slight
degradation. I've seen the namenode benchmarked at over 50,000
Brian Bockelman wrote:
Do you have any graphs you can share showing 50k opens / second (could
be publicly or privately)? The more external benchmarking data I have,
the more I can encourage adoption amongst my university...
The 50k opens/second is from some internal benchmarks run at Y!
tim robertson wrote:
Thanks Alex - this will allow me to share the shapefile, but I need to
one time only per job per jvm read it, parse it and store the
objects in the index.
Is the Mapper.configure() the best place to do this? E.g. will it
only be called once per job?
In 0.19, with
Otis Gospodnetic wrote:
Konstantin Co, please correct me if I'm wrong, but looking at
hadoop-default.xml makes me think that dfs.http.address is only the URL for the NN
*Web UI*. In other words, this is where we people go look at the NN.
The secondary NN must then be using only the Primary
This is hard to diagnose without knowing your InputFormat. Each split
returned by your #getSplits() implementation is passed to your
#getRecordReader() implementation. If your RecordReader is not stopping
when you expect it to, then that's a problem in your RecordReader, no?
Have you written
David C. Kerber wrote:
There would be quite a few files in the 100kB to 2MB range, which are received
and processed daily, with smaller numbers ranging up to ~600MB or so which are
summarizations of many of the daily data files, and maybe a handful in the 1GB
- 6GB range (disk images and
Bhupesh Bansal wrote:
Minor correction the graph size is about 6G and not 8G.
Ah, that's better.
With the jvm reuse feature in 0.19 you should be able to load it once
per job into a static, since all tasks of that job can share a JVM.
Things will get tight if you try to run two such jobs at
The safest thing is to restrict your Hadoop file names to a
common-denominator set of characters that are well supported by Unix,
Windows, and URIs. Colon is a special character on both Windows and in
URIs. Quoting is in theory possible, but it's hard to get it right
everywhere in practice.
Arun C Murthy wrote:
You need to add libhadoop.so to your java.library.patch. libhadoop.so
is available in the corresponding release in the lib/native directory.
I think he needs to first build libhadoop.so, since he appears to be
running on OS X and we only provide Linux builds of this in
Chris K Wensel wrote:
doh, conveniently collides with the GridGain and GridDynamics
presentations:
http://web.meetup.com/66/calendar/8561664/
Bay Area Hadoop User Group meetings are held on the third Wednesday
every month. This has been on the calendar for quite a while.
Doug
Ryan LeCompte wrote:
I'd really love to one day
see some scripts under src/contrib/ec2/bin that can setup/mount the EBS
volumes automatically. :-)
The fastest way might be to write contribute such scripts!
Doug
Changing it will unfortunately cause confusion too. Sigh. This is why
we should take time to name things well the first time.
Doug
叶双明 wrote:
Because the name of second-namenode making so much confusing, does the
hadoop team consider to change it?
LocalJobRunner allows you to test your code with everything running in a
single JVM. Just set mapred.job.tracker=local.
Doug
Ryan LeCompte wrote:
I see... so there really isn't a way for me to test a map/reduce
program using a single node without incurring the overhead of
upping/downing
Jason Venner wrote:
We have modified the /main/ that launches the children of the task
tracker to explicity exit, in it's finally block. That helps substantially.
Have you submitted this as a patch?
Doug
Kevin wrote:
Yes, I have looked at the block files and it matches what you said. I
am just wondering if there is some property or flag that would turn
this feature on, if it exists.
No. If you required this then you'd need to pad your data, but I'm not
sure why you'd ever require it.
Konstantin Shvachko wrote:
Imho we either need to correct it or remove.
+1
Doug
Elia Mazzawi wrote:
is it possible to run a map then reduce then a map then a reduce.
its really 2 jobs, but i don't want to store the intermediate results.
so can a hadoop job do more than one map/reduce?
This has been discussed several times before. The problem is that
temporary data is
Tom White wrote:
You can allow S3 as the default FS, it's just that then you can't run
HDFS at all in this case. You would only do this if you don't want to
use HDFS at all, for example, if you were running a MapReduce job
which read from S3 and wrote to S3.
Can't one work around this by using
Nathan Marz wrote:
Is there a way to get stats of the currently running job
programatically?
This should probably be an FAQ. In your Mapper or Reducer's configure
implementation, you can get a handle on the running job with:
RunningJob running =
new
Tarandeep Singh wrote:
When is Hadoop 0.18 release scheduled ? This link has a date of 6 June :-/
http://issues.apache.org/jira/browse/HADOOP/fixforversion/12312972
The release date is initially set to the feature freeze date. It's
updated when all of the blockers are fixed and an actual
Ted Dunning wrote:
The map task is not multi-threaded [ ... ]
Unless you specify a multi-threaded MapRunnable...
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/lib/MultithreadedMapRunner.html
Doug
Chris Collins wrote:
For instance, that all it requires is for me to create the ability for
say a mac user with login of bob to access things under /bob is for me
to go in as the super user and do something like:
hadoop dfs -mkdir /bob
hadoop dfs -chown bob /bob
where bob literally doesnt
Chris Collins wrote:
You are referring to creating a directory in hdfs? Because if I am user
chris and the hdfs only has user foo, then I cant create a directory
because I dont have perms, infact I cant even connect.
Today, users and groups are declared by the client. The namenode only
Andreas Kostyrka wrote:
java.lang.StackOverflowError
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:494)
at org.apache.hadoop.util.QuickSort.fix(QuickSort.java:29)
at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:58)
at
Ted Dunning wrote:
- in order to submit the job, I think you only need to see the job-tracker.
Somebody should correct me if I am wrong.
No, you also need to be able to write the job.xml, job.jar, and
job.split into HDFS. Someday perhaps we'll pass these via RPC to the
jobtracker and have
Otis Gospodnetic wrote:
10 GB in 3 h doesn't that seem slow?
Have you played with dfs.balance.bandwidthPerSec? It defaults to
1MB/sec per datanode. That would be about 10GB in 3 hours.
Doug
Ted Dunning wrote:
Take the fully qualified HDFS path that looks like this:
hdfs://namenode-host-name:port/file-path
And transform it into this:
hdfs://namenode-host-name:web-interface-port/data/file-path
The web-interface-port is 50070 by default. This will allow you to read HDFS
Maneesha Jain wrote:
I'm looking for any documentation or javadoc for MiniDFSCluster and have not
been able to find it anywhere.
Can someone please point me to it.
http://svn.apache.org/repos/asf/hadoop/core/trunk/src/test/org/apache/hadoop/dfs/MiniDFSCluster.java
This is part of the test
Cagdas Gerede wrote:
In the system I am working, we have 6 million blocks total and the namenode
heap size is about 600 MB and it takes about 5 minutes for namenode to leave
the safemode.
How big is are your files? Are they several blocks on average? Hadoop
is not designed for small files,
Cagdas Gerede wrote:
For a system with 60 million blocks, we can have 3 datanodes with 20 million
blocks each, or we can have 60 datanodes with 1 million blocks each. In
either case, would there be performance implications or would they behave
the same way?
If you're using mapreduce, then you
Cagdas Gerede wrote:
We will have 5 million files each having 20 blocks of 2MB. With the minimum
replication of 3, we would have 300 million blocks.
300 million blocks would store 600TB. At ~10TB/node, this means a 60 node
system.
Do you think these numbers are suitable for Hadoop DFS.
Why
Joydeep Sen Sarma wrote:
There seems to be two problems with small files:
1. namenode overhead. (3307 seems like _a_ solution)
2. map-reduce processing overhead and locality
It's not clear from 3307 description, how the archives interface with
map-reduce. How are the splits done? Will they
Karl Wettin wrote:
When is depricated methods removed from the API? At new every minor?
http://wiki.apache.org/hadoop/Roadmap
Note the remark: Prior to 1.0, minor releases follow the rules for
major releases, except they are still made every few months.
So, since we're still pre-1.0, we
CloudyEye wrote:
What else do I have to override in ArrayWritable to get the IntWritable
values written to the output files by the reducers?
public String toString();
Doug
Mikhail Bautin wrote:
Specifically, I just need a way to alter the child JVM's classpath via
JobConf, without having the framework copy anything in and out of HDFS,
because all my files are already accessible from all nodes. I see how to do
that by adding a couple of lines to TaskRunner's run()
Doug Cutting wrote:
Seems like we should force things onto the same availablity zone by
default, now that this is available. Patch, anyone?
It's already there! I just hadn't noticed.
https://issues.apache.org/jira/browse/HADOOP-2410
Sorry for missing this, Chris!
Doug
Chang Hu wrote:
Code below, also attached. I put this together from the word count
example.
The problem is with your combiner. When a combiner is specified, it
generates the final map output, since combination is a map-side
operation. Your combiner takes Text,IntWritable generated by
Rong-en Fan wrote:
I have two questions regarding the mapfile in hadoop/hdfs. First, when using
MapFileOutputFormat as reducer's output, is there any way to change
the index interval (i.e., able to call setIndexInterval() on the
output MapFile)?
Not at present. It would probably be good to
Stu Hood wrote:
But I'm trying to _output_ multiple different value classes from a Mapper, and
not having any luck.
You can wrap things in ObjectWritable. When writing, this records the
class name with each instance, then, when reading, constructs an
appropriate instance and reads it. It
Andrey Pankov wrote:
It's a little bit expensive to have big cluster running for a long
period, especially if you use EC2. So, as possible solution, we can
start additional nodes and include them into cluster before running job,
and then, after finishing, kill unused nodes.
As Ted has
Use MapFileOutputFormat to write your data, then call:
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/MapFileOutputFormat.html#getEntry(org.apache.hadoop.io.MapFile.Reader[],%20org.apache.hadoop.mapred.Partitioner,%20K,%20V)
The documentation is pretty sparse, but the
Tarandeep Singh wrote:
but isn't the output of reduce step sorted ?
No, the input of reduce is sorted by key. The output of reduce is
generally produced as the input arrives, so is generally also sorted by
key, but reducers can output whatever they like.
Doug
Joydeep Sen Sarma wrote:
i find the confusion over what backwards compatibility means scary - and i am
really hoping that the outcome of this thread is a clear definition from the
committers/hadoop-board of what to reasonably expect (or not!) going forward.
The goal is clear: code that
Jason Venner wrote:
Is disk arm contention (seek) a problem in a 6 disk configuration as
most likely all of the disks would be serving /local/ and /dfs/?
It should not be. MapReduce i/o is is sequential, in chunks large
enough that seeks should not dominate.
Doug
Jason Venner wrote:
We have 3 types of machines we can get, 2 disk, 6 disk and 16 disk
machines. They all have 4 dual core cpus.
The 2 disk machines have about 1 TB, the 6 disks about 3TB and the 16
disk about 8TB. The 16 disk machines have about 25% slower CPU's than
the 2/6 disk
Marc Harris wrote:
The hadoop upgrade wiki page contains a small typo
http://wiki.apache.org/hadoop/Hadoop_Upgrade .
[ ... ] I don't have access
to modify it, but someone else might like to.
Anyone can create themselves an account on the wiki and modify any page.
Doug
Lukas Vlcek wrote:
I think you have already heard rumours about Microsoft could buy Yahoo. Does
anybody have any idea how this could impact specifically Hadoop future?
First, Hadoop is an Apache project. Y! contributes to it, along with
others. Apache projects are designed to be able to
70 matches
Mail list logo