- Welcome
- 6:30pm - Introductions; start creating agenda
- Breakout sessions begin as soon as we're ready
- 8pm - Conclusion
Food and refreshments will be provided, courtesy of Splunk.
Please RSVP at http://www.meetup.com/hadoopsf/events/41427512/
Regards,
- Aaron Kimball
/events/35650052/
Regards,
- Aaron Kimball
ready
- 8pm - Conclusion
Food and refreshments will be provided, courtesy of RichRelevance.
If you're going to attend, please RSVP at http://bit.ly/kxaJqa.
Hope to see you all there!
- Aaron Kimball
will be provided, courtesy of Cloudera. Please RSVP
at http://bit.ly/hwMCI2
Looking forward to seeing you there!
Regards,
- Aaron Kimball
* 8pm - Conclusion
Regards,
- Aaron Kimball
I don't know if putting native-code .so files inside a jar works. A
native-code .so is not classloaded in the same way .class files are.
So the correct .so files probably need to exist in some physical directory
on the worker machines. You may want to doublecheck that the correct
directory on the
errors are coming from. If
they're from the OS, could it be because it needs to fork() and momentarily
exceed the ulimit before loading the native libs?
- Aaron
On Fri, Mar 4, 2011 at 1:26 PM, Aaron Kimball akimbal...@gmail.com wrote:
I don't know if putting native-code .so files inside a jar
to
facilitate a discussion. All members of the Hadoop community are welcome to
attend. While all Hadoop-related subjects are on topic, this month's
discussion theme is integration.
Regards,
- Aaron Kimball
has asked that all attendees RSVP in advance, to comply with their
security policy. Please join the meetup group and RSVP at
http://www.meetup.com/hadoopsf/events/16678757/
Refreshments will be provided.
Regards,
- Aaron Kimball
announcement! Sign up at
http://www.meetup.com/hadoopsf/
Regards,
- Aaron Kimball
Start with the student's CS department's web server?
I believe the wikimedia foundation also makes the access logs to wikipedia
et al. available publicly. That is quite a lot of data though.
- Aaron
On Sun, Jan 30, 2011 at 10:54 AM, Bruce Williams
williams.br...@gmail.comwrote:
Does anyone
W. P.,
How are you running your Reducer? Is everything running in standalone mode
(all mappers/reducers in the same process as the launching application)? Or
are you running this in pseudo-distributed mode or on a remote cluster?
Depending on the application's configuration, log4j configuration
gave excerpts from is a central one for the cluster.
On Wed, Dec 15, 2010 at 1:38 PM, Aaron Kimball akimbal...@gmail.com
wrote:
W. P.,
How are you running your Reducer? Is everything running in standalone
mode
(all mappers/reducers in the same process as the launching application
:
* I've created a short survey to help understand days / times that would
work for the most people: http://bit.ly/ajK26U
* Please also join the meetup group at http://meetup.com/hadoopsf -- We'll
use this to plan the event, RSVP information, etc.
I'm looking forward to meeting more of you!
- Aaron
David,
I think you've more-or-less outlined the pros and cons of each format
(though do see Alex's important point regarding SequenceFiles and
compression). If everyone who worked with Hadoop clearly favored one or the
other, we probably wouldn't include support for both formats by default. :)
Is there a reason you're using that particular interface? That's very
low-level.
See http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample for the proper
API to use.
- Aaron
On Sat, Jul 3, 2010 at 1:36 AM, Vidur Goyal vi...@students.iiit.ac.inwrote:
Hi,
I am trying to create a file in
One possibility: write out all the partition numbers (one per line) to a
single file, then use the NLineInputFormat to make each line its own map
task. Then in your mapper itself, you will get in a key of 0 or 1 or 2
etc. Then explicitly open /dataset1/part-(n) and /dataset2/part-(n) in your
Zhenyu,
It's a bit complicated and involves some layers of
indirection. CombineFileRecordReader is a sort of shell RecordReader that
passes the actual work of reading records to another child record reader.
That's the class name provided in the third parameter. Instructing it to use
process, please ask me.
Regards,
- Aaron Kimball
Cloudera, Inc.
Hi Utku,
Apache Hadoop 0.20 cannot support Sqoop as-is. Sqoop makes use of the
DataDrivenDBInputFormat (among other APIs) which are not shipped with
Apache's 0.20 release. In order to get Sqoop working on 20, you'd need to
apply a lengthy list of patches from the project source repository to your
The most obvious workaround is to use the old API (continue to use Mapper,
Reducer, etc. from org.apache.hadoop.mapred, not .mapreduce).
If you really want to use the new API, though, I unfortunately don't see a
super-easy path. You could try to apply the patch from MAPREDUCE-364 to your
version
We've already got a lot of mailing lists :) If you send questions to
mapreduce-user, are you not getting enough feedback?
- Aaron
On Wed, Mar 3, 2010 at 12:09 PM, Michael Kintzer
rockrep.had...@gmail.comwrote:
Hi,
Was curious if anyone else thought it would be useful to have a separate
mail
If it's terminating before you even run a job, then you're in luck -- it's
all still running on the local machine. Try running it in Eclipse and use
the debugger to trace its execution.
- Aaron
On Wed, Mar 3, 2010 at 4:13 AM, Rakhi Khatwani rkhatw...@gmail.com wrote:
Hi,
I am running a
Thomas,
What version of Hadoop are you building Debian packages for? If you're
taking Cloudera's existing debs and modifying them, these include a backport
of Sqoop (from Apache's trunk) which uses the rt tools.jar to compile
auto-generated code at runtime. Later versions of Sqoop (including the
Sonal,
Can I ask why you're sleeping between starting hdfs and mapreduce? I've
never needed this in my own code. In general, Hadoop is pretty tolerant
about starting daemons out of order.
If you need to wait for HDFS to be ready and come out of safe mode before
launching a job, that's another
To expand on Eric's comment: dfs.data.dir is the local filesystem directory
(or directories) that a particular datanode uses to store its slice of the
HDFS data blocks.
so dfs.data.dir might be /home/hadoop/data/ on some machine; a bunch of
files with inscrutable names like
There's an older mechanism called MultipleOutputFormat which may do what you
need.
- Aaron
On Fri, Feb 5, 2010 at 10:13 AM, Udaya Lakshmi udaya...@gmail.com wrote:
Hi,
MultipleOutput class is not available in hadoop 0.18.3. Is there any
alternative for this class? Please point me useful
Brian, it looks like you missed a step in the instructions. You'll need to
format the hdfs filesystem instance before starting the NameNode server:
You need to run:
$ bin/hadoop namenode -format
.. then you can do bin/start-dfs.sh
Hope this helps,
- Aaron
On Sat, Jan 30, 2010 at 12:27 AM,
Nick,
I'm afraid that right now the only available OutputFormat for JDBC is that
one. You'll note that DBOutputFormat doesn't really include much support for
special-casing to MySQL or other targets.
Your best bet is to probably copy the code from DBOutputFormat and
DBConfiguration into some
In a map-only job, map tasks will be connected directly to the OutputFormat.
So calling output.collect() / context.write() in the mapper will emit data
straight to files in HDFS without sorting the data. There is no sort buffer
involved. If you want exactly one output file, follow Nick's advice.
Note that org.apache.hadoop.mapreduce.lib.output.MultipleOutputs is
scheduled for the next CDH 0.20 release -- ready soon.
- Aaron
2010/1/6 Amareshwari Sri Ramadasu amar...@yahoo-inc.com
No. It is part of branch 0.21 onwards. For 0.20*, people can use old api
only, though JobConf is
When you set up the Job object, do you call job.setJarByClass(Map.class)?
That will tell Hadoop which jar file to ship with the job and to use for
classloading in your code.
- Aaron
On Thu, Nov 26, 2009 at 11:56 PM, aa...@buffalo.edu wrote:
Hi,
I am running the job from command line. The
You are always free to run with compression disabled. But in many production
situations, space or performance concerns dictate that all data sets are
stored compressed, so I think Tim was assuming that you might be operating
in such an environment -- in which case, you'd only need things to appear
You don't need to specify a path. If you don't specify a path argument for
ls, then it uses your home directory in HDFS (/user/yourusernamehere).
When you first started HDFS, /user/hadoop didn't exist, so 'hadoop fs -ls'
-- 'hadoop fs -ls /user/hadoop' -- directory not found. When you mkdir'd
On Thu, Nov 5, 2009 at 2:34 AM, Andrei Dragomir adrag...@adobe.com wrote:
Hello everyone.
We ran into a bunch of issues with building and deploying hadoop 0.21.
It would be great to get some answers about how things should work, so
we can try to fix them.
1. When checking out the
Also hadoop.tmp.dir and mapred.local.dir in your xml configuration, and the
environment variables HADOOP_LOG_DIR and HADOOP_PID_DIR in hadoop-env.sh.
- Aaron
On Thu, Oct 29, 2009 at 10:44 PM, Jeff Zhang zjf...@gmail.com wrote:
Hi all,
I have installed hadoop 0.18.3 on my own cluster with 5
and a corresponding
FixedLengthRecordReader.
Would the Hadoop commons project have interest in these? Basically these
are for reading inputs of textual record data, where each record is a fixed
length, (no carriage returns or separators etc)
thanks
On Oct 20, 2009, at 11:00 PM, Aaron Kimball
If you need another shuffle after your first reduce pass, then you need a
second MapReduce job to run after the first one. Just use an IdentityMapper.
This is a reasonably common situation.
- Aaron
On Thu, Oct 22, 2009 at 4:17 PM, Forhadoop rutu...@gmail.com wrote:
Hello,
In my application
You'll need to write your own, I'm afraid. You should subclass
FileInputFormat and go from there. You may want to look at TextInputFormat /
LineRecordReader for an example of how an IF/RR gets put together, but there
isn't an existing fixed-len record reader.
- Aaron
On Tue, Oct 20, 2009 at
If you're working with the Cloudera distribution, you can install CDH1
(0.18.3) and CDH2 (0.20.1) side-by-side on your development machine.
They'll install to /usr/lib/hadoop-0.18 and /usr/lib/hadoop-0.20; use
/usr/bin/hadoop-0.18 and /usr/bin/hadoop-0.20 to execute, etc.
See
Bhupesh: If you use FileSystem.newInstance(), does that return the correct
object type? This sidesteps CACHE.
- A
On Thu, Oct 15, 2009 at 3:07 PM, Bhupesh Bansal bban...@linkedin.comwrote:
This code is not map/reduce code and run only on single machine and
Also each node prints the right value
Edward,
Interesting concept. I imagine that implementing CachedInputFormat over
something like memcached would make for the most straightforward
implementation. You could store 64MB chunks in memcached and try to retrieve
them from there, falling back to the filesystem on failure. One obvious
Map tasks are generated based on InputSplits. An InputSplit is a logical
description of the work that a task should use. The array of InputSplit
objects is created on the client by the InputFormat.
org.apache.hadoop.mapreduce.InputSplit has an abstract method:
/**
* Get the list of nodes by
Quite possible. :\
- A
On Thu, Oct 1, 2009 at 5:17 PM, Mayuran Yogarajah
mayuran.yogara...@casalemedia.com wrote:
Aaron Kimball wrote:
If you want to run the 2NN on a different node than the NN, then you need
to
set dfs.http.address on the 2NN to point to the namenode's http server
If you want to run the 2NN on a different node than the NN, then you need to
set dfs.http.address on the 2NN to point to the namenode's http server
address. See
http://www.cloudera.com/blog/2009/02/10/multi-host-secondarynamenode-configuration/
- Aaron
On Mon, Sep 28, 2009 at 2:17 PM, Todd
Or maybe more pessimistically, the second stable append implementation.
It's not like HADOOP-1700 wasn't intended to work. It was just found not to
after the fact. Hopefully this reimplementation will succeed. If you're
running a cluster that contains mission-critical data that cannot tolerate
In the 0.20 branch, the common best-practice is to use the old API and
ignore deprecation warnings. When you get to 0.22, you'll need to convert
all your code to use the new API.
There may be a new-API equivalent in org.apache.hadoop.mapreduce.lib.output
that you could use, if you convert your
Use an external database (e.g., mysql) or some other transactional
bookkeeping system to record the state of all your datasets (STAGING,
UPLOADED, PROCESSED)
- Aaron
On Thu, Sep 17, 2009 at 7:17 PM, Huy Phan dac...@gmail.com wrote:
Hi all,
I have a question about strategy to prepare data
That's 99% correct. If you want/need to run different versions of HDFS on
the two different clusters, then you can't use hdfs:// protocol to access
both of them in the same command. In this case, use hdfs://bla/ for the
source fs and *hftp*://bla2/ for the dest fs.
- Aaron
On Tue, Sep 8, 2009 at
Hi Nikhil,
MRUnit now supports the 0.20 API as of
https://issues.apache.org/jira/browse/MAPREDUCE-800. There are no plans to
involve partitioners in MRUnit; it is for mappers and reducers only, and not
for full jobs involving input/output formats, partitioners, etc. Use the
LocalJobRunner for
Are you trying to serve blocks from a shared directory e.g. NFS?
The storageID for a node is recorded in a file named VERSION in
${dfs.data.dir}/current. If one node claims that the storage directory is
already locked, and another node is reporting the first node's storageID, it
makes me think
As a more general note -- any jars needed by your mappers and reducers
either need to be in your job jar in the lib/ directory of the .jar file, or
in $HADOOP_HOME/lib/ on all tasktracker nodes where mappers and reducers get
run.
- Aaron
On Fri, Aug 21, 2009 at 10:47 AM, ishwar ramani
Yes. It works just like Java-based MapReduce in that regard.
- Aaron
On Sun, Aug 23, 2009 at 5:09 AM, Nipun Saggar nipun.sag...@gmail.comwrote:
Hi all,
I have recently started using Hadoop streaming. From the documentation, I
understand that by default, each line output from a mapper up to
If you've got 20 nodes, then you want to have 20-ish reduce tasks. Maybe 40
if you want it to run in two waves. (Assuming 1 core/node. Multiply by N for
N cores...) As it is, each node has 500-ish map tasks that it has to read
from and for each of these, it needs to generate 500 separate reduce
Jeff,
Hadoop (HDFS in particular) is overly strict about machine names. The
filesystem's id is based on the DNS name used to access it. This needs to be
consistent across all nodes and all configurations in your cluster. You
should always use the fully-qualified domain name of the namenode in
Look into typed bytes:
http://dumbotics.com/2009/02/24/hadoop-1722-and-typed-bytes/
On Thu, Aug 20, 2009 at 10:29 AM, Jaliya Ekanayake jnekanay...@gmail.comwrote:
Hi Stefan,
I am sorry, for the late reply. Somehow the response email has slipped my
eyes.
Could you explain a bit on how to
Compressed OOPs are available now in 1.6.0u14:
https://jdk6.dev.java.net/6uNea.html
- Aaron
On Thu, Aug 20, 2009 at 10:51 AM, Raghu Angadi rang...@yahoo-inc.comwrote:
Suresh had made an spreadsheet for memory consumption.. will check.
A large portion of NN memory is taken by references. I
Hi Mithila,
In the Mapreduce svn tree, it's under src/contrib/fairscheduler/
- Aaron
On Wed, Aug 19, 2009 at 2:48 PM, Mithila Nagendra mnage...@asu.edu wrote:
Hello
I was wondering how I could locate the source code files for the fair
scheduler.
Thanks
Mithila
Hi Inifok,
This is a confusing aspect of Hadoop, I'm afraid.
Settings are divided into two categories: per-job and per-node.
Unfortunately, which are which, isn't documented.
Some settings are applied to the node that is being used. So for example, if
you set fs.default.name on a node to be
Also, if you haven't yet configured rack awareness, now's a good time to
start :)
- Aaron
On Tue, Aug 11, 2009 at 11:27 PM, Ted Dunning ted.dunn...@gmail.com wrote:
If you add these nodes, data will be put on them as you add data to the
cluster.
Soon after adding the nodes you should
Naga,
That's right. In the old API, Mapper and Reducer were just interfaces and
didn't provide default implementations of their code. Thus MapReduceBase.
Now Mapper and Reducer are classes to extend, so no MapReduceBase is needed.
- Aaron
On Fri, Aug 7, 2009 at 8:26 AM, Naga Vijayapuram
You can set it on a per-file basis if you'd like the control. The data
structures associated with files allow these to be individually controlled.
But there's also a create() call that only accepts the Path to open as an
argument. This uses the configuration file defaults. This use case is
May I ask why you're trying to run the NameNode in Eclipse? This is likely
going to cause you lots of classpath headaches. I think your current problem
is that it can't find its config files, so it's not able to read in the
strings for what addresses it should listen on.
If you want to see what's
Is that setting in the hadoop-site.xml file on every node? Each tasktracker
reads in that file once and sets its max map tasks from that. There's no way
to control this setting on a per-job basis or from the client (submitting)
system. If you've changed hadoop-site.xml after starting the
May I ask why you're trying to run the NameNode in Eclipse? This is
likely going to cause you lots of classpath headaches. I think your
current problem is that it can't find its config files, so it's not
able to read in the strings for what addresses it should listen on.
If you want to see what's
I don't know that your load-in speed is going to dramatically
increase. There's a number of parameters that adjust aspects of
MapReduce, but HDFS more or less works out of the box. You should run
some monitoring on your nodes (ganglia, nagios) or check out what
they're doing with top, iotop and
For future reference,
$ bin/hadoop dfsadmin safemode -leave
will also just cause HDFS to exit safemode forcibly.
- Aaron
On Wed, Aug 5, 2009 at 1:04 AM, Amandeep Khurana ama...@gmail.com wrote:
Two alternatives:
1. Do bin/hadoop namenode -format. That'll format the metadata and you can
mysqldump to local files on all 50 nodes, scp them to datanodes, and then
bin/hadoop fs -put?
- Aaron
On Mon, Aug 3, 2009 at 8:15 PM, Min Zhou coderp...@gmail.com wrote:
hi all,
We need to dump data from a mysql cluster with about 50 nodes to a hdfs
file. Considered about the issues on
Are you sure you stopped all the daemons? Use 'sudo jps' to make sure :)
- Aaron
On Mon, Aug 3, 2009 at 7:26 PM, bharath vissapragada
bharathvissapragada1...@gmail.com wrote:
Todd thanks for replying ..
I stopped the cluster and issued the command
bin/hadoop namenode -upgrade and iam
...@gmail.com wrote:
yes .. I have stopped all the daemons ... when i use jps ...i get only ...
pid Jps
Actually .. i upgraded the version from 18.2 to 19.x on the same path of
hdfs .. is it a problem?
On Wed, Aug 5, 2009 at 11:02 PM, Aaron Kimball aa...@cloudera.com wrote:
Are you sure you
The current best practice is to firewall off your cluster, configure a
SOCKS proxy/gateway, and only allow traffic to the cluster from the
gateway. Being able to SSH into the gateway provides authentication.
See
http://www.cloudera.com/blog/2008/12/03/securing-a-hadoop-cluster-through-a-gateway/
I hacked around this in MRUnit last night. MRUnit now has support for
the new API -- See MAPREDUCE-800.
You can, in fact, subclass Mapper.Context and Reducer.Context, since
they don't actually share any state with the outer class
Mapper/Reducer implementation, just the type signatures. But doing
And regarding your desire to set things on the command line: If your
program implements Tool and is launched via ToolRunner, you can
specify -D myparam=myvalue on the command line and it'll
automatically put that binding in the JobConf created for the tool,
retrieved via getConf().
- Aaron
On
Hm. What version of Hadoop are you running? Have you modified the
log4j.properties file in other ways? The logfiles generated by Hadoop
should, by default, switch to a new file every day, appending the
previous day's date to the closed log file (e.g.,
* *Type* *State* /hadoop/name
IMAGE_AND_EDITS Active
2009/7/21 Aaron Kimball aa...@cloudera.com
A VersionMismatch occurs because you're using different builds of Hadoop on
your different nodes. All DataNodes and the NameNode must be running the
exact same compilation of Hadoop (It's very
Hi Asif,
Just install the Hadoop package onto the external node to use it as a
client. On that node, you set your fs.default.name parameter to point to the
cluster, but you don't start any daemons locally, nor do you add that node
to the slaves file.
Then just do hadoop fs -put localfile
Hmm. DEBUG entries are just debug-level detail. The ERROR and WARN level
entries are the problematic ones.
What's your hadoop-site.xml file look like? If you're storing data
underneath ${hadoop.tmp.dir} and that's set to /tmp/${user.name} (as is the
default), then it's possible that a tmpwatch or
I'm not convinced anything is wrong with the TaskTracker. Can you run jobs?
Does the pi example work? If not, what error does it give?
If you're trying to configure your SecondaryNameNode on a different host
than your NameNode, you'll need to do some configuration tweaking. I wrote a
blog post
void close().
- Aaron
On Mon, Jul 13, 2009 at 11:31 AM, akhil1988 akhilan...@gmail.com wrote:
Hi All,
Just like method configure in Mapper interface, I am looking for its
counterpart that will perform the closing operation for a Map task. For
example, in method configure I start an
Reduce tasks which require more than twenty minutes are not a problem. But
you must emit some data periodically to inform the rest of the system that
each reducer is still alive. Emitting a (k, v) output pair to the collector
will reset the timer. Similarly, calling Reporter.incrCounter() will
Hi Matthew,
You can set the heap size for child jobs by calling
conf.set(mapred.child.java.opts, -Xmx1024m) to get a gig of heap space.
That should fix the OOM issue in IsolationRunner. You can also change the
heap size used in Eclipse; if you go to Debug Configurations, create a new
If you look into FileInputFormat, you'll see that there's a call to
FileSystem.getFileBlockLocations() (line 222) which finds the addresses of
the nodes holding the blocks to be mapped. Each FileSplit generated in that
same getSplits() method contains the list of locations where this split
should
82 matches
Mail list logo