How long does it take to start the code locally in a single thread?
Can you reuse the JVM so it only starts once per node per job?
conf.setNumTasksToExecutePerJvm(-1)
Cheers,
Tim
On Sun, Jun 28, 2009 at 9:43 PM, Marcus Heroumarcus.he...@tailsweep.com wrote:
Hi.
Wonder how one should
/hadoop-core-user/200905.mbox/%3cdfd95197f3ae8c45b0a96c2f4ba3a2556bf123e...@sc-mbxc1.thefacebook.com%3e
- Harish
On Sat, Jun 20, 2009 at 6:12 PM, tim robertson
timrobertson...@gmail.comwrote:
Hi all,
I am using Hadoop to build a read only store for voldemort on EC2 and
for some reason can't
Hi
I am not sure I understand the question correctly. If you mean you
want to use the output of Job1 as the input of Job2, then you can set
the input path to the second job as the output path (e.g. output
directory) from the first job.
Cheers
Tim
On Mon, Jun 15, 2009 at 3:30 PM,
Answers inline
- Once I place my data in HDFS, it gets replicated and chunked
automatically over the datanodes. Right? Hadoop takes care of all those
things.
Yes it does
- Now, if there is some third party who is not participating in the Hadoop
program. Means, he is not one of the nodes of
Yes you can do this.
It is complaining because you are not declaring the output types in
the method signature, but you will not use them anyway.
So please try
private static class Reducer extends MapReduceBase implements
ReducerText, Writable, Text, Text {
...
The output format will be a
So you are using a java program to execute a load data infile
command on mysql through JDBC?
If so I *think* you would have to copy it onto the mysql machine from
HDFS first, or the machine running the command and then try a 'load
data local infile'.
Or pehaps use the
Perhaps some kind of in memory index would be better than iterating an
array? Binary tree or so.
I did similar with polygon indexes and point data. It requires
careful memory planning on the nodes if the indexes are large (mine
were several GB).
Just a thought,
Tim
On Sat, May 16, 2009 at
Try and google binary tree java and you will get loads of hits...
This is a simple implementation but I am sure there are better ones
that handle balancing better.
Cheers
Tim
public class BinaryTree {
public static void main(String[] args) {
BinaryTree bt = new
Can you post the entire error trace please?
On Fri, May 8, 2009 at 9:40 AM, George Pang p09...@gmail.com wrote:
Dear users,
I got ClassNotFoundException when run the WordCount example on hadoop
using Eclipse. Does anyone know where is the problem?
Thank you!
George
Hi,
What input format are you using for the GZipped file?
I don't believe there is a GZip input format although some people have
discussed whether it is feasible...
Cheers
Tim
On Thu, May 7, 2009 at 9:05 PM, Malcolm Matalka
mmata...@millennialmedia.com wrote:
Problem:
I am comparing two
I don't think that you can using those classes. If you look at
TextInputFormat and LineRecordReader, they should not be hard to use
as a basis for copying into your own version which uniques the IDs but
I presume you would need to make them Text and not LongWritable keys.
Just a thought...
Hi,
[Ankur]: How can make sure this happens?
-- show processlist is how we spot it... literally it takes hours on
our set up so easy to find.
So we ended up with 2 DBs
- DB1 we insert to, prepare and do batch processing
- DB2 serving the read only web app
Periodically we dump the DB1, point the
If anyone is interested I did finally get round to processing it all,
and due to the sparsity of data we have, for all 23 zoom levels and
all species we have information on, the result was 807 million PNGs,
which is $8,000 to PUT to S3 - too much for me to pay.
So like most things I will probably
on the same
section of the sequence file. Maybe you can elaborate further and I'll see
if I can offer any thoughts.
On Tue, Apr 14, 2009 at 7:10 AM, tim robertson timrobertson...@gmail.com
wrote:
Sorry Brian, can I just ask please...
I have the PNGs in the Sequence file for my sample set
However, do the math on the costs for S3. We were doing something similar,
and found that we were spending a fortune on our put requests at $0.01 per
1000, and next to nothing on storage. I've since moved to a more complicated
model where I pack many small items in each object and store an
Hi all,
I am currently processing a lot of raw CSV data and producing a
summary text file which I load into mysql. On top of this I have a
PHP application to generate tiles for google mapping (sample tile:
http://eol-map.gbif.org/php/map/getEolTile.php?tile=0_0_0_13839800).
Here is a (dev
and
places it onto S3. (If my numbers are correct, you're looking at around 3TB
of data; is this right? With that much, you might want another separate Map
task to unpack all the files in parallel ... really depends on the
throughput you get to Amazon)
Brian
On Apr 14, 2009, at 4:35 AM, tim
missing
something obvious (e.g. can I disable this behavior)?
Cheers
Tim
On Tue, Apr 14, 2009 at 2:44 PM, tim robertson
timrobertson...@gmail.com wrote:
Thanks Brian,
This is pretty much what I was looking for.
Your calculations are correct but based on the assumption that at all
zoom
Thanks for sharing this - I find these comparisons really interesting.
I have a small comment after skimming this very quickly.
[Please accept my apologies for commenting on such a trivial thing,
but personal experience has shown this really influences performance]
One thing not touched on in
Hi all,
I am not a hardware guy but about to set up a 10 node cluster for some
processing of (mostly) tab files, generating various indexes and
researching HBase, Mahout, pig, hive etc.
Could someone please sanity check that these specs look sensible?
[I know 4 drives would be better but price
with 50 -- 33 GB RAM
and 8 x 1 TB disks on each one; one box however just has 16 GB of RAM
and it routinely falls over when we run jobs on it)
Miles
2009/4/2 tim robertson timrobertson...@gmail.com:
Hi all,
I am not a hardware guy but about to set up a 10 node cluster for some
processing
If Akira was to write his/her own Mappers, using types like
IntWritable would result in it being numerically sorted right?
Cheers,
Tim
On Mon, Mar 23, 2009 at 5:04 PM, Aaron Kimball aa...@cloudera.com wrote:
Simplest possible solution: zero-pad your keys to ten places?
- Aaron
On Sat,
Yeps,
A good starting read: http://wiki.apache.org/hadoop/AmazonEC2
These are the AMIs:
$ ec2-describe-images -a | grep hadoop
IMAGE ami-245db94dcloudbase-1.1-hadoop-fc64/image.manifest.xml
247610401714available public x86_64 machine
IMAGE ami-791ffb10
Hi Praveen,
I think it is more equivalent to Hive than HBase - both offer joins
and structured querying where HBase is more a column oriented data
store with many to ones embedded in a single row and (currently) only
indexes on the primary key, but secondary keys are coming. I
anticipate using
warehosue layer on top of Hadoop and by means of its SQL
interface makes it easier to mine logs. So instead of writing Map-Reduce
jobs for analyzing data, one can use SQL to do the same and SQL to Map
Reduce job translation is handled by CloudBase.
-Taran
2009/3/3 tim robertson timrobertson
first.
*
* @see http://en.wikipedia.org/wiki/Relational_algebra#Semijoin
* @see http://en.wikipedia.org/wiki/Bloom_filter
*/
Thanks,
Taran
2009/3/3 tim robertson timrobertson...@gmail.com
Hi Taran,
Have you a blog or something
Hi,
Sounds like you might want to look at the Nutch project architecture
and then see the Nutch on Hadoop tutorial -
http://wiki.apache.org/nutch/NutchHadoopTutorial It does web
crawling, and indexing using Lucene. It would be a good place to
start anyway for ideas, even if it doesn't end up
Hi Dmitry,
What version of hadoop are you using?
Assuming your 3G DB is a read only lookup... can you load it into
memory in the Map.configure and then use (0.19+ only...):
property
namemapred.job.reuse.jvm.num.tasks/name
value-1/value
/property
So that the Maps are reused for all time
Nope
Here is a super simple little pom to install it locally, and change
version easily (put it in project root along with hadoop jar and then
run as per comment at top). If you do put it in a repository
yourself, are you able to you share the URL? Ours is unfortunately on
an intranet so I
I would also consider a DB for this... 10M and 2 columns is not a lot
of data so I would look to have it in memory with some DB index or
memory hash for querying.
(We are keeping the indexes of tables with 150M records, 30M and 10M
and joining between them with around 25 indexes on the 150M table
I don't agree that this would be considered unconventional, as I have
scenarios where this makes sense too - one file with a summary view,
and others that are very detailed and a pass over the first one
determines which ones to analyse properly in a second job.
I am a novice, but it looks like
there
is probably no benefit over just waiting for the first pass to finish.
On Sat, Dec 6, 2008 at 6:41 PM, Devaraj Das [EMAIL PROTECTED] wrote:
On 12/6/08 10:43 PM, tim robertson [EMAIL PROTECTED] wrote:
I don't agree that this would be considered unconventional, as I have
scenarios where
Can't answer your question exactly, but can let you know what I do.
I build all dependencies into 1 jar, and by using Maven for my build
environment, when I assemble my jar, I am 100% sure all my
dependencies are collected together. This is working very nicely for
me and I have used the same
are using Maven
to take the dependencies, and package them in one large jar? Basically
unjar the contents of the jar and use those with your code I'm assuming?
On Dec 3, 2008, at 9:25 AM, tim robertson wrote:
Can't answer your question exactly, but can let you know what I do.
I build all
Hi,
I am a newbie so please excuse if I am doing something wrong:
in hadoop-site.xml I have the following since I have a very memory
intensive map:
property
namemapred.tasktracker.map.tasks.maximum/name
value1/value
/property
property
namemapred.tasktracker.reduce.tasks.maximum/name
number of multi-hundred-MB files? You can
probably make your setup work eventually, but it'll be a bit like fighting
the tide. Alternately, if you must have random-record access, try putting
your results into HBase.
Hope this helps!
Brian
On Nov 28, 2008, at 2:14 AM, tim robertson wrote:
I
Ok - apologies, it seems changes to the hadoop-site.xml are not
automatically picked up after the cluster is running.
Cheers
Tim
On Sun, Nov 30, 2008 at 12:48 PM, tim robertson
[EMAIL PROTECTED] wrote:
Hi,
I am a newbie so please excuse if I am doing something wrong:
in hadoop-site.xml I
explain some of the differences between using:
- setNumTasksToExecutePerJvm() and then having statically declared
data initialised in Mapper.configure(); and
- a MultithreadedMapRunner?
Regards,
Shane
On Wed, Nov 26, 2008 at 6:41 AM, Doug Cutting [EMAIL PROTECTED] wrote:
tim robertson wrote
)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
On Thu, Nov 27, 2008 at 10:55 AM, tim robertson
Hi all,
I am really struggling with splitting a single file into many files
using hadoop and would appreciate any help offered. The input file is
150,000,000 rows long today, but will grow to 1Billion+.
My mapper simply emits a key that it determines from the data (key
will be used for the
Hi Ricky,
As a newcomer to MR and Hadoop I think what you are doing is a great
addition to the docs. One thing I would like to see in this overview
is how JVM's are spawned in the process - e.g. is it 1 JVM per node
per job, or per node per task etc. The reason being it has
implications about
Hi,
Could you please sanity check this:
In Hadoop-site.xml I add:
property
namemapred.child.java.opts/name
value-Xmx1G/value
descriptionIncreasing the size of the heap to allow for large in
memory index of polygons/description
/property
Is this all required to increase the -Xmx for
, -Xmx1024M version
-Xmx1G. Other than that I think it looks good
Dennis
tim robertson wrote:
Hi,
Could you please sanity check this:
In Hadoop-site.xml I add:
property
namemapred.child.java.opts/name
value-Xmx1G/value
descriptionIncreasing the size of the heap to allow for large
I would still store the result in file, and then write a user
interface that renders the output file as required...
How would you know the user is still on the other end waiting to view
the result? If you are sure, then perhaps the thing that launches the
job could block until it is finished,
Hi all,
I am doing a very simple Map that determines an integer value to
assign to an input (1-64000).
The reduction does nothing, but I then use this output formatter to
put the data in a file per Key.
public class CellBasedOutputFormat extends
MultipleTextOutputFormatWritableComparable,
Hi all,
If I want to have an in memory lookup Hashmap that is available in
my Map class, where is the best place to initialise this please?
I have a shapefile with polygons, and I wish to create the polygon
objects in memory on each node's JVM and have the map able to pull
back the objects by id
!
Alex
On Tue, Nov 25, 2008 at 11:09 AM, tim robertson
[EMAIL PROTECTED]wrote:
Hi all,
If I want to have an in memory lookup Hashmap that is available in
my Map class, where is the best place to initialise this please?
I have a shapefile with polygons, and I wish to create the polygon
, or is a Mapper.configure() the best place for this?
Can it be called multiple times per Job meaning I need to keep some
static synchronised indicator flag?
Thanks again,
Tim
On Tue, Nov 25, 2008 at 8:41 PM, Doug Cutting [EMAIL PROTECTED] wrote:
tim robertson wrote:
Thanks Alex
On Nov 25, 2008, at 11:46 AM, tim robertson wrote:
Hi Doug,
Thanks - it is not so much I want to run in a single JVM - I do want a
bunch of machines doing the work, it is just I want them all to have
this in-memory lookup index, that is configured once per job. Is
there some hook somewhere
, tim robertson [EMAIL PROTECTED]wrote:
Hi,
Can someone please point me at the best way to create multiple output
files based on the Key outputted from the Map? So I end up with no
reduction, but a file per Key outputted in the Mapping phase, ideally
with the Key as the file name.
Many thanks
Hi all,
I am running MR which is scanning 130M records and then trying to
group them into around 64,000 files.
The Map does the grouping of the record by determining the key, and
then I use a MultipleTextOutputFormat to write the file based on the
key:
@Override
protected String
Thank you Jeremy
I am on Mac (10.5.5) and it is 256 by default. I will change this and
rerun before running on the cluster.
Thanks again
Tim
On Mon, Nov 24, 2008 at 8:38 AM, Jeremy Chow [EMAIL PROTECTED] wrote:
There are a file number limitation each process can open in unix/linux. The
PROTECTED] wrote:
There's a case study with some numbers in it from a presentation I
gave on Hadoop and AWS in London last month, which you may find
interesting: http://skillsmatter.com/custom/presentations/ec2-talk.pdf.
tim robertson [EMAIL PROTECTED] wrote:
For these small
datasets, you might find
I have been processing only 100s GBs on EC2, not 1000's and using 20
nodes and really only in exploration and testing phase right now.
On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock [EMAIL PROTECTED] wrote:
Hi Ryan,
Just a heads up, if you require more than the 20 node limit, Amazon
, but maybe I'm
incorrect in my assumptions. I am also noticing that it takes about 15
minutes to parse through the 15GB of data with a 15 node cluster.
Thanks,
Ryan
On Tue, Sep 2, 2008 at 3:29 AM, tim robertson [EMAIL PROTECTED] wrote:
I have been processing only 100s GBs on EC2, not 1000's
I suggest reading up around map reduce first:
http://labs.google.com/papers/mapreduce-osdi04.pdf
Cheers
On Mon, Sep 1, 2008 at 11:27 AM, HHB [EMAIL PROTECTED] wrote:
Hey,
I'm reading about Hadoop lately but I'm unable to understand it.
Would you please explain it to me in easy words?
How
Hi Shirley,
If you mean the distinct words along with counts of their usage for example...
In the Map, output the word as the key and 1 as the value
In the Reduce, count up the values for the key
This is then 1 job.
Cheers
Tim
On Tue, Aug 26, 2008 at 3:02 PM, Shirley Cohen [EMAIL PROTECTED]
I am a newbie also, so my answer is not an expert user's by any means.
That said:
This is not what the MR is designed for...
If you have a reporting tool for example, which takes a database a
very long time to answer - such a long time that you can't expect a
user to hang around waiting for the
No there isn't unfortunately...
I use this, so I can quickly change versions:
!--
mvn -f hadoop-installer.xml install -Dhadoop.version=0.16.4
-Dmaven.test.skip=true
--
project xmlns=http://maven.apache.org/POM/4.0.0; xmlns:xsi=
http://www.w3.org/2001/XMLSchema-instance; xsi:schemaLocation=
Hi Ashish
I am very excited to try this, having been evaluating Hadoop, HBase,
Cascading etc recently to process 100 millions of Biodiversity records
(expecting billions soon), with a view for data mining purposes (species
that are critically endangered and observed outside of protected areas
MapReduce on Hadoop is for processing very large amounts of data, or else
the overhead of framework (job scheduling, failover etc) do not justify it.
If you are processing 10-100M / min = 14-140G a day. This probably
justifies it's use I would say
You can't get a performance estimate on a pseudo
Perhaps something like a RandomTextWriter to generate a file for input?
http://hadoop.apache.org/core/docs/r0.17.0/api/org/apache/hadoop/examples/RandomTextWriter.html
Cheers
Tim
On Sat, Jun 28, 2008 at 4:42 AM, Richard Zhang [EMAIL PROTECTED]
wrote:
Hello Folks:
I wrote a map reduce
Hi all,
I have data in a file (150million lines at 100Gb or so) and have several
MapReduce classes for my processing (custom index generation).
Can someone please confirm the following is the best way to run on EC2 and
S3 (both of which I am new to..)
1) load my 100Gb file into S3
2) create a
Hi all,
I have been battling EC2 all day and getting nowhere (see other message)
Does anyone use the hadoop-ec2-images/hadoop-0.17.0 AMI for small instances
successfully?
Following http://wiki.apache.org/hadoop/AmazonEC2 unfortunately doesn't work
as the slaves don't come up (details in my
Hi all,
I am a day one newbie investigating distributed work for the first time...
I have run through the tutorials with ease (thanks for the nice
documentation) and now have written my first map reduce.
Is it accurate to say that the reduce is repetitively called by the Hadoop
framework until
65 matches
Mail list logo