Re: Sharing object between mappers on same node (reuse.jvm ?)

2009-06-04 Thread Kevin Peterson
On Wed, Jun 3, 2009 at 10:59 AM, Tarandeep Singh tarand...@gmail.comwrote:

 I want to share a object (Lucene Index Writer Instance) between mappers
 running on same node of 1 job (not across multiple jobs). Please correct me
 if I am wrong -

 If I set the -1 for the property: mapred.job.reuse.jvm.num.tasks then all
 mappers of one job will be executed in the same jvm and in that case if I
 create a static Lucene Index Writer instance in my mapper class, all
 mappers
 running on the same node will be able to use it.


Not quite. The JVM reuse controls whether the JVM will be terminated after a
single mapper run and a new one created for the next. It doesn't influence
how many JVMs are created -- you will still get one jvm per mapper or
reducer.

I think there is, or was, or maybe a patch enables, what you are asking for,
IIRC.


Re: Persistent storage on EC2

2009-05-28 Thread Kevin Peterson
On Tue, May 26, 2009 at 7:50 PM, Malcolm Matalka 
mmata...@millennialmedia.com wrote:

 I'm using EBS volumes to have a persistent HDFS on EC2.  Do I need to keep
 the master updated on how to map the internal IPs, which change as I
 understand, to a known set of host names so it knows where the blocks are
 located each time I bring a cluster up?  If so, is keeping a mapping up to
 date in /etc/hosts sufficient?


I can't answer your first question of whether it's necessary. The namenode
might be able to figure it out when the DNs report their blocks.

Our staging cluster uses the setup you describe, with /etc/hosts pushed out
to all the machines, and the EBS volumes always mounted on the same
hostname. This works great.


MultipleOutputs or MultipleTextOutputFormat?

2009-05-28 Thread Kevin Peterson
I am trying to figure out the best way to split output into different
directories. My goal is to have a directory structure allowing me to add the
content from each batch into the right bucket, like this:

...
/content/200904/batch_20090429
/content/200904/batch_20090430
/content/200904/batch_20090501
/content/200904/batch_20090502
/content/200905/batch_20090430
/content/200905/batch_20090501
/content/200905/batch_20090502
...

I would then run my nightly jobs to build the index on /content/200904/* for
the April index and /content/200905/* for the May index.

I'm not sure whether I would be better off using MultipleOutputs or
MultipleTextOutputFormat. I'm having trouble understanding how I set the
output path for these two classes. It seems like MultipleTextOutputFormat is
about partitioning data to different files within the same directory on the
key, rather than into different directories as I need. Could I get the
behavior I want by specifying date/batch as my filename, set output path to
some temporary work directory, then move /work/* to /content?

MultipleOutputs seems to be more about outputting all the data in different
formats, but it's supposed to be simpler to use. Reading it, it seems to be
better documented and the API makes more sense (choosing the output
explicitly in the map or reduce, rather than hiding this decision in the
output format), but I don't see any way to set a file name. If am using
textoutputformat, I see no way to put these into different directories.


Mixing s3, s3n and hdfs

2009-05-08 Thread Kevin Peterson
Currently, we are running our cluster in EC2 with HDFS stored on the local
(i.e. transient) disk. We don't want to deal with EBS, because it
complicates being able to spin up additional slaves as needed. We're looking
at moving to a combination of s3 (block) or s3n for data that we care about,
and leaving lower value data that we can recreate on HDFS.

My thinking is that s3n has significant advantages in terms of how easy it
is to import data from non-Hadoop processes, and also the ease of sampling
data, but I'm not sure how well it actually works. I'm guessing that it
wouldn't be able to split files, or maybe it would need to download the
entire file from S3 multiple times to split it? Is the issue with writes
buffering the entire file on the local machine significant? Our jobs tend to
be more CPU intensive than the usual kind of log processing type jobs, so we
usually end up with smaller files.

Is it feasible to run s3 (block) and hdfs in parallel? Would I need two
namenodes to do this? Is this a good idea?

Has anyone tried either of these configurations in EC2?


Re: Generating many small PNGs to Amazon S3 with MapReduce

2009-04-15 Thread Kevin Peterson
On Tue, Apr 14, 2009 at 2:35 AM, tim robertson timrobertson...@gmail.comwrote:


 I am considering (for better throughput as maps generate huge request
 volumes) pregenerating all my tiles (PNG) and storing them in S3 with
 cloudfront.  There will be billions of PNGs produced each at 1-3KB
 each.


Storing billions of PNGs each at 1-3kb each into S3 will be perfectly fine,
there is no need to generate them and then push them at once, if you are
storing them each in their own S3 object (which they must be, if you intend
to fetch them using cloudfront). Each S3 object is unique, and can be
written fully in parallel. If you are writing to the same S3 object twice,
... well, you're doing it wrong.

However, do the math on the costs for S3. We were doing something similar,
and found that we were spending a fortune on our put requests at $0.01 per
1000, and next to nothing on storage. I've since moved to a more complicated
model where I pack many small items in each object and store an index in
simpledb. You'll need to partition your SimpleDBs if you do this.


Re: Amazon Elastic MapReduce

2009-04-02 Thread Kevin Peterson
So if I understand correctly, this is an automated system to bring up a
hadoop cluster on EC2, import some data from S3, run a job flow, write the
data back to S3, and bring down the cluster?

This seems like a pretty good deal. At the pricing they are offering, unless
I'm able to keep a cluster at more than about 80% capacity 24/7, it'll be
cheaper to use this new service.

Does this use an existing Hadoop job control API, or do I need to write my
flows to conform to Amazon's API?


Re: Iterative feedback in map reduce....

2009-03-28 Thread Kevin Peterson
On Fri, Mar 27, 2009 at 4:39 PM, Sid123 itis...@gmail.com wrote:

 But I was thinking of grouping the values and generating a key using a
 random number generator in the collector of the mapper. The values will now
 be uniformly distributed over a few keys. Say the number of keys will be
 0.1% of the # of values or atleast 1, which ever is higher. So if there
 2 values 2000 odd values should be under a single key.. and 10 reducers
 should spawn to do the sum in parallel...  Now I can atleast run 10 sum in
 parallel rather than just 1 reducer doing the whole work... How does that
 theory seem?


What you want to do is write a combiner, which is essentially a reducer that
runs on the map output of a single node before before being sent to the main
reducer. Then the real reducer would get one value per node.


Re: How many nodes does one man want?

2009-03-27 Thread Kevin Peterson
On Thu, Mar 26, 2009 at 4:38 PM, Sid123 itis...@gmail.com wrote:


 I am working of implementing some machine learning algorithms using Map
 Red.
 I want to know that If I have data that takes 5-6 hours to train on a
 normal
 machine. Will putting in 2-3 more nodes have an effect? I read in the yahoo
 hadoop tutorial.
 Executing Hadoop on a limited amount of data on a small number of nodes
 may
 not demonstrate particularly stellar performance as the overhead involved
 in
 starting Hadoop programs is relatively high. Other parallel/distributed
 programming paradigms such as MPI (Message Passing Interface) may perform
 much better on two, four, or perhaps a dozen machines.

 I have at my disposal 3 laptops each with 4 G RAM and 150G hard disk space
 each...  I have 600M of training data


I'd say don't bother. Not because adding two machines won't double your
performance (maybe it will come close) but just because of the hassle of
setting up hadoop, then having to copy data in and out of HDFS,
restructuring your code within map-reduce paradigm, and so on.

I have a machine learning task that takes about an hour on my machine. I
find this significantly more convenient than running it on hadoop, and I'm
already working within hadoop. Of course, some of this inconvenience is due
to EC2, not hadoop itself. If I could run from inside eclipse, it might be a
different story.


Re: Building Release 0.19.1

2009-03-13 Thread Kevin Peterson
There may be a separate issue with windows, but the error related to:

[javac] import
org.eclipse.jdt.internal.debug.ui.launcher.JavaApplicationLaunchShortcut;

is the eclipse 3.4 issue that is addressed by the patch in
https://issues.apache.org/jira/browse/HADOOP-3744


Recommend JSON Library? net.sf.json has memory leak

2009-03-05 Thread Kevin Peterson
We're using JSON serialization for all our data, but we can't seem to find a
good library. We just discovered that the root cause of out of memory errors
is a leak in the net.sf.json library. Can anyone out there recommend a java
json library that they have actually used successfully within Hadoop?


Re: HADOOP-2536 supports Oracle too?

2009-02-20 Thread Kevin Peterson
On Wed, Feb 18, 2009 at 1:06 AM, sandhiya sandhiy...@gmail.com wrote:

 Thanks a million!!! It worked. but its a little weird though. I have to put
 the Library with the jdbc jars in BOTH the executable jar file AND the lib
 folder in $HADOOP_HOME. Do all of you do the same thing or is it just my
 computer acting strange??


It seems that things that are directly referenced by the jar you are running
can be included in the lib directory in the jar, but things that are loaded
with reflection like JDBC drivers have to be in the Hadoop lib directory. I
don't think it's both.


Re: How to use DBInputFormat?

2009-02-03 Thread Kevin Peterson
On Tue, Feb 3, 2009 at 5:49 PM, Amandeep Khurana ama...@gmail.com wrote:

 In the setInput(...) function in DBInputFormat, there are two sets of
 arguments that one can use.

 1. public static void *setInput*(JobConf

 a) In this, do we necessarily have to give all the fieldNames (which are
 the
 column names right?) that the table has, or do we need to specify only the
 ones that we want to extract?


You may specify only those columns that you are interested in.

b) Do we have to have a orderBy or not necessarily? Does this relate to the
 primary key in the table in any ways?


Conditions and order by are not necessary.

a) Is there any restriction on the kind of queries that this function
 can take in the inputQuery string?


I don't think so, but I don't use this method -- I just use the fieldNames
and tableName method.


 I am facing issues in getting this to work with an Oracle database and
 have no idea of how to debug it (an email sent earlier).
 Can anyone give me some inputs on this please?


Create a new table that has one column, put about five entries into that
table, then try to get a map job working that outputs the values to a text
file. If that doesn't work, post your code and errors.


Re: DBOutputFormat and auto-generated keys

2009-01-27 Thread Kevin Peterson
On Mon, Jan 26, 2009 at 5:40 PM, Vadim Zaliva kroko...@gmail.com wrote:

 Is it possible to obtain auto-generated IDs when writing data using
 DBOutputFormat?

 For example, is it possible to write Mapper which stores records in DB
 and returns auto-generated
 IDs of these records?

...

 which I would like to store in normalized for in two tables. First
 table will store
 keys (string). Each key will have unique int id auto-generated by mysql.

 Second table will have (key_id,value) pairs, key_id being foreign key,
 pointing to first table.


A mapper has to have one output format, and that output format can't pass
any data into the map, so that approach won't work. DBOutputFormat doesn't
provide any way to do it either.

If you wanted to add this kind of functionality, you would need to write
your own output format, which probably wouldn't look much like
DBOutputFormat, which would be aware of your foreign keys. It would quickly
get very complicated.

One possibility that comes to mind is writing a HibernateOutputFormat or
similar, which would give you a way to express the relationships between
tables, leaving your only task to hook up your persistence logic to a hadoop
output format.

I had a similar problem with writing out reports to be used by a Rails app,
and solved it by restructuring things so that I don't need to write to two
tables from the same map task.


Cannot access svn.apache.org -- mirror?

2008-11-14 Thread Kevin Peterson
I'm trying to import Hadoop Core into our local repository using piston
( http://piston.rubyforge.org/index.html ).

I can't seem to access svn.apache.org though. I've also tried the EU
mirror. No errors, nothing but eventual timeout. Traceroute fails at
corv-car1-gw.nero.net. I got the same errors a couple weeks ago, but
assumed they were just temporary downtime. I have found some messages
from earlier this year about a similar problem where some people can
access it fine, and others just can't connect. I'm able to access it
from a remote shell account, but not from my machine.

Has anyone been able to work around this? Is there any mirror of the
Hadoop repository?