from:"Kevin Peterson"

Re: Sharing object between mappers on same node (reuse.jvm ?)

2009-06-04 Thread Kevin Peterson

On Wed, Jun 3, 2009 at 10:59 AM, Tarandeep Singh wrote:

> I want to share a object (Lucene Index Writer Instance) between mappers
> running on same node of 1 job (not across multiple jobs). Please correct me
> if I am wrong -
>
> If I set the -1 for the property: mapred.job.reuse.jvm.num.tasks then all
> mappers of one job will be executed in the same jvm and in that case if I
> create a static Lucene Index Writer instance in my mapper class, all
> mappers
> running on the same node will be able to use it.
>

Not quite. The JVM reuse controls whether the JVM will be terminated after a
single mapper run and a new one created for the next. It doesn't influence
how many JVMs are created -- you will still get one jvm per mapper or
reducer.

I think there is, or was, or maybe a patch enables, what you are asking for,
IIRC.

MultipleOutputs or MultipleTextOutputFormat?

2009-05-28 Thread Kevin Peterson

I am trying to figure out the best way to split output into different
directories. My goal is to have a directory structure allowing me to add the
content from each batch into the right bucket, like this:

...
/content/200904/batch_20090429
/content/200904/batch_20090430
/content/200904/batch_20090501
/content/200904/batch_20090502
/content/200905/batch_20090430
/content/200905/batch_20090501
/content/200905/batch_20090502
...

I would then run my nightly jobs to build the index on /content/200904/* for
the April index and /content/200905/* for the May index.

I'm not sure whether I would be better off using MultipleOutputs or
MultipleTextOutputFormat. I'm having trouble understanding how I set the
output path for these two classes. It seems like MultipleTextOutputFormat is
about partitioning data to different files within the same directory on the
key, rather than into different directories as I need. Could I get the
behavior I want by specifying date/batch as my filename, set output path to
some temporary work directory, then move /work/* to /content?

MultipleOutputs seems to be more about outputting all the data in different
formats, but it's supposed to be simpler to use. Reading it, it seems to be
better documented and the API makes more sense (choosing the output
explicitly in the map or reduce, rather than hiding this decision in the
output format), but I don't see any way to set a file name. If am using
textoutputformat, I see no way to put these into different directories.

Re: Persistent storage on EC2

2009-05-28 Thread Kevin Peterson

On Tue, May 26, 2009 at 7:50 PM, Malcolm Matalka <
mmata...@millennialmedia.com> wrote:

> I'm using EBS volumes to have a persistent HDFS on EC2.  Do I need to keep
> the master updated on how to map the internal IPs, which change as I
> understand, to a known set of host names so it knows where the blocks are
> located each time I bring a cluster up?  If so, is keeping a mapping up to
> date in /etc/hosts sufficient?
>

I can't answer your first question of whether it's necessary. The namenode
might be able to figure it out when the DNs report their blocks.

Our staging cluster uses the setup you describe, with /etc/hosts pushed out
to all the machines, and the EBS volumes always mounted on the same
hostname. This works great.

Mixing s3, s3n and hdfs

2009-05-08 Thread Kevin Peterson

Currently, we are running our cluster in EC2 with HDFS stored on the local
(i.e. transient) disk. We don't want to deal with EBS, because it
complicates being able to spin up additional slaves as needed. We're looking
at moving to a combination of s3 (block) or s3n for data that we care about,
and leaving lower value data that we can recreate on HDFS.

My thinking is that s3n has significant advantages in terms of how easy it
is to import data from non-Hadoop processes, and also the ease of sampling
data, but I'm not sure how well it actually works. I'm guessing that it
wouldn't be able to split files, or maybe it would need to download the
entire file from S3 multiple times to split it? Is the issue with writes
buffering the entire file on the local machine significant? Our jobs tend to
be more CPU intensive than the usual kind of log processing type jobs, so we
usually end up with smaller files.

Is it feasible to run s3 (block) and hdfs in parallel? Would I need two
namenodes to do this? Is this a good idea?

Has anyone tried either of these configurations in EC2?

Re: Using the Stanford NLP with hadoop

2009-04-21 Thread Kevin Peterson

On Sat, Apr 18, 2009 at 5:18 AM, hari939  wrote:

>
> My project of parsing through material for a semantic search engine
> requires
> me to use the  http://nlp.stanford.edu/software/lex-parser.shtml Stanford
> NLP parser  on hadoop cluster.
>
> To use the Stanford NLP parser, one must create a lexical parser object
> using a englishPCFG.ser.gz file as a constructor's parameter.
> i have tried loading the file onto the Hadoop dfs in the /user/root/ folder
> and have also tried packing the file along with the jar of the java
> program.


Use getResourceAsStream to read it from the jar.

Use the ObjectInputStream constructor.

That is, new LexicalizedParser(new ObjectInputStream(new
GzipInputStream(ClassName.class.getResourceAsStream("/englishPCFG.ser.gz")))

I'm interested to know if you have found any other open source parsers in
Java or at least have java bindings.

Re: Generating many small PNGs to Amazon S3 with MapReduce

2009-04-15 Thread Kevin Peterson

On Tue, Apr 14, 2009 at 2:35 AM, tim robertson wrote:

>
> I am considering (for better throughput as maps generate huge request
> volumes) pregenerating all my tiles (PNG) and storing them in S3 with
> cloudfront.  There will be billions of PNGs produced each at 1-3KB
> each.
>

Storing billions of PNGs each at 1-3kb each into S3 will be perfectly fine,
there is no need to generate them and then push them at once, if you are
storing them each in their own S3 object (which they must be, if you intend
to fetch them using cloudfront). Each S3 object is unique, and can be
written fully in parallel. If you are writing to the same S3 object twice,
... well, you're doing it wrong.

However, do the math on the costs for S3. We were doing something similar,
and found that we were spending a fortune on our put requests at $0.01 per
1000, and next to nothing on storage. I've since moved to a more complicated
model where I pack many small items in each object and store an index in
simpledb. You'll need to partition your SimpleDBs if you do this.

Re: Amazon Elastic MapReduce

2009-04-02 Thread Kevin Peterson

So if I understand correctly, this is an automated system to bring up a
hadoop cluster on EC2, import some data from S3, run a job flow, write the
data back to S3, and bring down the cluster?

This seems like a pretty good deal. At the pricing they are offering, unless
I'm able to keep a cluster at more than about 80% capacity 24/7, it'll be
cheaper to use this new service.

Does this use an existing Hadoop job control API, or do I need to write my
flows to conform to Amazon's API?

Re: Iterative feedback in map reduce....

2009-03-28 Thread Kevin Peterson

On Fri, Mar 27, 2009 at 4:39 PM, Sid123  wrote:

> But I was thinking of grouping the values and generating a key using a
> random number generator in the collector of the mapper. The values will now
> be uniformly distributed over a few keys. Say the number of keys will be
> 0.1% of the # of values or atleast 1, which ever is higher. So if there
> 2 values 2000 odd values should be under a single key.. and 10 reducers
> should spawn to do the sum in parallel...  Now I can atleast run 10 sum in
> parallel rather than just 1 reducer doing the whole work... How does that
> theory seem?
>

What you want to do is write a combiner, which is essentially a reducer that
runs on the map output of a single node before before being sent to the main
reducer. Then the real reducer would get one value per node.

Re: How many nodes does one man want?

2009-03-27 Thread Kevin Peterson

On Thu, Mar 26, 2009 at 4:38 PM, Sid123  wrote:

>
> I am working of implementing some machine learning algorithms using Map
> Red.
> I want to know that If I have data that takes 5-6 hours to train on a
> normal
> machine. Will putting in 2-3 more nodes have an effect? I read in the yahoo
> hadoop tutorial.
> "Executing Hadoop on a limited amount of data on a small number of nodes
> may
> not demonstrate particularly stellar performance as the overhead involved
> in
> starting Hadoop programs is relatively high. Other parallel/distributed
> programming paradigms such as MPI (Message Passing Interface) may perform
> much better on two, four, or perhaps a dozen machines."
>
> I have at my disposal 3 laptops each with 4 G RAM and 150G hard disk space
> each...  I have 600M of training data
>

I'd say don't bother. Not because adding two machines won't double your
performance (maybe it will come close) but just because of the hassle of
setting up hadoop, then having to copy data in and out of HDFS,
restructuring your code within map-reduce paradigm, and so on.

I have a machine learning task that takes about an hour on my machine. I
find this significantly more convenient than running it on hadoop, and I'm
already working within hadoop. Of course, some of this inconvenience is due
to EC2, not hadoop itself. If I could run from inside eclipse, it might be a
different story.

Re: Building Release 0.19.1

2009-03-13 Thread Kevin Peterson

There may be a separate issue with windows, but the error related to:

[javac] import
org.eclipse.jdt.internal.debug.ui.launcher.JavaApplicationLaunchShortcut;

is the eclipse 3.4 issue that is addressed by the patch in
https://issues.apache.org/jira/browse/HADOOP-3744

Recommend JSON Library? net.sf.json has memory leak

2009-03-05 Thread Kevin Peterson

We're using JSON serialization for all our data, but we can't seem to find a
good library. We just discovered that the root cause of out of memory errors
is a leak in the net.sf.json library. Can anyone out there recommend a java
json library that they have actually used successfully within Hadoop?

Re: HADOOP-2536 supports Oracle too?

2009-02-20 Thread Kevin Peterson

On Wed, Feb 18, 2009 at 1:06 AM, sandhiya  wrote:

> Thanks a million!!! It worked. but its a little weird though. I have to put
> the Library with the jdbc jars in BOTH the executable jar file AND the lib
> folder in $HADOOP_HOME. Do all of you do the same thing or is it just my
> computer acting strange??
>

It seems that things that are directly referenced by the jar you are running
can be included in the lib directory in the jar, but things that are loaded
with reflection like JDBC drivers have to be in the Hadoop lib directory. I
don't think it's both.

Re: How to use DBInputFormat?

2009-02-03 Thread Kevin Peterson

On Tue, Feb 3, 2009 at 5:49 PM, Amandeep Khurana  wrote:

> In the setInput(...) function in DBInputFormat, there are two sets of
> arguments that one can use.
>
> 1. public static void *setInput*(JobConf
>
> a) In this, do we necessarily have to give all the fieldNames (which are
> the
> column names right?) that the table has, or do we need to specify only the
> ones that we want to extract?

You may specify only those columns that you are interested in.

b) Do we have to have a orderBy or not necessarily? Does this relate to the
> primary key in the table in any ways?

Conditions and order by are not necessary.

a) Is there any restriction on the kind of queries that this function
> can take in the inputQuery string?

I don't think so, but I don't use this method -- I just use the fieldNames
and tableName method.

> I am facing issues in getting this to work with an Oracle database and
> have no idea of how to debug it (an email sent earlier).
> Can anyone give me some inputs on this please?

Create a new table that has one column, put about five entries into that
table, then try to get a map job working that outputs the values to a text
file. If that doesn't work, post your code and errors.

Re: DBOutputFormat and auto-generated keys

2009-01-27 Thread Kevin Peterson

On Mon, Jan 26, 2009 at 5:40 PM, Vadim Zaliva  wrote:

> Is it possible to obtain auto-generated IDs when writing data using
> DBOutputFormat?
>
> For example, is it possible to write Mapper which stores records in DB
> and returns auto-generated
> IDs of these records?

...

> which I would like to store in normalized for in two tables. First
> table will store
> keys (string). Each key will have unique int id auto-generated by mysql.
>
> Second table will have (key_id,value) pairs, key_id being foreign key,
> pointing to first table.
>

A mapper has to have one output format, and that output format can't pass
any data into the map, so that approach won't work. DBOutputFormat doesn't
provide any way to do it either.

If you wanted to add this kind of functionality, you would need to write
your own output format, which probably wouldn't look much like
DBOutputFormat, which would be aware of your foreign keys. It would quickly
get very complicated.

One possibility that comes to mind is writing a "HibernateOutputFormat" or
similar, which would give you a way to express the relationships between
tables, leaving your only task to hook up your persistence logic to a hadoop
output format.

I had a similar problem with writing out reports to be used by a Rails app,
and solved it by restructuring things so that I don't need to write to two
tables from the same map task.

Cannot access svn.apache.org -- mirror?

2008-11-14 Thread Kevin Peterson

I'm trying to import Hadoop Core into our local repository using piston
( http://piston.rubyforge.org/index.html ).

I can't seem to access svn.apache.org though. I've also tried the EU
mirror. No errors, nothing but eventual timeout. Traceroute fails at
corv-car1-gw.nero.net. I got the same errors a couple weeks ago, but
assumed they were just temporary downtime. I have found some messages
from earlier this year about a similar problem where some people can
access it fine, and others just can't connect. I'm able to access it
from a remote shell account, but not from my machine.

Has anyone been able to work around this? Is there any mirror of the
Hadoop repository?

Re: Sharing object between mappers on same node (reuse.jvm ?)

MultipleOutputs or MultipleTextOutputFormat?

Re: Persistent storage on EC2

Mixing s3, s3n and hdfs

Re: Using the Stanford NLP with hadoop

Re: Generating many small PNGs to Amazon S3 with MapReduce

Re: Amazon Elastic MapReduce

Re: Iterative feedback in map reduce....

Re: How many nodes does one man want?

Re: Building Release 0.19.1

Recommend JSON Library? net.sf.json has memory leak

Re: HADOOP-2536 supports Oracle too?

Re: How to use DBInputFormat?

Re: DBOutputFormat and auto-generated keys

Cannot access svn.apache.org -- mirror?

15 matches

Site Navigation

Mail list logo

Footer information