Re: grahical tool for hadoop mapreduce

2009-06-25 Thread Kevin Weil
Some people at sun have done some recent work on this -- see a blog post at http://blogs.sun.com/jgebis/entry/hadoop_resource_utilization_and_performance, and a subsequent post with more detail at http://blogs.sun.com/jgebis/entry/hadoop_resource_utilization_monitoring_scripts . Kevin On Thu

Re: Sharing object between mappers on same node (reuse.jvm ?)

2009-06-04 Thread Kevin Peterson
On Wed, Jun 3, 2009 at 10:59 AM, Tarandeep Singh wrote: > I want to share a object (Lucene Index Writer Instance) between mappers > running on same node of 1 job (not across multiple jobs). Please correct me > if I am wrong - > > If I set the -1 for the property: mapred.job.reuse.jvm.num.tasks the

MultipleOutputs or MultipleTextOutputFormat?

2009-05-28 Thread Kevin Peterson
I am trying to figure out the best way to split output into different directories. My goal is to have a directory structure allowing me to add the content from each batch into the right bucket, like this: ... /content/200904/batch_20090429 /content/200904/batch_20090430 /content/200904/batch_20090

Re: Persistent storage on EC2

2009-05-28 Thread Kevin Peterson
On Tue, May 26, 2009 at 7:50 PM, Malcolm Matalka < mmata...@millennialmedia.com> wrote: > I'm using EBS volumes to have a persistent HDFS on EC2. Do I need to keep > the master updated on how to map the internal IPs, which change as I > understand, to a known set of host names so it knows where t

Re: Suspend or scale back hadoop instance

2009-05-19 Thread Kevin Weil
you could scale back on datanode work entirely by setting the maximum number of mappers or reducers to 1 per node during the day (also in conf/hadoop-site.xml). Kevin On Tue, May 19, 2009 at 7:23 AM, Steve Loughran wrote: > John Clarke wrote: > >> Hi, >> >> I am workin

Mixing s3, s3n and hdfs

2009-05-08 Thread Kevin Peterson
Currently, we are running our cluster in EC2 with HDFS stored on the local (i.e. transient) disk. We don't want to deal with EBS, because it complicates being able to spin up additional slaves as needed. We're looking at moving to a combination of s3 (block) or s3n for data that we care about, and

Re: Using the Stanford NLP with hadoop

2009-04-21 Thread Kevin Peterson
On Sat, Apr 18, 2009 at 5:18 AM, hari939 wrote: > > My project of parsing through material for a semantic search engine > requires > me to use the http://nlp.stanford.edu/software/lex-parser.shtml Stanford > NLP parser on hadoop cluster. > > To use the Stanford NLP parser, one must create a lex

Re: Generating many small PNGs to Amazon S3 with MapReduce

2009-04-15 Thread Kevin Peterson
On Tue, Apr 14, 2009 at 2:35 AM, tim robertson wrote: > > I am considering (for better throughput as maps generate huge request > volumes) pregenerating all my tiles (PNG) and storing them in S3 with > cloudfront. There will be billions of PNGs produced each at 1-3KB > each. > Storing billions o

RE: Hadoop data nodes failing to start

2009-04-08 Thread Kevin Eppinger
Unfortunately not. I don't have much leeway to experiment with this cluster. -kevin -Original Message- From: jdcry...@gmail.com [mailto:jdcry...@gmail.com] On Behalf Of Jean-Daniel Cryans Sent: Wednesday, April 08, 2009 8:30 AM To: core-user@hadoop.apache.org Subject: Re: Hadoop

RE: Hadoop data nodes failing to start

2009-04-08 Thread Kevin Eppinger
k' from IRC for the help. -kevin -Original Message- From: Kevin Eppinger [mailto:keppin...@adknowledge.com] Sent: Tuesday, April 07, 2009 1:05 PM To: core-user@hadoop.apache.org Subject: Hadoop data nodes failing to start Hello everyone- So I have a 5 node cluster that I've bee

Hadoop data nodes failing to start

2009-04-07 Thread Kevin Eppinger
java:997) at java.lang.Thread.run(Thread.java:619) After this the data node shuts down. This same message is appearing on all the failed nodes. Help! -kevin

Re: Amazon Elastic MapReduce

2009-04-02 Thread Kevin Peterson
So if I understand correctly, this is an automated system to bring up a hadoop cluster on EC2, import some data from S3, run a job flow, write the data back to S3, and bring down the cluster? This seems like a pretty good deal. At the pricing they are offering, unless I'm able to keep a cluster at

Re: Iterative feedback in map reduce....

2009-03-28 Thread Kevin Peterson
On Fri, Mar 27, 2009 at 4:39 PM, Sid123 wrote: > But I was thinking of grouping the values and generating a key using a > random number generator in the collector of the mapper. The values will now > be uniformly distributed over a few keys. Say the number of keys will be > 0.1% of the # of value

Re: How many nodes does one man want?

2009-03-27 Thread Kevin Peterson
On Thu, Mar 26, 2009 at 4:38 PM, Sid123 wrote: > > I am working of implementing some machine learning algorithms using Map > Red. > I want to know that If I have data that takes 5-6 hours to train on a > normal > machine. Will putting in 2-3 more nodes have an effect? I read in the yahoo > hadoop

Re: Building Release 0.19.1

2009-03-13 Thread Kevin Peterson
There may be a separate issue with windows, but the error related to: [javac] import org.eclipse.jdt.internal.debug.ui.launcher.JavaApplicationLaunchShortcut; is the eclipse 3.4 issue that is addressed by the patch in https://issues.apache.org/jira/browse/HADOOP-3744

Recommend JSON Library? net.sf.json has memory leak

2009-03-05 Thread Kevin Peterson
We're using JSON serialization for all our data, but we can't seem to find a good library. We just discovered that the root cause of out of memory errors is a leak in the net.sf.json library. Can anyone out there recommend a java json library that they have actually used successfully within Hadoop?

Re: HADOOP-2536 supports Oracle too?

2009-02-20 Thread Kevin Peterson
On Wed, Feb 18, 2009 at 1:06 AM, sandhiya wrote: > Thanks a million!!! It worked. but its a little weird though. I have to put > the Library with the jdbc jars in BOTH the executable jar file AND the lib > folder in $HADOOP_HOME. Do all of you do the same thing or is it just my > computer acting

Re: How to use DBInputFormat?

2009-02-03 Thread Kevin Peterson
On Tue, Feb 3, 2009 at 5:49 PM, Amandeep Khurana wrote: > In the setInput(...) function in DBInputFormat, there are two sets of > arguments that one can use. > > 1. public static void *setInput*(JobConf > > a) In this, do we necessarily have to give all the fieldNames (which are > the > column na

Re: DBOutputFormat and auto-generated keys

2009-01-27 Thread Kevin Peterson
On Mon, Jan 26, 2009 at 5:40 PM, Vadim Zaliva wrote: > Is it possible to obtain auto-generated IDs when writing data using > DBOutputFormat? > > For example, is it possible to write Mapper which stores records in DB > and returns auto-generated > IDs of these records? ... > which I would like t

Cannot access svn.apache.org -- mirror?

2008-11-14 Thread Kevin Peterson
I'm trying to import Hadoop Core into our local repository using piston ( http://piston.rubyforge.org/index.html ). I can't seem to access svn.apache.org though. I've also tried the EU mirror. No errors, nothing but eventual timeout. Traceroute fails at corv-car1-gw.nero.net. I got the same errors

Configuring Hadoop to use S3 for Nutch

2008-09-30 Thread Kevin MacDonald
uration so that Hadoop has everything it needs to function. For example, I somehow have to copy my seed urls file to the S3 bucket in a way that Hadoop can find it. Can anyone point me in the right direction on how to do this? 2008-09-30 13:31:49,926 WARN httpclient.RestS3Service - Response '/%

Re: How do specify certain IP to be used by datanode/namenode

2008-09-06 Thread Kevin
by its own, instead of sticking to the default one? -Kevin On Fri, Sep 5, 2008 at 8:45 PM, Jean-Daniel Cryans <[EMAIL PROTECTED]> wrote: > Kevin, > > Did you try changing the > dfs.datanode.dns.interface/dfs.datanode.dns.nameserver/mapred.tasktra

How do specify certain IP to be used by datanode/namenode

2008-09-05 Thread Kevin
when hadoop runs. Does anyone have an idea how I could possibly make it work? Thank you! -Kevin

Specify per file replication factor in "dfs -put" command line

2008-08-29 Thread Kevin
Hi, Does any one happen to know how to specify the replication factor of a file when I upload it by the "hadoop dfs -put" command? Thank you! Best, -Kevin

Re: Errors when hadoop.tmp.dir is sent to multiple directories

2008-08-26 Thread Kevin
It turns out that I should not set hadoop.tmp.dir to multiple directories. Instead, I should overwrite the dfs.data.dir and dfs.name.dir. -Kevin On Mon, Aug 18, 2008 at 3:03 PM, Kevin <[EMAIL PROTECTED]> wrote: > Hi, > > I guess it is not a rare use case to have hadoop d

Re: hadoop + Berkeley Database

2008-08-25 Thread Kevin
I did not try this but maybe "-libjars" of hadoop command could help. -Kevin On Mon, Aug 25, 2008 at 4:06 PM, Elia Mazzawi <[EMAIL PROTECTED]> wrote: > ended up putting the bdb library with the hadoop library, works fine now. > > cp /usr/local/BerkeleyDB.4.5/lib/libdb_

Re: Is -D option working with 0.18.0?

2008-08-25 Thread Kevin
Correct my previous reply. It should be after classname. -Kevin On Mon, Aug 25, 2008 at 3:26 PM, Kevin <[EMAIL PROTECTED]> wrote: > Thank you. I see where I was wrong. The -D should come after "jar" AND > before application-specific parameters. > > Best, > -Kevin

Re: Is -D option working with 0.18.0?

2008-08-25 Thread Kevin
Thank you. I see where I was wrong. The -D should come after "jar" AND before application-specific parameters. Best, -Kevin On Mon, Aug 25, 2008 at 2:18 PM, Chris Douglas <[EMAIL PROTECTED]> wrote: > bin/hadoop fs -D key=value -ls > > works for me. Options to the Ge

Is -D option working with 0.18.0?

2008-08-25 Thread Kevin
Could anyone help verify this? It does not look like working here. -Kevin

Re: Question about distributed sort

2008-08-22 Thread Kevin
For the same key, reducer is called only once. -Kevin On Fri, Aug 22, 2008 at 4:06 PM, Alex Holmes <[EMAIL PROTECTED]> wrote: > If this is the case, can the same reducer be invoked multiple times > with the same key? And if so, would this imply that the key could > appear on mu

Re: Question about distributed sort

2008-08-22 Thread Kevin
IIRC, the same key will always be sent to the same reducer. -Kevin On Fri, Aug 22, 2008 at 4:00 PM, Alex Holmes <[EMAIL PROTECTED]> wrote: > Hi, > > For a given input key, K, in a reduce task, does Hadoop guarantee that > all mapper-emitted values for key K are available in

Re: [ANNOUNCE] Hadoop release 0.18.0 available

2008-08-22 Thread Kevin
Is 0.18.0 supposed to be the current stable? -Kevin On Fri, Aug 22, 2008 at 1:44 PM, Nigel Daley <[EMAIL PROTECTED]> wrote: > Release 0.18.0 contains many improvements, new features, bug fixes and > optimizations. > > For release details and downloads, visit: > > http:

Re: Customize job name in command line

2008-08-22 Thread Kevin
Why -jobconf is not recognized, and -D is overwritten by the program code? Best, -Kevin On Fri, Aug 22, 2008 at 2:05 PM, Kevin <[EMAIL PROTECTED]> wrote: > Thank you! > -Kevin > > > > On Fri, Aug 22, 2008 at 1:53 PM, Miles Osborne <[EMAIL PROTECTED]>

Re: Customize job name in command line

2008-08-22 Thread Kevin
Thank you! -Kevin On Fri, Aug 22, 2008 at 1:53 PM, Miles Osborne <[EMAIL PROTECTED]> wrote: > yes: > > -jobconf mapred.job.name > > is your friend > > Miles > > 2008/8/22 Kevin <[EMAIL PROTECTED]> > >> Hi group, >> >> Is it p

Customize job name in command line

2008-08-22 Thread Kevin
Hi group, Is it possible to customize the job name when using "bin/hadoop jar ..."? Best, -Kevin

Re: Get information of input split from MapRunner?

2008-08-21 Thread Kevin
Override "configure(JobConf job)" in your mapper class. Get the "map.input.start" and "map.input.length" from the JobConf. -Kevin On Thu, Aug 21, 2008 at 2:14 PM, Qin Gao <[EMAIL PROTECTED]> wrote: > Hi mailing, > > I want to get information of cu

Errors when hadoop.tmp.dir is sent to multiple directories

2008-08-18 Thread Kevin
upload(put) a file. But everything is right when I use only one directory at each node. Does any one know about this issue? Thank you! Best, -Kevin

Re: DFS. How to read from a specific datanode

2008-08-07 Thread Kevin
Yes, I agree with you that it should be negotiated. That is "namenode provides an ordered list and the client can choose some based on its own measurements." But I am afraid 0.17.1 does not provide easy interface for this. -Kevin On Thu, Aug 7, 2008 at 3:40 AM, Steve Loughr

Re: Are lines broken in dfs and/or in InputSplit

2008-08-06 Thread Kevin
Yes, I have looked at the block files and it matches what you said. I am just wondering if there is some property or flag that would turn this feature on, if it exists. -Kevin On Wed, Aug 6, 2008 at 8:01 PM, Taeho Kang <[EMAIL PROTECTED]> wrote: > I guess a quick way to find an answer

Re: How to order all the output file if I use more than one reduce node?

2008-08-06 Thread Kevin
I suppose you meant to sort the result globally across files. AFAIK, This is not currently supported unless you have only one reducer. It is said that version 0.19 will introduce such capability. -Kevin On Wed, Aug 6, 2008 at 6:01 PM, Xing <[EMAIL PROTECTED]> wrote: > If I use one

Re: Are lines broken in dfs and/or in InputSplit

2008-08-06 Thread Kevin
Hi, I guess this thread is old. But I eventually need to raise the question again as I am more into dfs now. Would a line be broken between adjacent blocks in dfs? Can line be preserved in block level? -Kevin On Wed, Jul 16, 2008 at 4:57 PM, Chris Douglas <[EMAIL PROTECTED]>

Re: DFS. How to read from a specific datanode

2008-08-06 Thread Kevin
Thank you for the idea of submitting request. However, I guess I could not wait until it is served. The worst case is that I would probably hack my copy of hadoop and rebuild it. -Kevin On Wed, Aug 6, 2008 at 11:31 AM, lohit <[EMAIL PROTECTED]> wrote: >>I need this because I do

Re: DFS. How to read from a specific datanode

2008-08-06 Thread Kevin
out which datanode is nearest. -Kevin On Wed, Aug 6, 2008 at 2:31 AM, Samuel Guo <[EMAIL PROTECTED]> wrote: > Kevin 写道: >> >> Hi, >> >> This is about dfs only, not to consider mapreduce. It may sound like a >> strange need, but sometimes I want to read a b

Re: DFS. How to read from a specific datanode

2008-08-06 Thread Kevin
overriding it seems infeasible. Neither are the callers of chooseDataNode public or protected. I need this because I do not want to trust namenode's ordering. For applications where network congestion is rare, we should let the client to decide which data node to load from. -Kevin On Tue,

DFS. How to read from a specific datanode

2008-08-05 Thread Kevin
from? Best, -Kevin

Re: mapper input file name

2008-08-04 Thread Kevin
OK. I guess I find out how. Override the "configure" method of user defined Map class so that you can take note of the filename. -Kevin On Mon, Aug 4, 2008 at 3:53 PM, Kevin <[EMAIL PROTECTED]> wrote: > Is it possible to get this information in user defined map function? &g

Re: mapper input file name

2008-08-04 Thread Kevin
Is it possible to get this information in user defined map function? i.e., how do we get the JobConf object in map() function? Another way is to subclass RecordReader to embed file-name in the data, which does not look simple. -Kevin On Sun, Aug 3, 2008 at 10:17 PM, Amareshwari Sriramadasu

Re: Examples of using DFS without MapReduce

2008-08-04 Thread Kevin
Thank you! The java code is exactly what I want. Following your code, I encounter the user permission issue when trying to write to a file. I wonder if the user id could be manipulated in the protocol. -Kevin On Mon, Aug 4, 2008 at 2:27 PM, Michael Bieniosek <[EMAIL PROTECTED]> wrote:

Examples of using DFS without MapReduce

2008-08-04 Thread Kevin
Hi there, I am trying to use the DFS of hadoop in other applications. It is not clear to me how that could be carried out easily. Could any one give a direction to go or examples? Thank you. -Kevin

Is there a network communication counter for mapred?

2008-07-25 Thread Kevin
Hi, Besides knowing "data-local" and "rack-local" map task numbers, I am interested in the size of data that are transferred on network. E.g., the size of intermediate map output transferred (not dealt locally). I wonder if there is such a counter. Thank you. Best, -Kevin

DFS, write sequence number and consistency

2008-07-21 Thread Kevin
block and apply to every replica. But in hadoop, how is this achieved? If multiple clients write to the same block, what will happen? Moreover, is this scenario possible under current situation? Thanks and regards, -Kevin

Re: Are lines broken in dfs and/or in InputSplit

2008-07-16 Thread Kevin
I tried a bit and it looks that lines are preserved so far. However, is this property supported for sure, or what should I do to keep it works in this way? Thank you. -Kevin On Tue, Jul 15, 2008 at 5:07 PM, Kevin <[EMAIL PROTECTED]> wrote: > Hi, > > I was trying to parse text

Are lines broken in dfs and/or in InputSplit

2008-07-15 Thread Kevin
InputFormat may not preserve lines. If this is the case, is it possible to restore the lines for mapper input, or I have to drop broken lines? Thank you. Best, -Kevin

Sorting and partitioner

2008-07-15 Thread Kevin
, -Kevin

When does reducer read mapper's intermediate result?

2008-07-14 Thread Kevin
reducer only needs to do merge sort when it gets all the intermediate files from different mappers). Best, -Kevin

Re: How does org.apache.hadoop.mapred.join work?

2008-07-14 Thread Kevin
Thank you, Chris. This solves my questions. -Kevin On Mon, Jul 14, 2008 at 11:17 AM, Chris Douglas <[EMAIL PROTECTED]> wrote: > "Yielding equal partitions" means that each input source will offer n > partitions and for any given partition 0 <= i < n, the records in th

How does org.apache.hadoop.mapred.join work?

2008-07-14 Thread Kevin
ng equal partitions" mean? Thank you. -Kevin

How to add/remove slave nodes on run time

2008-07-11 Thread Kevin
Hi, I searched a bit but could not find the answer. What is the right way to add (and remove) new slave nodes on run time? Thank you. -Kevin