Re: Permissions needed to run RandomWriter ?

2009-06-26 Thread Alex Loddengaard
> mapred.job.tracker >hadoop01:9001 > > > > fs.default.name > hdfs://hadoop01:9000 > > > > hadoop.tmp.dir > /data1/hadoop-tmp/ > > > > dfs.data.dir > /data1/hdfs,/data2/hdfs > > > > Any comments

Re: Permissions needed to run RandomWriter ?

2009-06-26 Thread Alex Loddengaard
Hey Stephen, What does your hadoop-site.xml look like? The Exception is in java.io.UnixFileSystem, which makes me think that you're actually creating and modifying directories on your local file system instead of HDFS. Make sure "fs.default.name" looks like "hdfs://your-namenode.domain.com:PORT"

Re: Add new Datnodes : Is redistribution of previous data required?

2009-06-24 Thread Alex Loddengaard
Hi, Running the rebalancer script (by the way, you only need to run it once) redistributes all of your data for you. That is, after you've run the rebalancer, your data should be stored evenly among your 10 nodes. Alex On Wed, Jun 24, 2009 at 2:50 PM, asif md wrote: > hello everyone, > > I ha

Re: Is it possible? I want to group data blocks.

2009-06-23 Thread Alex Loddengaard
Hi Hyunsik, Unfortunately you can't control the servers that blocks go on. Hadoop does block allocation for you, and it tries its best to distribute data evenly among the cluster, so long as replicated blocks reside on different machines, on different racks (assuming you've made Hadoop rack-aware

Re: HDFS out of space

2009-06-22 Thread Alex Loddengaard
Are you seeing any exceptions because of the disk being at 99% capacity? Hadoop should do something sane here and write new data to the disk with more capacity. That said, it is ideal to be balanced. As far as I know, there is no way to balance an individual DataNode's hard drives (Hadoop does r

Re: Measuring runtime of Map-reduce Jobs

2009-06-22 Thread Alex Loddengaard
What specific information are you interested in? The job history logs show all sorts of great information (look in the "history" sub directory of the JobTracker node's log directory). Alex On Mon, Jun 22, 2009 at 1:23 AM, bharath vissapragada < bhara...@students.iiit.ac.in> wrote: > Hi , > > Ar

Re: Slides/Videos of Hadoop Summit

2009-06-22 Thread Alex Loddengaard
The Cloudera talks are here: < http://www.cloudera.com/blog/2009/06/22/a-great-week-for-hadoop-summit-west-roundup/ > As for the rest, I'm not sure. Alex On Sun, Jun 21, 2009 at 11:46 PM, jaideep dhok wrote: > Hi all, > Are the slides or videos of the talks given at Hadoop Summit available >

Re: "sleep 60" between "start-dfs.sh" and putting files. Is it normal?

2009-06-19 Thread Alex Loddengaard
Hey Pavel, It's also worth checking the number of data nodes that have registered with the name node, depending on what you're trying to do when HDFS is ready. Try this: hadoop dfsadmin -report | grep "Datanodes available" | awk '{ print $3 }' > - or - MIN_NODES=5 > MAX_RETRIES=15 > counter=0 >

Re: Read/write dependency wrt total data size on hdfs

2009-06-18 Thread Alex Loddengaard
I'm a little confused what you're question is. Are you asking why HDFS has consistent read/write speeds even as your cluster gets more and more data? If so, two HDFS bottlenecks that would change read/write performance as used capacity changes are name node (NN) RAM and the amount of data each of

Re: Hadoop as Cloud Storage

2009-06-16 Thread Alex Loddengaard
Hey Wildan, HDFS is successfully storing well over 50TBs on a single cluster. It's meant to store data that will be analyzed in a MR job, but it can be used for archival storage. You'd probably consider deploying nodes with lots of disk space vs. lots of RAM and processor power. You'll want to

Re: Anyway to sort "keys" before Reduce function in Hadoop ?

2009-06-15 Thread Alex Loddengaard
Hey Kun, Keys given to a given reducer instance are given in sorted order. Meaning, for a given reducer JVM instance, the reduce function will be called several times, once for each key. The order in which the keys are given to the reduce function are sorted. The sorting happens in the shuffle

Re: parsing open xml

2009-06-15 Thread Alex Loddengaard
op> Hope this helps! Alex On Sat, Jun 13, 2009 at 1:42 AM, Alexandre Jaquet wrote: > Thanks Alex, > > Parsing the documents is a task done within the reducer ? we collect the > datas (document input) within a mapper and then parse it ? > > Thanks in advance > > Alexandre J

Re: parsing open xml

2009-06-12 Thread Alex Loddengaard
When you refer to "filesystem," do you mean HDFS? It's very common to store lots of text files in HDFS and run multiple jobs to process / learn about those text files. As for XML support, you can use Java libraries (or Python libraries if you're using Hadoop streaming) to parse the XML; Hadoop it

Re: Hadoop streaming - No room for reduce task error

2009-06-10 Thread Alex Loddengaard
What is mapred.child.ulimit set to? This configuration options specifics how much memory child processes are allowed to have. You may want to up this limit and see what happens. Let me know if that doesn't get you anywhere. Alex On Wed, Jun 10, 2009 at 9:40 AM, Scott wrote: > Complete newby

Re: hadoop hardware suggestion

2009-06-09 Thread Alex Loddengaard
100 nodes is certainly overkill for 500MBs of data, but if you have the resources, you might as well use them I suppose (assuming you're already paying for power, network, cooling, etc.). As for your idea of virtualization, it makes sense. I don't know of anyone running a Hadoop cluster on Window

Re: Few Queries..!!!

2009-06-08 Thread Alex Loddengaard
> Also, I am an undergraduate as of now. I want to be a part of this hadoop > project and get into its development of various sub projects undertaken. > Can > that be feasible.?? > > Thanking You, > > > On Fri, Jun 5, 2009 at 11:19 PM, Alex Loddengaard > wrote: > >

Re: Monitoring hadoop?

2009-06-05 Thread Alex Loddengaard
Hyperic is a great monitoring suite as well, as it auto discovers all sorts of daemons and lets you monitor them without much work at all. It's great. A plugin to collect information from Hadoop itself also exists: < http://github.com/hyperic/hq-hadoop/tree/master> Alex On Fri, Jun 5, 2009 at 9

Re: Few Queries..!!!

2009-06-05 Thread Alex Loddengaard
Hi, The throughput of HDFS is good, because each read is basically a stream from several hard drives (each hard drive holds a different block of the file, and these blocks are distributed across many machines). That said, HDFS does not have very good latency, at least compared to local file syste

Re: Customizing machines to use for different jobs

2009-06-04 Thread Alex Loddengaard
Hi Raakhi, Unfortunately there is no built-in way of doing this. You'd have to instantiate two entirely separate Hadoop clusters to accomplish what you're trying to do, which isn't an uncommon thing to do. I'm not sure why you're hoping to have this behavior, but the fair share scheduler might b

Re: *.gz input files

2009-06-03 Thread Alex Loddengaard
Hi Adam, Gzipped files don't play that nicely with Hadoop, because they aren't splittable. Can you use bzip2 instead? bzip2 files play more nicely with Hadoop, because they're splittable. If you're stuck with gzip, then take a look here: . I don

Re: New version/API stable?

2009-05-28 Thread Alex Loddengaard
0.19 is considered unstable by us at Cloudera and by the Y! folks; they never deployed it to their clusters. That said, we recommend 0.18.3 as the most stable version of Hadoop right now. Y! has (or will soon) deploy(ed) 0.20, which implies that it's at least stable enough for them to give it a g

Re: hdfs on public internet/wan

2009-05-27 Thread Alex Loddengaard
It sounds like HDFS probably isn't the right application for you. When new nodes add themselves to the cluster, the administrator needs to rebalance the cluster in order for the new nodes to get data. Without rebalancing, new data will be stored on those new nodes, but old data will not be distri

Re: hadoop hardware configuration

2009-05-27 Thread Alex Loddengaard
Whoops. I answered to the wrong list as well. Sorry for the cross-post. Alex On Wed, May 27, 2009 at 12:39 PM, Alex Loddengaard wrote: > Answers in-line. > > Alex > > On Wed, May 27, 2009 at 6:50 AM, Patrick Angeles > wrote: > >> Hey all, >> >> I&#x

Re: Username in Hadoop cluster

2009-05-26 Thread Alex Loddengaard
dhadoop/Hadoop/" as > xxx.xx.xx.251 has username as utdhadoop* . > > Any inputs?? > > Thanks > Pankil > > On Wed, May 20, 2009 at 6:30 PM, Todd Lipcon wrote: > > > On Wed, May 20, 2009 at 4:14 PM, Alex Loddengaard > > wrote: > > > > &

Re: Randomize input file?

2009-05-21 Thread Alex Loddengaard
value:line) > > Reducer will sort on Integer.random() giving you a random ordering for your > input file. > > Best > Bhupesh > > > On 5/21/09 11:15 AM, "Alex Loddengaard" wrote: > > > Hi John, > > > > I don't know of a built-in way to do t

Re: Randomize input file?

2009-05-21 Thread Alex Loddengaard
Hi John, I don't know of a built-in way to do this. Depending on how well you want to randomize, you could just run a MapReduce job with at least one map (the more maps, the more random) and no reduces. When you run a job with no reduces, the shuffle phase is skipped entirely, and the intermedia

Re: Username in Hadoop cluster

2009-05-20 Thread Alex Loddengaard
Ah ha! Good point, Todd. Pankil, with Todd's suggestion, you can ignore the first option I proposed. Thanks, Alex On Wed, May 20, 2009 at 4:30 PM, Todd Lipcon wrote: > On Wed, May 20, 2009 at 4:14 PM, Alex Loddengaard > wrote: > > > First of all, if you can get all

Re: Username in Hadoop cluster

2009-05-20 Thread Alex Loddengaard
First of all, if you can get all machines to have the same user, that would greatly simplify things. If, for whatever reason, you absolutely can't get the same user on all machines, then you could do either of the following: 1) Change the *-all.sh scripts to read from a slaves file that has two f

Re: Mysql Load Data Infile with Hadoop?

2009-05-19 Thread Alex Loddengaard
DBOutputFormat will very likely put significantly more load on your MySQL server vs. LOAD DATA INFILE. DBOutputFormat will trounce your MySQL server with at least one connection per reducer. This may be OK if you have a small number of reducers and a small amount of output data. LOAD DATA INFILE

Re: Hadoop & Python

2009-05-19 Thread Alex Loddengaard
ike the ease of deploying and reading python compared > with Java but want to make sure using python over hadoop is scalable & is > standard practice and not something done only for prototyping and small > scale tests. > > > On Tue, May 19, 2009 at 9:48 AM, Alex Loddengaard

Re: Hadoop & Python

2009-05-19 Thread Alex Loddengaard
Streaming is slightly slower than native Java jobs. Otherwise Python works great in streaming. Alex On Tue, May 19, 2009 at 8:36 AM, s d wrote: > Hi, > How robust is using hadoop with python over the streaming protocol? Any > disadvantages (performance? flexibility?) ? It just strikes me that

Re: Optimal Filesystem (and Settings) for HDFS

2009-05-18 Thread Alex Loddengaard
I believe Yahoo! uses ext3, though I know other people have said that XFS has performed better in various benchmarks. We use ext3, though we haven't done any benchmarks to prove its worth. This question has come up a lot, so I think it'd be worth doing a benchmark and writing up the results. I h

Re: Master crashed

2009-04-29 Thread Alex Loddengaard
I'm confused. Why are you trying to stop things when you're bringing the name node back up? Try running start-all.sh instead. Alex On Tue, Apr 28, 2009 at 4:00 PM, Mayuran Yogarajah < mayuran.yogara...@casalemedia.com> wrote: > The master in my cluster crashed, the dfs/mapred java processes ar

Re: The mechanism of choosing target datanodes

2009-04-23 Thread Alex Loddengaard
I believe the blocks will be distributed across data nodes and not local to only one data node. If this wasn't the case, then running a MR job on the file would only be local to one task tracker. Alex On Thu, Apr 23, 2009 at 2:14 AM, Xie, Tao wrote: > > If a cluster has many datanodes and I wa

Re: NameNode Startup Problem

2009-04-22 Thread Alex Loddengaard
Can you post your hadoop-site.xml? Also, what prompted this problem? Did you bounce the cluster? Alex On Wed, Apr 22, 2009 at 8:16 AM, Tamir Kamara wrote: > Hi, > > After a while working with hadoop I'm now faced with a situation where the > namenode won't start up. I'm working with a patched

Re: How to access data node without a passphrase?

2009-04-22 Thread Alex Loddengaard
RPMs won't work on Ubuntu, but we're almost finished with DEBs, which will work on Ubuntu. Shoot Todd an email if you want to try out our DEBs: Are you asking about choosing a Linux distribution? The problem with Ubuntu is that it changes very frequently and generally uses relatively new softw

Re: How to access data node without a passphrase?

2009-04-21 Thread Alex Loddengaard
I would recommend installing the Hadoop RPMs and avoid the start-all scripts all together. The RPMs ship with init scripts, allowing you to start and stop daemons with /sbin/service (or with a configuration management tool, which I assume you'll be using as your cluster grows). Here's more info o

Re: Problem with using differnt username

2009-04-17 Thread Alex Loddengaard
I don't think you can tell start-all.sh to log in as a different user on certain nodes. Why not just create the same user on the fourth node? An alternative would be to start the fourth node manually via hadoop-daemon.sh script. Here's an example: bin/hadoop-daemon.sh start datanode bin/hadoop-

Re: getting DiskErrorException during map

2009-04-16 Thread Alex Loddengaard
tomatically created so is it seems like hadoop.tmp.dir is set > properly. However, hadoop still creates > /tmp/hadoop-jim/mapred/local and uses that directory for the local storage. > > I'm starting to suspect that mapred.local.dir is overwritten to a default > value of /tmp/hadoop-$

Re: Directory /tmp/hadoop-hadoop/dfs/name is in an inconsistent state: storage directory does not exist

2009-04-15 Thread Alex Loddengaard
Data stored to /tmp has no consistency / reliability guarantees. Your OS can delete that data at any time. Configure hadoop-site.xml to store data elsewhere. Grep for "/tmp" in hadoop-default.xml to see all the configuration options you'll have to change. Here's the list I came up with: hadoop

Re: getting DiskErrorException during map

2009-04-14 Thread Alex Loddengaard
either. Is there any > other property that I'm supposed to change? I tried searching for "/tmp" in > the hadoop-default.xml file but couldn't find anything else. > > Thanks, > Jim > > > On Tue, Apr 7, 2009 at 9:35 PM, Alex Loddengaard > wrote:

Re: More Replication on dfs

2009-04-14 Thread Alex Loddengaard
> Corrupt blocks:0 > Missing replicas: 675 (128.327 %) > Number of data-nodes: 2 > Number of racks: 1 > > > The filesystem under path '/' is HEALTHY > Please tell what is wrong. > > Aseem > > -Ori

Re: Does the HDFS client read the data from NameNode, or from DataNode directly?

2009-04-10 Thread Alex Loddengaard
Data is streamed directly from the data nodes themselves. The name node is only queried for block locations and other meta data. Alex On Fri, Apr 10, 2009 at 8:33 AM, Stas Oskin wrote: > Hi. > > I wanted to verify a point about HDFS client operations: > > When asking for file, is the all commu

Re: Add second partition to HDFS

2009-04-10 Thread Alex Loddengaard
Make sure you bounce the datanode daemon once you change the configuration file as well. Alex On Fri, Apr 10, 2009 at 8:23 AM, Ravi Phulari wrote: > Add your second disk name in dfs.data.dir . > Refer - http://hadoop.apache.org/core/docs/r0.19.1/cluster_setup.html > > dfs.data.dir = Comma sepa

Re: More Replication on dfs

2009-04-10 Thread Alex Loddengaard
o the question, how does one decide what is the optimal > replication > factor for a cluster. For instance what would be the appropriate > replication > factor for a cluster consisting of 5 nodes. > Mithila > > On Fri, Apr 10, 2009 at 8:20 AM, Alex Loddengaard > wrote: >

Re: More Replication on dfs

2009-04-10 Thread Alex Loddengaard
the optimal replication > factor for a cluster. For instance what would be the appropriate > replication > factor for a cluster consisting of 5 nodes. > Mithila > > On Fri, Apr 10, 2009 at 8:20 AM, Alex Loddengaard > wrote: > > > Did you load any files when

Re: More Replication on dfs

2009-04-09 Thread Alex Loddengaard
Did you load any files when replication was set to 3? If so, you'll have to rebalance: Note that most people run HDFS with a replication factor

Re: HDFS as a logfile ??

2009-04-09 Thread Alex Loddengaard
This is a great idea and a common application, Ricky. Scribe is probably useful for you as well: < http://images.google.com/imgres?imgurl=http://farm3.static.flickr.com/2211/2197670659_b42810b8ba.jpg&imgrefurl=http://www.flickr.com/photos/niallkenne

Re: HDFS read/write speeds, and read optimization

2009-04-09 Thread Alex Loddengaard
Answers in-line. Alex On Thu, Apr 9, 2009 at 3:45 PM, Stas Oskin wrote: > Hi. > > I have 2 questions about HDFS performance: > > 1) How fast are the read and write operations over network, in Mbps per > second? Hypertable (a BigTable implementation) has a good KFS vs. HDFS breakdown: < http://

Re: BytesWritable get() returns more bytes then what's stored

2009-04-08 Thread Alex Loddengaard
FYI: this (open) JIRA might be interesting to you: Alex On Wed, Apr 8, 2009 at 7:18 PM, Todd Lipcon wrote: > On Wed, Apr 8, 2009 at 7:14 PM, bzheng wrote: > > > > > Thanks for the clarification. Though I still find it strange why not > have

Re: getting DiskErrorException during map

2009-04-07 Thread Alex Loddengaard
The getLocalPathForWrite function that throws this Exception assumes that you have space on the disks that mapred.local.dir is configured on. Can you verify with `df` that those disks have space available? You might also try moving mapred.local.dir off of /tmp if it's configured to use /tmp right

Re: Streaming data into Hadoop

2008-12-08 Thread Alex Loddengaard
This should answer your questions: Alex On Mon, Dec 8, 2008 at 2:19 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote: > Hello all, > > I normally upload files into hadoop via bin/hadoop fs -put file dest. > > However, is there a way to somehow stream dat

Re: JDBC input/output format

2008-12-08 Thread Alex Loddengaard
Here are some useful links with regard to reading from and writing to MySQL: Those two issues should answer your questions. Alex On Mon, Dec 8, 2008 at 9:10 AM, Edward Capriolo <[EMAIL PROTECTE

Re: How to install and use chukwa and x-trace?

2008-12-08 Thread Alex Loddengaard
The only Chukwa documentation that I know about is here: However, a big commit just went in that could have some documentation in it. At first glance, it doesn't look like the recent commit included much documentation, though. Here is the recent commit: <

Re: slow shuffle

2008-12-05 Thread Alex Loddengaard
These configuration options will be useful: > mapred.job.shuffle.merge.percent > 0.66 > The usage threshold at which an in-memory merge will be > initiated, expressed as a percentage of the total memory allocated to > storing in-memory map outputs, as defined by > mapred.job.shuffle.i

Re: Strange behavior with bzip2 input files w/release 0.19.0

2008-12-04 Thread Alex Loddengaard
Currently in Hadoop you cannot split bzip2 files: However, gzip files can be split: Hope this helps. Alex On Thu, Dec 4, 2008 at 9:11 AM, Andy Sautins <[EMAIL PROTECTED]>wrote: > > >I'm s

Re: Optimized way

2008-12-04 Thread Alex Loddengaard
Well, Map/Reduce and Hadoop by definition run maps in parallel. I think you're interested in the following two configuration settings: mapred.tasktracker.map.tasks.maximum mapred.tasktracker.reduce.tasks.maximum These go in hadoop-site.xml and will set the number of map and reduce tasks for each

Re: Hadoop balancer

2008-12-03 Thread Alex Loddengaard
Have you tried running fsck? fsck will tell you if you have corruption. Alex On Wed, Dec 3, 2008 at 7:37 AM, Ryan LeCompte <[EMAIL PROTECTED]> wrote: > I've tried running the bin/hadoop balance command since I recently > ad

Re: "Lookup" HashMap available within the Map

2008-11-25 Thread Alex Loddengaard
You should use the DistributedCache: < http://www.cloudera.com/blog/2008/11/14/sending-files-to-remote-task-nodes-with-hadoop-mapreduce/ > and < http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#DistributedCache > Hope this helps! Alex On Tue, Nov 25, 2008 at 11:09 AM, tim robert

Re: Hadoop Installation

2008-11-21 Thread Alex Loddengaard
mons website? > > Thanks > Mithila > On Fri, Nov 21, 2008 at 8:15 PM, Mithila Nagendra <[EMAIL PROTECTED]> > wrote: > > > I tried the 0.18.2 as welll.. it gave me the same exception.. so tried > the > > lower version.. I should check if this works.. Thanks

Re: Hadoop Development Status

2008-11-20 Thread Alex Loddengaard
0 PM > Subject: Re: Hadoop Development Status > > This is very nice. > A suggestion if it is related to the development status. > Do you think guys you can analyze which questions are > discussed most often in the mailing lists, so that we could > update our FAQs based on that. &

Re: Hadoop Installation

2008-11-20 Thread Alex Loddengaard
Maybe try downloading the Apache Commons - Logging jars (< http://commons.apache.org/downloads/download_logging.cgi>) and drop them in to $HADOOP_HOME/lib. Just curious, if you're starting a new cluster, why have you chosen to use 0.17.* and not 0.18.2? It would be a good idea to use 0.18.2 if pos

Hadoop Development Status

2008-11-20 Thread Alex Loddengaard
Some engineers here at Cloudera have been working on a website to report on Hadoop development status, and we're happy to announce that the website is now available! We've written a blog post describing its usefulness, goals, and future, so take a look if you're interested: < http://www.cloudera.

Re: Hadoop Installation

2008-11-19 Thread Alex Loddengaard
None of your emails have had attachments. I think this list might strip them. Can you copy-paste the error? Though I think the error won't be useful. I'm pretty confident your issue is with Java. What UNIX are you using? Alex On Wed, Nov 19, 2008 at 11:38 AM, Mithila Nagendra <[EMAIL PROTECTE

Re: Hadoop Installation

2008-11-19 Thread Alex Loddengaard
By "UNIX" do you mean FreeBSD? The Hadoop configuration is platform agnostic, so your issue is probably related to your Java configuration (classpath, etc). Alex On Wed, Nov 19, 2008 at 10:20 AM, Mithila Nagendra <[EMAIL PROTECTED]> wrote: > I ve attached the screen shots of the exception and ha

Re: What do you do with task logs?

2008-11-18 Thread Alex Loddengaard
You could take a look at Chukwa, which essentially collects and drops your logs to HDFS: The last time I tried to play with Chukwa, it wasn't in a state to be played with yet. If that's still the case, then you can use Scribe to collect all of your logs in a

Re: 0.18.2 release compiled with java 6 ?

2008-11-18 Thread Alex Loddengaard
Or just run `ant jar` from $HADOOP_HOME and grab the jar (postfixed with -dev) in $HADOOP_HOME/build. Alex On Tue, Nov 18, 2008 at 6:30 AM, 柳松 <[EMAIL PROTECTED]> wrote: > You can also rebuild the jar by compiling all the sources in the 'src' > folder with your working jdk. > > > > > > 在2008-11-1

Re: Cleaning up files in HDFS?

2008-11-14 Thread Alex Loddengaard
A Python script that queried HDFS through the command line (use hadoop fs -lsr) would definitely suffice. I don't know of any toolsets of frameworks for pruning HDFS, other than this: Alex On Fri, Nov 14, 2008 at 5:08 PM, Erik Holstad <[EMAIL PR

Re: HDFS NameNode and HA: best strategy?

2008-11-14 Thread Alex Loddengaard
rms periodic checkpoints: > > http://wiki.apache.org/hadoop/FAQ?highlight=(secondary)#7 > > Are there any instructions out there on how to copy the FS image and edits > log from the secondary NameNode to a new machine when the original NameNode > fails? > > Bill > > On F

Re: HDFS NameNode and HA: best strategy?

2008-11-14 Thread Alex Loddengaard
HDFS does have a single point of failure, and there is no way around this in its current implementation. The namenode keeps track of a FS image and and edits log. It's common for these to be stored both on the local disk and on a NFS mount. In the case when the namenode fails, a new machine can

Re: Any suggestion on performance improvement ?

2008-11-14 Thread Alex Loddengaard
How big is the data that you're loading and filtering? Your cluster is pretty small, so if you have data on the magnitude of tens or hundreds of GBs, then the performance you're describing is probably to be expected. How many map and reduce tasks are you running on each node? Alex On Thu, Nov 13

Re: Web Proxy to Access DataNodes

2008-11-13 Thread Alex Loddengaard
You could also have your developers setup a SOCKS proxy with the -D option to ssh. Then have them install FoxyProxy. The solution you're trying to do will make maintaining access to your datanodes difficult. That is, for each new datanode, you'll have to add a proxy rule to Apache. With the SOCK

Re: Hadoop+log4j

2008-11-11 Thread Alex Loddengaard
Have you seen this: Alex On Tue, Nov 11, 2008 at 6:03 PM, ZhiHong Fu <[EMAIL PROTECTED]> wrote: > Hello, > >I'm very sorry to trouble you, I'm developing a MapReduce > Application, And I can get Log.INFO in InputFormat ,but In M

Re: Best way to handle namespace host failures

2008-11-10 Thread Alex Loddengaard
There has been a lot of discussion on this list about handling namenode failover. Generally the most common approach is to backup the namenode to an NFS mount and manually instantiate a new namenode when your current namenode fails. As Hadoop exists today, the namenode is a single point of failure

Re: hadoop with tomcat

2008-11-10 Thread Alex Loddengaard
Do you know about the jobtracker page? Visit http://:50030. This page (served by Jetty) gives you statistics about your cluster and each MR job. Alex On Sun, Nov 9, 2008 at 11:33 PM, ZhiHong Fu <[EMAIL PROTECTED]> wrote: > Hello: > > I have implemented a Map/Reduce job, which will receive

Re: Hadoop

2008-11-06 Thread Alex Loddengaard
Can you say more about what you're looking for? Are you interested in using Hadoop to serve web content? Are you interested in using Hadoop to analyze Internet crawl data? Or are you interested in using Hadoop from a remote data center? Alex On Thu, Nov 6, 2008 at 1:47 PM, Francesc Bruguera <[E

Re: HDFS Login Security

2008-11-04 Thread Alex Loddengaard
Look at the "hadoop.job.ugi" configuration option. You can manually set a user and the groups that user is a part of. Alex On Tue, Nov 4, 2008 at 1:42 PM, Wasim Bari <[EMAIL PROTECTED]> wrote: > Hi, > Do we have any Java class for Login purpose to HDFS programmatically > like traditional Use

Re: Recovery from Failed Jobs

2008-11-04 Thread Alex Loddengaard
With regard to checkpointing, not yet. This JIRA is a prerequisite: I'm a little confused about what you're trying to do with log parsing. You should consider Scribe or Chukwa, though Chukwa isn't ready to be used yet. Learn more here: Chukwa:

Re: Problem while starting Hadoop

2008-11-04 Thread Alex Loddengaard
Does 'ping lca2-s3-pc01' resolve from lca2-s3-pc04 and vise-versa? Are your 'slaves' and 'master' configuration files configured correctly? You can also try stopping everything, deleting all of your Hadoop data on each machine (by default in /tmp), reformating the namenode, and starting all again.

Re: How to read mapreduce output in HDFS directory from Web Application

2008-11-02 Thread Alex Loddengaard
Someone else correct me if I'm wrong, but I don't think HBASE queries run nearly fast enough to be displayed on a website. You would see long load times, and hence create a bad user experience. Agreed that you should definitely be concerned with MySQL tables becoming insanely large. MySQL is real

Re: Can anyone recommend me a inter-language data file format?

2008-11-02 Thread Alex Loddengaard
; Any advices? > > On Sun, Nov 2, 2008 at 1:45 PM, Bryan Duxbury <[EMAIL PROTECTED]> wrote: > > > Agree, we use Thrift at Rapleaf for this purpose. It's trivial to make a > > ThriftWritable if you want to be crafty, but you can also just use > byte[]s > > a

Re: Can anyone recommend me a inter-language data file format?

2008-11-01 Thread Alex Loddengaard
Take a look at Thrift: Alex On Sat, Nov 1, 2008 at 7:15 PM, Zhou, Yunqing <[EMAIL PROTECTED]> wrote: > The project I focused on has many modules written in different languages > (several modules are hadoop jobs). > So I'd like to utilize a common record b

Re: How to read mapreduce output in HDFS directory from Web Application

2008-11-01 Thread Alex Loddengaard
I suppose it depends on what you're trying to do. One approach would be to output SQL insert statements and import them in to a database that a web app could query. On the other hand, you could output XML or JSON that can be queried by an AJAX app. Read more about MySQL connectivity here:

Re: hostname in logs

2008-10-31 Thread Alex Loddengaard
an <[EMAIL PROTECTED]> wrote: > Alex Loddengaard wrote: > >> I'd like my log messages to display the hostname of the node that they >> were >> outputted on. Sure, this information can be grabbed from the log >> filename, >> but I would like eac

Re: Debugging / Logging in Hadoop?

2008-10-30 Thread Alex Loddengaard
me of the slides in the > video) > > > On Oct 30, 2008, at 1:34 PM, Alex Loddengaard wrote: > > Arun gave a great talk about debugging and tuning at the Rapleaf event. >> Take a look: >> <http://www.vimeo.com/2085477> >> >> Alex >> >>

hostname in logs

2008-10-30 Thread Alex Loddengaard
I'd like my log messages to display the hostname of the node that they were outputted on. Sure, this information can be grabbed from the log filename, but I would like each log message to also have the hostname. I don't think log4j provides support to include the hostname in a log, so I've tried

Re: Debugging / Logging in Hadoop?

2008-10-30 Thread Alex Loddengaard
Arun gave a great talk about debugging and tuning at the Rapleaf event. Take a look: Alex On Thu, Oct 30, 2008 at 6:20 AM, Malcolm Matalka < [EMAIL PROTECTED]> wrote: > I'm not sure of the correct way, but when I need to log a job I have it > print out with some u

Re: Why separate Map/Reduce task limits per node ?

2008-10-28 Thread Alex Loddengaard
son I can think of for having > separate map and reduce task limits, is the default scheduler. > It wants to schedule all map tasks first, so you really need to limit the > number of > them so that reduces have a chance to run. > > Thanks for any insight, > Doug > > >

Re: namenode failure

2008-10-28 Thread Alex Loddengaard
Manually killing a process might create a situation where only a portion of your data is written to disk, and other data in queue to be written is lost. This is what has most likely caused corruption in your name node. Start by reading about bin/hadoop namenode -fsck:

Re: Why separate Map/Reduce task limits per node ?

2008-10-27 Thread Alex Loddengaard
In most jobs, map and reduce tasks are significantly differently, and their runtimes vary as well. The number of reducers also determines how many output files you have. So in the case when you would want one output file, having a single generic task limit would mean that you'd also have one mapp

Re: How does an offline Datanode come back up ?

2008-10-27 Thread Alex Loddengaard
I'm pretty sure that failed nodes won't be automatically added to the cluster when they go down. It's the sysadmin's responsibility to deal with downed nodes and get them back in to the cluster. Alex On 10/27/08, wmitchell <[EMAIL PROTECTED]> wrote: > > Hi All, > > Ive been working michael nolls

Re: Using hadoop as storage cluster?

2008-10-25 Thread Alex Loddengaard
1GB - 6GB range (disk images and database backups, mostly). > There would also be a few (comparatively few, that is) configuration files > of a few kB each. > > Thanks for the response; do you know of any other systems with similar > functionality? > > Dave > > >

Re: Using hadoop as storage cluster?

2008-10-24 Thread Alex Loddengaard
What files do you expect to be storing? Generally speaking, HDFS (Hadoop's distributed file system) does not handle small files very efficiently. Instead it's optimized for large files, upwards of 64MB each. Alex On Fri, Oct 24, 2008 at 9:41 AM, David C. Kerber < [EMAIL PROTECTED]> wrote: > Hi

Re: Is it possible to change parameters using org.apache.hadoop.conf.Configuration API?

2008-10-22 Thread Alex Loddengaard
Just to be clear, you want to persist a configuration change to your entire cluster without bringing it down, and you're hoping to use the Configuration API to do so. Did I get your question correct? I don't know of a way to do this without restarting the cluster, because I'm pretty sure Configur

Re: Problems running the Hadoop Quickstart

2008-10-20 Thread Alex Loddengaard
Have you looked at your logs yet? You should look at your logs and post any errors or warnings. Alex On Mon, Oct 20, 2008 at 8:29 PM, Amareshwari Sriramadasu < [EMAIL PROTECTED]> wrote: > Has your task-tracker started? I mean, do you see non-zero nodes on your > job tracker UI? > > -Amareshwari

Re: Does anybody have tried to setup a cluster with multiple namenodes?

2008-10-20 Thread Alex Loddengaard
I believe the common practice is to have a secondary namenode, which by default is enabled. Secondary namenodes serve the purpose of having a redundant backup. However, as far as I'm aware, they are not hot swappable. This means that if your namenode fails, then your cluster will go down until y

Re: Chukwa Support

2008-10-17 Thread Alex Loddengaard
gt; javadoc should tell you what the methods need to do. > > > > You start an adaptor by sending a command of the form "add [classname] > > [parameters] 0" to the Chukwa agent over TCP. By default, Chukwa > > listens on port 9093. > > > > I don't believe

Chukwa Support

2008-10-15 Thread Alex Loddengaard
I'm trying to play with Chukwa, but I'm struggling to get anything going. I've been operating off of the wiki entry (< http://wiki.apache.org/hadoop/Chukwa_Quick_Start>), making revisions as I go along. It's unclear to me how to 1) create an adapter and 2) start HICC (see the wiki for more infor

Re: Basic doubts and questions

2008-10-12 Thread Alex Loddengaard
This should answer all of your questions: Or if you run Ubuntu, cross-reference this: < http://wiki.apache.org/hadoop/Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster) > Alex On Sun, Oct 12, 2008 at 4:33 AM, Amit k. Saha <[EMAIL PROTE

Re: architecture diagram

2008-10-08 Thread Alex Loddengaard
igure this out now and get it to work. I will check back in > if I get it. All that is missing at the moment is in my pivot back mapping > step. Thanks for the help. > > Terrence A. Pietrondi > > > --- On Tue, 10/7/08, Alex Loddengaard <[EMAIL PROTECTED]> wrote: >

  1   2   >