Re: Too many open files Error

2012-01-26 Thread Paul Ho
I think you may need to use ulimit in addition to setting dfs.datanode.max.xcievers. For example, on one of our boxes: ~ $ ulimit -a core file size(blocks, -c) unlimited data seg size (kbytes, -d) unlimited file size (blocks, -f) unlimited open files

Re: Problems with timeout when a Hadoop job generates a large number of key-value pairs

2012-01-20 Thread Paul Ho
I think the balancing bandwidth property you are looking for is in hdfs-site.xml: dfs.balance.bandwidthPerSec 402653184 Set the value that makes most sense for your NIC. But I thought this is only for balancing. On Jan 20, 2012, at 3:43 PM, Michael Segel wrote: > Ste

getting file position for a LZO file

2012-01-10 Thread Paul Ho
Hi all, For the TextInputFormat class, the input key is a file position. This is working well. But when I switch to LzoTextInputFormat to read LZO files, the key does not make sense. It does not indicate file position. Is the file position supported with LzoTextInputFormat? Here is a job that

Re: Is SAN storage is a good option for Hadoop ?

2011-09-29 Thread Paul Ingles
Our Hadoop journey included a brief stint running on our own virtualised infrastructure. Our pre-Hadoop application was already running on the VM infrastructure so we set up a small cluster as virtual machines on the SAN. It worked ok for a while but as our usage grew we ditched it for a couple

Re: problem regarding the hadoop

2011-06-30 Thread Paul Rimba
sudo chown -R hadoop:hadoop /usr/local/hadoop. That will give the directory ownership over to your hadoop account. On Fri, Jul 1, 2011 at 5:07 AM, Dhruv Kumar wrote: > It is a permission issue. Are you sure that the account "hadoop" has read > and write access to /usr/local/* directories? > > Th

Re: Passing files and directory structures to the map reduce cluster via hadoop streaming?

2011-06-29 Thread Paul Ingles
Hi, I'm not familiar with wukong, but Mandy has some scripts that wrap the hadoop commands- the default behaviour IIRC is to package the folder the script is in. This is then distributed so the app carries all its dependencies with it. Happy to hear -files works for you. Sent from my iPhone O

Re: question about processing XML file

2010-10-12 Thread Paul Ingles
and compile in your tree. Hth, Paul Sent from my iPhone On 12 Oct 2010, at 18:10, Steve Lewis wrote: > Look at the classes org.apache.hadoop.mapreduce.lib.input.LineRecordReader > and org.apache.hadoop.mapreduce.lib.input.TextInputFormat > > What you need to do is copy those and

Re: Inputs of Mapreduce

2010-07-13 Thread Paul Ingles
d up using. I also posted something on my blog about it all [2], and a little about my understanding (so far) of input formats and record readers etc. Hope that helps, Paul 1. http://github.com/apache/mahout/blob/ad84344e4055b1e6adff5779339a33fa29e1265d/examples/src/main/java/org/apache/mahout/

Re: Try to mount HDFS

2010-04-22 Thread paul
Just a heads up on this, we've run into problems when trying to use fuse to mount dfs running on port :8020. However, it works fine when we ran it on :9000. -paul On Thu, Apr 22, 2010 at 7:59 PM, Brian Bockelman wrote: > Hey Christian, > > I've run into this before. >

Tools to automatically setup new slaves (PXE boot?)

2010-03-02 Thread Paul Ingles
our systems guys has recommended using a PXE boot image? Are there any other similar tools that people could recommend? Thanks, Paul

Behaviour of FileSystem.rename() and hadoop fs -mv command

2010-02-19 Thread Paul Ingles
ed invoking directly doesn't seem to create the parent directories- it just returns false? If I want to move a file (creating any parent directories) on HDFS is there an existing class I can use? Thanks, Paul

StreamXmlRecordReader

2010-01-05 Thread Paul Ingles
the data from the other splits? Do I need to write a custom InputFormat to perform splits that honour the record boundaries? Thanks, Paul

Re: What do you use for capturing Disk I/O?

2009-12-14 Thread Paul Smith
fine detail at any point, looking for correlation between metrics. cheers, Paul On 11/12/2009, at 12:40 PM, Matt Massie wrote: > If you're looking for ganglia gmetric scripts for Disk I/O, take a look at > http://ganglia.info/gmetric/ or http://ben.hartshorne.net/ganglia/. At the >

Re: Best CFM Engine for Hadoop

2009-12-09 Thread paul
If your distro is Redhat based, you may also want to consider a system like Spacewalk: http://www.redhat.com/spacewalk/ https://fedorahosted.org/spacewalk/ We've found it very useful in our environment for a lot of purposes. -paul On Wed, Dec 9, 2009 at 2:26 PM, John Martyniak

Re: Confused by new API & MultipleOutputFormats using Hadoop 0.20.1

2009-11-09 Thread Paul Smith
om On Sat, Nov 7, 2009 at 6:45 AM, Xiance SI(司宪策) wrote: I just fall back to old mapred.* APIs, seems MultipleOutputs only works for the old API. wishes, Xiance On Mon, Nov 2, 2009 at 9:12 AM, Paul Smith wrote: Totally stuck here, I can't seem to find a way to resolve this, but I

Confused by new API & MultipleOutputFormats using Hadoop 0.20.1

2009-11-01 Thread Paul Smith
eate a File-per-metric name (there's only 5). thoughts? Paul

Re: Recommended file-system for DataNode

2009-10-08 Thread paul
Check out the bottom of this page: http://wiki.apache.org/hadoop/DiskSetup noatime is all we've done in our environment. I haven't found it worth the time to optimize further since we're CPU bound in most of our jobs. -paul On Thu, Oct 8, 2009 at 3:26 PM, Stas Oski

Re: memcached or existing hbase ?

2009-10-07 Thread Paul Ingles
s, or a couple of gigabytes worth of data. HTH, Paul On 7 Oct 2009, at 10:58, Bob Schulze wrote: I need a cache, that is read by many nodes often, written by a few nodes rarely. Its not too big in size (200.000-2Mio records/1Gb), but may be too big to fit into one node (so keeping local

Re: local node Quotas (for an R&D cluster)

2009-10-01 Thread Paul Smith
On 23/09/2009, at 10:47 AM, Ravi Phulari wrote: Hello Paul here is quick answer to your question - You can use dfs.datanode.du.pct and dfs.datanode.du.reserved property in hdfs-site.xml config file to configure maximum local disk space used by hdfs and mapreduce

Re: Which instance type on Amazon EC2?

2009-09-29 Thread Paul Ingles
more of the less powerful instances. During the early days of our experiments with Hadoop and EC2, this was by far and away the most surprising thing (although in retrospect I guess it's no so strange!) Not sure it answers your question, but food for thought hopefully. Thanks, Paul On 2

Re: local node Quotas (for an R&D cluster)

2009-09-25 Thread Paul Smith
On 25/09/2009, at 8:55 PM, Steve Loughran wrote: Paul Smith wrote: On 25/09/2009, at 3:57 PM, Allen Wittenauer wrote: On 9/24/09 7:38 PM, "Paul Smith" wrote: "I think this could be one of these "If we build it, they will come" issues. most of the Hadoop

3D Cluster Performance Visualization

2009-09-25 Thread Paul Smith
interested in asking questions or suggesting crucial feature sets we'd appreciate it. cheers (and thanks for getting this far in the email.. :) ) Paul Smith psmith at aconex.com psmith at apache.org [1] Performance Co-Pilot (PCP) http://oss.sgi.com/projects/pcp/index.html

Re: local node Quotas (for an R&D cluster)

2009-09-24 Thread Paul Smith
On 25/09/2009, at 3:57 PM, Allen Wittenauer wrote: On 9/24/09 7:38 PM, "Paul Smith" wrote: "I think this could be one of these "If we build it, they will come" issues. most of the Hadoop committers are working in large scale homogenous environments (lucky them).

Re: local node Quotas (for an R&D cluster)

2009-09-24 Thread Paul Smith
; systems without problems. Perhaps then the patch will be accepted. In summary, I wouldn't wait for the committers." cheers, Paul

Re: local node Quotas (for an R&D cluster)

2009-09-24 Thread Paul Smith
I can raised one if you like, I've been a unwell the last few days and out of the loop, but happy for this to be my first Hadoop JIRA contribution. :) Paul On 24/09/2009, at 2:44 AM, Eli Collins wrote: These values determine how much HDFS is *not* allowed to use. There is no limit o

Re: local node Quotas (for an R&D cluster)

2009-09-22 Thread Paul Smith
with this percentage? ("Only use 75% of available space on the allocated volumes, leaving 25% free for non-DFS usage", is that correct reading). If that is the case, I would only use this option right, and not the 'reserved' one ? many thanks for an awesomely quick reply

local node Quotas (for an R&D cluster)

2009-09-22 Thread Paul Smith
es, but this would allow me to setup a reasonable-sized cluster for some good experiments without clobbering existing processes and work that are being done. cheers, Paul Smith

Re: Cluster gets overloaded processing large files via streaming

2009-09-21 Thread paul
like +=, etc, and then watch for a change in key between records. If the current key is different than the last key, print out the last key and its aggregate values. -paul On Mon, Sep 21, 2009 at 3:00 PM, Alex McLintock wrote: > I think the default chunk size you are referring to is ab

Problem with packaged lib class loading

2009-07-15 Thread Paul Ingles
figure out why it's not being loaded correctly? Thanks as always! Paul

Sorting data sets

2009-07-07 Thread Paul Barmaksezian
ld be able to sort the data properly. I'd like to be able to run it through a mapper and output results as hive tables so we can then run our aggregations from there. Thank you. Paul

Sorting data sets

2009-07-07 Thread Paul B
can this session tagging piece be done using hadoop? I'm a little confused on how a mapper would be able to sort the data properly. I'd like to be able to run it through a mapper and output results as hive tables so we can then run our aggregations from there. Thank you. Paul