RE: Need help in hdfs configuration fully distributed way in Mac OSX...

2008-09-17 Thread souravm
Hi Mafish, Thanks for your suggestions. Finally I could resolve the issue. The *site.xml in namenode had ds.default.name as localhost where as in data nodes it were the actual ip. I changed the local host to actual ip in name node and it started working. Regards, Sourav -Original Message-

Re: Serving contents of large MapFiles/SequenceFiles from memory across many machines

2008-09-17 Thread Miles Osborne
hello Chris! (if you are talking about serving language models and/or phrase tables) i had a student look at using HBase for LMs this summer. i don't think it is sufficiently quick to deal with millions of queries per second, but that may be due to blunders on our part. it may be possible that

RE: Trouble with SequenceFileOutputFormat.getReaders

2008-09-17 Thread Palleti, Pallavi
Hi, This is the problem with the hadoop 17.2 release. It creates _logs directory in the output directory of any map red job. This _logs is not your output file. So, You have to explicitly make sure that this file shouldn't be read by your map job. Thanks Pallavi -Original Message- From

Serving contents of large MapFiles/SequenceFiles from memory across many machines

2008-09-17 Thread Chris Dyer
Hi all- One more question. I'm looking for a lightweight way to serve data stored as key-value pairs in a series of MapFiles or SequenceFiles. HBase/Hypertable offer a very robust, powerful solution to this problem with a bunch of extra features like updates and column types, etc., that I don't n

Trouble with SequenceFileOutputFormat.getReaders

2008-09-17 Thread Chris Dyer
Hi all- I am having trouble with SequenceFileOutputFormat.getReaders on a hadoop 17.2 cluster. I am trying to open a set of SequenceFiles that was created in one map process that has completed from within a second map process by passing in the job configuration for the running map process (not of

Re: OutOfMemory Error

2008-09-17 Thread Pallavi Palleti
Hadoop Version - 17.1 io.sort.factor =10 The key is of the form "ID :DenseVector Representation in mahout with dimensionality size = 160k" For example: C1:[,0.0011, 3.002, .. 1.001,] So, typical size of the key of the mapper output can be 160K*6 (assuming double in string is represent

Re: [ANN] katta-0.1.0 release - distribute lucene indexes in a grid

2008-09-17 Thread 叶双明
Thanks for your work!! 2008/9/18 Stefan Groschupf <[EMAIL PROTECTED]> > After 5 month work we are happy to announce the first developer preview > release of katta. > This release contains all functionality to serve a large, sharded lucene > index on many servers. > Katta is standing on the should

Re: gridmix on a small cluster?

2008-09-17 Thread Chris Douglas
Yes. If you look at the README, gridmix-env, and the generateData script, you should be able to alter the job mix to match your requirements. In particular, you probably want to look closely at the number of small, medium, and large jobs for each run. For a three node cluster, you might wan

[ANN] katta-0.1.0 release - distribute lucene indexes in a grid

2008-09-17 Thread Stefan Groschupf
After 5 month work we are happy to announce the first developer preview release of katta. This release contains all functionality to serve a large, sharded lucene index on many servers. Katta is standing on the shoulders of the giants lucene, hadoop and zookeeper. Main features: + Plays wel

gridmix on a small cluster?

2008-09-17 Thread Joel Welling
Hi folks; I'd like to try the gridmix benchmark on my small cluster (3 nodes at 8 cores each, Lustre with IB interconnect). The documentation for gridmix suggests that it will take 4 hours on a 500 node cluster, which suggests it would take me something like a week to run. Is there a way to sca

Re: Lots of files in a single hdfs directory?

2008-09-17 Thread Konstantin Shvachko
Lohit is right. File creation will be slow if all 100,000 files are in one directory. Directory entries are implemented as a sorted array (ArrayList), which optimizes lookup (binary search) in the table, but makes entry insertion inefficient because it requires shifting all entries to the left of

Re: Lots of files in a single hdfs directory?

2008-09-17 Thread Lohit
Last when I tried to load an image with lots of files in same directory, it was like ten times slow. This is to do with the data structures. My numbers were million though. Try to have a directory structure. Lohit On Sep 17, 2008, at 11:57 AM, Nathan Marz <[EMAIL PROTECTED]> wrote: Hello all,

javasort takes > 500sec but machines do not appear busy?

2008-09-17 Thread damien . cooke
Hi all, Sorry I am new to Hadoop. I have been watching my 10 node cluster and it appears to take around 500 sec to do a run of the javasort portion of the Gridmix benchmark. That does not worry me but the machines do not appear to be doing much. iostat -cx extended device sta

Lots of files in a single hdfs directory?

2008-09-17 Thread Nathan Marz
Hello all, Is it bad to have a lot of files in a single HDFS directory (aka, on the order of hundreds of thousands)? Or should we split our files into a directory structure of some sort? Thanks, Nathan Marz

Re: scp to namenode faster than dfs put?

2008-09-17 Thread pvvpr
yes, client was a namenode and also a datanode. thanks Raghu, will try not running datanode. - Prasad. On Thursday 18 September 2008 12:00:30 am Raghu Angadi wrote: > pvvpr wrote: > > The time seemed to be around double the time taken to scp. Didn't realize > > it could be due to replication. >

Re: scp to namenode faster than dfs put?

2008-09-17 Thread Raghu Angadi
pvvpr wrote: The time seemed to be around double the time taken to scp. Didn't realize it could be due to replication. twice slow is not expected. One possibility is that your client is also one of the datanodes (i.e. you are reading from and writing to the same disk). Raghu. Regd dfs bei

Re: Using JavaSerialzation and SequenceFileInput

2008-09-17 Thread Jason Grey
Cool, thanks for the answer. On Wed, Sep 17, 2008 at 12:35 PM, Owen O'Malley <[EMAIL PROTECTED]> wrote: > The problem is that the Java serialization works for SequenceFile, but > doesn't work with RecordReader. The problem is that Java serialization > always returns a new object and the RecordRea

Re: scp to namenode faster than dfs put?

2008-09-17 Thread pvvpr
The time seemed to be around double the time taken to scp. Didn't realize it could be due to replication. Regd dfs being faster than scp, the statement came more out of expectation (or wish list) rather than anything else. Since scp is the most elementary way of copying files, was thinking if the

Re: Using JavaSerialzation and SequenceFileInput

2008-09-17 Thread Owen O'Malley
The problem is that the Java serialization works for SequenceFile, but doesn't work with RecordReader. The problem is that Java serialization always returns a new object and the RecordReader interface looks like: boolean next(Object key, Object value) throws IOException; where the outer cont

Re: OutOfMemory Error

2008-09-17 Thread Devaraj Das
On 9/17/08 6:06 PM, "Pallavi Palleti" <[EMAIL PROTECTED]> wrote: > > Hi all, > >I am getting outofmemory error as shown below when I ran map-red on huge > amount of data.: > java.lang.OutOfMemoryError: Java heap space > at > org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBu

Re: scp to namenode faster than dfs put?

2008-09-17 Thread Raghu Angadi
How much slower is 'dfs -put' any way? How large is the file you are copying? > but shouldn't that > be atleast as fast as copying data to namenode from a single machine, It would be "at most" as fast as scp assuming you are not cpu bound. Why would you think dfs be faster even if it copyin

RE: OutOfMemory Error

2008-09-17 Thread Leon Mergen
Hello, What version of Hadoop are you using ? Regards, Leon Mergen > -Original Message- > From: Pallavi Palleti [mailto:[EMAIL PROTECTED] > Sent: Wednesday, September 17, 2008 2:36 PM > To: core-user@hadoop.apache.org > Subject: OutOfMemory Error > > > Hi all, > >I am getting outofm

Re: scp to namenode faster than dfs put?

2008-09-17 Thread pvvpr
thanks for the replies. So looks like replication might be the real overhead when compared to scp. > Also dfs put copies multiple replicas unlike scp. > > Lohit > > On Sep 17, 2008, at 6:03 AM, "叶双明" <[EMAIL PROTECTED]> wrote: > > Actually, No. > As you said, I understand that "dfs -put" bre

Re: scp to namenode faster than dfs put?

2008-09-17 Thread Dennis Kubes
While an scp will copy data to the namenode machine, it does *not* store the data in dfs, it simply copies the data to namenode machine. This is the same as copying data to any other machine. The data isn't in DFS and is not accessible from DFS. If the box running the namenode fails you lo

Re: Using JavaSerialzation and SequenceFileInput

2008-09-17 Thread Jason Grey
I read HADOOP-3413 a bit more closely - it updates SequenceFile.Reader, not SequenceFileInputFormat, which is what M.R. framework uses... looks like you have to write your own input format, or have your mappers/reducers take raw bytes, and deseria

Re: Using JavaSerialzation and SequenceFileInput

2008-09-17 Thread Jason Grey
I just found this one this morning, looks like a fix should be in 0.18.0 according to the bug tracker: https://issues.apache.org/jira/browse/HADOOP-3413 I'm going to go double check all my code, as I'm pretty sure I am on 0.18.0 already -jg- On Tue, Sep 16, 2008 at 9:10 PM, Alex Loddengaard <[

Re: scp to namenode faster than dfs put?

2008-09-17 Thread Lohit
Also dfs put copies multiple replicas unlike scp. Lohit On Sep 17, 2008, at 6:03 AM, "叶双明" <[EMAIL PROTECTED]> wrote: Actually, No. As you said, I understand that "dfs -put" breaks the data into blocksand then copies to datanodes, but scp do not breaks the data into blocksand , and just copy the

Re: scp to namenode faster than dfs put?

2008-09-17 Thread 叶双明
Actually, No. As you said, I understand that "dfs -put" breaks the data into blocksand then copies to datanodes, but scp do not breaks the data into blocksand , and just copy the data to the namenode! 2008/9/17, Prasad Pingali <[EMAIL PROTECTED]>: > > Hello, > I observe that scp of data to the

OutOfMemory Error

2008-09-17 Thread Pallavi Palleti
Hi all, I am getting outofmemory error as shown below when I ran map-red on huge amount of data.: java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:52) at org.apache.hadoop.io.DataOutputBuffer.write(DataOutp

scp to namenode faster than dfs put?

2008-09-17 Thread Prasad Pingali
Hello, I observe that scp of data to the namenode is faster than actually putting into dfs (all nodes coming from same switch and have same ethernet cards, homogenous nodes)? I understand that "dfs -put" breaks the data into blocks and then copies to datanodes, but shouldn't that be atleast a

jython HBase map/red task

2008-09-17 Thread Dmitry Pushkarev
Hi. I'm writing mapreduce task in jython, and I can't launch ToolRunner.run, jython says "TypeError: integer required" on ToolRunner.run line and I can't get more detailed explanation. I guess the error is either in ToolRunner or in setConf: What am I doing wrong? J And can anyone share a s