Re: Combiner run specification and questions

2009-01-07 Thread Saptarshi Guha
. So as long as the correctness of the computation doesn't > rely on a transformation performed in the combiner, it should be OK. In Right, i had the same thought. > > However, this restriction limits the scalability of your solution. It might > be necessary to work around R's limitations by br

Re: Storing/retrieving time series with hadoop

2009-01-07 Thread Mark Chadwick
Brok, I've had good luck storing time-series data with HBase. Its latency for looking up records is orders of magnitude lower than Hadoop's MapReduce (which is more for batch processing), yet still resides on HDFS, and has mechanisms to let you MapReduce on your HBase data. You may have a diffic

Storing/retrieving time series with hadoop

2009-01-07 Thread Brock Judkins
Hi list, I am researching hadoop as a possible solution for my company's data warehousing solution. My question is whether hadoop, possibly in combination with Hive or Pig, is a good solution for time-series data? We basically have a ton of web analytics to store that we display both internally and

Re: TestDFSIO delivers bad values of "throughput" and "average IO rate"

2009-01-07 Thread tienduc_dinh
hi Konstantin, I think I got it, I forgot one thing in your last post. time = time(0) + ... + time(N-1). So it must be the throughput per client, and I'm happy now, that hadoop works very well with the scalbility on my cluster. Thank you so much and wish you all the best in the new year 2009 :

Re: TestDFSIO delivers bad values of "throughput" and "average IO rate"

2009-01-07 Thread tienduc_dinh
hi Konstantin, sorry for my mistake, it was not 5012, it was 512. Of course, it is great that the throughput is mb/sec per client like you said. In this case we have circa 120 MB/sec :clap: But I'm not sure, if that really was. Please follow my example and calculation of throughput > hadoop-0.

Re: Question about the Namenode edit log and syncing the edit log to disk. 0.19.0

2009-01-07 Thread Konstantin Shvachko
From Java documentation http://java.sun.com/javase/6/docs/api/java/nio/channels/FileChannel.html#force(boolean) "Passing false for this parameter indicates that only updates to the file's content need be written to storage; passing true indicates that updates to both the file's content and metada

Re: TestDFSIO delivers bad values of "throughput" and "average IO rate"

2009-01-07 Thread Konstantin Shvachko
tienduc_dinh wrote: Hi Konstantin, thanks so much for your help. I was a litte bit confused about why my setting mapred.map.tasks = 10 in hadoop-site.xml, but hadoop didn't map anything. So your answer with In case of TestDFSIO it will be overridden by "-nrFiles". is the key. I need now

RE: Concatenating PDF files

2009-01-07 Thread Zak, Richard [USA]
I was able to process 100 pdfs in 4 directories. How I have moved up to 500 pdfs (started with 700 and I'm working backwards) in 6 directories, and I am getting this error in the console: 09/01/07 14:04:41 INFO mapred.JobClient: Task Id : attempt_200812311556_0034_m_00_0, Status : FAILED java

Re: Question about the Namenode edit log and syncing the edit log to disk. 0.19.0

2009-01-07 Thread Raghu Angadi
Did you look at FSEditLog.EditLogFileOutputStream.flushAndSync()? This code was re-organized sometime back. But the guarantees it provides should be exactly same as before. Please let us know otherwise. Raghu. Jason Venner wrote: I have always assumed (which is clearly my error) that edit lo

Question about the Namenode edit log and syncing the edit log to disk. 0.19.0

2009-01-07 Thread Jason Venner
I have always assumed (which is clearly my error) that edit log writes were flushed to storage to ensure that the edit log was consistent during machine crash recovery. I have been working through FSEditLog.java and I don't see any calls of force(true) on the file channel or sync on the file d

Re: Auditing and accounting with Hadoop

2009-01-07 Thread Doug Cutting
The notion of a client/task ID, independent of IP or username seems useful for log analysis. DFS's client ID is probably currently your best bet, but we might improve its implementation, and make the notion more generic. It is currently implemented as: String taskId = conf.get("mapred.ta

Auditing and accounting with Hadoop

2009-01-07 Thread Brian Bockelman
Hey, One of our charges is to do auditing and accounting with our file systems (we use the simplifying assumption that the users are non- malicious). Auditing can be done by going through the namenode logs and utilizing the UGI information to track opens/reads/writes back to the users.

Re: TestDFSIO delivers bad values of "throughput" and "average IO rate"

2009-01-07 Thread tienduc_dinh
Hi Konstantin, thanks so much for your help. I was a litte bit confused about why my setting mapred.map.tasks = 10 in hadoop-site.xml, but hadoop didn't map anything. So your answer with > In case of TestDFSIO it will be overridden by "-nrFiles". is the key. I need now your confirm to know,

We have finally opened Neptune, yet another BigTable-clone project.

2009-01-07 Thread neptune
Dear all, We have finally opened Neptune, yet another BigTable-clone project. Neptune has the following features. - Basic Data Service . Single-row operations : Get, Put . Multi-row operations : Like, Between, Scanner . Data Uploader : DirectUploader . MapReduce : TableInputFormat