Re: Mathematics behind Hadoop-based systems

2010-01-05 Thread Nathan Marz
That's a great way of putting it: "increasing capacity shifts the equilibrium". If you work some examples you'll find that it doesn't take many iterations for a workflow to converge to its stable runtime. There is some minimum capacity you need for there to be an equilibrium though, or else the run

Matthew McCullough to Speak on Dividing and Conquering Hadoop at GIDS 2010

2010-01-05 Thread Satpal Yadav
*Matthew McCullough to Speak on Dividing and Conquering Hadoop at GIDS 2010 * Great Indian Developer Summit 2010 – Gold Standard for India's Software Developer Ecosystem *Bangalore**, January 04, 2010*: Moore's law has finally hit the wall and CPU speeds have actually decreased in the last few

Re: Multiple file output

2010-01-05 Thread Vijay
org.apache.hadoop.mapreduce.lib.output.MultipleOutputs is not part of the released version of 0.20.1 right? Is this expected to be part of 0.20.2 or later? 2010/1/5 Amareshwari Sri Ramadasu > In branch 0.21, You can get the functionality of both > org.apache.hadoop.mapred.lib.MultipleOutputs an

Re: combiner statistics

2010-01-05 Thread Gang Luo
Thanks. What I mean is, the combiner doesn't "intentionally" re-read spilled records back to memory just to combine them. But it does happens that some records will be re-read for sort. I think combiner should work on those records. -Gang - 原始邮件 发件人: Ted Xu 收件人: common-user@hadoop

Matthew McCullough to Speak on Dividing and Conquering Hadoop at GIDS 2010

2010-01-05 Thread Satpal Yadav
*Matthew McCullough to Speak on Dividing and Conquering Hadoop at GIDS 2010 * Great Indian Developer Summit 2010 – Gold Standard for India's Software Developer Ecosystem *Bangalore**, January 04, 2010*: Moore's law has finally hit the wall and CPU speeds have actually decreased in the last few

Re: Multiple file output

2010-01-05 Thread Amareshwari Sri Ramadasu
In branch 0.21, You can get the functionality of both org.apache.hadoop.mapred.lib.MultipleOutputs and org.apache.hadop.mapred.lib.MultipleOutputFormat in org.apache.hadoop.mapreduce.lib.output.MultipleOutputs. Please see MAPREDUCE-370 for more details. Thanks Amareshwari On 1/5/10 5:56 PM, "

Re: Cannot pass dynamic values by Configuration.Set()

2010-01-05 Thread Farhan Husain
Thanks Steve. I could solve the problem by moving the set() methods before job creation, as Amogh suggested. However, I will also try your solution. On Tue, Jan 5, 2010 at 1:24 PM, Steve Kuo wrote: > There seemed to be a change between 0.20 and 0.19 API in that 0.20 no > longer > set "map.input.

Re: combiner statistics

2010-01-05 Thread Ted Xu
Hi Gang, My understanding to this is that, the combiner has to re-read some records > which have already been spilled to disk and combine them with those records > which come later. > I believe the combine operation is done before map spill and after reduce merge. Combine only occurs in the memor

Re: Mathematics behind Hadoop-based systems

2010-01-05 Thread Yuri Pradkin
On Sunday 03 January 2010 11:30:29 Nathan Marz wrote: > I did some analysis on the performance of Hadoop-based workflows. Some of > the results are counter-intuitive so I thought the community at large would > be interested: > > http://nathanmarz.com/blog/hadoop-mathematics/ > > Would love to he

CFP for 24th International Conference on Supercomputing (ICS 2010, Tsukuba, Japan)

2010-01-05 Thread Viraj Bhat
Dear Hadoop and Pig Users, This is just to let you know that the submission deadline for ICS'10 ( http://www.ics-conference.org/) is two weeks from today. ICS is a premier forum for research in cloud/distributed computing and the most of the work/research we do in CCDI. The CFP of the conferen

Hadoop User Group (Bay Area) - Jan 20th at Yahoo!

2010-01-05 Thread Dekel Tankel
Hi all, Happy new year! RSVP is now open for the first 2010 Bay Area Hadoop user group at the Yahoo! Sunnyvale Campus, planed for Jan 20th. Registration is available here http://www.meetup.com/hadoop/calendar/12229988/ Agenda will be posted soon. Looking forward to seeing you there Dekel

combiner statistics

2010-01-05 Thread Gang Luo
Hi all, when I run a mapreduce job using combiner, I find that the combiner input # > map output #, and combiner output # > reduce input #. My understanding to this is that, the combiner has to re-read some records which have already been spilled to disk and combine them with those records which

Re: time outs when accessing port 50010

2010-01-05 Thread Raghu Angadi
On Mon, Dec 21, 2009 at 11:57 AM, dave bayer wrote: > > On Nov 25, 2009, at 11:27 AM, David J. O'Dell wrote: > > I've intermittently seen the following errors on both of my clusters, it >> happens when writing files. >> I was hoping this would go away with the new version but I see the same >> b

Reduce output records Counter not right?

2010-01-05 Thread Yonggang Qiao
try a wide audience... the number from Reduce output records Counter doesn't match its actually # of records in the output files. although after reran it, it did match. any idea what could be wrong? Thanks, Yonggang

Re: Configuration.set/Configuration.get now working

2010-01-05 Thread Farhan Husain
Hello Amogh, Thanks a lot for the reply. Moving the Set() methods before Job creation solved my problem. I think it should be mentioned somewhere in the API docs or tutorial. Regards, Farhan On Tue, Jan 5, 2010 at 6:09 AM, Amogh Vasekar wrote: > Hi, > > 1. map.input.file in new API is contenti

Re: Security Mechanisms in HDFS

2010-01-05 Thread Owen O'Malley
On Jan 5, 2010, at 7:44 AM, Yu Xi wrote: Could any hadoop gurus tell me what kinds of security mechanisms are already(or planed to be) implemented in hadoop filesystem? It looks like you've found the ones that are already there. You can see my slides about it here: http://www.slideshare.

Re: Cannot pass dynamic values by Configuration.Set()

2010-01-05 Thread Steve Kuo
There seemed to be a change between 0.20 and 0.19 API in that 0.20 no longer set "map.input.file". config.set(), as far as I can tell, should work. I however use the following to pass the parameters. String[] params = new String[] { "-D", "tag1=string_value", ...} ToolRunner(new Configuration()

Security Mechanisms in HDFS

2010-01-05 Thread Yu Xi
Hi list, Could any hadoop gurus tell me what kinds of security mechanisms are already(or planed to be) implemented in hadoop filesystem? I know there're some kind of Linux-like 9 bits(ie. ower,group,other) access control existing in hdfs. Unfortunately there're no user authentication modules. See

Browse remote cluster on Web UI?

2010-01-05 Thread 松柳
Hi all! I setup a cluster on a remote machine on EC2, and configured mapreduce and hdfs on "localhost" by specifying the property in core-site, hdfs-site, mapred-site.xml as quick start guid shows. I can see we UI and general task information well, but when I turned to logs information, or brows

Next Boston Hadoop Meetup, Tuesday, January 19th

2010-01-05 Thread Dan Milstein
Time for another Boston Hadoop Meetup. Next one will be in two weeks, on Tuesday, January 19th, 7 pm, at the HubSpot offices: http://www.meetup.com/bostonhadoop/calendar/12227906/ (HubSpot is at 1 Broadway, Cambridge on the fifth floor. There Will Be Food. There Will Be Beer.) As before

Re: HDFS read/write speeds, and read optimization

2010-01-05 Thread Stas Oskin
Hi again. By the way, I forgot to mention that I do the tests on same machines that serve as DataNodes. i.e. same machine acts both like as a client and DataNode. Regards.

Re: HDFS read/write speeds, and read optimization

2010-01-05 Thread Stas Oskin
Hi. Also, It would be interesting to know "data.replication" setting you have > for this benchmark? > > data.replication = 2 A bit of topic - is it safe to have such number? About a year ago I heard only 3 way replication was fully tested, while 2 way had some issues - was it fixed in subsequent

Re: HDFS read/write speeds, and read optimization

2010-01-05 Thread Stas Oskin
Hi. Well, that all depends on many details, but: > > -) are you really using 4 discs (configured correctly as data > directories?) > > Yes, 4 directories, one per each disk. > -) What hdd/connection technology? > > SATA 3Gbp/s > -) And 77MB/s would match up curiously well with 1Gbit networking

Re: HDFS read/write speeds, and read optimization

2010-01-05 Thread Stas Oskin
Hi. Can you provide more information about your workload and the > environment? eg are you running t.o.a.h.h.BenchmarkThroughput, > TestDFSIO, or timing hadoop fs -put/get to transfer data to hdfs from > another machine, looking at metrics, etc. What else is running on the > cluster? Have you prof

StreamXmlRecordReader

2010-01-05 Thread Paul Ingles
Hi, We're looking to convert some Ruby/C libxml XML processing code over to Hadoop. Currently reports are transformed into a CSV output that is then easier to consume for the downstream systems. We already use Hadoop (streaming) quite extensively for the rest of our daily batches so we'd

Re: Multiple file output

2010-01-05 Thread 松柳
I'm afraid you have to write it by yourself, since there are no equivalent classes in new API. 2009/12/28 Huazhong Ning > Hi all, > > I need your help on multiple file output. I have many big files and I hope > the processing result of each file is outputted to a separate file. I know > in the o

Re: Configuration.set/Configuration.get now working

2010-01-05 Thread Amogh Vasekar
Hi, 1. map.input.file in new API is contentious. It doesn't seem to be seralized in .20 ( https://issues.apache.org/jira/browse/HADOOP-5973 ) . As of now you can use ((FileSplit)context.getInputSplit).getPath() , there was a post on this sometime back. 2. for your own variables in conf, please

[ANNOUNCE] London Open Source Search meetup - Tue 12 January / Meetup.com

2010-01-05 Thread René Kriegler
Hi all, We are organising another open source search social evening (OSSSE?) in London on Tuesday the 12th of January. The plan is to get together and chat about search technology, from Lucene to Solr, Hadoop, Mahout, Xapian, Ferret and the like - bringing together people from across the field t

Re: How to reuse the nodes in blacklist ?

2010-01-05 Thread Jeff Zhang
Thanks, it works. Jeff Zhang On Tue, Jan 5, 2010 at 5:00 PM, Amareshwari Sri Ramadasu < amar...@yahoo-inc.com> wrote: > Restarting the trackers makes them un-blacklisted. > > -Amareshwari > > On 1/5/10 2:27 PM, "Jeff Zhang" wrote: > > Hi all, > > Two of my nodes are in the blacklist, and I wan

Re: How to reuse the nodes in blacklist ?

2010-01-05 Thread Amareshwari Sri Ramadasu
Restarting the trackers makes them un-blacklisted. -Amareshwari On 1/5/10 2:27 PM, "Jeff Zhang" wrote: Hi all, Two of my nodes are in the blacklist, and I want to reuse them again. How can I do that ? Thank you. Jeff Zhang

How to reuse the nodes in blacklist ?

2010-01-05 Thread Jeff Zhang
Hi all, Two of my nodes are in the blacklist, and I want to reuse them again. How can I do that ? Thank you. Jeff Zhang