Re: THIS WEEK: PNW Hadoop / Apache Cloud Stack Users' Meeting, Wed Jun 24th, Seattle
Hey all, Just writing a quick note of thanks, we had another solid group of people show up! As always, we learned quite a lot about interesting use cases for Hadoop, Lucene, and the rest of the Apache 'Cloud Stack'. I couldn't get it taped, but we talked about: -Scaling Lucene with Katta and the Katta infrastructure -the need for low-latency BI on distributed document stores -Lots and lots of detail on Amazon Elastic MapReduce We'll be doing it again next month -- July 29th. On Mon, Jun 22, 2009 at 5:40 PM, Bradford Stephensbradfordsteph...@gmail.com wrote: Hey all, just a friendly reminder that this is Wednesday! I hope to see everyone there again. Please let me know if there's something interesting you'd like to talk about -- I'll help however I can. You don't even need a Powerpoint presentation -- there's many whiteboards. I'll try to have a video cam, but no promises. Feel free to call at 904-415-3009 if you need directions or any questions :) ~~` Greetings, On the heels of our smashing success last month, we're going to be convening the Pacific Northwest (Oregon and Washington) Hadoop/HBase/Lucene/etc. meetup on the last Wednesday of June, the 24th. The meeting should start at 6:45, organized chats will end around 8:00, and then there shall be discussion and socializing :) The meeting will be at the University of Washington in Seattle again. It's in the Computer Science building (not electrical engineering!), room 303, located here: http://www.washington.edu/home/maps/southcentral.html?80,70,792,660 If you've ever wanted to learn more about distributed computing, or just see how other people are innovating with Hadoop, you can't miss this opportunity. Our focus is on learning and education, so every presentation must end with a few questions for the group to research and discuss. (But if you're an introvert, we won't mind). The format is two or three 15-minute deep dive talks, followed by several 5 minute lightning chats. We had a few interesting topics last month: -Building a Social Media Analysis company on the Apache Cloud Stack -Cancer detection in images using Hadoop -Real-time OLAP on HBase -- is it possible? -Video and Network Flow Analysis in Hadoop vs. Distributed RDBMS -Custom Ranking in Lucene We already have one deep dive scheduled this month, on truly scalable Lucene with Katta. If you've been looking for a way to handle those large Lucene indices, this is a must-attend! Looking forward to seeing everyone there again. Cheers, Bradford http://www.roadtofailure.com -- The Fringes of Distributed Computing, Computer Science, and Social Media.
Re: THIS WEEK: PNW Hadoop / Apache Cloud Stack Users' Meeting, Wed Jun 24th, Seattle
Greetings, I've gotten a few replies on this, but I'd really like to know who else is coming. Just send me a quick note :) Cheers, Bradford On Mon, Jun 22, 2009 at 5:40 PM, Bradford Stephensbradfordsteph...@gmail.com wrote: Hey all, just a friendly reminder that this is Wednesday! I hope to see everyone there again. Please let me know if there's something interesting you'd like to talk about -- I'll help however I can. You don't even need a Powerpoint presentation -- there's many whiteboards. I'll try to have a video cam, but no promises. Feel free to call at 904-415-3009 if you need directions or any questions :) ~~` Greetings, On the heels of our smashing success last month, we're going to be convening the Pacific Northwest (Oregon and Washington) Hadoop/HBase/Lucene/etc. meetup on the last Wednesday of June, the 24th. The meeting should start at 6:45, organized chats will end around 8:00, and then there shall be discussion and socializing :) The meeting will be at the University of Washington in Seattle again. It's in the Computer Science building (not electrical engineering!), room 303, located here: http://www.washington.edu/home/maps/southcentral.html?80,70,792,660 If you've ever wanted to learn more about distributed computing, or just see how other people are innovating with Hadoop, you can't miss this opportunity. Our focus is on learning and education, so every presentation must end with a few questions for the group to research and discuss. (But if you're an introvert, we won't mind). The format is two or three 15-minute deep dive talks, followed by several 5 minute lightning chats. We had a few interesting topics last month: -Building a Social Media Analysis company on the Apache Cloud Stack -Cancer detection in images using Hadoop -Real-time OLAP on HBase -- is it possible? -Video and Network Flow Analysis in Hadoop vs. Distributed RDBMS -Custom Ranking in Lucene We already have one deep dive scheduled this month, on truly scalable Lucene with Katta. If you've been looking for a way to handle those large Lucene indices, this is a must-attend! Looking forward to seeing everyone there again. Cheers, Bradford http://www.roadtofailure.com -- The Fringes of Distributed Computing, Computer Science, and Social Media.
Re: Can you tell if a particular mapper was data local ?
(Correct me if I'm wrong), but I think you can tell though the Hadoop Web UI -- it'll show a count of which map tasks are data-local. You can then click on that to see a list of all the tasks there, and drill down to see which nodes those tasks ran on. On Tue, Jun 23, 2009 at 6:37 PM, Suratna Budalakotisura...@yahoo-inc.com wrote: Hi all, Is there any way to tell, from logs, or by reading/setting a counter, whether a particular mapper was data local, i.e., it ran on the same node as its input data? Thanks, Suratna
THIS WEEK: PNW Hadoop / Apache Cloud Stack Users' Meeting, Wed Jun 24th, Seattle
Hey all, just a friendly reminder that this is Wednesday! I hope to see everyone there again. Please let me know if there's something interesting you'd like to talk about -- I'll help however I can. You don't even need a Powerpoint presentation -- there's many whiteboards. I'll try to have a video cam, but no promises. Feel free to call at 904-415-3009 if you need directions or any questions :) ~~` Greetings, On the heels of our smashing success last month, we're going to be convening the Pacific Northwest (Oregon and Washington) Hadoop/HBase/Lucene/etc. meetup on the last Wednesday of June, the 24th. The meeting should start at 6:45, organized chats will end around 8:00, and then there shall be discussion and socializing :) The meeting will be at the University of Washington in Seattle again. It's in the Computer Science building (not electrical engineering!), room 303, located here: http://www.washington.edu/home/maps/southcentral.html?80,70,792,660 If you've ever wanted to learn more about distributed computing, or just see how other people are innovating with Hadoop, you can't miss this opportunity. Our focus is on learning and education, so every presentation must end with a few questions for the group to research and discuss. (But if you're an introvert, we won't mind). The format is two or three 15-minute deep dive talks, followed by several 5 minute lightning chats. We had a few interesting topics last month: -Building a Social Media Analysis company on the Apache Cloud Stack -Cancer detection in images using Hadoop -Real-time OLAP on HBase -- is it possible? -Video and Network Flow Analysis in Hadoop vs. Distributed RDBMS -Custom Ranking in Lucene We already have one deep dive scheduled this month, on truly scalable Lucene with Katta. If you've been looking for a way to handle those large Lucene indices, this is a must-attend! Looking forward to seeing everyone there again. Cheers, Bradford http://www.roadtofailure.com -- The Fringes of Distributed Computing, Computer Science, and Social Media.
Re: [ANN] HBase 0.20.0-alpha available for download
Oh sweet. This will be a most excellent party. On Tue, Jun 16, 2009 at 10:23 PM, stackst...@duboce.net wrote: An alpha version of HBase 0.20.0 is available for download at: http://people.apache.org/~stack/hbase-0.20.0-alpha/ We are making this release available to preview what is coming in HBase 0.20.0. In short, 0.20.0 is about performance and high-availability. Also, a new, richer API has been added and the old deprecated. Here is a list of almost 300 issues addressed so far in 0.20.0: http://tinyurl.com/ntvheo This alpha release contains known bugs. See http://tinyurl.com/kvfsft for the current list. In particular, this alpha release is without a migration script to bring your 0.19.x era data forward to work on hbase 0.20.0. A working, well-tested migration script will be in place before we cut the first HBase 0.20.0 release candidate some time in the next week or so. After download, please take the time to review the 0.20.0 'Getting Started' also available here: http://people.apache.org/~stack/hbase-0.20.0-alpha/docs/api/overview-summary.html#overview_description. HBase 0.20.0 has new dependencies, in particular it now depends on ZooKeeper. With ZooKeeper in the mix a few core HBase configurations have been removed and replaced with ZooKeeper configurations instead. Also of note, HBase 0.20.0 will include Stargate, an improved REST connector for HBase. The old, bundled REST connector will be deprecated. Stargate is implemented using the Jersey framework. It includes protobuf encoding support, has caching proxy awareness, supports batching for scanners and updates, and in general has the goal of enabling Web scale storage systems (a la S3) backed by HBase. Currently its only available up on github, http://github.com/macdiesel/stargate/tree/master. It will be added to a new contrib directory before we cut a release candidate. Please let us know if you have difficulty with the install, if you find the documentation missing or, if you trip over bugs hbasing. Yours, The HBasistas
Re: Seattle / PNW Hadoop + Lucene User Group?
Hey everyone! I just wanted to give a BIG THANKS for everyone who came. We had over a dozen people, and a few got lost at UW :) [I would have sent this update earlier, but I flew to Florida the day after the meeting]. If you didn't come, you missed quite a bit of learning and topics. Such as: -Building a Social Media Analysis company on the Apache Cloud Stack -Cancer detection in images using Hadoop -Real-time OLAP -Scalable Lucene using Katta and Hadoop -Video and Network Flow -Custom Ranking in Lucene I'm going to update our wiki with the topics, and a few questions raised and the lessons we've learned. The next meetup will be June 24th. Be there, or be... boring :) Cheers, Bradford On Thu, Apr 16, 2009 at 3:27 PM, Bradford Stephens bradfordsteph...@gmail.com wrote: Greetings, Would anybody be willing to join a PNW Hadoop and/or Lucene User Group with me in the Seattle area? I can donate some facilities, etc. -- I also always have topics to speak about :) Cheers, Bradford
Re: Seattle / PNW Hadoop + Lucene User Group?
Sorry, no videos this time. The conversation wasn't very structured... next month I'll record it :) On Wed, Jun 3, 2009 at 1:59 PM, Bhupesh Bansal bban...@linkedin.com wrote: Great Bradford, Can you post some videos if you have some ? Best Bhupesh On 6/3/09 11:58 AM, Bradford Stephens bradfordsteph...@gmail.com wrote: Hey everyone! I just wanted to give a BIG THANKS for everyone who came. We had over a dozen people, and a few got lost at UW :) [I would have sent this update earlier, but I flew to Florida the day after the meeting]. If you didn't come, you missed quite a bit of learning and topics. Such as: -Building a Social Media Analysis company on the Apache Cloud Stack -Cancer detection in images using Hadoop -Real-time OLAP -Scalable Lucene using Katta and Hadoop -Video and Network Flow -Custom Ranking in Lucene I'm going to update our wiki with the topics, and a few questions raised and the lessons we've learned. The next meetup will be June 24th. Be there, or be... boring :) Cheers, Bradford On Thu, Apr 16, 2009 at 3:27 PM, Bradford Stephens bradfordsteph...@gmail.com wrote: Greetings, Would anybody be willing to join a PNW Hadoop and/or Lucene User Group with me in the Seattle area? I can donate some facilities, etc. -- I also always have topics to speak about :) Cheers, Bradford
Re: Seattle / PNW Hadoop + Lucene User Group?
Hello everyone! We (finally) have space secured (it's a tough task!): University of Washington, Allen Center Room 303, at 6:45pm on Wednesday, May 27, 2009. I'm going to put together a map, and a wiki so we can collab. What I'm envisioning is a meetup for about 2 hours: we'll have two in-depth talks of 15-20 minutes each, and then several lightning talks of 5 minutes. We'll then have discussion and 'social time'. Let me know if you're interested in speaking or attending. I'd like to focus on education, so every presentation *needs* to ask some questions at the end. We can talk about these after the presentations, and I'll record what we've learned in a wiki and share that with the rest of us. Looking forward to meeting you all! Cheers, Bradford On Thu, Apr 16, 2009 at 3:27 PM, Bradford Stephens bradfordsteph...@gmail.com wrote: Greetings, Would anybody be willing to join a PNW Hadoop and/or Lucene User Group with me in the Seattle area? I can donate some facilities, etc. -- I also always have topics to speak about :) Cheers, Bradford
Re: Free Training at 2009 Hadoop Summit
Hey there, I notice this is already sold out -- any chance of more openings? :) Cheers, Bradford On Tue, May 5, 2009 at 6:25 PM, Christophe Bisciglia christo...@cloudera.com wrote: Just wanted to follow up on this and let everyone know that Cloudera and Y! are teaming up to offer two day-long training sessions for free on the day after the summit (June 11th). We'll cover Hadoop basics, Pig, Hive and some new tools Cloudera is releasing for importing data to Hadoop from existing databases. http://hadoopsummit09-training.eventbrite.com Each of these sessions normally runs about $1000 but were taking advantage of having so much of the Hadoop community in one place and offering this for free at the 2009 Hadoop Summit. Basic training is appropriate for people just getting started with Hadoop, and the advanced training will focus on augmenting your existing infrastructure with Hadoop and taking advantage of Hadoop's advanced features and related projects. Space is limited, so sign up before time runs out. Hope to see you there! Christophe and the Cloduera Team On Wed, May 6, 2009 at 6:10 AM, Ajay Anand aan...@yahoo-inc.com wrote: This year’s Hadoop Summit (http://developer.yahoo.com/events/hadoopsummit09/) is confirmed for June 10th at the Santa Clara Marriott, and is now open for registration. We have a packed agenda, with three tracks – for developers, administrators, and one focused on new and innovative applications using Hadoop. The presentations include talks from Amazon, IBM, Sun, Cloudera, Facebook, HP, Microsoft, and the Yahoo! team, as well as leading universities including UC Berkeley, CMU, Cornell, U of Maryland, U of Nebraska and SUNY. From our experience last year with the rush for seats, I would encourage people to register early at http://hadoopsummit09.eventbrite.com/ Looking forward to seeing you at the summit! Ajay -- get hadoop: cloudera.com/hadoop online training: cloudera.com/hadoop-training blog: cloudera.com/blog twitter: twitter.com/cloudera
Re: What do we call Hadoop+HBase+Lucene+Zookeeper+etc....
I read through the deck and sent it around the company. Good stuff! It's going to be a big help for trying to get the .NET Enterprise people wrapping their heads around web-scale data. I must admit Apache Cloud Computing Edition is sort of unwieldy to say verbally, and frankly Java Enterprise Edition is a taboo phrase at a lot of projects I've had. Guilt by association. I think I'll call it Apache Cloud Stack, and reference Apache Cloud Computing Edition in my deck. When I think Stack, I think of a suite of software that provides all the pieces I need to solve my problem :) On Tue, May 5, 2009 at 7:00 AM, Steve Loughran ste...@apache.org wrote: Bradford Stephens wrote: Hey all, I'm going to be speaking at OSCON about my company's experiences with Hadoop and Friends, but I'm having a hard time coming up with a name for the entire software ecosystem. I'm thinking of calling it the Apache CloudStack. Does this sound legit to you all? :) Is there something more 'official'? We've been using Apache Cloud Computing Edition for this, to emphasise this is the successor to Java Enterprise Edition, and that it is cross language and being built at apache. If you use the same term, even if you put a different stack outline than us, it gives the idea more legitimacy. The slides that Andrew linked to are all in SVN under http://svn.apache.org/repos/asf/labs/clouds/ we have a space in the apache labs for apache clouds, where we want to do more work integrating things, and bringing the idea of deploy and test on someone else's infrastructure mainstream across all the apache products. We would welcome your involvement -and if you send a draft of your slides out, will happily review them -steve
Re: Seattle / PNW Hadoop + Lucene User Group?
Thanks for the responses, everyone. Where shall we host? My company can offer space in our building in Factoria, but it's not exactly a 'cool' or 'fun' place. I can also reserve a room at a local library. I can bring some beer and light refreshments. On Mon, Apr 20, 2009 at 7:22 AM, Matthew Hall mh...@informatics.jax.org wrote: Same here, sadly there isn't much call for Lucene user groups in Maine. It would be nice though ^^ Matt Amin Mohammed-Coleman wrote: I would love to come but I'm afraid I'm stuck in rainy old England :( Amin On 18 Apr 2009, at 01:08, Bradford Stephens bradfordsteph...@gmail.com wrote: OK, we've got 3 people... that's enough for a party? :) Surely there must be dozens more of you guys out there... c'mon, accelerate your knowledge! Join us in Seattle! On Thu, Apr 16, 2009 at 3:27 PM, Bradford Stephens bradfordsteph...@gmail.com wrote: Greetings, Would anybody be willing to join a PNW Hadoop and/or Lucene User Group with me in the Seattle area? I can donate some facilities, etc. -- I also always have topics to speak about :) Cheers, Bradford - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Using the Stanford NLP with hadoop
Greetings, There's a way you can distribute files along with your MR job as part of a payload, or you could save the file in the same spot on every machine of your cluster with some rsyncing, and hard-code loading it. This may be of some help: http://hadoop.apache.org/core/docs/r0.18.2/api/org/apache/hadoop/filecache/DistributedCache.html On Sat, Apr 18, 2009 at 5:18 AM, hari939 hari...@gmail.com wrote: My project of parsing through material for a semantic search engine requires me to use the http://nlp.stanford.edu/software/lex-parser.shtml Stanford NLP parser on hadoop cluster. To use the Stanford NLP parser, one must create a lexical parser object using a englishPCFG.ser.gz file as a constructor's parameter. i have tried loading the file onto the Hadoop dfs in the /user/root/ folder and have also tried packing the file along with the jar of the java program. i am new to the hadoop platform and am not very familiar with some of the salient features of hadoop. looking forward to any form of help. -- View this message in context: http://www.nabble.com/Using-the-Stanford-NLP-with-hadoop-tp23112316p23112316.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: fyi: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks
There's definitely a false dichotomy to this paper, and I think it's a tad disingenuous. It's titled A Comparison Of Approaches To Large Scale Data Analysis, when it should be titled A Comparison of Parallel RDBMSs to MapReduce for RDBMS-specific problems. There's little surprise that the people who wrote the paper have been gunning for Hadoop for quite a while -- they've written papers before which describe MR as a Big Step Backwards. Not to mention the primary authors are a CTO of Vertica, a parallel DB company, and a lead tech from Microsoft. We all know MapReduce is not meant for non-parallelizable, non-indexed tasks like O(1) access to data,table joins, grepping indexed stuff, etc. MapReduce excels at highly parallelizable tasks, like keyword and document indexing, web crawling, gene sequencing, etc. What would have been *great*, and what I'm working on a whitepaper for, is a study on what classes of problems are ideal for parallel RDBMs, what are ideal for MapReduce, and then performance timing on those solutions. The study is about as useful as if I had written Comparison of Approaches to Operating System File Allocation Table Management, and then compared SQL and Ext3. Yes, I'm in one of *those* moods today :) Cheers, Bradford On Wed, Apr 15, 2009 at 8:22 AM, Jonathan Gray jl...@streamy.com wrote: I agree with you, Andy. This seems to be a great look into what Hadoop MapReduce is not good at. Over in the HBase world, we constantly deal with comparisons like this to RDBMSs, trying to determine if one is better than the other. It's a false choice and completely depends on the use case. Hadoop is not suited for random access, joins, dealing with subsets of your data; ie. it is not a relational database! It's designed to distribute a full scan of a large dataset, placing tasks on the same nodes as the data its processing. The emphasis is on task scheduling, fault tolerance, and very large datasets, low-latency has not been a priority. There are no indexes to speak of, it's completely orthogonal to what it does, so of course there is an enormous disparity in cases where that makes sense. Yes, B-Tree indexes are a wonderful breakthrough in data technology :) In short, I'm using Hadoop (HDFS and MapReduce) for a broad spectrum of applications including batch log processing, web crawling, and number of machine learning and natural language processing jobs... These may not be tasks that DBMS-X or Vertica would be good at, if even capable of them, but all things that I would include under Large-Scale Data Analysis. Would have been really interesting to see how things like Pig, Hive, and Cascading would stack up against DBMS-X/Vertica for very complex, multi-join/sort/etc queries, across a broad spectrum of use cases and dataset/result sizes. There are a wide variety of solutions to the problems out there. It's important to know the strengths and weaknesses of each, so a bit unfortunate that this paper set the stage as it did. JG On Wed, April 15, 2009 6:44 am, Andy Liu wrote: Not sure if comparing Hadoop to databases is an apples to apples comparison. Hadoop is a complete job execution framework, which collocates the data with the computation. I suppose DBMS-X and Vertica do that to some certain extent, by way of SQL, but you're restricted to that. If you want to say, build a distributed web crawler, or a complex data processing pipeline, Hadoop will schedule those processes across a cluster for you, while Vertica and DBMS-X only deal with the storage of the data. The choice of experiments seemed skewed towards DBMS-X and Vertica. I think everybody is aware that Map-Reduce is inefficient for handling SQL-like queries and joins. It's also worth noting that I think 4 out of the 7 authors either currently or at one time work with Vertica (or c-store, the precursor to Vertica). Andy On Tue, Apr 14, 2009 at 10:16 AM, Guilherme Germoglio germog...@gmail.comwrote: (Hadoop is used in the benchmarks) http://database.cs.brown.edu/sigmod09/ There is currently considerable enthusiasm around the MapReduce (MR) paradigm for large-scale data analysis [17]. Although the basic control flow of this framework has existed in parallel SQL database management systems (DBMS) for over 20 years, some have called MR a dramatically new computing model [8, 17]. In this paper, we describe and compare both paradigms. Furthermore, we evaluate both kinds of systems in terms of performance and de- velopment complexity. To this end, we define a benchmark con- sisting of a collection of tasks that we have run on an open source version of MR as well as on two parallel DBMSs. For each task, we measure each system’s performance for various degrees of par- allelism on a cluster of 100 nodes. Our results reveal some inter- esting trade-offs. Although the process to load data into and tune the execution of parallel DBMSs took much longer than the MR
Re: Seattle / PNW Hadoop + Lucene User Group?
OK, we've got 3 people... that's enough for a party? :) Surely there must be dozens more of you guys out there... c'mon, accelerate your knowledge! Join us in Seattle! On Thu, Apr 16, 2009 at 3:27 PM, Bradford Stephens bradfordsteph...@gmail.com wrote: Greetings, Would anybody be willing to join a PNW Hadoop and/or Lucene User Group with me in the Seattle area? I can donate some facilities, etc. -- I also always have topics to speak about :) Cheers, Bradford
Seattle / PNW Hadoop + Lucene User Group?
Greetings, Would anybody be willing to join a PNW Hadoop and/or Lucene User Group with me in the Seattle area? I can donate some facilities, etc. -- I also always have topics to speak about :) Cheers, Bradford
2009 Hadoop Summit?
Hey there, I was just wondering if there's plans for another Hadoop Summit this year? I went last March and learned quite a bit -- I'm excited to see what new things people have done since then. Cheers, Bradford
Avoiding Newline Problems in Hadoop Streaming + StreamXMLRecordReader
Greetings, I have an interesting problem I'm trying to solve. I currently store a bunch of webpages in a large XML file in Hadoop. I'm trying to parse information out of these webpages using a complex C# program that I have running on Mono (I'm in a Linux environment). Therefore, I'm using Hadoop Streaming and the StreamXMLRecordReader in order to get the information to my C# parser. The problem is that even wrapped in XML, the Hadoop Streaming ends the records at newlines! This makes the map input data pretty useless. Does anyone have any hints on how to get around this? Here's the XML structure I'm trying to use: ContentRecordRecordURLhttp://www.blah/RecordURLPageContent![CDATA[page text would be here including newlines ]]/PageContent/ContentRecord Any ideas? Cheers, Bradford
Re: Hadoop cluster build, machine specs
Greetings, It really depends on your budget. What are you looking to spend? $5k? $20k? Hadoop is about bringing the calculations to your data, so the more machines you can have, the better. In general, I'd recommend Dual-Core Opterons and 2-4 GB of RAM with an SATA hard drive. My company just ordered five such machines from Dell for Hadoop goodness, and I think the total came to around eight grand. Another alternative is Amazon EC2 and S3, of course. It all depends on what you want to do. On Fri, Apr 4, 2008 at 5:27 PM, Ted Dziuba [EMAIL PROTECTED] wrote: Hi all, I'm looking to build a small, 5-10 node cluster to run mostly CPU-bound Hadoop jobs. I'm shying away from the 8-core behemoth type machines for cost reasons. But what about dual core machines? 32 or 64 bits? I'm still in the planning stages, so any advice would be greatly appreciated. Thanks, Ted
Re: hadoop 0.15.3 r612257 freezes on reduce task
Hey everyone, I'm having a similar problem: Map output lost, rescheduling: getMapOutput(task_200803281212_0001_m_00_2,0) failed : org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find task_200803281212_0001_m_00_2/file.out.index in any of the configured local directories Then it fails in about 10 minutes. I'm just trying to grep some etexts. New HDFS installation on 2 nodes (one master, one slave). Ubuntu Linux, Dell Core 2 Duo processors, Java 1.5.0. I have a feeling its a configuration issue. Anyone else run into it? On Tue, Jan 29, 2008 at 11:08 AM, Jason Venner [EMAIL PROTECTED] wrote: We are running under linux with dfs on GiGE lans, kernel 2.6.15-1.2054_FC5smp, with a variety of xeon steppings for our processors. Our replacation factor was set to 3 Florian Leibert wrote: Maybe it helps to know that we're running Hadoop inside amazon's EC2... Thanks, Florian -- Jason Venner Attributor - Publish with Confidence http://www.attributor.com/ Attributor is hiring Hadoop Wranglers, contact if interested
Re: hadoop 0.15.3 r612257 freezes on reduce task
Also, I'm running hadoop 0.16.1 :) On Fri, Mar 28, 2008 at 1:23 PM, Bradford Stephens [EMAIL PROTECTED] wrote: Hey everyone, I'm having a similar problem: Map output lost, rescheduling: getMapOutput(task_200803281212_0001_m_00_2,0) failed : org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find task_200803281212_0001_m_00_2/file.out.index in any of the configured local directories Then it fails in about 10 minutes. I'm just trying to grep some etexts. New HDFS installation on 2 nodes (one master, one slave). Ubuntu Linux, Dell Core 2 Duo processors, Java 1.5.0. I have a feeling its a configuration issue. Anyone else run into it? On Tue, Jan 29, 2008 at 11:08 AM, Jason Venner [EMAIL PROTECTED] wrote: We are running under linux with dfs on GiGE lans, kernel 2.6.15-1.2054_FC5smp, with a variety of xeon steppings for our processors. Our replacation factor was set to 3 Florian Leibert wrote: Maybe it helps to know that we're running Hadoop inside amazon's EC2... Thanks, Florian -- Jason Venner Attributor - Publish with Confidence http://www.attributor.com/ Attributor is hiring Hadoop Wranglers, contact if interested
Re: hadoop 0.15.3 r612257 freezes on reduce task
Thanks for the hint, Deveraj! I was using paths for the mapred.local.dir that was based on ~/, so I gave it an absolute path instead. Also, the directory for hadoop.tmp.dir did not exist on one machine :) On Fri, Mar 28, 2008 at 2:00 PM, Devaraj Das [EMAIL PROTECTED] wrote: Hi Bradford, Could you please check what your mapred.local.dir is set to? Devaraj. -Original Message- From: Bradford Stephens [mailto:[EMAIL PROTECTED] Sent: Saturday, March 29, 2008 1:54 AM To: core-user@hadoop.apache.org Cc: [EMAIL PROTECTED] Subject: Re: hadoop 0.15.3 r612257 freezes on reduce task Hey everyone, I'm having a similar problem: Map output lost, rescheduling: getMapOutput(task_200803281212_0001_m_00_2,0) failed : org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find task_200803281212_0001_m_00_2/file.out.index in any of the configured local directories Then it fails in about 10 minutes. I'm just trying to grep some etexts. New HDFS installation on 2 nodes (one master, one slave). Ubuntu Linux, Dell Core 2 Duo processors, Java 1.5.0. I have a feeling its a configuration issue. Anyone else run into it? On Tue, Jan 29, 2008 at 11:08 AM, Jason Venner [EMAIL PROTECTED] wrote: We are running under linux with dfs on GiGE lans, kernel 2.6.15-1.2054_FC5smp, with a variety of xeon steppings for our processors. Our replacation factor was set to 3 Florian Leibert wrote: Maybe it helps to know that we're running Hadoop inside amazon's EC2... Thanks, Florian -- Jason Venner Attributor - Publish with Confidence http://www.attributor.com/ Attributor is hiring Hadoop Wranglers, contact if interested
Re: Amazon S3 questions
What sort of performance hit is there for using S3 vs. a local cluster? On Sat, Mar 1, 2008 at 1:09 PM, Steve Sapovits [EMAIL PROTECTED] wrote: One other note: When you use S3 URIs, you get a port out of range error on startup but that doesn't appear to be fatal. I spent a few hours on that one before I realized it didn't seem to matter. It seems like the S3 URI format where ':' is used to separate ID and secret key is confusing someone. -- Steve Sapovits Invite Media - http://www.invitemedia.com [EMAIL PROTECTED]
Re: MapReduce usage with Lucene Indexing
I'm actually going to be doing something similar, with Nutch. I just started learning about Hadoop this week, so I'm interested in what everyone has to say :) On Jan 24, 2008 5:00 PM, roger dimitri [EMAIL PROTECTED] wrote: Hi, I am very new to Hadoop, and I have a project where I need to use Lucene to index some input given either as a a huge collection of Java objects or one huge java object. I read about Hadoop's MapReduce utilities and I want to leverage that feature in my case described above. Can some one please tell me how I can approach the problem described above. Because all the Hadoop's MapReduce examples out there show only File based input and don't explicitly deal with data coming in as a huge Java object or so to speak. Any help is greatly appreciated. Thanks, Roger Never miss a thing. Make Yahoo your home page. http://www.yahoo.com/r/hs