Re: THIS WEEK: PNW Hadoop / Apache Cloud Stack Users' Meeting, Wed Jun 24th, Seattle

2009-06-25 Thread Bradford Stephens
Hey all,

Just writing a quick note of thanks, we had another solid group of
people show up! As always, we learned quite a lot about interesting
use cases for Hadoop, Lucene, and the rest of the Apache 'Cloud
Stack'.

 I couldn't get it taped, but we talked about:

-Scaling Lucene with Katta and the Katta infrastructure
-the need for low-latency BI on distributed document stores
-Lots and lots of detail on Amazon Elastic MapReduce

We'll be doing it again next month --  July 29th.

On Mon, Jun 22, 2009 at 5:40 PM, Bradford
Stephensbradfordsteph...@gmail.com wrote:
 Hey all, just a friendly reminder that this is Wednesday! I hope to see
 everyone there again. Please let me know if there's something interesting
 you'd like to talk about -- I'll help however I can. You don't even need a
 Powerpoint presentation -- there's many whiteboards. I'll try to have a
 video cam, but no promises.
 Feel free to call at 904-415-3009 if you need directions or any questions :)
 ~~`
 Greetings,

 On the heels of our smashing success last month, we're going to be
 convening the Pacific Northwest (Oregon and Washington)
 Hadoop/HBase/Lucene/etc. meetup on the last Wednesday of June, the
 24th.  The meeting should start at 6:45, organized chats will end
 around  8:00, and then there shall be discussion and socializing :)

 The meeting will be at the University of Washington in
 Seattle again. It's in the Computer Science building (not electrical
 engineering!), room 303, located
 here: http://www.washington.edu/home/maps/southcentral.html?80,70,792,660

 If you've ever wanted to learn more about distributed computing, or
 just see how other people are innovating with Hadoop, you can't miss
 this opportunity. Our focus is on learning and education, so every
 presentation must end with a few questions for the group to research
 and discuss. (But if you're an introvert, we won't mind).

 The format is two or three 15-minute deep dive talks, followed by
 several 5 minute lightning chats. We had a few interesting topics
 last month:

 -Building a Social Media Analysis company on the Apache Cloud Stack
 -Cancer detection in images using Hadoop
 -Real-time OLAP on HBase -- is it possible?
 -Video and Network Flow Analysis in Hadoop vs. Distributed RDBMS
 -Custom Ranking in Lucene

 We already have one deep dive scheduled this month, on truly
 scalable Lucene with Katta. If you've been looking for a way to handle
 those large Lucene indices, this is a must-attend!

 Looking forward to seeing everyone there again.

 Cheers,
 Bradford

 http://www.roadtofailure.com -- The Fringes of Distributed Computing,
 Computer Science, and Social Media.


Re: THIS WEEK: PNW Hadoop / Apache Cloud Stack Users' Meeting, Wed Jun 24th, Seattle

2009-06-23 Thread Bradford Stephens
Greetings,

I've gotten a few replies on this, but I'd really like to know who
else is coming. Just send me a quick note :)

Cheers,
Bradford

On Mon, Jun 22, 2009 at 5:40 PM, Bradford
Stephensbradfordsteph...@gmail.com wrote:
 Hey all, just a friendly reminder that this is Wednesday! I hope to see
 everyone there again. Please let me know if there's something interesting
 you'd like to talk about -- I'll help however I can. You don't even need a
 Powerpoint presentation -- there's many whiteboards. I'll try to have a
 video cam, but no promises.
 Feel free to call at 904-415-3009 if you need directions or any questions :)
 ~~`
 Greetings,

 On the heels of our smashing success last month, we're going to be
 convening the Pacific Northwest (Oregon and Washington)
 Hadoop/HBase/Lucene/etc. meetup on the last Wednesday of June, the
 24th.  The meeting should start at 6:45, organized chats will end
 around  8:00, and then there shall be discussion and socializing :)

 The meeting will be at the University of Washington in
 Seattle again. It's in the Computer Science building (not electrical
 engineering!), room 303, located
 here: http://www.washington.edu/home/maps/southcentral.html?80,70,792,660

 If you've ever wanted to learn more about distributed computing, or
 just see how other people are innovating with Hadoop, you can't miss
 this opportunity. Our focus is on learning and education, so every
 presentation must end with a few questions for the group to research
 and discuss. (But if you're an introvert, we won't mind).

 The format is two or three 15-minute deep dive talks, followed by
 several 5 minute lightning chats. We had a few interesting topics
 last month:

 -Building a Social Media Analysis company on the Apache Cloud Stack
 -Cancer detection in images using Hadoop
 -Real-time OLAP on HBase -- is it possible?
 -Video and Network Flow Analysis in Hadoop vs. Distributed RDBMS
 -Custom Ranking in Lucene

 We already have one deep dive scheduled this month, on truly
 scalable Lucene with Katta. If you've been looking for a way to handle
 those large Lucene indices, this is a must-attend!

 Looking forward to seeing everyone there again.

 Cheers,
 Bradford

 http://www.roadtofailure.com -- The Fringes of Distributed Computing,
 Computer Science, and Social Media.


Re: Can you tell if a particular mapper was data local ?

2009-06-23 Thread Bradford Stephens
(Correct me if I'm wrong), but I think you can tell though the Hadoop
Web UI -- it'll show a count of which map tasks are data-local. You
can then click on that to see a list of all the tasks there, and drill
down to see which nodes those tasks ran on.

On Tue, Jun 23, 2009 at 6:37 PM, Suratna
Budalakotisura...@yahoo-inc.com wrote:
 Hi all,

 Is there any way to tell, from logs, or by reading/setting a counter, whether 
 a particular mapper was data local, i.e., it ran on the same node as its 
 input data?

 Thanks,
 Suratna



THIS WEEK: PNW Hadoop / Apache Cloud Stack Users' Meeting, Wed Jun 24th, Seattle

2009-06-22 Thread Bradford Stephens
Hey all, just a friendly reminder that this is Wednesday! I hope to see
everyone there again. Please let me know if there's something interesting
you'd like to talk about -- I'll help however I can. You don't even need a
Powerpoint presentation -- there's many whiteboards. I'll try to have a
video cam, but no promises.
Feel free to call at 904-415-3009 if you need directions or any questions :)

~~`

Greetings,

On the heels of our smashing success last month, we're going to be
convening the Pacific Northwest (Oregon and Washington)
Hadoop/HBase/Lucene/etc. meetup on the last Wednesday of June, the
24th.  The meeting should start at 6:45, organized chats will end
around  8:00, and then there shall be discussion and socializing :)

The meeting will be at the University of Washington in
Seattle again. It's in the Computer Science building (not electrical
engineering!), room 303, located here:
http://www.washington.edu/home/maps/southcentral.html?80,70,792,660

If you've ever wanted to learn more about distributed computing, or
just see how other people are innovating with Hadoop, you can't miss
this opportunity. Our focus is on learning and education, so every
presentation must end with a few questions for the group to research
and discuss. (But if you're an introvert, we won't mind).

The format is two or three 15-minute deep dive talks, followed by
several 5 minute lightning chats. We had a few interesting topics
last month:

-Building a Social Media Analysis company on the Apache Cloud Stack
-Cancer detection in images using Hadoop
-Real-time OLAP on HBase -- is it possible?
-Video and Network Flow Analysis in Hadoop vs. Distributed RDBMS
-Custom Ranking in Lucene

We already have one deep dive scheduled this month, on truly
scalable Lucene with Katta. If you've been looking for a way to handle
those large Lucene indices, this is a must-attend!

Looking forward to seeing everyone there again.

Cheers,
Bradford

http://www.roadtofailure.com -- The Fringes of Distributed Computing,
Computer Science, and Social Media.


Re: [ANN] HBase 0.20.0-alpha available for download

2009-06-16 Thread Bradford Stephens
Oh sweet. This will be a most excellent party.

On Tue, Jun 16, 2009 at 10:23 PM, stackst...@duboce.net wrote:
 An alpha version of HBase 0.20.0 is available for download at:

  http://people.apache.org/~stack/hbase-0.20.0-alpha/

 We are making this release available to preview what is coming in HBase
 0.20.0.  In short, 0.20.0 is about performance and high-availability.  Also,
 a new, richer API has been added and the old deprecated.  Here is a list of
 almost 300 issues addressed so far in 0.20.0: http://tinyurl.com/ntvheo

 This alpha release contains known bugs.  See http://tinyurl.com/kvfsft for
 the current list.  In particular, this alpha release is without a migration
 script to bring your 0.19.x era data forward to work on hbase 0.20.0.  A
 working, well-tested migration script will be in place before we cut the
 first HBase 0.20.0 release candidate some time in the next week or so.

 After download, please take the time to review the 0.20.0 'Getting Started'
 also available here:
 http://people.apache.org/~stack/hbase-0.20.0-alpha/docs/api/overview-summary.html#overview_description.
 HBase 0.20.0 has new dependencies, in particular it now depends on
 ZooKeeper.  With ZooKeeper in the mix a few core HBase configurations have
 been removed and replaced with ZooKeeper configurations instead.

 Also of note, HBase 0.20.0 will include Stargate, an improved REST
 connector for HBase.  The old, bundled REST connector will be deprecated.
 Stargate is implemented using the Jersey framework.  It includes protobuf
 encoding support, has caching proxy awareness, supports batching for
 scanners and updates, and in general has the goal of enabling Web scale
 storage systems (a la S3) backed by HBase.  Currently its only available up
 on github, http://github.com/macdiesel/stargate/tree/master.  It will be
 added to a new contrib directory before we cut a release candidate.

 Please let us know if you have difficulty with the install, if you find the
 documentation missing or, if you trip over bugs hbasing.

 Yours,
 The HBasistas



Re: Seattle / PNW Hadoop + Lucene User Group?

2009-06-03 Thread Bradford Stephens
Hey everyone!
I just wanted to give a BIG THANKS for everyone who came. We had over a
dozen people, and a few got lost at UW :)  [I would have sent this update
earlier, but I flew to Florida the day after the meeting].

If you didn't come, you missed quite a bit of learning and topics. Such as:

-Building a Social Media Analysis company on the Apache Cloud Stack
-Cancer detection in images using Hadoop
-Real-time OLAP
-Scalable Lucene using Katta and Hadoop
-Video and Network Flow
-Custom Ranking in Lucene

I'm going to update our wiki with the topics, and a few questions raised and
the lessons we've learned.

The next meetup will be June 24th. Be there, or be... boring :)

Cheers,
Bradford

On Thu, Apr 16, 2009 at 3:27 PM, Bradford Stephens 
bradfordsteph...@gmail.com wrote:

 Greetings,

 Would anybody be willing to join a PNW Hadoop and/or Lucene User Group
 with me in the Seattle area? I can donate some facilities, etc. -- I
 also always have topics to speak about :)

 Cheers,
 Bradford



Re: Seattle / PNW Hadoop + Lucene User Group?

2009-06-03 Thread Bradford Stephens
Sorry, no videos this time. The conversation wasn't very structured... next
month I'll record it :)

On Wed, Jun 3, 2009 at 1:59 PM, Bhupesh Bansal bban...@linkedin.com wrote:

 Great Bradford,

 Can you post some videos if you have some ?

 Best
 Bhupesh



 On 6/3/09 11:58 AM, Bradford Stephens bradfordsteph...@gmail.com
 wrote:

  Hey everyone!
  I just wanted to give a BIG THANKS for everyone who came. We had over a
  dozen people, and a few got lost at UW :)  [I would have sent this update
  earlier, but I flew to Florida the day after the meeting].
 
  If you didn't come, you missed quite a bit of learning and topics. Such
 as:
 
  -Building a Social Media Analysis company on the Apache Cloud Stack
  -Cancer detection in images using Hadoop
  -Real-time OLAP
  -Scalable Lucene using Katta and Hadoop
  -Video and Network Flow
  -Custom Ranking in Lucene
 
  I'm going to update our wiki with the topics, and a few questions raised
 and
  the lessons we've learned.
 
  The next meetup will be June 24th. Be there, or be... boring :)
 
  Cheers,
  Bradford
 
  On Thu, Apr 16, 2009 at 3:27 PM, Bradford Stephens 
  bradfordsteph...@gmail.com wrote:
 
  Greetings,
 
  Would anybody be willing to join a PNW Hadoop and/or Lucene User Group
  with me in the Seattle area? I can donate some facilities, etc. -- I
  also always have topics to speak about :)
 
  Cheers,
  Bradford
 




Re: Seattle / PNW Hadoop + Lucene User Group?

2009-05-19 Thread Bradford Stephens
Hello everyone! We (finally) have space secured (it's a tough task!):
University of Washington, Allen Center Room 303, at 6:45pm on Wednesday, May
27, 2009.
I'm going to put together a map, and a wiki so we can collab.

What I'm envisioning is a meetup for about 2 hours: we'll have two in-depth
talks of 15-20 minutes each, and then several lightning talks of 5
minutes. We'll then have discussion and 'social time'.
Let me know if you're interested in speaking or attending.

I'd like to focus on education, so every presentation *needs* to ask some
questions at the end. We can talk about these after the presentations, and
I'll record what we've learned in a wiki and share that with the rest of
us.

Looking forward to meeting you all!

Cheers,
Bradford

On Thu, Apr 16, 2009 at 3:27 PM, Bradford Stephens 
bradfordsteph...@gmail.com wrote:

 Greetings,

 Would anybody be willing to join a PNW Hadoop and/or Lucene User Group
 with me in the Seattle area? I can donate some facilities, etc. -- I
 also always have topics to speak about :)

 Cheers,
 Bradford



Re: Free Training at 2009 Hadoop Summit

2009-05-11 Thread Bradford Stephens
Hey there,

I notice this is already sold out -- any chance of more openings? :)

Cheers,
Bradford

On Tue, May 5, 2009 at 6:25 PM, Christophe Bisciglia
christo...@cloudera.com wrote:
 Just wanted to follow up on this and let everyone know that Cloudera and Y!
 are teaming up to offer two day-long training sessions for free on the day
 after the summit (June 11th).

 We'll cover Hadoop basics, Pig, Hive and some new tools Cloudera is
 releasing for importing data to Hadoop from existing databases.

 http://hadoopsummit09-training.eventbrite.com

 Each of these sessions normally runs about $1000 but were taking advantage
 of having so much of the Hadoop community in one place and offering this for
 free at the 2009 Hadoop Summit.

 Basic training is appropriate for people just getting started with Hadoop,
 and the advanced training will focus on augmenting your existing
 infrastructure with Hadoop and taking advantage of Hadoop's advanced
 features and related projects.

 Space is limited, so sign up before time runs out.

 Hope to see you there!

 Christophe and the Cloduera Team

 On Wed, May 6, 2009 at 6:10 AM, Ajay Anand aan...@yahoo-inc.com wrote:
 This year’s Hadoop Summit
 (http://developer.yahoo.com/events/hadoopsummit09/) is confirmed for June
 10th at the Santa Clara Marriott, and is now open for registration.



 We have a packed agenda, with three tracks – for developers,
 administrators,
 and one focused on new and innovative applications using Hadoop. The
 presentations include talks from Amazon, IBM, Sun, Cloudera, Facebook, HP,
 Microsoft, and the Yahoo! team, as well as leading universities including
 UC
 Berkeley, CMU, Cornell, U of Maryland, U of Nebraska and SUNY.



 From our experience last year with the rush for seats, I would encourage
 people to register early at http://hadoopsummit09.eventbrite.com/



 Looking forward to seeing you at the summit!



 Ajay



 --
 get hadoop: cloudera.com/hadoop
 online training: cloudera.com/hadoop-training
 blog: cloudera.com/blog
 twitter: twitter.com/cloudera



Re: What do we call Hadoop+HBase+Lucene+Zookeeper+etc....

2009-05-05 Thread Bradford Stephens
I read through the deck and sent it around the company. Good stuff!
It's going to be a big help for trying to get the .NET Enterprise
people wrapping their heads around web-scale data.

I must admit Apache Cloud Computing Edition is sort of unwieldy to
say verbally, and frankly Java Enterprise Edition is a taboo phrase
at a lot of projects I've had. Guilt by association. I think I'll call
it Apache Cloud Stack, and reference Apache Cloud Computing
Edition in my deck. When I think Stack, I think of a suite of
software that provides all the pieces I need to solve my problem :)

On Tue, May 5, 2009 at 7:00 AM, Steve Loughran ste...@apache.org wrote:
 Bradford Stephens wrote:

 Hey all,

 I'm going to be speaking at OSCON about my company's experiences with
 Hadoop and Friends, but I'm having a hard time coming up with a name
 for the entire software ecosystem. I'm thinking of calling it the
 Apache CloudStack. Does this sound legit to you all? :) Is there
 something more 'official'?

 We've been using Apache Cloud Computing Edition for this, to emphasise
 this is the successor to Java Enterprise Edition, and that it is cross
 language and being built at apache. If you use the same term, even if you
 put a different stack outline than us, it gives the idea more legitimacy.

 The slides that Andrew linked to are all in SVN under
 http://svn.apache.org/repos/asf/labs/clouds/

 we have a space in the apache labs for apache clouds, where we want to do
 more work integrating things, and bringing the idea of deploy and test on
 someone else's infrastructure mainstream across all the apache products. We
 would welcome your involvement -and if you send a draft of your slides out,
 will happily review them

 -steve



Re: Seattle / PNW Hadoop + Lucene User Group?

2009-04-20 Thread Bradford Stephens
Thanks for the responses, everyone. Where shall we host? My company
can offer space in  our building in Factoria, but it's not exactly a
'cool' or 'fun' place. I can also reserve a room at a local library. I
can bring some beer and light refreshments.

On Mon, Apr 20, 2009 at 7:22 AM, Matthew Hall mh...@informatics.jax.org wrote:
 Same here, sadly there isn't much call for Lucene user groups in Maine.  It
 would be nice though ^^

 Matt

 Amin Mohammed-Coleman wrote:

 I would love to come but I'm afraid I'm stuck in rainy old England :(

 Amin

 On 18 Apr 2009, at 01:08, Bradford Stephens bradfordsteph...@gmail.com
 wrote:

 OK, we've got 3 people... that's enough for a party? :)

 Surely there must be dozens more of you guys out there... c'mon,
 accelerate your knowledge! Join us in Seattle!



 On Thu, Apr 16, 2009 at 3:27 PM, Bradford Stephens
 bradfordsteph...@gmail.com wrote:

 Greetings,

 Would anybody be willing to join a PNW Hadoop and/or Lucene User Group
 with me in the Seattle area? I can donate some facilities, etc. -- I
 also always have topics to speak about :)

 Cheers,
 Bradford


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: Using the Stanford NLP with hadoop

2009-04-18 Thread Bradford Stephens
Greetings,

There's a way you can distribute files along with your MR job as part
of a payload, or you could save the file in the same spot on every
machine of your cluster with some rsyncing, and hard-code loading it.

This may be of some help:
http://hadoop.apache.org/core/docs/r0.18.2/api/org/apache/hadoop/filecache/DistributedCache.html

On Sat, Apr 18, 2009 at 5:18 AM, hari939 hari...@gmail.com wrote:

 My project of parsing through material for a semantic search engine requires
 me to use the  http://nlp.stanford.edu/software/lex-parser.shtml Stanford
 NLP parser  on hadoop cluster.

 To use the Stanford NLP parser, one must create a lexical parser object
 using a englishPCFG.ser.gz file as a constructor's parameter.
 i have tried loading the file onto the Hadoop dfs in the /user/root/ folder
 and have also tried packing the file along with the jar of the java program.

 i am new to the hadoop platform and am not very familiar with some of the
 salient features of hadoop.

 looking forward to any form of help.
 --
 View this message in context: 
 http://www.nabble.com/Using-the-Stanford-NLP-with-hadoop-tp23112316p23112316.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: fyi: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks

2009-04-17 Thread Bradford Stephens
There's definitely a false dichotomy to this paper, and I think it's a
tad disingenuous. It's titled A Comparison Of Approaches To Large
Scale Data Analysis, when it should be titled A Comparison of
Parallel RDBMSs to MapReduce for RDBMS-specific problems. There's
little surprise that the people who wrote the paper have been
gunning for Hadoop for quite a while -- they've written papers
before which describe MR as a Big Step Backwards. Not to mention the
primary authors are a CTO of Vertica, a parallel DB company, and a
lead tech from Microsoft.

We all know MapReduce is not meant for non-parallelizable, non-indexed
tasks like O(1) access to data,table joins, grepping indexed stuff,
etc. MapReduce excels at highly parallelizable tasks, like keyword and
document indexing, web crawling, gene sequencing, etc.

What would have been *great*, and what I'm working on a whitepaper
for, is a study on what classes of problems are ideal for parallel
RDBMs, what are ideal for MapReduce, and then performance timing on
those solutions.

The study is about as useful as if I had written Comparison of
Approaches to Operating System File Allocation Table Management, and
then compared SQL and Ext3.

Yes, I'm in one of *those* moods today :)

Cheers,
Bradford

On Wed, Apr 15, 2009 at 8:22 AM, Jonathan Gray jl...@streamy.com wrote:
 I agree with you, Andy.

 This seems to be a great look into what Hadoop MapReduce is not good at.

 Over in the HBase world, we constantly deal with comparisons like this to
 RDBMSs, trying to determine if one is better than the other.  It's a false
 choice and completely depends on the use case.

 Hadoop is not suited for random access, joins, dealing with subsets of
 your data; ie. it is not a relational database!  It's designed to
 distribute a full scan of a large dataset, placing tasks on the same nodes
 as the data its processing.  The emphasis is on task scheduling, fault
 tolerance, and very large datasets, low-latency has not been a priority.
 There are no indexes to speak of, it's completely orthogonal to what it
 does, so of course there is an enormous disparity in cases where that
 makes sense.  Yes, B-Tree indexes are a wonderful breakthrough in data
 technology :)

 In short, I'm using Hadoop (HDFS and MapReduce) for a broad spectrum of
 applications including batch log processing, web crawling, and number of
 machine learning and natural language processing jobs... These may not be
 tasks that DBMS-X or Vertica would be good at, if even capable of them,
 but all things that I would include under Large-Scale Data Analysis.

 Would have been really interesting to see how things like Pig, Hive, and
 Cascading would stack up against DBMS-X/Vertica for very complex,
 multi-join/sort/etc queries, across a broad spectrum of use cases and
 dataset/result sizes.

 There are a wide variety of solutions to the problems out there.  It's
 important to know the strengths and weaknesses of each, so a bit
 unfortunate that this paper set the stage as it did.

 JG

 On Wed, April 15, 2009 6:44 am, Andy Liu wrote:
 Not sure if comparing Hadoop to databases is an apples to apples
 comparison.  Hadoop is a complete job execution framework, which
 collocates the data with the computation.  I suppose DBMS-X and Vertica do
 that to some certain extent, by way of SQL, but you're restricted to that.
 If you want
 to say, build a distributed web crawler, or a complex data processing
 pipeline, Hadoop will schedule those processes across a cluster for you,
 while Vertica and DBMS-X only deal with the storage of the data.

 The choice of experiments seemed skewed towards DBMS-X and Vertica.  I
 think everybody is aware that Map-Reduce is inefficient for handling
 SQL-like
 queries and joins.

 It's also worth noting that I think 4 out of the 7 authors either
 currently or at one time work with Vertica (or c-store, the precursor to
 Vertica).


 Andy


 On Tue, Apr 14, 2009 at 10:16 AM, Guilherme Germoglio
 germog...@gmail.comwrote:


 (Hadoop is used in the benchmarks)


 http://database.cs.brown.edu/sigmod09/


 There is currently considerable enthusiasm around the MapReduce
 (MR) paradigm for large-scale data analysis [17]. Although the
 basic control flow of this framework has existed in parallel SQL
 database management systems (DBMS) for over 20 years, some have called
 MR a dramatically new computing model [8, 17]. In
 this paper, we describe and compare both paradigms. Furthermore, we
 evaluate both kinds of systems in terms of performance and de- velopment
 complexity. To this end, we define a benchmark con- sisting of a
 collection of tasks that we have run on an open source version of MR as
 well as on two parallel DBMSs. For each task, we measure each system’s
 performance for various degrees of par- allelism on a cluster of 100
 nodes. Our results reveal some inter- esting trade-offs. Although the
 process to load data into and tune the execution of parallel DBMSs took
 much longer than the MR 

Re: Seattle / PNW Hadoop + Lucene User Group?

2009-04-17 Thread Bradford Stephens
OK, we've got 3 people... that's enough for a party? :)

Surely there must be dozens more of you guys out there... c'mon,
accelerate your knowledge! Join us in Seattle!



On Thu, Apr 16, 2009 at 3:27 PM, Bradford Stephens
bradfordsteph...@gmail.com wrote:
 Greetings,

 Would anybody be willing to join a PNW Hadoop and/or Lucene User Group
 with me in the Seattle area? I can donate some facilities, etc. -- I
 also always have topics to speak about :)

 Cheers,
 Bradford



Seattle / PNW Hadoop + Lucene User Group?

2009-04-16 Thread Bradford Stephens
Greetings,

Would anybody be willing to join a PNW Hadoop and/or Lucene User Group
with me in the Seattle area? I can donate some facilities, etc. -- I
also always have topics to speak about :)

Cheers,
Bradford


2009 Hadoop Summit?

2009-01-29 Thread Bradford Stephens
Hey there,

I was just wondering if there's plans for another Hadoop Summit this
year? I went last March and learned quite a bit -- I'm excited to see
what new things people have done since then.

Cheers,
Bradford


Avoiding Newline Problems in Hadoop Streaming + StreamXMLRecordReader

2008-05-21 Thread Bradford Stephens
Greetings,

I have an interesting problem I'm trying to solve. I currently store a bunch
of webpages in a large XML file in Hadoop. I'm trying to parse information
out of these webpages using a complex C# program that I have running on Mono
(I'm in a Linux environment). Therefore, I'm using Hadoop Streaming and the
StreamXMLRecordReader in order to get the information to my C# parser. The
problem is that even wrapped in XML, the Hadoop Streaming ends the records
at newlines! This makes the map input data pretty useless. Does anyone have
any hints on how to get around this?

Here's the XML structure I'm trying to use:

ContentRecordRecordURLhttp://www.blah/RecordURLPageContent![CDATA[page
text would be here including newlines ]]/PageContent/ContentRecord

Any ideas?

Cheers,
Bradford


Re: Hadoop cluster build, machine specs

2008-04-04 Thread Bradford Stephens
Greetings,

It really depends on your budget. What are you looking to spend? $5k?
$20k? Hadoop is about bringing the calculations to your data, so the
more machines you can have, the better.

In general, I'd recommend Dual-Core Opterons and 2-4 GB of RAM with an
SATA hard drive. My company just ordered five such machines from Dell
for Hadoop goodness, and I think the total came to around eight grand.

Another alternative is Amazon EC2 and S3, of course. It all depends on
what you want to do.


On Fri, Apr 4, 2008 at 5:27 PM, Ted Dziuba [EMAIL PROTECTED] wrote:
 Hi all,

  I'm looking to build a small, 5-10 node cluster to run mostly CPU-bound
 Hadoop jobs.  I'm shying away from the 8-core behemoth type machines for
 cost reasons.  But what about dual core machines?  32 or 64 bits?

  I'm still in the planning stages, so any advice would be greatly
 appreciated.

  Thanks,

  Ted



Re: hadoop 0.15.3 r612257 freezes on reduce task

2008-03-28 Thread Bradford Stephens
Hey everyone,

I'm having a similar problem:

Map output lost, rescheduling:
getMapOutput(task_200803281212_0001_m_00_2,0) failed :
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
task_200803281212_0001_m_00_2/file.out.index in any of the
configured local directories

Then it fails in about 10 minutes. I'm just trying to grep some etexts.

New HDFS installation on 2 nodes (one master, one slave). Ubuntu
Linux, Dell Core 2 Duo processors, Java 1.5.0.

I have a feeling its a configuration issue. Anyone else run into it?


On Tue, Jan 29, 2008 at 11:08 AM, Jason Venner [EMAIL PROTECTED] wrote:
 We are running under linux with dfs on GiGE lans,  kernel
  2.6.15-1.2054_FC5smp, with a variety of xeon steppings for our processors.
  Our replacation factor was set to 3



  Florian Leibert wrote:
   Maybe it helps to know that we're running Hadoop inside amazon's EC2...
  
   Thanks,
   Florian
  

  --
  Jason Venner
  Attributor - Publish with Confidence http://www.attributor.com/
  Attributor is hiring Hadoop Wranglers, contact if interested



Re: hadoop 0.15.3 r612257 freezes on reduce task

2008-03-28 Thread Bradford Stephens
Also, I'm running hadoop 0.16.1 :)

On Fri, Mar 28, 2008 at 1:23 PM, Bradford Stephens
[EMAIL PROTECTED] wrote:
 Hey everyone,

  I'm having a similar problem:

  Map output lost, rescheduling:
  getMapOutput(task_200803281212_0001_m_00_2,0) failed :

 org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
  task_200803281212_0001_m_00_2/file.out.index in any of the
  configured local directories

  Then it fails in about 10 minutes. I'm just trying to grep some etexts.

  New HDFS installation on 2 nodes (one master, one slave). Ubuntu
  Linux, Dell Core 2 Duo processors, Java 1.5.0.

  I have a feeling its a configuration issue. Anyone else run into it?




  On Tue, Jan 29, 2008 at 11:08 AM, Jason Venner [EMAIL PROTECTED] wrote:
   We are running under linux with dfs on GiGE lans,  kernel
2.6.15-1.2054_FC5smp, with a variety of xeon steppings for our processors.
Our replacation factor was set to 3
  
  
  
Florian Leibert wrote:
 Maybe it helps to know that we're running Hadoop inside amazon's EC2...

 Thanks,
 Florian

  
--
Jason Venner
Attributor - Publish with Confidence http://www.attributor.com/
Attributor is hiring Hadoop Wranglers, contact if interested
  



Re: hadoop 0.15.3 r612257 freezes on reduce task

2008-03-28 Thread Bradford Stephens
Thanks for the hint, Deveraj! I was using paths for the
mapred.local.dir that was based on ~/, so I gave it an absolute path
instead. Also, the directory for hadoop.tmp.dir did not exist on one
machine :)


On Fri, Mar 28, 2008 at 2:00 PM, Devaraj Das [EMAIL PROTECTED] wrote:
 Hi Bradford,
  Could you please check what your mapred.local.dir is set to?
  Devaraj.



   -Original Message-
   From: Bradford Stephens [mailto:[EMAIL PROTECTED]
   Sent: Saturday, March 29, 2008 1:54 AM
   To: core-user@hadoop.apache.org
   Cc: [EMAIL PROTECTED]
   Subject: Re: hadoop 0.15.3 r612257 freezes on reduce task
  
   Hey everyone,
  
   I'm having a similar problem:
  
   Map output lost, rescheduling:
   getMapOutput(task_200803281212_0001_m_00_2,0) failed :
   org.apache.hadoop.util.DiskChecker$DiskErrorException: Could
   not find task_200803281212_0001_m_00_2/file.out.index in
   any of the configured local directories
  
   Then it fails in about 10 minutes. I'm just trying to grep
   some etexts.
  
   New HDFS installation on 2 nodes (one master, one slave).
   Ubuntu Linux, Dell Core 2 Duo processors, Java 1.5.0.
  
   I have a feeling its a configuration issue. Anyone else run into it?
  
  
   On Tue, Jan 29, 2008 at 11:08 AM, Jason Venner
   [EMAIL PROTECTED] wrote:
We are running under linux with dfs on GiGE lans,  kernel
2.6.15-1.2054_FC5smp, with a variety of xeon steppings for
   our processors.
 Our replacation factor was set to 3
   
   
   
 Florian Leibert wrote:
  Maybe it helps to know that we're running Hadoop inside
   amazon's EC2...
 
  Thanks,
  Florian
 
   
 --
 Jason Venner
 Attributor - Publish with Confidence http://www.attributor.com/
Attributor is hiring Hadoop Wranglers, contact if interested
   
  




Re: Amazon S3 questions

2008-03-01 Thread Bradford Stephens
What sort of performance hit is there for using S3 vs.  a local cluster?

On Sat, Mar 1, 2008 at 1:09 PM, Steve Sapovits
[EMAIL PROTECTED] wrote:

  One other note: When you use S3 URIs, you get a port out of range error
  on startup but that doesn't appear to be fatal.  I spent a few hours on that
  one before I realized it didn't seem to matter.  It seems like the S3 URI 
 format
  where ':' is used to separate ID and secret key is confusing someone.



  --
  Steve Sapovits
  Invite Media  -  http://www.invitemedia.com
  [EMAIL PROTECTED]




Re: MapReduce usage with Lucene Indexing

2008-01-24 Thread Bradford Stephens
I'm actually going to be doing something similar, with Nutch. I just
started learning about Hadoop this week, so I'm interested in what
everyone has to say :)

On Jan 24, 2008 5:00 PM, roger dimitri [EMAIL PROTECTED] wrote:
 Hi,
I am very new to Hadoop, and I have a project where I need to use Lucene 
 to index some input given either as a a huge collection of Java objects or 
 one huge java object.
   I read about Hadoop's MapReduce utilities and I want to leverage that 
 feature in my case described above.
   Can some one please tell me how I can approach the problem described above. 
 Because all the Hadoop's MapReduce examples out there show only File based 
 input and don't explicitly deal with data coming in as a huge Java object or 
 so to speak.

 Any help is greatly appreciated.

 Thanks,
 Roger




   
 
 Never miss a thing.  Make Yahoo your home page.
 http://www.yahoo.com/r/hs