Re: Poor IO performance on a 10 node cluster.

2011-06-01 Thread Ted Dunning
It is also worth using dd to verify your raw disk speeds.

Also, expressing disk transfer rates in bytes per second makes it a bit
easier for most of the disk people I know to figure out what is large or
small.

Each of these disks disk should do about 100MB/s when driven well.  Hadoop
does OK, but not nearly full capacity so I would expected more like
40-50MB/s/disk.  Also, if one of your disks is doing double duty you may
have some extra cost.  40 x 2 x 10 = 800MB per second.  You are doing 100GB
/ 220 seconds = 0.5 GB / s which isn't so terribly bad.  It is definitely
less than the theoretically possible 100 x 2 x 10 = 2GB/second and I would
expect you could tune this up a little, but not a massive amount.

In general, blade servers do not make good Hadoop nodes exactly because the
I/O performance tends to be low when you only have a few spindle.

One other reason that this might be a bit below expectations is that your
files may not be well distributed on your cluster.  Can you say what you
used to upload the files?

On Wed, Jun 1, 2011 at 5:56 PM, hadoopman hadoop...@gmail.com wrote:

 Some things which helped us include setting your vm.swappiness to 0 and
 mounting your disks with noatime,nodiratime options.

 Also make sure your disks aren't setup with RAID (JBOD is recommended)

 You might want to run terasort as you tweak your environment.  It's very
 helpful when checking if a change helped (or hurt) your cluster.

 Hope that helps a bit.

 On 05/30/2011 06:27 AM, Gyuribácsi wrote:


 Hi,

 I have a 10 node cluster (IBM blade servers, 48GB RAM, 2x500GB Disk, 16 HT
 cores).

 I've uploaded 10 files to HDFS. Each file is 10GB. I used the streaming
 jar
 with 'wc -l' as mapper and 'cat' as reducer.

 I use 64MB block size and the default replication (3).

 The wc on the 100 GB took about 220 seconds which translates to about 3.5
 Gbit/sec processing speed. One disk can do sequential read with 1Gbit/sec
 so
 i would expect someting around 20 GBit/sec (minus some overhead), and I'm
 getting only 3.5.

 Is my expectaion valid?

 I checked the jobtracked and it seems all nodes are working, each reading
 the right blocks. I have not played with the number of mapper and reducers
 yet. It seems the number of mappers is the same as the number of blocks
 and
 the number of reducers is 20 (there are 20 disks). This looks ok for me.

 We also did an experiment with TestDFSIO with similar results. Aggregated
 read io speed is around 3.5Gbit/sec. It is just too far from my
 expectation:(

 Please help!

 Thank you,
 Gyorgy






Re: trying to select technology

2011-05-31 Thread Ted Dunning
To pile on, thousands or millions of documents are well within the range
that is well addressed by Lucene.

Solr may be an even better option than bare Lucene since it handles lots of
the boilerplate problems like document parsing and index update scheduling.

On Tue, May 31, 2011 at 11:56 AM, Matthew Foley ma...@yahoo-inc.com wrote:

 Sounds like you're looking for a full-text inverted index.  Lucene is a
 good opensource implementation of that.  I believe it has an option for
 storing the original full text as well as the indexes.
 --Matt

 On May 31, 2011, at 10:50 AM, cs230 wrote:


 Hello All,

 I am planning to start project where I have to do extensive storage of xml
 and text files. On top of that I have to implement efficient algorithm for
 searching over thousands or millions of files, and also do some indexes to
 make search faster next time.

 I looked into Oracle database but it delivers very poor result. Can I use
 Hadoop for this? Which Hadoop project would be best fit for this?

 Is there anything from Google I can use?

 Thanks a lot in advance.
 --
 View this message in context:
 http://old.nabble.com/trying-to-select-technology-tp31743063p31743063.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.





Re: Simple change to WordCount either times out or runs 18+ hrs with little progress

2011-05-24 Thread Ted Dunning
itr.nextToken() is inside the if.

On Tue, May 24, 2011 at 7:29 AM, maryanne.dellasa...@gdc4s.com wrote:

while (itr.hasMoreTokens()) {
if(count == 5)
{
word.set(itr.nextToken());
output.collect(word, one);
}
count++;
  }



Re: Hadoop and WikiLeaks

2011-05-19 Thread Ted Dunning
ZK started as sub-project of Hadoop.

On Thu, May 19, 2011 at 7:27 AM, M. C. Srivas mcsri...@gmail.com wrote:

 Interesting to note that Cassandra and ZK are now considered Hadoop
 projects.

 There were independent of Hadoop before the recent update.


 On Thu, May 19, 2011 at 4:18 AM, Steve Loughran ste...@apache.org wrote:

  On 18/05/11 18:05, javam...@cox.net wrote:
 
  Yes!
 
  -Pete
 
   Edward Caprioloedlinuxg...@gmail.com  wrote:
 
  =
  http://hadoop.apache.org/#What+Is+Apache%E2%84%A2+Hadoop%E2%84%A2%3F
 
  March 2011 - Apache Hadoop takes top prize at Media Guardian Innovation
  Awards
 
  The Hadoop project won the innovator of the yearaward from the UK's
  Guardian newspaper, where it was described as had the potential as a
  greater catalyst for innovation than other nominees including WikiLeaks
  and
  the iPad.
 
  Does this copy text bother anyone else? Sure winning any award is great
  but
  does hadoop want to be associated with innovation like WikiLeaks?
 
 
 
  Ian updated the page yesterday with changes I'd put in for trademarks,
 and
  I added this news quote directly from the paper. We could strip out the
  quote easily enough.
 
 



Re: matrix-vector multiply in hadoop

2011-05-17 Thread Ted Dunning
Try using the Apache Mahout code that solves exactly this problem.

Mahout has a distributed row-wise matrix that is read one row at a time.
 Dot products with the vector are computed and the results are collected.
 This capability is used extensively in the large scale SVD's in Mahout.

On Tue, May 17, 2011 at 1:13 PM, Alexandra Anghelescu 
axanghele...@gmail.com wrote:

 Hi all,

 I was wondering how to go about doing a matrix-vector multiplication using
 hadoop. I have my matrix in one file and my vector in another. All the map
 tasks will need the vector file... basically they need to share it.

 Basically I want my map function to output key-value pairs (i,m[i,j]*v(j)),
 where i is the row number, and j the column number; v(j) is the jth element
 in v. And the reduce function will sum up all the values with the same key
 -
 i, and that will be the ith element of my result vector.

 I don't know how to format the input to do this.. even if I do it in 2 MR
 iterations, first formatting the input and second the actual
 matrix-vector-multiply, I don't have a clear idea.

 If you have any ideas/suggestions I would appreciate it!

 Thanks in advance,
 Alexandra



Re: Suggestions for swapping issue

2011-05-11 Thread Ted Dunning
How is it that 36 processes are not expected if you have configured 48 + 12
= 50 slots available on the machine?

On Wed, May 11, 2011 at 11:11 AM, Adi adi.pan...@gmail.com wrote:

 By our calculations hadoop should not exceed 70% of memory.
 Allocated per node - 48 map slots (24 GB) ,  12 reduce slots (6 GB), 1 GB
 each for DataNode/TaskTracker and one JobTracker Totalling 33/34 GB
 allocation.
 The queues are capped at using only 90% of capacity allocated so generally
 10% of slots are always kept free.

 The cluster was running total 33 mappers and 1 reducer so around 8-9
 mappers
 per node with 3 GB max limit and they were utilizing around 2GB each.
 Top was showing 100% memory utilized. Which our sys admin says is ok as the
 memory is used for file caching by linux if the processes are not using it.
 No swapping on 3 nodes.
 Then node4 just started swapping after the number of processes shot up
 unexpectedly. The main mystery are these excess number of processes on the
 node which went down. 36 as opposed to expected 11. The other 3 nodes were
 successfully executing the mappers without any memory/swap issues.

 -Adi

 On Wed, May 11, 2011 at 1:40 PM, Michel Segel michael_se...@hotmail.com
 wrote:

  You have to do the math...
  If you have 2gb per mapper, and run 10 mappers per node... That means
 20gb
  of memory.
  Then you have TT and DN running which also take memory...
 
  What did you set as the number of mappers/reducers per node?
 
  What do you see in ganglia or when you run top?
 
  Sent from a remote device. Please excuse any typos...
 
  Mike Segel
 
  On May 11, 2011, at 12:31 PM, Adi adi.pan...@gmail.com wrote:
 
   Hello Hadoop Gurus,
   We are running a 4-node cluster. We just upgraded the RAM to 48 GB. We
  have
   allocated around 33-34 GB per node for hadoop processes. Leaving the
 rest
  of
   the 14-15 GB memory for OS and as buffer. There are no other processes
   running on these nodes.
   Most of the lighter jobs run successfully but one big job is
  de-stabilizing
   the cluster. One node starts swapping and runs out of swap space and
 goes
   offline. We tracked the processes on that node and noticed that it ends
  up
   with more than expected hadoop-java processes.
   The other 3 nodes were running 10 or 11 processes and this node ends up
  with
   36. After killing the job we find these processes still show up and we
  have
   to kill them manually.
   We have tried reducing the swappiness to 6 but saw the same results. It
  also
   looks like hadoop stays well within the memory limits allocated and
 still
   starts swapping.
  
   Some other suggestions we have seen are:
   1) Increase swap size. Current size is 6 GB. The most quoted size is
  'tons
   of swap' but note sure how much it translates to in numbers. Should it
 be
  16
   or 24 GB
   2) Increase overcommit ratio. Not sure if this helps as a few blog
  comments
   mentioned it didn't help
  
   Any other hadoop or linux config suggestions are welcome.
  
   Thanks.
  
   -Adi
 



Re: questions about hadoop map reduce and compute intensive related applications

2011-04-30 Thread Ted Dunning
On Sat, Apr 30, 2011 at 12:18 AM, elton sky eltonsky9...@gmail.com wrote:

 I got 2 questions:

 1. I am wondering how hadoop MR performs when it runs compute intensive
 applications, e.g. Monte carlo method compute PI. There's a example in
 0.21,
 QuasiMonteCarlo, but that example doesn't use random number and it
 generates
 psudo input upfront. If we use distributed random number generation, then I
 guess the performance of hadoop should be similar with some message passing
 framework, like MPI. So my guess is by using proper method hadoop would be
 good in compute intensive applications compared with MPI.


Not quite sure what algorithms you mean here, but for trivial parallelism,
map-reduce is a fine way to go.

MPI supports node-to-node communications in ways that map-reduce does not,
however, which requires that you iterate map-reduce steps for many
algorithms.   With Hadoop's current implementation, this is horrendously
slow (minimum 20-30 seconds per iteration).

Sometimes you can avoid this by clever tricks.  For instance, random
projection can compute the key step in an SVD decomposition with one
map-reduce while the comparable Lanczos algorithm requires more than one
step per eigenvector (and we often want 100 of them!).

Sometimes, however, there are no known algorithms that avoid the need for
repeated communication.  For these problems, Hadoop as it stands may be a
poor fit.  Help is on the way, however, with the MapReduce 2.0 work because
that will allow much more flexible models of computation.


 2. I am looking for some applications, which has large data sets and
 requires intensive computation. An application can be divided into a
 workflow, including either map reduce operations, and message passing like
 operations. For example, in step 1 I use hadoop MR processes 10TB of data
 and generates small output, say, 10GB. This 10GB can be fit into memory and
 they are better be processed with some interprocess communication, which
 will boost the performance. So in step 2 I will use MPI, etc.


Some machine learning algorithms require features that are much smaller than
the original input.  This leads to exactly the pattern you describe.
 Integrating MPI with map-reduce is currently difficult and/or very ugly,
however.  Not impossible and there are hackish ways to do the job, but they
are hacks.


Re: Serving Media Streaming

2011-04-30 Thread Ted Dunning
Check out S4 http://s4.io/

On Fri, Apr 29, 2011 at 10:13 PM, Luiz Fernando Figueiredo 
luiz.figueir...@auctorita.com.br wrote:

 Hi guys.

 Hadoop is well known to process large amounts of data but we think that we
 can do much more than it. Our goal is try to serve pseudo-streaming near of
 Akamai do (with their platform).

 Do someone know if is implemented any solution about it upon Hadoop
 already?
 Or, are any real problem about using it in that way?

 Until right now I didn't see any barrier of aplying Hadoop this way, but
 some people talk about only use GlusterFS instead, but more than just store
 the multimedia files, we want to be able to handle some data too.

 Cheers.
 Luiz Fernando



Re: Applications creates bigger output than input?

2011-04-30 Thread Ted Dunning
Cooccurrence analysis is commonly used in recommendations.  These produce
large intermediates.

Come on over to the Mahout project if you would like to talk to a bunch of
people who work on these problems.

On Fri, Apr 29, 2011 at 9:31 PM, elton sky eltonsky9...@gmail.com wrote:

 Thank you for suggestions:

 Weblog analysis, market basket analysis and generating search index.

 I guess for these applications we need more reduces than maps, for handling
 large intermediate output, isn't it. Besides, the input split for map
 should
 be smaller than usual,  because the workload for spill and merge on map's
 local disk is heavy.

 -Elton

 On Sat, Apr 30, 2011 at 11:22 AM, Owen O'Malley omal...@apache.org
 wrote:

  On Fri, Apr 29, 2011 at 5:02 AM, elton sky eltonsky9...@gmail.com
 wrote:
 
   For my benchmark purpose, I am looking for some non-trivial, real life
   applications which creates *bigger* output than its input. Trivial
  example
   I
   can think about is cross join...
  
 
  As you say, almost all cross join jobs have that property. The other case
  that almost always fits into that category is generating an index. For
  example, if your input is a corpus of documents and you want to generate
  the
  list of documents that contain each word, the output (and especially the
  shuffle data) is much larger than the input.
 
  -- Owen
 



Re: providing the same input to more than one Map task

2011-04-22 Thread Ted Dunning
I would recommend taking this question to the Mahout mailing list.

The short answer is that matrix multiplication by a column vector is pretty
easy.  Each mapper reads the vector in the configure method and then does a
dot product for each row of the input matrix.  Results are reassembled into
a vector in the reducer.

Mahout has special matrix structures to help with this.

On Fri, Apr 22, 2011 at 2:59 PM, Mehmet Tepedelenlioglu 
mehmets...@gmail.com wrote:

 There is a way:


 http://hadoop.apache.org/common/docs/r0.18.3/mapred_tutorial.html#DistributedCache

 Are you working with a sparse matrix, or a full one?


 On Apr 22, 2011, at 2:33 PM, aanghelescu wrote:

 
  Hi all,
 
  I am trying to perform matrix-vector multiplication using Hadoop.
 
  So I have matrix M in a file, and vector v in another file. Obviously,
 files
  are of different sizes. Is it possible to make it so that each Map task
 will
  get the whole vector v and a chunk of matrix M? I know how my map and
 reduce
  functions should look like, but I don't know how to format the input.
 
  Basically I want my map function to output key-value pairs
 (i,m[i,j]*v(j)),
  where i is the row number, and j the column number; v(j) is the jth
 element
  in v. And the reduce function will sum up all the values with the same
 key -
  i, and that will be the ith element of my result vector.
 
  Or can you suggest another way to do it?
 
  Thanks,
  Alexandra
  --
  View this message in context:
 http://old.nabble.com/providing-the-same-input-to-more-than-one-Map-task-tp31459012p31459012.html
  Sent from the Hadoop core-user mailing list archive at Nabble.com.
 




Re: Estimating Time required to compute M/Rjob

2011-04-17 Thread Ted Dunning
Turing completion isn't the central question here, really.  The truth
is, map-reduce programs have considerably pressure to be written in a
scalable fashion which limits them to fairly simple behaviors that
result in pretty linear dependence of run-time on input size for a
given program.

The cool thing about the paper that I linked to the other day is that
there are enough cues about the expected runtime of the program
available to make good predictions *without* looking at the details.
No doubt the estimation facility could make good use of something as
simple as the hash of the jar in question, but even without that it is
possible to produce good estimates.

I suppose that this means that all of us Hadoop programmers are really
just kind of boring folk.  On average, anyway.

On Sun, Apr 17, 2011 at 12:07 PM, Matthew Foley ma...@yahoo-inc.com wrote:
 Since general M/R jobs vary over a huge (Turing problem equivalent!) range of 
 behaviors, a more tractable problem might be to characterize the descriptive 
 parameters needed to answer the question: If the following problem P runs in 
 T0 amount of time on a certain benchmark platform B0, how long T1 will it 
 take to run on a differently configured real-world platform B1 ?



Re: Estimating Time required to compute M/Rjob

2011-04-16 Thread Ted Dunning
Sounds like this paper might help you:

Predicting Multiple Performance Metrics for Queries: Better Decisions
Enabled by Machine Learning by Ganapathi, Archana, Harumi Kuno,
Umeshwar Daval, Janet Wiener, Armando Fox, Michael Jordan,  David
Patterson

http://radlab.cs.berkeley.edu/publication/187

On Sat, Apr 16, 2011 at 1:19 PM, Stephen Boesch java...@gmail.com wrote:

 some additional thoughts about the the  'variables' involved in
 characterizing the M/R application itself.


   - the configuration of the cluster for numbers of mappers vs reducers
   compared to the characteristics (amount of work/procesing) required in each
   of the map/shuffle/reduce stages


   - is the application using multiple chained M/R stages?  Multi stage
   M/R's are more difficult to tune properly in terms of keeping all workers
   busy  . That may be challenging to model.

 2011/4/16 Stephen Boesch java...@gmail.com

  You could consider two scenarios / set of requirements for your estimator:
 
 
     1. Allow it to 'learn' from certain input data and then project running
     times of similar (or moderately dissimilar) workloads.   So the first 
  steps
     could be to define a couple of  relatively small control M/R jobs on a
     small-ish dataset and throw it at the unknown (cluster-under-test) hdfs/ 
  M/R
      cluster.  Try to design the control M/R job  in a way that it will be
     able to completely load down all of the  available DataNodes in the
      cluster-under-test for at least a brief period of time.   Then you wlil
     have obtained a decent signal on the capabilities of the cluster under 
  test
     and may allow a relatively high degree of predictive accuracy for even 
  much
     larger jobs
     2. If instead it were your goal to drive the predictions off of a
     purely mathematical model  - in your terms the application and base 
  file
     system - and without any empirical data - then here is an alternative
     approach.
        - Follow step (1) above against a variety of applications and
        base file systems - especially in configurations for which  you 
  wish your
        model to provide high quality predictions.
        - Save  the results in structured data
        - Derive formulas for characterizing the curves of performance via
        those variables that you defined (application /  base file system)
 
  Now you have a trained model.  When it is applied to a new set of
  applications / base file systems it can use the curves you have already
  determined to provide the result without any runtime requirements.
 
  Obviously the value of this second approach is limited by the degree of
  similarity of the training data to the applications you attempt to model.
   If all of your training data is on a 50 node cluster against machines with
  IDE drives don't expect good results when asked to model a 1000 node cluster
  using SAN's / RAID's / SCSI's.
 
 
  2011/4/16 Sonal Goyal sonalgoy...@gmail.com
 
  What is your MR job doing? What is the amount of data it is processing?
  What
  kind of a cluster do you have? Would you be able to share some details
  about
  what you are trying to do?
 
  If you are looking for metrics, you could look at the Terasort run ..
 
  Thanks and Regards,
  Sonal
  https://github.com/sonalgoyal/hihoHadoop ETL and Data
  Integrationhttps://github.com/sonalgoyal/hiho
  Nube Technologies http://www.nubetech.co
 
  http://in.linkedin.com/in/sonalgoyal
 
 
 
 
 
  On Sat, Apr 16, 2011 at 3:31 PM, real great..
  greatness.hardn...@gmail.comwrote:
 
   Hi,
   As a part of my final year BE final project I want to estimate the time
   required by a M/R job given an application and a base file system.
   Can you folks please help me by posting some thoughts on this issue or
   posting some links here.
  
   --
   Regards,
   R.V.
  
 
 
 


Re: Dynamic Data Sets

2011-04-13 Thread Ted Dunning
Hbase is very good at this kind of thing.

Depending on your aggregation needs OpenTSDB might be interesting since they
store and query against large amounts of time ordered data similar to what
you want to do.

It isn't clear to whether your data is primarily about current state or
about time-embedded state transitions.  You can easily store both in hbase,
but the arrangements will be a bit different.

On Wed, Apr 13, 2011 at 6:12 PM, Sam Seigal selek...@yahoo.com wrote:

 I have a requirement where I have large sets of incoming data into a
 system I own.

 A single unit of data in this set has a set of immutable attributes +
 state attached to it. The state is dynamic and can change at any time.
 What is the best way to run analytical queries on data of such nature
 ?

 One way is to maintain this data in a separate store, take a snapshot
 in point of time, and then import into the HDFS filesystem for
 analysis using Hadoop Map-Reduce. I do not see this approach scaling,
 since moving data is obviously expensive.
 If i was to directly maintain this data as Sequence Files in HDFS, how
 would updates work ?

 I am new to Hadoop/HDFS , so any suggestions/critique is welcome. I
 know that HBase works around this problem through multi version
 concurrency control techniques. Is that the only option ? Are there
 any alternatives ?

 Also note that all aggregation and analysis I want to do is time based
 i.e. sum of x on pivot y over a day, 2 days, week, month etc. For such
 use cases, is it advisable to use HDFS directly or use systems built
 on top of hadoop like Hive or Hbase ?



Re: Memory mapped resources

2011-04-12 Thread Ted Dunning
Kevin,

You present a good discussion of architectural alternatives here.  But my
comment really had more to do with whether a particular HDFS patch would
provide what the original poster seemed to be asking about.  This is
especially pertinent since the patch was intended to scratch a different
itch.

On Tue, Apr 12, 2011 at 5:51 AM, kevin.le...@thomsonreuters.com wrote:

 This is the age old argument of what to share in a partitioned
 environment. IBM and Teradata have always used shared nothing which is
 what only getting one chunk of the file in each hadoop node is doing.
 Oracle has always used shared disk which is not an easy thing to do,
 especially in scale, and seems to have varying results depending on
 application, transaction or dss. Here are a couple of web references.

 http://www.informatik.uni-trier.de/~ley/db/conf/vldb/Bhide88.html

 http://jhingran.typepad.com/anant_jhingrans_musings/2010/02/shared-nothi
 ng-vs-shared-disks-the-cloud-sequel.html

 Rather than say shared nothing isn't useful, hadoop should look to how
 others make this work. The two key problems to avoid are data skew where
 one node sees to much data and becomes the slow node and large
 intra-partition joins where large data is needed from more than one
 partition and potentially gets copied around.

 Rather than hybriding into shared disk, I think hadoop should hybrid
 into the shared data solutions others use, replication of select data,
 for solving intra-partition joins in a shared nothing architecture.
 This may be more database terminology that could be addressed by hbase,
 but I think it is good background for the questions of memory mapping
 files in hadoop.

 Kevin


 -Original Message-
 From: Ted Dunning [mailto:tdunn...@maprtech.com]
 Sent: Tuesday, April 12, 2011 12:09 AM
 To: Jason Rutherglen
 Cc: common-user@hadoop.apache.org; Edward Capriolo
 Subject: Re: Memory mapped resources

 Yes.  But only one such block. That is what I meant by chunk.

 That is fine if you want that chunk but if you want to mmap the entire
 file,
 it isn't real useful.

 On Mon, Apr 11, 2011 at 6:48 PM, Jason Rutherglen 
 jason.rutherg...@gmail.com wrote:

  What do you mean by local chunk?  I think it's providing access to the
  underlying file block?
 
  On Mon, Apr 11, 2011 at 6:30 PM, Ted Dunning tdunn...@maprtech.com
  wrote:
   Also, it only provides access to a local chunk of a file which isn't
 very
   useful.
  
   On Mon, Apr 11, 2011 at 5:32 PM, Edward Capriolo
 edlinuxg...@gmail.com
   wrote:
  
   On Mon, Apr 11, 2011 at 7:05 PM, Jason Rutherglen
   jason.rutherg...@gmail.com wrote:
Yes you can however it will require customization of HDFS.  Take
 a
look at HDFS-347 specifically the HDFS-347-branch-20-append.txt
 patch.
 I have been altering it for use with HBASE-3529.  Note that the
 patch
noted is for the -append branch which is mainly for HBase.
   
On Mon, Apr 11, 2011 at 3:57 PM, Benson Margulies
bimargul...@gmail.com wrote:
We have some very large files that we access via memory mapping
 in
Java. Someone's asked us about how to make this conveniently
deployable in Hadoop. If we tell them to put the files into
 hdfs, can
we obtain a File for the underlying file on any given node?
   
   
  
   This features it not yet part of hadoop so doing this is not
  convenient.
  
  
 



Re: Memory mapped resources

2011-04-12 Thread Ted Dunning
Well, no.

You could mmap all the blocks that are local to the node your program is on.
 The others you will have to read more conventionally.  If all blocks are
guaranteed local, this would work.  I don't think that guarantee is possible
on a non-trivial cluster.

On Tue, Apr 12, 2011 at 6:32 AM, Jason Rutherglen 
jason.rutherg...@gmail.com wrote:

 Then one could MMap the blocks pertaining to the HDFS file and piece
 them together.  Lucene's MMapDirectory implementation does just this
 to avoid an obscure JVM bug.

 On Mon, Apr 11, 2011 at 9:09 PM, Ted Dunning tdunn...@maprtech.com
 wrote:
  Yes.  But only one such block. That is what I meant by chunk.
  That is fine if you want that chunk but if you want to mmap the entire
 file,
  it isn't real useful.
 
  On Mon, Apr 11, 2011 at 6:48 PM, Jason Rutherglen
  jason.rutherg...@gmail.com wrote:
 
  What do you mean by local chunk?  I think it's providing access to the
  underlying file block?
 
  On Mon, Apr 11, 2011 at 6:30 PM, Ted Dunning tdunn...@maprtech.com
  wrote:
   Also, it only provides access to a local chunk of a file which isn't
   very
   useful.
  
   On Mon, Apr 11, 2011 at 5:32 PM, Edward Capriolo 
 edlinuxg...@gmail.com
   wrote:
  
   On Mon, Apr 11, 2011 at 7:05 PM, Jason Rutherglen
   jason.rutherg...@gmail.com wrote:
Yes you can however it will require customization of HDFS.  Take a
look at HDFS-347 specifically the HDFS-347-branch-20-append.txt
patch.
 I have been altering it for use with HBASE-3529.  Note that the
patch
noted is for the -append branch which is mainly for HBase.
   
On Mon, Apr 11, 2011 at 3:57 PM, Benson Margulies
bimargul...@gmail.com wrote:
We have some very large files that we access via memory mapping in
Java. Someone's asked us about how to make this conveniently
deployable in Hadoop. If we tell them to put the files into hdfs,
can
we obtain a File for the underlying file on any given node?
   
   
  
   This features it not yet part of hadoop so doing this is not
   convenient.
  
  
 
 



Re: Memory mapped resources

2011-04-12 Thread Ted Dunning
Blocks live where they land when first created.  They can be moved due to
node failure or rebalancing, but it is typically pretty expensive to do
this.  It certainly is slower than just reading the file.

If you really, really want mmap to work, then you need to set up some native
code that builds an mmap'ed region, but sets all pages to no access if the
corresponding block is non-local and sets the block to access the local
block if the block is local. Then you can intercept the segmentation
violations that occur on page access to non-local data, read that block to
local storage and mmap it into those pages.

This is LOTS of work to get exactly right and must be done in C since Java
can't really handle seg faults correctly.  This pattern is fairly commonly
used in garbage collected languages to allow magical remapping of memory
without explicit tests.

On Tue, Apr 12, 2011 at 8:24 AM, Jason Rutherglen 
jason.rutherg...@gmail.com wrote:


 Interesting.  I'm not familiar with how blocks go local, however I'm
 interested in how to make this occur via a manual oriented call.  Eg,
 is there an option available that guarantees locality, and if not,
 perhaps there's work being done towards that path?


Re: Memory mapped resources

2011-04-12 Thread Ted Dunning
Actually, it doesn't become trivial.  It just becomes total fail or total
win instead of almost always being partial win.  It doesn't meet Benson's
need.

On Tue, Apr 12, 2011 at 11:09 AM, Jason Rutherglen 
jason.rutherg...@gmail.com wrote:

 To get around the chunks or blocks problem, I've been implementing a
 system that simply sets a max block size that is too large for a file
 to reach.  In this way there will only be one block for HDFS file, and
 so MMap'ing or other single file ops become trivial.

 On Tue, Apr 12, 2011 at 10:40 AM, Benson Margulies
 bimargul...@gmail.com wrote:
  Here's the OP again.
 
  I want to make it clear that my question here has to do with the
  problem of distributing 'the program' around the cluster, not 'the
  data'. In the case at hand, the issue a system that has a large data
  resource that it needs to do its work. Every instance of the code
  needs the entire model. Not just some blocks or pieces.
 
  Memory mapping is a very attractive tactic for this kind of data
  resource. The data is read-only. Memory-mapping it allows the
  operating system to ensure that only one copy of the thing ends up in
  physical memory.
 
  If we force the model into a conventional file (storable in HDFS) and
  read it into the JVM in a conventional way, then we get as many copies
  in memory as we have JVMs.  On a big machine with a lot of cores, this
  begins to add up.
 
  For people who are running a cluster of relatively conventional
  systems, just putting copies on all the nodes in a conventional place
  is adequate.
 



Re: Memory mapped resources

2011-04-12 Thread Ted Dunning
Benson is actually a pretty sophisticated guy who knows a lot about mmap.

I engaged with him yesterday on this since I know him from Apache.

On Tue, Apr 12, 2011 at 7:16 PM, M. C. Srivas mcsri...@gmail.com wrote:

 I am not sure if you realize, but HDFS is not VM integrated.


Re: Using global reverse lookup tables

2011-04-11 Thread Ted Dunning
Depending on the function that you want to use, it sounds like you want to
use a self join to compute transposed cooccurrence.

That is, it sounds like you want to find all the sets that share elements
with X.  If you have a binary matrix A that represents your set membership
with one row per set and one column per element, then you want to compute A
A'.   It is common for A to be available in row-major form or in the form of
pairs.  With row major form, the easiest way to compute your desired result
is to transpose A in a first map-reduce.  With either the transposed matrix
a second map-reduce can be used in which the mapper reads all of the sets
with a particular element and emits pairs of sets that have a common
element.  The combiner and reducer are basically a pair counter.

This implementation suffers in that it takes time quadratic in the density
of the most dense row.  It is common to downsample such rows to a reasonable
level.  Most uses of the cooccurrence matrix don't care about this
downsampling and the makes the algorithm much faster.

As an alternative, you can compute a matrix decomposition and use that to
compute A A'.  This can be arranged so as to avoid the downsampling, but the
program required is much more complex.  The Apache Mahout project has
several implementations of such decompositions tuned for different
situations.  Some implementations use map-reduce, some are sequential.


On Mon, Apr 11, 2011 at 9:30 AM, W.P. McNeill bill...@gmail.com wrote:

 Are there general approaches to solving this problem?


Re: Memory mapped resources

2011-04-11 Thread Ted Dunning
Also, it only provides access to a local chunk of a file which isn't very
useful.

On Mon, Apr 11, 2011 at 5:32 PM, Edward Capriolo edlinuxg...@gmail.comwrote:

 On Mon, Apr 11, 2011 at 7:05 PM, Jason Rutherglen
 jason.rutherg...@gmail.com wrote:
  Yes you can however it will require customization of HDFS.  Take a
  look at HDFS-347 specifically the HDFS-347-branch-20-append.txt patch.
   I have been altering it for use with HBASE-3529.  Note that the patch
  noted is for the -append branch which is mainly for HBase.
 
  On Mon, Apr 11, 2011 at 3:57 PM, Benson Margulies bimargul...@gmail.com
 wrote:
  We have some very large files that we access via memory mapping in
  Java. Someone's asked us about how to make this conveniently
  deployable in Hadoop. If we tell them to put the files into hdfs, can
  we obtain a File for the underlying file on any given node?
 
 

 This features it not yet part of hadoop so doing this is not convenient.



Re: Memory mapped resources

2011-04-11 Thread Ted Dunning
Yes.  But only one such block. That is what I meant by chunk.

That is fine if you want that chunk but if you want to mmap the entire file,
it isn't real useful.

On Mon, Apr 11, 2011 at 6:48 PM, Jason Rutherglen 
jason.rutherg...@gmail.com wrote:

 What do you mean by local chunk?  I think it's providing access to the
 underlying file block?

 On Mon, Apr 11, 2011 at 6:30 PM, Ted Dunning tdunn...@maprtech.com
 wrote:
  Also, it only provides access to a local chunk of a file which isn't very
  useful.
 
  On Mon, Apr 11, 2011 at 5:32 PM, Edward Capriolo edlinuxg...@gmail.com
  wrote:
 
  On Mon, Apr 11, 2011 at 7:05 PM, Jason Rutherglen
  jason.rutherg...@gmail.com wrote:
   Yes you can however it will require customization of HDFS.  Take a
   look at HDFS-347 specifically the HDFS-347-branch-20-append.txt patch.
I have been altering it for use with HBASE-3529.  Note that the patch
   noted is for the -append branch which is mainly for HBase.
  
   On Mon, Apr 11, 2011 at 3:57 PM, Benson Margulies
   bimargul...@gmail.com wrote:
   We have some very large files that we access via memory mapping in
   Java. Someone's asked us about how to make this conveniently
   deployable in Hadoop. If we tell them to put the files into hdfs, can
   we obtain a File for the underlying file on any given node?
  
  
 
  This features it not yet part of hadoop so doing this is not
 convenient.
 
 



Re: Architectural question

2011-04-10 Thread Ted Dunning
There are no subtle ways to deal with quadratic problems like this.  They
just don't scale.

Your suggestions are roughly on course.  When matching 10GB against 50GB,
the choice of which input to use as input to the mapper depends a lot on how
much you can buffer in memory and how long such a buffer takes to build.

If you can't store the entire 10GB of data in memory at once, then consider
a program like this:

a) split the 50GB of data across as many mappers as you have using standard
methods

b) in the mapper, emit each record several times with keys of the form (i,
j) where i cycles through [0,n) and is incremented once for each record read
and j cycles through [0, m) and is incremented each time you emit a record.
 Choose m so that 1/m of your 10GB data will fit in your reducers memory.
 Choose n so that n x m is as large as your desired number of reducers.

c) in the reducer, you will get some key (i,j) and an iterator for a number
of records.   Read the i-th segment of your 10GB data and compare each of
the records that the iterator gives you to that data.  If you made n = 1 in
step (b), then you will have at most m-way parallelism in this step.  If n
is large, however, your reducer may need to read the same segment of your
10GB data more than once.  In such conditions you may want to sort the
records and remember which segment you have already read.

In general, though, as I mentioned this is not a scalable process and as
your data grows it is likely to become untenable.

If you can split your data into pieces and estimate which piece each record
should be matched to then you might be able to make the process more
scalable.  Consider indexing techniques to do this rough targeting.  For
instance, if you are trying to find the closes few strings based on edit
distance, you might be able to use n-grams to get approximate matches via a
text retrieval index.  This can substantially reduce the cost of your
algorithm.

On Sun, Apr 10, 2011 at 2:10 PM, oleksiy gayduk.a.s...@mail.ru wrote:

 ... Persistent data and input data don't have commons keys.

 In my cluster I have 5 data nodes.
 The app does simple match every line of input data with every line of
 persistent data.

 ...
 And may be there is more subtle way in hadoop to do this work?



Re: Architectural question

2011-04-10 Thread Ted Dunning
The original poster said that there was no common key.  Your suggestion
presupposes that such a key exists.

On Sun, Apr 10, 2011 at 4:29 PM, Mehmet Tepedelenlioglu 
mehmets...@gmail.com wrote:

 My understanding is you have two sets of strings S1, and S2 and you want to
 mark all strings that
 belong to both sets. If this is correct, then:

 Mapper: for all strings K in Si (i is 1 or 2) emit: key K and value i.
 Reducer: For key K, if the list of values includes both 1 and 2, you have a
 match, emit: K MATCH, else emit: K NO_MATCH (or nothing).

 I assume that the load is not terribly unbalanced. The logic goes for
 intersection of any number of sets. Mark the members with their sets, reduce
 over them to see if they belong to every set.

 Good luck.


 On Apr 10, 2011, at 2:10 PM, oleksiy wrote:

 
  Hi all,
  I have some architectural question.
  For my app I have persistent 50 GB data, which stored in HDFS, data is
  simple CSV format file.
  Also for my app which should be run over this (50 GB) data I have 10 GB
  input data also CSV format.
  Persistent data and input data don't have commons keys.
 
  In my cluster I have 5 data nodes.
  The app does simple match every line of input data with every line of
  persistent data.
 
  For solving this task I see two different approaches:
  1. Destribute input file to every node using attribute -files, and run
 job.
  But in this case every map will go through 10 GB input data.
  2. Devide input file (10 GB) to 5 parts (for instance), run 5 independent
  jobs (one per data node for instance), and for every job we will put 2 GB
  data. In this case every map should go through 2 GB data. In other words
  I'll give every map node it's own input data. But drawback of this
 approache
  is work which I should do before start job and after job finished.
 
  And may be there is more subtle way in hadoop to do this work?
 
  --
  View this message in context:
 http://old.nabble.com/Architectural-question-tp31365863p31365863.html
  Sent from the Hadoop core-user mailing list archive at Nabble.com.
 




Re: We are looking to the root of the problem that caused us IOException

2011-04-06 Thread Ted Dunning
yes.  At least periodically.

You now have a situation where the age distribution of blocks in each
datanode is quite different.  This will lead to different evolution of which
files are retained and that is likely to cause imbalances again.  It will
also cause the performance of your system to be degraded since any given
program probably will have a non-uniform distribution of content in your
cluster.

Over time, this effect will probably decrease unless you keep your old files
forever.

On Tue, Apr 5, 2011 at 11:34 PM, Guy Doulberg guy.doulb...@conduit.comwrote:

 Do you think we should always run the balance, with low bandwidth, and not
 only after adding new nodes?



Re: We are looking to the root of the problem that caused us IOException

2011-04-05 Thread Ted Dunning
YOu can configure the balancer to use higher bandwidth.  That can speed it
up by 10x

On Tue, Apr 5, 2011 at 2:54 AM, Guy Doulberg guy.doulb...@conduit.comwrote:

 We are running the blancer, but it takes a lot of time... in this time the
 cluster not working



Re: HBase schema design

2011-04-04 Thread Ted Dunning
The hbase list would be more appropriate.

See http://hbase.apache.org/mail-lists.html

http://hbase.apache.org/mail-lists.htmlThere is an active IRC channel, but
your question fits the mailing list better so pop on over and I will give
you some comments.

In the meantime, take a look at OpenTSDB who are doing something very much
like what you want to do.

On Mon, Apr 4, 2011 at 8:43 AM, Miguel Costa miguel-co...@telecom.ptwrote:

 Hi,



 I need some help to a schema design on HBase.



 I have 5 dimensions (Time,Site,Referrer Keyword,Country).

 My row key is Site+Time.



 Now I want to answer some questions like what is the top Referrer by
 Keyword for a site on a Period of Time.

 Basically I want to cross all the dimensions that I have. And if I have 30
 dimensions?



 What is the best schema design.



 Please let me know  if this isn’t the right mailing list.



 Thank you for your time.



 Miguel














Re: Reverse Indexing Programming Help

2011-03-31 Thread Ted Dunning
It would help to get a good book.  There are several.

For your program, there are several things that will trip you up:

a) lots of little files is going to be slow.  You want input that is 100MB
per file if you want speed.

b) That file format is a bit cheesy since it is hard to tell URL's from text
if you concatenate lots of files.  Better to use a format like protobufs or
Avro or even sequence files to separate the key and the data unambiguously.

c) I suspect that what you are asking for is to run a mapper so that each
invocation of map gets the URL as key and the text as data.  That map
invocation can then tokenize the data and emit records with the URL as key
and each word as data.  That isn't much use since the reducer will get the
URL and all the words that were emitted for that URL.  If each URL appears
exactly once, then the input already had that.  Perhaps you mean to emit the
word as key and URL as data.  Then the reducer will see the word as key and
an iterator over all the URLs that mentioned the word.

On Thu, Mar 31, 2011 at 9:48 PM, DoomUs ddu...@nmt.edu wrote:


 I'm just starting out using Hadoop.  I've looked through the java examples,
 and have an idea about what's going on, but don't really get it.

 I'd like to write a program that takes a directory of files.  Contained in
 those files are a URL to a website on the first line, and the second line
 is
 the TEXT from that website.

 The mapper should create a map for each word in the text to that URL, so
 every word found on the website would map to the URL.

 The reducer then, would collect all of the URLs that are mapped to via a
 given word.

 Each Word-URL is then written to a file.

 So, it's simple as a program designed to run on a single system, but I
 want to be able to distribute the computation and whatnot using Hadoop.

 I'm extremely new to Hadoop,  I'm not even sure how to ask all of the
 questions I'd like answers for, I have zero experience in MapReduce, and
 limited experience in functional programming at all.  Any programming tips,
 or if I have my Mapper or Reducer defined incorrectly, corrections, etc
 would be greatly appreciated.

 Questions:
 How do I read (and write) files from hdfs?
 Once I've read them, How do I distribute the files to be mapped?
 I know I need a class to implement the mapper, and one to implement the
 reducer, but how does the class have a return type to output the map?

 Thanks a lot for your help.
 --
 View this message in context:
 http://old.nabble.com/Reverse-Indexing-Programming-Help-tp31292449p31292449.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




Re: Chukwa - Lightweight agents

2011-03-20 Thread Ted Dunning
Take a look at openTsDb at http://opentsdb.net/

It provides lots of the capability in a MUCH simpler package.

On Sun, Mar 20, 2011 at 8:43 AM, Mark static.void@gmail.com wrote:

 Sorry but it doesn't look like Chukwa mailing list exists anymore?

 Is there an easy way to set up lightweight agents on cluster of machines
 instead of downloading the full Chukwa source (+50mb)?

 Has anyone build separate RPM's for the agents/collectors?

 Thanks



Re: Chukwa - Lightweight agents

2011-03-20 Thread Ted Dunning
OpenTSDB is purely a monitoring solution which is the primary mission of
chukwa.

If you are looking for data import, what about Flume?

On Sun, Mar 20, 2011 at 9:59 AM, Mark static.void@gmail.com wrote:

  Thanks but we need Chukwa to aggregate and store files from across our app
 servers into Hadoop. Doesn't really look like opentsdb is meant for that. I
 could be wrong though?


 On 3/20/11 9:49 AM, Ted Dunning wrote:

 Take a look at openTsDb at http://opentsdb.net/

  It provides lots of the capability in a MUCH simpler package.

 On Sun, Mar 20, 2011 at 8:43 AM, Mark static.void@gmail.com wrote:

 Sorry but it doesn't look like Chukwa mailing list exists anymore?

 Is there an easy way to set up lightweight agents on cluster of machines
 instead of downloading the full Chukwa source (+50mb)?

 Has anyone build separate RPM's for the agents/collectors?

 Thanks





Re: Inserting many small files into HBase

2011-03-20 Thread Ted Dunning
Take a look at this:

http://wiki.apache.org/hadoop/Hbase/DesignOverview

then read the bigtable paper.

On Sun, Mar 20, 2011 at 6:39 PM, edward choi mp2...@gmail.com wrote:

 Hi,

 I'm planning to crawl thousands of news rss feeds via MapReduce, and save
 each news article into HBase directly.

 My concern is that Hadoop does not work well with a large number of
 small-size files,

 and if I insert every single news article (which is small-size apparently)
 into HBase, (without separately storing it into HDFS)

 I might end up with millions of files that are only several kilobytes in
 size.

 Or does HBase somehow automatically append each news article into a single
 file, so that it would have only a few files of large-size?

 Ed



Re: decommissioning node woes

2011-03-19 Thread Ted Dunning
Unfortunately this doesn't help much because it is hard to get the ports to
balance the load.

On Fri, Mar 18, 2011 at 8:30 PM, Michael Segel michael_se...@hotmail.comwrote:

 With a 1GBe port, you could go 100Mbs for the bandwidth limit.
 If you bond your ports, you could go higher.



Re: how to build kmeans

2011-03-18 Thread Ted Dunning
These java files are full of HTML.

Are you sure that they are supposed to compile?  How did you get these
files?

On Fri, Mar 18, 2011 at 3:12 AM, MANISH SINGLA coolmanishh...@gmail.comwrote:

 Hii everyone...

 I am trying to run kmeans on a single node... I have the attached
 files with me...I put them in a folder named kmeans...and place it in
 the src/examples directory...
 Then i go to hadoop home ant type ant examples on the command line...
 but I am getting some errors...which I am not able to understand...any
 help wpuld be appreciated...


 ON COMMAND LINE:  ant examples //tHIS i TYPE ON THE COMMAND LINE
 OUTPUT: //AS A result I get all this
   Buildfile: build.xml

 clover.setup:

 clover.info:
 [echo]
 [echo]  Clover not found. Code coverage reports disabled.
 [echo]

 clover:

 ivy-download:
  [get] Getting:

 http://repo2.maven.org/maven2/org/apache/ivy/ivy/2.0.0-rc2/ivy-2.0.0-rc2.jar
  [get] To: /home/master/Desktop/hadoop-0.20.2/ivy/ivy-2.0.0-rc2.jar
  [get] Not modified - so not downloaded

 ivy-init-dirs:

 ivy-probe-antlib:

 ivy-init-antlib:

 ivy-init:
 [ivy:configure] :: Ivy 2.0.0-rc2 - 20081028224207 ::
 http://ant.apache.org/ivy/ ::
 :: loading settings :: file =
 /home/master/Desktop/hadoop-0.20.2/ivy/ivysettings.xml

 ivy-resolve-common:
 [ivy:resolve] :: resolving dependencies ::
 org.apache.hadoop#Hadoop;working@master
 [ivy:resolve]   confs: [common]
 [ivy:resolve]   found commons-logging#commons-logging;1.0.4 in maven2
 [ivy:resolve]   found log4j#log4j;1.2.15 in maven2
 [ivy:resolve]   found commons-httpclient#commons-httpclient;3.0.1 in maven2
 [ivy:resolve]   found commons-codec#commons-codec;1.3 in maven2
 [ivy:resolve]   found commons-cli#commons-cli;1.2 in maven2
 [ivy:resolve]   found xmlenc#xmlenc;0.52 in maven2
 [ivy:resolve]   found net.java.dev.jets3t#jets3t;0.6.1 in maven2
 [ivy:resolve]   found commons-net#commons-net;1.4.1 in maven2
 [ivy:resolve]   found org.mortbay.jetty#servlet-api-2.5;6.1.14 in maven2
 [ivy:resolve]   found oro#oro;2.0.8 in maven2
 [ivy:resolve]   found org.mortbay.jetty#jetty;6.1.14 in maven2
 [ivy:resolve]   found org.mortbay.jetty#jetty-util;6.1.14 in maven2
 [ivy:resolve]   found tomcat#jasper-runtime;5.5.12 in maven2
 [ivy:resolve]   found tomcat#jasper-compiler;5.5.12 in maven2
 [ivy:resolve]   found commons-el#commons-el;1.0 in maven2
 [ivy:resolve]   found junit#junit;3.8.1 in maven2
 [ivy:resolve]   found commons-logging#commons-logging-api;1.0.4 in maven2
 [ivy:resolve]   found org.slf4j#slf4j-api;1.4.3 in maven2
 [ivy:resolve]   found org.eclipse.jdt#core;3.1.1 in maven2
 [ivy:resolve]   found org.slf4j#slf4j-log4j12;1.4.3 in maven2
 [ivy:resolve]   found org.mockito#mockito-all;1.8.0 in maven2
 [ivy:resolve] :: resolution report :: resolve 612ms :: artifacts dl 24ms

  -
|  |modules||   artifacts
 |
|   conf   | number| search|dwnlded|evicted||
 number|dwnlded|

  -
|  common  |   21  |   0   |   0   |   0   ||   21  |   0
 |

  -

 ivy-retrieve-common:
 [ivy:retrieve] :: retrieving :: org.apache.hadoop#Hadoop
 [ivy:retrieve]  confs: [common]
 [ivy:retrieve]  0 artifacts copied, 21 already retrieved (0kB/16ms)
 No ivy:settings found for the default reference 'ivy.instance'.  A
 default instance will be used
 DEPRECATED: 'ivy.conf.file' is deprecated, use 'ivy.settings.file' instead
 :: loading settings :: file =
 /home/master/Desktop/hadoop-0.20.2/ivy/ivysettings.xml

 init:
[touch] Creating /tmp/null1787905831
   [delete] Deleting: /tmp/null1787905831
 [exec] src/saveVersion.sh: 34: svn: not found
 [exec] src/saveVersion.sh: 34: svn: not found

 record-parser:

 compile-rcc-compiler:

 compile-core-classes:

 compile-mapred-classes:
[javac] Compiling 1 source file to
 /home/master/Desktop/hadoop-0.20.2/build/classes

 compile-hdfs-classes:
[javac] Compiling 4 source files to
 /home/master/Desktop/hadoop-0.20.2/build/classes

 compile-core-native:

 check-c++-makefiles:

 create-c++-pipes-makefile:

 create-c++-utils-makefile:

 compile-c++-utils:

 compile-c++-pipes:

 compile-c++:

 compile-core:

 jar:
  [tar] Nothing to do:
 /home/master/Desktop/hadoop-0.20.2/build/classes/bin.tgz is up to
 date.
  [jar] Building jar:
 /home/master/Desktop/hadoop-0.20.2/build/hadoop-0.20.3-dev-core.jar

 compile-tools:

 create-c++-examples-pipes-makefile:

 compile-c++-examples-pipes:

 compile-c++-examples:

 compile-examples:
[javac] Compiling 9 source files to
 /home/master/Desktop/hadoop-0.20.2/build/examples
[javac]
 /home/master/Desktop/hadoop-0.20.2/src/examples/org/apache/hadoop/examples/kmeans/KmeansUtil.java:5:
 class, interface, or enum expected
[javac] !DOCTYPE html
[javac] ^

Re: how to build kmeans

2011-03-18 Thread Ted Dunning
This looks like you took the code from
http://code.google.com/p/kmeans-hadoop/

http://code.google.com/p/kmeans-hadoop/And it looks like you didn't
actually download the code, but you cut and pasted the HTML rendition of the
code.

First, this code is not from a serious project.  It is more of a status dump
of somebody just trying some stuff out.

If you really want k-means, go get Mahout.  http://mahout.apache.org/

On Fri, Mar 18, 2011 at 3:12 AM, MANISH SINGLA coolmanishh...@gmail.comwrote:

 Hii everyone...

 I am trying to run kmeans on a single node... I have the attached
 files with me...I put them in a folder named kmeans...and place it in
 the src/examples directory...
 Then i go to hadoop home ant type ant examples on the command line...
 but I am getting some errors...which I am not able to understand...any
 help wpuld be appreciated...


 ON COMMAND LINE:  ant examples //tHIS i TYPE ON THE COMMAND LINE
 OUTPUT: //AS A result I get all this
   Buildfile: build.xml

 clover.setup:

 clover.info:
 [echo]
 [echo]  Clover not found. Code coverage reports disabled.
 [echo]

 clover:

 ivy-download:
  [get] Getting:

 http://repo2.maven.org/maven2/org/apache/ivy/ivy/2.0.0-rc2/ivy-2.0.0-rc2.jar
  [get] To: /home/master/Desktop/hadoop-0.20.2/ivy/ivy-2.0.0-rc2.jar
  [get] Not modified - so not downloaded

 ivy-init-dirs:

 ivy-probe-antlib:

 ivy-init-antlib:

 ivy-init:
 [ivy:configure] :: Ivy 2.0.0-rc2 - 20081028224207 ::
 http://ant.apache.org/ivy/ ::
 :: loading settings :: file =
 /home/master/Desktop/hadoop-0.20.2/ivy/ivysettings.xml

 ivy-resolve-common:
 [ivy:resolve] :: resolving dependencies ::
 org.apache.hadoop#Hadoop;working@master
 [ivy:resolve]   confs: [common]
 [ivy:resolve]   found commons-logging#commons-logging;1.0.4 in maven2
 [ivy:resolve]   found log4j#log4j;1.2.15 in maven2
 [ivy:resolve]   found commons-httpclient#commons-httpclient;3.0.1 in maven2
 [ivy:resolve]   found commons-codec#commons-codec;1.3 in maven2
 [ivy:resolve]   found commons-cli#commons-cli;1.2 in maven2
 [ivy:resolve]   found xmlenc#xmlenc;0.52 in maven2
 [ivy:resolve]   found net.java.dev.jets3t#jets3t;0.6.1 in maven2
 [ivy:resolve]   found commons-net#commons-net;1.4.1 in maven2
 [ivy:resolve]   found org.mortbay.jetty#servlet-api-2.5;6.1.14 in maven2
 [ivy:resolve]   found oro#oro;2.0.8 in maven2
 [ivy:resolve]   found org.mortbay.jetty#jetty;6.1.14 in maven2
 [ivy:resolve]   found org.mortbay.jetty#jetty-util;6.1.14 in maven2
 [ivy:resolve]   found tomcat#jasper-runtime;5.5.12 in maven2
 [ivy:resolve]   found tomcat#jasper-compiler;5.5.12 in maven2
 [ivy:resolve]   found commons-el#commons-el;1.0 in maven2
 [ivy:resolve]   found junit#junit;3.8.1 in maven2
 [ivy:resolve]   found commons-logging#commons-logging-api;1.0.4 in maven2
 [ivy:resolve]   found org.slf4j#slf4j-api;1.4.3 in maven2
 [ivy:resolve]   found org.eclipse.jdt#core;3.1.1 in maven2
 [ivy:resolve]   found org.slf4j#slf4j-log4j12;1.4.3 in maven2
 [ivy:resolve]   found org.mockito#mockito-all;1.8.0 in maven2
 [ivy:resolve] :: resolution report :: resolve 612ms :: artifacts dl 24ms

  -
|  |modules||   artifacts
 |
|   conf   | number| search|dwnlded|evicted||
 number|dwnlded|

  -
|  common  |   21  |   0   |   0   |   0   ||   21  |   0
 |

  -

 ivy-retrieve-common:
 [ivy:retrieve] :: retrieving :: org.apache.hadoop#Hadoop
 [ivy:retrieve]  confs: [common]
 [ivy:retrieve]  0 artifacts copied, 21 already retrieved (0kB/16ms)
 No ivy:settings found for the default reference 'ivy.instance'.  A
 default instance will be used
 DEPRECATED: 'ivy.conf.file' is deprecated, use 'ivy.settings.file' instead
 :: loading settings :: file =
 /home/master/Desktop/hadoop-0.20.2/ivy/ivysettings.xml

 init:
[touch] Creating /tmp/null1787905831
   [delete] Deleting: /tmp/null1787905831
 [exec] src/saveVersion.sh: 34: svn: not found
 [exec] src/saveVersion.sh: 34: svn: not found

 record-parser:

 compile-rcc-compiler:

 compile-core-classes:

 compile-mapred-classes:
[javac] Compiling 1 source file to
 /home/master/Desktop/hadoop-0.20.2/build/classes

 compile-hdfs-classes:
[javac] Compiling 4 source files to
 /home/master/Desktop/hadoop-0.20.2/build/classes

 compile-core-native:

 check-c++-makefiles:

 create-c++-pipes-makefile:

 create-c++-utils-makefile:

 compile-c++-utils:

 compile-c++-pipes:

 compile-c++:

 compile-core:

 jar:
  [tar] Nothing to do:
 /home/master/Desktop/hadoop-0.20.2/build/classes/bin.tgz is up to
 date.
  [jar] Building jar:
 /home/master/Desktop/hadoop-0.20.2/build/hadoop-0.20.3-dev-core.jar

 compile-tools:

 create-c++-examples-pipes-makefile:

 compile-c++-examples-pipes:

 compile-c++-examples:

Re: decommissioning node woes

2011-03-18 Thread Ted Dunning
If nobody else more qualified is willing to jump in, I can at least provide
some pointers.

What you describe is a bit surprising.  I have zero experience with any 0.21
version, but decommissioning was working well
in much older versions, so this would be a surprising regression.

The observations you have aren't all inconsistent with how decommissioning
should work.  The fact that your nodes look up
after starting the decommissioning isn't so strange.  The idea is that no
new data will be put on the node, nor should it be
counted as a replica, but it will help in reading data.

So that isn't such a big worry.

The fact that it takes forever and a day, however, is a big worry.  I cannot
provide any help there just off hand.

What happens when a datanode goes down?  Do you see under-replicated files?
 Does the number of such files decrease over time?

On Fri, Mar 18, 2011 at 4:23 AM, Rita rmorgan...@gmail.com wrote:

 Any help?


 On Wed, Mar 16, 2011 at 9:36 PM, Rita rmorgan...@gmail.com wrote:

  Hello,
 
  I have been struggling with decommissioning data  nodes. I have a 50+
 data
  node cluster (no MR) with each server holding about 2TB of storage. I
 split
  the nodes into 2 racks.
 
 
  I edit the 'exclude' file and then do a -refreshNodes. I see the node
  immediate in 'Decommiosied node' and I also see it as a 'live' node!
  Eventhough I wait 24+ hours its still like this. I am suspecting its a
 bug
  in my version.  The data node process is still running on the node I am
  trying to decommission. So, sometimes I kill -9 the process and I see the
  'under replicated' blocks...this can't be the normal procedure.
 
  There were even times that I had corrupt blocks because I was impatient
 --
  waited 24-34 hours
 
  I am using 23 August, 2010: release 0.21.0 
 http://hadoop.apache.org/hdfs/releases.html#23+August%2C+2010%3A+release+0.21.0+available
 
   version.
 
  Is this a known bug? Is there anything else I need to do to decommission
 a
  node?
 
 
 
 
 
 
 
  --
  --- Get your facts first, then you can distort them as you please.--
 



 --
 --- Get your facts first, then you can distort them as you please.--



Re: decommissioning node woes

2011-03-18 Thread Ted Dunning
Unless the last copy is on that node.

Decommissioning is the only safe way to shut off 10 nodes at once.  Doing
them one at a time and waiting for replication to (asymptotically) recover
is painful and error prone.

On Fri, Mar 18, 2011 at 9:08 AM, James Seigel ja...@tynt.com wrote:

 Just a note.  If you just shut the node off, the blocks will replicate
 faster.

 James.


 On 2011-03-18, at 10:03 AM, Ted Dunning wrote:

  If nobody else more qualified is willing to jump in, I can at least
 provide
  some pointers.
 
  What you describe is a bit surprising.  I have zero experience with any
 0.21
  version, but decommissioning was working well
  in much older versions, so this would be a surprising regression.
 
  The observations you have aren't all inconsistent with how
 decommissioning
  should work.  The fact that your nodes look up
  after starting the decommissioning isn't so strange.  The idea is that no
  new data will be put on the node, nor should it be
  counted as a replica, but it will help in reading data.
 
  So that isn't such a big worry.
 
  The fact that it takes forever and a day, however, is a big worry.  I
 cannot
  provide any help there just off hand.
 
  What happens when a datanode goes down?  Do you see under-replicated
 files?
  Does the number of such files decrease over time?
 
  On Fri, Mar 18, 2011 at 4:23 AM, Rita rmorgan...@gmail.com wrote:
 
  Any help?
 
 
  On Wed, Mar 16, 2011 at 9:36 PM, Rita rmorgan...@gmail.com wrote:
 
  Hello,
 
  I have been struggling with decommissioning data  nodes. I have a 50+
  data
  node cluster (no MR) with each server holding about 2TB of storage. I
  split
  the nodes into 2 racks.
 
 
  I edit the 'exclude' file and then do a -refreshNodes. I see the node
  immediate in 'Decommiosied node' and I also see it as a 'live' node!
  Eventhough I wait 24+ hours its still like this. I am suspecting its a
  bug
  in my version.  The data node process is still running on the node I am
  trying to decommission. So, sometimes I kill -9 the process and I see
 the
  'under replicated' blocks...this can't be the normal procedure.
 
  There were even times that I had corrupt blocks because I was impatient
  --
  waited 24-34 hours
 
  I am using 23 August, 2010: release 0.21.0 
 
 http://hadoop.apache.org/hdfs/releases.html#23+August%2C+2010%3A+release+0.21.0+available
 
  version.
 
  Is this a known bug? Is there anything else I need to do to
 decommission
  a
  node?
 
 
 
 
 
 
 
  --
  --- Get your facts first, then you can distort them as you please.--
 
 
 
 
  --
  --- Get your facts first, then you can distort them as you please.--
 




Re: hadoop fs -rmr /*?

2011-03-16 Thread Ted Dunning
W.P is correct, however, that standard techniques like snapshots and mirrors
and point in time backups do not exist in standard hadoop.

This requires a variety of creative work-arounds if you use stock hadoop.

It is not uncommon for people to have memories of either removing everything
or somebody close to them doing the same thing.

Few people have memories of doing it twice.

On Wed, Mar 16, 2011 at 11:20 AM, David Rosenstrauch dar...@darose.netwrote:

 On 03/16/2011 01:35 PM, W.P. McNeill wrote:

 On HDFS, anyone can run hadoop fs -rmr /* and delete everything.


 Not sure how you have your installation set but on ours (we installed
 Cloudera CDH), only user hadoop has full read/write access to HDFS. Since
 we rarely either login as user hadoop, or run jobs as that user, this forces
 us to explicitly set and chown directory trees in HDFS that only specific
 users can access, thus enforcing file read/write restrictions.

 HTH,

 DR



Re: Why hadoop is written in java?

2011-03-16 Thread Ted Dunning
Note that that comment is now 7 years old.

See Mahout for a more modern take on numerics using Hadoop (and other tools)
for scalable machine learning and data mining.

On Wed, Mar 16, 2011 at 10:43 AM, baloodevil dukek...@hotmail.com wrote:

 See this for comment on java handling numeric calculations like sparse
 matrices...
 http://acs.lbl.gov/software/colt/



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Why-hadoop-is-written-in-java-tp1673148p2688781.html
 Sent from the Hadoop lucene-users mailing list archive at Nabble.com.



Re: k-means

2011-03-04 Thread Ted Dunning
Since you asked so nicely:

http://www.manning.com/owen/

On Fri, Mar 4, 2011 at 6:52 AM, Mike Nute mike.n...@gmail.com wrote:

 James,

 Do you know how to get a copy of this book in early access form? Amazon
 doesn't release it until may.  Thanks!

 Mike Nute
 --Original Message--
 From: James Seigel
 To: common-user@hadoop.apache.org
 ReplyTo: common-user@hadoop.apache.org
 Subject: Re: k-means
 Sent: Mar 4, 2011 9:46 AM

 I am not near a computer so I won't be able to give you specifics.  So
 instead, I'd suggest Manning's mahout in action book which is in
 their early access form for some basic direction.

 Disclosure: I have no relation to the publisher or authors.

 Cheers
 James

 Sent from my mobile. Please excuse the typos.

 On 2011-03-04, at 7:37 AM, MANISH SINGLA coolmanishh...@gmail.com wrote:

  are u suggesting me that???  if yes can u plzzz tell me the steps to
  use that...because I havent used it yet...a quick reply will really be
  appreciated...
  Thanx
  Manish
 
  On Fri, Mar 4, 2011 at 7:39 PM, James Seigel ja...@tynt.com wrote:
  Mahout project?
 
  Sent from my mobile. Please excuse the typos.
 
  On 2011-03-04, at 6:41 AM, MANISH SINGLA coolmanishh...@gmail.com
 wrote:
 
  Hey ppl...
  I need some serious help...I m not able to run kmeans code in
  hadoop...does anyone have a running code...that they would have
  tried...
 
  Regards
  MANISH
 




Re: Digital Signal Processing Library + Hadoop

2011-03-04 Thread Ted Dunning
Come on over to the Apache Mahout mailing list for a warm welcome at least.

We don't have a lot of time series stuff but would be very interested in
hearing more about what you need and would like to see if there are some
common issues that we might work on together.

On Fri, Mar 4, 2011 at 9:05 PM, Roger Smith rogersmith1...@gmail.comwrote:

 All -
 I wonder if any of you have integrated a DSP library with Hadoop.
 We are considering using Hadoop to processing time series data, but don't
 want to write standard DSP functions.

 Roger.



Re: Performance Test

2011-03-02 Thread Ted Dunning
It will be very difficult to do.  If you have n machines running 4 different
things, you will probably get better results segregating tasks as much as
possible.  Interactions can be very subtle and can have major impact on
performance in a few cases.

Hadoop, in general, will use a lot of the resources if they appear to be
available.  The intent, after all, is to run batch jobs absolutely as fast
as your hardware can handle.

On Wed, Mar 2, 2011 at 7:31 PM, liupei liu...@xingcloud.com wrote:

 Hi,

 I'd like to tune params in hadoop config for my job. But my current cluster
 runs lot of other processes such as mongod, php gateways and some other
 routine hadoop jobs. It is impossible to stop all to get a clear environment
 for testing. Is there any way to get reliable results for my tuning in such
 a mixture environment?

 Thanks




Re: Advice for a new open-source project and a license

2011-03-01 Thread Ted Dunning
Bixo may have some useful components.  The thrust is different, but some of
the pieces are similar.

http://bixo.101tec.com/

On Mon, Feb 28, 2011 at 7:57 PM, Mark Kerzner markkerz...@gmail.com wrote:

 Well, it's more complex than that. I packed all files (or selected
 directories) into zip files, and those zip files go into HDFS, and they are
 processed from there.

 Mark

 On Mon, Feb 28, 2011 at 9:53 PM, Greg Roelofs roel...@yahoo-inc.com
 wrote:

  Mark Kerzner markkerz...@gmail.com wrote:
 
   I am working on an open-source project that would be using
   Hadoop/HDFS/HBase/Tika/Lucene and would make all files on a hard drive
   searchable.
 
  _A_ hard drive?  Hadoop?  Seems like a bad match.
 
  Greg
 



Re: Advice for a new open-source project and a license

2011-02-28 Thread Ted Dunning
Check out http://www.elasticsearch.org/

http://www.elasticsearch.org/Not what you are doing, but possibly a
helpful bit of the pie.

Also, Solr integrates Tika and Lucene pretty nicely any more.  No Hbase yet,
but it isn't hard to add that.

On Mon, Feb 28, 2011 at 1:01 PM, Mark Kerzner markkerz...@gmail.com wrote:

 Hi,

 I am working on an open-source project that would be using
 Hadoop/HDFS/HBase/Tika/Lucene and would make all files on a hard drive
 searchable. Like Nutch, only applied to hard drives, and like Google
 Desktop
 Search, only I want to output information about every file found. Not a big
 difference though.

 I am looking for an advice on the following

   1. Have you heard of a similar project?
   2. What license should I use? I am thinking of Apache V2.0, because it
   relies on other Apache V2.0 projects;
   3. Any other advice?

 Thank you. Sincerely,
 Mark



Re: Hadoop Case Studies?

2011-02-27 Thread Ted Dunning
Ted,

Greetings back at you.  It has been a while.

Check out Jimmy Lin and Chris Dyer's book about text processing with
hadoop:

http://www.umiacs.umd.edu/~jimmylin/book.html


On Sun, Feb 27, 2011 at 4:34 PM, Ted Pedersen tpede...@d.umn.edu wrote:

 Greetings all,

 I'm teaching an undergraduate Computer Science class that is using
 Hadoop quite heavily, and would like to include some case studies at
 various points during this semester.

 We are using Tom White's Hadoop The Definitive Guide as a text, and
 that includes a very nice chapter of case studies which might even
 provide enough material for my purposes.

 But, I wanted to check and see if there were other case studies out
 there that might provide motivating and interesting examples of how
 Hadoop is currently being used. The idea is to find material that goes
 beyond simply saying X uses Hadoop to explaining in more detail how
 and why X are using Hadoop.

 Any hints would be very gratefully received.

 Cordially,
 Ted

 --
 Ted Pedersen
 http://www.d.umn.edu/~tpederse



Re: Hadoop Case Studies?

2011-02-27 Thread Ted Dunning
At any large company that makes heavy use of Hadoop, you aren't going to
find any concise description of all the ways that hadoop is used.

That said, here is a concise description of some of the ways that hadoop is
(was) used at Yahoo:

http://www.slideshare.net/ydn/hadoop-yahoo-internet-scale-data-processing

On Sun, Feb 27, 2011 at 7:31 PM, Ted Pedersen tpede...@d.umn.edu wrote:

 Thanks for all these great ideas. These are really very helpful.

 What I'm also hoping to find are articles or papers that describe what
 particular companies or organizations have done with Hadoop. How does
 Facebook use Hadoop for example (that's one of the case studies in the
 White book), or how does last.fm use Hadoop (another of the case
 studies in the White book).

 One interesting resource is the list of powered by Hadoop projects
 available here:

 http://wiki.apache.org/hadoop/PoweredBy

 Some of these entries provide links to more detailed discussions of
 what an organization is doing, as in the following from Twitter
 http://www.slideshare.net/kevinweil/hadoop-pig-and-twitter-nosql-east-2009

 So any additional descriptions of what specific organizations are
 doing with Hadoop (to the extent they are willing to share) would be
 really helpful (these sorts of real world cases tend to be
 particularly motivating).

 Cordially,
 Ted

 On Sun, Feb 27, 2011 at 9:23 PM, Simon gsmst...@gmail.com wrote:
  I think you can also simulate PageRank Algorithm with hadoop.
 
  Simon -
 
  On Sun, Feb 27, 2011 at 9:20 PM, Lance Norskog goks...@gmail.com
 wrote:
 
  This is an exercise that will appeal to undergrads: pull the Craiglist
  personals ads from several cities, and do text classification. Given a
  training set of all the cities, attempt to classify test ads by city.
  (If Peter Harrington is out there, I stole this from you.)
 
  Lance
 
  On Sun, Feb 27, 2011 at 4:55 PM, Ted Dunning tdunn...@maprtech.com
  wrote:
   Ted,
  
   Greetings back at you.  It has been a while.
  
   Check out Jimmy Lin and Chris Dyer's book about text processing with
   hadoop:
  
   http://www.umiacs.umd.edu/~jimmylin/book.html
  
  
   On Sun, Feb 27, 2011 at 4:34 PM, Ted Pedersen tpede...@d.umn.edu
  wrote:
  
   Greetings all,
  
   I'm teaching an undergraduate Computer Science class that is using
   Hadoop quite heavily, and would like to include some case studies at
   various points during this semester.
  
   We are using Tom White's Hadoop The Definitive Guide as a text, and
   that includes a very nice chapter of case studies which might even
   provide enough material for my purposes.
  
   But, I wanted to check and see if there were other case studies out
   there that might provide motivating and interesting examples of how
   Hadoop is currently being used. The idea is to find material that
 goes
   beyond simply saying X uses Hadoop to explaining in more detail how
   and why X are using Hadoop.
  
   Any hints would be very gratefully received.
  
   Cordially,
   Ted
  
   --
   Ted Pedersen
   http://www.d.umn.edu/~tpederse
  
  
 
 
 
  --
  Lance Norskog
  goks...@gmail.com
 
 
 
 
  --
  Regards,
  Simon
 



 --
 Ted Pedersen
 http://www.d.umn.edu/~tpederse



Re: Quick question

2011-02-20 Thread Ted Dunning
This is the most important thing that you have said. The map function
is called once per unit of input but the mapper object persists for
many input units of input.

You have a little bit of control over how many mapper objects there
are and how many machines they are created on and how many pieces your
input is broken into.  That control is limited, however, unless you
build your own input format. The standard input formats are optimized
for very large inputs and may not give you the flexibility that you
want for your experiments. That is unfortunate for the purpose of
learning about hadoop but hadoop is designed mostly for dealing with
very large data and isn't usually designed to be easy to understand.
Where easy coincides with powerful then easy is good but powerful
isn't always easy.

On Sunday, February 20, 2011, maha m...@umail.ucsb.edu wrote:
 So first question: is there a difference between Mappers and maps ?


Re: Quick question

2011-02-18 Thread Ted Dunning
The input is effectively split by lines, but under the covers, the actual
splits are by byte.  Each mapper will cleverly scan from the specified start
to the next line after the start point.  At then end, it will over-read to
the end of line that is at or after the end of its specified region.  This
can make the last split be a bit smaller than the others and the first be a
bit larger.

Practically speaking, however, your 2000 line file is extremely unlikely to
be split at all because it is sooo small.

On Fri, Feb 18, 2011 at 11:14 AM, maha m...@umail.ucsb.edu wrote:

 Hi all,

  I want to check if the following statement is right:

  If I use TextInputFormat to process a text file with 2000 lines (each
 ending with \n) with 20 mappers. Then each map will have a sequence of
 COMPLETE LINES .

 In other words,  the input is not split byte-wise but by lines.

 Is that right?


 Thank you,
 Maha


Re: benchmark choices

2011-02-18 Thread Ted Dunning
MalStone looks like a very narrow benchmark.

Terasort is also a very narrow and somewhat idiosyncratic benchmark, but it
has the characteristic that lots of people use it.

You should add PigMix to your list.  There java versions of the problems in
PigMix that make a pretty good set of benchmarks independent of Pig itself.

On Fri, Feb 18, 2011 at 1:32 PM, Shrinivas Joshi jshrini...@gmail.comwrote:

 Which workloads are used for serious benchmarking of Hadoop clusters? Do
 you
 care about any of the following workloads :
 TeraSort, GridMix v1, v2, or v3, MalStone, CloudBurst, MRBench, NNBench,
 sample apps shipped with Hadoop distro like PiEstimator, dbcount etc.

 Thanks,
 -Shrinivas



Re: benchmark choices

2011-02-18 Thread Ted Dunning
I just read the malstone report.  They report times for a Java version that
is many (5x) times slower than for a streaming implementation.  That single
fact indicates that the Java code is so appallingly bad that this is a very
bad benchmark.

On Fri, Feb 18, 2011 at 2:27 PM, Jim Falgout jim.falg...@pervasive.comwrote:

 We use MalStone and TeraSort. For Hive, you can use TPC-H, at least the
 data and the queries, if not the query generator. There is a Jira issue in
 Hive that discusses the TPC-H benchmark if you're interested. Sorry, I
 don't remember the issue number offhand.

 -Original Message-
 From: Shrinivas Joshi [mailto:jshrini...@gmail.com]
 Sent: Friday, February 18, 2011 3:32 PM
 To: common-user@hadoop.apache.org
 Subject: benchmark choices

 Which workloads are used for serious benchmarking of Hadoop clusters? Do
 you care about any of the following workloads :
 TeraSort, GridMix v1, v2, or v3, MalStone, CloudBurst, MRBench, NNBench,
 sample apps shipped with Hadoop distro like PiEstimator, dbcount etc.

 Thanks,
 -Shrinivas




Re: Hadoop in Real time applications

2011-02-17 Thread Ted Dunning
Remove the for at the end that got sucked in by the email editor.

On Thu, Feb 17, 2011 at 5:56 AM, Jason Rutherglen 
jason.rutherg...@gmail.com wrote:

 Ted, thanks for the links, the yahoo.com one doesn't seem to exist?

 On Wed, Feb 16, 2011 at 11:48 PM, Ted Dunning tdunn...@maprtech.com
 wrote:
  Unless you go beyond the current standard semantics, this is true.
 
  See here: http://code.google.com/p/hop/  and
  http://labs.yahoo.com/node/476for alternatives.
 
  On Wed, Feb 16, 2011 at 10:30 PM, madhu phatak phatak@gmail.com
 wrote:
 
  Hadoop is not suited for real time applications
 
  On Thu, Feb 17, 2011 at 9:47 AM, Karthik Kumar 
 karthik84ku...@gmail.com
  wrote:
 
   Can Hadoop be used for Real time Applications such as banking
  solutions...
  
   --
   With Regards,
   Karthik
  
 
 



Re: DataCreator

2011-02-16 Thread Ted Dunning
Sounds like Pig.  Or Cascading.  Or Hive.

Seriously, isn't this already available?

On Wed, Feb 16, 2011 at 7:06 AM, Guy Doulberg guy.doulb...@conduit.comwrote:


 Hey all,
 I want to consult with you hadoppers about a Map/Reduce application I want
 to build.

 I want to build a map/reduce job, that read files from HDFS, perform some
 sort of transformation on the file lines, and store them to several
 partition depending on the source of the file or its data.

 I want this application to be as configurable as possible, so I designed
 interfaces to Parse, Decorate and Partition(On HDFS) the Data.

 I want to be able to configure different data flows, with different
 parsers, decorators and partitioners, using a config file.

 Do you think, you would use such an application? Does it fit an open-source
 project?

 Now, I have some technical questions:
 I was thinking of using reflection, to load all the classes I would need
 according to the configuration during the setup process of the Mapper.
 Do you think it is a good idea?

 Is there a way to send the Mapper objects or interfaces from the Job
 declaration?



  Thanks,




Re: Trying out AvatarNode

2011-02-16 Thread Ted Dunning
http://wiki.apache.org/hadoop/HowToContribute

On Wed, Feb 16, 2011 at 8:23 AM, Mark Kerzner markkerz...@gmail.com wrote:

 if there is a place to learn about patch
 practices, please point me to it.



Re: Hadoop in Real time applications

2011-02-16 Thread Ted Dunning
Unless you go beyond the current standard semantics, this is true.

See here: http://code.google.com/p/hop/  and
http://labs.yahoo.com/node/476for alternatives.

On Wed, Feb 16, 2011 at 10:30 PM, madhu phatak phatak@gmail.com wrote:

 Hadoop is not suited for real time applications

 On Thu, Feb 17, 2011 at 9:47 AM, Karthik Kumar karthik84ku...@gmail.com
 wrote:

  Can Hadoop be used for Real time Applications such as banking
 solutions...
 
  --
  With Regards,
  Karthik
 



Re: recommendation on HDDs

2011-02-15 Thread Ted Dunning
Good idea!

Would you like to create the nucleus of such a page?  (there might already
be something like that)

On Tue, Feb 15, 2011 at 8:49 AM, Shrinivas Joshi jshrini...@gmail.comwrote:

 It would be
 nice to have a wiki page collecting all this good information.



Re: hadoop 0.20 append - some clarifications

2011-02-14 Thread Ted Dunning
HDFS definitely doesn't follow anything like POSIX file semantics.

They may be a vague inspiration for what HDFS does, but generally the
behavior of HDFS is not tightly specified.  Even the unit tests have some
real surprising behavior.

On Mon, Feb 14, 2011 at 7:21 AM, Gokulakannan M gok...@huawei.com wrote:



  I think that in general, the behavior of any program reading data from
 an HDFS file before hsync or close is called is pretty much undefined.



 In Unix, users can parallelly read a file when another user is writing a
 file. And I suppose the sync feature design is based on that.

 So at any point of time during the file write, parallel users should be
 able to read the file.




 https://issues.apache.org/jira/browse/HDFS-142?focusedCommentId=12663958page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12663958
  --

 *From:* Ted Dunning [mailto:tdunn...@maprtech.com]
 *Sent:* Friday, February 11, 2011 2:14 PM
 *To:* common-user@hadoop.apache.org; gok...@huawei.com
 *Cc:* hdfs-u...@hadoop.apache.org; dhr...@gmail.com
 *Subject:* Re: hadoop 0.20 append - some clarifications



 I think that in general, the behavior of any program reading data from an
 HDFS file before hsync or close is called is pretty much undefined.



 If you don't wait until some point were part of the file is defined, you
 can't expect any particular behavior.

 On Fri, Feb 11, 2011 at 12:31 AM, Gokulakannan M gok...@huawei.com
 wrote:

 I am not concerned about the sync behavior.

 The thing is the reader reading non-flushed(non-synced) data from HDFS as
 you have explained in previous post.(in hadoop 0.20 append branch)

 I identified one specific scenario where the above statement is not holding
 true.

 Following is how you can reproduce the problem.

 1. add debug point at createBlockOutputStream() method in DFSClient and run
 your HDFS write client in debug mode

 2. allow client to write 1 block to HDFS

 3. for the 2nd block, the flow will come to the debug point mentioned in
 1(do not execute the createBlockOutputStream() method). hold here.

 4. parallely, try to read the file from another client

 Now you will get an error saying that file cannot be read.



  _

 From: Ted Dunning [mailto:tdunn...@maprtech.com]
 Sent: Friday, February 11, 2011 11:04 AM
 To: gok...@huawei.com
 Cc: common-user@hadoop.apache.org; hdfs-u...@hadoop.apache.org;
 c...@boudnik.org
 Subject: Re: hadoop 0.20 append - some clarifications



 It is a bit confusing.



 SequenceFile.Writer#sync isn't really sync.



 There is SequenceFile.Writer#syncFs which is more what you might expect to
 be sync.



 Then there is HADOOP-6313 which specifies hflush and hsync.  Generally, if
 you want portable code, you have to reflect a bit to figure out what can be
 done.

 On Thu, Feb 10, 2011 at 8:38 PM, Gokulakannan M gok...@huawei.com wrote:

 Thanks Ted for clarifying.

 So the sync is to just flush the current buffers to datanode and persist
 the
 block info in namenode once per block, isn't it?



 Regarding reader able to see the unflushed data, I faced an issue in the
 following scneario:

 1. a writer is writing a 10MB file(block size 2 MB)

 2. wrote the file upto 4MB (2 finalized blocks in current and nothing in
 blocksBeingWritten directory in DN) . So 2 blocks are written

 3. client calls addBlock for the 3rd block on namenode and not yet created
 outputstream to DN(or written anything to DN). At this point of time, the
 namenode knows about the 3rd block but the datanode doesn't.

 4. at point 3, a reader is trying to read the file and he is getting
 exception and not able to read the file as the datanode's getBlockInfo
 returns null to the client(of course DN doesn't know about the 3rd block
 yet)

 In this situation the reader cannot see the file. But when the block
 writing
 is in progress , the read is successful.

 Is this a bug that needs to be handled in append branch?



  -Original Message-
  From: Konstantin Boudnik [mailto:c...@boudnik.org]
  Sent: Friday, February 11, 2011 4:09 AM
 To: common-user@hadoop.apache.org
  Subject: Re: hadoop 0.20 append - some clarifications

  You might also want to check append design doc published at HDFS-265



 I was asking about the hadoop 0.20 append branch. I suppose HDFS-265's
 design doc won't apply to it.



  _

 From: Ted Dunning [mailto:tdunn...@maprtech.com]
 Sent: Thursday, February 10, 2011 9:29 PM
 To: common-user@hadoop.apache.org; gok...@huawei.com
 Cc: hdfs-u...@hadoop.apache.org
 Subject: Re: hadoop 0.20 append - some clarifications



 Correct is a strong word here.



 There is actually an HDFS unit test that checks to see if partially written
 and unflushed data is visible.  The basic rule of thumb is that you need to
 synchronize readers and writers outside of HDFS.  There is no guarantee
 that
 data is visible or invisible after writing, but there is a guarantee that
 it
 will become visible after

Re: Is this a fair summary of HDFS failover?

2011-02-14 Thread Ted Dunning
Note that document purports to be from 2008 and, at best, was uploaded just
about a year ago.

That it is still pretty accurate is kind of a tribute to either the
stability of hbase or the stagnation depending on how you read it.

On Mon, Feb 14, 2011 at 12:31 PM, Mark Kerzner markkerz...@gmail.comwrote:

 As a conclusion, when building an HA HDFS cluster, one needs to follow the
 best
 practices outlined by Tom
 White
 http://www.cloudera.com/wp-content/uploads/2010/03/HDFS_Reliability.pdf,
 and may still need to resort to specialized NSF filers for running the
 NameNode.



Re: recommendation on HDDs

2011-02-12 Thread Ted Dunning
The original poster also seemed somewhat interested in disk bandwidth.

That is facilitated by having more than on disk in the box.

On Sat, Feb 12, 2011 at 8:26 AM, Michael Segel michael_se...@hotmail.comwrote:

 Since the OP believes that their requirement is 1TB per node... a single
 2TB would be the best choice. It allows for additional space and you really
 shouldn't be too worried about disk i/o being your bottleneck.


Re: Which strategy is proper to run an this enviroment?

2011-02-12 Thread Ted Dunning
This sounds like it will be very inefficient.  There is considerable
overhead in starting Hadoop jobs.  As you describe it, you will be starting
thousands of jobs and paying this penalty many times.

Is there a way that you could process all of the directories in one
map-reduce job?  Can you combine these directories into a single directory
with a few large files?

On Fri, Feb 11, 2011 at 8:07 PM, Jun Young Kim juneng...@gmail.com wrote:

 Hi.

 I have small clusters (9 nodes) to run a hadoop here.

 Under this cluster, a hadoop will take thousands of directories sequencely.

 In a each dir, there is two input files to m/r. Size of input files are
 from
 1m to 5g bytes.
 In a summary, each hadoop job will take an one of these dirs.

 To get best performance, which strategy is proper for us?

 Could u suggest me about it?
 Which configuration is best?

 Ps) physical memory size is 12g of each node.



Re: hadoop 0.20 append - some clarifications

2011-02-11 Thread Ted Dunning
I think that in general, the behavior of any program reading data from an
HDFS file before hsync or close is called is pretty much undefined.

If you don't wait until some point were part of the file is defined, you
can't expect any particular behavior.

On Fri, Feb 11, 2011 at 12:31 AM, Gokulakannan M gok...@huawei.com wrote:

 I am not concerned about the sync behavior.

 The thing is the reader reading non-flushed(non-synced) data from HDFS as
 you have explained in previous post.(in hadoop 0.20 append branch)

 I identified one specific scenario where the above statement is not holding
 true.

 Following is how you can reproduce the problem.

 1. add debug point at createBlockOutputStream() method in DFSClient and run
 your HDFS write client in debug mode

 2. allow client to write 1 block to HDFS

 3. for the 2nd block, the flow will come to the debug point mentioned in
 1(do not execute the createBlockOutputStream() method). hold here.

 4. parallely, try to read the file from another client

 Now you will get an error saying that file cannot be read.



  _

 From: Ted Dunning [mailto:tdunn...@maprtech.com]
 Sent: Friday, February 11, 2011 11:04 AM
 To: gok...@huawei.com
 Cc: common-user@hadoop.apache.org; hdfs-u...@hadoop.apache.org;
 c...@boudnik.org
 Subject: Re: hadoop 0.20 append - some clarifications



 It is a bit confusing.



 SequenceFile.Writer#sync isn't really sync.



 There is SequenceFile.Writer#syncFs which is more what you might expect to
 be sync.



 Then there is HADOOP-6313 which specifies hflush and hsync.  Generally, if
 you want portable code, you have to reflect a bit to figure out what can be
 done.

 On Thu, Feb 10, 2011 at 8:38 PM, Gokulakannan M gok...@huawei.com wrote:

 Thanks Ted for clarifying.

 So the sync is to just flush the current buffers to datanode and persist
 the
 block info in namenode once per block, isn't it?



 Regarding reader able to see the unflushed data, I faced an issue in the
 following scneario:

 1. a writer is writing a 10MB file(block size 2 MB)

 2. wrote the file upto 4MB (2 finalized blocks in current and nothing in
 blocksBeingWritten directory in DN) . So 2 blocks are written

 3. client calls addBlock for the 3rd block on namenode and not yet created
 outputstream to DN(or written anything to DN). At this point of time, the
 namenode knows about the 3rd block but the datanode doesn't.

 4. at point 3, a reader is trying to read the file and he is getting
 exception and not able to read the file as the datanode's getBlockInfo
 returns null to the client(of course DN doesn't know about the 3rd block
 yet)

 In this situation the reader cannot see the file. But when the block
 writing
 is in progress , the read is successful.

 Is this a bug that needs to be handled in append branch?



  -Original Message-
  From: Konstantin Boudnik [mailto:c...@boudnik.org]
  Sent: Friday, February 11, 2011 4:09 AM
 To: common-user@hadoop.apache.org
  Subject: Re: hadoop 0.20 append - some clarifications

  You might also want to check append design doc published at HDFS-265



 I was asking about the hadoop 0.20 append branch. I suppose HDFS-265's
 design doc won't apply to it.



  _

 From: Ted Dunning [mailto:tdunn...@maprtech.com]
 Sent: Thursday, February 10, 2011 9:29 PM
 To: common-user@hadoop.apache.org; gok...@huawei.com
 Cc: hdfs-u...@hadoop.apache.org
 Subject: Re: hadoop 0.20 append - some clarifications



 Correct is a strong word here.



 There is actually an HDFS unit test that checks to see if partially written
 and unflushed data is visible.  The basic rule of thumb is that you need to
 synchronize readers and writers outside of HDFS.  There is no guarantee
 that
 data is visible or invisible after writing, but there is a guarantee that
 it
 will become visible after sync or close.

 On Thu, Feb 10, 2011 at 7:11 AM, Gokulakannan M gok...@huawei.com wrote:

 Is this the correct behavior or my understanding is wrong?








Re: hadoop 0.20 append - some clarifications

2011-02-10 Thread Ted Dunning
Correct is a strong word here.

There is actually an HDFS unit test that checks to see if partially written
and unflushed data is visible.  The basic rule of thumb is that you need to
synchronize readers and writers outside of HDFS.  There is no guarantee that
data is visible or invisible after writing, but there is a guarantee that it
will become visible after sync or close.

On Thu, Feb 10, 2011 at 7:11 AM, Gokulakannan M gok...@huawei.com wrote:

 Is this the correct behavior or my understanding is wrong?



Re: recommendation on HDDs

2011-02-10 Thread Ted Dunning
Get bigger disks.  Data only grows and having extra is always good.

You can get 2TB drives for $100 and 1TB for  $75.

As far as transfer rates are concerned, any 3GB/s SATA drive is going to be
about the same (ish).  Seek times will vary a bit with rotation speed, but
with Hadoop, you will be doing long reads and writes.

Your controller and backplane will have a MUCH bigger vote in getting
acceptable performance.  With only 4 or 5 drives, you don't have to worry
about super-duper backplane, but you can still kill performance with a lousy
controller.

On Thu, Feb 10, 2011 at 12:26 PM, Shrinivas Joshi jshrini...@gmail.comwrote:

 What would be a good hard drive for a 7 node cluster which is targeted to
 run a mix of IO and CPU intensive Hadoop workloads? We are looking for
 around 1 TB of storage on each node distributed amongst 4 or 5 disks. So
 either 250GB * 4 disks or 160GB * 5 disks. Also it should be less than 100$
 each ;)

 I looked at HDD benchmark comparisons on tomshardware, storagereview etc.
 Got overwhelmed with the # of benchmarks and different aspects of HDD
 performance.

 Appreciate your help on this.

 -Shrinivas



Re: hadoop 0.20 append - some clarifications

2011-02-10 Thread Ted Dunning
It is a bit confusing.

SequenceFile.Writer#sync isn't really sync.

There is SequenceFile.Writer#syncFs which is more what you might expect to
be sync.

Then there is HADOOP-6313 which specifies hflush and hsync.  Generally, if
you want portable code, you have to reflect a bit to figure out what can be
done.

On Thu, Feb 10, 2011 at 8:38 PM, Gokulakannan M gok...@huawei.com wrote:

  Thanks Ted for clarifying.

 So the *sync* is to just flush the current buffers to datanode and persist
 the block info in namenode once per block, isn't it?



 Regarding reader able to see the unflushed data, I faced an issue in the
 following scneario:

 1. a writer is writing a *10MB* file(block size 2 MB)

 2. wrote the file upto 4MB (2 finalized blocks in *current* and nothing in
 *blocksBeingWritten* directory in DN) . So 2 blocks are written

 3. client calls addBlock for the 3rd block on namenode and not yet created
 outputstream to DN(or written anything to DN). At this point of time, the
 namenode knows about the 3rd block but the datanode doesn't.

 4. at point 3, a reader is trying to read the file and he is getting
 exception and not able to read the file as the datanode's getBlockInfo
 returns null to the client(of course DN doesn't know about the 3rd block
 yet)

 In this situation the reader cannot see the file. But when the block
 writing is in progress , the read is successful.

 *Is this a bug that needs to be handled in append branch?*



  -Original Message-
  From: Konstantin Boudnik [mailto:c...@boudnik.org]
  Sent: Friday, February 11, 2011 4:09 AM
 To: common-user@hadoop.apache.org
  Subject: Re: hadoop 0.20 append - some clarifications

  You might also want to check append design doc published at HDFS-265



 I was asking about the hadoop 0.20 append branch. I suppose HDFS-265's
 design doc won't apply to it.


  --

 *From:* Ted Dunning [mailto:tdunn...@maprtech.com]
 *Sent:* Thursday, February 10, 2011 9:29 PM
 *To:* common-user@hadoop.apache.org; gok...@huawei.com
 *Cc:* hdfs-u...@hadoop.apache.org
 *Subject:* Re: hadoop 0.20 append - some clarifications



 Correct is a strong word here.



 There is actually an HDFS unit test that checks to see if partially written
 and unflushed data is visible.  The basic rule of thumb is that you need to
 synchronize readers and writers outside of HDFS.  There is no guarantee that
 data is visible or invisible after writing, but there is a guarantee that it
 will become visible after sync or close.

 On Thu, Feb 10, 2011 at 7:11 AM, Gokulakannan M gok...@huawei.com wrote:

 Is this the correct behavior or my understanding is wrong?





Re: Can single map-reduce solve this problem

2011-02-08 Thread Ted Dunning
mapper should produce (k,1), (1, v) for lines k,v in file1 and should
produce (k,2), (2,v) for lines k,v in file2.  Your partition function should
look at only the first member of the key tuple, but should order on both
members.

Your reducer will get data like this:

   (k,1), [(1,v)]

or like this

   (k, 1), [(1,v1),(2,v2)]

In the first case, it should emit k, v.  In the second, k,v2.  More simply,
it should simply emit the last value in the reduce group.

In actual practice, you should probably use something fancier than an
integer to tag the data.  You will also have to find some kind of
appropriate tuple structure.

Pig, Cascading, Plume and Hive would make this easier than straight Java,
but all techniques would work.


On Tue, Feb 8, 2011 at 4:26 PM, Gururaj S Mayya gsma...@yahoo.com wrote:

 Any pointers as to how this could be done?



Re: Hadoop XML Error

2011-02-07 Thread Ted Dunning
This is due to the security API not being available.  You are crossing from
a cluster with security to one without and that is causing confusion.
 Presumably your client assumes that it is available and your hadoop library
doesn't provide it.

Check your class path very carefully looking for version assumptions and
confusions.

On Mon, Feb 7, 2011 at 11:43 AM, Korb, Michael [USA]
korb_mich...@bah.comwrote:

 We're migrating from CDH3b3 to a recent build of 0.20-append published by
 Ryan Rawson. This isn't something covered by normal upgrade scripts. I've
 tried several commands with different protocols and port numbers, but now
 keep getting the same error:

 11/02/07 14:35:06 INFO tools.DistCp: srcPaths=[hftp://mc1:50070/]
 11/02/07 14:35:06 INFO tools.DistCp: destPath=hdfs://mc0:55310/
 Exception in thread main java.lang.NoSuchMethodError:
 org.apache.hadoop.mapred.JobConf.getCredentials()Lorg/apache/hadoop/security/Credentials;
at org.apache.hadoop.tools.DistCp.checkSrcPath(DistCp.java:632)
at org.apache.hadoop.tools.DistCp.copy(DistCp.java:656)
at org.apache.hadoop.tools.DistCp.run(DistCp.java:881)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.tools.DistCp.main(DistCp.java:908)

 Has anyone seen this before? What might be causing it?

 Thanks,
 Mike


 
 From: Xavier Stevens [xstev...@mozilla.com]
 Sent: Monday, February 07, 2011 1:47 PM
 To: common-user@hadoop.apache.org
 Subject: Re: Hadoop XML Error

 You don't need to distcp to upgrade a cluster.  You just need to go
 through the upgrade process.  Bumping from 0.20.2 to 0.20.3 you might
 not even need to do anything other than stop the cluster processes, and
 then restart them using the 0.20.3 install.

 Here's a link to the upgrade and rollback docs:

 http://hadoop.apache.org/common/docs/r0.20.0/hdfs_user_guide.html#Upgrade+and+Rollback


 -Xavier


 On 2/7/11 10:22 AM, Korb, Michael [USA] wrote:
  Xavier,
 
  Yes, I'm trying to upgrade from 0.20.2 to 0.20.3. Both are running on the
 same cluster. I'm trying to distcp everything from the 0.20.2 instance over
 to the 0.20.3 instance, without any luck yet.
 
  Mike
  
  From: Xavier Stevens [xstev...@mozilla.com]
  Sent: Monday, February 07, 2011 1:20 PM
  To: common-user@hadoop.apache.org
  Subject: Re: Hadoop XML Error
 
  Mike,
 
  Are you just trying to upgrade then?  I've never heard of anyone trying
  to run two versions of hadoop on the same cluster.  I'm don't think
  that's even possible, but maybe someone else knows.
 
  -Xavier
 
 
  On 2/7/11 10:03 AM, Korb, Michael [USA] wrote:
  Xavier,
 
  Both instances of Hadoop are running on the same cluster. I tried the
 command sudo -u hdfs ./hadoop distcp -update hftp://mc1:50070/
 hdfs://mc0:55310 from the hadoop2 bin directory (the 0.20.3 install) on
 mc0 (the port 55310 is specified in core-site.xml). Now I'm getting
 this:
 
  11/02/07 13:03:14 INFO tools.DistCp: srcPaths=[hftp://mc1:50070/]
  11/02/07 13:03:14 INFO tools.DistCp: destPath=hdfs://mc0:55310
  Exception in thread main java.lang.NoSuchMethodError:
 org.apache.hadoop.mapred.JobConf.getCredentials()Lorg/apache/hadoop/security/Credentials;
at org.apache.hadoop.tools.DistCp.checkSrcPath(DistCp.java:632)
at org.apache.hadoop.tools.DistCp.copy(DistCp.java:656)
at org.apache.hadoop.tools.DistCp.run(DistCp.java:881)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.tools.DistCp.main(DistCp.java:908)
 
  Thanks,
  Mike
  
  From: Xavier Stevens [xstev...@mozilla.com]
  Sent: Monday, February 07, 2011 12:56 PM
  To: common-user@hadoop.apache.org
  Subject: Re: Hadoop XML Error
 
  Mike,
 
  I've seen this when a directory has been removed or is missing from the
  time distcp starting stating the source files.  You'll probably want to
  make sure that no code or person is messing with the filesystem during
  your copy.  I would make sure you only have one version of hadoop
  installed on your destination cluster.  Also you should use hdfs as the
  destination protocol and run the command as the hdfs user if you're
  using hadoop security.
 
  Example (Running on destination cluster):
 
  sudo -u hdfs /usr/lib/hadoop-0.20.3/bin/hadoop distcp -update
  hftp://mc1:50070/ hdfs://mc0:8020/
 
   Cheers,
 
 
  -Xavier
 
 
  On 2/7/11 9:39 AM, Korb, Michael [USA] wrote:
  I'm trying to copy from 0.20.2 to 0.20.3. I was trying to follow the
 DistCp Guide but I think I know the problem. I'm trying to run the command
 on the destination cluster, but when I call hadoop, I think the path is set
 to run the hadoop1 executable. So I tried going to the hadoop2 install 

Re: Quick Question: LineSplit or BlockSplit

2011-02-07 Thread Ted Dunning
Option (1) isn't the way that things normally work.  Besides, mappers are
called many times for each construction of a mapper.

On Mon, Feb 7, 2011 at 3:38 PM, maha m...@umail.ucsb.edu wrote:

 Hi,

  I would appreciate it if you could give me your thoughts if there is
 affect on efficiency if:

  1) Mappers were per line in a document

  or

  2) Mappers were per block of lines in a document.


  I know the obvious difference I can see is that (1) has more mappers. Does
 that mean (1) will be slower because of scheduling time ?

 Thank you,
 Maha



Re: Quick Question: LineSplit or BlockSplit

2011-02-07 Thread Ted Dunning
That is quite doable.  One way to do it is to make the max split size quite
small.

On Mon, Feb 7, 2011 at 6:14 PM, Mark Kerzner markkerz...@gmail.com wrote:

 Ted,

 I am also interested in this answer.

 I put the name of a zip file on a line in an input file, and I want one
 mapper to read this line, and start working on it (since it now knows the
 path in HDFS). Are you saying it's not doable?

 Thank you,
 Mark

 On Mon, Feb 7, 2011 at 8:10 PM, Ted Dunning tdunn...@maprtech.com wrote:

  Option (1) isn't the way that things normally work.  Besides, mappers are
  called many times for each construction of a mapper.
 
  On Mon, Feb 7, 2011 at 3:38 PM, maha m...@umail.ucsb.edu wrote:
 
   Hi,
  
I would appreciate it if you could give me your thoughts if there is
   affect on efficiency if:
  
1) Mappers were per line in a document
  
or
  
2) Mappers were per block of lines in a document.
  
  
I know the obvious difference I can see is that (1) has more mappers.
  Does
   that mean (1) will be slower because of scheduling time ?
  
   Thank you,
   Maha
  
 



Re: retain state between mappers

2011-02-05 Thread Ted Dunning
Remember that mappers are not executed in a well defined order.

They can be executed in different order or even at the same time.  One
mapper can be run more than once.

There are two ways to get something like what you want, but the question you
asked is ill-posed.

First, you can adapt the input format so that it gives a different integer
to each split as a key.  This doesn't something like what you ask for since
each mapper will get a different integer and multiply executed mappers will
get the same key each time they are run.

Secondly, you could use central coordination server to act as a global
counter.  **THIS IS A REALLY BAD IDEA**  It is bad because it turns a
parallel computation into a partially sequential one and because it doesn't
account for the fact that mappers can be run multiple times.

On Sat, Feb 5, 2011 at 4:03 AM, ANKITBHATNAGAR abhatna...@vantage.comwrote:

 is there a way I can retain the count between mappers and increment.?



Re: Hadoop is for whom? Data architect or Java Architect or All

2011-01-26 Thread Ted Dunning
Yes.  There is still a mismatch.

The degree of mismatch is decreasing as tools like Hive or Pig become more
advanced.  Better packaging of hadoop is also helping.

But your data architect will definitely need support from a java aware
person.  They won't be able to do all of the tasks.

On Wed, Jan 26, 2011 at 7:43 AM, manoranjand manoranj...@rediffmail.comwrote:


 Hi- I have a basic question. Appologies for my ignorance, but is hadoop a
 mis-fit for a data architect with zero java knowledge?
 --
 View this message in context:
 http://old.nabble.com/Hadoop-is-for-whom--Data-architect-or-Java-Architect-or-All-tp30765860p30765860.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




Re: the performance of HDFS

2011-01-25 Thread Ted Dunning
This is a bit lower than it should be, but it is not so far out of line with
what is reasonable.

Did you make sure that have multiple separate disks for HDFS to use?  With
many disks, you should be able to get local disk write speeds up to a few
hundred MB/s.

Once you involve replication then you need to have data go out the network
interface, back in to another machine, back out and back in to a third
machine.  There are lots of copies going on and if you are writing lots of
files, you will typically be limited to 1/2 of your network bandwidth at
most, but doing a bit less than that is to be expected.  What you are seeing
is lower than it should be but a moderate factor.

On Tue, Jan 25, 2011 at 12:33 PM, Da Zheng zhengda1...@gmail.com wrote:

 Hello,

 I try to measure the performance of HDFS, but the writing rate is quite
 low. When the replication factor is 1, the rate of writing to HDFS is about
 60MB/s. When the replication factor is 3, the rate drops significantly to
 about 15MB/s. Even though the actual rate of writing data to the disk is
 about 45MB/s, it's still much lower than when replication factor is 1. The
 link between two nodes in the cluster is 1Gbps. CPU is Dual-Core AMD
 Opteron(tm) Processor 2212, so CPU isn't bottleneck either. I thought I
 should be able to saturate the disk very easily. I wonder where the
 bottleneck is. What is the throughput for writing on a Hadoop cluster when
 the replication factor is 3?

 Thanks,
 Da



Re: the performance of HDFS

2011-01-25 Thread Ted Dunning
This is a really slow drive or controller.

Consumer grade 3.5 inch 2TB drives typically can handle 100MB/s.  I would
suspect in the absence of real information that your controller is more
likely to be deficient than your drive.  If this is on a laptop or
something, then I withdraw my thought.

On Tue, Jan 25, 2011 at 4:50 PM, Da Zheng zhengda1...@gmail.com wrote:

 The aggregate write-rate can get much higher if you use more drives, but a
 single stream throughput is limited to the speed of one disk spindle.

  You are right. I measure the performance of the hard drive. It seems the
 bottleneck is the hard drive, but the hard drive is a little too slow. The
 average writing rate is 50MB/s.


Re: the performance of HDFS

2011-01-25 Thread Ted Dunning
Perhaps lshw would help you.

ubuntu:~$ sudo lshw
   ...
*-storage
 description: RAID bus controller
 product: SB700/SB800 SATA Controller [Non-RAID5 mode]
 vendor: ATI Technologies Inc
 physical id: 11
 bus info: pci@:00:11.0
 logical name: scsi0
 logical name: scsi1
 version: 00
 width: 32 bits
 clock: 66MHz
 capabilities: storage pm bus_master cap_list emulated
 configuration: driver=ahci latency=64
 resources: irq:22 ioport:b000(size=8) ioport:a000(size=4)
ioport:9000(size=8) ioport:8000(size=4) ioport:7000(size=16)
memory:fe7ffc00-fe7f
   *-disk
description: ATA Disk
product: ST3750528AS
vendor: Seagate
physical id: 0
bus info: scsi@0:0.0.0
   ...

On Tue, Jan 25, 2011 at 6:29 PM, Da Zheng zhengda1...@gmail.com wrote:

 No, each node in the cluster is powerful server. I was told the nodes are
 Dell
 Poweredge SC1435, but I cannot figure out the configuration of hard drives.
 Dell
 provides several possible hard drives for this model.

 On 1/25/11 7:59 PM, Ted Dunning wrote:
  This is a really slow drive or controller.
 
  Consumer grade 3.5 inch 2TB drives typically can handle 100MB/s.  I would
  suspect in the absence of real information that your controller is more
  likely to be deficient than your drive.  If this is on a laptop or
  something, then I withdraw my thought.
 
  On Tue, Jan 25, 2011 at 4:50 PM, Da Zheng zhengda1...@gmail.com wrote:
 
  The aggregate write-rate can get much higher if you use more drives, but
 a
  single stream throughput is limited to the speed of one disk spindle.
 
   You are right. I measure the performance of the hard drive. It seems
 the
  bottleneck is the hard drive, but the hard drive is a little too slow.
 The
  average writing rate is 50MB/s.
 




Re: Hadoop user event in Europe (interested ? )

2011-01-20 Thread Ted Dunning
If it occurs in early June, it might be possible for US attendees of
Buzzwords to link in a visit to the hadoop meeting.  I certainly would like
to do that.

On Thu, Jan 20, 2011 at 12:22 AM, Asif Jan asif@unige.ch wrote:

 Hi

 wondering if there is interest to organize a hadoop meet-up in Europe (
 Geneva, Switzerland) ? It could be a 2 day event discussing use of Hadoop in
 industry/science project.

 If this interest you please let me know.

 cheers
 asif













Re: When applying a patch, which attachment should I use?

2011-01-11 Thread Ted Dunning
You may also be interested in the append branch:

http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20-append/

On Tue, Jan 11, 2011 at 3:12 AM, edward choi mp2...@gmail.com wrote:

 Thanks for the info.
 I am currently using Hadoop 0.20.2, so I guess I only need apply
 hdfs-630-0.20-append.patch
 https://issues.apache.org/jira/secure/attachment/12446812/hdfs-630-0.20-append.patch
 
 .
 I wasn't familiar with the term trunk. I guess it means the latest
 development.
 Thanks again.

 Best Regards,
 Ed

 2011/1/11 Konstantin Boudnik c...@apache.org

  Yeah, that's pretty crazy all right. In your case looks like that 3
  patches on the top are the latest for 0.20-append branch, 0.21 branch
  and trunk (which perhaps 0.22 branch at the moment). It doesn't look
  like you need to apply all of them - just try the latest for your
  particular branch.
 
  The mess is caused by the fact the ppl are using different names for
  consequent patches (as in file.1.patch, file.2.patch etc) This is
  _very_ confusing indeed, especially when different contributors work
  on the same fix/feature.
  --
Take care,
  Konstantin (Cos) Boudnik
 
 
  On Mon, Jan 10, 2011 at 01:10, edward choi mp2...@gmail.com wrote:
   Hi,
   For the first time I am about to apply a patch to HDFS.
  
   https://issues.apache.org/jira/browse/HDFS-630
  
   Above is the one that I am trying to do.
   But there are like 15 patches and I don't know which one to use.
  
   Could anyone tell me if I need to apply them all or just the one at the
  top?
  
   The whole patching process is just so confusing :-(
  
   Ed
  
 



Re: Import data from mysql

2011-01-10 Thread Ted Dunning
Yes.  Hadoop can definitely help with this.

On Mon, Jan 10, 2011 at 12:00 PM, Brian brian.mcswee...@gmail.com wrote:

 Thus, I would greatly appreciate your opinion on whether or not using
 hadoop for this would make sense in order to parallelize the task if it gets
 too slow.


Re: Import data from mysql

2011-01-09 Thread Ted Dunning
It is, of course, only quadratic, even if you compare all rows to all other
rows.  You can reduce this cost to O(n log n) by ordinary sorting and you
can reduce further reduce the cost to O(n) using radix sort on hashes.

Practically speaking, in either the parallel or non parallel setting try
sorting each batch of inputs and then doing a merge pass to find duplicated
rows.  Hashing your rows and doing the sort will make things faster if your
rows are very long or if you use radix sort.  Unless your data is vast, this
would probably work on a single machine with no need for parallelism since
sorting billions of items would require 10 passes through your data with a
2^16 way radix sort.

To do this with hadoop, simply do the hashing as before and run a typical
word count.  Then the rows that duplicate are simply the ones with count  1
and these can be preferentially output by the reducer.

On Sat, Jan 8, 2011 at 3:33 PM, Brian McSweeney
brian.mcswee...@gmail.comwrote:

 I'm a TOTAL newbie on hadoop. I have an existing webapp that has a growing
 number of rows in a mysql database that I have to compare against one
 another once a day from a batch job. This is an exponential problem as
 every
 row must be compared against every other row. I was thinking of
 parallelizing this computation via hadoop.



Re: Import data from mysql

2011-01-09 Thread Ted Dunning
You still have to knock down the quadratic cost.

Any equality checks you have in your problem can be used to limit the
problem to growing quadratically in the number of records equal by that
comparison.  That may be enough to fix things (for now).  Unfortunately
heavily skewed data are very common so this smaller quadratic will be orders
of magnitude smaller than the original, but still unscalable.  Hadoop makes
this grouping by equality much easier of course and the internal scan can be
done by conventional techniques.

Beyond that, you need to look at more interesting techniques to really make
this a viable option.

I would recommend:

- if the multiplication is part of a cosine similarity measurement, then
look at expressing it as a difference instead and bound the largest
component of the different.

- take a look at locality sensitive hashing.  This gives you an approximate
nearest neighbor solution that will allow good probabilistic bounds on the
number of cases that you miss in return of a scalable solution.  The error
bounds can be made fairly tight.  See http://www.mit.edu/~andoni/LSH/

- if you decide that LSH is the way to go, check out Mahout which has a
minhash clustering implementation.

- if you can't restate the problem as non-quadratic, then start over.
 Quadratic algorithms are not scalable as Michael Black has stated
eloquently enough in another thread.

- consider tell the group more about your problem.  You get more if you give
more.

On Sun, Jan 9, 2011 at 5:26 AM, Brian McSweeney
brian.mcswee...@gmail.comwrote:

 Thanks Ted,

 You're right but I suppose I was too brief in my initial statement. I
 should
 have said that I have to run an operation on all rows with respect to each
 other. It's not a case of just comparing them and thus sorting them so
 unfortunately I don't think this will help much. Some of the values in the
 rows have to be multiplied together, some have to be compared, some have to
 have a function run against them etc.

 cheers,
 Brian

 On Sun, Jan 9, 2011 at 8:55 AM, Ted Dunning tdunn...@maprtech.com wrote:

  It is, of course, only quadratic, even if you compare all rows to all
 other
  rows.  You can reduce this cost to O(n log n) by ordinary sorting and you
  can reduce further reduce the cost to O(n) using radix sort on hashes.
 
  Practically speaking, in either the parallel or non parallel setting try
  sorting each batch of inputs and then doing a merge pass to find
 duplicated
  rows.  Hashing your rows and doing the sort will make things faster if
 your
  rows are very long or if you use radix sort.  Unless your data is vast,
  this
  would probably work on a single machine with no need for parallelism
 since
  sorting billions of items would require 10 passes through your data with
 a
  2^16 way radix sort.
 
  To do this with hadoop, simply do the hashing as before and run a typical
  word count.  Then the rows that duplicate are simply the ones with count
 
  1
  and these can be preferentially output by the reducer.
 
  On Sat, Jan 8, 2011 at 3:33 PM, Brian McSweeney
  brian.mcswee...@gmail.comwrote:
 
   I'm a TOTAL newbie on hadoop. I have an existing webapp that has a
  growing
   number of rows in a mysql database that I have to compare against one
   another once a day from a batch job. This is an exponential problem as
   every
   row must be compared against every other row. I was thinking of
   parallelizing this computation via hadoop.
  
 



 --
 -
 Brian McSweeney

 Technology Director
 Smarter Technology
 web: http://www.smarter.ie
 phone: +353868578212
 -



Re: Rngd

2011-01-04 Thread Ted Dunning
As it normally stands, rngd will only help (it appears) if you have a
hardware RNG.

You need to cheat and use entropy you don't really have.  If you don't mind
hacking your system, you could even do this:

# mv /dev/random /dev/random.orig
# ln /dev/urandom /dev/random

This makes /dev/random behave as if it were /dev/urandom (which it, strictly
speaking, is after you do this).

Don't let your sysadmin see you do this, of course.

On Tue, Jan 4, 2011 at 12:00 PM, Jon Lederman jon2...@mac.com wrote:

 Hi,

 I am trying to locate the source for rngd to build on my embedded processor
 in order to test whether my hadoop setup is stalled due to low entropy?  Do
 u know where I can find this.  I thought it was part of rng-tools but it's
 not.

 Thanks

 Jon

 Sent from my iPhone

 Sent from my iPhone



Re: What is the runtime efficiency of secondary sorting?

2011-01-03 Thread Ted Dunning
As a point of order, you would normally use a combiner with this problem and
you wouldn't sort in either the combiner or the reducer.  Instead, combiner
and reducer would simply scan and keep the smallest item to emit at the end
of the scan.

As a point of information, most of the rank-based statistics like min, max,
median and quantiles can be approximated in an on-line fashion with O(n)
time and O(1) storage.

Back to the question, though.  Let's assume that you would still like to
sort the elements.  The Hadoop sorts are typically merge-sorts which can be
helped if you provide data in order.  Thus, you should consider (and
probably test) providing a combiner that will sort partial results in order
to make the framework sorts run faster.

Another important consideration is whether the Hadoop shuffle and sort steps
will have to deserialize your data in order to sort it.  If so, you will
almost certainly be better off doing the sort yourself.  I don't know if
your combiner would be subject to the round-trip serialization cost, but I
wouldn't be surprised.

On Mon, Jan 3, 2011 at 2:59 PM, W.P. McNeill bill...@gmail.com wrote:

 Say I have a set of unordered sets of integers:

 A: {2,5,7}
 B: {6,1,9}
 C: {3,8,2,1,6}

 I want to use map/reduce to emit the smallest integer in each set.  If my
 input data looks like this:

 A2
 A5
 A7
 B6
 B1
 ...etc...

 I could use an identity mapper and a reducer like the following

 Reduce(setID, [e0, e1, ... ]):
a = sort [e0, e1, ... ]
Emit(setID, a[0])

 Using standard sort algorithms, this has runtime efficiency of O(N log N),
 where N is the length of [e0, e1, ... ].

 I can write a custom partitioner and grouper to do a secondary
 sort
 http://sonerbalkir.blogspot.com/2010/01/simulating-secondary-sort-on-values.html
 ,
 so that Hadoop sees to it that [e0, e1, ... ] comes into my reducer already
 sorted.  When I do this my reducer becomes simply:

 Reduce(setID, [e0, e1, ... ]):
Emit(setID, e0)

 I understand that this makes things faster because I'm parallelizing the
 sort work, but how much faster?  Specifically is my runtime efficiency now
 O(1), amortized O(1), some function of the cluster size, or still O(N log
 N)
 but with smaller constant factors?

 I think (but am not 100% sure) that this is equivalent to the question,
 What is the runtime efficiency of map-reduce sort?

 Also, is there an academic paper with this information that I could cite?

 Usual Google searches and manual perusals were fruitless.  Thanks for your
 help.



Re: Entropy Pool and HDFS FS Commands Hanging System

2011-01-03 Thread Ted Dunning
Yes.  It is stuck as suggested.  See the bolded lines.

You can help avoid this by dumping additional entropy into the machine via
network traffic.  According to the man page for /dev/random you can cheat by
writing goo into /dev/urandom, but I have been unable to verify that by
experiment.

Is it really necessary to use /dev/random here?  Again from the man page,
there is a strong feeling in the community that only very long lived, high
value keys really need to read from /dev/random.  Session keys from
/dev/urandom are fine.

I wrote an adaptation of the secure seed generator that doesn't block for
Mahout.  It is trivial, but might be useful to copy:
http://svn.apache.org/repos/asf/mahout/trunk/math/src/main/java/org/apache/mahout/common/DevURandomSeedGenerator.java



On Mon, Jan 3, 2011 at 3:13 PM, Jon Lederman jon2...@gmail.com wrote:

 I have attached the jstack pid of namenode output.  Does it appear to be
 stuck in SecureRandom as you noted as a possibility?  I am not sure whether
 this is indicated in the following output:

 ...

main prio=10 tid=0x000583c8 nid=0xf3f runnable [0xb729d000]
   java.lang.Thread.State: RUNNABLE
 *at java.io.FileInputStream.readBytes(Native Method)
 *at java.io.FileInputStream.read(FileInputStream.java:236)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
- locked 0x70e59ae8 (a java.io.BufferedInputStream)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
- locked 0x70e59970 (a java.io.BufferedInputStream)
at
 sun.security.provider.SeedGenerator$URLSeedGenerator.getSeedByte(SeedGenerator.java:469)
at
 sun.security.provider.SeedGenerator.getSeedBytes(SeedGenerator.java:140)
at
 sun.security.provider.SeedGenerator.generateSeed(SeedGenerator.java:135)
 *at
 sun.security.provider.SecureRandom.engineGenerateSeed(SecureRandom.java:131)
 *at
 sun.security.provider.SecureRandom.engineNextBytes(SecureRandom.java:188)




Re: What is the runtime efficiency of secondary sorting?

2011-01-03 Thread Ted Dunning
On Mon, Jan 3, 2011 at 4:00 PM, W.P. McNeill bill...@gmail.com wrote:

 ... If I write a combiner like this, is there any advantage to also doing a
 secondary sort?


The definitive answer is that it depends.



 As for deserialization, the value in my actual application is a Java object
 with a floating point rank field, and I will be sorting these objects by
 this rank.  Does this make deserialization relatively costly?  (I'm
 guessing
 it does, because it's not as simple as a single number.)


If you can define a binary comparator for your value field that extracts
this single number and compare it, then the framework can sort your data
items without (fully) deserializing them.  This can be a big, big
performance win if only because the serialized form is often smaller so the
merge sort needs to recurse less because more data fits into a certain
amount of memory.

Without a binary comparator, the framework must deserialize your object
completely and then compare it. That can't help but be slower than avoiding
the deserialization.

Some serialization frameworks like Avro even allow some values to be sorted
without any deserialization.  This is even better, of course, than partial
deserialization.


Re: Entropy Pool and HDFS FS Commands Hanging System

2011-01-03 Thread Ted Dunning
try

   dd if=/dev/random bs=1 count=100 of=/dev/null

This will likely hang for a long time.

There is no way that I know of to change the behavior of /dev/random except
by changing the file itself to point to a different minor device.  That
would be very bad form.

One think you may be able do is to pour lots of entropy into the system via
/dev/urandom.  I was not able to demonstrate this, though, when I just tried
that.  It would be nice if there were a config variable to set that would
change this behavior, but right now, a code change is required (AFAIK).

Another thing to do is replace the use of SecureRandom with a version that
uses /dev/urandom.  That is the point of the code that I linked to.  It
provides a plugin replacement that will not block.

On Mon, Jan 3, 2011 at 4:31 PM, Jon Lederman jon2...@gmail.com wrote:


 Could you give me a bit more information on how I can overcome this issue.
  I am running Hadoop on an embedded processor and networking is turned off
 to the embedded processor. Is there a quick way to check whether this is in
 fact blocking on my system?  And, are there some variables or configuration
 options I can set to avoid any potential blocking behavior?




Re: Entropy Pool and HDFS FS Commands Hanging System

2011-01-03 Thread Ted Dunning
On Mon, Jan 3, 2011 at 4:48 PM, Jon Lederman jon2...@gmail.com wrote:

 Thanks.  Will try that.  One final question, based on the jstack output I
 sent, is it obvious that the system is blocked due to the behavior of
 /dev/random?



I tried to send you a highlighted markup of your jstack output.

The key thing to look for is some thread reading bytes that nests from
SecureRandom.


 If I just let the FS command run (i.e., hadoop fs -ls), is there any
 guarantee it will eventually return in some relatively finite period of time
 such as hours, or could it potentially take days, weeks, years or eternity?


It depends on how quiet your machine is.  If it has stuff happening, then it
will unwedge eventually.


Re: Help for the problem of running lucene on Hadoop

2011-01-02 Thread Ted Dunning
With even a dozen or two servers, it is very easy to flatten a mysql server
with a hadoop cluster.

Also, mysql is typically a very poor storage system for an inverted index
because it doesn't allow for compression of the posting vectors.

Better to copy Katta in this required and create many independent indexes.

On Fri, Dec 31, 2010 at 9:56 PM, Jander g jande...@gmail.com wrote:

 Thanks for all the above reply.

 Now my idea is: running word segmentation on Hadoop and creating the
 inverted index in mysql. As we know, Hadoop MR supports writing and reading
 to mysql.

 Does this have any problem?

 On Sat, Jan 1, 2011 at 7:49 AM, James Seigel ja...@tynt.com wrote:

  Check out katta for an example



Re: Hadoop RPC call response post processing

2010-12-28 Thread Ted Dunning
Knowing the tenuring distribution will tell a lot about that exact issue.
 Ephemeral collections take on average less than one instruction per
allocation and the allocation itself is generally only a single instruction.
 For ephemeral garbage, it is extremely unlikely that you can beat that.

So the real question is whether you are actually creating so much garbage
that you are over-whelming the collector or whether the data is much longer
lived than it should be. *That* can cause lots of collection costs.

To tell how long data lives, you need to get the tenuring distribution:

-XX:+PrintTenuringDistribution Prints details about the tenuring
distribution to standard out. It can be used to show this threshold and the
ages of objects in the new generation. It is also useful for observing the
lifetime distribution of an application.
On Tue, Dec 28, 2010 at 11:59 AM, Stefan Groschupf s...@101tec.com wrote:

 I don't think the problem is allocation but garbage collection.



Re: help for using mapreduce to run different code?

2010-12-28 Thread Ted Dunning
if you mean running different code in different mappers, I recommend using
an if statement.

On Tue, Dec 28, 2010 at 2:53 PM, Jander g jande...@gmail.com wrote:

 Whether Hadoop supports the map function running different code? If yes,
 how
 to realize this?



Re: how to run jobs every 30 minutes?

2010-12-28 Thread Ted Dunning
Good quote.

On Tue, Dec 28, 2010 at 3:46 PM, Chris K Wensel ch...@wensel.net wrote:


 deprecated is the new stable.

 https://issues.apache.org/jira/browse/MAPREDUCE-1734

 ckw

 On Dec 28, 2010, at 2:56 PM, Jimmy Wan wrote:

  I've been using Cascading to act as make for my Hadoop processes for
 quite
  some time. Unfortunately, even the most recent distribution of Cascading
 was
  written against the deprecated Hadoop APIs (JobConf) that I'm looking to
  replace. Does anyone have an alternative?
 
  On Tue, Dec 14, 2010 at 18:02, Chris K Wensel ch...@wensel.net wrote:
 
 
  Cascading also has the ability to only run 'stale' processes. Think
 'make'
  file. When re-running a job where only one file of many has changed,
 this is
  a big win.
 

 --
 Chris K Wensel
 ch...@concurrentinc.com
 http://www.concurrentinc.com

 -- Concurrent, Inc. offers mentoring, support, and licensing for Cascading




Re: Hadoop RPC call response post processing

2010-12-27 Thread Ted Dunning
I would be very surprised if allocation itself is the problem as opposed to
good old fashioned excess copying.

It is very hard to write an allocator faster than the java generational gc,
especially if you are talking about objects that are ephemeral.

Have you looked at the tenuring distribution?

On Mon, Dec 27, 2010 at 8:07 PM, Stefan Groschupf s...@101tec.com wrote:

 Hi All,
 I'm browsing the RPC code since quite a while now trying to find any entry
 point / interceptor slot that allows me to handle a RPC call response
 writable after it was send over the wire.
 Does anybody has an idea how break into the RPC code from outside. All the
 interesting methods are private. :(

 Background:
 Heavy use of the RPC allocates hugh amount of Writable objects. We saw in
 multiple systems  that the garbage collect can get so busy that the jvm
 almost freezes for seconds. Things like zookeeper sessions time out in that
 cases.
 My idea is to create an object pool for writables. Borrowing an object from
 the pool is simple since this happen in our custom code, though we do know
 when the writable return was send over the wire and can be returned into the
 pool.
 A dirty hack would be to overwrite the write(out) method in the writable,
 assuming that is the last thing done with the writable, though turns out
 that this method is called in other cases too, e.g. to measure throughput.

 Any ideas?

 Thanks,
 Stefan


Re: How to simulate network delay on 1 node

2010-12-26 Thread Ted Dunning
See also https://github.com/toddlipcon/gremlins



On Sun, Dec 26, 2010 at 11:26 AM, Konstantin Boudnik c...@apache.org wrote:

 Hi there.

 What are looking at is fault injection.
 I am not sure what version of Hadoop you're looking at, but here's at
 what you take a look in 0.21 and forward:
  - Herriot system testing framework (which does code instrumentation
 to add special APIs) on a real clusters. Here's some starting
 pointers:
- source code is in src/test/system
- http://wiki.apache.org/hadoop/HowToUseSystemTestFramework
  - fault injection framework (should've been ported to 0.20 as well)
- Source code is under src/test/aop
- http://hadoop.apache.org/hdfs/docs/r0.21.0/faultinject_framework.html

 If you are running on simulated infrastructure you don't need to look
 further than fault injection framework. There's a test in HDFS which
 does pretty much what you're looking for but for pipe-lines (look
 under src/test/aop/org/apache/hadoop/hdfs/*).

 If you are on a physical cluster then you need to use a combination of
 1st and 2nd. The implementation of faults in system tests are coming
 into Hadoop at some point of not very distant future, so you might
 want to wait a little bit.
 --
   Take care,
 Konstantin (Cos) Boudnik

 On Sun, Dec 26, 2010 at 04:25, yipeng yip...@gmail.com wrote:
  Hi everyone,
 
  I would like to simulate network delay on 1 node in my cluster, perhaps
 by
  putting the thread to sleep every time it transfers data non-locally. I'm
  looking at the source but am not sure where to place the code. Is there a
  better way to do it... a tool perhaps? Or could someone point me in the
  right direction?
 
  Cheers,
 
  Yipeng
 



Re: Hadoop/Elastic MR on AWS

2010-12-24 Thread Ted Dunning
EMR instances are started near each other.  This increases the bandwidth
between nodes.

There may also be some enhancements in terms of access to the SAN that
supports EBS.

On Fri, Dec 24, 2010 at 4:41 AM, Otis Gospodnetic 
otis_gospodne...@yahoo.com wrote:

 - Original Message 
  From: Amandeep Khurana ama...@gmail.com
  To: common-user@hadoop.apache.org
  Sent: Fri, December 10, 2010 1:14:45 AM
  Subject: Re: Hadoop/Elastic MR on AWS
 
  Mark,
 
  Using EMR makes it very easy to start a cluster and add/reduce  capacity
 as
  and when required. There are certain optimizations that make EMR  an
  attractive choice as compared to building your own cluster out. Using
  EMR


 Could you please point out what optimizations you are referring to?



Re: How frequently can I set status?

2010-12-23 Thread Ted Dunning
It is reasonable to update counters often, but I think you are right to
limit the number status updates.

On Thu, Dec 23, 2010 at 11:15 AM, W.P. McNeill bill...@gmail.com wrote:

 I have a loop that runs over a large number of iterations (order of
 100,000)
 very quickly.  It is nice to do context.setStatus() with an indication of
 where I am in the loop.  Currently I'm only calling setStatus() every
 10,000
 iterations because I don't want to overwhelm the task trackers with lots of
 status messages.  Is this something I should be worried, about or is Hadoop
 designed to handle a high volume of status messages?  If so, I'll just call
 setStatus() every iteration.



Re: breadth-first search

2010-12-22 Thread Ted Dunning
The Mahout math package has a number of basic algorithms that use
algorithmic efficiencies when given sparse graphs.

A number of other algorithms use only the product of a sparse matrix on
another matrix or a vector.  Since these algorithms never change the
original sparse matrix, they are safe against fill-in problems.

The random projection technique avoids O(v^3) algorithms for computing SVD
or related matrix decompositions.  See http://arxiv.org/abs/0909.4061 and
https://issues.apache.org/jira/browse/MAHOUT-376

None of these these algorithms are specific to graph theory, but all deal
with methods that are useful with sparse graphs.

On Wed, Dec 22, 2010 at 10:46 AM, Ricky Ho rickyphyl...@yahoo.com wrote:

 Can you point me to Matrix algorithms that is tuned for sparse graph ?
  What I
 mean is from O(v^3) to O(v*e)  where v = number of vertex and e = number of
 edges.



Re: breadth-first search

2010-12-21 Thread Ted Dunning
Ahh... I see what you mean.

This algorithm can be implemented with all of the iterations for all points
proceeding in parallel.  You should only need 4 map-reduce steps, not 400.
 This will still take several minutes on Hadoop, but as your problem
increases in size, this overhead becomes less important.

2010/12/21 Peng, Wei wei.p...@xerox.com

 The graph that my BFS algorithm is running on only needs 4 levels to reach
 all nodes. The reason I say many iterations is that there are 100 sources
 nodes, so totally 400 iterations. The algorithm should be right, and I
 cannot think of anything to reduce the number of iterations.

 Ted, I will check out the links that you sent to me.
 I really appreciate your help.

 Wei
 -Original Message-
 From: Edward J. Yoon [mailto:edw...@udanax.org]
 Sent: Tuesday, December 21, 2010 1:27 AM
 To: common-user@hadoop.apache.org
 Subject: Re: breadth-first search

 There's no release yet.

 But, I had tested the BFS using hama and, hbase.

 Sent from my iPhone

 On 2010. 12. 21., at 오전 11:30, Peng, Wei wei.p...@xerox.com wrote:

  Yoon,
 
  Can I use HAMA now, or it is still in development?
 
  Thanks
 
  Wei
 
  -Original Message-
  From: Edward J. Yoon [mailto:edwardy...@apache.org]
  Sent: Monday, December 20, 2010 6:23 PM
  To: common-user@hadoop.apache.org
  Subject: Re: breadth-first search
 
  Check this slide out -
  http://people.apache.org/~edwardyoon/papers/Apache_HAMA_BSP.pdf
 
  On Tue, Dec 21, 2010 at 10:49 AM, Peng, Wei wei.p...@xerox.com wrote:
 
   I implemented an algorithm to run hadoop on a 25GB graph data to
  calculate its average separation length.
  The input format is V1(tab)V2 (where V2 is the friend of V1).
  My purpose is to first randomly select some seed nodes, and then for
  each node, calculate the shortest paths from this node to all other
  nodes on the graph.
 
  To do this, I first run a simple python code in a single machine to get
  some random seed nodes.
  Then I run a hadoop job to generate adjacent list for each node as the
  input for the second job.
 
  The second job takes the adjacent list input and output the first level
  breadth-first search result. The nodes which are the friends of the seed
  node have distance 1. Then this output is the input for the next hadoop
  job so on so forth, until all the nodes are reached.
 
  I generated a simulated graph for testing. This data has only 100 nodes.
  Normal python code can find the separation length within 1 second (100
  seed nodes). However, the hadoop took almost 3 hours to do that
  (pseudo-distributed mode on one machine)!!
 
  I wonder if there is a more efficient way to do breadth-first search in
  hadoop? It is very inefficient to output so many intermediate results.
  Totally there would be seedNodeNumber*levelNumber+1 jobs,
  seedNodeNumber*levelNumber intermediate files. Why is hadoop so slow?
 
  Please help.  Thanks!
 
  Wei
 
 
 
 
  --
  Best Regards, Edward J. Yoon
  edwardy...@apache.org
  http://blog.udanax.org



Re: breadth-first search

2010-12-21 Thread Ted Dunning
Absolutely true.  Nobody should pretend otherwise.

On Tue, Dec 21, 2010 at 10:04 AM, Peng, Wei wei.p...@xerox.com wrote:

 Hadoop is useful when the data is huge and cannot fit into memory, but it
 does not seem to be a real-time solution.


Re: Friends of friends with MapReduce

2010-12-20 Thread Ted Dunning
On Mon, Dec 20, 2010 at 9:39 AM, Antonio Piccolboni anto...@piccolboni.info
 wrote:

 For an easy solution, use hive. Let's say your record contains userid and
 friendid and the table is called friends
 Then you would do
 select A.userid , B.friendid from friends A join friends B on (A.friendid =
 B user.id)

 This is on top of my mind, sorry if some details are off, but I've done it
 in the past on large datasets (~100M rows).That's it. Do that in java and
 tell me if it isn't at least 50 lines of code.


In raw java it will be a lot of code.

In Plume, it should be just a few lines, most of which will have to do with
reading the data.

Pig and Hive will definitely be the most concise though.


Re: breadth-first search

2010-12-20 Thread Ted Dunning
On Mon, Dec 20, 2010 at 8:16 PM, Peng, Wei wei.p...@xerox.com wrote:

 ... My question is really about what is the efficient way for graph
 computation, matrix computation, algorithms that need many iterations to
 converge (with intermediate results).


Large graph computations usually assume a sparse graph for historical
reasons.  A key property of scalable
algorithms is that the time and space are linear in the input size.  Most
all path algorithms are not linear because
the result is n x n and is dense.

Some graph path computations can be done indirectly by spectral methods.
 With good random projection algorithms for sparse matrix decomposition,
approximate versions of some of these algorithms can be phrased in a
scalable fashion.  It isn't an easy task, however.


 HAMA looks like a very good solution, but can we use it now and how to
 use it?


I don't think that Hama has produced any usable software yet.


Re: breadth-first search

2010-12-20 Thread Ted Dunning
On Mon, Dec 20, 2010 at 9:43 PM, Peng, Wei wei.p...@xerox.com wrote:

 ...
 Currently, most of the matrix data (graph matrix, document-word matrix)
 that we are dealing with are sparse.


Good.


 The matrix decomposition often needs many iterations to converge, then
 intermediate results have to be saved to serve as the input for the next
 iteration.


I think you are thinking of the wrong algorithms.  The ones that I am
talking about converge
in a fixed and small number of steps.  See
https://issues.apache.org/jira/browse/MAHOUT-376 for the work
in progress on this.


 This is super inefficient. As I mentioned, the BFS algorithm written in
 python took 1 second to run, however, hadoop took almost 3 hours. I
 would expect hadoop to be slower, but not this slow.


I think you have a combination of factors here.  But, even accounting for
having too many
iterations in your BFS algorithm, iterations with stock Hadoop take 10s of
seconds even if
they do nothing.  If you arrange your computation to need many iterations,
it will be slow.


All the hadoop applications that I saw are all very simple calculations,
 I wonder how it can be applied to machine learning/data mining
 algorithms.


Check out Mahout.  There is a lot of machine learning going on there both on
Hadoop and using other scalable methods.


 Is HAMA the only way to solve it? If it is not ready to use yet, then
 can I assume hadoop is not a good solution for multiple iteration
 algorithms now?


I don't see much evidence that HAMA will ever solve anything, so I wouldn't
recommend pinning your hopes on that.

For fast, iterative map-reduce, you really need to keep your mappers and
reducers live between iterations.  Check out
twister for that:  http://www.iterativemapreduce.org/


Re: InputFormat for a big file

2010-12-17 Thread Ted Dunning
a) this is a small file by hadoop standards.  You should be able to process
it by conventional methods on a single machine in about the same time it
takes to start a hadoop job that does nothing at all.

b) reading a single line at a time is not as inefficient as you might think.
 If you write a mapper that reads each line, converts to an integer and
outputs a key consisting of a constant integer and the data you read, the
mapper will process the data reasonably quickly.  If you add a combiner and
a reducer that add up numbers in a list, then the amount of data spilled
will be nearly zero.


On Fri, Dec 17, 2010 at 7:58 AM, madhu phatak phatak@gmail.com wrote:

 Hi
 I have a very large file of size 1.4 GB. Each line of the file is a number
 .
 I want to find the sum all those numbers.
 I wanted to use NLineInputFormat as a InputFormat but it sends only one
 line
 to the Mapper which is very in efficient.
 So can you guide me to write a InputFormat which splits the file
 into multiple Splits and each mapper can read multiple
 line from each split

 Regards
 Madhukar



Re: Please help with hadoop configuration parameter set and get

2010-12-17 Thread Ted Dunning
Statics won't work the way you might think because different mappers and
different reducers are all running in different JVM's.  It might work in
local mode, but don't kid yourself about it working in a distributed mode.
 It won't.

On Fri, Dec 17, 2010 at 8:31 AM, Peng, Wei wei.p...@xerox.com wrote:

 Arindam, how to set this global static Boolean variable?
 I have tried to do something similarly yesterday in the following:
 Public class BFSearch
 {
Private static boolean expansion;
Public static class MapperClass {if no nodes expansion = false;}
Public static class ReducerClass
Public static void main {expansion = true; run job;
 print(expansion)}
 }
 In this case, expansion is still true.
 I will look at hadoop counter and report back here later.

 Thank you for all your help
 Wei

 -Original Message-
 From: Arindam Khaled [mailto:akha...@utdallas.edu]
 Sent: Friday, December 17, 2010 10:35 AM
 To: common-user@hadoop.apache.org
 Subject: Re: Please help with hadoop configuration parameter set and get

 I did something like this using a global static boolean variable
 (flag) while I was implementing breadth first IDA*. In my case, I set
 the flag to something else if a solution was found, which was examined
 in the reducer.

 I guess in your case, since you know that if the mappers don't produce
 anything the reducers won't have anything as input, if I am not wrong.

 And I had chaining map-reduce jobs (
 http://developer.yahoo.com/hadoop/tutorial/module4.html
  ) running until a solution was found.


 Kind regards,

 Arindam Khaled





 On Dec 17, 2010, at 12:58 AM, Peng, Wei wrote:

  Hi,
 
 
 
  I am a newbie of hadoop.
 
  Today I was struggling with a hadoop problem for several hours.
 
 
 
  I initialize a parameter by setting job configuration in main.
 
  E.g. Configuration con = new Configuration();
 
  con.set(test, 1);
 
  Job job = new Job(con);
 
 
 
  Then in the mapper class, I want to set test to 2. I did it by
 
  context.getConfiguration().set(test,2);
 
 
 
  Finally in the main method, after the job is finished, I check the
  test again by
 
  job.getConfiguration().get(test);
 
 
 
  However, the value of test is still 1.
 
 
 
  The reason why I want to change the parameter inside Mapper class is
  that I want to determine when to stop an iteration in the main method.
  For example, for doing breadth-first search, when there is no new
  nodes
  are added for further expansion, the searching iteration should stop.
 
 
 
  Your help will be deeply appreciated. Thank you
 
 
 
  Wei
 




Re: Needs a simple answer

2010-12-16 Thread Ted Dunning
Maha,

Remember that the mapper is not running on the same machine as the main
class.  Thus local files aren't where you think.

On Thu, Dec 16, 2010 at 1:06 PM, maha m...@umail.ucsb.edu wrote:

 Hi all,

   Why the following lines would work in the main class (WordCount) and not
 in Mapper ? even though  myconf  is set in WordCount to point to the
 getConf() returned object.

 try{
FileSystem hdfs = FileSystem.get(wc.WordCount.myconf);
hdfs.copyFromLocalFile(new Path(/Users/file), new
 Path(/tmp/file));
   }catch(Exception e) { System.err.print(\nError);}


  Also, the print statement will never print on console unless it's in my
 run function..

  Appreciate it :)

Maha




Re: how to run jobs every 30 minutes?

2010-12-13 Thread Ted Dunning
Or even simpler, try Azkaban: http://sna-projects.com/azkaban/

On Mon, Dec 13, 2010 at 9:26 PM, edward choi mp2...@gmail.com wrote:

 Thanks for the tip. I took a look at it.
 Looks similar to Cascading I guess...?
 Anyway thanks for the info!!

 Ed

 2010/12/8 Alejandro Abdelnur t...@cloudera.com

  Or, if you want to do it in a reliable way you could use an Oozie
  coordinator job.
 
  On Wed, Dec 8, 2010 at 1:53 PM, edward choi mp2...@gmail.com wrote:
   My mistake. Come to think about it, you are right, I can just make an
   infinite loop inside the Hadoop application.
   Thanks for the reply.
  
   2010/12/7 Harsh J qwertyman...@gmail.com
  
   Hi,
  
   On Tue, Dec 7, 2010 at 2:25 PM, edward choi mp2...@gmail.com wrote:
Hi,
   
I'm planning to crawl a certain web site every 30 minutes.
How would I get it done in Hadoop?
   
In pure Java, I used Thread.sleep() method, but I guess this won't
  work
   in
Hadoop.
  
   Why wouldn't it? You need to manage your post-job logic mostly, but
   sleep and resubmission should work just fine.
  
Or if it could work, could anyone show me an example?
   
Ed.
   
  
  
  
   --
   Harsh J
   www.harshj.com
  
  
 



Re: Is it possible to write file output in Map phase once and write another file output in Reduce phase?

2010-12-11 Thread Ted Dunning
Of course.  It is just a set of Hadoop programs.

2010/12/11 edward choi mp2...@gmail.com

 Can I operate Bixo on a cluster other than Amazon EC2?



  1   2   >