Re: Poor IO performance on a 10 node cluster.

2011-06-01 Thread Ted Dunning
It is also worth using dd to verify your raw disk speeds. Also, expressing disk transfer rates in bytes per second makes it a bit easier for most of the disk people I know to figure out what is large or small. Each of these disks disk should do about 100MB/s when driven well. Hadoop does OK,

Re: trying to select technology

2011-05-31 Thread Ted Dunning
To pile on, thousands or millions of documents are well within the range that is well addressed by Lucene. Solr may be an even better option than bare Lucene since it handles lots of the boilerplate problems like document parsing and index update scheduling. On Tue, May 31, 2011 at 11:56 AM,

Re: Simple change to WordCount either times out or runs 18+ hrs with little progress

2011-05-24 Thread Ted Dunning
itr.nextToken() is inside the if. On Tue, May 24, 2011 at 7:29 AM, maryanne.dellasa...@gdc4s.com wrote: while (itr.hasMoreTokens()) { if(count == 5) { word.set(itr.nextToken()); output.collect(word, one); } count++; }

Re: Hadoop and WikiLeaks

2011-05-19 Thread Ted Dunning
ZK started as sub-project of Hadoop. On Thu, May 19, 2011 at 7:27 AM, M. C. Srivas mcsri...@gmail.com wrote: Interesting to note that Cassandra and ZK are now considered Hadoop projects. There were independent of Hadoop before the recent update. On Thu, May 19, 2011 at 4:18 AM, Steve

Re: matrix-vector multiply in hadoop

2011-05-17 Thread Ted Dunning
Try using the Apache Mahout code that solves exactly this problem. Mahout has a distributed row-wise matrix that is read one row at a time. Dot products with the vector are computed and the results are collected. This capability is used extensively in the large scale SVD's in Mahout. On Tue,

Re: Suggestions for swapping issue

2011-05-11 Thread Ted Dunning
How is it that 36 processes are not expected if you have configured 48 + 12 = 50 slots available on the machine? On Wed, May 11, 2011 at 11:11 AM, Adi adi.pan...@gmail.com wrote: By our calculations hadoop should not exceed 70% of memory. Allocated per node - 48 map slots (24 GB) , 12 reduce

Re: questions about hadoop map reduce and compute intensive related applications

2011-04-30 Thread Ted Dunning
On Sat, Apr 30, 2011 at 12:18 AM, elton sky eltonsky9...@gmail.com wrote: I got 2 questions: 1. I am wondering how hadoop MR performs when it runs compute intensive applications, e.g. Monte carlo method compute PI. There's a example in 0.21, QuasiMonteCarlo, but that example doesn't use

Re: Serving Media Streaming

2011-04-30 Thread Ted Dunning
Check out S4 http://s4.io/ On Fri, Apr 29, 2011 at 10:13 PM, Luiz Fernando Figueiredo luiz.figueir...@auctorita.com.br wrote: Hi guys. Hadoop is well known to process large amounts of data but we think that we can do much more than it. Our goal is try to serve pseudo-streaming near of

Re: Applications creates bigger output than input?

2011-04-30 Thread Ted Dunning
Cooccurrence analysis is commonly used in recommendations. These produce large intermediates. Come on over to the Mahout project if you would like to talk to a bunch of people who work on these problems. On Fri, Apr 29, 2011 at 9:31 PM, elton sky eltonsky9...@gmail.com wrote: Thank you for

Re: providing the same input to more than one Map task

2011-04-22 Thread Ted Dunning
I would recommend taking this question to the Mahout mailing list. The short answer is that matrix multiplication by a column vector is pretty easy. Each mapper reads the vector in the configure method and then does a dot product for each row of the input matrix. Results are reassembled into a

Re: Estimating Time required to compute M/Rjob

2011-04-17 Thread Ted Dunning
Turing completion isn't the central question here, really. The truth is, map-reduce programs have considerably pressure to be written in a scalable fashion which limits them to fairly simple behaviors that result in pretty linear dependence of run-time on input size for a given program. The cool

Re: Estimating Time required to compute M/Rjob

2011-04-16 Thread Ted Dunning
Sounds like this paper might help you: Predicting Multiple Performance Metrics for Queries: Better Decisions Enabled by Machine Learning by Ganapathi, Archana, Harumi Kuno, Umeshwar Daval, Janet Wiener, Armando Fox, Michael Jordan, David Patterson http://radlab.cs.berkeley.edu/publication/187

Re: Dynamic Data Sets

2011-04-13 Thread Ted Dunning
Hbase is very good at this kind of thing. Depending on your aggregation needs OpenTSDB might be interesting since they store and query against large amounts of time ordered data similar to what you want to do. It isn't clear to whether your data is primarily about current state or about

Re: Memory mapped resources

2011-04-12 Thread Ted Dunning
nothing architecture. This may be more database terminology that could be addressed by hbase, but I think it is good background for the questions of memory mapping files in hadoop. Kevin -Original Message- From: Ted Dunning [mailto:tdunn...@maprtech.com] Sent: Tuesday, April 12, 2011

Re: Memory mapped resources

2011-04-12 Thread Ted Dunning
, Jason Rutherglen jason.rutherg...@gmail.com wrote: Then one could MMap the blocks pertaining to the HDFS file and piece them together. Lucene's MMapDirectory implementation does just this to avoid an obscure JVM bug. On Mon, Apr 11, 2011 at 9:09 PM, Ted Dunning tdunn...@maprtech.com wrote

Re: Memory mapped resources

2011-04-12 Thread Ted Dunning
Blocks live where they land when first created. They can be moved due to node failure or rebalancing, but it is typically pretty expensive to do this. It certainly is slower than just reading the file. If you really, really want mmap to work, then you need to set up some native code that builds

Re: Memory mapped resources

2011-04-12 Thread Ted Dunning
Actually, it doesn't become trivial. It just becomes total fail or total win instead of almost always being partial win. It doesn't meet Benson's need. On Tue, Apr 12, 2011 at 11:09 AM, Jason Rutherglen jason.rutherg...@gmail.com wrote: To get around the chunks or blocks problem, I've been

Re: Memory mapped resources

2011-04-12 Thread Ted Dunning
Benson is actually a pretty sophisticated guy who knows a lot about mmap. I engaged with him yesterday on this since I know him from Apache. On Tue, Apr 12, 2011 at 7:16 PM, M. C. Srivas mcsri...@gmail.com wrote: I am not sure if you realize, but HDFS is not VM integrated.

Re: Using global reverse lookup tables

2011-04-11 Thread Ted Dunning
Depending on the function that you want to use, it sounds like you want to use a self join to compute transposed cooccurrence. That is, it sounds like you want to find all the sets that share elements with X. If you have a binary matrix A that represents your set membership with one row per set

Re: Memory mapped resources

2011-04-11 Thread Ted Dunning
Also, it only provides access to a local chunk of a file which isn't very useful. On Mon, Apr 11, 2011 at 5:32 PM, Edward Capriolo edlinuxg...@gmail.comwrote: On Mon, Apr 11, 2011 at 7:05 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Yes you can however it will require customization

Re: Memory mapped resources

2011-04-11 Thread Ted Dunning
providing access to the underlying file block? On Mon, Apr 11, 2011 at 6:30 PM, Ted Dunning tdunn...@maprtech.com wrote: Also, it only provides access to a local chunk of a file which isn't very useful. On Mon, Apr 11, 2011 at 5:32 PM, Edward Capriolo edlinuxg...@gmail.com wrote

Re: Architectural question

2011-04-10 Thread Ted Dunning
There are no subtle ways to deal with quadratic problems like this. They just don't scale. Your suggestions are roughly on course. When matching 10GB against 50GB, the choice of which input to use as input to the mapper depends a lot on how much you can buffer in memory and how long such a

Re: Architectural question

2011-04-10 Thread Ted Dunning
The original poster said that there was no common key. Your suggestion presupposes that such a key exists. On Sun, Apr 10, 2011 at 4:29 PM, Mehmet Tepedelenlioglu mehmets...@gmail.com wrote: My understanding is you have two sets of strings S1, and S2 and you want to mark all strings that

Re: We are looking to the root of the problem that caused us IOException

2011-04-06 Thread Ted Dunning
yes. At least periodically. You now have a situation where the age distribution of blocks in each datanode is quite different. This will lead to different evolution of which files are retained and that is likely to cause imbalances again. It will also cause the performance of your system to be

Re: We are looking to the root of the problem that caused us IOException

2011-04-05 Thread Ted Dunning
YOu can configure the balancer to use higher bandwidth. That can speed it up by 10x On Tue, Apr 5, 2011 at 2:54 AM, Guy Doulberg guy.doulb...@conduit.comwrote: We are running the blancer, but it takes a lot of time... in this time the cluster not working

Re: HBase schema design

2011-04-04 Thread Ted Dunning
The hbase list would be more appropriate. See http://hbase.apache.org/mail-lists.html http://hbase.apache.org/mail-lists.htmlThere is an active IRC channel, but your question fits the mailing list better so pop on over and I will give you some comments. In the meantime, take a look at OpenTSDB

Re: Reverse Indexing Programming Help

2011-03-31 Thread Ted Dunning
It would help to get a good book. There are several. For your program, there are several things that will trip you up: a) lots of little files is going to be slow. You want input that is 100MB per file if you want speed. b) That file format is a bit cheesy since it is hard to tell URL's from

Re: Chukwa - Lightweight agents

2011-03-20 Thread Ted Dunning
Take a look at openTsDb at http://opentsdb.net/ It provides lots of the capability in a MUCH simpler package. On Sun, Mar 20, 2011 at 8:43 AM, Mark static.void@gmail.com wrote: Sorry but it doesn't look like Chukwa mailing list exists anymore? Is there an easy way to set up lightweight

Re: Chukwa - Lightweight agents

2011-03-20 Thread Ted Dunning
into Hadoop. Doesn't really look like opentsdb is meant for that. I could be wrong though? On 3/20/11 9:49 AM, Ted Dunning wrote: Take a look at openTsDb at http://opentsdb.net/ It provides lots of the capability in a MUCH simpler package. On Sun, Mar 20, 2011 at 8:43 AM, Mark static.void

Re: Inserting many small files into HBase

2011-03-20 Thread Ted Dunning
Take a look at this: http://wiki.apache.org/hadoop/Hbase/DesignOverview then read the bigtable paper. On Sun, Mar 20, 2011 at 6:39 PM, edward choi mp2...@gmail.com wrote: Hi, I'm planning to crawl thousands of news rss feeds via MapReduce, and save each news article into HBase directly.

Re: decommissioning node woes

2011-03-19 Thread Ted Dunning
Unfortunately this doesn't help much because it is hard to get the ports to balance the load. On Fri, Mar 18, 2011 at 8:30 PM, Michael Segel michael_se...@hotmail.comwrote: With a 1GBe port, you could go 100Mbs for the bandwidth limit. If you bond your ports, you could go higher.

Re: how to build kmeans

2011-03-18 Thread Ted Dunning
These java files are full of HTML. Are you sure that they are supposed to compile? How did you get these files? On Fri, Mar 18, 2011 at 3:12 AM, MANISH SINGLA coolmanishh...@gmail.comwrote: Hii everyone... I am trying to run kmeans on a single node... I have the attached files with me...I

Re: how to build kmeans

2011-03-18 Thread Ted Dunning
This looks like you took the code from http://code.google.com/p/kmeans-hadoop/ http://code.google.com/p/kmeans-hadoop/And it looks like you didn't actually download the code, but you cut and pasted the HTML rendition of the code. First, this code is not from a serious project. It is more of a

Re: decommissioning node woes

2011-03-18 Thread Ted Dunning
If nobody else more qualified is willing to jump in, I can at least provide some pointers. What you describe is a bit surprising. I have zero experience with any 0.21 version, but decommissioning was working well in much older versions, so this would be a surprising regression. The observations

Re: decommissioning node woes

2011-03-18 Thread Ted Dunning
. If you just shut the node off, the blocks will replicate faster. James. On 2011-03-18, at 10:03 AM, Ted Dunning wrote: If nobody else more qualified is willing to jump in, I can at least provide some pointers. What you describe is a bit surprising. I have zero experience with any

Re: hadoop fs -rmr /*?

2011-03-16 Thread Ted Dunning
W.P is correct, however, that standard techniques like snapshots and mirrors and point in time backups do not exist in standard hadoop. This requires a variety of creative work-arounds if you use stock hadoop. It is not uncommon for people to have memories of either removing everything or

Re: Why hadoop is written in java?

2011-03-16 Thread Ted Dunning
Note that that comment is now 7 years old. See Mahout for a more modern take on numerics using Hadoop (and other tools) for scalable machine learning and data mining. On Wed, Mar 16, 2011 at 10:43 AM, baloodevil dukek...@hotmail.com wrote: See this for comment on java handling numeric

Re: k-means

2011-03-04 Thread Ted Dunning
Since you asked so nicely: http://www.manning.com/owen/ On Fri, Mar 4, 2011 at 6:52 AM, Mike Nute mike.n...@gmail.com wrote: James, Do you know how to get a copy of this book in early access form? Amazon doesn't release it until may. Thanks! Mike Nute --Original Message-- From:

Re: Digital Signal Processing Library + Hadoop

2011-03-04 Thread Ted Dunning
Come on over to the Apache Mahout mailing list for a warm welcome at least. We don't have a lot of time series stuff but would be very interested in hearing more about what you need and would like to see if there are some common issues that we might work on together. On Fri, Mar 4, 2011 at 9:05

Re: Performance Test

2011-03-02 Thread Ted Dunning
It will be very difficult to do. If you have n machines running 4 different things, you will probably get better results segregating tasks as much as possible. Interactions can be very subtle and can have major impact on performance in a few cases. Hadoop, in general, will use a lot of the

Re: Advice for a new open-source project and a license

2011-03-01 Thread Ted Dunning
Bixo may have some useful components. The thrust is different, but some of the pieces are similar. http://bixo.101tec.com/ On Mon, Feb 28, 2011 at 7:57 PM, Mark Kerzner markkerz...@gmail.com wrote: Well, it's more complex than that. I packed all files (or selected directories) into zip

Re: Advice for a new open-source project and a license

2011-02-28 Thread Ted Dunning
Check out http://www.elasticsearch.org/ http://www.elasticsearch.org/Not what you are doing, but possibly a helpful bit of the pie. Also, Solr integrates Tika and Lucene pretty nicely any more. No Hbase yet, but it isn't hard to add that. On Mon, Feb 28, 2011 at 1:01 PM, Mark Kerzner

Re: Hadoop Case Studies?

2011-02-27 Thread Ted Dunning
Ted, Greetings back at you. It has been a while. Check out Jimmy Lin and Chris Dyer's book about text processing with hadoop: http://www.umiacs.umd.edu/~jimmylin/book.html On Sun, Feb 27, 2011 at 4:34 PM, Ted Pedersen tpede...@d.umn.edu wrote: Greetings all, I'm teaching an undergraduate

Re: Hadoop Case Studies?

2011-02-27 Thread Ted Dunning
Harrington is out there, I stole this from you.) Lance On Sun, Feb 27, 2011 at 4:55 PM, Ted Dunning tdunn...@maprtech.com wrote: Ted, Greetings back at you. It has been a while. Check out Jimmy Lin and Chris Dyer's book about text processing with hadoop: http

Re: Quick question

2011-02-20 Thread Ted Dunning
This is the most important thing that you have said. The map function is called once per unit of input but the mapper object persists for many input units of input. You have a little bit of control over how many mapper objects there are and how many machines they are created on and how many

Re: Quick question

2011-02-18 Thread Ted Dunning
The input is effectively split by lines, but under the covers, the actual splits are by byte. Each mapper will cleverly scan from the specified start to the next line after the start point. At then end, it will over-read to the end of line that is at or after the end of its specified region.

Re: benchmark choices

2011-02-18 Thread Ted Dunning
MalStone looks like a very narrow benchmark. Terasort is also a very narrow and somewhat idiosyncratic benchmark, but it has the characteristic that lots of people use it. You should add PigMix to your list. There java versions of the problems in PigMix that make a pretty good set of benchmarks

Re: benchmark choices

2011-02-18 Thread Ted Dunning
I just read the malstone report. They report times for a Java version that is many (5x) times slower than for a streaming implementation. That single fact indicates that the Java code is so appallingly bad that this is a very bad benchmark. On Fri, Feb 18, 2011 at 2:27 PM, Jim Falgout

Re: Hadoop in Real time applications

2011-02-17 Thread Ted Dunning
Remove the for at the end that got sucked in by the email editor. On Thu, Feb 17, 2011 at 5:56 AM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Ted, thanks for the links, the yahoo.com one doesn't seem to exist? On Wed, Feb 16, 2011 at 11:48 PM, Ted Dunning tdunn...@maprtech.com wrote

Re: DataCreator

2011-02-16 Thread Ted Dunning
Sounds like Pig. Or Cascading. Or Hive. Seriously, isn't this already available? On Wed, Feb 16, 2011 at 7:06 AM, Guy Doulberg guy.doulb...@conduit.comwrote: Hey all, I want to consult with you hadoppers about a Map/Reduce application I want to build. I want to build a map/reduce job,

Re: Trying out AvatarNode

2011-02-16 Thread Ted Dunning
http://wiki.apache.org/hadoop/HowToContribute On Wed, Feb 16, 2011 at 8:23 AM, Mark Kerzner markkerz...@gmail.com wrote: if there is a place to learn about patch practices, please point me to it.

Re: Hadoop in Real time applications

2011-02-16 Thread Ted Dunning
Unless you go beyond the current standard semantics, this is true. See here: http://code.google.com/p/hop/ and http://labs.yahoo.com/node/476for alternatives. On Wed, Feb 16, 2011 at 10:30 PM, madhu phatak phatak@gmail.com wrote: Hadoop is not suited for real time applications On Thu,

Re: recommendation on HDDs

2011-02-15 Thread Ted Dunning
Good idea! Would you like to create the nucleus of such a page? (there might already be something like that) On Tue, Feb 15, 2011 at 8:49 AM, Shrinivas Joshi jshrini...@gmail.comwrote: It would be nice to have a wiki page collecting all this good information.

Re: hadoop 0.20 append - some clarifications

2011-02-14 Thread Ted Dunning
-- *From:* Ted Dunning [mailto:tdunn...@maprtech.com] *Sent:* Friday, February 11, 2011 2:14 PM *To:* common-user@hadoop.apache.org; gok...@huawei.com *Cc:* hdfs-u...@hadoop.apache.org; dhr...@gmail.com *Subject:* Re: hadoop 0.20 append - some clarifications I think that in general, the behavior

Re: Is this a fair summary of HDFS failover?

2011-02-14 Thread Ted Dunning
Note that document purports to be from 2008 and, at best, was uploaded just about a year ago. That it is still pretty accurate is kind of a tribute to either the stability of hbase or the stagnation depending on how you read it. On Mon, Feb 14, 2011 at 12:31 PM, Mark Kerzner

Re: recommendation on HDDs

2011-02-12 Thread Ted Dunning
The original poster also seemed somewhat interested in disk bandwidth. That is facilitated by having more than on disk in the box. On Sat, Feb 12, 2011 at 8:26 AM, Michael Segel michael_se...@hotmail.comwrote: Since the OP believes that their requirement is 1TB per node... a single 2TB would

Re: Which strategy is proper to run an this enviroment?

2011-02-12 Thread Ted Dunning
This sounds like it will be very inefficient. There is considerable overhead in starting Hadoop jobs. As you describe it, you will be starting thousands of jobs and paying this penalty many times. Is there a way that you could process all of the directories in one map-reduce job? Can you

Re: hadoop 0.20 append - some clarifications

2011-02-11 Thread Ted Dunning
not execute the createBlockOutputStream() method). hold here. 4. parallely, try to read the file from another client Now you will get an error saying that file cannot be read. _ From: Ted Dunning [mailto:tdunn...@maprtech.com] Sent: Friday, February 11, 2011 11:04 AM To: gok

Re: hadoop 0.20 append - some clarifications

2011-02-10 Thread Ted Dunning
Correct is a strong word here. There is actually an HDFS unit test that checks to see if partially written and unflushed data is visible. The basic rule of thumb is that you need to synchronize readers and writers outside of HDFS. There is no guarantee that data is visible or invisible after

Re: recommendation on HDDs

2011-02-10 Thread Ted Dunning
Get bigger disks. Data only grows and having extra is always good. You can get 2TB drives for $100 and 1TB for $75. As far as transfer rates are concerned, any 3GB/s SATA drive is going to be about the same (ish). Seek times will vary a bit with rotation speed, but with Hadoop, you will be

Re: hadoop 0.20 append - some clarifications

2011-02-10 Thread Ted Dunning
branch. I suppose HDFS-265's design doc won't apply to it. -- *From:* Ted Dunning [mailto:tdunn...@maprtech.com] *Sent:* Thursday, February 10, 2011 9:29 PM *To:* common-user@hadoop.apache.org; gok...@huawei.com *Cc:* hdfs-u...@hadoop.apache.org *Subject:* Re

Re: Can single map-reduce solve this problem

2011-02-08 Thread Ted Dunning
mapper should produce (k,1), (1, v) for lines k,v in file1 and should produce (k,2), (2,v) for lines k,v in file2. Your partition function should look at only the first member of the key tuple, but should order on both members. Your reducer will get data like this: (k,1), [(1,v)] or like

Re: Hadoop XML Error

2011-02-07 Thread Ted Dunning
This is due to the security API not being available. You are crossing from a cluster with security to one without and that is causing confusion. Presumably your client assumes that it is available and your hadoop library doesn't provide it. Check your class path very carefully looking for

Re: Quick Question: LineSplit or BlockSplit

2011-02-07 Thread Ted Dunning
Option (1) isn't the way that things normally work. Besides, mappers are called many times for each construction of a mapper. On Mon, Feb 7, 2011 at 3:38 PM, maha m...@umail.ucsb.edu wrote: Hi, I would appreciate it if you could give me your thoughts if there is affect on efficiency if:

Re: Quick Question: LineSplit or BlockSplit

2011-02-07 Thread Ted Dunning
this line, and start working on it (since it now knows the path in HDFS). Are you saying it's not doable? Thank you, Mark On Mon, Feb 7, 2011 at 8:10 PM, Ted Dunning tdunn...@maprtech.com wrote: Option (1) isn't the way that things normally work. Besides, mappers are called many times

Re: retain state between mappers

2011-02-05 Thread Ted Dunning
Remember that mappers are not executed in a well defined order. They can be executed in different order or even at the same time. One mapper can be run more than once. There are two ways to get something like what you want, but the question you asked is ill-posed. First, you can adapt the

Re: Hadoop is for whom? Data architect or Java Architect or All

2011-01-26 Thread Ted Dunning
Yes. There is still a mismatch. The degree of mismatch is decreasing as tools like Hive or Pig become more advanced. Better packaging of hadoop is also helping. But your data architect will definitely need support from a java aware person. They won't be able to do all of the tasks. On Wed,

Re: the performance of HDFS

2011-01-25 Thread Ted Dunning
This is a bit lower than it should be, but it is not so far out of line with what is reasonable. Did you make sure that have multiple separate disks for HDFS to use? With many disks, you should be able to get local disk write speeds up to a few hundred MB/s. Once you involve replication then

Re: the performance of HDFS

2011-01-25 Thread Ted Dunning
This is a really slow drive or controller. Consumer grade 3.5 inch 2TB drives typically can handle 100MB/s. I would suspect in the absence of real information that your controller is more likely to be deficient than your drive. If this is on a laptop or something, then I withdraw my thought.

Re: the performance of HDFS

2011-01-25 Thread Ted Dunning
for this model. On 1/25/11 7:59 PM, Ted Dunning wrote: This is a really slow drive or controller. Consumer grade 3.5 inch 2TB drives typically can handle 100MB/s. I would suspect in the absence of real information that your controller is more likely to be deficient than your drive

Re: Hadoop user event in Europe (interested ? )

2011-01-20 Thread Ted Dunning
If it occurs in early June, it might be possible for US attendees of Buzzwords to link in a visit to the hadoop meeting. I certainly would like to do that. On Thu, Jan 20, 2011 at 12:22 AM, Asif Jan asif@unige.ch wrote: Hi wondering if there is interest to organize a hadoop meet-up in

Re: When applying a patch, which attachment should I use?

2011-01-11 Thread Ted Dunning
You may also be interested in the append branch: http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20-append/ On Tue, Jan 11, 2011 at 3:12 AM, edward choi mp2...@gmail.com wrote: Thanks for the info. I am currently using Hadoop 0.20.2, so I guess I only need apply

Re: Import data from mysql

2011-01-10 Thread Ted Dunning
Yes. Hadoop can definitely help with this. On Mon, Jan 10, 2011 at 12:00 PM, Brian brian.mcswee...@gmail.com wrote: Thus, I would greatly appreciate your opinion on whether or not using hadoop for this would make sense in order to parallelize the task if it gets too slow.

Re: Import data from mysql

2011-01-09 Thread Ted Dunning
It is, of course, only quadratic, even if you compare all rows to all other rows. You can reduce this cost to O(n log n) by ordinary sorting and you can reduce further reduce the cost to O(n) using radix sort on hashes. Practically speaking, in either the parallel or non parallel setting try

Re: Import data from mysql

2011-01-09 Thread Ted Dunning
them so unfortunately I don't think this will help much. Some of the values in the rows have to be multiplied together, some have to be compared, some have to have a function run against them etc. cheers, Brian On Sun, Jan 9, 2011 at 8:55 AM, Ted Dunning tdunn...@maprtech.com wrote

Re: Rngd

2011-01-04 Thread Ted Dunning
As it normally stands, rngd will only help (it appears) if you have a hardware RNG. You need to cheat and use entropy you don't really have. If you don't mind hacking your system, you could even do this: # mv /dev/random /dev/random.orig # ln /dev/urandom /dev/random This makes

Re: What is the runtime efficiency of secondary sorting?

2011-01-03 Thread Ted Dunning
As a point of order, you would normally use a combiner with this problem and you wouldn't sort in either the combiner or the reducer. Instead, combiner and reducer would simply scan and keep the smallest item to emit at the end of the scan. As a point of information, most of the rank-based

Re: Entropy Pool and HDFS FS Commands Hanging System

2011-01-03 Thread Ted Dunning
Yes. It is stuck as suggested. See the bolded lines. You can help avoid this by dumping additional entropy into the machine via network traffic. According to the man page for /dev/random you can cheat by writing goo into /dev/urandom, but I have been unable to verify that by experiment. Is it

Re: What is the runtime efficiency of secondary sorting?

2011-01-03 Thread Ted Dunning
On Mon, Jan 3, 2011 at 4:00 PM, W.P. McNeill bill...@gmail.com wrote: ... If I write a combiner like this, is there any advantage to also doing a secondary sort? The definitive answer is that it depends. As for deserialization, the value in my actual application is a Java object with a

Re: Entropy Pool and HDFS FS Commands Hanging System

2011-01-03 Thread Ted Dunning
try dd if=/dev/random bs=1 count=100 of=/dev/null This will likely hang for a long time. There is no way that I know of to change the behavior of /dev/random except by changing the file itself to point to a different minor device. That would be very bad form. One think you may be able do

Re: Entropy Pool and HDFS FS Commands Hanging System

2011-01-03 Thread Ted Dunning
On Mon, Jan 3, 2011 at 4:48 PM, Jon Lederman jon2...@gmail.com wrote: Thanks. Will try that. One final question, based on the jstack output I sent, is it obvious that the system is blocked due to the behavior of /dev/random? I tried to send you a highlighted markup of your jstack output.

Re: Help for the problem of running lucene on Hadoop

2011-01-02 Thread Ted Dunning
With even a dozen or two servers, it is very easy to flatten a mysql server with a hadoop cluster. Also, mysql is typically a very poor storage system for an inverted index because it doesn't allow for compression of the posting vectors. Better to copy Katta in this required and create many

Re: Hadoop RPC call response post processing

2010-12-28 Thread Ted Dunning
Knowing the tenuring distribution will tell a lot about that exact issue. Ephemeral collections take on average less than one instruction per allocation and the allocation itself is generally only a single instruction. For ephemeral garbage, it is extremely unlikely that you can beat that. So

Re: help for using mapreduce to run different code?

2010-12-28 Thread Ted Dunning
if you mean running different code in different mappers, I recommend using an if statement. On Tue, Dec 28, 2010 at 2:53 PM, Jander g jande...@gmail.com wrote: Whether Hadoop supports the map function running different code? If yes, how to realize this?

Re: how to run jobs every 30 minutes?

2010-12-28 Thread Ted Dunning
Good quote. On Tue, Dec 28, 2010 at 3:46 PM, Chris K Wensel ch...@wensel.net wrote: deprecated is the new stable. https://issues.apache.org/jira/browse/MAPREDUCE-1734 ckw On Dec 28, 2010, at 2:56 PM, Jimmy Wan wrote: I've been using Cascading to act as make for my Hadoop processes for

Re: Hadoop RPC call response post processing

2010-12-27 Thread Ted Dunning
I would be very surprised if allocation itself is the problem as opposed to good old fashioned excess copying. It is very hard to write an allocator faster than the java generational gc, especially if you are talking about objects that are ephemeral. Have you looked at the tenuring distribution?

Re: How to simulate network delay on 1 node

2010-12-26 Thread Ted Dunning
See also https://github.com/toddlipcon/gremlins On Sun, Dec 26, 2010 at 11:26 AM, Konstantin Boudnik c...@apache.org wrote: Hi there. What are looking at is fault injection. I am not sure what version of Hadoop you're looking at, but here's at what you take a look in 0.21 and forward: -

Re: Hadoop/Elastic MR on AWS

2010-12-24 Thread Ted Dunning
EMR instances are started near each other. This increases the bandwidth between nodes. There may also be some enhancements in terms of access to the SAN that supports EBS. On Fri, Dec 24, 2010 at 4:41 AM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: - Original Message From:

Re: How frequently can I set status?

2010-12-23 Thread Ted Dunning
It is reasonable to update counters often, but I think you are right to limit the number status updates. On Thu, Dec 23, 2010 at 11:15 AM, W.P. McNeill bill...@gmail.com wrote: I have a loop that runs over a large number of iterations (order of 100,000) very quickly. It is nice to do

Re: breadth-first search

2010-12-22 Thread Ted Dunning
The Mahout math package has a number of basic algorithms that use algorithmic efficiencies when given sparse graphs. A number of other algorithms use only the product of a sparse matrix on another matrix or a vector. Since these algorithms never change the original sparse matrix, they are safe

Re: breadth-first search

2010-12-21 Thread Ted Dunning
Ahh... I see what you mean. This algorithm can be implemented with all of the iterations for all points proceeding in parallel. You should only need 4 map-reduce steps, not 400. This will still take several minutes on Hadoop, but as your problem increases in size, this overhead becomes less

Re: breadth-first search

2010-12-21 Thread Ted Dunning
Absolutely true. Nobody should pretend otherwise. On Tue, Dec 21, 2010 at 10:04 AM, Peng, Wei wei.p...@xerox.com wrote: Hadoop is useful when the data is huge and cannot fit into memory, but it does not seem to be a real-time solution.

Re: Friends of friends with MapReduce

2010-12-20 Thread Ted Dunning
On Mon, Dec 20, 2010 at 9:39 AM, Antonio Piccolboni anto...@piccolboni.info wrote: For an easy solution, use hive. Let's say your record contains userid and friendid and the table is called friends Then you would do select A.userid , B.friendid from friends A join friends B on (A.friendid =

Re: breadth-first search

2010-12-20 Thread Ted Dunning
On Mon, Dec 20, 2010 at 8:16 PM, Peng, Wei wei.p...@xerox.com wrote: ... My question is really about what is the efficient way for graph computation, matrix computation, algorithms that need many iterations to converge (with intermediate results). Large graph computations usually assume a

Re: breadth-first search

2010-12-20 Thread Ted Dunning
On Mon, Dec 20, 2010 at 9:43 PM, Peng, Wei wei.p...@xerox.com wrote: ... Currently, most of the matrix data (graph matrix, document-word matrix) that we are dealing with are sparse. Good. The matrix decomposition often needs many iterations to converge, then intermediate results have to

Re: InputFormat for a big file

2010-12-17 Thread Ted Dunning
a) this is a small file by hadoop standards. You should be able to process it by conventional methods on a single machine in about the same time it takes to start a hadoop job that does nothing at all. b) reading a single line at a time is not as inefficient as you might think. If you write a

Re: Please help with hadoop configuration parameter set and get

2010-12-17 Thread Ted Dunning
Statics won't work the way you might think because different mappers and different reducers are all running in different JVM's. It might work in local mode, but don't kid yourself about it working in a distributed mode. It won't. On Fri, Dec 17, 2010 at 8:31 AM, Peng, Wei wei.p...@xerox.com

Re: Needs a simple answer

2010-12-16 Thread Ted Dunning
Maha, Remember that the mapper is not running on the same machine as the main class. Thus local files aren't where you think. On Thu, Dec 16, 2010 at 1:06 PM, maha m...@umail.ucsb.edu wrote: Hi all, Why the following lines would work in the main class (WordCount) and not in Mapper ? even

Re: how to run jobs every 30 minutes?

2010-12-13 Thread Ted Dunning
Or even simpler, try Azkaban: http://sna-projects.com/azkaban/ On Mon, Dec 13, 2010 at 9:26 PM, edward choi mp2...@gmail.com wrote: Thanks for the tip. I took a look at it. Looks similar to Cascading I guess...? Anyway thanks for the info!! Ed 2010/12/8 Alejandro Abdelnur

Re: Is it possible to write file output in Map phase once and write another file output in Reduce phase?

2010-12-11 Thread Ted Dunning
Of course. It is just a set of Hadoop programs. 2010/12/11 edward choi mp2...@gmail.com Can I operate Bixo on a cluster other than Amazon EC2?

  1   2   >