It is also worth using dd to verify your raw disk speeds.
Also, expressing disk transfer rates in bytes per second makes it a bit
easier for most of the disk people I know to figure out what is large or
small.
Each of these disks disk should do about 100MB/s when driven well. Hadoop
does OK,
To pile on, thousands or millions of documents are well within the range
that is well addressed by Lucene.
Solr may be an even better option than bare Lucene since it handles lots of
the boilerplate problems like document parsing and index update scheduling.
On Tue, May 31, 2011 at 11:56 AM,
itr.nextToken() is inside the if.
On Tue, May 24, 2011 at 7:29 AM, maryanne.dellasa...@gdc4s.com wrote:
while (itr.hasMoreTokens()) {
if(count == 5)
{
word.set(itr.nextToken());
output.collect(word, one);
}
count++;
}
ZK started as sub-project of Hadoop.
On Thu, May 19, 2011 at 7:27 AM, M. C. Srivas mcsri...@gmail.com wrote:
Interesting to note that Cassandra and ZK are now considered Hadoop
projects.
There were independent of Hadoop before the recent update.
On Thu, May 19, 2011 at 4:18 AM, Steve
Try using the Apache Mahout code that solves exactly this problem.
Mahout has a distributed row-wise matrix that is read one row at a time.
Dot products with the vector are computed and the results are collected.
This capability is used extensively in the large scale SVD's in Mahout.
On Tue,
How is it that 36 processes are not expected if you have configured 48 + 12
= 50 slots available on the machine?
On Wed, May 11, 2011 at 11:11 AM, Adi adi.pan...@gmail.com wrote:
By our calculations hadoop should not exceed 70% of memory.
Allocated per node - 48 map slots (24 GB) , 12 reduce
On Sat, Apr 30, 2011 at 12:18 AM, elton sky eltonsky9...@gmail.com wrote:
I got 2 questions:
1. I am wondering how hadoop MR performs when it runs compute intensive
applications, e.g. Monte carlo method compute PI. There's a example in
0.21,
QuasiMonteCarlo, but that example doesn't use
Check out S4 http://s4.io/
On Fri, Apr 29, 2011 at 10:13 PM, Luiz Fernando Figueiredo
luiz.figueir...@auctorita.com.br wrote:
Hi guys.
Hadoop is well known to process large amounts of data but we think that we
can do much more than it. Our goal is try to serve pseudo-streaming near of
Cooccurrence analysis is commonly used in recommendations. These produce
large intermediates.
Come on over to the Mahout project if you would like to talk to a bunch of
people who work on these problems.
On Fri, Apr 29, 2011 at 9:31 PM, elton sky eltonsky9...@gmail.com wrote:
Thank you for
I would recommend taking this question to the Mahout mailing list.
The short answer is that matrix multiplication by a column vector is pretty
easy. Each mapper reads the vector in the configure method and then does a
dot product for each row of the input matrix. Results are reassembled into
a
Turing completion isn't the central question here, really. The truth
is, map-reduce programs have considerably pressure to be written in a
scalable fashion which limits them to fairly simple behaviors that
result in pretty linear dependence of run-time on input size for a
given program.
The cool
Sounds like this paper might help you:
Predicting Multiple Performance Metrics for Queries: Better Decisions
Enabled by Machine Learning by Ganapathi, Archana, Harumi Kuno,
Umeshwar Daval, Janet Wiener, Armando Fox, Michael Jordan, David
Patterson
http://radlab.cs.berkeley.edu/publication/187
Hbase is very good at this kind of thing.
Depending on your aggregation needs OpenTSDB might be interesting since they
store and query against large amounts of time ordered data similar to what
you want to do.
It isn't clear to whether your data is primarily about current state or
about
nothing architecture.
This may be more database terminology that could be addressed by hbase,
but I think it is good background for the questions of memory mapping
files in hadoop.
Kevin
-Original Message-
From: Ted Dunning [mailto:tdunn...@maprtech.com]
Sent: Tuesday, April 12, 2011
, Jason Rutherglen
jason.rutherg...@gmail.com wrote:
Then one could MMap the blocks pertaining to the HDFS file and piece
them together. Lucene's MMapDirectory implementation does just this
to avoid an obscure JVM bug.
On Mon, Apr 11, 2011 at 9:09 PM, Ted Dunning tdunn...@maprtech.com
wrote
Blocks live where they land when first created. They can be moved due to
node failure or rebalancing, but it is typically pretty expensive to do
this. It certainly is slower than just reading the file.
If you really, really want mmap to work, then you need to set up some native
code that builds
Actually, it doesn't become trivial. It just becomes total fail or total
win instead of almost always being partial win. It doesn't meet Benson's
need.
On Tue, Apr 12, 2011 at 11:09 AM, Jason Rutherglen
jason.rutherg...@gmail.com wrote:
To get around the chunks or blocks problem, I've been
Benson is actually a pretty sophisticated guy who knows a lot about mmap.
I engaged with him yesterday on this since I know him from Apache.
On Tue, Apr 12, 2011 at 7:16 PM, M. C. Srivas mcsri...@gmail.com wrote:
I am not sure if you realize, but HDFS is not VM integrated.
Depending on the function that you want to use, it sounds like you want to
use a self join to compute transposed cooccurrence.
That is, it sounds like you want to find all the sets that share elements
with X. If you have a binary matrix A that represents your set membership
with one row per set
Also, it only provides access to a local chunk of a file which isn't very
useful.
On Mon, Apr 11, 2011 at 5:32 PM, Edward Capriolo edlinuxg...@gmail.comwrote:
On Mon, Apr 11, 2011 at 7:05 PM, Jason Rutherglen
jason.rutherg...@gmail.com wrote:
Yes you can however it will require customization
providing access to the
underlying file block?
On Mon, Apr 11, 2011 at 6:30 PM, Ted Dunning tdunn...@maprtech.com
wrote:
Also, it only provides access to a local chunk of a file which isn't very
useful.
On Mon, Apr 11, 2011 at 5:32 PM, Edward Capriolo edlinuxg...@gmail.com
wrote
There are no subtle ways to deal with quadratic problems like this. They
just don't scale.
Your suggestions are roughly on course. When matching 10GB against 50GB,
the choice of which input to use as input to the mapper depends a lot on how
much you can buffer in memory and how long such a
The original poster said that there was no common key. Your suggestion
presupposes that such a key exists.
On Sun, Apr 10, 2011 at 4:29 PM, Mehmet Tepedelenlioglu
mehmets...@gmail.com wrote:
My understanding is you have two sets of strings S1, and S2 and you want to
mark all strings that
yes. At least periodically.
You now have a situation where the age distribution of blocks in each
datanode is quite different. This will lead to different evolution of which
files are retained and that is likely to cause imbalances again. It will
also cause the performance of your system to be
YOu can configure the balancer to use higher bandwidth. That can speed it
up by 10x
On Tue, Apr 5, 2011 at 2:54 AM, Guy Doulberg guy.doulb...@conduit.comwrote:
We are running the blancer, but it takes a lot of time... in this time the
cluster not working
The hbase list would be more appropriate.
See http://hbase.apache.org/mail-lists.html
http://hbase.apache.org/mail-lists.htmlThere is an active IRC channel, but
your question fits the mailing list better so pop on over and I will give
you some comments.
In the meantime, take a look at OpenTSDB
It would help to get a good book. There are several.
For your program, there are several things that will trip you up:
a) lots of little files is going to be slow. You want input that is 100MB
per file if you want speed.
b) That file format is a bit cheesy since it is hard to tell URL's from
Take a look at openTsDb at http://opentsdb.net/
It provides lots of the capability in a MUCH simpler package.
On Sun, Mar 20, 2011 at 8:43 AM, Mark static.void@gmail.com wrote:
Sorry but it doesn't look like Chukwa mailing list exists anymore?
Is there an easy way to set up lightweight
into Hadoop. Doesn't really look like opentsdb is meant for that. I
could be wrong though?
On 3/20/11 9:49 AM, Ted Dunning wrote:
Take a look at openTsDb at http://opentsdb.net/
It provides lots of the capability in a MUCH simpler package.
On Sun, Mar 20, 2011 at 8:43 AM, Mark static.void
Take a look at this:
http://wiki.apache.org/hadoop/Hbase/DesignOverview
then read the bigtable paper.
On Sun, Mar 20, 2011 at 6:39 PM, edward choi mp2...@gmail.com wrote:
Hi,
I'm planning to crawl thousands of news rss feeds via MapReduce, and save
each news article into HBase directly.
Unfortunately this doesn't help much because it is hard to get the ports to
balance the load.
On Fri, Mar 18, 2011 at 8:30 PM, Michael Segel michael_se...@hotmail.comwrote:
With a 1GBe port, you could go 100Mbs for the bandwidth limit.
If you bond your ports, you could go higher.
These java files are full of HTML.
Are you sure that they are supposed to compile? How did you get these
files?
On Fri, Mar 18, 2011 at 3:12 AM, MANISH SINGLA coolmanishh...@gmail.comwrote:
Hii everyone...
I am trying to run kmeans on a single node... I have the attached
files with me...I
This looks like you took the code from
http://code.google.com/p/kmeans-hadoop/
http://code.google.com/p/kmeans-hadoop/And it looks like you didn't
actually download the code, but you cut and pasted the HTML rendition of the
code.
First, this code is not from a serious project. It is more of a
If nobody else more qualified is willing to jump in, I can at least provide
some pointers.
What you describe is a bit surprising. I have zero experience with any 0.21
version, but decommissioning was working well
in much older versions, so this would be a surprising regression.
The observations
. If you just shut the node off, the blocks will replicate
faster.
James.
On 2011-03-18, at 10:03 AM, Ted Dunning wrote:
If nobody else more qualified is willing to jump in, I can at least
provide
some pointers.
What you describe is a bit surprising. I have zero experience with any
W.P is correct, however, that standard techniques like snapshots and mirrors
and point in time backups do not exist in standard hadoop.
This requires a variety of creative work-arounds if you use stock hadoop.
It is not uncommon for people to have memories of either removing everything
or
Note that that comment is now 7 years old.
See Mahout for a more modern take on numerics using Hadoop (and other tools)
for scalable machine learning and data mining.
On Wed, Mar 16, 2011 at 10:43 AM, baloodevil dukek...@hotmail.com wrote:
See this for comment on java handling numeric
Since you asked so nicely:
http://www.manning.com/owen/
On Fri, Mar 4, 2011 at 6:52 AM, Mike Nute mike.n...@gmail.com wrote:
James,
Do you know how to get a copy of this book in early access form? Amazon
doesn't release it until may. Thanks!
Mike Nute
--Original Message--
From:
Come on over to the Apache Mahout mailing list for a warm welcome at least.
We don't have a lot of time series stuff but would be very interested in
hearing more about what you need and would like to see if there are some
common issues that we might work on together.
On Fri, Mar 4, 2011 at 9:05
It will be very difficult to do. If you have n machines running 4 different
things, you will probably get better results segregating tasks as much as
possible. Interactions can be very subtle and can have major impact on
performance in a few cases.
Hadoop, in general, will use a lot of the
Bixo may have some useful components. The thrust is different, but some of
the pieces are similar.
http://bixo.101tec.com/
On Mon, Feb 28, 2011 at 7:57 PM, Mark Kerzner markkerz...@gmail.com wrote:
Well, it's more complex than that. I packed all files (or selected
directories) into zip
Check out http://www.elasticsearch.org/
http://www.elasticsearch.org/Not what you are doing, but possibly a
helpful bit of the pie.
Also, Solr integrates Tika and Lucene pretty nicely any more. No Hbase yet,
but it isn't hard to add that.
On Mon, Feb 28, 2011 at 1:01 PM, Mark Kerzner
Ted,
Greetings back at you. It has been a while.
Check out Jimmy Lin and Chris Dyer's book about text processing with
hadoop:
http://www.umiacs.umd.edu/~jimmylin/book.html
On Sun, Feb 27, 2011 at 4:34 PM, Ted Pedersen tpede...@d.umn.edu wrote:
Greetings all,
I'm teaching an undergraduate
Harrington is out there, I stole this from you.)
Lance
On Sun, Feb 27, 2011 at 4:55 PM, Ted Dunning tdunn...@maprtech.com
wrote:
Ted,
Greetings back at you. It has been a while.
Check out Jimmy Lin and Chris Dyer's book about text processing with
hadoop:
http
This is the most important thing that you have said. The map function
is called once per unit of input but the mapper object persists for
many input units of input.
You have a little bit of control over how many mapper objects there
are and how many machines they are created on and how many
The input is effectively split by lines, but under the covers, the actual
splits are by byte. Each mapper will cleverly scan from the specified start
to the next line after the start point. At then end, it will over-read to
the end of line that is at or after the end of its specified region.
MalStone looks like a very narrow benchmark.
Terasort is also a very narrow and somewhat idiosyncratic benchmark, but it
has the characteristic that lots of people use it.
You should add PigMix to your list. There java versions of the problems in
PigMix that make a pretty good set of benchmarks
I just read the malstone report. They report times for a Java version that
is many (5x) times slower than for a streaming implementation. That single
fact indicates that the Java code is so appallingly bad that this is a very
bad benchmark.
On Fri, Feb 18, 2011 at 2:27 PM, Jim Falgout
Remove the for at the end that got sucked in by the email editor.
On Thu, Feb 17, 2011 at 5:56 AM, Jason Rutherglen
jason.rutherg...@gmail.com wrote:
Ted, thanks for the links, the yahoo.com one doesn't seem to exist?
On Wed, Feb 16, 2011 at 11:48 PM, Ted Dunning tdunn...@maprtech.com
wrote
Sounds like Pig. Or Cascading. Or Hive.
Seriously, isn't this already available?
On Wed, Feb 16, 2011 at 7:06 AM, Guy Doulberg guy.doulb...@conduit.comwrote:
Hey all,
I want to consult with you hadoppers about a Map/Reduce application I want
to build.
I want to build a map/reduce job,
http://wiki.apache.org/hadoop/HowToContribute
On Wed, Feb 16, 2011 at 8:23 AM, Mark Kerzner markkerz...@gmail.com wrote:
if there is a place to learn about patch
practices, please point me to it.
Unless you go beyond the current standard semantics, this is true.
See here: http://code.google.com/p/hop/ and
http://labs.yahoo.com/node/476for alternatives.
On Wed, Feb 16, 2011 at 10:30 PM, madhu phatak phatak@gmail.com wrote:
Hadoop is not suited for real time applications
On Thu,
Good idea!
Would you like to create the nucleus of such a page? (there might already
be something like that)
On Tue, Feb 15, 2011 at 8:49 AM, Shrinivas Joshi jshrini...@gmail.comwrote:
It would be
nice to have a wiki page collecting all this good information.
--
*From:* Ted Dunning [mailto:tdunn...@maprtech.com]
*Sent:* Friday, February 11, 2011 2:14 PM
*To:* common-user@hadoop.apache.org; gok...@huawei.com
*Cc:* hdfs-u...@hadoop.apache.org; dhr...@gmail.com
*Subject:* Re: hadoop 0.20 append - some clarifications
I think that in general, the behavior
Note that document purports to be from 2008 and, at best, was uploaded just
about a year ago.
That it is still pretty accurate is kind of a tribute to either the
stability of hbase or the stagnation depending on how you read it.
On Mon, Feb 14, 2011 at 12:31 PM, Mark Kerzner
The original poster also seemed somewhat interested in disk bandwidth.
That is facilitated by having more than on disk in the box.
On Sat, Feb 12, 2011 at 8:26 AM, Michael Segel michael_se...@hotmail.comwrote:
Since the OP believes that their requirement is 1TB per node... a single
2TB would
This sounds like it will be very inefficient. There is considerable
overhead in starting Hadoop jobs. As you describe it, you will be starting
thousands of jobs and paying this penalty many times.
Is there a way that you could process all of the directories in one
map-reduce job? Can you
not execute the createBlockOutputStream() method). hold here.
4. parallely, try to read the file from another client
Now you will get an error saying that file cannot be read.
_
From: Ted Dunning [mailto:tdunn...@maprtech.com]
Sent: Friday, February 11, 2011 11:04 AM
To: gok
Correct is a strong word here.
There is actually an HDFS unit test that checks to see if partially written
and unflushed data is visible. The basic rule of thumb is that you need to
synchronize readers and writers outside of HDFS. There is no guarantee that
data is visible or invisible after
Get bigger disks. Data only grows and having extra is always good.
You can get 2TB drives for $100 and 1TB for $75.
As far as transfer rates are concerned, any 3GB/s SATA drive is going to be
about the same (ish). Seek times will vary a bit with rotation speed, but
with Hadoop, you will be
branch. I suppose HDFS-265's
design doc won't apply to it.
--
*From:* Ted Dunning [mailto:tdunn...@maprtech.com]
*Sent:* Thursday, February 10, 2011 9:29 PM
*To:* common-user@hadoop.apache.org; gok...@huawei.com
*Cc:* hdfs-u...@hadoop.apache.org
*Subject:* Re
mapper should produce (k,1), (1, v) for lines k,v in file1 and should
produce (k,2), (2,v) for lines k,v in file2. Your partition function should
look at only the first member of the key tuple, but should order on both
members.
Your reducer will get data like this:
(k,1), [(1,v)]
or like
This is due to the security API not being available. You are crossing from
a cluster with security to one without and that is causing confusion.
Presumably your client assumes that it is available and your hadoop library
doesn't provide it.
Check your class path very carefully looking for
Option (1) isn't the way that things normally work. Besides, mappers are
called many times for each construction of a mapper.
On Mon, Feb 7, 2011 at 3:38 PM, maha m...@umail.ucsb.edu wrote:
Hi,
I would appreciate it if you could give me your thoughts if there is
affect on efficiency if:
this line, and start working on it (since it now knows the
path in HDFS). Are you saying it's not doable?
Thank you,
Mark
On Mon, Feb 7, 2011 at 8:10 PM, Ted Dunning tdunn...@maprtech.com wrote:
Option (1) isn't the way that things normally work. Besides, mappers are
called many times
Remember that mappers are not executed in a well defined order.
They can be executed in different order or even at the same time. One
mapper can be run more than once.
There are two ways to get something like what you want, but the question you
asked is ill-posed.
First, you can adapt the
Yes. There is still a mismatch.
The degree of mismatch is decreasing as tools like Hive or Pig become more
advanced. Better packaging of hadoop is also helping.
But your data architect will definitely need support from a java aware
person. They won't be able to do all of the tasks.
On Wed,
This is a bit lower than it should be, but it is not so far out of line with
what is reasonable.
Did you make sure that have multiple separate disks for HDFS to use? With
many disks, you should be able to get local disk write speeds up to a few
hundred MB/s.
Once you involve replication then
This is a really slow drive or controller.
Consumer grade 3.5 inch 2TB drives typically can handle 100MB/s. I would
suspect in the absence of real information that your controller is more
likely to be deficient than your drive. If this is on a laptop or
something, then I withdraw my thought.
for this model.
On 1/25/11 7:59 PM, Ted Dunning wrote:
This is a really slow drive or controller.
Consumer grade 3.5 inch 2TB drives typically can handle 100MB/s. I would
suspect in the absence of real information that your controller is more
likely to be deficient than your drive
If it occurs in early June, it might be possible for US attendees of
Buzzwords to link in a visit to the hadoop meeting. I certainly would like
to do that.
On Thu, Jan 20, 2011 at 12:22 AM, Asif Jan asif@unige.ch wrote:
Hi
wondering if there is interest to organize a hadoop meet-up in
You may also be interested in the append branch:
http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20-append/
On Tue, Jan 11, 2011 at 3:12 AM, edward choi mp2...@gmail.com wrote:
Thanks for the info.
I am currently using Hadoop 0.20.2, so I guess I only need apply
Yes. Hadoop can definitely help with this.
On Mon, Jan 10, 2011 at 12:00 PM, Brian brian.mcswee...@gmail.com wrote:
Thus, I would greatly appreciate your opinion on whether or not using
hadoop for this would make sense in order to parallelize the task if it gets
too slow.
It is, of course, only quadratic, even if you compare all rows to all other
rows. You can reduce this cost to O(n log n) by ordinary sorting and you
can reduce further reduce the cost to O(n) using radix sort on hashes.
Practically speaking, in either the parallel or non parallel setting try
them so
unfortunately I don't think this will help much. Some of the values in the
rows have to be multiplied together, some have to be compared, some have to
have a function run against them etc.
cheers,
Brian
On Sun, Jan 9, 2011 at 8:55 AM, Ted Dunning tdunn...@maprtech.com wrote
As it normally stands, rngd will only help (it appears) if you have a
hardware RNG.
You need to cheat and use entropy you don't really have. If you don't mind
hacking your system, you could even do this:
# mv /dev/random /dev/random.orig
# ln /dev/urandom /dev/random
This makes
As a point of order, you would normally use a combiner with this problem and
you wouldn't sort in either the combiner or the reducer. Instead, combiner
and reducer would simply scan and keep the smallest item to emit at the end
of the scan.
As a point of information, most of the rank-based
Yes. It is stuck as suggested. See the bolded lines.
You can help avoid this by dumping additional entropy into the machine via
network traffic. According to the man page for /dev/random you can cheat by
writing goo into /dev/urandom, but I have been unable to verify that by
experiment.
Is it
On Mon, Jan 3, 2011 at 4:00 PM, W.P. McNeill bill...@gmail.com wrote:
... If I write a combiner like this, is there any advantage to also doing a
secondary sort?
The definitive answer is that it depends.
As for deserialization, the value in my actual application is a Java object
with a
try
dd if=/dev/random bs=1 count=100 of=/dev/null
This will likely hang for a long time.
There is no way that I know of to change the behavior of /dev/random except
by changing the file itself to point to a different minor device. That
would be very bad form.
One think you may be able do
On Mon, Jan 3, 2011 at 4:48 PM, Jon Lederman jon2...@gmail.com wrote:
Thanks. Will try that. One final question, based on the jstack output I
sent, is it obvious that the system is blocked due to the behavior of
/dev/random?
I tried to send you a highlighted markup of your jstack output.
With even a dozen or two servers, it is very easy to flatten a mysql server
with a hadoop cluster.
Also, mysql is typically a very poor storage system for an inverted index
because it doesn't allow for compression of the posting vectors.
Better to copy Katta in this required and create many
Knowing the tenuring distribution will tell a lot about that exact issue.
Ephemeral collections take on average less than one instruction per
allocation and the allocation itself is generally only a single instruction.
For ephemeral garbage, it is extremely unlikely that you can beat that.
So
if you mean running different code in different mappers, I recommend using
an if statement.
On Tue, Dec 28, 2010 at 2:53 PM, Jander g jande...@gmail.com wrote:
Whether Hadoop supports the map function running different code? If yes,
how
to realize this?
Good quote.
On Tue, Dec 28, 2010 at 3:46 PM, Chris K Wensel ch...@wensel.net wrote:
deprecated is the new stable.
https://issues.apache.org/jira/browse/MAPREDUCE-1734
ckw
On Dec 28, 2010, at 2:56 PM, Jimmy Wan wrote:
I've been using Cascading to act as make for my Hadoop processes for
I would be very surprised if allocation itself is the problem as opposed to
good old fashioned excess copying.
It is very hard to write an allocator faster than the java generational gc,
especially if you are talking about objects that are ephemeral.
Have you looked at the tenuring distribution?
See also https://github.com/toddlipcon/gremlins
On Sun, Dec 26, 2010 at 11:26 AM, Konstantin Boudnik c...@apache.org wrote:
Hi there.
What are looking at is fault injection.
I am not sure what version of Hadoop you're looking at, but here's at
what you take a look in 0.21 and forward:
-
EMR instances are started near each other. This increases the bandwidth
between nodes.
There may also be some enhancements in terms of access to the SAN that
supports EBS.
On Fri, Dec 24, 2010 at 4:41 AM, Otis Gospodnetic
otis_gospodne...@yahoo.com wrote:
- Original Message
From:
It is reasonable to update counters often, but I think you are right to
limit the number status updates.
On Thu, Dec 23, 2010 at 11:15 AM, W.P. McNeill bill...@gmail.com wrote:
I have a loop that runs over a large number of iterations (order of
100,000)
very quickly. It is nice to do
The Mahout math package has a number of basic algorithms that use
algorithmic efficiencies when given sparse graphs.
A number of other algorithms use only the product of a sparse matrix on
another matrix or a vector. Since these algorithms never change the
original sparse matrix, they are safe
Ahh... I see what you mean.
This algorithm can be implemented with all of the iterations for all points
proceeding in parallel. You should only need 4 map-reduce steps, not 400.
This will still take several minutes on Hadoop, but as your problem
increases in size, this overhead becomes less
Absolutely true. Nobody should pretend otherwise.
On Tue, Dec 21, 2010 at 10:04 AM, Peng, Wei wei.p...@xerox.com wrote:
Hadoop is useful when the data is huge and cannot fit into memory, but it
does not seem to be a real-time solution.
On Mon, Dec 20, 2010 at 9:39 AM, Antonio Piccolboni anto...@piccolboni.info
wrote:
For an easy solution, use hive. Let's say your record contains userid and
friendid and the table is called friends
Then you would do
select A.userid , B.friendid from friends A join friends B on (A.friendid =
On Mon, Dec 20, 2010 at 8:16 PM, Peng, Wei wei.p...@xerox.com wrote:
... My question is really about what is the efficient way for graph
computation, matrix computation, algorithms that need many iterations to
converge (with intermediate results).
Large graph computations usually assume a
On Mon, Dec 20, 2010 at 9:43 PM, Peng, Wei wei.p...@xerox.com wrote:
...
Currently, most of the matrix data (graph matrix, document-word matrix)
that we are dealing with are sparse.
Good.
The matrix decomposition often needs many iterations to converge, then
intermediate results have to
a) this is a small file by hadoop standards. You should be able to process
it by conventional methods on a single machine in about the same time it
takes to start a hadoop job that does nothing at all.
b) reading a single line at a time is not as inefficient as you might think.
If you write a
Statics won't work the way you might think because different mappers and
different reducers are all running in different JVM's. It might work in
local mode, but don't kid yourself about it working in a distributed mode.
It won't.
On Fri, Dec 17, 2010 at 8:31 AM, Peng, Wei wei.p...@xerox.com
Maha,
Remember that the mapper is not running on the same machine as the main
class. Thus local files aren't where you think.
On Thu, Dec 16, 2010 at 1:06 PM, maha m...@umail.ucsb.edu wrote:
Hi all,
Why the following lines would work in the main class (WordCount) and not
in Mapper ? even
Or even simpler, try Azkaban: http://sna-projects.com/azkaban/
On Mon, Dec 13, 2010 at 9:26 PM, edward choi mp2...@gmail.com wrote:
Thanks for the tip. I took a look at it.
Looks similar to Cascading I guess...?
Anyway thanks for the info!!
Ed
2010/12/8 Alejandro Abdelnur
Of course. It is just a set of Hadoop programs.
2010/12/11 edward choi mp2...@gmail.com
Can I operate Bixo on a cluster other than Amazon EC2?
1 - 100 of 159 matches
Mail list logo