Note that even the Facebook four degrees of separation paper went down to a
single machine running WebGraph (http://webgraph.di.unimi.it/) for the final
steps, after running jobs in there Hadoop cluster to build the dataset for that
final operation.
The computations were performed on a
Just the same as spark was disrupting the hadoop ecosystem by changing the
assumption that you can't rely on memory in distributed analytics...now
maybe we are challenging the assumption that big data analytics need to
distributed?
I've been asking the same question lately and seen similarly that
On 30 Mar 2015, at 13:27, jay vyas
jayunit100.apa...@gmail.commailto:jayunit100.apa...@gmail.com wrote:
Just the same as spark was disrupting the hadoop ecosystem by changing the
assumption that you can't rely on memory in distributed analytics...now maybe
we are challenging the assumption
One issue is that 'big' becomes 'not so big' reasonably quickly. A couple
of TeraBytes is not that challenging (depending on the algorithm) these
days where as 5 years ago it was a big challenge. We have a bit over a
PetaByte (not using Spark) and using a distributed system is the only
viable way
Hi Sean,
I think your point about the ETL costs are the wining argument here. but I
would like to see more research on the topic.
What I would like to see researched - is ability to run a specialized set
of common algorithms in fast-local-mode just like a compiler optimizer
can decide to inline
Hallo,
Well all problems you want to solve with technology need to have good
justification for a certain technology. So the first thing is that you ask
which technology fits to my current and future problems. This is also what
the article says. Unfortunately, it does only provide a vague answer
Remember that article that went viral on HN? (Where a guy showed how GraphX
/ Giraph / GraphLab / Spark have worse performance on a 128 cluster than on
a 1 thread machine? if not here is the article -
http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html)
Well as you may
(I bet the Spark implementation could be improved. I bet GraphX could
be optimized.)
Not sure about this one, but in core benchmarks often start by
assuming that the data is local. In the real world, data is unlikely
to be. The benchmark has to include the cost of bringing all the data
to the