Spark User Meetup in San Francisco

2012-01-30 Thread Matei Zaharia
This is a somewhat late announcement, but I thought it might be interesting to people on this list. We're holding the first user meetup for Spark (www.spark-project.org), the in-memory cluster computing framework that lets you do interactive and iterative data mining on Hadoop data, in San Franc

Re: Other than hadoop

2012-01-30 Thread Matei Zaharia
Spark (http://www.spark-project.org) aims to provide a higher-level programming interface as well as higher performance than Hadoop. Matei On Jan 30, 2012, at 2:24 PM, Ronald Petty wrote: > R.V., > > Are you looking for the platforms that due distributed computation or the > larger ecosystems

Re: Lack of data locality in Hadoop-0.20.2

2011-07-12 Thread Matei Zaharia
Hi Virajith, The default FIFO scheduler just isn't optimized for locality for small jobs. You should be able to get substantially more locality even with 1 replica if you use the fair scheduler, although the version of the scheduler in 0.20 doesn't contain the locality optimization. Try the Clo

Re: High load, low CPU on hard-to-reach instances

2011-07-05 Thread Matei Zaharia
What does the memory load look like on them? The one time I've seen stuff like this happen regularly is with too much memory in use. Matei On Jul 5, 2011, at 9:36 PM, Kai Ju Liu wrote: > Over the past week or two, I've been seeing an issue where hard-to-reach > (i.e. hard to ssh to) instances

Re: Dynamic Cluster Node addition

2011-06-30 Thread Matei Zaharia
You can have a new TaskTracker or DataNode join the cluster by just starting that daemon on the slave (e.g. bin/hadoop-daemon.sh start tasktracker) and making sure it is configured to connect to the right JobTracker or NameNode (through the mapred.job.tracker and fs.default.name properties in th

Re: Is there any way for the reducer to determine the total number of reduce tasks?

2011-06-22 Thread Matei Zaharia
You can implement the configure() method of the Reducer interface and look at the properties in the JobConf. In particular, "mapred.reduce.tasks" is the number of reduce tasks and "mapred.job.tracker" will be set to "local" when running in local mode. Matei On Jun 22, 2011, at 3:12 PM, Steve L

Re: question about spill/combine

2010-02-23 Thread Matei Zaharia
Hi Adam, It looks like map output records are indeed serialized before being combined and written out. I'm not really sure why this is, except perhaps to simplify the code for the case where you don't know the size of the records. Maybe someone more familiar with this part of Hadoop can explain

Re: load balanced task distribution

2010-01-07 Thread Matei Zaharia
Hi Michael, The Fair Scheduler's LoadManager was indeed put in place to allow for resource-aware scheduling in the future. Actually, Scott Chen from Facebook is currently working towards this feature. His latest patch related to it is https://issues.apache.org/jira/browse/MAPREDUCE-1218, which