Re: How to send user variables from Spark client to custom InputFormat or RecordReader ?

2015-02-22 Thread Tom Vacek
The SparkConf doesn't allow you to set arbitrary variables. You can use SparkContext's HadoopRDD and create a JobConf (with whatever variables you want), and then grab them out of the JobConf in your RecordReader. On Sun, Feb 22, 2015 at 4:28 PM, hnahak harihar1...@gmail.com wrote: Hi, I

Re: Is There Any Benchmarks Comparing C++ MPI with Spark

2014-06-16 Thread Tom Vacek
Spark gives you four of the classical collectives: broadcast, reduce, scatter, and gather. There are also a few additional primitives, mostly based on a join. Spark is certainly less optimized than MPI for these, but maybe that isn't such a big deal. Spark has one theoretical disadvantage

Re: Spark LIBLINEAR

2014-05-16 Thread Tom Vacek
I've done some comparisons with my own implementation of TRON on Spark. From a distributed computing perspective, it does 2x more local work per iteration than LBFGS, so the parallel isoefficiency is improved slightly. I think the truncated Newton solver holds some potential because there have

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Tom Vacek
As to your last line: I've used RDD zipping to avoid GC since MyBaseData is large and doesn't change. I think this is a very good solution to what is being asked for. On Mon, Apr 28, 2014 at 10:44 AM, Ian O'Connell i...@ianoconnell.com wrote: A mutable map in an object should do what your

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Tom Vacek
I'm not sure what I said came through. RDD zip is not hacky at all, as it only depends on a user not changing the partitioning. Basically, you would keep your losses as an RDD[Double] and zip whose with the RDD of examples, and update the losses. You're doing a copy (and GC) on the RDD of

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Tom Vacek
my iPhone On Apr 28, 2014, at 9:45 AM, Tom Vacek minnesota...@gmail.com wrote: I'm not sure what I said came through. RDD zip is not hacky at all, as it only depends on a user not changing the partitioning. Basically, you would keep your losses as an RDD[Double] and zip whose with the RDD

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Tom Vacek
Ian, I tried playing with your suggestion, but I get a task not serializable error (and some obvious things didn't fix it). Can you get that working? On Mon, Apr 28, 2014 at 10:58 AM, Tom Vacek minnesota...@gmail.com wrote: As to your last line: I've used RDD zipping to avoid GC since

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Tom Vacek
on loss RDD (copy) ? Chester Sent from my iPhone On Apr 28, 2014, at 9:45 AM, Tom Vacek minnesota...@gmail.com wrote: I'm not sure what I said came through. RDD zip is not hacky at all, as it only depends on a user not changing the partitioning. Basically, you would keep your losses as an RDD

Re: GraphX: Help understanding the limitations of Pregel

2014-04-23 Thread Tom Vacek
Here are some out-of-the-box ideas: If the elements lie in a fairly small range and/or you're willing to work with limited precision, you could use counting sort. Moreover, you could iteratively find the median using bisection, which would be associative and commutative. It's easy to think of

internship opportunity

2014-04-22 Thread Tom Vacek
Thomson Reuters is looking for a graduate (or possibly advanced undergraduate) summer intern in Eagan, MN. This is a chance to work on an innovative project exploring how big data sets can be used by professionals such as lawyers, scientists and journalists. If you're subscribed to this mailing