Re: Reading a large file (binary) into RDD

2015-04-02 Thread Jeremy Freeman
of use here? My current method happens to have a large overhead (much more than actual computation time). Also, I am short of memory at the driver when it has to read the entire file. On Thu, Apr 2, 2015 at 1:44 PM, Jeremy Freeman freeman.jer...@gmail.com wrote: If it’s a flat binary

Re: Reading a large file (binary) into RDD

2015-04-02 Thread Jeremy Freeman
If it’s a flat binary file and each record is the same length (in bytes), you can use Spark’s binaryRecords method (defined on the SparkContext), which loads records from one or more large flat binary files into an RDD. Here’s an example in python to show how it works: # write data from an

Re: Can LBFGS be used on streaming data?

2015-03-19 Thread Jeremy Freeman
splitting it into 5 sets, it gives me a little bit different weights (difference is in decimals). I am still trying to analyse why would this be happening. Any inputs, on why would this be happening? Best Regards, Arunkumar On Tue, Mar 17, 2015 at 11:32 AM, Jeremy Freeman freeman.jer

Re: Can LBFGS be used on streaming data?

2015-03-17 Thread Jeremy Freeman
Hi Arunkumar, That looks like it should work. Logically, it’s similar to the implementation used by StreamingLinearRegression and StreamingLogisticRegression, see this class:

Re: Streaming linear regression example question

2015-03-15 Thread Jeremy Freeman
Hi Margus, thanks for reporting this, I’ve been able to reproduce and there does indeed appear to be a bug. I’ve created a JIRA and have a fix ready, can hopefully include in 1.3.1. In the meantime, you can get the desired result using transform: model.trainOn(trainingData)

Re: MLlib(Logistic Regression) + Spark Streaming.

2014-12-28 Thread Jeremy Freeman
Along with Xiangrui’s suggestion, we will soon be adding an implantation of Streaming Logistic Regression, which will be similar to the current version of Streaming Linear Regression, and continually update the model as new data arrive (JIRA). Hopefully this will be in v1.3. — Jeremy

Re: MLlib + Streaming

2014-12-28 Thread Jeremy Freeman
Hi Fernando, There’s currently no streaming ALS in Spark. I’m exploring a streaming singular value decomposition (JIRA) based on this paper (http://www.stat.osu.edu/~dmsl/thinSVDtracking.pdf), which might be one way to think about it. There has also been some cool recent work explicitly on

meetup october 30-31st in SF

2014-10-08 Thread Jeremy Freeman
Hi all, We’re organizing a meetup October 30-31st in downtown SF that might be of interest to the Spark community. The focus is on large-scale data analysis and its role in neuroscience. It will feature several active Spark developers and users, including Xiangrui Meng, Josh Rosen, Reza Zadeh,

Re: Anaconda Spark AMI

2014-07-13 Thread Jeremy Freeman
Hi Ben, This is great! I just spun up an EC2 cluster and tested basic pyspark + ipython/numpy/scipy functionality, and all seems to be working so far. Will let you know if any issues arise. We do a lot with pyspark + scientific computing, and for EC2 usage I think this is a terrific way to

Re: error loading large files in PySpark 0.9.0

2014-06-06 Thread Jeremy Freeman
Oh cool, thanks for the heads up! Especially for the Hadoop InputFormat support. We recently wrote a custom hadoop input format so we can support flat binary files (https://github.com/freeman-lab/thunder/tree/master/scala/src/main/scala/thunder/util/io/hadoop), and have been testing it in Scala.

Re: error loading large files in PySpark 0.9.0

2014-06-04 Thread Jeremy Freeman
Hey Matei, Wanted to let you know this issue appears to be fixed in 1.0.0. Great work! -- Jeremy -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/error-loading-large-files-in-PySpark-0-9-0-tp3049p6985.html Sent from the Apache Spark User List mailing list

Re: Spark on an HPC setup

2014-05-28 Thread Jeremy Freeman
. -- Jeremy - Jeremy Freeman, PhD Neuroscientist @thefreemanlab On May 28, 2014, at 11:02 AM, Sidharth Kashyap sidharth.n.kash...@outlook.com wrote: Hi, Has anyone tried to get Spark working on an HPC setup? If yes, can you please share your learnings and how you went about

Re: Computing cosine similiarity using pyspark

2014-05-27 Thread Jeremy Freeman
Hi Jamal, One nice feature of PySpark is that you can easily use existing functions from NumPy and SciPy inside your Spark code. For a simple example, the following uses Spark's cartesian operation (which combines pairs of vectors into tuples), followed by NumPy's corrcoef to compute the pearson

spark ec2 error

2014-05-04 Thread Jeremy Freeman
exist, and earlier versions of spark-ec2.py still use deploy_templates from https://github.com/mesos/spark-ec2.git -b v2, which has the new variables. Using the updated spark-ec2.py from master works fine. -- Jeremy - Jeremy Freeman, PhD Neuroscientist @thefreemanlab -- View

Re: spark ec2 error

2014-05-04 Thread Jeremy Freeman
Cool, glad to help! I just tested with 0.8.1 and 0.9.0 and both worked perfectly, so seems to all be good. -- Jeremy -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-error-tp5323p5329.html Sent from the Apache Spark User List mailing list archive

Re: Initial job has not accepted any resources

2014-05-04 Thread Jeremy Freeman
Hey Pedro, From which version of Spark were you running the spark-ec2.py script? You might have run into the problem described here (http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-error-td5323.html), which Patrick just fixed up to ensure backwards compatibility. With the bug, it

Re: Combining RDD's columns

2014-04-18 Thread Jeremy Freeman
() Array[String] = Array(sign1, sign2, sign3) rdd1.zip(rdd2).collect() Array[(String, String)] = Array((name1,sign1), (name2,sign2), (name3,sign3)) In your case, you might have the first two RDDs calculated from some common raw data through a map. -- Jeremy - Jeremy Freeman, PhD

Re: Scala vs Python performance differences

2014-04-14 Thread Jeremy Freeman
, in our hands, that 40% number is ballpark correct, at least for some basic operations (e.g textFile, count, reduce). -- Jeremy - Jeremy Freeman, PhD Neuroscientist @thefreemanlab -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Scala-vs

Re: Spark on other parallel filesystems

2014-04-04 Thread Jeremy Freeman
We run Spark (in Standalone mode) on top of a network-mounted file system (NFS), rather than HDFS, and find it to work great. It required no modification or special configuration to set this up; as Matei says, we just point Spark to data using the file location. -- Jeremy On Apr 4, 2014, at

Re: Calling Spark enthusiasts in NYC

2014-03-31 Thread Jeremy Freeman
- jeremy freeman, phd neuroscientist @thefreemanlab On Mar 31, 2014, at 2:31 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Nicholas, I'm in Boston and would be interested in a Spark group. Not sure if you know this -- there was a meetup that never got off the ground. Anyway, I'd be +1

Re: error loading large files in PySpark 0.9.0

2014-03-24 Thread Jeremy Freeman
Thanks Matei, unfortunately doesn't seem to fix it. I tried batchSize = 10, 100, as well as 1 (which should reproduce the 0.8.1 behavior?), and it stalls at the same point in each case. -- Jeremy - jeremy freeman, phd neuroscientist @thefreemanlab On Mar 23, 2014, at 9:56

error loading large files in PySpark 0.9.0

2014-03-23 Thread Jeremy Freeman
Hi all, Hitting a mysterious error loading large text files, specific to PySpark 0.9.0. In PySpark 0.8.1, this works: data = sc.textFile(path/to/myfile) data.count() But in 0.9.0, it stalls. There are indications of completion up to: 14/03/17 16:54:24 INFO TaskSetManager: Finished TID 4 in

Re: Machine Learning on streaming data

2014-03-20 Thread Jeremy Freeman
Thanks TD, happy to share my experience with MLLib + Spark Streaming integration. Here's a gist with two examples I have working, one for StreamingLinearRegression and another for StreamingKMeans. https://gist.github.com/freeman-lab/9672685 The goal in each case was to implement a streaming

Re: example of non-line oriented input data?

2014-03-19 Thread Jeremy Freeman
Another vote on this, support for simple SequenceFiles and/or Avro would be terrific, as using plain text can be very space-inefficient, especially for numerical data. -- Jeremy On Mar 19, 2014, at 5:24 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: I'd second the request for Avro