of use here?
My current method happens to have a large overhead (much more than actual
computation time). Also, I am short of memory at the driver when it has to
read the entire file.
On Thu, Apr 2, 2015 at 1:44 PM, Jeremy Freeman freeman.jer...@gmail.com
wrote:
If it’s a flat binary
If it’s a flat binary file and each record is the same length (in bytes), you
can use Spark’s binaryRecords method (defined on the SparkContext), which loads
records from one or more large flat binary files into an RDD. Here’s an example
in python to show how it works:
# write data from an
splitting it
into 5 sets, it gives me a little bit different weights (difference is in
decimals). I am still trying to analyse why would this be happening.
Any inputs, on why would this be happening?
Best Regards,
Arunkumar
On Tue, Mar 17, 2015 at 11:32 AM, Jeremy Freeman freeman.jer
Hi Arunkumar,
That looks like it should work. Logically, it’s similar to the implementation
used by StreamingLinearRegression and StreamingLogisticRegression, see this
class:
Hi Margus, thanks for reporting this, I’ve been able to reproduce and there
does indeed appear to be a bug. I’ve created a JIRA and have a fix ready, can
hopefully include in 1.3.1.
In the meantime, you can get the desired result using transform:
model.trainOn(trainingData)
Along with Xiangrui’s suggestion, we will soon be adding an implantation of
Streaming Logistic Regression, which will be similar to the current version of
Streaming Linear Regression, and continually update the model as new data
arrive (JIRA). Hopefully this will be in v1.3.
— Jeremy
Hi Fernando,
There’s currently no streaming ALS in Spark. I’m exploring a streaming singular
value decomposition (JIRA) based on this paper
(http://www.stat.osu.edu/~dmsl/thinSVDtracking.pdf), which might be one way to
think about it.
There has also been some cool recent work explicitly on
Hi all,
We’re organizing a meetup October 30-31st in downtown SF that might be of
interest to the Spark community. The focus is on large-scale data analysis and
its role in neuroscience. It will feature several active Spark developers and
users, including Xiangrui Meng, Josh Rosen, Reza Zadeh,
Hi Ben,
This is great! I just spun up an EC2 cluster and tested basic pyspark +
ipython/numpy/scipy functionality, and all seems to be working so far. Will let
you know if any issues arise.
We do a lot with pyspark + scientific computing, and for EC2 usage I think this
is a terrific way to
Oh cool, thanks for the heads up! Especially for the Hadoop InputFormat
support. We recently wrote a custom hadoop input format so we can support
flat binary files
(https://github.com/freeman-lab/thunder/tree/master/scala/src/main/scala/thunder/util/io/hadoop),
and have been testing it in Scala.
Hey Matei,
Wanted to let you know this issue appears to be fixed in 1.0.0. Great work!
-- Jeremy
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/error-loading-large-files-in-PySpark-0-9-0-tp3049p6985.html
Sent from the Apache Spark User List mailing list
.
-- Jeremy
-
Jeremy Freeman, PhD
Neuroscientist
@thefreemanlab
On May 28, 2014, at 11:02 AM, Sidharth Kashyap sidharth.n.kash...@outlook.com
wrote:
Hi,
Has anyone tried to get Spark working on an HPC setup?
If yes, can you please share your learnings and how you went about
Hi Jamal,
One nice feature of PySpark is that you can easily use existing functions
from NumPy and SciPy inside your Spark code. For a simple example, the
following uses Spark's cartesian operation (which combines pairs of vectors
into tuples), followed by NumPy's corrcoef to compute the pearson
exist, and earlier versions of spark-ec2.py
still use deploy_templates from https://github.com/mesos/spark-ec2.git -b
v2, which has the new variables.
Using the updated spark-ec2.py from master works fine.
-- Jeremy
-
Jeremy Freeman, PhD
Neuroscientist
@thefreemanlab
--
View
Cool, glad to help! I just tested with 0.8.1 and 0.9.0 and both worked
perfectly, so seems to all be good.
-- Jeremy
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-error-tp5323p5329.html
Sent from the Apache Spark User List mailing list archive
Hey Pedro,
From which version of Spark were you running the spark-ec2.py script? You
might have run into the problem described here
(http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-error-td5323.html),
which Patrick just fixed up to ensure backwards compatibility.
With the bug, it
()
Array[String] = Array(sign1, sign2, sign3)
rdd1.zip(rdd2).collect()
Array[(String, String)] = Array((name1,sign1), (name2,sign2), (name3,sign3))
In your case, you might have the first two RDDs calculated from some common raw
data through a map.
-- Jeremy
-
Jeremy Freeman, PhD
, in our hands,
that 40% number is ballpark correct, at least for some basic operations (e.g
textFile, count, reduce).
-- Jeremy
-
Jeremy Freeman, PhD
Neuroscientist
@thefreemanlab
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Scala-vs
We run Spark (in Standalone mode) on top of a network-mounted file system
(NFS), rather than HDFS, and find it to work great. It required no modification
or special configuration to set this up; as Matei says, we just point Spark to
data using the file location.
-- Jeremy
On Apr 4, 2014, at
-
jeremy freeman, phd
neuroscientist
@thefreemanlab
On Mar 31, 2014, at 2:31 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote:
Nicholas, I'm in Boston and would be interested in a Spark group. Not
sure if you know this -- there was a meetup that never got off the
ground. Anyway, I'd be +1
Thanks Matei, unfortunately doesn't seem to fix it. I tried batchSize = 10,
100, as well as 1 (which should reproduce the 0.8.1 behavior?), and it stalls
at the same point in each case.
-- Jeremy
-
jeremy freeman, phd
neuroscientist
@thefreemanlab
On Mar 23, 2014, at 9:56
Hi all,
Hitting a mysterious error loading large text files, specific to PySpark
0.9.0.
In PySpark 0.8.1, this works:
data = sc.textFile(path/to/myfile)
data.count()
But in 0.9.0, it stalls. There are indications of completion up to:
14/03/17 16:54:24 INFO TaskSetManager: Finished TID 4 in
Thanks TD, happy to share my experience with MLLib + Spark Streaming
integration.
Here's a gist with two examples I have working, one for
StreamingLinearRegression and another for StreamingKMeans.
https://gist.github.com/freeman-lab/9672685
The goal in each case was to implement a streaming
Another vote on this, support for simple SequenceFiles and/or Avro would be
terrific, as using plain text can be very space-inefficient, especially for
numerical data.
-- Jeremy
On Mar 19, 2014, at 5:24 PM, Nicholas Chammas nicholas.cham...@gmail.com
wrote:
I'd second the request for Avro
24 matches
Mail list logo