Dear Spark users and developers,
I have released version 1.0.0 of scalable-deeplearning package. This package is
based on the implementation of artificial neural networks in Spark ML. It is
intended for new Spark deep learning features that were not yet merged to Spark
ML or that are too
Dear Spark developers,
I am working with Spark streaming 1.6.1. The task is to get RDDs for some
external analytics from each timewindow. This external function accepts RDD so
I cannot use DStream. I learned that DStream.window.compute(time) returns
Option[RDD]. I am trying to use it in the
Dear Spark developers,
Could you suggest how to perform pattern matching on the type of the graph edge
in the following scenario. I need to perform some math by means of
aggregateMessages on the graph edges if edges are Double. Here is the code:
def my[VD: ClassTag, ED: ClassTag] (graph:
-1, due to unresolved https://issues.apache.org/jira/browse/SPARK-15899
From: Reynold Xin [mailto:r...@databricks.com]
Sent: Thursday, July 14, 2016 12:00 PM
To: dev@spark.apache.org
Subject: [VOTE] Release Apache Spark 2.0.0 (RC4)
Please vote on releasing the following candidate as Apache Spark
Here is the fix https://github.com/apache/spark/pull/13868
From: Reynold Xin [mailto:r...@databricks.com]
Sent: Wednesday, June 22, 2016 6:43 PM
To: Ulanov, Alexander <alexander.ula...@hpe.com>
Cc: Mark Hamstra <m...@clearstorydata.com>; Marcelo Vanzin
<van...@cloudera.com>; de
4:09 PM
To: Marcelo Vanzin <van...@cloudera.com>
Cc: Ulanov, Alexander <alexander.ula...@hpe.com>; Reynold Xin
<r...@databricks.com>; dev@spark.apache.org
Subject: Re: [VOTE] Release Apache Spark 2.0.0 (RC1)
It's also marked as Minor, not Blocker.
On Wed, Jun 22, 2016 at 4:07
-1
Spark Unit tests fail on Windows. Still not resolved, though marked as resolved.
https://issues.apache.org/jira/browse/SPARK-15893
From: Reynold Xin [mailto:r...@databricks.com]
Sent: Tuesday, June 21, 2016 6:27 PM
To: dev@spark.apache.org
Subject: [VOTE] Release Apache Spark 2.0.0 (RC1)
, May 13, 2016 12:38 PM
To: Ulanov, Alexander <alexander.ula...@hpe.com>
Cc: dev@spark.apache.org
Subject: Re: Shrinking the DataFrame lineage
Here's a JIRA for it: https://issues.apache.org/jira/browse/SPARK-13346
I don't have a great method currently, but hacks can get around it: c
Dear Spark developers,
Recently, I was trying to switch my code from RDDs to DataFrames in order to
compare the performance. The code computes RDD in a loop. I use RDD.persist
followed by RDD.count to force Spark compute the RDD and cache it, so that it
does not need to re-compute it on each
it will involve shuffling.
From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Tuesday, April 26, 2016 2:44 PM
To: Ulanov, Alexander <alexander.ula...@hpe.com>
Cc: dev@spark.apache.org
Subject: Re: Number of partitions for binaryFiles
From what I understand, Spark code was written this way becau
Hi Ted,
I have 36 files of size ~600KB and the rest 74 are about 400KB.
Is there a workaround rather than changing Sparks code?
Best regards, Alexander
From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Tuesday, April 26, 2016 1:22 PM
To: Ulanov, Alexander <alexander.ula...@hpe.com>
C
Dear Spark developers,
I have 100 binary files in local file system that I want to load into Spark
RDD. I need the data from each file to be in a separate partition. However, I
cannot make it happen:
scala> sc.binaryFiles("/data/subset").partitions.size
res5: Int = 66
The "minPartitions"
Hi Pan,
There is a pull request that is supposed to fix the issue:
https://github.com/apache/spark/pull/9854
There is a workaround for saving/loading a model (however I am not sure if it
will work for the pipeline):
sc.parallelize(Seq(model), 1).saveAsObjectFile("path")
val sameModel =
, Alexander
From: Kazuaki Ishizaki [mailto:ishiz...@jp.ibm.com]
Sent: Thursday, January 21, 2016 3:34 AM
To: dev@spark.apache.org; Ulanov, Alexander; Joseph Bradley
Cc: John Canny; Evan R. Sparks; Xiangrui Meng; Sam Halliday
Subject: RE: Using CUDA within Spark / boosting linear algebra
Dear all
...@gmail.com]
Sent: Thursday, March 26, 2015 9:27 AM
To: John Canny
Cc: Xiangrui Meng; dev@spark.apache.org; Joseph Bradley; Evan R. Sparks;
Ulanov, Alexander
Subject: Re: Using CUDA within Spark / boosting linear algebra
John, I have to disagree with you there. Dense matrices come up a lot in
industry
Hi Yanbo,
As long as two models fit into memory of a single machine, there should be no
problems, so even 16GB machines can handle large models. (master should have
more memory because it runs LBFGS) In my experiments, I’ve trained the models
12M and 32M parameters without issues.
Best
is handled by Spark RDD,
i.e. each worker processes a subset of data partitions, and master serves the
role of parameter server.
Best regards, Alexander
From: Disha Shrivastava [mailto:dishu@gmail.com]
Sent: Wednesday, December 30, 2015 4:03 AM
To: Ulanov, Alexander
Cc: dev@spark.apache.org
Hi Disha,
Multilayer perceptron classifier in Spark implements data parallelism.
Best regards, Alexander
From: Disha Shrivastava [mailto:dishu@gmail.com]
Sent: Tuesday, December 08, 2015 12:43 AM
To: dev@spark.apache.org; Ulanov, Alexander
Subject: Data and Model Parallelism in MLPC
Hi,
I
forward and back
propagation. However, this option does not seem very practical to me.
Best regards, Alexander
From: Disha Shrivastava [mailto:dishu@gmail.com]
Sent: Tuesday, December 08, 2015 11:19 AM
To: Ulanov, Alexander
Cc: dev@spark.apache.org
Subject: Re: Data and Model Parallelism in MLPC
Parameter Server is a new feature and thus does not match the goal of 2.0 is
“to fix things that are broken in the current API and remove certain deprecated
APIs”. At the same time I would be happy to have that feature.
With regards to Machine learning, it would be great to move useful features
a look into how to zip the data sent as update. Do you know
any options except going from double to single precision (or less) ?
Best regards, Alexander
From: Evan Sparks [mailto:evan.spa...@gmail.com]
Sent: Saturday, October 17, 2015 2:24 PM
To: Joseph Bradley
Cc: Ulanov, Alexander; dev
the size of the data and
the model. Also, you have to make sure that all workers own local data, that is
a separate thing to the number of partitions.
Best regards, Alexander
From: Disha Shrivastava [mailto:dishu@gmail.com]
Sent: Thursday, October 15, 2015 10:13 AM
To: Ulanov, Alexander
Cc
Bradley [mailto:jos...@databricks.com]
Sent: Wednesday, October 14, 2015 11:35 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.org
Subject: Re: Gradient Descent with large model size
For those numbers of partitions, I don't think you'll actually use tree
aggregation. The number of partitions needs
to be worthwhile for this rather small
dataset.
Best regards, Alexander
From: Disha Shrivastava [mailto:dishu@gmail.com]
Sent: Sunday, October 11, 2015 9:29 AM
To: Mike Hynes
Cc: dev@spark.apache.org; Ulanov, Alexander
Subject: Re: No speedup in MultiLayerPerceptronClassifier with increase in
number
Thank you, Nitin. This does explain the problem. It seems that UI should make
this more clear to the user, otherwise it is simply misleading if you read it
as it.
From: Nitin Goyal [mailto:nitin2go...@gmail.com]
Sent: Sunday, October 11, 2015 5:57 AM
To: Ulanov, Alexander
Cc: dev
Dear Spark developers,
I am trying to understand how Spark UI displays operation with the cached RDD.
For example, the following code caches an rdd:
>> val rdd = sc.parallelize(1 to 5, 5).zipWithIndex.cache
>> rdd.count
The Jobs tab shows me that the RDD is evaluated:
: 1 count at :24
Hi Ankur,
Could you help with explanation of the problem below?
Best regards, Alexander
From: Ulanov, Alexander
Sent: Friday, October 02, 2015 11:39 AM
To: 'Robin East'
Cc: dev@spark.apache.org
Subject: RE: GraphX PageRank keeps 3 copies of graph in memory
Hi Robin,
Sounds interesting. I am
]
Sent: Friday, October 02, 2015 12:27 AM
To: Ulanov, Alexander
Cc: dev@spark.apache.org
Subject: Re: GraphX PageRank keeps 3 copies of graph in memory
Alexander,
I’ve just run the benchmark and only end up with 2 sets of RDDs in the Storage
tab. This is on 1.5.0, what version are you using?
Robin
Dear Spark developers,
I would like to understand GraphX caching behavior with regards to PageRank in
Spark, in particular, the following implementation of PageRank:
https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala
On each iteration
Dear Spark developers,
I have created a simple Spark application for spark submit. It calls a machine
learning library from Spark MLlib that is executed in a number of iterations
that correspond to the same number of task in Spark. It seems that Spark
creates an executor for each task and then
of partitions per
node?
From: Reynold Xin [mailto:r...@databricks.com]
Sent: Friday, September 18, 2015 4:37 PM
To: Ulanov, Alexander
Cc: Feynman Liang; dev@spark.apache.org
Subject: Re: One element per node
Use a global atomic boolean and return nothing from that partition if the
boolean is true
Thank you! How can I guarantee that I have only one element per executor (per
worker, or per physical node)?
From: Feynman Liang [mailto:fli...@databricks.com]
Sent: Friday, September 18, 2015 4:06 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.org
Subject: Re: One element per node
Dear Spark developers,
Is it possible (and how to do it if possible) to pick one element per physical
node from an RDD? Let's say the first element of any partition on that node.
The result would be an RDD[element], the count of elements is equal to the N of
nodes that has partitions of the
, September 16, 2015 5:35 PM
To: Feynman Liang
Cc: Ulanov, Alexander; dev@spark.apache.org
Subject: Re: Enum parameter in ML
I've tended to use Strings. Params can be created with a validator (isValid)
which can ensure users get an immediate error if they try to pass an
unsupported String
Hi Feynman,
Thank you for suggestion. How can I ensure that there will be no problems for
Java users? (I only use Scala API)
Best regards, Alexander
From: Feynman Liang [mailto:fli...@databricks.com]
Sent: Monday, September 14, 2015 5:27 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.org
Dear Spark developers,
I would like to create a dataframe with one column. However, the
createDataFrame method accepts at least a Product:
val data = Seq(1.0, 2.0)
val rdd = sc.parallelize(data, 2)
val df = sqlContext.createDataFrame(rdd)
[fail]:25: error: overloaded method value
Thank you for quick response! I’ll use Tuple1
From: Feynman Liang [mailto:fli...@databricks.com]
Sent: Monday, September 14, 2015 11:05 AM
To: Ulanov, Alexander
Cc: dev@spark.apache.org
Subject: Re: Data frame with one column
For an example, see the ml-feature word2vec user
guide<ht
Dear Spark developers,
Could you suggest what is the intended use of UnsafeRow (except for Tungsten
groupBy and sort) and give an example how to use it?
1)Is it intended to be instantiated as the copy of the Row in order to perform
in-place modifications of it?
2)Can I create a new UnsafeRow
:07 AM, Ulanov, Alexander
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
It seems that there is a nice improvement with Tungsten enabled given that data
is persisted in memory 2x and 3x. However, the improvement is not that nice for
parquet, it is 1.5x. What’s interesting
is not interesting.
From: Reynold Xin [mailto:r...@databricks.com]
Sent: Thursday, August 20, 2015 9:24 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.org
Subject: Re: Dataframe aggregation with Tungsten unsafe
Not sure what's going on or how you measure the time, but the difference here
is pretty big
)).toDF(key,
value)
data.write.parquet(/scratch/rxin/tmp/alex)
val df = sqlContext.read.parquet(/scratch/rxin/tmp/alex)
val t = System.nanoTime()
val res = df.groupBy(key).agg(sum(value))
res.count()
println((System.nanoTime() - t) / 1e9)
On Thu, Aug 20, 2015 at 2:57 PM, Ulanov, Alexander
I am using Spark 1.5 cloned from master on June 12. (The aggregate unsafe
feature was added to Spark on April 29.)
From: Reynold Xin [mailto:r...@databricks.com]
Sent: Thursday, August 20, 2015 5:26 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.org
Subject: Re: Dataframe aggregation
)
at
org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.scala:30)
at
org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:64)
... 73 more
From: Reynold Xin [mailto:r...@databricks.com]
Sent: Thursday, August 20, 2015 4:22 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.org
, Alexander
Cc: dev@spark.apache.org
Subject: Re: Dataframe aggregation with Tungsten unsafe
Please git pull :)
On Thu, Aug 20, 2015 at 5:35 PM, Ulanov, Alexander
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
I am using Spark 1.5 cloned from master on June 12. (The aggregate unsafe
Dear Spark developers,
I am trying to benchmark the new Dataframe aggregation implemented under the
project Tungsten and released with Spark 1.4 (I am using the latest Spark from
the repo, i.e. 1.5):
https://github.com/apache/spark/pull/5725
It tells that the aggregation should be faster due to
Dear Spark developers,
Are there any best practices or guidelines for machine learning unit tests in
Spark? After taking a brief look at the unit tests in ML and MLlib, I have
found that each algorithm is tested in a different way. There are few kinds of
tests:
1)Partial check of internal
: Tuesday, July 28, 2015 12:05 PM
To: Ulanov, Alexander
Cc: Robin East; dev@spark.apache.org
Subject: Re: Two joins in GraphX Pregel implementation
On 27 Jul 2015, at 16:42, Ulanov, Alexander
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
It seems that the mentioned two joins can
.
Do you know the reason why this improvement is not pushed?
CC’ing Dave
From: Robin East [mailto:robin.e...@xense.co.uk]
Sent: Monday, July 27, 2015 9:11 AM
To: Ulanov, Alexander
Cc: dev@spark.apache.org
Subject: Re: Two joins in GraphX Pregel implementation
Quite possibly - there is a JIRA open
Dear Spark developers,
Below is the GraphX Pregel code snippet from
https://spark.apache.org/docs/latest/graphx-programming-guide.html#pregel-api:
(it does not contain caching step):
while (activeMessages 0 i maxIterations) {
// Receive the messages:
27, 2015 8:56 AM
To: Ulanov, Alexander
Cc: dev@spark.apache.org
Subject: Re: Two joins in GraphX Pregel implementation
What happens to this line of code:
messages = g.mapReduceTriplets(sendMsg, mergeMsg, Some((newVerts,
activeDir))).cache()
Part of the Pregel ‘contract’ is that vertices
Hi Shivaram,
Thank you for the explanation. Is there a direct way to check the length of the
lineage i.e. that the computation is repeated?
Best regards, Alexander
From: Shivaram Venkataraman [mailto:shiva...@eecs.berkeley.edu]
Sent: Friday, July 17, 2015 10:10 AM
To: Ulanov, Alexander
Cc
spark.sql.unsafe.enabled=true removes the GC when persisting/unpersisting the
DataFrame?
Best regards, Alexander
From: Ulanov, Alexander
Sent: Monday, July 13, 2015 11:15 AM
To: shiva...@eecs.berkeley.edu
Cc: dev@spark.apache.org
Subject: RE: Model parallelism with RDD
Below are the average
then.
Best,
Burak
On Wed, Jul 15, 2015 at 3:04 PM, Ulanov, Alexander
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
Hi Burak,
I’ve modified my code as you suggested, however it still leads to shuffling.
Could you suggest what’s wrong with my code or provide an example code
to me, because it is a direct
analogy from column or row-based data storage in matrices, which is used in
BLAS.
Best regards, Alexander
From: Burak Yavuz [mailto:brk...@gmail.com]
Sent: Tuesday, July 14, 2015 10:14 AM
To: Ulanov, Alexander
Cc: Rakesh Chalasani; dev@spark.apache.orgmailto:dev
a local reduce before
aggregating across nodes.
Rakesh
On Mon, Jul 13, 2015 at 9:24 PM Ulanov, Alexander
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
Dear Spark developers,
I am trying to perform BlockMatrix multiplication in Spark. My test is as
follows: 1)create a matrix of N
am missing something or
using it wrong.
Best regards, Alexander
From: Rakesh Chalasani [mailto:vnit.rak...@gmail.com]
Sent: Tuesday, July 14, 2015 9:05 AM
To: Ulanov, Alexander
Cc: dev@spark.apache.org
Subject: Re: BlockMatrix multiplication
Hi Alexander:
Aw, I missed the 'cogroup
From: Burak Yavuz [mailto:brk...@gmail.com]
Sent: Tuesday, July 14, 2015 10:14 AM
To: Ulanov, Alexander
Cc: Rakesh Chalasani; dev@spark.apache.org
Subject: Re: BlockMatrix multiplication
Hi Alexander,
From your example code, using the GridPartitioner, you will have 1 column, and
5 rows. When you
}
println(Avg iteration time: + avgTime / numIterations)
Best regards, Alexander
From: Shivaram Venkataraman [mailto:shiva...@eecs.berkeley.edu]
Sent: Friday, July 10, 2015 10:04 PM
To: Ulanov, Alexander
Cc: shiva...@eecs.berkeley.edu; dev@spark.apache.org
Subject: Re: Model parallelism with RDD
Dear Spark developers,
I am trying to perform BlockMatrix multiplication in Spark. My test is as
follows: 1)create a matrix of N blocks, so that each row of block matrix
contains only 1 block and each block resides in separate partition on separate
node, 2)transpose the block matrix and
Hi,
I am interested how scalable can be the model parallelism within Spark.
Suppose, the model contains N weights of type Double and N is so large that
does not fit into the memory of a single node. So, we can store the model in
RDD[Double] within several nodes. To train the model, one needs
],
MapPartitionsRDD[68] at explain at console:25
Could Spark SQL developers suggest why it happens?
Best regards, Alexander
From: Stephen Carman [mailto:scar...@coldlight.com]
Sent: Wednesday, June 24, 2015 12:33 PM
To: Ulanov, Alexander
Cc: CC GP; dev@spark.apache.org
Subject: Re: Force inner
Hi,
My Hadoop is configured to have replication ratio = 2. I've added
$HADOOP_HOME/config to the PATH as suggested in
http://apache-spark-user-list.1001560.n3.nabble.com/hdfs-replication-on-saving-RDD-td289.html.
Spark (1.4) does rdd.saveAsTextFile with replication=2. However
Hi,
Is there a way to increase the amount of partition of RDD without causing
shuffle? I've found JIRA issue https://issues.apache.org/jira/browse/SPARK-5997
however there is no implementation yet.
Just in case, I am reading data from ~300 big binary files, which results in
300 partitions,
Testing too. Recently I got few undelivered mails to dev-list.
From: Reynold Xin [mailto:r...@databricks.com]
Sent: Thursday, May 14, 2015 3:39 PM
To: dev@spark.apache.org
Subject: testing HTML email
Testing html emails ...
Hello
This is bold
This is a linkhttp://databricks.com/
)
Best regards, Alexander
-Original Message-
From: Ulanov, Alexander
Sent: Monday, May 11, 2015 11:59 AM
To: Olivier Girardot; Michael Armbrust
Cc: Reynold Xin; dev@spark.apache.org
Subject: RE: DataFrame distinct vs RDD distinct
Hi,
Could you suggest alternative way of implementing
Hi,
Could you suggest alternative way of implementing distinct, e.g. via fold or
aggregate? Both SQL distinct and RDD distinct fail on my dataset due to
overflow of Spark shuffle disk. I have 7 nodes with 300GB dedicated to Spark
shuffle each. My dataset is 2B rows, the field which I'm
Thank you for suggestions!
From: Reynold Xin [mailto:r...@databricks.com]
Sent: Friday, May 08, 2015 11:10 AM
To: Will Benton
Cc: Ulanov, Alexander; dev@spark.apache.org
Subject: Re: Easy way to convert Row back to case class
In 1.4, you can do
row.getInt(colName)
In 1.5, some variant
Hi Pramod,
For cluster-like tests you might want to use the same code as in mllib's
LocalClusterSparkContext. You can rebuild only the package that you change and
then run this main class.
Best regards, Alexander
-Original Message-
From: Pramod Biligiri
there was disagreement
about
how to proceed. So I dont think a 'lock' is necessary in practice and
don't
think even signaling has been a problem.
On Apr 23, 2015 6:14 PM, Ulanov, Alexander alexander.ula...@hp.com
wrote:
My thinking is that current way of assigning a contributor after
My thinking is that current way of assigning a contributor after the patch is
done (or almost done) is OK. Parallel efforts are also OK until they are
discussed in the issue's thread. Ilya Ganelin made a good point that it is
about moving the project forward. It also adds means of competition
: Tuesday, April 07, 2015 3:28 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.org
Subject: Re: Regularization in MLlib
1) Norm(weights, N) will return (w_1^N + w_2^N +)^(1/N), so norm
* norm is required.
2) This is bug as you said. I intend to fix this using weighted regularization
you suggest?
(it seems that new version of Spark was not tested for Windows. Previous
versions worked more or less fine for me)
-Original Message-
From: Marcelo Vanzin [mailto:van...@cloudera.com]
Sent: Friday, April 03, 2015 1:04 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.org
: Thursday, April 02, 2015 1:26 PM
To: Joseph Bradley
Cc: Ulanov, Alexander; dev@spark.apache.org
Subject: Re: Stochastic gradient descent performance
I haven't looked closely at the sampling issues, but regarding the aggregation
latency, there are fixed overheads (in local and distributed mode
on this? I do understand that in cluster mode
the network speed will kick in and then one can blame it.
Best regards, Alexander
From: Joseph Bradley [mailto:jos...@databricks.com]
Sent: Thursday, April 02, 2015 10:51 AM
To: Ulanov, Alexander
Cc: dev@spark.apache.org
Subject: Re: Stochastic
Sorry for bothering you again, but I think that it is an important issue for
applicability of SGD in Spark MLlib. Could Spark developers please comment on
it.
-Original Message-
From: Ulanov, Alexander
Sent: Monday, March 30, 2015 5:00 PM
To: dev@spark.apache.org
Subject: Stochastic
Thanks, sounds interesting! How do you load files to Spark? Did you consider
having multiple files instead of file lines?
From: Hector Yee [mailto:hector@gmail.com]
Sent: Wednesday, April 01, 2015 11:36 AM
To: Ulanov, Alexander
Cc: Evan R. Sparks; Stephen Boesch; dev@spark.apache.org
Subject
...@gmail.com]
Sent: Wednesday, April 01, 2015 1:37 PM
To: Hector Yee
Cc: Ulanov, Alexander; Evan R. Sparks; Stephen Boesch; dev@spark.apache.org
Subject: Re: Storing large data for MLlib machine learning
@Alexander, re: using flat binary and metadata, you raise excellent points! At
least in our case, we
Hi,
It seems to me that there is an overhead in runMiniBatchSGD function of
MLlib's GradientDescent. In particular, sample and treeAggregate might
take time that is order of magnitude greater than the actual gradient
computation. In particular, for mnist dataset of 60K instances, minibatch
@spark.apache.org; Ulanov, Alexander;
jfcanny
Subject: Re: Using CUDA within Spark / boosting linear algebra
Hi Alex,
Since it is non-trivial to make nvblas work with netlib-java, it would be great
if you can send the instructions to netlib-java as part of the README.
Hopefully we don't need to modify
, Alexander
-Original Message-
From: Ulanov, Alexander
Sent: Tuesday, March 24, 2015 6:57 PM
To: Sam Halliday
Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
Subject: RE: Using CUDA within Spark / boosting linear algebra
Hi,
I am trying to use nvblas with netlib
and took you a while to figure out! Would you
mind posting a gist or something of maybe the shell scripts/exports
you used to make this work - I can imagine it being highly useful for others
in the future.
Thanks!
Evan
On Wed, Mar 25, 2015 at 2:31 PM, Ulanov, Alexander
alexander.ula...@hp.com
: Wednesday, March 25, 2015 2:55 PM
To: Ulanov, Alexander
Cc: Sam Halliday; dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R.
Sparks; jfcanny
Subject: Re: Using CUDA within Spark / boosting linear algebra
Alexander,
does using netlib imply that one cannot switch between CPU and GPU blas
/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
-Original Message-
From: Ulanov, Alexander
Sent: Wednesday, March 25, 2015 2:31 PM
To: Sam Halliday
Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks; jfcanny
Subject: RE: Using CUDA within Spark / boosting linear algebra
Hi again,
I
trait that will
provide a sink operation (basically memory will be allocated by user)...adding
more BLAS operators in breeze will also help in general as lot more operations
are defined over there...
On Wed, Mar 18, 2015 at 8:09 PM, Ulanov, Alexander
alexander.ula...@hp.commailto:alexander.ula
Hi,
Currently I am using Breeze within Spark MLlib for linear algebra. I would like
to reuse previously allocated matrices for storing the result of matrices
multiplication, i.e. I need to use gemm function C:=q*A*B+p*C, which is
missing in Breeze (Breeze automatically allocates a new matrix
Hi,
I am working on artificial neural networks for Spark. It is solved with
Gradient Descent, so each step the data is read, sum of gradients is calculated
for each data partition (on each worker), aggregated (on the driver) and
broadcasted back. I noticed that the gradient computation time is
this and will appreciate any help
from you ☺
From: Sam Halliday [mailto:sam.halli...@gmail.com]
Sent: Monday, March 09, 2015 6:01 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
Subject: RE: Using CUDA within Spark / boosting linear algebra
Thanks so much
for enabling high performance
binaries on OSX and Linux? Or better, figure out a way for the
system to fetch these automatically.
- Evan
On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander
alexander.ula...@hp.com wrote:
Just to summarize this thread, I was finally able to make all
Hi,
I've implemented class MyClass in MLlib that does some operation on
LabeledPoint. MyClass extends serializable, so I can map this operation on data
of RDD[LabeledPoints], such as data.map(lp = MyClass.operate(lp)). I write
this class in file with ObjectOutputStream.writeObject. Then I stop
.
https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
-Original Message-
From: Sam Halliday [mailto:sam.halli...@gmail.com]
Sent: Monday, March 02, 2015 1:24 PM
To: Ulanov, Alexander
Subject: Re: Using CUDA within Spark / boosting linear
[mailto:men...@gmail.com]
Sent: Monday, March 02, 2015 11:42 AM
To: Sam Halliday
Cc: Joseph Bradley; Ulanov, Alexander; dev; Evan R. Sparks
Subject: Re: Using CUDA within Spark / boosting linear algebra
On Fri, Feb 27, 2015 at 12:33 PM, Sam Halliday sam.halli...@gmail.com wrote:
Also, check
Typo - CPU was 2.5 cheaper (not GPU!)
-Original Message-
From: Ulanov, Alexander
Sent: Thursday, February 26, 2015 2:01 PM
To: Sam Halliday; Xiangrui Meng
Cc: dev@spark.apache.org; Joseph Bradley; Evan R. Sparks
Subject: RE: Using CUDA within Spark / boosting linear algebra
Evan, thank
: Thursday, February 26, 2015 1:56 PM
To: Xiangrui Meng
Cc: dev@spark.apache.org; Joseph Bradley; Ulanov, Alexander; Evan R. Sparks
Subject: Re: Using CUDA within Spark / boosting linear algebra
Btw, I wish people would stop cheating when comparing CPU and GPU timings for
things like matrix
results.
https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
One thing still needs exploration: does BIDMat-cublas perform copying to/from
machine’s RAM?
-Original Message-
From: Ulanov, Alexander
Sent: Tuesday, February 10, 2015 2:12 PM
: Monday, February 09, 2015 6:06 PM
To: Ulanov, Alexander
Cc: Joseph Bradley; dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra
Great - perhaps we can move this discussion off-list and onto a JIRA ticket?
(Here's one: https://issues.apache.org/jira/browse/SPARK-5705
To: Ulanov, Alexander
Cc: Joseph Bradley; dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra
I would build OpenBLAS yourself, since good BLAS performance comes from getting
cache sizes, etc. set up correctly for your particular hardware - this is often
a very tricky
a
specific blas (not specific wrapper for blas).
Btw. I have installed openblas (yum install openblas), so I suppose that netlib
is using it.
From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
Sent: Friday, February 06, 2015 5:19 PM
To: Ulanov, Alexander
Cc: Joseph Bradley; dev
05, 2015 5:29 PM
To: Ulanov, Alexander
Cc: Evan R. Sparks; dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra
Hi Alexander,
Using GPUs with Spark would be very exciting. Small comment: Concerning your
question earlier about keeping data stored on the GPU rather
To: Ulanov, Alexander
Cc: dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra
I'd expect that we can make GPU-accelerated BLAS faster than CPU blas in many
cases.
You might consider taking a look at the codepaths that BIDMat
(https://github.com/BIDData/BIDMat
:29 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra
I'd be surprised of BIDMat+OpenBLAS was significantly faster than
netlib-java+OpenBLAS, but if it is much faster it's probably due to data layout
and fewer levels of indirection
1 - 100 of 115 matches
Mail list logo