scalable-deeplearning 1.0.0 released

2016-09-09 Thread Ulanov, Alexander
Dear Spark users and developers, I have released version 1.0.0 of scalable-deeplearning package. This package is based on the implementation of artificial neural networks in Spark ML. It is intended for new Spark deep learning features that were not yet merged to Spark ML or that are too

Spark streaming get RDD within the sliding window

2016-08-24 Thread Ulanov, Alexander
Dear Spark developers, I am working with Spark streaming 1.6.1. The task is to get RDDs for some external analytics from each timewindow. This external function accepts RDD so I cannot use DStream. I learned that DStream.window.compute(time) returns Option[RDD]. I am trying to use it in the

Graph edge type pattern matching in GraphX

2016-08-02 Thread Ulanov, Alexander
Dear Spark developers, Could you suggest how to perform pattern matching on the type of the graph edge in the following scenario. I need to perform some math by means of aggregateMessages on the graph edges if edges are Double. Here is the code: def my[VD: ClassTag, ED: ClassTag] (graph:

RE: [VOTE] Release Apache Spark 2.0.0 (RC4)

2016-07-14 Thread Ulanov, Alexander
-1, due to unresolved https://issues.apache.org/jira/browse/SPARK-15899 From: Reynold Xin [mailto:r...@databricks.com] Sent: Thursday, July 14, 2016 12:00 PM To: dev@spark.apache.org Subject: [VOTE] Release Apache Spark 2.0.0 (RC4) Please vote on releasing the following candidate as Apache Spark

RE: [VOTE] Release Apache Spark 2.0.0 (RC1)

2016-06-22 Thread Ulanov, Alexander
Here is the fix https://github.com/apache/spark/pull/13868 From: Reynold Xin [mailto:r...@databricks.com] Sent: Wednesday, June 22, 2016 6:43 PM To: Ulanov, Alexander <alexander.ula...@hpe.com> Cc: Mark Hamstra <m...@clearstorydata.com>; Marcelo Vanzin <van...@cloudera.com>; de

RE: [VOTE] Release Apache Spark 2.0.0 (RC1)

2016-06-22 Thread Ulanov, Alexander
4:09 PM To: Marcelo Vanzin <van...@cloudera.com> Cc: Ulanov, Alexander <alexander.ula...@hpe.com>; Reynold Xin <r...@databricks.com>; dev@spark.apache.org Subject: Re: [VOTE] Release Apache Spark 2.0.0 (RC1) It's also marked as Minor, not Blocker. On Wed, Jun 22, 2016 at 4:07

RE: [VOTE] Release Apache Spark 2.0.0 (RC1)

2016-06-22 Thread Ulanov, Alexander
-1 Spark Unit tests fail on Windows. Still not resolved, though marked as resolved. https://issues.apache.org/jira/browse/SPARK-15893 From: Reynold Xin [mailto:r...@databricks.com] Sent: Tuesday, June 21, 2016 6:27 PM To: dev@spark.apache.org Subject: [VOTE] Release Apache Spark 2.0.0 (RC1)

RE: Shrinking the DataFrame lineage

2016-05-13 Thread Ulanov, Alexander
, May 13, 2016 12:38 PM To: Ulanov, Alexander <alexander.ula...@hpe.com> Cc: dev@spark.apache.org Subject: Re: Shrinking the DataFrame lineage Here's a JIRA for it: https://issues.apache.org/jira/browse/SPARK-13346 I don't have a great method currently, but hacks can get around it: c

Shrinking the DataFrame lineage

2016-05-11 Thread Ulanov, Alexander
Dear Spark developers, Recently, I was trying to switch my code from RDDs to DataFrames in order to compare the performance. The code computes RDD in a loop. I use RDD.persist followed by RDD.count to force Spark compute the RDD and cache it, so that it does not need to re-compute it on each

RE: Number of partitions for binaryFiles

2016-04-26 Thread Ulanov, Alexander
it will involve shuffling. From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Tuesday, April 26, 2016 2:44 PM To: Ulanov, Alexander <alexander.ula...@hpe.com> Cc: dev@spark.apache.org Subject: Re: Number of partitions for binaryFiles From what I understand, Spark code was written this way becau

RE: Number of partitions for binaryFiles

2016-04-26 Thread Ulanov, Alexander
Hi Ted, I have 36 files of size ~600KB and the rest 74 are about 400KB. Is there a workaround rather than changing Sparks code? Best regards, Alexander From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Tuesday, April 26, 2016 1:22 PM To: Ulanov, Alexander <alexander.ula...@hpe.com> C

Number of partitions for binaryFiles

2016-04-26 Thread Ulanov, Alexander
Dear Spark developers, I have 100 binary files in local file system that I want to load into Spark RDD. I need the data from each file to be in a separate partition. However, I cannot make it happen: scala> sc.binaryFiles("/data/subset").partitions.size res5: Int = 66 The "minPartitions"

RE: MLPC model can not be saved

2016-03-21 Thread Ulanov, Alexander
Hi Pan, There is a pull request that is supposed to fix the issue: https://github.com/apache/spark/pull/9854 There is a workaround for saving/loading a model (however I am not sure if it will work for the pipeline): sc.parallelize(Seq(model), 1).saveAsObjectFile("path") val sameModel =

RE: Using CUDA within Spark / boosting linear algebra

2016-01-21 Thread Ulanov, Alexander
, Alexander From: Kazuaki Ishizaki [mailto:ishiz...@jp.ibm.com] Sent: Thursday, January 21, 2016 3:34 AM To: dev@spark.apache.org; Ulanov, Alexander; Joseph Bradley Cc: John Canny; Evan R. Sparks; Xiangrui Meng; Sam Halliday Subject: RE: Using CUDA within Spark / boosting linear algebra Dear all

RE: Using CUDA within Spark / boosting linear algebra

2016-01-20 Thread Ulanov, Alexander
...@gmail.com] Sent: Thursday, March 26, 2015 9:27 AM To: John Canny Cc: Xiangrui Meng; dev@spark.apache.org; Joseph Bradley; Evan R. Sparks; Ulanov, Alexander Subject: Re: Using CUDA within Spark / boosting linear algebra John, I have to disagree with you there. Dense matrices come up a lot in industry

RE: SparkML algos limitations question.

2016-01-04 Thread Ulanov, Alexander
Hi Yanbo, As long as two models fit into memory of a single machine, there should be no problems, so even 16GB machines can handle large models. (master should have more memory because it runs LBFGS) In my experiments, I’ve trained the models 12M and 32M parameters without issues. Best

RE: Data and Model Parallelism in MLPC

2016-01-04 Thread Ulanov, Alexander
is handled by Spark RDD, i.e. each worker processes a subset of data partitions, and master serves the role of parameter server. Best regards, Alexander From: Disha Shrivastava [mailto:dishu@gmail.com] Sent: Wednesday, December 30, 2015 4:03 AM To: Ulanov, Alexander Cc: dev@spark.apache.org

RE: Data and Model Parallelism in MLPC

2015-12-08 Thread Ulanov, Alexander
Hi Disha, Multilayer perceptron classifier in Spark implements data parallelism. Best regards, Alexander From: Disha Shrivastava [mailto:dishu@gmail.com] Sent: Tuesday, December 08, 2015 12:43 AM To: dev@spark.apache.org; Ulanov, Alexander Subject: Data and Model Parallelism in MLPC Hi, I

RE: Data and Model Parallelism in MLPC

2015-12-08 Thread Ulanov, Alexander
forward and back propagation. However, this option does not seem very practical to me. Best regards, Alexander From: Disha Shrivastava [mailto:dishu@gmail.com] Sent: Tuesday, December 08, 2015 11:19 AM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: Data and Model Parallelism in MLPC

RE: A proposal for Spark 2.0

2015-11-12 Thread Ulanov, Alexander
Parameter Server is a new feature and thus does not match the goal of 2.0 is “to fix things that are broken in the current API and remove certain deprecated APIs”. At the same time I would be happy to have that feature. With regards to Machine learning, it would be great to move useful features

RE: Gradient Descent with large model size

2015-10-19 Thread Ulanov, Alexander
a look into how to zip the data sent as update. Do you know any options except going from double to single precision (or less) ? Best regards, Alexander From: Evan Sparks [mailto:evan.spa...@gmail.com] Sent: Saturday, October 17, 2015 2:24 PM To: Joseph Bradley Cc: Ulanov, Alexander; dev

RE: No speedup in MultiLayerPerceptronClassifier with increase in number of cores

2015-10-15 Thread Ulanov, Alexander
the size of the data and the model. Also, you have to make sure that all workers own local data, that is a separate thing to the number of partitions. Best regards, Alexander From: Disha Shrivastava [mailto:dishu@gmail.com] Sent: Thursday, October 15, 2015 10:13 AM To: Ulanov, Alexander Cc

RE: Gradient Descent with large model size

2015-10-15 Thread Ulanov, Alexander
Bradley [mailto:jos...@databricks.com] Sent: Wednesday, October 14, 2015 11:35 PM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: Gradient Descent with large model size For those numbers of partitions, I don't think you'll actually use tree aggregation. The number of partitions needs

RE: No speedup in MultiLayerPerceptronClassifier with increase in number of cores

2015-10-12 Thread Ulanov, Alexander
to be worthwhile for this rather small dataset. Best regards, Alexander From: Disha Shrivastava [mailto:dishu@gmail.com] Sent: Sunday, October 11, 2015 9:29 AM To: Mike Hynes Cc: dev@spark.apache.org; Ulanov, Alexander Subject: Re: No speedup in MultiLayerPerceptronClassifier with increase in number

RE: Operations with cached RDD

2015-10-12 Thread Ulanov, Alexander
Thank you, Nitin. This does explain the problem. It seems that UI should make this more clear to the user, otherwise it is simply misleading if you read it as it. From: Nitin Goyal [mailto:nitin2go...@gmail.com] Sent: Sunday, October 11, 2015 5:57 AM To: Ulanov, Alexander Cc: dev

Operations with cached RDD

2015-10-09 Thread Ulanov, Alexander
Dear Spark developers, I am trying to understand how Spark UI displays operation with the cached RDD. For example, the following code caches an rdd: >> val rdd = sc.parallelize(1 to 5, 5).zipWithIndex.cache >> rdd.count The Jobs tab shows me that the RDD is evaluated: : 1 count at :24

RE: GraphX PageRank keeps 3 copies of graph in memory

2015-10-07 Thread Ulanov, Alexander
Hi Ankur, Could you help with explanation of the problem below? Best regards, Alexander From: Ulanov, Alexander Sent: Friday, October 02, 2015 11:39 AM To: 'Robin East' Cc: dev@spark.apache.org Subject: RE: GraphX PageRank keeps 3 copies of graph in memory Hi Robin, Sounds interesting. I am

RE: GraphX PageRank keeps 3 copies of graph in memory

2015-10-02 Thread Ulanov, Alexander
] Sent: Friday, October 02, 2015 12:27 AM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: GraphX PageRank keeps 3 copies of graph in memory Alexander, I’ve just run the benchmark and only end up with 2 sets of RDDs in the Storage tab. This is on 1.5.0, what version are you using? Robin

GraphX PageRank keeps 3 copies of graph in memory

2015-09-30 Thread Ulanov, Alexander
Dear Spark developers, I would like to understand GraphX caching behavior with regards to PageRank in Spark, in particular, the following implementation of PageRank: https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala On each iteration

Too many executors are created

2015-09-29 Thread Ulanov, Alexander
Dear Spark developers, I have created a simple Spark application for spark submit. It calls a machine learning library from Spark MLlib that is executed in a number of iterations that correspond to the same number of task in Spark. It seems that Spark creates an executor for each task and then

RE: One element per node

2015-09-18 Thread Ulanov, Alexander
of partitions per node? From: Reynold Xin [mailto:r...@databricks.com] Sent: Friday, September 18, 2015 4:37 PM To: Ulanov, Alexander Cc: Feynman Liang; dev@spark.apache.org Subject: Re: One element per node Use a global atomic boolean and return nothing from that partition if the boolean is true

RE: One element per node

2015-09-18 Thread Ulanov, Alexander
Thank you! How can I guarantee that I have only one element per executor (per worker, or per physical node)? From: Feynman Liang [mailto:fli...@databricks.com] Sent: Friday, September 18, 2015 4:06 PM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: One element per node

One element per node

2015-09-18 Thread Ulanov, Alexander
Dear Spark developers, Is it possible (and how to do it if possible) to pick one element per physical node from an RDD? Let's say the first element of any partition on that node. The result would be an RDD[element], the count of elements is equal to the N of nodes that has partitions of the

RE: Enum parameter in ML

2015-09-16 Thread Ulanov, Alexander
, September 16, 2015 5:35 PM To: Feynman Liang Cc: Ulanov, Alexander; dev@spark.apache.org Subject: Re: Enum parameter in ML I've tended to use Strings. Params can be created with a validator (isValid) which can ensure users get an immediate error if they try to pass an unsupported String

RE: Enum parameter in ML

2015-09-14 Thread Ulanov, Alexander
Hi Feynman, Thank you for suggestion. How can I ensure that there will be no problems for Java users? (I only use Scala API) Best regards, Alexander From: Feynman Liang [mailto:fli...@databricks.com] Sent: Monday, September 14, 2015 5:27 PM To: Ulanov, Alexander Cc: dev@spark.apache.org

Data frame with one column

2015-09-14 Thread Ulanov, Alexander
Dear Spark developers, I would like to create a dataframe with one column. However, the createDataFrame method accepts at least a Product: val data = Seq(1.0, 2.0) val rdd = sc.parallelize(data, 2) val df = sqlContext.createDataFrame(rdd) [fail]:25: error: overloaded method value

RE: Data frame with one column

2015-09-14 Thread Ulanov, Alexander
Thank you for quick response! I’ll use Tuple1 From: Feynman Liang [mailto:fli...@databricks.com] Sent: Monday, September 14, 2015 11:05 AM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: Data frame with one column For an example, see the ml-feature word2vec user guide<ht

Use of UnsafeRow

2015-09-01 Thread Ulanov, Alexander
Dear Spark developers, Could you suggest what is the intended use of UnsafeRow (except for Tungsten groupBy and sort) and give an example how to use it? 1)Is it intended to be instantiated as the copy of the Row in order to perform in-place modifications of it? 2)Can I create a new UnsafeRow

Re: Dataframe aggregation with Tungsten unsafe

2015-08-25 Thread Ulanov, Alexander
:07 AM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: It seems that there is a nice improvement with Tungsten enabled given that data is persisted in memory 2x and 3x. However, the improvement is not that nice for parquet, it is 1.5x. What’s interesting

RE: Dataframe aggregation with Tungsten unsafe

2015-08-21 Thread Ulanov, Alexander
is not interesting. From: Reynold Xin [mailto:r...@databricks.com] Sent: Thursday, August 20, 2015 9:24 PM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: Dataframe aggregation with Tungsten unsafe Not sure what's going on or how you measure the time, but the difference here is pretty big

Re: Dataframe aggregation with Tungsten unsafe

2015-08-20 Thread Ulanov, Alexander
)).toDF(key, value) data.write.parquet(/scratch/rxin/tmp/alex) val df = sqlContext.read.parquet(/scratch/rxin/tmp/alex) val t = System.nanoTime() val res = df.groupBy(key).agg(sum(value)) res.count() println((System.nanoTime() - t) / 1e9) On Thu, Aug 20, 2015 at 2:57 PM, Ulanov, Alexander

RE: Dataframe aggregation with Tungsten unsafe

2015-08-20 Thread Ulanov, Alexander
I am using Spark 1.5 cloned from master on June 12. (The aggregate unsafe feature was added to Spark on April 29.) From: Reynold Xin [mailto:r...@databricks.com] Sent: Thursday, August 20, 2015 5:26 PM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: Dataframe aggregation

RE: Dataframe aggregation with Tungsten unsafe

2015-08-20 Thread Ulanov, Alexander
) at org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.scala:30) at org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:64) ... 73 more From: Reynold Xin [mailto:r...@databricks.com] Sent: Thursday, August 20, 2015 4:22 PM To: Ulanov, Alexander Cc: dev@spark.apache.org

RE: Dataframe aggregation with Tungsten unsafe

2015-08-20 Thread Ulanov, Alexander
, Alexander Cc: dev@spark.apache.org Subject: Re: Dataframe aggregation with Tungsten unsafe Please git pull :) On Thu, Aug 20, 2015 at 5:35 PM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: I am using Spark 1.5 cloned from master on June 12. (The aggregate unsafe

Dataframe aggregation with Tungsten unsafe

2015-08-20 Thread Ulanov, Alexander
Dear Spark developers, I am trying to benchmark the new Dataframe aggregation implemented under the project Tungsten and released with Spark 1.4 (I am using the latest Spark from the repo, i.e. 1.5): https://github.com/apache/spark/pull/5725 It tells that the aggregation should be faster due to

Machine learning unit tests guidelines

2015-07-30 Thread Ulanov, Alexander
Dear Spark developers, Are there any best practices or guidelines for machine learning unit tests in Spark? After taking a brief look at the unit tests in ML and MLlib, I have found that each algorithm is tested in a different way. There are few kinds of tests: 1)Partial check of internal

RE: Two joins in GraphX Pregel implementation

2015-07-29 Thread Ulanov, Alexander
: Tuesday, July 28, 2015 12:05 PM To: Ulanov, Alexander Cc: Robin East; dev@spark.apache.org Subject: Re: Two joins in GraphX Pregel implementation On 27 Jul 2015, at 16:42, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: It seems that the mentioned two joins can

RE: Two joins in GraphX Pregel implementation

2015-07-28 Thread Ulanov, Alexander
. Do you know the reason why this improvement is not pushed? CC’ing Dave From: Robin East [mailto:robin.e...@xense.co.uk] Sent: Monday, July 27, 2015 9:11 AM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: Two joins in GraphX Pregel implementation Quite possibly - there is a JIRA open

Two joins in GraphX Pregel implementation

2015-07-27 Thread Ulanov, Alexander
Dear Spark developers, Below is the GraphX Pregel code snippet from https://spark.apache.org/docs/latest/graphx-programming-guide.html#pregel-api: (it does not contain caching step): while (activeMessages 0 i maxIterations) { // Receive the messages:

RE: Two joins in GraphX Pregel implementation

2015-07-27 Thread Ulanov, Alexander
27, 2015 8:56 AM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: Two joins in GraphX Pregel implementation What happens to this line of code: messages = g.mapReduceTriplets(sendMsg, mergeMsg, Some((newVerts, activeDir))).cache() Part of the Pregel ‘contract’ is that vertices

RE: Model parallelism with RDD

2015-07-17 Thread Ulanov, Alexander
Hi Shivaram, Thank you for the explanation. Is there a direct way to check the length of the lineage i.e. that the computation is repeated? Best regards, Alexander From: Shivaram Venkataraman [mailto:shiva...@eecs.berkeley.edu] Sent: Friday, July 17, 2015 10:10 AM To: Ulanov, Alexander Cc

RE: Model parallelism with RDD

2015-07-16 Thread Ulanov, Alexander
spark.sql.unsafe.enabled=true removes the GC when persisting/unpersisting the DataFrame? Best regards, Alexander From: Ulanov, Alexander Sent: Monday, July 13, 2015 11:15 AM To: shiva...@eecs.berkeley.edu Cc: dev@spark.apache.org Subject: RE: Model parallelism with RDD Below are the average

RE: BlockMatrix multiplication

2015-07-16 Thread Ulanov, Alexander
then. Best, Burak On Wed, Jul 15, 2015 at 3:04 PM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Hi Burak, I’ve modified my code as you suggested, however it still leads to shuffling. Could you suggest what’s wrong with my code or provide an example code

RE: BlockMatrix multiplication

2015-07-15 Thread Ulanov, Alexander
to me, because it is a direct analogy from column or row-based data storage in matrices, which is used in BLAS. Best regards, Alexander From: Burak Yavuz [mailto:brk...@gmail.com] Sent: Tuesday, July 14, 2015 10:14 AM To: Ulanov, Alexander Cc: Rakesh Chalasani; dev@spark.apache.orgmailto:dev

Re: BlockMatrix multiplication

2015-07-14 Thread Ulanov, Alexander
a local reduce before aggregating across nodes. Rakesh On Mon, Jul 13, 2015 at 9:24 PM Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Dear Spark developers, I am trying to perform BlockMatrix multiplication in Spark. My test is as follows: 1)create a matrix of N

RE: BlockMatrix multiplication

2015-07-14 Thread Ulanov, Alexander
am missing something or using it wrong. Best regards, Alexander From: Rakesh Chalasani [mailto:vnit.rak...@gmail.com] Sent: Tuesday, July 14, 2015 9:05 AM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: BlockMatrix multiplication Hi Alexander: Aw, I missed the 'cogroup

RE: BlockMatrix multiplication

2015-07-14 Thread Ulanov, Alexander
From: Burak Yavuz [mailto:brk...@gmail.com] Sent: Tuesday, July 14, 2015 10:14 AM To: Ulanov, Alexander Cc: Rakesh Chalasani; dev@spark.apache.org Subject: Re: BlockMatrix multiplication Hi Alexander, From your example code, using the GridPartitioner, you will have 1 column, and 5 rows. When you

RE: Model parallelism with RDD

2015-07-13 Thread Ulanov, Alexander
} println(Avg iteration time: + avgTime / numIterations) Best regards, Alexander From: Shivaram Venkataraman [mailto:shiva...@eecs.berkeley.edu] Sent: Friday, July 10, 2015 10:04 PM To: Ulanov, Alexander Cc: shiva...@eecs.berkeley.edu; dev@spark.apache.org Subject: Re: Model parallelism with RDD

BlockMatrix multiplication

2015-07-13 Thread Ulanov, Alexander
Dear Spark developers, I am trying to perform BlockMatrix multiplication in Spark. My test is as follows: 1)create a matrix of N blocks, so that each row of block matrix contains only 1 block and each block resides in separate partition on separate node, 2)transpose the block matrix and

Model parallelism with RDD

2015-07-10 Thread Ulanov, Alexander
Hi, I am interested how scalable can be the model parallelism within Spark. Suppose, the model contains N weights of type Double and N is so large that does not fit into the memory of a single node. So, we can store the model in RDD[Double] within several nodes. To train the model, one needs

RE: Force inner join to shuffle the smallest table

2015-06-25 Thread Ulanov, Alexander
], MapPartitionsRDD[68] at explain at console:25 Could Spark SQL developers suggest why it happens? Best regards, Alexander From: Stephen Carman [mailto:scar...@coldlight.com] Sent: Wednesday, June 24, 2015 12:33 PM To: Ulanov, Alexander Cc: CC GP; dev@spark.apache.org Subject: Re: Force inner

Force Spark save parquet files with replication factor other than 3 (default one)

2015-06-22 Thread Ulanov, Alexander
Hi, My Hadoop is configured to have replication ratio = 2. I've added $HADOOP_HOME/config to the PATH as suggested in http://apache-spark-user-list.1001560.n3.nabble.com/hdfs-replication-on-saving-RDD-td289.html. Spark (1.4) does rdd.saveAsTextFile with replication=2. However

Increase partition count (repartition) without shuffle

2015-06-18 Thread Ulanov, Alexander
Hi, Is there a way to increase the amount of partition of RDD without causing shuffle? I've found JIRA issue https://issues.apache.org/jira/browse/SPARK-5997 however there is no implementation yet. Just in case, I am reading data from ~300 big binary files, which results in 300 partitions,

RE: testing HTML email

2015-05-14 Thread Ulanov, Alexander
Testing too. Recently I got few undelivered mails to dev-list. From: Reynold Xin [mailto:r...@databricks.com] Sent: Thursday, May 14, 2015 3:39 PM To: dev@spark.apache.org Subject: testing HTML email Testing html emails ... Hello This is bold This is a linkhttp://databricks.com/

RE: DataFrame distinct vs RDD distinct

2015-05-11 Thread Ulanov, Alexander
) Best regards, Alexander -Original Message- From: Ulanov, Alexander Sent: Monday, May 11, 2015 11:59 AM To: Olivier Girardot; Michael Armbrust Cc: Reynold Xin; dev@spark.apache.org Subject: RE: DataFrame distinct vs RDD distinct Hi, Could you suggest alternative way of implementing

RE: DataFrame distinct vs RDD distinct

2015-05-11 Thread Ulanov, Alexander
Hi, Could you suggest alternative way of implementing distinct, e.g. via fold or aggregate? Both SQL distinct and RDD distinct fail on my dataset due to overflow of Spark shuffle disk. I have 7 nodes with 300GB dedicated to Spark shuffle each. My dataset is 2B rows, the field which I'm

RE: Easy way to convert Row back to case class

2015-05-11 Thread Ulanov, Alexander
Thank you for suggestions! From: Reynold Xin [mailto:r...@databricks.com] Sent: Friday, May 08, 2015 11:10 AM To: Will Benton Cc: Ulanov, Alexander; dev@spark.apache.org Subject: Re: Easy way to convert Row back to case class In 1.4, you can do row.getInt(colName) In 1.5, some variant

RE: Speeding up Spark build during development

2015-05-01 Thread Ulanov, Alexander
Hi Pramod, For cluster-like tests you might want to use the same code as in mllib's LocalClusterSparkContext. You can rebuild only the package that you change and then run this main class. Best regards, Alexander -Original Message- From: Pramod Biligiri

Re: Should we let everyone set Assignee?

2015-04-24 Thread Ulanov, Alexander
there was disagreement about how to proceed. So I dont think a 'lock' is necessary in practice and don't think even signaling has been a problem. On Apr 23, 2015 6:14 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: My thinking is that current way of assigning a contributor after

RE: Should we let everyone set Assignee?

2015-04-23 Thread Ulanov, Alexander
My thinking is that current way of assigning a contributor after the patch is done (or almost done) is OK. Parallel efforts are also OK until they are discussed in the issue's thread. Ilya Ganelin made a good point that it is about moving the project forward. It also adds means of competition

RE: Regularization in MLlib

2015-04-07 Thread Ulanov, Alexander
: Tuesday, April 07, 2015 3:28 PM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: Regularization in MLlib 1) Norm(weights, N) will return (w_1^N + w_2^N +)^(1/N), so norm * norm is required. 2) This is bug as you said. I intend to fix this using weighted regularization

RE: Running LocalClusterSparkContext

2015-04-03 Thread Ulanov, Alexander
you suggest? (it seems that new version of Spark was not tested for Windows. Previous versions worked more or less fine for me) -Original Message- From: Marcelo Vanzin [mailto:van...@cloudera.com] Sent: Friday, April 03, 2015 1:04 PM To: Ulanov, Alexander Cc: dev@spark.apache.org

RE: Stochastic gradient descent performance

2015-04-02 Thread Ulanov, Alexander
: Thursday, April 02, 2015 1:26 PM To: Joseph Bradley Cc: Ulanov, Alexander; dev@spark.apache.org Subject: Re: Stochastic gradient descent performance I haven't looked closely at the sampling issues, but regarding the aggregation latency, there are fixed overheads (in local and distributed mode

RE: Stochastic gradient descent performance

2015-04-02 Thread Ulanov, Alexander
on this? I do understand that in cluster mode the network speed will kick in and then one can blame it. Best regards, Alexander From: Joseph Bradley [mailto:jos...@databricks.com] Sent: Thursday, April 02, 2015 10:51 AM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: Stochastic

RE: Stochastic gradient descent performance

2015-04-01 Thread Ulanov, Alexander
Sorry for bothering you again, but I think that it is an important issue for applicability of SGD in Spark MLlib. Could Spark developers please comment on it. -Original Message- From: Ulanov, Alexander Sent: Monday, March 30, 2015 5:00 PM To: dev@spark.apache.org Subject: Stochastic

RE: Storing large data for MLlib machine learning

2015-04-01 Thread Ulanov, Alexander
Thanks, sounds interesting! How do you load files to Spark? Did you consider having multiple files instead of file lines? From: Hector Yee [mailto:hector@gmail.com] Sent: Wednesday, April 01, 2015 11:36 AM To: Ulanov, Alexander Cc: Evan R. Sparks; Stephen Boesch; dev@spark.apache.org Subject

RE: Storing large data for MLlib machine learning

2015-04-01 Thread Ulanov, Alexander
...@gmail.com] Sent: Wednesday, April 01, 2015 1:37 PM To: Hector Yee Cc: Ulanov, Alexander; Evan R. Sparks; Stephen Boesch; dev@spark.apache.org Subject: Re: Storing large data for MLlib machine learning @Alexander, re: using flat binary and metadata, you raise excellent points! At least in our case, we

Stochastic gradient descent performance

2015-03-30 Thread Ulanov, Alexander
Hi, It seems to me that there is an overhead in runMiniBatchSGD function of MLlib's GradientDescent. In particular, sample and treeAggregate might take time that is order of magnitude greater than the actual gradient computation. In particular, for mnist dataset of 60K instances, minibatch

RE: Using CUDA within Spark / boosting linear algebra

2015-03-30 Thread Ulanov, Alexander
@spark.apache.org; Ulanov, Alexander; jfcanny Subject: Re: Using CUDA within Spark / boosting linear algebra Hi Alex, Since it is non-trivial to make nvblas work with netlib-java, it would be great if you can send the instructions to netlib-java as part of the README. Hopefully we don't need to modify

RE: Using CUDA within Spark / boosting linear algebra

2015-03-25 Thread Ulanov, Alexander
, Alexander -Original Message- From: Ulanov, Alexander Sent: Tuesday, March 24, 2015 6:57 PM To: Sam Halliday Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks Subject: RE: Using CUDA within Spark / boosting linear algebra Hi, I am trying to use nvblas with netlib

RE: Using CUDA within Spark / boosting linear algebra

2015-03-25 Thread Ulanov, Alexander
and took you a while to figure out! Would you mind posting a gist or something of maybe the shell scripts/exports you used to make this work - I can imagine it being highly useful for others in the future. Thanks! Evan On Wed, Mar 25, 2015 at 2:31 PM, Ulanov, Alexander alexander.ula...@hp.com

RE: Using CUDA within Spark / boosting linear algebra

2015-03-25 Thread Ulanov, Alexander
: Wednesday, March 25, 2015 2:55 PM To: Ulanov, Alexander Cc: Sam Halliday; dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks; jfcanny Subject: Re: Using CUDA within Spark / boosting linear algebra Alexander, does using netlib imply that one cannot switch between CPU and GPU blas

RE: Using CUDA within Spark / boosting linear algebra

2015-03-25 Thread Ulanov, Alexander
/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing -Original Message- From: Ulanov, Alexander Sent: Wednesday, March 25, 2015 2:31 PM To: Sam Halliday Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks; jfcanny Subject: RE: Using CUDA within Spark / boosting linear algebra Hi again, I

Re: Which linear algebra interface to use within Spark MLlib?

2015-03-19 Thread Ulanov, Alexander
trait that will provide a sink operation (basically memory will be allocated by user)...adding more BLAS operators in breeze will also help in general as lot more operations are defined over there... On Wed, Mar 18, 2015 at 8:09 PM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula

Which linear algebra interface to use within Spark MLlib?

2015-03-18 Thread Ulanov, Alexander
Hi, Currently I am using Breeze within Spark MLlib for linear algebra. I would like to reuse previously allocated matrices for storing the result of matrices multiplication, i.e. I need to use gemm function C:=q*A*B+p*C, which is missing in Breeze (Breeze automatically allocates a new matrix

Profiling Spark: MemoryStore

2015-03-12 Thread Ulanov, Alexander
Hi, I am working on artificial neural networks for Spark. It is solved with Gradient Descent, so each step the data is read, sum of gradients is calculated for each data partition (on each worker), aggregated (on the driver) and broadcasted back. I noticed that the gradient computation time is

RE: Using CUDA within Spark / boosting linear algebra

2015-03-10 Thread Ulanov, Alexander
this and will appreciate any help from you ☺ From: Sam Halliday [mailto:sam.halli...@gmail.com] Sent: Monday, March 09, 2015 6:01 PM To: Ulanov, Alexander Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks Subject: RE: Using CUDA within Spark / boosting linear algebra Thanks so much

RE: Using CUDA within Spark / boosting linear algebra

2015-03-09 Thread Ulanov, Alexander
for enabling high performance binaries on OSX and Linux? Or better, figure out a way for the system to fetch these automatically. - Evan On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: Just to summarize this thread, I was finally able to make all

Loading previously serialized object to Spark

2015-03-06 Thread Ulanov, Alexander
Hi, I've implemented class MyClass in MLlib that does some operation on LabeledPoint. MyClass extends serializable, so I can map this operation on data of RDD[LabeledPoints], such as data.map(lp = MyClass.operate(lp)). I write this class in file with ObjectOutputStream.writeObject. Then I stop

RE: Using CUDA within Spark / boosting linear algebra

2015-03-02 Thread Ulanov, Alexander
. https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing -Original Message- From: Sam Halliday [mailto:sam.halli...@gmail.com] Sent: Monday, March 02, 2015 1:24 PM To: Ulanov, Alexander Subject: Re: Using CUDA within Spark / boosting linear

RE: Using CUDA within Spark / boosting linear algebra

2015-03-02 Thread Ulanov, Alexander
[mailto:men...@gmail.com] Sent: Monday, March 02, 2015 11:42 AM To: Sam Halliday Cc: Joseph Bradley; Ulanov, Alexander; dev; Evan R. Sparks Subject: Re: Using CUDA within Spark / boosting linear algebra On Fri, Feb 27, 2015 at 12:33 PM, Sam Halliday sam.halli...@gmail.com wrote: Also, check

RE: Using CUDA within Spark / boosting linear algebra

2015-02-26 Thread Ulanov, Alexander
Typo - CPU was 2.5 cheaper (not GPU!) -Original Message- From: Ulanov, Alexander Sent: Thursday, February 26, 2015 2:01 PM To: Sam Halliday; Xiangrui Meng Cc: dev@spark.apache.org; Joseph Bradley; Evan R. Sparks Subject: RE: Using CUDA within Spark / boosting linear algebra Evan, thank

RE: Using CUDA within Spark / boosting linear algebra

2015-02-26 Thread Ulanov, Alexander
: Thursday, February 26, 2015 1:56 PM To: Xiangrui Meng Cc: dev@spark.apache.org; Joseph Bradley; Ulanov, Alexander; Evan R. Sparks Subject: Re: Using CUDA within Spark / boosting linear algebra Btw, I wish people would stop cheating when comparing CPU and GPU timings for things like matrix

RE: Using CUDA within Spark / boosting linear algebra

2015-02-12 Thread Ulanov, Alexander
results. https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing One thing still needs exploration: does BIDMat-cublas perform copying to/from machine’s RAM? -Original Message- From: Ulanov, Alexander Sent: Tuesday, February 10, 2015 2:12 PM

RE: Using CUDA within Spark / boosting linear algebra

2015-02-10 Thread Ulanov, Alexander
: Monday, February 09, 2015 6:06 PM To: Ulanov, Alexander Cc: Joseph Bradley; dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra Great - perhaps we can move this discussion off-list and onto a JIRA ticket? (Here's one: https://issues.apache.org/jira/browse/SPARK-5705

RE: Using CUDA within Spark / boosting linear algebra

2015-02-09 Thread Ulanov, Alexander
To: Ulanov, Alexander Cc: Joseph Bradley; dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra I would build OpenBLAS yourself, since good BLAS performance comes from getting cache sizes, etc. set up correctly for your particular hardware - this is often a very tricky

RE: Using CUDA within Spark / boosting linear algebra

2015-02-08 Thread Ulanov, Alexander
a specific blas (not specific wrapper for blas). Btw. I have installed openblas (yum install openblas), so I suppose that netlib is using it. From: Evan R. Sparks [mailto:evan.spa...@gmail.com] Sent: Friday, February 06, 2015 5:19 PM To: Ulanov, Alexander Cc: Joseph Bradley; dev

RE: Using CUDA within Spark / boosting linear algebra

2015-02-08 Thread Ulanov, Alexander
05, 2015 5:29 PM To: Ulanov, Alexander Cc: Evan R. Sparks; dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra Hi Alexander, Using GPUs with Spark would be very exciting.  Small comment: Concerning your question earlier about keeping data stored on the GPU rather

RE: Using CUDA within Spark / boosting linear algebra

2015-02-05 Thread Ulanov, Alexander
To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra I'd expect that we can make GPU-accelerated BLAS faster than CPU blas in many cases. You might consider taking a look at the codepaths that BIDMat (https://github.com/BIDData/BIDMat

RE: Using CUDA within Spark / boosting linear algebra

2015-02-05 Thread Ulanov, Alexander
:29 PM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: Using CUDA within Spark / boosting linear algebra I'd be surprised of BIDMat+OpenBLAS was significantly faster than netlib-java+OpenBLAS, but if it is much faster it's probably due to data layout and fewer levels of indirection

  1   2   >