Adding external jar to spark-shell classpath in spark 1.0

2014-06-11 Thread Ulanov, Alexander
Hi, I am currently using spark 1.0 locally on Windows 7. I would like to use classes from external jar in the spark-shell. I followed the instruction in: http://mail-archives.apache.org/mod_mbox/spark-user/201402.mbox/%3CCALrNVjWWF6k=c_jrhoe9w_qaacjld4+kbduhfv0pitr8h1f...@mail.gmail.com%3E I

RE: Adding external jar to spark-shell classpath in spark 1.0

2014-06-11 Thread Ulanov, Alexander
On Wed, Jun 11, 2014 at 10:25 AM, Ulanov, Alexander alexander.ula...@hp.com wrote: Hi, I am currently using spark 1.0 locally on Windows 7. I would like to use classes from external jar in the spark-shell. I followed the instruction in: http://mail-archives.apache.org/mod_mbox/spark

RE: Adding external jar to spark-shell classpath in spark 1.0

2014-06-11 Thread Ulanov, Alexander
://issues.apache.org/jira/browse/SPARK-1919. We haven't found a fix yet, but for now, you can workaround this by including your simple class in your application jar. 2014-06-11 10:25 GMT-07:00 Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com: Hi, I am currently using spark

Multiclass classification evaluation measures

2014-06-23 Thread Ulanov, Alexander
Hi, I've implemented a class with measures for evaluation of multiclass classification (as well as unit tests). They are per class and averaged Precision, Recall and F1-measure. As far as I know, in Spark, there is binary classification evaluator only, given that Spark's Bayesian classifier

RE: Prediction using Classification with text attributes in Apache Spark MLLib

2014-06-24 Thread Ulanov, Alexander
Hi, You need to convert your text to vector space model: http://en.wikipedia.org/wiki/Vector_space_model and then pass it to SVM. As far as I know, in previous versions of MLlib there was a special class for doing this:

RE: Prediction using Classification with text attributes in Apache Spark MLLib

2014-06-24 Thread Ulanov, Alexander
Hi Imk, There is a number of libraries and scripts to convert text to libsvm format, if you just type libsvm format converter in search engine. Unfortunately I cannot recommend a specific one, except the one that is built in Weka. I use it for test purposes, and for big experiments it is

RE: Prediction using Classification with text attributes in Apache Spark MLLib

2014-06-25 Thread Ulanov, Alexander
Hi Imk, I am not aware of any classifier in MLLib that accept nominal type of data. They do accept RDD of LabeledPoints, which are label + vector of Double. So, you'll need to convert nominal to double. Best regards, Alexander -Original Message- From: lmk

RE: Prediction using Classification with text attributes in Apache Spark MLLib

2014-06-26 Thread Ulanov, Alexander
if it will be easy to standardize libsvm converter on data that can be on hdfs,hbase, cassandra or solrbut of course libsvm, netflix format, csv are standard for algorithms and mllib supports all 3... On Jun 25, 2014 6:00 AM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote

Logging from the Spark shell

2014-11-05 Thread Ulanov, Alexander
Dear Spark users, I would like to run a long experiment using spark-shell. How can I log my intermediate results (numbers, strings) into some file on a master node? What are the best practices? It is NOT performance metrics of Spark that I want to log every X seconds. Instead, I would like to

RE: Scalability of group by

2015-04-28 Thread Ulanov, Alexander
Richard, The same problem is with sort. I have enough disk space and tmp folder. The errors in logs tell out of memory. I wonder what does it hold in memory? Alexander From: Richard Marscher [mailto:rmarsc...@localytics.com] Sent: Tuesday, April 28, 2015 7:34 AM To: Ulanov, Alexander Cc: user

Sort (order by) of the big dataset

2015-04-29 Thread Ulanov, Alexander
Hi, I have a 2 billion records dataset witch schema eventId: String, time: Double, value: Double. It is stored in Parquet format in HDFS, size 23GB. Specs: Spark 1.3, Hadoop 1.2.1, 8 nodes with Xeon 16GB RAM, 1TB disk space, each node has 3 workers with 3GB memory. I keep failing to sort the

RE: Group by order by

2015-04-27 Thread Ulanov, Alexander
From: Richard Marscher [mailto:rmarsc...@localytics.com] Sent: Monday, April 27, 2015 12:47 PM To: Ulanov, Alexander Cc: user@spark.apache.org Subject: Re: Group by order by It's not related to Spark, but the concept of what you are trying to do with the data. Grouping by ID means consolidating

RE: Scalability of group by

2015-04-27 Thread Ulanov, Alexander
It works on a smaller dataset of 100 rows. Probably I could find the size when it fails using binary search. However, it would not help me because I need to work with 2B rows. From: ayan guha [mailto:guha.a...@gmail.com] Sent: Monday, April 27, 2015 6:58 PM To: Ulanov, Alexander Cc: user

RE: Sort (order by) of the big dataset

2015-04-29 Thread Ulanov, Alexander
executor was first to be lost. The other follow it as in house of cards. What's the problem? The number of reducers. For the first task it is equal to the number of partitions, i.e. 2000, but for the second it switched to 200. From: Ulanov, Alexander Sent: Wednesday, April 29, 2015 1:08 PM To: user

Group by order by

2015-04-27 Thread Ulanov, Alexander
Hi, Could you suggest what is the best way to do group by x order by y in Spark? When I try to perform it with Spark SQL I get the following error (Spark 1.3): val results = sqlContext.sql(select * from sample group by id order by time) org.apache.spark.sql.AnalysisException: expression 'time'

RE: Group by order by

2015-04-27 Thread Ulanov, Alexander
Hi Richard, There are several values of time per id. Is there a way to perform group by id and sort by time in Spark? Best regards, Alexander From: Richard Marscher [mailto:rmarsc...@localytics.com] Sent: Monday, April 27, 2015 12:20 PM To: Ulanov, Alexander Cc: user@spark.apache.org Subject

RE: Multilabel Classification in spark

2015-05-05 Thread Ulanov, Alexander
If you are interested in multilabel (not multiclass), you might want to take a look at SPARK-7015 https://github.com/apache/spark/pull/5830/files. It is supposed to perform one-versus-all transformation on classes, which is usually how multilabel classifiers are built. Alexander From: Joseph

RE: DataFrame DSL documentation

2015-05-06 Thread Ulanov, Alexander
+1 I had to browse spark-catalyst sources to find what is supported: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SqlParser.scala Alexander From: Gerard Maas [mailto:gerard.m...@gmail.com] Sent: Wednesday, May 06, 2015 11:42 AM To: spark

RE: Reading large files

2015-05-06 Thread Ulanov, Alexander
: Wednesday, May 06, 2015 2:23 PM To: Ulanov, Alexander Cc: user@spark.apache.org Subject: Re: Reading large files Thanks. In both cases, does the driver need to have enough memory to contain the entire file? How do both these functions work when, for example, the binary file is 4G and available

How to specify Worker and Master LOG folders?

2015-05-06 Thread Ulanov, Alexander
Hi, How can I specify Worker and Master LOG folders? If I set SPARK_WORKER_DIR in spark-env, it only affects Executor logs and shuffling folder. But Worker and Master logs still goes to something default: starting org.apache.spark.deploy.master.Master, logging to

RE: Reading large files

2015-05-06 Thread Ulanov, Alexander
SparkContext has two methods for reading binary files: binaryFiles (reads multiple binary files into RDD) and binaryRecords (reads separate lines of a single binary file into RDD). For example, I have a big binary file split into logical parts, so I can use “binaryFiles”. The possible problem

RE: Sort (order by) of the big dataset

2015-05-07 Thread Ulanov, Alexander
The answer for Spark SQL “order by” is setting spark.sql.shuffle.partitions to a bigger number. For RDD.sortBy it works out of the box if RDD has enough number of partitions. From: Night Wolf [mailto:nightwolf...@gmail.com] Sent: Thursday, May 07, 2015 5:26 AM To: Ulanov, Alexander Cc: user

RE: Sort (order by) of the big dataset

2015-05-07 Thread Ulanov, Alexander
avulanov.blogspot.com, though it does not have more on this particular issue than I already posted. From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Thursday, May 07, 2015 6:25 AM To: Ulanov, Alexander Cc: user@spark.apache.org Subject: Re: Sort (order by) of the big dataset Where can I find

RE: Any way to get raw score from MultilayerPerceptronClassificationModel ?

2015-11-17 Thread Ulanov, Alexander
Hi Robert, Raw scores are not available through the public API. It would be great to add this feature, it seems that we overlooked it. The simple way to access the raw predictions currently would be to create a wrapper for mlpModel. This wrapper should be defined in [ml] package. One need to

RE: Spark ANN

2015-09-08 Thread Ulanov, Alexander
That is an option too. Implementing convolutions with FFTs should be considered as well http://arxiv.org/pdf/1312.5851.pdf. From: Feynman Liang [mailto:fli...@databricks.com] Sent: Tuesday, September 08, 2015 12:07 PM To: Ulanov, Alexander Cc: Ruslan Dautkhanov; Nick Pentreath; user; na

RE: Spark ANN

2015-09-09 Thread Ulanov, Alexander
considering matrix-matrix multiplication for convolution optimization at least as a first version. It can also take advantage of data batches. From: Feynman Liang [mailto:fli...@databricks.com] Sent: Wednesday, September 09, 2015 12:56 AM To: Ulanov, Alexander Cc: Ruslan Dautkhanov; Nick Pentreath

RE: How to save Multilayer Perceptron Classifier model.

2015-12-14 Thread Ulanov, Alexander
Hi Vadim, As Yanbo pointed out, that feature is not yet merged into the main branch. However, there is a hacky workaround: // save model sc.parallelize(Seq(model), 1).saveAsObjectFile("path") // load model val sameModel = sc.objectFile[YourCLASS]("path").first() Best regards, Alexander From:

RE: SparkML algos limitations question.

2016-01-04 Thread Ulanov, Alexander
Hi Yanbo, As long as two models fit into memory of a single machine, there should be no problems, so even 16GB machines can handle large models. (master should have more memory because it runs LBFGS) In my experiments, I’ve trained the models 12M and 32M parameters without issues. Best

RE: Spark LBFGS Error with ANN

2016-02-16 Thread Ulanov, Alexander
Hi Hayri, The MLP classifier is multi-class (one class per instance) but not multi-label (multiple classes per instance). The top layer of the network is softmax http://spark.apache.org/docs/latest/ml-classification-regression.html#multilayer-perceptron-classifier that requires the outputs sum

RE: Learning Fails with 4 Number of Layes at ANN Training with SGDOptimizer

2016-02-16 Thread Ulanov, Alexander
Hi Hayri, The default MLP optimizer is LBFGS. SGD is available only thought the private interface and its use is discouraged due to multiple reasons. With regards to SGD in general, the paramters are very specific to the dataset and network configuration, one need to find them empirically. The

RE: best way to do deep learning on spark ?

2016-03-18 Thread Ulanov, Alexander
Hi Charles, There is an implementation of multilayer perceptron in Spark (since 1.5): https://spark.apache.org/docs/latest/ml-classification-regression.html#multilayer-perceptron-classifier Other features such as autoencoder, convolutional layers, etc. are currently under development. Please

RE: Non-classification neural networks

2016-03-28 Thread Ulanov, Alexander
Hi Jim, It is possible to use raw artificial neural networks by means of FeedForwardTrainer. It is [ml] package private, so your code should be in that package too. Basically, you need to do the same as it is done in MultilayerPerceptronClassifier but without encoding the output as one-hot:

RE: Spark 2.0 error: Wrong FS: file://spark-warehouse, expected: file:///

2016-08-03 Thread Ulanov, Alexander
Hi Sean, I updated the issue, could you check the changes? Best regards, Alexander -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Wednesday, August 03, 2016 2:49 AM To: Utkarsh Sengar Cc: User Subject: Re: Spark 2.0

RE: Spark MLlib: MultilayerPerceptronClassifier error?

2016-07-05 Thread Ulanov, Alexander
Hi Mikhail, I have followed the MLP user-guide and used the dataset and network configuration you mentioned. MLP was trained without any issues with default parameters, that is block size of 128 and 100 iterations. Source code: scala> import

scalable-deeplearning 1.0.0 released

2016-09-09 Thread Ulanov, Alexander
Dear Spark users and developers, I have released version 1.0.0 of scalable-deeplearning package. This package is based on the implementation of artificial neural networks in Spark ML. It is intended for new Spark deep learning features that were not yet merged to Spark ML or that are too

accessing spark packages through proxy

2016-09-09 Thread Ulanov, Alexander
Dear Spark users, I am trying to use spark packages, however I get the ivy error listed below. I checked JIRA and stackoverflow and it might be a proxy error. However, neither of proposed solutions did not work for me. Could you suggest how to solve this issue?

Belief propagation algorithm is open sourced

2016-12-13 Thread Ulanov, Alexander
Dear Spark developers and users, HPE has open sourced the implementation of the belief propagation (BP) algorithm for Apache Spark, a popular message passing algorithm for performing inference in probabilistic graphical models. It provides exact inference for graphical models without loops.

Re: Belief propagation algorithm is open sourced

2016-12-15 Thread Ulanov, Alexander
We were using both LibDAI and our own implementation of BP for GraphLab and as a reference. Best regards, Manish Marwah & Alexander From: Bertrand Dechoux <decho...@gmail.com> Sent: Thursday, December 15, 2016 1:03:49 AM To: Bryan Cutler Cc: Ulanov, Al