Re: GradientBoostedTrees.trainRegressor with categoricalFeaturesInfo

2015-05-20 Thread Burak Yavuz
Could you please open a JIRA for it? The maxBins input is missing for the Python Api. Is it possible if you can use the current master? In the current master, you should be able to use trees with the Pipeline Api and DataFrames. Best, Burak On Wed, May 20, 2015 at 2:44 PM, Don Drake wrote: > I

Re: ReduceByKey and sorting within partitions

2015-05-04 Thread Burak Yavuz
I think this Spark Package may be what you're looking for! http://spark-packages.org/package/tresata/spark-sorted Best, Burak On Mon, May 4, 2015 at 12:56 PM, Imran Rashid wrote: > oh wow, that is a really interesting observation, Marco & Jerry. > I wonder if this is worth exposing in combineBy

Re: DataFrame filter referencing error

2015-04-30 Thread Burak Yavuz
Is "new" a reserved word for MySQL? On Thu, Apr 30, 2015 at 2:41 PM, Francesco Bigarella < francesco.bigare...@gmail.com> wrote: > Do you know how I can check that? I googled a bit but couldn't find a > clear explanation about it. I also tried to use explain() but it doesn't > really help. > I st

Fwd: Change ivy cache for spark on Windows

2015-04-27 Thread Burak Yavuz
+user -- Forwarded message -- From: Burak Yavuz Date: Mon, Apr 27, 2015 at 1:59 PM Subject: Re: Change ivy cache for spark on Windows To: mj Hi, In your conf file (SPARK_HOME\conf\spark-defaults.conf) you can set: `spark.jars.ivy \your\path` Best, Burak On Mon, Apr 27

Re: Understanding Spark/MLlib failures

2015-04-23 Thread Burak Yavuz
Hi Andrew, I observed similar behavior under high GC pressure, when running ALS. What happened to me was that, there would be very long Full GC pauses (over 600 seconds at times). These would prevent the executors from sending heartbeats to the driver. Then the driver would think that the executor

Re: Benchmaking col vs row similarities

2015-04-10 Thread Burak Yavuz
Depends... The heartbeat you received happens due to GC pressure (probably due to Full GC). If you increase the memory too much, the GC's may be less frequent, but the Full GC's may take longer. Try increasing the following confs: spark.executor.heartbeatInterval spark.core.connection.ack.wait.tim

Re: Query REST web service with Spark?

2015-03-31 Thread Burak Yavuz
Hi, If I recall correctly, I've read people integrating REST calls to Spark Streaming jobs in the user list. I don't imagine any cases for why it shouldn't be possible. Best, Burak On Tue, Mar 31, 2015 at 1:46 PM, Minnow Noir wrote: > We have have some data on Hadoop that needs augmented with

Re: Why KMeans with mllib is so slow ?

2015-03-28 Thread Burak Yavuz
Hi David, Can you also try with Spark 1.3 if possible? I believe there was a 2x improvement on K-Means between 1.2 and 1.3. Thanks, Burak On Sat, Mar 28, 2015 at 9:04 PM, davidshen84 wrote: > Hi Jao, > > Sorry to pop up this old thread. I am have the same problem like you did. I > want to kn

Re: netlib-java cannot load native lib in Windows when using spark-submit

2015-03-22 Thread Burak Yavuz
Did you build Spark with: -Pnetlib-lgpl? Ref: https://spark.apache.org/docs/latest/mllib-guide.html Burak On Sun, Mar 22, 2015 at 7:37 AM, Ted Yu wrote: > How about pointing LD_LIBRARY_PATH to native lib folder ? > > You need Spark 1.2.0 or higher for the above to work. See SPARK-1719 > > Chee

Re: RDD ordering after map

2015-03-18 Thread Burak Yavuz
Hi, Yes, ordering is preserved with map. Shuffles break ordering. Burak On Wed, Mar 18, 2015 at 2:02 PM, sergunok wrote: > Does map(...) preserve ordering of original RDD? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/RDD-ordering-after-map-tp2

Re: Getting incorrect weights for LinearRegression

2015-03-13 Thread Burak Yavuz
Hi, I would suggest you use LBFGS, as I think the step size is hurting you. You can run the same thing in LBFGS as: ``` val algorithm = new LBFGS(new LeastSquaresGradient(), new SimpleUpdater()) val initialWeights = Vectors.dense(Array.fill(3)( scala.util.Random.nextDouble())) val weights = algo

Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-09 Thread Burak Yavuz
Hi Jaonary, The RowPartitionedMatrix is a special case of the BlockMatrix, where the colsPerBlock = nCols. I hope that helps. Burak On Mar 6, 2015 9:13 AM, "Jaonary Rabarisoa" wrote: > Hi Shivaram, > > Thank you for the link. I'm trying to figure out how can I port this to > mllib. May you can

Re: what are the types of tasks when running ALS iterations

2015-03-09 Thread Burak Yavuz
+user On Mar 9, 2015 8:47 AM, "Burak Yavuz" wrote: > Hi, > In the web UI, you don't see every single task. You see the name of the > last task before the stage boundary (which is a shuffle like a groupByKey), > which in your case is a flatMap. Therefore you only s

Re: How to reuse a ML trained model?

2015-03-07 Thread Burak Yavuz
Hi, There is model import/export for some of the ML algorithms on the current master (and they'll be shipped with the 1.3 release). Burak On Mar 7, 2015 4:17 AM, "Xi Shen" wrote: > Wait...it seem SparkContext does not provide a way to save/load object > files. It can only save/load RDD. What do

Re: spark-sorted, or secondary sort and streaming reduce for spark

2015-03-06 Thread Burak Yavuz
Hi Koert, Would you like to register this on spark-packages.org? Burak On Fri, Mar 6, 2015 at 8:53 AM, Koert Kuipers wrote: > currently spark provides many excellent algorithms for operations per key > as long as the data send to the reducers per key fits in memory. operations > like combineBy

Re: Problem getting program to run on 15TB input

2015-02-27 Thread Burak Yavuz
Hi, Not sure if it can help, but `StorageLevel.MEMORY_AND_DISK_SER` generates many small objects that lead to very long GC time, causing the executor losts, heartbeat not received, and GC overhead limit exceeded messages. Could you try using `StorageLevel.MEMORY_AND_DISK` instead? You can also try

Re: Why is RDD lookup slow?

2015-02-19 Thread Burak Yavuz
If your dataset is large, there is a Spark Package called IndexedRDD optimized for lookups. Feel free to check that out. Burak On Feb 19, 2015 7:37 AM, "Ilya Ganelin" wrote: > Hi Shahab - if your data structures are small enough a broadcasted Map is > going to provide faster lookup. Lookup withi

Re: generate a random matrix with uniform distribution

2015-02-09 Thread Burak Yavuz
Thanks a lot! > > Can I ask why this code generates a uniform distribution? > > > > If dist is N(0,1) data should be N(-1, 2). > > > > Let me know. > > Thanks, > > Luca > > > > 2015-02-07 3:00 GMT+00:00 Burak Y

Re: generate a random matrix with uniform distribution

2015-02-06 Thread Burak Yavuz
Hi, You can do the following: ``` import org.apache.spark.mllib.linalg.distributed.RowMatrix import org.apache.spark.mllib.random._ // sc is the spark context, numPartitions is the number of partitions you want the RDD to be in val dist: RDD[Vector] = RandomRDDs.normalVectorRDD(sc, n, k, numParti

Re: matrix of random variables with spark.

2015-02-06 Thread Burak Yavuz
Forgot to add the more recent training material: https://databricks-training.s3.amazonaws.com/index.html On Fri, Feb 6, 2015 at 12:12 PM, Burak Yavuz wrote: > Hi Luca, > > You can tackle this using RowMatrix (spark-shell example): >

Re: matrix of random variables with spark.

2015-02-06 Thread Burak Yavuz
Hi Luca, You can tackle this using RowMatrix (spark-shell example): ``` import org.apache.spark.mllib.linalg.distributed.RowMatrix import org.apache.spark.mllib.random._ // sc is the spark context, numPartitions is the number of partitions you want the RDD to be in val data: RDD[Vector] = RandomR

Re: null Error in ALS model predict

2014-12-24 Thread Burak Yavuz
Hi, The MatrixFactorizationModel consists of two RDD's. When you use the second method, Spark tries to serialize both RDD's for the .map() function, which is not possible, because RDD's are not serializable. Therefore you receive the NULLPointerException. You must use the first method. Best, B

Re: How can I make Spark Streaming count the words in a file in a unit test?

2014-12-08 Thread Burak Yavuz
Hi, https://github.com/databricks/spark-perf/tree/master/streaming-tests/src/main/scala/streaming/perf contains some performance tests for streaming. There are examples of how to generate synthetic files during the test in that repo, maybe you can find some code snippets that you can use there.

Re: MLLib ALS ArrayIndexOutOfBoundsException with Scala Spark 1.1.0

2014-10-27 Thread Burak Yavuz
Hi, I've come across this multiple times, but not in a consistent manner. I found it hard to reproduce. I have a jira for it: SPARK-3080 Do you observe this error every single time? Where do you load your data from? Which version of Spark are you running? Figuring out the similarities may help

Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Burak Yavuz
Hi Ray, The reduceByKey / collectAsMap does a lot of calculations. Therefore it can take a very long time if: 1) The parameter number of runs is set very high 2) k is set high (you have observed this already) 3) data is not properly repartitioned It seems that it is hanging, but there is a lot of

Re: MLlib Linear Regression Mismatch

2014-10-01 Thread Burak Yavuz
Hi, It appears that the step size is too high that the model is diverging with the added noise. Could you try by setting the step size to be 0.1 or 0.01? Best, Burak - Original Message - From: "Krishna Sankar" To: user@spark.apache.org Sent: Wednesday, October 1, 2014 12:43:20 PM Subj

Re: Spark on EC2

2014-09-18 Thread Burak Yavuz
Hi Gilberto, Could you please attach the driver logs as well, so that we can pinpoint what's going wrong? Could you also add the flag `--driver-memory 4g` while submitting your application and try that as well? Best, Burak - Original Message - From: "Gilberto Lira" To: user@spark.apach

Re: Odd error when using a rdd map within a stream map

2014-09-18 Thread Burak Yavuz
Hi, I believe it's because you're trying to use a Function of an RDD, in an RDD, which is not possible. Instead of using a `Function>`, could you try Function, and `public Void call(Float arg0) throws Exception { ` and `System.out.println(arg0)` instead. I'm not perfectly sure of the semantics i

Re: Python version of kmeans

2014-09-17 Thread Burak Yavuz
Hi, spark-1.0.1/examples/src/main/python/kmeans.py => Naive example for users to understand how to code in Spark spark-1.0.1/python/pyspark/mllib/clustering.py => Use this!!! Bonus: spark-1.0.1/examples/src/main/python/mllib/kmeans.py => Example on how to call KMeans. Feel free to use it as a t

Re: MLLib: LIBSVM issue

2014-09-17 Thread Burak Yavuz
Hi, The spacing between the inputs should be a single space, not a tab. I feel like your inputs have tabs between them instead of a single space. Therefore the parser cannot parse the input. Best, Burak - Original Message - From: "Sameer Tilak" To: user@spark.apache.org Sent: Wednesda

Re: Size exceeds Integer.MAX_VALUE in BlockFetcherIterator

2014-09-17 Thread Burak Yavuz
Hi, Could you try repartitioning the data by .repartition(# of cores on machine) or while reading the data, supply the number of minimum partitions as in sc.textFile(path, # of cores on machine). It may be that the whole data is stored in one block? If it is billions of rows, then the indexing

Re: How to run kmeans after pca?

2014-09-17 Thread Burak Yavuz
To properly perform PCA, you must left multiply the resulting DenseMatrix with the original RowMatrix. The result will also be a RowMatrix, therefore you can easily access the values by .values, and train KMeans on that. Don't forget to Broadcast the DenseMatrix returned from RowMatrix.computePr

Re: Spark and disk usage.

2014-09-17 Thread Burak Yavuz
age, except in Spark Streaming, and some MLlib algorithms. If you can help with the guide, I think it would be a nice feature to have! Burak - Original Message - From: "Andrew Ash" To: "Burak Yavuz" Cc: "Макар Красноперов" , "user" Sent: Wednesday

Re: Spark and disk usage.

2014-09-17 Thread Burak Yavuz
etting the directory will not be enough. Best, Burak - Original Message - From: "Andrew Ash" To: "Burak Yavuz" Cc: "Макар Красноперов" , "user" Sent: Wednesday, September 17, 2014 10:19:42 AM Subject: Re: Spark and disk usage. Hi Burak, Most discussion

Re: Spark and disk usage.

2014-09-17 Thread Burak Yavuz
Hi, The files you mentioned are temporary files written by Spark during shuffling. ALS will write a LOT of those files as it is a shuffle heavy algorithm. Those files will be deleted after your program completes as Spark looks for those files in case a fault occurs. Having those files ready allo

Re: Spark SQL

2014-09-14 Thread Burak Yavuz
Hi, I'm not a master on SparkSQL, but from what I understand, the problem ıs that you're trying to access an RDD inside an RDD here: val xyz = file.map(line => *** extractCurRate(sqlContext.sql("select rate ... *** and here: xyz = file.map(line => *** extractCurRate(sqlContext.sql("select rate

Re: Filter function problem

2014-09-09 Thread Burak Yavuz
Hi, val test = persons.value .map{tuple => (tuple._1, tuple._2 .filter{event => *inactiveIDs.filter(event2 => event2._1 == tuple._1).count() != 0})} Your problem is right between the asterisk. You can't make an RDD operation inside an RDD operation, because RDD's can't be serialized

Re: Executor address issue: "CANNOT FIND ADDRESS" (Spark 0.9.1)

2014-09-08 Thread Burak Yavuz
Hi Nicolas, It seems that you are starting to lose executors and then the job starts to fail. Can you please share more information about your application so that we can help you debug it, such as what you're trying to do, and your driver logs please? Best, Burak - Original Message - F

Re: Memory statistics in the Application detail UI

2014-08-28 Thread Burak Yavuz
Hi, Spark uses by default approximately 60% of the executor heap memory to store RDDs. That's why you have 8.6GB instead of 16GB. 95.5 is therefore the sum of all the 8.6 GB of executor memory + the driver memory. Best, Burak - Original Message - From: "SK" To: u...@spark.incubator.ap

Re: OutofMemoryError when generating output

2014-08-28 Thread Burak Yavuz
Yeah, saveAsTextFile is an RDD specific method. If you really want to use that method, just turn the map into an RDD: `sc.parallelize(x.toSeq).saveAsTextFile(...)` Reading through the api-docs will present you many more alternate solutions! Best, Burak - Original Message - From: "SK"

Re: Amplab: big-data-benchmark

2014-08-27 Thread Burak Yavuz
Hi Sameer, I've faced this issue before. They don't show up on http://s3.amazonaws.com/big-data-benchmark/. But you can directly use: `sc.textFile("s3n://big-data-benchmark/pavlo/text/tiny/crawl")` The gotcha is that you also need to supply which dataset you want: crawl, uservisits, or rankings

Re: saveAsTextFile hangs with hdfs

2014-08-26 Thread Burak Yavuz
Hi David, Your job is probably hanging on the groupByKey process. Probably GC is kicking in and the process starts to hang or the data is unbalanced and you end up with stragglers (Once GC kicks in you'll start to get the connection errors you shared). If you don't care about the list of value

Re: OutofMemoryError when generating output

2014-08-26 Thread Burak Yavuz
Hi, The error doesn't occur during saveAsTextFile but rather during the groupByKey as far as I can tell. We strongly urge users to not use groupByKey if they don't have to. What I would suggest is the following work-around: sc.textFile(baseFile)).map { line => val fields = line.split("\t") (

Re: Finding Rank in Spark

2014-08-23 Thread Burak Yavuz
Spearman's Correlation requires the calculation of ranks for columns. You can checkout the code here and slice the part you need! https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmanCorrelation.scala Best, Burak - Original Message

Re: LDA example?

2014-08-22 Thread Burak Yavuz
You can check out this pull request: https://github.com/apache/spark/pull/476 LDA is on the roadmap for the 1.2 release, hopefully we will officially support it then! Best, Burak - Original Message - From: "Denny Lee" To: user@spark.apache.org Sent: Thursday, August 21, 2014 10:10:35 P

Re: [MLLib]:choosing the Loss function

2014-08-11 Thread Burak Yavuz
Hi, // Initialize the optimizer using logistic regression as the loss function with L2 regularization val lbfgs = new LBFGS(new LogisticGradient(), new SquaredL2Updater()) // Set the hyperparameters lbfgs.setMaxNumIterations(numIterations).setRegParam(regParam).setConvergenceTol(tol).setNumCorre

Re: [MLLib]:choosing the Loss function

2014-08-07 Thread Burak Yavuz
The following code will allow you to run Logistic Regression using L-BFGS: val lbfgs = new LBFGS(new LogisticGradient(), new SquaredL2Updater()) lbfgs.setMaxNumIterations(numIterations).setRegParam(regParam).setConvergenceTol(tol).setNumCorrections(numCor) val weights = lbfgs.optimize(data, initi

Re: questions about MLLib recommendation models

2014-08-07 Thread Burak Yavuz
Hi Jay, I've had the same problem you've been having in Question 1 with a synthetic dataset. I thought I wasn't producing the dataset well enough. This seems to be a bug. I will open a JIRA for it. Instead of using: ratings.map{ case Rating(u,m,r) => { val pred = model.predict(u, m) (r

Re: KMeans Input Format

2014-08-07 Thread Burak Yavuz
Hi, Could you try running spark-shell with the flag --driver-memory 2g or more if you have more RAM available and try again? Thanks, Burak - Original Message - From: "AlexanderRiggers" To: u...@spark.incubator.apache.org Sent: Thursday, August 7, 2014 7:37:40 AM Subject: KMeans Input F

Re: Regularization parameters

2014-08-06 Thread Burak Yavuz
Hi, That is interesting. Would you please share some code on how you are setting the regularization type, regularization parameters and running Logistic Regression? Thanks, Burak - Original Message - From: "SK" To: u...@spark.incubator.apache.org Sent: Wednesday, August 6, 2014 6:18:4

Re: Naive Bayes parameters

2014-08-06 Thread Burak Yavuz
Hi, Could you please send the link for the example you are talking about? minPartitions and numFeatures do not exist in the current API for NaiveBayes as far as I know. So, I don't know how to answer your second question. Regarding your first question, guessing blindly, it should be related to

Re: Retrieve dataset of Big Data Benchmark

2014-07-16 Thread Burak Yavuz
Hi Tom, Actually I was mistaken, sorry about that. Indeed on the website, the keys for the datasets you mention are not showing up. However, they are still accessible through the spark-shell, which means that they are there. So in order to answer your questions: - Are the tiny and 1node sets s

Re: Retrieve dataset of Big Data Benchmark

2014-07-15 Thread Burak Yavuz
Hi Tom, If you wish to load the file in Spark directly, you can use sc.textFile("s3n://big-data-benchmark/pavlo/...") where sc is your SparkContext. This can be done because the files should be publicly available and you don't need AWS Credentials to access them. If you want to download the fi

Re: Restarting a Streaming Context

2014-07-09 Thread Burak Yavuz
Someone can correct me if I'm wrong, but unfortunately for now, once a streaming context is stopped, it can't be restarted. - Original Message - From: "Nick Chammas" To: u...@spark.incubator.apache.org Sent: Wednesday, July 9, 2014 6:11:51 PM Subject: Restarting a Streaming Context So

<    1   2