Fwd: Change ivy cache for spark on Windows

2015-04-27 Thread Burak Yavuz
+user -- Forwarded message -- From: Burak Yavuz brk...@gmail.com Date: Mon, Apr 27, 2015 at 1:59 PM Subject: Re: Change ivy cache for spark on Windows To: mj jone...@gmail.com Hi, In your conf file (SPARK_HOME\conf\spark-defaults.conf) you can set: `spark.jars.ivy \your\path

Re: Understanding Spark/MLlib failures

2015-04-23 Thread Burak Yavuz
Hi Andrew, I observed similar behavior under high GC pressure, when running ALS. What happened to me was that, there would be very long Full GC pauses (over 600 seconds at times). These would prevent the executors from sending heartbeats to the driver. Then the driver would think that the

Re: Benchmaking col vs row similarities

2015-04-10 Thread Burak Yavuz
Depends... The heartbeat you received happens due to GC pressure (probably due to Full GC). If you increase the memory too much, the GC's may be less frequent, but the Full GC's may take longer. Try increasing the following confs: spark.executor.heartbeatInterval

Re: Query REST web service with Spark?

2015-03-31 Thread Burak Yavuz
Hi, If I recall correctly, I've read people integrating REST calls to Spark Streaming jobs in the user list. I don't imagine any cases for why it shouldn't be possible. Best, Burak On Tue, Mar 31, 2015 at 1:46 PM, Minnow Noir minnown...@gmail.com wrote: We have have some data on Hadoop that

Re: Why KMeans with mllib is so slow ?

2015-03-28 Thread Burak Yavuz
Hi David, Can you also try with Spark 1.3 if possible? I believe there was a 2x improvement on K-Means between 1.2 and 1.3. Thanks, Burak On Sat, Mar 28, 2015 at 9:04 PM, davidshen84 davidshe...@gmail.com wrote: Hi Jao, Sorry to pop up this old thread. I am have the same problem like you

Re: netlib-java cannot load native lib in Windows when using spark-submit

2015-03-22 Thread Burak Yavuz
Did you build Spark with: -Pnetlib-lgpl? Ref: https://spark.apache.org/docs/latest/mllib-guide.html Burak On Sun, Mar 22, 2015 at 7:37 AM, Ted Yu yuzhih...@gmail.com wrote: How about pointing LD_LIBRARY_PATH to native lib folder ? You need Spark 1.2.0 or higher for the above to work. See

Re: RDD ordering after map

2015-03-18 Thread Burak Yavuz
Hi, Yes, ordering is preserved with map. Shuffles break ordering. Burak On Wed, Mar 18, 2015 at 2:02 PM, sergunok ser...@gmail.com wrote: Does map(...) preserve ordering of original RDD? -- View this message in context:

Re: Getting incorrect weights for LinearRegression

2015-03-13 Thread Burak Yavuz
Hi, I would suggest you use LBFGS, as I think the step size is hurting you. You can run the same thing in LBFGS as: ``` val algorithm = new LBFGS(new LeastSquaresGradient(), new SimpleUpdater()) val initialWeights = Vectors.dense(Array.fill(3)( scala.util.Random.nextDouble())) val weights =

Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-09 Thread Burak Yavuz
Hi Jaonary, The RowPartitionedMatrix is a special case of the BlockMatrix, where the colsPerBlock = nCols. I hope that helps. Burak On Mar 6, 2015 9:13 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hi Shivaram, Thank you for the link. I'm trying to figure out how can I port this to mllib.

Re: what are the types of tasks when running ALS iterations

2015-03-09 Thread Burak Yavuz
+user On Mar 9, 2015 8:47 AM, Burak Yavuz brk...@gmail.com wrote: Hi, In the web UI, you don't see every single task. You see the name of the last task before the stage boundary (which is a shuffle like a groupByKey), which in your case is a flatMap. Therefore you only see flatMap in the UI

Re: How to reuse a ML trained model?

2015-03-07 Thread Burak Yavuz
Hi, There is model import/export for some of the ML algorithms on the current master (and they'll be shipped with the 1.3 release). Burak On Mar 7, 2015 4:17 AM, Xi Shen davidshe...@gmail.com wrote: Wait...it seem SparkContext does not provide a way to save/load object files. It can only

Re: spark-sorted, or secondary sort and streaming reduce for spark

2015-03-06 Thread Burak Yavuz
Hi Koert, Would you like to register this on spark-packages.org? Burak On Fri, Mar 6, 2015 at 8:53 AM, Koert Kuipers ko...@tresata.com wrote: currently spark provides many excellent algorithms for operations per key as long as the data send to the reducers per key fits in memory. operations

Re: Problem getting program to run on 15TB input

2015-02-27 Thread Burak Yavuz
Hi, Not sure if it can help, but `StorageLevel.MEMORY_AND_DISK_SER` generates many small objects that lead to very long GC time, causing the executor losts, heartbeat not received, and GC overhead limit exceeded messages. Could you try using `StorageLevel.MEMORY_AND_DISK` instead? You can also

Re: Why is RDD lookup slow?

2015-02-19 Thread Burak Yavuz
If your dataset is large, there is a Spark Package called IndexedRDD optimized for lookups. Feel free to check that out. Burak On Feb 19, 2015 7:37 AM, Ilya Ganelin ilgan...@gmail.com wrote: Hi Shahab - if your data structures are small enough a broadcasted Map is going to provide faster

Re: generate a random matrix with uniform distribution

2015-02-09 Thread Burak Yavuz
wrote: Thanks a lot! Can I ask why this code generates a uniform distribution? If dist is N(0,1) data should be N(-1, 2). Let me know. Thanks, Luca 2015-02-07 3:00 GMT+00:00 Burak Yavuz brk...@gmail.com: Hi, You can do the following: ``` import

Re: matrix of random variables with spark.

2015-02-06 Thread Burak Yavuz
Forgot to add the more recent training material: https://databricks-training.s3.amazonaws.com/index.html On Fri, Feb 6, 2015 at 12:12 PM, Burak Yavuz brk...@gmail.com wrote: Hi Luca, You can tackle this using RowMatrix (spark-shell example): ``` import

Re: matrix of random variables with spark.

2015-02-06 Thread Burak Yavuz
Hi Luca, You can tackle this using RowMatrix (spark-shell example): ``` import org.apache.spark.mllib.linalg.distributed.RowMatrix import org.apache.spark.mllib.random._ // sc is the spark context, numPartitions is the number of partitions you want the RDD to be in val data: RDD[Vector] =

Re: generate a random matrix with uniform distribution

2015-02-06 Thread Burak Yavuz
Hi, You can do the following: ``` import org.apache.spark.mllib.linalg.distributed.RowMatrix import org.apache.spark.mllib.random._ // sc is the spark context, numPartitions is the number of partitions you want the RDD to be in val dist: RDD[Vector] = RandomRDDs.normalVectorRDD(sc, n, k,

Re: null Error in ALS model predict

2014-12-24 Thread Burak Yavuz
Hi, The MatrixFactorizationModel consists of two RDD's. When you use the second method, Spark tries to serialize both RDD's for the .map() function, which is not possible, because RDD's are not serializable. Therefore you receive the NULLPointerException. You must use the first method. Best,

Re: How can I make Spark Streaming count the words in a file in a unit test?

2014-12-08 Thread Burak Yavuz
Hi, https://github.com/databricks/spark-perf/tree/master/streaming-tests/src/main/scala/streaming/perf contains some performance tests for streaming. There are examples of how to generate synthetic files during the test in that repo, maybe you can find some code snippets that you can use there.

Re: MLLib ALS ArrayIndexOutOfBoundsException with Scala Spark 1.1.0

2014-10-27 Thread Burak Yavuz
Hi, I've come across this multiple times, but not in a consistent manner. I found it hard to reproduce. I have a jira for it: SPARK-3080 Do you observe this error every single time? Where do you load your data from? Which version of Spark are you running? Figuring out the similarities may

Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Burak Yavuz
Hi Ray, The reduceByKey / collectAsMap does a lot of calculations. Therefore it can take a very long time if: 1) The parameter number of runs is set very high 2) k is set high (you have observed this already) 3) data is not properly repartitioned It seems that it is hanging, but there is a lot

Re: MLlib Linear Regression Mismatch

2014-10-01 Thread Burak Yavuz
Hi, It appears that the step size is too high that the model is diverging with the added noise. Could you try by setting the step size to be 0.1 or 0.01? Best, Burak - Original Message - From: Krishna Sankar ksanka...@gmail.com To: user@spark.apache.org Sent: Wednesday, October 1,

Re: Python version of kmeans

2014-09-18 Thread Burak Yavuz
Hi, spark-1.0.1/examples/src/main/python/kmeans.py = Naive example for users to understand how to code in Spark spark-1.0.1/python/pyspark/mllib/clustering.py = Use this!!! Bonus: spark-1.0.1/examples/src/main/python/mllib/kmeans.py = Example on how to call KMeans. Feel free to use it as a

Re: Odd error when using a rdd map within a stream map

2014-09-18 Thread Burak Yavuz
Hi, I believe it's because you're trying to use a Function of an RDD, in an RDD, which is not possible. Instead of using a `FunctionJavaRDDFloat`, could you try FunctionFloat, and `public Void call(Float arg0) throws Exception { ` and `System.out.println(arg0)` instead. I'm not perfectly sure

Re: Spark on EC2

2014-09-18 Thread Burak Yavuz
Hi Gilberto, Could you please attach the driver logs as well, so that we can pinpoint what's going wrong? Could you also add the flag `--driver-memory 4g` while submitting your application and try that as well? Best, Burak - Original Message - From: Gilberto Lira g...@scanboo.com.br

Re: Spark and disk usage.

2014-09-17 Thread Burak Yavuz
Hi, The files you mentioned are temporary files written by Spark during shuffling. ALS will write a LOT of those files as it is a shuffle heavy algorithm. Those files will be deleted after your program completes as Spark looks for those files in case a fault occurs. Having those files ready

Re: Spark and disk usage.

2014-09-17 Thread Burak Yavuz
the directory will not be enough. Best, Burak - Original Message - From: Andrew Ash and...@andrewash.com To: Burak Yavuz bya...@stanford.edu Cc: Макар Красноперов connector@gmail.com, user user@spark.apache.org Sent: Wednesday, September 17, 2014 10:19:42 AM Subject: Re: Spark and disk usage. Hi

Re: Spark and disk usage.

2014-09-17 Thread Burak Yavuz
Streaming, and some MLlib algorithms. If you can help with the guide, I think it would be a nice feature to have! Burak - Original Message - From: Andrew Ash and...@andrewash.com To: Burak Yavuz bya...@stanford.edu Cc: Макар Красноперов connector@gmail.com, user user@spark.apache.org

Re: Size exceeds Integer.MAX_VALUE in BlockFetcherIterator

2014-09-17 Thread Burak Yavuz
Hi, Could you try repartitioning the data by .repartition(# of cores on machine) or while reading the data, supply the number of minimum partitions as in sc.textFile(path, # of cores on machine). It may be that the whole data is stored in one block? If it is billions of rows, then the indexing

Re: MLLib: LIBSVM issue

2014-09-17 Thread Burak Yavuz
Hi, The spacing between the inputs should be a single space, not a tab. I feel like your inputs have tabs between them instead of a single space. Therefore the parser cannot parse the input. Best, Burak - Original Message - From: Sameer Tilak ssti...@live.com To: user@spark.apache.org

Re: Spark SQL

2014-09-14 Thread Burak Yavuz
Hi, I'm not a master on SparkSQL, but from what I understand, the problem ıs that you're trying to access an RDD inside an RDD here: val xyz = file.map(line = *** extractCurRate(sqlContext.sql(select rate ... *** and here: xyz = file.map(line = *** extractCurRate(sqlContext.sql(select rate

Re: Filter function problem

2014-09-09 Thread Burak Yavuz
Hi, val test = persons.value .map{tuple = (tuple._1, tuple._2 .filter{event = *inactiveIDs.filter(event2 = event2._1 == tuple._1).count() != 0})} Your problem is right between the asterisk. You can't make an RDD operation inside an RDD operation, because RDD's can't be serialized.

Re: OutofMemoryError when generating output

2014-08-28 Thread Burak Yavuz
Yeah, saveAsTextFile is an RDD specific method. If you really want to use that method, just turn the map into an RDD: `sc.parallelize(x.toSeq).saveAsTextFile(...)` Reading through the api-docs will present you many more alternate solutions! Best, Burak - Original Message - From: SK

Re: Memory statistics in the Application detail UI

2014-08-28 Thread Burak Yavuz
Hi, Spark uses by default approximately 60% of the executor heap memory to store RDDs. That's why you have 8.6GB instead of 16GB. 95.5 is therefore the sum of all the 8.6 GB of executor memory + the driver memory. Best, Burak - Original Message - From: SK skrishna...@gmail.com To:

Re: Amplab: big-data-benchmark

2014-08-27 Thread Burak Yavuz
Hi Sameer, I've faced this issue before. They don't show up on http://s3.amazonaws.com/big-data-benchmark/. But you can directly use: `sc.textFile(s3n://big-data-benchmark/pavlo/text/tiny/crawl)` The gotcha is that you also need to supply which dataset you want: crawl, uservisits, or rankings

Re: OutofMemoryError when generating output

2014-08-26 Thread Burak Yavuz
Hi, The error doesn't occur during saveAsTextFile but rather during the groupByKey as far as I can tell. We strongly urge users to not use groupByKey if they don't have to. What I would suggest is the following work-around: sc.textFile(baseFile)).map { line = val fields = line.split(\t)

Re: saveAsTextFile hangs with hdfs

2014-08-26 Thread Burak Yavuz
Hi David, Your job is probably hanging on the groupByKey process. Probably GC is kicking in and the process starts to hang or the data is unbalanced and you end up with stragglers (Once GC kicks in you'll start to get the connection errors you shared). If you don't care about the list of

Re: Finding Rank in Spark

2014-08-23 Thread Burak Yavuz
Spearman's Correlation requires the calculation of ranks for columns. You can checkout the code here and slice the part you need! https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmanCorrelation.scala Best, Burak - Original

Re: LDA example?

2014-08-22 Thread Burak Yavuz
You can check out this pull request: https://github.com/apache/spark/pull/476 LDA is on the roadmap for the 1.2 release, hopefully we will officially support it then! Best, Burak - Original Message - From: Denny Lee denny.g@gmail.com To: user@spark.apache.org Sent: Thursday, August

Re: [MLLib]:choosing the Loss function

2014-08-11 Thread Burak Yavuz
Hi, // Initialize the optimizer using logistic regression as the loss function with L2 regularization val lbfgs = new LBFGS(new LogisticGradient(), new SquaredL2Updater()) // Set the hyperparameters

Re: KMeans Input Format

2014-08-07 Thread Burak Yavuz
Hi, Could you try running spark-shell with the flag --driver-memory 2g or more if you have more RAM available and try again? Thanks, Burak - Original Message - From: AlexanderRiggers alexander.rigg...@gmail.com To: u...@spark.incubator.apache.org Sent: Thursday, August 7, 2014 7:37:40

Re: questions about MLLib recommendation models

2014-08-07 Thread Burak Yavuz
Hi Jay, I've had the same problem you've been having in Question 1 with a synthetic dataset. I thought I wasn't producing the dataset well enough. This seems to be a bug. I will open a JIRA for it. Instead of using: ratings.map{ case Rating(u,m,r) = { val pred = model.predict(u, m) (r

Re: [MLLib]:choosing the Loss function

2014-08-07 Thread Burak Yavuz
The following code will allow you to run Logistic Regression using L-BFGS: val lbfgs = new LBFGS(new LogisticGradient(), new SquaredL2Updater()) lbfgs.setMaxNumIterations(numIterations).setRegParam(regParam).setConvergenceTol(tol).setNumCorrections(numCor) val weights = lbfgs.optimize(data,

Re: Regularization parameters

2014-08-06 Thread Burak Yavuz
Hi, That is interesting. Would you please share some code on how you are setting the regularization type, regularization parameters and running Logistic Regression? Thanks, Burak - Original Message - From: SK skrishna...@gmail.com To: u...@spark.incubator.apache.org Sent: Wednesday,

Re: Retrieve dataset of Big Data Benchmark

2014-07-16 Thread Burak Yavuz
Hi Tom, Actually I was mistaken, sorry about that. Indeed on the website, the keys for the datasets you mention are not showing up. However, they are still accessible through the spark-shell, which means that they are there. So in order to answer your questions: - Are the tiny and 1node sets

Re: Retrieve dataset of Big Data Benchmark

2014-07-15 Thread Burak Yavuz
Hi Tom, If you wish to load the file in Spark directly, you can use sc.textFile(s3n://big-data-benchmark/pavlo/...) where sc is your SparkContext. This can be done because the files should be publicly available and you don't need AWS Credentials to access them. If you want to download the

Re: Restarting a Streaming Context

2014-07-09 Thread Burak Yavuz
Someone can correct me if I'm wrong, but unfortunately for now, once a streaming context is stopped, it can't be restarted. - Original Message - From: Nick Chammas nicholas.cham...@gmail.com To: u...@spark.incubator.apache.org Sent: Wednesday, July 9, 2014 6:11:51 PM Subject:

<    1   2