+user
-- Forwarded message --
From: Burak Yavuz brk...@gmail.com
Date: Mon, Apr 27, 2015 at 1:59 PM
Subject: Re: Change ivy cache for spark on Windows
To: mj jone...@gmail.com
Hi,
In your conf file (SPARK_HOME\conf\spark-defaults.conf) you can set:
`spark.jars.ivy \your\path
Hi Andrew,
I observed similar behavior under high GC pressure, when running ALS. What
happened to me was that, there would be very long Full GC pauses (over 600
seconds at times). These would prevent the executors from sending
heartbeats to the driver. Then the driver would think that the
Depends... The heartbeat you received happens due to GC pressure (probably
due to Full GC). If you increase the memory too much, the GC's may be less
frequent, but the Full GC's may take longer. Try increasing the following
confs:
spark.executor.heartbeatInterval
Hi,
If I recall correctly, I've read people integrating REST calls to Spark
Streaming jobs in the user list. I don't imagine any cases for why it
shouldn't be possible.
Best,
Burak
On Tue, Mar 31, 2015 at 1:46 PM, Minnow Noir minnown...@gmail.com wrote:
We have have some data on Hadoop that
Hi David,
Can you also try with Spark 1.3 if possible? I believe there was a 2x
improvement on K-Means between 1.2 and 1.3.
Thanks,
Burak
On Sat, Mar 28, 2015 at 9:04 PM, davidshen84 davidshe...@gmail.com wrote:
Hi Jao,
Sorry to pop up this old thread. I am have the same problem like you
Did you build Spark with: -Pnetlib-lgpl?
Ref: https://spark.apache.org/docs/latest/mllib-guide.html
Burak
On Sun, Mar 22, 2015 at 7:37 AM, Ted Yu yuzhih...@gmail.com wrote:
How about pointing LD_LIBRARY_PATH to native lib folder ?
You need Spark 1.2.0 or higher for the above to work. See
Hi,
Yes, ordering is preserved with map. Shuffles break ordering.
Burak
On Wed, Mar 18, 2015 at 2:02 PM, sergunok ser...@gmail.com wrote:
Does map(...) preserve ordering of original RDD?
--
View this message in context:
Hi,
I would suggest you use LBFGS, as I think the step size is hurting you. You
can run the same thing in LBFGS as:
```
val algorithm = new LBFGS(new LeastSquaresGradient(), new SimpleUpdater())
val initialWeights = Vectors.dense(Array.fill(3)(
scala.util.Random.nextDouble()))
val weights =
Hi Jaonary,
The RowPartitionedMatrix is a special case of the BlockMatrix, where the
colsPerBlock = nCols. I hope that helps.
Burak
On Mar 6, 2015 9:13 AM, Jaonary Rabarisoa jaon...@gmail.com wrote:
Hi Shivaram,
Thank you for the link. I'm trying to figure out how can I port this to
mllib.
+user
On Mar 9, 2015 8:47 AM, Burak Yavuz brk...@gmail.com wrote:
Hi,
In the web UI, you don't see every single task. You see the name of the
last task before the stage boundary (which is a shuffle like a groupByKey),
which in your case is a flatMap. Therefore you only see flatMap in the UI
Hi,
There is model import/export for some of the ML algorithms on the current
master (and they'll be shipped with the 1.3 release).
Burak
On Mar 7, 2015 4:17 AM, Xi Shen davidshe...@gmail.com wrote:
Wait...it seem SparkContext does not provide a way to save/load object
files. It can only
Hi Koert,
Would you like to register this on spark-packages.org?
Burak
On Fri, Mar 6, 2015 at 8:53 AM, Koert Kuipers ko...@tresata.com wrote:
currently spark provides many excellent algorithms for operations per key
as long as the data send to the reducers per key fits in memory. operations
Hi,
Not sure if it can help, but `StorageLevel.MEMORY_AND_DISK_SER` generates
many small objects that lead to very long GC time, causing the executor
losts, heartbeat not received, and GC overhead limit exceeded messages.
Could you try using `StorageLevel.MEMORY_AND_DISK` instead? You can also
If your dataset is large, there is a Spark Package called IndexedRDD
optimized for lookups. Feel free to check that out.
Burak
On Feb 19, 2015 7:37 AM, Ilya Ganelin ilgan...@gmail.com wrote:
Hi Shahab - if your data structures are small enough a broadcasted Map is
going to provide faster
wrote:
Thanks a lot!
Can I ask why this code generates a uniform distribution?
If dist is N(0,1) data should be N(-1, 2).
Let me know.
Thanks,
Luca
2015-02-07 3:00 GMT+00:00 Burak Yavuz brk...@gmail.com:
Hi,
You can do the following:
```
import
Forgot to add the more recent training material:
https://databricks-training.s3.amazonaws.com/index.html
On Fri, Feb 6, 2015 at 12:12 PM, Burak Yavuz brk...@gmail.com wrote:
Hi Luca,
You can tackle this using RowMatrix (spark-shell example):
```
import
Hi Luca,
You can tackle this using RowMatrix (spark-shell example):
```
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.random._
// sc is the spark context, numPartitions is the number of partitions you
want the RDD to be in
val data: RDD[Vector] =
Hi,
You can do the following:
```
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.random._
// sc is the spark context, numPartitions is the number of partitions you
want the RDD to be in
val dist: RDD[Vector] = RandomRDDs.normalVectorRDD(sc, n, k,
Hi,
The MatrixFactorizationModel consists of two RDD's. When you use the second
method, Spark tries to serialize both RDD's for the .map() function,
which is not possible, because RDD's are not serializable. Therefore you
receive the NULLPointerException. You must use the first method.
Best,
Hi,
https://github.com/databricks/spark-perf/tree/master/streaming-tests/src/main/scala/streaming/perf
contains some performance tests for streaming. There are examples of how to
generate synthetic files during the test in that repo, maybe you
can find some code snippets that you can use there.
Hi,
I've come across this multiple times, but not in a consistent manner. I found
it hard to reproduce. I have a jira for it: SPARK-3080
Do you observe this error every single time? Where do you load your data from?
Which version of Spark are you running?
Figuring out the similarities may
Hi Ray,
The reduceByKey / collectAsMap does a lot of calculations. Therefore it can
take a very long time if:
1) The parameter number of runs is set very high
2) k is set high (you have observed this already)
3) data is not properly repartitioned
It seems that it is hanging, but there is a lot
Hi,
It appears that the step size is too high that the model is diverging with the
added noise.
Could you try by setting the step size to be 0.1 or 0.01?
Best,
Burak
- Original Message -
From: Krishna Sankar ksanka...@gmail.com
To: user@spark.apache.org
Sent: Wednesday, October 1,
Hi,
spark-1.0.1/examples/src/main/python/kmeans.py = Naive example for users to
understand how to code in Spark
spark-1.0.1/python/pyspark/mllib/clustering.py = Use this!!!
Bonus: spark-1.0.1/examples/src/main/python/mllib/kmeans.py = Example on how
to call KMeans. Feel free to use it as a
Hi,
I believe it's because you're trying to use a Function of an RDD, in an RDD,
which is not possible. Instead of using a
`FunctionJavaRDDFloat`, could you try FunctionFloat, and
`public Void call(Float arg0) throws Exception { `
and
`System.out.println(arg0)`
instead. I'm not perfectly sure
Hi Gilberto,
Could you please attach the driver logs as well, so that we can pinpoint what's
going wrong? Could you also add the flag
`--driver-memory 4g` while submitting your application and try that as well?
Best,
Burak
- Original Message -
From: Gilberto Lira g...@scanboo.com.br
Hi,
The files you mentioned are temporary files written by Spark during shuffling.
ALS will write a LOT of those files as it is a shuffle heavy algorithm.
Those files will be deleted after your program completes as Spark looks for
those files in case a fault occurs. Having those files ready
the directory will not be enough.
Best,
Burak
- Original Message -
From: Andrew Ash and...@andrewash.com
To: Burak Yavuz bya...@stanford.edu
Cc: Макар Красноперов connector@gmail.com, user
user@spark.apache.org
Sent: Wednesday, September 17, 2014 10:19:42 AM
Subject: Re: Spark and disk usage.
Hi
Streaming, and some MLlib algorithms.
If you can help with the guide, I think it would be a nice feature to have!
Burak
- Original Message -
From: Andrew Ash and...@andrewash.com
To: Burak Yavuz bya...@stanford.edu
Cc: Макар Красноперов connector@gmail.com, user
user@spark.apache.org
Hi,
Could you try repartitioning the data by .repartition(# of cores on machine) or
while reading the data, supply the number of minimum partitions as in
sc.textFile(path, # of cores on machine).
It may be that the whole data is stored in one block? If it is billions of
rows, then the indexing
Hi,
The spacing between the inputs should be a single space, not a tab. I feel like
your inputs have tabs between them instead of a single space. Therefore the
parser
cannot parse the input.
Best,
Burak
- Original Message -
From: Sameer Tilak ssti...@live.com
To: user@spark.apache.org
Hi,
I'm not a master on SparkSQL, but from what I understand, the problem ıs that
you're trying to access an RDD
inside an RDD here: val xyz = file.map(line = ***
extractCurRate(sqlContext.sql(select rate ... *** and
here: xyz = file.map(line = *** extractCurRate(sqlContext.sql(select rate
Hi,
val test = persons.value
.map{tuple = (tuple._1, tuple._2
.filter{event = *inactiveIDs.filter(event2 = event2._1 ==
tuple._1).count() != 0})}
Your problem is right between the asterisk. You can't make an RDD operation
inside an RDD operation, because RDD's can't be serialized.
Yeah, saveAsTextFile is an RDD specific method. If you really want to use that
method, just turn the map into an RDD:
`sc.parallelize(x.toSeq).saveAsTextFile(...)`
Reading through the api-docs will present you many more alternate solutions!
Best,
Burak
- Original Message -
From: SK
Hi,
Spark uses by default approximately 60% of the executor heap memory to store
RDDs. That's why you have 8.6GB instead of 16GB. 95.5 is therefore the sum of
all the 8.6 GB of executor memory + the driver memory.
Best,
Burak
- Original Message -
From: SK skrishna...@gmail.com
To:
Hi Sameer,
I've faced this issue before. They don't show up on
http://s3.amazonaws.com/big-data-benchmark/. But you can directly use:
`sc.textFile(s3n://big-data-benchmark/pavlo/text/tiny/crawl)`
The gotcha is that you also need to supply which dataset you want: crawl,
uservisits, or rankings
Hi,
The error doesn't occur during saveAsTextFile but rather during the groupByKey
as far as I can tell. We strongly urge users to not use groupByKey
if they don't have to. What I would suggest is the following work-around:
sc.textFile(baseFile)).map { line =
val fields = line.split(\t)
Hi David,
Your job is probably hanging on the groupByKey process. Probably GC is kicking
in and the process starts to hang or the data is unbalanced and you end up with
stragglers (Once GC kicks in you'll start to get the connection errors you
shared). If you don't care about the list of
Spearman's Correlation requires the calculation of ranks for columns. You can
checkout the code here and slice the part you need!
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmanCorrelation.scala
Best,
Burak
- Original
You can check out this pull request: https://github.com/apache/spark/pull/476
LDA is on the roadmap for the 1.2 release, hopefully we will officially support
it then!
Best,
Burak
- Original Message -
From: Denny Lee denny.g@gmail.com
To: user@spark.apache.org
Sent: Thursday, August
Hi,
// Initialize the optimizer using logistic regression as the loss function with
L2 regularization
val lbfgs = new LBFGS(new LogisticGradient(), new SquaredL2Updater())
// Set the hyperparameters
Hi,
Could you try running spark-shell with the flag --driver-memory 2g or more if
you have more RAM available and try again?
Thanks,
Burak
- Original Message -
From: AlexanderRiggers alexander.rigg...@gmail.com
To: u...@spark.incubator.apache.org
Sent: Thursday, August 7, 2014 7:37:40
Hi Jay,
I've had the same problem you've been having in Question 1 with a synthetic
dataset. I thought I wasn't producing the dataset well enough. This seems to
be a bug. I will open a JIRA for it.
Instead of using:
ratings.map{ case Rating(u,m,r) = {
val pred = model.predict(u, m)
(r
The following code will allow you to run Logistic Regression using L-BFGS:
val lbfgs = new LBFGS(new LogisticGradient(), new SquaredL2Updater())
lbfgs.setMaxNumIterations(numIterations).setRegParam(regParam).setConvergenceTol(tol).setNumCorrections(numCor)
val weights = lbfgs.optimize(data,
Hi,
That is interesting. Would you please share some code on how you are setting
the regularization type, regularization parameters and running Logistic
Regression?
Thanks,
Burak
- Original Message -
From: SK skrishna...@gmail.com
To: u...@spark.incubator.apache.org
Sent: Wednesday,
Hi Tom,
Actually I was mistaken, sorry about that. Indeed on the website, the keys for
the datasets you mention are not showing up. However,
they are still accessible through the spark-shell, which means that they are
there.
So in order to answer your questions:
- Are the tiny and 1node sets
Hi Tom,
If you wish to load the file in Spark directly, you can use
sc.textFile(s3n://big-data-benchmark/pavlo/...) where sc is your
SparkContext. This can be
done because the files should be publicly available and you don't need AWS
Credentials to access them.
If you want to download the
Someone can correct me if I'm wrong, but unfortunately for now, once a
streaming context is stopped, it can't be restarted.
- Original Message -
From: Nick Chammas nicholas.cham...@gmail.com
To: u...@spark.incubator.apache.org
Sent: Wednesday, July 9, 2014 6:11:51 PM
Subject:
101 - 148 of 148 matches
Mail list logo