Hi,
The recently added NNLS implementation in MLlib returns wrong solutions.
This is not data specific, just try any data in R's nnls, and then the same
data in MLlib's NNLS. The results are very different.
Also, the elected algorithm Polyak(1969) is not the best one around. The
most popular one
Hi,
Is there an implementation for Nonnegative Matrix Factorization in Spark? I
understand that MLlib comes with matrix factorization, but it does not seem
to cover the nonnegative case.
On Thu, Jun 26, 2014 at 9:42 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
That’s technically true, but I’d be surprised if there wasn’t a lot of
room for improvement in spark-ec2 regarding cluster launch+config times.
Unfortunately, this is a spark support issue, but an AWS one.
Hi,
Today Google announced their cloud dataflow, which is very similar to spark
in performing batch processing and stream processing.
How does spark compare to Google cloud dataflow? Are they solutions trying
to aim the same problem?
Regards
On Thu, May 8, 2014 at 2:58 AM, Aureliano Buendia buendia...@gmail.comwrote:
Please send a pull request, this should be maintained by the community,
just in case you do not feel like continuing to maintain it.
Also, nice to see that the gce version is shorter than the aws version
Yes, things get more unstable with larger data. But, that's the whole point
of my question:
Why should spark get unstable when data gets larger?
When data gets larger, spark should get *slower*, not more unstable. lack
of stability makes parameter tuning very difficult, time consuming and a
Hi,
Sometimes running the very same spark application binary, behaves
differently with every execution.
- The Ganglia profile is different with every execution: sometimes it
takes 0.5 TB of memory, the next time it takes 1 TB of memory, the next
time it is 0.75 TB...
- Spark UI shows
:01 PM, Mark Hamstra m...@clearstorydata.comwrote:
Please file an issue: Spark Project
JIRAhttps://issues.apache.org/jira/browse/SPARK
On Fri, Apr 18, 2014 at 10:25 AM, Aureliano Buendia
buendia...@gmail.com wrote:
Hi,
I just notices that sc.makeRDD() does not make all values given
Hi,
Since 0.9.0 spark-ec2 has gone unstable. During launch it throws many
errors like:
ssh: connect to host ec-xx-xx-xx-xx.compute-1.amazonaws.com port 22:
Connection refused
Error 255 while executing remote command, retrying after 30 seconds
.. and recently, it prompts for passwords!:
about
the password request; I haven't seen that on my end.
Regards,
Frank Austin Nothaft
fnoth...@berkeley.edu
fnoth...@eecs.berkeley.edu
202-340-0466
On Fri, Apr 18, 2014 at 8:57 PM, Aureliano Buendia
buendia...@gmail.comwrote:
Hi,
Since 0.9.0 spark-ec2 has gone unstable. During
Hi,
Spark-ec2 uses rsync to deploy many applications. It seem over time more
and more applications have been added to the script, which has
significantly slowed down the setup time.
Perhaps the script could be restructured this this way: Instead of rsyncing
N times per application, we could have
Hi,
I notices spark machine learning examples use training data to validate
regression models, For instance, in linear
regressionhttp://spark.apache.org/docs/0.9.0/mllib-guide.htmlexample:
// Evaluate model on training examples and compute training errorval
valuesAndPreds = parsedData.map {
On Tue, Mar 18, 2014 at 12:56 PM, Ognen Duzlevski
og...@plainvanillagames.com wrote:
On 3/18/14, 4:49 AM, dmpou...@gmail.com wrote:
On Sunday, 2 March 2014 19:19:49 UTC+2, Aureliano Buendia wrote:
Is there a reason for spark using the older akka?
On Sun, Mar 2, 2014 at 1:53 PM, 1esha
Hi,
Our spark app reduces a few 100 gb of data to to a few 100 kb of csv. We
found that a partition number of 1000 is a good number to speed the process
up. However, it does not make sense to have 1000 pieces of csv files each
less than 1 kb.
We used RDD.coalesce(1) to get only 1 csv file, but
through one reduce node for
writing it out. That's probably the fastest it will get. No need to cache
if you do that.
Matei
On Mar 21, 2014, at 4:04 PM, Aureliano Buendia buendia...@gmail.com
wrote:
Hi,
Our spark app reduces a few 100 gb of data to to a few 100 kb of csv. We
found
I think you bumped the wrong thread.
As I mentioned in the other thread:
saveAsHadoopFile only applies compression when the codec is available, and
it does not seem to respect the global hadoop compression properties.
I'm not sure if this is a feature, or a bug in spark.
if this is a feature,
Hi,
After sorting an RDD and writing to hadoop, would the RDD be still sorted
when reading it back?
Can sorting be guaranteed after reading back, when the RDD was written as 1
partition with rdd.coalesce(1)?
Hi,
Is the spark docker script now mature enough to substitute spark-ec2
script? Anyone here using the docker script is production?
Also, in this talk http://www.youtube.com/watch?v=OhpjgaBVUtU on using
spark streaming in production, the author seems to have missed the topic of
how to manage cloud instances.
On Fri, Feb 28, 2014 at 6:48 PM, Aureliano Buendia buendia...@gmail.comwrote:
What's the updated way of deploying
Hi,
Running:
./bin/run-example org.apache.spa.streaming.examples.SimpleZeroMQPublisher
tcp://127.0.1.1:1234 foo
causes over 100% cpu usage on os x. Given that it's just a simple zmq
publisher, this shouldn't be expected. Is there something wrong with that
example?
the spark app, or the
spark cluster?
How is it possible to gracefully shut down a spark app?
(2) buildup of logs in the work/ directory or files in the Spark tmp
directory, and (3) bug in Spark (woo!).
On Tue, Feb 4, 2014 at 5:58 AM, Aureliano Buendia buendia...@gmail.comwrote:
On Mon
source support as well? (Eg kafka
requires setting up zookeeper).
On Thu, Feb 27, 2014 at 10:11 AM, Aureliano Buendia
buendia...@gmail.comwrote:
Hi,
Does the ec2 support for spark 0.9 also include spark streaming? If not,
is there an equivalent?
22 matches
Mail list logo