[Spark ML] : Implement the Conjugate Gradient method for ALS

2017-11-30 Thread Nate Wendt
The conjugate gradient method has been shown to be very efficient at
solving the least squares error problem in matrix factorization:
http://www.benfrederickson.com/fast-implicit-matrix-factorization/.

This post is motivated by:
https://pdfs.semanticscholar.org/bfdf/7af6cf7fd7bb5e6b6db5bbd91be11597eaf0.pdf

Implementing this in Spark could mean a significant speedup in ALS solving
as the order of growth is smaller than the default solver (Cholesky). This
has the potential to improve the training phase of collaborative filtering
significantly.

I've opened a JIRA ticket
 but thought I'd reach
out here as well since I've implemented the algorithm (for implicit
feedback) and demonstrated it's correctness but I am having trouble
actually seeing a performance speedup, likely due to incorrect handling of
RDD persistence/checkpointing. I wasn't sure the best way to reach out to
see if there were dev cycles available to collaborate on completing this
solution but I figure it has the potential to have a big impact within
Spark and MLLib. If there is interest, I can open a pull request with the
functionally correct code I have as of now.

*Note, we are seeing collaborative filtering training times of over 3 hours
within Spark (4 large node instances) compared to ~8 minutes on a single
machine running the Implicit library cited above.  It would be great to get
this kind of speedup within Spark and potentially benefit from the added
parallelism.*

Thanks,
Nathaniel


RE: Spark on Apache Ingnite?

2016-01-05 Thread nate
We started playing with Ignite back Hadoop, hive and spark services, and
looking to move to it as our default for deployment going forward, still
early but so far its been pretty nice and excited for the flexibility it
will provide for our particular use cases.

Would say in general its worth looking into if your data workloads are:

a) mix of read/write, or heavy write at times
b) want write/read access to data from services/apps outside of your spark
workloads (old Hadoop jobs, custom apps, etc)
c) have strings of spark jobs that could benefit from caching your data
across them (think similar usage to tachyon)
d) you have sparksql queries that could benefit from indexing and mutability
(see pt (a) about mix read/write)

If your data is read exclusive and very batch oriented, and your workloads
are strictly spark based, benefits will be less and ignite would probably
act as more of a tachyon replacement as many of the other features outside
of RDD caching wont be leveraged.


-Original Message-
From: unk1102 [mailto:umesh.ka...@gmail.com] 
Sent: Tuesday, January 5, 2016 10:15 AM
To: user@spark.apache.org
Subject: Spark on Apache Ingnite?

Hi has anybody tried and had success with Spark on Apache Ignite seems
promising? https://ignite.apache.org/



--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Apache-Ingnite-
tp25884.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Powered by Spark page

2015-11-12 Thread Nate Kupp
Hello,

We are using Spark at Thumbtack for all of our big data work. Would love to
get added to the powered by spark page:

https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark

Requested info:

*Organization name:* Thumbtack
*URL:* thumbtack.com
*Spark components:* Spark Core, Spark SQL, MLLib
*Use case: *We are using Spark for supporting analytics on both our
relational and event data, building data products, and big data processing.

Thanks!

-Nate


RE: Benchmark results between Flink and Spark

2015-07-05 Thread nate
Maybe some flink benefits from some pts they outline here:

 

http://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.html

 

Probably if re-ran the benchmarks with 1.5/tungsten line would close the gap a 
bit(or a lot) with spark moving towards similar style off-heap memory mgmt, 
more planning optimizations

 

 

From: Jerry Lam [mailto:chiling...@gmail.com] 
Sent: Sunday, July 5, 2015 6:28 PM
To: Ted Yu
Cc: Slim Baltagi; user
Subject: Re: Benchmark results between Flink and Spark

 

Hi guys,

 

I just read the paper too. There is no much information regarding why Flink is 
faster than Spark for data science type of workloads in the benchmark. It is 
very difficult to generalize the conclusion of a benchmark from my point of 
view. How much experience the author has with Spark is in comparisons to Flink 
is one of the immediate questions I have. It would be great if they have the 
benchmark software available somewhere for other people to experiment.

 

just my 2 cents,

 

Jerry

 

On Sun, Jul 5, 2015 at 4:35 PM, Ted Yu yuzhih...@gmail.com 
mailto:yuzhih...@gmail.com  wrote:

There was no mentioning of the versions of Flink and Spark used in benchmarking.

 

The size of cluster is quite small.

 

Cheers

 

On Sun, Jul 5, 2015 at 10:24 AM, Slim Baltagi sbalt...@gmail.com 
mailto:sbalt...@gmail.com  wrote:

Hi

Apache Flink outperforms Apache Spark in processing machine learning  graph
algorithms and relational queries but not in batch processing!

The results were published in the proceedings of the 18th International
Conference, Business Information Systems 2015, PoznaƄ, Poland, June 24-26,
2015.

Thanks to our friend Google, Chapter 3: 'Evaluating New Approaches of Big
Data Analytics Frameworks' by Norman Spangenberg, Martin Roth and Bogdan
Franczyk is available for preview at http://goo.gl/WocQci on pages 28-37.

Enjoy!

Slim Baltagi
http://www.SparkBigData.com




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Benchmark-results-between-Flink-and-Spark-tp23626.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
mailto:user-unsubscr...@spark.apache.org 
For additional commands, e-mail: user-h...@spark.apache.org 
mailto:user-h...@spark.apache.org 

 

 



RE: Word2Vec with billion-word corpora

2015-05-19 Thread nate
Might also want to look at Y! post, looks like they are experimenting with 
similar efforts in large scale word2vec:

http://yahooeng.tumblr.com/post/118860853846/distributed-word2vec-on-top-of-pistachio



-Original Message-
From: Xiangrui Meng [mailto:men...@gmail.com] 
Sent: Tuesday, May 19, 2015 1:25 PM
To: Shilad Sen
Cc: user
Subject: Re: Word2Vec with billion-word corpora

With vocabulary size 4M and 400 vector size, you need 400 * 4M = 16B floats to 
store the model. That is 64GB. We store the model on the driver node in the 
current implementation. So I don't think it would work. You might try 
increasing the minCount to decrease the vocabulary size and reduce the vector 
size. I'm interested in learning the trade-off between the model size and the 
model quality. If you have done some experiments, please let me know. Thanks! 
-Xiangrui

On Wed, May 13, 2015 at 11:17 AM, Shilad Sen s...@macalester.edu wrote:
 Hi all,

 I'm experimenting with Spark's Word2Vec implementation for a 
 relatively large (5B word, vocabulary size 4M, 400-dimensional 
 vectors) corpora. Has anybody had success running it at this scale?

 Thanks in advance for your guidance!

 -Shilad

 --
 Shilad W. Sen
 Associate Professor
 Mathematics, Statistics, and Computer Science Dept.
 Macalester College
 s...@macalester.edu
 http://www.shilad.com
 https://www.linkedin.com/in/shilad
 651-696-6273

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Connecting a PHP/Java applications to Spark SQL Thrift Server

2015-03-03 Thread nate
SparkSQL supports JDBC/ODBC connectivity, so if that's the route you
needed/wanted to connect through you could do so via java/php apps.  Havent
used either so cant speak to the developer experience, assume its pretty
good as would be preferred method for lots of third party enterprise
apps/tooling

If you prefer using the thrift server/interface, if they don't exist already
in open source land you can use thrift definitions to generate client libs
in any supported thrift language and use that for connectivity.  Seems one
issue with thrift-server is when running in cluster mode.  Seems like it
still exists but UX of error has been cleaned up in 1.3:

https://issues.apache.org/jira/browse/SPARK-5176



-Original Message-
From: fanooos [mailto:dev.fano...@gmail.com] 
Sent: Tuesday, March 3, 2015 11:15 PM
To: user@spark.apache.org
Subject: Connecting a PHP/Java applications to Spark SQL Thrift Server

We have installed hadoop cluster with hive and spark and the spark sql
thrift server is up and running without any problem.

Now we have set of applications need to use spark sql thrift server to query
some data. 

Some of these applications are java applications and the others are PHP
applications. 

As I am an old fashioned java developer, I used to connect java applications
to BD servers like Mysql using a JDBC driver. Is there a corresponding
driver for connecting with Spark Sql Thrift server ? Or what is the library
I need to use to connect to it? 


For PHP, what are the ways we can use to connect PHP applications to Spark
Sql Thrift Server? 





--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Connecting-a-PHP-Java-ap
plications-to-Spark-SQL-Thrift-Server-tp21902.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Apache Ignite vs Apache Spark

2015-02-26 Thread nate
Ignite guys spoke at the bigtop workshop last week at Scale, posted slides
here:

https://cwiki.apache.org/confluence/display/BIGTOP/SCALE13x

Couple main pts around comments made during the preso.., although incubating
apache (first code drop was last week I believe).., tech is battle tested
with many enterprises over past years (want to say he said ~8 yrs since
gridgain started building solution).

On roadmap, believe ignite provides all main features as core gridgain
in-memory offering, there is slide or two in preso that have * for
enterprise feature, think most were around mgmt./monitoring of production
clusters, also presenter mentioned that can't promise at the moment, but
they are working on open sourcing those components as well under the ignite
project.


-Original Message-
From: Jay Vyas [mailto:jayunit100.apa...@gmail.com] 
Sent: Thursday, February 26, 2015 3:40 PM
To: Sean Owen
Cc: Ognen Duzlevski; user@spark.apache.org
Subject: Re: Apache Ignite vs Apache Spark

-https://wiki.apache.org/incubator/IgniteProposal has I think been updated
recently and has a good comparison.  

- Although grid gain has been around since the spark days, Apache Ignite is
quite new and just getting started I think so 

- you will probably want to reach out to the developers for details on
ignites roadmap because there might be interesting details not yet codified.

 On Feb 26, 2015, at 1:08 PM, Sean Owen so...@cloudera.com wrote:
 
 Ignite is the renaming of GridGain, if that helps. It's like Oracle 
 Coherence, if that helps. These do share some similarities -- fault 
 tolerant, in-memory, distributed processing. The pieces they're built 
 on differ, the architecture differs, the APIs differ. So fairly 
 different in particulars. I never used the above, so can't be much 
 more useful.
 
 On Thu, Feb 26, 2015 at 5:46 PM, Ognen Duzlevski 
 ognen.duzlev...@gmail.com wrote:
 Can someone with experience briefly share or summarize the 
 differences between Ignite and Spark? Are they complementary? Totally
unrelated?
 Overlapping? Seems like ignite has reached version 1.0, I have never 
 heard of it until a few days ago and given what is advertised, it 
 sounds pretty interesting but I am unsure how this relates to or differs
from Spark.
 
 Thanks!
 Ognen
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For 
 additional commands, e-mail: user-h...@spark.apache.org
 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Submit Spark applications from a machine that doesn't have Java installed

2015-01-11 Thread Nate D'Amico
Cant speak to the internals of SparkSubmit and how to reproduce sans jvm,
guess would depend if you want/need to support various deployment
enviroments (stand-alone, mesos, yarn, etc)

 

If just need YARN, or looking at starting point, might want to look at
capabilities of YARN API:

 

http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceM
anagerRest.html#Cluster_Applications_APISubmit_Application

 

 

 

From: Nick Chammas [mailto:nicholas.cham...@gmail.com] 
Sent: Sunday, January 11, 2015 1:45 PM
To: user@spark.apache.org
Subject: Submit Spark applications from a machine that doesn't have Java
installed

 

Is it possible to submit a Spark application to a cluster from a machine
that does not have Java installed?

 

My impression is that many, many more computers come with Python installed
by default than do with Java.

 

I want to write a command-line utility
https://issues.apache.org/jira/browse/SPARK-3499  that submits a Spark
application to a remote cluster. I want that utility to run on as many
machines as possible out-of-the-box, so I want to avoid a dependency on Java
(or a JRE) if possible.

 

Nick

 

 

  _  

View this message in context: Submit Spark applications from a machine that
doesn't have Java installed
http://apache-spark-user-list.1001560.n3.nabble.com/Submit-Spark-applicatio
ns-from-a-machine-that-doesn-t-have-Java-installed-tp21085.html 
Sent from the Apache Spark User List mailing list archive
http://apache-spark-user-list.1001560.n3.nabble.com/  at Nabble.com.