Re: mvn build hangs on: Dependency-reduced POM written at bagel/dependency-reduced-pom.xml

2015-07-04 Thread Alec Taylor
Thanks, will just build from spark-1.4.0.tgz in the meantime.

On Sun, Jul 5, 2015 at 2:52 PM, Ted Yu yuzhih...@gmail.com wrote:

 See this thread:


 http://search-hadoop.com/m/q3RTt4CqUGAvnPj2/Spark+master+buildsubj=Re+Can+not+build+master


  On Jul 4, 2015, at 9:44 PM, Alec Taylor alec.tayl...@gmail.com wrote:
 
  Running: `build/mvn -DskipTests clean package` on Ubuntu 15.04 (amd64,
 3.19.0-21-generic) with Apache Maven 3.3.3 starts to build fine, then just
 keeps outputting these lines:
 
  [INFO] Dependency-reduced POM written at:
 /spark/bagel/dependency-reduced-pom.xml
 
  I've kept it running for an hour.
 
  How do I build Spark?
 
  Thanks for all suggestions



mvn build hangs on: Dependency-reduced POM written at bagel/dependency-reduced-pom.xml

2015-07-04 Thread Alec Taylor
Running: `build/mvn -DskipTests clean package` on Ubuntu 15.04 (amd64,
3.19.0-21-generic) with Apache Maven 3.3.3 starts to build fine, then just
keeps outputting these lines:

[INFO] Dependency-reduced POM written at:
/spark/bagel/dependency-reduced-pom.xml

I've kept it running for an hour.

How do I build Spark?

Thanks for all suggestions


Re: Spark for core business-logic? - Replacing: MongoDB?

2015-01-05 Thread Alec Taylor
Thanks all. To answer your clarification questions:

- I'm writing this in Python
- A similar problem to my actual one is to find common 30 minute slots
(over the next 12 months) [r] that k users have in common. Total
users: n. Given n=1 and r=17472 then the [naïve] time-complexity
is $\mathcal{O}(nr)$. n*r=17,472,000. I may be able to get
$\mathcal{O}(n \log r)$ if not $\log \log$ from reading the literature
on sequence matching, however this is uncertain.

So assuming all the other business-logic which needs to be built in,
such as authentication and various other CRUD operations, as well as
this more intensive sequence searching operation, what stack would be
best for me?

Thanks for all suggestions

On Mon, Jan 5, 2015 at 4:24 PM, Jörn Franke jornfra...@gmail.com wrote:
 Hallo,

 It really depends on your requirements, what kind of machine learning
 algorithm your budget, if you do currently something really new or integrate
 it with an existing application, etc.. You can run MongoDB as well as a
 cluster. I don't think this question can be answered generally, but depends
 on details of your case.

 Best regards

 Le 4 janv. 2015 01:44, Alec Taylor alec.tayl...@gmail.com a écrit :

 In the middle of doing the architecture for a new project, which has
 various machine learning and related components, including:
 recommender systems, search engines and sequence [common intersection]
 matching.

 Usually I use: MongoDB (as db), Redis (as cache) and celery (as queue,
 backed by Redis).

 Though I don't have experience with Hadoop, I was thinking of using
 Hadoop for the machine-learning (as this will become a Big Data
 problem quite quickly). To push the data into Hadoop, I would use a
 connector of some description, or push the MongoDB backups into HDFS
 at set intervals.

 However I was thinking that it might be better to put the whole thing
 in Hadoop, store all persistent data in Hadoop, and maybe do all the
 layers in Apache Spark (with caching remaining in Redis).

 Is that a viable option? - Most of what I see discusses Spark (and
 Hadoop in general) for analytics only. Apache Phoenix exposes a nice
 interface for read/write over HBase, so I might use that if Spark ends
 up being the wrong solution.

 Thanks for all suggestions,

 Alec Taylor

 PS: I need this for both Big and Small data. Note that I am using
 the Cloudera definition of Big Data referring to processing/storage
 across more than 1 machine.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark for core business-logic? - Replacing: MongoDB?

2015-01-05 Thread Alec Taylor
Thanks Simon, that's a good way to train on incoming events (and
related problems / and result computations).

However, does it handle the actual data storage? - E.g.: CRUD documents

On Tue, Jan 6, 2015 at 1:18 PM, Simon Chan simonc...@gmail.com wrote:
 Alec,

 If you are looking for a Machine Learning stack that supports
 business-logics, you may take a look at PredictionIO:
 http://prediction.io/

 It's based on Spark and HBase.

 Simon


 On Mon, Jan 5, 2015 at 6:14 PM, Alec Taylor alec.tayl...@gmail.com wrote:

 Thanks all. To answer your clarification questions:

 - I'm writing this in Python
 - A similar problem to my actual one is to find common 30 minute slots
 (over the next 12 months) [r] that k users have in common. Total
 users: n. Given n=1 and r=17472 then the [naïve] time-complexity
 is $\mathcal{O}(nr)$. n*r=17,472,000. I may be able to get
 $\mathcal{O}(n \log r)$ if not $\log \log$ from reading the literature
 on sequence matching, however this is uncertain.

 So assuming all the other business-logic which needs to be built in,
 such as authentication and various other CRUD operations, as well as
 this more intensive sequence searching operation, what stack would be
 best for me?

 Thanks for all suggestions

 On Mon, Jan 5, 2015 at 4:24 PM, Jörn Franke jornfra...@gmail.com wrote:
  Hallo,
 
  It really depends on your requirements, what kind of machine learning
  algorithm your budget, if you do currently something really new or
  integrate
  it with an existing application, etc.. You can run MongoDB as well as a
  cluster. I don't think this question can be answered generally, but
  depends
  on details of your case.
 
  Best regards
 
  Le 4 janv. 2015 01:44, Alec Taylor alec.tayl...@gmail.com a écrit :
 
  In the middle of doing the architecture for a new project, which has
  various machine learning and related components, including:
  recommender systems, search engines and sequence [common intersection]
  matching.
 
  Usually I use: MongoDB (as db), Redis (as cache) and celery (as queue,
  backed by Redis).
 
  Though I don't have experience with Hadoop, I was thinking of using
  Hadoop for the machine-learning (as this will become a Big Data
  problem quite quickly). To push the data into Hadoop, I would use a
  connector of some description, or push the MongoDB backups into HDFS
  at set intervals.
 
  However I was thinking that it might be better to put the whole thing
  in Hadoop, store all persistent data in Hadoop, and maybe do all the
  layers in Apache Spark (with caching remaining in Redis).
 
  Is that a viable option? - Most of what I see discusses Spark (and
  Hadoop in general) for analytics only. Apache Phoenix exposes a nice
  interface for read/write over HBase, so I might use that if Spark ends
  up being the wrong solution.
 
  Thanks for all suggestions,
 
  Alec Taylor
 
  PS: I need this for both Big and Small data. Note that I am using
  the Cloudera definition of Big Data referring to processing/storage
  across more than 1 machine.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark for core business-logic? - Replacing: MongoDB?

2015-01-03 Thread Alec Taylor
In the middle of doing the architecture for a new project, which has
various machine learning and related components, including:
recommender systems, search engines and sequence [common intersection]
matching.

Usually I use: MongoDB (as db), Redis (as cache) and celery (as queue,
backed by Redis).

Though I don't have experience with Hadoop, I was thinking of using
Hadoop for the machine-learning (as this will become a Big Data
problem quite quickly). To push the data into Hadoop, I would use a
connector of some description, or push the MongoDB backups into HDFS
at set intervals.

However I was thinking that it might be better to put the whole thing
in Hadoop, store all persistent data in Hadoop, and maybe do all the
layers in Apache Spark (with caching remaining in Redis).

Is that a viable option? - Most of what I see discusses Spark (and
Hadoop in general) for analytics only. Apache Phoenix exposes a nice
interface for read/write over HBase, so I might use that if Spark ends
up being the wrong solution.

Thanks for all suggestions,

Alec Taylor

PS: I need this for both Big and Small data. Note that I am using
the Cloudera definition of Big Data referring to processing/storage
across more than 1 machine.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org