Re: mvn build hangs on: Dependency-reduced POM written at bagel/dependency-reduced-pom.xml
Thanks, will just build from spark-1.4.0.tgz in the meantime. On Sun, Jul 5, 2015 at 2:52 PM, Ted Yu yuzhih...@gmail.com wrote: See this thread: http://search-hadoop.com/m/q3RTt4CqUGAvnPj2/Spark+master+buildsubj=Re+Can+not+build+master On Jul 4, 2015, at 9:44 PM, Alec Taylor alec.tayl...@gmail.com wrote: Running: `build/mvn -DskipTests clean package` on Ubuntu 15.04 (amd64, 3.19.0-21-generic) with Apache Maven 3.3.3 starts to build fine, then just keeps outputting these lines: [INFO] Dependency-reduced POM written at: /spark/bagel/dependency-reduced-pom.xml I've kept it running for an hour. How do I build Spark? Thanks for all suggestions
mvn build hangs on: Dependency-reduced POM written at bagel/dependency-reduced-pom.xml
Running: `build/mvn -DskipTests clean package` on Ubuntu 15.04 (amd64, 3.19.0-21-generic) with Apache Maven 3.3.3 starts to build fine, then just keeps outputting these lines: [INFO] Dependency-reduced POM written at: /spark/bagel/dependency-reduced-pom.xml I've kept it running for an hour. How do I build Spark? Thanks for all suggestions
Re: Spark for core business-logic? - Replacing: MongoDB?
Thanks all. To answer your clarification questions: - I'm writing this in Python - A similar problem to my actual one is to find common 30 minute slots (over the next 12 months) [r] that k users have in common. Total users: n. Given n=1 and r=17472 then the [naïve] time-complexity is $\mathcal{O}(nr)$. n*r=17,472,000. I may be able to get $\mathcal{O}(n \log r)$ if not $\log \log$ from reading the literature on sequence matching, however this is uncertain. So assuming all the other business-logic which needs to be built in, such as authentication and various other CRUD operations, as well as this more intensive sequence searching operation, what stack would be best for me? Thanks for all suggestions On Mon, Jan 5, 2015 at 4:24 PM, Jörn Franke jornfra...@gmail.com wrote: Hallo, It really depends on your requirements, what kind of machine learning algorithm your budget, if you do currently something really new or integrate it with an existing application, etc.. You can run MongoDB as well as a cluster. I don't think this question can be answered generally, but depends on details of your case. Best regards Le 4 janv. 2015 01:44, Alec Taylor alec.tayl...@gmail.com a écrit : In the middle of doing the architecture for a new project, which has various machine learning and related components, including: recommender systems, search engines and sequence [common intersection] matching. Usually I use: MongoDB (as db), Redis (as cache) and celery (as queue, backed by Redis). Though I don't have experience with Hadoop, I was thinking of using Hadoop for the machine-learning (as this will become a Big Data problem quite quickly). To push the data into Hadoop, I would use a connector of some description, or push the MongoDB backups into HDFS at set intervals. However I was thinking that it might be better to put the whole thing in Hadoop, store all persistent data in Hadoop, and maybe do all the layers in Apache Spark (with caching remaining in Redis). Is that a viable option? - Most of what I see discusses Spark (and Hadoop in general) for analytics only. Apache Phoenix exposes a nice interface for read/write over HBase, so I might use that if Spark ends up being the wrong solution. Thanks for all suggestions, Alec Taylor PS: I need this for both Big and Small data. Note that I am using the Cloudera definition of Big Data referring to processing/storage across more than 1 machine. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark for core business-logic? - Replacing: MongoDB?
Thanks Simon, that's a good way to train on incoming events (and related problems / and result computations). However, does it handle the actual data storage? - E.g.: CRUD documents On Tue, Jan 6, 2015 at 1:18 PM, Simon Chan simonc...@gmail.com wrote: Alec, If you are looking for a Machine Learning stack that supports business-logics, you may take a look at PredictionIO: http://prediction.io/ It's based on Spark and HBase. Simon On Mon, Jan 5, 2015 at 6:14 PM, Alec Taylor alec.tayl...@gmail.com wrote: Thanks all. To answer your clarification questions: - I'm writing this in Python - A similar problem to my actual one is to find common 30 minute slots (over the next 12 months) [r] that k users have in common. Total users: n. Given n=1 and r=17472 then the [naïve] time-complexity is $\mathcal{O}(nr)$. n*r=17,472,000. I may be able to get $\mathcal{O}(n \log r)$ if not $\log \log$ from reading the literature on sequence matching, however this is uncertain. So assuming all the other business-logic which needs to be built in, such as authentication and various other CRUD operations, as well as this more intensive sequence searching operation, what stack would be best for me? Thanks for all suggestions On Mon, Jan 5, 2015 at 4:24 PM, Jörn Franke jornfra...@gmail.com wrote: Hallo, It really depends on your requirements, what kind of machine learning algorithm your budget, if you do currently something really new or integrate it with an existing application, etc.. You can run MongoDB as well as a cluster. I don't think this question can be answered generally, but depends on details of your case. Best regards Le 4 janv. 2015 01:44, Alec Taylor alec.tayl...@gmail.com a écrit : In the middle of doing the architecture for a new project, which has various machine learning and related components, including: recommender systems, search engines and sequence [common intersection] matching. Usually I use: MongoDB (as db), Redis (as cache) and celery (as queue, backed by Redis). Though I don't have experience with Hadoop, I was thinking of using Hadoop for the machine-learning (as this will become a Big Data problem quite quickly). To push the data into Hadoop, I would use a connector of some description, or push the MongoDB backups into HDFS at set intervals. However I was thinking that it might be better to put the whole thing in Hadoop, store all persistent data in Hadoop, and maybe do all the layers in Apache Spark (with caching remaining in Redis). Is that a viable option? - Most of what I see discusses Spark (and Hadoop in general) for analytics only. Apache Phoenix exposes a nice interface for read/write over HBase, so I might use that if Spark ends up being the wrong solution. Thanks for all suggestions, Alec Taylor PS: I need this for both Big and Small data. Note that I am using the Cloudera definition of Big Data referring to processing/storage across more than 1 machine. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark for core business-logic? - Replacing: MongoDB?
In the middle of doing the architecture for a new project, which has various machine learning and related components, including: recommender systems, search engines and sequence [common intersection] matching. Usually I use: MongoDB (as db), Redis (as cache) and celery (as queue, backed by Redis). Though I don't have experience with Hadoop, I was thinking of using Hadoop for the machine-learning (as this will become a Big Data problem quite quickly). To push the data into Hadoop, I would use a connector of some description, or push the MongoDB backups into HDFS at set intervals. However I was thinking that it might be better to put the whole thing in Hadoop, store all persistent data in Hadoop, and maybe do all the layers in Apache Spark (with caching remaining in Redis). Is that a viable option? - Most of what I see discusses Spark (and Hadoop in general) for analytics only. Apache Phoenix exposes a nice interface for read/write over HBase, so I might use that if Spark ends up being the wrong solution. Thanks for all suggestions, Alec Taylor PS: I need this for both Big and Small data. Note that I am using the Cloudera definition of Big Data referring to processing/storage across more than 1 machine. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org