Re: Open sourcing Spindle by Adobe Research, a web analytics processing engine in Scala, Spark, and Parquet.

2014-08-25 Thread Brandon Amos
\cc David Tompkins and Jim Donahue if they have anything to add.
\cc My school email. Please include bamos_cmu.edu for further discussion.

Hi Deb,


Debasish Das wrote
 Looks very cool...will try it out for ad-hoc analysis of our datasets and
 provide more feedback...
 
 Could you please give bit more details about the differences of Spindle
 architecture compared to Hue + Spark integration (python stack) and Ooyala
 Jobserver ?
 
 
 Does Spindle allow sharing of spark context over multiple spark jobs like
 jobserver ?

Great point, I think these jobservers would work well with Spindle on larger
clusters.
I've added the following portion to the README to mention this as an
area of future work.

Regards,
Brandon.

---

## Future Work - Utilizing Spark job servers or resource managers.
Spindle's architecture can likely be improved on larger clusters by
utilizing a job server or resource manager to
maintain a pool of Spark contexts for query execution.
[Ooyala's spark-jobserver][spark-jobserver] provides
a RESTful interface for submitting Spark jobs that Spindle could
interface with instead of interfacing with Spark directly.
[YARN][yarn] can also be used to manage Spark's
resources on a cluster, as described in [this article][spark-yarn].

However, allocating resources on the cluster raises additional
questions and engineering work that Spindle can address in future work.
Spindle's current architecture coincides HDFS and Spark workers
on the same nodes, minimizing the network traffic required
to load data.
How much will the performance degrade if the resource manager
allocates some subset of Spark workers that don't
coincide with any of the HDFS data being accessed?

Furthermore, how would a production-ready caching policy
on a pool of Spark Contexts look?
What if many queries are being submitted and executed on
different Spark Contexts that use the same data?
Scheduling the queries on the same Spark Context and
caching the data between query executions would substantially
increase the performance, but how should the scheduler
be informed of this information?

[spark-jobserver]: https://github.com/ooyala/spark-jobserver
[yarn]:
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html
[spark-yarn]:
http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Open-sourcing-Spindle-by-Adobe-Research-a-web-analytics-processing-engine-in-Scala-Spark-and-Parquet-tp12203p12731.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Open sourcing Spindle by Adobe Research, a web analytics processing engine in Scala, Spark, and Parquet.

2014-08-23 Thread Brandon Amos
\cc David Tompkins and Jim Donahue if they have anything to add. 
\cc My school email. Please include bamos_cmu.edu for further discussion. 

Hi Soumya,


ssimanta wrote
 The project mentions - process petabytes of data in real-time. I'm
 curious to know if the architecture implemented in the Github repo was
 used
 to process petabytes?
 If yes, how many nodes did you use for this and did you use Spark
 standalone cluster or with YARN/Mesos ?
 I'm also interested to know what issues you had with Spray and Akka
 working
 at this scale.

Great question, I've added the following portion to the README's intro
section to make it clear that Spindle is not yet ready for processing
petabytes of data in real-time.

Also, I'd be interested in seeing how Spray/Akka at larger
scales compares to using job or resource managers.
We're currently running Spindle on the standalone cluster.

Regards,
Brandon.

---

This repo contains the Spindle implementation and benchmarking scripts
to observe Spindle's performance while exploring Spark's tuning options.
Spindle's goal is to process petabytes of data on thousands of nodes,
but the current implementation has not yet been tested at this scale.
Our current experimental results use six nodes,
each with 24 cores and 21g of Spark memory, to query 13.1GB of analytics
data.
The trends show that further Spark tuning and optimizations should
be investigated before attempting larger scale deployments.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Open-sourcing-Spindle-by-Adobe-Research-a-web-analytics-processing-engine-in-Scala-Spark-and-Parquet-tp12203p12706.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Open sourcing Spindle by Adobe Research, a web analytics processing engine in Scala, Spark, and Parquet.

2014-08-16 Thread Matei Zaharia
Thanks for sharing this, Brandon! Looks like a great architecture for people to 
build on.

Matei

On August 15, 2014 at 2:07:06 PM, Brandon Amos (a...@adobe.com) wrote:

Hi Spark community, 

At Adobe Research, we're happy to open source a prototype 
technology called Spindle we've been developing over 
the past few months for processing analytics queries with Spark. 
Please take a look at the repository on GitHub at 
https://github.com/adobe-research/spindle, 
and we welcome any feedback. Thanks! 

Regards, 
Brandon. 



-- 
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Open-sourcing-Spindle-by-Adobe-Research-a-web-analytics-processing-engine-in-Scala-Spark-and-Parquet-tp12203.html
 
Sent from the Apache Spark User List mailing list archive at Nabble.com. 

- 
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
For additional commands, e-mail: user-h...@spark.apache.org 



Re: Open sourcing Spindle by Adobe Research, a web analytics processing engine in Scala, Spark, and Parquet.

2014-08-16 Thread Debasish Das
Hi Brandon,

Looks very cool...will try it out for ad-hoc analysis of our datasets and
provide more feedback...

Could you please give bit more details about the differences of Spindle
architecture compared to Hue + Spark integration (python stack) and Ooyala
Jobserver ?

Does Spindle allow sharing of spark context over multiple spark jobs like
jobserver ?

Thanks.
Deb


On Sat, Aug 16, 2014 at 12:19 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 Thanks for sharing this, Brandon! Looks like a great architecture for
 people to build on.

 Matei

 On August 15, 2014 at 2:07:06 PM, Brandon Amos (a...@adobe.com) wrote:

 Hi Spark community,

 At Adobe Research, we're happy to open source a prototype
 technology called Spindle we've been developing over
 the past few months for processing analytics queries with Spark.
 Please take a look at the repository on GitHub at
 https://github.com/adobe-research/spindle,
 and we welcome any feedback. Thanks!

 Regards,
 Brandon.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Open-sourcing-Spindle-by-Adobe-Research-a-web-analytics-processing-engine-in-Scala-Spark-and-Parquet-tp12203.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org