[Community] Python support added to Spark Job Server

2016-08-17 Thread Evan Chan
Hi folks, Just a friendly message that we have added Python support to the REST Spark Job Server project. If you are a Python user looking for a RESTful way to manage your Spark jobs, please come have a look at our project! https://github.com/spark-jobserver/spark-jobserver -Evan

Re: Can we use spark inside a web service?

2016-03-14 Thread Evan Chan
nd memory between queries. >>> >>> Note that Mark is running a slightly-modified version of stock Spark. >>> (He's mentioned this in prior posts, as well.) >>> >>> And I have to say that I'm, personally, seeing more and more >>> slightly-mo

Re: Can we use spark inside a web service?

2016-03-14 Thread Evan Chan
>> >> this may not be what people want to hear, but it's a trend that i'm seeing >> lately as more and more customize Spark to their specific use cases. >> >> Anyway, thanks for the good discussion, everyone! This is why we have >> these lists, right! :) >> >>

Re: Can we use spark inside a web service?

2016-03-10 Thread Evan Chan
One of the premises here is that if you can restrict your workload to fewer cores - which is easier with FiloDB and careful data modeling - you can make this work for much higher concurrency and lower latency than most typical Spark use cases. The reason why it typically does not work in

Achieving 700 Spark SQL Queries Per Second

2016-03-10 Thread Evan Chan
Hey folks, I just saw a recent thread on here (but can't find it anymore) on using Spark as a web-speed query engine. I want to let you guys know that this is definitely possible! Most folks don't realize how low-latency Spark can actually be. Please check out my blog post below on achieving

Re: SparkSQL - can we add new column(s) to parquet files

2014-11-21 Thread Evan Chan
I would expect an SQL query on c would fail because c would not be known in the schema of the older Parquet file. What I'd be very interested in is how to add a new column as an incremental new parquet file, and be able to somehow join the existing and new file, in an efficient way. IE, somehow

Re: Multitenancy in Spark - within/across spark context

2014-10-23 Thread Evan Chan
Ashwin, I would say the strategies in general are: 1) Have each user submit separate Spark app (each its own Spark Context), with its own resource settings, and share data through HDFS or something like Tachyon for speed. 2) Share a single spark context amongst multiple users, using fair

Re: Use Case of mutable RDD - any ideas around will help.

2014-09-19 Thread Evan Chan
at 10:40 PM, Evan Chan velvia.git...@gmail.com wrote: SPARK-1671 looks really promising. Note that even right now, you don't need to un-cache the existing table. You can do something like this: newAdditionRdd.registerTempTable(table2) sqlContext.cacheTable(table2) val unionedRdd

Re: Example of Geoprocessing with Spark

2014-09-19 Thread Evan Chan
Hi Abel, Pretty interesting. May I ask how big is your point CSV dataset? It seems you are relying on searching through the FeatureCollection of polygons for which one intersects your point. This is going to be extremely slow. I highly recommend using a SpatialIndex, such as the many that

Re: Better way to process large image data set ?

2014-09-19 Thread Evan Chan
What Sean said. You should also definitely turn on Kryo serialization. The default Java serialization is really really slow if you're gonna move around lots of data.Also make sure you use a cluster with high network bandwidth on. On Thu, Sep 18, 2014 at 3:06 AM, Sean Owen so...@cloudera.com

Re: Use Case of mutable RDD - any ideas around will help.

2014-09-14 Thread Evan Chan
SPARK-1671 looks really promising. Note that even right now, you don't need to un-cache the existing table. You can do something like this: newAdditionRdd.registerTempTable(table2) sqlContext.cacheTable(table2) val unionedRdd = sqlContext.table(table1).unionAll(sqlContext.table(table2)) When

Re: Finding previous and next element in a sorted RDD

2014-08-22 Thread Evan Chan
There's no way to avoid a shuffle due to the first and last elements of each partition needing to be computed with the others, but I wonder if there is a way to do a minimal shuffle. On Thu, Aug 21, 2014 at 6:13 PM, cjwang c...@cjwang.us wrote: One way is to do zipWithIndex on the RDD. Then use

Merging two Spark SQL tables?

2014-08-21 Thread Evan Chan
Is it possible to merge two cached Spark SQL tables into a single table so it can queried with one SQL statement? ie, can you do schemaRdd1.union(schemaRdd2), then register the new schemaRdd and run a query over it? Ideally, both schemaRdd1 and schemaRdd2 would be cached, so the union should run

Re: Merging two Spark SQL tables?

2014-08-21 Thread Evan Chan
...@databricks.com wrote: I believe this should work if you run srdd1.unionAll(srdd2). Both RDDs must have the same schema. On Wed, Aug 20, 2014 at 11:30 PM, Evan Chan velvia.git...@gmail.com wrote: Is it possible to merge two cached Spark SQL tables into a single table so it can queried with one

Writeup on Spark SQL with GDELT

2014-08-21 Thread Evan Chan
I just put up a repo with a write-up on how to import the GDELT public dataset into Spark SQL and play around. Has a lot of notes on different import methods and observations about Spark SQL. Feel free to have a look and comment. http://www.github.com/velvia/spark-sql-gdelt

[Tachyon] Error reading from Parquet files in HDFS

2014-08-21 Thread Evan Chan
Spark 1.0.2, Tachyon 0.4.1, Hadoop 1.0 (standard EC2 config) scala val gdeltT = sqlContext.parquetFile(tachyon://172.31.42.40:19998/gdelt-parquet/1979-2005/) 14/08/21 19:07:14 INFO : initialize(tachyon://172.31.42.40:19998/gdelt-parquet/1979-2005, Configuration: core-default.xml, core-site.xml,

Re: [Tachyon] Error reading from Parquet files in HDFS

2014-08-21 Thread Evan Chan
The underFS is HDFS btw. On Thu, Aug 21, 2014 at 12:22 PM, Evan Chan velvia.git...@gmail.com wrote: Spark 1.0.2, Tachyon 0.4.1, Hadoop 1.0 (standard EC2 config) scala val gdeltT = sqlContext.parquetFile(tachyon://172.31.42.40:19998/gdelt-parquet/1979-2005/) 14/08/21 19:07:14 INFO

Re: [Tachyon] Error reading from Parquet files in HDFS

2014-08-21 Thread Evan Chan
And it worked earlier with non-parquet directory. On Thu, Aug 21, 2014 at 12:22 PM, Evan Chan velvia.git...@gmail.com wrote: The underFS is HDFS btw. On Thu, Aug 21, 2014 at 12:22 PM, Evan Chan velvia.git...@gmail.com wrote: Spark 1.0.2, Tachyon 0.4.1, Hadoop 1.0 (standard EC2 config

Spark-JobServer moving to a new location

2014-08-21 Thread Evan Chan
Dear community, Wow, I remember when we first open sourced the job server, at the first Spark Summit in December. Since then, more and more of you have started using it and contributing to it. It is awesome to see! If you are not familiar with the spark job server, it is a REST API for

Re: type issue: found RDD[T] expected RDD[A]

2014-08-19 Thread Evan Chan
That might not be enough. Reflection is used to determine what the fields are, thus your class might actually need to have members corresponding to the fields in the table. I heard that a more generic method of inputting stuff is coming. On Tue, Aug 19, 2014 at 6:43 PM, Tobias Pfeiffer