Hi folks,
Just a friendly message that we have added Python support to the REST
Spark Job Server project. If you are a Python user looking for a
RESTful way to manage your Spark jobs, please come have a look at our
project!
https://github.com/spark-jobserver/spark-jobserver
-Evan
nd memory between queries.
>>>
>>> Note that Mark is running a slightly-modified version of stock Spark.
>>> (He's mentioned this in prior posts, as well.)
>>>
>>> And I have to say that I'm, personally, seeing more and more
>>> slightly-mo
>>
>> this may not be what people want to hear, but it's a trend that i'm seeing
>> lately as more and more customize Spark to their specific use cases.
>>
>> Anyway, thanks for the good discussion, everyone! This is why we have
>> these lists, right! :)
>>
>>
One of the premises here is that if you can restrict your workload to
fewer cores - which is easier with FiloDB and careful data modeling -
you can make this work for much higher concurrency and lower latency
than most typical Spark use cases.
The reason why it typically does not work in
Hey folks,
I just saw a recent thread on here (but can't find it anymore) on
using Spark as a web-speed query engine. I want to let you guys know
that this is definitely possible! Most folks don't realize how
low-latency Spark can actually be. Please check out my blog post
below on achieving
I would expect an SQL query on c would fail because c would not be known in
the schema of the older Parquet file.
What I'd be very interested in is how to add a new column as an incremental
new parquet file, and be able to somehow join the existing and new file, in
an efficient way. IE, somehow
Ashwin,
I would say the strategies in general are:
1) Have each user submit separate Spark app (each its own Spark
Context), with its own resource settings, and share data through HDFS
or something like Tachyon for speed.
2) Share a single spark context amongst multiple users, using fair
at 10:40 PM, Evan Chan velvia.git...@gmail.com wrote:
SPARK-1671 looks really promising.
Note that even right now, you don't need to un-cache the existing
table. You can do something like this:
newAdditionRdd.registerTempTable(table2)
sqlContext.cacheTable(table2)
val unionedRdd
Hi Abel,
Pretty interesting. May I ask how big is your point CSV dataset?
It seems you are relying on searching through the FeatureCollection of
polygons for which one intersects your point. This is going to be
extremely slow. I highly recommend using a SpatialIndex, such as the
many that
What Sean said.
You should also definitely turn on Kryo serialization. The default
Java serialization is really really slow if you're gonna move around
lots of data.Also make sure you use a cluster with high network
bandwidth on.
On Thu, Sep 18, 2014 at 3:06 AM, Sean Owen so...@cloudera.com
SPARK-1671 looks really promising.
Note that even right now, you don't need to un-cache the existing
table. You can do something like this:
newAdditionRdd.registerTempTable(table2)
sqlContext.cacheTable(table2)
val unionedRdd = sqlContext.table(table1).unionAll(sqlContext.table(table2))
When
There's no way to avoid a shuffle due to the first and last elements
of each partition needing to be computed with the others, but I wonder
if there is a way to do a minimal shuffle.
On Thu, Aug 21, 2014 at 6:13 PM, cjwang c...@cjwang.us wrote:
One way is to do zipWithIndex on the RDD. Then use
Is it possible to merge two cached Spark SQL tables into a single
table so it can queried with one SQL statement?
ie, can you do schemaRdd1.union(schemaRdd2), then register the new
schemaRdd and run a query over it?
Ideally, both schemaRdd1 and schemaRdd2 would be cached, so the union
should run
...@databricks.com wrote:
I believe this should work if you run srdd1.unionAll(srdd2). Both RDDs must
have the same schema.
On Wed, Aug 20, 2014 at 11:30 PM, Evan Chan velvia.git...@gmail.com wrote:
Is it possible to merge two cached Spark SQL tables into a single
table so it can queried with one
I just put up a repo with a write-up on how to import the GDELT public
dataset into Spark SQL and play around. Has a lot of notes on
different import methods and observations about Spark SQL. Feel free
to have a look and comment.
http://www.github.com/velvia/spark-sql-gdelt
Spark 1.0.2, Tachyon 0.4.1, Hadoop 1.0 (standard EC2 config)
scala val gdeltT =
sqlContext.parquetFile(tachyon://172.31.42.40:19998/gdelt-parquet/1979-2005/)
14/08/21 19:07:14 INFO :
initialize(tachyon://172.31.42.40:19998/gdelt-parquet/1979-2005,
Configuration: core-default.xml, core-site.xml,
The underFS is HDFS btw.
On Thu, Aug 21, 2014 at 12:22 PM, Evan Chan velvia.git...@gmail.com wrote:
Spark 1.0.2, Tachyon 0.4.1, Hadoop 1.0 (standard EC2 config)
scala val gdeltT =
sqlContext.parquetFile(tachyon://172.31.42.40:19998/gdelt-parquet/1979-2005/)
14/08/21 19:07:14 INFO
And it worked earlier with non-parquet directory.
On Thu, Aug 21, 2014 at 12:22 PM, Evan Chan velvia.git...@gmail.com wrote:
The underFS is HDFS btw.
On Thu, Aug 21, 2014 at 12:22 PM, Evan Chan velvia.git...@gmail.com wrote:
Spark 1.0.2, Tachyon 0.4.1, Hadoop 1.0 (standard EC2 config
Dear community,
Wow, I remember when we first open sourced the job server, at the
first Spark Summit in December. Since then, more and more of you have
started using it and contributing to it. It is awesome to see!
If you are not familiar with the spark job server, it is a REST API
for
That might not be enough. Reflection is used to determine what the
fields are, thus your class might actually need to have members
corresponding to the fields in the table.
I heard that a more generic method of inputting stuff is coming.
On Tue, Aug 19, 2014 at 6:43 PM, Tobias Pfeiffer
20 matches
Mail list logo