Hey all,
Was messing around with Spark and Google FlatBuffers for fun, and it got me
thinking about Spark and serialization. I know there's been work / talk
about in-memory columnar formats Spark SQL, so maybe there are ways to
provide this flexibility already that I've missed? Either way, my
Technically you can already do custom serializer for each shuffle operation
(it is part of the ShuffledRDD). I've seen Matei suggesting on jira issues
(or github) in the past a storage policy in which you can specify how
data should be stored. I think that would be a great API to have in the
long
Hi,
I am trying to understand how the
/spark/*/Storage/BlockManagerMaster.askDriverWithReply() works.
def getPeers(blockManagerId: BlockManagerId, numPeers: Int):
Seq[BlockManagerId] = {
val result =
askDriverWithReply[Seq[BlockManagerId]](GetPeers(blockManagerId, numPeers))
if (result.length
ask() is a method on every Actor. It comes from the akka library, which
spark uses for a lot of the communication between various components.
There is some documentation on ask() here (go to the section on Send
messages):
http://doc.akka.io/docs/akka/2.2.3/scala/actors.html
though if you are
(Whoops, forgot to copy dev@ in my original reply; adding it back)
Yeah, the GraphViz part was mostly for fun and for understanding cyclic
object graphs. In general, an object graph might contain cycles, so for
understanding the overall structure it's handy to have a picture. The
GraphViz thing
+1 (binding)
I see this as a way to increase transparency and efficiency around a
process that already informally exists, with benefits to both new
contributors and committers. For new contributors, it makes clear who they
should ping about a pending patch. For committers, it's a good reference
Who here would be interested in helping to work on an implementation of the
Tikerpop3 Gremlin API for Spark? Is this something that should continue in
the Spark discussion group, or should it migrate to the Gremlin message
group?
Reynold is right that there will be inherent mismatches in the
I’m definitely onboard to help / take a portion of this work. I too am
wondering what the proper discussion venue should be moving forward given
Reynold’s remarks on a community project hosted outside Spark. If I’m
understanding correctly my take would be:
1. to find a core group of developers
I think if we are going to use GraphX as the query engine in Tinkerpop3,
then the Tinkerpop3 community is the right platform to further the
discussion.
The reason I asked the question on improving APIs in GraphX is because why
only Gremlin, any graph DSL can exploit the GraphX APIs. Cypher has
Hi,
I have installed spark-1.1.0 and apache flume 1.4 for running streaming
example FlumeEventCount. Previously the code was working fine. Now Iam facing
with the below mentioned issues. My flume is running properly it is able to
write the file.
The command I use is
bin/run-example
I just watched Kay's talk from 2013 on Sparrow
https://www.youtube.com/watch?v=ayjH_bG-RC0. Is replacing Spark's native
scheduler with Sparrow still on the books?
The Sparrow repo https://github.com/radlab/sparrow hasn't been updated
recently, and I don't see any JIRA issues about it.
It would
-1 (not binding, +1 for maintainer, -1 for sign off)
Agree with Greg and Vinod. In the beginning, everything is better
(more efficient, more focus), but after some time, fighting begins.
Code style is the most hot topic to fight (we already saw it in some
PRs). If two committers (one of them is
Hi Nick,
This hasn't yet been directly supported by Spark because of a lack of
demand. The last time I ran a throughput test on the default Spark
scheduler (~1 year ago, so this may have changed), it could launch
approximately 1500 tasks / second. If, for example, you have a cluster of
100
+1 (binding)
I agree with the proposal that it just formalizes what we have been
doing till now, and will increase the efficiency and focus of the
review process.
To address Davies' concern, I agree coding style is often a hot topic
of contention. But that is just an indication that our
Sorry for my last email, I misunderstood the proposal here, all the
committer still have equal -1 to all the code changes.
Also, as mentioned in the proposal, the sign off only happens to
public API and architect, something like discussion about code style
things are still the same.
So, I'd
If, for example, you have a cluster of 100 machines, this means the
scheduler can launch 150 tasks per machine per second.
Did you mean 15 tasks per machine per second here? Or alternatively, 10
machines?
I don't know of any existing Spark clusters that have a large enough number
of
On Fri, Nov 7, 2014 at 6:20 PM, Nicholas Chammas nicholas.cham...@gmail.com
wrote:
If, for example, you have a cluster of 100 machines, this means the
scheduler can launch 150 tasks per machine per second.
Did you mean 15 tasks per machine per second here? Or alternatively, 10
machines?
Sounds good. I'm looking forward to tracking improvements in this area.
Also, just to connect some more dots here, I just remembered that there is
currently an initiative to add an IndexedRDD
https://issues.apache.org/jira/browse/SPARK-2365 interface. Some
interesting use cases mentioned there
On Fri, Nov 7, 2014 at 8:04 PM, Nicholas Chammas nicholas.cham...@gmail.com
wrote:
Sounds good. I'm looking forward to tracking improvements in this area.
Also, just to connect some more dots here, I just remembered that there is
currently an initiative to add an IndexedRDD
Hmm, relevant quote from section 3.3:
newer frameworks like Spark [35] reduce the overhead to 5ms. To support
tasks that complete in hundreds of mil- liseconds, we argue for reducing
task launch overhead even further to 1ms so that launch overhead
constitutes at most 1% of task runtime. By
I think Kay might be able to give a better answer. The most recent
benchmark I remember had the number at at somewhere between 8.6ms and
14.6ms depending on the Spark version (
https://github.com/apache/spark/pull/2030#issuecomment-52715181). Another
point to note is that this is the total time to
+1 (binding)
On 8 Nov 2014 07:26, Davies Liu dav...@databricks.com wrote:
Sorry for my last email, I misunderstood the proposal here, all the
committer still have equal -1 to all the code changes.
Also, as mentioned in the proposal, the sign off only happens to
public API and architect,
I don't have much more info than what Shivaram said. My sense is that,
over time, task launch overhead with Spark has slowly grown as Spark
supports more and more functionality. However, I haven't seen it be as
high as the 100ms Michael quoted (maybe this was for jobs with tasks that
have much
We should take a vector instead giving the user flexibility to decide
data source/ type
What do you mean by vector datatype exactly?
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Wed, Nov 5, 2014 at 6:45 AM,
I noticed that this doesn't compile:
mvn -Pyarn-alpha -Phadoop-0.23 -Dhadoop.version=0.23.7 -DskipTests clean package
[error] warning: [options] bootstrap class path not set in conjunction
with -source 1.6
[error]
I bet it doesn't work. +1 on isolating it's inclusion to only the
newer YARN API's.
- Patrick
On Fri, Nov 7, 2014 at 11:43 PM, Sean Owen so...@cloudera.com wrote:
I noticed that this doesn't compile:
mvn -Pyarn-alpha -Phadoop-0.23 -Dhadoop.version=0.23.7 -DskipTests clean
package
[error]
Hm. Problem is, core depends directly on it:
[error]
/Users/srowen/Documents/spark/core/src/main/scala/org/apache/spark/SecurityManager.scala:25:
object sasl is not a member of package org.apache.spark.network
[error] import org.apache.spark.network.sasl.SecretKeyHolder
[error]
27 matches
Mail list logo