/spark/blob/master/dev/create-release/create-release.sh#L65
On Mon, Jul 28, 2014 at 12:39 PM, Koert Kuipers ko...@tresata.com wrote:
ah ok thanks. guess i am gonna read up about maven-release-plugin then!
On Mon, Jul 28, 2014 at 3:37 PM, Sean Owen so...@cloudera.com wrote
hey we used to publish spark inhouse by simply overriding the publishTo
setting. but now that we are integrated in SBT with maven i cannot find it
anymore.
i tried looking into the pom file, but after reading 1144 lines of xml i
1) havent found anything that looks like publishing
2) i feel
and if i want to change the version, it seems i have to change it in all 23
pom files? mhhh. is it mandatory for these sub-project pom files to repeat
that version info? useful?
spark$ grep 1.1.0-SNAPSHOT * -r | wc -l
23
On Mon, Jul 28, 2014 at 3:05 PM, Koert Kuipers ko...@tresata.com wrote
for you by this plugin.
Maven requires artifacts to set a version and it can't inherit one. I
feel like I understood the reason this is necessary at one point.
On Mon, Jul 28, 2014 at 8:33 PM, Koert Kuipers ko...@tresata.com wrote:
and if i want to change the version, it seems i have to change
i have graphx queries running inside a service where i collect the results
to the driver and do not hold any references to the rdds involved in the
queries. my assumption was that with the references gone spark would go and
remove the cached rdds from memory (note, i did not cache them, graphx
hello all,
in case anyone is interested, i just wrote a short blog about using
shapeless in spark to optimize data layout in memory.
blog is here:
http://tresata.com/tresata-open-sources-spark-columnar
code is here:
https://github.com/tresata/spark-columnar
but be aware that spark-defaults.conf is only used if you use spark-submit
On Jul 17, 2014 4:29 PM, Zongheng Yang zonghen...@gmail.com wrote:
One way is to set this in your conf/spark-defaults.conf:
spark.executor.extraLibraryPath /path/to/native/lib
The key is documented here:
:
The UI code is the same in both, but one possibility is that your
executors were given less memory on YARN. Can you check that? Or otherwise,
how do you know that some RDDs were cached?
Matei
On Jul 12, 2014, at 4:12 PM, Koert Kuipers ko...@tresata.com wrote:
hey shuo,
so far all stage links
.
Best,
On Fri, Jul 11, 2014 at 4:42 PM, Koert Kuipers ko...@tresata.com wrote:
I just tested a long lived application (that we normally run in
standalone mode) on yarn in client mode.
it looks to me like cached rdds are missing in the storage tap of the ui.
accessing the rdd storage
I just tested a long lived application (that we normally run in standalone
mode) on yarn in client mode.
it looks to me like cached rdds are missing in the storage tap of the ui.
accessing the rdd storage information via the spark context shows rdds as
fully cached but they are missing on
not sure I understand why unifying how you submit app for different
platforms and dynamic configuration cannot be part of SparkConf and
SparkContext?
for classpath a simple script similar to hadoop classpath that shows what
needs to be added should be sufficient.
on spark standalone I can launch
did you explicitly cache the rdd? we cache rdds and share them between jobs
just fine within one context in spark 1.0.x. but we do not use the ooyala
job server...
On Wed, Jul 9, 2014 at 10:03 AM, premdass premdas...@yahoo.co.in wrote:
Hi,
I using spark 1.0.0 , using Ooyala Job Server, for
we simply hold on to the reference to the rdd after it has been cached. so
we have a single Map[String, RDD[X]] for cached RDDs for the application
On Wed, Jul 9, 2014 at 11:00 AM, premdass premdas...@yahoo.co.in wrote:
Hi,
Yes . I am caching the RDD's by calling cache method..
May i
.
-Suren
On Wed, Jul 9, 2014 at 8:20 AM, Koert Kuipers ko...@tresata.com
wrote:
not sure I understand why unifying how you submit app for different
platforms and dynamic configuration cannot be part of SparkConf and
SparkContext?
for classpath a simple script similar to hadoop
do you control your cluster and spark deployment? if so, you can try to
rebuild with jetty 9.x
On Tue, Jul 8, 2014 at 9:39 AM, Martin Gammelsæter
martingammelsae...@gmail.com wrote:
Digging a bit more I see that there is yet another jetty instance that
is causing the problem, namely the
you could only do the deep check if the hashcodes are the same and design
hashcodes that do not take all elements into account.
the alternative seems to be putting cache statements all over graphx, as is
currently the case, which is trouble for any long lived application where
caching is
i noticed that some algorithms such as graphx liberally cache RDDs for
efficiency, which makes sense. however it can also leave a long trail of
unused yet cached RDDs, that might push other RDDs out of memory.
in a long-lived spark context i would like to decide which RDDs stick
around. would it
spark has a setting to put user jars in front of classpath, which should do
the trick.
however i had no luck with this. see here:
https://issues.apache.org/jira/browse/SPARK-1863
On Mon, Jul 7, 2014 at 1:31 PM, Robert James srobertja...@gmail.com wrote:
spark-submit includes a spark-assembly
i was testing using the acl for spark ui in secure mode on yarn in client
mode.
it works great. my spark 1.0.0 configuration has:
spark.authenticate = true
spark.ui.acls.enable = true
spark.ui.view.acls = koert
spark.ui.filters =
probably a dumb question, but why is reference equality used for the
indexes?
On Sun, Jul 6, 2014 at 12:43 AM, Ankur Dave ankurd...@gmail.com wrote:
When joining two VertexRDDs with identical indexes, GraphX can use a fast
code path (a zip join without any hash lookups). However, the check
. On
the driver you can just top k the combined top k from each partition
(assuming you have (object, count) for each top k list).
—
Sent from Mailbox https://www.dropbox.com/mailbox
On Sat, Jul 5, 2014 at 10:17 AM, Koert Kuipers ko...@tresata.com wrote:
my initial approach to taking top k values
thanks for replying. why is joining two vertexrdds without caching slow?
what is recomputed unnecessarily?
i am not sure what is different here from joining 2 regular RDDs (where
nobody seems to recommend to cache before joining i think...)
On Thu, Jul 3, 2014 at 10:52 PM, Ankur Dave
i did the second option: re-implemented .toBreeze as .breeze using pimp
classes
On Wed, Jul 2, 2014 at 5:00 PM, Thunder Stumpges thunder.stump...@gmail.com
wrote:
I am upgrading from Spark 0.9.0 to 1.0 and I had a pretty good amount of
code working with internals of MLLib. One of the big
its kind of handy to be able to convert stuff to breeze... is there some
other way i am supposed to access that functionality?
libraryDependencies ++= Seq(
org.apache.spark %% spark-core % versionSpark % provided
exclude(org.apache.hadoop, hadoop-client)
org.apache.hadoop % hadoop-client % versionHadoop % provided
)
On Wed, Jun 25, 2014 at 11:26 AM, Robert James srobertja...@gmail.com
wrote:
To add Spark to a SBT
lately i am seeing a lot of this warning in graphx:
org.apache.spark.graphx.impl.ShippableVertexPartitionOps: Joining two
VertexPartitions with different indexes is slow.
i am using Graph.outerJoinVertices to join in data from a regular RDD (that
is co-partitioned). i would like this operation to
run your spark app in client mode together with a spray rest service, that
the front end can talk to
On Tue, Jun 24, 2014 at 3:12 AM, Jaonary Rabarisoa jaon...@gmail.com
wrote:
Hi all,
So far, I run my spark jobs with spark-shell or spark-submit command. I'd
like to go further and I wonder
, Koert Kuipers ko...@tresata.com wrote:
i am trying to understand how yarn-client mode works. i am not using
Application application_1403117970283_0014 failed 2 times due to AM
Container for appattempt_1403117970283_0014_02 exited with exitCode:
-1000 due to: File file:/home/koert/test
i noticed that when i submit a job to yarn it mistakenly tries to upload
files to local filesystem instead of hdfs. what could cause this?
in spark-env.sh i have HADOOP_CONF_DIR set correctly (and spark-submit does
find yarn), and my core-site.xml has a fs.defaultFS that is hdfs, not local
...@cloudera.com
wrote:
Hi Koert,
Could you provide more details? Job arguments, log messages, errors, etc.
On Fri, Jun 20, 2014 at 9:40 AM, Koert Kuipers ko...@tresata.com wrote:
i noticed that when i submit a job to yarn it mistakenly tries to upload
files to local filesystem instead
ok solved it. as it happened in spark/conf i also had a file called
core.site.xml (with some tachyone related stuff in it) so thats why it
ignored /etc/hadoop/conf/core-site.xml
On Fri, Jun 20, 2014 at 3:24 PM, Koert Kuipers ko...@tresata.com wrote:
i put some logging statements
for development/testing i think its fine to run them side by side as you
suggested, using spark standalone. just be realistic about what size data
you can load with limited RAM.
On Fri, Jun 20, 2014 at 3:43 PM, Mayur Rustagi mayur.rust...@gmail.com
wrote:
The ideal way to do that is to use a
still struggling with SPARK_JAVA_OPTS being deprecated. i am using spark
standalone.
for example if i have a akka timeout setting that i would like to be
applied to every piece of the spark framework (so spark master, spark
workers, spark executor sub-processes, spark-shell, etc.). i used to do
it to HDFS and specify its location by exporting
its
location as SPARK_JAR.
Kevin Markey
On 06/19/2014 11:22 AM, Koert Kuipers wrote:
i am trying to understand how yarn-client mode works. i am not using
spark-submit, but instead launching a spark job from within my own
application
://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai
On Thu, Jun 19, 2014 at 12:08 PM, Koert Kuipers ko...@tresata.com wrote:
db tsai,
if in yarn-cluster mode the driver runs inside yarn, how can you do a
rdd.collect and bring the results back to your application?
On Thu, Jun 19, 2014
Koert Kuipers ko...@tresata.com:
still struggling with SPARK_JAVA_OPTS being deprecated. i am using spark
standalone.
for example if i have a akka timeout setting that i would like to be
applied to every piece of the spark framework (so spark master, spark
workers, spark executor sub
, 2014, at 8:44 PM, Koert Kuipers ko...@tresata.com wrote:
from looking at the source code i see executors run in their own jvm
subprocesses.
how long to they live for? as long as the worker/slave? or are they tied
to the sparkcontext and life/die with it?
thx
:
They’re tied to the SparkContext (application) that launched them.
Matei
On May 19, 2014, at 8:44 PM, Koert Kuipers ko...@tresata.com wrote:
from looking at the source code i see executors run in their own jvm
subprocesses.
how long to they live for? as long as the worker/slave
, Koert Kuipers ko...@tresata.com wrote:
just for my clarification: off heap cannot be java objects, correct? so
we are always talking about serialized off-heap storage?
On May 20, 2014 1:27 AM, Tathagata Das tathagata.das1...@gmail.com
wrote:
That's one the main motivation in using Tachyon
why does it need to be local file? why not do some filter ops on hdfs file
and save to hdfs, from where you can create rdd?
you can read a small file in on driver program and use sc.parallelize to
turn it into RDD
On May 16, 2014 7:01 PM, Sai Prasanna ansaiprasa...@gmail.com wrote:
I found that
from looking at the source code i see executors run in their own jvm
subprocesses.
how long to they live for? as long as the worker/slave? or are they tied to
the sparkcontext and life/die with it?
thx
i used to be able to get all tests to pass.
with java 6 and sbt i get PermGen errors (no matter how high i make the
PermGen). so i have given up on that.
with java 7 i see 1 error in a bagel test and a few in streaming tests. any
ideas? see the error in BagelSuite below.
[info] - large number
...@cloudera.com wrote:
Since the error concerns a timeout -- is the machine slowish?
What about blowing away everything in your local maven repo, do a
clean, etc. to rule out environment issues?
I'm on OS X here FWIW.
On Thu, May 15, 2014 at 5:24 PM, Koert Kuipers ko...@tresata.com wrote:
yeah
)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348)
On Thu, May 15, 2014 at 3:03 PM, Koert Kuipers ko...@tresata.com wrote:
when i set spark.files.userClassPathFirst=true, i get java
in writing my own RDD i ran into a few issues with respect to stuff being
private in spark.
in compute i would like to return an iterator that respects task killing
(as HadoopRDD does), but the mechanics for that are inside the private
InterruptibleIterator. also the exception i am supposed to
yeah sure. it is ubuntu 12.04 with jdk1.7.0_40
what else is relevant that i can provide?
On Thu, May 15, 2014 at 12:17 PM, Sean Owen so...@cloudera.com wrote:
FWIW I see no failures. Maybe you can say more about your environment, etc.
On Wed, May 7, 2014 at 10:01 PM, Koert Kuipers ko
(JavaSerializer.scala:60)
On Fri, May 16, 2014 at 1:46 PM, Koert Kuipers ko...@tresata.com wrote:
after removing all class paramater of class Path from my code, i tried
again. different but related eror when i set
spark.files.userClassPathFirst=true
now i dont even use FileInputFormat directly
by the child and somehow this means the companion objects are reset or
something like that because i get NPEs.
On Fri, May 16, 2014 at 3:54 PM, Koert Kuipers ko...@tresata.com wrote:
ok i think the issue is visibility: a classloader can see all classes
loaded by its parent classloader
(spark.akka.frameSize,
1).setAppName(...).setMaster(...)
val sc = new SparkContext(conf)
- Patrick
On Wed, May 14, 2014 at 9:09 AM, Koert Kuipers ko...@tresata.com wrote:
i have some settings that i think are relevant for my application. they
are
spark.akka settings so i assume they are relevant
i have some settings that i think are relevant for my application. they are
spark.akka settings so i assume they are relevant for both executors and my
driver program.
i used to do:
SPARK_JAVA_OPTS=-Dspark.akka.frameSize=1
now this is deprecated. the alternatives mentioned are:
* some
resending... my email somehow never made it to the user list.
On Fri, May 9, 2014 at 2:11 PM, Koert Kuipers ko...@tresata.com wrote:
in writing my own RDD i ran into a few issues with respect to stuff being
private in spark.
in compute i would like to return an iterator that respects task
will do
On May 11, 2014 6:44 PM, Aaron Davidson ilike...@gmail.com wrote:
You got a good point there, those APIs should probably be marked as
@DeveloperAPI. Would you mind filing a JIRA for that (
https://issues.apache.org/jira/browse/SPARK)?
On Sun, May 11, 2014 at 11:51 AM, Koert Kuipers
yes it seems broken. i got only a few emails in last few days
On Fri, May 9, 2014 at 7:24 AM, wxhsdp wxh...@gmail.com wrote:
is there something wrong with the mailing list? very few people see my
thread
--
View this message in context:
Hey Matei,
Not sure i understand that. These are 2 separate jobs. So the second job
takes advantage of the fact that there is map output left somewhere on disk
from the first job, and re-uses that?
On Sat, May 3, 2014 at 8:29 PM, Matei Zaharia matei.zaha...@gmail.comwrote:
Hi Diana,
Apart
not work. I specified it in my email already. But I figured a way
around it by excluding akka dependencies
Shivani
On Tue, Apr 29, 2014 at 12:37 PM, Koert Kuipers ko...@tresata.com wrote:
you need to merge reference.conf files and its no longer an issue.
see the Build for for spark itself
SparkContext.getRDDStorageInfo
On Tue, Apr 29, 2014 at 12:34 PM, Andras Nemeth
andras.nem...@lynxanalytics.com wrote:
Hi,
Is it possible to know from code about an RDD if it is cached, and more
precisely, how many of its partitions are cached in memory and how many are
cached on disk? I
you need to merge reference.conf files and its no longer an issue.
see the Build for for spark itself:
case reference.conf = MergeStrategy.concat
On Tue, Apr 29, 2014 at 3:32 PM, Shivani Rao raoshiv...@gmail.com wrote:
Hello folks,
I was going to post this question to spark user group as
.
Thanks again for reporting this. I will push out a fix shortly.
Andrew
On Tue, Apr 8, 2014 at 1:30 PM, Koert Kuipers ko...@tresata.com wrote:
our one cached RDD in this run has id 3
*** onStageSubmitted **
rddInfo: RDD 2 (2) Storage: StorageLevel
Author: Andrew Or andrewo...@gmail.com
Closes #281 from andrewor14/ui-storage-fix and squashes the following
commits:
408585a [Andrew Or] Fix storage UI bug
On Mon, Apr 7, 2014 at 4:21 PM, Koert Kuipers ko...@tresata.com wrote:
got it thanks
On Mon, Apr 7, 2014 at 4:08 PM, Xiangrui
when i start spark-shell i now see
ls: cannot access /usr/local/lib/spark/lib_managed/jars/: No such file or
directory
we do not package a lib_managed with our spark build (never did). maybe the
logic in compute-classpath.sh that searches for datanucleus should check
for the existence of
note that for a cached rdd in the spark shell it all works fine. but
something is going wrong with the spark-shell in our applications that
extensively cache and re-use RDDs
On Tue, Apr 8, 2014 at 12:33 PM, Koert Kuipers ko...@tresata.com wrote:
i tried again with latest master, which includes
sorry, i meant to say: note that for a cached rdd in the spark shell it all
works fine. but something is going wrong with the SPARK-APPLICATION-UI in
our applications that extensively cache and re-use RDDs
On Tue, Apr 8, 2014 at 12:55 PM, Koert Kuipers ko...@tresata.com wrote:
note
in the
storage tab.
2) Did you run ./make-distribution.sh after you switched to the current
master?
Xiangrui
On Tue, Apr 8, 2014 at 9:33 AM, Koert Kuipers ko...@tresata.com wrote:
i tried again with latest master, which includes commit below, but ui
page
still shows nothing on storage tab
./make-distribution.sh to
re-compile Spark first. -Xiangrui
On Tue, Apr 8, 2014 at 9:57 AM, Koert Kuipers ko...@tresata.com wrote:
sorry, i meant to say: note that for a cached rdd in the spark shell it
all works fine. but something is going wrong with the SPARK-APPLICATION-UI
in our
()
The storagelevels you see here are never the ones of my RDDs. and
apparently updateRDDInfo never gets called (i had println in there too).
On Tue, Apr 8, 2014 at 2:13 PM, Koert Kuipers ko...@tresata.com wrote:
yes i am definitely using latest
On Tue, Apr 8, 2014 at 1:07 PM, Xiangrui Meng
yet at same time i can see via our own api:
storageInfo: {
diskSize: 0,
memSize: 19944,
numCachedPartitions: 1,
numPartitions: 1
}
On Tue, Apr 8, 2014 at 2:25 PM, Koert Kuipers ko...@tresata.com wrote:
i put some println statements in BlockManagerUI
.
Andrew
On Tue, Apr 8, 2014 at 11:26 AM, Koert Kuipers ko...@tresata.com wrote:
yet at same time i can see via our own api:
storageInfo: {
diskSize: 0,
memSize: 19944,
numCachedPartitions: 1,
numPartitions: 1
}
On Tue, Apr 8, 2014 at 2:25 PM
**
_rddInfoMap: Map()
On Tue, Apr 8, 2014 at 4:20 PM, Koert Kuipers ko...@tresata.com wrote:
1) at the end of the callback
2) yes we simply expose sc.getRDDStorageInfo to the user via REST
3) yes exactly. we define the RDDs at startup, all of them are cached.
from
any reason why RDDInfo suddenly became private in SPARK-1132?
we are using it to show users status of rdds
of the developer API. The particular PR that will change this is:
https://github.com/apache/spark/pull/274.
Cheers,
Andrew
On Mon, Apr 7, 2014 at 5:05 PM, Koert Kuipers ko...@tresata.com wrote:
any reason why RDDInfo suddenly became private in SPARK-1132?
we are using it to show users status
import org.apache.spark.SparkContext._
import org.apache.spark.rdd.RDD
import scala.reflect.ClassTag
def joinTest[K: ClassTag](rddA: RDD[(K, Int)], rddB: RDD[(K, Int)]) :
RDD[(K, Int)] = {
rddA.join(rddB).map { case (k, (a, b)) = (k, a+b) }
}
On Tue, Apr 1, 2014 at 4:55 PM, Daniel
i have found that i am unable to build/test spark with sbt and java6, but
using java7 works (and it compiles with java target version 1.6 so binaries
are usable from java 6)
On Sat, Mar 22, 2014 at 3:11 PM, Bharath Bhushan manku.ti...@outlook.comwrote:
Thanks for the reply. It turns out that
i dont know anything about arm clusters but it looks great. what are
the specs? the nodes have no local disk at all?
On Tue, Mar 18, 2014 at 10:36 PM, Chanwit Kaewkasi chan...@gmail.comwrote:
Hi all,
We are a small team doing a research on low-power (and low-cost) ARM
clusters. We built
on spark 1.0.0 SNAPSHOT this seems to work. at least so far i have seen no
issues yet.
On Thu, Mar 6, 2014 at 8:44 AM, Koert Kuipers ko...@tresata.com wrote:
its 0.9 snapshot from january running in standalone mode.
have these fixed been merged into 0.9?
On Thu, Mar 6, 2014 at 12:45 AM
not that long ago there was a nice example on here about how to combine
multiple operations on a single RDD. so basically if you want to do a
count() and something else, how to roll them into a single job. i think
patrick wendell gave the examples.
i cant find them anymore patrick can you
/apache/spark/pull/103
On Sun, Mar 9, 2014 at 8:40 PM, Koert Kuipers ko...@tresata.com wrote:
edit last line of sbt/sbt, after which i run:
sbt/sbt test
On Sun, Mar 9, 2014 at 10:24 PM, Sean Owen so...@cloudera.com wrote:
How are you specifying these args?
On Mar 9, 2014 8:55 PM, Koert
hello all,
i am observing a strange result. i have a computation that i run on a
cached RDD in spark-standalone. it typically takes about 4 seconds.
but when other RDDs that are not relevant to the computation at hand are
cached in memory (in same spark context), the computation takes 40 seconds
also be good to see what percent of each GC generation is used.
The concurrent mark-and-sweep GC -XX:+UseConcMarkSweepGC or the G1 GC in
Java 7 (-XX:+UseG1GC) might also avoid these pauses by GCing concurrently
with your application threads.
Matei
On Mar 10, 2014, at 3:18 PM, Koert Kuipers ko
i also noticed that jobs (with a new JobGroupId) which i run after this use
which use the same RDDs get very confused. i see lots of cancelled stages
and retries that go on forever.
On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers ko...@tresata.com wrote:
i have a running job that i cancel while
at 2:40 PM, Koert Kuipers ko...@tresata.com wrote:
SparkContext.cancelJobGroup
On Wed, Mar 5, 2014 at 5:32 PM, Mayur Rustagi mayur.rust...@gmail.comwrote:
How do you cancel the job. Which API do you use?
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi
If you need quick response re-use your spark context between queries and
cache rdds in memory
On Mar 3, 2014 12:42 AM, polkosity polkos...@gmail.com wrote:
Thanks for the advice Mayur.
I thought I'd report back on the performance difference... Spark
standalone
mode has executors processing
yes, tachyon is in memory serialized, which is not as fast as cached in
memory in spark (not serialized). the difference really depends on your job
type.
On Mon, Mar 3, 2014 at 7:10 PM, polkosity polkos...@gmail.com wrote:
Thats exciting! Will be looking into that, thanks Andrew.
Related
401 - 482 of 482 matches
Mail list logo