.. What is the config that you are trying to set?
-Andrew
2015-06-12 11:17 GMT-07:00 Peng Cheng pc...@uow.edu.au:
In Spark 1.3.x, the system property of the driver can be set by --conf
option, shared between setting spark properties and system properties.
In Spark 1.4.0 this feature
2015 at 19:39, Ted Yu yuzhih...@gmail.com wrote:
This is the SPARK JIRA which introduced the warning:
[SPARK-7037] [CORE] Inconsistent behavior for non-spark config properties
in spark-shell and spark-submit
On Fri, Jun 12, 2015 at 4:34 PM, Peng Cheng rhw...@gmail.com wrote:
Hi Andrew
In Spark 1.3.x, the system property of the driver can be set by --conf
option, shared between setting spark properties and system properties.
In Spark 1.4.0 this feature is removed, the driver instead log the following
warning:
Warning: Ignoring non-spark config property: xxx.xxx=v
How do
I stumble upon this thread and I conjecture that this may affect restoring a
checkpointed RDD as well:
http://apache-spark-user-list.1001560.n3.nabble.com/Union-of-checkpointed-RDD-in-Apache-Spark-has-long-gt-10-hour-between-stage-latency-td22925.html#a22928
In my case I have 1600+ fragmented
Looks like this problem has been mentioned before:
http://qnalist.com/questions/5666463/downloads-from-s3-exceedingly-slow-when-running-on-spark-ec2
and a temporarily solution is to deploy on a dedicated EMR/S3 configuration.
I'll go for that one for a shot.
--
View this message in context:
Turns out the above thread is unrelated: it was caused by using s3:// instead
of s3n://. Which I already avoided in my checkpointDir configuration.
--
View this message in context:
BTW: My thread dump of the driver's main thread looks like it is stuck on
waiting for Amazon S3 bucket metadata for a long time (which may suggests
that I should move checkpointing directory from S3 to HDFS):
Thread 1: main (RUNNABLE)
java.net.SocketInputStream.socketRead0(Native Method)
I'm deploying a Spark data processing job on an EC2 cluster, the job is small
for the cluster (16 cores with 120G RAM in total), the largest RDD has only
76k+ rows. But heavily skewed in the middle (thus requires repartitioning)
and each row has around 100k of data after serialization. The job
I got exactly the same problem, except that I'm running on a standalone
master. Can you tell me the counterpart parameter on standalone master for
increasing the same memroy overhead?
--
View this message in context:
I'm using Amazon EMR + S3 as my spark cluster infrastructure. When I'm
running a job with periodic checkpointing (it has a long dependency tree, so
truncating by checkpointing is mandatory, each checkpoint has 320
partitions). The job stops halfway, resulting an exception:
(On driver)
I double check the 1.2 feature list and found out that the new sort-based
shuffle manager has nothing to do with HashPartitioner :- Sorry for the
misinformation.
In another hand. This may explain increase in shuffle spill as a side effect
of the new shuffle manager, let me revert
Same problem here, shuffle write increased from 10G to over 64G, since I'm
running on amazon EC2 this always cause temporary folder to consume all the
disk space. Still looking for a solution.
BTW, the 64G shuffle write is encountered on shuffling a pairRDD with
HashPartitioner, so its not
You are right. I've checked the overall stage metrics and looks like the
largest shuffling write is over 9G. The partition completed successfully
but its spilled file can't be removed until all others are finished.
It's very likely caused by a stupid mistake in my design. A lookup table
grows
I'm running a small job on a cluster with 15G of mem and 8G of disk per
machine.
The job always get into a deadlock where the last error message is:
java.io.IOException: No space left on device
at java.io.FileOutputStream.writeBytes(Native Method)
at
I got the same problem, maybe java serializer is unstable
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-IllegalStateException-unread-block-data-tp20668p21463.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
I'm talking about RDD1 (not persisted or checkpointed) in this situation:
...(somewhere) - RDD1 - RDD2
||
V V
RDD3 - RDD4 - Action!
To my experience the change RDD1 get
I'm talking about RDD1 (not persisted or checkpointed) in this situation:
...(somewhere) - RDD1 - RDD2
||
V V
RDD3 - RDD4 - Action!
To my experience the change RDD1 get
to distribute the parameters. Haven't thought thru yet.
Cheers
k/
On Fri, Jan 9, 2015 at 2:56 PM, Andrei faithlessfri...@gmail.com wrote:
Does it makes sense to use Spark's actor system (e.g. via
SparkContext.env.actorSystem) to create parameter server?
On Fri, Jan 9, 2015 at 10:09 PM, Peng
You are not the first :) probably not the fifth to have the question.
parameter server is not included in spark framework and I've seen all kinds
of hacking to improvise it: REST api, HDFS, tachyon, etc.
Not sure if an 'official' benchmark implementation will be released soon
On 9 January 2015
I was under the impression that ALS wasn't designed for it :- The famous
ebay online recommender uses SGD
However, you can try using the previous model as starting point, and
gradually reduce the number of iteration after the model stablize. I never
verify this idea, so you need to at least
Everything else is there except spark-repl. Can someone check that out this
weekend?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-repl-1-2-0-was-not-uploaded-to-central-maven-repository-tp20799.html
Sent from the Apache Spark User List mailing list
IMHO: cache doesn't provide redundancy, and its in the same jvm, so its much
faster.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Tachyon-tp1463p20800.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
In my project I extend a new RDD type that wraps another RDD and some
metadata. The code I use is similar to FilteredRDD implementation:
case class PageRowRDD(
self: RDD[PageRow],
@transient keys: ListSet[KeyLike] = ListSet()
){
override def getPartitions:
I have a spark application that deserialize an object 'Seq[Page]', save to
HDFS/S3, and read by another worker to be used elsewhere. The serialization
and deserialization use the same serializer as Spark itself. (Read from
SparkEnv.get.serializer.newInstance())
However I sporadically get this
I have a spark application that deserialize an object 'Seq[Page]', save to
HDFS/S3, and read by another worker to be used elsewhere. The serialization
and deserialization use the same serializer as Spark itself. (Read from
SparkEnv.get.serializer.newInstance())
However I sporadically get this
Sorry its a timeout duplicate, please remove it
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-sure-a-ClassPath-is-always-shipped-to-workers-tp18018p18020.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Looks like the only way is to implement that feature. There is no way of
hacking it into working
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Asynchronous-Broadcast-from-driver-to-workers-is-it-possible-tp15758p16985.html
Sent from the Apache Spark User
Any suggestions? I'm thinking of submitting a feature request for mutable
broadcast. Is it doable?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Asynchronous-Broadcast-from-driver-to-workers-is-it-possible-tp15758p15807.html
Sent from the Apache Spark
While Spark already offers support for asynchronous reduce (collect data from
workers, while not interrupting execution of a parallel transformation)
through accumulator, I have made little progress on making this process
reciprocal, namely, to broadcast data from driver to workers to be used by
Hi Sandeep,
would you be interesting in joining my open source project?
https://github.com/tribbloid/spookystuff
IMHO spark is indeed not for general purpose crawling, of which distributed
job is highly homogeneous. But good enough for directional scraping which
involves heterogeneous input and
Unfortunately, After some research I found its just a side effect of how
closure containing var works in scala:
http://stackoverflow.com/questions/11657676/how-does-scala-maintains-the-values-of-variable-when-the-closure-was-defined
the closure keep referring var broadcasted wrapper as a pointer,
I'm curious to see that if you declare broadcasted wrapper as a var, and
overwrite it in the driver program, the modification can have stable impact
on all transformations/actions defined BEFORE the overwrite but was executed
lazily AFTER the overwrite:
val a = sc.parallelize(1 to 10)
var
Yeah, Thanks a lot. I know for people understanding lazy execution this seems
straightforward. But for those who don't it may become a liability.
I've only tested its stability on a small example (which seems stable),
hopefully it's not a serendipity. Can a committer confirm this?
Yours Peng
I give up, communication must be blocked by the complex EC2 network topology
(though the error information indeed need some improvement). It doesn't make
sense to run a client thousands miles away to communicate frequently with
workers. I have moved everything to EC2 now.
--
View this message
This will be handy for demo and quick prototyping as the command-line REPL
doesn't support a lot of editor features, also, you don't need to ssh into
your worker/master if your client is behind an NAT wall. Since Spark
codebase has a minimalistic design philosophy I don't think this component
can
That would be really cool with IPython, But I' still wondering if all
language features are supported, namely I need these 2 in particular:
1. importing class and ILoop from external jars (so I can point it to
SparkILoop or Sparkbinding ILoop of Apache Mahout instead of Scala's default
ILoop)
2.
Sorry I just realize that start-slave is for a different task. Please close
this
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-slave-fail-to-start-with-wierd-error-information-tp8203p8246.html
Sent from the Apache Spark User List mailing list
I'm running a very small job (16 partitions, 2 stages) on a 2-node cluster,
each with 15G memory, the master page looks all normal:
URL: spark://ec2-54-88-40-125.compute-1.amazonaws.com:7077
Workers: 1
Cores: 2 Total, 2 Used
Memory: 13.9 GB Total, 512.0 MB Used
Applications: 1 Running, 0
Totally agree, also there is a class 'SparkSubmit' you can call directly to
replace shellscript
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-as-web-app-backend-tp8163p8248.html
Sent from the Apache Spark User List mailing list archive at
Expanded to 4 nodes and change the workers to listen to public DNS, but still
it shows the same error (which is obviously wrong). I can't believe I'm the
first to encounter this issue.
--
View this message in context:
I'm trying to link a spark slave with an already-setup master, using:
$SPARK_HOME/sbin/start-slave.sh spark://ip-172-31-32-12:7077
However the result shows that it cannot open a log file it is supposed to
create:
failed to launch org.apache.spark.deploy.worker.Worker:
tail: cannot open
I haven't setup a passwordless login from slave to master node yet (I was
under impression that this is not necessary since they communicate using
port 7077)
--
View this message in context:
make sure all queries are called through class methods and wrap your query
info with a class having only simple properties (strings, collections etc).
If you can't find such wrapper you can also use SerializableWritable wrapper
out-of-the-box, but its not recommended. (developer-api and make fat
I've read somewhere that in 1.0 there is a bash tool called 'spark-config.sh'
that allows you to propagate your config files to a number of master and
slave nodes. However I haven't use it myself
--
View this message in context:
I got 'NoSuchFieldError' which is of the same type. its definitely a
dependency jar conflict. spark driver will load jars of itself which in
recent version get many dependencies that are 1-2 years old. And if your
newer version dependency is in the same package it will be shaded (Java's
first come
I'm afraid persisting connection across two tasks is a dangerous act as they
can't be guaranteed to be executed on the same machine. Your ES server may
think its a man-in-the-middle attack!
I think its possible to invoke a static method that give you a connection in
a local 'pool', so nothing
I'm deploying a cluster to Amazon EC2, trying to override its internal ip
addresses with public dns
I start a cluster with environment parameter: SPARK_PUBLIC_DNS=[my EC2
public DNS]
But it doesn't change anything on the web UI, it still shows internal ip
address
Spark Master at
Right problem solved in a most disgraceful manner. Just add a package
relocation in maven shade config.
The downside is that it is not compatible with my IDE (IntelliJ IDEA), will
cause:
Error:scala.reflect.internal.MissingRequirementError: object scala.runtime
in compiler mirror not found.:
Thanks a lot! Let me check my maven shade plugin config and see if there is a
fix
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-throws-NoSuchFieldError-when-testing-on-cluster-mode-tp8064p8073.html
Sent from the Apache Spark User List mailing list
Indeed I see a lot of duplicate package warning in the maven-shade assembly
package output, so I tried to eliminate them:
First I set scope of dependency to apache-spark to 'provided', as suggested
in this page:
http://spark.apache.org/docs/latest/submitting-applications.html
But spark master
Latest Advancement:
I found the cause of NoClassDef exception: I wasn't using spark-submit,
instead I tried to run the spark application directly with SparkConf set in
the code. (this is handy in local debugging). However the old problem
remains: Even my maven-shade plugin doesn't give any warning
I also found that any buggy application submitted in --deploy-mode = cluster
mode will crash the worker (turn status to 'DEAD'). This shouldn't really
happen, otherwise nobody will use this mode. It is yet unclear whether all
workers will crash or only the one running the driver will (as I only
Hi Sean,
OK I'm about 90% sure about the cause of this problem: Just another classic
Dependency conflict:
Myproject - Selenium - apache.httpcomponents:httpcore 4.3.1 (has
ContentType)
Spark - Spark SQL Hive - Hive - Thrift - apache.httpcomponents:httpcore
4.1.3 (has no ContentType)
Though I
I've tried enabling the speculative jobs, this seems partially solved the
problem, however I'm not sure if it can handle large-scale situations as it
only start when 75% of the job is finished.
--
View this message in context:
My transformations or actions has some external tool set dependencies and
sometimes they just stuck somewhere and there is no way I can fix them. If I
don't want the job to run forever, Do I need to implement several monitor
threads to throws an exception when they stuck. Or the framework can
I wasn't using spark sql before.
But by default spark should retry the exception for 4 times.
I'm curious why it aborted after 1 failure
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark1-0-spark-sql-saveAsParquetFile-Error-tp7006p7252.html
Sent from
I speculate that Spark will only retry on exceptions that are registered with
TaskSetScheduler, so a definitely-will-fail task will fail quickly without
taking more resources. However I haven't found any documentation or web page
on it
--
View this message in context:
I think these failed task must got retried automatically if you can't see any
error in your results. Other wise the entire application will throw a
SparkException and abort.
Unfortunately I don't know how to do this, my application always abort.
--
View this message in context:
on this.
On Mon, Jun 9, 2014 at 8:59 AM, Peng Cheng pc...@uow.edu.au
mailto:pc...@uow.edu.au wrote:
I speculate that Spark will only retry on exceptions that are
registered with
TaskSetScheduler, so a definitely-will-fail task will fail quickly
without
taking more resources
Oh, and to make things worse, they forgot '\*' in their regex.
Am I the first to encounter this problem before?
On Mon 09 Jun 2014 02:24:43 PM EDT, Peng Cheng wrote:
Thanks a lot! That's very responsive, somebody definitely has
encountered the same problem before, and added two hidden modes
that right away
rather than retrying the task several times and having them worry about why
they get so many errors.
Matei
On Jun 9, 2014, at 11:28 AM, Peng Cheng pc...@uowmail.edu.au wrote:
Oh, and to make things worse, they forgot '\*' in their regex.
Am I the first to encounter
61 matches
Mail list logo