On 12 June 2015 at 19:39, Ted Yu wrote:
> This is the SPARK JIRA which introduced the warning:
>
> [SPARK-7037] [CORE] Inconsistent behavior for non-spark config properties
> in spark-shell and spark-submit
>
> On Fri, Jun 12, 2015 at 4:34 PM, Peng Cheng wrote:
>
>> Hi A
quot;spark.". What is the config that you are trying to set?
>
> -Andrew
>
> 2015-06-12 11:17 GMT-07:00 Peng Cheng :
>
>> In Spark <1.3.x, the system property of the driver can be set by --conf
>> option, shared between setting spark properties and system properti
In Spark <1.3.x, the system property of the driver can be set by --conf
option, shared between setting spark properties and system properties.
In Spark 1.4.0 this feature is removed, the driver instead log the following
warning:
Warning: Ignoring non-spark config property: xxx.xxx=v
How do s
I stumble upon this thread and I conjecture that this may affect restoring a
checkpointed RDD as well:
http://apache-spark-user-list.1001560.n3.nabble.com/Union-of-checkpointed-RDD-in-Apache-Spark-has-long-gt-10-hour-between-stage-latency-td22925.html#a22928
In my case I have 1600+ fragmented che
Turns out the above thread is unrelated: it was caused by using s3:// instead
of s3n://. Which I already avoided in my checkpointDir configuration.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Union-of-checkpointed-RDD-in-Apache-Spark-has-long-10-hour-bet
Looks like this problem has been mentioned before:
http://qnalist.com/questions/5666463/downloads-from-s3-exceedingly-slow-when-running-on-spark-ec2
and a temporarily solution is to deploy on a dedicated EMR/S3 configuration.
I'll go for that one for a shot.
--
View this message in context:
h
BTW: My thread dump of the driver's main thread looks like it is stuck on
waiting for Amazon S3 bucket metadata for a long time (which may suggests
that I should move checkpointing directory from S3 to HDFS):
Thread 1: main (RUNNABLE)
java.net.SocketInputStream.socketRead0(Native Method)
java.net
I'm implementing one of my machine learning/graph analysis algorithm on
Apache Spark:
The algorithm is very iterative (like all other ML algorithms), but it has a
rather strange workflow: first a subset of all training data (called seeds
RDD: {S_1} is randomly selected) and loaded, in each iterati
I'm deploying a Spark data processing job on an EC2 cluster, the job is small
for the cluster (16 cores with 120G RAM in total), the largest RDD has only
76k+ rows. But heavily skewed in the middle (thus requires repartitioning)
and each row has around 100k of data after serialization. The job alwa
I got exactly the same problem, except that I'm running on a standalone
master. Can you tell me the counterpart parameter on standalone master for
increasing the same memroy overhead?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-
I'm using Amazon EMR + S3 as my spark cluster infrastructure. When I'm
running a job with periodic checkpointing (it has a long dependency tree, so
truncating by checkpointing is mandatory, each checkpoint has 320
partitions). The job stops halfway, resulting an exception:
(On driver)
org.apache.s
I double check the 1.2 feature list and found out that the new sort-based
shuffle manager has nothing to do with HashPartitioner :-< Sorry for the
misinformation.
In another hand. This may explain increase in shuffle spill as a side effect
of the new shuffle manager, let me revert spark.shuffle.ma
Same problem here, shuffle write increased from 10G to over 64G, since I'm
running on amazon EC2 this always cause temporary folder to consume all the
disk space. Still looking for a solution.
BTW, the 64G shuffle write is encountered on shuffling a pairRDD with
HashPartitioner, so its not related
You are right. I've checked the overall stage metrics and looks like the
largest shuffling write is over 9G. The partition completed successfully
but its spilled file can't be removed until all others are finished.
It's very likely caused by a stupid mistake in my design. A lookup table
grows const
I'm running a small job on a cluster with 15G of mem and 8G of disk per
machine.
The job always get into a deadlock where the last error message is:
java.io.IOException: No space left on device
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write
I got the same problem, maybe java serializer is unstable
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-IllegalStateException-unread-block-data-tp20668p21463.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
I'm talking about RDD1 (not persisted or checkpointed) in this situation:
...(somewhere) -> RDD1 -> RDD2
||
V V
RDD3 -> RDD4 -> Action!
To my experience the change RDD1 get recalc
I'm talking about RDD1 (not persisted or checkpointed) in this situation:
...(somewhere) -> RDD1 -> RDD2
||
V V
RDD3 -> RDD4 -> Action!
To my experience the change RDD1 get recalc
ribute the parameters. Haven't thought thru yet.
> Cheers
>
>
> On Fri, Jan 9, 2015 at 2:56 PM, Andrei wrote:
>
>> Does it makes sense to use Spark's actor system (e.g. via
>> SparkContext.env.actorSystem) to create parameter server?
>>
>> On Fri,
You are not the first :) probably not the fifth to have the question.
parameter server is not included in spark framework and I've seen all kinds
of hacking to improvise it: REST api, HDFS, tachyon, etc.
Not sure if an 'official' benchmark & implementation will be released soon
On 9 January 2015 a
I was under the impression that ALS wasn't designed for it :-< The famous
ebay online recommender uses SGD
However, you can try using the previous model as starting point, and
gradually reduce the number of iteration after the model stablize. I never
verify this idea, so you need to at least cross-
IMHO: cache doesn't provide redundancy, and its in the same jvm, so its much
faster.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Tachyon-tp1463p20800.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Everything else is there except spark-repl. Can someone check that out this
weekend?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-repl-1-2-0-was-not-uploaded-to-central-maven-repository-tp20799.html
Sent from the Apache Spark User List mailing list
In my project I extend a new RDD type that wraps another RDD and some
metadata. The code I use is similar to FilteredRDD implementation:
case class PageRowRDD(
self: RDD[PageRow],
@transient keys: ListSet[KeyLike] = ListSet()
){
override def getPartitions: A
Thanks a lot! Unfortunately this is not my problem: The page class is already
in the jar that is shipped to every worker. (I've logged into workers and
unpacked the jar files, and see the class file right there as intended)
Also, this error only happens sporadically, not every time. the error was
s
Sorry its a timeout duplicate, please remove it
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-sure-a-ClassPath-is-always-shipped-to-workers-tp18018p18020.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---
I have a spark application that deserialize an object 'Seq[Page]', save to
HDFS/S3, and read by another worker to be used elsewhere. The serialization
and deserialization use the same serializer as Spark itself. (Read from
SparkEnv.get.serializer.newInstance())
However I sporadically get this erro
I have a spark application that deserialize an object 'Seq[Page]', save to
HDFS/S3, and read by another worker to be used elsewhere. The serialization
and deserialization use the same serializer as Spark itself. (Read from
SparkEnv.get.serializer.newInstance())
However I sporadically get this erro
Looks like the only way is to implement that feature. There is no way of
hacking it into working
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Asynchronous-Broadcast-from-driver-to-workers-is-it-possible-tp15758p16985.html
Sent from the Apache Spark User L
Any suggestions? I'm thinking of submitting a feature request for mutable
broadcast. Is it doable?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Asynchronous-Broadcast-from-driver-to-workers-is-it-possible-tp15758p15807.html
Sent from the Apache Spark User
While Spark already offers support for asynchronous reduce (collect data from
workers, while not interrupting execution of a parallel transformation)
through accumulator, I have made little progress on making this process
reciprocal, namely, to broadcast data from driver to workers to be used by
al
Hi Sandeep,
would you be interesting in joining my open source project?
https://github.com/tribbloid/spookystuff
IMHO spark is indeed not for general purpose crawling, of which distributed
job is highly homogeneous. But good enough for directional scraping which
involves heterogeneous input and
Unfortunately, After some research I found its just a side effect of how
closure containing var works in scala:
http://stackoverflow.com/questions/11657676/how-does-scala-maintains-the-values-of-variable-when-the-closure-was-defined
the closure keep referring var broadcasted wrapper as a pointer,
Yeah, Thanks a lot. I know for people understanding lazy execution this seems
straightforward. But for those who don't it may become a liability.
I've only tested its stability on a small example (which seems stable),
hopefully it's not a serendipity. Can a committer confirm this?
Yours Peng
-
I'm curious to see that if you declare broadcasted wrapper as a var, and
overwrite it in the driver program, the modification can have stable impact
on all transformations/actions defined BEFORE the overwrite but was executed
lazily AFTER the overwrite:
val a = sc.parallelize(1 to 10)
var
That would be really cool with IPython, But I' still wondering if all
language features are supported, namely I need these 2 in particular:
1. importing class and ILoop from external jars (so I can point it to
SparkILoop or Sparkbinding ILoop of Apache Mahout instead of Scala's default
ILoop)
2. im
This will be handy for demo and quick prototyping as the command-line REPL
doesn't support a lot of editor features, also, you don't need to ssh into
your worker/master if your client is behind an NAT wall. Since Spark
codebase has a minimalistic design philosophy I don't think this component
can m
I give up, communication must be blocked by the complex EC2 network topology
(though the error information indeed need some improvement). It doesn't make
sense to run a client thousands miles away to communicate frequently with
workers. I have moved everything to EC2 now.
--
View this message in
Expanded to 4 nodes and change the workers to listen to public DNS, but still
it shows the same error (which is obviously wrong). I can't believe I'm the
first to encounter this issue.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/TaskSchedulerImpl-Initial
Totally agree, also there is a class 'SparkSubmit' you can call directly to
replace shellscript
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-as-web-app-backend-tp8163p8248.html
Sent from the Apache Spark User List mailing list archive at Nabbl
I'm running a very small job (16 partitions, 2 stages) on a 2-node cluster,
each with 15G memory, the master page looks all normal:
URL: spark://ec2-54-88-40-125.compute-1.amazonaws.com:7077
Workers: 1
Cores: 2 Total, 2 Used
Memory: 13.9 GB Total, 512.0 MB Used
Applications: 1 Running, 0 Completed
Sorry I just realize that start-slave is for a different task. Please close
this
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-slave-fail-to-start-with-wierd-error-information-tp8203p8246.html
Sent from the Apache Spark User List mailing list archive
I'm deploying a cluster to Amazon EC2, trying to override its internal ip
addresses with public dns
I start a cluster with environment parameter: SPARK_PUBLIC_DNS=[my EC2
public DNS]
But it doesn't change anything on the web UI, it still shows internal ip
address
Spark Master at spark://ip-172-3
anyone encounter this situation?
Also, I'm very sure my slave and master are in the same security group, with
port 7077 opened
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-slave-fail-to-start-with-wierd-error-information-tp8203p8227.html
Sent from
I'm afraid persisting connection across two tasks is a dangerous act as they
can't be guaranteed to be executed on the same machine. Your ES server may
think its a man-in-the-middle attack!
I think its possible to invoke a static method that give you a connection in
a local 'pool', so nothing will
I got 'NoSuchFieldError' which is of the same type. its definitely a
dependency jar conflict. spark driver will load jars of itself which in
recent version get many dependencies that are 1-2 years old. And if your
newer version dependency is in the same package it will be shaded (Java's
first come
I've read somewhere that in 1.0 there is a bash tool called 'spark-config.sh'
that allows you to propagate your config files to a number of master and
slave nodes. However I haven't use it myself
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-Reload-
make sure all queries are called through class methods and wrap your query
info with a class having only simple properties (strings, collections etc).
If you can't find such wrapper you can also use SerializableWritable wrapper
out-of-the-box, but its not recommended. (developer-api and make fat
cl
I encounter the same problem with hadoop.fs.Configuration (very complex,
unserializable class)
basically if your closure contains any instance (not constant
object/singleton! they are in the jar, not closure) that doesn't inherit
Serializable, or their properties doesn't inherit Serializable, you a
I haven't setup a passwordless login from slave to master node yet (I was
under impression that this is not necessary since they communicate using
port 7077)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-slave-fail-to-start-with-wierd-error-informati
I'm trying to link a spark slave with an already-setup master, using:
$SPARK_HOME/sbin/start-slave.sh spark://ip-172-31-32-12:7077
However the result shows that it cannot open a log file it is supposed to
create:
failed to launch org.apache.spark.deploy.worker.Worker:
tail: cannot open
'/opt/spa
Right problem solved in a most disgraceful manner. Just add a package
relocation in maven shade config.
The downside is that it is not compatible with my IDE (IntelliJ IDEA), will
cause:
Error:scala.reflect.internal.MissingRequirementError: object scala.runtime
in compiler mirror not found.: objec
JVM will quit after spending most of its time on GC (about 95%), but usually
before that you have to wait for a long time, particularly if your job is
already at massive scale.
Since it is hard to run profiling online, maybe its easier for debugging if
you make a lot of partitions (so you can watc
Hi Sean,
OK I'm about 90% sure about the cause of this problem: Just another classic
Dependency conflict:
Myproject -> Selenium -> apache.httpcomponents:httpcore 4.3.1 (has
ContentType)
Spark -> Spark SQL Hive -> Hive -> Thrift -> apache.httpcomponents:httpcore
4.1.3 (has no ContentType)
Though I
I also found that any buggy application submitted in --deploy-mode = cluster
mode will crash the worker (turn status to 'DEAD'). This shouldn't really
happen, otherwise nobody will use this mode. It is yet unclear whether all
workers will crash or only the one running the driver will (as I only hav
Latest Advancement:
I found the cause of NoClassDef exception: I wasn't using spark-submit,
instead I tried to run the spark application directly with SparkConf set in
the code. (this is handy in local debugging). However the old problem
remains: Even my maven-shade plugin doesn't give any warning
Indeed I see a lot of duplicate package warning in the maven-shade assembly
package output, so I tried to eliminate them:
First I set scope of dependency to apache-spark to 'provided', as suggested
in this page:
http://spark.apache.org/docs/latest/submitting-applications.html
But spark master gav
Thanks a lot! Let me check my maven shade plugin config and see if there is a
fix
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-throws-NoSuchFieldError-when-testing-on-cluster-mode-tp8064p8073.html
Sent from the Apache Spark User List mailing list ar
I have a Spark application that runs perfectly in local mode with 8 threads,
but when deployed on a single-node cluster. It gives the following error:
ROR TaskSchedulerImpl: Lost executor 0 on 192.168.42.202: Uncaught exception
Spark assembly has been built with Hive, including Datanucleus jars on
Wow, that sounds a lot of work (need a mini-thread), thanks a lot for the
answer.
It might be a nice-to-have feature.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/What-is-the-best-way-to-handle-transformations-or-actions-that-takes-forever-tp7664p8024.htm
I've tried enabling the speculative jobs, this seems partially solved the
problem, however I'm not sure if it can handle large-scale situations as it
only start when 75% of the job is finished.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/What-is-the-best
My transformations or actions has some external tool set dependencies and
sometimes they just stuck somewhere and there is no way I can fix them. If I
don't want the job to run forever, Do I need to implement several monitor
threads to throws an exception when they stuck. Or the framework can alrea
that right away
rather than retrying the task several times and having them worry about why
they get so many errors.
Matei
On Jun 9, 2014, at 11:28 AM, Peng Cheng wrote:
Oh, and to make things worse, they forgot '\*' in their regex.
Am I the first to encounter this problem before?
Oh, and to make things worse, they forgot '\*' in their regex.
Am I the first to encounter this problem before?
On Mon 09 Jun 2014 02:24:43 PM EDT, Peng Cheng wrote:
Thanks a lot! That's very responsive, somebody definitely has
encountered the same problem before, and added two
when running in
local mode. Not exactly sure why, feel free to submit a JIRA on this.
On Mon, Jun 9, 2014 at 8:59 AM, Peng Cheng mailto:pc...@uow.edu.au>> wrote:
I speculate that Spark will only retry on exceptions that are
registered with
TaskSetScheduler, so a definitely-wil
I think these failed task must got retried automatically if you can't see any
error in your results. Other wise the entire application will throw a
SparkException and abort.
Unfortunately I don't know how to do this, my application always abort.
--
View this message in context:
http://apache-s
I speculate that Spark will only retry on exceptions that are registered with
TaskSetScheduler, so a definitely-will-fail task will fail quickly without
taking more resources. However I haven't found any documentation or web page
on it
--
View this message in context:
http://apache-spark-user-l
I wasn't using spark sql before.
But by default spark should retry the exception for 4 times.
I'm curious why it aborted after 1 failure
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark1-0-spark-sql-saveAsParquetFile-Error-tp7006p7252.html
Sent from the
68 matches
Mail list logo