Re: [Spark 1.4.0]How to set driver's system property using spark-submit options?

2015-06-12 Thread Peng Cheng
.. What is the config that you are trying to set? -Andrew 2015-06-12 11:17 GMT-07:00 Peng Cheng pc...@uow.edu.au: In Spark 1.3.x, the system property of the driver can be set by --conf option, shared between setting spark properties and system properties. In Spark 1.4.0 this feature

Re: [Spark 1.4.0]How to set driver's system property using spark-submit options?

2015-06-12 Thread Peng Cheng
2015 at 19:39, Ted Yu yuzhih...@gmail.com wrote: This is the SPARK JIRA which introduced the warning: [SPARK-7037] [CORE] Inconsistent behavior for non-spark config properties in spark-shell and spark-submit On Fri, Jun 12, 2015 at 4:34 PM, Peng Cheng rhw...@gmail.com wrote: Hi Andrew

[Spark 1.4.0]How to set driver's system property using spark-submit options?

2015-06-12 Thread Peng Cheng
In Spark 1.3.x, the system property of the driver can be set by --conf option, shared between setting spark properties and system properties. In Spark 1.4.0 this feature is removed, the driver instead log the following warning: Warning: Ignoring non-spark config property: xxx.xxx=v How do

Re: S3NativeFileSystem inefficient implementation when calling sc.textFile

2015-05-21 Thread Peng Cheng
I stumble upon this thread and I conjecture that this may affect restoring a checkpointed RDD as well: http://apache-spark-user-list.1001560.n3.nabble.com/Union-of-checkpointed-RDD-in-Apache-Spark-has-long-gt-10-hour-between-stage-latency-td22925.html#a22928 In my case I have 1600+ fragmented

Re: Union of checkpointed RDD in Apache Spark has long ( 10 hour) between-stage latency

2015-05-17 Thread Peng Cheng
Looks like this problem has been mentioned before: http://qnalist.com/questions/5666463/downloads-from-s3-exceedingly-slow-when-running-on-spark-ec2 and a temporarily solution is to deploy on a dedicated EMR/S3 configuration. I'll go for that one for a shot. -- View this message in context:

Re: Union of checkpointed RDD in Apache Spark has long ( 10 hour) between-stage latency

2015-05-17 Thread Peng Cheng
Turns out the above thread is unrelated: it was caused by using s3:// instead of s3n://. Which I already avoided in my checkpointDir configuration. -- View this message in context:

Re: Union of checkpointed RDD in Apache Spark has long ( 10 hour) between-stage latency

2015-05-17 Thread Peng Cheng
BTW: My thread dump of the driver's main thread looks like it is stuck on waiting for Amazon S3 bucket metadata for a long time (which may suggests that I should move checkpointing directory from S3 to HDFS): Thread 1: main (RUNNABLE) java.net.SocketInputStream.socketRead0(Native Method)

What are the likely causes of org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle?

2015-04-24 Thread Peng Cheng
I'm deploying a Spark data processing job on an EC2 cluster, the job is small for the cluster (16 cores with 120G RAM in total), the largest RDD has only 76k+ rows. But heavily skewed in the middle (thus requires repartitioning) and each row has around 100k of data after serialization. The job

Re: Spark Performance on Yarn

2015-04-20 Thread Peng Cheng
I got exactly the same problem, except that I'm running on a standalone master. Can you tell me the counterpart parameter on standalone master for increasing the same memroy overhead? -- View this message in context:

How to avoid “Invalid checkpoint directory” error in apache Spark?

2015-04-17 Thread Peng Cheng
I'm using Amazon EMR + S3 as my spark cluster infrastructure. When I'm running a job with periodic checkpointing (it has a long dependency tree, so truncating by checkpointing is mandatory, each checkpoint has 320 partitions). The job stops halfway, resulting an exception: (On driver)

Re: Shuffle write increases in spark 1.2

2015-02-14 Thread Peng Cheng
I double check the 1.2 feature list and found out that the new sort-based shuffle manager has nothing to do with HashPartitioner :- Sorry for the misinformation. In another hand. This may explain increase in shuffle spill as a side effect of the new shuffle manager, let me revert

Re: Shuffle write increases in spark 1.2

2015-02-14 Thread Peng Cheng
Same problem here, shuffle write increased from 10G to over 64G, since I'm running on amazon EC2 this always cause temporary folder to consume all the disk space. Still looking for a solution. BTW, the 64G shuffle write is encountered on shuffling a pairRDD with HashPartitioner, so its not

Re: Why does spark write huge file into temporary local disk even without on-disk persist or checkpoint?

2015-02-11 Thread Peng Cheng
You are right. I've checked the overall stage metrics and looks like the largest shuffling write is over 9G. The partition completed successfully but its spilled file can't be removed until all others are finished. It's very likely caused by a stupid mistake in my design. A lookup table grows

Why does spark write huge file into temporary local disk even without on-disk persist or checkpoint?

2015-02-10 Thread Peng Cheng
I'm running a small job on a cluster with 15G of mem and 8G of disk per machine. The job always get into a deadlock where the last error message is: java.io.IOException: No space left on device at java.io.FileOutputStream.writeBytes(Native Method) at

Re: java.lang.IllegalStateException: unread block data

2015-02-02 Thread Peng Cheng
I got the same problem, maybe java serializer is unstable -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-IllegalStateException-unread-block-data-tp20668p21463.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

If an RDD appeared twice in a DAG, of which calculation is triggered by a single action, will this RDD be calculated twice?

2015-01-16 Thread Peng Cheng
I'm talking about RDD1 (not persisted or checkpointed) in this situation: ...(somewhere) - RDD1 - RDD2 || V V RDD3 - RDD4 - Action! To my experience the change RDD1 get

If an RDD appeared twice in a DAG, of which calculation is triggered by a single action, will this RDD be calculated twice?

2015-01-16 Thread Peng Cheng
I'm talking about RDD1 (not persisted or checkpointed) in this situation: ...(somewhere) - RDD1 - RDD2 || V V RDD3 - RDD4 - Action! To my experience the change RDD1 get

Re: DeepLearning and Spark ?

2015-01-09 Thread Peng Cheng
to distribute the parameters. Haven't thought thru yet. Cheers k/ On Fri, Jan 9, 2015 at 2:56 PM, Andrei faithlessfri...@gmail.com wrote: Does it makes sense to use Spark's actor system (e.g. via SparkContext.env.actorSystem) to create parameter server? On Fri, Jan 9, 2015 at 10:09 PM, Peng

Re: DeepLearning and Spark ?

2015-01-09 Thread Peng Cheng
You are not the first :) probably not the fifth to have the question. parameter server is not included in spark framework and I've seen all kinds of hacking to improvise it: REST api, HDFS, tachyon, etc. Not sure if an 'official' benchmark implementation will be released soon On 9 January 2015

Re: Is it possible to do incremental training using ALSModel (MLlib)?

2015-01-02 Thread Peng Cheng
I was under the impression that ALS wasn't designed for it :- The famous ebay online recommender uses SGD However, you can try using the previous model as starting point, and gradually reduce the number of iteration after the model stablize. I never verify this idea, so you need to at least

spark-repl_1.2.0 was not uploaded to central maven repository.

2014-12-20 Thread Peng Cheng
Everything else is there except spark-repl. Can someone check that out this weekend? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-repl-1-2-0-was-not-uploaded-to-central-maven-repository-tp20799.html Sent from the Apache Spark User List mailing list

Re: Spark on Tachyon

2014-12-20 Thread Peng Cheng
IMHO: cache doesn't provide redundancy, and its in the same jvm, so its much faster. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Tachyon-tp1463p20800.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

How to extend an one-to-one RDD of Spark that can be persisted?

2014-12-04 Thread Peng Cheng
In my project I extend a new RDD type that wraps another RDD and some metadata. The code I use is similar to FilteredRDD implementation: case class PageRowRDD( self: RDD[PageRow], @transient keys: ListSet[KeyLike] = ListSet() ){ override def getPartitions:

How to make sure a ClassPath is always shipped to workers?

2014-11-03 Thread Peng Cheng
I have a spark application that deserialize an object 'Seq[Page]', save to HDFS/S3, and read by another worker to be used elsewhere. The serialization and deserialization use the same serializer as Spark itself. (Read from SparkEnv.get.serializer.newInstance()) However I sporadically get this

How to make sure a ClassPath is always shipped to workers?

2014-11-03 Thread Peng Cheng
I have a spark application that deserialize an object 'Seq[Page]', save to HDFS/S3, and read by another worker to be used elsewhere. The serialization and deserialization use the same serializer as Spark itself. (Read from SparkEnv.get.serializer.newInstance()) However I sporadically get this

Re: How to make sure a ClassPath is always shipped to workers?

2014-11-03 Thread Peng Cheng
Sorry its a timeout duplicate, please remove it -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-sure-a-ClassPath-is-always-shipped-to-workers-tp18018p18020.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Asynchronous Broadcast from driver to workers, is it possible?

2014-10-21 Thread Peng Cheng
Looks like the only way is to implement that feature. There is no way of hacking it into working -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Asynchronous-Broadcast-from-driver-to-workers-is-it-possible-tp15758p16985.html Sent from the Apache Spark User

Re: Asynchronous Broadcast from driver to workers, is it possible?

2014-10-06 Thread Peng Cheng
Any suggestions? I'm thinking of submitting a feature request for mutable broadcast. Is it doable? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Asynchronous-Broadcast-from-driver-to-workers-is-it-possible-tp15758p15807.html Sent from the Apache Spark

Asynchronous Broadcast from driver to workers, is it possible?

2014-10-04 Thread Peng Cheng
While Spark already offers support for asynchronous reduce (collect data from workers, while not interrupting execution of a parallel transformation) through accumulator, I have made little progress on making this process reciprocal, namely, to broadcast data from driver to workers to be used by

Re: Crawler and Scraper with different priorities

2014-09-09 Thread Peng Cheng
Hi Sandeep, would you be interesting in joining my open source project? https://github.com/tribbloid/spookystuff IMHO spark is indeed not for general purpose crawling, of which distributed job is highly homogeneous. But good enough for directional scraping which involves heterogeneous input and

Re: Bug or feature? Overwrite broadcasted variables.

2014-08-19 Thread Peng Cheng
Unfortunately, After some research I found its just a side effect of how closure containing var works in scala: http://stackoverflow.com/questions/11657676/how-does-scala-maintains-the-values-of-variable-when-the-closure-was-defined the closure keep referring var broadcasted wrapper as a pointer,

Bug or feature? Overwrite broadcasted variables.

2014-08-18 Thread Peng Cheng
I'm curious to see that if you declare broadcasted wrapper as a var, and overwrite it in the driver program, the modification can have stable impact on all transformations/actions defined BEFORE the overwrite but was executed lazily AFTER the overwrite: val a = sc.parallelize(1 to 10) var

Re: Bug or feature? Overwrite broadcasted variables.

2014-08-18 Thread Peng Cheng
Yeah, Thanks a lot. I know for people understanding lazy execution this seems straightforward. But for those who don't it may become a liability. I've only tested its stability on a small example (which seems stable), hopefully it's not a serendipity. Can a committer confirm this? Yours Peng

Re: TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

2014-06-27 Thread Peng Cheng
I give up, communication must be blocked by the complex EC2 network topology (though the error information indeed need some improvement). It doesn't make sense to run a client thousands miles away to communicate frequently with workers. I have moved everything to EC2 now. -- View this message

Integrate spark-shell into officially supported web ui/api plug-in? What do you think?

2014-06-27 Thread Peng Cheng
This will be handy for demo and quick prototyping as the command-line REPL doesn't support a lot of editor features, also, you don't need to ssh into your worker/master if your client is behind an NAT wall. Since Spark codebase has a minimalistic design philosophy I don't think this component can

Re: Integrate spark-shell into officially supported web ui/api plug-in? What do you think?

2014-06-27 Thread Peng Cheng
That would be really cool with IPython, But I' still wondering if all language features are supported, namely I need these 2 in particular: 1. importing class and ILoop from external jars (so I can point it to SparkILoop or Sparkbinding ILoop of Apache Mahout instead of Scala's default ILoop) 2.

Re: Spark slave fail to start with wierd error information

2014-06-25 Thread Peng Cheng
Sorry I just realize that start-slave is for a different task. Please close this -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-slave-fail-to-start-with-wierd-error-information-tp8203p8246.html Sent from the Apache Spark User List mailing list

TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

2014-06-25 Thread Peng Cheng
I'm running a very small job (16 partitions, 2 stages) on a 2-node cluster, each with 15G memory, the master page looks all normal: URL: spark://ec2-54-88-40-125.compute-1.amazonaws.com:7077 Workers: 1 Cores: 2 Total, 2 Used Memory: 13.9 GB Total, 512.0 MB Used Applications: 1 Running, 0

Re: Using Spark as web app backend

2014-06-25 Thread Peng Cheng
Totally agree, also there is a class 'SparkSubmit' you can call directly to replace shellscript -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-as-web-app-backend-tp8163p8248.html Sent from the Apache Spark User List mailing list archive at

Re: TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

2014-06-25 Thread Peng Cheng
Expanded to 4 nodes and change the workers to listen to public DNS, but still it shows the same error (which is obviously wrong). I can't believe I'm the first to encounter this issue. -- View this message in context:

Spark slave fail to start with wierd error information

2014-06-24 Thread Peng Cheng
I'm trying to link a spark slave with an already-setup master, using: $SPARK_HOME/sbin/start-slave.sh spark://ip-172-31-32-12:7077 However the result shows that it cannot open a log file it is supposed to create: failed to launch org.apache.spark.deploy.worker.Worker: tail: cannot open

Re: Spark slave fail to start with wierd error information

2014-06-24 Thread Peng Cheng
I haven't setup a passwordless login from slave to master node yet (I was under impression that this is not necessary since they communicate using port 7077) -- View this message in context:

Re: ElasticSearch enrich

2014-06-24 Thread Peng Cheng
make sure all queries are called through class methods and wrap your query info with a class having only simple properties (strings, collections etc). If you can't find such wrapper you can also use SerializableWritable wrapper out-of-the-box, but its not recommended. (developer-api and make fat

Re: How to Reload Spark Configuration Files

2014-06-24 Thread Peng Cheng
I've read somewhere that in 1.0 there is a bash tool called 'spark-config.sh' that allows you to propagate your config files to a number of master and slave nodes. However I haven't use it myself -- View this message in context:

Re: Upgrading to Spark 1.0.0 causes NoSuchMethodError

2014-06-24 Thread Peng Cheng
I got 'NoSuchFieldError' which is of the same type. its definitely a dependency jar conflict. spark driver will load jars of itself which in recent version get many dependencies that are 1-2 years old. And if your newer version dependency is in the same package it will be shaded (Java's first come

Re: ElasticSearch enrich

2014-06-24 Thread Peng Cheng
I'm afraid persisting connection across two tasks is a dangerous act as they can't be guaranteed to be executed on the same machine. Your ES server may think its a man-in-the-middle attack! I think its possible to invoke a static method that give you a connection in a local 'pool', so nothing

Does PUBLIC_DNS environment parameter really works?

2014-06-24 Thread Peng Cheng
I'm deploying a cluster to Amazon EC2, trying to override its internal ip addresses with public dns I start a cluster with environment parameter: SPARK_PUBLIC_DNS=[my EC2 public DNS] But it doesn't change anything on the web UI, it still shows internal ip address Spark Master at

Re: Spark throws NoSuchFieldError when testing on cluster mode

2014-06-22 Thread Peng Cheng
Right problem solved in a most disgraceful manner. Just add a package relocation in maven shade config. The downside is that it is not compatible with my IDE (IntelliJ IDEA), will cause: Error:scala.reflect.internal.MissingRequirementError: object scala.runtime in compiler mirror not found.:

Re: Spark throws NoSuchFieldError when testing on cluster mode

2014-06-21 Thread Peng Cheng
Thanks a lot! Let me check my maven shade plugin config and see if there is a fix -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-throws-NoSuchFieldError-when-testing-on-cluster-mode-tp8064p8073.html Sent from the Apache Spark User List mailing list

Re: Spark throws NoSuchFieldError when testing on cluster mode

2014-06-21 Thread Peng Cheng
Indeed I see a lot of duplicate package warning in the maven-shade assembly package output, so I tried to eliminate them: First I set scope of dependency to apache-spark to 'provided', as suggested in this page: http://spark.apache.org/docs/latest/submitting-applications.html But spark master

Re: Spark throws NoSuchFieldError when testing on cluster mode

2014-06-21 Thread Peng Cheng
Latest Advancement: I found the cause of NoClassDef exception: I wasn't using spark-submit, instead I tried to run the spark application directly with SparkConf set in the code. (this is handy in local debugging). However the old problem remains: Even my maven-shade plugin doesn't give any warning

Re: Spark throws NoSuchFieldError when testing on cluster mode

2014-06-21 Thread Peng Cheng
I also found that any buggy application submitted in --deploy-mode = cluster mode will crash the worker (turn status to 'DEAD'). This shouldn't really happen, otherwise nobody will use this mode. It is yet unclear whether all workers will crash or only the one running the driver will (as I only

Re: Spark throws NoSuchFieldError when testing on cluster mode

2014-06-21 Thread Peng Cheng
Hi Sean, OK I'm about 90% sure about the cause of this problem: Just another classic Dependency conflict: Myproject - Selenium - apache.httpcomponents:httpcore 4.3.1 (has ContentType) Spark - Spark SQL Hive - Hive - Thrift - apache.httpcomponents:httpcore 4.1.3 (has no ContentType) Though I

Re: What is the best way to handle transformations or actions that takes forever?

2014-06-17 Thread Peng Cheng
I've tried enabling the speculative jobs, this seems partially solved the problem, however I'm not sure if it can handle large-scale situations as it only start when 75% of the job is finished. -- View this message in context:

What is the best way to handle transformations or actions that takes forever?

2014-06-16 Thread Peng Cheng
My transformations or actions has some external tool set dependencies and sometimes they just stuck somewhere and there is no way I can fix them. If I don't want the job to run forever, Do I need to implement several monitor threads to throws an exception when they stuck. Or the framework can

Re: spark1.0 spark sql saveAsParquetFile Error

2014-06-09 Thread Peng Cheng
I wasn't using spark sql before. But by default spark should retry the exception for 4 times. I'm curious why it aborted after 1 failure -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark1-0-spark-sql-saveAsParquetFile-Error-tp7006p7252.html Sent from

Re: How to enable fault-tolerance?

2014-06-09 Thread Peng Cheng
I speculate that Spark will only retry on exceptions that are registered with TaskSetScheduler, so a definitely-will-fail task will fail quickly without taking more resources. However I haven't found any documentation or web page on it -- View this message in context:

Re: Occasional failed tasks

2014-06-09 Thread Peng Cheng
I think these failed task must got retried automatically if you can't see any error in your results. Other wise the entire application will throw a SparkException and abort. Unfortunately I don't know how to do this, my application always abort. -- View this message in context:

Re: How to enable fault-tolerance?

2014-06-09 Thread Peng Cheng
on this. On Mon, Jun 9, 2014 at 8:59 AM, Peng Cheng pc...@uow.edu.au mailto:pc...@uow.edu.au wrote: I speculate that Spark will only retry on exceptions that are registered with TaskSetScheduler, so a definitely-will-fail task will fail quickly without taking more resources

Re: How to enable fault-tolerance?

2014-06-09 Thread Peng Cheng
Oh, and to make things worse, they forgot '\*' in their regex. Am I the first to encounter this problem before? On Mon 09 Jun 2014 02:24:43 PM EDT, Peng Cheng wrote: Thanks a lot! That's very responsive, somebody definitely has encountered the same problem before, and added two hidden modes

Re: How to enable fault-tolerance?

2014-06-09 Thread Peng Cheng
that right away rather than retrying the task several times and having them worry about why they get so many errors. Matei On Jun 9, 2014, at 11:28 AM, Peng Cheng pc...@uowmail.edu.au wrote: Oh, and to make things worse, they forgot '\*' in their regex. Am I the first to encounter