Re: Data Duplication Bug Found - Structured Streaming Versions 3..4.1, 3.2.4, and 3.3.2

2023-09-18 Thread Jerry Peng
had the chance to review the attached draft, let us know if > there are any questions in the meantime. Again, we welcome the opportunity > to work with the teams on this. > > > > Best- > > Craig > > > > > > > > *From: *Craig Alfieri > *Date: *Thursday, Se

Re: Data Duplication Bug Found - Structured Streaming Versions 3..4.1, 3.2.4, and 3.3.2

2023-09-14 Thread Jerry Peng
Craig, Thanks! Please let us know the result! Best, Jerry On Thu, Sep 14, 2023 at 12:22 PM Mich Talebzadeh wrote: > > Hi Craig, > > Can you please clarify what this bug is and provide sample code causing > this issue? > > HTH > > Mich Talebzadeh, > Distinguished Technologist, Solutions

Unsubscribe

2023-05-01 Thread peng

Re: Potability of dockers built on different cloud platforms

2023-04-05 Thread Ken Peng
ashok34...@yahoo.com.INVALID wrote: Is it possible to use Spark docker built on GCP on AWS without rebuilding from new on AWS? I am using the spark image from bitnami for running on k8s. And yes, it's deployed by helm. -- https://kenpeng.pages.dev/

unsubscribe

2023-01-20 Thread peng
unsubscribe

Re: A simple comparison for three SQL engines

2022-04-09 Thread Wes Peng
may I forward this report to spark list as well. Thanks. Wes Peng wrote: Hello, This weekend I made a test against a big dataset. spark, drill, mysql, postgresql were involved. This is the final report: https://blog.cloudcache.net/handles-the-file-larger-than-memory/ The simple conclusion

Re: Executorlost failure

2022-04-07 Thread Wes Peng
I just did a test, even for a single node (local deployment), spark can handle the data whose size is much larger than the total memory. My test VM (2g ram, 2 cores): $ free -m totalusedfree shared buff/cache available Mem: 19921845

Re: Executorlost failure

2022-04-07 Thread Wes Peng
I once had a file which is 100+GB getting computed in 3 nodes, each node has 24GB memory only. And the job could be done well. So from my experience spark cluster seems to work correctly for big files larger than memory by swapping them to disk. Thanks rajat kumar wrote: Tested this with

Re: Executorlost failure

2022-04-07 Thread Wes Peng
how many executors do you have? rajat kumar wrote: Tested this with executors of size 5 cores, 17GB memory. Data vol is really high around 1TB - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

query time comparison to several SQL engines

2022-04-07 Thread Wes Peng
I made a simple test to query time for several SQL engines including mysql, hive, drill and spark. The report, https://cloudcache.net/data/query-time-mysql-hive-drill-spark.pdf It maybe have no special meaning, just for fun. :) regards.

Re: Profiling spark application

2022-01-19 Thread Wes Peng
Give a look at this: https://github.com/LucaCanali/sparkMeasure On 2022/1/20 1:18, Prasad Bhalerao wrote: Is there any way we can profile spark applications which will show no. of invocations of spark api and their execution time etc etc just the way jprofiler shows all the details?

Re: [Pyspark] How to download Zip file from SFTP location and put in into Azure Data Lake and unzip it

2022-01-18 Thread Wes Peng
How large is the file? From my experience, reading the excel file from data lake and loading as dataframe, works great. Thanks On 2022-01-18 22:16, Heta Desai wrote: Hello, I have zip files on SFTP location. I want to download/copy those files and put into Azure Data Lake. Once the zip

Re: ivy unit test case filing for Spark

2021-12-21 Thread Wes Peng
Are you using IvyVPN which causes this problem? If the VPN software changes the network URL silently you should avoid using them. Regards. On Wed, Dec 22, 2021 at 1:48 AM Pralabh Kumar wrote: > Hi Spark Team > > I am building a spark in VPN . But the unit test case below is failing. > This is

Re: [ANNOUNCE] Apache Spark 3.2.0

2021-10-19 Thread Henrik Peng
Congrats and thanks! Gengliang Wang 于2021年10月19日 周二下午10:16写道: > Hi all, > > Apache Spark 3.2.0 is the third release of the 3.x line. With tremendous > contribution from the open-source community, this release managed to > resolve in excess of 1,700 Jira tickets. > > We'd like to thank our

Re: Spark Session error with 30s

2021-04-12 Thread Peng Lei
Hi KhajaAsmath Mohammed Please check the configuration of "spark.speculation.interval", just pass the "30" to it. ''' override def start(): Unit = { backend.start() if (!isLocal && conf.get(SPECULATION_ENABLED)) { logInfo("Starting speculative execution thread")

Question about how hadoop configurations populated in driver/executor pod

2021-03-22 Thread Yue Peng
Hi, I am trying run sparkPi example via Spark on Kubernetes in my cluster. However, it is consistently failing because of executor does not have the correct hadoop configurations. I could fix it by pre-creating a configmap and mounting it into executor by specifying in pod template. But I do

Re: Unsubscribe

2020-12-22 Thread Wesley Peng
Bhavya Jain wrote: Unsubscribe please send an email to: user-unsubscr...@spark.apache.org to unsubscribe yourself from the list. thanks. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: unsubscribe

2020-06-27 Thread Wesley Peng
please send an empty email to: user-unsubscr...@spark.apache.org to unsubscribe yourself from the list. Sri Kris wrote: Sent from Mail for Windows 10 - To unsubscribe

[ML] [How-to]: How to unload the loaded W2V model in Pyspark?

2020-02-17 Thread Zhefu PENG
Hi all, I'm using pyspark and Spark-ml to train and use Word2Vect model, here is the logic of my program: model = Word2VecModel.load("save path") result_list = model.findSynonymsArray(target, top_N) Then I use the graphframe and result_list to create graph and do some computing. However the

Re: [ANNOUNCE] Announcing Apache Spark 2.4.4

2019-09-02 Thread Wesley Peng
on 2019/9/2 5:54, Dongjoon Hyun wrote: We are happy to announce the availability of Spark 2.4.4! Spark 2.4.4 is a maintenance release containing stability fixes. This release is based on the branch-2.4 maintenance branch of Spark. We strongly recommend all 2.4 users to upgrade to this stable

How to work around NoOffsetForPartitionException when using Spark Streaming

2018-06-01 Thread Martin Peng
Hi, We see below exception when using Spark Kafka streaming 0.10 on a normal Kafka topic. Not sure why offset missing in zk, but since Spark streaming override the offset reset policy to none in the code. I can not set the reset policy to latest(I don't really care data loss now). Is there any

spark jdbc postgres query results don't match those of postgres query

2018-03-29 Thread Kevin Peng
I am running into a weird issue in Spark 1.6, which I was wondering if anyone has encountered before. I am running a simple select query from spark using a jdbc connection to postgres: val POSTGRES_DRIVER: String = "org.postgresql.Driver" val srcSql = """select total_action_value, last_updated

Re: Spark Job crash due to File Not found when shuffle intermittently

2017-07-25 Thread Martin Peng
handle task fail so if job ended normally , this error >>> can be ignore. >>> Second, when using BypassMergeSortShuffleWriter, it will first write >>> data file then write an index file. >>> You can check "Failed to delete temporary index file at

Re: Spark Job crash due to File Not found when shuffle intermittently

2017-07-24 Thread Martin Peng
Is there anyone at share me some lights about this issue? Thanks Martin 2017-07-21 18:58 GMT-07:00 Martin Peng <wei...@gmail.com>: > Hi, > > I have several Spark jobs including both batch job and Stream jobs to > process the system log and analyze them. We are using Kaf

Spark Job crash due to File Not found when shuffle intermittently

2017-07-21 Thread Martin Peng
Hi, I have several Spark jobs including both batch job and Stream jobs to process the system log and analyze them. We are using Kafka as the pipeline to connect each jobs. Once upgrade to Spark 2.1.0 + Spark Kafka Streaming 010, I found some of the jobs(both batch or streaming) are thrown below

The stability of Spark Stream Kafka 010

2017-06-29 Thread Martin Peng
Hi, We planned to upgrade our Spark Kafka library to 0.10 from 0.81 to simplify our infrastructure code logic. Does anybody know when will the 010 version become stable from experimental? May I use this 010 version together with Spark 1.5.1?

Re: Setting Optimal Number of Spark Executor Instances

2017-03-15 Thread Kevin Peng
Mohini, We set that parameter before we went and played with the number of executors and that didn't seem to help at all. Thanks, KP On Tue, Mar 14, 2017 at 3:37 PM, mohini kalamkar wrote: > Hi, > > try using this parameter --conf spark.sql.shuffle.partitions=1000

Re: udf of aggregation in pyspark dataframe ?

2016-09-29 Thread peng yu
df: - a|b|c --- 1|m|n 1|x | j 2|m|x ... import pyspark.sql.functions as F from pyspark.sql.types import MapType, StringType def my_zip(c, d): return dict(zip(c, d)) my_zip = F.udf(_my_zip, MapType(StingType(), StringType(), True), True)

Re: udf of aggregation in pyspark dataframe ?

2016-09-29 Thread peng yu
btw, i am using spark 1.6.1 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/udf-of-aggregation-in-pyspark-dataframe-tp27811p27812.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

udf of aggregation in pyspark dataframe ?

2016-09-29 Thread peng yu
Hi, is there a way to write a udf in pyspark support agg()? i search all over the docs and internet, and tested it out.. some say yes, some say no. and when i try those yes code examples, just complaint about AnalysisException: u"expression 'pythonUDF' is neither present in the group by,

Re: Weird results with Spark SQL Outer joins

2016-05-03 Thread Kevin Peng
> >> wrote: > >>> > >>> Hi Kevin, > >>> > >>> Having given it a first look I do think that you have hit something > here > >>> and this does not look quite fine. I have to work on the multiple AND > >>> conditions in ON

Re: Weird results with Spark SQL Outer joins

2016-05-03 Thread Kevin Peng
gt;> > >> Having given it a first look I do think that you have hit something here > >> and this does not look quite fine. I have to work on the multiple AND > >> conditions in ON and see whether that is causing any issues. > >> > >> Regards, > >>

Re: Weird results with Spark SQL Outer joins

2016-05-03 Thread Kevin Peng
pe show the same > results, > which meant that all the rows from left could match at least one row from > right, > all the rows from right could match at least one row from left, even > the number of row from left does not equal that of right. > > This is correct result. > >

Re: Weird results with Spark SQL Outer joins

2016-05-02 Thread Kevin Peng
Yong, Sorry, let explain my deduction; it is going be difficult to get a sample data out since the dataset I am using is proprietary. >From the above set queries (ones mentioned in above comments) both inner and outer join are producing the same counts. They are basically pulling out selected

Re: Weird results with Spark SQL Outer joins

2016-05-02 Thread Kevin Peng
Gourav, I wish that was case, but I have done a select count on each of the two tables individually and they return back different number of rows: dps.registerTempTable("dps_pin_promo_lt") swig.registerTempTable("swig_pin_promo_lt") dps.count() RESULT: 42632 swig.count() RESULT: 42034

Re: Weird results with Spark SQL Outer joins

2016-05-02 Thread Kevin Peng
Gourav, Apologies. I edited my post with this information: Spark version: 1.6 Result from spark shell OS: Linux version 2.6.32-431.20.3.el6.x86_64 ( mockbu...@c6b9.bsys.dev.centos.org) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-4) (GCC) ) #1 SMP Thu Jun 19 21:14:45 UTC 2014 Thanks, KP On Mon,

Re: println not appearing in libraries when running job using spark-submit --master local

2016-03-28 Thread Kevin Peng
Ted, What triggerAndWait does is perform a rest call to a specified url and then waits until the status message that gets returned by that url in a json a field says complete. The issues is I put a println at the very top of the method and that doesn't get printed out, and I know that println

ClassCastException when saving a DataFrame to parquet file (saveAsParquetFile, Spark 1.3.1) using Scala

2015-08-21 Thread Emma Boya Peng
Hi, I was trying to programmatically specify a schema and apply it to a RDD of Rows and save the resulting DataFrame as a parquet file. Here's what I did: 1. Created an RDD of Rows from RDD[Array[String]]: val gameId= Long.valueOf(line(0)) val accountType = Long.valueOf(line(1)) val

ClassCastException when saving a DataFrame to parquet file (saveAsParquetFile, Spark 1.3.1) using Scala

2015-08-21 Thread Emma Boya Peng
Hi, I was trying to programmatically specify a schema and apply it to a RDD of Rows and save the resulting DataFrame as a parquet file, but I got java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Long on the last step. Here's what I did: 1. Created an RDD of Rows from

Re: [Spark 1.4.0]How to set driver's system property using spark-submit options?

2015-06-12 Thread Peng Cheng
the new way to set the same properties? Yours Peng On 12 June 2015 at 14:20, Andrew Or and...@databricks.com wrote: Hi Peng, Setting properties through --conf should still work in Spark 1.4. From the warning it looks like the config you are trying to set does not start with the prefix spark

Re: [Spark 1.4.0]How to set driver's system property using spark-submit options?

2015-06-12 Thread Peng Cheng
2015 at 19:39, Ted Yu yuzhih...@gmail.com wrote: This is the SPARK JIRA which introduced the warning: [SPARK-7037] [CORE] Inconsistent behavior for non-spark config properties in spark-shell and spark-submit On Fri, Jun 12, 2015 at 4:34 PM, Peng Cheng rhw...@gmail.com wrote: Hi Andrew

[Spark 1.4.0]How to set driver's system property using spark-submit options?

2015-06-12 Thread Peng Cheng
In Spark 1.3.x, the system property of the driver can be set by --conf option, shared between setting spark properties and system properties. In Spark 1.4.0 this feature is removed, the driver instead log the following warning: Warning: Ignoring non-spark config property: xxx.xxx=v How do

Re: S3NativeFileSystem inefficient implementation when calling sc.textFile

2015-05-21 Thread Peng Cheng
I stumble upon this thread and I conjecture that this may affect restoring a checkpointed RDD as well: http://apache-spark-user-list.1001560.n3.nabble.com/Union-of-checkpointed-RDD-in-Apache-Spark-has-long-gt-10-hour-between-stage-latency-td22925.html#a22928 In my case I have 1600+ fragmented

Re: Union of checkpointed RDD in Apache Spark has long ( 10 hour) between-stage latency

2015-05-17 Thread Peng Cheng
Looks like this problem has been mentioned before: http://qnalist.com/questions/5666463/downloads-from-s3-exceedingly-slow-when-running-on-spark-ec2 and a temporarily solution is to deploy on a dedicated EMR/S3 configuration. I'll go for that one for a shot. -- View this message in context:

Re: Union of checkpointed RDD in Apache Spark has long ( 10 hour) between-stage latency

2015-05-17 Thread Peng Cheng
Turns out the above thread is unrelated: it was caused by using s3:// instead of s3n://. Which I already avoided in my checkpointDir configuration. -- View this message in context:

Re: Union of checkpointed RDD in Apache Spark has long ( 10 hour) between-stage latency

2015-05-17 Thread Peng Cheng
BTW: My thread dump of the driver's main thread looks like it is stuck on waiting for Amazon S3 bucket metadata for a long time (which may suggests that I should move checkpointing directory from S3 to HDFS): Thread 1: main (RUNNABLE) java.net.SocketInputStream.socketRead0(Native Method)

What are the likely causes of org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle?

2015-04-24 Thread Peng Cheng
I'm deploying a Spark data processing job on an EC2 cluster, the job is small for the cluster (16 cores with 120G RAM in total), the largest RDD has only 76k+ rows. But heavily skewed in the middle (thus requires repartitioning) and each row has around 100k of data after serialization. The job

Re: Spark Performance on Yarn

2015-04-20 Thread Peng Cheng
I got exactly the same problem, except that I'm running on a standalone master. Can you tell me the counterpart parameter on standalone master for increasing the same memroy overhead? -- View this message in context:

How to avoid “Invalid checkpoint directory” error in apache Spark?

2015-04-17 Thread Peng Cheng
I'm using Amazon EMR + S3 as my spark cluster infrastructure. When I'm running a job with periodic checkpointing (it has a long dependency tree, so truncating by checkpointing is mandatory, each checkpoint has 320 partitions). The job stops halfway, resulting an exception: (On driver)

Re: spark there is no space on the disk

2015-03-31 Thread Peng Xia
, Mesos) or LOCAL_DIRS (YARN) On Sat, Mar 14, 2015 at 5:29 PM, Peng Xia sparkpeng...@gmail.com wrote: Hi Sean, Thank very much for your reply. I tried to config it from below code: sf = SparkConf().setAppName(test).set(spark.executor.memory, 45g).set(spark.cores.max, 62),set

Re: refer to dictionary

2015-03-31 Thread Peng Xia
Hi Ted, Thanks very much, yea, using broadcast is much faster. Best, Peng On Tue, Mar 31, 2015 at 8:49 AM, Ted Yu yuzhih...@gmail.com wrote: You can use broadcast variable. See also this thread: http://search-hadoop.com/m/JW1q5GX7U22/Spark+broadcast+variablesubj=How+Broadcast+variable

Re: Can I start multiple executors in local mode?

2015-03-16 Thread xu Peng
Hi David, You can try the local-cluster. the number in local-cluster[2,2,1024] represents that there are 2 worker, 2 cores and 1024M Best Regards Peng Xu 2015-03-16 19:46 GMT+08:00 Xi Shen davidshe...@gmail.com: Hi, In YARN mode you can specify the number of executors. I wonder if we can

Re: spark there is no space on the disk

2015-03-14 Thread Peng Xia
Hi Sean, Thank very much for your reply. I tried to config it from below code: sf = SparkConf().setAppName(test).set(spark.executor.memory, 45g).set(spark.cores.max, 62),set(spark.local.dir, C:\\tmp) But still get the error. Do you know how I can config this? Thanks, Best, Peng On Sat, Mar

Re: spark there is no space on the disk

2015-03-14 Thread Peng Xia
And I have 2 TB free space on C driver. On Sat, Mar 14, 2015 at 8:29 PM, Peng Xia sparkpeng...@gmail.com wrote: Hi Sean, Thank very much for your reply. I tried to config it from below code: sf = SparkConf().setAppName(test).set(spark.executor.memory, 45g).set(spark.cores.max, 62),set

Re: spark sql writing in avro

2015-03-13 Thread Kevin Peng
). Now you should be able to compile and run. HTH, Markus On 03/12/2015 11:55 PM, Kevin Peng wrote: Dale, I basically have the same maven dependency above, but my code will not compile due to not being able to reference to AvroSaver, though the saveAsAvro reference compiles fine, which

spark there is no space on the disk

2015-03-13 Thread Peng Xia
Hi I was running a logistic regression algorithm on a 8 nodes spark cluster, each node has 8 cores and 56 GB Ram (each node is running a windows system). And the spark installation driver has 1.9 TB capacity. The dataset I was training on are has around 40 million records with around 6600

Re: Loading in json with spark sql

2015-03-13 Thread Kevin Peng
Yin, Yup thanks. I fixed that shortly after I posted and it worked. Thanks, Kevin On Fri, Mar 13, 2015 at 8:28 PM, Yin Huai yh...@databricks.com wrote: Seems you want to use array for the field of providers, like providers:[{id: ...}, {id:...}] instead of providers:{{id: ...}, {id:...}}

Re: spark sql writing in avro

2015-03-12 Thread Kevin Peng
Dale, I basically have the same maven dependency above, but my code will not compile due to not being able to reference to AvroSaver, though the saveAsAvro reference compiles fine, which is weird. Eventhough saveAsAvro compiles for me, it errors out when running the spark job due to it not being

Re: error on training with logistic regression sgd

2015-03-10 Thread Peng Xia
algorithm in python. 3. train a logistic regression model with the converted labeled points. Can any one give some advice for how to avoid the 2gb, if this is the cause? Thanks very much for the help. Best, Peng On Mon, Mar 9, 2015 at 3:54 PM, Peng Xia sparkpeng...@gmail.com wrote: Hi, I

error on training with logistic regression sgd

2015-03-09 Thread Peng Xia
(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) The data are transformed to LabeledPoint and I was using pyspark for this. Can anyone help me on this? Thanks, Best, Peng

Re: Issues with maven dependencies for version 1.2.0 but not version 1.1.0

2015-03-04 Thread Kevin Peng
/cloudera/en/documentation/core/v5-2-x/topics/cdh_vd_cdh5_maven_repo.html On Wed, Mar 4, 2015 at 4:34 PM, Kevin Peng kpe...@gmail.com wrote: Ted, I am currently using CDH 5.3 distro, which has Spark 1.2.0, so I am not too sure about the compatibility issues between 1.2.0 and 1.2.1, that is why

Re: Issues with maven dependencies for version 1.2.0 but not version 1.1.0

2015-03-04 Thread Kevin Peng
this thread: http://search-hadoop.com/m/JW1q5Vfe6X1 Cheers On Wed, Mar 4, 2015 at 4:18 PM, Kevin Peng kpe...@gmail.com wrote: Marcelo, Yes that is correct, I am going through a mirror, but 1.1.0 works properly, while 1.2.0 does not. I suspect there is crc in the 1.2.0 pom file. On Wed, Mar

Re: Shuffle write increases in spark 1.2

2015-02-14 Thread Peng Cheng
I double check the 1.2 feature list and found out that the new sort-based shuffle manager has nothing to do with HashPartitioner :- Sorry for the misinformation. In another hand. This may explain increase in shuffle spill as a side effect of the new shuffle manager, let me revert

Re: Shuffle write increases in spark 1.2

2015-02-14 Thread Peng Cheng
not related to Spark 1.2.0's new features Yours Peng -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Shuffle-write-increases-in-spark-1-2-tp20894p21656.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: Why does spark write huge file into temporary local disk even without on-disk persist or checkpoint?

2015-02-11 Thread Peng Cheng
You are right. I've checked the overall stage metrics and looks like the largest shuffling write is over 9G. The partition completed successfully but its spilled file can't be removed until all others are finished. It's very likely caused by a stupid mistake in my design. A lookup table grows

Why does spark write huge file into temporary local disk even without on-disk persist or checkpoint?

2015-02-10 Thread Peng Cheng
I'm running a small job on a cluster with 15G of mem and 8G of disk per machine. The job always get into a deadlock where the last error message is: java.io.IOException: No space left on device at java.io.FileOutputStream.writeBytes(Native Method) at

Is LogisticRegressionWithSGD in MLlib scalable?

2015-02-03 Thread Peng Zhang
Hi Everyone, Is LogisticRegressionWithSGD in MLlib scalable? If so, what is the idea behind the scalable implementation? Thanks in advance, Peng - Peng Zhang -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-LogisticRegressionWithSGD-in-MLlib

Re: java.lang.IllegalStateException: unread block data

2015-02-02 Thread Peng Cheng
I got the same problem, maybe java serializer is unstable -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-IllegalStateException-unread-block-data-tp20668p21463.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

If an RDD appeared twice in a DAG, of which calculation is triggered by a single action, will this RDD be calculated twice?

2015-01-16 Thread Peng Cheng
I'm talking about RDD1 (not persisted or checkpointed) in this situation: ...(somewhere) - RDD1 - RDD2 || V V RDD3 - RDD4 - Action! To my experience the change RDD1 get

If an RDD appeared twice in a DAG, of which calculation is triggered by a single action, will this RDD be calculated twice?

2015-01-16 Thread Peng Cheng
I'm talking about RDD1 (not persisted or checkpointed) in this situation: ...(somewhere) - RDD1 - RDD2 || V V RDD3 - RDD4 - Action! To my experience the change RDD1 get

Re: DeepLearning and Spark ?

2015-01-09 Thread Peng Cheng
to distribute the parameters. Haven't thought thru yet. Cheers k/ On Fri, Jan 9, 2015 at 2:56 PM, Andrei faithlessfri...@gmail.com wrote: Does it makes sense to use Spark's actor system (e.g. via SparkContext.env.actorSystem) to create parameter server? On Fri, Jan 9, 2015 at 10:09 PM, Peng

Re: DeepLearning and Spark ?

2015-01-09 Thread Peng Cheng
You are not the first :) probably not the fifth to have the question. parameter server is not included in spark framework and I've seen all kinds of hacking to improvise it: REST api, HDFS, tachyon, etc. Not sure if an 'official' benchmark implementation will be released soon On 9 January 2015

Re: Is it possible to do incremental training using ALSModel (MLlib)?

2015-01-02 Thread Peng Cheng
I was under the impression that ALS wasn't designed for it :- The famous ebay online recommender uses SGD However, you can try using the previous model as starting point, and gradually reduce the number of iteration after the model stablize. I never verify this idea, so you need to at least

Re: spark-repl_1.2.0 was not uploaded to central maven repository.

2014-12-22 Thread peng
will still move to databricks cloud, which has far more features than that. Many influential projects already depends on the routinely published Scala-REPL (e.g. playFW), it would be strange for Spark not doing the same. What do you think? Yours Peng On 12/22/2014 04:57 PM, Sean Owen wrote

Re: Announcing Spark Packages

2014-12-22 Thread peng
Me 2 :) On 12/22/2014 06:14 PM, Andrew Ash wrote: Hi Xiangrui, That link is currently returning a 503 Over Quota error message. Would you mind pinging back out when the page is back up? Thanks! Andrew On Mon, Dec 22, 2014 at 12:37 PM, Xiangrui Meng men...@gmail.com

spark-repl_1.2.0 was not uploaded to central maven repository.

2014-12-20 Thread Peng Cheng
Everything else is there except spark-repl. Can someone check that out this weekend? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-repl-1-2-0-was-not-uploaded-to-central-maven-repository-tp20799.html Sent from the Apache Spark User List mailing list

Re: Spark on Tachyon

2014-12-20 Thread Peng Cheng
IMHO: cache doesn't provide redundancy, and its in the same jvm, so its much faster. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Tachyon-tp1463p20800.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

How to extend an one-to-one RDD of Spark that can be persisted?

2014-12-04 Thread Peng Cheng
In my project I extend a new RDD type that wraps another RDD and some metadata. The code I use is similar to FilteredRDD implementation: case class PageRowRDD( self: RDD[PageRow], @transient keys: ListSet[KeyLike] = ListSet() ){ override def getPartitions:

How to make sure a ClassPath is always shipped to workers?

2014-11-03 Thread Peng Cheng
: this error doesn't always happen, sometimes the old Seq[Page] was get properly, sometimes it throws the exception, how could this happen and how do I fix it? Yours Peng -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-sure-a-ClassPath-is-always-shipped

How to make sure a ClassPath is always shipped to workers?

2014-11-03 Thread Peng Cheng
: this error doesn't always happen, sometimes the old Seq[Page] was get properly, sometimes it throws the exception, how could this happen and how do I fix it? Yours Peng -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-sure-a-ClassPath-is-always-shipped

Re: How to make sure a ClassPath is always shipped to workers?

2014-11-03 Thread Peng Cheng
Sorry its a timeout duplicate, please remove it -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-sure-a-ClassPath-is-always-shipped-to-workers-tp18018p18020.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

issue on applying SVM to 5 million examples.

2014-10-30 Thread peng xia
distinct records (One positive and one negative), others are all duplications. Any one has any idea on why it takes so long on this small data? Thanks, Best, Peng

Re: issue on applying SVM to 5 million examples.

2014-10-30 Thread peng xia
= model.predict(point.features) // (point.label, prediction) // } // val trainErr = labelAndPreds.filter(r = r._1 != r._2).count.toDouble / testParsedData.count // println(Training Error = + trainErr) println(Calendar.getInstance().getTime()) } } Thanks, Best, Peng On Thu, Oct 30, 2014 at 1:23 PM

Re: issue on applying SVM to 5 million examples.

2014-10-30 Thread peng xia
Hi Xiangrui, Can you give me some code example about caching, as I am new to Spark. Thanks, Best, Peng On Thu, Oct 30, 2014 at 6:57 PM, Xiangrui Meng men...@gmail.com wrote: Then caching should solve the problem. Otherwise, it is just loading and parsing data from disk for each iteration

Re: issue on applying SVM to 5 million examples.

2014-10-30 Thread peng xia
Thanks Jimmy. I will have a try. Thanks very much for your guys' help. Best, Peng On Thu, Oct 30, 2014 at 8:19 PM, Jimmy ji...@sellpoints.com wrote: sampleRDD. cache() Sent from my iPhone On Oct 30, 2014, at 5:01 PM, peng xia toxiap...@gmail.com wrote: Hi Xiangrui, Can you give me some

Re: Asynchronous Broadcast from driver to workers, is it possible?

2014-10-21 Thread Peng Cheng
Looks like the only way is to implement that feature. There is no way of hacking it into working -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Asynchronous-Broadcast-from-driver-to-workers-is-it-possible-tp15758p16985.html Sent from the Apache Spark User

Re: Asynchronous Broadcast from driver to workers, is it possible?

2014-10-06 Thread Peng Cheng
Any suggestions? I'm thinking of submitting a feature request for mutable broadcast. Is it doable? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Asynchronous-Broadcast-from-driver-to-workers-is-it-possible-tp15758p15807.html Sent from the Apache Spark

Asynchronous Broadcast from driver to workers, is it possible?

2014-10-04 Thread Peng Cheng
While Spark already offers support for asynchronous reduce (collect data from workers, while not interrupting execution of a parallel transformation) through accumulator, I have made little progress on making this process reciprocal, namely, to broadcast data from driver to workers to be used by

[no subject]

2014-09-30 Thread PENG ZANG
Hi, We have a cluster setup with spark 1.0.2 running 4 workers and 1 master with 64G RAM for each. In the sparkContext we specify 32G executor memory. However, as long as the task running longer than approximate 15 mins, all the executors are lost just like some sort of timeout no matter if the

Re: Invalid signature file digest for Manifest main attributes with spark job built using maven

2014-09-16 Thread Kevin Peng
Sean, Thanks. That worked. Kevin On Mon, Sep 15, 2014 at 3:37 PM, Sean Owen so...@cloudera.com wrote: This is more of a Java / Maven issue than Spark per se. I would use the shade plugin to remove signature files in your final META-INF/ dir. As Spark does, in its configuration: filters

Re: Crawler and Scraper with different priorities

2014-09-09 Thread Peng Cheng
and deep graph following extraction. Please drop me a line if you have a user case, as I'll try to integrate it as a feature. Yours Peng -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Crawler-Scraper-with-different-priorities-tp13645p13838.html Sent from

Re: Spark Streaming into HBase

2014-09-03 Thread Kevin Peng
On Wed, Sep 3, 2014 at 2:33 PM, Kevin Peng kpe...@gmail.com wrote: Ted, The hbase-site.xml is in the classpath (had worse issues before... until I figured that it wasn't in the path). I get the following error in the spark-shell: org.apache.spark.SparkException: Job aborted due to stage

Re: Bug or feature? Overwrite broadcasted variables.

2014-08-19 Thread Peng Cheng
Unfortunately, After some research I found its just a side effect of how closure containing var works in scala: http://stackoverflow.com/questions/11657676/how-does-scala-maintains-the-values-of-variable-when-the-closure-was-defined the closure keep referring var broadcasted wrapper as a pointer,

Bug or feature? Overwrite broadcasted variables.

2014-08-18 Thread Peng Cheng
). This can be useful sometimes but may cause confusion at other times (people can no longer add persist at will just for backup because it may change the result). So far I've found no documentation supporting this feature. So can some one confirm that its a feature craftly designed? Yours Peng -- View

Re: Bug or feature? Overwrite broadcasted variables.

2014-08-18 Thread Peng Cheng
Yeah, Thanks a lot. I know for people understanding lazy execution this seems straightforward. But for those who don't it may become a liability. I've only tested its stability on a small example (which seems stable), hopefully it's not a serendipity. Can a committer confirm this? Yours Peng

Re: TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

2014-06-27 Thread Peng Cheng
I give up, communication must be blocked by the complex EC2 network topology (though the error information indeed need some improvement). It doesn't make sense to run a client thousands miles away to communicate frequently with workers. I have moved everything to EC2 now. -- View this message

Integrate spark-shell into officially supported web ui/api plug-in? What do you think?

2014-06-27 Thread Peng Cheng
? Yours Peng -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Integrate-spark-shell-into-officially-supported-web-ui-api-plug-in-What-do-you-think-tp8447.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Integrate spark-shell into officially supported web ui/api plug-in? What do you think?

2014-06-27 Thread Peng Cheng
That would be really cool with IPython, But I' still wondering if all language features are supported, namely I need these 2 in particular: 1. importing class and ILoop from external jars (so I can point it to SparkILoop or Sparkbinding ILoop of Apache Mahout instead of Scala's default ILoop) 2.

Re: Spark slave fail to start with wierd error information

2014-06-25 Thread Peng Cheng
Sorry I just realize that start-slave is for a different task. Please close this -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-slave-fail-to-start-with-wierd-error-information-tp8203p8246.html Sent from the Apache Spark User List mailing list

TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

2014-06-25 Thread Peng Cheng
Node Submitted Time UserState Duration app-20140625083158- org.tribbloid.spookystuff.example.GoogleImage$ 2 512.0 MB2014/06/25 08:31:58 pengRUNNING 17 min However when submitting the job in client mode: $SPARK_HOME/bin/spark-submit \ --class

  1   2   >