回复: mlib compilation errors

2015-12-08 Thread wei....@kaiyuandao.com
probably it is because I ran "./dev/change-scala-version.sh 2.11" after importing these projects in intellij. I reimported these projects later. it works fine. closed for this thread. thanks 发件人: wei@kaiyuandao.com 发送时间: 2015-12-07 16:43 收件人: dev 主题: mlib compilation errors hi, when

Data and Model Parallelism in MLPC

2015-12-08 Thread Disha Shrivastava
Hi, I would like to know if the implementation of MLPC in the latest released version of Spark ( 1.5.2 ) implements model parallelism and data parallelism as done in the DistBelief model implemented by Google

Failed to generate predicate Error when using dropna

2015-12-08 Thread Chang Ya-Hsuan
spark version: spark-1.5.2-bin-hadoop2.6 python version: 2.7.9 os: ubuntu 14.04 code to reproduce error # write.py import pyspark sc = pyspark.SparkContext() sqlc = pyspark.SQLContext(sc) df = sqlc.range(10) df1 = df.withColumn('a', df['id'] * 2) df1.write.partitionBy('id').parquet('./data')

Re: Fastest way to build Spark from scratch

2015-12-08 Thread Steve Loughran
On 7 Dec 2015, at 19:07, Jakob Odersky > wrote: make-distribution and the second code snippet both create a distribution from a clean state. They therefore require that every source file be compiled and that takes time (you can maybe tweak some

Re: Failed to generate predicate Error when using dropna

2015-12-08 Thread Reynold Xin
Can you create a JIRA ticket for this? Thanks. On Tue, Dec 8, 2015 at 5:25 PM, Chang Ya-Hsuan wrote: > spark version: spark-1.5.2-bin-hadoop2.6 > python version: 2.7.9 > os: ubuntu 14.04 > > code to reproduce error > > # write.py > > import pyspark > sc =

Filte the null before InnerJoin to solve the problem of data skew

2015-12-08 Thread vector
when i join two tables, i find a table has the problem of data skew, and the skewing value of the field is null. so i want to filte the null before InnerJoin. like that a.key is skewed and the skewing value is null Change "select * from a join b on a.key = b.key" to "select * from a

Re: Fastest way to build Spark from scratch

2015-12-08 Thread Nicholas Chammas
Interesting. As long as Spark's dependencies don't change that often, the same caches could save "from scratch" build time over many months of Spark development. Is that right? On Tue, Dec 8, 2015 at 12:33 PM Josh Rosen wrote: > @Nick, on a fresh EC2 instance a

Re: Fastest way to build Spark from scratch

2015-12-08 Thread Stephen Boesch
I will echo Steve L's comment about having zinc running (with --nailed). That provides at least a 2X speedup - sometimes without it spark simply does not build for me. 2015-12-08 9:33 GMT-08:00 Josh Rosen : > @Nick, on a fresh EC2 instance a significant chunk of the

Re: [VOTE] Release Apache Spark 1.6.0 (RC1)

2015-12-08 Thread Michael Armbrust
An update: the vote fails due to the -1. I'll post another RC as soon as we've resolved these issues. In the mean time I encourage people to continue testing and post any problems they encounter here. On Sun, Dec 6, 2015 at 6:24 PM, Yin Huai wrote: > -1 > > Tow blocker

Re: Fastest way to build Spark from scratch

2015-12-08 Thread Nicholas Chammas
Thanks for the tips, Jakob and Steve. It looks like my original approach is the best for me since I'm installing Spark on newly launched EC2 instances and can't take advantage of incremental compilation. Nick On Tue, Dec 8, 2015 at 7:01 AM Steve Loughran wrote: > On 7

RE: Data and Model Parallelism in MLPC

2015-12-08 Thread Ulanov, Alexander
Hi Disha, Multilayer perceptron classifier in Spark implements data parallelism. Best regards, Alexander From: Disha Shrivastava [mailto:dishu@gmail.com] Sent: Tuesday, December 08, 2015 12:43 AM To: dev@spark.apache.org; Ulanov, Alexander Subject: Data and Model Parallelism in MLPC Hi, I

Re: Data and Model Parallelism in MLPC

2015-12-08 Thread Disha Shrivastava
Hi Alexander, Thanks for your response. Can you suggest ways to incorporate Model Parallelism in MPLC? I am trying to do the same in Spark. I got hold of your post http://apache-spark-developers-list.1001551.n3.nabble.com/Model-parallelism-with-RDD-td13141.html where you have divided the weight

RE: Data and Model Parallelism in MLPC

2015-12-08 Thread Ulanov, Alexander
Hi Disha, Which use case do you have in mind that would require model parallelism? It should have large number of weights, so it could not fit into the memory of a single machine. For example, multilayer perceptron topologies, that are used for speech recognition, have up to 100M of weights.

I filed SPARK-12233

2015-12-08 Thread Fengdong Yu
Hi, I filed an issue, please take a look: https://issues.apache.org/jira/browse/SPARK-12233 It definitely can be reproduced.

Re: Failed to generate predicate Error when using dropna

2015-12-08 Thread Chang Ya-Hsuan
https://issues.apache.org/jira/browse/SPARK-12231 this is my first time to create JIRA ticket. is this ticket proper? thanks On Tue, Dec 8, 2015 at 9:59 PM, Reynold Xin wrote: > Can you create a JIRA ticket for this? Thanks. > > > On Tue, Dec 8, 2015 at 5:25 PM, Chang

Re: A proposal for Spark 2.0

2015-12-08 Thread Kostas Sakellis
I'd also like to make it a requirement that Spark 2.0 have a stable dataframe and dataset API - we should not leave these APIs experimental in the 2.0 release. We already know of at least one breaking change we need to make to dataframes, now's the time to make any other changes we need to