Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-25 Thread Josh Rosen
Hi Jerry, Do you have speculation enabled? A write which produces one million files / output partitions might be using tons of driver memory via the OutputCommitCoordinator's bookkeeping data structures. On Sun, Oct 25, 2015 at 5:50 PM, Jerry Lam wrote: > Hi spark guys, >

Re: [SPARK MLLIB] could not understand the wrong and inscrutable result of Linear Regression codes

2015-10-25 Thread DB Tsai
Column 4 is always constant, so no predictive power resulting zero weight. On Sunday, October 25, 2015, Zhiliang Zhu wrote: > Hi DB Tsai, > > Thanks very much for your kind reply help. > > As for your comment, I just modified and tested the key part of the codes: > >

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-25 Thread Jerry Lam
Hi Josh, No I don't have speculation enabled. The driver took about few hours until it was OOM. Interestingly, all partitions are generated successfully (_SUCCESS file is written in the output directory). Is there a reason why the driver needs so much memory? The jstack revealed that it called

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-25 Thread Jerry Lam
Hi guys, I mentioned that the partitions are generated so I tried to read the partition data from it. The driver is OOM after few minutes. The stack trace is below. It looks very similar to the the jstack above (note on the refresh method). Thanks! Name: java.lang.OutOfMemoryError Message: GC

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-25 Thread Jerry Lam
Hi spark guys, I think I hit the same issue SPARK-8890 https://issues.apache.org/jira/browse/SPARK-8890. It is marked as resolved. However it is not. I have over a million output directories for 1 single column in partitionBy. Not sure if this is a regression issue? Do I need to set some

Re: Newbie Help for spark compilation problem

2015-10-25 Thread Todd Nist
So yes the individual artifacts are released however, there is no deployable bundle prebuilt for Spark 1.5.1 and Scala 2.11.7, something like: spark-1.5.1-bin-hadoop-2.6_scala-2.11.tgz. The spark site even states this: *Note: Scala 2.11 users should download the Spark source package and build

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-25 Thread Jerry Lam
Hi guys, After waiting for a day, it actually causes OOM on the spark driver. I configure the driver to have 6GB. Note that I didn't call refresh myself. The method was called when saving the dataframe in parquet format. Also I'm using partitionBy() on the DataFrameWriter to generate over 1

Re: "Failed to bind to" error with spark-shell on CDH5 and YARN

2015-10-25 Thread Lin Zhao
Have the issue resolved. In this case the hostname of my machine is configured to a public domain resolved to the EC2 machine's public IP. It's not allowed to bind to an elastic IP. I changed the hostnames to Amazon's private hostname (ip-72-xxx-xxx) then it works.

Re: SparkR in yarn-client mode needs sparkr.zip

2015-10-25 Thread Ram Venkatesh
Felix, Missed your reply - agree looks like the same issue, resolved mine as Duplicate. Thanks! Ram On Sun, Oct 25, 2015 at 2:47 PM, Felix Cheung wrote: > > > This might be related to https://issues.apache.org/jira/browse/SPARK-10500 > > > > On Sun, Oct 25, 2015 at

Re: Secondary Sorting in Spark

2015-10-25 Thread swetha kasireddy
Hi, Does the use of custom partitioner in Streaming affect performance? On Mon, Oct 5, 2015 at 1:06 PM, Adrian Tanase wrote: > Great article, especially the use of a custom partitioner. > > Also, sorting by multiple fields by creating a tuple out of them is an > awesome,

RE: How to set memory for SparkR with master="local[*]"

2015-10-25 Thread Sun, Rui
As documented in http://spark.apache.org/docs/latest/configuration.html#available-properties, Note for “spark.driver.memory”: Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead,

Re: [SPARK MLLIB] could not understand the wrong and inscrutable result of Linear Regression codes

2015-10-25 Thread Zhiliang Zhu
On Monday, October 26, 2015 11:26 AM, Zhiliang Zhu wrote: Hi DB Tsai, Thanks very much for your kind help. I  get it now. I am sorry that there is another issue, the weight/coefficient result is perfect while A is triangular matrix, however, while A

Re: [SPARK MLLIB] could not understand the wrong and inscrutable result of Linear Regression codes

2015-10-25 Thread Meihua Wu
please add "setFitIntercept(false)" to your LinearRegression. LinearRegression by default includes an intercept in the model, e.g. label = intercept + features dot weight To get the result you want, you need to force the intercept to be zero. Just curious, are you trying to solve systems of

Re: SparkR in yarn-client mode needs sparkr.zip

2015-10-25 Thread Felix Cheung
This might be related to https://issues.apache.org/jira/browse/SPARK-10500 On Sun, Oct 25, 2015 at 9:57 AM -0700, "Ted Yu" wrote: In zipRLibraries(): // create a zip file from scratch, do not append to existing file. val zipFile = new File(dir, name) I guess

Re: SparkR in yarn-client mode needs sparkr.zip

2015-10-25 Thread Ram Venkatesh
Ted Yu, Agree that either picking up sparkr.zip if it already exists, or creating a zip in a local scratch directory will work. This code is called by the client side job submission logic and the resulting zip is already added to the local resources for the YARN job, so I don't think the

Re: [SPARK MLLIB] could not understand the wrong and inscrutable result of Linear Regression codes

2015-10-25 Thread Zhiliang Zhu
Hi DB Tsai, Thanks very much for your kind reply help. As for your comment, I just modified and tested the key part of the codes:  LinearRegression lr = new LinearRegression()    .setMaxIter(1)    .setRegParam(0)    .setElasticNetParam(0);  //the number could be reset  final

Re: SparkR in yarn-client mode needs sparkr.zip

2015-10-25 Thread Ted Yu
In zipRLibraries(): // create a zip file from scratch, do not append to existing file. val zipFile = new File(dir, name) I guess instead of creating sparkr.zip in the same directory as R lib, the zip file can be created under some directory writable by the user launching the app and

Re: Newbie Help for spark compilation problem

2015-10-25 Thread Ted Yu
A dependency couldn't be downloaded: [INFO] +- com.h2database:h2:jar:1.4.183:test Have you checked your network settings ? Cheers On Sun, Oct 25, 2015 at 10:22 AM, Bilinmek Istemiyor wrote: > Thank you for the quick reply. You are God Send. I have long not been >

[SPARK MLLIB] could not understand the wrong and inscrutable result of Linear Regression codes

2015-10-25 Thread Zhiliang Zhu
Dear All, I have some program as below which makes me very much confused and inscrutable, it is about multiple dimension linear regression mode, the weight / coefficient is always perfect while the dimension is smaller than 4, otherwise it is wrong all the time.Or, whether the

Re: Newbie Help for spark compilation problem

2015-10-25 Thread Bilinmek Istemiyor
Thank you for the quick reply. You are God Send. I have long not been programming in java, nothing know about maven, scala, sbt ant spark stuff. I used java 7 since build failed with java 8. Which java version do you advise in general to use spark. I can downgrade scala version as well. Can you

Re: [SPARK MLLIB] could not understand the wrong and inscrutable result of Linear Regression codes

2015-10-25 Thread DB Tsai
LinearRegressionWithSGD is not stable. Please use linear regression in ML package instead. http://spark.apache.org/docs/latest/ml-linear-methods.html Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Sun, Oct 25,

Re: Newbie Help for spark compilation problem

2015-10-25 Thread Todd Nist
Hi Bilnmek, Spark 1.5.x does not support Scala 2.11.7 so the easiest thing to do it build it like your trying. Here are the steps I followed to build it on a Max OS X 10.10.5 environment, should be very similar on ubuntu. 1. set theJAVA_HOME environment variable in my bash session via export

Re: Newbie Help for spark compilation problem

2015-10-25 Thread Sean Owen
Hm, why do you say it doesn't support 2.11? It does. It is not even this difficult; you just need a source distribution, and then run "./dev/change-scala-version.sh 2.11" as you say. Then build as normal On Sun, Oct 25, 2015 at 4:00 PM, Todd Nist wrote: > Hi Bilnmek, > >

Spark scala REPL - Unable to create sqlContext

2015-10-25 Thread Yao
I have not been able to start Spark scala shell since 1.5 as it was not able to create the sqlContext during the startup. It complains the metastore_db is already locked: "Another instance of Derby may have already booted the database". The Derby log is attached. I only have this problem with

Re: Spark scala REPL - Unable to create sqlContext

2015-10-25 Thread Ted Yu
Have you taken a look at the fix for SPARK-11000 which is in the upcoming 1.6.0 release ? Cheers On Sun, Oct 25, 2015 at 8:42 AM, Yao wrote: > I have not been able to start Spark scala shell since 1.5 as it was not > able > to create the sqlContext during the startup. It

Re: question about HadoopFsRelation

2015-10-25 Thread Koert Kuipers
thanks i will read up on that On Sat, Oct 24, 2015 at 12:53 PM, Ted Yu wrote: > The code below was introduced by SPARK-7673 / PR #6225 > > See item #1 in the description of the PR. > > Cheers > > On Sat, Oct 24, 2015 at 12:59 AM, Koert Kuipers wrote: >

SparkR in yarn-client mode needs sparkr.zip

2015-10-25 Thread Ram Venkatesh
If you run sparkR in yarn-client mode, it fails with Exception in thread "main" java.io.FileNotFoundException: /usr/hdp/2.3.2.1-12/spark/R/lib/sparkr.zip (Permission denied) at java.io.FileOutputStream.open0(Native Method) at

RE: Spark scala REPL - Unable to create sqlContext

2015-10-25 Thread Ge, Yao (Y.)
Thanks. I wonder why this is not widely reported in the user forum. The RELP shell is basically broken in 1.5 .0 and 1.5.1 -Yao From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Sunday, October 25, 2015 12:01 PM To: Ge, Yao (Y.) Cc: user Subject: Re: Spark scala REPL - Unable to create sqlContext

Re: Error building Spark on Windows with sbt

2015-10-25 Thread Ted Yu
If you have a pull request, Jenkins can test your change for you. FYI > On Oct 25, 2015, at 12:43 PM, Richard Eggert wrote: > > Also, if I run the Maven build on Windows or Linux without setting > -DskipTests=true, it hangs indefinitely when it gets to >

Re: Error building Spark on Windows with sbt

2015-10-25 Thread Richard Eggert
Yes, I know, but it would be nice to be able to test things myself before I push commits. On Sun, Oct 25, 2015 at 3:50 PM, Ted Yu wrote: > If you have a pull request, Jenkins can test your change for you. > > FYI > > On Oct 25, 2015, at 12:43 PM, Richard Eggert

Re: Newbie Help for spark compilation problem

2015-10-25 Thread Todd Nist
Sorry Sean you are absolutely right it supports 2.11 all o meant is there is no release available as a standard download and that one has to build it. Thanks for the clairification. -Todd On Sunday, October 25, 2015, Sean Owen wrote: > Hm, why do you say it doesn't support

Error building Spark on Windows with sbt

2015-10-25 Thread Richard Eggert
When I try to start up sbt for the Spark build, or if I try to import it in IntelliJ IDEA as an sbt project, it fails with a "No such file or directory" error when it attempts to "git clone" sbt-pom-reader into .sbt/0.13/staging/some-sha1-hash. If I manually create the expected directory before

Re: Error building Spark on Windows with sbt

2015-10-25 Thread Richard Eggert
Also, if I run the Maven build on Windows or Linux without setting -DskipTests=true, it hangs indefinitely when it gets to org.apache.spark.JavaAPISuite. It's hard to test patches when the build doesn't work. :-/ On Sun, Oct 25, 2015 at 3:41 PM, Richard Eggert wrote:

Re: Error building Spark on Windows with sbt

2015-10-25 Thread Richard Eggert
By "it works", I mean, "It gets past that particular error". It still fails several minutes later with a different error: java.lang.IllegalStateException: impossible to get artifacts when data has not been loaded. IvyNode = org.scala-lang#scala-library;2.10.3 On Sun, Oct 25, 2015 at 3:38 PM,

Re: Newbie Help for spark compilation problem

2015-10-25 Thread Sean Owen
No, 2.11 artifacts are in fact published: http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22spark-parent_2.11%22 On Sun, Oct 25, 2015 at 7:37 PM, Todd Nist wrote: > Sorry Sean you are absolutely right it supports 2.11 all o meant is there is > no release available as a