Re: [SQL] Memory leak with spark streaming and spark sql in spark 1.5.1

2015-10-15 Thread Shixiong Zhu
Thanks for reporting it Terry. I submitted a PR to fix it: https://github.com/apache/spark/pull/9132 Best Regards, Shixiong Zhu 2015-10-15 2:39 GMT+08:00 Reynold Xin : > +dev list > > On Wed, Oct 14, 2015 at 1:07 AM, Terry Hoo wrote: > >> All, >> >>

Re: Should enforce the uniqueness of field name in DataFrame ?

2015-10-15 Thread Koert Kuipers
if DataFrame aspires to be more than a vehicle for SQL then i think it would be mistake to allow multiple column names. it is very confusing. pandas indeed allows this and it has led to many bugs. R does not allow it for data.frame (it renames the name dupes). i would consider a csv with

Re: Gradient Descent with large model size

2015-10-15 Thread Joseph Bradley
For those numbers of partitions, I don't think you'll actually use tree aggregation. The number of partitions needs to be over a certain threshold (>= 7) before treeAggregate really operates on a tree structure:

Re: Should enforce the uniqueness of field name in DataFrame ?

2015-10-15 Thread Xiao Li
True. As long as we can ensure the correct message are printed out, users can correct their app easily. For example, Reference 'name' is ambiguous, could be: name#1, name#5.; Thanks, Xiao Li 2015-10-14 23:58 GMT-07:00 Reynold Xin : > That could break a lot of applications.

Re: PMML export for LinearRegressionModel

2015-10-15 Thread Fazlan Nazeem
This is the API doc for LinearRegressionModel. It does not implement PMMLExportable https://spark.apache.org/docs/latest/api/java/index.html On Thu, Oct 15, 2015 at 3:11 PM, canan chen wrote: > The method toPMML is in trait PMMLExportable > > *LinearRegressionModel has this

PMML export for LinearRegressionModel

2015-10-15 Thread Fazlan Nazeem
Hi I am trying to export a LinearRegressionModel in PMML format. According to the following resource[1] PMML export is supported for LinearRegressionModel. [1] https://spark.apache.org/docs/latest/mllib-pmml-model-export.html But there is *no* *toPMML* method in *LinearRegressionModel* class

Re: PMML export for LinearRegressionModel

2015-10-15 Thread Fazlan Nazeem
Ok It turns out I was using the wrong LinearRegressionModel which was in package org.apache.spark.ml.regression;. On Thu, Oct 15, 2015 at 3:23 PM, Fazlan Nazeem wrote: > This is the API doc for LinearRegressionModel. It does not implement > PMMLExportable > >

Re: Should enforce the uniqueness of field name in DataFrame ?

2015-10-15 Thread Reynold Xin
That could break a lot of applications. In particular, a lot of input data sources (csv, json) don't have clean schema, and can have duplicate column names. For the case of join, maybe a better solution is to ask the left/right prefix/suffix in the user code, similar to what Pandas does. On Wed,

Re: Understanding code/closure shipment to Spark workers‏

2015-10-15 Thread Lars Francke
Hi Arijit, my understanding is the following: RDD actions will at some point call the runJob method of a SparkContext That runJob method calls the clean method which in turn calls ClosureCleaner.clean which removes unneeded stuff from closures and also checks whether they are serializable. The

Re: PMML export for LinearRegressionModel

2015-10-15 Thread canan chen
The method toPMML is in trait PMMLExportable *LinearRegressionModel has this trait, you should be able to call * *LinearRegressionModel#toPMML* On Thu, Oct 15, 2015 at 5:25 PM, Fazlan Nazeem wrote: > Hi > > I am trying to export a LinearRegressionModel in PMML format.

MLlib Contribution

2015-10-15 Thread Kybe67
Hi, i made a clustering algorithm in Scala/Spark during my internship, i would like to contribute to MLlib, but i don't know how, i do my best to follow this instructions :

RE: No speedup in MultiLayerPerceptronClassifier with increase in number of cores

2015-10-15 Thread Ulanov, Alexander
Hi Disha, This is a good question. We plan to elaborate on it in our talk on the upcoming Spark Summit. Less workers means less compute power, more workers means more communication overhead. So, there exist an optimal number of workers for solving optimization problem with batch gradient given

RE: Gradient Descent with large model size

2015-10-15 Thread Ulanov, Alexander
Hi Joseph, There seems to be no improvement if I run it with more partitions or bigger depth: N = 6 Avg time: 13.49157910868 N = 7 Avg time: 8.929480508 N = 8 Avg time: 14.50712347198 N= 9 Avg time: 13.85487164533 Depth = 3 N=2 Avg time: 8.85389534633 N=5 Avg time:

Re: Network-related environemental problem when running JDBCSuite

2015-10-15 Thread Richard Hillegas
Thanks for everyone's patience with this email thread. I have fixed my environmental problem and my tests run cleanly now. This seems to be a problem which afflicts modern JVMs on Mac OSX (and maybe other unix variants). The following can happen on these platforms:

Re: Building Spark

2015-10-15 Thread Ted Yu
bq. Access is denied Please check permission of the path mentioned. On Thu, Oct 15, 2015 at 3:45 PM, Annabel Melongo < melongo_anna...@yahoo.com.invalid> wrote: > I was trying to build a cloned version of Spark on my local machine using > the command: > mvn -Pyarn -Phadoop-2.4

Building Spark

2015-10-15 Thread Annabel Melongo
I was trying to build a cloned version of Spark on my local machine using the command:        mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean packageHowever I got the error:       [ERROR] Failed to execute goal org.apache.maven.plugins:maven-shade-plugin:2.4.1:shade (default)

Re: If you use Spark 1.5 and disabled Tungsten mode ...

2015-10-15 Thread Josh Rosen
To clarify, we're asking about the *spark.sql.tungsten.enabled* flag, which was introduced in Spark 1.5 and enables Project Tungsten optimizations in Spark SQL. This option is set to *true* by default in Spark 1.5+ and exists primarily to allow users to disable the new code paths if they encounter

Re: If you use Spark 1.5 and disabled Tungsten mode ...

2015-10-15 Thread mkhaitman
My apologies for mixing up what was being referred to in that case! :) Mark. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/If-you-use-Spark-1-5-and-disabled-Tungsten-mode-tp14604p14629.html Sent from the Apache Spark Developers List mailing list

Re: If you use Spark 1.5 and disabled Tungsten mode ...

2015-10-15 Thread mkhaitman
Are you referring to spark.shuffle.manager=tungsten-sort? If so, we saw the default value as still being as the regular sort, and since it was only first introduced in 1.5, were actually waiting a bit to see if anyone ENABLED it as opposed to DISABLING it since - it's disabled by default! :) I

Network-related environemental problem when running JDBCSuite

2015-10-15 Thread Richard Hillegas
I am seeing what look like environmental errors when I try to run a test on a clean local branch which has been sync'd to the head of the development trunk. I would appreciate advice about how to debug or hack around this problem. For the record, the test ran cleanly last week. This is the

Re: No speedup in MultiLayerPerceptronClassifier with increase in number of cores

2015-10-15 Thread Disha Shrivastava
Hi Alexander, Thanks for your reply.Actually I am working with a modified version of the actual MNIST dataset ( maximum samples = 8.2 M) https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html. I have been running different sized versions*( 1,10,50,1M,8M samples)* on

Re: Network-related environemental problem when running JDBCSuite

2015-10-15 Thread Richard Hillegas
Continuing this lively conversation with myself (hopefully this archived thread may be useful to someone else in the future): I set the following environment variable as recommended by this page:

Re: Network-related environemental problem when running JDBCSuite

2015-10-15 Thread sgoodwin
Rick, Try setting the environment variable SPARK_LOCAL_IP=127.0.0.1 in your spark-env.conf (if not done yet) ... Regards, - Steve From: Richard Hillegas Sent: Thursday, October 15, 2015 1:50 PM To: Richard Hillegas Cc: Dev Subject: Re: Network-related environemental problem when running

Re: Should enforce the uniqueness of field name in DataFrame ?

2015-10-15 Thread Michael Armbrust
> > In hive, the ambiguous name can be resolved by using the table name as > prefix, but seems DataFrame don't support it ( I mean DataFrame API rather > than SparkSQL) You can do the same using pure DataFrames. Seq((1,2)).toDF("a", "b").registerTempTable("y") Seq((1,4)).toDF("a",

Re: Network-related environemental problem when running JDBCSuite

2015-10-15 Thread Richard Hillegas
For the record, I get the same error when I simply try to boot the spark shell: bash-3.2$ bin/spark-shell log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory). log4j:WARN Please initialize the log4j system properly. log4j:WARN See