Cleaning spark memory

2016-06-10 Thread Cesar Flores
? Thanks -- Cesar Flores

Integrating spark source in an eclipse project?

2016-06-07 Thread Cesar Flores
I created a spark application in Eclipse by including the spark-assembly-1.6.0-hadoop2.6.0.jar file in the path. However, this method does not allow me see spark code. Is there an easy way to include spark source code for reference in an application developed in Eclipse? Thanks ! -- Cesar

Re: My notes on Spark Performance & Tuning Guide

2016-05-17 Thread Cesar Flores
Please sent me to me too ! Thanks ! ! ! Cesar Flores On Tue, May 17, 2016 at 4:55 PM, Femi Anthony <femib...@gmail.com> wrote: > Please send it to me as well. > > Thanks > > Sent from my iPhone > > On May 17, 2016, at 12:09 PM, Raghavendra Pandey < > ra

DAG Pipelines?

2016-05-04 Thread Cesar Flores
this functionality may be useful?* Thanks -- Cesar Flores

Re: Weird results with Spark SQL Outer joins

2016-05-03 Thread Cesar Flores
;> Yong >>> >> >>> >> >>> >> From: kpe...@gmail.com >>> >> Date: Mon, 2 May 2016 12:11:18 -0700 >>> >> Subject: Re: Weird results with Spark SQL Outer joins >>> >> To: gourav.sengu...@gmail.com >>> >> CC: user@spark.apache.org >>> >> >>> >> >>> >> Gourav, >>> >> >>> >> I wish that was case, but I have done a select count on each of the >>> two >>> >> tables individually and they return back different number of rows: >>> >> >>> >> >>> >> dps.registerTempTable("dps_pin_promo_lt") >>> >> swig.registerTempTable("swig_pin_promo_lt") >>> >> >>> >> >>> >> dps.count() >>> >> RESULT: 42632 >>> >> >>> >> >>> >> swig.count() >>> >> RESULT: 42034 >>> >> >>> >> On Mon, May 2, 2016 at 11:55 AM, Gourav Sengupta >>> >> <gourav.sengu...@gmail.com> wrote: >>> >> >>> >> This shows that both the tables have matching records and no >>> mismatches. >>> >> Therefore obviously you have the same results irrespective of whether >>> you >>> >> use right or left join. >>> >> >>> >> I think that there is no problem here, unless I am missing something. >>> >> >>> >> Regards, >>> >> Gourav >>> >> >>> >> On Mon, May 2, 2016 at 7:48 PM, kpeng1 <kpe...@gmail.com> wrote: >>> >> >>> >> Also, the results of the inner query produced the same results: >>> >> sqlContext.sql("SELECT s.date AS edate , s.account AS s_acc , >>> d.account >>> >> AS >>> >> d_acc , s.ad as s_ad , d.ad as d_ad , s.spend AS s_spend , >>> >> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN >>> >> dps_pin_promo_lt d ON (s.date = d.date AND s.account = d.account AND >>> s.ad >>> >> = >>> >> d.ad) WHERE s.date >= '2016-01-03'AND d.date >= >>> '2016-01-03'").count() >>> >> RESULT:23747 >>> >> >>> >> >>> >> >>> >> -- >>> >> View this message in context: >>> >> >>> http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861p26863.html >>> >> Sent from the Apache Spark User List mailing list archive at >>> Nabble.com. >>> >> >>> >> - >>> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> >> For additional commands, e-mail: user-h...@spark.apache.org >>> >> >>> >> >>> >> >>> > >>> >> >> > -- Cesar Flores

Re: how to query the number of running executors?

2016-04-06 Thread Cesar Flores
Thanks Ted: That is the kind of answer I was looking for. Best, Cesar flores On Wed, Apr 6, 2016 at 3:01 PM, Ted Yu <yuzhih...@gmail.com> wrote: > Have you looked at SparkListener ? > > /** >* Called when the driver registers a new executor. >*/ > def onExe

how to query the number of running executors?

2016-04-06 Thread Cesar Flores
Hello: I wonder if there is a way to query the number of running executors (nor the number asked executors) inside a spark job? Thanks -- Cesar Flores

Spark property parameters priority

2016-03-11 Thread Cesar Flores
in the config parameter spark.sql.shuffle.partitions, which I need to modify on the fly to do group by clauses depending on the size of my input.* Thanks -- Cesar Flores

performance of personalized page rank

2016-03-01 Thread Cesar Flores
of time (i.e. less than 12 hours). Best -- Cesar Flores

Re: Migrating Transformers from Spark 1.3.1 to 1.5.0

2016-02-15 Thread Cesar Flores
I found my problem. I was calling setParameterValue(defaultValue) more than one time in the hierarchy of my classes. Thanks! On Mon, Feb 15, 2016 at 6:34 PM, Cesar Flores <ces...@gmail.com> wrote: > > I have a set of transformers (each with specific parameters) in spark > 1.

Migrating Transformers from Spark 1.3.1 to 1.5.0

2016-02-15 Thread Cesar Flores
.* *Does anyone have any idea of what I may be doing wrong. My guess is that I am doing something weird in my class hierarchy but can not figure out what.* Thanks! -- Cesar Flores

Re: Optimal way to re-partition from a single partition

2016-02-09 Thread Cesar Flores
t;> column on which you are running orderBy? If yes, you are better off not >> running the orderBy clause. >> >> May be someone from spark sql team could answer that how should the >> partitioning of the output DF be handled when doing an orderBy? >> >> Hemant &

Optimal way to re-partition from a single partition

2016-02-08 Thread Cesar Flores
with a single partition and around 14 million records val newDF = hc.createDataFrame(rdd, df.schema) This process is really slow. Is there any other way of achieving this task, or to optimize it (perhaps tweaking a spark configuration parameter)? Thanks a lot -- Cesar Flores

A question about sql clustering

2015-11-23 Thread Cesar Flores
ery useful for performing joins later). Is that true? And second question, if I save *df* just after the query into a hive table, when I reload this table from hive, does spark will remember the partitioning? I am using at the moment 1.3.1 spark version. Thanks -- Cesar Flores

Question about data frame partitioning in Spark 1.3.0

2015-10-14 Thread Cesar Flores
) val partitioned_df = hc.createDataFrame(partitioned_rdd, unpartitioned_df.schema) Thanks a lot -- Cesar Flores

Re: Question about data frame partitioning in Spark 1.3.0

2015-10-14 Thread Cesar Flores
aware of RDD level partitioning since its > mostly a blackbox. > > 1) could be fixed by adding caching. 2) is on our roadmap (though you'd > have to use logical DataFrame expressions to do the partitioning instead of > a class based partitioner). > > On Wed, Oct 14, 2015 at 8:45 AM

Is coalesce smart while merging partitions?

2015-10-07 Thread Cesar Flores
to merge is random? Thanks -- Cesar Flores

Re: shutdown local hivecontext?

2015-08-06 Thread Cesar Flores
linux path /home/my_user_name, which fails. On Thu, Aug 6, 2015 at 3:12 PM, Cesar Flores ces...@gmail.com wrote: Well, I try this approach, and still have issues. Apparently TestHive can not delete the hive metastore directory. The complete error that I have is: 15/08/06 15:01:29 ERROR Driver

Re: shutdown local hivecontext?

2015-08-06 Thread Cesar Flores
On Mon, Aug 3, 2015 at 5:56 PM, Michael Armbrust mich...@databricks.com wrote: TestHive takes care of creating a temporary directory for each invocation so that multiple test runs won't conflict. On Mon, Aug 3, 2015 at 3:09 PM, Cesar Flores ces...@gmail.com wrote: We are using a local hive

shutdown local hivecontext?

2015-08-03 Thread Cesar Flores
: libraryDependencies += org.scalatest % scalatest_2.10 % 2.0 % test, parallelExecution in Test := false, fork := true, javaOptions ++= Seq(-Xms512M, -Xmx2048M, -XX:MaxPermSize=2048M, -XX:+CMSClassUnloadingEnabled) We are working under Spark 1.3.0 Thanks -- Cesar Flores

Dataframe in single partition after sorting?

2015-07-02 Thread Cesar Flores
!!! -- Cesar Flores

Dataframe random permutation?

2015-06-01 Thread Cesar Flores
tried also: hc.createDataFrame(df.rdd.repartition(100),df.schema) which appears to be a random permutation. Can some one confirm me that the last line is in fact a random permutation, or point me out to a better approach? Thanks -- Cesar Flores

dataframe cumulative sum

2015-05-29 Thread Cesar Flores
cumsum column as the next one: flag | price | cumsum_price --|--- 1|47.808764653746 | 47.808764653746 1|47.808764653746 | 95.6175293075 1|31.9869279512204| 127.604457259 Thanks -- Cesar Flores

Adding an indexed column

2015-05-28 Thread Cesar Flores
as the next one: flag | price | index --|--- 1|47.808764653746 | 0 1|47.808764653746 | 1 1|31.9869279512204| 2 1|47.7907893713564| 3 1|16.7599200038239| 4 1|16.7599200038239| 5 1|20.3916014172137| 6 -- Cesar Flores

partitioning after extracting from a hive table?

2015-05-22 Thread Cesar Flores
I have a table in a Hive database partitioning by date. I notice that when I query this table using HiveContext the created data frame has an specific number of partitions. Do this partitioning corresponds to my original table partitioning in Hive? Thanks -- Cesar Flores

Naming an DF aggregated column

2015-05-19 Thread Cesar Flores
on the fly, and not after performing the aggregation? thanks -- Cesar Flores

dataframe can not find fields after loading from hive

2015-04-16 Thread Cesar Flores
. Can someone tell me if I need to do some post processing after loading from hive in order to avoid this kind of errors? Thanks -- Cesar Flores

Re: dataframe can not find fields after loading from hive

2015-04-16 Thread Cesar Flores
Never mind. I found the solution: val newDataFrame = hc.createDataFrame(hiveLoadedDataFrame.rdd, hiveLoadedDataFrame.schema) which translate to convert the data frame to rdd and back again to data frame. Not the prettiest solution, but at least it solves my problems. Thanks, Cesar Flores

ML Pipeline question about caching

2015-03-17 Thread Cesar Flores
-- Cesar Flores

Re: SchemaRDD: SQL Queries vs Language Integrated Queries

2015-03-11 Thread Cesar Flores
, Cesar Flores ces...@gmail.com wrote: I am new to the SchemaRDD class, and I am trying to decide in using SQL queries or Language Integrated Queries ( https://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD ). Can someone tell me what is the main difference

SchemaRDD: SQL Queries vs Language Integrated Queries

2015-03-10 Thread Cesar Flores
different syntax? Are they interchangeable? Which one has better performance? Thanks a lot -- Cesar Flores

Data Frame types

2015-03-06 Thread Cesar Flores
) will be able to handle user defined classes too? Do user classes will need to extend they will need to define the same approach? -- Cesar Flores

Re: SchemaRDD.select

2015-02-19 Thread Cesar Flores
to hear the opinion of an expert about it. Thanks On Thu, Feb 19, 2015 at 12:01 PM, Cesar Flores ces...@gmail.com wrote: I am trying to pass a variable number of arguments to the select function of a SchemaRDD I created, as I want to select the fields in run time: val

SchemaRDD.select

2015-02-19 Thread Cesar Flores
will be a better approach for selecting the required fields in run time? Thanks in advance for your help -- Cesar Flores

ML Transformer

2015-02-18 Thread Cesar Flores
is private to the ml package: private[ml] def transformSchema(schema: StructType, paramMap: ParamMap): StructType Do any user can create their own transformers? If not, do this functionality will be added in the future. Thanks -- Cesar Flores