Re: Union of multiple data frames

2018-04-05 Thread Cesar
Thanks for your answers. The suggested method works when the number of Data Frames is small. However, I am trying to union >30 Data Frames, and the time to create the plan is taking longer than the execution, which should not be the case. Thanks! -- Cesar On Thu, Apr 5, 2018 at 1:29 PM, A

Union of multiple data frames

2018-04-05 Thread Cesar
mes?* thanks -- Cesar Flores

Unpersist all from memory in spark 2.2

2017-09-25 Thread Cesar
Is there a way to unpersist all data frames, data sets, and/or RDD in Spark 2.2 in a single call? Thanks -- Cesar Flores

tuning the spark.locality.wait

2017-01-21 Thread Cesar
that I can try ? Thanks a lot ! -- Cesar Flores

credentials are not hiding on a jdbc query

2016-12-06 Thread Cesar
am looking more for a hack kind of solution. Thanks a lot ! -- Cesar Flores

does column order matter in dataframe.repartition?

2016-11-17 Thread Cesar
lumns variable? Thanks -- Cesar Flores

Logistic Regression Standardization in ML

2016-10-10 Thread Cesar
. * * @group setParam */ Specifically I am having issues with understanding why the solution should converge to the same weight values with/without standardization ? Thanks ! -- Cesar Flores

releasing memory without stopping the spark context ?

2016-08-31 Thread Cesar
Is there a way to release all persisted RDD's/DataFrame's in Spark without stopping the SparkContext ? Thanks a lot -- Cesar Flores

Logistic regression formula string

2016-08-08 Thread Cesar
for something similar to what R output does (where it clearly indicates which weight corresponds to each feature name, including categorical ones). Thanks a lot ! -- Cesar Flores

saving data frame to optimize joins at a later time

2016-08-02 Thread Cesar
SELECT * FROM tableAlias " ) Do the partition information ("id") will be stored in whse.someTable such that when querying on that table in a second spark job, the information will be used for optimizing joins for example? If this approach do not work, can you suggest one that works? Thanks -- Cesar Flores

How to check if a data frame is cached?

2016-07-14 Thread Cesar
Is there a simpler way to check if a data frame is cached other than: dataframe.registerTempTable("cachedOutput") assert(hc.isCached("cachedOutput"), "The table was not cached") Thanks! -- Cesar Flores

Cleaning spark memory

2016-06-10 Thread Cesar Flores
? Thanks -- Cesar Flores

Integrating spark source in an eclipse project?

2016-06-07 Thread Cesar Flores
I created a spark application in Eclipse by including the spark-assembly-1.6.0-hadoop2.6.0.jar file in the path. However, this method does not allow me see spark code. Is there an easy way to include spark source code for reference in an application developed in Eclipse? Thanks ! -- Cesar

Re: My notes on Spark Performance & Tuning Guide

2016-05-17 Thread Cesar Flores
Please sent me to me too ! Thanks ! ! ! Cesar Flores On Tue, May 17, 2016 at 4:55 PM, Femi Anthony <femib...@gmail.com> wrote: > Please send it to me as well. > > Thanks > > Sent from my iPhone > > On May 17, 2016, at 12:09 PM, Raghavendra Pandey < > ra

DAG Pipelines?

2016-05-04 Thread Cesar Flores
this functionality may be useful?* Thanks -- Cesar Flores

Re: Weird results with Spark SQL Outer joins

2016-05-03 Thread Cesar Flores
;> Yong >>> >> >>> >> >>> >> From: kpe...@gmail.com >>> >> Date: Mon, 2 May 2016 12:11:18 -0700 >>> >> Subject: Re: Weird results with Spark SQL Outer joins >>> >> To: gourav.sengu...@gmail.com >>> >> CC: user@spark.apache.org >>> >> >>> >> >>> >> Gourav, >>> >> >>> >> I wish that was case, but I have done a select count on each of the >>> two >>> >> tables individually and they return back different number of rows: >>> >> >>> >> >>> >> dps.registerTempTable("dps_pin_promo_lt") >>> >> swig.registerTempTable("swig_pin_promo_lt") >>> >> >>> >> >>> >> dps.count() >>> >> RESULT: 42632 >>> >> >>> >> >>> >> swig.count() >>> >> RESULT: 42034 >>> >> >>> >> On Mon, May 2, 2016 at 11:55 AM, Gourav Sengupta >>> >> <gourav.sengu...@gmail.com> wrote: >>> >> >>> >> This shows that both the tables have matching records and no >>> mismatches. >>> >> Therefore obviously you have the same results irrespective of whether >>> you >>> >> use right or left join. >>> >> >>> >> I think that there is no problem here, unless I am missing something. >>> >> >>> >> Regards, >>> >> Gourav >>> >> >>> >> On Mon, May 2, 2016 at 7:48 PM, kpeng1 <kpe...@gmail.com> wrote: >>> >> >>> >> Also, the results of the inner query produced the same results: >>> >> sqlContext.sql("SELECT s.date AS edate , s.account AS s_acc , >>> d.account >>> >> AS >>> >> d_acc , s.ad as s_ad , d.ad as d_ad , s.spend AS s_spend , >>> >> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN >>> >> dps_pin_promo_lt d ON (s.date = d.date AND s.account = d.account AND >>> s.ad >>> >> = >>> >> d.ad) WHERE s.date >= '2016-01-03'AND d.date >= >>> '2016-01-03'").count() >>> >> RESULT:23747 >>> >> >>> >> >>> >> >>> >> -- >>> >> View this message in context: >>> >> >>> http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861p26863.html >>> >> Sent from the Apache Spark User List mailing list archive at >>> Nabble.com. >>> >> >>> >> - >>> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> >> For additional commands, e-mail: user-h...@spark.apache.org >>> >> >>> >> >>> >> >>> > >>> >> >> > -- Cesar Flores

Re: how to query the number of running executors?

2016-04-06 Thread Cesar Flores
Thanks Ted: That is the kind of answer I was looking for. Best, Cesar flores On Wed, Apr 6, 2016 at 3:01 PM, Ted Yu <yuzhih...@gmail.com> wrote: > Have you looked at SparkListener ? > > /** >* Called when the driver registers a new executor. >*/ > def onExe

how to query the number of running executors?

2016-04-06 Thread Cesar Flores
Hello: I wonder if there is a way to query the number of running executors (nor the number asked executors) inside a spark job? Thanks -- Cesar Flores

Spark property parameters priority

2016-03-11 Thread Cesar Flores
in the config parameter spark.sql.shuffle.partitions, which I need to modify on the fly to do group by clauses depending on the size of my input.* Thanks -- Cesar Flores

performance of personalized page rank

2016-03-01 Thread Cesar Flores
of time (i.e. less than 12 hours). Best -- Cesar Flores

Re: Migrating Transformers from Spark 1.3.1 to 1.5.0

2016-02-15 Thread Cesar Flores
I found my problem. I was calling setParameterValue(defaultValue) more than one time in the hierarchy of my classes. Thanks! On Mon, Feb 15, 2016 at 6:34 PM, Cesar Flores <ces...@gmail.com> wrote: > > I have a set of transformers (each with specific parameters) in spark > 1.

Migrating Transformers from Spark 1.3.1 to 1.5.0

2016-02-15 Thread Cesar Flores
.* *Does anyone have any idea of what I may be doing wrong. My guess is that I am doing something weird in my class hierarchy but can not figure out what.* Thanks! -- Cesar Flores

Re: Optimal way to re-partition from a single partition

2016-02-09 Thread Cesar Flores
t;> column on which you are running orderBy? If yes, you are better off not >> running the orderBy clause. >> >> May be someone from spark sql team could answer that how should the >> partitioning of the output DF be handled when doing an orderBy? >> >> Hemant &

Optimal way to re-partition from a single partition

2016-02-08 Thread Cesar Flores
with a single partition and around 14 million records val newDF = hc.createDataFrame(rdd, df.schema) This process is really slow. Is there any other way of achieving this task, or to optimize it (perhaps tweaking a spark configuration parameter)? Thanks a lot -- Cesar Flores

A question about sql clustering

2015-11-23 Thread Cesar Flores
ery useful for performing joins later). Is that true? And second question, if I save *df* just after the query into a hive table, when I reload this table from hive, does spark will remember the partitioning? I am using at the moment 1.3.1 spark version. Thanks -- Cesar Flores

Question about data frame partitioning in Spark 1.3.0

2015-10-14 Thread Cesar Flores
) val partitioned_df = hc.createDataFrame(partitioned_rdd, unpartitioned_df.schema) Thanks a lot -- Cesar Flores

Re: Question about data frame partitioning in Spark 1.3.0

2015-10-14 Thread Cesar Flores
aware of RDD level partitioning since its > mostly a blackbox. > > 1) could be fixed by adding caching. 2) is on our roadmap (though you'd > have to use logical DataFrame expressions to do the partitioning instead of > a class based partitioner). > > On Wed, Oct 14, 2015 at 8:45 AM

Is coalesce smart while merging partitions?

2015-10-07 Thread Cesar Flores
to merge is random? Thanks -- Cesar Flores

Re: Spark 1.3.1 on Yarn not using all given capacity

2015-10-06 Thread Cesar Berezowski
3 cores* not 8 César. > Le 6 oct. 2015 à 19:08, Cesar Berezowski <ce...@adaltas.com> a écrit : > > I deployed hdp 2.3.1 and got spark 1.3.1, spark 1.4 is supposed to be > available as technical preview I think > > vendor’s forum ? you mean hortonworks' ? >

Job on Yarn not using all given capacity ends up failing

2015-10-05 Thread Cesar Berezowski
Hi, I recently upgraded from 1.2.1 to 1.3.1 (through HDP). I have a job that does a cartesian product on two datasets (2K and 500K lines minimum) to do string matching. I updated it to use Dataframes because the old code wouldn’t run anymore (deprecated RDD functions). It used to run very

Re: shutdown local hivecontext?

2015-08-06 Thread Cesar Flores
linux path /home/my_user_name, which fails. On Thu, Aug 6, 2015 at 3:12 PM, Cesar Flores ces...@gmail.com wrote: Well, I try this approach, and still have issues. Apparently TestHive can not delete the hive metastore directory. The complete error that I have is: 15/08/06 15:01:29 ERROR Driver

Re: shutdown local hivecontext?

2015-08-06 Thread Cesar Flores
On Mon, Aug 3, 2015 at 5:56 PM, Michael Armbrust mich...@databricks.com wrote: TestHive takes care of creating a temporary directory for each invocation so that multiple test runs won't conflict. On Mon, Aug 3, 2015 at 3:09 PM, Cesar Flores ces...@gmail.com wrote: We are using a local hive

shutdown local hivecontext?

2015-08-03 Thread Cesar Flores
: libraryDependencies += org.scalatest % scalatest_2.10 % 2.0 % test, parallelExecution in Test := false, fork := true, javaOptions ++= Seq(-Xms512M, -Xmx2048M, -XX:MaxPermSize=2048M, -XX:+CMSClassUnloadingEnabled) We are working under Spark 1.3.0 Thanks -- Cesar Flores

Dataframe in single partition after sorting?

2015-07-02 Thread Cesar Flores
!!! -- Cesar Flores

Time series data

2015-06-26 Thread Caio Cesar Trucolo
Hi everyone! I am working with multiple time series data and in summary I have to adjust each time series (like inserting average values in data gaps) and then training regression models with mllib for each time series. The adjustment step I did with the adjustement function being mapped for each

Dataframe random permutation?

2015-06-01 Thread Cesar Flores
tried also: hc.createDataFrame(df.rdd.repartition(100),df.schema) which appears to be a random permutation. Can some one confirm me that the last line is in fact a random permutation, or point me out to a better approach? Thanks -- Cesar Flores

dataframe cumulative sum

2015-05-29 Thread Cesar Flores
cumsum column as the next one: flag | price | cumsum_price --|--- 1|47.808764653746 | 47.808764653746 1|47.808764653746 | 95.6175293075 1|31.9869279512204| 127.604457259 Thanks -- Cesar Flores

Adding an indexed column

2015-05-28 Thread Cesar Flores
as the next one: flag | price | index --|--- 1|47.808764653746 | 0 1|47.808764653746 | 1 1|31.9869279512204| 2 1|47.7907893713564| 3 1|16.7599200038239| 4 1|16.7599200038239| 5 1|20.3916014172137| 6 -- Cesar Flores

partitioning after extracting from a hive table?

2015-05-22 Thread Cesar Flores
I have a table in a Hive database partitioning by date. I notice that when I query this table using HiveContext the created data frame has an specific number of partitions. Do this partitioning corresponds to my original table partitioning in Hive? Thanks -- Cesar Flores

Naming an DF aggregated column

2015-05-19 Thread Cesar Flores
on the fly, and not after performing the aggregation? thanks -- Cesar Flores

dataframe can not find fields after loading from hive

2015-04-16 Thread Cesar Flores
. Can someone tell me if I need to do some post processing after loading from hive in order to avoid this kind of errors? Thanks -- Cesar Flores

Re: dataframe can not find fields after loading from hive

2015-04-16 Thread Cesar Flores
Never mind. I found the solution: val newDataFrame = hc.createDataFrame(hiveLoadedDataFrame.rdd, hiveLoadedDataFrame.schema) which translate to convert the data frame to rdd and back again to data frame. Not the prettiest solution, but at least it solves my problems. Thanks, Cesar Flores

ML Pipeline question about caching

2015-03-17 Thread Cesar Flores
-- Cesar Flores

Re: SchemaRDD: SQL Queries vs Language Integrated Queries

2015-03-11 Thread Cesar Flores
, Cesar Flores ces...@gmail.com wrote: I am new to the SchemaRDD class, and I am trying to decide in using SQL queries or Language Integrated Queries ( https://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD ). Can someone tell me what is the main difference

SchemaRDD: SQL Queries vs Language Integrated Queries

2015-03-10 Thread Cesar Flores
different syntax? Are they interchangeable? Which one has better performance? Thanks a lot -- Cesar Flores

Data Frame types

2015-03-06 Thread Cesar Flores
) will be able to handle user defined classes too? Do user classes will need to extend they will need to define the same approach? -- Cesar Flores

Re: SchemaRDD.select

2015-02-19 Thread Cesar Flores
to hear the opinion of an expert about it. Thanks On Thu, Feb 19, 2015 at 12:01 PM, Cesar Flores ces...@gmail.com wrote: I am trying to pass a variable number of arguments to the select function of a SchemaRDD I created, as I want to select the fields in run time: val

SchemaRDD.select

2015-02-19 Thread Cesar Flores
will be a better approach for selecting the required fields in run time? Thanks in advance for your help -- Cesar Flores

ML Transformer

2015-02-18 Thread Cesar Flores
is private to the ml package: private[ml] def transformSchema(schema: StructType, paramMap: ParamMap): StructType Do any user can create their own transformers? If not, do this functionality will be added in the future. Thanks -- Cesar Flores

GraphX question about graph traversal

2014-08-20 Thread Cesar Arevalo
into that. Anyway, I look forward to a response. Best, -- Cesar Arevalo Software Engineer ❘ Zephyr Health 450 Mission Street, Suite #201 ❘ San Francisco, CA 94105 m: +1 415-571-7687 ❘ s: arevalocesar | t: @zephyrhealth https://twitter.com/zephyrhealth o: +1 415-529-7649 ❘ f: +1 415-520-9288 http

Re: GraphX question about graph traversal

2014-08-20 Thread Cesar Arevalo
Hey, thanks for your response. And I had seen the triplets, but I'm not quite sure how the triplets would get me that V1 is connected to V4. Maybe I need to spend more time understanding it, I guess. -Cesar On Wed, Aug 20, 2014 at 10:56 AM, glxc r.ryan.mcc...@gmail.com wrote: I don't know

Re: GraphX question about graph traversal

2014-08-20 Thread Cesar Arevalo
to modify. I'll let you know how it goes. -Cesar On Wed, Aug 20, 2014 at 2:14 PM, Ankur Dave ankurd...@gmail.com wrote: At 2014-08-20 10:34:50 -0700, Cesar Arevalo ce...@zephyrhealthinc.com wrote: I would like to get the type B vertices that are connected through type A vertices where

Re: NullPointerException when connecting from Spark to a Hive table backed by HBase

2014-08-19 Thread Cesar Arevalo
. -Cesar On Tue, Aug 19, 2014 at 2:04 PM, Yin Huai huaiyin@gmail.com wrote: Seems https://issues.apache.org/jira/browse/SPARK-2846 is the jira tracking this issue. On Mon, Aug 18, 2014 at 6:26 PM, cesararevalo ce...@zephyrhealthinc.com wrote: Thanks, Zhan for the follow up. But, do

NullPointerException when connecting from Spark to a Hive table backed by HBase

2014-08-18 Thread Cesar Arevalo
/lib_managed/bundles/com.jolbox/bonecp/bonecp-0.7.1.RELEASE.jar:/opt/spark-poc/sbt/ivy/cache/com.datastax.cassandra/cassandra-driver-core/bundles/cassandra-driver-core-2.0.4.jar:/opt/spark-poc/lib_managed/jars/org.json/json/json-20090211.jar Can anybody help me? Best, -- Cesar Arevalo Software

Re: NullPointerException when connecting from Spark to a Hive table backed by HBase

2014-08-18 Thread Cesar Arevalo
Nope, it is NOT null. Check this out: scala hiveContext == null res2: Boolean = false And thanks for sending that link, but I had already looked at it. Any other ideas? I looked through some of the relevant Spark Hive code and I'm starting to think this may be a bug. -Cesar On Mon, Aug 18

Re: NullPointerException when connecting from Spark to a Hive table backed by HBase

2014-08-18 Thread Cesar Arevalo
is not available. It may be completely missing from the current classpath, ommitted more stacktrace for readability... Best, -Cesar On Mon, Aug 18, 2014 at 12:47 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Then definitely its a jar conflict. Can you try removing this jar from the class path /opt

Re: Broadcast variable in Spark Java application

2014-07-07 Thread Cesar Arevalo
are doing wrong. I've found that following the spark programming guide online usually gives me enough information, but I guess you've already tried that. Best, -Cesar On Jul 7, 2014, at 12:41 AM, Praveen R prav...@sigmoidanalytics.com wrote: I need a variable to be broadcasted from driver

Re: Spark 1.0 failed on HDP 2.0 with absurd exception

2014-07-05 Thread Cesar Arevalo
.jar I didn't try this, so it may not work. Best, -Cesar On Sat, Jul 5, 2014 at 2:48 AM, Konstantin Kudryavtsev kudryavtsev.konstan...@gmail.com wrote: Hi all, I have cluster with HDP 2.0. I built Spark 1.0 on edge node and trying to run with a command ./bin/spark-submit --class

Re: Spark Streaming on top of Cassandra?

2014-07-04 Thread Cesar Arevalo
-spark-streaming-for-high-velocity-analytics-on-cassandra Best, -Cesar On Jul 4, 2014, at 12:33 AM, zarzyk k.zarzy...@gmail.com wrote: Hi, I bump this thread as I'm also interested in the answer. Can anyone help or point to the information on how to do Spark Streaming from/to Cassandra

Re: Spark Streaming on top of Cassandra?

2014-07-04 Thread Cesar Arevalo
-spark-streaming-for-high-velocity-analytics-on-cassandra Best, -Cesar On Jul 4, 2014, at 12:33 AM, zarzyk k.zarzy...@gmail.com wrote: Hi, I bump this thread as I'm also interested in the answer. Can anyone help or point to the information on how to do Spark Streaming from/to Cassandra

Anybody changed their mind about going to the Spark Summit 2014

2014-06-27 Thread Cesar Arevalo
Hi All: I was wondering if anybody had bought a ticket for the upcoming Spark Summit 2014 this coming week and had changed their mind about going. Let me know, since it has sold out and I can't buy a ticket anymore, I would be interested in buying it. Best, -- Cesar Arevalo Software Engineer