Re: How is union() implemented? Need to implement column bind

2022-04-20 Thread Andrew Davidson
at 5:34 PM To: Andrew Davidson Cc: Andrew Melo , Bjørn Jørgensen , "user @spark" Subject: Re: How is union() implemented? Need to implement column bind Wait, how is all that related to cbind -- very different from what's needed to insert. BigQuery is unrelated to MR or Spark. It

Re: How is union() implemented? Need to implement column bind

2022-04-20 Thread Andrew Davidson
. Kind regards Andy From: Sean Owen Date: Wednesday, April 20, 2022 at 2:31 PM To: Andrew Melo Cc: Andrew Davidson , Bjørn Jørgensen , "user @spark" Subject: Re: How is union() implemented? Need to implement column bind I don't think there's fundamental disapproval (it is implemented i

Re: How is union() implemented? Need to implement column bind

2022-04-20 Thread Andrew Davidson
it spark will work well for our need Kind regards Andy From: Sean Owen Date: Monday, April 18, 2022 at 6:58 PM To: Andrew Davidson Cc: "user @spark" Subject: Re: How is union() implemented? Need to implement column bind A join is the natural answer, but this is a 10114-way join, whic

How is union() implemented? Need to implement column bind

2022-04-18 Thread Andrew Davidson
Hi have a hard problem I have 10114 column vectors each in a separate file. The file has 2 columns, the row id, and numeric values. The row ids are identical and in sort order. All the column vectors have the same number of rows. There are over 5 million rows. I need to combine them into a

Re: pivoting panda dataframe

2022-03-15 Thread Andrew Davidson
: Tuesday, March 15, 2022 at 2:19 PM To: Andrew Davidson Cc: Mich Talebzadeh , "user @spark" Subject: Re: pivoting panda dataframe Hi Andrew. Mitch asked, and I answered transpose() https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.tran

Re: pivoting panda dataframe

2022-03-15 Thread Andrew Davidson
Hi Bjorn I have been looking for spark transform for a while. Can you send me a link to the pyspark function? I assume pandas transform is not really an option. I think it will try to pull the entire dataframe into the drivers memory. Kind regards Andy p.s. My real problem is that spark

Re: Does spark have something like rowsum() in R?

2022-02-09 Thread Andrew Davidson
at 8:19 AM To: Andrew Davidson Cc: "user @spark" Subject: Re: Does spark have something like rowsum() in R? It really depends on what is running out of memory. You can have all the workers in the world but if something is blowing up the driver, won't do anything. You can have a hu

Re: Does spark have something like rowsum() in R?

2022-02-09 Thread Andrew Davidson
ternatives in R and Spark; in other languages you might more directly get the array of (numeric?) row value and sum them efficiently. Certainly pandas UDFs would make short work of that. On Tue, Feb 8, 2022 at 10:02 AM Andrew Davidson wrote: As part of my data normalization process I need to cal

Does spark have something like rowsum() in R?

2022-02-08 Thread Andrew Davidson
As part of my data normalization process I need to calculate row sums. The following code works on smaller test data sets. It does not work on my big tables. When I run on a table with over 10,000 columns I get an OOM on a cluster with 2.8 TB. Is there a better way to implement this Kind

Does spark support something like the bind function in R?

2022-02-08 Thread Andrew Davidson
I need to create a single table by selecting one column from thousands of files. The columns are all of the same type, have the same number of rows and rows names. I am currently using join. I get OOM on mega-mem cluster with 2.8 TB. Does spark have something like cbind() “Take a sequence of

Re: What are your experiences using google cloud platform

2022-01-24 Thread Andrew Davidson
/a/54283997/4586180 retDF = countsSparkDF.na.fill( 0 ).withColumn( newColName , reduce( add, [col( x ) for x in columnNames] ) ) self.logger.warn( "rowSumsImpl END\n" ) return retDF From: Mich Talebzadeh Date: Monday, January 24, 2022 at 12:54 AM To: Andrew Dav

What are your experiences using google cloud platform

2022-01-23 Thread Andrew Davidson
Hi recently started using GCP dataproc spark. Seem to have trouble getting big jobs to complete. I am using check points. I am wondering if maybe I should look for another cloud solution Kind regards Andy

Re: How to configure log4j in pyspark to get log level, file name, and line number

2022-01-21 Thread Andrew Davidson
regards Andy From: Andrew Davidson Date: Thursday, January 20, 2022 at 2:32 PM To: "user @spark" Subject: How to configure log4j in pyspark to get log level, file name, and line number Hi When I use python logging for my unit test. I am able to control the output format. I get the

Is user@spark indexed by google?

2022-01-21 Thread Andrew Davidson
There is a ton of great info in this archive. I noticed when I do a google search it does not seem to find results from this source Kind regards Andy

How to configure log4j in pyspark to get log level, file name, and line number

2022-01-20 Thread Andrew Davidson
Hi When I use python logging for my unit test. I am able to control the output format. I get the log level, the file and line number, then the msg [INFO testEstimatedScalingFactors.py:166 - test_B_convertCountsToInts()] BEGIN In my spark driver I am able to get the log4j logger spark

java.lang.StackOverflow Error How to sum across rows in a data frame with a large number of columns

2022-01-20 Thread Andrew Davidson
Hi I have a dataframe of integers. It has 10409 columns. How can I sum across each row? I get a very long stack trace rowSums BEGIN 2022-01-20 22:11:24 ERROR __main__:? - An error occurred while calling o93935.withColumn. : java.lang.StackOverflowError at

Re: How to add a row number column with out reordering my data frame

2022-01-11 Thread Andrew Davidson
Thanks! I will take a look Andy From: Gourav Sengupta Date: Tuesday, January 11, 2022 at 8:42 AM To: Andrew Davidson Cc: Andrew Davidson , "user @spark" Subject: Re: How to add a row number column with out reordering my data frame Hi, I do not think we need to do any of that.

Re: How to add a row number column with out reordering my data frame

2022-01-11 Thread Andrew Davidson
functions.monotonically_increasing_id.html The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. Comments and suggestions appreciated Andy From: Gourav Sengupta Date: Monday, January 10, 2022 at 11:03 AM To: Andrew Davidson Cc: "user @spark" Su

How to add a row number column with out reordering my data frame

2022-01-06 Thread Andrew Davidson
Hi I am trying to work through a OOM error. I have 10411 files. I want to select a single column from each file and then join them into a single table. The files have a row unique id. However it is a very long string. The data file with just the name and column of interest is about 470 M. The

Re: Newbie pyspark memory mgmt question

2022-01-05 Thread Andrew Davidson
Thanks Sean Andy From: Sean Owen Date: Wednesday, January 5, 2022 at 3:38 PM To: Andrew Davidson , Nicholas Gustafson Cc: "user @spark" Subject: Re: Newbie pyspark memory mgmt question There is no memory leak, no. You can .cache() or .persist() DataFrames, and that can use me

Newbie pyspark memory mgmt question

2022-01-05 Thread Andrew Davidson
Hi I am running into OOM problems. My cluster should be much bigger than I need. I wonder if it has to do with the way I am writing my code. Below are three style cases. I wonder if they cause memory to be leaked? Case 1 : df1 = spark.read.load( cvs file) df1 = df1.someTransform() df1 =

Joining many tables Re: Pyspark debugging best practices

2022-01-03 Thread Andrew Davidson
#rawCountsSDF.explain() self.logger.info( "END\n" ) return retNumReadsDF From: David Diebold Date: Monday, January 3, 2022 at 12:39 AM To: Andrew Davidson , "user @spark" Subject: Re: Pyspark debugging best practices Hello Andy, Are you sure you wa

Re: Pyspark debugging best practices

2021-12-30 Thread Andrew Davidson
: > Hi Andrew, > > Any chance you might give Databricks a try in GCP? > > The above transformations look complicated to me, why are you adding > dataframes to a list? > > > Regards, > Gourav Sengupta > > > > On Sun, Dec 26, 2021 at 7:00 PM Andrew Davidson

Pyspark debugging best practices

2021-12-26 Thread Andrew Davidson
Hi I am having trouble debugging my driver. It runs correctly on smaller data set but fails on large ones. It is very hard to figure out what the bug is. I suspect it may have something do with the way spark is installed and configured. I am using google cloud platform dataproc pyspark The

Pyspark garbage collection and cache management best practices

2021-12-26 Thread Andrew Davidson
Hi Below is typical pseudo code I find myself writing over and over again. There is only a single action at the very end of the program. The early narrow transformations potentially hold on to a lot of needless data. I have a for loop over join. (ie wide transformation). Followed by a bunch

Re: About some Spark technical help

2021-12-24 Thread Andrew Davidson
; >>> Thanks, here's the Github repo to the code and the publication : >>> https://github.com/SamSmithDevs10/paperReplicationForReview >>> >>> Kind regards >>> >>> Le jeu. 23 déc. 2021 à 17:58, Andrew Davidson a >>> écrit : >>>

OOM Joining thousands of dataframes Was: AnalysisException: Trouble using select() to append multiple columns

2021-12-24 Thread Andrew Davidson
te: Friday, December 24, 2021 at 8:30 AM To: Gourav Sengupta Cc: Andrew Davidson , Nicholas Gustafson , User Subject: Re: AnalysisException: Trouble using select() to append multiple columns (that's not the situation below we are commenting on) On Fri, Dec 24, 2021, 9:28 AM Gourav Sengupta mail

Re: About some Spark technical help

2021-12-23 Thread Andrew Davidson
Hi Sam Can you tell us more? What is the algorithm? Can you send us the URL the publication Kind regards Andy From: sam smith Date: Wednesday, December 22, 2021 at 10:59 AM To: "user@spark.apache.org" Subject: About some Spark technical help Hello guys, I am replicating a paper's

Re: ??? INFO CreateViewCommand:57 - Try to uncache `rawCounts` before replacing.

2021-12-21 Thread Andrew Davidson
ew with name "rawCounts", spark3 would uncache the > previous "rawCounts". > > Correct me if I'm wrong. > > Regards > > > On Tue, Dec 21, 2021 at 10:05 PM Andrew Davidson > wrote: > >> Happy Holidays >> >> >> >>

??? INFO CreateViewCommand:57 - Try to uncache `rawCounts` before replacing.

2021-12-20 Thread Andrew Davidson
Happy Holidays I am a newbie I have 16,000 data files, all files have the same number of rows and columns. The row ids are identical and are in the same order. I want to create a new data frame that contains the 3rd column from each data file. My pyspark script runs correctly when I test on

Re: AnalysisException: Trouble using select() to append multiple columns

2021-12-18 Thread Andrew Davidson
Thanks Nicholas Andy From: Nicholas Gustafson Date: Friday, December 17, 2021 at 6:12 PM To: Andrew Davidson Cc: "user@spark.apache.org" Subject: Re: AnalysisException: Trouble using select() to append multiple columns Since df1 and df2 are different DataFrames, you will need to

AnalysisException: Trouble using select() to append multiple columns

2021-12-17 Thread Andrew Davidson
Hi I am a newbie I have 16,000 data files, all files have the same number of rows and columns. The row ids are identical and are in the same order. I want to create a new data frame that contains the 3rd column from each data file I wrote a test program that uses a for loop and Join. It works