RE: PyArrow Exception in Pandas UDF GROUPEDAGG()

2020-05-07 Thread Gautham Acharya
() function takes a numPartitions column. What other options can I explore? --gautham -Original Message- From: ZHANG Wei Sent: Thursday, May 7, 2020 1:34 AM To: Gautham Acharya Cc: user@spark.apache.org Subject: Re: PyArrow Exception in Pandas UDF GROUPEDAGG() CAUTION: This email origina

PyArrow Exception in Pandas UDF GROUPEDAGG()

2020-05-05 Thread Gautham Acharya
Hi everyone, I'm running a job that runs a Pandas UDF to GROUP BY a large matrix. The GROUP BY function runs on a wide dataset. The first column of the dataset contains string labels that are GROUPed on. The remaining columns are numeric values that are aggregated in the Pandas UDF. The dataset

[PySpark] How to write HFiles as an 'append' to the same directory?

2020-03-14 Thread Gautham Acharya
I have a process in Apache Spark that attempts to write HFiles to S3 in a batched process. I want the resulting HFiles in the same directory, as they are in the same column family. However, I'm getting a 'directory already exists error' when I try to run this on AWS EMR. How can I write Hfiles v

RE: [Beginner] Run compute on large matrices and return the result in seconds?

2019-07-17 Thread Gautham Acharya
OpenMP than for Spark. On Wed, Jul 17, 2019 at 3:42 PM Gautham Acharya mailto:gauth...@alleninstitute.org>> wrote: As I said in the my initial message, precomputing is not an option. Retrieving only the top/bottom N most correlated is an option – would that speed up the results? Our S

RE: [Beginner] Run compute on large matrices and return the result in seconds?

2019-07-17 Thread Gautham Acharya
...@dstillery.com] Sent: Wednesday, July 17, 2019 12:39 PM To: Gautham Acharya Cc: Bobby Evans ; Steven Stetzler ; user@spark.apache.org Subject: Re: [Beginner] Run compute on large matrices and return the result in seconds? CAUTION: This email originated from outside the Allen Institute. Please do

RE: [Beginner] Run compute on large matrices and return the result in seconds?

2019-07-17 Thread Gautham Acharya
easily perform this computation in spark? --gautham From: Bobby Evans [mailto:reva...@gmail.com] Sent: Wednesday, July 17, 2019 7:06 AM To: Steven Stetzler Cc: Gautham Acharya ; user@spark.apache.org Subject: Re: [Beginner] Run compute on large matrices and return the result in seconds? CAUTION

RE: [Beginner] Run compute on large matrices and return the result in seconds?

2019-07-11 Thread Gautham Acharya
Ping? I would really appreciate advice on this! Thank you! From: Gautham Acharya Sent: Tuesday, July 9, 2019 4:22 PM To: user@spark.apache.org Subject: [Beginner] Run compute on large matrices and return the result in seconds? This is my first email to this mailing list, so I apologize if I

[Beginner] Run compute on large matrices and return the result in seconds?

2019-07-09 Thread Gautham Acharya
This is my first email to this mailing list, so I apologize if I made any errors. My team's going to be building an application and I'm investigating some options for distributed compute systems. We want to be performing computes on large matrices. The requirements are as follows: 1.