() function
takes a numPartitions column. What other options can I explore?
--gautham
-Original Message-
From: ZHANG Wei
Sent: Thursday, May 7, 2020 1:34 AM
To: Gautham Acharya
Cc: user@spark.apache.org
Subject: Re: PyArrow Exception in Pandas UDF GROUPEDAGG()
CAUTION: This email origina
Hi everyone,
I'm running a job that runs a Pandas UDF to GROUP BY a large matrix.
The GROUP BY function runs on a wide dataset. The first column of the dataset
contains string labels that are GROUPed on. The remaining columns are numeric
values that are aggregated in the Pandas UDF. The dataset
I have a process in Apache Spark that attempts to write HFiles to S3 in a
batched process. I want the resulting HFiles in the same directory, as they are
in the same column family. However, I'm getting a 'directory already exists
error' when I try to run this on AWS EMR. How can I write Hfiles v
OpenMP than for Spark.
On Wed, Jul 17, 2019 at 3:42 PM Gautham Acharya
mailto:gauth...@alleninstitute.org>> wrote:
As I said in the my initial message, precomputing is not an option.
Retrieving only the top/bottom N most correlated is an option – would that
speed up the results?
Our S
...@dstillery.com]
Sent: Wednesday, July 17, 2019 12:39 PM
To: Gautham Acharya
Cc: Bobby Evans ; Steven Stetzler
; user@spark.apache.org
Subject: Re: [Beginner] Run compute on large matrices and return the result in
seconds?
CAUTION: This email originated from outside the Allen Institute. Please do
easily perform this computation in
spark?
--gautham
From: Bobby Evans [mailto:reva...@gmail.com]
Sent: Wednesday, July 17, 2019 7:06 AM
To: Steven Stetzler
Cc: Gautham Acharya ; user@spark.apache.org
Subject: Re: [Beginner] Run compute on large matrices and return the result in
seconds?
CAUTION
Ping? I would really appreciate advice on this! Thank you!
From: Gautham Acharya
Sent: Tuesday, July 9, 2019 4:22 PM
To: user@spark.apache.org
Subject: [Beginner] Run compute on large matrices and return the result in
seconds?
This is my first email to this mailing list, so I apologize if I
This is my first email to this mailing list, so I apologize if I made any
errors.
My team's going to be building an application and I'm investigating some
options for distributed compute systems. We want to be performing computes on
large matrices.
The requirements are as follows:
1.