Hi, All.
Since we use both Apache JIRA and GitHub actively for Apache Spark
contributions, we have lots of JIRAs and PRs consequently. One specific
thing I've been longing to see is `Jira Issue Type` in GitHub.
How about exposing JIRA issue types at GitHub PRs as GitHub `Labels`? There
are two
Nicholas , thank you for your explanation.
I am also interested in the example that Rishi is asking for. I am sure
mapPartitions may work , but as Vladimir suggests it may not be the best
option in terms of performance.
@Vladimir Prus , are you aware of any example about writing a "custom
Hi,
I am running a Spark Dataframe function of NTILE over a huge data - it
spills lot of data while sorting and eventually it fails.
The data size is roughly 80 Million record with size of 4G (not sure
whether its serialized or deserialized) - I am calculating NTILE(10) for
all these records
Dear Apache Enthusiast,
(You’re receiving this message because you’re subscribed to one or more
Apache Software Foundation project user mailing lists.)
We’re thrilled to announce the schedule for our upcoming conference,
ApacheCon North America 2019, in Las Vegas, Nevada. See it now at
Hey Jean-Michel,
Looks like its specific for YARN. As I mentioned, I am running on
standalone cluster.
Thanks
On Tue 11 Jun 2019 at 10:50, Lourier, Jean-Michel (FIX1) <
jean-michel.lour...@porsche.de> wrote:
> Hi Patrick,
>
> I guess the easiest way is to use log aggregation:
>
Nice finding!
Given you already pointed out previous issue which fixed similar issue, it
would be also easy for you to craft the patch and verify whether the fix
resolves your issue. Looking forward to see your patch.
Thanks,
Jungtaek Lim (HeartSaVioR)
On Wed, Jun 12, 2019 at 8:23 PM Gerard
Ooops - linked the wrong JIRA ticket: (that other one is related)
https://issues.apache.org/jira/browse/SPARK-28025
On Wed, Jun 12, 2019 at 1:21 PM Gerard Maas wrote:
> Hi!
> I would like to socialize this issue we are currently facing:
> The Structured Streaming default CheckpointFileManager
Hi!
I would like to socialize this issue we are currently facing:
The Structured Streaming default CheckpointFileManager leaks .crc files by
leaving them behind after users of this class (like
HDFSBackedStateStoreProvider) apply their cleanup methods.
This results in an unbounded creation of tiny
FYI, I am already using QueryExecutionListener which satisfies the
requirements.
But that only works for dataframe APIs. If someone does
df.rdd().someAction(), QueryExecutionListener is never invoked. I want
something like QueryExecutionListener works in case of
df.rdd().someAction() too.
I
Hi,
If your data frame is partitioned by column A, and you want deduplication
by columns A, B and C, then a faster way might be to sort each partition by
A, B and C and then do a linear scan - it is often faster than group by all
columns - which require a shuffle. Sadly, there's no standard way
Hi all,
As we know that parquet is stored in columnar format and filtering on the
column will require that column only instead of the complete record.
So when we are creating Dataset[Class] and doing group by on the column vs
same on steps DataFrame is performing differently. Operations on
Hi All,
Is there any way to receive some event that a DataSourceReader is finished?
I want to do some clean up after all the DataReaders are finished reading
and hence need some kind of cleanUp() mechanism at DataSourceReader(Driver)
level.
How to achieve this?
For instance, in DataSourceWriter
Hi, Sonu.
You can send email to user-unsubscr...@spark.apache.org with subject
"(send this email to unsubscribe)" to unsubscribe from this mailling
list[1].
Regards.
[1] https://spark.apache.org/community.html
2019-05-27 2:01 GMT+07.00, Sonu Jyotshna :
>
>
--
--
Salam Hangat,
Pengelola
Hi,
My employer(IBM) is interested in hiring people in hyderabad if they are
committers in any of the Apache Projects and are interested Spark and
ecosystem.
Thanks,
Prashant.
14 matches
Mail list logo