Exposing JIRA issue types at GitHub PRs

2019-06-12 Thread Dongjoon Hyun
Hi, All. Since we use both Apache JIRA and GitHub actively for Apache Spark contributions, we have lots of JIRAs and PRs consequently. One specific thing I've been longing to see is `Jira Issue Type` in GitHub. How about exposing JIRA issue types at GitHub PRs as GitHub `Labels`? There are two

Re: High level explanation of dropDuplicates

2019-06-12 Thread Yeikel
Nicholas , thank you for your explanation. I am also interested in the example that Rishi is asking for. I am sure mapPartitions may work , but as Vladimir suggests it may not be the best option in terms of performance. @Vladimir Prus , are you aware of any example about writing a "custom

Spark Dataframe NTILE function

2019-06-12 Thread Subash Prabakar
Hi, I am running a Spark Dataframe function of NTILE over a huge data - it spills lot of data while sorting and eventually it fails. The data size is roughly 80 Million record with size of 4G (not sure whether its serialized or deserialized) - I am calculating NTILE(10) for all these records

ApacheCon North America 2019 Schedule Now Live!

2019-06-12 Thread Rich Bowen
Dear Apache Enthusiast, (You’re receiving this message because you’re subscribed to one or more Apache Software Foundation project user mailing lists.) We’re thrilled to announce the schedule for our upcoming conference, ApacheCon North America 2019, in Las Vegas, Nevada. See it now at

Re: Getting driver logs in Standalone Cluster

2019-06-12 Thread Tomasz Krol
Hey Jean-Michel, Looks like its specific for YARN. As I mentioned, I am running on standalone cluster. Thanks On Tue 11 Jun 2019 at 10:50, Lourier, Jean-Michel (FIX1) < jean-michel.lour...@porsche.de> wrote: > Hi Patrick, > > I guess the easiest way is to use log aggregation: >

Re: [StructuredStreaming] HDFSBackedStateStoreProvider is leaking .crc files.

2019-06-12 Thread Jungtaek Lim
Nice finding! Given you already pointed out previous issue which fixed similar issue, it would be also easy for you to craft the patch and verify whether the fix resolves your issue. Looking forward to see your patch. Thanks, Jungtaek Lim (HeartSaVioR) On Wed, Jun 12, 2019 at 8:23 PM Gerard

Re: [StructuredStreaming] HDFSBackedStateStoreProvider is leaking .crc files.

2019-06-12 Thread Gerard Maas
Ooops - linked the wrong JIRA ticket: (that other one is related) https://issues.apache.org/jira/browse/SPARK-28025 On Wed, Jun 12, 2019 at 1:21 PM Gerard Maas wrote: > Hi! > I would like to socialize this issue we are currently facing: > The Structured Streaming default CheckpointFileManager

[StructuredStreaming] HDFSBackedStateStoreProvider is leaking .crc files.

2019-06-12 Thread Gerard Maas
Hi! I would like to socialize this issue we are currently facing: The Structured Streaming default CheckpointFileManager leaks .crc files by leaving them behind after users of this class (like HDFSBackedStateStoreProvider) apply their cleanup methods. This results in an unbounded creation of tiny

Re: Clean up method for DataSourceReader

2019-06-12 Thread Shubham Chaurasia
FYI, I am already using QueryExecutionListener which satisfies the requirements. But that only works for dataframe APIs. If someone does df.rdd().someAction(), QueryExecutionListener is never invoked. I want something like QueryExecutionListener works in case of df.rdd().someAction() too. I

Re: High level explanation of dropDuplicates

2019-06-12 Thread Vladimir Prus
Hi, If your data frame is partitioned by column A, and you want deduplication by columns A, B and C, then a faster way might be to sort each partition by A, B and C and then do a linear scan - it is often faster than group by all columns - which require a shuffle. Sadly, there's no standard way

Performance difference between Dataframe and Dataset especially on parquet data.

2019-06-12 Thread Shivam Sharma
Hi all, As we know that parquet is stored in columnar format and filtering on the column will require that column only instead of the complete record. So when we are creating Dataset[Class] and doing group by on the column vs same on steps DataFrame is performing differently. Operations on

Clean up method for DataSourceReader

2019-06-12 Thread Shubham Chaurasia
Hi All, Is there any way to receive some event that a DataSourceReader is finished? I want to do some clean up after all the DataReaders are finished reading and hence need some kind of cleanUp() mechanism at DataSourceReader(Driver) level. How to achieve this? For instance, in DataSourceWriter

Re: unsubscribe

2019-06-12 Thread B2B Web ID
Hi, Sonu. You can send email to user-unsubscr...@spark.apache.org with subject "(send this email to unsubscribe)" to unsubscribe from this mailling list[1]. Regards. [1] https://spark.apache.org/community.html 2019-05-27 2:01 GMT+07.00, Sonu Jyotshna : > > -- -- Salam Hangat, Pengelola

Employment opportunities.

2019-06-12 Thread Prashant Sharma
Hi, My employer(IBM) is interested in hiring people in hyderabad if they are committers in any of the Apache Projects and are interested Spark and ecosystem. Thanks, Prashant.