Exposing JIRA issue types at GitHub PRs

2019-06-12 Thread Dongjoon Hyun
Hi, All.

Since we use both Apache JIRA and GitHub actively for Apache Spark
contributions, we have lots of JIRAs and PRs consequently. One specific
thing I've been longing to see is `Jira Issue Type` in GitHub.

How about exposing JIRA issue types at GitHub PRs as GitHub `Labels`? There
are two main benefits:
1. It helps the communication between the contributors and reviewers with
more information.
(In some cases, some people only visit GitHub to see the PR and commits)
2. `Labels` is searchable. We don't need to visit Apache Jira to search PRs
to see a specific type.
(For example, the reviewers can see and review 'BUG' PRs first by using
`is:open is:pr label:BUG`.)

Of course, this can be done automatically without human intervention. Since
we already have GitHub Jenkins job to access JIRA/GitHub, that job can add
the labels from the beginning. If needed, I can volunteer to update the
script.

To show the demo, I labeled several PRs manually. You can see the result
right now in Apache Spark PR page.

  - https://github.com/apache/spark/pulls

If you're surprised due to those manual activities, I want to apologize for
that. I hope we can take advantage of the existing GitHub features to serve
Apache Spark community in a way better than yesterday.

How do you think about this specific suggestion?

Bests,
Dongjoon

PS. I saw that `Request Review` and `Assign` features are already used for
some purposes, but these feature are out of the scope in this email.


Re: High level explanation of dropDuplicates

2019-06-12 Thread Yeikel
Nicholas , thank you for your explanation. 

I am also interested in the example that Rishi is asking for.  I am sure
mapPartitions may work , but as Vladimir suggests it may not be the best
option in terms of performance. 

@Vladimir Prus , are you aware of any example about writing a  "custom
physical exec operator"? 

If anyone needs a further explanation for the follow up  question Rishi 
posted , please see the example below : 


import org.apache.spark.sql.types._
import org.apache.spark.sql.Row


val someData = Seq(
  Row(1, 10),
  Row(1, 20),
  Row(1, 11)
)

val schema = List(
  StructField("id", IntegerType, true),
  StructField("score", IntegerType, true)
)

val df = spark.createDataFrame(
  spark.sparkContext.parallelize(someData),
  StructType(schema)
)

// Goal : Drop duplicates using the "id" as the primary key and keep the
highest "score".

df.sort($"score".desc).dropDuplicates("id").show

== Physical Plan ==
*(2) HashAggregate(keys=[id#191], functions=[first(score#192, false)])
+- Exchange hashpartitioning(id#191, 200)
   +- *(1) HashAggregate(keys=[id#191], functions=[partial_first(score#192,
false)])
  +- *(1) Sort [score#192 DESC NULLS LAST], true, 0
 +- Exchange rangepartitioning(score#192 DESC NULLS LAST, 200)
+- Scan ExistingRDD[id#191,score#192]

This seems to work , but I don't know what are the implications if we use
this approach with a bigger dataset or what are the alternatives. From the
explain output I can see the two Exchanges , so it may not be the best
approach? 







--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark Dataframe NTILE function

2019-06-12 Thread Subash Prabakar
Hi,

I am running a Spark Dataframe function of NTILE over a huge data - it
spills lot of data while sorting and eventually it fails.

The data size is roughly 80 Million record with size of 4G (not sure
whether its serialized or deserialized) - I am calculating NTILE(10) for
all these records order by one metric.

Few stats are below, I need help in finding alternatives or anyone did some
benchmarking of highest load this function can handle ?

The below snapshot is the calculation of NTILE for two columns separately -
each runs and that final 1 partition is where the complete data is present
- meaning, Window function moves all to 1 final partition to compute NTILE
- which is 80M in my case.

[image: Screen Shot 2019-06-13 at 12.51.56 AM.jpg]

Executor memory is 8G - with shuffle.storageMemory of 0.8 => so it is 5.5G

So ideally 80M records I saw inside the stage level metrics - it shows as
below,

*Shuffle read size / records*
[image: Screen Shot 2019-06-13 at 12.51.33 AM.jpg]

Is there any alternative or is it not feasible to perform this operation in
Spark SQL functions ?

Thanks,
Subash


ApacheCon North America 2019 Schedule Now Live!

2019-06-12 Thread Rich Bowen

Dear Apache Enthusiast,

(You’re receiving this message because you’re subscribed to one or more 
Apache Software Foundation project user mailing lists.)


We’re thrilled to announce the schedule for our upcoming conference, 
ApacheCon North America 2019, in Las Vegas, Nevada. See it now at 
https://www.apachecon.com/acna19/schedule.html  The event will be held 
September 9th through 12th at the Flamingo Hotel.  Register today at 
https://www.apachecon.com/acna19/register.html


Our schedule features keynotes by James Gosling, the father of Java; 
Samaira Mehta, founder and CEO of the Billion Kids Can Code project; and 
David Brin, noted science fiction author and futurist. And we’ll have a 
discussion panel featuring some of the founders of The Apache Software 
Foundation, talking about the past as well as their vision for the future.


ApacheCon is the flagship convention of the ASF, and features tracks 
curated by many of our project communities: Apache Drill, Apache Karaf, 
Big Data, TomcatCon, Apache Cloudstack, Integration, Apache Cassandra, 
Streaming, Geospatial software, Graph processing, Internet of Things, 
Community, Machine Learning, Apache Traffic Control, Apache Beam, 
Observability, OFBiz, and Mobile app development.


The Hackathon, which will run all day, every day, is the place to meet 
your project community, and get some serious work knocked out in a short 
focused time. The BarCamp is the place to discuss the topics that are 
important to you, with your colleagues, in an unconference format.


We offer financial assistance for travel and lodging for those who want 
to come to ApacheCon but are unable to afford it. Apply at 
http://apache.org/travel/ by June 21st to be considered for this.


If you’re unable to make it to North America, we’ll also be running 
ApacheCon Europe in Berlin in October. Details of that event are at 
https://aceu19.apachecon.com/


Follow us on Twitter - @ApacheCon - for all the latest updates. See you 
in Las Vegas!


Rich Bowen, VP Conferences, The ASF
rbo...@apache.org
http://apachecon.com/


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Getting driver logs in Standalone Cluster

2019-06-12 Thread Tomasz Krol
Hey Jean-Michel,

Looks like its specific for YARN. As I mentioned, I am running on
standalone cluster.

Thanks

On Tue 11 Jun 2019 at 10:50, Lourier, Jean-Michel (FIX1) <
jean-michel.lour...@porsche.de> wrote:

> Hi Patrick,
>
> I guess the easiest way is to use log aggregation:
> https://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application
>
> BR
>
> Jean-Michel
>
>
>
> Dr. Ing. h.c. F.  Porsche Aktiengesellschaft
> Sitz der Gesellschaft: Stuttgart
> Registergericht: Amtsgericht Stuttgart HRB-Nr. 730623
> Vorsitzender des Aufsichtsrats: Dr. Wolfgang Porsche
> Vorstand: Oliver Blume, Vorsitzender
> Lutz Meschke, stv. Vorsitzender
> Andreas Haffner, Detlev von Platen, Albrecht Reimold, Uwe-Karsten Städter,
> Michael Steiner
>
> Informationen zum Umgang mit Ihren Daten finden Sie in unsere
> Datenschutzhinweisen unter
> https://www.porsche.com/germany/porscheag-privacy/
>
> Die vorgenannten Angaben werden jeder E-Mail automatisch hinzugefügt. Dies
> ist kein Anerkenntnis,
> dass es sich beim Inhalt dieser E-Mail um eine rechtsverbindliche
> Erklärung der Porsche AG handelt.
> Erklärungen, die die Porsche AG verpflichten, bedürfen jeweils der
> Unterschrift durch zwei zeichnungs-
> berechtigte Personen der AG.
>
>
>
>
> -Ursprüngliche Nachricht-
> Von: tkrol 
> Gesendet: Freitag, 7. Juni 2019 16:22
> An: user@spark.apache.org
> Betreff: Getting driver logs in Standalone Cluster
>
> Hey Guys,
>
> I am wondering what is the best way to get logs for driver in the cluster
> mode on standalone cluster? Normally I used to run client mode so I could
> capture logs from the console.
>
> Now I've started running jobs in cluster mode and obviously driver is
> running on worker and can't see the logs.
>
> I would like to store logs (preferably in hdfs), any easy way to do that?
>
> Thanks
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
> --
Tomasz Krol
patric...@gmail.com


Re: [StructuredStreaming] HDFSBackedStateStoreProvider is leaking .crc files.

2019-06-12 Thread Jungtaek Lim
Nice finding!

Given you already pointed out previous issue which fixed similar issue, it
would be also easy for you to craft the patch and verify whether the fix
resolves your issue. Looking forward to see your patch.

Thanks,
Jungtaek Lim (HeartSaVioR)

On Wed, Jun 12, 2019 at 8:23 PM Gerard Maas  wrote:

> Ooops - linked the wrong JIRA ticket:  (that other one is related)
>
> https://issues.apache.org/jira/browse/SPARK-28025
>
> On Wed, Jun 12, 2019 at 1:21 PM Gerard Maas  wrote:
>
>> Hi!
>> I would like to socialize this issue we are currently facing:
>> The Structured Streaming default CheckpointFileManager leaks .crc files
>> by leaving them behind after users of this class (like
>> HDFSBackedStateStoreProvider) apply their cleanup methods.
>>
>> This results in an unbounded creation of tiny files that eat away storage
>> by the block and, in our case, deteriorates the file system performance.
>>
>> We correlated the processedRowsPerSecond reported by the
>> StreamingQueryProgress against a count of the .crc files in the storage
>> directory (checkpoint + state store). The performance impact we observe is
>> dramatic.
>>
>> We are running on Kubernetes, using GlusterFS as the shared storage
>> provider.
>> [image: out processedRowsPerSecond vs. files in storage_process.png]
>> I have created a JIRA ticket with additional detail:
>>
>> https://issues.apache.org/jira/browse/SPARK-17475
>>
>> This is also related to an earlier discussion about the state store
>> unbounded disk-size growth, which was left unresolved back then:
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-State-Store-storage-behavior-for-the-Stream-Deduplication-function-td34883.html
>>
>> If there's any additional detail I should add/research, please let me
>> know.
>>
>> kind regards, Gerard.
>>
>>
>>

-- 
Name : Jungtaek Lim
Blog : http://medium.com/@heartsavior
Twitter : http://twitter.com/heartsavior
LinkedIn : http://www.linkedin.com/in/heartsavior


Re: [StructuredStreaming] HDFSBackedStateStoreProvider is leaking .crc files.

2019-06-12 Thread Gerard Maas
Ooops - linked the wrong JIRA ticket:  (that other one is related)

https://issues.apache.org/jira/browse/SPARK-28025

On Wed, Jun 12, 2019 at 1:21 PM Gerard Maas  wrote:

> Hi!
> I would like to socialize this issue we are currently facing:
> The Structured Streaming default CheckpointFileManager leaks .crc files by
> leaving them behind after users of this class (like
> HDFSBackedStateStoreProvider) apply their cleanup methods.
>
> This results in an unbounded creation of tiny files that eat away storage
> by the block and, in our case, deteriorates the file system performance.
>
> We correlated the processedRowsPerSecond reported by the
> StreamingQueryProgress against a count of the .crc files in the storage
> directory (checkpoint + state store). The performance impact we observe is
> dramatic.
>
> We are running on Kubernetes, using GlusterFS as the shared storage
> provider.
> [image: out processedRowsPerSecond vs. files in storage_process.png]
> I have created a JIRA ticket with additional detail:
>
> https://issues.apache.org/jira/browse/SPARK-17475
>
> This is also related to an earlier discussion about the state store
> unbounded disk-size growth, which was left unresolved back then:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-State-Store-storage-behavior-for-the-Stream-Deduplication-function-td34883.html
>
> If there's any additional detail I should add/research, please let me know.
>
> kind regards, Gerard.
>
>
>


[StructuredStreaming] HDFSBackedStateStoreProvider is leaking .crc files.

2019-06-12 Thread Gerard Maas
Hi!
I would like to socialize this issue we are currently facing:
The Structured Streaming default CheckpointFileManager leaks .crc files by
leaving them behind after users of this class (like
HDFSBackedStateStoreProvider) apply their cleanup methods.

This results in an unbounded creation of tiny files that eat away storage
by the block and, in our case, deteriorates the file system performance.

We correlated the processedRowsPerSecond reported by the
StreamingQueryProgress against a count of the .crc files in the storage
directory (checkpoint + state store). The performance impact we observe is
dramatic.

We are running on Kubernetes, using GlusterFS as the shared storage
provider.
[image: out processedRowsPerSecond vs. files in storage_process.png]
I have created a JIRA ticket with additional detail:

https://issues.apache.org/jira/browse/SPARK-17475

This is also related to an earlier discussion about the state store
unbounded disk-size growth, which was left unresolved back then:
http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-State-Store-storage-behavior-for-the-Stream-Deduplication-function-td34883.html

If there's any additional detail I should add/research, please let me know.

kind regards, Gerard.


Re: Clean up method for DataSourceReader

2019-06-12 Thread Shubham Chaurasia
FYI, I am already using QueryExecutionListener which satisfies the
requirements.

But that only works for dataframe APIs. If someone does
df.rdd().someAction(), QueryExecutionListener is never invoked. I want
something like QueryExecutionListener works in case of
df.rdd().someAction() too.
I explored SparkListener#onJobEnd, but then how to propagate some state
from DataSourceReader to SparkListener?

On Wed, Jun 12, 2019 at 2:22 PM Shubham Chaurasia 
wrote:

> Hi All,
>
> Is there any way to receive some event that a DataSourceReader is
> finished?
> I want to do some clean up after all the DataReaders are finished reading
> and hence need some kind of cleanUp() mechanism at DataSourceReader(Driver)
> level.
>
> How to achieve this?
>
> For instance, in DataSourceWriter we can rely on commit() and abort()
> methods to know that all the DataWriters are finished.
>
> Thanks,
> Shubham
>


Re: High level explanation of dropDuplicates

2019-06-12 Thread Vladimir Prus
Hi,

If your data frame is partitioned by column A, and you want deduplication
by columns A, B and C, then a faster way might be to sort each partition by
A, B and C and then do a linear scan - it is often faster than group by all
columns - which require a shuffle. Sadly, there's no standard way to do it.

One way to do it is via mapPartitions, but that involves serialisation
to/from Row. The best way is to write custom physical exec operator, but
it's not entirely trivial.

On Mon, 10 Jun 2019, 06:00 Rishi Shah,  wrote:

> Hi All,
>
> Just wanted to check back regarding best way to perform deduplication. Is
> using drop duplicates the optimal way to get rid of duplicates? Would it be
> better if we run operations on red directly?
>
> Also what about if we want to keep the last value of the group while
> performing deduplication (based on some sorting criteria)?
>
> Thanks,
> Rishi
>
> On Mon, May 20, 2019 at 3:33 PM Nicholas Hakobian <
> nicholas.hakob...@rallyhealth.com> wrote:
>
>> From doing some searching around in the spark codebase, I found the
>> following:
>>
>>
>> https://github.com/apache/spark/blob/163a6e298213f216f74f4764e241ee6298ea30b6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1452-L1474
>>
>> So it appears there is no direct operation called dropDuplicates or
>> Deduplicate, but there is an optimizer rule that converts this logical
>> operation to a physical operation that is equivalent to grouping by all the
>> columns you want to deduplicate across (or all columns if you are doing
>> something like distinct), and taking the First() value. So (using a pySpark
>> code example):
>>
>> df = input_df.dropDuplicates(['col1', 'col2'])
>>
>> Is effectively shorthand for saying something like:
>>
>> df = input_df.groupBy('col1',
>> 'col2').agg(first(struct(input_df.columns)).alias('data')).select('data.*')
>>
>> Except I assume that it has some internal optimization so it doesn't need
>> to pack/unpack the column data, and just returns the whole Row.
>>
>> Nicholas Szandor Hakobian, Ph.D.
>> Principal Data Scientist
>> Rally Health
>> nicholas.hakob...@rallyhealth.com
>>
>>
>>
>> On Mon, May 20, 2019 at 11:38 AM Yeikel  wrote:
>>
>>> Hi ,
>>>
>>> I am looking for a high level explanation(overview) on how
>>> dropDuplicates[1]
>>> works.
>>>
>>> [1]
>>>
>>> https://github.com/apache/spark/blob/db24b04cad421ed508413d397c6beec01f723aee/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2326
>>>
>>> Could someone please explain?
>>>
>>> Thank you
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>
> --
> Regards,
>
> Rishi Shah
>


Performance difference between Dataframe and Dataset especially on parquet data.

2019-06-12 Thread Shivam Sharma
Hi all,

As we know that parquet is stored in columnar format and filtering on the
column will require that column only instead of the complete record.

So when we are creating Dataset[Class] and doing group by on the column vs
same on steps DataFrame is performing differently. Operations on Dataset is
causing OOM issues with same execution parameters.

Thanks

-- 
Shivam Sharma
Indian Institute Of Information Technology, Design and Manufacturing
Jabalpur
Email:- 28shivamsha...@gmail.com
LinkedIn:-*https://www.linkedin.com/in/28shivamsharma
*


Clean up method for DataSourceReader

2019-06-12 Thread Shubham Chaurasia
Hi All,

Is there any way to receive some event that a DataSourceReader is finished?
I want to do some clean up after all the DataReaders are finished reading
and hence need some kind of cleanUp() mechanism at DataSourceReader(Driver)
level.

How to achieve this?

For instance, in DataSourceWriter we can rely on commit() and abort()
methods to know that all the DataWriters are finished.

Thanks,
Shubham


Re: unsubscribe

2019-06-12 Thread B2B Web ID
Hi, Sonu.
You can send email to user-unsubscr...@spark.apache.org with subject
"(send this email to unsubscribe)" to unsubscribe from this mailling
list[1].

Regards.

[1] https://spark.apache.org/community.html


2019-05-27 2:01 GMT+07.00, Sonu Jyotshna :
>
>


-- 

-- 
Salam Hangat,
Pengelola B2B.Web.ID


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Employment opportunities.

2019-06-12 Thread Prashant Sharma
Hi,

My employer(IBM) is interested in hiring people in hyderabad if they are
committers in any of the Apache Projects and are interested Spark and
ecosystem.

Thanks,
Prashant.