Re: Resolving all JIRAs affecting EOL releases

2019-05-19 Thread Hyukjin Kwon
Thanks Shane .. the URL I linked somehow didn't work in other people
browser. Hope this link works:

https://issues.apache.org/jira/browse/SPARK-23492?jql=project%20%3D%20SPARK%20%0A%20%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%0A%20%20AND%20(%0A%20%20%20%20affectedVersion%20%3D%20EMPTY%20OR%0A%20%20%20%20NOT%20(affectedVersion%20in%20versionMatch(%22%5E3.*%22)%0A%20%20%20%20%20%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.4.*%22)%0A%20%20%20%20%20%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.3.*%22)%0A%20%20%20%20)%0A%20%20)%0A%20%20AND%20updated%20%3C%3D%20-52w

I will take an action around this time tomorrow considering there were some
more changes to make at the last minute.


2019년 5월 19일 (일) 오후 6:39, Hyukjin Kwon 님이 작성:

> I will add one more condition for "updated". So, it will additionally
> avoid things updated within one year but left open against EOL releases.
>
> project = SPARK
>   AND status in (Open, "In Progress", Reopened)
>   AND (
> affectedVersion = EMPTY OR
> NOT (affectedVersion in versionMatch("^3.*")
>   OR affectedVersion in versionMatch("^2.4.*")
>   OR affectedVersion in versionMatch("^2.3.*")
> )
>   )
>   AND updated <= -52w
>
>
> https://issues.apache.org/jira/issues/?filter=12344168=project%20%3D%20SPARK%20%0A%20%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%0A%20%20AND%20(%0A%20%20%20%20affectedVersion%20%3D%20EMPTY%20OR%0A%20%20%20%20NOT%20(affectedVersion%20in%20versionMatch(%22%5E3.*%22)%0A%20%20%20%20%20%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.4.*%22)%0A%20%20%20%20%20%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.3.*%22)%0A%20%20%20%20)%0A%20%20)%0A%20%20AND%20updated%20%3C%3D%20-52w
>
> This still reduces JIRAs under 1000 which I originally targeted.
>
>
>
> 2019년 5월 19일 (일) 오후 6:08, Sean Owen 님이 작성:
>
>> I'd only tweak this to perhaps not close JIRAs that have been updated
>> recently -- even just avoiding things updated in the last month. For
>> example this would close
>> https://issues.apache.org/jira/browse/SPARK-27758 which was opened
>> Friday (though, for other reasons it should probably be closed). Still I
>> don't mind it under the logic that it has been reported against 2.1.0.
>>
>> On the other hand, I'd go further and close _anything_ not updated in a
>> long time, like a year (or 2 if feeling conservative). That is there's
>> probably a lot of old cruft out there that wasn't marked with an Affected
>> Version, before that was required.
>>
>> On Sat, May 18, 2019 at 10:48 PM Hyukjin Kwon 
>> wrote:
>>
>>> Thanks guys.
>>>
>>> This thread got more than 3 PMC votes without any objection. I slightly
>>> edited JQL from Abdeali's suggestion (thanks, Abdeali).
>>>
>>>
>>> JQL:
>>>
>>> project = SPARK
>>>   AND status in (Open, "In Progress", Reopened)
>>>   AND (
>>> affectedVersion = EMPTY OR
>>> NOT (affectedVersion in versionMatch("^3.*")
>>>   OR affectedVersion in versionMatch("^2.4.*")
>>>   OR affectedVersion in versionMatch("^2.3.*")
>>> )
>>>   )
>>>
>>>
>>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20%0A%20%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%0A%20%20AND%20(%0A%20%20%20%20affectedVersion%20%3D%20EMPTY%20OR%0A%20%20%20%20NOT%20(affectedVersion%20in%20versionMatch(%22%5E3.*%22)%0A%20%20%20%20%20%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.4.*%22)%0A%20%20%20%20%20%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.3.*%22)%0A%20%20%20%20)%0A%20%20)
>>>
>>>
>>> It means we will resolve all JIRAs that have EOL releases as affected
>>> versions, including no version specified in affected versions - this will
>>> reduce open JIRAs under 900.
>>>
>>> Looks I can use a bulk action feature in JIRA. Tomorrow at the similar
>>> time, I will
>>> - Label those JIRAs as 'bulk-closed'
>>> - Resolve them via `Incomplete` status.
>>>
>>> Please double check the list and let me know if you guys have any
>>> concern.
>>>
>>>
>>>
>>>
>>>
>>> 2019년 5월 18일 (토) 오후 12:22, Dongjoon Hyun 님이 작성:
>>>
 +1, too.

 Thank you, Hyukjin!

 Bests,
 Dongjoon.


 On Fri, May 17, 2019 at 9:07 AM Imran Rashid
  wrote:

> +1, thanks for taking this on
>
> On Wed, May 15, 2019 at 7:26 PM Hyukjin Kwon 
> wrote:
>
>> oh, wait. 'Incomplete' can still make sense in this way then.
>> Yes, I am good with 'Incomplete' too.
>>
>> 2019년 5월 16일 (목) 오전 11:24, Hyukjin Kwon 님이 작성:
>>
>>> I actually recently used 'Incomplete'  a bit when the JIRA is
>>> basically too poorly formed (like just copying and pasting an error) ...
>>>
>>> I was thinking about 'Unresolved' status or `Auto Closed' too. I
>>> double checked they can be reopen as well after resolution.
>>>
>>> [image: Screen Shot 2019-05-16 at 10.35.14 AM.png]
>>> [image: Screen Shot 2019-05-16 at 10.35.39 AM.png]
>>>
>>> 2019년 5월 16일 (목) 오전 11:04, 

Re: Access to live data of cached dataFrame

2019-05-19 Thread Tomas Bartalos
I'm trying to re-read however I'm getting cached data (which is a bit
confusing). For re-read I'm issuing:
spark.read.format("delta").load("/data").groupBy(col("event_hour")).count

The cache seems to be global influencing also new dataframes.

So the question is how should I re-read without loosing the cached data
(without using unpersist) ?

As I mentioned with sql its possible - I can create a cached view, so wen I
access the original table I get live data, when I access the view I get
cached data.

BR,
Tomas

On Fri, 17 May 2019, 8:57 pm Sean Owen,  wrote:

> A cached DataFrame isn't supposed to change, by definition.
> You can re-read each time or consider setting up a streaming source on
> the table which provides a result that updates as new data comes in.
>
> On Fri, May 17, 2019 at 1:44 PM Tomas Bartalos 
> wrote:
> >
> > Hello,
> >
> > I have a cached dataframe:
> >
> >
> spark.read.format("delta").load("/data").groupBy(col("event_hour")).count.cache
> >
> > I would like to access the "live" data for this data frame without
> deleting the cache (using unpersist()). Whatever I do I always get the
> cached data on subsequent queries. Even adding new column to the query
> doesn't help:
> >
> >
> spark.read.format("delta").load("/data").groupBy(col("event_hour")).count.withColumn("dummy",
> lit("dummy"))
> >
> >
> > I'm able to workaround this using cached sql view, but I couldn't find a
> pure dataFrame solution.
> >
> > Thank you,
> > Tomas
>


Object serialization for workers

2019-05-19 Thread R. Tyler Croy

Greetings! I am looking into the possibility of JRuby support for Spark, and
could use some pointers (references?) to orient myself a bit better within the
codebase.

JRuby fat jars load just fine in Spark but where things start to get
predictably dicey is with object serialization for RDDs getting sent to the
workers.

Having worked on something similar for Apache Storm
(https://github.com/jruby-gradle/redstorm), what we ended up doing was shimming
some classes to handy Ruby object/class serialization properly.

I'm expecting to do something similar in Spark but I'm not entirely sure which
interfaces/classes describe the serialization of RDDs. I'm figuring that I'll
need to implement a Ruby equivalent of the org.apache.spark.api.java.function
namespaces, but am not entirely where the pieces come together to turn those
into serialized objects.


Appreciate any direction you all might be able to share, in the meantime, I've
got my miner's cap on and am presently digging through core/ :)



Cheers

--
GitHub:  https://github.com/rtyler

GPG Key ID: 0F2298A980EE31ACCA0A7825E5C92681BEF6CEA2


signature.asc
Description: OpenPGP digital signature


Re: Resolving all JIRAs affecting EOL releases

2019-05-19 Thread Hyukjin Kwon
I will add one more condition for "updated". So, it will additionally avoid
things updated within one year but left open against EOL releases.

project = SPARK
  AND status in (Open, "In Progress", Reopened)
  AND (
affectedVersion = EMPTY OR
NOT (affectedVersion in versionMatch("^3.*")
  OR affectedVersion in versionMatch("^2.4.*")
  OR affectedVersion in versionMatch("^2.3.*")
)
  )
  AND updated <= -52w

https://issues.apache.org/jira/issues/?filter=12344168=project%20%3D%20SPARK%20%0A%20%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%0A%20%20AND%20(%0A%20%20%20%20affectedVersion%20%3D%20EMPTY%20OR%0A%20%20%20%20NOT%20(affectedVersion%20in%20versionMatch(%22%5E3.*%22)%0A%20%20%20%20%20%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.4.*%22)%0A%20%20%20%20%20%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.3.*%22)%0A%20%20%20%20)%0A%20%20)%0A%20%20AND%20updated%20%3C%3D%20-52w

This still reduces JIRAs under 1000 which I originally targeted.



2019년 5월 19일 (일) 오후 6:08, Sean Owen 님이 작성:

> I'd only tweak this to perhaps not close JIRAs that have been updated
> recently -- even just avoiding things updated in the last month. For
> example this would close https://issues.apache.org/jira/browse/SPARK-27758 
> which
> was opened Friday (though, for other reasons it should probably be closed).
> Still I don't mind it under the logic that it has been reported against
> 2.1.0.
>
> On the other hand, I'd go further and close _anything_ not updated in a
> long time, like a year (or 2 if feeling conservative). That is there's
> probably a lot of old cruft out there that wasn't marked with an Affected
> Version, before that was required.
>
> On Sat, May 18, 2019 at 10:48 PM Hyukjin Kwon  wrote:
>
>> Thanks guys.
>>
>> This thread got more than 3 PMC votes without any objection. I slightly
>> edited JQL from Abdeali's suggestion (thanks, Abdeali).
>>
>>
>> JQL:
>>
>> project = SPARK
>>   AND status in (Open, "In Progress", Reopened)
>>   AND (
>> affectedVersion = EMPTY OR
>> NOT (affectedVersion in versionMatch("^3.*")
>>   OR affectedVersion in versionMatch("^2.4.*")
>>   OR affectedVersion in versionMatch("^2.3.*")
>> )
>>   )
>>
>>
>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20%0A%20%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%0A%20%20AND%20(%0A%20%20%20%20affectedVersion%20%3D%20EMPTY%20OR%0A%20%20%20%20NOT%20(affectedVersion%20in%20versionMatch(%22%5E3.*%22)%0A%20%20%20%20%20%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.4.*%22)%0A%20%20%20%20%20%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.3.*%22)%0A%20%20%20%20)%0A%20%20)
>>
>>
>> It means we will resolve all JIRAs that have EOL releases as affected
>> versions, including no version specified in affected versions - this will
>> reduce open JIRAs under 900.
>>
>> Looks I can use a bulk action feature in JIRA. Tomorrow at the similar
>> time, I will
>> - Label those JIRAs as 'bulk-closed'
>> - Resolve them via `Incomplete` status.
>>
>> Please double check the list and let me know if you guys have any concern.
>>
>>
>>
>>
>>
>> 2019년 5월 18일 (토) 오후 12:22, Dongjoon Hyun 님이 작성:
>>
>>> +1, too.
>>>
>>> Thank you, Hyukjin!
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Fri, May 17, 2019 at 9:07 AM Imran Rashid
>>>  wrote:
>>>
 +1, thanks for taking this on

 On Wed, May 15, 2019 at 7:26 PM Hyukjin Kwon 
 wrote:

> oh, wait. 'Incomplete' can still make sense in this way then.
> Yes, I am good with 'Incomplete' too.
>
> 2019년 5월 16일 (목) 오전 11:24, Hyukjin Kwon 님이 작성:
>
>> I actually recently used 'Incomplete'  a bit when the JIRA is
>> basically too poorly formed (like just copying and pasting an error) ...
>>
>> I was thinking about 'Unresolved' status or `Auto Closed' too. I
>> double checked they can be reopen as well after resolution.
>>
>> [image: Screen Shot 2019-05-16 at 10.35.14 AM.png]
>> [image: Screen Shot 2019-05-16 at 10.35.39 AM.png]
>>
>> 2019년 5월 16일 (목) 오전 11:04, Sean Owen 님이 작성:
>>
>>> Agree, anything without an Affected Version should be old enough to
>>> time out.
>>> I might use "Incomplete" or something as the status, as we haven't
>>> otherwise used that. Maybe that's simpler than a label. But, anything 
>>> like
>>> that sounds good.
>>>
>>> On Wed, May 15, 2019 at 8:40 PM Hyukjin Kwon 
>>> wrote:
>>>
 BTW, affected version became a required field (I don't remember
 when exactly was .. I believe it's around when we work on Spark 2.3):

 [image: Screen Shot 2019-05-16 at 10.29.50 AM.png]

 So, including all EOL versions and affected versions not specified
 will roughly work.
 Using "Cannot Reproduce" as its status and 'bulk-closed' label
 makes the best sense to me.

 Okie. I want to open this roughly for a week 

Re: Resolving all JIRAs affecting EOL releases

2019-05-19 Thread Sean Owen
I'd only tweak this to perhaps not close JIRAs that have been updated
recently -- even just avoiding things updated in the last month. For
example this would close
https://issues.apache.org/jira/browse/SPARK-27758 which
was opened Friday (though, for other reasons it should probably be closed).
Still I don't mind it under the logic that it has been reported against
2.1.0.

On the other hand, I'd go further and close _anything_ not updated in a
long time, like a year (or 2 if feeling conservative). That is there's
probably a lot of old cruft out there that wasn't marked with an Affected
Version, before that was required.

On Sat, May 18, 2019 at 10:48 PM Hyukjin Kwon  wrote:

> Thanks guys.
>
> This thread got more than 3 PMC votes without any objection. I slightly
> edited JQL from Abdeali's suggestion (thanks, Abdeali).
>
>
> JQL:
>
> project = SPARK
>   AND status in (Open, "In Progress", Reopened)
>   AND (
> affectedVersion = EMPTY OR
> NOT (affectedVersion in versionMatch("^3.*")
>   OR affectedVersion in versionMatch("^2.4.*")
>   OR affectedVersion in versionMatch("^2.3.*")
> )
>   )
>
>
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20%0A%20%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%0A%20%20AND%20(%0A%20%20%20%20affectedVersion%20%3D%20EMPTY%20OR%0A%20%20%20%20NOT%20(affectedVersion%20in%20versionMatch(%22%5E3.*%22)%0A%20%20%20%20%20%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.4.*%22)%0A%20%20%20%20%20%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.3.*%22)%0A%20%20%20%20)%0A%20%20)
>
>
> It means we will resolve all JIRAs that have EOL releases as affected
> versions, including no version specified in affected versions - this will
> reduce open JIRAs under 900.
>
> Looks I can use a bulk action feature in JIRA. Tomorrow at the similar
> time, I will
> - Label those JIRAs as 'bulk-closed'
> - Resolve them via `Incomplete` status.
>
> Please double check the list and let me know if you guys have any concern.
>
>
>
>
>
> 2019년 5월 18일 (토) 오후 12:22, Dongjoon Hyun 님이 작성:
>
>> +1, too.
>>
>> Thank you, Hyukjin!
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Fri, May 17, 2019 at 9:07 AM Imran Rashid 
>> wrote:
>>
>>> +1, thanks for taking this on
>>>
>>> On Wed, May 15, 2019 at 7:26 PM Hyukjin Kwon 
>>> wrote:
>>>
 oh, wait. 'Incomplete' can still make sense in this way then.
 Yes, I am good with 'Incomplete' too.

 2019년 5월 16일 (목) 오전 11:24, Hyukjin Kwon 님이 작성:

> I actually recently used 'Incomplete'  a bit when the JIRA is
> basically too poorly formed (like just copying and pasting an error) ...
>
> I was thinking about 'Unresolved' status or `Auto Closed' too. I
> double checked they can be reopen as well after resolution.
>
> [image: Screen Shot 2019-05-16 at 10.35.14 AM.png]
> [image: Screen Shot 2019-05-16 at 10.35.39 AM.png]
>
> 2019년 5월 16일 (목) 오전 11:04, Sean Owen 님이 작성:
>
>> Agree, anything without an Affected Version should be old enough to
>> time out.
>> I might use "Incomplete" or something as the status, as we haven't
>> otherwise used that. Maybe that's simpler than a label. But, anything 
>> like
>> that sounds good.
>>
>> On Wed, May 15, 2019 at 8:40 PM Hyukjin Kwon 
>> wrote:
>>
>>> BTW, affected version became a required field (I don't remember when
>>> exactly was .. I believe it's around when we work on Spark 2.3):
>>>
>>> [image: Screen Shot 2019-05-16 at 10.29.50 AM.png]
>>>
>>> So, including all EOL versions and affected versions not specified
>>> will roughly work.
>>> Using "Cannot Reproduce" as its status and 'bulk-closed' label makes
>>> the best sense to me.
>>>
>>> Okie. I want to open this roughly for a week before taking an actual
>>> action for this. If there's no more feedback, I will do as I said ^ next
>>> week.
>>>
>>>
>>> 2019년 5월 15일 (수) 오후 11:33, Josh Rosen 님이 작성:
>>>
 +1 in favor of some sort of JIRA cleanup.

 My only request is that we attach some sort of 'bulk-closed' label
 to issues that we close via JIRA filter batch operations (and resolve 
 the
 issues as "Timed Out" / "Cannot Reproduce", not "Fixed"). Using a label
 makes it easier to audit what was closed, simplifying the process of
 identifying and re-opening valid issues caught in our dragnet.


 On Wed, May 15, 2019 at 7:19 AM Sean Owen  wrote:

> I gave up looking through JIRAs a long time ago, so, big respect
> for
> continuing to try to triage them. I am afraid we're missing a few
> important bug reports in the torrent, but most JIRAs are not
> well-formed, just questions, stale, or simply things that won't be
> added. I do think it's important to reflect that reality, and so
> I'm
> always in favor of