Re: Resolving all JIRAs affecting EOL releases
Thanks Shane .. the URL I linked somehow didn't work in other people browser. Hope this link works: https://issues.apache.org/jira/browse/SPARK-23492?jql=project%20%3D%20SPARK%20%0A%20%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%0A%20%20AND%20(%0A%20%20%20%20affectedVersion%20%3D%20EMPTY%20OR%0A%20%20%20%20NOT%20(affectedVersion%20in%20versionMatch(%22%5E3.*%22)%0A%20%20%20%20%20%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.4.*%22)%0A%20%20%20%20%20%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.3.*%22)%0A%20%20%20%20)%0A%20%20)%0A%20%20AND%20updated%20%3C%3D%20-52w I will take an action around this time tomorrow considering there were some more changes to make at the last minute. 2019년 5월 19일 (일) 오후 6:39, Hyukjin Kwon 님이 작성: > I will add one more condition for "updated". So, it will additionally > avoid things updated within one year but left open against EOL releases. > > project = SPARK > AND status in (Open, "In Progress", Reopened) > AND ( > affectedVersion = EMPTY OR > NOT (affectedVersion in versionMatch("^3.*") > OR affectedVersion in versionMatch("^2.4.*") > OR affectedVersion in versionMatch("^2.3.*") > ) > ) > AND updated <= -52w > > > https://issues.apache.org/jira/issues/?filter=12344168=project%20%3D%20SPARK%20%0A%20%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%0A%20%20AND%20(%0A%20%20%20%20affectedVersion%20%3D%20EMPTY%20OR%0A%20%20%20%20NOT%20(affectedVersion%20in%20versionMatch(%22%5E3.*%22)%0A%20%20%20%20%20%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.4.*%22)%0A%20%20%20%20%20%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.3.*%22)%0A%20%20%20%20)%0A%20%20)%0A%20%20AND%20updated%20%3C%3D%20-52w > > This still reduces JIRAs under 1000 which I originally targeted. > > > > 2019년 5월 19일 (일) 오후 6:08, Sean Owen 님이 작성: > >> I'd only tweak this to perhaps not close JIRAs that have been updated >> recently -- even just avoiding things updated in the last month. For >> example this would close >> https://issues.apache.org/jira/browse/SPARK-27758 which was opened >> Friday (though, for other reasons it should probably be closed). Still I >> don't mind it under the logic that it has been reported against 2.1.0. >> >> On the other hand, I'd go further and close _anything_ not updated in a >> long time, like a year (or 2 if feeling conservative). That is there's >> probably a lot of old cruft out there that wasn't marked with an Affected >> Version, before that was required. >> >> On Sat, May 18, 2019 at 10:48 PM Hyukjin Kwon >> wrote: >> >>> Thanks guys. >>> >>> This thread got more than 3 PMC votes without any objection. I slightly >>> edited JQL from Abdeali's suggestion (thanks, Abdeali). >>> >>> >>> JQL: >>> >>> project = SPARK >>> AND status in (Open, "In Progress", Reopened) >>> AND ( >>> affectedVersion = EMPTY OR >>> NOT (affectedVersion in versionMatch("^3.*") >>> OR affectedVersion in versionMatch("^2.4.*") >>> OR affectedVersion in versionMatch("^2.3.*") >>> ) >>> ) >>> >>> >>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20%0A%20%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%0A%20%20AND%20(%0A%20%20%20%20affectedVersion%20%3D%20EMPTY%20OR%0A%20%20%20%20NOT%20(affectedVersion%20in%20versionMatch(%22%5E3.*%22)%0A%20%20%20%20%20%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.4.*%22)%0A%20%20%20%20%20%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.3.*%22)%0A%20%20%20%20)%0A%20%20) >>> >>> >>> It means we will resolve all JIRAs that have EOL releases as affected >>> versions, including no version specified in affected versions - this will >>> reduce open JIRAs under 900. >>> >>> Looks I can use a bulk action feature in JIRA. Tomorrow at the similar >>> time, I will >>> - Label those JIRAs as 'bulk-closed' >>> - Resolve them via `Incomplete` status. >>> >>> Please double check the list and let me know if you guys have any >>> concern. >>> >>> >>> >>> >>> >>> 2019년 5월 18일 (토) 오후 12:22, Dongjoon Hyun 님이 작성: >>> +1, too. Thank you, Hyukjin! Bests, Dongjoon. On Fri, May 17, 2019 at 9:07 AM Imran Rashid wrote: > +1, thanks for taking this on > > On Wed, May 15, 2019 at 7:26 PM Hyukjin Kwon > wrote: > >> oh, wait. 'Incomplete' can still make sense in this way then. >> Yes, I am good with 'Incomplete' too. >> >> 2019년 5월 16일 (목) 오전 11:24, Hyukjin Kwon 님이 작성: >> >>> I actually recently used 'Incomplete' a bit when the JIRA is >>> basically too poorly formed (like just copying and pasting an error) ... >>> >>> I was thinking about 'Unresolved' status or `Auto Closed' too. I >>> double checked they can be reopen as well after resolution. >>> >>> [image: Screen Shot 2019-05-16 at 10.35.14 AM.png] >>> [image: Screen Shot 2019-05-16 at 10.35.39 AM.png] >>> >>> 2019년 5월 16일 (목) 오전 11:04,
Re: Access to live data of cached dataFrame
I'm trying to re-read however I'm getting cached data (which is a bit confusing). For re-read I'm issuing: spark.read.format("delta").load("/data").groupBy(col("event_hour")).count The cache seems to be global influencing also new dataframes. So the question is how should I re-read without loosing the cached data (without using unpersist) ? As I mentioned with sql its possible - I can create a cached view, so wen I access the original table I get live data, when I access the view I get cached data. BR, Tomas On Fri, 17 May 2019, 8:57 pm Sean Owen, wrote: > A cached DataFrame isn't supposed to change, by definition. > You can re-read each time or consider setting up a streaming source on > the table which provides a result that updates as new data comes in. > > On Fri, May 17, 2019 at 1:44 PM Tomas Bartalos > wrote: > > > > Hello, > > > > I have a cached dataframe: > > > > > spark.read.format("delta").load("/data").groupBy(col("event_hour")).count.cache > > > > I would like to access the "live" data for this data frame without > deleting the cache (using unpersist()). Whatever I do I always get the > cached data on subsequent queries. Even adding new column to the query > doesn't help: > > > > > spark.read.format("delta").load("/data").groupBy(col("event_hour")).count.withColumn("dummy", > lit("dummy")) > > > > > > I'm able to workaround this using cached sql view, but I couldn't find a > pure dataFrame solution. > > > > Thank you, > > Tomas >
Object serialization for workers
Greetings! I am looking into the possibility of JRuby support for Spark, and could use some pointers (references?) to orient myself a bit better within the codebase. JRuby fat jars load just fine in Spark but where things start to get predictably dicey is with object serialization for RDDs getting sent to the workers. Having worked on something similar for Apache Storm (https://github.com/jruby-gradle/redstorm), what we ended up doing was shimming some classes to handy Ruby object/class serialization properly. I'm expecting to do something similar in Spark but I'm not entirely sure which interfaces/classes describe the serialization of RDDs. I'm figuring that I'll need to implement a Ruby equivalent of the org.apache.spark.api.java.function namespaces, but am not entirely where the pieces come together to turn those into serialized objects. Appreciate any direction you all might be able to share, in the meantime, I've got my miner's cap on and am presently digging through core/ :) Cheers -- GitHub: https://github.com/rtyler GPG Key ID: 0F2298A980EE31ACCA0A7825E5C92681BEF6CEA2 signature.asc Description: OpenPGP digital signature
Re: Resolving all JIRAs affecting EOL releases
I will add one more condition for "updated". So, it will additionally avoid things updated within one year but left open against EOL releases. project = SPARK AND status in (Open, "In Progress", Reopened) AND ( affectedVersion = EMPTY OR NOT (affectedVersion in versionMatch("^3.*") OR affectedVersion in versionMatch("^2.4.*") OR affectedVersion in versionMatch("^2.3.*") ) ) AND updated <= -52w https://issues.apache.org/jira/issues/?filter=12344168=project%20%3D%20SPARK%20%0A%20%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%0A%20%20AND%20(%0A%20%20%20%20affectedVersion%20%3D%20EMPTY%20OR%0A%20%20%20%20NOT%20(affectedVersion%20in%20versionMatch(%22%5E3.*%22)%0A%20%20%20%20%20%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.4.*%22)%0A%20%20%20%20%20%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.3.*%22)%0A%20%20%20%20)%0A%20%20)%0A%20%20AND%20updated%20%3C%3D%20-52w This still reduces JIRAs under 1000 which I originally targeted. 2019년 5월 19일 (일) 오후 6:08, Sean Owen 님이 작성: > I'd only tweak this to perhaps not close JIRAs that have been updated > recently -- even just avoiding things updated in the last month. For > example this would close https://issues.apache.org/jira/browse/SPARK-27758 > which > was opened Friday (though, for other reasons it should probably be closed). > Still I don't mind it under the logic that it has been reported against > 2.1.0. > > On the other hand, I'd go further and close _anything_ not updated in a > long time, like a year (or 2 if feeling conservative). That is there's > probably a lot of old cruft out there that wasn't marked with an Affected > Version, before that was required. > > On Sat, May 18, 2019 at 10:48 PM Hyukjin Kwon wrote: > >> Thanks guys. >> >> This thread got more than 3 PMC votes without any objection. I slightly >> edited JQL from Abdeali's suggestion (thanks, Abdeali). >> >> >> JQL: >> >> project = SPARK >> AND status in (Open, "In Progress", Reopened) >> AND ( >> affectedVersion = EMPTY OR >> NOT (affectedVersion in versionMatch("^3.*") >> OR affectedVersion in versionMatch("^2.4.*") >> OR affectedVersion in versionMatch("^2.3.*") >> ) >> ) >> >> >> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20%0A%20%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%0A%20%20AND%20(%0A%20%20%20%20affectedVersion%20%3D%20EMPTY%20OR%0A%20%20%20%20NOT%20(affectedVersion%20in%20versionMatch(%22%5E3.*%22)%0A%20%20%20%20%20%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.4.*%22)%0A%20%20%20%20%20%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.3.*%22)%0A%20%20%20%20)%0A%20%20) >> >> >> It means we will resolve all JIRAs that have EOL releases as affected >> versions, including no version specified in affected versions - this will >> reduce open JIRAs under 900. >> >> Looks I can use a bulk action feature in JIRA. Tomorrow at the similar >> time, I will >> - Label those JIRAs as 'bulk-closed' >> - Resolve them via `Incomplete` status. >> >> Please double check the list and let me know if you guys have any concern. >> >> >> >> >> >> 2019년 5월 18일 (토) 오후 12:22, Dongjoon Hyun 님이 작성: >> >>> +1, too. >>> >>> Thank you, Hyukjin! >>> >>> Bests, >>> Dongjoon. >>> >>> >>> On Fri, May 17, 2019 at 9:07 AM Imran Rashid >>> wrote: >>> +1, thanks for taking this on On Wed, May 15, 2019 at 7:26 PM Hyukjin Kwon wrote: > oh, wait. 'Incomplete' can still make sense in this way then. > Yes, I am good with 'Incomplete' too. > > 2019년 5월 16일 (목) 오전 11:24, Hyukjin Kwon 님이 작성: > >> I actually recently used 'Incomplete' a bit when the JIRA is >> basically too poorly formed (like just copying and pasting an error) ... >> >> I was thinking about 'Unresolved' status or `Auto Closed' too. I >> double checked they can be reopen as well after resolution. >> >> [image: Screen Shot 2019-05-16 at 10.35.14 AM.png] >> [image: Screen Shot 2019-05-16 at 10.35.39 AM.png] >> >> 2019년 5월 16일 (목) 오전 11:04, Sean Owen 님이 작성: >> >>> Agree, anything without an Affected Version should be old enough to >>> time out. >>> I might use "Incomplete" or something as the status, as we haven't >>> otherwise used that. Maybe that's simpler than a label. But, anything >>> like >>> that sounds good. >>> >>> On Wed, May 15, 2019 at 8:40 PM Hyukjin Kwon >>> wrote: >>> BTW, affected version became a required field (I don't remember when exactly was .. I believe it's around when we work on Spark 2.3): [image: Screen Shot 2019-05-16 at 10.29.50 AM.png] So, including all EOL versions and affected versions not specified will roughly work. Using "Cannot Reproduce" as its status and 'bulk-closed' label makes the best sense to me. Okie. I want to open this roughly for a week
Re: Resolving all JIRAs affecting EOL releases
I'd only tweak this to perhaps not close JIRAs that have been updated recently -- even just avoiding things updated in the last month. For example this would close https://issues.apache.org/jira/browse/SPARK-27758 which was opened Friday (though, for other reasons it should probably be closed). Still I don't mind it under the logic that it has been reported against 2.1.0. On the other hand, I'd go further and close _anything_ not updated in a long time, like a year (or 2 if feeling conservative). That is there's probably a lot of old cruft out there that wasn't marked with an Affected Version, before that was required. On Sat, May 18, 2019 at 10:48 PM Hyukjin Kwon wrote: > Thanks guys. > > This thread got more than 3 PMC votes without any objection. I slightly > edited JQL from Abdeali's suggestion (thanks, Abdeali). > > > JQL: > > project = SPARK > AND status in (Open, "In Progress", Reopened) > AND ( > affectedVersion = EMPTY OR > NOT (affectedVersion in versionMatch("^3.*") > OR affectedVersion in versionMatch("^2.4.*") > OR affectedVersion in versionMatch("^2.3.*") > ) > ) > > > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20%0A%20%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%0A%20%20AND%20(%0A%20%20%20%20affectedVersion%20%3D%20EMPTY%20OR%0A%20%20%20%20NOT%20(affectedVersion%20in%20versionMatch(%22%5E3.*%22)%0A%20%20%20%20%20%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.4.*%22)%0A%20%20%20%20%20%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.3.*%22)%0A%20%20%20%20)%0A%20%20) > > > It means we will resolve all JIRAs that have EOL releases as affected > versions, including no version specified in affected versions - this will > reduce open JIRAs under 900. > > Looks I can use a bulk action feature in JIRA. Tomorrow at the similar > time, I will > - Label those JIRAs as 'bulk-closed' > - Resolve them via `Incomplete` status. > > Please double check the list and let me know if you guys have any concern. > > > > > > 2019년 5월 18일 (토) 오후 12:22, Dongjoon Hyun 님이 작성: > >> +1, too. >> >> Thank you, Hyukjin! >> >> Bests, >> Dongjoon. >> >> >> On Fri, May 17, 2019 at 9:07 AM Imran Rashid >> wrote: >> >>> +1, thanks for taking this on >>> >>> On Wed, May 15, 2019 at 7:26 PM Hyukjin Kwon >>> wrote: >>> oh, wait. 'Incomplete' can still make sense in this way then. Yes, I am good with 'Incomplete' too. 2019년 5월 16일 (목) 오전 11:24, Hyukjin Kwon 님이 작성: > I actually recently used 'Incomplete' a bit when the JIRA is > basically too poorly formed (like just copying and pasting an error) ... > > I was thinking about 'Unresolved' status or `Auto Closed' too. I > double checked they can be reopen as well after resolution. > > [image: Screen Shot 2019-05-16 at 10.35.14 AM.png] > [image: Screen Shot 2019-05-16 at 10.35.39 AM.png] > > 2019년 5월 16일 (목) 오전 11:04, Sean Owen 님이 작성: > >> Agree, anything without an Affected Version should be old enough to >> time out. >> I might use "Incomplete" or something as the status, as we haven't >> otherwise used that. Maybe that's simpler than a label. But, anything >> like >> that sounds good. >> >> On Wed, May 15, 2019 at 8:40 PM Hyukjin Kwon >> wrote: >> >>> BTW, affected version became a required field (I don't remember when >>> exactly was .. I believe it's around when we work on Spark 2.3): >>> >>> [image: Screen Shot 2019-05-16 at 10.29.50 AM.png] >>> >>> So, including all EOL versions and affected versions not specified >>> will roughly work. >>> Using "Cannot Reproduce" as its status and 'bulk-closed' label makes >>> the best sense to me. >>> >>> Okie. I want to open this roughly for a week before taking an actual >>> action for this. If there's no more feedback, I will do as I said ^ next >>> week. >>> >>> >>> 2019년 5월 15일 (수) 오후 11:33, Josh Rosen 님이 작성: >>> +1 in favor of some sort of JIRA cleanup. My only request is that we attach some sort of 'bulk-closed' label to issues that we close via JIRA filter batch operations (and resolve the issues as "Timed Out" / "Cannot Reproduce", not "Fixed"). Using a label makes it easier to audit what was closed, simplifying the process of identifying and re-opening valid issues caught in our dragnet. On Wed, May 15, 2019 at 7:19 AM Sean Owen wrote: > I gave up looking through JIRAs a long time ago, so, big respect > for > continuing to try to triage them. I am afraid we're missing a few > important bug reports in the torrent, but most JIRAs are not > well-formed, just questions, stale, or simply things that won't be > added. I do think it's important to reflect that reality, and so > I'm > always in favor of