Re: Resolving all JIRAs affecting EOL releases

2019-05-17 Thread Dongjoon Hyun
+1, too.

Thank you, Hyukjin!

Bests,
Dongjoon.


On Fri, May 17, 2019 at 9:07 AM Imran Rashid 
wrote:

> +1, thanks for taking this on
>
> On Wed, May 15, 2019 at 7:26 PM Hyukjin Kwon  wrote:
>
>> oh, wait. 'Incomplete' can still make sense in this way then.
>> Yes, I am good with 'Incomplete' too.
>>
>> 2019년 5월 16일 (목) 오전 11:24, Hyukjin Kwon 님이 작성:
>>
>>> I actually recently used 'Incomplete'  a bit when the JIRA is basically
>>> too poorly formed (like just copying and pasting an error) ...
>>>
>>> I was thinking about 'Unresolved' status or `Auto Closed' too. I double
>>> checked they can be reopen as well after resolution.
>>>
>>> [image: Screen Shot 2019-05-16 at 10.35.14 AM.png]
>>> [image: Screen Shot 2019-05-16 at 10.35.39 AM.png]
>>>
>>> 2019년 5월 16일 (목) 오전 11:04, Sean Owen 님이 작성:
>>>
 Agree, anything without an Affected Version should be old enough to
 time out.
 I might use "Incomplete" or something as the status, as we haven't
 otherwise used that. Maybe that's simpler than a label. But, anything like
 that sounds good.

 On Wed, May 15, 2019 at 8:40 PM Hyukjin Kwon 
 wrote:

> BTW, affected version became a required field (I don't remember when
> exactly was .. I believe it's around when we work on Spark 2.3):
>
> [image: Screen Shot 2019-05-16 at 10.29.50 AM.png]
>
> So, including all EOL versions and affected versions not specified
> will roughly work.
> Using "Cannot Reproduce" as its status and 'bulk-closed' label makes
> the best sense to me.
>
> Okie. I want to open this roughly for a week before taking an actual
> action for this. If there's no more feedback, I will do as I said ^ next
> week.
>
>
> 2019년 5월 15일 (수) 오후 11:33, Josh Rosen 님이 작성:
>
>> +1 in favor of some sort of JIRA cleanup.
>>
>> My only request is that we attach some sort of 'bulk-closed' label to
>> issues that we close via JIRA filter batch operations (and resolve the
>> issues as "Timed Out" / "Cannot Reproduce", not "Fixed"). Using a label
>> makes it easier to audit what was closed, simplifying the process of
>> identifying and re-opening valid issues caught in our dragnet.
>>
>>
>> On Wed, May 15, 2019 at 7:19 AM Sean Owen  wrote:
>>
>>> I gave up looking through JIRAs a long time ago, so, big respect for
>>> continuing to try to triage them. I am afraid we're missing a few
>>> important bug reports in the torrent, but most JIRAs are not
>>> well-formed, just questions, stale, or simply things that won't be
>>> added. I do think it's important to reflect that reality, and so I'm
>>> always in favor of more aggressively closing JIRAs. I think this is
>>> more standard practice, from projects like TensorFlow/Keras, pandas,
>>> etc to just automatically drop Issues that don't see activity for N
>>> days. We won't do that, but, are probably on the other hand far too
>>> lax in closing them.
>>>
>>> Remember that JIRAs stay searchable and can be reopened, so it's not
>>> like we lose much information.
>>>
>>> I'd close anything that hasn't had activity in 2 years (?), as a
>>> start.
>>> I like the idea of closing things that only affect an EOL release,
>>> but, many items aren't marked, so may need to cast the net wider.
>>>
>>> I think only then does it make sense to look at bothering to
>>> reproduce
>>> or evaluate the 1000s that will still remain.
>>>
>>> On Wed, May 15, 2019 at 4:25 AM Hyukjin Kwon 
>>> wrote:
>>> >
>>> > Hi all,
>>> >
>>> > I would like to propose to resolve all JIRAs that affects EOL
>>> releases - 2.2 and below. and affected version
>>> > not specified. I was rather against this way and considered this
>>> as last resort in roughly 3 years ago
>>> > when we discussed. Now I think we should go ahead with this. See
>>> below.
>>> >
>>> > I have been talking care of this for so long time almost every day
>>> those 3 years. The number of JIRAs
>>> > keeps increasing and it does never go down. Now the number is
>>> going over 2500 JIRAs.
>>> > Did you guys know? in JIRA, we can only go through page by page up
>>> to 1000 items. So, currently we're even
>>> > having difficulties to go through every JIRA. We should manually
>>> filter out and check each.
>>> > The number is going over the manageable size.
>>> >
>>> > I am not suggesting this without anything actually trying. This is
>>> what we have tried within my visibility:
>>> >
>>> >   1. In roughly 3 years ago, Sean tried to gather committers and
>>> even non-committers people to sort
>>> > out this number. At that time, we were only able to keep this
>>> number as is. After we lost this momentum,
>>> > it kept increasing back.
>>> >   2. At least I scanned _all_ the previous 

Re: [build system] short downtime for 2 ubuntu workers

2019-05-17 Thread shane knapp
actually, amp-jenkins-staging-worker-01 is seriously unhappy and just
crashed.  we will investigate more on monday.

:(

shane

On Fri, May 17, 2019 at 3:19 PM shane knapp  wrote:

> all workers are now up, online and ready to build!
>
> On Fri, May 17, 2019 at 2:55 PM shane knapp  wrote:
>
>> amp-jenkins-staging-worker-02 and ubuntu-testing are back up.
>>
>> -01 is being a little reluctant to boot and we're investigating.
>>
>> On Fri, May 17, 2019 at 2:08 PM shane knapp  wrote:
>>
>>> machines are down, gpus are about to go in.  i expect these workers to
>>> back up and building in ~30min.
>>>
>>> On Fri, May 17, 2019 at 1:47 PM shane knapp  wrote:
>>>
 we're installing some new GPUs for builds to use for tests...  the
 following workers will be offline for the next couple of hours:

 amp-jenkins-staging-worker-01
 amp-jenkins-staging-worker-02

 the ubuntu-testing worker will also be down, but that only impacts one
 build.

 the GPUs will be used for both lab and spark integration tests, but
 will NOT BE READY for the next couple of weeks.

 i repeat:  even though there will be GPUs, they will not be ready for
 use yet.  ;)

 shane
 --
 Shane Knapp
 UC Berkeley EECS Research / RISELab Staff Technical Lead
 https://rise.cs.berkeley.edu

>>>
>>>
>>> --
>>> Shane Knapp
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>>
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [build system] short downtime for 2 ubuntu workers

2019-05-17 Thread shane knapp
all workers are now up, online and ready to build!

On Fri, May 17, 2019 at 2:55 PM shane knapp  wrote:

> amp-jenkins-staging-worker-02 and ubuntu-testing are back up.
>
> -01 is being a little reluctant to boot and we're investigating.
>
> On Fri, May 17, 2019 at 2:08 PM shane knapp  wrote:
>
>> machines are down, gpus are about to go in.  i expect these workers to
>> back up and building in ~30min.
>>
>> On Fri, May 17, 2019 at 1:47 PM shane knapp  wrote:
>>
>>> we're installing some new GPUs for builds to use for tests...  the
>>> following workers will be offline for the next couple of hours:
>>>
>>> amp-jenkins-staging-worker-01
>>> amp-jenkins-staging-worker-02
>>>
>>> the ubuntu-testing worker will also be down, but that only impacts one
>>> build.
>>>
>>> the GPUs will be used for both lab and spark integration tests, but will
>>> NOT BE READY for the next couple of weeks.
>>>
>>> i repeat:  even though there will be GPUs, they will not be ready for
>>> use yet.  ;)
>>>
>>> shane
>>> --
>>> Shane Knapp
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>>
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [build system] short downtime for 2 ubuntu workers

2019-05-17 Thread shane knapp
amp-jenkins-staging-worker-02 and ubuntu-testing are back up.

-01 is being a little reluctant to boot and we're investigating.

On Fri, May 17, 2019 at 2:08 PM shane knapp  wrote:

> machines are down, gpus are about to go in.  i expect these workers to
> back up and building in ~30min.
>
> On Fri, May 17, 2019 at 1:47 PM shane knapp  wrote:
>
>> we're installing some new GPUs for builds to use for tests...  the
>> following workers will be offline for the next couple of hours:
>>
>> amp-jenkins-staging-worker-01
>> amp-jenkins-staging-worker-02
>>
>> the ubuntu-testing worker will also be down, but that only impacts one
>> build.
>>
>> the GPUs will be used for both lab and spark integration tests, but will
>> NOT BE READY for the next couple of weeks.
>>
>> i repeat:  even though there will be GPUs, they will not be ready for use
>> yet.  ;)
>>
>> shane
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [build system] short downtime for 2 ubuntu workers

2019-05-17 Thread shane knapp
machines are down, gpus are about to go in.  i expect these workers to back
up and building in ~30min.

On Fri, May 17, 2019 at 1:47 PM shane knapp  wrote:

> we're installing some new GPUs for builds to use for tests...  the
> following workers will be offline for the next couple of hours:
>
> amp-jenkins-staging-worker-01
> amp-jenkins-staging-worker-02
>
> the ubuntu-testing worker will also be down, but that only impacts one
> build.
>
> the GPUs will be used for both lab and spark integration tests, but will
> NOT BE READY for the next couple of weeks.
>
> i repeat:  even though there will be GPUs, they will not be ready for use
> yet.  ;)
>
> shane
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


[build system] short downtime for 2 ubuntu workers

2019-05-17 Thread shane knapp
we're installing some new GPUs for builds to use for tests...  the
following workers will be offline for the next couple of hours:

amp-jenkins-staging-worker-01
amp-jenkins-staging-worker-02

the ubuntu-testing worker will also be down, but that only impacts one
build.

the GPUs will be used for both lab and spark integration tests, but will
NOT BE READY for the next couple of weeks.

i repeat:  even though there will be GPUs, they will not be ready for use
yet.  ;)

shane
-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: Access to live data of cached dataFrame

2019-05-17 Thread Sean Owen
A cached DataFrame isn't supposed to change, by definition.
You can re-read each time or consider setting up a streaming source on
the table which provides a result that updates as new data comes in.

On Fri, May 17, 2019 at 1:44 PM Tomas Bartalos  wrote:
>
> Hello,
>
> I have a cached dataframe:
>
> spark.read.format("delta").load("/data").groupBy(col("event_hour")).count.cache
>
> I would like to access the "live" data for this data frame without deleting 
> the cache (using unpersist()). Whatever I do I always get the cached data on 
> subsequent queries. Even adding new column to the query doesn't help:
>
> spark.read.format("delta").load("/data").groupBy(col("event_hour")).count.withColumn("dummy",
>  lit("dummy"))
>
>
> I'm able to workaround this using cached sql view, but I couldn't find a pure 
> dataFrame solution.
>
> Thank you,
> Tomas

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Access to live data of cached dataFrame

2019-05-17 Thread Tomas Bartalos
Hello,

I have a cached dataframe:

spark.read.format("delta").load("/data").groupBy(col("event_hour")).count.cache

I would like to access the "live" data for this data frame without deleting
the cache (using unpersist()). Whatever I do I always get the cached data
on subsequent queries. Even adding new column to the query doesn't help:

spark.read.format("delta").load("/data").groupBy(col("event_hour")).count.withColumn("dummy",
lit("dummy"))


I'm able to workaround this using cached sql view, but I couldn't find a
pure dataFrame solution.

Thank you,
Tomas


Re: Resolving all JIRAs affecting EOL releases

2019-05-17 Thread Imran Rashid
+1, thanks for taking this on

On Wed, May 15, 2019 at 7:26 PM Hyukjin Kwon  wrote:

> oh, wait. 'Incomplete' can still make sense in this way then.
> Yes, I am good with 'Incomplete' too.
>
> 2019년 5월 16일 (목) 오전 11:24, Hyukjin Kwon 님이 작성:
>
>> I actually recently used 'Incomplete'  a bit when the JIRA is basically
>> too poorly formed (like just copying and pasting an error) ...
>>
>> I was thinking about 'Unresolved' status or `Auto Closed' too. I double
>> checked they can be reopen as well after resolution.
>>
>> [image: Screen Shot 2019-05-16 at 10.35.14 AM.png]
>> [image: Screen Shot 2019-05-16 at 10.35.39 AM.png]
>>
>> 2019년 5월 16일 (목) 오전 11:04, Sean Owen 님이 작성:
>>
>>> Agree, anything without an Affected Version should be old enough to time
>>> out.
>>> I might use "Incomplete" or something as the status, as we haven't
>>> otherwise used that. Maybe that's simpler than a label. But, anything like
>>> that sounds good.
>>>
>>> On Wed, May 15, 2019 at 8:40 PM Hyukjin Kwon 
>>> wrote:
>>>
 BTW, affected version became a required field (I don't remember when
 exactly was .. I believe it's around when we work on Spark 2.3):

 [image: Screen Shot 2019-05-16 at 10.29.50 AM.png]

 So, including all EOL versions and affected versions not specified will
 roughly work.
 Using "Cannot Reproduce" as its status and 'bulk-closed' label makes
 the best sense to me.

 Okie. I want to open this roughly for a week before taking an actual
 action for this. If there's no more feedback, I will do as I said ^ next
 week.


 2019년 5월 15일 (수) 오후 11:33, Josh Rosen 님이 작성:

> +1 in favor of some sort of JIRA cleanup.
>
> My only request is that we attach some sort of 'bulk-closed' label to
> issues that we close via JIRA filter batch operations (and resolve the
> issues as "Timed Out" / "Cannot Reproduce", not "Fixed"). Using a label
> makes it easier to audit what was closed, simplifying the process of
> identifying and re-opening valid issues caught in our dragnet.
>
>
> On Wed, May 15, 2019 at 7:19 AM Sean Owen  wrote:
>
>> I gave up looking through JIRAs a long time ago, so, big respect for
>> continuing to try to triage them. I am afraid we're missing a few
>> important bug reports in the torrent, but most JIRAs are not
>> well-formed, just questions, stale, or simply things that won't be
>> added. I do think it's important to reflect that reality, and so I'm
>> always in favor of more aggressively closing JIRAs. I think this is
>> more standard practice, from projects like TensorFlow/Keras, pandas,
>> etc to just automatically drop Issues that don't see activity for N
>> days. We won't do that, but, are probably on the other hand far too
>> lax in closing them.
>>
>> Remember that JIRAs stay searchable and can be reopened, so it's not
>> like we lose much information.
>>
>> I'd close anything that hasn't had activity in 2 years (?), as a
>> start.
>> I like the idea of closing things that only affect an EOL release,
>> but, many items aren't marked, so may need to cast the net wider.
>>
>> I think only then does it make sense to look at bothering to reproduce
>> or evaluate the 1000s that will still remain.
>>
>> On Wed, May 15, 2019 at 4:25 AM Hyukjin Kwon 
>> wrote:
>> >
>> > Hi all,
>> >
>> > I would like to propose to resolve all JIRAs that affects EOL
>> releases - 2.2 and below. and affected version
>> > not specified. I was rather against this way and considered this as
>> last resort in roughly 3 years ago
>> > when we discussed. Now I think we should go ahead with this. See
>> below.
>> >
>> > I have been talking care of this for so long time almost every day
>> those 3 years. The number of JIRAs
>> > keeps increasing and it does never go down. Now the number is going
>> over 2500 JIRAs.
>> > Did you guys know? in JIRA, we can only go through page by page up
>> to 1000 items. So, currently we're even
>> > having difficulties to go through every JIRA. We should manually
>> filter out and check each.
>> > The number is going over the manageable size.
>> >
>> > I am not suggesting this without anything actually trying. This is
>> what we have tried within my visibility:
>> >
>> >   1. In roughly 3 years ago, Sean tried to gather committers and
>> even non-committers people to sort
>> > out this number. At that time, we were only able to keep this
>> number as is. After we lost this momentum,
>> > it kept increasing back.
>> >   2. At least I scanned _all_ the previous JIRAs at least more than
>> two times and resolved them. Roughly
>> > once a year. The rest of them are mostly obsolete but not
>> enough information to investigate further.
>> >   3. I strictly stick to 

Out Of Memory while reading a table partition from HIVE

2019-05-17 Thread Shivam Sharma
Hi All,

I am getting Out Of Memory due to GC overhead while reading a table from
HIVE from spark like:

spark.sql("SELECT * FROM some.table where date='2019-05-14' LIMIT
> 10").show()


So when I run above command in spark-shell then it starts processing *1780
tasks* where it goes OOM at a specific partition.

1. Table partition(*date='2019-05-14'*) is having *4000* files on HDFS so
ideally 4000 partitions should be created inside Spark Dataframe if I am
not wrong. I analyzed the table actually it is having total *1780*
partitions(means
1780 dates folder).

2. I checked the size of files in Table partition(*date='2019-05-14'*), max
file size is *1.1 GB* and I have given *7GB* to each executor so if I am
right above then it should not throw OOM.

3. And when I have put the* LIMIT 10* then does spark-hive reads all files?

Thanks

-- 
Shivam Sharma
Indian Institute Of Information Technology, Design and Manufacturing
Jabalpur
Email:- 28shivamsha...@gmail.com
LinkedIn:-*https://www.linkedin.com/in/28shivamsharma
*