What's the root cause of not supporting multiple aggregations in structured streaming?

2019-05-20 Thread 张万新
Hi there,

I'd like to know what's the root reason why multiple aggregations on
streaming dataframe is not allowed since it's a very useful feature, and
flink has supported it for a long time.

Thanks.


Re: What's the root cause of not supporting multiple aggregations in structured streaming?

2019-05-20 Thread Gabor Somogyi
There is PR for this but not yet merged.

On Mon, May 20, 2019 at 10:13 AM 张万新  wrote:

> Hi there,
>
> I'd like to know what's the root reason why multiple aggregations on
> streaming dataframe is not allowed since it's a very useful feature, and
> flink has supported it for a long time.
>
> Thanks.
>


Re: What's the root cause of not supporting multiple aggregations in structured streaming?

2019-05-20 Thread Arun Mahadevan
Heres the proposal for supporting it in "append" mode -
https://github.com/apache/spark/pull/23576. You could see if it addresses
your requirement and post your feedback in the PR.
For "update" mode its going to be much harder to support this without first
adding support for "retractions", otherwise we would end up with wrong
results.

- Arun


On Mon, 20 May 2019 at 01:34, Gabor Somogyi 
wrote:

> There is PR for this but not yet merged.
>
> On Mon, May 20, 2019 at 10:13 AM 张万新  wrote:
>
>> Hi there,
>>
>> I'd like to know what's the root reason why multiple aggregations on
>> streaming dataframe is not allowed since it's a very useful feature, and
>> flink has supported it for a long time.
>>
>> Thanks.
>>
>


Hadoop version(s) compatible with spark-2.4.3-bin-without-hadoop-scala-2.12

2019-05-20 Thread Michael Heuer
Hello,

Which Hadoop version or versions are compatible with Spark 2.4.3 and Scala 2.12?

The binary distribution spark-2.4.3-bin-without-hadoop-scala-2.12.tgz is 
missing avro-1.8.2.jar, so when attempting to run with Hadoop 2.7.7 there are 
classpath conflicts at runtime, as Hadoop 2.7.7 includes avro-1.7.4.jar.

https://issues.apache.org/jira/browse/SPARK-27781 


   michael

Re: Hadoop version(s) compatible with spark-2.4.3-bin-without-hadoop-scala-2.12

2019-05-20 Thread Koert Kuipers
we run it without issues on hadoop 2.6 - 2.8 on top of my head.

we however do some post-processing on the tarball:
1) we fix the ownership of the files inside the tar.gz file (should be
uid/gid 0/0, otherwise untarring by root can lead to ownership by unknown
user).
2) add avro-1.8.2.jar and jline-2.14.6.jar to jars folder. i believe these
jars missing in provided profile is simply a mistake.

best,
koert

On Mon, May 20, 2019 at 3:37 PM Michael Heuer  wrote:

> Hello,
>
> Which Hadoop version or versions are compatible with Spark 2.4.3 and Scala
> 2.12?
>
> The binary distribution spark-2.4.3-bin-without-hadoop-scala-2.12.tgz is
> missing avro-1.8.2.jar, so when attempting to run with Hadoop 2.7.7 there
> are classpath conflicts at runtime, as Hadoop 2.7.7 includes avro-1.7.4.jar.
>
> https://issues.apache.org/jira/browse/SPARK-27781
>
>michael
>


Re: Hadoop version(s) compatible with spark-2.4.3-bin-without-hadoop-scala-2.12

2019-05-20 Thread Sean Owen
Re: 1), I think we tried to fix that on the build side and it requires
flags that not all tar versions (i.e. OS X) have. But that's
tangential.

I think the Avro + Parquet dependency situation is generally
problematic -- see JIRA for some details. But yes I'm not surprised if
Spark has a different version from Hadoop 2.7.x and that would cause
problems -- if using Avro. I'm not sure the mistake is that the JARs
are missing, as I think this is supposed to be a 'provided'
dependency, but I haven't looked into it. If there's any easy obvious
correction to be made there, by all means.

Not sure what the deal is with jline... I'd expect that's in the
"hadoop-provided" distro? That one may be a real issue if it's
considered provided but isn't used that way.


On Mon, May 20, 2019 at 4:15 PM Koert Kuipers  wrote:
>
> we run it without issues on hadoop 2.6 - 2.8 on top of my head.
>
> we however do some post-processing on the tarball:
> 1) we fix the ownership of the files inside the tar.gz file (should be 
> uid/gid 0/0, otherwise untarring by root can lead to ownership by unknown 
> user).
> 2) add avro-1.8.2.jar and jline-2.14.6.jar to jars folder. i believe these 
> jars missing in provided profile is simply a mistake.
>
> best,
> koert
>
> On Mon, May 20, 2019 at 3:37 PM Michael Heuer  wrote:
>>
>> Hello,
>>
>> Which Hadoop version or versions are compatible with Spark 2.4.3 and Scala 
>> 2.12?
>>
>> The binary distribution spark-2.4.3-bin-without-hadoop-scala-2.12.tgz is 
>> missing avro-1.8.2.jar, so when attempting to run with Hadoop 2.7.7 there 
>> are classpath conflicts at runtime, as Hadoop 2.7.7 includes avro-1.7.4.jar.
>>
>> https://issues.apache.org/jira/browse/SPARK-27781
>>
>>michael

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Hadoop version(s) compatible with spark-2.4.3-bin-without-hadoop-scala-2.12

2019-05-20 Thread Koert Kuipers
its somewhat weird because avro-mapred-1.8.2-hadoop2.jar is included in the
hadoop-provided distro, but avro-1.8.2.jar is not. i tried to fix it but i
am not too familiar with the pom file.

regarding jline you only run into this if you use spark-shell (and it isnt
always reproducible it seems). see SPARK-25783

best,
koert




On Mon, May 20, 2019 at 5:43 PM Sean Owen  wrote:

> Re: 1), I think we tried to fix that on the build side and it requires
> flags that not all tar versions (i.e. OS X) have. But that's
> tangential.
>
> I think the Avro + Parquet dependency situation is generally
> problematic -- see JIRA for some details. But yes I'm not surprised if
> Spark has a different version from Hadoop 2.7.x and that would cause
> problems -- if using Avro. I'm not sure the mistake is that the JARs
> are missing, as I think this is supposed to be a 'provided'
> dependency, but I haven't looked into it. If there's any easy obvious
> correction to be made there, by all means.
>
> Not sure what the deal is with jline... I'd expect that's in the
> "hadoop-provided" distro? That one may be a real issue if it's
> considered provided but isn't used that way.
>
>
> On Mon, May 20, 2019 at 4:15 PM Koert Kuipers  wrote:
> >
> > we run it without issues on hadoop 2.6 - 2.8 on top of my head.
> >
> > we however do some post-processing on the tarball:
> > 1) we fix the ownership of the files inside the tar.gz file (should be
> uid/gid 0/0, otherwise untarring by root can lead to ownership by unknown
> user).
> > 2) add avro-1.8.2.jar and jline-2.14.6.jar to jars folder. i believe
> these jars missing in provided profile is simply a mistake.
> >
> > best,
> > koert
> >
> > On Mon, May 20, 2019 at 3:37 PM Michael Heuer  wrote:
> >>
> >> Hello,
> >>
> >> Which Hadoop version or versions are compatible with Spark 2.4.3 and
> Scala 2.12?
> >>
> >> The binary distribution spark-2.4.3-bin-without-hadoop-scala-2.12.tgz
> is missing avro-1.8.2.jar, so when attempting to run with Hadoop 2.7.7
> there are classpath conflicts at runtime, as Hadoop 2.7.7 includes
> avro-1.7.4.jar.
> >>
> >> https://issues.apache.org/jira/browse/SPARK-27781
> >>
> >>michael
>


Re: Resolving all JIRAs affecting EOL releases

2019-05-20 Thread shane knapp
alright, i found 3 jiras that i was able to close:

   1. SPARK-19612 
   2.
  1. SPARK-22996 
 2.
1. SPARK-22766

2.
3.


On Sun, May 19, 2019 at 6:43 PM Hyukjin Kwon  wrote:

> Thanks Shane .. the URL I linked somehow didn't work in other people
> browser. Hope this link works:
>
>
> https://issues.apache.org/jira/browse/SPARK-23492?jql=project%20%3D%20SPARK%20%0A%20%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%0A%20%20AND%20(%0A%20%20%20%20affectedVersion%20%3D%20EMPTY%20OR%0A%20%20%20%20NOT%20(affectedVersion%20in%20versionMatch(%22%5E3.*%22)%0A%20%20%20%20%20%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.4.*%22)%0A%20%20%20%20%20%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.3.*%22)%0A%20%20%20%20)%0A%20%20)%0A%20%20AND%20updated%20%3C%3D%20-52w
>
> I will take an action around this time tomorrow considering there were
> some more changes to make at the last minute.
>
>
> 2019년 5월 19일 (일) 오후 6:39, Hyukjin Kwon 님이 작성:
>
>> I will add one more condition for "updated". So, it will additionally
>> avoid things updated within one year but left open against EOL releases.
>>
>> project = SPARK
>>   AND status in (Open, "In Progress", Reopened)
>>   AND (
>> affectedVersion = EMPTY OR
>> NOT (affectedVersion in versionMatch("^3.*")
>>   OR affectedVersion in versionMatch("^2.4.*")
>>   OR affectedVersion in versionMatch("^2.3.*")
>> )
>>   )
>>   AND updated <= -52w
>>
>>
>> https://issues.apache.org/jira/issues/?filter=12344168&jql=project%20%3D%20SPARK%20%0A%20%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%0A%20%20AND%20(%0A%20%20%20%20affectedVersion%20%3D%20EMPTY%20OR%0A%20%20%20%20NOT%20(affectedVersion%20in%20versionMatch(%22%5E3.*%22)%0A%20%20%20%20%20%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.4.*%22)%0A%20%20%20%20%20%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.3.*%22)%0A%20%20%20%20)%0A%20%20)%0A%20%20AND%20updated%20%3C%3D%20-52w
>>
>> This still reduces JIRAs under 1000 which I originally targeted.
>>
>>
>>
>> 2019년 5월 19일 (일) 오후 6:08, Sean Owen 님이 작성:
>>
>>> I'd only tweak this to perhaps not close JIRAs that have been updated
>>> recently -- even just avoiding things updated in the last month. For
>>> example this would close
>>> https://issues.apache.org/jira/browse/SPARK-27758 which was opened
>>> Friday (though, for other reasons it should probably be closed). Still I
>>> don't mind it under the logic that it has been reported against 2.1.0.
>>>
>>> On the other hand, I'd go further and close _anything_ not updated in a
>>> long time, like a year (or 2 if feeling conservative). That is there's
>>> probably a lot of old cruft out there that wasn't marked with an Affected
>>> Version, before that was required.
>>>
>>> On Sat, May 18, 2019 at 10:48 PM Hyukjin Kwon 
>>> wrote:
>>>
 Thanks guys.

 This thread got more than 3 PMC votes without any objection. I slightly
 edited JQL from Abdeali's suggestion (thanks, Abdeali).


 JQL:

 project = SPARK
   AND status in (Open, "In Progress", Reopened)
   AND (
 affectedVersion = EMPTY OR
 NOT (affectedVersion in versionMatch("^3.*")
   OR affectedVersion in versionMatch("^2.4.*")
   OR affectedVersion in versionMatch("^2.3.*")
 )
   )


 https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20%0A%20%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%0A%20%20AND%20(%0A%20%20%20%20affectedVersion%20%3D%20EMPTY%20OR%0A%20%20%20%20NOT%20(affectedVersion%20in%20versionMatch(%22%5E3.*%22)%0A%20%20%20%20%20%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.4.*%22)%0A%20%20%20%20%20%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.3.*%22)%0A%20%20%20%20)%0A%20%20)


 It means we will resolve all JIRAs that have EOL releases as affected
 versions, including no version specified in affected versions - this will
 reduce open JIRAs under 900.

 Looks I can use a bulk action feature in JIRA. Tomorrow at the similar
 time, I will
 - Label those JIRAs as 'bulk-closed'
 - Resolve them via `Incomplete` status.

 Please double check the list and let me know if you guys have any
 concern.





 2019년 5월 18일 (토) 오후 12:22, Dongjoon Hyun 님이
 작성:

> +1, too.
>
> Thank you, Hyukjin!
>
> Bests,
> Dongjoon.
>
>
> On Fri, May 17, 2019 at 9:07 AM Imran Rashid
>  wrote:
>
>> +1, thanks for taking this on
>>
>> On Wed, May 15, 2019 at 7:26 PM Hyukjin Kwon 
>> wrote:
>>
>>> oh, wait. 'Incomplete' can still make sense in this way then.
>>> Yes, I am good with 'Incomplete' too.
>>>
>>> 2019년 5월 16

Re: Resolving all JIRAs affecting EOL releases

2019-05-20 Thread Hyukjin Kwon
I took an action for those JIRAs.

The JIRAs that has not been updated for the last year, and having affect
version of EOL releases were now:
  - Resolved as 'Incomplete' status
  - Has a 'bulk-closed' label.

Thanks guys.

2019년 5월 21일 (화) 오전 8:35, shane knapp 님이 작성:

> alright, i found 3 jiras that i was able to close:
>
>1. SPARK-19612 
>2.
>   1. SPARK-22996 
>  2.
> 1. SPARK-22766
> 
> 2.
> 3.
>
>
> On Sun, May 19, 2019 at 6:43 PM Hyukjin Kwon  wrote:
>
>> Thanks Shane .. the URL I linked somehow didn't work in other people
>> browser. Hope this link works:
>>
>>
>> https://issues.apache.org/jira/browse/SPARK-23492?jql=project%20%3D%20SPARK%20%0A%20%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%0A%20%20AND%20(%0A%20%20%20%20affectedVersion%20%3D%20EMPTY%20OR%0A%20%20%20%20NOT%20(affectedVersion%20in%20versionMatch(%22%5E3.*%22)%0A%20%20%20%20%20%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.4.*%22)%0A%20%20%20%20%20%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.3.*%22)%0A%20%20%20%20)%0A%20%20)%0A%20%20AND%20updated%20%3C%3D%20-52w
>>
>> I will take an action around this time tomorrow considering there were
>> some more changes to make at the last minute.
>>
>>
>> 2019년 5월 19일 (일) 오후 6:39, Hyukjin Kwon 님이 작성:
>>
>>> I will add one more condition for "updated". So, it will additionally
>>> avoid things updated within one year but left open against EOL releases.
>>>
>>> project = SPARK
>>>   AND status in (Open, "In Progress", Reopened)
>>>   AND (
>>> affectedVersion = EMPTY OR
>>> NOT (affectedVersion in versionMatch("^3.*")
>>>   OR affectedVersion in versionMatch("^2.4.*")
>>>   OR affectedVersion in versionMatch("^2.3.*")
>>> )
>>>   )
>>>   AND updated <= -52w
>>>
>>>
>>> https://issues.apache.org/jira/issues/?filter=12344168&jql=project%20%3D%20SPARK%20%0A%20%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%0A%20%20AND%20(%0A%20%20%20%20affectedVersion%20%3D%20EMPTY%20OR%0A%20%20%20%20NOT%20(affectedVersion%20in%20versionMatch(%22%5E3.*%22)%0A%20%20%20%20%20%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.4.*%22)%0A%20%20%20%20%20%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.3.*%22)%0A%20%20%20%20)%0A%20%20)%0A%20%20AND%20updated%20%3C%3D%20-52w
>>>
>>> This still reduces JIRAs under 1000 which I originally targeted.
>>>
>>>
>>>
>>> 2019년 5월 19일 (일) 오후 6:08, Sean Owen 님이 작성:
>>>
 I'd only tweak this to perhaps not close JIRAs that have been updated
 recently -- even just avoiding things updated in the last month. For
 example this would close
 https://issues.apache.org/jira/browse/SPARK-27758 which was opened
 Friday (though, for other reasons it should probably be closed). Still I
 don't mind it under the logic that it has been reported against 2.1.0.

 On the other hand, I'd go further and close _anything_ not updated in a
 long time, like a year (or 2 if feeling conservative). That is there's
 probably a lot of old cruft out there that wasn't marked with an Affected
 Version, before that was required.

 On Sat, May 18, 2019 at 10:48 PM Hyukjin Kwon 
 wrote:

> Thanks guys.
>
> This thread got more than 3 PMC votes without any objection. I
> slightly edited JQL from Abdeali's suggestion (thanks, Abdeali).
>
>
> JQL:
>
> project = SPARK
>   AND status in (Open, "In Progress", Reopened)
>   AND (
> affectedVersion = EMPTY OR
> NOT (affectedVersion in versionMatch("^3.*")
>   OR affectedVersion in versionMatch("^2.4.*")
>   OR affectedVersion in versionMatch("^2.3.*")
> )
>   )
>
>
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20%0A%20%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%0A%20%20AND%20(%0A%20%20%20%20affectedVersion%20%3D%20EMPTY%20OR%0A%20%20%20%20NOT%20(affectedVersion%20in%20versionMatch(%22%5E3.*%22)%0A%20%20%20%20%20%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.4.*%22)%0A%20%20%20%20%20%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.3.*%22)%0A%20%20%20%20)%0A%20%20)
>
>
> It means we will resolve all JIRAs that have EOL releases as affected
> versions, including no version specified in affected versions - this will
> reduce open JIRAs under 900.
>
> Looks I can use a bulk action feature in JIRA. Tomorrow at the similar
> time, I will
> - Label those JIRAs as 'bulk-closed'
> - Resolve them via `Incomplete` status.
>
> Please double check the list and let me know if you guys have any
> concern.
>
>
>
>
>
> 2019년 5월 18일 (토) 오후 12:22, Dongjoon Hyun 님이
> 작성:
>
>> +1, too.
>>
>> Thank you, 

RDD object Out of scope.

2019-05-20 Thread Nasrulla Khan Haris
HI Spark developers,

Can someone point out the code where RDD objects go out of scope ?. I found the 
contextcleaner
 code in which only persisted RDDs are cleaned up in regular intervals if the 
RDD is registered to cleanup. I have not found where the destructor for RDD 
object is invoked. I am trying to understand when RDD cleanup happens when the 
RDD is not persisted.

Thanks in advance, appreciate your help.
Nasrulla